On 18.06.2015 22:33, M. Amin Farajian wrote:
Hi all,
I am working on a toolkit which does some text analysis on the given
text documents. This toolkit primarily was supposed to work with XML
files. But since the input files in the real applications are mostly
*.doc/*.docx/*.odt/*.ppt/*.pptx/*.odp/*.pdf/ets, I need to write a
library for reading these file formats and convert their contents into
the desired XML format. I was looking for such a library and learned
that LibreOffice does have such a functionality.
i think your best option is to use LibreOffice's --convert-to command
line option (or equivalent wrappers e.g. unoconv / LibreOfficeKit) to
convert all sorts of formats to ODF, which is relatively easy to handle
with XML tools (Flat-ODF is even easier than the regular zip-based ODF,
but gives you huge base64 encoded images).
if you use Java or DotNet you can use libraries from Apache ODF Toolkit
to easily manipulate ODF files, or use an XML DOM, or XSLT ...
I searched for the part of the code in LibreOffice which is responsible
of reading the given files (in different formats), but couldn't find it.
Could you please point me to this part of the code in the LibreOffice
project?
it's distributed all around the code base basically. the only filters
that are easily separable from the LO code are the Document Liberation
ones and those already produce ODF as output.
Context
Privacy Policy |
Impressum (Legal Info) |
Copyright information: Unless otherwise specified, all text and images
on this website are licensed under the
Creative Commons Attribution-Share Alike 3.0 License.
This does not include the source code of LibreOffice, which is
licensed under the Mozilla Public License (
MPLv2).
"LibreOffice" and "The Document Foundation" are
registered trademarks of their corresponding registered owners or are
in actual use as trademarks in one or more countries. Their respective
logos and icons are also subject to international copyright laws. Use
thereof is explained in our
trademark policy.