A word of warning about text retrieved from PDF documents.
Recovering text blocks from PDFs is inherently risky. PDF is a page
definition format, and so it has no notion of the semantics of the text
it contains. It places bits of text at certain positions on the page.
You can create a whole page of text by taking the individual characters
and their attributes and position on the page, shuffling them, and
writing them to the file. That will produce a readable file, but try
extracting the text from that file. Unless you have a very, very smart
text extractor that reverse-engineers the process of creating the page,
then calculates the _visual_ order of the text elements, you will end up
with gibberish.
_Most_ pdf text, _most_ of the time, is laid on the page in visual
order, but in even the best-behaved files, you are likely to be surprised.
If you don't _know_ that your PDF text extractor program is completely
visually accurate by design, don't tell your boss that you can easily
extract that PDF text, without allowing time for proof-reading every
page. You will get burned.
I don't know how LO extracts PDF text; perhaps it is very sophisticated.
I have my doubts.
--
Peter West
"Other seed fell among thorns, and the thorns grew up and choked it..."
--
To unsubscribe e-mail to: users+unsubscribe@global.libreoffice.org
Problems? http://www.libreoffice.org/get-help/mailing-lists/how-to-unsubscribe/
Posting guidelines + more: http://wiki.documentfoundation.org/Netiquette
List archive: http://listarchives.libreoffice.org/global/users/
All messages sent to this list will be publicly archived and cannot be deleted
Context
- [libreoffice-users] A word of warning about PDF text · Peter West
Privacy Policy |
Impressum (Legal Info) |
Copyright information: Unless otherwise specified, all text and images
on this website are licensed under the
Creative Commons Attribution-Share Alike 3.0 License.
This does not include the source code of LibreOffice, which is
licensed under the Mozilla Public License (
MPLv2).
"LibreOffice" and "The Document Foundation" are
registered trademarks of their corresponding registered owners or are
in actual use as trademarks in one or more countries. Their respective
logos and icons are also subject to international copyright laws. Use
thereof is explained in our
trademark policy.