Re: [libreoffice-users] A word of warning about PDF text

Dominique Michel <dominique.michel -AT- vtxnet.ch>
Sat, 1 Feb 2014 04:16:31 +0100

Le Sat, 1 Feb 2014 01:18:22 +0100,
Dominique Michel <dominique.michel@vtxnet.ch> a écrit :

Le Fri, 31 Jan 2014 13:22:41 +1000,
Peter West <lists@pbw.id.au> a écrit :

A word of warning about text retrieved from PDF documents.

Recovering text blocks from PDFs is inherently risky.  PDF is a
page definition format, and so it has no notion of the semantics of
the text it contains. It places bits of text at certain positions
on the page. You can create a whole page of text by taking the
individual characters and their attributes and position on the
page, shuffling them, and writing them to the file.  That will
produce a readable file, but try extracting the text from that
file. Unless you have a very, very smart text extractor that
reverse-engineers the process of creating the page, then calculates
the _visual_ order of the text elements, you will end up with
gibberish.

_Most_ pdf text, _most_ of the time, is laid on the page in visual 
order, but in even the best-behaved files, you are likely to be
surprised.

If you don't _know_ that your PDF text extractor program is
completely visually accurate by design, don't tell your boss that
you can easily extract that PDF text, without allowing time for
proof-reading every page. You will get burned.


It is why I open the pdf file into a separated program and use the
mouse to select the text, and copy/past or Ctrl-C/Ctrl-V. That way, I
have full control on how the text will appear when I select it.

And I use other programs like pdfimages, pdftppm and convert to
extract the images directly from the pdf. They can be turned or
mirrored, it is why convert is useful too. When they are split in
small pieces, pdftoppm give me an exact copy of each page of the pdf,
each page into a ppm file, which is converted in jpeg. In that case,
gimp is useful to extract only the images from these files and cut
the text.

The script I use for the images is joined. To use it, place it
somewhere in your path, control it is executable, go into the
directory where your pdf file is, and run 'pdf2jpg'. It will only
issue a help message. Be aware it will extract all the pdf files in
that directory on the fly. Be also aware that, if the final output is
jpeg files, ppm files are automatically used as middle men when
needed, the conversion will be much slower and they can use a lot of
space on the disk.

So, if you want to extract pictures from a 100MB pdf file, count at
least 2GB of temporary disk usage to be safe in all cases. (estimation
from memory, so make you own tests if you don't have a lot of free
disk space)

Also, with some distributions, you may have to adjust the name of the
pdfimages and pdftoppm commands in the script. They are part of
poppler on gentoo (poppler-utils or something like that on Debian),
in the past, they was part of xpdf.

Dominique


The script didn't make it. Here it is:
http://fvwm-crystal.sourceforge.net/other/pdf2jpg

Dominique


I don't know how LO extracts PDF text; perhaps it is very
sophisticated. I have my doubts.


-- 
To unsubscribe e-mail to: users+unsubscribe@global.libreoffice.org
Problems? http://www.libreoffice.org/get-help/mailing-lists/how-to-unsubscribe/
Posting guidelines + more: http://wiki.documentfoundation.org/Netiquette
List archive: http://listarchives.libreoffice.org/global/users/
All messages sent to this list will be publicly archived and cannot be deleted

Context

[libreoffice-users] A word of warning about PDF text · Peter West
- Re: [libreoffice-users] A word of warning about PDF text · Cley Faye
- Re: [libreoffice-users] A word of warning about PDF text · Dominique Michel
  - Re: [libreoffice-users] A word of warning about PDF text · Dominique Michel

Privacy Policy | Impressum (Legal Info) | Copyright information: Unless otherwise specified, all text and images on this website are licensed under the Creative Commons Attribution-Share Alike 3.0 License. This does not include the source code of LibreOffice, which is licensed under the Mozilla Public License (MPLv2). "LibreOffice" and "The Document Foundation" are registered trademarks of their corresponding registered owners or are in actual use as trademarks in one or more countries. Their respective logos and icons are also subject to international copyright laws. Use thereof is explained in our trademark policy.