Re: [libreoffice-design] pdf import design docs?

Thorsten Behrens <thb -AT- libreoffice.org>
Thu, 6 Oct 2016 01:56:28 +0200

Larry Evans wrote:

Well, I did see code here:

  sdext/source/pdfimport/pdfparse/pdfparse.cxx

but that looked like it used boost/spirit to parse the pdf file
(about line 553):

            boost::spirit::parse( pBuffer,
                                  pBuffer+nLen,
                                  aGrammar,
                                  boost::spirit::space_p );

That's chiefly to deal with hybrid pdf, which needs to detect early-on
that instead of parsing PDF, it should instead load the embedded ODF
file. So for understanding real PDF import, simply ignore that part -

Hence, I guess Poplar/xpdf does some sophisticated
processing that the use of boost::spirit does not do or is
incapable of doing.  Of course, I'm jumping to conclusions
which hopefully people of the devel list will correct :)

Yes. Poppler does the actual pdf processing (it's also powering most
of the linux desktop pdf viewers, like okular or evince).

    In general - it would be -way- better to pick up something like eg.
pdfium - and add a rendering front-end there to match first, the same
protocol (but we can do this in-process), and subsquently to simplify
and factor lots of that madness out =) PDFium seems to be gaining
traction in browsers (Chrome + Firefox) and so on.


Thanks for the pointer.  I'm googling for PDFium now.

For the import of PDF into Draw/Writer (compared to simply rendering
PDF as a picture), the above is a bit of a red herring. The added
complexity in terms of code for doing this in a separate process is
pretty low; the challenge for that sort of thing really is decent
layout detection. There's been a GSoC project proposal to hook up
something like Tesseract or other OCR engines to help with that, sadly
with little traction so far. ;)

I'm trying to solve the problem I posed earlier in this
post:

https://lists.freedesktop.org/archives/libreoffice/2014-January/059106.html

Ah, XFA. Well then, poppler does not have support for that, pdfium
apparently has a branch: https://pdfium.googlesource.com/pdfium/+/xfa
- no idea how useable that is though. And from the grapevines, XFA
seems pretty dead as an architecture?

I've also noticed that the font sizes and location of
letters is sometime not correct; hence, I'd like to figure
out how to correct that.

That's mostly due to prioritizing editability over accuracy. The code
to look at is in sdext/source/pdfimport/tree/drawtreevisiting.cxx,
which writes out ODF from the render tree.

Hope that helps,

-- Thorsten

Attachment: signature.asc
Description: Digital signature

Context

Re: [libreoffice-design] pdf import design docs? · Michael Meeks
- Re: [libreoffice-design] pdf import design docs? · Larry Evans
  - Re: [libreoffice-design] pdf import design docs? · Thorsten Behrens
- Re: [libreoffice-design] pdf import design docs? · Michael Stahl
  - Re: [libreoffice-design] pdf import design docs? · Michael Meeks
  - Re: [libreoffice-design] pdf import design docs? · Larry Evans
    - Re: [libreoffice-design] pdf import design docs? · Michael Stahl
      - Re: [libreoffice-design] pdf import design docs? · Khaled Hosny
        
        Re: [libreoffice-design] pdf import design docs? · Michael Stahl
        
        Re: [libreoffice-design] pdf import design docs? · Khaled Hosny

Privacy Policy | Impressum (Legal Info) | Copyright information: Unless otherwise specified, all text and images on this website are licensed under the Creative Commons Attribution-Share Alike 3.0 License. This does not include the source code of LibreOffice, which is licensed under the Mozilla Public License (MPLv2). "LibreOffice" and "The Document Foundation" are registered trademarks of their corresponding registered owners or are in actual use as trademarks in one or more countries. Their respective logos and icons are also subject to international copyright laws. Use thereof is explained in our trademark policy.