Re: [libreoffice-design] pdf import design docs?

Larry Evans <cppljevans -AT- suddenlink.net>
Wed, 5 Oct 2016 13:06:06 -0500

On 10/05/2016 10:24 AM, Michael Meeks wrote:
> Hi Larry,
>
>    First - really great to have you looking at that
>       code ! =)

Thanks for the encouragement Michael.

>
> On 10/05/2016 04:10 PM, Larry Evans wrote:
>> I'm trying to understand how the pdf import code works.
>> I've tried looking at the code; however, that's hard to
>> follow; hence, I was hoping there was some sort of design
>> document explaining somewhat how the code works.
>
>    Second - the design list is really for User Experience / developer
> interaction, and this seems like a real gnarly coding problem - so I've
> re-sent it to the dev-list =)

OOPS.  Sorry about that.

>
>> TIA for any pointers.
>
>    Sure - so the PDF import is a bit of a mess; it currently spawns a
> remote process using poplar to parse the PDF, and then extracts (via a
> simple text protocol) data from poplar's rendering to re-constitute into
> internal ODF callbacks to produce an internal document; at least -
> that's if I got it right =)

Well, I did see code here:

  sdext/source/pdfimport/pdfparse/pdfparse.cxx

but that looked like it used boost/spirit to parse the pdf file
(about line 553):

            boost::spirit::parse( pBuffer,
                                  pBuffer+nLen,
                                  aGrammar,
                                  boost::spirit::space_p );

but then, trying to find where that (or the caller of that) was called
lead me to:

  sdext/source/pdfimport/wrapper/wrapper.cxx

where there is a call(around line 927):

  std::unique_ptr<pdfparse::PDFEntry> pEntry(
  pdfparse::PDFReader::read( aPDFFile.getStr() ));

but that's called in a function:


 bool checkEncryption

whose name doesn't suggest any translation into something
like the xml which is what libreoffice stores its files as,
IIUC:

  https://en.wikipedia.org/wiki/OpenOffice.org_XML

but, looking further in that file, there's, as you mention,
what looks like a remote process call in function:

  bool xpdf_ImportFromFile

on about line 1079:

        osl_executeProcess_WithRedirectedIO(converterURL.pData,
                                            args,
                                            nArgs,

osl_Process_SEARCHPATH|osl_Process_HIDDEN,
                                            pSecurity,
                                            nullptr, nullptr, 0,
                                            &aProcess, &pIn, &pOut, &pErr);

So that's where I wanted some overall design help, because I
thought it odd that boost::spirit was used to parse the
file, I guess, just to determine whether it was encrypted,
and then, an xpdf process was used to parse the same file

again. That seemed awfully redundant.



>
>    Poplar/xpdf has a GPL license and so requires all this silliness.
>

Hence, I guess Poplar/xpdf does some sophisticated
processing that the use of boost::spirit does not do or is
incapable of doing.  Of course, I'm jumping to conclusions
which hopefully people of the devel list will correct :)

>    In general - it would be -way- better to pick up something like eg.
> pdfium - and add a rendering front-end there to match first, the same
> protocol (but we can do this in-process), and subsquently to simplify
> and factor lots of that madness out =) PDFium seems to be gaining
> traction in browsers (Chrome + Firefox) and so on.

Thanks for the pointer.  I'm googling for PDFium now.

>
>    Does that make sense ? out of interest, what bug or mis-feature are you
> interested in there ? are you looking at:
>
>    filter/source/pdf
> and        sdext/source/pdfimport

The latter.

>
>    ? =)

I'm trying to solve the problem I posed earlier in this
post:


https://lists.freedesktop.org/archives/libreoffice/2014-January/059106.html

I've also noticed that the font sizes and location of
letters is sometime not correct; hence, I'd like to figure
out how to correct that.

Thanks for your interest, Michael.

-regards,
Larry

Context

Re: [libreoffice-design] pdf import design docs? · Michael Meeks
- Re: [libreoffice-design] pdf import design docs? · Larry Evans
  - Re: [libreoffice-design] pdf import design docs? · Thorsten Behrens
- Re: [libreoffice-design] pdf import design docs? · Michael Stahl
  - Re: [libreoffice-design] pdf import design docs? · Michael Meeks
  - Re: [libreoffice-design] pdf import design docs? · Larry Evans
    - Re: [libreoffice-design] pdf import design docs? · Michael Stahl
      - Re: [libreoffice-design] pdf import design docs? · Khaled Hosny
        
        Re: [libreoffice-design] pdf import design docs? · Michael Stahl
        
        Re: [libreoffice-design] pdf import design docs? · Khaled Hosny

Privacy Policy | Impressum (Legal Info) | Copyright information: Unless otherwise specified, all text and images on this website are licensed under the Creative Commons Attribution-Share Alike 3.0 License. This does not include the source code of LibreOffice, which is licensed under the Mozilla Public License (MPLv2). "LibreOffice" and "The Document Foundation" are registered trademarks of their corresponding registered owners or are in actual use as trademarks in one or more countries. Their respective logos and icons are also subject to international copyright laws. Use thereof is explained in our trademark policy.