On 10/05/2016 10:24 AM, Michael Meeks wrote:
> Hi Larry,
>
> First - really great to have you looking at that
> code ! =)
Thanks for the encouragement Michael.
>
> On 10/05/2016 04:10 PM, Larry Evans wrote:
>> I'm trying to understand how the pdf import code works.
>> I've tried looking at the code; however, that's hard to
>> follow; hence, I was hoping there was some sort of design
>> document explaining somewhat how the code works.
>
> Second - the design list is really for User Experience / developer
> interaction, and this seems like a real gnarly coding problem - so I've
> re-sent it to the dev-list =)
OOPS. Sorry about that.
>
>> TIA for any pointers.
>
> Sure - so the PDF import is a bit of a mess; it currently spawns a
> remote process using poplar to parse the PDF, and then extracts (via a
> simple text protocol) data from poplar's rendering to re-constitute into
> internal ODF callbacks to produce an internal document; at least -
> that's if I got it right =)
Well, I did see code here:
sdext/source/pdfimport/pdfparse/pdfparse.cxx
but that looked like it used boost/spirit to parse the pdf file
(about line 553):
boost::spirit::parse( pBuffer,
pBuffer+nLen,
aGrammar,
boost::spirit::space_p );
but then, trying to find where that (or the caller of that) was called
lead me to:
sdext/source/pdfimport/wrapper/wrapper.cxx
where there is a call(around line 927):
std::unique_ptr<pdfparse::PDFEntry> pEntry(
pdfparse::PDFReader::read( aPDFFile.getStr() ));
but that's called in a function:
bool checkEncryption
whose name doesn't suggest any translation into something
like the xml which is what libreoffice stores its files as,
IIUC:
https://en.wikipedia.org/wiki/OpenOffice.org_XML
but, looking further in that file, there's, as you mention,
what looks like a remote process call in function:
bool xpdf_ImportFromFile
on about line 1079:
osl_executeProcess_WithRedirectedIO(converterURL.pData,
args,
nArgs,
osl_Process_SEARCHPATH|osl_Process_HIDDEN,
pSecurity,
nullptr, nullptr, 0,
&aProcess, &pIn, &pOut, &pErr);
So that's where I wanted some overall design help, because I
thought it odd that boost::spirit was used to parse the
file, I guess, just to determine whether it was encrypted,
and then, an xpdf process was used to parse the same file
again. That seemed awfully redundant.
>
> Poplar/xpdf has a GPL license and so requires all this silliness.
>
Hence, I guess Poplar/xpdf does some sophisticated
processing that the use of boost::spirit does not do or is
incapable of doing. Of course, I'm jumping to conclusions
which hopefully people of the devel list will correct :)
> In general - it would be -way- better to pick up something like eg.
> pdfium - and add a rendering front-end there to match first, the same
> protocol (but we can do this in-process), and subsquently to simplify
> and factor lots of that madness out =) PDFium seems to be gaining
> traction in browsers (Chrome + Firefox) and so on.
Thanks for the pointer. I'm googling for PDFium now.
>
> Does that make sense ? out of interest, what bug or mis-feature are you
> interested in there ? are you looking at:
>
> filter/source/pdf
> and sdext/source/pdfimport
The latter.
>
> ? =)
I'm trying to solve the problem I posed earlier in this
post:
https://lists.freedesktop.org/archives/libreoffice/2014-January/059106.html
I've also noticed that the font sizes and location of
letters is sometime not correct; hence, I'd like to figure
out how to correct that.
Thanks for your interest, Michael.
-regards,
Larry
Context
Privacy Policy |
Impressum (Legal Info) |
Copyright information: Unless otherwise specified, all text and images
on this website are licensed under the
Creative Commons Attribution-Share Alike 3.0 License.
This does not include the source code of LibreOffice, which is
licensed under the Mozilla Public License (
MPLv2).
"LibreOffice" and "The Document Foundation" are
registered trademarks of their corresponding registered owners or are
in actual use as trademarks in one or more countries. Their respective
logos and icons are also subject to international copyright laws. Use
thereof is explained in our
trademark policy.