Larry Evans wrote:
Well, I did see code here: sdext/source/pdfimport/pdfparse/pdfparse.cxx but that looked like it used boost/spirit to parse the pdf file (about line 553): boost::spirit::parse( pBuffer, pBuffer+nLen, aGrammar, boost::spirit::space_p );
That's chiefly to deal with hybrid pdf, which needs to detect early-on that instead of parsing PDF, it should instead load the embedded ODF file. So for understanding real PDF import, simply ignore that part -
Hence, I guess Poplar/xpdf does some sophisticated processing that the use of boost::spirit does not do or is incapable of doing. Of course, I'm jumping to conclusions which hopefully people of the devel list will correct :)
Yes. Poppler does the actual pdf processing (it's also powering most of the linux desktop pdf viewers, like okular or evince).
In general - it would be -way- better to pick up something like eg. pdfium - and add a rendering front-end there to match first, the same protocol (but we can do this in-process), and subsquently to simplify and factor lots of that madness out =) PDFium seems to be gaining traction in browsers (Chrome + Firefox) and so on.Thanks for the pointer. I'm googling for PDFium now.
For the import of PDF into Draw/Writer (compared to simply rendering PDF as a picture), the above is a bit of a red herring. The added complexity in terms of code for doing this in a separate process is pretty low; the challenge for that sort of thing really is decent layout detection. There's been a GSoC project proposal to hook up something like Tesseract or other OCR engines to help with that, sadly with little traction so far. ;)
I'm trying to solve the problem I posed earlier in this post: https://lists.freedesktop.org/archives/libreoffice/2014-January/059106.html
Ah, XFA. Well then, poppler does not have support for that, pdfium apparently has a branch: https://pdfium.googlesource.com/pdfium/+/xfa - no idea how useable that is though. And from the grapevines, XFA seems pretty dead as an architecture?
I've also noticed that the font sizes and location of letters is sometime not correct; hence, I'd like to figure out how to correct that.
That's mostly due to prioritizing editability over accuracy. The code to look at is in sdext/source/pdfimport/tree/drawtreevisiting.cxx, which writes out ODF from the render tree. Hope that helps, -- Thorsten
Attachment:
signature.asc
Description: Digital signature