Date: prev next · Thread: first prev next last
2017 Archives by date, by thread · List index


2017-07-09 23:58 GMT+02:00 Jean-Francois Nifenecker <
jean-francois.nifenecker@laposte.net>:

Hello Gilles,

Le 09/07/2017 à 19:20, Gilles a écrit :

Hello,

This PDF file
<https://www.legifrance.gouv.fr/download_code_pdf.do?cidText
e=LEGITEXT000006074228&dlType=pdf>
has no Table of Contents, and I was wondering if LO could grab all the
headers and build a TOC.


In order to create a PDF with a TOC/index you'll have to set heading
styles to the appropriate paragraphs.

Opening a PDF with LibO won't go anywhere as the tool for that is Draw
which can't set styles for a text processor.

I can't see a way to do that quickly, I'm afraid: a copy/paste from the
PDF document to Writer is possible but you'll have to fix a lot of things
(eg. useless carriage returns) and apply heading styles by hand. On a 400+
pages document this a big PITA.

Hopefully someone else will come with brighter ideas.



​You want brighter ideas? Say no more!

So... hmm... I'm afraid there won't be many fully-automated tools that can
build a TOC for you. A PDF basically contains a lot of individual elements,
that are arranged to look like ​something coherent.
From the document you linked, it could theoretically be possible to write a
tool that split every pages, grab the raw text, use a regex to find actual
titles, build a TOC, and inject it in the PDF. This would assume:
- Text extraction works correctly (it's not always the case with PDF)
- Titles always follow the same format

But on this kind of document, you could definitely get some acceptable
results. I experimented a bit. The output is here:
http://www.cjoint.com/c/GGjw0OtPkGc
And for the curious, the "script" I used is here:
​https://pastebin.com/icQSZxQr

As you'll see, it is VERY specific to this document, ​but it is possible to
do something.

-- 
To unsubscribe e-mail to: users+unsubscribe@global.libreoffice.org
Problems? http://www.libreoffice.org/get-help/mailing-lists/how-to-unsubscribe/
Posting guidelines + more: http://wiki.documentfoundation.org/Netiquette
List archive: http://listarchives.libreoffice.org/global/users/
All messages sent to this list will be publicly archived and cannot be deleted

Context


Privacy Policy | Impressum (Legal Info) | Copyright information: Unless otherwise specified, all text and images on this website are licensed under the Creative Commons Attribution-Share Alike 3.0 License. This does not include the source code of LibreOffice, which is licensed under the Mozilla Public License (MPLv2). "LibreOffice" and "The Document Foundation" are registered trademarks of their corresponding registered owners or are in actual use as trademarks in one or more countries. Their respective logos and icons are also subject to international copyright laws. Use thereof is explained in our trademark policy.