Re: [libreoffice-users] Scanned and OCR's PDF to text

Albrecht Dreß <albrecht.dress -AT- arcor.de>
Mon, 09 Jun 2025 12:01:53 +0200

Am 09.06.25 01:45 schrieb(en) Leo L te Braake:

  * If somehow I get this text without the graphics in a LO Draw file,
    will I be able to make a Writes file out of it?
  * Is there a better route between the PDF and a .csv file?


Not sure if I understood your issue completely…

If you have a PDF which includes both the scanned bitmap as well as the plain text from OCR, you 
can use command line tools like “pdftotext” (on Debian in the ”poppler-utils” package) or similar 
to extract the latter.

If the quality of the scan (as you mentioned) is somewhat bad, but you can access a higher quality 
scan as PDF, have a look at OCRmyPDF (<https://github.com/ocrmypdf/OCRmyPDF>).  It runs tesseract 
as OCR engine on the PDF input, producing a combined bitmap/text PDF output, with the ability to 
write the OCR output to a different file (have a look at the “--sidecar” option).

Once you have the plain-text output, it should be feasible to write a script (python, perl, 
whatever) to extract the relevant data as CSV.

Hth,
Albrecht.
-- 
To unsubscribe e-mail to: users+unsubscribe@global.libreoffice.org
Problems? https://www.libreoffice.org/get-help/mailing-lists/how-to-unsubscribe/
Posting guidelines + more: https://wiki.documentfoundation.org/Netiquette
List archive: https://listarchives.libreoffice.org/global/users/
Privacy Policy: https://www.documentfoundation.org/privacy

Context

[libreoffice-users] Scanned and OCR's PDF to text · Leo L te Braake
- Re: [libreoffice-users] Scanned and OCR's PDF to text · Dave Howorth
  - (message not available)
    - Re: [libreoffice-users] Scanned and OCR's PDF to text · Dave Howorth
      - Re: [libreoffice-users] Scanned and OCR's PDF to text · FARHAN ISHRAK Fahim
- Re: [libreoffice-users] Scanned and OCR's PDF to text · Albrecht Dreß
  - Re: [libreoffice-users] Scanned and OCR's PDF to text · FARHAN ISHRAK Fahim
- Re: [libreoffice-users] Scanned and OCR's PDF to text · Robert Funnell

Privacy Policy | Impressum (Legal Info) | Copyright information: Unless otherwise specified, all text and images on this website are licensed under the Creative Commons Attribution-Share Alike 3.0 License. This does not include the source code of LibreOffice, which is licensed under the Mozilla Public License (MPLv2). "LibreOffice" and "The Document Foundation" are registered trademarks of their corresponding registered owners or are in actual use as trademarks in one or more countries. Their respective logos and icons are also subject to international copyright laws. Use thereof is explained in our trademark policy.