Date: prev next · Thread: first prev next last
2022 Archives by date, by thread · List index

On 04/07/2022 17:59, Ian Bertram wrote:
I have been sent a graphic heavy document in ODT format. However it looks as if it has been badly 
converted from a pdf file. The layout is scrambled, headers don’t align properly and there are a 
host of other issues. It is also in columns. Is there a simple way to strip out everything bar the 
words? I have tried saving it as a txt file, but this loses a lot of the paragraph numbering and 
introduces other layout issues. Saving in rtf format is even worse.

You might try this.

Open the odt with an archive manager - it's just a zip file. Extract content.xml, then run this perl program, which will extract the text from that:


use strict;
use warnings;

use XML::LibXML;

my $filename = 'content.xml';

binmode(STDOUT, ":utf8");
binmode(STDERR, ":utf8");

my $dom = XML::LibXML->load_xml(location => $filename) or die "open?\n";

foreach my $para ($dom->findnodes('/office:document-content/office:body/office:text/text:p')) {
        my $b = $para->to_literal;
        print $b, "\n";

Works for me but YMMV. BEWARE email line wrap in the 'foreach' line. The single line should say foreach ....... p')) {

Mike Scott (unet2 <at> [deletethis]
Harlow Essex England
"The only way is Brexit" -- anon.

To unsubscribe e-mail to:
Posting guidelines + more:
List archive:
Privacy Policy:


Privacy Policy | Impressum (Legal Info) | Copyright information: Unless otherwise specified, all text and images on this website are licensed under the Creative Commons Attribution-Share Alike 3.0 License. This does not include the source code of LibreOffice, which is licensed under the Mozilla Public License (MPLv2). "LibreOffice" and "The Document Foundation" are registered trademarks of their corresponding registered owners or are in actual use as trademarks in one or more countries. Their respective logos and icons are also subject to international copyright laws. Use thereof is explained in our trademark policy.