On 04/07/2022 17:59, Ian Bertram wrote:
....
-->
I have been sent a graphic heavy document in ODT format. However it looks as if it has been badly
converted from a pdf file. The layout is scrambled, headers don’t align properly and there are a
host of other issues. It is also in columns. Is there a simple way to strip out everything bar the
words? I have tried saving it as a txt file, but this loses a lot of the paragraph numbering and
introduces other layout issues. Saving in rtf format is even worse.
You might try this.
Open the odt with an archive manager - it's just a zip file. Extract
content.xml, then run this perl program, which will extract the text
from that:
#!/usr/bin/perl
use strict;
use warnings;
use XML::LibXML;
my $filename = 'content.xml';
binmode(STDOUT, ":utf8");
binmode(STDERR, ":utf8");
my $dom = XML::LibXML->load_xml(location => $filename) or die "open?\n";
foreach my $para
($dom->findnodes('/office:document-content/office:body/office:text/text:p'))
{
my $b = $para->to_literal;
print $b, "\n";
}
Works for me but YMMV. BEWARE email line wrap in the 'foreach' line. The
single line should say foreach ....... p')) {
--
Mike Scott (unet2 <at> [deletethis] scottsonline.org.uk)
Harlow Essex England
"The only way is Brexit" -- anon.
--
To unsubscribe e-mail to: users+unsubscribe@global.libreoffice.org
Problems? https://www.libreoffice.org/get-help/mailing-lists/how-to-unsubscribe/
Posting guidelines + more: https://wiki.documentfoundation.org/Netiquette
List archive: https://listarchives.libreoffice.org/global/users/
Privacy Policy: https://www.documentfoundation.org/privacy
Context
Privacy Policy |
Impressum (Legal Info) |
Copyright information: Unless otherwise specified, all text and images
on this website are licensed under the
Creative Commons Attribution-Share Alike 3.0 License.
This does not include the source code of LibreOffice, which is
licensed under the Mozilla Public License (
MPLv2).
"LibreOffice" and "The Document Foundation" are
registered trademarks of their corresponding registered owners or are
in actual use as trademarks in one or more countries. Their respective
logos and icons are also subject to international copyright laws. Use
thereof is explained in our
trademark policy.