mso-dumper: utf16 conversion

jf -AT- dockes.org
Mon, 18 Nov 2013 10:35:32 +0100


Hi,

I am having a look at making a PowerPoint text extractor out of mso-dumper. 

While doing this, I found that the routine used to convert utf-16 text out
of the TextChars Atoms was not working for me.

The new version (first attached patch) was tested on a variety of inputs,
including chinese, vietnamese and several European languages, and produces
text corresponding to what libreoffice displays, instead of outputting "<xx
invalid chars>" messages. See the commit comment for more detailed
explanations. 

Also the method which processed text out of textBytes Atoms assumed that
these were ascii characters, which sometimes also caused problems (wrong
displays or exceptions).

The new version decodes from cp1252, and works better where I tried
it. Also see the commit message for more details about the choice of
encoding. 

Cheers,

jf

Attachment: utf16conv.diff
Description: Binary data

Attachment: 8bitbytes.diff
Description: Binary data

Context

mso-dumper: utf16 conversion · jf
- Re: mso-dumper: utf16 conversion · Miklos Vajna
  - Re: mso-dumper: utf16 conversion · Thorsten Behrens
- mso-dumper: making a PPT text extractor · jf
  - Re: mso-dumper: making a PPT text extractor · Thorsten Behrens
    - Re: mso-dumper: making a PPT text extractor · jf
      - Re: mso-dumper: making a PPT text extractor · Thorsten Behrens
- Re: mso-dumper: utf16 conversion · Thorsten Behrens
  - Re: mso-dumper: utf16 conversion · jf

Privacy Policy | Impressum (Legal Info) | Copyright information: Unless otherwise specified, all text and images on this website are licensed under the Creative Commons Attribution-Share Alike 3.0 License. This does not include the source code of LibreOffice, which is licensed under the Mozilla Public License (MPLv2). "LibreOffice" and "The Document Foundation" are registered trademarks of their corresponding registered owners or are in actual use as trademarks in one or more countries. Their respective logos and icons are also subject to international copyright laws. Use thereof is explained in our trademark policy.