Hi, I am having a look at making a PowerPoint text extractor out of mso-dumper. While doing this, I found that the routine used to convert utf-16 text out of the TextChars Atoms was not working for me. The new version (first attached patch) was tested on a variety of inputs, including chinese, vietnamese and several European languages, and produces text corresponding to what libreoffice displays, instead of outputting "<xx invalid chars>" messages. See the commit comment for more detailed explanations. Also the method which processed text out of textBytes Atoms assumed that these were ascii characters, which sometimes also caused problems (wrong displays or exceptions). The new version decodes from cp1252, and works better where I tried it. Also see the commit message for more details about the choice of encoding. Cheers, jf
Attachment:
utf16conv.diff
Description: Binary data
Attachment:
8bitbytes.diff
Description: Binary data