Re: Character encoding not being detected when using Link to external source in calc

Chris Sherlock <chris.sherlock79 -AT- gmail.com>
Wed, 6 Jan 2016 04:27:51 +1100

Thanks Mark, appreciate these code pointers!

(I’m cc’ing in the mailing list so others can comment)

Chris

On 4 Jan 2016, at 8:21 PM, Mark Hung <marklh9@gmail.com> wrote:


I meant there is a chance for SvParser::GetNextChar() to switch encoding, but yes it is less 
relevant.

Grepping content-type under ucb , there are some suspicious code
http://opengrok.libreoffice.org/xref/core/ucb/source/ucp/webdav-neon/ContentProperties.cxx#454 
<http://opengrok.libreoffice.org/xref/core/ucb/source/ucp/webdav-neon/ContentProperties.cxx#454>
http://opengrok.libreoffice.org/xref/core/ucb/source/ucp/webdav/ContentProperties.cxx#471 
<http://opengrok.libreoffice.org/xref/core/ucb/source/ucp/webdav/ContentProperties.cxx#471>

Which seems incosistent with 
http://opengrok.libreoffice.org/xref/core/sc/source/filter/html/htmlpars.cxx#264 
<http://opengrok.libreoffice.org/xref/core/sc/source/filter/html/htmlpars.cxx#264>


2016-01-04 16:17 GMT+08:00 Chris Sherlock <chris.sherlock79@gmail.com 
<mailto:chris.sherlock79@gmail.com>>:
Hi Mark,

BOM detection is irrelevant here. The HTTP header states that it should be UTF8, but this is not 
being honoured. 

There is something further down the stack that isn’t recording the HTTP headers. 

Chris

On 4 Jan 2016, at 4:23 PM, Mark Hung <marklh9@gmail.com <mailto:marklh9@gmail.com>> wrote:

Hi Chris,

As recently I'm working on SvParser and HTMLParser, 

There is BOM detection is in SvParser::GetNextChar().

A quick look at eehtml, EditHTMLParser:: <>EditHTMLParser seems relevant.

Best regards.


2016-01-04 12:02 GMT+08:00 Chris Sherlock <chris.sherlock79@gmail.com 
<mailto:chris.sherlock79@gmail.com>>:
Hey guys, 

Probably nobody saw this because of the time of year (Happy New Year, incidentally!!!).

Just a quick ping to the list to see if anyone can give me some pointers. 

Chris

On 30 Dec 2015, at 12:15 PM, Chris Sherlock <chris.sherlock79@gmail.com 
<mailto:chris.sherlock79@gmail.com>> wrote:

Hi guys,

In bug 95217 - https://bugs.documentfoundation.org/show_bug.cgi?id=95217 
<https://bugs.documentfoundation.org/show_bug.cgi?id=95217> - Persian test in a webpage encoded 
as UTF-8 is corrupting.

If I take the webpage and save to an HTML file encoded as UTF8, then there are no problems and 
the Persian text comes through fine. However, when connecting to a webserver directly, the HTTP 
header correctly gives the content type as utf8.

I did a test using Charles Proxy with its SSL interception feature turned on and pointed Safari 
to https://bugs.documentfoundation.org/attachment.cgi?id=119818 
<https://bugs.documentfoundation.org/attachment.cgi?id=119818>

The following headers are gathered:

HTTP/1.1 200 OK
Server: nginx/1.2.1
Date: Sat, 26 Dec 2015 01:41:30 GMT
Content-Type: text/html; name="text.html"; charset=UTF-8
Content-Length: 982
Connection: keep-alive
X-xss-protection: 1; mode=block
Content-disposition: inline; filename="text.html"
X-content-type-options: nosniff

Some warnings are spat out that it editeng's eehtml can't detect the encoding. I initially 
thought it was looking for a BOM, which makes no sense for a webpage, but that's wrong. 
Instead, for some reason the headers don't seem to be processed and the HTML parser is falling 
back to ISO-8859-1 and not UTF8 as the character encoding.

We seem to use Neon to make the GET request to the webserver. A few observations:

1. We detect a server OK response as an error
2. (Probably more to the point) I believe PROPFIND is being used, but actually even though the 
function being used indicates a PROPFIND verb is used a GET is used as is normal but the 
headers aren't being stored. This ,Evans that when the parser looks for the headers to find the 
encoding it's not finding anything, resulting in a fallback to ISO-8859-1.

One easy thing (doesn't solve the root issue) is that wouldn't it be a better idea to fallback 
to UTF8 and not ISO-8859-1, given ISO-8859-1 is really just a subset of UTF-8?

Any pointers on how to get to the bottom of this would be appreciated, I'm honestly not up on 
webdav or Neon.

Chris Sherlock



_______________________________________________
LibreOffice mailing list
LibreOffice@lists.freedesktop.org <mailto:LibreOffice@lists.freedesktop.org>
http://lists.freedesktop.org/mailman/listinfo/libreoffice 
<http://lists.freedesktop.org/mailman/listinfo/libreoffice>




-- 
Mark Hung





-- 
Mark Hung

Context

Re: Character encoding not being detected when using Link to external source in calc · Chris Sherlock
- Re: Character encoding not being detected when using Link to external source in calc · Mark Hung
  - Re: Character encoding not being detected when using Link to external source in calc · Chris Sherlock
    - (message not available)
      - Re: Character encoding not being detected when using Link to external source in calc · Chris Sherlock
        
        Re: Character encoding not being detected when using Link to external source in calc · Chris Sherlock
        
        Re: Character encoding not being detected when using Link to external source in calc · Mark Hung
        
        Re: Character encoding not being detected when using Link to external source in calc · Chris Sherlock

Privacy Policy | Impressum (Legal Info) | Copyright information: Unless otherwise specified, all text and images on this website are licensed under the Creative Commons Attribution-Share Alike 3.0 License. This does not include the source code of LibreOffice, which is licensed under the Mozilla Public License (MPLv2). "LibreOffice" and "The Document Foundation" are registered trademarks of their corresponding registered owners or are in actual use as trademarks in one or more countries. Their respective logos and icons are also subject to international copyright laws. Use thereof is explained in our trademark policy.