Date: prev next · Thread: first prev next last
2012 Archives by date, by thread · List index


On 01/10/12 13:25, Michael Meeks wrote:

On Mon, 2012-10-01 at 13:02 +0200, Noel Grandin wrote:
That was something I was thinking about the other day - given than the 
bulk of our strings are pure 7-bit ASCII, it might be a worthwhile 
optimisation to store a bit that says "this string is 7-bit ASCII", and 
then store the string as a sequence of bytes.

      Optimisation ? :-) IMHO the ideal is to store all strings as UTF-8
underneath the hatches anyway. All the people I've discussed this with
that objected to that, turned out (after some discussion) to have a weak
understanding of UTF-8, UTF-16 and of rendering complex text ;-) Of
course, perhaps I should discuss with more people.

      The only problem with a change there is our ABI - which explicitly
exposes the encoding of that.

the right time to do it is for LO4.  sadly nobody has signed up for that
yet :( ... (while there are volunteers for far sillier proposals, like
getting rid of com.sun.star...)

of course this would only affect C++ binding (and possibly Python -- am
not up to date how that does Unicode; there are differences between 2
and 3 iirc; of course we should migrate to Python 3 as well...), while
Java binding still uses UTF-16 but i assume we have to copy strings
passed over the Java UNO bridge anyway.

The latest Java VM does this trick internally - it pretends that String 
is stored with an array of 16-bit values, but actually it stores them as 
UTF-8.

      Interesting - for all strings ? is there a pointer to the code / docs
for that detail somewhere ? :-) Last I looked Java also stored partial

i would expect they take advantage of JVM's tendency to generate code at
runtime to some non-trivial extent :)

strings chained to it's parent; so 'substring' takes a reference on the
parent (be it ever so large), and can return a single character string
out of it without re-allocation. IIRC that can cause huge grief when
parsing big files into little ones ;-)

that is a potential advantage of immutable string buffers that afaik we
don't take advantage of in LO so far.



Context


Privacy Policy | Impressum (Legal Info) | Copyright information: Unless otherwise specified, all text and images on this website are licensed under the Creative Commons Attribution-Share Alike 3.0 License. This does not include the source code of LibreOffice, which is licensed under the Mozilla Public License (MPLv2). "LibreOffice" and "The Document Foundation" are registered trademarks of their corresponding registered owners or are in actual use as trademarks in one or more countries. Their respective logos and icons are also subject to international copyright laws. Use thereof is explained in our trademark policy.