Date: prev next · Thread: first prev next last
2012 Archives by date, by thread · List index


On 01/10/12 13:02, Noel Grandin wrote:

On 2012-10-01 12:38, Michael Meeks wrote:
We could do some magic there; of course - space is a bit of an issue - 
we already pointlessly bloat bazillions of ascii strings into UCS-2 
(nominally UTF-16) representations and nail a ref-count and length on 
the beginning. If you turn on the lifecycle diagnostics in 
sal/rtl/source/strimp.hxx with the #ifdef and re-build sal, you can 
start to see the scale of the problem when you launch libreoffice ;-)

Changing subject because I'm changing the topic.

That was something I was thinking about the other day - given than the 
bulk of our strings are pure 7-bit ASCII, it might be a worthwhile 
optimisation to store a bit that says "this string is 7-bit ASCII", and 
then store the string as a sequence of bytes.

The latest Java VM does this trick internally - it pretends that String 
is stored with an array of 16-bit values, but actually it stores them as 
UTF-8.

it does that?  impressive that they could dig their way out of the
utf-16 hole... but whatever they are doing won't be possible with our
OUStrings that directly expose the internal sal_Unicode array.

Even in an app running in a language other than US-English, strings are 
used for so many internal things that >90% of the strings are 7-bit ASCII.

space overhead is one problem with UTF16 strings, but there are other
problems as well: they are very error prone to use in an application
like LO that really must be 100% i18n-able: with UTF-16 it's all too
easy to write loops over the 16-bit code units without taking into
account the possibility that there are Unicode code points that are
actually represented by not one but two UTF-16 code units, leading to
real i18n bugs that are very difficult to detect because they only
happen with rather obscure languages; i.e. UTF-16 manages to combine the
size overhead of UCS-4 and variable length of UTF-8 into the worst of
both worlds.

with a UTF-8 string these i18n bugs would be very easy to detect since
they happen in pretty much every non-English language; you don't need to
be able to write Cuneiform to see the problem.  iteration should be done
with a dedicated method that returns the next code point as a int32_t.

also a UTF-8 string could be really constant: just write an ordinary
string literal in C++ and wrap a value class around it, no memory
allocation needed.

... which brings me to another point: in a hypothetical future when we
could efficiently create a UTF8String from a string literal in C++
without copying the darn thing, what should hypothetical operations to
mutate the string's buffer do?



Context


Privacy Policy | Impressum (Legal Info) | Copyright information: Unless otherwise specified, all text and images on this website are licensed under the Creative Commons Attribution-Share Alike 3.0 License. This does not include the source code of LibreOffice, which is licensed under the Mozilla Public License (MPLv2). "LibreOffice" and "The Document Foundation" are registered trademarks of their corresponding registered owners or are in actual use as trademarks in one or more countries. Their respective logos and icons are also subject to international copyright laws. Use thereof is explained in our trademark policy.