On 01/10/12 13:02, Noel Grandin wrote:
On 2012-10-01 12:38, Michael Meeks wrote:
We could do some magic there; of course - space is a bit of an issue -
we already pointlessly bloat bazillions of ascii strings into UCS-2
(nominally UTF-16) representations and nail a ref-count and length on
the beginning. If you turn on the lifecycle diagnostics in
sal/rtl/source/strimp.hxx with the #ifdef and re-build sal, you can
start to see the scale of the problem when you launch libreoffice ;-)
Changing subject because I'm changing the topic.
That was something I was thinking about the other day - given than the
bulk of our strings are pure 7-bit ASCII, it might be a worthwhile
optimisation to store a bit that says "this string is 7-bit ASCII", and
then store the string as a sequence of bytes.
The latest Java VM does this trick internally - it pretends that String
is stored with an array of 16-bit values, but actually it stores them as
UTF-8.
it does that? impressive that they could dig their way out of the
utf-16 hole... but whatever they are doing won't be possible with our
OUStrings that directly expose the internal sal_Unicode array.
Even in an app running in a language other than US-English, strings are
used for so many internal things that >90% of the strings are 7-bit ASCII.
space overhead is one problem with UTF16 strings, but there are other
problems as well: they are very error prone to use in an application
like LO that really must be 100% i18n-able: with UTF-16 it's all too
easy to write loops over the 16-bit code units without taking into
account the possibility that there are Unicode code points that are
actually represented by not one but two UTF-16 code units, leading to
real i18n bugs that are very difficult to detect because they only
happen with rather obscure languages; i.e. UTF-16 manages to combine the
size overhead of UCS-4 and variable length of UTF-8 into the worst of
both worlds.
with a UTF-8 string these i18n bugs would be very easy to detect since
they happen in pretty much every non-English language; you don't need to
be able to write Cuneiform to see the problem. iteration should be done
with a dedicated method that returns the next code point as a int32_t.
also a UTF-8 string could be really constant: just write an ordinary
string literal in C++ and wrap a value class around it, no memory
allocation needed.
... which brings me to another point: in a hypothetical future when we
could efficiently create a UTF8String from a string literal in C++
without copying the darn thing, what should hypothetical operations to
mutate the string's buffer do?
Context
Re: OUString is mutable? · Stephan Bergmann
Privacy Policy |
Impressum (Legal Info) |
Copyright information: Unless otherwise specified, all text and images
on this website are licensed under the
Creative Commons Attribution-Share Alike 3.0 License.
This does not include the source code of LibreOffice, which is
licensed under the Mozilla Public License (
MPLv2).
"LibreOffice" and "The Document Foundation" are
registered trademarks of their corresponding registered owners or are
in actual use as trademarks in one or more countries. Their respective
logos and icons are also subject to international copyright laws. Use
thereof is explained in our
trademark policy.