Re: RTL_CONSTASCII_USTRINGPARAM: cleanup wanted?

Lubos Lunak <l.lunak -AT- suse.cz>
Wed, 22 Feb 2012 14:56:12 +0100

On Wednesday 22 of February 2012, Stephan Bergmann wrote:

On 02/22/2012 11:25 AM, Michael Meeks wrote:

    Great ! :-) incidentally, I had one minor point around the ASCII vs.
UTF-8 side; the rtl_string2UString (cf. sal/rtl/source/string.cxx) does
a typically slower UTF-8 length counting loop; I suggest that we could
do better performance wise (and we do create a biggish scad of these
strings) by sticking with ascii, and doing a single, simple copy/expand
of the string. Perhaps in a new rtl_uString_newFromAsciiL method.


 Actually rtl_string2UString() is reasonably optimized for the case when the 
data is ASCII or UTF-8-that-in-fact-is-ASCII, so the one loop analysing the 
contents is the only overhead. Makes me wonder if avoiding that one loop is 
really worth it. I'll go with 'no' for the time being, until somebody shows 
me otherwise.

Thinking about it again, the restriction to ASCII could become a
hindrance in the longer run.  C++11 has provision for UTF-8 string
literals (u8"..."), but they still have type char const[], so are not
distinguishable from traditional plain "..." literals via function
overloading.  So, if we ever wanted to extend the new facilities to also
support UTF-8 string literals, but would want to keep the performance
benefit for the ASCII-only case, we could not offer the same simple syntax

   rtl::OUString("foo");
   rtl::OUString(u8"I\u2764C++");

for both.


 We could have OUString::fromUtf8( utf8literal ), which I consider acceptable, 
especially given that IMO we are unlikely to have a larger number of utf8 
literals anyway. But I think it's better to go for utf8 always and optimize 
if we find out it's worth it.

 I thought there could be a way to test string literal contents at 
compile-time, but string literals are not considered to be compile-time 
constants just because the standard says so, so templates can't take them as 
arguments, and while I've eventually found a way to do it, based on 
http://www.macieira.org/blog/2011/07/initialising-an-array-with-cx0x-using-constexpr-and-variadic-templates/
 , 
see attachment, it turns out to be unusable in practice. Maybe later.

-- 
 Lubos Lunak
 l.lunak@suse.cz

// With gcc-4.5.1 this is awfully slow to compile.
// Also, for longer strings the computation is no longer done at compile
// time and instead code for handling it at runtime is generated.

#include <stdio.h>

constexpr inline
int sum()
    {
    return 0;
    }

template< typename... T >
constexpr inline 
int sum( int v1, T... v2 )
    {
    return v1 + sum( v2... );
    }

// TODO BUG
// This is the other way around, it should in fact lead to skipping ret-1
// following characters, so this needs to be handled as
// { utf8LengthChar( s[ i ] )... ) } (i.e. array) to ensure ordering.
constexpr inline 
int utf8LengthChar( unsigned char c )
    {
    return !( c & 0x80 ) ? 1
        : ( c & 0xe0 ) == 0xc0 ? 2
        : ( c & 0xf0 ) == 0xe0 ? 3
        : ( c & 0xf8 ) == 0xf0 ? 4
        : ( c & 0xfc ) == 0xf8 ? 5
        : ( c & 0xfe ) == 0xfc ? 6
        : 1;
    }

template< int... >
struct IndexList
    {
    };

template< typename IndexList, int Right >
struct Merge;

template< int... Left, int Right >
struct Merge< IndexList< Left... >, Right >
    {
    typedef IndexList< Left..., Right > Range;
    };

template< int N >
struct Indexes
    {
    typedef typename Merge< typename Indexes< N - 1 >::Range, N >::Range Range;
    };

template<>
struct Indexes< 0 >
    {
    typedef IndexList<> Range;
    };

template< int N, typename T >
struct Utf8LengthHelper;

template< int N, int... i >
struct Utf8LengthHelper< N, IndexList< i... > >
    {
    constexpr inline Utf8LengthHelper( const char s[ N ] )
        : value( sum( utf8LengthChar( s[ i ] )... ))
        {
        }
    const int value;
    };

template< int N >
constexpr inline int utf8Length( const char s[ N ] )
    {
    return Utf8LengthHelper< N, typename Indexes< N >::Range >( s ).value;
    }

template< int N >
inline
void foo( const char (&s)[ N ] )
    {
    fprintf( stderr, "%s %d\n", s, utf8Length< N - 1 >( s ));
    }

int main()
    {
    foo( "testé" );
    }

Context

RTL_CONSTASCII_USTRINGPARAM: cleanup wanted? · Chr. Rossmanith
- Re: RTL_CONSTASCII_USTRINGPARAM: cleanup wanted? · Stephan Bergmann
  - Re: RTL_CONSTASCII_USTRINGPARAM: cleanup wanted? · Lubos Lunak
    - Re: RTL_CONSTASCII_USTRINGPARAM: cleanup wanted? · Michael Meeks
      - Re: RTL_CONSTASCII_USTRINGPARAM: cleanup wanted? · Stephan Bergmann
        
        Re: RTL_CONSTASCII_USTRINGPARAM: cleanup wanted? · Lubos Lunak
        
        Re: RTL_CONSTASCII_USTRINGPARAM: cleanup wanted? · Michael Meeks
        
        Re: RTL_CONSTASCII_USTRINGPARAM: cleanup wanted? · Lubos Lunak

Privacy Policy | Impressum (Legal Info) | Copyright information: Unless otherwise specified, all text and images on this website are licensed under the Creative Commons Attribution-Share Alike 3.0 License. This does not include the source code of LibreOffice, which is licensed under the Mozilla Public License (MPLv2). "LibreOffice" and "The Document Foundation" are registered trademarks of their corresponding registered owners or are in actual use as trademarks in one or more countries. Their respective logos and icons are also subject to international copyright laws. Use thereof is explained in our trademark policy.