Re: Character classification

Chris Sherlock <chris.sherlock79 -AT- gmail.com>
Thu, 6 Apr 2017 08:02:21 +1000

On 23 Mar 2017, at 10:47 pm, Stephan Bergmann <sbergman@redhat.com> wrote:

C character classification and case mapping functions in <ctype.h> are notoriously hard to use.


[snip]

I audited all uses of such functions (outside of external/ source blobs) on recent master:

In C++ code, I replaced them with calls to corresponding rtl/character.hxx functions (and adding 
missing casts from char to unsigned char where necessary; also see 
<https://cgit.freedesktop.org/libreoffice/core/commit/?id=7778d9f51bd1f4d086cafe95995406c3157afb89>
 "Prevent calls to rtl/character.hxx functions with (signed) char arguments").  A function 
corresponding to isspace was missing and has been added with 
<https://cgit.freedesktop.org/libreoffice/core/commit/?id=f5c93d4149e7ae967e98dbce72528a04a204ca95>
 "Use rtl::isAscii* instead of ctype.h is* (and fix passing plain char) and add 
rtl::isAsciiWhiteSpace".  Any calls thus replaced appeared to either never pass in EOF or have 
been adapted accordingly 
(<https://cgit.freedesktop.org/libreoffice/core/commit/?id=4a3f2cb747b2553485f48dc440e141e30ade5a70>
 "Fix some usage of std::istream unformatted input in hwpfilter/source/hwpeq.cxx").  (And there 
had been no uses of the corresponding std-namespaced functions from <cctype>


Nice work! When I get to the point where I write about the RTL I'll make sure I note this in the 
part I write about string handling.

What remains is the source of the five C programs

 rsc/Executable_rsc.mk (rsc/source/rscpp/cpp{2,3,5,6}.c)
 shell/Executable_uri_encode.mk (shell/source/unix/misc/uri-encode.c)
 solenv/Executable_concat-deps.mk (solenv/bin/concat-deps.c)
 soltools/Executable_cpp.mk (soltools/cpp/_{tokens,unix}.c)
 soltools/Executable_mkdepend.mk (soltools/mkdepend/{cppsetup,ifparser,parse}.c)

For one, I have added any casts from char to unsigned char where missing.  (But note that in some 
cases the input already was of the expected form.)


So the recommendation is to avoid C string functions in LibreOffice code in future?

I realise this may be a silly question, but does this mean we have a portable, cross-culture string 
handling module that makes things like character case handling consistent across platforms?

If so, I've always wondered what effort it would take to hive off our string handling code to its 
own project - there is a lot of thought and decades worth of testing for this C and C++ code. 
Perhaps it might help other projects outside of LibreOffice in the future...

For another, with a recent set of commits to master I have removed all but one call to setlocale 
from the LO code base itself.  (The remaining one is in SetSystemLocale in 
vcl/unx/generic/app/i18n_im.cxx, and smells like it is necessary for proper IME support in 
VCL-based applications on Linux.  None of those five C programs should be affected by it.)  So 
barring any calls to setlocale in external code, and ignoring the somewhat fuzzy definition of 
isprint as called from rsc/source/rscpp/cpp{5,6}.c, those five C programs should not (any longer) 
be affected by locale issues.


Was this done because of the character casing challenges mentioned above? Or was calling on this 
causing problems elsewhere?

I'm assuming setting locales is frowned on...

Note that there are three remaining calls to toupper/tolower that are currently being addressed 
by <https://gerrit.libreoffice.org/#/c/35303/> "tdf#99589 change tolower/toupper to 
easytolower/upper."

Please avoid adding any new uses of any of those functions.


Nice work! Please pardon me if any of my questions are silly, I am relatively inexperienced with 
topics related to localization.

Chris

Context

Character classification · Stephan Bergmann
- Re: Character classification · Chris Sherlock
  - Re: Character classification · Stephan Bergmann

Privacy Policy | Impressum (Legal Info) | Copyright information: Unless otherwise specified, all text and images on this website are licensed under the Creative Commons Attribution-Share Alike 3.0 License. This does not include the source code of LibreOffice, which is licensed under the Mozilla Public License (MPLv2). "LibreOffice" and "The Document Foundation" are registered trademarks of their corresponding registered owners or are in actual use as trademarks in one or more countries. Their respective logos and icons are also subject to international copyright laws. Use thereof is explained in our trademark policy.