Character classification

Stephan Bergmann <sbergman -AT- redhat.com>
Thu, 23 Mar 2017 12:47:43 +0100

C character classification and case mapping functions in <ctype.h> arenotoriously hard to use.

For one, they take arguments of type int, but (quoting the C Standard):"In all cases the argument is an int, the value of which shall berepresentable as an unsigned char or shall equal the value of the macroEOF. If the argument has any other value, the behavior is undefined."So passing in an argument of type plain char is generally undefinedbehavior (when char is signed and the value is negative). (It appearsthat glibc (un-)helpfully tries to mitigate that, by making thefunctions work "as intended" for negative char values, but that hidesportability issues and can break down, depending on locale, for '\xFF'if EOF is -1 (which it generally is). The MSVC debug runtime, on theother hand, triggers asserts on invalid input to these functions.)

For another, those functions are locale dependent. For one, they changebehavior depending on preceding calls to setlocale. For another, evenin a process that never called setlocale, while most of the functions(isalnum, isalpha, isblank, isdigit, islower, isspace, isupper,isxdigit, tolower, toupper) have uniquely defined behavior, some(iscntrl, isgraph, isprint, ispunct) still depend on what exactlyconstitutes a printing character or a control character, something leftopen by the C Standard.

(POSIX and MSVC have an additional isascii, but which accepts arbitraryint values and returns true iff the value is in the 0..127 7-bit ASCIIrange, so using it is "harmless" (modulo portability issues beyond POSIXand MSVC). POSIX has additional is*_l variants that take an additionallocale_t argument, and MSVC has similar _is*_l variants. There appearto be no uses of such across the LO code base.)

I audited all uses of such functions (outside of external/ source blobs)on recent master:

In C++ code, I replaced them with calls to correspondingrtl/character.hxx functions (and adding missing casts from char tounsigned char where necessary; also see<https://cgit.freedesktop.org/libreoffice/core/commit/?id=7778d9f51bd1f4d086cafe95995406c3157afb89>"Prevent calls to rtl/character.hxx functions with (signed) chararguments"). A function corresponding to isspace was missing and hasbeen added with<https://cgit.freedesktop.org/libreoffice/core/commit/?id=f5c93d4149e7ae967e98dbce72528a04a204ca95>"Use rtl::isAscii* instead of ctype.h is* (and fix passing plain char)and add rtl::isAsciiWhiteSpace". Any calls thus replaced appeared toeither never pass in EOF or have been adapted accordingly(<https://cgit.freedesktop.org/libreoffice/core/commit/?id=4a3f2cb747b2553485f48dc440e141e30ade5a70>"Fix some usage of std::istream unformatted input inhwpfilter/source/hwpeq.cxx"). (And there had been no uses of thecorresponding std-namespaced functions from <cctype>.)


What remains is the source of the five C programs

  rsc/Executable_rsc.mk (rsc/source/rscpp/cpp{2,3,5,6}.c)
  shell/Executable_uri_encode.mk (shell/source/unix/misc/uri-encode.c)
  solenv/Executable_concat-deps.mk (solenv/bin/concat-deps.c)
  soltools/Executable_cpp.mk (soltools/cpp/_{tokens,unix}.c)

soltools/Executable_mkdepend.mk(soltools/mkdepend/{cppsetup,ifparser,parse}.c)

For one, I have added any casts from char to unsigned char wheremissing. (But note that in some cases the input already was of theexpected form.)

For another, with a recent set of commits to master I have removed allbut one call to setlocale from the LO code base itself. (The remainingone is in SetSystemLocale in vcl/unx/generic/app/i18n_im.cxx, and smellslike it is necessary for proper IME support in VCL-based applications onLinux. None of those five C programs should be affected by it.) Sobarring any calls to setlocale in external code, and ignoring thesomewhat fuzzy definition of isprint as called fromrsc/source/rscpp/cpp{5,6}.c, those five C programs should not (anylonger) be affected by locale issues.

Note that there are three remaining calls to toupper/tolower that arecurrently being addressed by <https://gerrit.libreoffice.org/#/c/35303/>"tdf#99589 change tolower/toupper to easytolower/upper."


Please avoid adding any new uses of any of those functions.

Context

Character classification · Stephan Bergmann
- Re: Character classification · Chris Sherlock
  - Re: Character classification · Stephan Bergmann

Privacy Policy | Impressum (Legal Info) | Copyright information: Unless otherwise specified, all text and images on this website are licensed under the Creative Commons Attribution-Share Alike 3.0 License. This does not include the source code of LibreOffice, which is licensed under the Mozilla Public License (MPLv2). "LibreOffice" and "The Document Foundation" are registered trademarks of their corresponding registered owners or are in actual use as trademarks in one or more countries. Their respective logos and icons are also subject to international copyright laws. Use thereof is explained in our trademark policy.