C character classification and case mapping functions in <ctype.h> are
notoriously hard to use.
For one, they take arguments of type int, but (quoting the C Standard):
"In all cases the argument is an int, the value of which shall be
representable as an unsigned char or shall equal the value of the macro
EOF. If the argument has any other value, the behavior is undefined."
So passing in an argument of type plain char is generally undefined
behavior (when char is signed and the value is negative). (It appears
that glibc (un-)helpfully tries to mitigate that, by making the
functions work "as intended" for negative char values, but that hides
portability issues and can break down, depending on locale, for '\xFF'
if EOF is -1 (which it generally is). The MSVC debug runtime, on the
other hand, triggers asserts on invalid input to these functions.)
For another, those functions are locale dependent. For one, they change
behavior depending on preceding calls to setlocale. For another, even
in a process that never called setlocale, while most of the functions
(isalnum, isalpha, isblank, isdigit, islower, isspace, isupper,
isxdigit, tolower, toupper) have uniquely defined behavior, some
(iscntrl, isgraph, isprint, ispunct) still depend on what exactly
constitutes a printing character or a control character, something left
open by the C Standard.
(POSIX and MSVC have an additional isascii, but which accepts arbitrary
int values and returns true iff the value is in the 0..127 7-bit ASCII
range, so using it is "harmless" (modulo portability issues beyond POSIX
and MSVC). POSIX has additional is*_l variants that take an additional
locale_t argument, and MSVC has similar _is*_l variants. There appear
to be no uses of such across the LO code base.)
I audited all uses of such functions (outside of external/ source blobs)
on recent master:
In C++ code, I replaced them with calls to corresponding
rtl/character.hxx functions (and adding missing casts from char to
unsigned char where necessary; also see
<https://cgit.freedesktop.org/libreoffice/core/commit/?id=7778d9f51bd1f4d086cafe95995406c3157afb89>
"Prevent calls to rtl/character.hxx functions with (signed) char
arguments"). A function corresponding to isspace was missing and has
been added with
<https://cgit.freedesktop.org/libreoffice/core/commit/?id=f5c93d4149e7ae967e98dbce72528a04a204ca95>
"Use rtl::isAscii* instead of ctype.h is* (and fix passing plain char)
and add rtl::isAsciiWhiteSpace". Any calls thus replaced appeared to
either never pass in EOF or have been adapted accordingly
(<https://cgit.freedesktop.org/libreoffice/core/commit/?id=4a3f2cb747b2553485f48dc440e141e30ade5a70>
"Fix some usage of std::istream unformatted input in
hwpfilter/source/hwpeq.cxx"). (And there had been no uses of the
corresponding std-namespaced functions from <cctype>.)
What remains is the source of the five C programs
rsc/Executable_rsc.mk (rsc/source/rscpp/cpp{2,3,5,6}.c)
shell/Executable_uri_encode.mk (shell/source/unix/misc/uri-encode.c)
solenv/Executable_concat-deps.mk (solenv/bin/concat-deps.c)
soltools/Executable_cpp.mk (soltools/cpp/_{tokens,unix}.c)
soltools/Executable_mkdepend.mk
(soltools/mkdepend/{cppsetup,ifparser,parse}.c)
For one, I have added any casts from char to unsigned char where
missing. (But note that in some cases the input already was of the
expected form.)
For another, with a recent set of commits to master I have removed all
but one call to setlocale from the LO code base itself. (The remaining
one is in SetSystemLocale in vcl/unx/generic/app/i18n_im.cxx, and smells
like it is necessary for proper IME support in VCL-based applications on
Linux. None of those five C programs should be affected by it.) So
barring any calls to setlocale in external code, and ignoring the
somewhat fuzzy definition of isprint as called from
rsc/source/rscpp/cpp{5,6}.c, those five C programs should not (any
longer) be affected by locale issues.
Note that there are three remaining calls to toupper/tolower that are
currently being addressed by <https://gerrit.libreoffice.org/#/c/35303/>
"tdf#99589 change tolower/toupper to easytolower/upper."
Please avoid adding any new uses of any of those functions.
Context
- Character classification · Stephan Bergmann
Privacy Policy |
Impressum (Legal Info) |
Copyright information: Unless otherwise specified, all text and images
on this website are licensed under the
Creative Commons Attribution-Share Alike 3.0 License.
This does not include the source code of LibreOffice, which is
licensed under the Mozilla Public License (
MPLv2).
"LibreOffice" and "The Document Foundation" are
registered trademarks of their corresponding registered owners or are
in actual use as trademarks in one or more countries. Their respective
logos and icons are also subject to international copyright laws. Use
thereof is explained in our
trademark policy.