In the context of fdo#41166 (a bug in how our numbers/date parser
handles ambiguities), I thought "wouldn't it be better if we used a
proper parser generator and wrote a proper grammar instead of a
hand-coded parser"? It would be easier to debug, easier to maintain,
easier to read, ... and thus probably less buggy in the long run.
Well, I've a good "proof of concept" start on making it a
Boost::Spirit grammar. I've chosen Boost::Spirit because:
1) It is a *dynamic* parser generator, that is the grammar can be
constructed at runtime. That's a requirement for us, for i18n
reasons.
2) It is in Boost, so no new external dependency: we already require
Boost.
The patch is attached so that you can get an idea. Alas, it is
polluted by the fact I converted everything to rtl::OUString from
tools::String. It is against libreoffice-3-5 simply because that's my
current main "dogfood / development" branch these days. I'm *not*
suggesting we do the change in libreoffice-3-5, only in master!
Highlights:
1) Yes, it works on the basis of UCS4 codepoints, so no problem with
UTF-16 surrogate pairs.
2) I add a const_iterator interface to rtl::OUString. Instead of
coding it manually to use iterateCodePoints, I again reuse the
power of boost. But I have in my git history a manually-coded
version based on iterateCodePoints if for some reason it is
desirable (I don't think it is).
That's also more generally a more "C++ natural" interface to
OUStrings, so we can use it throughout our codebase instead of
iterateCodePoints.
I don't define an iterator interface, as this could be *extremely*
slow: writing one character in the middle of a string is O(n)
(linear in the length of the string) when replacing a
surrogates-encoded codepoint by a non-surrogate encoded one and
vice-versa.
3) For now, it does not replace the old parser, but just does a
parallel parse and displays the result of the parse to stderr for
demonstration / debugging purposes.
4) Yes, this is heavily template / metaprogramming based and
compilation time is significant. But it is only this one file.
5) I've stuck the LibreOffice<->Spirit integration in
svl/source/numbers/spirit/, but possibly we could move it to a more
general location (sal/rtl?) so that it can be reused in other parts
of LibreOffice.
For now, I'm using only the "teach rtl::OUString to spirit" part of
the integration, so only that one is in the patch.
But there is also a "use the character classes notions of
<unotools/charclass.hxx> instead of the spirit ones", which I
attach as separate files. A "character class" is something like
"Alpha", "AlphaNumeric", "Letter", ... In particular, the notions
in unotools are language-dependent. However, I find the coverage
rather spotty. For example, there is no "is this character a
whitespace" predicate... So we would have to complement from the
native spirit notions anyway?
I'm not sure we will need it, since the current parser basically
does not use these character classifications at all: it only uses
toUpper to make things case-insensitive.
5) The grammar itself is in struct
ImpSvNumberInputScan::number_grammar. That's where you should
look.
Let's look for example at a simple, yet internationalised example,
booleans:
const ImpSvNumberformatScan* pFS = parent.pFormatter->GetFormatScanner();
::rtl::OUString sTrue (pFS->GetTrueString());
::rtl::OUString sFalse(pFS->GetTrueString());
rBoolean = (qi::lit(sTrue) >> qi::eoi) [ ref(fOutNumber) = 1, ref(parent.eScannedType) =
NUMBERFORMAT_LOGICAL ]
| (qi::lit(sFalse) >> qi::eoi) [ ref(fOutNumber) = -1, ref(parent.eScannedType) =
NUMBERFORMAT_LOGICAL ];
A boolean is a either a lit(eral) sTrue or a literal sFalse, and
sTrue and sFalse are runtime values. For sTrue, "1" is written in
fOutNumber, for sFalse, -1 is written. In Both cases,
NUMBERFORMAT_LOGICAL is written in eScannedType. Hmm... I could
have factorised that, as in:
rBoolean = ( (qi::lit(sTrue) >> qi::eoi) [ ref(fOutNumber) = 1 ]
| (qi::lit(sFalse) >> qi::eoi) [ ref(fOutNumber) = -1 ] )
eoi [ ref(parent.eScannedType) = NUMBERFORMAT_LOGICAL ]
For a more complicated example, see dates:
rDate = -dow >>
( ( eps [ reset_date(pCal) ] >> rISO8601 >> eoi )
| rPlainDate )
>> eoi [ ref(parent.eScannedType) = NUMBERFORMAT_DATE,
ref(fOutNumber) = make_date(parent.pNullDate, ref(pCal)) ];
A date is an optional day-of-week (the "-" means "optional")
followed by either a ISO8601 date or a (localised) "plain date".
A plain date is:
if ( true /* DMY format */ )
rPlainDate =
( eps [ reset_date(pCal) ] >> rDayOfMonth >> -rDateSep >> rMonth >> eoi )
| ( eps [ reset_date(pCal) ] >> rMonth >> -rDateSep >> rYear >> eoi )
| ( eps [ reset_date(pCal) ] >> rDayOfMonth >> -rDateSep >> rMonth >> -rDateSep >>
rYear >> eoi );
else /* MDY format */
rPlainDate =
( eps [ reset_date(pCal) ] >> rMonth >> -rDateSep >> rDayOfMonth >> eoi )
| ( eps [ reset_date(pCal) ] >> rMonth >> -rDateSep >> rYear >> eoi )
| ( eps [ reset_date(pCal) ] >> rMonth >> -rDateSep >> rDayOfMonth >> -rDateSep >> rYear
eoi );
That is, either a day-of-month, a month and a year or only two of
those, with "day-of-month and month" taking precedence over "month
and year" so that "5/3" is fifth of march (or third of may,
depending on language), not may of year 3 (or 2003).
In turn a month is either an unsigned short integer between 1 and
number of months in a year or the name of month:
const int nMonths = pCal->getNumberOfMonthsInYear();
rMonth = ushort_ [ _pass = _1 <= nMonths,
bind(&CalendarWrapper::setValue, ref(pCal), CalendarFieldIndex::MONTH, _1 - 1
) ]
| month [ bind(&CalendarWrapper::setValue, ref(pCal), CalendarFieldIndex::MONTH, _1 ) ];
"_pass = _1 <= nMonths" means this rule should fail if the read
unsigned short is not <= nMonths. "month" is the rule for a "name
of month as a string":
for ( int i = 0; i < nMonths; ++i)
{
month.add(parent.pUpperMonthText[i], i)(parent.pUpperAbbrevMonthText[i], i);
}
Meaning $\forall i$, pUpperMonthText[i] and
pUpperAbbrevMonthText[i] map to month number i.
6) The grammar is constructed and invoked in IsNumberFormatMain:
std::cerr << "***** DBG before parse ****** r=(undefined)\tfOutNumber=" << fOutNumber <<
"\teScannedType=" << eScannedType << std::endl;
std::cerr << "To parse: '" << ::rtl::OUStringToOString( rString, RTL_TEXTENCODING_UTF8
).getStr() << "'" << std::endl;
bool r = qi::phrase_parse(rString.begin(),
rString.end(),
number_grammar<rtl::OUString::const_iterator>(*this, fOutNumber),
enc::space);
std::cerr << "***** DBG after parse ****** r=" << r << "\tfOutNumber=" << fOutNumber <<
"\teScannedType=" << eScannedType << std::endl;
So, this is not feature-complete yet (e.g. does not recognise
fractions, mantissa/exponent notation such as "1.52225E14", dates are
not fully i18ned), and especially the date part will probably need
some cooperation / guidance from an i18n expert :)
But I think it shows we can do that, IMO to great long-term
advantage. Do I have some buy-in from you on that?
Questions you'll ask or should ask ;-)
- Hey, this constructs a grammar for each parse. We should do that
only when the grammar (language) changes.
Yes, in ChangeIntl. TODO.
P.S.: I'm in vacation from Fri 2 march 2012 to Sun 11 march 2012.
--
Lionel
Context
- Using a real parser generator to parse numbers (and dates) · Lionel Elie Mamane
Privacy Policy |
Impressum (Legal Info) |
Copyright information: Unless otherwise specified, all text and images
on this website are licensed under the
Creative Commons Attribution-Share Alike 3.0 License.
This does not include the source code of LibreOffice, which is
licensed under the Mozilla Public License (
MPLv2).
"LibreOffice" and "The Document Foundation" are
registered trademarks of their corresponding registered owners or are
in actual use as trademarks in one or more countries. Their respective
logos and icons are also subject to international copyright laws. Use
thereof is explained in our
trademark policy.