Using a real parser generator to parse numbers (and dates)

Lionel Elie Mamane <lionel -AT- mamane.lu>
Wed, 29 Feb 2012 11:49:38 +0100

In the context of fdo#41166 (a bug in how our numbers/date parser
handles ambiguities), I thought "wouldn't it be better if we used a
proper parser generator and wrote a proper grammar instead of a
hand-coded parser"? It would be easier to debug, easier to maintain,
easier to read, ... and thus probably less buggy in the long run.

Well, I've a good "proof of concept" start on making it a
Boost::Spirit grammar. I've chosen Boost::Spirit because:

1) It is a *dynamic* parser generator, that is the grammar can be
   constructed at runtime. That's a requirement for us, for i18n
   reasons.

2) It is in Boost, so no new external dependency: we already require
   Boost.


The patch is attached so that you can get an idea. Alas, it is
polluted by the fact I converted everything to rtl::OUString from
tools::String. It is against libreoffice-3-5 simply because that's my
current main "dogfood / development" branch these days. I'm *not*
suggesting we do the change in libreoffice-3-5, only in master!

Highlights:

1) Yes, it works on the basis of UCS4 codepoints, so no problem with
   UTF-16 surrogate pairs.

2) I add a const_iterator interface to rtl::OUString. Instead of
   coding it manually to use iterateCodePoints, I again reuse the
   power of boost. But I have in my git history a manually-coded
   version based on iterateCodePoints if for some reason it is
   desirable (I don't think it is).

   That's also more generally a more "C++ natural" interface to
   OUStrings, so we can use it throughout our codebase instead of
   iterateCodePoints.

   I don't define an iterator interface, as this could be *extremely*
   slow: writing one character in the middle of a string is O(n)
   (linear in the length of the string) when replacing a
   surrogates-encoded codepoint by a non-surrogate encoded one and
   vice-versa.

3) For now, it does not replace the old parser, but just does a
   parallel parse and displays the result of the parse to stderr for
   demonstration / debugging purposes.

4) Yes, this is heavily template / metaprogramming based and
   compilation time is significant. But it is only this one file.

5) I've stuck the LibreOffice<->Spirit integration in
   svl/source/numbers/spirit/, but possibly we could move it to a more
   general location (sal/rtl?) so that it can be reused in other parts
   of LibreOffice.

   For now, I'm using only the "teach rtl::OUString to spirit" part of
   the integration, so only that one is in the patch.

   But there is also a "use the character classes notions of
   <unotools/charclass.hxx> instead of the spirit ones", which I
   attach as separate files. A "character class" is something like
   "Alpha", "AlphaNumeric", "Letter", ... In particular, the notions
   in unotools are language-dependent. However, I find the coverage
   rather spotty. For example, there is no "is this character a
   whitespace" predicate... So we would have to complement from the
   native spirit notions anyway?

   I'm not sure we will need it, since the current parser basically
   does not use these character classifications at all: it only uses
   toUpper to make things case-insensitive.

5) The grammar itself is in struct
   ImpSvNumberInputScan::number_grammar. That's where you should
   look.

   Let's look for example at a simple, yet internationalised example,
   booleans:

   const ImpSvNumberformatScan* pFS = parent.pFormatter->GetFormatScanner();

   ::rtl::OUString sTrue (pFS->GetTrueString());
   ::rtl::OUString sFalse(pFS->GetTrueString());
   rBoolean =   (qi::lit(sTrue)  >> qi::eoi) [ ref(fOutNumber) =  1, ref(parent.eScannedType) = 
NUMBERFORMAT_LOGICAL ]
              | (qi::lit(sFalse) >> qi::eoi) [ ref(fOutNumber) = -1, ref(parent.eScannedType) = 
NUMBERFORMAT_LOGICAL ];

   A boolean is a either a lit(eral) sTrue or a literal sFalse, and
   sTrue and sFalse are runtime values. For sTrue, "1" is written in
   fOutNumber, for sFalse, -1 is written. In Both cases,
   NUMBERFORMAT_LOGICAL is written in eScannedType. Hmm... I could
   have factorised that, as in:

   rBoolean = (   (qi::lit(sTrue)  >> qi::eoi) [ ref(fOutNumber) =  1 ]
                | (qi::lit(sFalse) >> qi::eoi) [ ref(fOutNumber) = -1 ] )
              eoi [ ref(parent.eScannedType) = NUMBERFORMAT_LOGICAL ]


   For a more complicated example, see dates:

   rDate = -dow >>
       (   ( eps [ reset_date(pCal) ] >> rISO8601 >> eoi )
         | rPlainDate )
        >> eoi [ ref(parent.eScannedType) = NUMBERFORMAT_DATE,
                 ref(fOutNumber) =  make_date(parent.pNullDate, ref(pCal)) ];

   A date is an optional day-of-week (the "-" means "optional")
   followed by either a ISO8601 date or a (localised) "plain date".

   A plain date is:

   if ( true /* DMY format */ )
       rPlainDate =
             ( eps [ reset_date(pCal) ] >> rDayOfMonth >> -rDateSep >> rMonth >> eoi )
           | ( eps [ reset_date(pCal) ] >> rMonth >> -rDateSep >> rYear >> eoi )
           | ( eps [ reset_date(pCal) ] >> rDayOfMonth >> -rDateSep >> rMonth  >> -rDateSep >> 
rYear >> eoi );
   else /* MDY format */
       rPlainDate =
             ( eps [ reset_date(pCal) ] >> rMonth >> -rDateSep >> rDayOfMonth >> eoi )
           | ( eps [ reset_date(pCal) ] >> rMonth >> -rDateSep >> rYear >> eoi )
           | ( eps [ reset_date(pCal) ] >> rMonth >> -rDateSep >> rDayOfMonth >> -rDateSep >> rYear

eoi );


   That is, either a day-of-month, a month and a year or only two of
   those, with "day-of-month and month" taking precedence over "month
   and year" so that "5/3" is fifth of march (or third of may,
   depending on language), not may of year 3 (or 2003).

   In turn a month is either an unsigned short integer between 1 and
   number of months in a year or the name of month:

   const int nMonths =  pCal->getNumberOfMonthsInYear();
   rMonth = ushort_ [ _pass = _1 <= nMonths,
                      bind(&CalendarWrapper::setValue, ref(pCal), CalendarFieldIndex::MONTH, _1 - 1 
) ]
       | month      [ bind(&CalendarWrapper::setValue, ref(pCal), CalendarFieldIndex::MONTH, _1 ) ];


   "_pass = _1 <= nMonths" means this rule should fail if the read
   unsigned short is not <= nMonths. "month" is the rule for a "name
   of month as a string":

   for ( int i = 0; i < nMonths; ++i)
   {
       month.add(parent.pUpperMonthText[i], i)(parent.pUpperAbbrevMonthText[i], i);
   }

   Meaning $\forall i$, pUpperMonthText[i] and
   pUpperAbbrevMonthText[i] map to month number i.

6) The grammar is constructed and invoked in IsNumberFormatMain:


   std::cerr << "***** DBG before parse ****** r=(undefined)\tfOutNumber=" << fOutNumber << 
"\teScannedType=" << eScannedType << std::endl;
   std::cerr << "To parse: '" << ::rtl::OUStringToOString( rString, RTL_TEXTENCODING_UTF8 
).getStr() << "'" << std::endl;

   bool r = qi::phrase_parse(rString.begin(),
                             rString.end(),
                             number_grammar<rtl::OUString::const_iterator>(*this, fOutNumber),
                             enc::space);

   std::cerr << "***** DBG after  parse ****** r=" << r << "\tfOutNumber=" << fOutNumber << 
"\teScannedType=" << eScannedType << std::endl;



So, this is not feature-complete yet (e.g. does not recognise
fractions, mantissa/exponent notation such as "1.52225E14", dates are
not fully i18ned), and especially the date part will probably need
some cooperation / guidance from an i18n expert :)


But I think it shows we can do that, IMO to great long-term
advantage. Do I have some buy-in from you on that?



Questions you'll ask or should ask ;-)

 - Hey, this constructs a grammar for each parse. We should do that
   only when the grammar (language) changes.

   Yes, in ChangeIntl. TODO.


P.S.: I'm in vacation from Fri 2 march 2012 to Sun 11 march 2012.

-- 
Lionel

Context

Using a real parser generator to parse numbers (and dates) · Lionel Elie Mamane
- Re: Using a real parser generator to parse numbers (and dates) · Lionel Elie Mamane
- Re: Using a real parser generator to parse numbers (and dates) · julien2412
  - Re: Using a real parser generator to parse numbers (and dates) · Lionel Elie Mamane
    - Re: Using a real parser generator to parse numbers (and dates) · Rene Engelhard
    - Re: Using a real parser generator to parse numbers (and dates) · Tor Lillqvist
      - Re: Using a real parser generator to parse numbers (and dates) · Lionel Elie Mamane
        
        Re: Using a real parser generator to parse numbers (and dates) · Tor Lillqvist
        
        Re: Using a real parser generator to parse numbers (and dates) · Lionel Elie Mamane
        
        Re: Using a real parser generator to parse numbers (and dates) · Tor Lillqvist
- Re: Using a real parser generator to parse numbers (and dates) · Stephan Bergmann

Privacy Policy | Impressum (Legal Info) | Copyright information: Unless otherwise specified, all text and images on this website are licensed under the Creative Commons Attribution-Share Alike 3.0 License. This does not include the source code of LibreOffice, which is licensed under the Mozilla Public License (MPLv2). "LibreOffice" and "The Document Foundation" are registered trademarks of their corresponding registered owners or are in actual use as trademarks in one or more countries. Their respective logos and icons are also subject to international copyright laws. Use thereof is explained in our trademark policy.