switching to XFastParser

Noel Grandin <noelgrandin -AT- gmail.com>
Thu, 31 Mar 2016 14:11:52 +0200

Hi

[Including the original off-list discussion below for context for anyone who cares]

So I took a look a Daniel Sikeler's branch at
   https://cgit.freedesktop.org/libreoffice/core/log/?h=feature/fastparser
and it looks like he did a pretty thorough job of converting everything to XFastParser.

What was the reason this did not get merged?

Would it suffice to simply pull the commits out of this tree one-by-one, dust them off, pretty them up, verify themthrough 'make check' and push them to master?


Regards, Noel


On 2016/02/29 12:36 PM, Michael Meeks wrote:

Hi Noel,

        This belongs CC'd to the dev. list; please do fwd it there to contine
the discussion =)

On Sun, 2016-02-28 at 09:05 +0200, Noel Grandin wrote:

When you guys did the SAX parsing improvements (XFastParser2), why did
we maintain the UNO API?


        Is there an XFastParser2 API ?

Why not use libxml/expat directly ?


        The libxml2 API (the faster parser) is horrendous - the XFastParser API
is at least a tokenized API - which is essentially what we want the code
to consume; ultimately we want to patch libxml2 some more as well to
improve load performance - removing some of the more stupid pieces;
quite possibly we also want to implement an even faster compressed XML
parsing scheme I have up my sleeve behind that API.

        We did short-circuit UNO for the tokenization piece - which saved a
huge chunk of time, and profiled it rather intensively. Last I looked, I
saw no significant performance cost from the UNO interface.

        Finally - the libxml2 and expat APIs are (like most SAX APIs)
synchronous, and same-thread; a big part of our load-time speed win
comes from doing the XML parse + tokenize in another thread, and
emitting the events in the main thread [ cf. slide decks at several
LibreOffice conferences on the topic ].

        ie. nothing to 'fix' there =)

I'm assuming there is something I'm missing?


        Depends what you're trying to achieve =) if you want to improve
performance and cleanliness -by-far- the most useful thing remaining to
be done there is to switch the ODF filters in xmloff/ to use the
FastParser API - currently they do tokenization themselves in a horribly
inefficient way; and of course they don't take advantage of the threaded
parsing etc.

        There was a Munich student (Daniel Sikeler) working on that -
unfortunately with very little time for mentoring; so it may be a
challenge to try to rescue that work. xmloff/ is quite big - and built
on outside in the main components too. So - almost certainly by far the
best way here is an incremental one.

        We need to write a good, clean XFastParser <-> XParser mapping, prolly
that will require some love in sax/ some of the semantics don't map
entirely perfectly in corner cases. I believe Daniel's branch is
feature/fastparser - and you could rescue just this mapper from there I
think.

        That would then allow the threaded processing & tokenization (we would
need to de-tokenize again to the XParser interface but I think we would
still get some nice wins ;-). When that works nicely - we need to
connect the xmloff/ tokenization code to the XFastParser tokenized
results to avoid doing all of that twice, and slowly and carefully push
the interface change across the code to kill the XParser variant.

        At least - that would be my suggestion of something worthwhile & juicy
to dig teeth into =) it is

        ATB,

                Michael.

Context

switching to XFastParser · Noel Grandin
- Re: switching to XFastParser · Michael Meeks
  - Re: switching to XFastParser · Noel Grandin
    - Re: switching to XFastParser · Michael Meeks
      - Re: switching to XFastParser · Stephan Bergmann

Privacy Policy | Impressum (Legal Info) | Copyright information: Unless otherwise specified, all text and images on this website are licensed under the Creative Commons Attribution-Share Alike 3.0 License. This does not include the source code of LibreOffice, which is licensed under the Mozilla Public License (MPLv2). "LibreOffice" and "The Document Foundation" are registered trademarks of their corresponding registered owners or are in actual use as trademarks in one or more countries. Their respective logos and icons are also subject to international copyright laws. Use thereof is explained in our trademark policy.