Re: XFastParser - next steps ...

Michael Meeks <michael.meeks -AT- collabora.com>
Thu, 28 Jul 2016 14:16:33 +0100

Hi Mohammed,

Summarizing where we're at:

 No. Of Records in ods sax_expat  #1          #2
 1 million             8s         20.5s       11s
 50 million            23.4s      26.3s       22s

#1 - first cut of XParser implemented on FastParser
#2 - second cut avoiding bogus tokenizer calls

...

After this change, load time for ods file with 1million entries has
come down to around 11s and with 50million entries to around 22s.


        Great; so we're winning for 50m cells =)

I've included complete traces of when a ods file with 1million entries
is imported for both xparser and legacyfastparser.


        Lovely.

I'm looking into the suggestions you gave in the irc channel.


        Let me log that here:

<mmeeks> Azorpid: so the maUserEvents are re-used; but when they are
re-used it is necessary to clear the attributes on any elements in there
- which takes time.
<mmeeks> Azorpid: so we could clear them instead en-masse here:
<mmeeks>                     if (!consume(pEventList))
<mmeeks>                         done = true;
<mmeeks>                     aGuard.reset(); // lock
<mmeeks>                     rEntity.maUsedEvents.push(pEventList);
<mmeeks> .
<mmeeks> Azorpid: IIRC we used to have an ambition to have code to work
out if the producer or consumer had more time,
<mmeeks> Azorpid: and do that work in whichever one was getting ahead of
the other ;-) still not a bad idea IMHO.
<mmeeks> Azorpid: something like if (rEntity.maPendingEvents.size() <=
rEntity.mnEventLowWater) ... do the work in the consumer thread of
cleaning those attributes up.

        Lets get that done - but reading the code - I don't think this is going
to give us a huge win - although it is useful to have that done anyway,
I'm sure in some cases it will really help.

        Particularly after this next optimization.

        I think we need to move something based on your legacyfastparser.cxx -
which is now working nicely into xmloff/ itself. It is a bit inefficient
re-constructing UNO APIs with Attributes with re-constructed namespace
pieces etc. here I think.

        Ultimately, we want to fuse xmloff/ into using XFastParser directly -
so, I think the first step is to move a copy of the legacyfastparser.cxx
code into xmloff/ itself.

        Checkout the CallbackDocumentHandler::startUnknown method in the 2nd
profile to see what that can look like: we're swamped with allocation
and de-allocation per element / attribute =)

        Can you implement the XFastParser interface and the proxying onwards to
the legacy interface inside the SvXMLImport class in
xmloff/source/core/xmlimp.cxx ?

        After that works - I think we need to consider how we can convert the
ScXMLTableRowCellContext to have a fast-path that tokenizes its input in
the parsing thread: thus avoiding all of the SvXMLNamespaceMap pieces,
and also avoiding all of the string allocation, free and copying
associated with that.

        How does that sound ? I think the next steps are:

        A. optimize clearing the pending events - unlikely to give
           a big win, but nice.

        B. merge the legacyfastparser pieces into SvXMLImport

        C. consider how to allow XFastParser tokenization selectively
           just for the elements eg. ScXMLTableRowCellContext that
           can get the maximum benefit in the short-run.
                1. this will involve slowing us down again by
                   adding a tokenizer.
                2. in fact this may speed things up by avoiding
                   lots of allocations.

        C.2. is quite exciting; we'll need to implement a nice getTokenDirect
for ODF - and we will need to be able (on demand) to switch those tokens
back into OUStrings. In fact - this seems like a great idea anyway IMHO
- surely much faster than allocating all of those strings. Perhaps worth
trying C.2. after A ;-)

        For C.2. you'll want to checkout the:

        $ git log -u feature/fastparser

        branch that Daniel Siekler worked on. I'll send you an archive of his
xmloff/ fastparser pieces - but it would be great to cherry-pick some of
that work across to master. My -hope- is that as/when we have an
incremental approach working here - we can pick and test his patches
across one by one and enable fast-parsing for each of those contexts. 

        How does that sound ?

        ATB,

                Michael.

-- 
michael.meeks@collabora.com <><, GM Collabora Productivity
 Skype: mmeeks, Google Hangout: mejmeeks@gmail.com
 (M) +44 7795 666 147 - timezone usually UK / Europe

Context

Re: XFastParser - next steps ... · Michael Meeks
- Re: XFastParser - next steps ... · Michael Stahl
  - Re: XFastParser - next steps ... · Michael Meeks
    - (message not available)
      - (message not available)
        
        (message not available)
        
        (message not available)
        
        (message not available)
        Re: XFastParser - next steps ... · Michael Meeks
        (message not available)
        Re: XFastParser - next steps ... · Michael Meeks

Privacy Policy | Impressum (Legal Info) | Copyright information: Unless otherwise specified, all text and images on this website are licensed under the Creative Commons Attribution-Share Alike 3.0 License. This does not include the source code of LibreOffice, which is licensed under the Mozilla Public License (MPLv2). "LibreOffice" and "The Document Foundation" are registered trademarks of their corresponding registered owners or are in actual use as trademarks in one or more countries. Their respective logos and icons are also subject to international copyright laws. Use thereof is explained in our trademark policy.