Hi there,
On Thu, 2012-05-10 at 16:06 +0100, Caolán McNamara wrote:
fdo#47644 big 18 meg .doc is super dooper slow to load, mostly because
of the extra checks I added to sanity test .doc file during parsing.
:-)
1st attachment gets us back to where we came from in terms of
performance there.
Great - I've pushed that to libreoffice-3-5.
The second one is maybe a bit more contentious for 3-5, but I include it
for a look-over. Basically there's a "page chain" in the ole2 format and
to seek to a location you walk through the chain until you get to the
correct page.
Yep - it's a FAT structure; nasty one-slot linked list.
I see that in practice most documents have their chain in
a presorted order (I have no idea if having them unsorted is actually a
sign of a broken document or if its legal)
It is legal, but it's highly unlikely.
So second attachment keeps the chain if its one of the circumstances
where we have to parse the whole thing anyway, and if not (the usual
one) then if we're seeking an (fairly arbitrary) large number of steps
where its likely that the cost-benefit is in favour of it then generate
the chain once and hold on to it for reuse until there any event which
might invalidate the chain.
So - I'm a bit lost with the second patch. The FAT is essentially a
linked list; and we walk it to find the block we want, so:
nRel = 3; -> go three steps along the linked list.
A -> B -> C == C ...
Your pageCache looks like the right approach, and turns this into a
simple vector we can rapidly look up:
[0] = A, [1] = B, [2] = C
And so on - hopefully I'm there so far :-) With the 512 byte block size
we expect, that turns a 100Mb OLE2 stream into a 800k vector (which is
just fine).
Second patch knocks about 50 seconds off load time. First patch knocks
some unknown but > 10 mins off load time.
My question is thus about:
[snip]
std::vector<sal_Int32>::iterator aI;
if (m_bSortedPageChain)
aI = std::lower_bound(m_aPagesCache.begin(), m_aPagesCache.end(), nBgn);
else
aI = std::find(m_aPagesCache.begin(), m_aPagesCache.end(), nBgn);
if (aI == m_aPagesCache.end())
{
SAL_WARN("sot", "Unknown page position");
nBgn = STG_EOF;
}
else
{
size_t nBgnDistance = std::distance(m_aPagesCache.begin(), aI);
size_t nIndex = nBgnDistance + nRel;
if (nIndex > m_aPagesCache.size())
{
nRel = m_aPagesCache.size() - nBgnDistance;
nIndex = m_aPagesCache.size() - 1;
}
else
nRel = 0;
nLast = nIndex ? m_aPagesCache[nIndex - 1] : STG_EOF;
nBgn = m_aPagesCache[nIndex];
}
[/snip]
Surely that should read:
nBgn = m_aPageCache[nRel];
or did I go a bit crazy ? [ I'm trying to turn the above into that
mentally and giving up ;-].
Otherwise the clearing of the cache looks sensible; though the:
StgFATStrm::SetPage(); should prolly just update that vector as well as
doing the nasty fat walk so:
m_aPageCache[nBlocks] = nNewPage;
or somesuch (assuming it is long enough for that).
Otherwise it looks really lovely.
Thanks !
Michael.
--
michael.meeks@suse.com <><, Pseudo Engineer, itinerant idiot
Context
Privacy Policy |
Impressum (Legal Info) |
Copyright information: Unless otherwise specified, all text and images
on this website are licensed under the
Creative Commons Attribution-Share Alike 3.0 License.
This does not include the source code of LibreOffice, which is
licensed under the Mozilla Public License (
MPLv2).
"LibreOffice" and "The Document Foundation" are
registered trademarks of their corresponding registered owners or are
in actual use as trademarks in one or more countries. Their respective
logos and icons are also subject to international copyright laws. Use
thereof is explained in our
trademark policy.