Re: [Libreoffice] Should the Thesaurus/mythes use a precomputed index (installer file size)

Steven Butler <sebutler -AT- gmail.com>
Tue, 1 Feb 2011 06:36:51 +1000

Hi Michael

On 1 February 2011 01:17, Michael Meeks <michael.meeks@novell.com> wrote:

Hi Steve,

       Sure - so; in response to user input I suspect we can take a second to
parse the thesaurus; we have around 20Mb of text to load for en_US;
perhaps 32Mb is a reasonable upper-bound; it does seem a lot to parse so
quickly.


Where it will hurt is if it is not in cache and the user has some
background task running that hits the disk.

An example might be on Windows with virus scanning (or viruses :) ).

       Right. I think we could easily serialize a small skip-list to disk too
- if we simply store ~8 or ~32 or so indexes into the data - we can
parse only a fraction of it, and pop that in our home directory. We
could also drop the MyThes code too as a depedency to manage.


I'm not sure what you mean by a skip list unless you simply mean a
similar file to the existing .idx, or just a list of offsets for where
the words are to skip loading the whole file.  The trouble with that
approach is the readahead will likely pull in the whole file anyway as
the words aren't generally _that_ far apart in it, so you'll still do
all the IO and just skip a bit of the CPU time.


       The code using it is in:

       lingucomponent/source/thesaurus/libnth/nthesimp.cxx

BTW, if I did that I'd probably do some major surgery on mythes and
just use STL because it basically is doing C style memory management
and processing and I think I would screw it up if I started messing
with it.  The only problem with simplifying it with STL constructs is
that I would want to change the interface (string vs char *), maybe
use STL vectors for the list of synonyms, etc.


       Heh; sure.


I've cooled off on this a bit as performance is slower when using lots
of strings etc.  I was able to change the approach to loading the idx
to treat it as a big buffer and sped it up considerably too.  This did
mean resorting to lots of pointer tomfoolery but it is easy to cleanup
as there are only 3 allocations instead of 100k+ worth.

       I guess we could re-write it inside lingucomponent then (?) but we
should prolly get a better understanding of how frequently this code is
called first - is it hooked into from the spell checking code ? or is it
really just the Tools->Language->Thesaurus ?


It's actually hooked into the right click menu (probably amongst other
things).  The first time you right click on a word, the dictionary for
the current locale is loaded before the right click menu shows up.
After that, it uses the cached thesaurus dictionary for subsequent
lookups.

If you look in your right-click menu, you'll notice a thesaurus list
of synonyms shows up (assuming the word is found) :).

Regards,
Steven Butler

Context

Re: [Libreoffice] Should the Thesaurus/mythes use a precomputed index (installer file size) (continued)
- Re: [Libreoffice] Should the Thesaurus/mythes use a precomputed index (installer file size) · Michael Meeks
  - Re: [Libreoffice] Should the Thesaurus/mythes use a precomputed index (installer file size) · Caolán McNamara
    - Re: [Libreoffice] Should the Thesaurus/mythes use a precomputed index (installer file size) · Steven Butler
  - Re: [Libreoffice] Should the Thesaurus/mythes use a precomputed index (installer file size) · Steven Butler

Privacy Policy | Impressum (Legal Info) | Copyright information: Unless otherwise specified, all text and images on this website are licensed under the Creative Commons Attribution-Share Alike 3.0 License. This does not include the source code of LibreOffice, which is licensed under the Mozilla Public License (MPLv2). "LibreOffice" and "The Document Foundation" are registered trademarks of their corresponding registered owners or are in actual use as trademarks in one or more countries. Their respective logos and icons are also subject to international copyright laws. Use thereof is explained in our trademark policy.