Re: [Libreoffice] Should the Thesaurus/mythes use a precomputed index (installer file size)

Michael Meeks <michael.meeks -AT- novell.com>
Mon, 31 Jan 2011 15:17:47 +0000

Hi Steve,

On Sat, 2011-01-29 at 21:45 +1000, Steve Butler wrote:

I haven't had a look at this yet as I thought getting a script to
analyze the existing thesaurus files would be helpful to get those
errors looked at.


        Nice work with that :-)

I thought I would discuss your idea about not using the index at all
to see what reception it gets, but I think you may also have been
suggesting a similar thing: are the index files even useful on modern gear?


        I suspect the index files are mostly useless (personally).

I can populate the en_US index in memory from the .dat file with the
C++ code in 0.287 s after dropping all cache, and 0.188s when the
cache is hot.


        Sure - so; in response to user input I suspect we can take a second to
parse the thesaurus; we have around 20Mb of text to load for en_US;
perhaps 32Mb is a reasonable upper-bound; it does seem a lot to parse so
quickly.

I do admit that my desktop is pretty quick though, with 4 cores, SATA
II drives etc.


        Sure - but it will only use one of these ;-)

If the thesaurus is only loaded when the user pops it up, then
couldn't mythes be taught to generate its own in-memory index
from the dictionary and not bother with an index file at all?


        Right. I think we could easily serialize a small skip-list to disk too
- if we simply store ~8 or ~32 or so indexes into the data - we can
parse only a fraction of it, and pop that in our home directory. We
could also drop the MyThes code too as a depedency to manage.

        The code using it is in:

        lingucomponent/source/thesaurus/libnth/nthesimp.cxx

BTW, if I did that I'd probably do some major surgery on mythes and
just use STL because it basically is doing C style memory management
and processing and I think I would screw it up if I started messing
with it.  The only problem with simplifying it with STL constructs is
that I would want to change the interface (string vs char *), maybe
use STL vectors for the list of synonyms, etc.


        Heh; sure.

By this stage it's not looking much like mythes anymore ...


        I guess we could re-write it inside lingucomponent then (?) but we
should prolly get a better understanding of how frequently this code is
called first - is it hooked into from the spell checking code ? or is it
really just the Tools->Language->Thesaurus ?

        Thanks !

                Michael.

-- 
 michael.meeks@novell.com  <><, Pseudo Engineer, itinerant idiot

Context

Re: [Libreoffice] Should the Thesaurus/mythes use a precomputed index (installer file size) (continued)
- Re: [Libreoffice] Should the Thesaurus/mythes use a precomputed index (installer file size) · Michael Meeks
  - Re: [Libreoffice] Should the Thesaurus/mythes use a precomputed index (installer file size) · Caolán McNamara
    - Re: [Libreoffice] Should the Thesaurus/mythes use a precomputed index (installer file size) · Steven Butler
  - Re: [Libreoffice] Should the Thesaurus/mythes use a precomputed index (installer file size) · Steven Butler

Privacy Policy | Impressum (Legal Info) | Copyright information: Unless otherwise specified, all text and images on this website are licensed under the Creative Commons Attribution-Share Alike 3.0 License. This does not include the source code of LibreOffice, which is licensed under the Mozilla Public License (MPLv2). "LibreOffice" and "The Document Foundation" are registered trademarks of their corresponding registered owners or are in actual use as trademarks in one or more countries. Their respective logos and icons are also subject to international copyright laws. Use thereof is explained in our trademark policy.