Date: prev next · Thread: first prev next last
2011 Archives by date, by thread · List index


Hi Steven,

On Wed, 2011-01-26 at 15:17 +1000, Steven Butler wrote:
One idea, can we generate thesaurus idx file during install? That may
solve few megabytes.
..
I have had an attempt at this - code attached, it is dual licensed under
LGPL / MIT although there are no (c) headers in the file (feel free to add
some).

        Wow - great work :-) I've just pushed this to dictionaries/source in
master, and compiled it there. Still need some tweaks to get it called
in the various dictionaries/ makefiles I suppose - but it is a great
start thanks !

        Licensing wise - I'd like to add the standard LGPLv3+/MPL header to it
(see bootstrap/) but having MIT too is fine if you want.

        I was going to add it as an easy hack, but you beat me to it :-)

I have no idea how this would be integrated into the build process as I'm
not even sure where it is called from, but happy if someone wants to
take up the challenge and/or incorporate it as an installer process.

        So - the installer process is more exciting on Windows I think - we'll
need to see how the setup_native/ tools are called and be inspired by
that I think.

Here's timing of the CPP version on a Core i5 amd64 generating the
following indices:
..      
The same set of files using th_gen_idx.pl took around 5 seconds (although
some basic fixups got it done to 3.5 seconds).

        Great - its trivial; indeed - it rather makes you wonder whether we
need the indexes at all ? [ I wonder what they are good for, and/or what
code loads and uses them ;-]. We may discover that in fact there is no
need for them to be indexed - any chance of a dig around ?

What I have noticed while testing the change was that a lot of the
dictionaries I processed have errors.

        Nasty.

These range from having the entry count incorrect, causing the index
process to miss a word (lots of these in some dictionaries), to having
words apparently duplicated either as the next entry, or sometimes a long
way apart.

        That is bad; we should mail the l10n list to ask them to have a look I
suppose.

I have not attempted to fix these dictionary issues, but if they are
serious it might be worth having a perl script that is able to validate
the dictionaries are internally consistent.  Unfortunately, it would have
to use heuristics as the file format makes it difficult to tell in general
what kind of line is being processed.

        Right; we should validate them as we compile the index perhaps - or at
least, look at the parser and see how it has traditionally interpreted
them.

The CPP version attached has a difference from the perl script in that
when multiple entries are found, they appear to be coming out in reverse
order to the original perl script.  What I'm curious about is what impact
Having multiple entries for a word when loaded into libreoffice?

        Me too ;-)

For reference I have attached an improved perl version of the perl script
that runs a couple of seconds faster than the original.  I had three to
four versions in my tree but changing none of them triggered a git diff to
show the changes so I've attached the full copy.

        The native code thing is great; it'd be wonderful if you had some time
to look at hooking it into the build process in dictionaries/ (?)

        Thanks muchly !

                Michael.


-- 
 michael.meeks@novell.com  <><, Pseudo Engineer, itinerant idiot



Context


Privacy Policy | Impressum (Legal Info) | Copyright information: Unless otherwise specified, all text and images on this website are licensed under the Creative Commons Attribution-Share Alike 3.0 License. This does not include the source code of LibreOffice, which is licensed under the Mozilla Public License (MPLv2). "LibreOffice" and "The Document Foundation" are registered trademarks of their corresponding registered owners or are in actual use as trademarks in one or more countries. Their respective logos and icons are also subject to international copyright laws. Use thereof is explained in our trademark policy.