Hi Michael, On 28 January 2011 04:04, Michael Meeks <michael.meeks@novell.com> wrote:
Licensing wise - I'd like to add the standard LGPLv3+/MPL header to it (see bootstrap/) but having MIT too is fine if you want.
This patch adds the (c) header from the template to the idxdict.cpp although i had to tweak it to 2011.
I have no idea how this would be integrated into the build process as I'm not even sure where it is called from, but happy if someone wants to take up the challenge and/or incorporate it as an installer process.So - the installer process is more exciting on Windows I think - we'll need to see how the setup_native/ tools are called and be inspired by that I think.
I think in order to do any work on the windows installer I would have to work out how to get a windows compile environment setup. I currently only have it setup on my Ubunto machine.
The same set of files using th_gen_idx.pl took around 5 seconds (although some basic fixups got it done to 3.5 seconds).Great - its trivial; indeed - it rather makes you wonder whether we need the indexes at all ? [ I wonder what they are good for, and/or what code loads and uses them ;-]. We may discover that in fact there is no need for them to be indexed - any chance of a dig around ?
I imagine my timings are a bit skewed by the machine I tested on, and the number of times I ran it. I'm sure all the dictionaries were well and truly in buffer cache so there was no I/O for the test. On slower machines (are you targetting these) or slower disks there is a chance the index files may offer a performance improvement. Here is the same test after I dropped all my buffer cache: real 0m2.300s user 0m0.700s sys 0m0.150s
These range from having the entry count incorrect, causing the index process to miss a word (lots of these in some dictionaries), to having words apparently duplicated either as the next entry, or sometimes a long way apart.That is bad; we should mail the l10n list to ask them to have a look I suppose.
I wasn't aware there was such a list and I can't find one on freedesktop.org - is it a libreoffice related l10n list, or are these dictionaries sourced from another project?
I have not attempted to fix these dictionary issues, but if they are serious it might be worth having a perl script that is able to validate the dictionaries are internally consistent. Unfortunately, it would have to use heuristics as the file format makes it difficult to tell in general what kind of line is being processed.Right; we should validate them as we compile the index perhaps - or at least, look at the parser and see how it has traditionally interpreted them.
If a utility were written that can validate the files, would it be possible to make it reject on commit if it detected errors?
Having multiple entries for a word when loaded into libreoffice?
The native code thing is great; it'd be wonderful if you had some time to look at hooking it into the build process in dictionaries/ (?)
Yep... I will have to try to figure out how the build works though. Back to the wiki, at least I've realised how to make git work across the multiple checkouts now. -- Regards, Steven Butler
Attachment:
copyright.patch
Description: Binary data