Date: prev next · Thread: first prev next last
2011 Archives by date, by thread · List index


One idea, can we generate thesaurus idx file during install? That may
solve few megabytes.

      Oh - right; 4Mb of that - which we can (I assume easily) build at
install time; I've added that to the spreadsheet, and re-up-loaded it.
It should be quite fun in fact to re-write the somewhat trivial
dictionaries/util/th_gen_idx.pl script as a standalone C++ tool - would
be faster too: it takes ~5 CPU seconds each to index those beasties in
perl, which would be ~instant in C++.

I have had an attempt at this - code attached, it is dual licensed under
LGPL / MIT although there are no (c) headers in the file (feel free to add
some).

I have no idea how this would be integrated into the build process as I'm
not even sure where
it is called from, but happy if someone wants to take up the challenge
and/or incorporate it
as an installer process.

Here's timing of the CPP version on a Core i5 amd64 generating the
following indices:

libo/clone/libs-extern-sys/dictionaries/ca/th_ca_ES_v3.dat.idx2
libo/clone/libs-extern-sys/dictionaries/cs_CZ/th_cs_CZ_v2.dat.idx2
libo/clone/libs-extern-sys/dictionaries/da_DK/th_da_DK.dat.idx2
libo/clone/libs-extern-sys/dictionaries/de_AT/th_de_AT_v2.dat.idx2
libo/clone/libs-extern-sys/dictionaries/de_CH/th_de_CH_v2.dat.idx2
libo/clone/libs-extern-sys/dictionaries/de_DE/th_de_DE_v2.dat.idx2
libo/clone/libs-extern-sys/dictionaries/en/th_en_US_v2.dat.idx2
libo/clone/libs-extern-sys/dictionaries/fr_FR/thes_fr.dat.idx2
libo/clone/libs-extern-sys/dictionaries/hu_HU/th_hu_HU_v2.dat.idx2
libo/clone/libs-extern-sys/dictionaries/it_IT/th_it_IT_v2.dat.idx2
libo/clone/libs-extern-sys/dictionaries/ne_NP/th_ne_NP_v2.dat.idx2
libo/clone/libs-extern-sys/dictionaries/no/th_nb_NO_v2.dat.idx2
libo/clone/libs-extern-sys/dictionaries/no/th_nn_NO_v2.dat.idx2
libo/clone/libs-extern-sys/dictionaries/pl_PL/th_pl_PL_v2.dat.idx2
libo/clone/libs-extern-sys/dictionaries/ro/th_ro_RO_v2.dat.idx2
libo/clone/libs-extern-sys/dictionaries/ru_RU/th_ru_RU_v2.dat.idx2
libo/clone/libs-extern-sys/dictionaries/sk_SK/th_sk_SK_v2.dat.idx2
libo/clone/libs-extern-sys/dictionaries/sl_SI/th_sl_SI_v2.dat.idx2

real    0m0.792s
user    0m0.630s
sys     0m0.080s

The same set of files using th_gen_idx.pl took around 5 seconds (although
some basic fixups got it done to 3.5 seconds).

What I have noticed while testing the change was that a lot of the
dictionaries I processed have errors.

These range from having the entry count incorrect, causing the index
process to miss a word (lots of these in some dictionaries), to having
words apparently duplicated either as the next entry, or sometimes a long
way apart.

I have not attempted to fix these dictionary issues, but if they are
serious it might be worth having a perl script that is able to validate
the dictionaries are internally consistent.  Unfortunately, it would have
to
use heuristics as the file format makes it difficult to tell in general
what kind of line is being processed.

The CPP version attached has a difference from the perl script in that
when multiple entries are found, they appear to be coming out in reverse
order to the original perl script.  What I'm curious about is what impact
Having multiple entries for a word when loaded into libreoffice?

For reference I have attached an improved perl version of the perl script
that runs a couple of seconds faster than the original.  I had three to
four versions in my tree but changing none of them triggered a git diff to
show the changes so I've attached the full copy.

Cheers
Steve.

#include <iostream>
#include <fstream>
#include <string>
#include <map>
#include <stdlib.h>
#include <string.h>

static const int MAXLINE = 1024*64;

using namespace std;

int main(int argc, char *argv[])
{
        if (argc != 3 || strcmp(argv[1],"-o"))
        {
                cout << "Usage: th_gen_idx -o outputfile < input\n";
                ::exit(99);
        }
        // This call improves performance by approx 5x
        cin.sync_with_stdio(false);

        const char * outputFile(argv[2]);
        char inputBuffer[MAXLINE];
        multimap<string, size_t> entries;
        multimap<string,size_t>::iterator ret(entries.begin());

        int line(1);
        cin.getline(inputBuffer, MAXLINE);
        const string encoding(inputBuffer);
        size_t currentOffset(encoding.size()+1);
        while (true)
        {
                // Extract the next word, but not the entry count
                cin.getline(inputBuffer, MAXLINE, '|');

                if (cin.eof()) break;

                string word(inputBuffer);
                ret = entries.insert(ret, pair<string, size_t>(word, currentOffset));
                currentOffset += word.size() + 1;
                // Next is the entry count
                cin.getline(inputBuffer, MAXLINE);
                if (!cin.good())
                {
                        cerr << "Unable to read entry - insufficient buffer?.\n";
                        exit(99);
                }
                currentOffset += strlen(inputBuffer)+1;
                int entryCount(strtol(inputBuffer, NULL, 10));
                for (int i(0); i < entryCount; ++i)
                {
                        cin.getline(inputBuffer, MAXLINE);
                        currentOffset += strlen(inputBuffer)+1;
                        ++line;
                }
        }

        // Use binary mode to prevent any translation of LF to CRLF on Windows
        ofstream outputStream(outputFile, ios_base::binary| ios_base::trunc|ios_base::out);
        if (!outputStream.is_open())
        {
                cerr << "Unable to open output file " << outputFile << endl;
                ::exit(99);
        }

        cout << outputFile << endl;

        outputStream << encoding << '\n' << entries.size() << '\n';

        for (multimap<string, size_t>::const_iterator ii(entries.begin());
                ii != entries.end();
                ++ii
        )
        {
                outputStream << ii->first << '|' << ii->second << '\n';
        }
        
        
}

Attachment: th_gen_idx.pl
Description: Binary data


Context


Privacy Policy | Impressum (Legal Info) | Copyright information: Unless otherwise specified, all text and images on this website are licensed under the Creative Commons Attribution-Share Alike 3.0 License. This does not include the source code of LibreOffice, which is licensed under the Mozilla Public License (MPLv2). "LibreOffice" and "The Document Foundation" are registered trademarks of their corresponding registered owners or are in actual use as trademarks in one or more countries. Their respective logos and icons are also subject to international copyright laws. Use thereof is explained in our trademark policy.