One idea, can we generate thesaurus idx file during install? That may solve few megabytes.
Oh - right; 4Mb of that - which we can (I assume easily) build at install time; I've added that to the spreadsheet, and re-up-loaded it. It should be quite fun in fact to re-write the somewhat trivial dictionaries/util/th_gen_idx.pl script as a standalone C++ tool - would be faster too: it takes ~5 CPU seconds each to index those beasties in perl, which would be ~instant in C++.
I have had an attempt at this - code attached, it is dual licensed under LGPL / MIT although there are no (c) headers in the file (feel free to add some). I have no idea how this would be integrated into the build process as I'm not even sure where it is called from, but happy if someone wants to take up the challenge and/or incorporate it as an installer process. Here's timing of the CPP version on a Core i5 amd64 generating the following indices: libo/clone/libs-extern-sys/dictionaries/ca/th_ca_ES_v3.dat.idx2 libo/clone/libs-extern-sys/dictionaries/cs_CZ/th_cs_CZ_v2.dat.idx2 libo/clone/libs-extern-sys/dictionaries/da_DK/th_da_DK.dat.idx2 libo/clone/libs-extern-sys/dictionaries/de_AT/th_de_AT_v2.dat.idx2 libo/clone/libs-extern-sys/dictionaries/de_CH/th_de_CH_v2.dat.idx2 libo/clone/libs-extern-sys/dictionaries/de_DE/th_de_DE_v2.dat.idx2 libo/clone/libs-extern-sys/dictionaries/en/th_en_US_v2.dat.idx2 libo/clone/libs-extern-sys/dictionaries/fr_FR/thes_fr.dat.idx2 libo/clone/libs-extern-sys/dictionaries/hu_HU/th_hu_HU_v2.dat.idx2 libo/clone/libs-extern-sys/dictionaries/it_IT/th_it_IT_v2.dat.idx2 libo/clone/libs-extern-sys/dictionaries/ne_NP/th_ne_NP_v2.dat.idx2 libo/clone/libs-extern-sys/dictionaries/no/th_nb_NO_v2.dat.idx2 libo/clone/libs-extern-sys/dictionaries/no/th_nn_NO_v2.dat.idx2 libo/clone/libs-extern-sys/dictionaries/pl_PL/th_pl_PL_v2.dat.idx2 libo/clone/libs-extern-sys/dictionaries/ro/th_ro_RO_v2.dat.idx2 libo/clone/libs-extern-sys/dictionaries/ru_RU/th_ru_RU_v2.dat.idx2 libo/clone/libs-extern-sys/dictionaries/sk_SK/th_sk_SK_v2.dat.idx2 libo/clone/libs-extern-sys/dictionaries/sl_SI/th_sl_SI_v2.dat.idx2 real 0m0.792s user 0m0.630s sys 0m0.080s The same set of files using th_gen_idx.pl took around 5 seconds (although some basic fixups got it done to 3.5 seconds). What I have noticed while testing the change was that a lot of the dictionaries I processed have errors. These range from having the entry count incorrect, causing the index process to miss a word (lots of these in some dictionaries), to having words apparently duplicated either as the next entry, or sometimes a long way apart. I have not attempted to fix these dictionary issues, but if they are serious it might be worth having a perl script that is able to validate the dictionaries are internally consistent. Unfortunately, it would have to use heuristics as the file format makes it difficult to tell in general what kind of line is being processed. The CPP version attached has a difference from the perl script in that when multiple entries are found, they appear to be coming out in reverse order to the original perl script. What I'm curious about is what impact Having multiple entries for a word when loaded into libreoffice? For reference I have attached an improved perl version of the perl script that runs a couple of seconds faster than the original. I had three to four versions in my tree but changing none of them triggered a git diff to show the changes so I've attached the full copy. Cheers Steve.
#include <iostream> #include <fstream> #include <string> #include <map> #include <stdlib.h> #include <string.h> static const int MAXLINE = 1024*64; using namespace std; int main(int argc, char *argv[]) { if (argc != 3 || strcmp(argv[1],"-o")) { cout << "Usage: th_gen_idx -o outputfile < input\n"; ::exit(99); } // This call improves performance by approx 5x cin.sync_with_stdio(false); const char * outputFile(argv[2]); char inputBuffer[MAXLINE]; multimap<string, size_t> entries; multimap<string,size_t>::iterator ret(entries.begin()); int line(1); cin.getline(inputBuffer, MAXLINE); const string encoding(inputBuffer); size_t currentOffset(encoding.size()+1); while (true) { // Extract the next word, but not the entry count cin.getline(inputBuffer, MAXLINE, '|'); if (cin.eof()) break; string word(inputBuffer); ret = entries.insert(ret, pair<string, size_t>(word, currentOffset)); currentOffset += word.size() + 1; // Next is the entry count cin.getline(inputBuffer, MAXLINE); if (!cin.good()) { cerr << "Unable to read entry - insufficient buffer?.\n"; exit(99); } currentOffset += strlen(inputBuffer)+1; int entryCount(strtol(inputBuffer, NULL, 10)); for (int i(0); i < entryCount; ++i) { cin.getline(inputBuffer, MAXLINE); currentOffset += strlen(inputBuffer)+1; ++line; } } // Use binary mode to prevent any translation of LF to CRLF on Windows ofstream outputStream(outputFile, ios_base::binary| ios_base::trunc|ios_base::out); if (!outputStream.is_open()) { cerr << "Unable to open output file " << outputFile << endl; ::exit(99); } cout << outputFile << endl; outputStream << encoding << '\n' << entries.size() << '\n'; for (multimap<string, size_t>::const_iterator ii(entries.begin()); ii != entries.end(); ++ii ) { outputStream << ii->first << '|' << ii->second << '\n'; } }
Attachment:
th_gen_idx.pl
Description: Binary data