Re: [Libreoffice] Should the Thesaurus/mythes use a precomputed index (installer file size)

Steve Butler <sebutler -AT- gmail.com>
Sun, 30 Jan 2011 20:32:18 +1000

Hi Michael,

On 29 January 2011 21:45, Steve Butler <sebutler@gmail.com> wrote:

I thought I would discuss your idea about not using the index at all
to see what reception it gets, but I think you may also have been
suggesting a similar thing:
are the index files even useful on modern gear?

I can populate the en_US index in memory from the .dat file with the
C++ code in 0.287 s after dropping all cache, and 0.188s when the
cache is hot.

I do admit that my desktop is pretty quick though, with 4 cores, SATA
II drives etc.


I have plugged the idxdict.cpp code (modified) into the mythes index
loader and made it load from the .dat file directly.  The index file
is no longer touched.

Here's some comparison timings on the above system (measured with
gettimeofday either side of the call in swriter).

Using an INDEX FILE:
US Thesaurus - cold OS cache
2011/01/30 04:21:37.887449: Loaded in 0.097378 seconds.
US Thesaurus - hot OS cache
2011/01/30 04:22:37.338682: Loaded in 0.044813 seconds.

USING NO INDEX FILE:
US Thesaurus - cold OS cache
2011/01/30 10:07:42.186452: Loaded in 0.253337 seconds.
US Thesaurus - hot OS cache
2011/01/30 10:08:01.737888: Loaded in 0.130883 seconds.

As can be seen from these numbers, it is around 3x slower for the US
thesaurus regardless of hot/cold cache.

BTW, if I did that I'd probably do some major surgery on mythes and
just use STL because it basically is doing C style memory management
and processing and I think I would screw it up if I started messing
with it.  The only problem with simplifying it with STL constructs is
that I would want to change the interface (string vs char *), maybe
use STL vectors for the list of synonyms, etc.


I've kept the public interface of mythes the same with my changes (but
the index file name in the constructor is ignored), apart from this
one:
const char* get_th_encoding();

I didn't change the mentry struct or code dealing with reading an
entry from the dat file at all.  The offset is loaded straight from
the std::map by word lookup but then falls back to the mythes C style
code.

It might be possible to make the index creation run quicker by
avoiding use of so many std::strings but I probably wouldn't do this
as it will make it harder to understand.

I did remove some private member functions that were no longer needed,
and some private data is now using std::string and std::map (as
per idxdict).

Now, assuming anyone thinks this is a good idea and the tradeoff of
initial lookup speed vs installation size is appropriate, I would
appreciate pointers as to how we would go about packaging up such a
change when it is completely isolated to messing about with 3rd party
source.  Naturally if this approach was selected then building the
.idx files and adding them to the language pack zips would need to be
removed.  A further option could be to have it use idx files if they
exist, but fallback to using only the .dat files.

Changes are LGPLv3+,MPL licensed.  I've attached the two altered files
here in case anyone wants to have a look and provide feedback on the
approach.

As this is simply proof of concept for the timing, I haven't tested
against memory leaks or corruption of data yet.

I'm also not sure how to format it as the original code is not well formatted.

Regards,
Steven Butler

#ifndef _MYTHES_HXX_
#define _MYTHES_HXX_

// some maximum sizes for buffers
#define MAX_WD_LEN 200
#define MAX_LN_LEN 16384


// a meaning with definition, count of synonyms and synonym list
struct mentry {
  char*  defn;
  int  count;
  char** psyns;
};
#include <iostream>
#include <fstream>
#include <string>
#include <map>

typedef std::map<std::string, long> WordLocationMap;

class MyThes
{

        std::string  encoding;           /* stores text encoding; */
        WordLocationMap wordList;
 
        FILE  *pdfile;

        // disallow copy-constructor and assignment-operator for now
        MyThes();
        MyThes(const MyThes &);
        MyThes & operator = (const MyThes &);

public:
        MyThes(const char* idxpath, const char* datpath);
        ~MyThes();

        // lookup text in index and return number of meanings
        // each meaning entry has a defintion, synonym count and pointer 
        // when complete return the *original* meaning entry and count via 
        // CleanUpAfterLookup to properly handle memory deallocation

        int Lookup(const char * pText, int len, mentry** pme); 

        void CleanUpAfterLookup(mentry** pme, int nmean);

        const char* get_th_encoding(); 

private:
        // Open index and dat files and load list array
        int thInitialize (const char* indxpath, const char* datpath);
        
        // internal close and cleanup dat and idx files
        int thCleanup ();
/*
        // read a text line (\n terminated) stripping off line terminator
        int readLine(FILE * pf, char * buf, int nc);

        // binary search on null terminated character strings
        int binsearch(char * wrd, char* list[], int nlst);
*/
};

#endif

#include "COPYING"
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <errno.h>

#include "mythes.hxx"
#include <fstream>
// some basic utility routines

// remove cross-platform text line end characters
void mytheschomp(char * s)
{
  int k = strlen(s);
  if ((k > 0) && ((*(s+k-1)=='\r') || (*(s+k-1)=='\n'))) *(s+k-1) = '\0';
  if ((k > 1) && (*(s+k-2) == '\r')) *(s+k-2) = '\0';
}




// read a line of text from a text file stripping
// off the line terminator and replacing it with
// a null string terminator.
// returns:  -1 on error or the number of characters in
//             in the returning string

// A maximum of nc characters will be returned

static int readLine(FILE * pf, char * buf, int nc)
{
    
  if (fgets(buf,nc,pf)) {
    mytheschomp(buf);
    return strlen(buf);
  }
  return -1;
}


// string duplication routine
char * mythesstrdup(const char * p)
{
   
  int sl = strlen(p) + 1;
  char * d = (char *)malloc(sl);
  if (d) {
    memcpy(d,p,sl);
    return d;
  }
  return NULL;
}


// return index of char in string
int mystr_indexOfChar(const char * d, int c)
{
  const char * p = strchr(d,c);
  if (p) return (int)(p-d);
  return -1;
}


MyThes::MyThes(const char* idxpath, const char * datpath)
{
/*
    encoding = NULL;
    list = NULL;
    offst = NULL;
*/
    if (thInitialize(idxpath, datpath) != 1) {
        fprintf(stderr,"Error - can't open %s or %s\n",idxpath, datpath);
        fflush(stderr);
/*        if (encoding) free((void*)encoding);
        if (list)  free((void*)list);
        if (offst) free((void*)offst);
*/
        // did not initialize properly - throw exception?
    }
}


MyThes::~MyThes()
{
    if (thCleanup() != 1) {
        /* did not cleanup properly - throw exception? */
    }
/*
    if (encoding) free((void*)encoding);
    encoding = NULL;
    list = NULL;
    offst = NULL;
*/
}


int MyThes::thInitialize(const char* idxpath, const char* datpath)
{
        std::ifstream input(datpath);
        if (!input.is_open())
        {
                return 0;
        }

        char inputBuffer[MAX_LN_LEN];
        WordLocationMap::iterator ret(wordList.begin());

        int line(1);
        input.getline(inputBuffer, MAX_LN_LEN);
        encoding.assign(inputBuffer);
        int currentOffset(encoding.size()+1);
        while (true)
        {
                // Extract the next word, but not the entry count
                input.getline(inputBuffer, MAX_LN_LEN, '|');

                if (input.eof()) break;

                std::string word(inputBuffer);
                ret = wordList.insert(ret, WordLocationMap::value_type(word, currentOffset));
                currentOffset += word.size() + 1;
                // Next is the entry count
                input.getline(inputBuffer, MAX_LN_LEN);
                if (!input.good())
                {
                        
                        std::cerr << "Unable to read entry - insufficient buffer?.\n";
                        wordList.clear();
                        return 0;
                }
                currentOffset += strlen(inputBuffer)+1;
                int entryCount(strtol(inputBuffer, NULL, 10));
                for (int i(0); i < entryCount; ++i)
                {
                        input.getline(inputBuffer, MAX_LN_LEN);
                        currentOffset += strlen(inputBuffer)+1;
                        ++line;
                }
        }
        input.close();
    /* next open the data file for legacy code  */
    pdfile = fopen(datpath,"r");
    if (!pdfile) {
        pdfile = NULL;
        wordList.clear();
        return 0;
    } 
        
    return 1;        
}


int MyThes::thCleanup()
{
    /* first close the data file */
    if (pdfile) {
        fclose(pdfile);
        pdfile=NULL;
    }

    /* now free up all the allocated strings on the list */
/*
    for (int i=0; i < nw; i++) 
    {
        if (list[i]) {
            free(list[i]);
            list[i] = 0;
        }
    }
*/
/*    if (list)  free((void*)list);
    if (offst) free((void*)offst);

    nw = 0;
*/
    return 1;
}



// lookup text in index and count of meanings and a list of meaning entries
// with each entry having a synonym count and pointer to an 
// array of char * (i.e the synonyms)
// 
// note: calling routine should call CleanUpAfterLookup with the original
// meaning point and count to properly deallocate memory

int MyThes::Lookup(const char * pText, int len, mentry** pme)
{ 

    *pme = NULL;

    // handle the case of missing file or file related errors
    if (! pdfile) return 0;


    std::string wrd(pText, len);
  
    /* find it in the list */
    WordLocationMap::const_iterator ii(wordList.find(wrd));
    if (ii == wordList.end())
    {
        return 0;
    }
    long offset(ii->second);

    // now seek to the offset
    int rc = fseek(pdfile,offset,SEEK_SET);
    if (rc) {
       return 0;
    }

    // grab the count of the number of meanings
    // and allocate a list of meaning entries
    char * buf = NULL;
    buf  = (char *) malloc( MAX_LN_LEN );
    if (!buf) return 0;
    readLine(pdfile, buf, (MAX_LN_LEN-1));
    int np = mystr_indexOfChar(buf,'|');
    if (np < 0) {
         free(buf);
         return 0;
    }          
    int nmeanings = atoi(buf+np+1);
    *pme = (mentry*) malloc( nmeanings * sizeof(mentry) );
    if (!(*pme)) {
        free(buf);
        return 0;
    }

    // now read in each meaning and parse it to get defn, count and synonym lists
    mentry* pm = *(pme);
    char dfn[MAX_WD_LEN];

    for (int j = 0; j < nmeanings; j++) {
        readLine(pdfile, buf, (MAX_LN_LEN-1));

        pm->count = 0;
        pm->psyns = NULL;
        pm->defn = NULL;

        // store away the part of speech for later use
        char * p = buf;
        char * pos = NULL;
        np = mystr_indexOfChar(p,'|');
        if (np >= 0) {
           *(buf+np) = '\0';
           pos = mythesstrdup(p);
           p = p + np + 1;
        } else {
          pos = mythesstrdup("");
        }
        
        // count the number of fields in the remaining line
        int nf = 1;
        char * d = p;
        np = mystr_indexOfChar(d,'|');        
        while ( np >= 0 ) {
          nf++;
          d = d + np + 1;
          np = mystr_indexOfChar(d,'|');          
        }
        pm->count = nf;
        pm->psyns = (char **) malloc(nf*sizeof(char*)); 
        
        // fill in the synonym list
        d = p;
        for (int j = 0; j < nf; j++) {
            np = mystr_indexOfChar(d,'|');
            if (np > 0) {
              *(d+np) = '\0';
              pm->psyns[j] = mythesstrdup(d);
              d = d + np + 1;
            } else {
              pm->psyns[j] = mythesstrdup(d);
            }            
        }

        // add pos to first synonym to create the definition
        int k = strlen(pos);
        int m = strlen(pm->psyns[0]);
        if ((k+m) < (MAX_WD_LEN - 1)) {
             strncpy(dfn,pos,k);
             *(dfn+k) = ' ';
             strncpy((dfn+k+1),(pm->psyns[0]),m+1);
             pm->defn = mythesstrdup(dfn);
        } else {
             pm->defn = mythesstrdup(pm->psyns[0]);
        }
        free(pos);
        pm++;

    }
    free(buf);
   
    return nmeanings;
} 



void MyThes::CleanUpAfterLookup(mentry ** pme, int nmeanings)
{ 

    if (nmeanings == 0) return;
    if ((*pme) == NULL) return;

    mentry * pm = *pme;
       
    for (int i = 0; i < nmeanings; i++) {
       int count = pm->count;
       for (int j = 0; j < count; j++) {
          if (pm->psyns[j]) free(pm->psyns[j]);
          pm->psyns[j] = NULL;
       }
       if (pm->psyns) free(pm->psyns);
       pm->psyns = NULL;
       if (pm->defn) free(pm->defn);
       pm->defn = NULL;
       pm->count = 0;
       pm++;
    }
    pm = *pme;
    free(pm);
    *pme = NULL;
    return;
}


/* 
//  performs a binary search on null terminated character
//  strings
//
//  returns: -1 on not found
//           index of wrd in the list[]

int MyThes::binsearch(char * sw, char* list[], int nlst) 
{
    int lp, up, mp, j, indx;
    lp = 0;
    up = nlst-1;
    indx = -1;
    if (strcmp(sw,list[lp]) < 0) return -1;
    if (strcmp(sw,list[up]) > 0) return -1;
    while (indx < 0 ) {
        mp = (int)((lp+up) >> 1);
        j = strcmp(sw,list[mp]);
        if ( j > 0) {
            lp = mp + 1;
        } else if (j < 0 ) {
            up = mp - 1;
        } else {
            indx = mp;
        }
        if (lp > up) return -1;      
    }
    return indx;
}
*/
const char * MyThes::get_th_encoding()
{
  if (encoding.size()) return encoding.c_str();
  return NULL;
}

Context

[Libreoffice] Should the Thesaurus/mythes use a precomputed index (installer file size) · Steve Butler
- Re: [Libreoffice] Should the Thesaurus/mythes use a precomputed index (installer file size) · Steve Butler
  - Re: [Libreoffice] Should the Thesaurus/mythes use a precomputed index (installer file size) · Norbert Thiebaud
    - Re: [Libreoffice] Should the Thesaurus/mythes use a precomputed index (installer file size) · Steve Butler
      - [Libreoffice] building index at Win32 install time ... · Michael Meeks
        
        Re: [Libreoffice] building index at Win32 install time ... · Tor Lillqvist
        
        (message not available)
        
        Re: [Libreoffice] building index at Win32 install time ... · Tor Lillqvist
        Re: [Libreoffice] building index at Win32 install time ... · Steven Butler
        Re: [Libreoffice] building index at Win32 install time ... · Jesús Corrius
        Re: [Libreoffice] building index at Win32 install time ... · Tor Lillqvist
        Re: [Libreoffice] building index at Win32 install time ... · Jesús Corrius
        
        Re: [Libreoffice] building index at Win32 install time ... · Tor Lillqvist

Privacy Policy | Impressum (Legal Info) | Copyright information: Unless otherwise specified, all text and images on this website are licensed under the Creative Commons Attribution-Share Alike 3.0 License. This does not include the source code of LibreOffice, which is licensed under the Mozilla Public License (MPLv2). "LibreOffice" and "The Document Foundation" are registered trademarks of their corresponding registered owners or are in actual use as trademarks in one or more countries. Their respective logos and icons are also subject to international copyright laws. Use thereof is explained in our trademark policy.