On 01/01/2012 08:09 PM, Dan Lewis wrote:
      Something to remember: the main dictionary for the language used by
LO is a binary file kept in the Installation folder. If a language pack
is added, this language is also binary and kept in the same place. These
are large files.
      User created dictionary files (.dic) are kept in the personal
settings folder. These are text files.
      Some time ago, someone asked about dictionary file sizes referring to
the user created .dic files. The reply was 22K or less per file seemed
like a good number. It was mentioned that OOo would not use a dictionary
file if it was too large.
       The dictionary files .dic) are text documents with
the first four lines very important as far as content is concerned.
Below is the first four lines for an English user created .dic file followed
by a German user created .dic file.

            OOoUserDict1             OOoUserDict1
            lang: en-US      OR        lang: de-DE
            type: positive           type: positive
            ---                      ---
It appears like the second line is the one that has to be changed from
language to language. the letters before the hyphen are the language
(en,English; de, Deutch) and the letters afterward are the country
(US, USA; GB, Great Britain; etc.)
      But with the number of entries you have, you need to find some way
to make a binary file that LO can read as a .dic file. From what I remember
about the creation of the Austrialian dictionary, it is very time consuming
to create the binary files.


Every dictionary .dic file that I looked at English/German/French/etc. for LibreOffice, and from OOo's own site, is shown as the following:
line 1:  the number of words in the list
line 2 through the end:  the list of words and any control codes

All of the dictionaries are in .oxt files and they are a type of archives.

The largest dictionary I have is Irish/Gaeilge, which is about 7 meg in size for the .oxt file, but the .dic file inside it is about 2 meg. The Thesaurus and hyphenation files are most of the rest of the file size.

I do not know about any .oxt files that are in a "binary" format. Maybe the .dic files are converted to a binary somewhere, but not in dictionary creator's end.

When using my dictionaries, I do not notice any slowing down of the loading process. Even with all of my American English .oxt file enabled, plus the largest British English and Canadian English enabled. I do not know what "binary conversion" takes a long time, but I do not see it.

I have the the Australian .oxt file with the .dic file from 2008-12-15, also the Australian Medical dictionary with its .dic file from 2008-07-01. Neither .dic files are in a binary format. When opened they are just an ASCII text file in an archived file. There are control codes after some of the words in that .dic file, so maybe that was what too time to create - words and their control codes.

Here is a link to the English dictionary section and Australian is the first ones listed.

Now, I have large word lists in my .dic files. 6.4 meg for the 638K word size. But there is no control codes in my .dic files except the top line stating the number of words in the list.

Now, I offer several word list sizes for my dictionaries; 98K, 217K, 390K, and 638K words, with no 98K for Canada since I did not have a word list [yet] that size to use for one. So if the user wants to use the 98K word list for their spelling words, they can do it. There is the 638K word list dictionary since someone on this list asked me for a dictionary with the largest word list that I had. I asked before I made them.

As for seeing these in the .dic files I got from the OOo dictionary list, sorry I did not see them.
           OOoUserDict1             OOoUserDict1
           lang: en-US      OR        lang: de-DE
           type: positive           type: positive

Maybe they are are created in the folder that they reside in after LO/OOo loads them up through the Extension Manager. I know that I used some of these dictionaries when I used OOo 3.x.x and they were still 500K or more for the .dic files then.

There is a 8074 word list with a .dic file of 87.9KB. How many words would be in a 22K .dic file? Where did you get that 22K size info? I went to the .libreoffice hidden folder [Ubuntu 10.04] and not one of the .dic files listed there are anyway near that 22K size. Most are in the 1 to 3 MB range.

I did a lot of looking into what documentation I could find for creating a language dictionary, and nowhere did I find any info about file sizes and converting the .dic files to a binary format. I know binary as something other than a file that shows the actual "text" of the file in a text editor. I have seen "true binary" files when I had to program in Assembly and C. The resulting files ended up into a binary format unreadable in a text editor. The .dic files are not like that, as far as I can see. So I do not know what "binary conversion" you are talking about.

Also, if you download my dictionaries [making sure there is the .oxt file extension is there], then install it using the Extension Manager, you will have to issues. I have installed them on Ubuntu computers and Windows computers with equal results. SO, my dictionaries work as they are.

All I was really wanting to know was how far should I go with the number and types of words for these dictionaries. All the English words, plus English medical and chemistry add up to over 736,000 words [with not control codes after them].


