My connection ended while posting. Here is the full post:
Hello everyone,
## Build indexable binary grammatically tagged dictionaries for
Lightproof/Grammalecte ##
The most important limitation for building a grammar checker with Lightproof
was the lack of grammatically tagged dictionaries. Most of Hunspell
dictionaries, which Lightproof can handle via LibreOffice-UNO, are not
grammatically tagged and cannot be of any help to retrieve morphological
information about words.
LanguageTool has not this problem since it’s using binary indexable
dictionaries built on huge grammatically tagged lexicons with a finite-state
automaton (fsa) software
(http://www.eti.pg.gda.pl/katedry/kiw/pracownicy/Jan.Daciuk/personal/fsa.html)
written in C. Java has a dedicated library to read these binary files.
But we had nothing such as this in Python.
So I tried to understand how this FSA software in C works, but as I am not a
C expert and as I was upset to depend again on another software, I finally
decided to write my own FSA tool to build such indexable binary
dictionaries.
Why build such dictionaries? you may ask. Because lexicons which contain
words, lemmas and morphological tags are HUGE, up to several megabytes, they
are not indexable as is and it uses much more memory to make them such. So
the goal is to make them small, compressed, quick to load and to parse, low
memory consuming, indexable, readable without having to uncompress them.
That’s what I did with Python 3.3.
I took all lexicons from LanguageTool and I compressed them in binary
indexable dictionaries readable with my own script.
The built dictionaries are not as small as the ones made with the C FSA tool
used by LT, but it’s close enough and there is still room for improvements.
I’ll work on this later.
Here are the results:
These dictionaries are about 5-30 % bigger than the LT ones (and sometimes
surprisingly twice smaller), but anyhow it’s perfectly usable as is.
Consequences:
— it will be possible to use all existing LT lexicons with Lightproof,
— we will be able to make a stand-alone version of Lightproof/Grammalecte as
it won’t be necessary to use Hunspell anymore,
— we will be able to write automated tests and prevent regressions when
writting/modifying rules.
# Lexicons
Lexicon are simple text document listing all flexions, their stem and their
morphological tags:
Each field is separated with a tabulation.
With the new tool, lexicons MUST be UTF-8 encoded to be properly converted.
# Want to test it?
The code is written with Python 3.3. License: MPL 2.
Two files:
— fsa_builder.py reads all files listed in "_lexicons.list.txt" and
builds binary dictionaries with a specific stemming command.
— fsa_reader.py reads all files whose name is "[lang].bdic", and if it
finds a test file named "[lang].test.txt" writes results found for each word
in a new file.
The builder with uncompressed LT lexicons encoded in UTF-8:
http://dicollecte.free.fr/download/fsa1/pyFSA_builder.7z [130 Mb]
Type:
And let it run. Warning: building dictionaries is slow, as lexicons are
huge. For most langages it takes 1 or 2 minutes for each. But for german,
polish, galician, russian, czech, it tooks 5 to 10 minutes for each, and it
consumes a huge amount of memory. The czech uses up to 6 Gb! You have been
warned. :)
The dictionary reader with binary dictionaries and test files:
http://dicollecte.free.fr/download/fsa1/pyFSA_reader.7z [11 Mb]
Type:
Let it run. Count to 1 (or 2 if you have a slow computer). And it’s already
finished. :)
It has read all binary dictionaries, read the test files, and written the
results in other files.
I’ll try to write a more complete web page about this when I have the time.
I still have to compress it better, for those who might think it’s not
enough.
Regards,
Olivier R.
--
View this message in context:
http://nabble.documentfoundation.org/Grammar-checking-Using-LanguageTool-lexicons-with-Lightproof-new-possible-tp4022489p4022495.html
Sent from the Dev mailing list archive at Nabble.com.
Context
Privacy Policy |
Impressum (Legal Info) |
Copyright information: Unless otherwise specified, all text and images
on this website are licensed under the
Creative Commons Attribution-Share Alike 3.0 License.
This does not include the source code of LibreOffice, which is
licensed under the Mozilla Public License (
MPLv2).
"LibreOffice" and "The Document Foundation" are
registered trademarks of their corresponding registered owners or are
in actual use as trademarks in one or more countries. Their respective
logos and icons are also subject to international copyright laws. Use
thereof is explained in our
trademark policy.