On 17.08.2017 12:08, Andrej Warkentin wrote:
Hello,
in a talk at the PyData Berlin meetup I saw this project:
https://github.com/lusy/hora-de-decir-bye-bye , where spanish articles
are scraped and searched for english words. In order to identify english
words she used the dictionaries from Open Office and compared scraped
words to the dictionaries. She mentioned the problem that not all words
were in the dictionaries.
So I thought this could be used to find (or at least help finding) most
missing words in dictionaries for all languages. One could scrape e.g.
all Wikipedia articles of a certain language and create a candidate list
of missing words. Or it could also be used to find domain specific words
by scraping e.g. scientific articles, articles from certain types of
websites and so on.
My question is if this would be something helpful at all or if missing
words in dictionaries is not a problem anymore. Also, I unfortunately
don't have much spare time at the moment to work on this so if anyone
wants to pick this up feel free to do so. I will let you know when I
implemented something myself.
by "missing words in dictionaries", do you mean that if "teh" was used
as an archaic spelling of "tea" in a work of Shakespeare (completely
made up and hypothetical example), that we should add "teh" to the
dictionary and no longer flag it as a wrongly spelled word?
Context
Privacy Policy |
Impressum (Legal Info) |
Copyright information: Unless otherwise specified, all text and images
on this website are licensed under the
Creative Commons Attribution-Share Alike 3.0 License.
This does not include the source code of LibreOffice, which is
licensed under the Mozilla Public License (
MPLv2).
"LibreOffice" and "The Document Foundation" are
registered trademarks of their corresponding registered owners or are
in actual use as trademarks in one or more countries. Their respective
logos and icons are also subject to international copyright laws. Use
thereof is explained in our
trademark policy.