Hi,
The data files for libexttextcat in this directory:
https://github.com/giuliopaci/libexttextcat/tree/master/langclass/ShortTexts
Contains a garbled Hungarian version, it's almost in iso-8859-1 but some
characters are destroyed because it doesn't contain all Hungarian
characters.
It is easy to pick up a utf-8 good version from
http://www.ohchr.org/EN/UDHR/Pages/Language.aspx?LangID=hng
and see the difference.
It's not clear whether this prevents it from classifying Hungarian text
correctly, but it may stop it working in utf-8, because most of the other
files are in utf-8.
Cheers
Mark
Context
- libexttextcat data garbled in Hungarian · Mark Robson
 
   
 
  Privacy Policy |
  
Impressum (Legal Info) |
  
Copyright information: Unless otherwise specified, all text and images
  on this website are licensed under the
  
Creative Commons Attribution-Share Alike 3.0 License.
  This does not include the source code of LibreOffice, which is
  licensed under the Mozilla Public License (
MPLv2).
  "LibreOffice" and "The Document Foundation" are
  registered trademarks of their corresponding registered owners or are
  in actual use as trademarks in one or more countries. Their respective
  logos and icons are also subject to international copyright laws. Use
  thereof is explained in our 
trademark policy.