[libreoffice-l10n] State-of-the-art hyphenation for Danish, Dutch, German, Hungarian, Norwegian and Sweden

Németh László <nemeth -AT- numbertext.org>
Wed, 27 Dec 2023 13:43:47 +0100

Hi,

As a complement to my ongoing typography developments (
https://numbertext.org/typography/), I work on the extension of the
hyphenation zone to hyphenate compound words at stem boundaries, combining
the pattern based libhyphen hyphenation with morphological analysis of
Hunspell. This would result in state-of-the-art automatic hyphenation in
several languages, which write long or very long non-dictionary compound
words frequently. My plan is to commit the first working version within a
few days, so it will be possible to test it.

No need to extend the Hunspell dictionaries to get some improvement, but
can help (see “Specifying stem boundaries in .dic file”), also it can be an
option to check and fix the bugs of the pattern based libhyphen hyphenation
later.

The explanation of the extension: the hyphenation zone is a custom distance
from the end of the line, where the preferred line break is the last space,
if it exists, despite the possible hyphenation after that. The planned
extension differentiates the possible (libhyphen) hyphenations within the
hyphenation zone, and prefers the last one, which is also a stem boundary
(according to Hunspell), despite the possible hyphenations after that.

Issue with references:
https://bugs.documentfoundation.org/show_bug.cgi?id=158885

I have attached a Hungarian test file to the issue, with a hyphenation zone
(see Text Flow in paragraph settings), and with some explanation. You can
make your OpenDocument test file based on this example, if your Hunspell
dictionary uses COMPOUNDFLAG, COMPOUNDBEGIN etc. compound word features.

It's already possible to test Hunspell compound word decomposition by the
command line Hunspell, using -m (morphological analysis). For example for
Swedish:

~/libreoffice/dictionaries/sv_SE$ hunspell -d sv_SE -m
rättstavningskontroll
rättstavningskontroll  pa:kontroll

This analysis contains only the last stem or part (pa:) of the compound,
yet, but planned to extend Hunspell to get the other parts, too (which can
contain in-word suffixes, e.g. Fuge-elements in German, fogemorphemes in
Swedish).

Compare this with libhyphen hyphenation:

~/libreoffice/dictionaries/sv_SE$
/home/laci/libreoffice/workdir/UnpackedTarball/hyphen/example hyph_sv.dic
<(echo rättstavningskontroll)
rätt=stav=nings=kon=troll

With the existing Hunspell analysis, the upcoming development code can
prefer "rättstavnings-kontroll" instead of the hyphenation
"rättstavningskon-troll" in the specified hyphenation zone. With the
planned extension of Hunspell, to prefer "rätt-stavningskontroll" instead
of "rättstav-ningskontroll".

= Specifying stem boundaries in .dic file =

Using morphological fields, it's possible to add information for the stem
boundaries of a word with the following syntax:

stavningskontroll hy:stavnings|kontroll

or using the index of the character before the hyphenation point:

stavningskontroll hy:9

The first solution allows to specify more stem boundaries:

rättstavningskontroll hy:rätt|stavnings|kontroll

or weightening one of the stem boundaries for the hyphenation zone:

rättstavningskontroll hy:rätt||stavnings|kontroll

(Note: likely the real preference is the opposite, because of the meaning
is good|spelling||control, but I wanted to show an example, when we can
prefer not only the last stem boundary within the hyphenation zone).

See also man 5 hunspell for morphological fields.

I am interested to hear your opinions and welcome any further suggestions,
including for other languages, too.

Best regards and Happy New Year!

László

-- 
To unsubscribe e-mail to: l10n+unsubscribe@global.libreoffice.org
Problems? https://www.libreoffice.org/get-help/mailing-lists/how-to-unsubscribe/
Posting guidelines + more: https://wiki.documentfoundation.org/Netiquette
List archive: https://listarchives.libreoffice.org/global/l10n/
Privacy Policy: https://www.documentfoundation.org/privacy

Context

[libreoffice-l10n] State-of-the-art hyphenation for Danish, Dutch, German, Hungarian, Norwegian and Sweden · Németh László

Privacy Policy | Impressum (Legal Info) | Copyright information: Unless otherwise specified, all text and images on this website are licensed under the Creative Commons Attribution-Share Alike 3.0 License. This does not include the source code of LibreOffice, which is licensed under the Mozilla Public License (MPLv2). "LibreOffice" and "The Document Foundation" are registered trademarks of their corresponding registered owners or are in actual use as trademarks in one or more countries. Their respective logos and icons are also subject to international copyright laws. Use thereof is explained in our trademark policy.