[libreoffice-l10n] Re: State-of-the-art hyphenation for Danish, Dutch, German, Hungarian, Norwegian and Sweden

Németh László <nemeth -AT- numbertext.org>
Wed, 8 Jan 2025 15:32:58 +0100

Hi all,

I've extended the compound-constituent based hyphenation last year, thanks
to the support of FSF.hu Foundation, and there is a new option in paragraph
hyphenation setting to set the minimum compound constituent boundary
distance left before the hyphenation, too (also on the sidebar:
https://numbertext.org/typography/#__RefHeading___Toc2799_2584835629,
thanks to the support of NLnet Foundation).

I realized that this feature is much more important than I thought before.
I wrote a (yet Hungarian) article about it, see its English abstract:
https://numbertext.org/typography/automatikus_magyar_elv%C3%A1laszt%C3%A1s_a_LibreOffice-ban.pdf

For example, hyphenation is forbidden in headings traditionally, allowed
only e.g. at compound constituent boundary. Recent pattern based automatic
hyphenation can result in really annoying problems in the languages with
very long compound words, which need manual interaction (proofreading), or
stylistic constraints (very small font sizes in headings). The other
similar area is column/page/spread boundaries, where hyphenation is often
forbidden, which is the default setting in MSO, but its recent
implementation (shifting full lines to the next column/page/spread (spread
= visible page pair) is ugly and imperfect solution for the problem in
several languages, comparing to the compound constituent based hyphenation,
because it did not guarantee the elimination of the hard-to-read or ugly
hyphenation on the last line (if the previous line was also hyphenated).

My plan is to continue to extend the support for compound constituent-based
hyphenation, which can be applicable for fully automatic line breaking of
headings, also across columns, pages, spreads in German, and languages with
German-type orthography, e.g. Danish, Dutch, Hungarian, Norwegian and
Sweden.

Recent LibreOffice ODF extension:

loext:hyphenation-compound-left-char-count (now
loext:hyphenation-compound-remain-char-count)

Minimum hyphenation distance from the left compound constituent boundary.
The default setting is 2, which means that 1-character distance is
forbidden. Setting it to 3, also the 2-character distance is forbidden:
e.g. "counter-intelligence", but not "counterin-telligence". Value 0 means
maximal distance, i.e. no hyphenation after the left compound constituent
boundary (except in constituent boundaries). This setting doesn't affect
the hyphenation in the first compound constituent, and in the suffixation
of the compound word.

The planned improvements:

loext:hyphenation-compound-right-char-count

Minimum hyphenation distance from the right compound constituent boundary.
The default setting is 2, which means that 1-character distance is
forbidden. Setting it to 3, also the 2-character distance is forbidden:
e.g. "honeybee", but not "hon-eybee". Value 0 means maximal distance, i.e.
no hyphenation before the right compound constituent boundary (except in
constituent boundaries). This setting affects the hyphenation of the first
compound constituent.

Setting this option and loext:hyphenation-compound-left-char-count to 0,
the hyphenation is only allowed at compound constituent boundary and in
suffix of the compound word, but not within the (stem of the) compound
constituents.

loext:hyphenation-remain-char-count-compound

The minimum number of characters in a compound word before the hyphenation
character, if the word is hyphenated at compound constituent boundary. This
is the minimum number of characters in the compound constituent(s) left on
the line ending with the hyphenation character. The default value 0 has a
special meaning: apply the setting of fo:hyphenation-remain-char-count.

loext:hyphenation-push-char-count-compound

The minimum number of characters in a compound word after the hyphenation
character, if the word is hyphenated at compound constituent boundary. This
is the minimum number of characters in the compound constituent(s) pushed
to the next line after the line ending with the hyphenation character. The
default value 0 has a special meaning: apply the setting of
fo:hyphenation-push-char-count.

Example: forbidding the hyphenation in the suffix, but not at compound
boundary of compound words, set fo:hyphenation-push-char-count to the size
of the largest suffix, and set loext:hyphenation-push-char-count-compound
to a non-zero lower value.

I am interested to hear your opinions, especially from experts of the
listed languages and welcome any further suggestions, including for other
languages, too.

Best regards and Happy New Year!

László

Németh László <nemeth@numbertext.org> ezt írta (időpont: 2023. dec. 27.,
Sze, 13:43):

Hi,

As a complement to my ongoing typography developments (
https://numbertext.org/typography/), I work on the extension of the
hyphenation zone to hyphenate compound words at stem boundaries, combining
the pattern based libhyphen hyphenation with morphological analysis of
Hunspell. This would result in state-of-the-art automatic hyphenation in
several languages, which write long or very long non-dictionary compound
words frequently. My plan is to commit the first working version within a
few days, so it will be possible to test it.

No need to extend the Hunspell dictionaries to get some improvement, but
can help (see “Specifying stem boundaries in .dic file”), also it can be an
option to check and fix the bugs of the pattern based libhyphen hyphenation
later.

The explanation of the extension: the hyphenation zone is a custom
distance from the end of the line, where the preferred line break is the
last space, if it exists, despite the possible hyphenation after that. The
planned extension differentiates the possible (libhyphen) hyphenations
within the hyphenation zone, and prefers the last one, which is also a stem
boundary (according to Hunspell), despite the possible hyphenations after
that.

Issue with references:
https://bugs.documentfoundation.org/show_bug.cgi?id=158885

I have attached a Hungarian test file to the issue, with a hyphenation
zone (see Text Flow in paragraph settings), and with some explanation. You
can make your OpenDocument test file based on this example, if your
Hunspell dictionary uses COMPOUNDFLAG, COMPOUNDBEGIN etc. compound word
features.

It's already possible to test Hunspell compound word decomposition by the
command line Hunspell, using -m (morphological analysis). For example for
Swedish:

~/libreoffice/dictionaries/sv_SE$ hunspell -d sv_SE -m
rättstavningskontroll
rättstavningskontroll  pa:kontroll

This analysis contains only the last stem or part (pa:) of the compound,
yet, but planned to extend Hunspell to get the other parts, too (which can
contain in-word suffixes, e.g. Fuge-elements in German, fogemorphemes in
Swedish).

Compare this with libhyphen hyphenation:

~/libreoffice/dictionaries/sv_SE$
/home/laci/libreoffice/workdir/UnpackedTarball/hyphen/example hyph_sv.dic
<(echo rättstavningskontroll)
rätt=stav=nings=kon=troll

With the existing Hunspell analysis, the upcoming development code can
prefer "rättstavnings-kontroll" instead of the hyphenation
"rättstavningskon-troll" in the specified hyphenation zone. With the
planned extension of Hunspell, to prefer "rätt-stavningskontroll" instead
of "rättstav-ningskontroll".

= Specifying stem boundaries in .dic file =

Using morphological fields, it's possible to add information for the stem
boundaries of a word with the following syntax:

stavningskontroll hy:stavnings|kontroll

or using the index of the character before the hyphenation point:

stavningskontroll hy:9

The first solution allows to specify more stem boundaries:

rättstavningskontroll hy:rätt|stavnings|kontroll

or weightening one of the stem boundaries for the hyphenation zone:

rättstavningskontroll hy:rätt||stavnings|kontroll

(Note: likely the real preference is the opposite, because of the meaning
is good|spelling||control, but I wanted to show an example, when we can
prefer not only the last stem boundary within the hyphenation zone).

See also man 5 hunspell for morphological fields.

I am interested to hear your opinions and welcome any further suggestions,
including for other languages, too.

Best regards and Happy New Year!

László


-- 
To unsubscribe e-mail to: l10n+unsubscribe@global.libreoffice.org
Problems? https://www.libreoffice.org/get-help/mailing-lists/how-to-unsubscribe/
Posting guidelines + more: https://wiki.documentfoundation.org/Netiquette
List archive: https://listarchives.libreoffice.org/global/l10n/
Privacy Policy: https://www.documentfoundation.org/privacy

Context

[libreoffice-l10n] Re: State-of-the-art hyphenation for Danish, Dutch, German, Hungarian, Norwegian and Sweden · Németh László

Privacy Policy | Impressum (Legal Info) | Copyright information: Unless otherwise specified, all text and images on this website are licensed under the Creative Commons Attribution-Share Alike 3.0 License. This does not include the source code of LibreOffice, which is licensed under the Mozilla Public License (MPLv2). "LibreOffice" and "The Document Foundation" are registered trademarks of their corresponding registered owners or are in actual use as trademarks in one or more countries. Their respective logos and icons are also subject to international copyright laws. Use thereof is explained in our trademark policy.