Re: Adding Extension for Experimental Thai Spelling

Richard Wordingham <richard.wordingham -AT- ntlworld.com>
Thu, 26 Jul 2012 23:53:57 +0100

On Thu, 26 Jul 2012 16:33:00 +0700
Martin Hosken <martin_hosken@sil.org> wrote:

1. use of U+2060 makes string searching and spell checking harder
(unless WJ chars are stripped for searching and spell checking). They
are not part of the spelling of a word, so their introduction in the
underlying text stream is problematic for other text processing
processes (like searching as mentioned). This is less of an issue for
U+200B ZWSP because that occurs between words and searching across
word boundaries is a rarer activity. Likewise spell checking across
word boundaries isn't really needed.


U+2060 WJ should definitely be skipped for searching and, once it has
done its gluing job, spell-checking look-up, just like U+00AD SOFT
HYPHEN.  They're both indubitable complete ignorables for collation and
therefore for UCA (Unicode Collation Algorithm) search.

Now what happens if I want to put zw around a word that occurs < 20
chars after my last zw? The on off nature of the zw has now been
inverted. One option is to say that zw must always occur in pairs and
you would have to bracket your first or second word there. But then
management of which zw is on and which is off will get confusing for
users.


I think that is the wrong way of looking at it.  Various characters,
some ZWSP, others more natural, such as SP, tell the break iterators
where some word boundaries are.  The rule we would have is that the
break iterator should not try to break runs of less than, say, 20
characters if one of the boundaries is provided by ZWSP.  I am not
proposing that we limit how many breaks it makes in a run - 21
characters could be broken into seven words.  The short runs the break
iterator is prohibited from breaking can still be checked for spelling.
If they are not words, then the user can respond to the red wiggly line
appropriately, e.g. by putting extra word breaks in.

In the example you gave, one would have to split the words between the
delimited words.  I think the users must accept that - the rule we
would be working with is that the break iterator does not break short
runs created by inserted ZWSP, and that is a simple rule to
understand.  I suppose there may be some question of what to count -
base consonants perhaps? (In Unicode jargon, that would be extended
default graphemes.)  That might be a luxury feature we never need to
add.

Richard.

Context

Re: [PUSHED][REVIEW:3-5] Use ICU break-iterator for Thai Spelling (continued)

Privacy Policy | Impressum (Legal Info) | Copyright information: Unless otherwise specified, all text and images on this website are licensed under the Creative Commons Attribution-Share Alike 3.0 License. This does not include the source code of LibreOffice, which is licensed under the Mozilla Public License (MPLv2). "LibreOffice" and "The Document Foundation" are registered trademarks of their corresponding registered owners or are in actual use as trademarks in one or more countries. Their respective logos and icons are also subject to international copyright laws. Use thereof is explained in our trademark policy.