Thanks for your input Richard,
Firstly, you are right, I was mistaken about ICU and the breakiterator
working for sentences (I just tried it right now and it does work, but just
not with the normal "khan" or "period" of Khmer rather it works with Latin
sentence markers which is not enough). I had thought when we put in the
code for the breakiterator that it also covered the sentence, but I guess
not (I will work towards getting it working for Khmer).
In response to your comments:
1) The user always marks word breaks with ZWSP.
In this case, the ideal is to switch off the break iterator for the
language.
There is some truth to this - and that is why I had it as my last option
(just turning the whole thing off). But the ICU breakiterator for Khmer
actually works quite well with normal language - it breaks down when there
are proper names. So turning it off is an option, but not the most ideal
solution. Some users will continue to always mark breaks with a ZWSP (for
full control), but I also think having the option to turn it off for more
complex sentences would be ideal.
2) The user never marks word breaks.
In this case, the user is totally dependent on the break iterator, and
cannot be helped when it fails.
As I said above, I think a both/and solution would be idea for Khmer. But
if in the end it would work better for Thai to have and "off" and "on"
option only, that would be fine for Khmer as well for now, until we can
come up with a more ideal solution.
3) The user only marks word breaks and non-word breaks when the iterator
fails.
The problem with this in Khmer is the user cannot tell when the
breakiterator fails, unless it is on a line-break. A word could be broken
up into three parts and the user would never know it. This is why the issue
is so complex. Actually, if users could see where the breakiterator is
breaking words, that would simplify things a lot. Though I still think the
option to turn the breakiterator "on" or "off" for certain sentences would
be ideal (especially sentences with a ton of proper nouns where the ICU
breakiterator for Khmer has the most trouble).
As far as finding re-syncing points (when to turn the breakitorator back on
when it is turned off by a ZWSP) I agree with you:
The obvious re-synching points
are word external punctuation, such as end-of-line, white space,
quotation marks, commas and dandas (and as dandas I would include U+0E2F
THAI CHARACTER PAIYANNOI in its role as angkhandeaw, as well as U+17D5
KHMER SIGN BARIYOOSAN, though an exception may be worthwhile for Thai
ฯลฯ and ฯเปฯ).
The only problem with this would be at the beginning of a document or the
beginning of any new "re-syncing" segment because you might run into
something like this:
User input (example in English so others can make sense of it I hope):
wordwordwordwordword.
How the sentence is broken up by the breakiterator: wo r d word word wo rd
word.
User adds ZWSP to fix broken word on line-break: wo r d word word
ZWSPwordword.
But user has no idea the first word is broken incorrectly and that it is
also spelled incorrectly.
This is why it would be best (I think) as Martin suggested that when a ZWSP
is detected it also turn off break iteration for the previous words up
until a re-sync point. This would practicly give the user an "off" option
for the whole document if they so chose, and without the confusion of
having to find some option in the Tools menu to turn it on or off - it
would just be automatic, depending on the user's habit.
I agree with this:
Considering these four use cases, it seems simplest to let ZWSP, WJ and
ZWNBSP disable the iterator for the extent of the dictionariless word
in which it occurs.
Except, it also should disable the breakiterator up to the previous re-sync
point to enable users to functionally "turn off" the breakitorator if they
so choose (for Khmer this is necessary because for a book editor like
myself, I will want to manually put the breaks and not let the
breakitorator do anything automatically - but the feature is nice for the
casual user because it is much faster and more intuitive to not type spaces
between words for Cambodians).
A related issue that seems not to being handled is repetition mark U+0E46
THAI
CHARACTER MAIYAMOK. It should be separated from the preceding
alphabetic characters by a space, but Libreoffice doesn't recognised
the sequence as a possible continuation of the word. Sometimes it
is a necessary part of a word. I don't know what the situation is in
Khmer.
In Khmer the repeat character (U+17D7 LEK TOO) is not separated from the
preceding word by a space, but is connected, so this is not an issue for
us. But actually, there is a rule in ICU for the MAIYAMOK so unless that
is not working properly, I am not sure why LibreOffice doesn't break
correctly...
Here's the code from ICU4c for the Thai MAIYAMOK from dictbe.cpp if anyone
is interested...
if (uc ==
THAI_MAIYAMOK<http://fossies.org/dox/icu4c-49_1_2-src/dictbe_8cpp.html#a6b5f33afcd7763004fa04d88bcde2770>)
{
393 if
(utext_previous32<http://fossies.org/dox/icu4c-49_1_2-src/urename_8h.html#acf738fa383c571f940ad641faeeebba8>(text)
!=
THAI_MAIYAMOK<http://fossies.org/dox/icu4c-49_1_2-src/dictbe_8cpp.html#a6b5f33afcd7763004fa04d88bcde2770>)
{
394 // Skip over previous end and MAIYAMOK
395
utext_next32<http://fossies.org/dox/icu4c-49_1_2-src/urename_8h.html#a6d68a2734d6a1f0ea610dbaed40b0eec>
(text);
396
utext_next32<http://fossies.org/dox/icu4c-49_1_2-src/urename_8h.html#a6d68a2734d6a1f0ea610dbaed40b0eec>
(text);
397 wordLength += 1; // Add MAIYAMOK to word
Thoughts?
-Nathan
On Thu, Sep 27, 2012 at 6:14 PM, Richard Wordingham <
richard.wordingham@ntlworld.com> wrote:
On Thu, 27 Sep 2012 11:52:26 +0700
Nathan Wells <sungkhum@gmail.com> wrote:
1. If you are shutting off the ICU breakiterator for text following,
we
should probably also do it for text preceding. Thus if there is a
ZWSP or ZWNBSP (U+2060 WJ) anywhere in a text then ICU break
iteration is disabled for the whole sentence.
Yes, I think you are right. If a ZWSP of ZWNBSP is detected then ICU
break iteration should be disabled for the whole sentence.
What is the logic of this?
The use cases I see are:
1) The user always marks word breaks with ZWSP.
In this case, the ideal is to switch off the break iterator for the
language.
2) The user never marks word breaks.
In this case, the user is totally dependent on the break iterator, and
cannot be helped when it fails.
3) The user only marks word breaks and non-word breaks when the iterator
fails.
In this case, the iterator need only be switched off from the point of
override until it can clearly re-synch. The obvious re-synching points
are word external punctuation, such as end-of-line, white space,
quotation marks, commas and dandas (and as dandas I would include U+0E2F
THAI CHARACTER PAIYANNOI in its role as angkhandeaw, as well as U+17D5
KHMER SIGN BARIYOOSAN, though an exception may be worthwhile for Thai
ฯลฯ and ฯเปฯ).
Now, it may be easier to explain the rule if it applies to the whole
'word' - for what we are looking at is pretty much a 'word' as
understood by dictionariless editors.
4) Different parts of the text comes from different sources - some mark
word breaks, others expect the application to correctly identify them.
A ZWSP in a chunk of text would then tag the text as having come from a
a user in case 1 or 3; we have no reliable way of distinguishing the
two cases. A WJ (U+2060) or ZWNBSP (U+FEFF) (when not a BOM, so
paragraph initial is suspect) would strongly suggest use case 3 - but
might occur in use case 1 if the user has had to fight a break
iterator.
(end of use cases)
Considering these four use cases, it seems simplest to let ZWSP, WJ and
ZWNBSP disable the iterator for the extent of the dictionariless word
in which it occurs.
What is the definition of an ICU sentence boundary? I see no evidence
from CLDR 2.9 that it should be even approximately right for Khmer (or
Thai). Splitting Thai text into sentences is known to be challenging -
we can therefore expect different applications to split text
differently.
The one downside I can see to my suggestion is that if all word
boundaries are marked, switching the iterator off dictionariless word
by dictionariless word will require slightly greater use of WJ, for a
ZWSP later in the sentence will not necessarily be in the same
dictionariless word.
A related issue that seems not to being handled is repetition mark U+0E46
THAI
CHARACTER MAIYAMOK. It should be separated from the preceding
alphabetic characters by a space, but Libreoffice doesn't recognised
the sequence as a possible continuation of the word. Sometimes it
is a necessary part of a word. I don't know what the situation is in
Khmer.
Richard.
Context
- Re: Adding Extension for Experimental Thai Spelling (continued)
Privacy Policy |
Impressum (Legal Info) |
Copyright information: Unless otherwise specified, all text and images
on this website are licensed under the
Creative Commons Attribution-Share Alike 3.0 License.
This does not include the source code of LibreOffice, which is
licensed under the Mozilla Public License (
MPLv2).
"LibreOffice" and "The Document Foundation" are
registered trademarks of their corresponding registered owners or are
in actual use as trademarks in one or more countries. Their respective
logos and icons are also subject to international copyright laws. Use
thereof is explained in our
trademark policy.