Date: prev next · Thread: first prev next last
2012 Archives by date, by thread · List index


On Thu, 27 Sep 2012 21:08:13 +0700
Nathan Wells <sungkhum@gmail.com> wrote:

Firstly, you are right, I was mistaken about ICU and the breakiterator
working for sentences (I just tried it right now and it does work,
but just not with the normal "khan" or "period" of Khmer rather it
works with Latin sentence markers which is not enough).  I had
thought when we put in the code for the breakiterator that it also
covered the sentence, but I guess not (I will work towards getting it
working for Khmer).

It may be worth modifying the CLDR definition - sentence breaks can be
customised, though it is presently only done for Greek.  However, if
you want Khmer *sentence* rather than *clause* breaking, it will need a
lot of work - papers are still being published on breaking Thai into
sentences (e.g. www.mt-archive.info/Coling-2010-Slayden.pdf ).

In response to your comments:

1) The user always marks word breaks with ZWSP.
In this case, the ideal is to switch off the break iterator for the
language.


There is some truth to this - and that is why I had it as my last
option (just turning the whole thing off). But the ICU breakiterator
for Khmer actually works quite well with normal language - it breaks
down when there are proper names. So turning it off is an option, but
not the most ideal solution. Some users will continue to always mark
breaks with a ZWSP (for full control), but I also think having the
option to turn it off for more complex sentences would be ideal.

2) The user never marks word breaks.
In this case, the user is totally dependent on the break iterator,
and cannot be helped when it fails.

As I said above, I think a both/and solution would be idea for Khmer.
But if in the end it would work better for Thai to have and "off" and
"on" option only, that would be fine for Khmer as well for now, until
we can come up with a more ideal solution.


3) The user only marks word breaks and non-word breaks when the
iterator fails.

The problem with this in Khmer is the user cannot tell when the
breakiterator fails, unless it is on a line-break.  A word could be
broken up into three parts and the user would never know it.

I usually notice iterator failures in Thai with unrecognised words,
which prompts red ink over strange extents. Usually the words are not
recognised because they're misspelt, but not always.  The problem I see
in Thai is usually not so much as extra word boundaries as misplaced
word boundaries. 

Actually, if users could see where the
breakiterator is breaking words, that would simplify things a lot.

That is a very significant observation.

The only problem with this would be at the beginning of a document or
the beginning of any new "re-syncing" segment because you might run
into something like this:

User input (example in English so others can make sense of it I hope):
wordwordwordwordword.
How the sentence is broken up by the breakiterator: wo r d word word
wo rd word.
User adds ZWSP to fix broken word on line-break: wo r d word word
ZWSPwordword.

This example confuses me.  The problem here seems to be extra word
breaks rather than missing word breaks, and I don't see how confirming
a word break helps.

But user has no idea the first word is broken incorrectly and that it
is also spelled incorrectly.

This is why it would be best (I think) as Martin suggested that when
a ZWSP is detected it also turn off break iteration for the previous
words up until a re-sync point.  This would practicly give the user
an "off" option for the whole document if they so chose, and without
the confusion of having to find some option in the Tools menu to turn
it on or off - it would just be automatic, depending on the user's
habit.

I was clearly not clear enough.  In the example above,
'wordwordwordwordword' is what I would call a dictionariless word - a
word-breaker without a dictionary (e.g. a shell's parser) would see it
as just one 'word'.  Therefore, once ZWSP is inserted and
word-breaking disabled, dictionary-based word-breaking is not applied to
wordwordwordZWSPwordword, and, typically, red squiggles appear under
wordwordword and wordword.  The boundary may be revealed by a phase
discontinuity or gap in the squiggle.  Under the proposed scheme, user
has to introduce another three ZWSPs even if the dictionary contains
all the words.

I agree with this:

Considering these four use cases, it seems simplest to let ZWSP, WJ
and ZWNBSP disable the iterator for the extent of the
dictionariless word in which it occurs.

Except, it also should disable the breakiterator up to the previous
re-sync point...

But that is what I meant!

But actually, there is a rule in ICU for the MAIYAMOK
so unless that is not working properly, I am not sure why LibreOffice
doesn't break correctly...

I'll have to look further into this - and check that misbehaviour is
still happening.  Squiggly lines is what I chiefly remember.  There may
also be a Hunspell issue - the entries in the dictionary don't have
spaces before maiyamok.  The difference between finding word boundaries
and finding line boundaries may be significant here.  

Richard.

Context


Privacy Policy | Impressum (Legal Info) | Copyright information: Unless otherwise specified, all text and images on this website are licensed under the Creative Commons Attribution-Share Alike 3.0 License. This does not include the source code of LibreOffice, which is licensed under the Mozilla Public License (MPLv2). "LibreOffice" and "The Document Foundation" are registered trademarks of their corresponding registered owners or are in actual use as trademarks in one or more countries. Their respective logos and icons are also subject to international copyright laws. Use thereof is explained in our trademark policy.