Date: prev next · Thread: first prev next last
2012 Archives by date, by thread · List index


Thanks Martin,


1. If you are shutting off the ICU breakiterator for text following, we
should probably also do it for text preceding. Thus if there is a ZWSP or
ZWNBSP (U+2060 WJ) anywhere in a text then ICU break iteration is disabled
for the whole sentence.


Yes, I think you are right. If a ZWSP of ZWNBSP is detected then ICU break
iteration should be disabled for the whole sentence.


2. Why limit this to Khmer? I suspect as a model it should work for any
non-space broken text.


I am only limiting it to Khmer because that is my expertise and I didn't
want to cause problems for other languages - but it is possible these
changes would be beneficial for other languages that are not broken by
spaces (like Thai).


Thanks,
Nathan

On Thu, Sep 27, 2012 at 11:45 AM, Martin Hosken <martin_hosken@sil.org>wrote:

Dear Nathan,

Here are some new ideas, ordered by desirability, with number one being
the
most desired, to number three being the least.

1) When a zero-width space is detected (U+200B), shut off ICU
breakiterator
for Khmer spell checking for characters following the zero-width space
until encounters real space (U+0020) or end of sentence (detect end of
sentence using ICU Sentence Boundary).

I think this is a good direction to head. I have to follow on comments:

* 1. If you are shutting off the ICU breakiterator for text following, we
should probably also do it for text preceding. Thus if there is a ZWSP or
ZWNBSP (U+2060 WJ) anywhere in a text then ICU break iteration is disabled
for the whole sentence.

2. Why limit this to Khmer? I suspect as a model it should work for any
non-space broken text.*

Yours,
Martin




2) Disable use of ICU breakiterator for Khmer spell checking by default,
but allow users to enable it by adding a check-box to enable ICU
breakiterator in the Tools > Options > Language Settings > Writing Aids >
Options dialogue when a Khmer Hunspell dictionary is present (

http://extensions.libreoffice.org/extension-center/khmer-spelling-checker-sbbic-version
 ).

3) Disable use of ICU breakiterator for Khmer spell checking until the
ICU
breakiterator for Khmer is more accurate.

Currently, with the ICU breakiterator for Khmer enabled in LibreOffice
3.6
it causes a lot of spelling errors to go unnoticed since the ICU
breakiterator breaks words up incorrectly. So hopfully we can find a
solution that will work with the current ICU breakiterator - though with
ICU 50.1 the breakiterator for Khmer will have some improvements. But I
do
feel if solution 1 or 2 (or if someone else has better ideas) cannot
be implemented the breakiterator for spelling with Khmer should be turned
off in LibreOffice until the ICU breakiterator for Khmer is more
accurate.


Thanks again for your help and time, your input is greatly appreciated!

Sincerely,

Nathan



On Thu, Jul 26, 2012 at 4:33 PM, Martin Hosken <martin_hosken@sil.org
wrote:

Dear All,

An automatic word and line breaker is very necessary for Khmer and
Thai because traditionally they have no spaces between words, and
so
line-breaking and spell checking require the use of a zero-width
space
between words which is counterintuitive for most native speakers,
and
so spell checking goes widely unused.

I agree that automatic word breaking is a good thing and I am relieved
to
see that libreoffice does it based on language selection and not on
automatic language guessing based on scripts. There are more languages
that
use Thai script and Khmer script than just Thai and Khmer. So one of my
fears is already alleviated :)

But now with the ICU code you implemented, Thai and Khmer can be
automatically broken, and the results are quite good. But with its
implementation in the real world, I have found some issues that I
wanted to raise and also suggest possible solutions. I write this
as
an end-user, not so much as a programmer, nor do I claim to fully
understand the inner-workings of ICU and LibreOffice (because I
don't!
).

First, I will do my best to explain the current results of the ICU
break iterator with Khmer:

Input sentence: មានប្រាជ្ញាឈ្លាសវៃឈ្មោះសិវកឥវលិយៈ

Current ICU line-breaking: មាន|ប្រាជ្ញាឈ្លាស|វៃ|ឈ្មោះ|សិវ|កឥ|វលិ|យៈ

Compared with the sentence manually broken: មាន|ប្រាជ្ញា|ឈ្លាសវៃ|
ឈ្មោះ|សិវកឥវលិយៈ

The differences should be clear – the ICU break iterator does not
break the words with 100% accuracy.

One possible solution to this issue is by how the ICU Break
Iterator
interacts with zero-width spaces (U+200B) in LibreOffice. Before
ICU
code was enabled to automatically break Khmer, if an end-user
wanted
to spell check Khmer, they had to manually place U+200B characters
to
separate words. This solution worked quite well, but was
counterintuitive to most native speakers, because Khmer has no
spaces
(as stated before). But with this solution, an end-user could be
sure
that their document was broken with 100% accuracy, if there was no
human error (something automatic solutions cannot do – it is more
along the lines of 80% accurate). What I propose, is that the break
iterator code in LibreOffice looks for U+200B characters in a given
string and considers them as a sign to NOT automatically break,
but to
allow the end-user full control to manually break words. Let me
explain:

     1. The code starts processing the text and automatically
breaking
        it until it comes across a U+200B character. If one is
found,
        it searches to see if there are any additional U+200B or U
        +0020 characters in the following 20 characters (or so),
and
        if there are, the break iterator skips over those
characters
        and starts again from the second U+200B character (or
U+0020,
        but a U+0020 character would only signify the “close” of
the
        manual break because sometimes a phrase will end and there
        will be an actual space – so if the word that the user
wants
        to manually break has a “real” U+0020 space at the end of
it,
        then the user does not need to put an additional U+200B
        character to close it) which then repeats, looking for
U+200B
        characters etc.

     2. This would allow end-users to choose to manually break
their
        whole document so they can have precise control, as well as
        allow end-users to place U+200B characters around names of
        people, places or transliterations in order to tell the
break
        iterator to not try to break those words.

In principle I like this approach. I like the idea of being able to
force
breaks and non-breaks. But I don't think we are quite there with this
solution yet. Here are my difficulties with it:

1. use of U+2060 makes string searching and spell checking harder
(unless
WJ chars are stripped for searching and spell checking). They are not
part
of the spelling of a word, so their introduction in the underlying text
stream is problematic for other text processing processes (like
searching
as mentioned). This is less of an issue for U+200B ZWSP because that
occurs
between words and searching across word boundaries is a rarer activity.
Likewise spell checking across word boundaries isn't really needed.

2. How do we come up with the range of what is considered a word
between
two zwsp chars as opposed to two words? How close to the end of a
string
must a zwsp occur to disable all breaking before the end of the string?
does "abcdef<zwsp>uvwxyz" block all breaks in the string? I think we
need
to think harder (deeper) about the use of zwsp in this way and see if
we
can come up with something with a little less ambiguity. Having said
that,
I think we are going to have to think really hard, because I don't
think
this is an easy problem.

     4. I then notice that "ម្នាក់ទៀត" line breaks together (since
the
        automatic line-breaking breaks them as one word. And I
decide
        I would rather line-break after “ម្នាក់” rather than have
both
        words break connected to each other, so I place a
zero-width
        space between the words:
        មាន​ប្រាជ្ញាឈ្លាស​វៃ​ឈ្មោះ<zw>សិវ​កឥ​វលិ​យៈ<sp>​អ្នកប្រាជ្ញ
        ម្នាក់<zw>ទៀត​ដែល​ល្បីល្បាញ​ជាងគេ
        the automatic break iterator comes to the zero width space
and
        then stops automatically breaking and look ahead to see if
        there is a zero-width space or a “real” space within 20
        characters (this number might need refining, but I think 20
        characters would be enough). As there are no zero-width or
        “real” spaces within 20 characters, the break iterator then
        goes back to the previous zero-width and starts breaking
        starting from the zero-width character.

Now what happens if I want to put zw around a word that occurs < 20
chars
after my last zw? The on off nature of the zw has now been inverted.
One
option is to say that zw must always occur in pairs and you would have
to
bracket your first or second word there. But then management of which
zw is
on and which is off will get confusing for users.

An alternative model is to weight breakpoints. An explicit breakpoint
weighs more highly than an automatically generated one. Then when it
comes
to line breaking the weight of a breakpoint counts towards its choice
as to
the actual break. For example if we say an explicit break is 2 and an
automatic is 1. Then we might use a square rule for distance and say:
an
explicit break is preferred if it occurs closer to the end of a line
than
4x the distance to the last automatic break on the line. Or somesuch.

Yours,
Martin



Context


Privacy Policy | Impressum (Legal Info) | Copyright information: Unless otherwise specified, all text and images on this website are licensed under the Creative Commons Attribution-Share Alike 3.0 License. This does not include the source code of LibreOffice, which is licensed under the Mozilla Public License (MPLv2). "LibreOffice" and "The Document Foundation" are registered trademarks of their corresponding registered owners or are in actual use as trademarks in one or more countries. Their respective logos and icons are also subject to international copyright laws. Use thereof is explained in our trademark policy.