On 11/29/2010 06:39 PM, John LeMoyne Castle wrote:
...
However, looking at textsearch.cxx in Open Grok --
http://opengrok.go-oo.org/xref/libs-gui/i18npool/source/search/textsearch.cxx#165
-- can see this comment before the various types of calls to a search
routine:
// use transliteration here, but only if not RegEx, which does it different
One can also see other exclusion of the regexp search algorithm from the
transliteration search prep and search result code in textsearch.cxx around
the calls to the search routines, but I'm not absolutely sure that exclusion
is complete. If the regexp search truly *never* uses transliteration then
the swap out will be simpler and the change-over may actually enable
transliteration. I haven't looked at the internal code of the regexp -
perhaps it 'does it's own thing' internally for transliteration...
Right. I have only a vague idea what "transliteration" means here. From
a web search I can see that it must be an attempt to deal with things
like accented characters (Is "a" the same as "ä", or not? Is "ss" the
same as "ß"?), but I couldn't find any clear description of exactly what
the transliteration was doing.
There is a letter-case filter applied to the text before a regex search,
changing all characters to one single case, lower case for English text.
If the user indicates that case is significant, the filter is not applied.
The actual searches get a text buffer and a pair of indices (first,
last) indicating the region to search. The results are returned as a
list of matches, also with indices into the text buffer. The code does a
lot of adjusting of the indices, I suppose to account for
character-level changes due to the transliteration, but again, I can't
really tell what the adjustment code is supposed to do.
I was also having a lot of trouble learning anything from running OOo
under gdb. Gdb was acting weird and I couldn't step through the code and
poke around. I ended up trying to do it by adding a printf, rebuild,
run, rinse, repeat. No fun; less progress.
My thought was maybe to just avoid all that and start out with an
extension testbed that uses the Boost regexp. I'm sure I can get access
to paragraphs of text without any transliteration or filtering, and see
how well the Boost functions work. If that goes well, then move on to
replacing code.
I think Boost looks like the way to go, since it has a lot of
functionality, supports Unicode (16- or 32-bit chars), and OOo already
uses it.
Performance could be a problem. I saw a comment in the code somewhere
saying that performance is critical for some spreadsheets--I assume
because Calc's lookups default to using regular expression matching.
As far as I can see, that's a faulty design, the lookups should not use
regexp matching unless it is specifically requested, but it may be too
late to change that now.
I've seen benchmarks indicating that the Boost regexp is fairly fast
compared to other regexp engines, but I'm guessing that it's still
slower than the current primitive engine.
<Joe
Context
Privacy Policy |
Impressum (Legal Info) |
Copyright information: Unless otherwise specified, all text and images
on this website are licensed under the
Creative Commons Attribution-Share Alike 3.0 License.
This does not include the source code of LibreOffice, which is
licensed under the Mozilla Public License (
MPLv2).
"LibreOffice" and "The Document Foundation" are
registered trademarks of their corresponding registered owners or are
in actual use as trademarks in one or more countries. Their respective
logos and icons are also subject to international copyright laws. Use
thereof is explained in our
trademark policy.