Date: prev next · Thread: first prev next last
2010 Archives by date, by thread · List index


On 11/29/2010 06:39 PM, John LeMoyne Castle wrote:

...
However, looking at textsearch.cxx in Open Grok --
http://opengrok.go-oo.org/xref/libs-gui/i18npool/source/search/textsearch.cxx#165
--  can see this comment before the various types of calls to a search
routine:
// use transliteration here, but only if not RegEx, which does it different

One can also see other exclusion of the regexp search algorithm from the
transliteration search prep and search result code in textsearch.cxx around
the calls to the search routines, but I'm not absolutely sure that exclusion
is complete.  If the regexp search truly *never* uses transliteration then
the swap out will be simpler and the change-over may actually enable
transliteration.  I haven't looked at the internal code of the regexp -
perhaps it 'does it's own thing' internally for transliteration...

Right. I have only a vague idea what "transliteration" means here. From a web search I can see that it must be an attempt to deal with things like accented characters (Is "a" the same as "ä", or not? Is "ss" the same as "ß"?), but I couldn't find any clear description of exactly what the transliteration was doing.

There is a letter-case filter applied to the text before a regex search, changing all characters to one single case, lower case for English text. If the user indicates that case is significant, the filter is not applied.

The actual searches get a text buffer and a pair of indices (first, last) indicating the region to search. The results are returned as a list of matches, also with indices into the text buffer. The code does a lot of adjusting of the indices, I suppose to account for character-level changes due to the transliteration, but again, I can't really tell what the adjustment code is supposed to do.

I was also having a lot of trouble learning anything from running OOo under gdb. Gdb was acting weird and I couldn't step through the code and poke around. I ended up trying to do it by adding a printf, rebuild, run, rinse, repeat. No fun; less progress.

My thought was maybe to just avoid all that and start out with an extension testbed that uses the Boost regexp. I'm sure I can get access to paragraphs of text without any transliteration or filtering, and see how well the Boost functions work. If that goes well, then move on to replacing code.

I think Boost looks like the way to go, since it has a lot of functionality, supports Unicode (16- or 32-bit chars), and OOo already uses it.

Performance could be a problem. I saw a comment in the code somewhere saying that performance is critical for some spreadsheets--I assume because Calc's lookups default to using regular expression matching.

As far as I can see, that's a faulty design, the lookups should not use regexp matching unless it is specifically requested, but it may be too late to change that now.

I've seen benchmarks indicating that the Boost regexp is fairly fast compared to other regexp engines, but I'm guessing that it's still slower than the current primitive engine.

<Joe


Context


Privacy Policy | Impressum (Legal Info) | Copyright information: Unless otherwise specified, all text and images on this website are licensed under the Creative Commons Attribution-Share Alike 3.0 License. This does not include the source code of LibreOffice, which is licensed under the Mozilla Public License (MPLv2). "LibreOffice" and "The Document Foundation" are registered trademarks of their corresponding registered owners or are in actual use as trademarks in one or more countries. Their respective logos and icons are also subject to international copyright laws. Use thereof is explained in our trademark policy.