Re: [Libreoffice] [Crazy Ideas] Discuss

Joe Smith <jes -AT- martnet.com>
Mon, 29 Nov 2010 19:34:31 -0500

On 11/29/2010 06:39 PM, John LeMoyne Castle wrote:


...
However, looking at textsearch.cxx in Open Grok --
http://opengrok.go-oo.org/xref/libs-gui/i18npool/source/search/textsearch.cxx#165
--  can see this comment before the various types of calls to a search
routine:
// use transliteration here, but only if not RegEx, which does it different

One can also see other exclusion of the regexp search algorithm from the
transliteration search prep and search result code in textsearch.cxx around
the calls to the search routines, but I'm not absolutely sure that exclusion
is complete.  If the regexp search truly *never* uses transliteration then
the swap out will be simpler and the change-over may actually enable
transliteration.  I haven't looked at the internal code of the regexp -
perhaps it 'does it's own thing' internally for transliteration...

Right. I have only a vague idea what "transliteration" means here. Froma web search I can see that it must be an attempt to deal with thingslike accented characters (Is "a" the same as "ä", or not? Is "ss" thesame as "ß"?), but I couldn't find any clear description of exactly whatthe transliteration was doing.

There is a letter-case filter applied to the text before a regex search,changing all characters to one single case, lower case for English text.If the user indicates that case is significant, the filter is not applied.

The actual searches get a text buffer and a pair of indices (first,last) indicating the region to search. The results are returned as alist of matches, also with indices into the text buffer. The code does alot of adjusting of the indices, I suppose to account forcharacter-level changes due to the transliteration, but again, I can'treally tell what the adjustment code is supposed to do.

I was also having a lot of trouble learning anything from running OOounder gdb. Gdb was acting weird and I couldn't step through the code andpoke around. I ended up trying to do it by adding a printf, rebuild,run, rinse, repeat. No fun; less progress.

My thought was maybe to just avoid all that and start out with anextension testbed that uses the Boost regexp. I'm sure I can get accessto paragraphs of text without any transliteration or filtering, and seehow well the Boost functions work. If that goes well, then move on toreplacing code.

I think Boost looks like the way to go, since it has a lot offunctionality, supports Unicode (16- or 32-bit chars), and OOo alreadyuses it.

Performance could be a problem. I saw a comment in the code somewheresaying that performance is critical for some spreadsheets--I assumebecause Calc's lookups default to using regular expression matching.

As far as I can see, that's a faulty design, the lookups should not useregexp matching unless it is specifically requested, but it may be toolate to change that now.

I've seen benchmarks indicating that the Boost regexp is fairly fastcompared to other regexp engines, but I'm guessing that it's stillslower than the current primitive engine.


<Joe

Context

[Libreoffice] [Crazy Ideas] Discuss "Replace regexp parser with std library" · Joe Smith
- Re: [Libreoffice] [Crazy Ideas] Discuss "Replace regexp parser with std library" · Thorsten Behrens
- Re: [Libreoffice] [Crazy Ideas] Discuss · John LeMoyne Castle
  - Re: [Libreoffice] [Crazy Ideas] Discuss · Joe Smith
    - Re: [Libreoffice] [Crazy Ideas] Discuss · Mattias Johnsson
      - Re: [Libreoffice] [Crazy Ideas] Discuss · Kohei Yoshida
        
        Re: [Libreoffice] [Crazy Ideas] Discuss · David Tardon
      - Re: [Libreoffice] [Crazy Ideas] Discuss · Joe Smith

Privacy Policy | Impressum (Legal Info) | Copyright information: Unless otherwise specified, all text and images on this website are licensed under the Creative Commons Attribution-Share Alike 3.0 License. This does not include the source code of LibreOffice, which is licensed under the Mozilla Public License (MPLv2). "LibreOffice" and "The Document Foundation" are registered trademarks of their corresponding registered owners or are in actual use as trademarks in one or more countries. Their respective logos and icons are also subject to international copyright laws. Use thereof is explained in our trademark policy.