[EasyHack] #44681 port to CLucene from java/Lucene

Gert van Valkenhoef <g.h.m.van.valkenhoef -AT- rug.nl>
Fri, 10 Feb 2012 23:11:31 +0100

Dear LibreOffice developers,

Bug: https://bugs.freedesktop.org/show_bug.cgi?id=44681

Attached are initial implementations of the HelpIndexer and HelpSearchin C++ using CLucene, to replace the Java implementations using Lucene.

The code that interfaces with Lucene to do the indexing and searching iscomplete. I have a test set up where I create an index with both theHelpIndexerTool.jar and the C++ indexer, and search it using the C++searcher. These give identical results. Thus, luckily, the index formatis compatible between CLucene and Java Lucene.

I've also looked into where the HelpIndexerTool is currently used, andfound these:


 - xmlhelp/source/cxxhelp/provider/databases.cxx:

    * In extension mode (enabled by HelpIndexer), through XInvocation

    * Does not ZIP the result

 - helpcontent2/util/target.pmk

    * Called as a command-line tool

* ZIPs the result, but already has an alternative code path to doit (the final .ELSE)

Based on this, it looks like the Java HelpIndexerTool is a lot morecomplex than it needs to be, and does a few things that are betterhandled by other tools. Especially the "extension mode" seems to be arelic of the convoluted code path (through XInvocation etc.) and doesn'tdo much more than suppressing error messages. In addition, couldn't theZIP creation just always be replaced by this alternative code path? Itswell possible that I missed a few things here.

If "extension mode" and ZIP archiving are not needed, the implementationis complete, and the remaining work would be integrating with the buildprocess. Here are a couple of caveats and/or questions related to that:

* This implementation is using the master branch of CLucene's git,with clucene-contribs-lib enabled (for CJK support). The releasedversion of CLucene is compatible with Lucene 1.9.x, whereas LibreOfficeuses Lucene 2.3.

* Can someone help to figure out how to make CLucene part of the LObuild process? CLucene is using CMake and there seems to be no way to'make install' the clucene-contribs-lib, so this might be tricky.

* I'm not sure exactly how to make my code build as part of the LObuild, but could probably figure it out as long as the previous point isaddressed.

* CLucene (like Java) uses wide characters throughout, and definesit's own TCHAR type for that. Can we make this play nice with how LOhandles strings?

* I'm using some Unix headers, are these available on windows orshould I use some kind of LO equivalent of them?

* I tried replacing the HelpIndexerTool inhelpcontent2/util/target.pmk, which seems to work fine, except that I'mreturning an error code when the content/caption directory doesn't exist(unlike HelpIndexerTool), which breaks on "shared".


I hope this is useful (and not too verbose :-P).

Best regards,

Gert van Valkenhoef

#include <CLucene/StdHeader.h>
#include <CLucene.h>
#include <CLucene/analysis/LanguageBasedAnalyzer.h>

#include <unistd.h>
#include <sys/stat.h>
#include <dirent.h>
#include <errno.h>
#include <string.h>

#include <string>
#include <iostream>
#include <algorithm>
#include <set>

// I assume that TCHAR is defined as wchar_t throughout

using namespace lucene::document;

class HelpIndexer {
        private:
                std::string d_lang;
                std::string d_module;
                std::string d_captionDir;
                std::string d_contentDir;
                std::string d_indexDir;
                std::string d_error;
                std::set<std::string> d_files;

        public:

        /**
         * @param lang Help files language.
         * @param module The module of the helpfiles.
         * @param captionDir The directory to scan for caption files.
         * @param contentDir The directory to scan for content files.
         * @param indexDir The directory to write the index to.
         */
        HelpIndexer(std::string const &lang, std::string const &module,
                std::string const &captionDir, std::string const &contentDir,
                std::string const &indexDir);

        /**
         * Run the indexer.
         * @return true if index successfully generated.
         */
        bool indexDocuments();

        /**
         * Get the error string (empty if no error occurred).
         */
        std::string const & getErrorMessage();

        private:

        /**
         * Scan the caption & contents directories for help files.
         */
        bool scanForFiles();

        /**
         * Scan for files in the given directory.
         */
        bool scanForFiles(std::string const &path);

        /**
         * Fill the Document with information on the given help file.
         */
        bool helpDocument(std::string const & fileName, Document *doc);

        /**
         * Create a reader for the given file, and create an "empty" reader in case the file 
doesn't exist.
         */
        lucene::util::Reader *helpFileReader(std::string const & path);

        std::wstring string2wstring(std::string const &source);
};

HelpIndexer::HelpIndexer(std::string const &lang, std::string const &module,
        std::string const &captionDir, std::string const &contentDir, std::string const &indexDir) :
d_lang(lang), d_module(module), d_captionDir(captionDir), d_contentDir(contentDir), 
d_indexDir(indexDir), d_error(""), d_files() {}

bool HelpIndexer::indexDocuments() {
        if (!scanForFiles()) {
                return false;
        }

        // Construct the analyzer appropriate for the given language
        lucene::analysis::Analyzer *analyzer = (
                d_lang.compare("ja") == 0 ?
                (lucene::analysis::Analyzer*)new lucene::analysis::LanguageBasedAnalyzer(L"cjk") :
                (lucene::analysis::Analyzer*)new lucene::analysis::standard::StandardAnalyzer());

        lucene::index::IndexWriter writer(d_indexDir.c_str(), analyzer, true);

        // Index the identified help files
        Document doc;
        for (std::set<std::string>::iterator i = d_files.begin(); i != d_files.end(); ++i) {
                doc.clear();
                if (!helpDocument(*i, &doc)) {
                        delete analyzer;
                        return false;
                }
                writer.addDocument(&doc);
        }

        // Optimize the index
        writer.optimize();

        delete analyzer;
        return true;
}

std::string const & HelpIndexer::getErrorMessage() {
        return d_error;
}

bool HelpIndexer::scanForFiles() {
        if (!scanForFiles(d_contentDir)) {
                return false;
        }
        if (!scanForFiles(d_captionDir)) {
                return false;
        }
        return true;
}

bool HelpIndexer::scanForFiles(std::string const & path) {
        DIR *dir = opendir(path.c_str());
        if (dir == 0) {
                d_error = "Error reading directory " + path + strerror(errno);
                return false;
        }

        struct dirent *ent;
        struct stat info;
        while ((ent = readdir(dir)) != 0) {
                if (stat((path + "/" + ent->d_name).c_str(), &info) == 0 && S_ISREG(info.st_mode)) {
                        d_files.insert(ent->d_name);
                }
        }

        closedir(dir);

        return true;
}

bool HelpIndexer::helpDocument(std::string const & fileName, Document *doc) {
        // Add the help path as an indexed, untokenized field.
        std::wstring path(L"#HLP#" + string2wstring(d_module) + L"/" + string2wstring(fileName));
        doc->add(*new Field(_T("path"), path.c_str(), Field::STORE_YES | Field::INDEX_UNTOKENIZED));

        // Add the caption as a field.
        std::string captionPath = d_captionDir + "/" + fileName;
        doc->add(*new Field(_T("caption"), helpFileReader(captionPath), Field::STORE_NO | 
Field::INDEX_TOKENIZED));
        // FIXME: does the Document take responsibility for the FileReader or should I free it 
somewhere?

        // Add the content as a field.
        std::string contentPath = d_contentDir + "/" + fileName;
        doc->add(*new Field(_T("content"), helpFileReader(contentPath), Field::STORE_NO | 
Field::INDEX_TOKENIZED));
        // FIXME: does the Document take responsibility for the FileReader or should I free it 
somewhere?

        return true;
}

lucene::util::Reader *HelpIndexer::helpFileReader(std::string const & path) {
        if (access(path.c_str(), R_OK) == 0) {
                return new lucene::util::FileReader(path.c_str(), "UTF-8");
        } else {
                return new lucene::util::StringReader(L"");
        }
}

std::wstring HelpIndexer::string2wstring(std::string const &source) {
        std::wstring target(source.length(), L' ');
        std::copy(source.begin(), source.end(), target.begin());
        return target;
}

int main(int argc, char **argv) {
        const std::string pLang("-lang");
        const std::string pModule("-mod");
        const std::string pOutDir("-zipdir");
        const std::string pSrcDir("-srcdir");

        std::string lang;
        std::string module;
        std::string srcDir;
        std::string outDir;

        bool error = false;
        for (int i = 1; i < argc; ++i) {
                if (pLang.compare(argv[i]) == 0) {
                        if (i + 1 < argc) {
                                lang = argv[++i];
                        } else {
                                error = true;
                        }
                } else if (pModule.compare(argv[i]) == 0) {
                        if (i + 1 < argc) {
                                module = argv[++i];
                        } else {
                                error = true;
                        }
                } else if (pOutDir.compare(argv[i]) == 0) {
                        if (i + 1 < argc) {
                                outDir = argv[++i];
                        } else {
                                error = true;
                        }
                } else if (pSrcDir.compare(argv[i]) == 0) {
                        if (i + 1 < argc) {
                                srcDir = argv[++i];
                        } else {
                                error = true;
                        }
                } else {
                        error = true;
                }
        }

        if (error) {
                std::cerr << "Error parsing command-line arguments" << std::endl;
        }

        if (error || lang.empty() || module.empty() || srcDir.empty() || outDir.empty()) {
                std::cerr << "Usage: HelpIndexer -lang ISOLangCode -mod HelpModule -srcdir 
SourceDir -zipdir OutputDir" << std::endl;
                return 1;
        }

        std::string captionDir(srcDir + "/caption");
        std::string contentDir(srcDir + "/content");
        std::string indexDir(outDir + "/" + module + ".idxl");
        HelpIndexer indexer(lang, module, captionDir, contentDir, indexDir);
        if (!indexer.indexDocuments()) {
                std::cerr << indexer.getErrorMessage() << std::endl;
                return 2;
        }
        return 0;
}

#include <CLucene/StdHeader.h>
#include <CLucene.h>

#include <string>
#include <vector>
#include <iostream>

class HelpSearch {
        private:

        std::string d_lang;
        std::string d_indexDir;
        

        public:

        HelpSearch(std::string const &lang, std::string const &indexDir);

        /**
         * Query the index for a certain query string.
         * @param queryStr The query.
         * @param captionOnly Set to true to search in the caption, not the content.
         * @param rDocuments Vector to write the paths of the found documents.
         * @param rScores Vector to write the scores to.
         */
        bool query(std::wstring const &queryStr, bool captionOnly,
                std::vector<std::wstring> &rDocuments, std::vector<float> &rScores);
};

std::wstring string2wstring(std::string const &source) {
        std::wstring target(source.length(), L' ');
        std::copy(source.begin(), source.end(), target.begin());
        return target;
}

HelpSearch::HelpSearch(std::string const &lang, std::string const &indexDir) :
d_lang(lang), d_indexDir(indexDir) {}

bool HelpSearch::query(std::wstring const &queryStr, bool captionOnly,
                std::vector<std::wstring> &rDocuments, std::vector<float> &rScores) {
        lucene::index::IndexReader *reader = lucene::index::IndexReader::open(d_indexDir.c_str());
        lucene::search::IndexSearcher searcher(reader);

        std::wstring field = captionOnly ? L"caption" : L"content";

        bool isWildcard = queryStr[queryStr.length() - 1] == L'*';
        lucene::search::Query *query = (isWildcard ?
                (lucene::search::Query*)new lucene::search::WildcardQuery(new 
lucene::index::Term(field.c_str(), queryStr.c_str())) :
                (lucene::search::Query*)new lucene::search::TermQuery(new 
lucene::index::Term(field.c_str(), queryStr.c_str())));
        // FIXME: who is responsible for the Query* and Term*?

        lucene::search::Hits *hits = searcher.search(query); // FIXME: who is responsible for the 
Hits*?
        for (int i = 0; i < hits->length(); ++i) {
                lucene::document::Document &doc = hits->doc(i); // Document* belongs to Hits.
                wchar_t const *path = doc.get(L"path");
                rDocuments.push_back(std::wstring(path != 0 ? path : L""));
                rScores.push_back(hits->score(i));
        }

        reader->close();
        return true;
}

int main(int argc, char **argv) {
        const std::string pLang("-lang");
        const std::string pIndex("-index");
        const std::string pQuery("-query");
        const std::string pCaptionOnly("-caption");

        std::string lang;
        std::string index;
        std::string query;
        bool captionOnly = false;

        bool error = false;
        for (int i = 1; i < argc; ++i) {
                if (pLang.compare(argv[i]) == 0) {
                        if (i + 1 < argc) {
                                lang = argv[++i];
                        } else {
                                error = true;
                        }
                } else if (pIndex.compare(argv[i]) == 0) {
                        if (i + 1 < argc) {
                                index = argv[++i];
                        } else {
                                error = true;
                        }
                } else if (pQuery.compare(argv[i]) == 0) {
                        if (i + 1 < argc) {
                                query = argv[++i];
                        } else {
                                error = true;
                        }
                } else if (pCaptionOnly.compare(argv[i]) == 0) {
                        captionOnly = true;
                } else {
                        error = true;
                }
        }

        if (error) {
                std::cerr << "Error parsing command-line arguments" << std::endl;
        }

        if (error || lang.empty() || index.empty() || query.empty()) {
                std::cerr << "Usage: HelpSearch -lang ISOLangCode -index IndexDir [-caption] -query 
QueryString" << std::endl;
                return 1;
        }

        HelpSearch search(lang, index);
        std::vector<std::wstring> paths;
        std::vector<float> scores;
        if (!search.query(string2wstring(query), captionOnly, paths, scores)) {
                std::cerr << "Error in search." << std::endl;
                return 2;
        }

        for (int i = 0; i < paths.size(); ++i) {
                std::wstring &path = paths[i];
                float &score = scores[i];
                std::wcout << score << " " << path << std::endl; //<< path << std::endl;
        }

        return 0;
}

Context

[EasyHack] #44681 port to CLucene from java/Lucene · Gert van Valkenhoef
- Re: [EasyHack] #44681 port to CLucene from java/Lucene · Radek Doulik
  - Re: [EasyHack] #44681 port to CLucene from java/Lucene · Rene Engelhard
    - (message not available)
      - (message not available)
        
        (message not available)
        
        (message not available)
        
        (message not available)
        (message not available)
        (message not available)
        (message not available)
        (message not available)
        (message not available)
        (message not available)
        (message not available)
        (message not available)
        (message not available)
        (message not available)
        Re: [EasyHack] #44681 port to CLucene from java/Lucene · G.H.M.Valkenhoef, van
- Re: [EasyHack] #44681 port to CLucene from java/Lucene · Michael Meeks
- Re: [EasyHack] #44681 port to CLucene from java/Lucene · Caolán McNamara
  - (message not available)
    - (message not available)
      - (message not available)
        
        (message not available)
        
        (message not available)
        
        (message not available)
        (message not available)
        (message not available)
        (message not available)
        (message not available)
        (message not available)
        (message not available)
        (message not available)
        Re: [EasyHack] #44681 port to CLucene from java/Lucene · G.H.M.Valkenhoef, van
        Re: [EasyHack] #44681 port to CLucene from java/Lucene · Caolán McNamara
        Re: [EasyHack] #44681 port to CLucene from java/Lucene · Caolán McNamara
        Re: [EasyHack] #44681 port to CLucene from java/Lucene · G.H.M.Valkenhoef, van
        
        Re: [EasyHack] #44681 port to CLucene from java/Lucene · Norbert Thiebaud

Privacy Policy | Impressum (Legal Info) | Copyright information: Unless otherwise specified, all text and images on this website are licensed under the Creative Commons Attribution-Share Alike 3.0 License. This does not include the source code of LibreOffice, which is licensed under the Mozilla Public License (MPLv2). "LibreOffice" and "The Document Foundation" are registered trademarks of their corresponding registered owners or are in actual use as trademarks in one or more countries. Their respective logos and icons are also subject to international copyright laws. Use thereof is explained in our trademark policy.