On Tue, 2012-02-14 at 09:45 +0100, G.H.M.Valkenhoef, van wrote:
Yes, I found that java code (the HelpIndexer I refer to). I'll work on
a patch to replace the XInvocations of the Java code with calls to my
code.
I can try and knock together a skeleton of a conversion of that Java
component to a C++ component for you to integration the clucene stuff
into.
Presumably just editing l10ntools/source/help/makefile.mk and adding
another target or so in there will do the trick. I can hook this up
and see if how it goes.
Great, send me a patch if you get it going, then I can work on some of
the other stuff.
Attached is your code back again but added into the l10ntools
makefile.mk and other build-foo. And a patch to helpcontent2 (different
repository) to use it to build the helpcontent and just use normal zip
to zip them up.
I hacked the helplinker to let the "shared" ones through without error
and cut out the cjk.analyzer for now as I don't happen to have that
compiled up here.
I've got an update on this: I managed to create all the indexes and
doing a few searches on both the Java-generated an the C++-generated
indexes seems to give identical results (at least if I pipe the
results through sort).
sounds great, as does the other news that building clucene only takes a
short time.
C.
From 83dc0151cde6b765f4235456ae1e58813d3746bd Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Caol=C3=A1n=20McNamara?= <caolanm@redhat.com>
Date: Tue, 14 Feb 2012 11:53:27 +0000
Subject: [PATCH] add clucene helpindexer program
---
l10ntools/prj/build.lst | 2 +-
l10ntools/prj/d.lst | 6 +-
l10ntools/source/help/helpindexer.cxx | 247 +++++++++++++++++++++++++++++++++
l10ntools/source/help/makefile.mk | 30 ++---
4 files changed, 263 insertions(+), 22 deletions(-)
create mode 100644 l10ntools/source/help/helpindexer.cxx
diff --git a/l10ntools/prj/build.lst b/l10ntools/prj/build.lst
index 3cce7a3..c714256 100644
--- a/l10ntools/prj/build.lst
+++ b/l10ntools/prj/build.lst
@@ -1,4 +1,4 @@
-tr l10ntools : tools LIBXSLT:libxslt BERKELEYDB:berkeleydb LUCENE:lucene NULL
+tr l10ntools : tools LIBXSLT:libxslt BERKELEYDB:berkeleydb NULL
tr l10ntools usr1 - all tr_mkout
NULL
tr l10ntools\inc nmake - all tr_inc NULL
tr l10ntools\source nmake - all tr_src
tr_inc NULL
diff --git a/l10ntools/prj/d.lst b/l10ntools/prj/d.lst
index eded848..174bb6c 100644
--- a/l10ntools/prj/d.lst
+++ b/l10ntools/prj/d.lst
@@ -26,12 +26,14 @@ mkdir: %_DEST%\bin\help\com\sun\star\help
..\%__SRC%\bin\txtconv %_DEST%\bin\txtconv
..\%__SRC%\bin\ulfconv %_DEST%\bin\ulfconv
..\%__SRC%\class\FCFGMerge.jar %_DEST%\bin\FCFGMerge.jar
-..\%__SRC%\class\HelpIndexerTool.jar %_DEST%\bin\HelpIndexerTool.jar
-..\%__SRC%\bin\HelpLinker %_DEST%\bin\HelpLinker
..\%__SRC%\bin\HelpCompiler %_DEST%\bin\HelpCompiler
..\%__SRC%\bin\HelpCompiler.exe %_DEST%\bin\HelpCompiler.exe
+..\%__SRC%\bin\HelpLinker %_DEST%\bin\HelpLinker
..\%__SRC%\bin\HelpLinker.exe %_DEST%\bin\HelpLinker.exe
..\%__SRC%\bin\HelpLinker* %_DEST%\bin
+..\%__SRC%\bin\HelpIndexer %_DEST%\bin\HelpIndexer
+..\%__SRC%\bin\HelpIndexer.exe %_DEST%\bin\HelpIndexer.exe
+..\%__SRC%\bin\HelpIndexer* %_DEST%\bin
..\scripts\localize %_DEST%\bin\localize
..\scripts\fast_merge.pl %_DEST%\bin\fast_merge.pl
diff --git a/l10ntools/source/help/helpindexer.cxx b/l10ntools/source/help/helpindexer.cxx
new file mode 100644
index 0000000..c327119
--- /dev/null
+++ b/l10ntools/source/help/helpindexer.cxx
@@ -0,0 +1,247 @@
+#include <CLucene/StdHeader.h>
+#include <CLucene.h>
+#ifdef TODO
+#include <CLucene/analysis/LanguageBasedAnalyzer.h>
+#endif
+
+#include <unistd.h>
+#include <sys/stat.h>
+#include <dirent.h>
+#include <errno.h>
+#include <string.h>
+
+#include <string>
+#include <iostream>
+#include <algorithm>
+#include <set>
+
+// I assume that TCHAR is defined as wchar_t throughout
+
+using namespace lucene::document;
+
+class HelpIndexer {
+ private:
+ std::string d_lang;
+ std::string d_module;
+ std::string d_captionDir;
+ std::string d_contentDir;
+ std::string d_indexDir;
+ std::string d_error;
+ std::set<std::string> d_files;
+
+ public:
+
+ /**
+ * @param lang Help files language.
+ * @param module The module of the helpfiles.
+ * @param captionDir The directory to scan for caption files.
+ * @param contentDir The directory to scan for content files.
+ * @param indexDir The directory to write the index to.
+ */
+ HelpIndexer(std::string const &lang, std::string const &module,
+ std::string const &captionDir, std::string const &contentDir,
+ std::string const &indexDir);
+
+ /**
+ * Run the indexer.
+ * @return true if index successfully generated.
+ */
+ bool indexDocuments();
+
+ /**
+ * Get the error string (empty if no error occurred).
+ */
+ std::string const & getErrorMessage();
+
+ private:
+
+ /**
+ * Scan the caption & contents directories for help files.
+ */
+ bool scanForFiles();
+
+ /**
+ * Scan for files in the given directory.
+ */
+ bool scanForFiles(std::string const &path);
+
+ /**
+ * Fill the Document with information on the given help file.
+ */
+ bool helpDocument(std::string const & fileName, Document *doc);
+
+ /**
+ * Create a reader for the given file, and create an "empty" reader in case the file
doesn't exist.
+ */
+ lucene::util::Reader *helpFileReader(std::string const & path);
+
+ std::wstring string2wstring(std::string const &source);
+};
+
+HelpIndexer::HelpIndexer(std::string const &lang, std::string const &module,
+ std::string const &captionDir, std::string const &contentDir, std::string const &indexDir) :
+d_lang(lang), d_module(module), d_captionDir(captionDir), d_contentDir(contentDir),
d_indexDir(indexDir), d_error(""), d_files() {}
+
+bool HelpIndexer::indexDocuments() {
+ if (!scanForFiles()) {
+ return false;
+ }
+
+#ifdef TODO
+ // Construct the analyzer appropriate for the given language
+ lucene::analysis::Analyzer *analyzer = (
+ d_lang.compare("ja") == 0 ?
+ (lucene::analysis::Analyzer*)new lucene::analysis::LanguageBasedAnalyzer(L"cjk") :
+ (lucene::analysis::Analyzer*)new lucene::analysis::standard::StandardAnalyzer());
+#else
+ lucene::analysis::Analyzer *analyzer = (
+ (lucene::analysis::Analyzer*)new lucene::analysis::standard::StandardAnalyzer());
+#endif
+
+ lucene::index::IndexWriter writer(d_indexDir.c_str(), analyzer, true);
+
+ // Index the identified help files
+ Document doc;
+ for (std::set<std::string>::iterator i = d_files.begin(); i != d_files.end(); ++i) {
+ doc.clear();
+ if (!helpDocument(*i, &doc)) {
+ delete analyzer;
+ return false;
+ }
+ writer.addDocument(&doc);
+ }
+
+ // Optimize the index
+ writer.optimize();
+
+ delete analyzer;
+ return true;
+}
+
+std::string const & HelpIndexer::getErrorMessage() {
+ return d_error;
+}
+
+bool HelpIndexer::scanForFiles() {
+ if (!scanForFiles(d_contentDir)) {
+ return false;
+ }
+ if (!scanForFiles(d_captionDir)) {
+ return false;
+ }
+ return true;
+}
+
+bool HelpIndexer::scanForFiles(std::string const & path) {
+ DIR *dir = opendir(path.c_str());
+ if (dir == 0) {
+ d_error = "Error reading directory " + path + strerror(errno);
+ return true;
+ }
+
+ struct dirent *ent;
+ struct stat info;
+ while ((ent = readdir(dir)) != 0) {
+ if (stat((path + "/" + ent->d_name).c_str(), &info) == 0 && S_ISREG(info.st_mode)) {
+ d_files.insert(ent->d_name);
+ }
+ }
+
+ closedir(dir);
+
+ return true;
+}
+
+bool HelpIndexer::helpDocument(std::string const & fileName, Document *doc) {
+ // Add the help path as an indexed, untokenized field.
+ std::wstring path(L"#HLP#" + string2wstring(d_module) + L"/" + string2wstring(fileName));
+ doc->add(*new Field(_T("path"), path.c_str(), Field::STORE_YES | Field::INDEX_UNTOKENIZED));
+
+ // Add the caption as a field.
+ std::string captionPath = d_captionDir + "/" + fileName;
+ doc->add(*new Field(_T("caption"), helpFileReader(captionPath), Field::STORE_NO |
Field::INDEX_TOKENIZED));
+ // FIXME: does the Document take responsibility for the FileReader or should I free it
somewhere?
+
+ // Add the content as a field.
+ std::string contentPath = d_contentDir + "/" + fileName;
+ doc->add(*new Field(_T("content"), helpFileReader(contentPath), Field::STORE_NO |
Field::INDEX_TOKENIZED));
+ // FIXME: does the Document take responsibility for the FileReader or should I free it
somewhere?
+
+ return true;
+}
+
+lucene::util::Reader *HelpIndexer::helpFileReader(std::string const & path) {
+ if (access(path.c_str(), R_OK) == 0) {
+ return new lucene::util::FileReader(path.c_str(), "UTF-8");
+ } else {
+ return new lucene::util::StringReader(L"");
+ }
+}
+
+std::wstring HelpIndexer::string2wstring(std::string const &source) {
+ std::wstring target(source.length(), L' ');
+ std::copy(source.begin(), source.end(), target.begin());
+ return target;
+}
+
+int main(int argc, char **argv) {
+ const std::string pLang("-lang");
+ const std::string pModule("-mod");
+ const std::string pOutDir("-zipdir");
+ const std::string pSrcDir("-srcdir");
+
+ std::string lang;
+ std::string module;
+ std::string srcDir;
+ std::string outDir;
+
+ bool error = false;
+ for (int i = 1; i < argc; ++i) {
+ if (pLang.compare(argv[i]) == 0) {
+ if (i + 1 < argc) {
+ lang = argv[++i];
+ } else {
+ error = true;
+ }
+ } else if (pModule.compare(argv[i]) == 0) {
+ if (i + 1 < argc) {
+ module = argv[++i];
+ } else {
+ error = true;
+ }
+ } else if (pOutDir.compare(argv[i]) == 0) {
+ if (i + 1 < argc) {
+ outDir = argv[++i];
+ } else {
+ error = true;
+ }
+ } else if (pSrcDir.compare(argv[i]) == 0) {
+ if (i + 1 < argc) {
+ srcDir = argv[++i];
+ } else {
+ error = true;
+ }
+ } else {
+ error = true;
+ }
+ }
+
+ if (error) {
+ std::cerr << "Error parsing command-line arguments" << std::endl;
+ }
+
+ if (error || lang.empty() || module.empty() || srcDir.empty() || outDir.empty()) {
+ std::cerr << "Usage: HelpIndexer -lang ISOLangCode -mod HelpModule -srcdir
SourceDir -zipdir OutputDir" << std::endl;
+ return 1;
+ }
+
+ std::string captionDir(srcDir + "/caption");
+ std::string contentDir(srcDir + "/content");
+ std::string indexDir(outDir + "/" + module + ".idxl");
+ HelpIndexer indexer(lang, module, captionDir, contentDir, indexDir);
+ if (!indexer.indexDocuments()) {
+ std::cerr << indexer.getErrorMessage() << std::endl;
+ return 2;
+ }
+ return 0;
+}
diff --git a/l10ntools/source/help/makefile.mk b/l10ntools/source/help/makefile.mk
index bab01b8..e22c6a3 100644
--- a/l10ntools/source/help/makefile.mk
+++ b/l10ntools/source/help/makefile.mk
@@ -60,8 +60,10 @@ SLOFILES=\
EXCEPTIONSFILES=\
$(OBJ)$/HelpLinker.obj \
$(OBJ)$/HelpCompiler.obj \
+ $(OBJ)$/helpindexer.obj \
$(SLO)$/HelpLinker.obj \
$(SLO)$/HelpCompiler.obj
+
.IF "$(OS)" == "MACOSX" && "$(CPU)" == "P" && "$(COM)" == "GCC"
# There appears to be a GCC 4.0.1 optimization error causing _file:good() to
# report true right before the call to writeOut at HelpLinker.cxx:1.12 l. 954
@@ -72,6 +74,9 @@ NOOPTFILES=\
$(SLO)$/HelpLinker.obj
.ENDIF
+PKGCONFIG_MODULES=libclucene-core
+.INCLUDE : pkg_config.mk
+
APP1TARGET= $(TARGET)
APP1OBJS=\
$(OBJ)$/HelpLinker.obj \
@@ -79,6 +84,12 @@ APP1OBJS=\
APP1RPATH = NONE
APP1STDLIBS+=$(SALLIB) $(BERKELEYLIB) $(XSLTLIB) $(EXPATASCII3RDLIB)
+APP2TARGET=HelpIndexer
+APP2OBJS=\
+ $(OBJ)$/helpindexer.obj
+APP2RPATH = NONE
+APP2STDLIBS+=$(SALLIB) $(PKGCONFIG_LIBS)
+
SHL1TARGET =$(LIBBASENAME)$(DLLPOSTFIX)
SHL1LIBS= $(SLB)$/$(TARGET).lib
.IF "$(COM)" == "MSC"
@@ -93,26 +104,7 @@ SHL1USE_EXPORTS =ordinal
DEF1NAME =$(SHL1TARGET)
DEFLIB1NAME =$(TARGET)
-JAVAFILES = \
- HelpIndexerTool.java \
- HelpFileDocument.java
-
-
-JAVACLASSFILES = \
- $(CLASSDIR)$/$(PACKAGE)$/HelpIndexerTool.class \
- $(CLASSDIR)$/$(PACKAGE)$/HelpFileDocument.class
-.IF "$(SYSTEM_LUCENE)" == "YES"
-EXTRAJARFILES += $(LUCENE_CORE_JAR) $(LUCENE_ANALYZERS_JAR)
-.ELSE
-JARFILES += lucene-core-2.3.jar lucene-analyzers-2.3.jar
-.ENDIF
-JAVAFILES = $(subst,$(CLASSDIR)$/$(PACKAGE)$/, $(subst,.class,.java $(JAVACLASSFILES)))
-
-JARCLASSDIRS = $(PACKAGE)/*
-JARTARGET = HelpIndexerTool.jar
-JARCOMPRESS = TRUE
-
# --- Targets ------------------------------------------------------
.INCLUDE : target.mk
--
1.7.7.6
From b0fee7a4c8c4aa177ef47988721108c7f466f0b9 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Caol=C3=A1n=20McNamara?= <caolanm@redhat.com>
Date: Tue, 14 Feb 2012 11:52:29 +0000
Subject: [PATCH] use clucene indexer
---
helpcontent2/settings.pmk | 12 ------------
helpcontent2/util/target.pmk | 21 +++------------------
2 files changed, 3 insertions(+), 30 deletions(-)
diff --git a/helpcontent2/settings.pmk b/helpcontent2/settings.pmk
index 185438e..3716281 100755
--- a/helpcontent2/settings.pmk
+++ b/helpcontent2/settings.pmk
@@ -1,17 +1,5 @@
.INCLUDE : $(LOCAL_COMMON_OUT)/inc$/aux_langs.mk
.INCLUDE : $(LOCAL_COMMON_OUT)/inc$/help_exist.mk
-my_cp:=$(CLASSPATH)$(PATH_SEPERATOR)$(SOLARBINDIR)$/jaxp.jar$(PATH_SEPERATOR)$(SOLARBINDIR)$/juh.jar$(PATH_SEPERATOR)$(SOLARBINDIR)$/parser.jar$(PATH_SEPERATOR)$(SOLARBINDIR)$/xt.jar$(PATH_SEPERATOR)$(SOLARBINDIR)$/unoil.jar$(PATH_SEPERATOR)$(SOLARBINDIR)$/ridl.jar$(PATH_SEPERATOR)$(SOLARBINDIR)$/jurt.jar$(PATH_SEPERATOR)$(SOLARBINDIR)$/xmlsearch.jar$(PATH_SEPERATOR)$(SOLARBINDIR)$/LuceneHelpWrapper.jar$(PATH_SEPERATOR)$(SOLARBINDIR)$/HelpIndexerTool.jar$
-
-.IF "$(SYSTEM_LUCENE)" == "YES"
-my_cp!:=$(my_cp)$(PATH_SEPERATOR)$(LUCENE_CORE_JAR)$(PATH_SEPERATOR)$(LUCENE_ANALYZERS_JAR)
-.ELSE
-my_cp!:=$(my_cp)$(PATH_SEPERATOR)$(SOLARBINDIR)/lucene-core-2.3.jar$(PATH_SEPERATOR)$(SOLARBINDIR)/lucene-analyzers-2.3.jar
-.ENDIF
-
-.IF "$(SYSTEM_DB)" != "YES"
-JAVA_LIBRARY_PATH= -Djava.library.path=$(SOLARSHAREDBIN)
-.ENDIF
-
aux_alllangiso_all:=$(foreach,i,$(alllangiso) $(foreach,j,$(aux_langdirs) $(eq,$i,$j $i $(NULL))))
aux_alllangiso:=$(foreach,i,$(aux_alllangiso_all) $(foreach,j,$(help_exist) $(eq,$i,$j $i
$(NULL))))
diff --git a/helpcontent2/util/target.pmk b/helpcontent2/util/target.pmk
index 40f6e5d..7dd7e5b 100755
--- a/helpcontent2/util/target.pmk
+++ b/helpcontent2/util/target.pmk
@@ -30,25 +30,10 @@ LINKALLADDEDDEPS=$(foreach,i,$(aux_alllangiso) $(subst,LANGUAGE,$i $(LINKADDEDDP
ALLTAR : $(LINKALLTARGETS)
-.IF "$(SYSTEM_DB)" != "YES"
-JAVA_LIBRARY_PATH= -Djava.library.path=$(SOLARSHAREDBIN)
-.ENDIF
-
XSL_DIR*:=$(SOLARBINDIR)
$(LINKALLTARGETS) : $(foreach,i,$(LINKLINKFILES) $(COMMONMISC)$/$$(@:b:s/_/./:e:s/.//)/$i)
$(subst,LANGUAGE,$$(@:b:s/_/./:e:s/.//) $(LINKADDEDDEPS)) $(COMMONMISC)$/xhp_changed.flag
$(HELPLINKER) @$(mktmp -mod $(LINKNAME) -src $(COMMONMISC) -sty $(XSL_DIR)/embed.xsl -zipdir
$(MISC)$/ziptmp$(@:b) -idxcaption $(XSL_DIR)/idxcaption.xsl -idxcontent $(XSL_DIR)/idxcontent.xsl
-lang {$(subst,$(LINKNAME)_, $(@:b))} $(subst,LANGUAGE,{$(subst,$(LINKNAME)_, $(@:b))}
$(LINKADDEDFILES)) $(foreach,i,$(LINKLINKFILES) $(COMMONMISC)$/{$(subst,$(LINKNAME)_, $(@:b))}/$i)
-o $@.$(INPATH))
-.IF "$(SOLAR_JAVA)" == "TRUE"
-.IF "$(CHECK_LUCENCE_INDEXER_OUTPUT)" == ""
- $(JAVAI) $(JAVAIFLAGS) $(JAVA_LIBRARY_PATH) -cp "$(my_cp)" com.sun.star.help.HelpIndexerTool
-lang $(@:b:s/_/./:e:s/.//) -mod $(LINKNAME) -zipdir $(MISC)$/ziptmp$(@:b) -o $@.$(INPATH)
-.ELSE
- $(JAVAI) $(JAVAIFLAGS) $(JAVA_LIBRARY_PATH) -cp "$(my_cp)" com.sun.star.help.HelpIndexerTool
-lang $(@:b:s/_/./:e:s/.//) -mod $(LINKNAME) -zipdir $(MISC)$/ziptmp$(@:b) -o $@.$(INPATH)
-checkcfsandsegname _0 _3
-.ENDIF
- $(RENAME) $@.$(INPATH) $@
-.ELSE
- -$(RM) $(MISC)$/ziptmp$(@:b)$/content/*.*
- -$(RM) $(MISC)$/ziptmp$(@:b)$/caption/*.*
- zip -j -D $@.$(INPATH) $(MISC)$/ziptmp$(@:b)$/*
- $(RENAME) $@.$(INPATH) $@
- -$(RM) $(MISC)$/ziptmp$(@:b)$/*.*
-.ENDIF
+ $(HELPINDEXER) -lang $(@:b:s/_/./:e:s/.//) -mod $(LINKNAME) -srcdir $(MISC)$/ziptmp$(@:b)
-zipdir $(MISC)$/ziptmp$(@:b)
+ cd $(MISC)$/ziptmp$(@:b) && zip -rX --filesync zipfile.zip $(LINKNAME).*
+ $(RENAME) $(MISC)$/ziptmp$(@:b)$/zipfile.zip $@
--
1.7.7.6
Context
Re: [EasyHack] #44681 port to CLucene from java/Lucene · Michael Meeks
Re: [EasyHack] #44681 port to CLucene from java/Lucene · Caolán McNamara
Privacy Policy |
Impressum (Legal Info) |
Copyright information: Unless otherwise specified, all text and images
on this website are licensed under the
Creative Commons Attribution-Share Alike 3.0 License.
This does not include the source code of LibreOffice, which is
licensed under the Mozilla Public License (
MPLv2).
"LibreOffice" and "The Document Foundation" are
registered trademarks of their corresponding registered owners or are
in actual use as trademarks in one or more countries. Their respective
logos and icons are also subject to international copyright laws. Use
thereof is explained in our
trademark policy.