Date: prev next · Thread: first prev next last
2012 Archives by date, by thread · List index



On Mon, 2012-07-09 at 00:25 +0100, Flavio Moringa wrote:
nice to ear from someone so "up the ranks" like you.. makes me feel
much more important :-)

        Ho hum; we try to avoid unpleasant hierarchy as much as possible.

 I'll probably wont't be able to do a conversion engine by myself...
but I can definitely mess around with code...

        Great :-)

Yes, it's definitely something I can do... I do believe that the
harder part is getting that " large corpus of documents out
there...". At least as my experience goes, I've found that it's hard
to get users to send us documents they use... either due to privacy
questions or enterprise policies... But a tool like that makes a lot
of sense

        Oh - so; getting the documents is not -that- hard; Google has a
document-type search that can be automated; just search for:

        filetype:docx

        And start scraping; as well as 7 million files, we get to take
advantage of Google's popularity ranking to get the most popular first
100 or whatever :-)

For now then I'll start doing as you suggest and look in bugzilla for
documents with conversion problems to try and compile as much examples
as I can. Then maybe using the latest beta to do the conversion and
see which problems are still there. Then maybe starting a perl script
that can scrap the OOXML files to find the most used tags... and start
from there...

        We also have tools for dumping all the documents out of bugzilla - see
the main 'core' repository:

        bin/get-bugzilla-attachments-by-mimetype

        so really the fun piece is writing the parser & element / attribute
value parser / database to analyse what pieces are popular and provide a
pretty UI or command-line for hackers to grok that.

        It'd be just great to have that data in hand.

        Thanks !

                Michael.

-- 
michael.meeks@suse.com  <><, Pseudo Engineer, itinerant idiot


Context


Privacy Policy | Impressum (Legal Info) | Copyright information: Unless otherwise specified, all text and images on this website are licensed under the Creative Commons Attribution-Share Alike 3.0 License. This does not include the source code of LibreOffice, which is licensed under the Mozilla Public License (MPLv2). "LibreOffice" and "The Document Foundation" are registered trademarks of their corresponding registered owners or are in actual use as trademarks in one or more countries. Their respective logos and icons are also subject to international copyright laws. Use thereof is explained in our trademark policy.