Re: [Libreoffice] [GSoc] Progress report - Visio import filter

Fridrich Strba <fridrich.strba -AT- graduateinstitute.ch>
Thu, 26 May 2011 09:39:17 +0200

Hello, Eilidh,

In our private conversation you asked for some guidance about how to
structure the library. Here are my basic thoughts (that are again my
thoughts that come from having contributed to several libraries, but
they are not the God's word):

1) Since you will have to parse quite often compressed chunks of stream,
it would maybe be useful to write some class like the following one:

class VSDInternalStream : public WPXInputStream
{
public:
VSDInternalStream(WPXInputStream *input, size_t dataSize, bool
isCompressed);
~VSDInternalStream();
(...)
private:
std::vector<unsigned char> m_buffer;
VSDInternalStream();
VSDInternalStream(const VSDInternalStream&)
};

That would be constructed by reading the input of dataSize into the
m_buffer and if needed it would decompress it on the fly if it is
compressed. Like that you would have this task that will be quite
frequent one in one place. The advantage would be that the resulting
stream would be seakable and you would just read it as any other
WPXInputStream.

2) Since in the isSupported function I see that you are distinguishing
two versions of Visio Document, I would suggest that you write a base
parser class something like:

class VSDXParser
{
public:
VSDXParser(WPXInputStream *input);
~VSDXParser();
protected:
....
private:
....
};

That would contain common functions for all the formats as long as the
common state that you will need to keep. It could have two derived
classes for the n=11 and n=6 

class VSD<n>Parser : protected VSDXParser
{
public:
VSD<n>Parser(WPXInputStream *input);
~VSD<n>Parser();
parse(libwpg::WPGPaintInterface *iface);
private:
....
};

Those ones would contain functions specific for the given file-format
version as well as specific state information that cannot be extracted
into the VSDXParser.

Now in the VisioDocument::parse(...) function, one could detect which
file-format we are parsing, construct the corresponding VSD<n>Parser and
call the parse on it.

3) As to the development process, I would suggest to first have some dry
parsing in place, with functions that read the different elements of the
Visio document without processing them really. You can plant several
VSD_DEBUG_MSG((...)); statements inside the functions (include the
libvisio_utils.h and optionally un-comment for the time of heavy
development the #define VERBOSE_DEBUG=1). Doing so, you get maximum of
information on your console without actually the parser calling any of
the interface callbacks. Then you can start from there by actually
processing the useful content.

Myself I would write maybe a VSDElement class that would construct
itself by getting the pointer to the current input stream and would have
some kind of processContent function that will decide whether to call
private _readContent(...) for supported elements and _skipContent(...)
for unsupported elements. But again, this is too much of implementation
details and I can clearly confess that I have a bias from what we did in
libwpd and libwpg.

4) The bottom line of a good FOSS development model is to push often
small changes. It has two big advantages:
a) it is easier to bisect changes when something broke;
b) it gives nice overview of progress.
If atomic changes are committed and pushed (or at least the day's work
at the end of the day), I will be able to look at it often and pat your
back if the things are wonderful, marvelous, beyond the wildedst
immagination; or ask questions, seek clarification and discuss
directions if needed. Communication is the main challenge of any GSoC
endavour and git repository can help us to get it right.
Sometimes, GSoC students are scared that pushing publicly code of
questionable quality would be detrimental for them when a prospective
employer googles for their work. This is largely a myth and the evidence
is that if that was true, I would probably have to have spent all my
life living on social help :)

Happy hacking

Fridrich

On Sun, 2011-05-08 at 17:08 +0100, Tibby Lickle wrote:

Hi,

Just an update on where I am. So far I've been working on the basics
of extracting the data from the .vsd file.
To read Visio files, the steps are roughly:
1. Get the interesting part ("VisioDocument") from the OLE container.
2. Parse the header to get a pointer to the trailer stream (as well as
version, length of file, etc.)
3. Inflate compressed trailer.
4. Parse out pointers in trailer to the various - potentially
compressed - streams that hold the actual Visio document content.

I've done 1 - 3. I'm using the WPXStream and its implementation from
libwpd (WPXStreamImplementation.h here) to read/extract OLE streams.
The implementation of LZW-esque decompression of the trailer is
translated from Python to C++ (i.e. shamelessly ripped off) from
oletoy (thanks frob). 
I suspect most of what I'll be doing will be stand-alone for now -
developing and debugging will be too slow if LO integration is
included at this early stage. Once I've got a very basic parser, the
callback interface discussed in my proposal will be implemented and
integration with LO should in theory be relatively easy.

Note to my mentor -- I've got a paper due for next Saturday so my main
focus will be on that. I will, however, be spending some time on the
next stage.

Eilidh 
_______________________________________________
LibreOffice mailing list
LibreOffice@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/libreoffice

Context

[Libreoffice] [GSoc] Progress report - Visio import filter · Tibby Lickle
- Re: [Libreoffice] [GSoc] Progress report - Visio import filter · Fridrich Strba
- Re: [Libreoffice] [GSoc] Progress report - Visio import filter · Fridrich Strba
- Re: [Libreoffice] [GSoc] Progress report - Visio import filter · Fridrich Strba
  - Re: [Libreoffice] [GSoc] Progress report - Visio import filter · Cedric Bosdonnat

Privacy Policy | Impressum (Legal Info) | Copyright information: Unless otherwise specified, all text and images on this website are licensed under the Creative Commons Attribution-Share Alike 3.0 License. This does not include the source code of LibreOffice, which is licensed under the Mozilla Public License (MPLv2). "LibreOffice" and "The Document Foundation" are registered trademarks of their corresponding registered owners or are in actual use as trademarks in one or more countries. Their respective logos and icons are also subject to international copyright laws. Use thereof is explained in our trademark policy.