Date: prev next · Thread: first prev next last
2014 Archives by date, by thread · List index


Hi Maxim,

On Fri, 2014-05-02 at 12:41 +0300, Maxim Monastirsky wrote:
On Thursday 01 May 2014 09:29:48 Kohei Yoshida wrote:
So, I looked over those changes, and I do like the changes. :-)
Thanks Kohei!

He was concerned about having to "detect" zip
storage over and over again which he rightly said was not great for
performance.

It makes me think of another point. There are some detectors that do exactly 
the same detection procedure for all supported types. For example - oox, xml, 
and now the new storage one. If such detector didn't detect anything useful 
once, we can be sure that it won't detect anything also in the next runs. So 
it doesn't make sense to run it again and again.

I agree.  I think it makes sense to leave some data such as

* this is (not) a zip storage.
* this is (not) a valid ooxml format.
* this is (not) a valid ODF format.
* this is (not) a valid BIFF storage.

etc., and I can imagine storing these pieces of information with the
MediaDescriptor instance to help the subsequent detectors to skip
redundant detection routines.  Actually maybe we could just specify the
type of detected storage type such as

"DetectedStorage"

  + not detected -> detector should try to detect and store the result.
  + zip
  + gzip
  + biff
  + etc

"DetectedXMLType"

  + not detected -> detector should try to detect the XML type and store
the result.
  + ODF
  + OOXML

so that we can just store all this information using just one slot of
the MediaDescriptor rather than storing multiple boolean values.

Having said that, I don't think we have to go to the extent that "hey,
this is definitely not "XYZ format", don't bother trying to detect it".
The idea itself may make sense, but the way the detection services are
currently set up would make it a bit challenging to implement such
additoinal checks.  And since the number of file formats to detect
against is quite small (~120), simply iterating over all of them should
not cause a performance issue once we put the above mechanism to avoid
redundant checks.
 
Maybe we can store a list of such detectors in some config file, and add a 
corresponding check to the detection loop. This also would be a bit cleaner 
solution for fdo#46310. What is the best place to store such list?

We already have a list of detectors, and they are sorted in order of
complexity for strategic reasons.
filter/source/config/cache/typedetection.cxx is the place where the list
is stored and maintained.  But as I said above, I'd like us to try the
above mechanim first and see if that will improve the situation a bit.
I'm a bit cautious with trying to either shorten or reorder this master
detector list since I've seen doing such things caused quite
hard-to-debug (and fix) format detection bugs in the past.

Best,

Kohei


Context


Privacy Policy | Impressum (Legal Info) | Copyright information: Unless otherwise specified, all text and images on this website are licensed under the Creative Commons Attribution-Share Alike 3.0 License. This does not include the source code of LibreOffice, which is licensed under the Mozilla Public License (MPLv2). "LibreOffice" and "The Document Foundation" are registered trademarks of their corresponding registered owners or are in actual use as trademarks in one or more countries. Their respective logos and icons are also subject to international copyright laws. Use thereof is explained in our trademark policy.