Date: prev next · Thread: first prev next last
2012 Archives by date, by thread · List index


Hi Ben,

On Sunday, 2012-07-29 14:18:32 -0400, Ben Manashirov wrote:

The basic idea is to take a sample amount of lines (e.g. 100).

- For each line
- - Count the number of times each character occurs
- Compute the "peakiness" for each characters occurrence over the lines.
- Find the character with smallest peakiness.

The idea is that the delimiter will occurs the same number of times on each
line, and hence its peakiness will be 0 ideally.

Nice idea. Unfortunately it fails as soon as quoted field content is
involved as all characters within the quoted field are part of the
content, and one field may even wrap over multiple lines. Furthermore,
things are complicated by broken CSV generators that write not properly
quoted fields in which cases the boundaries of a field can only be
determined (or better call it guessed) if the field separator is known.
So while for simple data the approach probably will deliver usable
results, it will easily deliver unusable results for complicated data.

Btw, I wouldn't evaluate all characters <256, only common separator
characters.

However, in the simple data case that does not involve quoted field
content, for which the " double quote character could be assumed
I think and if not present the result be used, your approach could be
used to preselect the separator in the import dialog.

I'm just presenting this so perhaps someone will add this feature.

Ah, pity, I thought you'd like to implement it :)
That would go into sc/source/ui/dbgui/scuiasciiopt.cxx for the
mbFileImport case.

  Eike

-- 
LibreOffice Calc developer. Number formatter stricken i18n transpositionizer.
GnuPG key 0x293C05FD : 997A 4C60 CE41 0149 0DB3  9E96 2F1A D073 293C 05FD

Attachment: pgpQvsrbLYSKd.pgp
Description: PGP signature


Context


Privacy Policy | Impressum (Legal Info) | Copyright information: Unless otherwise specified, all text and images on this website are licensed under the Creative Commons Attribution-Share Alike 3.0 License. This does not include the source code of LibreOffice, which is licensed under the Mozilla Public License (MPLv2). "LibreOffice" and "The Document Foundation" are registered trademarks of their corresponding registered owners or are in actual use as trademarks in one or more countries. Their respective logos and icons are also subject to international copyright laws. Use thereof is explained in our trademark policy.