Re: adding autodetection of delimiter character for CSV files

Eike Rathke <erack -AT- redhat.com>

Mon, 30 Jul 2012 21:13:07 +0200

Hi Ben, On Sunday, 2012-07-29 14:18:32 -0400, Ben Manashirov wrote:

The basic idea is to take a sample amount of lines (e.g. 100).

- For each line
- - Count the number of times each character occurs
- Compute the "peakiness" for each characters occurrence over the lines.
- Find the character with smallest peakiness.

The idea is that the delimiter will occurs the same number of times on each
line, and hence its peakiness will be 0 ideally.

Nice idea. Unfortunately it fails as soon as quoted field content is involved as all characters within the quoted field are part of the content, and one field may even wrap over multiple lines. Furthermore, things are complicated by broken CSV generators that write not properly quoted fields in which cases the boundaries of a field can only be determined (or better call it guessed) if the field separator is known. So while for simple data the approach probably will deliver usable results, it will easily deliver unusable results for complicated data. Btw, I wouldn't evaluate all characters <256, only common separator characters. However, in the simple data case that does not involve quoted field content, for which the " double quote character could be assumed I think and if not present the result be used, your approach could be used to preselect the separator in the import dialog.

I'm just presenting this so perhaps someone will add this feature.

Ah, pity, I thought you'd like to implement it :) That would go into sc/source/ui/dbgui/scuiasciiopt.cxx for the mbFileImport case. Eike -- LibreOffice Calc developer. Number formatter stricken i18n transpositionizer. GnuPG key 0x293C05FD : 997A 4C60 CE41 0149 0DB3 9E96 2F1A D073 293C 05FD

Attachment: pgpQvsrbLYSKd.pgp
Description: PGP signature

Context

adding autodetection of delimiter character for CSV files · Ben Manashirov

Re: adding autodetection of delimiter character for CSV files · Eike Rathke

Privacy Policy | Impressum (Legal Info) | Copyright information: Unless otherwise specified, all text and images on this website are licensed under the Creative Commons Attribution-Share Alike 3.0 License. This does not include the source code of LibreOffice, which is licensed under the Mozilla Public License (MPLv2). "LibreOffice" and "The Document Foundation" are registered trademarks of their corresponding registered owners or are in actual use as trademarks in one or more countries. Their respective logos and icons are also subject to international copyright laws. Use thereof is explained in our trademark policy.