Hi Ben, On Sunday, 2012-07-29 14:18:32 -0400, Ben Manashirov wrote:
The basic idea is to take a sample amount of lines (e.g. 100). - For each line - - Count the number of times each character occurs - Compute the "peakiness" for each characters occurrence over the lines. - Find the character with smallest peakiness. The idea is that the delimiter will occurs the same number of times on each line, and hence its peakiness will be 0 ideally.
Nice idea. Unfortunately it fails as soon as quoted field content is involved as all characters within the quoted field are part of the content, and one field may even wrap over multiple lines. Furthermore, things are complicated by broken CSV generators that write not properly quoted fields in which cases the boundaries of a field can only be determined (or better call it guessed) if the field separator is known. So while for simple data the approach probably will deliver usable results, it will easily deliver unusable results for complicated data. Btw, I wouldn't evaluate all characters <256, only common separator characters. However, in the simple data case that does not involve quoted field content, for which the " double quote character could be assumed I think and if not present the result be used, your approach could be used to preselect the separator in the import dialog.
I'm just presenting this so perhaps someone will add this feature.
Ah, pity, I thought you'd like to implement it :) That would go into sc/source/ui/dbgui/scuiasciiopt.cxx for the mbFileImport case. Eike -- LibreOffice Calc developer. Number formatter stricken i18n transpositionizer. GnuPG key 0x293C05FD : 997A 4C60 CE41 0149 0DB3 9E96 2F1A D073 293C 05FD
Attachment:
pgpQvsrbLYSKd.pgp
Description: PGP signature