Using the following sample from a git patch one can see one way in which the
current counting method comes up with fewer words than other methods do.
+1747,9
1.7.0.4
14 characters on two lines: either 2, 3 or 6 words depending on how you
count
Gedit says: 2 lines 6 words 15 chars 14 chars(no spaces)
LibOdev says: 2 words 14 chars 14 chars excl spaces - (no stat line for
lines tho it has para counts)
Gedit takes each number as a word breaking the words on punctuation
Gedit also counts the new line as whitespace
LibOdev counts all of any block of contiguous characters as a word
LibOdev in node word counter never sees the newline
Over the diff part (from qgit) of Mattias' part 1 - sw patch file showing
gedit / LibOdev
Words: 2418 / 2414
Chars: 24241 / 24241
Chars – 16830 / 16830 (excl. spaces)
Now a near match in words and perfect match on chars excl spaces.
Testing with a different entire patch file, the major difference is in words
1338 to 1533 or ~200 out of 1400 words, but the total char and char excl.
spaces agree completely 13 459 and 10 157
Taking into account the different word handling (top) and the way they match
then don't match I suspect a second difference in the counting method tween
gedit and LibOdev and differences in the line breaks in the files after cut
and paste.
So far gedit and LibOdev agree completely ONLY on the non-space counts.
I didn't check results on your reference odt because gedit wont open odt and
cut and paste just dumps the XML into the text...
Words 3997 / 18
Chars 33429 / 125
Chars – 28469 / 107
Where the second smaller numbers are a page footer's counts. AFAIR -
LibOdev doesn't count the footer content and that might be the difference.
there are 20+ pages so thats 360+ words ~2500 chars in the footers
I also saw how the LibOdev count is zero at load of the odt. Perhaps the
count is made somewhere else and saved on the doc without this code or it is
stored in the doc and loaded – either way the word count is marked clean so
it is not re-counted when the dialog box calls updateStats and the excl.
spaces count remains zero. Just clicking in the document causes a full
recount tho and that seems too busy somehow.. <-- more than enough guessing
there....
All these tests are with the aScanner.GetLen() > 1 check in place. With
that Len >=2 check, the new counting routine has no problem with single
letter words like A, a, 1, -, or just ,
It is puzzling that Mattias removed the check to handle single char words on
his machine but a build out of master/LibOdev works (at least for me) with
that same check in …
I will test changing back to Mattias simpler submission. (building now).
I must note that the block immediately after this count area word counts the
outline numbers (and counts the bullets as words!?!) - it does not have any
such length check at all... I think all the len=1 strings that the scanner
might give back are just CH_TXTATR_BREAKWORD = 0x01. And they are probably
Scanner's zero length string. Scanner's GetEnd points one slot past the end
of the string – i.e. for SwScanner GetEnd() = GetBegin() + GetLen() (no
-1 there) And that end spot likely has a break marker.
Again gedit and LibOdev agree completely ONLY on the non-space counts.
--
View this message in context:
http://nabble.documentfoundation.org/PATCH-Fix-for-bug-feature-request-30550-Character-count-without-spaces-tp1778667p1782965.html
Sent from the Dev mailing list archive at Nabble.com.
Context
Privacy Policy |
Impressum (Legal Info) |
Copyright information: Unless otherwise specified, all text and images
on this website are licensed under the
Creative Commons Attribution-Share Alike 3.0 License.
This does not include the source code of LibreOffice, which is
licensed under the Mozilla Public License (
MPLv2).
"LibreOffice" and "The Document Foundation" are
registered trademarks of their corresponding registered owners or are
in actual use as trademarks in one or more countries. Their respective
logos and icons are also subject to international copyright laws. Use
thereof is explained in our
trademark policy.