Date: prev next · Thread: first prev next last
2011 Archives by date, by thread · List index


Norbert,

thanks for the feedback.

2011/1/30 Norbert Thiebaud <nthiebaud@gmail.com>

On Sat, Jan 29, 2011 at 5:43 PM, Kenneth Venken
<kenneth.venken@gmail.com> wrote:
Hi,

i came across gendict.cxx while fixing a possible memleak. It took me
some
time to figure out what the code did.
I notice a lot of very very long function bodies in LO-code, gendict was
no
exception. So i refactored the code, did some google searches on gendict
and
was able to fix the memleak.
I ended up submitting only the fix, not the refactoring, because i didn't
want to break any de facto coding style guideliness and i am still a
fairly
new contributer to LO.

I submit these patches now, so you guys can decide if you push them or
not.
In case it could save some other new contributor some time in
understanding
gendict ;)
BTW, i tested the code on ja.dic and output is still the same.

Even though the program is not _that_ long, I think the intent is
good, but I do have few questions, remarks:

Preamble:
- Some of the remarks below concern code-choice you made, some of them
concern
things that were already like that in the original code. I did not
distinguished the two because
a/ the remarks are on the resulting code, not on who wrote it nor when
b/ since this is a complete re factoring, minimizing changes are not a
priority; So, existing
oddities should be fixed along the way.

- This is by now means an exhaustive review.

1/ implement before use.
I'd strongly prefer if you implemented-before-use rather than
prototype + use + implementation.
in other words: main() at the end and the function above it, in order
such as they are implemented before they are used.

Ok, i'll do it like that from now on.


2/ keep private function private: in C that means qualify the print*
and init* functions you created to break-down
the big bad main function, as 'static'.

Good point.

3/ patch 0002: you factor printFunctions() but I can't see where it is
being called. ( oh, it 're-appear' in patch 0004, but still)

yeah, sorry about that, i noticed it was gone after a few commits.


4/ the indentation of the original was horrible, but if you are going
to refactor, you may want to fix that.
i.e
+        if (*u != current) {
+        if (*u < current)
+        printf("u %x, current %x, count %d, lenArray.size() %d\n", *u,
current,
+                    sal::static_int_cast<int>(count),
sal::static_int_cast<int>(lenArray.size()));
+        current = *u;
+        charArray[current] = lenArray.size();
+        }

+        if (*u != current)
+        {
+            if (*u < current)
+                printf("u %x, current %x, count %d, lenArray.size()
%d\n", *u, current,
+                         sal::static_int_cast<int>(count),
sal::static_int_cast<int>(lenArray.size()));
+            current = *u;
+            charArray[current] = lenArray.size();
+        }

Note that *I* prefer systematic {} even for one-liner if/for/while,
but that is not universally shared... so I won't insist on it.
Note2: while we are at it the
sal:static_int_cast<int>(count/lenArray.size()) is pretty stupid.
printf(" %d %d", count, lenArray.size()) works just fine. and btw
there is no reason for count (or in general any local counter) to be a
sal_int32. int is just fine)

the if statement used here should probably be an ASSERT( *u < current )
because the input dictionary file should be a sorted list of words.
if *u < current is true for the last word in the dictionary, the line
charArray[current+1] = lenArray.size(); (after loop) could produce
unintended behaviour.


5/ patch 004:
+void printLenArray(FILE* source_fp, const vector<sal_Int32>& lenArray,
+                   sal_Int32 count)
+{
+    fprintf(source_fp, "static const sal_Int32 lenArray[] = {\n\t");
+    count = 1;

what's the point passing cont as a parameter if you are going to
override it's value right away ? (note: ok so, patch 0005 actually fix
that...)

these patches should be viewed as a wholel. The refactoring was a process.
But you're wright.


+    fprintf(source_fp, "0x%x, ", 0); // insert one slat for skipping
0 in index2 array.
+    for (size_t k = 0; k < lenArray.size(); k++)
+    {
+        fprintf(source_fp, "0x%lx, ", static_cast<long unsigned
int>(lenArray[k]));

lenArray, in the generated code, is declared as sal_int32[] so why
casting it's element to long unsigned int ?

+        if (count == 0xf)
+        {
+            count = 0;
+            fprintf(source_fp, "\n\t");
+        }
+            else count++;
        if( (k & 0xe) == 0xe)
        {
           fprintf(source_fp, "\n\t");
        }
achieve the same result without the need to maintain 'count'
It is still a bit odd, due to the shift introduce by the special element 0.

I 'd prefer
+    for (size_t k = 0; k < lenArray.size(); k++)
+    {
+        if (!(k &0xf)
+        {
+            fprintf(source_fp, "\n\t");
+        }
+        fprintf(source_fp, "0x%lx, ", static_cast<long unsigned
int>(lenArray[k]));
+    }

that change a bit the output (by having the first special 0 entry
standing out, but actually that is a good thing since it _is_ special

+void printIndex2(FILE *source_fp, sal_Int32 *charArray, sal_Int16 *set)
+{
+    fprintf (source_fp, "static const sal_Int32 index2[] = {\n\t");
+    sal_Int32 prev = 0;
+    for (sal_Int32 i = 0; i < 0x100; i++) {
+        if (set[i] != 0xff) {
+        for (sal_Int32 j = 0; j < 0x100; j++) {
+            sal_Int32 k = (i*0x100) + j;

the compiler should be smart enough to detect that multiply by 256 is
left-shift by 8, but still why take a risk:
            k = (i << 8 | j);


+            if (prev != 0 && charArray[k] == 0) {
+            for (k++; k < 0x10000; k++)
+                if (charArray[k] != 0)
+                break;
+            }
+            prev = charArray[(i*0x100) + j];
+            fprintf(
+                source_fp, "0x%lx, ",
+                sal::static_int_cast< unsigned long >(
+                    k < 0x10000 ? charArray[k] + 1 : 0));
+            if ((j+1) % 0x10 == 0)

            if (j & 0x0f == 0x0f)
does the same thing without an addition AND a modulo operation, note
that you are in the inner loop here. this will
be called up to 65K times. granted that since we use fprintf() to push
constant string to a file, performance is clearly not
very high on the list of concern.
fputs("\n\t", source_fp) is at least an order of magnitude faster than
fprintf(source_cp, "\n\t");


+void printExistMark(FILE *source_fp, sal_Bool *exists, sal_Int32 count)
+{
+    count = 0;
+    fprintf (source_fp, "static const sal_uInt8 existMark[] = {\n\t");
+    for (sal_Int32 i = 0; i < 0x1FFF; i++) {
+        sal_uInt8 bit = 0;
+        for (sal_Int32 j = 0; j < 8; j++)
+        if (exists[i * 8 + j])
+            bit |= 1 << j;
+        fprintf(source_fp, "0x%02x, ", bit);
+        if (count == 0xf) {
+        count = 0;
+        fprintf(source_fp, "\n\t");
+        } else count++;
+    }
+    fprintf (source_fp, "\n};\n");
 }
since exist[] is really an array of bool that is init to false and the
set to true when needed, only to be then packed and dump as a
bitfield, you could do

static inline set_exist(int index)
{
   exist[index>>3] |= 1 << (index & 0x07);
}

then replace the two place that set exist
exist[u[0]] = sal_True; ->set_exist(u[0]);
and
exist[u[i]] = sal_True; -> set_exist(u[i]);

now exist[] is then defined as
static unsigned char exist[0x2000];
(note declaring it static means that you don't need to initialize it
anymore since the C standard guarantee that it will be initialized
with 0, much more efficiently than the loop that currently do that in
the code, and the code is not meant to be re-entrant anyway)

and you can simplify printExistMark in a simple byte dump.

Nice.


6/ patch 006

-    if (argc < 3) exit(-1);
+    if (argc < 3)
+    {
+        printf("2 arguments required: dictionary_file_name
source_file_name");
+        exit(-1);
+    }

The convention is to send error message to stderr.
(and the output -- here a source file -- should default to stdout)

Ok, i'll stick to that convention from now on.

7/ patch 008
/* FIXME?: what happens if in every range i there is at least one charArray
!= 0
+       => this will make index1[] = {0x00, 0x01, 0x02,... 0xfe, 0xff }
+       => then in index2, the last range will be ignored incorrectly */

luckily this is not possible since utf16 characters in the range
D800-DBFF are not supported/valid
see http://en.wikipedia.org/wiki/UTF-16/UCS-2 for a reason why.

so you will have, at least, 4 ranges without any hit.

So "It is not possible to encode these code points in UTF-16. The Unicode
standard permanently reserves these values for UTF-16 encoding only, so a
single 16-bit code unit in the range 0xD800 to 0xDFFF never represents a
character in Plane 0."
But since we are using the UTF-16 encoding, it could be possible that a
sal_Unicode value (a unicode code unit)  is in this range.
See section about Code points U+10000..U+10FFFF in the link you sent me.

Norbert

-- Kenneth



_______________________________________________
LibreOffice mailing list
LibreOffice@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/libreoffice




So, should i make the changes to the code or have you already done them?

-- Kenneth

Context


Privacy Policy | Impressum (Legal Info) | Copyright information: Unless otherwise specified, all text and images on this website are licensed under the Creative Commons Attribution-Share Alike 3.0 License. This does not include the source code of LibreOffice, which is licensed under the Mozilla Public License (MPLv2). "LibreOffice" and "The Document Foundation" are registered trademarks of their corresponding registered owners or are in actual use as trademarks in one or more countries. Their respective logos and icons are also subject to international copyright laws. Use thereof is explained in our trademark policy.