Date: prev next · Thread: first prev next last
2015 Archives by date, by thread · List index


Hi,

On Thu, Oct 29, 2015 at 3:21 PM, Michael Meeks
<michael.meeks@collabora.com> wrote:
Hi Kohei,

        I'd love some input (if you have a minute) on the attached. The
punch-line is, that if we want to do really fast arithmetic, we start to
need to do some odd things; while I suspect that this piece of unrolling
can be done with the iterator - the next step I'm poking at (SSE3
assembler ;-) is not going to like that.

You don't need SSE3 assembler for that - just use SSE(3) intrinsics..

SSE uses 128 registers so you can do 2 doubles at the same time.
Best is to have a twosums as __m128d and then sum the two doubles in the end.

__m128d twosums = _mm_set_pd (0.0, 0.0);

then do a similar unrolled for loop to sum 8 values at a time:
__m128d first = _mm_load_pd1(p[i]);
__m128d second = _mm_load_pd1(p[i]+2);

_mm_add_pd(twosums, first);
_mm_add_pd(twosums, second);

in the end just sum the two doubles in twosums and handle the rest of
corner cases...

Even faster it would be if the array is aligned to 16 byte boundary -
then you can use _mm_load_pd.

        ATB,

                Michael.

Regards, Tomaž

Context


Privacy Policy | Impressum (Legal Info) | Copyright information: Unless otherwise specified, all text and images on this website are licensed under the Creative Commons Attribution-Share Alike 3.0 License. This does not include the source code of LibreOffice, which is licensed under the Mozilla Public License (MPLv2). "LibreOffice" and "The Document Foundation" are registered trademarks of their corresponding registered owners or are in actual use as trademarks in one or more countries. Their respective logos and icons are also subject to international copyright laws. Use thereof is explained in our trademark policy.