Hi,
On Thu, Oct 29, 2015 at 3:21 PM, Michael Meeks
<michael.meeks@collabora.com> wrote:
Hi Kohei,
I'd love some input (if you have a minute) on the attached. The
punch-line is, that if we want to do really fast arithmetic, we start to
need to do some odd things; while I suspect that this piece of unrolling
can be done with the iterator - the next step I'm poking at (SSE3
assembler ;-) is not going to like that.
You don't need SSE3 assembler for that - just use SSE(3) intrinsics..
SSE uses 128 registers so you can do 2 doubles at the same time.
Best is to have a twosums as __m128d and then sum the two doubles in the end.
__m128d twosums = _mm_set_pd (0.0, 0.0);
then do a similar unrolled for loop to sum 8 values at a time:
__m128d first = _mm_load_pd1(p[i]);
__m128d second = _mm_load_pd1(p[i]+2);
_mm_add_pd(twosums, first);
_mm_add_pd(twosums, second);
in the end just sum the two doubles in twosums and handle the rest of
corner cases...
Even faster it would be if the array is aligned to 16 byte boundary -
then you can use _mm_load_pd.
ATB,
Michael.
Regards, Tomaž
Context
Privacy Policy |
Impressum (Legal Info) |
Copyright information: Unless otherwise specified, all text and images
on this website are licensed under the
Creative Commons Attribution-Share Alike 3.0 License.
This does not include the source code of LibreOffice, which is
licensed under the Mozilla Public License (
MPLv2).
"LibreOffice" and "The Document Foundation" are
registered trademarks of their corresponding registered owners or are
in actual use as trademarks in one or more countries. Their respective
logos and icons are also subject to international copyright laws. Use
thereof is explained in our
trademark policy.