Asp Forum - To Melzzzzz - comp.programming

Ramine

6/28/2014 10:52:00 PM

Hello,

I have looked at your matrix multiplication using
SSE2 and AVX that you have posted on assembler.x86 forum,
i just wanted to tell you that you don't need to write
it in assembler, cause GCC do auto-vectorize the floating
point calculation and it auto-vectorize also the Matrix
multiplication even at -O1 optimization... and GCC do
auto-vectorize beautifully since the GCC auto-vectorization
have given a good results on scimark2 benchmark too.

Thank you,
Amine Moulay Ramdane.

14 Answers

Melzzzzz

6/29/2014 2:10:00 AM

On Sat, 28 Jun 2014 15:51:59 -0700
Ramine <ramine@1.1> wrote:

>
> Hello,
>
> I have looked at your matrix multiplication using
> SSE2 and AVX that you have posted on assembler.x86 forum,
> i just wanted to tell you that you don't need to write
> it in assembler, cause GCC do auto-vectorize the floating
> point calculation and it auto-vectorize also the Matrix
> multiplication even at -O1 optimization...

as I see it it don't....

and GCC do
> auto-vectorize beautifully since the GCC auto-vectorization
> have given a good results on scimark2 benchmark too.

Care to share example?

>
>
> Thank you,
> Amine Moulay Ramdane.
>
>
>
>

--
Click OK to continue...

Melzzzzz

6/29/2014 2:40:00 PM

On Sun, 29 Jun 2014 10:33:41 -0700
Ramine <ramine@1.1> wrote:

> On 6/28/2014 7:10 PM, Melzzzzz wrote:
> > On Sat, 28 Jun 2014 15:51:59 -0700
> > Ramine <ramine@1.1> wrote:
> >
> >>
> >> Hello,
> >>
> >> I have looked at your matrix multiplication using
> >> SSE2 and AVX that you have posted on assembler.x86 forum,
> >> i just wanted to tell you that you don't need to write
> >> it in assembler, cause GCC do auto-vectorize the floating
> >> point calculation and it auto-vectorize also the Matrix
> >> multiplication even at -O1 optimization...
> >
> > as I see it it don't....
> >
>
>
> I am using the newest version of tdm-gcc and auto-vectorization is
> working even at -O1 level optimization.
>
>
> You can download tdb-gcc from here and try it yourself:
>
> http://tdm-gcc.td...

That is based on gcc 4.8.1 , Im using 4.9 and trunk versions
and there is no vectorization of matrix multiplication
(even for 4x4 matrices).
Care to show compiler output of assembly and c source code
so I can see what I am missing?

>
>
>
>
> Thank you,
> Amine Moulay Ramdane.
>
>
>
>
>
>
>
> >
> > and GCC do
> >> auto-vectorize beautifully since the GCC auto-vectorization
> >> have given a good results on scimark2 benchmark too.
> >
> > Care to share example?
> >
> >>
> >>
> >> Thank you,
> >> Amine Moulay Ramdane.
> >>
> >>
> >>
> >>
> >
> >
> >
>

--
Click OK to continue...

Melzzzzz

6/29/2014 2:46:00 PM

On Sun, 29 Jun 2014 10:41:30 -0700
Ramine <ramine@1.1> wrote:

> On 6/28/2014 7:10 PM, Melzzzzz wrote:
> > On Sat, 28 Jun 2014 15:51:59 -0700
> > Ramine <ramine@1.1> wrote:
> >
> >>
> >> Hello,
> >>
> >> I have looked at your matrix multiplication using
> >> SSE2 and AVX that you have posted on assembler.x86 forum,
> >> i just wanted to tell you that you don't need to write
> >> it in assembler, cause GCC do auto-vectorize the floating
> >> point calculation and it auto-vectorize also the Matrix
> >> multiplication even at -O1 optimization...
> >
> > as I see it it don't....
> >
> >
> > and GCC do
> >> auto-vectorize beautifully since the GCC auto-vectorization
> >> have given a good results on scimark2 benchmark too.
> >
> > Care to share example?
>
>
> You can download the following scimark2 and compile it with
> the newest version of tdm-gcc with just -O1 level optimization
> and -S to generate the assembler code and after that look at the
> assembler code and you will see that it is auto-vectorizing the code:
>
> http://math.nist.gov...

I have this benchmark long ago and it seems to autovectorize onli LU
and that at -O3.
bmaxa@maxa:~/examples/forth/sci$ gcc-trunk -O2 *.c -o scimark2 -lm
bmaxa@maxa:~/examples/forth/sci$ time ./scimark2
** **
** SciMark2 Numeric Benchmark, see http://math.nist.g... **
** for details. (Results can be submitted to pozo@nist.gov) **
** **
Using 2.00 seconds min time per kenel.
Composite Score: 1739.60
FFT Mflops: 1782.75 (N=1024)
SOR Mflops: 1305.62 (100 x 100)
MonteCarlo: Mflops: 618.10
Sparse matmult Mflops: 2255.16 (N=1000, nz=5000)
LU Mflops: 2736.35 (M=100, N=100)

real 0m33.408s
user 0m33.331s
sys 0m0.052s
bmaxa@maxa:~/examples/forth/sci$ gcc-trunk -O3 *.c -o scimark2 -lm
bmaxa@maxa:~/examples/forth/sci$ time ./scimark2
** **
** SciMark2 Numeric Benchmark, see http://math.nist.g... **
** for details. (Results can be submitted to pozo@nist.gov) **
** **
Using 2.00 seconds min time per kenel.
Composite Score: 2102.19
FFT Mflops: 1855.88 (N=1024)
SOR Mflops: 1848.64 (100 x 100)
MonteCarlo: Mflops: 621.80
Sparse matmult Mflops: 2275.65 (N=1000, nz=5000)
LU Mflops: 3908.98 (M=100, N=100)

real 0m28.933s
user 0m28.849s
sys 0m0.050s

>
>
> Thank you,
> Amine Moulay Ramdane.
>
>
> >
> >>
> >>
> >> Thank you,
> >> Amine Moulay Ramdane.
> >>
> >>
> >>
> >>
> >
> >
> >
>

--
Click OK to continue...

Melzzzzz

6/29/2014 3:01:00 PM

On Sun, 29 Jun 2014 10:47:51 -0700
Ramine <ramine@1.1> wrote:

>
> Hello,
>
> Look for example at the assembler source code of
> SparseCompRow.c that you will find inside the
> scimark2 source code benchmark, look carefully
> at the follwing assembler code, and you will notice
> that it is auto-vectorizing, i am using just using
> -O1 level optimization and the -S flag to generate
> the assembler code, here it is:
>
>
> ===
>
>
> .file "SparseCompRow.c"
> .text
> .globl SparseCompRow_num_flops
> .def SparseCompRow_num_flops; .scl
> 2; .type 32; .endef .seh_proc
> SparseCompRow_num_flops SparseCompRow_num_flops:
> .seh_endprologue
> movl %edx, %eax
> sarl $31, %edx
> idivl %ecx
> imull %eax, %ecx
> cvtsi2sd %ecx, %xmm0
> addsd %xmm0, %xmm0
> cvtsi2sd %r8d, %xmm1
> mulsd %xmm1, %xmm0
> ret
> .seh_endproc
> .globl SparseCompRow_matmult
> .def SparseCompRow_matmult; .scl
> 2; .type 32; .endef .seh_proc
> SparseCompRow_matmult SparseCompRow_matmult:
> pushq %r13
> .seh_pushreg %r13
> pushq %r12
> .seh_pushreg %r12
> pushq %rbp
> .seh_pushreg %rbp
> pushq %rdi
> .seh_pushreg %rdi
> pushq %rsi
> .seh_pushreg %rsi
> pushq %rbx
> .seh_pushreg %rbx
> .seh_endprologue
> movl %ecx, %edi
> movq %rdx, %rbp
> movq %r8, %rdx
> movq %r9, %rsi
> movq 88(%rsp), %r9
> movq 96(%rsp), %rcx
> movl 104(%rsp), %r13d
> movl $0, %r12d
> xorpd %xmm2, %xmm2
> movapd %xmm2, %xmm3
> testl %r13d, %r13d
> jg .L15
> jmp .L2
> .L12:
> movl (%rsi,%rbx,4), %eax
> movl 4(%rsi,%rbx,4), %r8d
> cmpl %r8d, %eax
> jge .L10
> movapd %xmm2, %xmm0
> .L6:
> movslq %eax, %r10
> movslq (%r9,%r10,4), %r11
> movsd (%rcx,%r11,8), %xmm1
> mulsd (%rdx,%r10,8), %xmm1
> addsd %xmm1, %xmm0
> addl $1, %eax
> cmpl %r8d, %eax
> jne .L6
> jmp .L5
> .L10:
> movapd %xmm3, %xmm0
> .L5:
> movsd %xmm0, 0(%rbp,%rbx,8)
> addq $1, %rbx
> cmpl %ebx, %edi
> jg .L12
> .L8:
> addl $1, %r12d
> cmpl %r13d, %r12d
> jne .L15
> jmp .L2
> .L15:
> movl $0, %ebx
> testl %edi, %edi
> jg .L12
> .p2align 4,,4
> jmp .L8
> .L2:
> popq %rbx
> popq %rsi
> popq %rdi
> popq %rbp
> popq %r12
> popq %r13
> ret
> .seh_endproc
>
> ==

No this is just scalar code. It does not use mulpd/addpd rather
mulsd/addsd.
It is scalar instruction for simd.

--
Click OK to continue...

Melzzzzz

6/29/2014 3:20:00 PM

On Sun, 29 Jun 2014 11:14:31 -0700
Ramine <ramine@1.1> wrote:

so it is not auto-vectorizing the code... so that's amazing.
> how then gcc is generating 2X times faster code than Delphi and
> FreePascal ? i think that Delphi and FreePascal are slow compilers.

gcc - C and C++ is being developed at fast pace and improved constantly.
Many many commits.
Hack gcc is even better than intel icc now, I think.

bmaxa@maxa:~/examples/forth/sci$ gcc-trunk -Ofast -mavx *.c -o scimark2
-lm bmaxa@maxa:~/examples/forth/sci$ time ./scimark2
** **
** SciMark2 Numeric Benchmark, see http://math.nist.g... **
** for details. (Results can be submitted to pozo@nist.gov) **
** **
Using 2.00 seconds min time per kenel.
Composite Score: 2147.56
FFT Mflops: 1810.43 (N=1024)
SOR Mflops: 1848.91 (100 x 100)
MonteCarlo: Mflops: 631.23
Sparse matmult Mflops: 2253.35 (N=1000, nz=5000)
LU Mflops: 4193.91 (M=100, N=100)

real 0m28.592s
user 0m28.516s
sys 0m0.050s
bmaxa@maxa:~/examples/forth/sci$ icc -fast -mavx *.c -o scimark2 -lm
bmaxa@maxa:~/examples/forth/sci$ time ./scimark2
** **
** SciMark2 Numeric Benchmark, see http://math.nist.g... **
** for details. (Results can be submitted to pozo@nist.gov) **
** **
Using 2.00 seconds min time per kenel.
Composite Score: 2079.37
FFT Mflops: 1707.05 (N=1024)
SOR Mflops: 1527.84 (100 x 100)
MonteCarlo: Mflops: 1203.64
Sparse matmult Mflops: 1695.46 (N=1000, nz=5000)
LU Mflops: 4262.85 (M=100, N=100)

real 0m27.656s
user 0m27.601s
sys 0m0.034s

(Intel seems that autovectorize MonteCarlo, which gcc don't)
but overal intel is slower.

>
>
>
> Thank you,
> Amine Moulay Ramdane.
>
>
>
>

--
Click OK to continue...

Melzzzzz

6/29/2014 3:31:00 PM

On Sun, 29 Jun 2014 11:23:30 -0700
Ramine <ramine@1.1> wrote:

>
> Hello,
>
> Sorry Melzzzzz i have forgot:
>
> Look at the assembler code it is using many:
>
> xorpd and movpd
>
> that's also gives much better performance , no ?
>

there is no xorsd and movapd is only instruction to
move from xmm register to xmm register. So, no.
It does not give better perfromance.

>
> Thank you,
> Amine Moulay Rmdane.
>
>

--
Click OK to continue...

Ramine

6/29/2014 5:34:00 PM

On 6/28/2014 7:10 PM, Melzzzzz wrote:
> On Sat, 28 Jun 2014 15:51:59 -0700
> Ramine <ramine@1.1> wrote:
>
>>
>> Hello,
>>
>> I have looked at your matrix multiplication using
>> SSE2 and AVX that you have posted on assembler.x86 forum,
>> i just wanted to tell you that you don't need to write
>> it in assembler, cause GCC do auto-vectorize the floating
>> point calculation and it auto-vectorize also the Matrix
>> multiplication even at -O1 optimization...
>
> as I see it it don't....
>

I am using the newest version of tdm-gcc and auto-vectorization is
working even at -O1 level optimization.

You can download tdb-gcc from here and try it yourself:

http://tdm-gcc.td...

Thank you,
Amine Moulay Ramdane.

>
> and GCC do
>> auto-vectorize beautifully since the GCC auto-vectorization
>> have given a good results on scimark2 benchmark too.
>
> Care to share example?
>
>>
>>
>> Thank you,
>> Amine Moulay Ramdane.
>>
>>
>>
>>
>
>
>

Ramine

6/29/2014 5:42:00 PM

On 6/28/2014 7:10 PM, Melzzzzz wrote:
> On Sat, 28 Jun 2014 15:51:59 -0700
> Ramine <ramine@1.1> wrote:
>
>>
>> Hello,
>>
>> I have looked at your matrix multiplication using
>> SSE2 and AVX that you have posted on assembler.x86 forum,
>> i just wanted to tell you that you don't need to write
>> it in assembler, cause GCC do auto-vectorize the floating
>> point calculation and it auto-vectorize also the Matrix
>> multiplication even at -O1 optimization...
>
> as I see it it don't....
>
>
> and GCC do
>> auto-vectorize beautifully since the GCC auto-vectorization
>> have given a good results on scimark2 benchmark too.
>
> Care to share example?

You can download the following scimark2 and compile it with
the newest version of tdm-gcc with just -O1 level optimization
and -S to generate the assembler code and after that look at the
assembler code and you will see that it is auto-vectorizing the code:

http://math.nist.gov...

Thank you,
Amine Moulay Ramdane.

>
>>
>>
>> Thank you,
>> Amine Moulay Ramdane.
>>
>>
>>
>>
>
>
>

Ramine

6/29/2014 5:48:00 PM

Hello,

Look for example at the assembler source code of
SparseCompRow.c that you will find inside the
scimark2 source code benchmark, look carefully
at the follwing assembler code, and you will notice
that it is auto-vectorizing, i am using just using
-O1 level optimization and the -S flag to generate
the assembler code, here it is:

===

..file "SparseCompRow.c"
.text
.globl SparseCompRow_num_flops
.def SparseCompRow_num_flops; .scl 2; .type 32; .endef
.seh_proc SparseCompRow_num_flops
SparseCompRow_num_flops:
.seh_endprologue
movl %edx, %eax
sarl $31, %edx
idivl %ecx
imull %eax, %ecx
cvtsi2sd %ecx, %xmm0
addsd %xmm0, %xmm0
cvtsi2sd %r8d, %xmm1
mulsd %xmm1, %xmm0
ret
.seh_endproc
.globl SparseCompRow_matmult
.def SparseCompRow_matmult; .scl 2; .type 32; .endef
.seh_proc SparseCompRow_matmult
SparseCompRow_matmult:
pushq %r13
.seh_pushreg %r13
pushq %r12
.seh_pushreg %r12
pushq %rbp
.seh_pushreg %rbp
pushq %rdi
.seh_pushreg %rdi
pushq %rsi
.seh_pushreg %rsi
pushq %rbx
.seh_pushreg %rbx
.seh_endprologue
movl %ecx, %edi
movq %rdx, %rbp
movq %r8, %rdx
movq %r9, %rsi
movq 88(%rsp), %r9
movq 96(%rsp), %rcx
movl 104(%rsp), %r13d
movl $0, %r12d
xorpd %xmm2, %xmm2
movapd %xmm2, %xmm3
testl %r13d, %r13d
jg .L15
jmp .L2
..L12:
movl (%rsi,%rbx,4), %eax
movl 4(%rsi,%rbx,4), %r8d
cmpl %r8d, %eax
jge .L10
movapd %xmm2, %xmm0
..L6:
movslq %eax, %r10
movslq (%r9,%r10,4), %r11
movsd (%rcx,%r11,8), %xmm1
mulsd (%rdx,%r10,8), %xmm1
addsd %xmm1, %xmm0
addl $1, %eax
cmpl %r8d, %eax
jne .L6
jmp .L5
..L10:
movapd %xmm3, %xmm0
..L5:
movsd %xmm0, 0(%rbp,%rbx,8)
addq $1, %rbx
cmpl %ebx, %edi
jg .L12
..L8:
addl $1, %r12d
cmpl %r13d, %r12d
jne .L15
jmp .L2
..L15:
movl $0, %ebx
testl %edi, %edi
jg .L12
.p2align 4,,4
jmp .L8
..L2:
popq %rbx
popq %rsi
popq %rdi
popq %rbp
popq %r12
popq %r13
ret
.seh_endproc

==

Thank you,
Amine Moulay Ramdane.

On 6/29/2014 7:39 AM, Melzzzzz wrote:
> On Sun, 29 Jun 2014 10:33:41 -0700
> Ramine <ramine@1.1> wrote:
>
>> On 6/28/2014 7:10 PM, Melzzzzz wrote:
>>> On Sat, 28 Jun 2014 15:51:59 -0700
>>> Ramine <ramine@1.1> wrote:
>>>
>>>>
>>>> Hello,
>>>>
>>>> I have looked at your matrix multiplication using
>>>> SSE2 and AVX that you have posted on assembler.x86 forum,
>>>> i just wanted to tell you that you don't need to write
>>>> it in assembler, cause GCC do auto-vectorize the floating
>>>> point calculation and it auto-vectorize also the Matrix
>>>> multiplication even at -O1 optimization...
>>>
>>> as I see it it don't....
>>>
>>
>>
>> I am using the newest version of tdm-gcc and auto-vectorization is
>> working even at -O1 level optimization.
>>
>>
>> You can download tdb-gcc from here and try it yourself:
>>
>> http://tdm-gcc.td...
>
> That is based on gcc 4.8.1 , Im using 4.9 and trunk versions
> and there is no vectorization of matrix multiplication
> (even for 4x4 matrices).
> Care to show compiler output of assembly and c source code
> so I can see what I am missing?
>
>>
>>
>>
>>
>> Thank you,
>> Amine Moulay Ramdane.
>>
>>
>>
>>
>>
>>
>>
>>>
>>> and GCC do
>>>> auto-vectorize beautifully since the GCC auto-vectorization
>>>> have given a good results on scimark2 benchmark too.
>>>
>>> Care to share example?
>>>
>>>>
>>>>
>>>> Thank you,
>>>> Amine Moulay Ramdane.
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>
>
>
>

Ramine

6/29/2014 5:55:00 PM

Melzzzzz wrote:
> I have this benchmark long ago and it seems to autovectorize onli LU
> and that at -O3.

No i don't think so, i am using the newest version of tdm-gcc
and it is auto-vectorizing the FFT and the MonteCarlo and Sparse
MatMult and LU too.

Thank you,
Amine Moulay Ramdane.

On 6/29/2014 7:46 AM, Melzzzzz wrote:
> On Sun, 29 Jun 2014 10:41:30 -0700
> Ramine <ramine@1.1> wrote:
>
>> On 6/28/2014 7:10 PM, Melzzzzz wrote:
>>> On Sat, 28 Jun 2014 15:51:59 -0700
>>> Ramine <ramine@1.1> wrote:
>>>
>>>>
>>>> Hello,
>>>>
>>>> I have looked at your matrix multiplication using
>>>> SSE2 and AVX that you have posted on assembler.x86 forum,
>>>> i just wanted to tell you that you don't need to write
>>>> it in assembler, cause GCC do auto-vectorize the floating
>>>> point calculation and it auto-vectorize also the Matrix
>>>> multiplication even at -O1 optimization...
>>>
>>> as I see it it don't....
>>>
>>>
>>> and GCC do
>>>> auto-vectorize beautifully since the GCC auto-vectorization
>>>> have given a good results on scimark2 benchmark too.
>>>
>>> Care to share example?
>>
>>
>> You can download the following scimark2 and compile it with
>> the newest version of tdm-gcc with just -O1 level optimization
>> and -S to generate the assembler code and after that look at the
>> assembler code and you will see that it is auto-vectorizing the code:
>>
>> http://math.nist.gov...
>
> I have this benchmark long ago and it seems to autovectorize onli LU
> and that at -O3.
> bmaxa@maxa:~/examples/forth/sci$ gcc-trunk -O2 *.c -o scimark2 -lm
> bmaxa@maxa:~/examples/forth/sci$ time ./scimark2
> ** **
> ** SciMark2 Numeric Benchmark, see http://math.nist.g... **
> ** for details. (Results can be submitted to pozo@nist.gov) **
> ** **
> Using 2.00 seconds min time per kenel.
> Composite Score: 1739.60
> FFT Mflops: 1782.75 (N=1024)
> SOR Mflops: 1305.62 (100 x 100)
> MonteCarlo: Mflops: 618.10
> Sparse matmult Mflops: 2255.16 (N=1000, nz=5000)
> LU Mflops: 2736.35 (M=100, N=100)
>
> real 0m33.408s
> user 0m33.331s
> sys 0m0.052s
> bmaxa@maxa:~/examples/forth/sci$ gcc-trunk -O3 *.c -o scimark2 -lm
> bmaxa@maxa:~/examples/forth/sci$ time ./scimark2
> ** **
> ** SciMark2 Numeric Benchmark, see http://math.nist.g... **
> ** for details. (Results can be submitted to pozo@nist.gov) **
> ** **
> Using 2.00 seconds min time per kenel.
> Composite Score: 2102.19
> FFT Mflops: 1855.88 (N=1024)
> SOR Mflops: 1848.64 (100 x 100)
> MonteCarlo: Mflops: 621.80
> Sparse matmult Mflops: 2275.65 (N=1000, nz=5000)
> LU Mflops: 3908.98 (M=100, N=100)
>
> real 0m28.933s
> user 0m28.849s
> sys 0m0.050s
>
>
>>
>>
>> Thank you,
>> Amine Moulay Ramdane.
>>
>>
>>>
>>>>
>>>>
>>>> Thank you,
>>>> Amine Moulay Ramdane.
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>
>
>
>

comp.programming

To Melzzzzz

Ramine

Melzzzzz

Melzzzzz

Melzzzzz

Melzzzzz

Melzzzzz

Melzzzzz

Ramine

Ramine

Ramine

Ramine

x Login to ForumsZone