SBR DSP x86: implement SSE sbr_sum_square_sse
authorChristophe GISQUET <christophe.gisquet@gmail.com>
Thu, 23 Feb 2012 18:48:58 +0000 (19:48 +0100)
committerRonald S. Bultje <rsbultje@gmail.com>
Thu, 23 Feb 2012 23:50:06 +0000 (15:50 -0800)
commit34454c761f01275d4adaf40df6d70a59011c4a6c
treea25a23c028ddee97c1195567f855ce064bdbe916
parent2e74a5abc2fda6cfbc86589852d6194d502332cb
SBR DSP x86: implement SSE sbr_sum_square_sse

The 32bits targets have been compiled with -mfpmath=sse for proper reference.
sbr_sum_square C  /32bits: 82c (unrolled)/102c
               C  /64bits: 69c (unrolled)/82c
               SSE/32bits: 42c
               SSE/64bits: 31c

Use of SSE4.1 dpps to perform the final sum is slower.
Not unrolling to perform 8 operations in a loop yields 10 more cycles.

Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
libavcodec/sbrdsp.c
libavcodec/sbrdsp.h
libavcodec/x86/Makefile
libavcodec/x86/sbrdsp.asm [new file with mode: 0644]
libavcodec/x86/sbrdsp_init.c [new file with mode: 0644]