aarch64: Add NEON optimizations for 10 and 12 bit vp9 itxfm
authorMartin Storsjö <martin@martin.st>
Tue, 3 Jan 2017 12:35:54 +0000 (14:35 +0200)
committerMartin Storsjö <martin@martin.st>
Tue, 24 Jan 2017 20:36:08 +0000 (22:36 +0200)
commitceb36b81781fc62814780bc3654ded53f239994b
tree07d1cb3af43aa12e1130d46d591e9538babeaf37
parent638eceed4752dbde9114718d53f199e07848f1c2
aarch64: Add NEON optimizations for 10 and 12 bit vp9 itxfm

This work is sponsored by, and copyright, Google.

Compared to the arm version, on aarch64 we can keep the full 8x8
transform in registers, and for 16x16 and 32x32, we can process
it in slices of 4 pixels instead of 2.

Examples of runtimes vs the 32 bit version, on a Cortex A53:
                                                ARM  AArch64
vp9_inv_adst_adst_4x4_sub4_add_10_neon:       111.0    109.7
vp9_inv_adst_adst_8x8_sub8_add_10_neon:       914.0    733.5
vp9_inv_adst_adst_16x16_sub16_add_10_neon:   5184.0   3745.7
vp9_inv_dct_dct_4x4_sub1_add_10_neon:          65.0     65.7
vp9_inv_dct_dct_4x4_sub4_add_10_neon:         100.0     96.7
vp9_inv_dct_dct_8x8_sub1_add_10_neon:         111.0    119.7
vp9_inv_dct_dct_8x8_sub8_add_10_neon:         618.0    494.7
vp9_inv_dct_dct_16x16_sub1_add_10_neon:       295.1    284.6
vp9_inv_dct_dct_16x16_sub2_add_10_neon:      2303.2   1883.9
vp9_inv_dct_dct_16x16_sub8_add_10_neon:      2984.8   2189.3
vp9_inv_dct_dct_16x16_sub16_add_10_neon:     3890.0   2799.4
vp9_inv_dct_dct_32x32_sub1_add_10_neon:      1044.4   1012.7
vp9_inv_dct_dct_32x32_sub2_add_10_neon:     13333.7   9695.1
vp9_inv_dct_dct_32x32_sub16_add_10_neon:    18531.3  12459.8
vp9_inv_dct_dct_32x32_sub32_add_10_neon:    24470.7  16160.2
vp9_inv_wht_wht_4x4_sub4_add_10_neon:          83.0     79.7

The larger transforms are significantly faster than the corresponding
ARM versions.

The speedup vs C code is smaller than in 32 bit mode, probably
because the 64 bit intermediates in the C code can be expressed
more efficiently in aarch64.

Signed-off-by: Martin Storsjö <martin@martin.st>
libavcodec/aarch64/Makefile
libavcodec/aarch64/vp9dsp_init_16bpp_aarch64_template.c
libavcodec/aarch64/vp9itxfm_16bpp_neon.S [new file with mode: 0644]