aarch64: Add NEON optimizations for 10 and 12 bit vp9 loop filter
authorMartin Storsjö <martin@martin.st>
Thu, 5 Jan 2017 10:52:06 +0000 (12:52 +0200)
committerMartin Storsjö <martin@martin.st>
Tue, 24 Jan 2017 20:36:11 +0000 (22:36 +0200)
commit9f10cff61042dbc0c27efd2dea7f1d3da83eff1b
tree44710314dd7bb298be9a2488881be33eb197c98e
parentceb36b81781fc62814780bc3654ded53f239994b
aarch64: Add NEON optimizations for 10 and 12 bit vp9 loop filter

This work is sponsored by, and copyright, Google.

This is similar to the arm version, but due to the larger registers
on aarch64, we can do 8 pixels at a time for all filter sizes.

Examples of runtimes vs the 32 bit version, on a Cortex A53:
                                             ARM AArch64
vp9_loop_filter_h_4_8_10bpp_neon:          213.2   172.6
vp9_loop_filter_h_8_8_10bpp_neon:          281.2   244.2
vp9_loop_filter_h_16_8_10bpp_neon:         657.0   444.5
vp9_loop_filter_h_16_16_10bpp_neon:       1280.4   877.7
vp9_loop_filter_mix2_h_44_16_10bpp_neon:   397.7   358.0
vp9_loop_filter_mix2_h_48_16_10bpp_neon:   465.7   429.0
vp9_loop_filter_mix2_h_84_16_10bpp_neon:   465.7   428.0
vp9_loop_filter_mix2_h_88_16_10bpp_neon:   533.7   499.0
vp9_loop_filter_mix2_v_44_16_10bpp_neon:   271.5   244.0
vp9_loop_filter_mix2_v_48_16_10bpp_neon:   330.0   305.0
vp9_loop_filter_mix2_v_84_16_10bpp_neon:   329.0   306.0
vp9_loop_filter_mix2_v_88_16_10bpp_neon:   386.0   365.0
vp9_loop_filter_v_4_8_10bpp_neon:          150.0   115.2
vp9_loop_filter_v_8_8_10bpp_neon:          209.0   175.5
vp9_loop_filter_v_16_8_10bpp_neon:         492.7   345.2
vp9_loop_filter_v_16_16_10bpp_neon:        951.0   682.7

This is significantly faster than the ARM version in almost
all cases except for the mix2 functions.

Based on START_TIMER/STOP_TIMER wrapping around a few individual
functions, the speedup vs C code is around 2-3x.

Signed-off-by: Martin Storsjö <martin@martin.st>
libavcodec/aarch64/Makefile
libavcodec/aarch64/vp9dsp_init_16bpp_aarch64_template.c
libavcodec/aarch64/vp9lpf_16bpp_neon.S [new file with mode: 0644]