Commit 2ed67eba authored 8 years ago by Martin Storsjö

arm: Add NEON optimizations for 10 and 12 bit vp9 itxfm

This work is sponsored by, and copyright, Google.

This is structured similarly to the 8 bit version. In the 8 bit
version, the coefficients are 16 bits, and intermediates are 32 bits.

Here, the coefficients are 32 bit. For the 4x4 transforms for 10 bit
content, the intermediates also fit in 32 bits, but for all other
transforms (4x4 for 12 bit content, and 8x8 and larger for both 10
and 12 bit) the intermediates are 64 bit.

For the existing 8 bit case, the 8x8 transform fit all coefficients in
registers; for 10/12 bit, when the coefficients are 32 bit, the 8x8
transform also has to be done in slices of 4 pixels (just as 16x16 and
32x32 for 8 bit).

The slice width also shrinks from 4 elements to 2 elements in parallel
for the 16x16 and 32x32 cases.

The 16 bit coefficients from idct_coeffs and similar tables also need
to be lenghtened to 32 bit in order to be used in multiplication with
vectors with 32 bit elements. This leads to the fixed coefficient
vectors needing more space, leading to more cases where they have to
be reloaded within the transform (in iadst16).

This technically would need testing in checkasm for subpartitions
in increments of 2, but that slows down normal checkasm runs
excessively.

Examples of relative speedup compared to the C version, from checkasm:
Cortex A7 A8 A9 A53
vp9_inv_adst_adst_4x4_sub4_add_10_neon: 4.83 11.36 5.22 6.77
vp9_inv_adst_adst_8x8_sub8_add_10_neon: 4.12 7.60 4.06 4.84
vp9_inv_adst_adst_16x16_sub16_add_10_neon: 3.93 8.16 4.52 5.35
vp9_inv_dct_dct_4x4_sub1_add_10_neon: 1.36 2.57 1.41 1.61
vp9_inv_dct_dct_4x4_sub4_add_10_neon: 4.24 8.66 5.06 5.81
vp9_inv_dct_dct_8x8_sub1_add_10_neon: 2.63 4.18 1.68 2.87
vp9_inv_dct_dct_8x8_sub4_add_10_neon: 4.52 9.47 4.24 5.39
vp9_inv_dct_dct_8x8_sub8_add_10_neon: 3.45 7.34 3.45 4.30
vp9_inv_dct_dct_16x16_sub1_add_10_neon: 3.56 6.21 2.47 4.32
vp9_inv_dct_dct_16x16_sub2_add_10_neon: 5.68 12.73 5.28 7.07
vp9_inv_dct_dct_16x16_sub8_add_10_neon: 4.42 9.28 4.24 5.45
vp9_inv_dct_dct_16x16_sub16_add_10_neon: 3.41 7.29 3.35 4.19
vp9_inv_dct_dct_32x32_sub1_add_10_neon: 4.52 8.35 3.83 6.40
vp9_inv_dct_dct_32x32_sub2_add_10_neon: 5.86 13.19 6.14 7.04
vp9_inv_dct_dct_32x32_sub16_add_10_neon: 4.29 8.11 4.59 5.06
vp9_inv_dct_dct_32x32_sub32_add_10_neon: 3.31 5.70 3.56 3.84
vp9_inv_wht_wht_4x4_sub4_add_10_neon: 1.89 2.80 1.82 1.97

The speedup compared to the C functions is around 1.3 to 7x for the
full transforms, even higher for the smaller subpartitions.

Signed-off-by: Martin Storsjö <martin@martin.st>

parent a4d4bad7

No related branches found

No related tags found

Hide whitespace changes

Inline Side-by-side

Showing with 1564 additions and 1 deletion

Please register or to comment