Al Green
Al Green's Brother
Home -- News -- Articles -- Books -- Source Code -- Videos -- Xmas -- LinkedIn -- About

Exploiting the Cache: Faster Separable Filters

Nils Liaaen Corneliusen
9 March 2018

2023: Updated ARM Neon Assembler and intrinsics versions in the article A Fast Image Scaler for ARM Neon.
2019: Added old ARM Neon Assembler implementation from 2014.
2018: The Majority of the work was done in 2013. Didn't have time to write this article until 2018.

Introduction

Image transformations using separable filters can be implemented as a vertical pass followed by a horizontal pass. They usually differ in their implementation, and the horizontal one is slower. A new method is presented where the passes are almost similar by ordering and transposing data in a specific manner. This may be faster on architectures where the L1 cache is large enough to hold a temporary dataset of 8 lines. SSE2 and ARM NEON implementations are provided.

Two Pass

Let's quickly look at how a common image transformation, a 4-tap scaler, can be implemented in two passes: The vertical scaler reads 4 lines from the source, multiplies with the coefficients, sums it and stores one row in a temp buffer. This row is passed to the horizontal scaler, which does the same horizontally, and writes the row to the destination buffer.

One way to improve a standard two pass scaler is to process more rows at a time. Make the temp buffer larger and have f ex 8 vertical rows ready before horizontal processing to fill up the L1 cache. The horizontal pass can be made deeper: On Intel SSE2, it would make sense to process 4 horizontal destination pixels at a time. But this does not change the fact that a lot of instructions are needed to load unaligned and overlapping data, doing horizontal dot products, and stuffing everything in the right place.

None Shall Pass

Let's look at how a naive implementation would work:

Scaling

A naive dual vertical pass scaler: Source picture (1) is scaled vertically (2) until 8 rows are ready (3). These rows are rotated 90 degrees left (4), then scaled in the same manner as the vertical scaler (5). The output is rotated 90 degrees to the right (6). 8 rows are now done, repeat until the entire picture is done. (Any similarity to a Rube Goldberg machine is purely coincidental.)

The problem with this solution is that rotating data is expensive, and an extra temp buffer is needed for the first rotation. There has to be a way to eliminate it.

As it turns out, the extra temp buffer can be eliminated by interleaving the writes while doing the first pass:

row 0, 0..7 row 1, 0..7 row 2, 0..7 row 3, 0..7 row 4, 0..7 row 5, 0..7 row 6, 0..7 row 7, 0..7
row 0, 8..15 row 1, 8..15 row 2, 8..15 row 3, 8..15 row 4, 8..15 row 5, 8..15 row 6, 8..15 row 7, 8..15
etc.

After this, do an in-place 8x8 transpose and the data is ready for the new horizontal pass. It should look like the vertical one except for coefficient reloading and transposing back into place before writing.

It's important to note that it doesn't have to be a full transpose: It's enough to guarantee that each row has the bytes of the corresponding column regardless of place. This can be done very efficiently on ARM NEON since it has vertical loads and destructive trn/zip/uzp. One 8x8 block can be done in 6 instructions. The necessary assembler instructions are:

    vld4.8  {d0-d3},   [%[src]:64]!
    vld4.8  {d4-d7},   [%[src]:64]!
    vtrn.8  q0, q2
    vtrn.8  q1, q3
    vst1.64 {d0-d3},   [%[dst]:64]!
    vst1.64 {d4-d7},   [%[dst]:64]!

Input                       Output
00 01 02 03 04 05 06 07     00 40 10 50 20 60 30 70
10 11 12 13 14 15 16 17     01 41 11 51 21 61 31 71
20 21 22 23 24 25 26 27     02 42 12 52 22 62 32 72
30 31 32 33 34 35 36 37     03 43 13 53 23 63 33 73
40 41 42 43 44 45 46 47     04 44 14 54 24 64 34 74
50 51 52 53 54 55 56 57     05 45 15 55 25 65 35 75
60 61 62 63 64 65 66 67     06 46 16 56 26 66 36 76
70 71 72 73 74 75 76 77     07 47 17 57 27 67 37 77

(Obviously it wouldn't be implemented like that, but you get the idea.)

To make a full scaler, some support stuff is needed. It's the usual: Functions for loading/saving basic images, converting them to/from 8-bit, calculating filter coefficients etc. I've covered this in earlier articles, so let's skip a long presentation of that. The necessary files are provided in the Source Code section.

A Working SSE2 Implementation

Let's review the integer Intel SSE2 implementation. SSE2 has 128 bit loads and stores, and also some nifty streaming store instructions. Unfortunately, all the stores are 64-bit vertical. But that's no problem: Doing more sets in parallel allows for wider stores. So let's try a vertical pass that does two sets.

Let's start off with the high-level function that finds the coefficients and scales a planar RGB image:

static void scale_picture( bmp_planar *src, bmp_planar *dst )
{
    uint64_t t0, t1;
    uint32_t *xco, *yco;
    uint32_t x[COEFFS_PHASES];
    uint32_t y[COEFFS_PHASES];

    float xfac = dst->w/(float)src->w;
    float yfac = dst->h/(float)src->h;

    coeffs_get( xfac, x );
    xco = x;
    if( xfac != yfac ) {
        coeffs_get( yfac, y );
        yco = y;
    } else {
        yco = x;
    }

    scale_plane( src->r, src->w, src->h, src->stride, dst->r, dst->w, dst->h, dst->stride, xco, yco );
    scale_plane( src->g, src->w, src->h, src->stride, dst->g, dst->w, dst->h, dst->stride, xco, yco );
    scale_plane( src->b, src->w, src->h, src->stride, dst->b, dst->w, dst->h, dst->stride, xco, yco );

}

Next up is the function that scales a single 8-bit plane:

void scale_plane( uint8_t *src, int srcw, int srch, int srcstr,
                  uint8_t *dst, int dstw, int dsth, int dststr,
                  uint32_t *xco, uint32_t *yco )
{
    int xadd = (srcw<<16)/dstw;
    int yadd = (srch<<16)/dsth;

    int xoffset = (xadd>>1)-(1<<15);
    int yoffset = (yadd>>1)-(1<<15);

    int yy = yoffset + (-1<<16);

    uint8_t *buf = alloca_aligned( (srcstr+16+32)*8, 16 );
    buf += 64;

    for( int y = 0; y < dsth; y += 8 ) {

        for( int i = 0; i < 8; i += 2 ) {

            scale_vertical_x2( src, srcw, srch, srcstr, buf + i*8, yco, yy, yadd );

            yy += yadd*2;

        }

        h2v( buf-64, srcw+16 );

        scale_horizontal_vertical( buf, dst + y * dststr, dstw, dststr, xadd, xoffset, xco );

    }

}

The vertical scaler must handle padding on the edges. That's very simple with this solution: Just allocate extra 8x8 blocks at the start/end and copy the first/last pixel. The addressing is handled by the GETOFFSET macro:

#define GETOFFSET(p) ( ((p)>>3)*64+((p)&0x07) )

static void scale_vertical_x2( uint8_t *src, int srcw, int srch, int srcstr, uint8_t *dst0, uint32_t *yco, int yy, int yadd )
{
    int ypos;
    uint32_t vco;

    srch -= 1;

    ypos = yy>>16;
    __m128i *src00 = (__m128i *)(src + srcstr * clamp( ypos  , 0, srch ));
    __m128i *src01 = (__m128i *)(src + srcstr * clamp( ypos+1, 0, srch ));
    __m128i *src02 = (__m128i *)(src + srcstr * clamp( ypos+2, 0, srch ));
    __m128i *src03 = (__m128i *)(src + srcstr * clamp( ypos+3, 0, srch ));

    vco = *(yco + __bextr_u32( yy, ((COEFFS_STOPBIT-COEFFS_STARTBIT+1)<<8) | COEFFS_STARTBIT ) );
    __m128i vco00 = _mm_set1_epi16( (int8_t)(vco & 0xff) );
    __m128i vco01 = _mm_set1_epi16( (int8_t)__bextr_u32( vco, (8<<8) | 8  ) );
    __m128i vco02 = _mm_set1_epi16( (int8_t)__bextr_u32( vco, (8<<8) | 16 ) );
    __m128i vco03 = _mm_set1_epi16( (int8_t)__bextr_u32( vco, (8<<8) | 24 ) );

    yy += yadd;

    // ...
    // ditto for src10,11,12,13 and vco10,11,12,13
    //

    uint8_t *dst00 = dst0;

    __m128i zero     = _mm_set1_epi32( 0x00000000 );
    __m128i roundval = _mm_set1_epi16( COEFFS_ROUNDVAL );

    for( int x = 0; x < srcw; x += 16 ) {
        __m128i i0, i1, i2, i3;
        __m128i lo, hi;
        __m128i r00, r10;

        i0 = _mm_load_si128( src00++ );
        i1 = _mm_load_si128( src01++ );
        i2 = _mm_load_si128( src02++ );
        i3 = _mm_load_si128( src03++ );

        lo = _mm_add_epi16( _mm_add_epi16( _mm_mullo_epi16( _mm_unpacklo_epi8( i0, zero ), vco00 ),
                                           _mm_mullo_epi16( _mm_unpacklo_epi8( i1, zero ), vco01 ) ),
                            _mm_add_epi16( _mm_mullo_epi16( _mm_unpacklo_epi8( i2, zero ), vco02 ),
                                           _mm_mullo_epi16( _mm_unpacklo_epi8( i3, zero ), vco03 ) ) );

        hi = _mm_add_epi16( _mm_add_epi16( _mm_mullo_epi16( _mm_unpackhi_epi8( i0, zero ), vco00 ),
                                           _mm_mullo_epi16( _mm_unpackhi_epi8( i1, zero ), vco01 ) ),
                            _mm_add_epi16( _mm_mullo_epi16( _mm_unpackhi_epi8( i2, zero ), vco02 ),
                                           _mm_mullo_epi16( _mm_unpackhi_epi8( i3, zero ), vco03 ) ) );

        r00 = _mm_packus_epi16( _mm_srai_epi16( _mm_add_epi16( lo, roundval ), COEFFS_SUM_BITS ),
                                _mm_srai_epi16( _mm_add_epi16( hi, roundval ), COEFFS_SUM_BITS ) );

        // ...
        // ditto for r10: src10,11,12,13
        // ...

        _mm_store_si128( (__m128i *)dst00, _mm_unpacklo_epi64( r00, r10 ) ); dst00 += 64;
        _mm_store_si128( (__m128i *)dst00, _mm_unpackhi_epi64( r00, r10 ) ); dst00 += 64;

    }

    // pad left
    __m128i left;
    left = _mm_set1_epi8( dst0[0] );
    _mm_storel_epi64( (__m128i *)(dst0-64), left );
    left = _mm_set1_epi8( dst0[8] );
    _mm_storel_epi64( (__m128i *)(dst0+8-64), left );

    // pad right
    uint8_t right;
    right = *(dst0+GETOFFSET(srcw-1));
    *(dst0+GETOFFSET(srcw+0)) = right;
    *(dst0+GETOFFSET(srcw+1)) = right;
    right = *(dst0+8+GETOFFSET(srcw-1));
    *(dst0+8+GETOFFSET(srcw+0)) = right;
    *(dst0+8+GETOFFSET(srcw+1)) = right;
}

When 2*4=8 rows are done, it's time to transpose the data in place. We cannot do any ARM tricks, so it's a standard 8x8 transpose:

static void h2v( uint8_t *in, int cnt )
{
    __m128i *inw = (__m128i *)in;
    __m128i *outw = inw;

    do {

        __m128i in01 = _mm_load_si128( inw++ ); // 00000000.11111111
        __m128i in23 = _mm_load_si128( inw++ ); // 22222222.33333333
        __m128i in45 = _mm_load_si128( inw++ ); // 44444444.55555555
        __m128i in67 = _mm_load_si128( inw++ ); // 66666666.77777777

        __m128i in02 = _mm_unpacklo_epi8( in01, in23 ); // 02020202.02020202
        __m128i in13 = _mm_unpackhi_epi8( in01, in23 ); // 13131313.13131313
        __m128i in46 = _mm_unpacklo_epi8( in45, in67 ); // 46464646.46464646
        __m128i in57 = _mm_unpackhi_epi8( in45, in67 ); // 57575757.57575757

        __m128i in0123l = _mm_unpacklo_epi8( in02, in13 ); // 01230123.01230123
        __m128i in0123h = _mm_unpackhi_epi8( in02, in13 ); // ..
        __m128i in4567l = _mm_unpacklo_epi8( in46, in57 ); // 45674567.45674567
        __m128i in4567h = _mm_unpackhi_epi8( in46, in57 ); // ..

        __m128i out0 = _mm_unpacklo_epi32( in0123l, in4567l ); // 01234567.01234567
        __m128i out1 = _mm_unpackhi_epi32( in0123l, in4567l ); // ..
        __m128i out2 = _mm_unpacklo_epi32( in0123h, in4567h ); // ..
        __m128i out3 = _mm_unpackhi_epi32( in0123h, in4567h ); // ..

        _mm_store_si128( outw++, out0 );
        _mm_store_si128( outw++, out1 );
        _mm_store_si128( outw++, out2 );
        _mm_store_si128( outw++, out3 );

        cnt -= 8;

    } while( cnt > 0 );
}

The output is sent to the "horizontal" pass, which is now very similar to the vertical one. The only difference is coefficient reloading and transposing data back before writing:

static void scale_horizontal_vertical( uint8_t *src, uint8_t *dst, int dstw, int dststr, int xadd, int xoffset, uint32_t *xco )
{
    int xx = xoffset + (-1<<16);

    __m128i zero     = _mm_set1_epi32( 0x00000000 );
    __m128i roundval = _mm_set1_epi16( COEFFS_ROUNDVAL );

    for( int x = 0; x < dstw; x += 16 ) {

        __m128i dbuf[8];

        for( int i = 0; i < 8; i++ ) {
            uint32_t hco;
            __m128i hco0, hco1, hco2, hco3;
            __m128i in01, in23;
            uint8_t *in;

            hco = *(xco + __bextr_u32( xx, ((COEFFS_STOPBIT-COEFFS_STARTBIT+1)<<8) | COEFFS_STARTBIT ) );
            hco0 = _mm_set1_epi16( (int8_t)(hco & 0xff) );
            hco1 = _mm_set1_epi16( (int8_t)__bextr_u32( hco, (8<<8) | 8  ) );
            hco2 = _mm_set1_epi16( (int8_t)__bextr_u32( hco, (8<<8) | 16 ) );
            hco3 = _mm_set1_epi16( (int8_t)__bextr_u32( hco, (8<<8) | 24 ) );

            in  = src + ((xx>>16)<<3);
            xx += xadd;

            in01 = _mm_loadu_si128( (__m128i *)in ); in += 16;
            in23 = _mm_loadu_si128( (__m128i *)in );

            __m128i lo = _mm_add_epi16( _mm_add_epi16( _mm_mullo_epi16( _mm_unpacklo_epi8( in01, zero ), hco0 ),
                                                       _mm_mullo_epi16( _mm_unpackhi_epi8( in01, zero ), hco1 ) ),
                                        _mm_add_epi16( _mm_mullo_epi16( _mm_unpacklo_epi8( in23, zero ), hco2 ),
                                                       _mm_mullo_epi16( _mm_unpackhi_epi8( in23, zero ), hco3 ) ) );

            // ...
            // ditto for hi
            // ...

            dbuf[i] = _mm_packus_epi16( _mm_srai_epi16( _mm_add_epi16( lo, roundval ), COEFFS_SUM_BITS ),
                                        _mm_srai_epi16( _mm_add_epi16( hi, roundval ), COEFFS_SUM_BITS ) );

        }

        __m128i in02, in13, in46, in57;
        __m128i in0123l, in0123h, in4567l, in4567h;
        __m128i out0l, out1l, out2l, out3l;
        __m128i out0h, out1h, out2h, out3h;

        in02 = _mm_unpacklo_epi8( dbuf[0], dbuf[1] );
        in13 = _mm_unpackhi_epi8( dbuf[0], dbuf[1] );
        in46 = _mm_unpacklo_epi8( dbuf[2], dbuf[3] );
        in57 = _mm_unpackhi_epi8( dbuf[2], dbuf[3] );

        in0123l = _mm_unpacklo_epi8( in02, in13 );
        in0123h = _mm_unpackhi_epi8( in02, in13 );
        in4567l = _mm_unpacklo_epi8( in46, in57 );
        in4567h = _mm_unpackhi_epi8( in46, in57 );

        out0l = _mm_unpacklo_epi32( in0123l, in4567l );
        out1l = _mm_unpackhi_epi32( in0123l, in4567l );
        out2l = _mm_unpacklo_epi32( in0123h, in4567h );
        out3l = _mm_unpackhi_epi32( in0123h, in4567h );

        in02 = _mm_unpacklo_epi8( dbuf[4], dbuf[5] );
        in13 = _mm_unpackhi_epi8( dbuf[4], dbuf[5] );
        in46 = _mm_unpacklo_epi8( dbuf[6], dbuf[7] );
        in57 = _mm_unpackhi_epi8( dbuf[6], dbuf[7] );

        in0123l = _mm_unpacklo_epi8( in02, in13 );
        in0123h = _mm_unpackhi_epi8( in02, in13 );
        in4567l = _mm_unpacklo_epi8( in46, in57 );
        in4567h = _mm_unpackhi_epi8( in46, in57 );

        out0h = _mm_unpacklo_epi32( in0123l, in4567l );
        out1h = _mm_unpackhi_epi32( in0123l, in4567l );
        out2h = _mm_unpacklo_epi32( in0123h, in4567h );
        out3h = _mm_unpackhi_epi32( in0123h, in4567h );

        uint8_t *out = dst + x;

        _mm_stream_si128( (__m128i *)out, _mm_unpacklo_epi64( out0l, out0h ) ); out += dststr;
        _mm_stream_si128( (__m128i *)out, _mm_unpackhi_epi64( out0l, out0h ) ); out += dststr;
        _mm_stream_si128( (__m128i *)out, _mm_unpacklo_epi64( out1l, out1h ) ); out += dststr;
        _mm_stream_si128( (__m128i *)out, _mm_unpackhi_epi64( out1l, out1h ) ); out += dststr;
        _mm_stream_si128( (__m128i *)out, _mm_unpacklo_epi64( out2l, out2h ) ); out += dststr;
        _mm_stream_si128( (__m128i *)out, _mm_unpackhi_epi64( out2l, out2h ) ); out += dststr;
        _mm_stream_si128( (__m128i *)out, _mm_unpacklo_epi64( out3l, out3h ) ); out += dststr;
        _mm_stream_si128( (__m128i *)out, _mm_unpackhi_epi64( out3l, out3h ) );

    }

}

Source Code

Source code (zip archive): imagetrans_2018.zip (browse). Contains all the files listed below.

Functions to allocate, load and save 24 bit bitmap files as planar data:

Functions for calculating 8-bit Lanczos-2 filter coefficients:

Two pass 4-tap scaler for Intel SSE2 that uses the new dual vertical pass method:

Test code to scale pictures and measure time used:

From the Pits of Hell (26 July 2019)

This is an old attempt to make an ARM NEON version of this. Using intrinsics will quickly lead to register chaos and GCC frantically dumping things on the stack when it shouldn't. Some dork said it would be simple to do using GCC inline assembly. It wasn't. Next time I'll write a pure Assembler version. The 8-bit scaler should work ok, can't remember if I ever got the NV12 routines to work properly.

Compile, Link and Test

The -m buffers option enables the performance test. The main purpose is to measure the effect of increased memory load. -m 1 should have everything in cache and run at max speed. -m 300 should make sure nothing is in the cache. Output is compressed a bit for readability.

Intel SSE2, Xeon E5-2699 v3 @ 2.3GHz:

[nco@marvin scaler]$ gcc -ffast-math -O3 -march=native -msse2 -mbmi -Wall -std=gnu99 -o scaler scale_sse2.c bmp_planar.c coeffs.c main.c -lm

[nco@marvin scaler]$ ./scaler -i ebu3325.bmp -o out.bmp -s 1280x720
Loading bitmap ebu3325.bmp
Width: 1920 Height: 1080 Rowbytes: 5760
File loaded!
Input:   1920*1080
Output:  1280*720
Scale time: 7934
Saving bitmap out.bmp
Width: 1280 Height: 720 Rowbytes: 3840

[nco@marvin scaler]$ ./scaler -m 1
total: 5400837, per frame: 5400, fps: 185.156479
[nco@marvin scaler]$ ./scaler -m 300
total: 6558348, per frame: 6558, fps: 152.477432

Comments are always appreciated. Contact information is available here.

Remember to appreciate this reasonably new XKCD strip.

Article Licensing Information

This article is published under the following license: Attribution-NoDerivatives 4.0 International (CC BY-ND 4.0)
Short summary: You may copy and redistribute the material in any medium or format for any purpose, even commercially. You must give appropriate credit, provide a link to the license, and indicate if changes were made. If you remix, transform, or build upon the material, you may not distribute the modified material.


Ekte Programmering Norwegian flag
American flag Real Programming
Ignorantus AS