Home -- News -- Articles -- Books -- Source Code -- Videos -- Xmas -- LinkedIn -- About

A Fast Image Scaler for ARM Neon

Nils Liaaen Corneliusen

15 May 2023

Update 25 October 2023: Added chapter about tests on Apple M1 and M2.

Abstract

Introducing a fast Lanczos-2 image scaler for ARM Neon written in C and intrinsics. It uses a unique method for applying separable filters described in the article Exploiting the Cache: Faster Separable Filters (Corneliusen 2018). An ARMv7 Neon Assembler implementation from 2014 gets a makeover. Full source code included.

A Brief History of Separable Image Filters

The method is presented in more detail in the book Real Programming (Corneliusen/Julin 2021). Nobody have probably read it or the article, so here's a description of how separable filters are applied from the article Separable Filter [en.wikipedia.org]:

A separable filter in image processing can be written as product of two more simple filters. Typically a 2-dimensional convolution operation is separated into two 1-dimensional filters.

That translates to applying one filter in one direction (for a number of destination lines) and the other in the other direction. The complexity is reduced a lot and everybody's happy. Another problem solved. Wrap it up in a library and stop thinking about it. Unfortunately, that's not how high-performance computing works. In this case, the niggling issue is that the two passes differ in their complexity: The vertical pass is easy, the horizontal pass is hard. It is slightly less hard on modern CPUs since unaligned loads cost less, but it's still a royal pain keeping track of everything.

The brilliant (or stupid, I forget which) idea I had back then was to rotate the data before the horizontal pass, thus making it look like the vertical one. This will require two temporary buffers and not run very quickly. However, my other finding was that one of the buffers can be eliminated by writing the output data interleaved and transposing it while it is in the cache. The result: Massive speed increase.

row 0, 0..7	row 1, 0..7	row 2, 0..7	row 3, 0..7	row 4, 0..7	row 5, 0..7	row 6, 0..7	row 7, 0..7
row 0, 8..15	row 1, 8..15	row 2, 8..15	row 3, 8..15	row 4, 8..15	row 5, 8..15	row 6, 8..15	row 7, 8..15
etc.

The 2018 article presents a functional example: An SSE2 implementation of a Lanczos-2 scaler that I made in 2013. The 5-year discrepancy can be attributed to a boss (not bus) error. Read the book. Since Intel and AMD now support even wider integer vector operations in AVX2 ~~and AVX-512~~, an update would have been appropriate. Unfortunately, Intel's latest AVX implementations are... crap [anandtech.com].

In 2019, I recovered an ARM Neon Assembler implementation that I made in 2014 and added it to the 2018 article. (Yeah, this story is not very linear.) That version is semi-functional: It doesn't compile out of the box, it has a quirky interface, it doesn't calculate start offsets correctly, and there's the case of a missing register in the second pass. In short, time for a refresh. Since modern ARM CPUs have 32 Neon registers and perform quite well, it's also time to make a completely new C and intrinsics implementation. I'll keep it readable, simple, and fast: All those words are traits of good C code, the only programming language that endures in this age of snake pits.

A Trip Down Memory Lane

Before getting to the new ARM scaler, let's have a look at the old SSE2 implementation from 2013. Does it still compile? Will it work correctly? What will the performance test say? What is the airspeed velocity of an unladen swallow? Many questions, few answers. Grab the archive called imagetrans_2018.zip (or browse it). Compile and run it on a newish AMD:

SSE2 intrinsics version (2013) on AMD Ryzen 9 5900X @ 4.5 GHz (2020)

nils@lunkwill:~/src/imagetrans$ gcc -O3 -Wall -march=native -msse2 -mbmi -o scaler scale_sse2.c bmp_planar.c coeffs.c main.c -lm
nils@lunkwill:~/src/imagetrans$

nils@lunkwill:~/src/imagetrans$ ./scaler -m 1
Loading bitmap ebu3325.bmp
Width: 1920 Height: 1080 Rowbytes: 5760
File loaded!
Input:  1920*1080
Scaling 1000 frames from 1920*1080 to 1280*720 using 1/1 buffers
Memory estimate: 5.93MB src, 2.64MB dst
total: 1514.000000ms, per frame: 1.514000ms, fps: 660.501953

nils@lunkwill:~/src/imagetrans$ ./scaler -m 300
(...)
Memory estimate: 1779.79MB src, 791.02MB dst
total: 1694.000122ms, per frame: 1.694000ms, fps: 590.318726

What does this tell us? Properly written code in C and intrinsics stands the test of time. Not a single warning from the compiler then; not a single warning from the compiler now. It was fast and readable then; it is fast and readable now. I cannot spell it out any clearer.

I made a half-assed attempt at writing an AVX2 version: Just the first pass and transposer were (poorly) converted. The test run showed a promising 780/683 fps. There's definite potential there. Let's see where AVX ends up first. My guess is a landfill in Alamogordo, New Mexico [en.wikipedia.org]. I might be wrong, though.

Raspberry Pi 2

Beating a Dead Horse

As mentioned in the introduction, the old ARMv7 Neon Assembler implementation has some issues. They have all been fixed. In addition, I've made use of lane-based multiplication, reordered the instructions, and generally improved it. The output should be exactly the same as from the intrinsics version. The instruction ordering can probably be improved further, but I'm not gonna bother with it. Here's the updated version:

scale_neon_asm.c

Unfortunately, finding an old Cortex-A9 with Neon to test it on was hard: Some dork (probably me) had thrown away lots of hardware when I moved last year. Luckily, Sjur had an old Raspberry Pi 2 Model B v1.1 hidden in his bag of holding. It's not gonna be quick, but it should compile and run. Toss the new file into the pond with the other files.

ARMv7 Neon Assembler version (2014,2023) on Pi 2 Model B v1.1, ARMv7 @ 900 MHz (2015)

nils@pitfall:~/src/imagetrans$ gcc -O3 -Wall -march=native -mfpu=neon -o scaler scale_neon_asm.c bmp_planar.c coeffs.c main.c -lm
nils@pitfall:~/src/imagetrans$

nils@pitfall:~/src/imagetrans$ ./scaler -m 1
(...)
total: 3949.499023ms, per frame: 39.494987ms, fps: 25.376513

nlc@pitfall:~/src/imagetrans $ ./scaler -m 50
(...)
total: 3933.701904ms, per frame: 39.337021ms, fps: 25.421347

Didn't know what to expect, but at least it runs. Use it if you need it.

Ask Not What You Can Do for the Compiler - Ask What the Compiler Can Do for You

There are great things about ARM Neon intrinsics... and some things that are at the other end of the scale. Vreinterpret, vector construction, the naming of loads and stores: All areas in need of significant improvement. No need to rant about it here, since I already did that in the book. Anyway, a decision was made to only use common v7/A32/A64 intrinsics. (Who makes these important decisions? Seriously, I have to start paying more attention.) That means the result will compile for older targets with 16 Neon registers. Let's see how that works out later.

So why am I writing this in C and intrinsics again? To get the compiler to do the register allocation and instruction scheduling. Those 32 AArch64 Neon registers are just begging to be fully exploited by some heavy Neon code. Scheduling instructions is hard: Look at the Assembler implementation. I have high hopes for the C and intrinsics version! I get disappointed quite often, though.

All this talk is getting nobody nowhere at any speed. To quote an earlier Transmeta employee: "Talk is cheap. Show me the code." Wise words. Let's jump to the matter at hand and review the three major parts of the implementation: The vertical pass, the transposer, and the vertical-looking horizontal pass.

None Shall Pass

First is the (almost) normal vertical pass. Laczos-2 is a separable 4-tap filter (duh), so it converts 4 contiguous source rows to 1 destination row. The only sign of something different is how the temporary buffer is filled.

The signed 8-bit coefficients are getting long in the tooth. Could have updated the coeffs code to produce sets of 4 16-bit ints. Didn't, so extracting them is quirky. Uses less cache, though. Feel free to change it. Note that the coeffs error distribution method in the generator was nifty in 2013. It's still nifty.

Start by setting up the row pointers: Duplicate top and bottom rows as needed.

static void vert_interpolate_4to1_8( uint8_t *src, int srcw, int srch, int src_stride, uint8_t *dst, uint32_t *yco, int yy )
{
    int16x8_t rv = vdupq_n_s16( COEFFS_ROUNDVAL );

    // Set up row pointers
    int ypos = yy>>16;
    uint64_t *sp0 = (uint64_t *)(src + src_stride * clamp(ypos,   0, srch));
    uint64_t *sp1 = (uint64_t *)(src + src_stride * clamp(ypos+1, 0, srch));
    uint64_t *sp2 = (uint64_t *)(src + src_stride * clamp(ypos+2, 0, srch));
    uint64_t *sp3 = (uint64_t *)(src + src_stride * clamp(ypos+3, 0, srch));

    uint8_t *dst0 = dst;

The quirky coeffs fetching. Get 32 bits from the table, expand, and store in vco0w:

    // Fetch 8-bit coeffs, convert to 16x4
    int16x4_t vco0w = vget_low_s16( vmovl_s8( s8_u64(vcreate_u64( *(yco + ((((uint32_t)yy)>>10)&63)) )) ) );

Set up the loop and load data from the 4 rows:

    for( int x = 0; x < srcw; x += 32 ) {
        int16x8_t res0, res1, res2, res3;

        // Row 0-3: fetch
        uint64x1x4_t in0 = vld4_u64( sp0 ); sp0 += 4;
        uint64x1x4_t in1 = vld4_u64( sp1 ); sp1 += 4;
        uint64x1x4_t in2 = vld4_u64( sp2 ); sp2 += 4;
        uint64x1x4_t in3 = vld4_u64( sp3 ); sp3 += 4;

This is why I hate vreinterpret: They're long and ugly. As a bonus, they rarely do anything useful except clutter the code. All the vreinterprets have had their arms and legs cut off, like all ARM programmers I know do. This is the same code as in a normal vertical pass. Or should be.

        // Row 0: Expand & multiply
        res0 = vmulq_lane_s16( qs16_u16( vmovl_u8( u8_u64(in0.val[0]) ) ), vco0w, 0 );
        res1 = vmulq_lane_s16( qs16_u16( vmovl_u8( u8_u64(in0.val[1]) ) ), vco0w, 0 );
        res2 = vmulq_lane_s16( qs16_u16( vmovl_u8( u8_u64(in0.val[2]) ) ), vco0w, 0 );
        res3 = vmulq_lane_s16( qs16_u16( vmovl_u8( u8_u64(in0.val[3]) ) ), vco0w, 0 );

        // Row 1: Expand & fma
        res0 = vmlaq_lane_s16( res0, qs16_u16( vmovl_u8( u8_u64(in1.val[0]) ) ), vco0w, 1 );
        res1 = vmlaq_lane_s16( res1, qs16_u16( vmovl_u8( u8_u64(in1.val[1]) ) ), vco0w, 1 );
        res2 = vmlaq_lane_s16( res2, qs16_u16( vmovl_u8( u8_u64(in1.val[2]) ) ), vco0w, 1 );
        res3 = vmlaq_lane_s16( res3, qs16_u16( vmovl_u8( u8_u64(in1.val[3]) ) ), vco0w, 1 );

        // Row 2: Etc.
        res0 = vmlaq_lane_s16( res0, qs16_u16( vmovl_u8( u8_u64(in2.val[0]) ) ), vco0w, 2 );
        res1 = vmlaq_lane_s16( res1, qs16_u16( vmovl_u8( u8_u64(in2.val[1]) ) ), vco0w, 2 );
        res2 = vmlaq_lane_s16( res2, qs16_u16( vmovl_u8( u8_u64(in2.val[2]) ) ), vco0w, 2 );
        res3 = vmlaq_lane_s16( res3, qs16_u16( vmovl_u8( u8_u64(in2.val[3]) ) ), vco0w, 2 );

        // Row 3
        res0 = vmlaq_lane_s16( res0, qs16_u16( vmovl_u8( u8_u64(in3.val[0]) ) ), vco0w, 3 );
        res1 = vmlaq_lane_s16( res1, qs16_u16( vmovl_u8( u8_u64(in3.val[1]) ) ), vco0w, 3 );
        res2 = vmlaq_lane_s16( res2, qs16_u16( vmovl_u8( u8_u64(in3.val[2]) ) ), vco0w, 3 );
        res3 = vmlaq_lane_s16( res3, qs16_u16( vmovl_u8( u8_u64(in3.val[3]) ) ), vco0w, 3 );

As mentioned in the book everybody has an opinion about, but nobody has read, vqshrun is very versatile. It's useful for shifting and packing results. And that's about it.

        // Round and pack
        uint8x8_t r0 = vqshrun_n_s16( vaddq_s16( res0, rv ), 6 );
        uint8x8_t r1 = vqshrun_n_s16( vaddq_s16( res1, rv ), 6 );
        uint8x8_t r2 = vqshrun_n_s16( vaddq_s16( res2, rv ), 6 );
        uint8x8_t r3 = vqshrun_n_s16( vaddq_s16( res3, rv ), 6 );

Write data interleaved:

        // Store result
        vst1_u8( dst,       r0 );
        vst1_u8( dst +  64, r1 );
        vst1_u8( dst + 128, r2 );
        vst1_u8( dst + 192, r3 );
        dst += 256;
    }

The buffer has added room for padding at the edges. It's slightly complicated since data was written interleaved. The nifty GETOFFSET macro takes care of finding the right spots.

    // pad left
    vst1_u64( (uint64_t *)(dst0  -64), u64_u8(vdup_n_u8( dst0[0] )) );
    vst1_u64( (uint64_t *)(dst0-8-64), u64_u8(vdup_n_u8( dst0[8] )) );

    // pad right
    uint8_t pix;
    pix = *(dst0+  GETOFFSET(srcw-1)); *(dst0+  GETOFFSET(srcw+0)) = pix; *(dst0+  GETOFFSET(srcw+1)) = pix;
    pix = *(dst0+8+GETOFFSET(srcw-1)); *(dst0+8+GETOFFSET(srcw+0)) = pix; *(dst0+8+GETOFFSET(srcw+1)) = pix;
}

NVidia Jetson Nano Developer Kit

Dawn of the Posers

When 8 destination rows are done, they are passed to the transposer. Since data was written interleaved, the effect of transposing 8x8 blocks will give 8-wide horizontal data that can be treated as vertical data. Talk about killing two birds with one stone. Or 64 birds with 8 stones.

On older ARMs, it's possible to cheat a bit. Consult the 2018 article. That method is (probably) not very efficient on modern ARMs where vtrn is split into 2 instructions, so I made a port of the SSE2 version. Try the older method for fun! High-performance computing is about trying out stuff, failing, and sometimes succeeding.

It is possible to use _x4 loads/stores here. I always use older compilers: Bugs are known, surprises are few... and new stuff is missing. D'oh. Try them out if your compiler supports them. The vzips can be combined and the extra variables eliminated. It's a matter of style: The code produced should be exactly the same. Jinx!

static void h2v( uint8_t *src, int cnt )
{
    uint32_t *dst32 = (uint32_t *)src;

    do {
        uint8x16_t in0 = vld1q_u8( src      ); // 00000000.11111111 r0.r1
        uint8x16_t in1 = vld1q_u8( src + 16 ); // 22222222.33333333 r2.r3
        uint8x16_t in2 = vld1q_u8( src + 32 ); // 44444444.55555555 r4.r5
        uint8x16_t in3 = vld1q_u8( src + 48 ); // 66666666.77777777 r6.r7
        src += 64;

        uint8x16x2_t tr0 = vzipq_u8( in0, in1 );
        uint8x16x2_t tr1 = vzipq_u8( in2, in3 );

        uint8x16x2_t tr2 = vzipq_u8( tr0.val[0], tr0.val[1] );
        uint8x16x2_t tr3 = vzipq_u8( tr1.val[0], tr1.val[1] );

        uint32x4x2_t tr4 = vzipq_u32( qu32_u8(tr2.val[0]), qu32_u8(tr3.val[0]) );
        uint32x4x2_t tr5 = vzipq_u32( qu32_u8(tr2.val[1]), qu32_u8(tr3.val[1]) );

        vst1q_u32( dst32,      tr4.val[0] );  // 01234567.01234567 c0.c1
        vst1q_u32( dst32 +  4, tr4.val[1] );  // 01234567.01234567 c2.c3
        vst1q_u32( dst32 +  8, tr5.val[0] );  // 01234567.01234567 c4.c5
        vst1q_u32( dst32 + 12, tr5.val[1] );  // 01234567.01234567 c6.c7
        dst32 += 16;

        cnt -= 8;

    } while( cnt > 0 );
}

Even This Shall Pass Away

While the vertical pass does 4 to 1 row, the horizontal pass does 8*4 possibly overlapping rows to 8 non-overlapping destination rows. This begs the question: What about restrict? Have a stab at it! I didn't notice any spectacular changes. Your mileage may vary.

static void horiz_interpolate_8_8( uint8_t *src, int xadd, int xoffset, uint32_t *xcoeffs, int width, uint8_t *dst, int dst_stride )
{
    int xx = xoffset - (1<<16) - 65536;

    int16x8_t rv = vdupq_n_s16( COEFFS_ROUNDVAL );

    for( int x = 0; x < width; x += 8 ) {
        int16x8_t res[8];
        int16x8_t in0w, in1w, in2w, in3w;
        int16x4_t hco0w;

8 output rows are needed, so 2 loop iterations. It can be unrolled further manually. It uses a result array so the compiler can be lured into spilling registers outside the loop on old ARM Neons with 16 registers. Yeah, that's never gonna work. Don't know why I said it.

        for( int i = 0; i < 8; i += 4 ) {

            // Fetch data and raw coeffs
            uint64x1x4_t in0 = vld4_u64( (uint64_t *)(src + ((xx>>16)<<3)) ); uint32_t hco0 = *(xcoeffs + ((((uint32_t)xx)>>10)&63)); xx += xadd;
            uint64x1x4_t in1 = vld4_u64( (uint64_t *)(src + ((xx>>16)<<3)) ); uint32_t hco1 = *(xcoeffs + ((((uint32_t)xx)>>10)&63)); xx += xadd;
            uint64x1x4_t in2 = vld4_u64( (uint64_t *)(src + ((xx>>16)<<3)) ); uint32_t hco2 = *(xcoeffs + ((((uint32_t)xx)>>10)&63)); xx += xadd;
            uint64x1x4_t in3 = vld4_u64( (uint64_t *)(src + ((xx>>16)<<3)) ); uint32_t hco3 = *(xcoeffs + ((((uint32_t)xx)>>10)&63)); xx += xadd;

            // Col 0/4: Convert coeffs, expand & mul/fma
            hco0w = vget_low_s16( vmovl_s8( s8_u64(vcreate_u64( hco0 )) ) );
            in0w = qs16_u16( vmovl_u8( u8_u64(in0.val[0]) ) );
            in1w = qs16_u16( vmovl_u8( u8_u64(in0.val[1]) ) );
            in2w = qs16_u16( vmovl_u8( u8_u64(in0.val[2]) ) );
            in3w = qs16_u16( vmovl_u8( u8_u64(in0.val[3]) ) );
            res[i+0] = vmulq_lane_s16(           in0w, hco0w, 0 );
            res[i+0] = vmlaq_lane_s16( res[i+0], in1w, hco0w, 1 );
            res[i+0] = vmlaq_lane_s16( res[i+0], in2w, hco0w, 2 );
            res[i+0] = vmlaq_lane_s16( res[i+0], in3w, hco0w, 3 );

            // Col 1/5: Convert coeffs, expand & mul/fma
            hco0w = vget_low_s16( vmovl_s8( s8_u64(vcreate_u64( hco1 )) ) );
            in0w = qs16_u16( vmovl_u8( u8_u64(in1.val[0]) ) );
            in1w = qs16_u16( vmovl_u8( u8_u64(in1.val[1]) ) );
            in2w = qs16_u16( vmovl_u8( u8_u64(in1.val[2]) ) );
            in3w = qs16_u16( vmovl_u8( u8_u64(in1.val[3]) ) );
            res[i+1] = vmulq_lane_s16(           in0w, hco0w, 0 );
            res[i+1] = vmlaq_lane_s16( res[i+1], in1w, hco0w, 1 );
            res[i+1] = vmlaq_lane_s16( res[i+1], in2w, hco0w, 2 );
            res[i+1] = vmlaq_lane_s16( res[i+1], in3w, hco0w, 3 );

            // Col 2/6: Etc.
            hco0w = vget_low_s16( vmovl_s8( s8_u64(vcreate_u64( hco2 )) ) );
            in0w = qs16_u16( vmovl_u8( u8_u64(in2.val[0]) ) );
            in1w = qs16_u16( vmovl_u8( u8_u64(in2.val[1]) ) );
            in2w = qs16_u16( vmovl_u8( u8_u64(in2.val[2]) ) );
            in3w = qs16_u16( vmovl_u8( u8_u64(in2.val[3]) ) );
            res[i+2] = vmulq_lane_s16(           in0w, hco0w, 0 );
            res[i+2] = vmlaq_lane_s16( res[i+2], in1w, hco0w, 1 );
            res[i+2] = vmlaq_lane_s16( res[i+2], in2w, hco0w, 2 );
            res[i+2] = vmlaq_lane_s16( res[i+2], in3w, hco0w, 3 );

            // Col 3/7
            hco0w = vget_low_s16( vmovl_s8( s8_u64(vcreate_u64( hco3 )) ) );
            in0w = qs16_u16( vmovl_u8( u8_u64(in3.val[0]) ) );
            in1w = qs16_u16( vmovl_u8( u8_u64(in3.val[1]) ) );
            in2w = qs16_u16( vmovl_u8( u8_u64(in3.val[2]) ) );
            in3w = qs16_u16( vmovl_u8( u8_u64(in3.val[3]) ) );
            res[i+3] = vmulq_lane_s16(           in0w, hco0w, 0 );
            res[i+3] = vmlaq_lane_s16( res[i+3], in1w, hco0w, 1 );
            res[i+3] = vmlaq_lane_s16( res[i+3], in2w, hco0w, 2 );
            res[i+3] = vmlaq_lane_s16( res[i+3], in3w, hco0w, 3 );
        }

Rounding and packing is the same, except they need to be prepared for the wide transpose:

        // Round and pack
        uint8x16_t r01 = vcombine_u8( vqshrun_n_s16( vaddq_s16( res[0], rv ), 6 ), vqshrun_n_s16( vaddq_s16( res[1], rv ), 6 ) );
        uint8x16_t r23 = vcombine_u8( vqshrun_n_s16( vaddq_s16( res[2], rv ), 6 ), vqshrun_n_s16( vaddq_s16( res[3], rv ), 6 ) );
        uint8x16_t r45 = vcombine_u8( vqshrun_n_s16( vaddq_s16( res[4], rv ), 6 ), vqshrun_n_s16( vaddq_s16( res[5], rv ), 6 ) );
        uint8x16_t r67 = vcombine_u8( vqshrun_n_s16( vaddq_s16( res[6], rv ), 6 ), vqshrun_n_s16( vaddq_s16( res[7], rv ), 6 ) );

The transpose. Note similarities with the h2v() function:

        // Undo transpose
        uint8x16x2_t tr0 = vzipq_u8( r01, r23 );
        uint8x16x2_t tr1 = vzipq_u8( r45, r67 );
        uint8x16x2_t tr2 = vzipq_u8( tr0.val[0], tr0.val[1] );
        uint8x16x2_t tr3 = vzipq_u8( tr1.val[0], tr1.val[1] );
        uint32x4x2_t tr4 = vzipq_u32( qu32_u8(tr2.val[0]), qu32_u8(tr3.val[0]) );
        uint32x4x2_t tr5 = vzipq_u32( qu32_u8(tr2.val[1]), qu32_u8(tr3.val[1]) );

Write 8x8 results to the final destination buffer.

        // Store final result
        uint8_t *tmp = dst;
        vst1_u32( (uint32_t *)tmp, vget_low_u32(  tr4.val[0] ) ); tmp += dst_stride;
        vst1_u32( (uint32_t *)tmp, vget_high_u32( tr4.val[0] ) ); tmp += dst_stride;
        vst1_u32( (uint32_t *)tmp, vget_low_u32(  tr4.val[1] ) ); tmp += dst_stride;
        vst1_u32( (uint32_t *)tmp, vget_high_u32( tr4.val[1] ) ); tmp += dst_stride;
        vst1_u32( (uint32_t *)tmp, vget_low_u32(  tr5.val[0] ) ); tmp += dst_stride;
        vst1_u32( (uint32_t *)tmp, vget_high_u32( tr5.val[0] ) ); tmp += dst_stride;
        vst1_u32( (uint32_t *)tmp, vget_low_u32(  tr5.val[1] ) ); tmp += dst_stride;
        vst1_u32( (uint32_t *)tmp, vget_high_u32( tr5.val[1] ) ); tmp += dst_stride;

        dst += 8;
    }
}

We're done! That wasn't so hard, was it?

NVidia Jetson AGX Xavier Developer Kit

Pulling the Trigger

New source file with support functions:

scale_neon_intrinsics.c

It needs all the old files unchanged, so, again, toss the new file into the pond. Next, locate a suitable modern ARM to test it on. First out is the good, old NVidia AGX Xavier Developer Kit which I did the majority of development on. Here goes for nothing!

Intrinsics version (2023) on NVidia Xavier, Carmel ARMv8.2-A @ 2.26 GHz (2018)

nils@magneto:~/src/imagetrans$ gcc -O3 -Wall -o scaler scale_neon_intrinsics.c bmp_planar.c coeffs.c main.c -lm
nils@magneto:~/src/imagetrans$

nils@magneto:~/src/imagetrans$ ./scaler -m 1
(...)
total: 3451.478271ms, per frame: 3.451478ms, fps: 289.730957

nils@magneto:~/src/imagetrans$ ./scaler -m 300
(...)
total: 3465.585205ms, per frame: 3.465585ms, fps: 288.551575

As always with such weird cores, results are doomed to be weird. For once, even I was befuddled. Why on planet Earth (and possibly some moons of Jupiter) are the results that good? Is their code translation actually working? Did they make a less crap cache and memory subsystem for a change? Why do the hoi polloi write programs in Python? Many questions, but this time there are no answers in sight. Maybe my skepticism about the Carmel cores in the book was off the mark. Oh well, it happens. I was right about the Denver 2 cores, though. They both suck and blow at the same time.

Anyway, I'll pawn the Raspberry Pi 2 for a busload of cash and buy an NVidia Orin (with normal ARM cores) for the next article. Came to think of it: An Orin is exorbitantly expensive, whereas the value of a used Pi 2 is, at best, dubious. Yeah, that's not gonna happen.

Back to the testing. Sjur found a Jetson Nano in his bag. I like the Nano. And Sjur's bag of many things. The Nano is based on aging hardware, but it's cheap, small, and has a fully usable GPU (not that mobile crap) for not a lot of cash. Did someone just say Weeaboo... [pbfcomics.com] sorry, Jetson Nano Xmas Demo? We have top men working on it right now.

Intrinsics version (2023) on NVidia Jetson Nano, Cortex-A57 ARMv8-A @ 1.43 GHz (2019, based on the 2015 TX1)

nils@mork:~/src/imagetrans$ gcc -O3 -Wall -o scaler scale_neon_intrinsics.c bmp_planar.c coeffs.c main.c -lm
nils@mork:~/src/imagetrans$

nils@mork:~/src/imagetrans$ ./scaler -m 1
(...)
total: 10964.515625ms, per frame: 10.964516ms, fps: 91.203300

nils@mork:~/src/imagetrans$ ./scaler -m 300
(...)
total: 11152.904297ms, per frame: 11.152905ms, fps: 89.662743

Interesting. It does 90 fps on a clunky old A57 from 2015 running at a measly 1.43 GHz. Remember, it's a 4-tap filter in both directions. Is my scaler fast? My Magic 8 Ball [en.wikipedia.org] says "Signs point to yes". It's rarely wrong. Job done.

Things Left Unsaid

I almost forgot that! Let's try the intrinsics version on the Pi 2 with just 16 Neon registers:

Intrinsics version (2023) on Pi 2 Model B v1.1, ARMv7 @ 900 MHz (2015)

nils@pitfall:~/src/imagetrans$ gcc -O3 -Wall -march=native -mfpu=neon -o scaler scale_neon_intrinsics.c bmp_planar.c coeffs.c main.c -lm
nils@pitfall:~/src/imagetrans$

nils@pitfall:~/src/imagetrans$ ./scaler -m 1
(...)
total: 54601.652344ms, per frame: 54.601654ms, fps: 18.314466

nlc@pitfall:~/src/imagetrans $ ./scaler -m 50
(...)
total: 53738.363281ms, per frame: 53.738365ms, fps: 18.608681

The Assembler version is 38% faster on the Pi 2. Not unexpected. Better results can probably be obtained with both versions. The basic structure is there: Feel free to improve it further.

Everything Everywhere All the Source

Complete source code archive:

neon_scaler_2023.zip (browse)

Complete list of files in the archive. New and modified files in bold at the top.

scale_neon_intrinsics.c
scale_neon_asm.c
scale_sse2.c
scale.h
coeffs.c - generates Lanczos-2 coefficients
coeffs.h
bmp_planar.c - loads/saves BMP files, splits into planes
bmp_planar.h
main.c - has compilation instructions

An Addendum About Apple

As mentioned in News 2023, the code was tested on Apple's M1 and M2 chips in September/October 2023. Here are the numbers:

Intrinsics version (2023) on Apple Mac Mini, M1 ARMv8.5-A @ 3.2 GHz (2020)

nils@merrimac scaler % gcc -O3 -Wall -o scaler scale_neon_intrinsics.c bmp_planar.c coeffs.c main.c -lm

nils@merrimac scaler % ./scaler -m 10
(...)
total: 1365.284058ms, per frame: 1.365284ms, fps: 732.448303

nils@merrimac scaler % ./scaler -m 300
(...)
total: 1373.613037ms, per frame: 1.373613ms, fps: 728.007080

Intrinsics version (2023) on Apple MacBook Air, M2 ARMv8.5-A @ 3.49 GHz (2022)

pak@helvetia scaler % ./scaler -m 10
(...)
total: 1261.817993ms, per frame: 1.261818ms, fps: 792.507324

pak@helvetia scaler % ./scaler -m 300
(...)
total: 1256.837036ms, per frame: 1.256837ms, fps: 795.648071

It's very fast on Apple's ARM variants. Cool!

Feedback

Technical comments are welcome. Contact information is available here.

If you have a problem with the article text in any way, please watch this informative video:

Remember to appreciate this SMBC strip called "Sheep" [smbc-comics.com].

Article Licensing Information

This article is published under the following license: Attribution-NoDerivatives 4.0 International (CC BY-ND 4.0)
Short summary: You may copy and redistribute the material in any medium or format for any purpose, even commercially. You must give appropriate credit, provide a link to the license, and indicate if changes were made. If you remix, transform, or build upon the material, you may not distribute the modified material.

Licensed Items

This page uses a panel from the webcomic Abstruse Goose, strip 432: O.P.C.
Credits: "The Abstruse Goose comic is a subsidiary of the powerful and evil Abstruse Goose Corporation."
License: Attribution-NonCommercial 3.0 United States (CC BY-NC 3.0 US)

Ekte Programmering Norwegian flag

Real Programming

Ignorantus AS