Nils Liaaen Corneliusen

24 August 2012

Halide has been covered in the tech press lately. The PDF called Decoupling Algorithms from Schedules for Easy Optimization of Image Processing Pipelines is indeed an interesting read. So why do I bring it up here? Well, to have a look at the initial code example used, a 3x3 box filter. Page 1, top right corner, "Fast C++ (for x86)".

Exactly how effective is that code on a modern x86 core? Clearly, steps have been taken to try to make it reasonably efficient, but must the code look like a train wreck to accomplish that? I dust off my i7-950 to do the initial work. Later I'll look at some newer and older Intel chips to find out what's going on here.

I'm not into C++ so let's just rewrite it in straight C using uint16_t pointers.

void fast_blur_halide( const uint16_t *in, uint16_t *blurred, int width, int height ) { int x, y; int xTile, yTile; __m128i one_third; one_third = _mm_set1_epi16(21846); for( yTile = 0; yTile < height; yTile += 32 ) { __m128i a, b, c; __m128i sum, avg; __m128i tmp[(256/8)*(32+2)]; for( xTile = 0; xTile < width; xTile += 256 ) { __m128i *tmpPtr = tmp; for (y = -1; y < 32+1; y++) { const uint16_t *inPtr = &in[(yTile+y)*width + xTile]; for (x = 0; x < 256; x += 8) { a = _mm_loadu_si128( (__m128i*)(inPtr-1) ); b = _mm_loadu_si128( (__m128i*)(inPtr+1) ); c = _mm_load_si128( (__m128i*)(inPtr) ); sum = _mm_add_epi16( _mm_add_epi16( a, b ), c ); avg = _mm_mulhi_epi16( sum, one_third ); _mm_store_si128( tmpPtr++, avg ); inPtr += 8; } } tmpPtr = tmp; for (y = 0; y < 32; y++) { __m128i *outPtr = (__m128i *)&blurred[(yTile+y)*width + xTile]; for( x = 0; x < 256; x += 8 ) { a = _mm_load_si128( tmpPtr+(2*256)/8 ); b = _mm_load_si128( tmpPtr+256/8 ); c = _mm_load_si128( tmpPtr++ ); sum = _mm_add_epi16( _mm_add_epi16( a, b ), c ); avg = _mm_mulhi_epi16( sum, one_third ); _mm_store_si128( outPtr++, avg ); } } } } }

As mentioned they've put some work into finding the correct sizes of the blocks. I will look into that later. The result is an instruction count that's not too bad:

Box filter 3x3 instruction count, 8192*8192 image Halide mul 17301504 add 34603008 loadu 17825792 loada 34078720 store 17301504

So let's try to attack the instruction count first. We try out the first extreme and do everything vertical to reuse multiplies maximally:

void fast_blur_halide( const uint16_t *in, uint16_t *blurred, int width, int height ) { int x, y; __m128i one_third; __m128i *dst; one_third = _mm_set1_epi16(21846); for( x = 0; x < width; x += 8 ) { __m128i s00, s01, s02, s10, s11, s12, s20, s21, s22; __m128i r00, r10, r20; const uint16_t *row0, *rowp, *rown; row0 = in + x; rowp = row0 - width; rown = row0 + width; dst = (__m128i *)(blurred + x); s00 = _mm_loadu_si128( (__m128i*)(rowp-1) ); s01 = _mm_loadu_si128( (__m128i*)(rowp+1) ); s02 = _mm_load_si128( (__m128i*)(rowp) ); r00 = _mm_mulhi_epi16( _mm_add_epi16( _mm_add_epi16( s00, s01 ), s02 ), one_third ); s10 = _mm_loadu_si128( (__m128i*)(row0-1) ); s11 = _mm_loadu_si128( (__m128i*)(row0+1) ); s12 = _mm_load_si128( (__m128i*)(row0) ); r10 = _mm_mulhi_epi16( _mm_add_epi16( _mm_add_epi16( s10, s11 ), s12 ), one_third ); for( y = 0; y < height; y++ ) { s20 = _mm_loadu_si128( (__m128i*)(rown-1) ); s21 = _mm_loadu_si128( (__m128i*)(rown+1) ); s22 = _mm_load_si128( (__m128i*)(rown) ); r20 = _mm_mulhi_epi16( _mm_add_epi16( _mm_add_epi16( s20, s21 ), s22 ), one_third ); _mm_store_si128( dst, _mm_mulhi_epi16( _mm_add_epi16( _mm_add_epi16( r00, r10 ), r20 ), one_third ) ); r00 = r10; r10 = r20; rown += width; dst += (width>>3); } } }

Now we got a couple of code sets to compare. This gives the following instruction count:

Box filter 3x3 instruction count, 8192*8192 image Halide Vertical mul 17301504 8390656 add 34603008 16781312 loadu 17825792 16781312 loada 34078720 8390656 store 17301504 8388608

That's pretty sweet. Unfortunately this goes all the way to hell when measuring execution time. It's 4.5 times slower on the i7-950. Crap. Let's try the other extreme instead.

The other extreme is of course a fully horizontal routine where we read three lines in parallel and completely ignore the overlap. (Actually I wrote this version first due to earlier experiences in the yuv converter, but never mind.)

void fast_blur_horiz( const uint16_t *in, uint16_t *blurred, int width, int height ) { int x, y; __m128i one_third; __m128i *dst; one_third = _mm_set1_epi16(21846); dst = (__m128i *)blurred; for( y = 0; y < height; y++ ) { const uint16_t *row0, *rowp, *rown; row0 = &in[y*width]; rowp = row0 - width; rown = row0 + width; for( x = 0; x < width; x += 8 ) { __m128i s0 ,s1 ,s2; __m128i r0, r1, r2; s0 = _mm_loadu_si128( (__m128i*)(row0-1) ); s1 = _mm_loadu_si128( (__m128i*)(row0+1) ); s2 = _mm_load_si128( (__m128i*)(row0) ); r0 = _mm_mulhi_epi16( _mm_add_epi16( _mm_add_epi16( s0, s1 ), s2 ), one_third ); s0 = _mm_loadu_si128( (__m128i*)(rowp-1) ); s1 = _mm_loadu_si128( (__m128i*)(rowp+1) ); s2 = _mm_load_si128( (__m128i*)(rowp) ); r1 = _mm_mulhi_epi16( _mm_add_epi16( _mm_add_epi16( s0, s1 ), s2 ), one_third ); s0 = _mm_loadu_si128( (__m128i*)(rown-1) ); s1 = _mm_loadu_si128( (__m128i*)(rown+1) ); s2 = _mm_load_si128( (__m128i*)(rown) ); r2 = _mm_mulhi_epi16( _mm_add_epi16( _mm_add_epi16( s0, s1 ), s2 ), one_third ); _mm_store_si128( dst++, _mm_mulhi_epi16( _mm_add_epi16( _mm_add_epi16( r0, r1 ), r2 ), one_third ) ); row0 += 8; rowp += 8; rown += 8; } } }

That's pretty and readable, but the instruction count is a complete disaster:

Box filter 3x3 instruction count, 8192*8192 image Halide Horiz Vertical mul 17301504 33554432 8390656 add 34603008 67108864 16781312 loadu 17825792 50331648 16781312 loada 34078720 25165824 8390656 store 17301504 8388608 8388608

Fortunately, it's a complete win when we consider execution times. It's about 31% faster than the original code on the i7-950. Neat.

Can we do any better? Actually, yes. We can read one more line and have two destinations to make it 36% faster:

void fast_blur_horiz2d( const uint16_t *in, uint16_t *blurred, int width, int height ) { int x, y; __m128i one_third; __m128i *dst0, *dst1; one_third = _mm_set1_epi16(21846); dst0 = (__m128i *)blurred; dst1 = (__m128i *)(blurred+width); for( y = 0; y < height; y += 2 ) { const uint16_t *row0, *row1, *row2, *row3; row1 = in + y*width; row0 = row1 - width; row2 = row1 + width; row3 = row2 + width; for( x = 0; x < width; x += 8 ) { __m128i s00 ,s01 ,s02; __m128i r00, r01, r02; s00 = _mm_loadu_si128( (__m128i*)(row0-1) ); s01 = _mm_loadu_si128( (__m128i*)(row0+1) ); s02 = _mm_load_si128( (__m128i*)(row0) ); r00 = _mm_mulhi_epi16( _mm_add_epi16( _mm_add_epi16( s00, s01 ), s02 ), one_third ); s00 = _mm_loadu_si128( (__m128i*)(row1-1) ); s01 = _mm_loadu_si128( (__m128i*)(row1+1) ); s02 = _mm_load_si128( (__m128i*)(row1) ); r01 = _mm_mulhi_epi16( _mm_add_epi16( _mm_add_epi16( s00, s01 ), s02 ), one_third ); s00 = _mm_loadu_si128( (__m128i*)(row2-1) ); s01 = _mm_loadu_si128( (__m128i*)(row2+1) ); s02 = _mm_load_si128( (__m128i*)(row2) ); r02 = _mm_mulhi_epi16( _mm_add_epi16( _mm_add_epi16( s00, s01 ), s02 ), one_third ); _mm_store_si128( dst0++, _mm_mulhi_epi16( _mm_add_epi16( _mm_add_epi16( r00, r01 ), r02 ), one_third ) ); s00 = _mm_loadu_si128( (__m128i*)(row3-1) ); s01 = _mm_loadu_si128( (__m128i*)(row3+1) ); s02 = _mm_load_si128( (__m128i*)(row3) ); r00 = _mm_mulhi_epi16( _mm_add_epi16( _mm_add_epi16( s00, s01 ), s02 ), one_third ); _mm_store_si128( dst1++, _mm_mulhi_epi16( _mm_add_epi16( _mm_add_epi16( r00, r01 ), r02 ), one_third ) ); row0 += 8; row1 += 8; row2 += 8; row3 += 8; } dst0 += width>>3; dst1 += width>>3; } }

Source code: box_sse2_2012.cpp

Instruction count for completeness:

Box filter 3x3 instruction count, 8192*8192 image Halide Horiz Horiz2d Vertical mul 17301504 33554432 16777216 8390656 add 34603008 67108864 33554432 16781312 loadu 17825792 50331648 33554432 16781312 loada 34078720 25165824 16777216 8390656 store 17301504 8388608 8388608 8388608

Expanding more in the same manner does not give any further reductions. So reading from 4 lines and writing 2 seems to be the optimal limit.

Let's sum up the execution times for my test system:

Image: 8192*8192, 100 iterations CPU: i7-950@3Ghz RAM: 6x2GB@1066Mhz Time Perc Vertical: 29541786 -326.2% Halide: 6931559 0.0% Horizontal: 4780441 31.0% Horizontal2d: 4421372 36.2%

How about Sandy Bridge? Everything runs significantly faster, and the gap is now even wider:

Image: 8192*8192, 100 iterations CPU: i7-2600K@3.4Ghz RAM: 4x4GB@1600Mhz Time Perc Vertical: 25291926 -474.3% Halide: 4403894 0.0% Horizontal2d: 2608385 40.8%

Basically, when there's not a lot of processing being done on the data, memory accesses will dominate. If we access memory linearly over not too many rows, things execute extremely fast on any newer Intel CPU.

But that doesn't explain the original code. How on earth did they come up with that code and those block sizes? A test on an old Core2 system makes it all clear. I'm only doing 10 iterations, for obvious reasons.

Image: 8192*8192, 10 iterations CPU: Q9550@2.8Ghz RAM: 4x2GB@800Mhz Time Perc Vertical: 7358867 -732.1% Halide: 884382 0.0% Horizontal: 1563866 -76.8% Horizontal2d: 1250236 -41.4%

Aha! The code is optimized for older Core2 based systems. The memory subsystem is obviously much smarter in newer models. While their approach works well for Core2, it's not the way to go for Core-i7 and later. That's good news; The Core2 approach is too messy.

Carl Zimmerman sent me an email on 11 July 2015:

The box filter code at https://www.ignorantus.com/box_sse2/

overflows. Is there a reason you or Halide didn't use the saturating add instruction?

i.e:

_mm_add_epi16 becomes _mm_adds_epi16

Which sounds correct. The test code only compared the outputs with memcmp(), so if you're going to use the code for anything, please replace the adds. It's always a good idea to check the output. Thanks, Carl!

If I've missed anything obvious you can figure out my email from the front page. Also, remember to appreciate this classic Abstruse Goose strip.