Ok, so I got this cool new i7 Sandy Bridge system. I haven't written much Intel-specific code, so there's a perfect opportunity to learn something new. In somewhat related news, gcc has actually started to suck less and provide decent intrinsics support. Gcc still sucks, only slightly less.
So... let's try to make a reasonably fast YUV420 to RGB conversion routine using SSE2 intrinsics. I didn't bother searching the net for other implementations, so I have no idea how it performs. It's a learning exercise. On my test data set it performs pretty decent. And it'll be readable and comprehensible. Or illegible and incomprehensible, I forget which. Either way is good.
YUV to RGB conversion is defined as follows in Video Demystified:
B = 1.164(Y - 16) + 2.018(U - 128) G = 1.164(Y - 16) - 0.813(V - 128) - 0.391(U - 128) R = 1.164(Y - 16) + 1.596(V - 128) saturate results to 0..255.
The correctness of this formula can be discussed. That's not the point of this exercise. Replacing the code below with your own formula should be easy.
SSE2 has 128 bit registers and can do 8 16-bit multiplies extremely fast. So, let's start off by eliminating the floating point math and using 16 bit integers instead. Easy, multiply the factors by 256. Next, ponder how to fit it into signed 16 bit. Also easy, divide by 4. Shift factors by 6 and we're good to go. I said "reasonably fast", not "precise".
We're converting YUV420 to RGB, meaning we have separate Y/U/V planes and want 32-bit packed RGB output. The U/V planes are 1/4 the size of Y. So, we got a lot of common factors and can process two rows at a time.
As mentioned, the code is divided into somewhat coherent sections for readability. Surely it can be rearranged for more speed. Or even changed completely in case I missed something obvious. Send me a mail if you think I did. Complete code available at the bottom.
Some necessary constants first, as reasoned above.
ysub = _mm_set1_epi32( 0x00100010 ); uvsub = _mm_set1_epi32( 0x00800080 ); facy = _mm_set1_epi32( 0x004a004a ); facrv = _mm_set1_epi32( 0x00660066 ); facgu = _mm_set1_epi32( 0x00190019 ); facgv = _mm_set1_epi32( 0x00340034 ); facbu = _mm_set1_epi32( 0x00810081 ); zero = _mm_set1_epi32( 0x00000000 );
We start off by loading 8 bytes of data from u and v, and 16 from the two y rows. We're processing 32 pixels at a time.
u0 = _mm_loadl_epi64( (__m128i *)srcu64 ); srcu64++; v0 = _mm_loadl_epi64( (__m128i *)srcv64 ); srcv64++; y0r0 = _mm_load_si128( srcy128r0++ ); y0r1 = _mm_load_si128( srcy128r1++ );
Next, we expand to 16 bit, calculate the constant y factors, and subtract 16:
y00r0 = _mm_mullo_epi16( _mm_sub_epi16( _mm_unpacklo_epi8( y0r0, zero ), ysub ), facy ); y01r0 = _mm_mullo_epi16( _mm_sub_epi16( _mm_unpackhi_epi8( y0r0, zero ), ysub ), facy ); y00r1 = _mm_mullo_epi16( _mm_sub_epi16( _mm_unpacklo_epi8( y0r1, zero ), ysub ), facy ); y01r1 = _mm_mullo_epi16( _mm_sub_epi16( _mm_unpackhi_epi8( y0r1, zero ), ysub ), facy );
Then it's time to prepare the uv factors that are common on both rows. Since using SSE2 multipliers is cheap, I expand the u and v data first and use 8 mullo instead of the obvious 4. Expanding afterwards is costly. I tried it.
// expand u and v so they're aligned with y values u0 = _mm_unpacklo_epi8( u0, zero ); u00 = _mm_sub_epi16( _mm_unpacklo_epi16( u0, u0 ), uvsub ); u01 = _mm_sub_epi16( _mm_unpackhi_epi16( u0, u0 ), uvsub ); v0 = _mm_unpacklo_epi8( v0, zero ); v00 = _mm_sub_epi16( _mm_unpacklo_epi16( v0, v0 ), uvsub ); v01 = _mm_sub_epi16( _mm_unpackhi_epi16( v0, v0 ), uvsub ); // common factors on both rows. rv00 = _mm_mullo_epi16( facrv, v00 ); rv01 = _mm_mullo_epi16( facrv, v01 ); gu00 = _mm_mullo_epi16( facgu, u00 ); gu01 = _mm_mullo_epi16( facgu, u01 ); gv00 = _mm_mullo_epi16( facgv, v00 ); gv01 = _mm_mullo_epi16( facgv, v01 ); bu00 = _mm_mullo_epi16( facbu, u00 ); bu01 = _mm_mullo_epi16( facbu, u01 );
Now it's trivial to calculate the r/g/b planar values by summing things together as specified and shifting down by 6, the multiplier we used on the factors.
r00 = _mm_srai_epi16( _mm_add_epi16( y00r0, rv00 ), 6 ); r01 = _mm_srai_epi16( _mm_add_epi16( y01r0, rv01 ), 6 ); g00 = _mm_srai_epi16( _mm_sub_epi16( _mm_sub_epi16( y00r0, gu00 ), gv00 ), 6 ); g01 = _mm_srai_epi16( _mm_sub_epi16( _mm_sub_epi16( y01r0, gu01 ), gv01 ), 6 ); b00 = _mm_srai_epi16( _mm_add_epi16( y00r0, bu00 ), 6 ); b01 = _mm_srai_epi16( _mm_add_epi16( y01r0, bu01 ), 6 );
The remaining challenge is saturating and packing the results into chunky pixels efficiently. Luckily, we have just the instructions for the job:
r00 = _mm_packus_epi16( r00, r01 ); // rrrr.. saturated g00 = _mm_packus_epi16( g00, g01 ); // gggg.. saturated b00 = _mm_packus_epi16( b00, b01 ); // bbbb.. saturated r01 = _mm_unpacklo_epi8( r00, zero ); // 0r0r.. gbgb = _mm_unpacklo_epi8( b00, g00 ); // gbgb.. rgb0123 = _mm_unpacklo_epi16( gbgb, r01 ); // 0rgb0rgb.. rgb4567 = _mm_unpackhi_epi16( gbgb, r01 ); // 0rgb0rgb.. r01 = _mm_unpackhi_epi8( r00, zero ); gbgb = _mm_unpackhi_epi8( b00, g00 ); rgb89ab = _mm_unpacklo_epi16( gbgb, r01 ); rgbcdef = _mm_unpackhi_epi16( gbgb, r01 );
We're just about done, just store the finished pixels first:
_mm_store_si128( dstrgb128r0++, rgb0123 ); _mm_store_si128( dstrgb128r0++, rgb4567 ); _mm_store_si128( dstrgb128r0++, rgb89ab ); _mm_store_si128( dstrgb128r0++, rgbcdef );
That concludes the work necessary for row 0. Repeat the last three steps for row 1, replacing the y values and target pointer, and we're done.
A complete routine would look something like this. You might want to put some more work into calculating the strides. The code is fast enough for all practical purposes I have.
void yuv420_to_argb8888( uint8_t *yp, uint8_t *up, uint8_t *vp, uint32_t sy, uint32_t suv, int width, int height, uint32_t *rgb, uint32_t srgb ) { __m128i y0r0, y0r1, u0, v0; __m128i y00r0, y01r0, y00r1, y01r1; __m128i u00, u01, v00, v01; __m128i rv00, rv01, gu00, gu01, gv00, gv01, bu00, bu01; __m128i r00, r01, g00, g01, b00, b01; __m128i rgb0123, rgb4567, rgb89ab, rgbcdef; __m128i gbgb; __m128i ysub, uvsub; __m128i zero, facy, facrv, facgu, facgv, facbu; __m128i *srcy128r0, *srcy128r1; __m128i *dstrgb128r0, *dstrgb128r1; __m64 *srcu64, *srcv64; int x, y; ysub = _mm_set1_epi32( 0x00100010 ); uvsub = _mm_set1_epi32( 0x00800080 ); facy = _mm_set1_epi32( 0x004a004a ); facrv = _mm_set1_epi32( 0x00660066 ); facgu = _mm_set1_epi32( 0x00190019 ); facgv = _mm_set1_epi32( 0x00340034 ); facbu = _mm_set1_epi32( 0x00810081 ); zero = _mm_set1_epi32( 0x00000000 ); for( y = 0; y < height; y += 2 ) { srcy128r0 = (__m128i *)(yp + sy*y); srcy128r1 = (__m128i *)(yp + sy*y + sy); srcu64 = (__m64 *)(up + suv*(y/2)); srcv64 = (__m64 *)(vp + suv*(y/2)); dstrgb128r0 = (__m128i *)(rgb + srgb*y); dstrgb128r1 = (__m128i *)(rgb + srgb*y + srgb); for( x = 0; x < width; x += 16 ) { u0 = _mm_loadl_epi64( (__m128i *)srcu64 ); srcu64++; v0 = _mm_loadl_epi64( (__m128i *)srcv64 ); srcv64++; y0r0 = _mm_load_si128( srcy128r0++ ); y0r1 = _mm_load_si128( srcy128r1++ ); // constant y factors y00r0 = _mm_mullo_epi16( _mm_sub_epi16( _mm_unpacklo_epi8( y0r0, zero ), ysub ), facy ); y01r0 = _mm_mullo_epi16( _mm_sub_epi16( _mm_unpackhi_epi8( y0r0, zero ), ysub ), facy ); y00r1 = _mm_mullo_epi16( _mm_sub_epi16( _mm_unpacklo_epi8( y0r1, zero ), ysub ), facy ); y01r1 = _mm_mullo_epi16( _mm_sub_epi16( _mm_unpackhi_epi8( y0r1, zero ), ysub ), facy ); // expand u and v so they're aligned with y values u0 = _mm_unpacklo_epi8( u0, zero ); u00 = _mm_sub_epi16( _mm_unpacklo_epi16( u0, u0 ), uvsub ); u01 = _mm_sub_epi16( _mm_unpackhi_epi16( u0, u0 ), uvsub ); v0 = _mm_unpacklo_epi8( v0, zero ); v00 = _mm_sub_epi16( _mm_unpacklo_epi16( v0, v0 ), uvsub ); v01 = _mm_sub_epi16( _mm_unpackhi_epi16( v0, v0 ), uvsub ); // common factors on both rows. rv00 = _mm_mullo_epi16( facrv, v00 ); rv01 = _mm_mullo_epi16( facrv, v01 ); gu00 = _mm_mullo_epi16( facgu, u00 ); gu01 = _mm_mullo_epi16( facgu, u01 ); gv00 = _mm_mullo_epi16( facgv, v00 ); gv01 = _mm_mullo_epi16( facgv, v01 ); bu00 = _mm_mullo_epi16( facbu, u00 ); bu01 = _mm_mullo_epi16( facbu, u01 ); // row 0 r00 = _mm_srai_epi16( _mm_add_epi16( y00r0, rv00 ), 6 ); r01 = _mm_srai_epi16( _mm_add_epi16( y01r0, rv01 ), 6 ); g00 = _mm_srai_epi16( _mm_sub_epi16( _mm_sub_epi16( y00r0, gu00 ), gv00 ), 6 ); g01 = _mm_srai_epi16( _mm_sub_epi16( _mm_sub_epi16( y01r0, gu01 ), gv01 ), 6 ); b00 = _mm_srai_epi16( _mm_add_epi16( y00r0, bu00 ), 6 ); b01 = _mm_srai_epi16( _mm_add_epi16( y01r0, bu01 ), 6 ); r00 = _mm_packus_epi16( r00, r01 ); // rrrr.. saturated g00 = _mm_packus_epi16( g00, g01 ); // gggg.. saturated b00 = _mm_packus_epi16( b00, b01 ); // bbbb.. saturated r01 = _mm_unpacklo_epi8( r00, zero ); // 0r0r.. gbgb = _mm_unpacklo_epi8( b00, g00 ); // gbgb.. rgb0123 = _mm_unpacklo_epi16( gbgb, r01 ); // 0rgb0rgb.. rgb4567 = _mm_unpackhi_epi16( gbgb, r01 ); // 0rgb0rgb.. r01 = _mm_unpackhi_epi8( r00, zero ); gbgb = _mm_unpackhi_epi8( b00, g00 ); rgb89ab = _mm_unpacklo_epi16( gbgb, r01 ); rgbcdef = _mm_unpackhi_epi16( gbgb, r01 ); _mm_store_si128( dstrgb128r0++, rgb0123 ); _mm_store_si128( dstrgb128r0++, rgb4567 ); _mm_store_si128( dstrgb128r0++, rgb89ab ); _mm_store_si128( dstrgb128r0++, rgbcdef ); // row 1 r00 = _mm_srai_epi16( _mm_add_epi16( y00r1, rv00 ), 6 ); r01 = _mm_srai_epi16( _mm_add_epi16( y01r1, rv01 ), 6 ); g00 = _mm_srai_epi16( _mm_sub_epi16( _mm_sub_epi16( y00r1, gu00 ), gv00 ), 6 ); g01 = _mm_srai_epi16( _mm_sub_epi16( _mm_sub_epi16( y01r1, gu01 ), gv01 ), 6 ); b00 = _mm_srai_epi16( _mm_add_epi16( y00r1, bu00 ), 6 ); b01 = _mm_srai_epi16( _mm_add_epi16( y01r1, bu01 ), 6 ); r00 = _mm_packus_epi16( r00, r01 ); // rrrr.. saturated g00 = _mm_packus_epi16( g00, g01 ); // gggg.. saturated b00 = _mm_packus_epi16( b00, b01 ); // bbbb.. saturated r01 = _mm_unpacklo_epi8( r00, zero ); // 0r0r.. gbgb = _mm_unpacklo_epi8( b00, g00 ); // gbgb.. rgb0123 = _mm_unpacklo_epi16( gbgb, r01 ); // 0rgb0rgb.. rgb4567 = _mm_unpackhi_epi16( gbgb, r01 ); // 0rgb0rgb.. r01 = _mm_unpackhi_epi8( r00, zero ); gbgb = _mm_unpackhi_epi8( b00, g00 ); rgb89ab = _mm_unpacklo_epi16( gbgb, r01 ); rgbcdef = _mm_unpackhi_epi16( gbgb, r01 ); _mm_store_si128( dstrgb128r1++, rgb0123 ); _mm_store_si128( dstrgb128r1++, rgb4567 ); _mm_store_si128( dstrgb128r1++, rgb89ab ); _mm_store_si128( dstrgb128r1++, rgbcdef ); } } }
Source code: yuv2rgb_sse2_2012.cpp
Comments are always appreciated. Contact information is available here.
Remember to appreciate this classic Abstruse Goose strip.
This article is published under the following license: Attribution-NoDerivatives 4.0 International (CC BY-ND 4.0)
Short summary:
You may copy and redistribute the material in any medium or format for any purpose, even commercially.
You must give appropriate credit, provide a link to the license, and indicate if changes were made.
If you remix, transform, or build upon the material, you may not distribute the modified material.