## YUV420 to RGB Conversion using SSE2

Nils L. Corneliusen
23 October 2012

## Introduction

Ok, so I got this cool new i7 Sandy Bridge system. I haven't written much Intel-specific code, so there's a perfect opportunity to learn something new. In somewhat related news, gcc has actually started to suck less and provide decent intrinsics support. Gcc still sucks, only slightly less.

So... let's try to make a reasonably fast YUV420 to RGB conversion routine using SSE2 intrinsics. I didn't bother searching the net for other implementations, so I have no idea how it performs. It's a learning excercise. On my test data set it performs pretty decent. And it'll be readable and comprehensible. Or illegible and incomprehensible, I forget which. Either way is good.

YUV to RGB conversion is defined as follows in Video Demystified:

```B = 1.164(Y - 16)                   + 2.018(U - 128)
G = 1.164(Y - 16) - 0.813(V - 128)  - 0.391(U - 128)
R = 1.164(Y - 16) + 1.596(V - 128)
saturate results to 0..255.
```

The correctness of this formula can be discussed. That's not the point of this excercise. Replacing the code below with your own formula should be easy.

SSE2 has 128 bit registers and can do 8 16-bit multiplies wicked fast. So let's start off by eliminating the floating point math and using 16 bit integers instead. Easy, multiply the factors by 256. Next, ponder how to fit it into signed 16 bit. Also easy, divide by 4. So shift factors by 6 and we're good to go. I said "reasonably fast", not "precise".

We're converting YUV420 to RGB, meaning we have separate Y/U/V planes and want 32-bit packed RGB output. The U/V planes are 1/4 the size of Y. So we got alot of common factors and can process two rows at a time.

## Look Out Below

As mentioned, the code is divided into somewhat coherent sections for readability. Surely it can be rearranged for more speed. Or even changed completely, in case I missed something obvious. Send me a mail if you think I did. Complete code available at the bottom.

Some necessary constants first, as reasoned above.

```    ysub  = _mm_set1_epi32( 0x00100010 );
uvsub = _mm_set1_epi32( 0x00800080 );
facy  = _mm_set1_epi32( 0x004a004a );
facrv = _mm_set1_epi32( 0x00660066 );
facgu = _mm_set1_epi32( 0x00190019 );
facgv = _mm_set1_epi32( 0x00340034 );
facbu = _mm_set1_epi32( 0x00810081 );
zero  = _mm_set1_epi32( 0x00000000 );
```

We start off by loading 8 bytes of data from u and v, and 16 from the two y rows. So we're processing 32 pixels at a time.

```        u0   = _mm_loadl_epi64( (__m128i *)srcu64 ); srcu64++;
v0   = _mm_loadl_epi64( (__m128i *)srcv64 ); srcv64++;
```

Next, we expand to 16 bit, calculate the constant y factors, and subtract 16:

```        y00r0 = _mm_mullo_epi16( _mm_sub_epi16( _mm_unpacklo_epi8( y0r0, zero ), ysub ), facy );
y01r0 = _mm_mullo_epi16( _mm_sub_epi16( _mm_unpackhi_epi8( y0r0, zero ), ysub ), facy );
y00r1 = _mm_mullo_epi16( _mm_sub_epi16( _mm_unpacklo_epi8( y0r1, zero ), ysub ), facy );
y01r1 = _mm_mullo_epi16( _mm_sub_epi16( _mm_unpackhi_epi8( y0r1, zero ), ysub ), facy );
```

Then it's time to prepare the uv factors that are common on both rows. Since using SSE2 multipliers is cheap, I expand the u and v data first and use 8 mullo instead of the obvious 4. Expanding afterwards is costly. I tried it.

```        // expand u and v so they're aligned with y values
u0  = _mm_unpacklo_epi8( u0,  zero );
u00 = _mm_sub_epi16( _mm_unpacklo_epi16( u0, u0 ), uvsub );
u01 = _mm_sub_epi16( _mm_unpackhi_epi16( u0, u0 ), uvsub );

v0  = _mm_unpacklo_epi8( v0,  zero );
v00 = _mm_sub_epi16( _mm_unpacklo_epi16( v0, v0 ), uvsub );
v01 = _mm_sub_epi16( _mm_unpackhi_epi16( v0, v0 ), uvsub );

// common factors on both rows.
rv00 = _mm_mullo_epi16( facrv, v00 );
rv01 = _mm_mullo_epi16( facrv, v01 );
gu00 = _mm_mullo_epi16( facgu, u00 );
gu01 = _mm_mullo_epi16( facgu, u01 );
gv00 = _mm_mullo_epi16( facgv, v00 );
gv01 = _mm_mullo_epi16( facgv, v01 );
bu00 = _mm_mullo_epi16( facbu, u00 );
bu01 = _mm_mullo_epi16( facbu, u01 );
```

Now it's trivial to calculate the r/g/b planar values by summing things together as specified and shifting down by 6, the multiplier we used on the factors.

```        r00 = _mm_srai_epi16( _mm_add_epi16( y00r0, rv00 ), 6 );
r01 = _mm_srai_epi16( _mm_add_epi16( y01r0, rv01 ), 6 );
g00 = _mm_srai_epi16( _mm_sub_epi16( _mm_sub_epi16( y00r0, gu00 ), gv00 ), 6 );
g01 = _mm_srai_epi16( _mm_sub_epi16( _mm_sub_epi16( y01r0, gu01 ), gv01 ), 6 );
b00 = _mm_srai_epi16( _mm_add_epi16( y00r0, bu00 ), 6 );
b01 = _mm_srai_epi16( _mm_add_epi16( y01r0, bu01 ), 6 );
```

The remaining challenge is saturating and packing the results into chunky pixels efficiently. Luckily we have just the instructions for the job:

```        r00 = _mm_packus_epi16( r00, r01 );         // rrrr.. saturated
g00 = _mm_packus_epi16( g00, g01 );         // gggg.. saturated
b00 = _mm_packus_epi16( b00, b01 );         // bbbb.. saturated

r01     = _mm_unpacklo_epi8(  r00,  zero ); // 0r0r..
gbgb    = _mm_unpacklo_epi8(  b00,  g00 );  // gbgb..
rgb0123 = _mm_unpacklo_epi16( gbgb, r01 );  // 0rgb0rgb..
rgb4567 = _mm_unpackhi_epi16( gbgb, r01 );  // 0rgb0rgb..

r01     = _mm_unpackhi_epi8(  r00,  zero );
gbgb    = _mm_unpackhi_epi8(  b00,  g00 );
rgb89ab = _mm_unpacklo_epi16( gbgb, r01 );
rgbcdef = _mm_unpackhi_epi16( gbgb, r01 );
```

We're just about done, just store the finished pixels first:

```        _mm_store_si128( dstrgb128r0++, rgb0123 );
_mm_store_si128( dstrgb128r0++, rgb4567 );
_mm_store_si128( dstrgb128r0++, rgb89ab );
_mm_store_si128( dstrgb128r0++, rgbcdef );
```

That concludes the work necessary for row 0. Repeat the last three steps for row 1, replacing the y values and target pointer, and we're done.

## Mystery Maze

A complete routine would look something like this. You might want to put some more work into calculating the strides. The code is fast enough for all practical purposes I have.

```void yuv420_to_argb8888( uint8_t *yp, uint8_t *up, uint8_t *vp,
uint32_t sy, uint32_t suv,
int width, int height,
uint32_t *rgb, uint32_t srgb )
{
__m128i y0r0, y0r1, u0, v0;
__m128i y00r0, y01r0, y00r1, y01r1;
__m128i u00, u01, v00, v01;
__m128i rv00, rv01, gu00, gu01, gv00, gv01, bu00, bu01;
__m128i r00, r01, g00, g01, b00, b01;
__m128i rgb0123, rgb4567, rgb89ab, rgbcdef;
__m128i gbgb;
__m128i ysub, uvsub;
__m128i zero, facy, facrv, facgu, facgv, facbu;
__m128i *srcy128r0, *srcy128r1;
__m128i *dstrgb128r0, *dstrgb128r1;
__m64   *srcu64, *srcv64;
int x, y;

ysub  = _mm_set1_epi32( 0x00100010 );
uvsub = _mm_set1_epi32( 0x00800080 );

facy  = _mm_set1_epi32( 0x004a004a );
facrv = _mm_set1_epi32( 0x00660066 );
facgu = _mm_set1_epi32( 0x00190019 );
facgv = _mm_set1_epi32( 0x00340034 );
facbu = _mm_set1_epi32( 0x00810081 );

zero  = _mm_set1_epi32( 0x00000000 );

for( y = 0; y < height; y += 2 ) {

srcy128r0 = (__m128i *)(yp + sy*y);
srcy128r1 = (__m128i *)(yp + sy*y + sy);
srcu64 = (__m64 *)(up + suv*(y/2));
srcv64 = (__m64 *)(vp + suv*(y/2));

dstrgb128r0 = (__m128i *)(rgb + srgb*y);
dstrgb128r1 = (__m128i *)(rgb + srgb*y + srgb);

for( x = 0; x < width; x += 16 ) {

u0 = _mm_loadl_epi64( (__m128i *)srcu64 ); srcu64++;
v0 = _mm_loadl_epi64( (__m128i *)srcv64 ); srcv64++;

// constant y factors
y00r0 = _mm_mullo_epi16( _mm_sub_epi16( _mm_unpacklo_epi8( y0r0, zero ), ysub ), facy );
y01r0 = _mm_mullo_epi16( _mm_sub_epi16( _mm_unpackhi_epi8( y0r0, zero ), ysub ), facy );
y00r1 = _mm_mullo_epi16( _mm_sub_epi16( _mm_unpacklo_epi8( y0r1, zero ), ysub ), facy );
y01r1 = _mm_mullo_epi16( _mm_sub_epi16( _mm_unpackhi_epi8( y0r1, zero ), ysub ), facy );

// expand u and v so they're aligned with y values
u0  = _mm_unpacklo_epi8( u0,  zero );
u00 = _mm_sub_epi16( _mm_unpacklo_epi16( u0, u0 ), uvsub );
u01 = _mm_sub_epi16( _mm_unpackhi_epi16( u0, u0 ), uvsub );

v0  = _mm_unpacklo_epi8( v0,  zero );
v00 = _mm_sub_epi16( _mm_unpacklo_epi16( v0, v0 ), uvsub );
v01 = _mm_sub_epi16( _mm_unpackhi_epi16( v0, v0 ), uvsub );

// common factors on both rows.
rv00 = _mm_mullo_epi16( facrv, v00 );
rv01 = _mm_mullo_epi16( facrv, v01 );
gu00 = _mm_mullo_epi16( facgu, u00 );
gu01 = _mm_mullo_epi16( facgu, u01 );
gv00 = _mm_mullo_epi16( facgv, v00 );
gv01 = _mm_mullo_epi16( facgv, v01 );
bu00 = _mm_mullo_epi16( facbu, u00 );
bu01 = _mm_mullo_epi16( facbu, u01 );

// row 0
r00 = _mm_srai_epi16( _mm_add_epi16( y00r0, rv00 ), 6 );
r01 = _mm_srai_epi16( _mm_add_epi16( y01r0, rv01 ), 6 );
g00 = _mm_srai_epi16( _mm_sub_epi16( _mm_sub_epi16( y00r0, gu00 ), gv00 ), 6 );
g01 = _mm_srai_epi16( _mm_sub_epi16( _mm_sub_epi16( y01r0, gu01 ), gv01 ), 6 );
b00 = _mm_srai_epi16( _mm_add_epi16( y00r0, bu00 ), 6 );
b01 = _mm_srai_epi16( _mm_add_epi16( y01r0, bu01 ), 6 );

r00 = _mm_packus_epi16( r00, r01 );         // rrrr.. saturated
g00 = _mm_packus_epi16( g00, g01 );         // gggg.. saturated
b00 = _mm_packus_epi16( b00, b01 );         // bbbb.. saturated

r01     = _mm_unpacklo_epi8(  r00,  zero ); // 0r0r..
gbgb    = _mm_unpacklo_epi8(  b00,  g00 );  // gbgb..
rgb0123 = _mm_unpacklo_epi16( gbgb, r01 );  // 0rgb0rgb..
rgb4567 = _mm_unpackhi_epi16( gbgb, r01 );  // 0rgb0rgb..

r01     = _mm_unpackhi_epi8(  r00,  zero );
gbgb    = _mm_unpackhi_epi8(  b00,  g00 );
rgb89ab = _mm_unpacklo_epi16( gbgb, r01 );
rgbcdef = _mm_unpackhi_epi16( gbgb, r01 );

_mm_store_si128( dstrgb128r0++, rgb0123 );
_mm_store_si128( dstrgb128r0++, rgb4567 );
_mm_store_si128( dstrgb128r0++, rgb89ab );
_mm_store_si128( dstrgb128r0++, rgbcdef );

// row 1
r00 = _mm_srai_epi16( _mm_add_epi16( y00r1, rv00 ), 6 );
r01 = _mm_srai_epi16( _mm_add_epi16( y01r1, rv01 ), 6 );
g00 = _mm_srai_epi16( _mm_sub_epi16( _mm_sub_epi16( y00r1, gu00 ), gv00 ), 6 );
g01 = _mm_srai_epi16( _mm_sub_epi16( _mm_sub_epi16( y01r1, gu01 ), gv01 ), 6 );
b00 = _mm_srai_epi16( _mm_add_epi16( y00r1, bu00 ), 6 );
b01 = _mm_srai_epi16( _mm_add_epi16( y01r1, bu01 ), 6 );

r00 = _mm_packus_epi16( r00, r01 );         // rrrr.. saturated
g00 = _mm_packus_epi16( g00, g01 );         // gggg.. saturated
b00 = _mm_packus_epi16( b00, b01 );         // bbbb.. saturated

r01     = _mm_unpacklo_epi8(  r00,  zero ); // 0r0r..
gbgb    = _mm_unpacklo_epi8(  b00,  g00 );  // gbgb..
rgb0123 = _mm_unpacklo_epi16( gbgb, r01 );  // 0rgb0rgb..
rgb4567 = _mm_unpackhi_epi16( gbgb, r01 );  // 0rgb0rgb..

r01     = _mm_unpackhi_epi8(  r00,  zero );
gbgb    = _mm_unpackhi_epi8(  b00,  g00 );
rgb89ab = _mm_unpacklo_epi16( gbgb, r01 );
rgbcdef = _mm_unpackhi_epi16( gbgb, r01 );

_mm_store_si128( dstrgb128r1++, rgb0123 );
_mm_store_si128( dstrgb128r1++, rgb4567 );
_mm_store_si128( dstrgb128r1++, rgb89ab );
_mm_store_si128( dstrgb128r1++, rgbcdef );

}
}
}
```

## Source Code

Source code: yuv2rgb_sse2_2012.cpp

## Done

If I've missed anything obvious you can figure out my email from the front page. Also, remember to appreciate this classic Abstruse Goose strip.