## RGB to YUV conversion on Tilera TILE-Gx

Nils L. Corneliusen
12 April 2013

## Introduction

I've earlier looked at making a reasonably fast YUV to RGB conversion routine for the Tilera TILE-Gx CPU here. That conversion does not allow for the use of the dual dot product instructions since some factors are subtracted and we run out of range and resolution, so muls and adds have to be used instead.

Going from RGB to YUV is another matter. All the factors are below 1 and Y is only positive. That means there's a decent chance to try these new, cool instructions. The dual dot product variants do 2x4*4 dot products with the low and high 32 bits being the two results.

The conversion requirements are as follows, this time from CCIR601:

```Y = R *  .299 + G *  .587 + B *   .114 + 16;
U = R * -.169 + G * -.332 + B *   .500 + 128;
V = R *  .500 + G * -.419 + B * -.0813 + 128;
saturate Y results.
```

## Another visitor

First, constants for the factors are needed. All Y factors are positive and below 1 so they can be multiplied by 256. U and V have to be fitted into signed range, but they're all below 1 here too so multiplying by 128 gives us the best possible results with 8 bit signed multipliers.

```#define cony 0x4d961d004d961d00
#define conu 0xebd64000ebd64000
#define conv 0x40cbf60040cbf600
```

Next, some rgba data is fetched and it's time to look at the basic steps needed for calculating y:

```        rgb01 = *src++; rgb23 = *src++; rgb45 = *src++; rgb67 = *src++;
rgb89 = *src++; rgbab = *src++; rgbcd = *src++; rgbef = *src++;

*pY++ = __insn_v1adduc( __insn_shufflebytes( __insn_v1ddotpu( rgb67, cony ),
__insn_v1ddotpu( rgb45, cony ), 0x05010d090f0f0f0f ) |
__insn_shufflebytes( __insn_v1ddotpu( rgb23, cony ),
__insn_v1ddotpu( rgb01, cony ), 0x0f0f0f0f05010d09 ), 0x1010101010101010 );

(...)

*pY++ = __insn_v1adduc( __insn_shufflebytes( __insn_v1ddotpu( rgbef, cony ),
__insn_v1ddotpu( rgbcd, cony ), 0x05010d090f0f0f0f ) |
__insn_shufflebytes( __insn_v1ddotpu( rgbab, cony ),
__insn_v1ddotpu( rgb89, cony ), 0x0f0f0f0f05010d09 ), 0x1010101010101010 );
```

Let's look at that from the inside out. The 32 bit dot product results are too wide. Just the 8 upper bits of the lower 16 bits are needed since they were scaled up by 256. That would be very simple with a pair v4pack and v2pack as below, but an unsigned v4pack is not available. Since unsigned dot products are used and the factors sum to 1, the clamping is not needed until the constant add at the end. That's ok - we can get by with another cool instruction - shufflebytes. That one takes any set of bytes from two registers and shuffles them as you want. Unfortunately it's slot x0 limited, but during the code analysis later it'll be obvious it's not a big problem. So shufflebytes is used to fill the upper and lower halves and clear the other part. Or it together, add constant with clamping. Job done.

Next, it's time to look at the U and V values:

```        *pU++ = __insn_v1addi( __insn_v2packh( __insn_v2shli( __insn_v4packsc( __insn_v1ddotpus( rgb67, conu ),
__insn_v1ddotpus( rgb45, conu ) ), 1 ),
__insn_v2shli( __insn_v4packsc( __insn_v1ddotpus( rgb23, conu ),
__insn_v1ddotpus( rgb01, conu ) ), 1 ) ), -128 );

*pV++ = __insn_v1addi( __insn_v2packh( __insn_v2shli( __insn_v4packsc( __insn_v1ddotpus( rgb67, conv ),
__insn_v1ddotpus( rgb45, conv ) ), 1 ),
__insn_v2shli( __insn_v4packsc( __insn_v1ddotpus( rgb23, conv ),
__insn_v1ddotpus( rgb01, conv ) ), 1 ) ), -128 );

(...)

*pU++ = __insn_v1addi( __insn_v2packh( __insn_v2shli( __insn_v4packsc( __insn_v1ddotpus( rgbef, conu ),
__insn_v1ddotpus( rgbcd, conu ) ), 1 ),
__insn_v2shli( __insn_v4packsc( __insn_v1ddotpus( rgbab, conu ),
__insn_v1ddotpus( rgb89, conu ) ), 1 ) ), -128 );

*pV++ = __insn_v1addi( __insn_v2packh( __insn_v2shli( __insn_v4packsc( __insn_v1ddotpus( rgbef, conv ),
__insn_v1ddotpus( rgbcd, conv ) ), 1 ),
__insn_v2shli( __insn_v4packsc( __insn_v1ddotpus( rgbab, conv ),
__insn_v1ddotpus( rgb89, conv ) ), 1 ) ), -128 );
```

The concept is the same here, but slightly different due to different scaling. First of all, the signed 16 bit results are packed with v4packsc that also clamps the result. Since factors were scaled by 128, those results are shifted one up and the high bytes picked out with v2packh. Add the constant and another part of the job's done.

## Stay awhile

A full routine based on that theory would look something like this:

```void rgba2yuv444_ccir601( const uint64_t *src, uint64_t *pY, uint64_t *pU, uint64_t *pV, uint32_t cnt )
{
uint64_t rgb01, rgb23, rgb45, rgb67, rgb89, rgbab, rgbcd, rgbef;

do {

rgb01 = *src++; rgb23 = *src++; rgb45 = *src++; rgb67 = *src++;
rgb89 = *src++; rgbab = *src++; rgbcd = *src++; rgbef = *src++;

*pY++ = __insn_v1adduc( __insn_shufflebytes( __insn_v1ddotpu( rgb67, cony ),
__insn_v1ddotpu( rgb45, cony ), 0x05010d090f0f0f0f ) |
__insn_shufflebytes( __insn_v1ddotpu( rgb23, cony ),
__insn_v1ddotpu( rgb01, cony ), 0x0f0f0f0f05010d09 ), 0x1010101010101010 );

*pU++ = __insn_v1addi( __insn_v2packh( __insn_v2shli( __insn_v4packsc( __insn_v1ddotpus( rgb67, conu ),
__insn_v1ddotpus( rgb45, conu ) ), 1 ),
__insn_v2shli( __insn_v4packsc( __insn_v1ddotpus( rgb23, conu ),
__insn_v1ddotpus( rgb01, conu ) ), 1 ) ), -128 );

*pV++ = __insn_v1addi( __insn_v2packh( __insn_v2shli( __insn_v4packsc( __insn_v1ddotpus( rgb67, conv ),
__insn_v1ddotpus( rgb45, conv ) ), 1 ),
__insn_v2shli( __insn_v4packsc( __insn_v1ddotpus( rgb23, conv ),
__insn_v1ddotpus( rgb01, conv ) ), 1 ) ), -128 );

*pY++ = __insn_v1adduc( __insn_shufflebytes( __insn_v1ddotpu( rgbef, cony ),
__insn_v1ddotpu( rgbcd, cony ), 0x05010d090f0f0f0f ) |
__insn_shufflebytes( __insn_v1ddotpu( rgbab, cony ),
__insn_v1ddotpu( rgb89, cony ), 0x0f0f0f0f05010d09 ), 0x1010101010101010 );

*pU++ = __insn_v1addi( __insn_v2packh( __insn_v2shli( __insn_v4packsc( __insn_v1ddotpus( rgbef, conu ),
__insn_v1ddotpus( rgbcd, conu ) ), 1 ),
__insn_v2shli( __insn_v4packsc( __insn_v1ddotpus( rgbab, conu ),
__insn_v1ddotpus( rgb89, conu ) ), 1 ) ), -128 );

*pV++ = __insn_v1addi( __insn_v2packh( __insn_v2shli( __insn_v4packsc( __insn_v1ddotpus( rgbef, conv ),
__insn_v1ddotpus( rgbcd, conv ) ), 1 ),
__insn_v2shli( __insn_v4packsc( __insn_v1ddotpus( rgbab, conv ),
__insn_v1ddotpus( rgb89, conv ) ), 1 ) ), -128 );

} while( --cnt );
}
```

## Stay Forever!

How does this look with the latest compiler release (4.1.4.152692)? Compile the code with --save-temps and look at the generated assembler .s-file. Do a double facepalm since the output looks really, really terrible. There's probably numerous easy ways to make it look better. I did it the hard way and wrote a nifty awk script for reformatting the code. Feed the script the code and look at the output from the inner loop part:

```  (...)branch into .L4:
.L9:
{ addi               r7, r7, 16                                                                                            }
.L4:
{ addi               r9, r0, 16           addi               r8, r0, 24           ld                 r10, r0               }
{ addi               r11, r0, 8           ld                 r12, r9              addi               r14, r0, 32           }
{ ld                 r13, r8              v1ddotpu           r27, r10, r6                                                  }
{ ld                 r11, r11             v1ddotpu           r29, r12, r6                                                  }
{ addi               r9, r0, 40           addi               r15, r0, 48          ld                 r14, r14              }
{ addi               r8, r0, 56           v1ddotpu           r16, r13, r6                                                  }
{ v1ddotpu           r17, r11, r6         ld                 r9, r9                                                        }
{ ld                 r15, r15             v1ddotpus          r30, r12, r5                                                  }
{ ld                 r8, r8               v1ddotpus          r25, r11, r5                                                  }
{ v1ddotpus          r28, r13, r5         addi               r24, r1, 8                                                    }
{ v1ddotpus          r26, r10, r5         addi               r22, r3, 8                                                    }
{ shufflebytes       r16, r29, r20        v4packsc           r28, r28, r30                                                 }
{ shufflebytes       r17, r27, r19        v4packsc           r26, r25, r26                                                 }
{ v1ddotpus          lr, r8, r4           or                 r17, r17, r16                                                 }
{ v1ddotpus          r25, r9, r4          v2shli             r28, r28, 1                                                   }
{ v1ddotpus          r31, r15, r4         v1adduc            r17, r17, r18                                                 }
{ v1ddotpus          r30, r14, r4         v2shli             r26, r26, 1                                                   }
{ st                 r1, r17              v2packh            r26, r28, r26                                                 }
{ v4packsc           r30, r25, r30        v1ddotpus          r29, r15, r5                                                  }
{ v1ddotpus          r25, r8, r5          v4packsc           lr, lr, r31                                                   }
{ v1ddotpus          r27, r9, r5          v2shli             lr, lr, 1                                                     }
{ v1ddotpu           r17, r14, r6         v2shli             r16, r30, 1                                                   }
{ v1ddotpus          r28, r14, r5         v2packh            r16, lr, r16                                                  }
{ v1ddotpu           r8, r8, r6           v4packsc           r25, r25, r29                                                 }
{ v1ddotpu           r15, r15, r6         v4packsc           r14, r27, r28                                                 }
{ v1ddotpu           r9, r9, r6           v1addi             r26, r26, -128                                                }
{ v1ddotpus          r13, r13, r4         st                 r2, r26                                                       }
{ v1ddotpus          r12, r12, r4         v1addi             r16, r16, -128                                                }
{ v1ddotpus          r11, r11, r4         v2shli             r14, r14, 1                                                   }
{ v1ddotpus          r10, r10, r4         v4packsc           r12, r13, r12                                                 }
{ shufflebytes       r8, r15, r20         v2shli             r13, r25, 1                                                   }
{ shufflebytes       r9, r17, r19         v4packsc           r10, r11, r10                                                 }
{ or                 r8, r9, r8           st                 r3, r16              addi               r23, r2, 8            }
{ v1adduc            r8, r8, r18          v2packh            r14, r13, r14                                                 }
{ v2shli             r12, r12, 1          v2shli             r3, r10, 1                                                    }
{ st                 r24, r8              v2packh            r3, r12, r3                                                   }
{ st                 r23, r8              cmpeq              r21, r32, r7         addi               r0, r0, 64            }
{ st                 r22, r3              addi               r1, r1, 16           addi               r2, r2, 16            }
{ move               r3, r7               beqzt              r21, .L9                                                      }
```

As usual, gcc does not manage to use ld_add or st_add, but otherwise it looks promising. It does appear to look further ahead than earlier versions, so shuffling blocks around is probably not necessary. Let's add ld_add/st_add:

```.L11:
{ ld_add             r11, r0, 8           v1ddotpus          r22, r10, r6                                                  }
{ v1ddotpu           r8, r10, r7                                                                                           }
{ ld_add             r12, r0, 8           v1ddotpus          r9, r11, r6                                                   }
{ v1ddotpu           r17, r11, r7                                                                                          }
{ ld_add             r13, r0, 8           v1ddotpus          r21, r12, r6                                                  }
{ v4packsc           r22, r9, r22         v1ddotpu           r23, r12, r7                                                  }
{ ld_add             r14, r0, 8           v1ddotpus          r15, r13, r6                                                  }
{ v1ddotpu           r16, r13, r7         v2shli             r22, r22, 1                                                   }
{ ld_add             r9, r0, 8            v4packsc           r21, r15, r21                                                 }
{ shufflebytes       r16, r23, r20        v2shli             r21, r21, 1                                                   }
{ ld_add             r15, r0, 8           v1ddotpus          r23, r9, r5                                                   }
{ v1ddotpus          r27, r14, r5         v2packh            r22, r21, r22                                                 }
{ shufflebytes       r17, r8, r19         ld_add             r8, r0, 8                                                     }
{ v1ddotpus          r26, r15, r5         or                 r17, r17, r16                                                 }
{ v4packsc           r27, r23, r27        v1ddotpus          r23, r8, r5                                                   }
{ v1adduc            r17, r17, r18        v1ddotpus          r21, r8, r6                                                   }
{ st_add             r1, r17, 8           v4packsc           r26, r23, r26                                                 }
{ v1ddotpus          r25, r15, r6         v2shli             r26, r26, 1                                                   }
{ v1ddotpus          r23, r9, r6          v2shli             r16, r27, 1                                                   }
{ v1ddotpu           r17, r14, r7         v2packh            r16, r26, r16                                                 }
{ v1ddotpus          r24, r14, r6         v4packsc           r21, r21, r25                                                 }
{ v1ddotpu           r8, r8, r7           v1addi             r22, r22, -128                                                }
{ v1ddotpu           r15, r15, r7         v4packsc           r14, r23, r24                                                 }
{ v1ddotpu           r9, r9, r7           st_add             r2, r22, 8                                                    }
{ v1ddotpus          r13, r13, r5         v1addi             r16, r16, -128                                                }
{ v1ddotpus          r12, r12, r5         v2shli             r14, r14, 1                                                   }
{ v1ddotpus          r11, r11, r5         st_add             r3, r16, 8                                                    }
{ v1ddotpus          r10, r10, r5         v4packsc           r12, r13, r12                                                 }
{ shufflebytes       r8, r15, r20         v2shli             r13, r21, 1                                                   }
{ shufflebytes       r9, r17, r19         v4packsc           r10, r11, r10                                                 }
{ or                 r9, r9, r8           v2packh            r14, r13, r14                                                 }
{ v1adduc            r9, r9, r18          v2shli             r12, r12, 1                                                   }
{ v2shli             r8, r10, 1           st_add             r1, r9, 8                                                     }
{ v2packh            r8, r12, r8          v1addi             r9, r14, -128                                                 }
{ st_add             r3, r8, 8                                                                                             }
{ bnezt              r4, .L11                                                                                              }
```

That leaves 3-4 holes, but there's no extra instructions anywhere, and it's 38 instead of 41 bundles. Neat. Now, if gcc would only stop screwing up inline and loop unrolling, it'd soon be a useful compiler. Still can't have it all it seems.

## Destroy him, my robots

The complete routine would look like this:

```void rgba2yuv444_ccir601_ld( const uint64_t *src, uint64_t *pY, uint64_t *pU, uint64_t *pV, uint32_t cnt )
{
uint64_t rgb01, rgb23, rgb45, rgb67, rgb89, rgbab, rgbcd, rgbef;
uint64_t y0, u0, v0;

do {
rgb01 = __insn_ld_add( src, 8 );
rgb23 = __insn_ld_add( src, 8 );
rgb45 = __insn_ld_add( src, 8 );
rgb67 = __insn_ld_add( src, 8 );
rgb89 = __insn_ld_add( src, 8 );
rgbab = __insn_ld_add( src, 8 );
rgbcd = __insn_ld_add( src, 8 );
rgbef = __insn_ld_add( src, 8 );

y0 = __insn_v1adduc( __insn_shufflebytes( __insn_v1ddotpu( rgb67, cony ),
__insn_v1ddotpu( rgb45, cony ), 0x05010d090f0f0f0f ) |
__insn_shufflebytes( __insn_v1ddotpu( rgb23, cony ),
__insn_v1ddotpu( rgb01, cony ), 0x0f0f0f0f05010d09 ), 0x1010101010101010 );

u0 = __insn_v1addi( __insn_v2packh( __insn_v2shli( __insn_v4packsc( __insn_v1ddotpus( rgb67, conu ),
__insn_v1ddotpus( rgb45, conu ) ), 1 ),
__insn_v2shli( __insn_v4packsc( __insn_v1ddotpus( rgb23, conu ),
__insn_v1ddotpus( rgb01, conu ) ), 1 ) ), -128 );

v0 = __insn_v1addi( __insn_v2packh( __insn_v2shli( __insn_v4packsc( __insn_v1ddotpus( rgb67, conv ),
__insn_v1ddotpus( rgb45, conv ) ), 1 ),
__insn_v2shli( __insn_v4packsc( __insn_v1ddotpus( rgb23, conv ),
__insn_v1ddotpus( rgb01, conv ) ), 1 ) ), -128 );

y0 = __insn_v1adduc( __insn_shufflebytes( __insn_v1ddotpu( rgbef, cony ),
__insn_v1ddotpu( rgbcd, cony ), 0x05010d090f0f0f0f ) |
__insn_shufflebytes( __insn_v1ddotpu( rgbab, cony ),
__insn_v1ddotpu( rgb89, cony ), 0x0f0f0f0f05010d09 ), 0x1010101010101010 );

u0 = __insn_v1addi( __insn_v2packh( __insn_v2shli( __insn_v4packsc( __insn_v1ddotpus( rgbef, conu ),
__insn_v1ddotpus( rgbcd, conu ) ), 1 ),
__insn_v2shli( __insn_v4packsc( __insn_v1ddotpus( rgbab, conu ),
__insn_v1ddotpus( rgb89, conu ) ), 1 ) ), -128 );

v0 = __insn_v1addi( __insn_v2packh( __insn_v2shli( __insn_v4packsc( __insn_v1ddotpus( rgbef, conv ),
__insn_v1ddotpus( rgbcd, conv ) ), 1 ),
__insn_v2shli( __insn_v4packsc( __insn_v1ddotpus( rgbab, conv ),
__insn_v1ddotpus( rgb89, conv ) ), 1 ) ), -128 );

} while( --cnt );
}
```

## No, no, no.

If the needed output is YUV420 it's just a matter of reading rgba data from two rows and downsample using the convenient v1avgu instruction. I haven't bothered with ld_add/st_add here. rgb0 is row 0, rgb1 is row 1:

```  rgb001 = *rgb0++; rgb023 = *rgb0++; rgb045 = *rgb0++; rgb067 = *rgb0++;
rgb089 = *rgb0++; rgb0ab = *rgb0++; rgb0cd = *rgb0++; rgb0ef = *rgb0++;

rgb101 = *rgb1++; rgb123 = *rgb1++; rgb145 = *rgb1++; rgb167 = *rgb1++;
rgb189 = *rgb1++; rgb1ab = *rgb1++; rgb1cd = *rgb1++; rgb1ef = *rgb1++;

(...)

uint64_t t0, t1;
uint64_t uv00,uv01,uv02,uv03;
t0 = __insn_v1avgu( rgb0ef, rgb1ef ); t1 = __insn_v1avgu( rgb0cd, rgb1cd ); uv00 = __insn_v1avgu( __insn_v4int_h( t0, t1 ),__insn_v4int_l( t0, t1 ) );
t0 = __insn_v1avgu( rgb0ab, rgb1ab ); t1 = __insn_v1avgu( rgb089, rgb189 ); uv01 = __insn_v1avgu( __insn_v4int_h( t0, t1 ),__insn_v4int_l( t0, t1 ) );
t0 = __insn_v1avgu( rgb067, rgb167 ); t1 = __insn_v1avgu( rgb045, rgb145 ); uv02 = __insn_v1avgu( __insn_v4int_h( t0, t1 ),__insn_v4int_l( t0, t1 ) );
t0 = __insn_v1avgu( rgb023, rgb123 ); t1 = __insn_v1avgu( rgb001, rgb101 ); uv03 = __insn_v1avgu( __insn_v4int_h( t0, t1 ),__insn_v4int_l( t0, t1 ) );

*pU++ = __insn_v1addi( __insn_v2packh( __insn_v2shli( __insn_v4packsc( __insn_v1ddotpus( uv00, conu ),
__insn_v1ddotpus( uv01, conu ) ), 1 ),
__insn_v2shli( __insn_v4packsc( __insn_v1ddotpus( uv02, conu ),
__insn_v1ddotpus( uv03, conu ) ), 1 ) ), -128 );

*pV++ = __insn_v1addi( __insn_v2packh( __insn_v2shli( __insn_v4packsc( __insn_v1ddotpus( uv00, conv ),
__insn_v1ddotpus( uv01, conv ) ), 1 ),
__insn_v2shli( __insn_v4packsc( __insn_v1ddotpus( uv02, conv ),
__insn_v1ddotpus( uv03, conv ) ), 1 ) ), -128 );
```

## Mission accomplished

If I've missed anything obvious you can figure out my email from the front page. Also, remember to appreciate this new Abstruse Goose strip.