I've earlier looked at making a reasonably fast YUV to RGB conversion routine for the Tilera TILE-Gx CPU here. That conversion does not allow for the use of the dual dot product instructions since some factors are subtracted and we run out of range and resolution, so muls and adds have to be used instead.
Going from RGB to YUV is another matter. All the factors are below 1 and Y is only positive. That means there's a decent chance to try these new, cool instructions. The dual dot product variants do 2x4*4 dot products with the low and high 32 bits being the two results.
The conversion requirements are as follows, this time from CCIR601:
Y = R * .299 + G * .587 + B * .114 + 16; U = R * -.169 + G * -.332 + B * .500 + 128; V = R * .500 + G * -.419 + B * -.0813 + 128; saturate Y results.
First, constants for the factors are needed. All Y factors are positive and below 1 so they can be multiplied by 256. U and V have to be fitted into signed range, but they're all below 1 here too so multiplying by 128 gives us the best possible results with 8 bit signed multipliers.
#define cony 0x4d961d004d961d00 #define conu 0xebd64000ebd64000 #define conv 0x40cbf60040cbf600
Next, some RGBA data is fetched and it's time to look at the basic steps needed for calculating y:
rgb01 = *src++; rgb23 = *src++; rgb45 = *src++; rgb67 = *src++; rgb89 = *src++; rgbab = *src++; rgbcd = *src++; rgbef = *src++; *pY++ = __insn_v1adduc( __insn_shufflebytes( __insn_v1ddotpu( rgb67, cony ), __insn_v1ddotpu( rgb45, cony ), 0x05010d090f0f0f0f ) | __insn_shufflebytes( __insn_v1ddotpu( rgb23, cony ), __insn_v1ddotpu( rgb01, cony ), 0x0f0f0f0f05010d09 ), 0x1010101010101010 ); (...) *pY++ = __insn_v1adduc( __insn_shufflebytes( __insn_v1ddotpu( rgbef, cony ), __insn_v1ddotpu( rgbcd, cony ), 0x05010d090f0f0f0f ) | __insn_shufflebytes( __insn_v1ddotpu( rgbab, cony ), __insn_v1ddotpu( rgb89, cony ), 0x0f0f0f0f05010d09 ), 0x1010101010101010 );
Let's look at that from the inside out. The 32 bit dot product results are too wide. Just the 8 upper bits of the lower 16 bits are needed since they were scaled up by 256. That would be very simple with a pair v4pack and v2pack as below, but an unsigned v4pack is not available. Since unsigned dot products are used and the factors sum to 1, the clamping is not needed until the constant add at the end. That's ok - we can get by with another cool instruction - shufflebytes. That one takes any set of bytes from two registers and shuffles them as you want. Unfortunately, it's slot x0 limited, but during the code analysis later it'll be obvious it's not a big problem. So shufflebytes is used to fill the upper and lower halves and clear the other part. Or it together, add constant with clamping. Job done.
Next, it's time to look at the U and V values:
*pU++ = __insn_v1addi( __insn_v2packh( __insn_v2shli( __insn_v4packsc( __insn_v1ddotpus( rgb67, conu ), __insn_v1ddotpus( rgb45, conu ) ), 1 ), __insn_v2shli( __insn_v4packsc( __insn_v1ddotpus( rgb23, conu ), __insn_v1ddotpus( rgb01, conu ) ), 1 ) ), -128 ); *pV++ = __insn_v1addi( __insn_v2packh( __insn_v2shli( __insn_v4packsc( __insn_v1ddotpus( rgb67, conv ), __insn_v1ddotpus( rgb45, conv ) ), 1 ), __insn_v2shli( __insn_v4packsc( __insn_v1ddotpus( rgb23, conv ), __insn_v1ddotpus( rgb01, conv ) ), 1 ) ), -128 ); (...) *pU++ = __insn_v1addi( __insn_v2packh( __insn_v2shli( __insn_v4packsc( __insn_v1ddotpus( rgbef, conu ), __insn_v1ddotpus( rgbcd, conu ) ), 1 ), __insn_v2shli( __insn_v4packsc( __insn_v1ddotpus( rgbab, conu ), __insn_v1ddotpus( rgb89, conu ) ), 1 ) ), -128 ); *pV++ = __insn_v1addi( __insn_v2packh( __insn_v2shli( __insn_v4packsc( __insn_v1ddotpus( rgbef, conv ), __insn_v1ddotpus( rgbcd, conv ) ), 1 ), __insn_v2shli( __insn_v4packsc( __insn_v1ddotpus( rgbab, conv ), __insn_v1ddotpus( rgb89, conv ) ), 1 ) ), -128 );
The concept is the same here, but slightly different due to different scaling. First of all, the signed 16 bit results are packed with v4packsc that also clamps the result. Since factors were scaled by 128, those results are shifted one up and the high bytes picked out with v2packh. Add the constant and another part of the job's done.
A full routine based on that theory would look something like this:
void rgba2yuv444_ccir601( const uint64_t *src, uint64_t *pY, uint64_t *pU, uint64_t *pV, uint32_t cnt ) { uint64_t rgb01, rgb23, rgb45, rgb67, rgb89, rgbab, rgbcd, rgbef; do { rgb01 = *src++; rgb23 = *src++; rgb45 = *src++; rgb67 = *src++; rgb89 = *src++; rgbab = *src++; rgbcd = *src++; rgbef = *src++; *pY++ = __insn_v1adduc( __insn_shufflebytes( __insn_v1ddotpu( rgb67, cony ), __insn_v1ddotpu( rgb45, cony ), 0x05010d090f0f0f0f ) | __insn_shufflebytes( __insn_v1ddotpu( rgb23, cony ), __insn_v1ddotpu( rgb01, cony ), 0x0f0f0f0f05010d09 ), 0x1010101010101010 ); *pU++ = __insn_v1addi( __insn_v2packh( __insn_v2shli( __insn_v4packsc( __insn_v1ddotpus( rgb67, conu ), __insn_v1ddotpus( rgb45, conu ) ), 1 ), __insn_v2shli( __insn_v4packsc( __insn_v1ddotpus( rgb23, conu ), __insn_v1ddotpus( rgb01, conu ) ), 1 ) ), -128 ); *pV++ = __insn_v1addi( __insn_v2packh( __insn_v2shli( __insn_v4packsc( __insn_v1ddotpus( rgb67, conv ), __insn_v1ddotpus( rgb45, conv ) ), 1 ), __insn_v2shli( __insn_v4packsc( __insn_v1ddotpus( rgb23, conv ), __insn_v1ddotpus( rgb01, conv ) ), 1 ) ), -128 ); *pY++ = __insn_v1adduc( __insn_shufflebytes( __insn_v1ddotpu( rgbef, cony ), __insn_v1ddotpu( rgbcd, cony ), 0x05010d090f0f0f0f ) | __insn_shufflebytes( __insn_v1ddotpu( rgbab, cony ), __insn_v1ddotpu( rgb89, cony ), 0x0f0f0f0f05010d09 ), 0x1010101010101010 ); *pU++ = __insn_v1addi( __insn_v2packh( __insn_v2shli( __insn_v4packsc( __insn_v1ddotpus( rgbef, conu ), __insn_v1ddotpus( rgbcd, conu ) ), 1 ), __insn_v2shli( __insn_v4packsc( __insn_v1ddotpus( rgbab, conu ), __insn_v1ddotpus( rgb89, conu ) ), 1 ) ), -128 ); *pV++ = __insn_v1addi( __insn_v2packh( __insn_v2shli( __insn_v4packsc( __insn_v1ddotpus( rgbef, conv ), __insn_v1ddotpus( rgbcd, conv ) ), 1 ), __insn_v2shli( __insn_v4packsc( __insn_v1ddotpus( rgbab, conv ), __insn_v1ddotpus( rgb89, conv ) ), 1 ) ), -128 ); } while( --cnt ); }
How does this look with the latest compiler release (4.1.4.152692)? Compile the code with --save-temps and look at the generated assembler .s-file. Do a double facepalm since the output looks really, really terrible. There are probably numerous easy ways to make it look better. I did it the hard way and wrote a nifty awk script for reformatting the code. Feed the script the code and look at the output from the inner loop part:
(...)branch into .L4: .L9: { addi r7, r7, 16 } .L4: { addi r9, r0, 16 addi r8, r0, 24 ld r10, r0 } { addi r11, r0, 8 ld r12, r9 addi r14, r0, 32 } { ld r13, r8 v1ddotpu r27, r10, r6 } { ld r11, r11 v1ddotpu r29, r12, r6 } { addi r9, r0, 40 addi r15, r0, 48 ld r14, r14 } { addi r8, r0, 56 v1ddotpu r16, r13, r6 } { v1ddotpu r17, r11, r6 ld r9, r9 } { ld r15, r15 v1ddotpus r30, r12, r5 } { ld r8, r8 v1ddotpus r25, r11, r5 } { v1ddotpus r28, r13, r5 addi r24, r1, 8 } { v1ddotpus r26, r10, r5 addi r22, r3, 8 } { shufflebytes r16, r29, r20 v4packsc r28, r28, r30 } { shufflebytes r17, r27, r19 v4packsc r26, r25, r26 } { v1ddotpus lr, r8, r4 or r17, r17, r16 } { v1ddotpus r25, r9, r4 v2shli r28, r28, 1 } { v1ddotpus r31, r15, r4 v1adduc r17, r17, r18 } { v1ddotpus r30, r14, r4 v2shli r26, r26, 1 } { st r1, r17 v2packh r26, r28, r26 } { v4packsc r30, r25, r30 v1ddotpus r29, r15, r5 } { v1ddotpus r25, r8, r5 v4packsc lr, lr, r31 } { v1ddotpus r27, r9, r5 v2shli lr, lr, 1 } { v1ddotpu r17, r14, r6 v2shli r16, r30, 1 } { v1ddotpus r28, r14, r5 v2packh r16, lr, r16 } { v1ddotpu r8, r8, r6 v4packsc r25, r25, r29 } { v1ddotpu r15, r15, r6 v4packsc r14, r27, r28 } { v1ddotpu r9, r9, r6 v1addi r26, r26, -128 } { v1ddotpus r13, r13, r4 st r2, r26 } { v1ddotpus r12, r12, r4 v1addi r16, r16, -128 } { v1ddotpus r11, r11, r4 v2shli r14, r14, 1 } { v1ddotpus r10, r10, r4 v4packsc r12, r13, r12 } { shufflebytes r8, r15, r20 v2shli r13, r25, 1 } { shufflebytes r9, r17, r19 v4packsc r10, r11, r10 } { or r8, r9, r8 st r3, r16 addi r23, r2, 8 } { v1adduc r8, r8, r18 v2packh r14, r13, r14 } { v2shli r12, r12, 1 v2shli r3, r10, 1 } { st r24, r8 v2packh r3, r12, r3 } { v1addi r8, r14, -128 v1addi r3, r3, -128 } { st r23, r8 cmpeq r21, r32, r7 addi r0, r0, 64 } { st r22, r3 addi r1, r1, 16 addi r2, r2, 16 } { move r3, r7 beqzt r21, .L9 }
As usual, gcc does not manage to use ld_add or st_add, but otherwise it looks promising. It does appear to look further ahead than earlier versions, so shuffling blocks around is probably not necessary. Let's add ld_add/st_add:
.L11: { ld_add r10, r0, 8 addxi r4, r4, -1 } { ld_add r11, r0, 8 v1ddotpus r22, r10, r6 } { v1ddotpu r8, r10, r7 } { ld_add r12, r0, 8 v1ddotpus r9, r11, r6 } { v1ddotpu r17, r11, r7 } { ld_add r13, r0, 8 v1ddotpus r21, r12, r6 } { v4packsc r22, r9, r22 v1ddotpu r23, r12, r7 } { ld_add r14, r0, 8 v1ddotpus r15, r13, r6 } { v1ddotpu r16, r13, r7 v2shli r22, r22, 1 } { ld_add r9, r0, 8 v4packsc r21, r15, r21 } { shufflebytes r16, r23, r20 v2shli r21, r21, 1 } { ld_add r15, r0, 8 v1ddotpus r23, r9, r5 } { v1ddotpus r27, r14, r5 v2packh r22, r21, r22 } { shufflebytes r17, r8, r19 ld_add r8, r0, 8 } { v1ddotpus r26, r15, r5 or r17, r17, r16 } { v4packsc r27, r23, r27 v1ddotpus r23, r8, r5 } { v1adduc r17, r17, r18 v1ddotpus r21, r8, r6 } { st_add r1, r17, 8 v4packsc r26, r23, r26 } { v1ddotpus r25, r15, r6 v2shli r26, r26, 1 } { v1ddotpus r23, r9, r6 v2shli r16, r27, 1 } { v1ddotpu r17, r14, r7 v2packh r16, r26, r16 } { v1ddotpus r24, r14, r6 v4packsc r21, r21, r25 } { v1ddotpu r8, r8, r7 v1addi r22, r22, -128 } { v1ddotpu r15, r15, r7 v4packsc r14, r23, r24 } { v1ddotpu r9, r9, r7 st_add r2, r22, 8 } { v1ddotpus r13, r13, r5 v1addi r16, r16, -128 } { v1ddotpus r12, r12, r5 v2shli r14, r14, 1 } { v1ddotpus r11, r11, r5 st_add r3, r16, 8 } { v1ddotpus r10, r10, r5 v4packsc r12, r13, r12 } { shufflebytes r8, r15, r20 v2shli r13, r21, 1 } { shufflebytes r9, r17, r19 v4packsc r10, r11, r10 } { or r9, r9, r8 v2packh r14, r13, r14 } { v1adduc r9, r9, r18 v2shli r12, r12, 1 } { v2shli r8, r10, 1 st_add r1, r9, 8 } { v2packh r8, r12, r8 v1addi r9, r14, -128 } { st_add r2, r9, 8 v1addi r8, r8, -128 } { st_add r3, r8, 8 } { bnezt r4, .L11 }
That leaves 3-4 holes, but there's no extra instructions anywhere, and it's 38 instead of 41 bundles. Neat. Now, if gcc would only stop screwing up inline and loop unrolling, it'd soon be a useful compiler. Still can't have it all it seems.
The complete routine would look like this:
void rgba2yuv444_ccir601_ld( const uint64_t *src, uint64_t *pY, uint64_t *pU, uint64_t *pV, uint32_t cnt ) { uint64_t rgb01, rgb23, rgb45, rgb67, rgb89, rgbab, rgbcd, rgbef; uint64_t y0, u0, v0; do { rgb01 = __insn_ld_add( src, 8 ); rgb23 = __insn_ld_add( src, 8 ); rgb45 = __insn_ld_add( src, 8 ); rgb67 = __insn_ld_add( src, 8 ); rgb89 = __insn_ld_add( src, 8 ); rgbab = __insn_ld_add( src, 8 ); rgbcd = __insn_ld_add( src, 8 ); rgbef = __insn_ld_add( src, 8 ); y0 = __insn_v1adduc( __insn_shufflebytes( __insn_v1ddotpu( rgb67, cony ), __insn_v1ddotpu( rgb45, cony ), 0x05010d090f0f0f0f ) | __insn_shufflebytes( __insn_v1ddotpu( rgb23, cony ), __insn_v1ddotpu( rgb01, cony ), 0x0f0f0f0f05010d09 ), 0x1010101010101010 ); u0 = __insn_v1addi( __insn_v2packh( __insn_v2shli( __insn_v4packsc( __insn_v1ddotpus( rgb67, conu ), __insn_v1ddotpus( rgb45, conu ) ), 1 ), __insn_v2shli( __insn_v4packsc( __insn_v1ddotpus( rgb23, conu ), __insn_v1ddotpus( rgb01, conu ) ), 1 ) ), -128 ); v0 = __insn_v1addi( __insn_v2packh( __insn_v2shli( __insn_v4packsc( __insn_v1ddotpus( rgb67, conv ), __insn_v1ddotpus( rgb45, conv ) ), 1 ), __insn_v2shli( __insn_v4packsc( __insn_v1ddotpus( rgb23, conv ), __insn_v1ddotpus( rgb01, conv ) ), 1 ) ), -128 ); __insn_st_add( pY, y0, 8 ); __insn_st_add( pU, u0, 8 ); __insn_st_add( pV, v0, 8 ); y0 = __insn_v1adduc( __insn_shufflebytes( __insn_v1ddotpu( rgbef, cony ), __insn_v1ddotpu( rgbcd, cony ), 0x05010d090f0f0f0f ) | __insn_shufflebytes( __insn_v1ddotpu( rgbab, cony ), __insn_v1ddotpu( rgb89, cony ), 0x0f0f0f0f05010d09 ), 0x1010101010101010 ); u0 = __insn_v1addi( __insn_v2packh( __insn_v2shli( __insn_v4packsc( __insn_v1ddotpus( rgbef, conu ), __insn_v1ddotpus( rgbcd, conu ) ), 1 ), __insn_v2shli( __insn_v4packsc( __insn_v1ddotpus( rgbab, conu ), __insn_v1ddotpus( rgb89, conu ) ), 1 ) ), -128 ); v0 = __insn_v1addi( __insn_v2packh( __insn_v2shli( __insn_v4packsc( __insn_v1ddotpus( rgbef, conv ), __insn_v1ddotpus( rgbcd, conv ) ), 1 ), __insn_v2shli( __insn_v4packsc( __insn_v1ddotpus( rgbab, conv ), __insn_v1ddotpus( rgb89, conv ) ), 1 ) ), -128 ); __insn_st_add( pY, y0, 8 ); __insn_st_add( pU, u0, 8 ); __insn_st_add( pV, v0, 8 ); } while( --cnt ); }
If the needed output is YUV420 it's just a matter of reading RGBA data from two rows and downsample using the convenient v1avgu instruction. I haven't bothered with ld_add/st_add here. rgb0 is row 0, rgb1 is row 1:
rgb001 = *rgb0++; rgb023 = *rgb0++; rgb045 = *rgb0++; rgb067 = *rgb0++; rgb089 = *rgb0++; rgb0ab = *rgb0++; rgb0cd = *rgb0++; rgb0ef = *rgb0++; rgb101 = *rgb1++; rgb123 = *rgb1++; rgb145 = *rgb1++; rgb167 = *rgb1++; rgb189 = *rgb1++; rgb1ab = *rgb1++; rgb1cd = *rgb1++; rgb1ef = *rgb1++; (...) uint64_t t0, t1; uint64_t uv00,uv01,uv02,uv03; t0 = __insn_v1avgu( rgb0ef, rgb1ef ); t1 = __insn_v1avgu( rgb0cd, rgb1cd ); uv00 = __insn_v1avgu( __insn_v4int_h( t0, t1 ),__insn_v4int_l( t0, t1 ) ); t0 = __insn_v1avgu( rgb0ab, rgb1ab ); t1 = __insn_v1avgu( rgb089, rgb189 ); uv01 = __insn_v1avgu( __insn_v4int_h( t0, t1 ),__insn_v4int_l( t0, t1 ) ); t0 = __insn_v1avgu( rgb067, rgb167 ); t1 = __insn_v1avgu( rgb045, rgb145 ); uv02 = __insn_v1avgu( __insn_v4int_h( t0, t1 ),__insn_v4int_l( t0, t1 ) ); t0 = __insn_v1avgu( rgb023, rgb123 ); t1 = __insn_v1avgu( rgb001, rgb101 ); uv03 = __insn_v1avgu( __insn_v4int_h( t0, t1 ),__insn_v4int_l( t0, t1 ) ); *pU++ = __insn_v1addi( __insn_v2packh( __insn_v2shli( __insn_v4packsc( __insn_v1ddotpus( uv00, conu ), __insn_v1ddotpus( uv01, conu ) ), 1 ), __insn_v2shli( __insn_v4packsc( __insn_v1ddotpus( uv02, conu ), __insn_v1ddotpus( uv03, conu ) ), 1 ) ), -128 ); *pV++ = __insn_v1addi( __insn_v2packh( __insn_v2shli( __insn_v4packsc( __insn_v1ddotpus( uv00, conv ), __insn_v1ddotpus( uv01, conv ) ), 1 ), __insn_v2shli( __insn_v4packsc( __insn_v1ddotpus( uv02, conv ), __insn_v1ddotpus( uv03, conv ) ), 1 ) ), -128 );
Source code: rgb2yuv_tilegx_2013.cpp
Comments are always appreciated. Contact information is available here.
Remember to appreciate this new Abstruse Goose strip.
This article is published under the following license: Attribution-NoDerivatives 4.0 International (CC BY-ND 4.0)
Short summary:
You may copy and redistribute the material in any medium or format for any purpose, even commercially.
You must give appropriate credit, provide a link to the license, and indicate if changes were made.
If you remix, transform, or build upon the material, you may not distribute the modified material.