As specified in the article AES Optimization on Tilera TILE-Gx, there are several things that can be done to speed up AES encryption on the Tilera TILE-Gx 36-core CPU: Use tblidx/ld4u/xor bundles to compact the code, preload the key to registers and inline the encryption routine where needed.
Sometimes that's not an option and you're stuck with using OpenSSL. Then it would be nice to have a drop-in replacement for aes_core.c that uses as much as possible from the previous article, and also include some improvements on loading unaligned data. But before we get to that, let's address one niggling issue: What to do about the last round?
Applying the last round and mapping the cipher state to the byte array block (puuh) looks like this in the OpenSSL code:
s0 = (Te2[(t0 >> 24) ] & 0xff000000) ^ (Te3[(t1 >> 16) & 0xff] & 0x00ff0000) ^ (Te0[(t2 >> 8) & 0xff] & 0x0000ff00) ^ (Te1[(t3 ) & 0xff] & 0x000000ff) ^ rk[0]; PUTU32(out , s0); s1 = (Te2[(t1 >> 24) ] & 0xff000000) ^ (Te3[(t2 >> 16) & 0xff] & 0x00ff0000) ^ (Te0[(t3 >> 8) & 0xff] & 0x0000ff00) ^ (Te1[(t0 ) & 0xff] & 0x000000ff) ^ rk[1]; PUTU32(out + 4, s1); s2 = (Te2[(t2 >> 24) ] & 0xff000000) ^ (Te3[(t3 >> 16) & 0xff] & 0x00ff0000) ^ (Te0[(t0 >> 8) & 0xff] & 0x0000ff00) ^ (Te1[(t1 ) & 0xff] & 0x000000ff) ^ rk[2]; PUTU32(out + 8, s2); s3 = (Te2[(t3 >> 24) ] & 0xff000000) ^ (Te3[(t0 >> 16) & 0xff] & 0x00ff0000) ^ (Te0[(t1 >> 8) & 0xff] & 0x0000ff00) ^ (Te1[(t2 ) & 0xff] & 0x000000ff) ^ rk[3]; PUTU32(out + 12, s2);
which I converted to this:
pte2 = (uint32_t *)tblidxb3( (uint64_t)pte2, t0 ); pte3 = (uint32_t *)tblidxb2( (uint64_t)pte3, t1 ); pte0 = (uint32_t *)tblidxb1( (uint64_t)pte0, t2 ); pte1 = (uint32_t *)tblidxb0( (uint64_t)pte1, t3 ); tmp0 = ((ld4u(pte2)&0xff000000) | (ld4u(pte3)&0x00ff0000) | (ld4u(pte0)&0x0000ff00) | (ld4u(pte1)&0x000000ff)) ^ rk40; pte2 = (uint32_t *)tblidxb3( (uint64_t)pte2, t1 ); pte3 = (uint32_t *)tblidxb2( (uint64_t)pte3, t2 ); pte0 = (uint32_t *)tblidxb1( (uint64_t)pte0, t3 ); pte1 = (uint32_t *)tblidxb0( (uint64_t)pte1, t0 ); tmp1 = ((ld4u(pte2)&0xff000000) | (ld4u(pte3)&0x00ff0000) | (ld4u(pte0)&0x0000ff00) | (ld4u(pte1)&0x000000ff)) ^ rk41; st_add( out, v4int_l(tmp1,tmp0), 8 ); pte2 = (uint32_t *)tblidxb3( (uint64_t)pte2, t2 ); pte3 = (uint32_t *)tblidxb2( (uint64_t)pte3, t3 ); pte0 = (uint32_t *)tblidxb1( (uint64_t)pte0, t0 ); pte1 = (uint32_t *)tblidxb0( (uint64_t)pte1, t1 ); tmp0 = ((ld4u(pte2)&0xff000000) | (ld4u(pte3)&0x00ff0000) | (ld4u(pte0)&0x0000ff00) | (ld4u(pte1)&0x000000ff)) ^ rk42; pte2 = (uint32_t *)tblidxb3( (uint64_t)pte2, t3 ); pte3 = (uint32_t *)tblidxb2( (uint64_t)pte3, t0 ); pte0 = (uint32_t *)tblidxb1( (uint64_t)pte0, t1 ); pte1 = (uint32_t *)tblidxb0( (uint64_t)pte1, t2 ); tmp1 = ((ld4u(pte2)&0xff000000) | (ld4u(pte3)&0x00ff0000) | (ld4u(pte0)&0x0000ff00) | (ld4u(pte1)&0x000000ff)) ^ rk43; st_add( out, v4int_l(tmp1,tmp0), 8 );
That's rather cumbersome: 28 logical operations just for concatenating the data. Gcc refuses to use interleave instructions, so some cycles will be saved by doing that. Surely there's a better solution? Let's go to the source: The Te tables. Recall that pairs in the tables are the same, and the pattern above always picks out values from the pairs. It doesn't matter which p-value is selected. Due to their placing, it's possible to pick out pairs with v2int_l, which will result in every other byte in the upper and lower halves or two registers having the correct values. These can be merged with mm() and packed with v2packh. This drawing might do a better job of explaining it:
I combine s0/s1 and s2/s3 in the code below to avoid extra packs:
// Exploit pairs in Te tables: 0=yppx,1=xypp,2=pxyp,3=ppxy pt2 = (uint32_t *)__insn_tblidxb3( (uint64_t)pt2, t0 ); pt3 = (uint32_t *)__insn_tblidxb2( (uint64_t)pt3, t1 ); pt0 = (uint32_t *)__insn_tblidxb1( (uint64_t)pt0, t2 ); pt1 = (uint32_t *)__insn_tblidxb0( (uint64_t)pt1, t3 ); s0 = __insn_mm( __insn_v2int_l( __insn_ld4u(pt2), __insn_ld4u(pt3) ), __insn_v2int_l( __insn_ld4u(pt0), __insn_ld4u(pt1) ), 32, 63 ); pt2 = (uint32_t *)__insn_tblidxb3( (uint64_t)pt2, t1 ); pt3 = (uint32_t *)__insn_tblidxb2( (uint64_t)pt3, t2 ); pt0 = (uint32_t *)__insn_tblidxb1( (uint64_t)pt0, t3 ); pt1 = (uint32_t *)__insn_tblidxb0( (uint64_t)pt1, t0 ); s1 = __insn_mm( __insn_v2int_l( __insn_ld4u(pt2), __insn_ld4u(pt3) ), __insn_v2int_l( __insn_ld4u(pt0), __insn_ld4u(pt1) ), 32, 63 ); s0 = __insn_revbytes( __insn_v2packh( s0, s1 ) ^ __insn_v4int_l( __insn_ld4u( rk + 0 ), __insn_ld4u( rk + 1 ) ) ); __insn_st_add( out, s0, 8 ); pt2 = (uint32_t *)__insn_tblidxb3( (uint64_t)pt2, t2 ); pt3 = (uint32_t *)__insn_tblidxb2( (uint64_t)pt3, t3 ); pt0 = (uint32_t *)__insn_tblidxb1( (uint64_t)pt0, t0 ); pt1 = (uint32_t *)__insn_tblidxb0( (uint64_t)pt1, t1 ); s2 = __insn_mm( __insn_v2int_l( __insn_ld4u(pt2), __insn_ld4u(pt3) ), __insn_v2int_l( __insn_ld4u(pt0), __insn_ld4u(pt1) ), 32, 63 ); pt2 = (uint32_t *)__insn_tblidxb3( (uint64_t)pt2, t3 ); pt3 = (uint32_t *)__insn_tblidxb2( (uint64_t)pt3, t0 ); pt0 = (uint32_t *)__insn_tblidxb1( (uint64_t)pt0, t1 ); pt1 = (uint32_t *)__insn_tblidxb0( (uint64_t)pt1, t2 ); s3 = __insn_mm( __insn_v2int_l( __insn_ld4u(pt2), __insn_ld4u(pt3) ), __insn_v2int_l( __insn_ld4u(pt0), __insn_ld4u(pt1) ), 32, 63 ); s2 = __insn_revbytes( __insn_v2packh( s2, s3 ) ^ __insn_v4int_l( __insn_ld4u( rk + 2 ), __insn_ld4u( rk + 3 ) ) ); __insn_st( out, s2 );
That's a cool 12 instructions instead of 28. The net result is about 3% faster execution time. Neat.
The TILE-Gx has instructions for making unaligned loads easier. ldna loads the specified address with the lower 3 bits masked out, and dblalign aligns two registers according to the lower 3 bits. That means loading 16 bytes can be done with 3 ldna and 2 dblalign, taking care not to load past the buffer if data is aligned:
uint64_t in0 = __insn_ldna( in ); uint64_t in1 = __insn_ldna( in + 8 ); uint64_t in2 = __insn_ldna( in + 15 ); uint64_t a = __insn_revbytes( __insn_dblalign( in0, in1, in ) ); uint64_t b = __insn_revbytes( __insn_dblalign( in1, in2, in ) ); s0 = (a>>32) ^ __insn_ld4u( rk ); s1 = a ^ __insn_ld4u( rk + 1 ); s2 = (b>>32) ^ __insn_ld4u( rk + 2 ); s3 = b ^ __insn_ld4u( rk + 3 );
What to do about unaligned stores? Not much, really. Either use an array of st1u or define GX_STORE_ALIGNED if the destination buffer is 8 byte aligned.
Let's start off by aligning the Te0-3 tables to 1k boundaries so tblidx can be used:
static const u32 __attribute__ ((aligned (1024))) Te0[256] = { (...) static const u32 __attribute__ ((aligned (1024))) Te1[256] = { (...) static const u32 __attribute__ ((aligned (1024))) Te2[256] = { (...) static const u32 __attribute__ ((aligned (1024))) Te3[256] = { (...)
Define the encrypt round macros:
#define ROUND_E_T(I0,I1,I2,I3) \ pt0 = (uint32_t *)__insn_tblidxb3( (uint64_t)pt0, s0 ); \ pt1 = (uint32_t *)__insn_tblidxb2( (uint64_t)pt1, s1 ); \ pt2 = (uint32_t *)__insn_tblidxb1( (uint64_t)pt2, s2 ); \ pt3 = (uint32_t *)__insn_tblidxb0( (uint64_t)pt3, s3 ); \ t0 = __insn_ld4u( pt0 ) ^ __insn_ld4u( pt1 ) ^ __insn_ld4u( pt2 ) ^ __insn_ld4u( pt3 ) ^ __insn_ld4u( rk + I0 ); \ pt0 = (uint32_t *)__insn_tblidxb3( (uint64_t)pt0, s1 ); \ pt1 = (uint32_t *)__insn_tblidxb2( (uint64_t)pt1, s2 ); \ pt2 = (uint32_t *)__insn_tblidxb1( (uint64_t)pt2, s3 ); \ pt3 = (uint32_t *)__insn_tblidxb0( (uint64_t)pt3, s0 ); \ t1 = __insn_ld4u( pt0 ) ^ __insn_ld4u( pt1 ) ^ __insn_ld4u( pt2 ) ^ __insn_ld4u( pt3 ) ^ __insn_ld4u( rk + I1 ); \ pt0 = (uint32_t *)__insn_tblidxb3( (uint64_t)pt0, s2 ); \ pt1 = (uint32_t *)__insn_tblidxb2( (uint64_t)pt1, s3 ); \ pt2 = (uint32_t *)__insn_tblidxb1( (uint64_t)pt2, s0 ); \ pt3 = (uint32_t *)__insn_tblidxb0( (uint64_t)pt3, s1 ); \ t2 = __insn_ld4u( pt0 ) ^ __insn_ld4u( pt1 ) ^ __insn_ld4u( pt2 ) ^ __insn_ld4u( pt3 ) ^ __insn_ld4u( rk + I2 ); \ pt0 = (uint32_t *)__insn_tblidxb3( (uint64_t)pt0, s3 ); \ pt1 = (uint32_t *)__insn_tblidxb2( (uint64_t)pt1, s0 ); \ pt2 = (uint32_t *)__insn_tblidxb1( (uint64_t)pt2, s1 ); \ pt3 = (uint32_t *)__insn_tblidxb0( (uint64_t)pt3, s2 ); \ t3 = __insn_ld4u( pt0 ) ^ __insn_ld4u( pt1 ) ^ __insn_ld4u( pt2 ) ^ __insn_ld4u( pt3 ) ^ __insn_ld4u( rk + I3 ); #define ROUND_E_S(I0,I1,I2,I3) \ pt0 = (uint32_t *)__insn_tblidxb3( (uint64_t)pt0, t0 ); \ pt1 = (uint32_t *)__insn_tblidxb2( (uint64_t)pt1, t1 ); \ pt2 = (uint32_t *)__insn_tblidxb1( (uint64_t)pt2, t2 ); \ pt3 = (uint32_t *)__insn_tblidxb0( (uint64_t)pt3, t3 ); \ s0 = __insn_ld4u( pt0 ) ^ __insn_ld4u( pt1 ) ^ __insn_ld4u( pt2 ) ^ __insn_ld4u( pt3 ) ^ __insn_ld4u( rk + I0 ); \ pt0 = (uint32_t *)__insn_tblidxb3( (uint64_t)pt0, t1 ); \ pt1 = (uint32_t *)__insn_tblidxb2( (uint64_t)pt1, t2 ); \ pt2 = (uint32_t *)__insn_tblidxb1( (uint64_t)pt2, t3 ); \ pt3 = (uint32_t *)__insn_tblidxb0( (uint64_t)pt3, t0 ); \ s1 = __insn_ld4u( pt0 ) ^ __insn_ld4u( pt1 ) ^ __insn_ld4u( pt2 ) ^ __insn_ld4u( pt3 ) ^ __insn_ld4u( rk + I1 ); \ pt0 = (uint32_t *)__insn_tblidxb3( (uint64_t)pt0, t2 ); \ pt1 = (uint32_t *)__insn_tblidxb2( (uint64_t)pt1, t3 ); \ pt2 = (uint32_t *)__insn_tblidxb1( (uint64_t)pt2, t0 ); \ pt3 = (uint32_t *)__insn_tblidxb0( (uint64_t)pt3, t1 ); \ s2 = __insn_ld4u( pt0 ) ^ __insn_ld4u( pt1 ) ^ __insn_ld4u( pt2 ) ^ __insn_ld4u( pt3 ) ^ __insn_ld4u( rk + I2 ); \ pt0 = (uint32_t *)__insn_tblidxb3( (uint64_t)pt0, t3 ); \ pt1 = (uint32_t *)__insn_tblidxb2( (uint64_t)pt1, t0 ); \ pt2 = (uint32_t *)__insn_tblidxb1( (uint64_t)pt2, t1 ); \ pt3 = (uint32_t *)__insn_tblidxb0( (uint64_t)pt3, t2 ); \ s3 = __insn_ld4u( pt0 ) ^ __insn_ld4u( pt1 ) ^ __insn_ld4u( pt2 ) ^ __insn_ld4u( pt3 ) ^ __insn_ld4u( rk + I3 );
Load the input data:
uint64_t in0 = __insn_ldna( in ); uint64_t in1 = __insn_ldna( in + 8 ); uint64_t in2 = __insn_ldna( in + 15 ); uint64_t a = __insn_revbytes( __insn_dblalign( in0, in1, in ) ); uint64_t b = __insn_revbytes( __insn_dblalign( in1, in2, in ) ); s0 = (a>>32) ^ __insn_ld4u( rk ); s1 = a ^ __insn_ld4u( rk + 1 ); s2 = (b>>32) ^ __insn_ld4u( rk + 2 ); s3 = b ^ __insn_ld4u( rk + 3 );
Do the rounds using the macros:
/* rounds 1-9 */ ROUND_E_T( 4, 5, 6, 7 ) ROUND_E_S( 8, 9,10,11 ) ROUND_E_T( 12,13,14,15 ) ROUND_E_S( 16,17,18,19 ) ROUND_E_T( 20,21,22,23 ) ROUND_E_S( 24,25,26,27 ) ROUND_E_T( 28,29,30,31 ) ROUND_E_S( 32,33,34,35 ) ROUND_E_T( 36,37,38,39 ) if (key->rounds > 10) { /* rounds 10-11 */ ROUND_E_S( 40,41,42,43 ) ROUND_E_T( 44,45,46,47 ) if (key->rounds > 12) { /* rounds 12-13 */ ROUND_E_S( 48,49,50,51 ) ROUND_E_T( 52,53,54,55 ) } } rk += key->rounds << 2;
Use the method described above for the last round and store things, now with added support for the GX_STORE_ALIGNED macro:
/* * apply last round and * map cipher state to byte array block: */ // Exploit pairs in Te tables: 0=yppx,1=xypp,2=pxyp,3=ppxy pt2 = (uint32_t *)__insn_tblidxb3( (uint64_t)pt2, t0 ); pt3 = (uint32_t *)__insn_tblidxb2( (uint64_t)pt3, t1 ); pt0 = (uint32_t *)__insn_tblidxb1( (uint64_t)pt0, t2 ); pt1 = (uint32_t *)__insn_tblidxb0( (uint64_t)pt1, t3 ); s0 = __insn_mm( __insn_v2int_l( __insn_ld4u(pt2), __insn_ld4u(pt3) ), __insn_v2int_l( __insn_ld4u(pt0), __insn_ld4u(pt1) ), 32, 63 ); pt2 = (uint32_t *)__insn_tblidxb3( (uint64_t)pt2, t1 ); pt3 = (uint32_t *)__insn_tblidxb2( (uint64_t)pt3, t2 ); pt0 = (uint32_t *)__insn_tblidxb1( (uint64_t)pt0, t3 ); pt1 = (uint32_t *)__insn_tblidxb0( (uint64_t)pt1, t0 ); s1 = __insn_mm( __insn_v2int_l( __insn_ld4u(pt2), __insn_ld4u(pt3) ), __insn_v2int_l( __insn_ld4u(pt0), __insn_ld4u(pt1) ), 32, 63 ); #if defined(GX_STORE_ALIGNED) s0 = __insn_revbytes( __insn_v2packh( s0, s1 ) ^ __insn_v4int_l( __insn_ld4u( rk + 0 ), __insn_ld4u( rk + 1 ) ) ); __insn_st_add( out, s0, 8 ); #else s0 = __insn_v2packh( s1, s0 ) ^ __insn_v4int_l( __insn_ld4u( rk + 1 ), __insn_ld4u( rk + 0 ) ); s1 = s0>>32; PUTU32(out, s0); PUTU32(out + 4, s1); #endif
Same for s2 and s3, obviously (not shown).
First, align the Td0-3 tables in the same manner as the Te tables (not shown).
Define the decrypt round macros:
#define ROUND_D_T(I0,I1,I2,I3) \ pt0 = (uint32_t *)__insn_tblidxb3( (uint64_t)pt0, s0 ); \ pt1 = (uint32_t *)__insn_tblidxb2( (uint64_t)pt1, s3 ); \ pt2 = (uint32_t *)__insn_tblidxb1( (uint64_t)pt2, s2 ); \ pt3 = (uint32_t *)__insn_tblidxb0( (uint64_t)pt3, s1 ); \ t0 = __insn_ld4u( pt0 ) ^ __insn_ld4u( pt1 ) ^ __insn_ld4u( pt2 ) ^ __insn_ld4u( pt3 ) ^ __insn_ld4u( rk + I0 ); \ pt0 = (uint32_t *)__insn_tblidxb3( (uint64_t)pt0, s1 ); \ pt1 = (uint32_t *)__insn_tblidxb2( (uint64_t)pt1, s0 ); \ pt2 = (uint32_t *)__insn_tblidxb1( (uint64_t)pt2, s3 ); \ pt3 = (uint32_t *)__insn_tblidxb0( (uint64_t)pt3, s2 ); \ t1 = __insn_ld4u( pt0 ) ^ __insn_ld4u( pt1 ) ^ __insn_ld4u( pt2 ) ^ __insn_ld4u( pt3 ) ^ __insn_ld4u( rk + I1 ); \ pt0 = (uint32_t *)__insn_tblidxb3( (uint64_t)pt0, s2 ); \ pt1 = (uint32_t *)__insn_tblidxb2( (uint64_t)pt1, s1 ); \ pt2 = (uint32_t *)__insn_tblidxb1( (uint64_t)pt2, s0 ); \ pt3 = (uint32_t *)__insn_tblidxb0( (uint64_t)pt3, s3 ); \ t2 = __insn_ld4u( pt0 ) ^ __insn_ld4u( pt1 ) ^ __insn_ld4u( pt2 ) ^ __insn_ld4u( pt3 ) ^ __insn_ld4u( rk + I2 ); \ pt0 = (uint32_t *)__insn_tblidxb3( (uint64_t)pt0, s3 ); \ pt1 = (uint32_t *)__insn_tblidxb2( (uint64_t)pt1, s2 ); \ pt2 = (uint32_t *)__insn_tblidxb1( (uint64_t)pt2, s1 ); \ pt3 = (uint32_t *)__insn_tblidxb0( (uint64_t)pt3, s0 ); \ t3 = __insn_ld4u( pt0 ) ^ __insn_ld4u( pt1 ) ^ __insn_ld4u( pt2 ) ^ __insn_ld4u( pt3 ) ^ __insn_ld4u( rk + I3 ); #define ROUND_D_S(I0,I1,I2,I3) \ pt0 = (uint32_t *)__insn_tblidxb3( (uint64_t)pt0, t0 ); \ pt1 = (uint32_t *)__insn_tblidxb2( (uint64_t)pt1, t3 ); \ pt2 = (uint32_t *)__insn_tblidxb1( (uint64_t)pt2, t2 ); \ pt3 = (uint32_t *)__insn_tblidxb0( (uint64_t)pt3, t1 ); \ s0 = __insn_ld4u( pt0 ) ^ __insn_ld4u( pt1 ) ^ __insn_ld4u( pt2 ) ^ __insn_ld4u( pt3 ) ^ __insn_ld4u( rk + I0 ); \ pt0 = (uint32_t *)__insn_tblidxb3( (uint64_t)pt0, t1 ); \ pt1 = (uint32_t *)__insn_tblidxb2( (uint64_t)pt1, t0 ); \ pt2 = (uint32_t *)__insn_tblidxb1( (uint64_t)pt2, t3 ); \ pt3 = (uint32_t *)__insn_tblidxb0( (uint64_t)pt3, t2 ); \ s1 = __insn_ld4u( pt0 ) ^ __insn_ld4u( pt1 ) ^ __insn_ld4u( pt2 ) ^ __insn_ld4u( pt3 ) ^ __insn_ld4u( rk + I1 ); \ pt0 = (uint32_t *)__insn_tblidxb3( (uint64_t)pt0, t2 ); \ pt1 = (uint32_t *)__insn_tblidxb2( (uint64_t)pt1, t1 ); \ pt2 = (uint32_t *)__insn_tblidxb1( (uint64_t)pt2, t0 ); \ pt3 = (uint32_t *)__insn_tblidxb0( (uint64_t)pt3, t3 ); \ s2 = __insn_ld4u( pt0 ) ^ __insn_ld4u( pt1 ) ^ __insn_ld4u( pt2 ) ^ __insn_ld4u( pt3 ) ^ __insn_ld4u( rk + I2 ); \ pt0 = (uint32_t *)__insn_tblidxb3( (uint64_t)pt0, t3 ); \ pt1 = (uint32_t *)__insn_tblidxb2( (uint64_t)pt1, t2 ); \ pt2 = (uint32_t *)__insn_tblidxb1( (uint64_t)pt2, t1 ); \ pt3 = (uint32_t *)__insn_tblidxb0( (uint64_t)pt3, t0 ); \ s3 = __insn_ld4u( pt0 ) ^ __insn_ld4u( pt1 ) ^ __insn_ld4u( pt2 ) ^ __insn_ld4u( pt3 ) ^ __insn_ld4u( rk + I3 );
Loading the data looks exactly the same, and the ROUND_E macros are replaced with ROUND_D, so let's not waste time on that.
The last round now uses a byte table, so using tblidx is not possible. The OpenSSL code looks like this:
/* * apply last round and * map cipher state to byte array block: */ s0 = ((u32)Td4[(t0 >> 24) ] << 24) ^ ((u32)Td4[(t3 >> 16) & 0xff] << 16) ^ ((u32)Td4[(t2 >> 8) & 0xff] << 8) ^ ((u32)Td4[(t1 ) & 0xff]) ^ rk[0]; PUTU32(out , s0); s1 = ((u32)Td4[(t1 >> 24) ] << 24) ^ ((u32)Td4[(t0 >> 16) & 0xff] << 16) ^ ((u32)Td4[(t3 >> 8) & 0xff] << 8) ^ ((u32)Td4[(t2 ) & 0xff]) ^ rk[1]; PUTU32(out + 4, s1); s2 = ((u32)Td4[(t2 >> 24) ] << 24) ^ ((u32)Td4[(t1 >> 16) & 0xff] << 16) ^ ((u32)Td4[(t0 >> 8) & 0xff] << 8) ^ ((u32)Td4[(t3 ) & 0xff]) ^ rk[2]; PUTU32(out + 8, s2); s3 = ((u32)Td4[(t3 >> 24) ] << 24) ^ ((u32)Td4[(t2 >> 16) & 0xff] << 16) ^ ((u32)Td4[(t1 >> 8) & 0xff] << 8) ^ ((u32)Td4[(t0 ) & 0xff]) ^ rk[3]; PUTU32(out + 12, s3);
As mentioned earlier, gcc refuses to use interleave instructions, so there'll be lots of shifting. The indexed loads look ok, so let's keep them:
s0 = __insn_v2int_l( __insn_v1int_l( Td4[(t0>>24) ], Td4[(t3>>16)&0xff] ), __insn_v1int_l( Td4[(t2>> 8)&0xff], Td4[(t1 )&0xff] ) ) ^ rk[0]; s1 = __insn_v2int_l( __insn_v1int_l( Td4[(t1>>24) ], Td4[(t0>>16)&0xff] ), __insn_v1int_l( Td4[(t3>> 8)&0xff], Td4[(t2 )&0xff] ) ) ^ rk[1]; s2 = __insn_v2int_l( __insn_v1int_l( Td4[(t2>>24) ], Td4[(t1>>16)&0xff] ), __insn_v1int_l( Td4[(t0>> 8)&0xff], Td4[(t3 )&0xff] ) ) ^ rk[2]; s3 = __insn_v2int_l( __insn_v1int_l( Td4[(t3>>24) ], Td4[(t2>>16)&0xff] ), __insn_v1int_l( Td4[(t1>> 8)&0xff], Td4[(t0 )&0xff] ) ) ^ rk[3];
That expands to bfextu for 12 indexes and shifts for the remaining 4. And the merging is done by interleaves, not shifts and xors.
I wrote a simple test program that uses the simulator in functional mode, so all memory accesses are reduced to L1 latency (2 cycles). get_cycle_count() is used to measure cycles used.
Let's first try with regular unaligned store:
[user@cloudy openssl]$ tile-cc -Wall -o test_aes -O3 -std=gnu99 -static --save-temps test.c aes_core.c aes_core_tilegx.c [user@cloudy openssl]$ tile-monitor --simulator --image 1x1 --upload test_aes /tmp/aes_gx --run -+- /tmp/aes_gx -+- --quit --functional enc10 01234567 ok enc12 01234567 ok enc14 01234567 ok dec10 01234567 ok dec12 01234567 ok dec14 01234567 ok AES_Encrypt() 15052685 11469981 23.801096% AES_Decrypt() 14994468 11703889 21.945286% done
And then define GX_STORE_ALIGNED for faster stores:
[user@cloudy openssl]$ tile-cc -DGX_STORE_ALIGNED -Wall -o test_aes -O3 -std=gnu99 -static --save-temps test.c aes_core.c aes_core_tilegx.c [user@cloudy openssl]$ tile-monitor --simulator --image 1x1 --upload test_aes /tmp/aes_gx --run -+- /tmp/aes_gx -+- --quit --functional enc10 01234567 ok enc12 01234567 ok enc14 01234567 ok dec10 01234567 ok dec12 01234567 ok dec14 01234567 ok AES_Encrypt() 15052685 10989981 26.989897% AES_Decrypt() 14994468 11103889 25.946762% done
Source code (zip archive): tilegx_aes_openssl_2015.zip
Source files:
If there's anybody out there who would like to integrate this into OpenSSL, please do so.
Comments are always appreciated. Contact information is available here.
Remember to appreciate this classic Abstruse Goose strip.
This article is published under the following license: Attribution-NoDerivatives 4.0 International (CC BY-ND 4.0)
Short summary:
You may copy and redistribute the material in any medium or format for any purpose, even commercially.
You must give appropriate credit, provide a link to the license, and indicate if changes were made.
If you remix, transform, or build upon the material, you may not distribute the modified material.