AES Optmization on Tilera TILE-Gx

Nils L. Corneliusen
23 December 2011

Introduction

It's Christmas, and what better new toy to play with than the new Tilera Tile-Gx chip? 36 cores, and full ISA documentation on Tilera's webpage. Let's see how fast we can get 128 bit AES encryption going on that chip.

The Tile-Gx has a full battery of vector operations, very few of which are useful here. However, there are a couple that really makes this interesting. Let's move on after this XKCD strip, which may or may not be relevant.

Starting off easy

A typical packet encryption loop would look something like this:

void encrypt_packet( uint8_t *src, uint8_t *dst, AES_KEY *key, int length )
{
    int i;

    for( i = 0; i < length; i += 16 ) {
        AES_encrypt( src, dst, key );
        src += 16;
        dst += 16;
    }
}

You'd probably stuff in an IV and xor with the data, but the general principle is the same. It also makes this article easier to understand.

This is of course not very efficient. As explained in the SRTP article, first strip down the AES_encrypt() function, inline it, and expect aligned data:

void encrypt_packet( uint32_t *in, uint32_t *out, uint32_t *rk, int length )
{
    int i;
    uint32_t s0, s1, s2, s3;
    uint32_t t0, t1, t2, t3;

    for( i = 0; i < length; i += 16 ) {
        /* round 0 */
        s0 = in[0] ^ rk[0];
        s1 = in[1] ^ rk[1];
        s2 = in[2] ^ rk[2];
        s3 = in[3] ^ rk[3];
        /* round 1: */
        t0 = Te0[s0 >> 24] ^ Te1[(s1 >> 16) & 0xff] ^ Te2[(s2 >>  8) & 0xff] ^ Te3[s3 & 0xff] ^ rk[ 4];
        t1 = Te0[s1 >> 24] ^ Te1[(s2 >> 16) & 0xff] ^ Te2[(s3 >>  8) & 0xff] ^ Te3[s0 & 0xff] ^ rk[ 5];
        t2 = Te0[s2 >> 24] ^ Te1[(s3 >> 16) & 0xff] ^ Te2[(s0 >>  8) & 0xff] ^ Te3[s1 & 0xff] ^ rk[ 6];
        t3 = Te0[s3 >> 24] ^ Te1[(s0 >> 16) & 0xff] ^ Te2[(s1 >>  8) & 0xff] ^ Te3[s2 & 0xff] ^ rk[ 7];
        (...)

        in += 4;
        out += 4;
    }
}

Looking at the generated code for this, there are a couple of things apparent: 32 bit operations are not very efficient, gcc fails at using the full register file, and the Te tables could be accessed much smarter.

The Tile-Gx is a VLIW architecture, and we should be able to issue 3 instructions per cycle more frequently. Right now we spend 344 cycles in the inner loop, where 33% is 3 per cycle, and the rest 2 per cycle. The restrictions on issuing 3 per cycle are many, but surely we should be able to get around that.

A look at the Te0-3 tables

The Tile-Gx has a nifty instruction called tblidx. This is used to extract byte 0-3 from the given register and insert it into bits 2-9 in another register. In other words, we get the table index extract/add in just one instruction. The downside being that the tables must be aligned on 1KB boundaries. So we copy the Te0-3 tables to a 1KB aligned buffer (do that yourself please), and use tblidx to access it.

Let's first make some defines to improve readability:

#define __insn_tblidxb0 tblidxb0
#define __insn_tblidxb1 tblidxb1
#define __insn_tblidxb2 tblidxb2
#define __insn_tblidxb3 tblidxb3
#define __insn_ld4u     ld4u
#define __insn_ld_add   ld_add
#define __insn_st_add   st_add
#define __insn_v4int_l  v4int_l

Then we make ROUND macros using tblidx:

#define ROUND_T(RKINDX) \
    pte0 = (uint32_t *)tblidxb3( (uint64_t)pte0, s0 ); \
    pte1 = (uint32_t *)tblidxb2( (uint64_t)pte1, s1 ); \
    pte2 = (uint32_t *)tblidxb1( (uint64_t)pte2, s2 ); \
    pte3 = (uint32_t *)tblidxb0( (uint64_t)pte3, s3 ); \
    t0 = *pte0 ^ *pte1 ^ *pte2 ^ *pte3 ^ rk[RKINDX];   \
    pte0 = (uint32_t *)tblidxb3( (uint64_t)pte0, s1 ); \
    pte1 = (uint32_t *)tblidxb2( (uint64_t)pte1, s2 ); \
    pte2 = (uint32_t *)tblidxb1( (uint64_t)pte2, s3 ); \
    pte3 = (uint32_t *)tblidxb0( (uint64_t)pte3, s0 ); \
    t1 = *pte0 ^ *pte1 ^ *pte2 ^ *pte3 ^ rk[RKINDX+1]; \
    pte0 = (uint32_t *)tblidxb3( (uint64_t)pte0, s2 ); \
    pte1 = (uint32_t *)tblidxb2( (uint64_t)pte1, s3 ); \
    pte2 = (uint32_t *)tblidxb1( (uint64_t)pte2, s0 ); \
    pte3 = (uint32_t *)tblidxb0( (uint64_t)pte3, s1 ); \
    t2 = *pte0 ^ *pte1 ^ *pte2 ^ *pte3 ^ rk[RKINDX+2]; \
    pte0 = (uint32_t *)tblidxb3( (uint64_t)pte0, s3 ); \
    pte1 = (uint32_t *)tblidxb2( (uint64_t)pte1, s0 ); \
    pte2 = (uint32_t *)tblidxb1( (uint64_t)pte2, s1 ); \
    pte3 = (uint32_t *)tblidxb0( (uint64_t)pte3, s2 ); \
    t3 = *pte0 ^ *pte1 ^ *pte2 ^ *pte3 ^ rk[RKINDX+3];

#define ROUND_S(RKINDX) \
    pte0 = (uint32_t *)tblidxb3( (uint64_t)pte0, t0 ); \
    pte1 = (uint32_t *)tblidxb2( (uint64_t)pte1, t1 ); \
    pte2 = (uint32_t *)tblidxb1( (uint64_t)pte2, t2 ); \
    pte3 = (uint32_t *)tblidxb0( (uint64_t)pte3, t3 ); \
    s0 = *pte0 ^ *pte1 ^ *pte2 ^ *pte3 ^ rk[RKINDX];   \
    pte0 = (uint32_t *)tblidxb3( (uint64_t)pte0, t1 ); \
    pte1 = (uint32_t *)tblidxb2( (uint64_t)pte1, t2 ); \
    pte2 = (uint32_t *)tblidxb1( (uint64_t)pte2, t3 ); \
    pte3 = (uint32_t *)tblidxb0( (uint64_t)pte3, t0 ); \
    s1 = *pte0 ^ *pte1 ^ *pte2 ^ *pte3 ^ rk[RKINDX+1]; \
    pte0 = (uint32_t *)tblidxb3( (uint64_t)pte0, t2 ); \
    pte1 = (uint32_t *)tblidxb2( (uint64_t)pte1, t3 ); \
    pte2 = (uint32_t *)tblidxb1( (uint64_t)pte2, t0 ); \
    pte3 = (uint32_t *)tblidxb0( (uint64_t)pte3, t1 ); \
    s2 = *pte0 ^ *pte1 ^ *pte2 ^ *pte3 ^ rk[RKINDX+2]; \
    pte0 = (uint32_t *)tblidxb3( (uint64_t)pte0, t3 ); \
    pte1 = (uint32_t *)tblidxb2( (uint64_t)pte1, t0 ); \
    pte2 = (uint32_t *)tblidxb1( (uint64_t)pte2, t1 ); \
    pte3 = (uint32_t *)tblidxb0( (uint64_t)pte3, t2 ); \
    s3 = *pte0 ^ *pte1 ^ *pte2 ^ *pte3 ^ rk[RKINDX+3];

Next, we stuff that into the loop:

void encrypt_packet( uint32_t *in, uint32_t *out, uint32_t *rk, int length )
{
    int i;
    uint64_t s0, s1, s2, s3;
    uint64_t t0, t1, t2, t3;
    uint32_t *pte0, *pte1, *pte2, *pte3;

    // These are the 1K aligned Te-buffers
    pte0 = Te0a;
    pte1 = Te1a;
    pte2 = Te2a;
    pte3 = Te3a;

    for( i = 0; i < length; i += 16 ) {

        /* round 0 */
        s0 = in[0] ^ rk[0];
        s1 = in[1] ^ rk[1];
        s2 = in[2] ^ rk[2];
        s3 = in[3] ^ rk[3];

        /* round 1-9: */
        ROUND_T( 4 )
        ROUND_S( 8 )
        ROUND_T( 12 )
        ROUND_S( 16 )
        ROUND_T( 20 )
        ROUND_S( 24 )
        ROUND_T( 28 )
        ROUND_S( 32 )
        ROUND_T( 36 )

        pte2 = (uint32_t *)tblidxb3( (uint64_t)pte2, t0 );
        pte3 = (uint32_t *)tblidxb2( (uint64_t)pte3, t1 );
        pte0 = (uint32_t *)tblidxb1( (uint64_t)pte0, t2 );
        pte1 = (uint32_t *)tblidxb0( (uint64_t)pte1, t3 );
        out[0] = ((*pte2&0xff000000) | (*pte3&0x00ff0000) | (*pte0&0x0000ff00) | (*pte1&0x000000ff)) ^ rk[40];

        pte2 = (uint32_t *)tblidxb3( (uint64_t)pte2, t1 );
        pte3 = (uint32_t *)tblidxb2( (uint64_t)pte3, t2 );
        pte0 = (uint32_t *)tblidxb1( (uint64_t)pte0, t3 );
        pte1 = (uint32_t *)tblidxb0( (uint64_t)pte1, t0 );
        out[1] = ((*pte2&0xff000000) | (*pte3&0x00ff0000) | (*pte0&0x0000ff00) | (*pte1&0x000000ff)) ^ rk[41];

        pte2 = (uint32_t *)tblidxb3( (uint64_t)pte2, t2 );
        pte3 = (uint32_t *)tblidxb2( (uint64_t)pte3, t3 );
        pte0 = (uint32_t *)tblidxb1( (uint64_t)pte0, t0 );
        pte1 = (uint32_t *)tblidxb0( (uint64_t)pte1, t1 );
        out[2] = ((*pte2&0xff000000) | (*pte3&0x00ff0000) | (*pte0&0x0000ff00) | (*pte1&0x000000ff)) ^ rk[42];

        pte2 = (uint32_t *)tblidxb3( (uint64_t)pte2, t3 );
        pte3 = (uint32_t *)tblidxb2( (uint64_t)pte3, t0 );
        pte0 = (uint32_t *)tblidxb1( (uint64_t)pte0, t1 );
        pte1 = (uint32_t *)tblidxb0( (uint64_t)pte1, t2 );
        out[3] = ((*pte2&0xff000000) | (*pte3&0x00ff0000) | (*pte0&0x0000ff00) | (*pte1&0x000000ff)) ^ rk[43];
        
        in += 4;
        out += 4;
    }
}

All that's well and good, but still we're only seeing 5% faster than the previous version.

Only 32 bits and nothing more

Looking at the generated code, it's easy to see that gcc does not like the 32 bit memory operations. It'll use ld4s and clear the upper half afterwards. I'm truly puzzled by the generated code here. This seems to be a compiler bug: Given a uint32 array, two loads to a simple intrinsic will generate ld4u. Adding the xor gives ld4s+v4int_l which is a disaster.

We avoid that problem by enforcing the correct ld4u load operations in the ROUND macros as follows:

#define ROUND_T(RKINDX) \
    pte0 = (uint32_t *)tblidxb3( (uint64_t)pte0, s0 ); \
    pte1 = (uint32_t *)tblidxb2( (uint64_t)pte1, s1 ); \
    pte2 = (uint32_t *)tblidxb1( (uint64_t)pte2, s2 ); \
    pte3 = (uint32_t *)tblidxb0( (uint64_t)pte3, s3 ); \
    t0 = ld4u( pte0 ) ^ ld4u( pte1 ) ^ ld4u( pte2 ) ^ ld4u( pte3 ) ^ rk[RKINDX]; \
    pte0 = (uint32_t *)tblidxb3( (uint64_t)pte0, s1 ); \
    pte1 = (uint32_t *)tblidxb2( (uint64_t)pte1, s2 ); \
    pte2 = (uint32_t *)tblidxb1( (uint64_t)pte2, s3 ); \
    pte3 = (uint32_t *)tblidxb0( (uint64_t)pte3, s0 ); \
    t1 = ld4u( pte0 ) ^ ld4u( pte1 ) ^ ld4u( pte2 ) ^ ld4u( pte3 ) ^ rk[RKINDX+1]; \
    pte0 = (uint32_t *)tblidxb3( (uint64_t)pte0, s2 ); \
    pte1 = (uint32_t *)tblidxb2( (uint64_t)pte1, s3 ); \
    pte2 = (uint32_t *)tblidxb1( (uint64_t)pte2, s0 ); \
    pte3 = (uint32_t *)tblidxb0( (uint64_t)pte3, s1 ); \
    t2 = ld4u( pte0 ) ^ ld4u( pte1 ) ^ ld4u( pte2 ) ^ ld4u( pte3 ) ^ rk[RKINDX+2]; \
    pte0 = (uint32_t *)tblidxb3( (uint64_t)pte0, s3 ); \
    pte1 = (uint32_t *)tblidxb2( (uint64_t)pte1, s0 ); \
    pte2 = (uint32_t *)tblidxb1( (uint64_t)pte2, s1 ); \
    pte3 = (uint32_t *)tblidxb0( (uint64_t)pte3, s2 ); \
    t3 = ld4u( pte0 ) ^ ld4u( pte1 ) ^ ld4u( pte2 ) ^ ld4u( pte3 ) ^ rk[RKINDX+3];

#define ROUND_S(RKINDX) \
    pte0 = (uint32_t *)tblidxb3( (uint64_t)pte0, t0 ); \
    pte1 = (uint32_t *)tblidxb2( (uint64_t)pte1, t1 ); \
    pte2 = (uint32_t *)tblidxb1( (uint64_t)pte2, t2 ); \
    pte3 = (uint32_t *)tblidxb0( (uint64_t)pte3, t3 ); \
    s0 = ld4u( pte0 ) ^ ld4u( pte1 ) ^ ld4u( pte2 ) ^ ld4u( pte3 ) ^ rk[RKINDX]; \
    pte0 = (uint32_t *)tblidxb3( (uint64_t)pte0, t1 ); \
    pte1 = (uint32_t *)tblidxb2( (uint64_t)pte1, t2 ); \
    pte2 = (uint32_t *)tblidxb1( (uint64_t)pte2, t3 ); \
    pte3 = (uint32_t *)tblidxb0( (uint64_t)pte3, t0 ); \
    s1 = ld4u( pte0 ) ^ ld4u( pte1 ) ^ ld4u( pte2 ) ^ ld4u( pte3 ) ^ rk[RKINDX+1]; \
    pte0 = (uint32_t *)tblidxb3( (uint64_t)pte0, t2 ); \
    pte1 = (uint32_t *)tblidxb2( (uint64_t)pte1, t3 ); \
    pte2 = (uint32_t *)tblidxb1( (uint64_t)pte2, t0 ); \
    pte3 = (uint32_t *)tblidxb0( (uint64_t)pte3, t1 ); \
    s2 = ld4u( pte0 ) ^ ld4u( pte1 ) ^ ld4u( pte2 ) ^ ld4u( pte3 ) ^ rk[RKINDX+2]; \
    pte0 = (uint32_t *)tblidxb3( (uint64_t)pte0, t3 ); \
    pte1 = (uint32_t *)tblidxb2( (uint64_t)pte1, t0 ); \
    pte2 = (uint32_t *)tblidxb1( (uint64_t)pte2, t1 ); \
    pte3 = (uint32_t *)tblidxb0( (uint64_t)pte3, t2 ); \
    s3 = ld4u( pte0 ) ^ ld4u( pte1 ) ^ ld4u( pte2 ) ^ ld4u( pte3 ) ^ rk[RKINDX+3];

This certainly improves the situation, but what more can be done to make this truly efficient?

Mmmmm registers

First, there's 64 registers in the Tile-Gx chip. By quickly skimming the generated code, it's obvious that we have lots of unused registers. And with packet sizes regularly at 1600+ bytes, we need to use them. The trick is to stuff the entire key (rk) into registers so we get 44 loads less per iteration. This will make the bundling extremely efficient and symmetrical since we get exactly one xor/tblidx/ld4u bundle per step.

Second, we go full 64 bit. I assume everything is aligned on 64 bit boundaries. It's trivial to rewrite the load/store part using 32 bit alignment, but the cost is about 5 percentage points.

#define ROUND_T(I0,I1,I2,I3) \
    pte0 = (uint32_t *)tblidxb3( (uint64_t)pte0, s0 ); \
    pte1 = (uint32_t *)tblidxb2( (uint64_t)pte1, s1 ); \
    pte2 = (uint32_t *)tblidxb1( (uint64_t)pte2, s2 ); \
    pte3 = (uint32_t *)tblidxb0( (uint64_t)pte3, s3 ); \
    t0 = ld4u( pte0 ) ^ ld4u( pte1 ) ^ ld4u( pte2 ) ^ ld4u( pte3 ) ^ rk##I0; \
    pte0 = (uint32_t *)tblidxb3( (uint64_t)pte0, s1 ); \
    pte1 = (uint32_t *)tblidxb2( (uint64_t)pte1, s2 ); \
    pte2 = (uint32_t *)tblidxb1( (uint64_t)pte2, s3 ); \
    pte3 = (uint32_t *)tblidxb0( (uint64_t)pte3, s0 ); \
    t1 = ld4u( pte0 ) ^ ld4u( pte1 ) ^ ld4u( pte2 ) ^ ld4u( pte3 ) ^ rk##I1; \
    pte0 = (uint32_t *)tblidxb3( (uint64_t)pte0, s2 ); \
    pte1 = (uint32_t *)tblidxb2( (uint64_t)pte1, s3 ); \
    pte2 = (uint32_t *)tblidxb1( (uint64_t)pte2, s0 ); \
    pte3 = (uint32_t *)tblidxb0( (uint64_t)pte3, s1 ); \
    t2 = ld4u( pte0 ) ^ ld4u( pte1 ) ^ ld4u( pte2 ) ^ ld4u( pte3 ) ^ rk##I2; \
    pte0 = (uint32_t *)tblidxb3( (uint64_t)pte0, s3 ); \
    pte1 = (uint32_t *)tblidxb2( (uint64_t)pte1, s0 ); \
    pte2 = (uint32_t *)tblidxb1( (uint64_t)pte2, s1 ); \
    pte3 = (uint32_t *)tblidxb0( (uint64_t)pte3, s2 ); \
    t3 = ld4u( pte0 ) ^ ld4u( pte1 ) ^ ld4u( pte2 ) ^ ld4u( pte3 ) ^ rk##I3;

#define ROUND_S(I0,I1,I2,I3) \
    pte0 = (uint32_t *)tblidxb3( (uint64_t)pte0, t0 ); \
    pte1 = (uint32_t *)tblidxb2( (uint64_t)pte1, t1 ); \
    pte2 = (uint32_t *)tblidxb1( (uint64_t)pte2, t2 ); \
    pte3 = (uint32_t *)tblidxb0( (uint64_t)pte3, t3 ); \
    s0 = ld4u( pte0 ) ^ ld4u( pte1 ) ^ ld4u( pte2 ) ^ ld4u( pte3 ) ^ rk##I0; \
    pte0 = (uint32_t *)tblidxb3( (uint64_t)pte0, t1 ); \
    pte1 = (uint32_t *)tblidxb2( (uint64_t)pte1, t2 ); \
    pte2 = (uint32_t *)tblidxb1( (uint64_t)pte2, t3 ); \
    pte3 = (uint32_t *)tblidxb0( (uint64_t)pte3, t0 ); \
    s1 = ld4u( pte0 ) ^ ld4u( pte1 ) ^ ld4u( pte2 ) ^ ld4u( pte3 ) ^ rk##I1; \
    pte0 = (uint32_t *)tblidxb3( (uint64_t)pte0, t2 ); \
    pte1 = (uint32_t *)tblidxb2( (uint64_t)pte1, t3 ); \
    pte2 = (uint32_t *)tblidxb1( (uint64_t)pte2, t0 ); \
    pte3 = (uint32_t *)tblidxb0( (uint64_t)pte3, t1 ); \
    s2 = ld4u( pte0 ) ^ ld4u( pte1 ) ^ ld4u( pte2 ) ^ ld4u( pte3 ) ^ rk##I2; \
    pte0 = (uint32_t *)tblidxb3( (uint64_t)pte0, t3 ); \
    pte1 = (uint32_t *)tblidxb2( (uint64_t)pte1, t0 ); \
    pte2 = (uint32_t *)tblidxb1( (uint64_t)pte2, t1 ); \
    pte3 = (uint32_t *)tblidxb0( (uint64_t)pte3, t2 ); \
    s3 = ld4u( pte0 ) ^ ld4u( pte1 ) ^ ld4u( pte2 ) ^ ld4u( pte3 ) ^ rk##I3;

void encrypt_packet( const uint64_t *in, uint64_t *out, const uint64_t *rk, uint64_t len )
{
    uint64_t i;
    uint64_t s0, s1, s2, s3;
    uint64_t t0, t1, t2, t3;
    uint32_t *pte0, *pte1, *pte2, *pte3;
    uint64_t rk0, rk1, rk2, rk3, rk4, rk5, rk6, rk7;
    uint64_t rk8, rk9, rk10,rk11,rk12,rk13,rk14,rk15;
    uint64_t rk16,rk17,rk18,rk19,rk20,rk21,rk22,rk23;
    uint64_t rk24,rk25,rk26,rk27,rk28,rk29,rk30,rk31;
    uint64_t rk32,rk33,rk34,rk35,rk36,rk37,rk38,rk39;
    uint64_t rk40,rk41,rk42,rk43;
    uint64_t tmp0,tmp1;

    pte0 = Te0a;
    pte1 = Te1a;
    pte2 = Te2a;
    pte3 = Te3a;

    rk0  = ld_add( rk, 8 ); rk1  = rk0 >>32; rk2  = ld_add( rk, 8 ); rk3  = rk2 >>32;
    rk4  = ld_add( rk, 8 ); rk5  = rk4 >>32; rk6  = ld_add( rk, 8 ); rk7  = rk6 >>32;
    rk8  = ld_add( rk, 8 ); rk9  = rk8 >>32; rk10 = ld_add( rk, 8 ); rk11 = rk10>>32;
    rk12 = ld_add( rk, 8 ); rk13 = rk12>>32; rk14 = ld_add( rk, 8 ); rk15 = rk14>>32;
    rk16 = ld_add( rk, 8 ); rk17 = rk16>>32; rk18 = ld_add( rk, 8 ); rk19 = rk18>>32;
    rk20 = ld_add( rk, 8 ); rk21 = rk20>>32; rk22 = ld_add( rk, 8 ); rk23 = rk22>>32;
    rk24 = ld_add( rk, 8 ); rk25 = rk24>>32; rk26 = ld_add( rk, 8 ); rk27 = rk26>>32;
    rk28 = ld_add( rk, 8 ); rk29 = rk28>>32; rk30 = ld_add( rk, 8 ); rk31 = rk30>>32;
    rk32 = ld_add( rk, 8 ); rk33 = rk32>>32; rk34 = ld_add( rk, 8 ); rk35 = rk34>>32;
    rk36 = ld_add( rk, 8 ); rk37 = rk36>>32; rk38 = ld_add( rk, 8 ); rk39 = rk38>>32;
    rk40 = ld_add( rk, 8 ); rk41 = rk40>>32; rk42 = ld_add( rk, 8 ); rk43 = rk42>>32;

    for( i = 0; i < len; i += 16 ) {

        /* round 0 */
        tmp0 = ld_add( in, 8 ); s0 = tmp0 ^ rk0; s1 = (tmp0>>32) ^ rk1;
        tmp0 = ld_add( in, 8 ); s2 = tmp0 ^ rk2; s3 = (tmp0>>32) ^ rk3;

        /* round 1-9: */
        ROUND_T(  4, 5, 6, 7 )
        ROUND_S(  8, 9,10,11 )
        ROUND_T( 12,13,14,15 )
        ROUND_S( 16,17,18,19 )
        ROUND_T( 20,21,22,23 )
        ROUND_S( 24,25,26,27 )
        ROUND_T( 28,29,30,31 )
        ROUND_S( 32,33,34,35 )
        ROUND_T( 36,37,38,39 )

        pte2 = (uint32_t *)tblidxb3( (uint64_t)pte2, t0 );
        pte3 = (uint32_t *)tblidxb2( (uint64_t)pte3, t1 );
        pte0 = (uint32_t *)tblidxb1( (uint64_t)pte0, t2 );
        pte1 = (uint32_t *)tblidxb0( (uint64_t)pte1, t3 );
        tmp0 = ((ld4u(pte2)&0xff000000) | (ld4u(pte3)&0x00ff0000) | (ld4u(pte0)&0x0000ff00) | (ld4u(pte1)&0x000000ff)) ^ rk40;

        pte2 = (uint32_t *)tblidxb3( (uint64_t)pte2, t1 );
        pte3 = (uint32_t *)tblidxb2( (uint64_t)pte3, t2 );
        pte0 = (uint32_t *)tblidxb1( (uint64_t)pte0, t3 );
        pte1 = (uint32_t *)tblidxb0( (uint64_t)pte1, t0 );
        tmp1 = ((ld4u(pte2)&0xff000000) | (ld4u(pte3)&0x00ff0000) | (ld4u(pte0)&0x0000ff00) | (ld4u(pte1)&0x000000ff)) ^ rk41;

        st_add( out, v4int_l(tmp1,tmp0), 8 );

        pte2 = (uint32_t *)tblidxb3( (uint64_t)pte2, t2 );
        pte3 = (uint32_t *)tblidxb2( (uint64_t)pte3, t3 );
        pte0 = (uint32_t *)tblidxb1( (uint64_t)pte0, t0 );
        pte1 = (uint32_t *)tblidxb0( (uint64_t)pte1, t1 );
        tmp0 = ((ld4u(pte2)&0xff000000) | (ld4u(pte3)&0x00ff0000) | (ld4u(pte0)&0x0000ff00) | (ld4u(pte1)&0x000000ff)) ^ rk42;

        pte2 = (uint32_t *)tblidxb3( (uint64_t)pte2, t3 );
        pte3 = (uint32_t *)tblidxb2( (uint64_t)pte3, t0 );
        pte0 = (uint32_t *)tblidxb1( (uint64_t)pte0, t1 );
        pte1 = (uint32_t *)tblidxb0( (uint64_t)pte1, t2 );
        tmp1 = ((ld4u(pte2)&0xff000000) | (ld4u(pte3)&0x00ff0000) | (ld4u(pte0)&0x0000ff00) | (ld4u(pte1)&0x000000ff)) ^ rk43;

        st_add( out, v4int_l(tmp1,tmp0), 8 );

    }

}

And there we have it, an AES routine that does what it's supposed to. The inner loop now has 208 cycles total, where 162 is 3 instructions per cycle (77%), 40 is 2 per cycle, and the remaining 6 is 1 per cycle. Gcc manages to replace the 4 obvious ld4us with ld1u. Could probably play around with it a bit more, but I feel there's little point in that.

So what's in it for me?

In my test scenario, I process 256 1600 byte packes in 5880589 cycles, versus 9497409 cycles with standard OpenSSL AES_Encrypt. That's 38% faster. Sweet.

Addendum 21 September 2015

Please check out the new method for doing the last round here.

As usual, your feedback is appreciated. My email address is at the front page.


www.ignorantus.com