OpenSSL aes_core.c Replacement for EZchip TILE-Gx

Nils L. Corneliusen
24 September 2015

Introduction

As specified in the article AES Optimization on Tilera TILE-Gx, there are several things that can be done to speed up AES encryption on the Tilera EZchip TILE-Gx CPU: Use tblidx/ld4u/xor bundles to compact the code, preload the key to registers and inline the encryption routine where needed.

Sometimes that's not an option and you're stuck with using OpenSSL. Then it would be nice to have a drop-in replacement for aes_core.c that uses as much as possible from the previous article, and also include some improvements on loading unaligned data. But before we get to that, let's address one niggling issue: What to do about the last round?

The Last Round, Round Two

Applying the last round and mapping the cipher state to the byte array block (puuh) looks like this in the OpenSSL code:

    s0 =
        (Te2[(t0 >> 24)       ] & 0xff000000) ^
        (Te3[(t1 >> 16) & 0xff] & 0x00ff0000) ^
        (Te0[(t2 >>  8) & 0xff] & 0x0000ff00) ^
        (Te1[(t3      ) & 0xff] & 0x000000ff) ^
        rk[0];
    PUTU32(out     , s0);
    s1 =
        (Te2[(t1 >> 24)       ] & 0xff000000) ^
        (Te3[(t2 >> 16) & 0xff] & 0x00ff0000) ^
        (Te0[(t3 >>  8) & 0xff] & 0x0000ff00) ^
        (Te1[(t0      ) & 0xff] & 0x000000ff) ^
        rk[1];
    PUTU32(out +  4, s1);
    s2 =
        (Te2[(t2 >> 24)       ] & 0xff000000) ^
        (Te3[(t3 >> 16) & 0xff] & 0x00ff0000) ^
        (Te0[(t0 >>  8) & 0xff] & 0x0000ff00) ^
        (Te1[(t1      ) & 0xff] & 0x000000ff) ^
        rk[2];
    PUTU32(out +  8, s2);
    s3 =
        (Te2[(t3 >> 24)       ] & 0xff000000) ^
        (Te3[(t0 >> 16) & 0xff] & 0x00ff0000) ^
        (Te0[(t1 >>  8) & 0xff] & 0x0000ff00) ^
        (Te1[(t2      ) & 0xff] & 0x000000ff) ^
        rk[3];
    PUTU32(out +  12, s2);

which I converted to this:

        pte2 = (uint32_t *)tblidxb3( (uint64_t)pte2, t0 );
        pte3 = (uint32_t *)tblidxb2( (uint64_t)pte3, t1 );
        pte0 = (uint32_t *)tblidxb1( (uint64_t)pte0, t2 );
        pte1 = (uint32_t *)tblidxb0( (uint64_t)pte1, t3 );
        tmp0 = ((ld4u(pte2)&0xff000000) | (ld4u(pte3)&0x00ff0000) | (ld4u(pte0)&0x0000ff00) | (ld4u(pte1)&0x000000ff)) ^ rk40;

        pte2 = (uint32_t *)tblidxb3( (uint64_t)pte2, t1 );
        pte3 = (uint32_t *)tblidxb2( (uint64_t)pte3, t2 );
        pte0 = (uint32_t *)tblidxb1( (uint64_t)pte0, t3 );
        pte1 = (uint32_t *)tblidxb0( (uint64_t)pte1, t0 );
        tmp1 = ((ld4u(pte2)&0xff000000) | (ld4u(pte3)&0x00ff0000) | (ld4u(pte0)&0x0000ff00) | (ld4u(pte1)&0x000000ff)) ^ rk41;

        st_add( out, v4int_l(tmp1,tmp0), 8 );

        pte2 = (uint32_t *)tblidxb3( (uint64_t)pte2, t2 );
        pte3 = (uint32_t *)tblidxb2( (uint64_t)pte3, t3 );
        pte0 = (uint32_t *)tblidxb1( (uint64_t)pte0, t0 );
        pte1 = (uint32_t *)tblidxb0( (uint64_t)pte1, t1 );
        tmp0 = ((ld4u(pte2)&0xff000000) | (ld4u(pte3)&0x00ff0000) | (ld4u(pte0)&0x0000ff00) | (ld4u(pte1)&0x000000ff)) ^ rk42;

        pte2 = (uint32_t *)tblidxb3( (uint64_t)pte2, t3 );
        pte3 = (uint32_t *)tblidxb2( (uint64_t)pte3, t0 );
        pte0 = (uint32_t *)tblidxb1( (uint64_t)pte0, t1 );
        pte1 = (uint32_t *)tblidxb0( (uint64_t)pte1, t2 );
        tmp1 = ((ld4u(pte2)&0xff000000) | (ld4u(pte3)&0x00ff0000) | (ld4u(pte0)&0x0000ff00) | (ld4u(pte1)&0x000000ff)) ^ rk43;

        st_add( out, v4int_l(tmp1,tmp0), 8 );

That's rather cumbersome: 28 logical operations just for concatenating the data. Gcc refuses to use interleave instructions, so some cycles will be saved by doing that. Surely there's a better solution? Let's go to the source: The Te tables. Recall that pairs in the tables are the same, and the pattern above always picks out values from the pairs. It doesn't matter which p-value is selected. Due to their placing, it's possible to pick out pairs with v2int_l, which will result in every other byte in the upper and lower halfs or two registers having the correct values. These can be merged with mm() and packed with v2packh. This drawing might do a better job of explaining it:

The Last Round

I combine s0/s1 and s2/s3 in the code below to avoid extra packs:

    // Exploit pairs in Te tables: 0=yppx,1=xypp,2=pxyp,3=ppxy
    pt2 = (uint32_t *)__insn_tblidxb3( (uint64_t)pt2, t0 );
    pt3 = (uint32_t *)__insn_tblidxb2( (uint64_t)pt3, t1 );
    pt0 = (uint32_t *)__insn_tblidxb1( (uint64_t)pt0, t2 );
    pt1 = (uint32_t *)__insn_tblidxb0( (uint64_t)pt1, t3 );
    s0 = __insn_mm( __insn_v2int_l( __insn_ld4u(pt2), __insn_ld4u(pt3) ), __insn_v2int_l( __insn_ld4u(pt0), __insn_ld4u(pt1) ), 32, 63 );
    pt2 = (uint32_t *)__insn_tblidxb3( (uint64_t)pt2, t1 );
    pt3 = (uint32_t *)__insn_tblidxb2( (uint64_t)pt3, t2 );
    pt0 = (uint32_t *)__insn_tblidxb1( (uint64_t)pt0, t3 );
    pt1 = (uint32_t *)__insn_tblidxb0( (uint64_t)pt1, t0 );
    s1 = __insn_mm( __insn_v2int_l( __insn_ld4u(pt2), __insn_ld4u(pt3) ), __insn_v2int_l( __insn_ld4u(pt0), __insn_ld4u(pt1) ), 32, 63 );

    s0 = __insn_revbytes( __insn_v2packh( s0, s1 ) ^ __insn_v4int_l( __insn_ld4u( rk + 0 ), __insn_ld4u( rk + 1 ) ) );
    __insn_st_add( out, s0, 8 );

    pt2 = (uint32_t *)__insn_tblidxb3( (uint64_t)pt2, t2 );
    pt3 = (uint32_t *)__insn_tblidxb2( (uint64_t)pt3, t3 );
    pt0 = (uint32_t *)__insn_tblidxb1( (uint64_t)pt0, t0 );
    pt1 = (uint32_t *)__insn_tblidxb0( (uint64_t)pt1, t1 );
    s2 = __insn_mm( __insn_v2int_l( __insn_ld4u(pt2), __insn_ld4u(pt3) ), __insn_v2int_l( __insn_ld4u(pt0), __insn_ld4u(pt1) ), 32, 63 );
    pt2 = (uint32_t *)__insn_tblidxb3( (uint64_t)pt2, t3 );
    pt3 = (uint32_t *)__insn_tblidxb2( (uint64_t)pt3, t0 );
    pt0 = (uint32_t *)__insn_tblidxb1( (uint64_t)pt0, t1 );
    pt1 = (uint32_t *)__insn_tblidxb0( (uint64_t)pt1, t2 );
    s3 = __insn_mm( __insn_v2int_l( __insn_ld4u(pt2), __insn_ld4u(pt3) ), __insn_v2int_l( __insn_ld4u(pt0), __insn_ld4u(pt1) ), 32, 63 );

    s2 = __insn_revbytes( __insn_v2packh( s2, s3 ) ^ __insn_v4int_l( __insn_ld4u( rk + 2 ), __insn_ld4u( rk + 3 ) ) );
    __insn_st( out, s2 );

That's a cool 12 instructions instead of 28. The net result is about 3% faster execution time. Neat.

Loading Unaligned Data

The TILE-Gx has instructions for making unaligned loads easier. ldna loads the specified address with the lower 3 bits masked out, and dblalign aligns two registers according to the lower 3 bits. That means loading 16 bytes can be done with 3 ldna and 2 dblalign, taking care not to load past the buffer if data is aligned:

    uint64_t in0 = __insn_ldna( in      );
    uint64_t in1 = __insn_ldna( in + 8  );
    uint64_t in2 = __insn_ldna( in + 15 );
    uint64_t a = __insn_revbytes( __insn_dblalign( in0, in1, in ) );
    uint64_t b = __insn_revbytes( __insn_dblalign( in1, in2, in ) );
    s0 = (a>>32) ^ __insn_ld4u( rk );
    s1 =  a      ^ __insn_ld4u( rk + 1 );
    s2 = (b>>32) ^ __insn_ld4u( rk + 2 );
    s3 =  b      ^ __insn_ld4u( rk + 3 );

What to do about unaligned stores? Not much, really. Either use an array of st1u or define GX_STORE_ALIGNED if the destination buffer is 8 byte aligned.

AES_Encrypt()

Let's start off by aligning the Te0-3 tables to 1k boundaries so tblidx can be used:

static const u32 __attribute__ ((aligned (1024))) Te0[256] = {
(...)
static const u32 __attribute__ ((aligned (1024))) Te1[256] = {
(...)
static const u32 __attribute__ ((aligned (1024))) Te2[256] = {
(...)
static const u32 __attribute__ ((aligned (1024))) Te3[256] = {
(...)

Define the encrypt round macros:

#define ROUND_E_T(I0,I1,I2,I3) \
    pt0 = (uint32_t *)__insn_tblidxb3( (uint64_t)pt0, s0 ); \
    pt1 = (uint32_t *)__insn_tblidxb2( (uint64_t)pt1, s1 ); \
    pt2 = (uint32_t *)__insn_tblidxb1( (uint64_t)pt2, s2 ); \
    pt3 = (uint32_t *)__insn_tblidxb0( (uint64_t)pt3, s3 ); \
    t0  = __insn_ld4u( pt0 ) ^ __insn_ld4u( pt1 ) ^ __insn_ld4u( pt2 ) ^ __insn_ld4u( pt3 ) ^ __insn_ld4u( rk + I0 ); \
    pt0 = (uint32_t *)__insn_tblidxb3( (uint64_t)pt0, s1 ); \
    pt1 = (uint32_t *)__insn_tblidxb2( (uint64_t)pt1, s2 ); \
    pt2 = (uint32_t *)__insn_tblidxb1( (uint64_t)pt2, s3 ); \
    pt3 = (uint32_t *)__insn_tblidxb0( (uint64_t)pt3, s0 ); \
    t1  = __insn_ld4u( pt0 ) ^ __insn_ld4u( pt1 ) ^ __insn_ld4u( pt2 ) ^ __insn_ld4u( pt3 ) ^ __insn_ld4u( rk + I1 ); \
    pt0 = (uint32_t *)__insn_tblidxb3( (uint64_t)pt0, s2 ); \
    pt1 = (uint32_t *)__insn_tblidxb2( (uint64_t)pt1, s3 ); \
    pt2 = (uint32_t *)__insn_tblidxb1( (uint64_t)pt2, s0 ); \
    pt3 = (uint32_t *)__insn_tblidxb0( (uint64_t)pt3, s1 ); \
    t2  = __insn_ld4u( pt0 ) ^ __insn_ld4u( pt1 ) ^ __insn_ld4u( pt2 ) ^ __insn_ld4u( pt3 ) ^ __insn_ld4u( rk + I2 ); \
    pt0 = (uint32_t *)__insn_tblidxb3( (uint64_t)pt0, s3 ); \
    pt1 = (uint32_t *)__insn_tblidxb2( (uint64_t)pt1, s0 ); \
    pt2 = (uint32_t *)__insn_tblidxb1( (uint64_t)pt2, s1 ); \
    pt3 = (uint32_t *)__insn_tblidxb0( (uint64_t)pt3, s2 ); \
    t3  = __insn_ld4u( pt0 ) ^ __insn_ld4u( pt1 ) ^ __insn_ld4u( pt2 ) ^ __insn_ld4u( pt3 ) ^ __insn_ld4u( rk + I3 );

#define ROUND_E_S(I0,I1,I2,I3) \
    pt0 = (uint32_t *)__insn_tblidxb3( (uint64_t)pt0, t0 ); \
    pt1 = (uint32_t *)__insn_tblidxb2( (uint64_t)pt1, t1 ); \
    pt2 = (uint32_t *)__insn_tblidxb1( (uint64_t)pt2, t2 ); \
    pt3 = (uint32_t *)__insn_tblidxb0( (uint64_t)pt3, t3 ); \
    s0  = __insn_ld4u( pt0 ) ^ __insn_ld4u( pt1 ) ^ __insn_ld4u( pt2 ) ^ __insn_ld4u( pt3 ) ^ __insn_ld4u( rk + I0 ); \
    pt0 = (uint32_t *)__insn_tblidxb3( (uint64_t)pt0, t1 ); \
    pt1 = (uint32_t *)__insn_tblidxb2( (uint64_t)pt1, t2 ); \
    pt2 = (uint32_t *)__insn_tblidxb1( (uint64_t)pt2, t3 ); \
    pt3 = (uint32_t *)__insn_tblidxb0( (uint64_t)pt3, t0 ); \
    s1  = __insn_ld4u( pt0 ) ^ __insn_ld4u( pt1 ) ^ __insn_ld4u( pt2 ) ^ __insn_ld4u( pt3 ) ^ __insn_ld4u( rk + I1 ); \
    pt0 = (uint32_t *)__insn_tblidxb3( (uint64_t)pt0, t2 ); \
    pt1 = (uint32_t *)__insn_tblidxb2( (uint64_t)pt1, t3 ); \
    pt2 = (uint32_t *)__insn_tblidxb1( (uint64_t)pt2, t0 ); \
    pt3 = (uint32_t *)__insn_tblidxb0( (uint64_t)pt3, t1 ); \
    s2  = __insn_ld4u( pt0 ) ^ __insn_ld4u( pt1 ) ^ __insn_ld4u( pt2 ) ^ __insn_ld4u( pt3 ) ^ __insn_ld4u( rk + I2 ); \
    pt0 = (uint32_t *)__insn_tblidxb3( (uint64_t)pt0, t3 ); \
    pt1 = (uint32_t *)__insn_tblidxb2( (uint64_t)pt1, t0 ); \
    pt2 = (uint32_t *)__insn_tblidxb1( (uint64_t)pt2, t1 ); \
    pt3 = (uint32_t *)__insn_tblidxb0( (uint64_t)pt3, t2 ); \
    s3  = __insn_ld4u( pt0 ) ^ __insn_ld4u( pt1 ) ^ __insn_ld4u( pt2 ) ^ __insn_ld4u( pt3 ) ^ __insn_ld4u( rk + I3 );

Load the input data:

    uint64_t in0 = __insn_ldna( in      );
    uint64_t in1 = __insn_ldna( in + 8  );
    uint64_t in2 = __insn_ldna( in + 15 );
    uint64_t a = __insn_revbytes( __insn_dblalign( in0, in1, in ) );
    uint64_t b = __insn_revbytes( __insn_dblalign( in1, in2, in ) );
    s0 = (a>>32) ^ __insn_ld4u( rk );
    s1 =  a      ^ __insn_ld4u( rk + 1 );
    s2 = (b>>32) ^ __insn_ld4u( rk + 2 );
    s3 =  b      ^ __insn_ld4u( rk + 3 );

Do the rounds using the macros:

    /* rounds 1-9 */
    ROUND_E_T(  4, 5, 6, 7 )
    ROUND_E_S(  8, 9,10,11 )
    ROUND_E_T( 12,13,14,15 )
    ROUND_E_S( 16,17,18,19 )
    ROUND_E_T( 20,21,22,23 )
    ROUND_E_S( 24,25,26,27 )
    ROUND_E_T( 28,29,30,31 )
    ROUND_E_S( 32,33,34,35 )
    ROUND_E_T( 36,37,38,39 )

    if (key->rounds > 10) {
        /* rounds 10-11 */
        ROUND_E_S( 40,41,42,43 )
        ROUND_E_T( 44,45,46,47 )

        if (key->rounds > 12) {
            /* rounds 12-13 */
            ROUND_E_S( 48,49,50,51 )
            ROUND_E_T( 52,53,54,55 )
        }
    }
    rk += key->rounds << 2;

Use the method described above for the last round and store things, now with added support for the GX_STORE_ALIGNED macro:

    /*
     * apply last round and
     * map cipher state to byte array block:
     */
    // Exploit pairs in Te tables: 0=yppx,1=xypp,2=pxyp,3=ppxy
    pt2 = (uint32_t *)__insn_tblidxb3( (uint64_t)pt2, t0 );
    pt3 = (uint32_t *)__insn_tblidxb2( (uint64_t)pt3, t1 );
    pt0 = (uint32_t *)__insn_tblidxb1( (uint64_t)pt0, t2 );
    pt1 = (uint32_t *)__insn_tblidxb0( (uint64_t)pt1, t3 );
    s0 = __insn_mm( __insn_v2int_l( __insn_ld4u(pt2), __insn_ld4u(pt3) ), __insn_v2int_l( __insn_ld4u(pt0), __insn_ld4u(pt1) ), 32, 63 );
    pt2 = (uint32_t *)__insn_tblidxb3( (uint64_t)pt2, t1 );
    pt3 = (uint32_t *)__insn_tblidxb2( (uint64_t)pt3, t2 );
    pt0 = (uint32_t *)__insn_tblidxb1( (uint64_t)pt0, t3 );
    pt1 = (uint32_t *)__insn_tblidxb0( (uint64_t)pt1, t0 );
    s1 = __insn_mm( __insn_v2int_l( __insn_ld4u(pt2), __insn_ld4u(pt3) ), __insn_v2int_l( __insn_ld4u(pt0), __insn_ld4u(pt1) ), 32, 63 );

#if defined(GX_STORE_ALIGNED)
    s0 = __insn_revbytes( __insn_v2packh( s0, s1 ) ^ __insn_v4int_l( __insn_ld4u( rk + 0 ), __insn_ld4u( rk + 1 ) ) );
    __insn_st_add( out, s0, 8 );
#else
    s0 = __insn_v2packh( s1, s0 ) ^ __insn_v4int_l( __insn_ld4u( rk + 1 ), __insn_ld4u( rk + 0 ) );
    s1 = s0>>32;
    PUTU32(out,      s0);
    PUTU32(out +  4, s1);
#endif

Same for s2 and s3, obviously (not shown).

AES_Decrypt()

First, align the Td0-3 tables in the same manner as the Te tables (not shown).

Define the decrypt round macros:

#define ROUND_D_T(I0,I1,I2,I3) \
    pt0 = (uint32_t *)__insn_tblidxb3( (uint64_t)pt0, s0 ); \
    pt1 = (uint32_t *)__insn_tblidxb2( (uint64_t)pt1, s3 ); \
    pt2 = (uint32_t *)__insn_tblidxb1( (uint64_t)pt2, s2 ); \
    pt3 = (uint32_t *)__insn_tblidxb0( (uint64_t)pt3, s1 ); \
    t0  = __insn_ld4u( pt0 ) ^ __insn_ld4u( pt1 ) ^ __insn_ld4u( pt2 ) ^ __insn_ld4u( pt3 ) ^ __insn_ld4u( rk + I0 ); \
    pt0 = (uint32_t *)__insn_tblidxb3( (uint64_t)pt0, s1 ); \
    pt1 = (uint32_t *)__insn_tblidxb2( (uint64_t)pt1, s0 ); \
    pt2 = (uint32_t *)__insn_tblidxb1( (uint64_t)pt2, s3 ); \
    pt3 = (uint32_t *)__insn_tblidxb0( (uint64_t)pt3, s2 ); \
    t1  = __insn_ld4u( pt0 ) ^ __insn_ld4u( pt1 ) ^ __insn_ld4u( pt2 ) ^ __insn_ld4u( pt3 ) ^ __insn_ld4u( rk + I1 ); \
    pt0 = (uint32_t *)__insn_tblidxb3( (uint64_t)pt0, s2 ); \
    pt1 = (uint32_t *)__insn_tblidxb2( (uint64_t)pt1, s1 ); \
    pt2 = (uint32_t *)__insn_tblidxb1( (uint64_t)pt2, s0 ); \
    pt3 = (uint32_t *)__insn_tblidxb0( (uint64_t)pt3, s3 ); \
    t2  = __insn_ld4u( pt0 ) ^ __insn_ld4u( pt1 ) ^ __insn_ld4u( pt2 ) ^ __insn_ld4u( pt3 ) ^ __insn_ld4u( rk + I2 ); \
    pt0 = (uint32_t *)__insn_tblidxb3( (uint64_t)pt0, s3 ); \
    pt1 = (uint32_t *)__insn_tblidxb2( (uint64_t)pt1, s2 ); \
    pt2 = (uint32_t *)__insn_tblidxb1( (uint64_t)pt2, s1 ); \
    pt3 = (uint32_t *)__insn_tblidxb0( (uint64_t)pt3, s0 ); \
    t3  = __insn_ld4u( pt0 ) ^ __insn_ld4u( pt1 ) ^ __insn_ld4u( pt2 ) ^ __insn_ld4u( pt3 ) ^ __insn_ld4u( rk + I3 );

#define ROUND_D_S(I0,I1,I2,I3) \
    pt0 = (uint32_t *)__insn_tblidxb3( (uint64_t)pt0, t0 ); \
    pt1 = (uint32_t *)__insn_tblidxb2( (uint64_t)pt1, t3 ); \
    pt2 = (uint32_t *)__insn_tblidxb1( (uint64_t)pt2, t2 ); \
    pt3 = (uint32_t *)__insn_tblidxb0( (uint64_t)pt3, t1 ); \
    s0  = __insn_ld4u( pt0 ) ^ __insn_ld4u( pt1 ) ^ __insn_ld4u( pt2 ) ^ __insn_ld4u( pt3 ) ^ __insn_ld4u( rk + I0 ); \
    pt0 = (uint32_t *)__insn_tblidxb3( (uint64_t)pt0, t1 ); \
    pt1 = (uint32_t *)__insn_tblidxb2( (uint64_t)pt1, t0 ); \
    pt2 = (uint32_t *)__insn_tblidxb1( (uint64_t)pt2, t3 ); \
    pt3 = (uint32_t *)__insn_tblidxb0( (uint64_t)pt3, t2 ); \
    s1  = __insn_ld4u( pt0 ) ^ __insn_ld4u( pt1 ) ^ __insn_ld4u( pt2 ) ^ __insn_ld4u( pt3 ) ^ __insn_ld4u( rk + I1 ); \
    pt0 = (uint32_t *)__insn_tblidxb3( (uint64_t)pt0, t2 ); \
    pt1 = (uint32_t *)__insn_tblidxb2( (uint64_t)pt1, t1 ); \
    pt2 = (uint32_t *)__insn_tblidxb1( (uint64_t)pt2, t0 ); \
    pt3 = (uint32_t *)__insn_tblidxb0( (uint64_t)pt3, t3 ); \
    s2  = __insn_ld4u( pt0 ) ^ __insn_ld4u( pt1 ) ^ __insn_ld4u( pt2 ) ^ __insn_ld4u( pt3 ) ^ __insn_ld4u( rk + I2 ); \
    pt0 = (uint32_t *)__insn_tblidxb3( (uint64_t)pt0, t3 ); \
    pt1 = (uint32_t *)__insn_tblidxb2( (uint64_t)pt1, t2 ); \
    pt2 = (uint32_t *)__insn_tblidxb1( (uint64_t)pt2, t1 ); \
    pt3 = (uint32_t *)__insn_tblidxb0( (uint64_t)pt3, t0 ); \
    s3  = __insn_ld4u( pt0 ) ^ __insn_ld4u( pt1 ) ^ __insn_ld4u( pt2 ) ^ __insn_ld4u( pt3 ) ^ __insn_ld4u( rk + I3 );

Loading the data looks exactly the same, and the ROUND_E macros are replaced with ROUND_D, so let's not waste time on that.

The last round now uses a byte table, so using tblidx is not possible. The OpenSSL code looks like this:

    /*
     * apply last round and
     * map cipher state to byte array block:
     */
    s0 =
        ((u32)Td4[(t0 >> 24)       ] << 24) ^
        ((u32)Td4[(t3 >> 16) & 0xff] << 16) ^
        ((u32)Td4[(t2 >>  8) & 0xff] <<  8) ^
        ((u32)Td4[(t1      ) & 0xff])       ^
        rk[0];
    PUTU32(out     , s0);
    s1 =
        ((u32)Td4[(t1 >> 24)       ] << 24) ^
        ((u32)Td4[(t0 >> 16) & 0xff] << 16) ^
        ((u32)Td4[(t3 >>  8) & 0xff] <<  8) ^
        ((u32)Td4[(t2      ) & 0xff])       ^
        rk[1];
    PUTU32(out +  4, s1);
    s2 =
        ((u32)Td4[(t2 >> 24)       ] << 24) ^
        ((u32)Td4[(t1 >> 16) & 0xff] << 16) ^
        ((u32)Td4[(t0 >>  8) & 0xff] <<  8) ^
        ((u32)Td4[(t3      ) & 0xff])       ^
        rk[2];
    PUTU32(out +  8, s2);
    s3 =
        ((u32)Td4[(t3 >> 24)       ] << 24) ^
        ((u32)Td4[(t2 >> 16) & 0xff] << 16) ^
        ((u32)Td4[(t1 >>  8) & 0xff] <<  8) ^
        ((u32)Td4[(t0      ) & 0xff])       ^
        rk[3];
    PUTU32(out + 12, s3);

As mentioned earlier, gcc refuses to use interleave instructions, so there'll be lots of shifting. The indexed loads look ok, so let's keep them:

    s0 = __insn_v2int_l( __insn_v1int_l( Td4[(t0>>24)     ], Td4[(t3>>16)&0xff] ),
                         __insn_v1int_l( Td4[(t2>> 8)&0xff], Td4[(t1    )&0xff] ) ) ^ rk[0];
    s1 = __insn_v2int_l( __insn_v1int_l( Td4[(t1>>24)     ], Td4[(t0>>16)&0xff] ),
                         __insn_v1int_l( Td4[(t3>> 8)&0xff], Td4[(t2    )&0xff] ) ) ^ rk[1];
    s2 = __insn_v2int_l( __insn_v1int_l( Td4[(t2>>24)     ], Td4[(t1>>16)&0xff] ),
                         __insn_v1int_l( Td4[(t0>> 8)&0xff], Td4[(t3    )&0xff] ) ) ^ rk[2];
    s3 = __insn_v2int_l( __insn_v1int_l( Td4[(t3>>24)     ], Td4[(t2>>16)&0xff] ),
                         __insn_v1int_l( Td4[(t1>> 8)&0xff], Td4[(t0    )&0xff] ) ) ^ rk[3];

That expands to bfextu for 12 indexes and shifts for the remaining 4. And the merging is done by interleaves, not shifts and xors.

Measurements

I wrote a simple test program that uses the simulator in functional mode, so all memory accesses are reduced to L1 latency (2 cycles). get_cycle_count() is used to measure cycles used.

Let's first try with regular unaligned store:

[user@cloudy openssl]$ tile-cc -Wall -o test_aes -O3 -std=gnu99 -static --save-temps test.c aes_core.c aes_core_tilegx.c
[user@cloudy openssl]$ tile-monitor --simulator --image 1x1 --upload test_aes /tmp/aes_gx --run -+- /tmp/aes_gx -+- --quit --functional
enc10 01234567 ok
enc12 01234567 ok
enc14 01234567 ok
dec10 01234567 ok
dec12 01234567 ok
dec14 01234567 ok
AES_Encrypt()
15052685 11469981 23.801096%
AES_Decrypt()
14994468 11703889 21.945286%
done

And then define GX_STORE_ALIGNED for faster stores:

[user@cloudy openssl]$ tile-cc -DGX_STORE_ALIGNED -Wall -o test_aes -O3 -std=gnu99 -static --save-temps test.c aes_core.c aes_core_tilegx.c
[user@cloudy openssl]$ tile-monitor --simulator --image 1x1 --upload test_aes /tmp/aes_gx --run -+- /tmp/aes_gx -+- --quit --functional
enc10 01234567 ok
enc12 01234567 ok
enc14 01234567 ok
dec10 01234567 ok
dec12 01234567 ok
dec14 01234567 ok
AES_Encrypt()
15052685 10989981 26.989897%
AES_Decrypt()
14994468 11103889 25.946762%
done

Source Code

Source files:

If there's anybody out there who would like to integrate this into OpenSSL, please do so.

Comments are always appreciated. I prefer being contacted on LinkedIn. Please refer to this webpage if we have no connections in common. If you're using old school email, you can figure out my mail address from the front page. Be aware that the spam filtering is extreme; hardly anything gets through.

Remember to appreciate this classic Abstruse Goose strip.


www.ignorantus.com