Integer Raytracing on Mellanox TILE-Gx

Nils L. Corneliusen
5 June 2017

Introduction

A second attempt at raytracing on the Mellanox TILE-Gx. The TILE-Gx has very limited floating point support in hardware, so let's try using fixed point math instead. Unfortunately, calculating square roots is very time consuming. An alternative approach is explored where custom conversion routines and integer math does this quickly enough to render 40 spheres in 1080p60.

Command and Control

The main part of this is duct taping code from three old projects together:

Let's extract the sphere control code from the GPU based raytracer and keep that mostly unchanged. So it's still in floating point, and the positions etc. only get converted as the last step before rendering. That makes the job much simpler. It runs in roughly 1.5ms on the Gx36 with 40 spheres. The parallelization library from the bilinear scaler is used for controlling the parallel jobs. The base raytracer code from the TILE-Gx floating point raytracer is also used, but without floating point of course.

The main rendering loop is very simple when using the parallelization library. Convert the float data to integer and start the rendering jobs:

    while( time_now < time_quit ) {

        (...)

        uint64_t t0 = get_usec();

        // convert world data to fixed point
        world_to_iworld( w0, iw0 );

        // start jobs
        raytracerdata rd;

        rd.dst = dst;

        for( int y = 0; y < rh; y += BLOCKY ) {

            rd.y = y;

            for( int x = 0; x < rw; x += BLOCKX ) {

                rd.x = x;

                par_sendjob( raytracer_func, &rd, sizeof(rd) );

            }

        }

Then move the spheres while jobs are wrapping up:

        if( staticframe == -1 ) {
            for( int i = 0; i < timeskip; i++ ) {
                world_timer( w0 );
            }
        }

Wait for all done, grab the time and occasionally write how much time was spent. And draw a smoothed load bar:

        par_wait();

        uint64_t t1 = get_usec();

        if( time_now%30 == 0 ) printf( "%lu\n", t1-t0 );

        uint64_t frametime = t1-t0;
        total += frametime;

        loadlist[loadpos++] =  (frametime*100.0f)/(1000000.0f/60.0f);
        if( loadpos >= LOADS ) loadpos = 0;

        float loadsum = 0.0f;
        for( int i = 0; i < LOADS; i++ ) {
            loadsum += loadlist[i];
        }

        float load = loadsum / LOADS;
        if( load > 100.0f ) load = 100.0f;

        drawbar( dst, (int)(load*((float)WIDTH/100.0f)) );
        time_now++;
    }

The raytracerdata structure and control function looks like this:

// Parallel raytracer
typedef struct {
    int x;
    int y;
    uint8_t *dst;
} raytracerdata;

static void raytracer_func( void *data, unsigned int len )
{
    (void)len;
    raytracerdata *rd = data;

    render( rd->dst, rd->x, rd->y );
}

We'll get to the render() function soon, but first:

Integers

As mentioned in the introduction, square root (and inverse square root) calculations are needed. Doing that in fixed point is slow and laborious. But all is not lost: There's still some floating point support in the TILE-Gx. An fp32 add/sub/mul can be done in 4 instructions. Now, here's a brilliant idea (or ridiculously stupid, I forget which): If fixed point numbers can be converted to and from floating point quickly, square roots can be calculated using normal integer operations on the floats..

Chris Hecker's article from Game Developer Magazine Feb/Mar 1996 covers the conversion basics. Super short version: If the range of the floating point number is known, float add enough so the exponent is fixed, extract the mantissa and integer subtract. Shift the result as needed. And vice versa it's slightly simpler. Using this method, a float to fixed point conversion costs roughly 7 instructions on the TILE-Gx, which bundle nicely if more stuff is done at the same time. The downside is that precision is lost, since there's only 23 bits of mantissa. So let's try to keep enough precision where needed.

Let's define the datatypes and conversion operations in ivec3.h. 20 bits is used for the fraction. This is more than enough, and makes square root calculations later slightly quicker:

#define FRACBITS 20

typedef int64_t fixed;

typedef struct {
    fixed x;
    fixed y;
    fixed z;
} ivec3;

For the sphere data, the number range -256.x to 255.x is enough:

union fi32_u
{
    float f;
    int32_t i;
};

union fu32_u
{
    float f;
    uint32_t u;
};

// Float range -256.x to 255.x, mn 9.14
static inline int32_t f2i( float f )
{
    union fi32_u x;

    x.f  = f;
    x.f += 768.0f;

    x.i  &= 0x7fffff;
    x.i  -= (1<<8)<<14;
    x.i <<= (FRACBITS-14);

    return x.i;
}

static inline ivec3 v2iv( vec3 v )
{
    ivec3 iv;

    iv.x = f2i( v.x );
    iv.y = f2i( v.y );
    iv.z = f2i( v.z );

    return iv;
}

Next is multiplication and dot operations. mul(a)_ls_ls() does 32x32->64 bit signed multiply which can be scheduled in both x0 and y0 slots. Since fixed point is used, mula_ls_ls() is only useful where 2 or more multiplications can be stacked:

static inline fixed imul( fixed a, fixed b )
{
    return __insn_mul_ls_ls( a, b )>>FRACBITS;
}

static inline fixed idot( ivec3 a, ivec3 b )
{
    fixed r = __insn_mul_ls_ls(     a.x, b.x );
          r = __insn_mula_ls_ls( r, a.y, b.y );
          r = __insn_mula_ls_ls( r, a.z, b.z );
    return r>>FRACBITS;
}

static inline fixed idotsq( ivec3 a, ivec3 b )
{
    fixed r = __insn_mul_ls_ls(     a.x, b.x );
          r = __insn_mula_ls_ls( r, a.y, b.y );
          r = __insn_mula_ls_ls( r, a.z, b.z );
          r >>= FRACBITS;
    return imul( r, r );
}

static inline fixed inorm2( ivec3 a )
{
    fixed r = __insn_mul_ls_ls(     a.x, a.x );
          r = __insn_mula_ls_ls( r, a.y, a.y );
          r = __insn_mula_ls_ls( r, a.z, a.z );
    return r>>FRACBITS;
}

Square root is only needed in the inner loop. The range is low, 0.x-7.x, meaning that the mantissa is 3 bits integer + 20 bits fraction. No shifts are needed to convert. The integer part is based on code from the Wikipedia article Methods of computing square roots: Approximations that depend on the floating point representation:

// Note: Range 0.x-7.x, mn 3.20
static inline fixed isqrt( fixed num )
{
    union fu32_u x;

    x.u  = num;
//    x.u >>= FRACBITS-20;
    x.u |= 0x41000000;
    x.f -= 8.0f;

    x.u >>= 1;
    x.u  += (1<<29)-(1<<22);

    x.f += 8.0f;
    x.u &= 0x7fffff;
//    x.u <<= FRACBITS-20;

    return x.u;
}

The normalization function needs larger range, 0.x-511.x. It incorporates the classic trick from Quake III for the estimate and two NR steps to get precision under control. It's covered in the Wikipedia article Fast inverse square root:

// Note: Range 0.x-511.x, mn 9.14
static inline ivec3 inormalize( ivec3 a )
{
    union fu32_u x;

    fixed aa = inorm2( a );

    x.u   = aa;
    x.u >>= FRACBITS-14;
    x.u  |= 0x44000000;
    x.f  -= 512.0f;

    x.u = 0x5f3759df - (x.u >> 1);

    x.f  += 512.0f;
    x.u  &= 0x7fffff;
    x.u <<= FRACBITS-14;

    aa >>= 1;

    x.u = imul( x.u, (0x03<<(FRACBITS-1)) - imul( imul( x.u, x.u ), aa ) );
    x.u = imul( x.u, (0x03<<(FRACBITS-1)) - imul( imul( x.u, x.u ), aa ) );

    return ivec3_scale( a, x.u );
}

That's all the basics, now let's get some raytracing done!

The Integer Raytracer

The render() function does the actual raytracing. The input coordinates is the upper left corner of a 16x16 block to be rendered. First, check if the corners hit anything and just clear the block if not:

    ivec3 startpos = ivec3_set( imul(-iw0->swidth,  1<<(FRACBITS-1) ) - iw0->eye.x + imul(xstart<<FRACBITS, iw0->ax ),
                                imul( iw0->sheight, 1<<(FRACBITS-1) ) - iw0->eye.y - imul(ystart<<FRACBITS, iw0->ay ),
                                                                      - iw0->eye.z );

    // just clear block if no hits on corners
    ivec3 spos;
    spos       = startpos;                           ivec3 ul = inormalize( spos );
    spos.y    -= imul(iw0->ay,(BLOCKY-1)<<FRACBITS); ivec3 bl = inormalize( spos );
    spos.x    += imul(iw0->ax,(BLOCKX-1)<<FRACBITS); ivec3 br = inormalize( spos );
    spos.y    += imul(iw0->ay,(BLOCKY-1)<<FRACBITS); ivec3 ur = inormalize( spos );

    int i;
    for( i = 0; i < OBJNUM; i++ ) {
        isphere *sp = &iw0->spheres[i];
        ivec3 v     = sp->poseye;
        fixed rv    = sp->rv;

        if( idotsq( v, ul ) + rv > 0 || idotsq( v, bl ) + rv > 0 ||
            idotsq( v, br ) + rv > 0 || idotsq( v, ur ) + rv > 0    ) break;
     }

    int lastrow = MIN( iw0->height, ystart+BLOCKY );

    if( i == OBJNUM ) {

        for( int y = ystart; y < lastrow; y++ ) {

            // adjust this if BLOCKX is changed
            outr[0] = BG_R64; outr[1] = BG_R64; outr += rowstride;
            outg[0] = BG_R64; outg[1] = BG_G64; outg += rowstride;
            outb[0] = BG_R64; outb[1] = BG_B64; outb += rowstride;
        }

        return;
    }

Set up the y and x loops and get the inital ray_pos and ray_dir:

    spos.y = startpos.y;

    uint64_t r0 = 0, g0 = 0, b0 = 0;

    for( int y = ystart; y < lastrow; y++, spos.y -= iw0->ay ) {

        spos.x = startpos.x;

        for( int x = 0; x < BLOCKX; x++, spos.x += iw0->ax ) {

            ivec3 ray_pos = iw0->eye;
            ivec3 ray_dir = inormalize( spos );

Then comes the main reflection loop where the closest sphere and distance is tracked:

            ivec3 col = { 0, 0, 0 };

            for( int j = 0; j < MAXREF; j++ ) {

                int   rt_hit  = NOHIT;
                fixed rt_dist = FARDIST;

For the first reflection pass, use the precalculated values to save time. Unroll 4 times to get as many instructions bundled as possible. Gcc, unlike Nvcc, can't stack if()s correctly, so do early cutoff if all miss. If there's a hit, calculate the distance and store it if closer:

                if( j == 0 ) {

                    for( int i = 0; i < OBJNUM; i += 4 ) {

                        fixed b0 = idot( iw0->spheres[i+0].poseye, ray_dir );
                        fixed b1 = idot( iw0->spheres[i+1].poseye, ray_dir );
                        fixed b2 = idot( iw0->spheres[i+2].poseye, ray_dir );
                        fixed b3 = idot( iw0->spheres[i+3].poseye, ray_dir );
                        fixed d0 = imul( b0, b0 ) + iw0->spheres[i+0].rv;
                        fixed d1 = imul( b1, b1 ) + iw0->spheres[i+1].rv;
                        fixed d2 = imul( b2, b2 ) + iw0->spheres[i+2].rv;
                        fixed d3 = imul( b3, b3 ) + iw0->spheres[i+3].rv;

                        if( d0 <= 0 && d1 <= 0 && d2 <= 0 && d3 <= 0 ) continue;

                        if( d0 > 0 ) {
                            fixed t0 = -b0 - isqrt( d0 );
                            if( t0 > 0 && t0 < rt_dist ) {
                                rt_dist = t0;
                                rt_hit  = i + 0;
                            }
                        }

                        // ...
                        // ditto for d1/d2/d3
                        // ...

                    }

For the other reflections, calculate v0-3. Otherwise it's the same:

                } else {

                    for( int i = 0; i < OBJNUM; i += 4 ) {

                        ivec3 v0 = ivec3_sub( ray_pos, iw0->spheres[i+0].obj_pos );
                        ivec3 v1 = ivec3_sub( ray_pos, iw0->spheres[i+1].obj_pos );
                        ivec3 v2 = ivec3_sub( ray_pos, iw0->spheres[i+2].obj_pos );
                        ivec3 v3 = ivec3_sub( ray_pos, iw0->spheres[i+3].obj_pos );
                        fixed b0 = idot( v0, ray_dir );
                        fixed b1 = idot( v1, ray_dir );
                        fixed b2 = idot( v2, ray_dir );
                        fixed b3 = idot( v3, ray_dir );
                        fixed d0 = imul( b0, b0 ) - inorm2( v0 ) + iw0->spheres[i+0].radsq;
                        fixed d1 = imul( b1, b1 ) - inorm2( v1 ) + iw0->spheres[i+1].radsq;
                        fixed d2 = imul( b2, b2 ) - inorm2( v2 ) + iw0->spheres[i+2].radsq;
                        fixed d3 = imul( b3, b3 ) - inorm2( v3 ) + iw0->spheres[i+3].radsq;

                        if( d0 <= 0 && d1 <= 0 && d2 <= 0 && d3 <= 0 ) continue;

                        if( d0 > 0 ) {
                            fixed t0 = -b0 - isqrt( d0 );
                            if( t0 > 0 && t0 < rt_dist ) {
                                rt_dist = t0;
                                rt_hit  = i + 0;
                            }
                        }

                        // ...
                        // ditto for d1/d2/d3
                        // ...

                    }

If nothing was hit, break out of the loops. If something was hit, calculate specular/diffuse and reflect:

                }

                if( rt_hit == NOHIT ) break;

                ray_pos  = ivec3_add( ray_pos, ivec3_scale( ray_dir, rt_dist ) );
                ivec3 n  = inormalize( ivec3_sub( ray_pos, iw0->spheres[rt_hit].obj_pos ) );
                ivec3 l  = inormalize( ivec3_sub( iw0->light_pos, ray_pos ) );

                fixed diffuse  = imax( idot( n, l ), 0 );
                fixed specular = idot( ray_dir, ivec3_sub( l, ivec3_scale( n, imul( diffuse, 2<<FRACBITS ) ) ) );
                      specular = ipower_spec( imax( specular, 0 ) );

                col = ivec3_add( col, ivec3_add1( ivec3_scale( iw0->spheres[rt_hit].obj_col, diffuse  ), specular ) );

                ray_dir = ivec3_sub( ray_dir, ivec3_scale( n, imul( idot( ray_dir, n ), 2<<FRACBITS ) ) );
            } // j refl

When that's done, convert the color values to 8 bit integers and store them. The test system is still using planar rgb, so change this as needed:

            r0 |= imin( __insn_bfextu( col.x, FRACBITS-8, FRACBITS+8 ), 255 ); r0 = __insn_rotli( r0, 56 );
            g0 |= imin( __insn_bfextu( col.y, FRACBITS-8, FRACBITS+8 ), 255 ); g0 = __insn_rotli( g0, 56 );
            b0 |= imin( __insn_bfextu( col.z, FRACBITS-8, FRACBITS+8 ), 255 ); b0 = __insn_rotli( b0, 56 );

            if( (x&7) == 7 ) {
                outr[x/8] = r0; r0 = 0;
                outg[x/8] = g0; g0 = 0;
                outb[x/8] = b0; b0 = 0;
            }

        } // x

        outr += rowstride;
        outg += rowstride;
        outb += rowstride;

    } // y

Summing Up

A library to parallelize code on the Mellanox TILE-Gx easily:

Header files for fixed and floating point operations on the Mellanox TILE-Gx:

An integer raytracer for the Mellanox TILE-Gx with control code:

Compile, link and run:

[ig@weeaboo]$ tile-cc -Wall -Wextra -O3 -std=gnu99 -ffast-math -o raytracer raytracer.c par.c -ltmc -lpthread -lrt -lm -lgxio -ljpeg
[ig@weeaboo]$ scp raytracer root@lastv36:/tmp
[lastv36:/tmp] $ ./raytracer -m
Measurement config being used
Frames:    900
Cores:     35
Blocksize: 16*16
Timeskip:  1
Channel:   0
Output:    1920*1080
Huge: 1 Large: 39 Small: 0
23372
15119
14850
13491
(...)
14044
total ms: 12858.049805
average frame time: 14.286721
[lastv36:/tmp] $ 

CPU load graph of 3600 frames:

TILE-Gx Integer Raytracer, 40 spheres measurement.

Video Archive

v5: 40 spheres
v4: 36 spheres
v3: 32 spheres
v2: 20 spheres

Comments are always appreciated. My email address is on the front page. I switched to a disposable email address system, so it will change regularly (ie. when spam starts piling up). I'm also available on LinkedIn

Remember to appreciate this classic XKCD strip.


www.ignorantus.com