GPURay - GPU+CPU Raytracer for NVidia Tegra X1

Nils L. Corneliusen
3 February 2017

Latest update: 10 March 2016: Changes to GLSL code for higher performance.
See Pipeline for details.

Introduction

GPURay started out in the GPU Hacks article as a GPU only raytracer. 80 spheres seems to be the max possible with that version. Would it be possible to do 128 spheres? Well, that's a challenge! To accomplish that, the CPU is used to do some preprocessing while keeping the GPU part mostly the same. It's recommended to read the original GPU Hacks article first and check updates done in the Pipeline section.

Most of the CPU code is boring and involves initializing OpenGL, setting up the spheres and handling the movement of the spheres. Also, I'm a terrible at writing application code, so it looks crap. Let's focus on the interesting parts instead.

CPU Threads, Sorting and Probing

main() will set up a thread that will do the probing and sorting. Communication between the threads is done via a simple barrier and a pair of global buffers. It's advisable to keep the thread on a separate core:

static void setcore( int coreid )
{
    int mask = 1<<(coreid%CORES);
    int syscallres = syscall( __NR_sched_setaffinity, gettid(), sizeof(mask), &mask );
    if( syscallres )
        printf( "Error syscall setaffinity: syscallres=%d, errno=%d\n", syscallres, errno );
}

static void *thread_start( void *arg )
{
    int coreid = (int)arg;

    setcore( coreid );

    printf( "Thread core: %d\n", coreid );

    while( true ) {

        barrier_wait( &bar );

        if( staticframe == -1 ) {
            world_timer( w1 );
        }

        world_probe( w1 );

        barrier_wait( &bar );

    }

    return NULL;
}

static int thread_init( void )
{
    barrier_init( &bar, NULL, 2 );

    pthread_attr_t attr;
    int rc = pthread_attr_init( &attr );
    if( rc != 0 ) {
        perror( "pthread_attr_init" );
        return rc;
    }

    pthread_t thread_id;
    int arg = 2;
    rc = pthread_create( &thread_id, &attr, &thread_start, (void *)arg );
    if( rc != 0 ) {
        perror( "pthread_create" );
        return rc;
    }

    return 0;
}

world_timer() moves the spheres one frame forward. The interesting stuff is in world_probe(). First, the objects are (crudely) sorted on the Z axis. This is done to cut off the main sphere loop on the GPU when a sphere that is close enough is found. It's implemented as follows:

typedef struct {
    v4 obj_pos;
    v4 obj_col;
} poscol;


static int sortfunc( const void *a, const void *b )
{
    poscol *va = (poscol *)a;
    poscol *vb = (poscol *)b;

    return va->obj_pos.z - vb->obj_pos.z > 0.0f ? 1 : -1;
}


static void world_probe( worldinfo *w )
{
    uint8_t r0[BLOCKSX+1];
    uint8_t r1[BLOCKSX+1];
    uint8_t *row0, *row1;

    // Depth sort and copy to GPU table
    poscol pc[OBJNUM];

    for( int i = 0; i < OBJNUM; i++ ) {
        pc[i].obj_pos = w->spheres[i].obj_pos;
        pc[i].obj_col = w->spheres[i].obj_col;

    }

    qsort( &pc[1], OBJNUM-1, sizeof(poscol), sortfunc );

    for( int i = 0; i < OBJNUM; i++ ) {
        w->gt.obj_pos[i] = pc[i].obj_pos;
        w->gt.obj_col[i] = pc[i].obj_col;
    }

Next up is the probing. Each corner of a 16x16 block is evaluated, and if all four match, it's either background only or sphere only. This means that the entire sphere evaluation can be eliminated on the GPU if the block only has background in it:

CPU Preprocessing

There's no need to go overboard with the optmizations here: The entire function runs in about 5ms per frame, which leaves a good margin to 16.67ms.

    // Check corners of 16x16 blocks
    row0 = r0;
    row1 = NULL;

    v3 ray_pos = v4_get3( w->gt.eye );

    int widthr  = (w->width +BLOCKW-1)&~(BLOCKW-1);
    int heightr = (w->height+BLOCKH-1)&~(BLOCKH-1);

    for( int y = 0; y <= heightr; y += BLOCKH ) {

        float fy = y/(float)w->height;

        uint8_t *dst = row0;

        for( int x = 0; x <= widthr; x += BLOCKW ) {

            float fx = x/(float)w->width;

            v3 startpos = v3_normalize( v3_set( w->gt.startpos.x + w->gt.ax * fx,
                                                w->gt.startpos.y + w->gt.ay * fy,
                                                w->gt.startpos.z ) );

            int i;
            for( i = 0; i < OBJNUM; i += 4 ) {

                v3 vv0 = v3_sub( ray_pos, v4_get3( w->gt.obj_pos[i+0] ) );
                v3 vv1 = v3_sub( ray_pos, v4_get3( w->gt.obj_pos[i+1] ) );
                v3 vv2 = v3_sub( ray_pos, v4_get3( w->gt.obj_pos[i+2] ) );
                v3 vv3 = v3_sub( ray_pos, v4_get3( w->gt.obj_pos[i+3] ) );

                float rv0 = -v3_dot( vv0, vv0 ) + w->gt.obj_pos[i+0].w;
                float rv1 = -v3_dot( vv1, vv1 ) + w->gt.obj_pos[i+1].w;
                float rv2 = -v3_dot( vv2, vv2 ) + w->gt.obj_pos[i+2].w;
                float rv3 = -v3_dot( vv3, vv3 ) + w->gt.obj_pos[i+3].w;

                if( v3_dotsq( vv0, startpos ) + rv0 > 0.0f ||
                    v3_dotsq( vv1, startpos ) + rv1 > 0.0f ||
                    v3_dotsq( vv2, startpos ) + rv2 > 0.0f ||
                    v3_dotsq( vv3, startpos ) + rv3 > 0.0f ) break;

            }

            *dst++ = i == OBJNUM ? BLOCK_BG : BLOCK_SPH;
            // It's reasonably cheap to separate sky/plane here, but haven't figured out how to exploit it

        }

        if( row1 == NULL ) {
            row0 = r1;
            row1 = r0;
            continue;
        }

        uint8_t *tmp;
        tmp  = row0;
        row0 = row1;
        row1 = tmp;

        uint8_t *proberow = w->gt.probe + ((y-BLOCKH)/BLOCKH)*BLOCKSX;

        for( int x = 0; x < widthr/BLOCKW; x++ ) {
            *proberow++ = row0[x] & row0[x+1] & row1[x] & row1[x+1];
        }

    }

}

These optimizations are not foolproof, and there are some occasions where they might fail. However, the failures should be very few and hardly visible. Some corners had to be cut to be able to render 128 spheres on the Tegra X1.

GPU Raytracing

The uniform buffer holding the data is largely the same, except that the probe block has been added:

layout(binding = 0) uniform stuff
{
    vec4 obj_pos[OBJNUM];
    vec4 obj_col[OBJNUM];
    vec4 plane_norm;
    vec4 plane_col;
    vec4 eye;
    vec4 light_pos;
    vec4 startpos;
    float ax;
    float ay;
    float frame;
    float width;
    float height;
    uint8_t probe[BLOCKSX*BLOCKSY];
};

Using uint8_t is overkill, but packing the data tighter and extracting does not make any noticeable change.

In main(), use the texture coord to determine where the correct probe byte is and fetch it:

    // Find probe result...
    int blx = int(floor(v_texCoord.x*width /float(BLOCKW)));
    int bly = int(floor(v_texCoord.y*height/float(BLOCKH)));
    int prval = int(probe[bly*BLOCKSX+blx]);

And use it to skip the entire sphere calculation if only background needs to be rendered:

    // ...and only do sphere calculation if needed
    if( prval == BLOCK_SPH || prval == BLOCK_ANY ) {

Next up is the reflection loop:

        int refl;
        for( refl = 0; refl < REFLECTIONS; refl++ ) {

            float rt_dist = 0.0f;
            int rt_i = -1;
            vec3 rt_t0;

            vec4 v0, v1, v2, v3;

            // Preload 0-3
            v0 = vec4( ray_pos - obj_pos[0].xyz, obj_pos[0].w );
            v1 = vec4( ray_pos - obj_pos[1].xyz, obj_pos[1].w );
            v2 = vec4( ray_pos - obj_pos[2].xyz, obj_pos[2].w );
            v3 = vec4( ray_pos - obj_pos[3].xyz, obj_pos[3].w );

Then iterate through blocks of 32 spheres and determine which is close enough:

            // Find closest (or close enough) sphere
            for( int i = 0; i < OBJNUM; i += 32 ) {
                vec4 v4, v5, v6, v7;
                vec4 b0, d0;

                // Preload 4-7
                v4 = vec4( ray_pos - obj_pos[i+4].xyz, obj_pos[i+4].w );
                v5 = vec4( ray_pos - obj_pos[i+5].xyz, obj_pos[i+5].w );
                v6 = vec4( ray_pos - obj_pos[i+6].xyz, obj_pos[i+6].w );
                v7 = vec4( ray_pos - obj_pos[i+7].xyz, obj_pos[i+7].w );

                // Calc 0-3
                b0 = vec4( dot( v0.xyz, ray_dir ),
                           dot( v1.xyz, ray_dir ),
                           dot( v2.xyz, ray_dir ),
                           dot( v3.xyz, ray_dir ) );

                d0 = vec4( b0.x * b0.x - dot( v0.xyz, v0.xyz ) + v0.w,
                           b0.y * b0.y - dot( v1.xyz, v1.xyz ) + v1.w,
                           b0.z * b0.z - dot( v2.xyz, v2.xyz ) + v2.w,
                           b0.w * b0.w - dot( v3.xyz, v3.xyz ) + v3.w );

                if( any( greaterThan( d0, vec4( 0.0f ) ) ) ) {

                    vec4  t0 = 1.0f / (-b0 - sqrt( d0 ));
                    float mv = max( max( t0.x, t0.y ), max( t0.z, t0.w ) );

                    if( mv < rt_dist ) {
                        rt_dist = mv;
                        rt_t0   = t0.yzw;
                        rt_i    = i + 0;
                    }

                }

                // ...
                // ditto for preload 8-11, calc 4-7 etc.
                // ...

                if( rt_dist < 3.0f ) break;
            }

If nothing was hit, we jump out. If we did hit something, find the id and calculate the color, reflect and iterate:

            // Stop if nothing hit in this pass
            if( rt_i == -1 ) break;

            // Find id of sphere hit
            ivec3 iv = ivec3( equal( rt_t0, vec3( rt_dist ) ) );
            rt_i += iv.x + iv.y + iv.y + iv.z + iv.z + iv.z;

            ray_pos += ray_dir * (1.0f / rt_dist);
            vec3 n   = normalize( ray_pos - obj_pos[rt_i].xyz );
            vec3 l   = normalize( light_pos.xyz - ray_pos );

            float diffuse  = max( dot( n, l ), 0.0f );
            float specular = pow( max( dot( ray_dir, l - n * 2.0f * diffuse ), 0.0f ), 8.0f );

            col += vec3( specular ) + obj_col[rt_i].xyz * diffuse;

            // Reflect and go on
            ray_dir = reflect( ray_dir, n );

        }

Spheres

Now it'll either keep reflecting until max reflections is reached or it's a miss. If anything was hit/reflected, we're done:

        // Done if something was hit/reflected
        if( refl != 0 ) {
            outColor = vec4( col, 1.0f );
            return;
        }

    }

With all sphere calculations out of the way, the rest of the code deals with the plane+shadows and sky. First, check if the plane is hit. If not, draw a pulsating sky and exit. Since the objects are sorted, the color is calculated differently:

Sky

    // BG only or nothing hit on first refl pass. Check plane/sky.
    float rt_dist = -(dot( ray_pos, plane_norm.xyz ) + plane_norm.w) / dot( ray_dir, plane_norm.xyz );

    if( rt_dist <= 0.0f ) {
        // Plane miss, draw sky
        vec2 dxy = v_texCoord - 0.5f;
        float dv = 1.0f - sqrt( dot( dxy, dxy ) + (sin(fract(frame/120.0f)*2.0f*3.14f)+1.0f)*0.5f );
        outColor = vec4( plane_col.yzx * dv, 1.0f );
        return;
    }

If we get here, it's certain that the plane is hit. Calculate new ray_pos and and make ray_dir point at the light source:

    // Plane hit
    ray_pos += ray_dir * rt_dist;
    ray_dir  = normalize( light_pos.xyz - ray_pos );

Checking for shadow is a bit simpler since it's only interesting if anything at all is hit, not what it is or how far away. It follows the same structure as the sphere calculation, except it's unrolled 16:

    // Check shadow

    // Preload 0-3
    vec4 v0 = vec4( ray_pos - obj_pos[0].xyz, obj_pos[0].w );
    vec4 v1 = vec4( ray_pos - obj_pos[1].xyz, obj_pos[1].w );
    vec4 v2 = vec4( ray_pos - obj_pos[2].xyz, obj_pos[2].w );
    vec4 v3 = vec4( ray_pos - obj_pos[3].xyz, obj_pos[3].w );

    int i;
    for( i = 0; i < OBJNUM; i += 16 ) {
        vec4 v4, v5, v6, v7;
        vec4 d0, d1, d2, d3;
        vec4 b0;

        // Preload 4-7
        v4 = vec4( ray_pos - obj_pos[i+4].xyz, obj_pos[i+4].w );
        v5 = vec4( ray_pos - obj_pos[i+5].xyz, obj_pos[i+5].w );
        v6 = vec4( ray_pos - obj_pos[i+6].xyz, obj_pos[i+6].w );
        v7 = vec4( ray_pos - obj_pos[i+7].xyz, obj_pos[i+7].w );

        // Calc 0-3
        b0 = vec4( dot( v0.xyz, ray_dir ),
                   dot( v1.xyz, ray_dir ),
                   dot( v2.xyz, ray_dir ),
                   dot( v3.xyz, ray_dir ) );

        d0 = vec4( b0.x * b0.x - dot( v0.xyz, v0.xyz ) + v0.w,
                   b0.y * b0.y - dot( v1.xyz, v1.xyz ) + v1.w,
                   b0.z * b0.z - dot( v2.xyz, v2.xyz ) + v2.w,
                   b0.w * b0.w - dot( v3.xyz, v3.xyz ) + v3.w );

        // ...
        // ditto for preload 8-11, calc 4-7 etc.
        // ...

        if( any( greaterThan( d0, vec4( 0.0f ) ) ) ||
            any( greaterThan( d1, vec4( 0.0f ) ) ) ||
            any( greaterThan( d2, vec4( 0.0f ) ) ) ||
            any( greaterThan( d3, vec4( 0.0f ) ) ) ) break;

    }

Then determine if it's shadow or plane and calculate the final output color:

Plane with shadows

    // Black shadow if anything was hit
    float diffuse = i == OBJNUM ? dot( plane_norm.xyz, ray_dir ) * 1.2f : 0.0f;
    outColor = vec4( plane_col.xyz * diffuse, 1.0f );

}

Source Code

Source files:

The vec.h file implements some basic vec3/vec4 stuff for the CPU side. I'm sure there's something like this hidden somewhere in OpenGL. Couldn't find it, so I reinvented the wheel. Feel free to replace it.

I compile this for Android, so there's some things missing. Barrier code can be found on locklessinc.com. WindowSurface.cpp/.h is probably part of some devkit. It's needed to get a NativeWindow to display things in. Pick it up at googlesource.com.

Performance Measurements

Run gpuray with the -q option. This renders a predetermined set of frames with display and vsync off for easy comparison. It will display frames per second, gpu load and GPU power usage:

[shieldup:/tmp] $ echo $$ > /sys/fs/cgroup/cpuset/tasks
[shieldup:/tmp] $ ./gpuray -q
Program linking log for default vertex and fragtracer.glsl:
Fragment info
-------------
0(76) : warning C7050: "rt_t0" might be used before being initialized

Large: 64 Small: 63
Objects:    128
Resolution: 1920x1080
Thread core: 2
53.4508 97.900  9.999
67.7886 96.700 10.189
68.7328 98.200 10.141
69.4878 97.900 10.189
64.0440 98.300 10.094
(...)
67.6468 97.600 10.094
65.2511 97.700 10.196
Time (ms): 36810.703125 Frames: 2401 FPS: 65.225601
[shieldup:/tmp] $ 

The warning is totally unexpected. "Might be" warnings have been bugging gcc for awhile, and now they're pouring out of nvcc. We're not banging rocks together here! Every cycle is needed. Adding a redundant initialization makes performance plummet. NVidia, please give us an option to disable useless warnings when using GL calls to compile and link.

Video Archive

Playlist with all videos

GPURay 2.0: 128 spheres.
GPURay 2.0 alpha 3: 120 spheres.
GPURay 2.0 alpha 2: 112 spheres. Music by Xerxes/Triumph.
GPURay 2.0 alpha 1: 96 spheres.
GPURay 1.7: 80 spheres. Music by Jan Wegger.
GPURay 1.6: 72 spheres.
GPURay 1.5: 64 spheres.
GPURay 1.4: 60 spheres.
GPURay 1.3: 52 spheres.
GPURay 1.2: 30 spheres. Shadows added.
GPURay 1.1: 36 spheres.
GPURay 1.0: 31 spheres.

Comments are always appreciated. I prefer being contacted on LinkedIn. Email is also available, you can figure it out from the front page. I switched to a disposable email address system, so it will change regularly.

Remember to appreciate this classic XKCD strip.


www.ignorantus.com