Nvidia Jetson TX2 Xmas Demo 2017

Nils L. Corneliusen
11 April 2018

Introduction

The original Xmas Demo 2017 was released 18 December 2017. Unfortunately, not all the effects would run in 60 fps. Presenting The Xmas Demo 2017 Remastered. It's running in 60 fps all the way on the NVidia Jetson TX2 Developer Kit. It's based on the same shader files that were released 8 January 2018. Should probably have made it in January, but I guess I'm lazy.

Shader Source Code

Most of the effects are based on shaders from ShaderToy. They are licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License. Different licenses are noted where applicable:

I'm guessing the ShaderToy shaders were supposed to run on bigger GPUs than the TX2, and only optimized until their speed targets were met. But is it possible to squeeze out every last cycle, cut corners and do all kinds of radical changes to get them to run in 60 fps on the TX2? And glue it all together so it resembles a semi-useful demo? This is an attempt at doing just that.

Control Code

Music playback has been removed due to assorted issues. Still some files missing, like the font. Need to regenerate it. The code in stuff.c/stuff.h has been copied from various places on the internet. It's pretty standard stuff: OpenGL initialization code, shader loading, reading/writing jpg/bmp files.

Music

Music 1: Galactic Damages by Jingle Punks.
License: "You're free to use this song in any of your videos."
Website: Link missing. YouTube doesn't give much info.

Music 2: Cephalopod by Kevin MacLeod.
License: CC BY 4.0
Website: incompetech.com

Shader Changes

I've gathered all the info about shader changes from Pipeline 2017 and 2018. Some of it is rewritten, and I've noted some dubious claims (by me) here and there that should be rechecked. Probably not gonna happen.

Galactic Dance

Based on code by Sinuousity.
License: Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.
Modified Source: galdance.glsl

No big changes: Just more of them and move them around a bit. I was running out of time, so the crap text etc. is just scaled and rendered by the CPU.

Glow City

Based on code by mhnewman.
License: Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.
Modified Source: glowcity.glsl

The original Glow City runs in roughly 16 fps on the Jetson TX2, mainly because there's too many buildings and they're too low. So reduce the building count to 100 and do one iteration of the j-loop. Increase the base height of the buildings and reduce the randomness slightly to cover up the background:

    float height = (0.5 + hash1(block)) * (2.0 + 4.0 * pow(noise1(0.1 * block), 2.5)) * 0.80f + 2.0f;

Reduce the noise() and some of the hash() functions to simple sin() expressions:

vec2 hash2( vec2 p2 )
{
    return vec2( ( sin( p2.x * p2.y ) + 1.0f ) / 2.0f, cos( p2.x * p2.y ) + 1.0f );
}

vec2 hash2( vec2 p2, float p )
{
    return vec2( sin( p2.x ) + 1.0f, cos( p2.y ) * p );
}

float noise1( vec2 p )
{
    return ( ( sin( p.x * 5.78f ) * cos( p.y * 4.23f ) ) + 1.0f ) / 2.0f;
}

Increase the beacon probability and change the window weighting. A side effect is that there'll now be colored windows without doing any work:

const float windowSize = 0.075;
const float windowSpead = 1.0;
(...)
const float beaconProb = 0.0006;

And now it should run at a comfortable 73 fps. It doesn't have the gloomy feeling of the original, but it's flashier. I like flashy stuff.

Raytracer

Based on code by Nils L. Corneliusen.
License: CC0 1.0
Modified Source: tracer.glsl

The raytracer has already been covered in the GPURay article. Only did some minor changes to the pathing to show off more reflections.

Noise 3D

Based on code by revers.
License: Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.
Modified Source: noise3d.glsl

The main work is done in the noise() function. The trick is to come up with a significantly quicker one that doesn't result in too much or too little open space, and that the values repeat. On the CPU, precalculate a uniform table with vec4 values that's a power of two in size so it wraps nicely. The formula used is the same as in the original, but on a significantly smaller scale. It has just 256 entries, which seems to be enough generate an interesting result. 128 looked crap.

#define TABLELEN  256
#define TABLEMASK (TABLELEN-1)
(...)
    for( int i = 0; i < TABLELEN; i++ ) {
        v4 v;

        v.x = fract( sinf( ((i +   0)&TABLEMASK) * 43758.5453123f ) );
        v.y = fract( sinf( ((i + 157)&TABLEMASK) * 43758.5453123f ) );
        v.z = fract( sinf( ((i + 113)&TABLEMASK) * 43758.5453123f ) );
        v.w = fract( sinf( ((i + 270)&TABLEMASK) * 43758.5453123f ) );
        gpuj->table[i] = v;
    }
(...)

Then make a new noise function that just reads from the table with LDC.F32X4 before mix(). Take care to handle negative numbers correctly:

float noise( vec3 x00 )
{
    ivec3 p00 = ivec3( floor( x00 ) );

    vec3 f00 = fract( x00 );
    f00 = f00 * f00 * (3.0 - 2.0 * f00);

    uint n00 = uint(abs(p00.x + p00.y * 157 + 113 * p00.z));
    vec4 h00 = table[(n00+0u)&uint(TABLEMASK)];
    vec4 h01 = table[(n00+1u)&uint(TABLEMASK)];

    h00    = mix( h00,    h01,    f00.x );
    h00.xy = mix( h00.xz, h00.yw, f00.y );
    h00.x  = mix( h00.x,  h00.y,  f00.z );

    return h00.x;
}

Unfortunately, the compiler refuses to generate LRP instructions. Interleaving multiple noise() calls to avoid stalls doesn't work well, since it will deinterleave them to save registers. D'oh!

The fbm() function has three noise() calls, and that's a bit much. Remove one and adjust the weights:

float fbm(vec3 p)
{
    float f;
    f  = 0.500 * noise(p); p *= 2.91;
    f += 0.250 * noise(p);
    return f;
}

It starts out at 13 fps in 1280x720. Adding the table-based noise() function increases this to 23 fps. Reducing depth to 96 gives 25 fps. Increasing MarchDumping to 0.9 gives 27, and killing off soft shadows and ambient occlusion gives 34 fps. Increase cutoff in castRay() to 0.03 and fix the if()-order.

The result of adjusting all these things is serious visual artifacts. But couldn't we just adjust MD on the fly so there's more detail when it's close and less detail further in? MD starts at 0.9, so multiply it by 1.025 for each step after the first 16 steps are done. This gives the needed 60 fps and looks almost like the original. Yeah, there's some artifacts, but it's running in 720p60 on a small SoC, not a discrete GPU.

Quaternions

Based on code by Keenan Crane.
License: Unspecified License.
Modified Source: quat.glsl

Quaternions have been reasonably well covered in the GPU Hacks article.

The detail level is much higher in this video. Iterations were increased to 10 by reshuffling the main loop a bit. Also added rotation:

void main()
{
    vec3 rD = normalize( vec3( (start_pos.x + ax * v_texCoord.x),
                               (start_pos.y + ay * v_texCoord.y),
                               (start_pos.z                    ) ) );

    // was probably not a good idea to hardcode this
    float angle = atan( rD.x ) * (2.0f/16.0f);

    vec3 rO = rotate_xz( rot_xz,         22.0f );
    vec3 xz = rotate_xz( rot_xz + angle,  8.0f );

    rD = normalize( vec3( -xz.x, rD.y, -xz.z ) );

    vec3 light = rO;

    // inline intersect
    float b0 = dot( rO, rD );
    float d0 = b0 * b0 - dot( rO, rO ) + BOUNDING_RADIUS_2;

    // redundant at correct zoom level
    if( d0 <= 0.0f ) {
        outColor = getBG();
        outColor.a = colscale;
        return;
    }

    float t0 = -b0 - sqrt( d0 );

    rO += t0 * rD;

    float dist = intersectQJulia( rO, rD, mu_pos );

    if( dist >= epsilon ) {
        outColor = getBG();
        outColor.a = colscale;
        return;
    }

    vec3 N = normEstimate( rO, mu_pos );

    outColor.rgb = Phong( light, rD, rO, N );
    outColor.a = colscale;
}

Colorful

Based on code by ollj.
License: Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.
Modified Source: colorful.glsl

No big changes here. Reformatted the loop and reduced resolution to 1600x900. Reduced iterations. Tweaked the start point so it's sync'ed with the music. Added restart after 3100 frames to avoid it getting stuck. Not important for the video.

Seascape

Based on code by TDM.
License: Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.
Modified Source: seascape.glsl

The original code gives around 23 fps, so let's start replacing the costly functions with low budget ones and hope the result looks somewhat like the original. There are a couple of places to save rendering time. First is the noise() function. Strip that out and replace it with the much simpler

float noise( vec2 p )
{
    return f = (sin( p.x * 1.91f ) + cos( p.y * 1.71f ))*0.75f;
}

Next is the map() and map_detailed() functions. It's enough to remove one sea_octave() call to stay marginally above 60 fps with some skyline, so remove the last one in map() and multiply the first one result by 1.4.

I like a straight skyline, so remove the "dir.z += length(uv) * 0.15;" in main().

To get constant 60 fps, let's try reducing some more stuff. Move ang, orig, m calculation to the CPU. Untangle heightMapTracing(). Discover that one of the two initial map() calls resolves to static value for all iterations: map( ori ). But just using a base value of 1.0f is more than precise enough. So main() becomes simpler:

(...)
    float hx = map( ori.xyz + dir * 1000.0f );
    if( hx > 0.0f ) {
        outColor = vec4( getSkyColor(dir), colscale );
        return;
    }

    vec3 p = heightMapTracing(ori.xyz,dir, hx );
(...)

And heightMapTracing() is changed to:

vec3 heightMapTracing( vec3 ori, vec3 dir, out float hx )
{
    vec3 p;
    float tm = 0.0;
    float tx = 1000.0;
//    float hm = map(ori);
    float hm = 1.0f; // ori fixed per frame, so close enough
    float tmid = 0.0;

    for( int i = 0; i < NUM_STEPS; i++ ) {
        tmid = mix( tm, tx, hm / (hm-hx) );
        p = ori + dir * tmid;
        float hmid = map( p );
        if( hmid < 0.0 ) {
            tx = tmid;
            hx = hmid;
        } else {
            tm = tmid;
            hm = hmid;
        }
    }
    return p;
}

You'll end up with something resembling the original. The sea is much more synthetic, but the patterns look kinda neat.

Transparent Blobs

Based on code by Shane.
License: Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.
Modified Source: trans.glsl

It doesn't look anything like the original since I enabled the fake noise looking field. Iterations reduced to 40. The texture effect is an accident: Tried making a faster calcNormal(), fumbled inlining the map() function, one thing lead to another, and the result was a nice texture on the blobs:

Old version:

float map( vec3 p )
{
    p = (cos(p*.315*2.5 + sin(p.zxy*.875*2.5)));

    float n = length(p);

    p = sin(p*6. + cos(p.yzx*6.));

    return n - 1. - abs(p.x*p.y*p.z)*.05;
}

vec3 calcNormal( vec3 p, float d )
{
    const vec2 e = vec2(0.01, 0);
    return normalize(vec3(d - map(p - e.xyy), d - map(p - e.yxy), d - map(p - e.yyx)));
}

New version after removing all the excess stuff:

vec3 calcNormal( vec3 p, float d )
{
    vec3 r0 = cos( p * .315 * 2.5 + sin( p.zxy * .875 * 2.5 ) ); p -= 0.01f;
    vec3 r1 = cos( p * .315 * 2.5 + sin( p.zxy * .875 * 2.5 ) );

    vec3 v0 = vec3( r1.x, r0.y, r0.z );
    vec3 v1 = vec3( r0.x, r1.y, r0.z );
    vec3 v2 = vec3( r0.x, r0.y, r1.z );

    float n0 = length( v0 );
    float n1 = length( v1 );
    float n2 = length( v2 );

    v0 = sin( v0 * 6. + cos( v0.yzx * 6. ) );
    v1 = sin( v0 * 6. + cos( v0.yzx * 6. ) );

    float f = - 1.0f - abs( v1.x * v1.y * v1.z ) * .05;

    n0 = n0 - 1.0f - abs( v0.x * v0.y * v0.z ) * .05;
    n1 = n1 + f;
    n2 = n2 + f;

    return normalize( vec3( d - n0, d - n1, d - n2 ) );
}

v1 and v2 depends on v0 but are still varied by the length. It's also slightly quicker.

Twofield

Based on code by w23.
License: Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.
Modified Source: twofield.glsl

The original code produces 26 fps on the NVidia Jetson TX2. The compiler actually manages to make mostly useful assembler code of it, since functions are short and it figures out the overlaps between the world()/w1()/w2()/t1()/t2() functions. (Or does it? I probably meant the opposite. Not important, I guess. The rewritten world() stuff looks much nicer.) So let's look elsewhere for speedups. 1080p60 or bust!

Start off by moving the trajectory and O-calculation to the CPU side. Also move the color and lightpos tables for later use. The CPU can adjust those for free while rendering a frame and costs roughly 0.5 fps. I never liked the shadows, removing them increases the fps to 43. Adjust the fog-like attenuation from 10-20 to 12-18 and remove the left and right 256 columns for a boost to 58 fps. In trace(), start L at 12.5 instead of 0 for 78 fps, so we're way over target. That means the number of spheres (N) can be increased to 8 and we end up at 64 fps. Any more spheres is too much without changing the paths.

Let's do some other minor tweaks since I like flashy stuff: Rotate the light sources around the spheres, make the lights bigger/smaller depending on Z, remove gamma correction, add fade in/fade out, adjust the specular values. Average fps ends up at around 62 fps and the load at 93%.

So it's running in 60 fps except during the transition. Let's save a bit more by chopping off more columns in the middle left and right sections. Probably simpler to cover the area with high-level triangles, but I just stuffed it in the shader:

 void main( void )
 {
    float xedge  = 256.0f/1920.0f;
    float yedge2 = 208.0f/1080.0f;
    float xedge2 = 416.0f/1920.0f;

    if(  (vt.x < xedge  || vt.x > 1.0f - xedge) ||
        ((vt.x < xedge2 || vt.x > 1.0f - xedge2) && vt.y > yedge2 && vt.y < 1.0f - yedge2 ) ) {
         outColor = vec4( 0.0f, 0.0f, 0.0f, colscale );
         return;
    }
    (...)

Also a trivial reduction in lightball() that the compiler doesn't do:

vec3 lightball( vec3 lpos, vec3 lcolor, vec3 O, vec3 D, float L )
{
    vec3 ldir = lpos - O;

    if( dot( ldir, ldir ) > L*L ) return vec3( 0.0f );

    float lv = 2.07f - ( ( ( lpos.z / 10.0f + 1.0f ) / 2.0f ) + 1.0f );

    float pw = pow( max( 0.0f, dot( normalize( ldir ), D ) ), 20000.0f * lv );

    return ( normalize( lcolor ) + 1.0f ) * pw;
}

Torus Thingy

Based on code by bal-khan.
License: Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.
Modified Source: torus.glsl

Remove unused code. Replace mylength() with standard length(). Replace march() with a standard one from noise3d and unroll once:

vec2 march(vec3 pos, vec3 dir)
{
    vec2 dist = vec2(0.0, 0.0);

    float esc = 0.01f;

    int i;
    for( i = 0; i < I_MAX; i += 2 ) {
        dist.x = scene( pos + dir * dist.y ); if( dist.x < esc ) break; dist.y += dist.x; if( dist.y > FAR ) break;
        dist.x = scene( pos + dir * dist.y ); if( dist.x < esc ) break; dist.y += dist.x; if( dist.y > FAR ) break;
    }

    return vec2( float(i), dist.y );
}

Yeah, it'll be noisy at distance, but it moves quickly, so it's not very important. The dodecahedrons will be a bit fatter too.

Replace mod() calculations in scene():

(...)
    p.xy = rotate( p.xy, iTime * 0.25f * ((int(var-0.5f)&1) == 1 ? -1. : 1.) );
(...)
    p.xz = rotate( p.xz, iTime * 1.5f * ((int(var)&1) == 1 ? -1. : 1.) * ((int(vir)&1) == 1 ? -1. : 1.) );
(...)

(Did that really pay off? Should recheck it.)

And the kicker is redoing the atan2() calls. The compiler will just replace the calls without considering rescheduling at all, so it's a stallfest. Use this replacement function:

// IEEE Signal Processing Magazine ( Volume: 30, Issue: 1, Jan. 2013 )
// Full Quadrant Approximations for the Arctangent Function

#define f2u( x ) floatBitsToUint( x )
#define u2f( x ) uintBitsToFloat( x )

float satan2( float y, float x )
{
    uint sign_mask = 0x80000000u;
    float b = 0.596227f;

    // Extract the sign bits
    uint ux_s  = (sign_mask & f2u(x) )^sign_mask;
    uint uy_s  = (sign_mask & f2u(y) )^sign_mask;

    // Determine the quadrant offset
    float q = float( ( ~ux_s & uy_s ) >> 29 | ux_s >> 30 );

    // Calculate the arctangent in the first quadrant
    float bxy_a = abs( b * x * y );
    float num = bxy_a + y * y;
    float atan_1q =  num / ( x * x + bxy_a + num );

    // Translate it to the proper quadrant
    uint uatan_2q = ((ux_s ^ uy_s) | f2u(atan_1q));
    return (q + u2f(uatan_2q)) * PI/2.0f - PI;
}

atan2() is called in atan2()+modA() pairs. The modA() call can be simplified by just adding PI/4 in a couple of instances:

vec2 modAs2( vec2 p, float count, float s2 )
{
    float an = TAU/count;
    float a = s2+an*.5 + PI/4.0f;
    a = mod(a, an)-an*.5;
    return vec2(cos(a),sin(a))*length(p);
}

So the center spikes calculation becomes:

    float s2 = satan2( p.x, p.z );
    float var = ( s2 + PI )/ TAU * 40.0f;
    p.xz = modAs2( p.xz, 40.0f, s2 );
    p.x -= 8.0f;

And ditto for the orange filler:

    s2 = satan2( p.x, p.y );
    float vir = ( s2 + PI ) / TAU;
    var = vir * 30.0f;
    p.xy = modAs2( p.xy, 30.0f, s2 );
    p.x -= 4.0f;
    q = vec2( length( p.zx ) - 0.25f, p.y );

Consider redoing the dodecahedron max stuff, but seems like the compiler manages to make sense of most of it. Replacing the vec3 b var with a float doesn't hurt, though.

It's a mess, but it runs in 60ish frames on the Jetson TX2.

Let's follow-up with some more reductions. D'oh of the week: I_MAX is 100. Sounds rather arbitrary, it never gets to 100 iterations anyway. Reduce until framerate is above 60 all the way. A good value of I_MAX seems to be 52. Visual artifacts minimal. Framerate awesome.

Earlier Versions

Original almost 60 fps version:

Some earlier tests:

Comments are always appreciated. My email address is on the front page. I switched to a disposable email address system, so it will change regularly (ie. when spam starts piling up). I'm also available on LinkedIn

Remember to appreciate this classic Abstruse Goose strip.


www.ignorantus.com