Random stuff I'm working on, in chronological order.

*
Note: Latest info about the Xmas demo is available here.
*

Final version of the demo! Everything runs in (almost) 60 fps on the NVidia Jetson TX2. As always, the YouTube video is 60 fps from start to end: I dumped the 18000 frames to files and built it with ffmpeg. The fps and load indicators track the actual TX2 output and should be considered correct. (Yeah, there are couple of scaling issues, but I'm not regenerating it now.)

Most of the effects are based on shaders from ShaderToy. They are licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License. They have been modified and radically optimized to run in 60 fps on the NVidia Jetson TX2 256 core Pascal-based GPU. The source code for the modified shaders can be downloaded below:

- galdance.glsl - based on code by Sinuousity
- glowcity.glsl - based on code by mhnewman
- noise3d.glsl - based on code by revers
- colorful.glsl - based on code by ollj
- seascape.glsl - based on code by TDM
- trans.glsl - based on code by Shane
- twofield.glsl - based on code by w23
- torus.glsl - based on code by bal-khan

Shaders with different licensing:

- tracer.glsl - based on code by Nils L. Corneliusen (CC0 1.0 license)
- quat.glsl - based on code by Keenan Crane (unspecified license)

Rendered in 4k and downscaled, everything fullscreen. The load bar and fps values are from 1080p version on TX2, so not accurate at all. Looks nicer now, though.

Mockup of this year's highlights. Overlay needs some work, maybe add a scrolltext, some flashy crap, the usual.

The texture effect is a result of trying to manually untangle the calls in calcNormal():

float map( vec3 p ) { p = (cos(p*.315*2.5 + sin(p.zxy*.875*2.5))); float n = length(p); p = sin(p*6. + cos(p.yzx*6.)); return n - 1. - abs(p.x*p.y*p.z)*.05; } vec3 calcNormal( vec3 p, float d ) { const vec2 e = vec2(0.01, 0); return normalize(vec3(d - map(p - e.xyy), d - map(p - e.yxy), d - map(p - e.yyx))); }

I did a typo in the v0-v2 calculations, so the result after scraping away the excess stuff is:

vec3 calcNormal( vec3 p, float d ) { vec3 r0 = cos( p * .315 * 2.5 + sin( p.zxy * .875 * 2.5 ) ); p -= 0.01f; vec3 r1 = cos( p * .315 * 2.5 + sin( p.zxy * .875 * 2.5 ) ); vec3 v0 = vec3( r1.x, r0.y, r0.z ); vec3 v1 = vec3( r0.x, r1.y, r0.z ); vec3 v2 = vec3( r0.x, r0.y, r1.z ); float n0 = length( v0 ); float n1 = length( v1 ); float n2 = length( v2 ); v0 = sin( v0 * 6. + cos( v0.yzx * 6. ) ); v1 = sin( v0 * 6. + cos( v0.yzx * 6. ) ); float f = - 1.0f - abs( v1.x * v1.y * v1.z ) * .05; n0 = n0 - 1.0f - abs( v0.x * v0.y * v0.z ) * .05; n1 = n1 + f; n2 = n2 + f; return normalize( vec3( d - n0, d - n1, d - n2 ) ); }

v1 and v2 depends on v0 but are still varied by the length. It's also slightly quicker.

Based on code by Shane which is available on ShaderToy. It's licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.

It doesn't look anything like the original since I enabled the fake noise looking field. Iterations reduced to 40. The texture effect is an accident: Tried making a faster calcNormal(), stumbled in inlining map(), one thing lead to another, and you got that effect. Have to figure out why it didn't work as expected before posting it.

Remove unused code. Replace mylength() with standard length(). Replace march() with a standard one from noise3d and unroll once:

vec2 march(vec3 pos, vec3 dir) { vec2 dist = vec2(0.0, 0.0); float esc = 0.01f; int i; for( i = 0; i < I_MAX; i += 2 ) { dist.x = scene( pos + dir * dist.y ); if( dist.x < esc ) break; dist.y += dist.x; if( dist.y > FAR ) break; dist.x = scene( pos + dir * dist.y ); if( dist.x < esc ) break; dist.y += dist.x; if( dist.y > FAR ) break; } return vec2( float(i), dist.y ); }

Yeah, it'll be noisy at distance, but it moves quickly, so it's not very important. The dodecahedrons will be a bit fatter too.

Replace mod() calculations in scene():

(...) p.xy = rotate( p.xy, iTime * 0.25f * ((int(var-0.5f)&1) == 1 ? -1. : 1.) ); (...) p.xz = rotate( p.xz, iTime * 1.5f * ((int(var)&1) == 1 ? -1. : 1.) * ((int(vir)&1) == 1 ? -1. : 1.) ); (...)

And the kicker is redoing the atan2() calls. Nvcc will just replace the calls without considering rescheduling at all. So it's a stallfest. Use this replacement function:

// IEEE Signal Processing Magazine ( Volume: 30, Issue: 1, Jan. 2013 ) // Full Quadrant Approximations for the Arctangent Function #define f2u( x ) floatBitsToUint( x ) #define u2f( x ) uintBitsToFloat( x ) float satan2( float y, float x ) { uint sign_mask = 0x80000000u; float b = 0.596227f; // Extract the sign bits uint ux_s = (sign_mask & f2u(x) )^sign_mask; uint uy_s = (sign_mask & f2u(y) )^sign_mask; // Determine the quadrant offset float q = float( ( ~ux_s & uy_s ) >> 29 | ux_s >> 30 ); // Calculate the arctangent in the first quadrant float bxy_a = abs( b * x * y ); float num = bxy_a + y * y; float atan_1q = num / ( x * x + bxy_a + num ); // Translate it to the proper quadrant uint uatan_2q = ((ux_s ^ uy_s) | f2u(atan_1q)); return (q + u2f(uatan_2q)) * PI/2.0f - PI; }

atan2() is called in atan2()+modA() pairs. The modA() call can be simplified by just adding PI/4 in a couple of instances:

vec2 modAs2( vec2 p, float count, float s2 ) { float an = TAU/count; float a = s2+an*.5 + PI/4.0f; a = mod(a, an)-an*.5; return vec2(cos(a),sin(a))*length(p); }

So the center spikes calculation becomes:

float s2 = satan2( p.x, p.z ); float var = ( s2 + PI )/ TAU * 40.0f; p.xz = modAs2( p.xz, 40.0f, s2 ); p.x -= 8.0f;

And ditto for the orange filler:

s2 = satan2( p.x, p.y ); float vir = ( s2 + PI ) / TAU; var = vir * 30.0f; p.xy = modAs2( p.xy, 30.0f, s2 ); p.x -= 4.0f; q = vec2( length( p.zx ) - 0.25f, p.y );

Consider redoing the dodecahedron max stuff, but seems like the compiler manages to make sense of most of it. Replacing the vec3 b var with a float doesn't hurt, though.

It's a mess, but it runs in 576p60 on the Jetson TX2, roughly 50% faster than the original. Had to render the video in 720p, since YouTube did the best it could to destroy the quality of the 576p video.

Based on code by bal-khan which is available on ShaderToy. It's licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.

Had to do some serious corner cutting to make this one run in 1024x576p60 fps on the NVidia Jetson TX2. Had to replace the atan2() marching() and scene() calculations. It's a bit iffy: The new marcher is much more crude, so there's some visible noise in the centre at distance. Will document it later.

Rendered in 3840x2160 and downscaled to 1920x1080 for higher quality. Max depth 2048 using the GPU routine described below. The poor X1 rendered it at 5.5 fps. Should have used the X2, but couldn't get libjpeg to work.

Recall the GPUJulia implementation which is optimized for depth first, ie. the opposite of the usual theory. It can render a 1080p screen where all pixels are at max depth at 45 fps on the X1. Looking at the generated code, it seems that the outer j-loop is somewhat inefficient. The register usage is high and unnecessary automatic unrolling makes it bloated. Let's try folding them into one another and replacing the shifts and ors with bitfield operations:

void main() { vec4 xv0 = vec4( 0.0f, 0.0f, 0.0f, width/height * (v_texCoord.x - 0.5f) / (0.5f * sz) + sx ); vec4 yv0 = vec4( 0.0f, 0.0f, 0.0f, (v_texCoord.y - 0.5f) / (0.5f * sz) + sy ); uint bits = 0u; int i; for( i = 0; i < MAXITER; i += DEPTH ) { bits = 0u; // ! for( int j = 0; j < DEPTH; j += 4 ) { xv0.x = xv0.w * xv0.w - yv0.w * yv0.w + re; yv0.x = 2.0f * xv0.w * yv0.w + im; xv0.y = xv0.x * xv0.x - yv0.x * yv0.x + re; yv0.y = 2.0f * xv0.x * yv0.x + im; xv0.z = xv0.y * xv0.y - yv0.y * yv0.y + re; yv0.z = 2.0f * xv0.y * yv0.y + im; xv0.w = xv0.z * xv0.z - yv0.z * yv0.z + re; yv0.w = 2.0f * xv0.z * yv0.z + im; uvec4 uv = uvec4( xv0.x * xv0.x + yv0.x * yv0.x > 4.0f, xv0.y * xv0.y + yv0.y * yv0.y > 4.0f, xv0.z * xv0.z + yv0.z * yv0.z > 4.0f, xv0.w * xv0.w + yv0.w * yv0.w > 4.0f ); bits = bitfieldInsert( bits, uv.x, j+0, 1 ); bits = bitfieldInsert( bits, uv.y, j+1, 1 ); bits = bitfieldInsert( bits, uv.z, j+2, 1 ); bits = bitfieldInsert( bits, uv.w, j+3, 1 ); } if( bits != 0u ) break; } int lsb = findLSB( bits ); i += lsb == -1 ? 0 : lsb; outColor = vec4( getRGB( float(i) / float(MAXITER) ), 1.0f ); }

This will reduce the register usage to 4 and the number of code lines to 94. The result is roughly 15% faster. The X1 can now render a black screen at 52 fps, and the Jetson TX2 at 71 fps. Neat.

*
19 October 2017: Added redundant "bits = 0u;" between the loops. Only adds an extra MOV.S between the REPs,
but it runs slightly faster. Requires further investigation. Jury's still out on using greaterThan(). Code will
be more compact, but execution time still suffers. Compacted the conversions for readability.
*

Brute force anti-aliasing! High-quality version of Seascape. Rendered in 7680x4320 and downscaled to 1920x1080. The poor NVidia Tegra X1 GPU spent the night producing roughly 3.5 fps. Using the ARM to scale afterwards was probably not a good idea, it ended up at around 0.5 fps total. Should have just stuffed the loops in the shader. Oh well, at least the aliasing is mostly gone.

The original Glow City runs in roughly 16 fps on the Jetson TX2, mainly because there's too many buildings and they're too low. So reduce the building count to 100 and do one iteration of the j-loop. Increase the base height of the buildings and reduce the randomness slightly to cover up the background:

float height = (0.5 + hash1(block)) * (2.0 + 4.0 * pow(noise1(0.1 * block), 2.5)) * 0.80f + 2.0f;

Reduce the noise() and some of the hash() functions to simple sin() expressions:

vec2 hash2( vec2 p2 ) { return vec2( ( sin( p2.x * p2.y ) + 1.0f ) / 2.0f, cos( p2.x * p2.y ) + 1.0f ); } vec2 hash2( vec2 p2, float p ) { return vec2( sin( p2.x ) + 1.0f, cos( p2.y ) * p ); } float noise1( vec2 p ) { return ( ( sin( p.x * 5.78f ) * cos( p.y * 4.23f ) ) + 1.0f ) / 2.0f; }

Increase the beacon probability and change the window weighting. A side effect is that there'll now be colored windows without doing any work:

const float windowSize = 0.075; const float windowSpead = 1.0; (...) const float beaconProb = 0.0006;

And now it should run at a comfortable 73 fps. It doesn't have the gloomy feeling of the original, but it's flashier. I still like flashy stuff.

Based on code by mhnewman which is available on ShaderToy. It's licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.

Need to document the details later. It was the usual replacing of noise() and hash() functions, and some simplifications, and adding some colors and stuff, and increasing the general height of everything. The light size was increased, but I see now that they should have been aligned better.

While investigating some performance discrepancies between the A57 and Denver2 cores on the NVidia Jetson TX2, somebody said "screw this crap, can the TX2 GPU render an ocean in 1080p60?". As usual, somebody else has already done this pretty well. But will it run in 60fps on the Jetson TX2?

Based on code by Alexander Alekseev which is available on ShaderToy. It's licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.

The original code gives around 23 fps, so let's start replacing the costly functions with low budget ones and hope the result looks somewhat like the original. There are a couple of places to save rendering time. First is the noise() function. Strip that out and replace it with the much simpler

float noise( vec2 p ) { return f = (sin( p.x * 1.91f ) + cos( p.y * 1.71f ))*0.75f; }

Next is the map() and map_detailed() functions. It's enough to remove one sea_octave() call to stay marginally above 60 fps, so remove the last one in map() and multiply the first one result by 1.4.

I like a straight skyline, so remove the "dir.z += length(uv) * 0.15;" in main().

You'll end up with something resembling the original. The sea is much more synthetic, but the patterns look kinda neat. Should really do something about the aliasing, though. This problem can easily be solved by brute force: Rendering in 8k and downscaling gives excellent results, but the framerate will suffer badly. 1.5 fps with the original code or 4 with the changes suggested above.

Based on code by w23 which is available on ShaderToy. It's licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.

The original code produces 26 fps on the NVidia Jetson TX2. Nvcc actually manages to make mostly useful assembler code of it, since functions are short and it figures out the overlaps between the world()/w1()/w2()/t1()/t2() functions. So let's look elsewhere for speedups. 1080p60 or bust!

Start off by moving the trajectory and O-calculation to the CPU side. Also move the color and lightpos tables for later use. The CPU can adjust those for free while rendering a frame and costs roughly 0.5 fps. I never liked the shadows, removing them increases the fps to 43. Adjust the fog-like attenuation from 10-20 to 12-18 and cover the left and right 256 columns with some triangles for a boost to 58 fps. In trace(), start L at 12.5 instead of 0 for 78 fps, so we're way over target. That means the number of spheres (N) can be increased to 8 and we end up at 64 fps. Any more spheres is too much without changing the paths.

Let's do some other minor tweaks since I like flashy stuff: Rotate the light sources around the spheres, make the lights bigger/smaller depending on Z, remove gamma correction, add fade in/fade out, adjust the specular values. Average fps ends up at around 62 fps and the load at 93%.

Again based on code by Kamil Kolaczynski which is available on ShaderToy. It's licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.

Let's try to make a new version of Noise 3D Fly Through that uses the NVidia Jetson TX2 and looks even closer to the original. The Pascal 256 core GPU runs at 1300mhz, so there's 30 percent more processing power available.

Let's do this again from scratch. Start with the original code and apply each change from the previous post that doesn't totally destroy the look. It starts out at 13 fps in 1280x720. Adding the table-based noise() function increases this to 23 fps. Reducing depth to 96 gives 25 fps. Increasing MarchDumping to 0.9 gives 27, and killing off soft shadows and ambient occlusion gives 34 fps. So there's still some work to do before it'll generate 60 fps.

Adjusting the mentioned parameters more will give serious visual artifacts. The trick this time is to mitigate these artifacts as much as possible by adjusting MD on the fly. Change castRay() so MD starts at 0.9, and multiply it by 1.025 for each step after the first 16 steps are done. This gives roughly 60 fps and looks alot more like the original. Yeah, there's some artifacts, but it's running in 720p60 on a small SoC, not a GTX1080Ti. Must resist urge to order a GTX1080Ti!

Based on code by Kamil Kolaczynski which is available on ShaderToy. It's licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.

The NVidia Tegra X1 is getting a bit long in the tooth, and only manages around 5 frames per second in 1080p. Serious corner cutting had to be done to make it run in 60 fps: Reduce resolution to 720p (doh), remove soft shadows and ambient occlusion, reduce max steps to 96, increase cutoff in castRay() to 0.03 and fix the if()-order in castRay(). Those changes are trivial.

The main work is done in the noise() function. The trick is to come up with a significantly quicker one that doesn't result in too much or too little open space. On the CPU, precalculate a uniform table with vec4 values that's a power of two in size so it wraps nicely. The formula used is the same as in the original, but on a significantly smaller scale. It has just 256 entries, which seems to be just enough generate an interesting result.

#define TABLELEN 256 #define TABLEMASK (TABLELEN-1) (...) for( int i = 0; i < TABLELEN; i++ ) { v4 v; v.x = fract( sinf( ((i + 0)&TABLEMASK) * 43758.5453123f ) ); v.y = fract( sinf( ((i + 157)&TABLEMASK) * 43758.5453123f ) ); v.z = fract( sinf( ((i + 113)&TABLEMASK) * 43758.5453123f ) ); v.w = fract( sinf( ((i + 270)&TABLEMASK) * 43758.5453123f ) ); gpuj->table[i] = v; } (...)

Then make a new noise function that just reads from the table with LDC.F32X4 before mix(). Take care to handle negative numbers correctly:

float noise( vec3 x00 ) { ivec3 p00 = ivec3( floor( x00 ) ); vec3 f00 = fract( x00 ); f00 = f00 * f00 * (3.0 - 2.0 * f00); uint n00 = uint(abs(p00.x + p00.y * 157 + 113 * p00.z)); vec4 h00 = table[(n00+0u)&uint(TABLEMASK)]; vec4 h01 = table[(n00+1u)&uint(TABLEMASK)]; h00 = mix( h00, h01, f00.x ); h00.xy = mix( h00.xz, h00.yw, f00.y ); h00.x = mix( h00.x, h00.y, f00.z ); return h00.x; }

Unfortunately, the compiler refuses to generate LRP instructions. Interleaving multiple noise() calls to avoid stalls doesn't work well, since it will deinterleave them to save registers.

The fbm() function has three noise() calls, and that's a bit much. Remove one and adjust the weights:

float fbm(vec3 p) { float f; f = 0.500 * noise(p); p *= 2.91; f += 0.250 * noise(p); return f; }

The result is not as cool as the original, but it still looks pretty nice and runs in 60 fps.

A rehash of the 1080p60 julia fractals video. Rendered in 4k and downscaled for higher quality. Fixed the pathing a bit, and better synchronization with the music (I got lucky).

Test video traced in 4k and downscaled to 1080p. No advanced filters, just simple averaging. Looks alot better than the other GPURay videos, but it's far from realtime on the X1. Obviously.

To get 60 fps Thorn fractals on the X1, the code has to be changed a bit. The original fractal code was optimized for a single pixel at max iterations. The method used in the video samples four pixels, averages them and assumes the iteration count is always low. So let's rewrite stuff to match the video:

First, set up the coordinates. Move each a quarter pixel left/right and up/down. Could be simplified, and the 16:9 scaling is now really out of place:

void main() { vec4 xv1 = vec4 ( 1.7778f * (v_texCoord.x - 0.5f - 0.25f/1920.0f) / (0.5f * sz) + sx, 1.7778f * (v_texCoord.x - 0.5f + 0.25f/1920.0f) / (0.5f * sz) + sx, 1.7778f * (v_texCoord.x - 0.5f - 0.25f/1920.0f) / (0.5f * sz) + sx, 1.7778f * (v_texCoord.x - 0.5f + 0.25f/1920.0f) / (0.5f * sz) + sx ); vec4 yv1 = vec4( (v_texCoord.y - 0.5f - 0.25f/1080.0f) / (0.5f * sz) + sy, (v_texCoord.y - 0.5f - 0.25f/1080.0f) / (0.5f * sz) + sy, (v_texCoord.y - 0.5f + 0.25f/1080.0f) / (0.5f * sz) + sy, (v_texCoord.y - 0.5f + 0.25f/1080.0f) / (0.5f * sz) + sy );

Need some vars to keep track of which of the four are done:

ivec4 done = ivec4( MAXITER ); int donecnt = 0;

The new main loop:

int i; for( i = 0; i < MAXITER; i++ ) { vec4 xv0 = xv1 / cos( yv1 ) + re; vec4 yv0 = yv1 / sin( xv1 ) + im; if( done.x == MAXITER && xv0.x * xv0.x + yv0.x * yv0.x > ESCAPE ) { done.x = i; donecnt++; } if( done.y == MAXITER && xv0.y * xv0.y + yv0.y * yv0.y > ESCAPE ) { done.y = i; donecnt++; } if( done.z == MAXITER && xv0.z * xv0.z + yv0.z * yv0.z > ESCAPE ) { done.z = i; donecnt++; } if( done.w == MAXITER && xv0.w * xv0.w + yv0.w * yv0.w > ESCAPE ) { done.w = i; donecnt++; } if( donecnt == 4 ) break; xv1 = xv0; yv1 = yv0; }

Get the colors, average and display:

vec3 col0 = getRGB( float(done.x) / float(MAXITER) * colscale ); vec3 col1 = getRGB( float(done.y) / float(MAXITER) * colscale ); vec3 col2 = getRGB( float(done.z) / float(MAXITER) * colscale ); vec3 col3 = getRGB( float(done.w) / float(MAXITER) * colscale ); outColor = vec4( (col0 + col1 + col2 + col3)*0.25f, 1.0f ); }

This will run the path used in the video at 60 fps and look exactly the same. Load graph:

There's now an abundance of free GPU capacity. A simple improvement would be to add a center point, move the four other points further out and try some different weightings. A quick test shows that this does improve the result, but probably little point in making a new video.

Paul Bourke's 2012 article about Thorn fractals is quite interesting. He recently posted a short video on Youtube demonstrating the effect.

It's quite simple to change the GPU Fractal GLSL code to support Thorn fractals. Replace the inner loop as follows:

(...) for( int j = 0; j < DEPTH/4; j++ ) { xv0.x = xv0.w / cos(yv0.w) + fre; yv0.x = yv0.w / sin(xv0.w) + fim; xv0.y = xv0.x / cos(yv0.x) + fre; yv0.y = yv0.x / sin(xv0.x) + fim; xv0.z = xv0.y / cos(yv0.y) + fre; yv0.z = yv0.y / sin(xv0.y) + fim; xv0.w = xv0.z / cos(yv0.z) + fre; yv0.w = yv0.z / sin(xv0.z) + fim; bv[j].x = xv0.x * xv0.x + yv0.x * yv0.x > ESCAPE; bv[j].y = xv0.y * xv0.y + yv0.y * yv0.y > ESCAPE; bv[j].z = xv0.z * xv0.z + yv0.z * yv0.z > ESCAPE; bv[j].w = xv0.w * xv0.w + yv0.w * yv0.w > ESCAPE; } (...)

The results are somewhat tacky. Bourke suggests sampling a much larger picture and downscaling with a high quality filter. Unfortunately, the X1 GPU doesn't have the horsepower for that if we want it to run at 60 fps. Let's instead try a cheaper solution: Sample 4 pixels per output pixel (+/-0.25) and average them in a single pass. The palette used is red-yellow-black with adjusted weighting depending on the re/im values:

If it's easy to project the quaternion on a plane, it should be easy to rotate it too. Also found some spare cycles, so detail is significantly increased:

It's alive! NVidia Jetson TX2 module with Tegra P1 "Parker"!

[inky:/tmp] $ cat /proc/cpuinfo processor : 0 model name : ARMv8 Processor rev 3 (v8l) BogoMIPS : 62.50 Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 CPU implementer : 0x41 CPU architecture: 8 CPU variant : 0x1 CPU part : 0xd07 CPU revision : 3 processor : 1 model name : ARMv8 Processor rev 0 (v8l) BogoMIPS : 62.50 Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 CPU implementer : 0x4e CPU architecture: 8 CPU variant : 0x0 CPU part : 0x003 CPU revision : 0 MTS version : 38019512 processor : 2 model name : ARMv8 Processor rev 0 (v8l) BogoMIPS : 62.50 Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 CPU implementer : 0x4e CPU architecture: 8 CPU variant : 0x0 CPU part : 0x003 CPU revision : 0 MTS version : 38019512 processor : 3 model name : ARMv8 Processor rev 3 (v8l) BogoMIPS : 62.50 Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 CPU implementer : 0x41 CPU architecture: 8 CPU variant : 0x1 CPU part : 0xd07 CPU revision : 3 processor : 4 model name : ARMv8 Processor rev 3 (v8l) BogoMIPS : 62.50 Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 CPU implementer : 0x41 CPU architecture: 8 CPU variant : 0x1 CPU part : 0xd07 CPU revision : 3 processor : 5 model name : ARMv8 Processor rev 3 (v8l) BogoMIPS : 62.50 Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 CPU implementer : 0x41 CPU architecture: 8 CPU variant : 0x1 CPU part : 0xd07 CPU revision : 3

[inky:/tmp] $ cat /sys/devices/17000000.gp10b/devfreq/17000000.gp10b/available_frequencies 114750000 216750000 318750000 420750000 522750000 624750000 726750000 828750000 930750000 1032750000 1134750000 1236750000 1300500000

Seems like 1.3ghz is the max GPU speed available. Let's force it to that:

[inky:/tmp] $ echo userspace >/sys/devices/17000000.gp10b/devfreq/17000000.gp10b/governor [inky:/tmp] $ echo 1300500000 >/sys/devices/17000000.gp10b/devfreq/17000000.gp10b/userspace/set_freq [inky:/tmp] $ ./gpuray -q GL Vendor: NVIDIA Corporation GL Renderer: NVIDIA Tegra GL Version: OpenGL ES 3.2 NVIDIA 381.00 (...) 85.1386 87.7119 88.7040 89.4596 82.6418 81.5514 78.9375 85.7998 86.3342 (...) Time (ms): 28378.757812 Frames: 2401 FPS: 84.605537

That's almost exactly 30% faster than the X1. The shape of the graph changes with 160 spheres, meaning it's time to reconsider the unrolling. So no 160 spheres at 60 fps yet:

I started out trying to add a shadow plane to the GPU Raytraced Quaternion Julia Sets code. That's very simple, but unfortunately they're not very pretty. They mostly look like blobs (duh). The fun part is that instead of drawing a shadow, it's almost free drawing the quaternion instead. So the shadow calculation can be used to distort it and make a semi-interesting video. Added some background 8-bit style pulsing effect for fun:

Revised the color scheme slightly in GPU Raytraced Quaternion Julia Sets and made a new video. Who says I don't recycle stuff?

New article with source code for the Mellanox TILE-Gx integer raytracer is available.

Time for modern art! An attempt to replicate one of the Burning Ship fractal test pictures from Wikipedia. Precise coordinates found on Paul Bourke's page.

The fractal code from the GPU Hacks article is used, just replace the inner loop calculation as follows:

xv0.x = xv0.w * xv0.w - yv0.w * yv0.w - fre; yv0.x = 2.0f * abs( xv0.w * yv0.w ) - fim; xv0.y = xv0.x * xv0.x - yv0.x * yv0.x - fre; yv0.y = 2.0f * abs( xv0.x * yv0.x ) - fim; xv0.z = xv0.y * xv0.y - yv0.y * yv0.y - fre; yv0.z = 2.0f * abs( xv0.y * yv0.y ) - fim; xv0.w = xv0.z * xv0.z - yv0.z * yv0.z - fre; yv0.w = 2.0f * abs( xv0.z * yv0.z ) - fim;

And flip y calculation, set iterations to 64 and set output color rg=(1.0f - float(i) / float(MAXITER).

Increase to 40 spheres by forcing multiply-accumulate in some places. Made the center sphere a bit bigger and placed the rotating spheres a bit closer to each other, since there was cycles to spare:

Also a slight movement speedup, so it doesn't compare directly to the previous version:

Finally! 36 spheres on the Mellanox TILE-Gx36, ie. 1 sphere per core. (Actually slightly more since 1 core is used for administration) Had to do some counter-intuitive restructuring of the loops to max it out. Gcc 6.3 still cannot create optimal loops. Hmm.

Load measurement of 3600 frames:

32 spheres integer raytracer on Mellanox TILE-Gx36:

Pretty much same as the old one, but uses custom float to integer and vice versa conversion routines with only as much precision as needed. N-R steps in inv_float() moved to int. Higher precision fixed point removes most of the bugs in the first video.

20 spheres integer raytracer on Mellanox TILE-Gx36:

Not as simple as it sounds. Using fix16 format is one option, but there's no point in getting stuck with max 32 bit operations on a 64 bit architecture. The compiler tends to pollute 32 bit operations with interleaves, and there's no 64x64 bit multiply. But there's a collection of 32x32->64 bit multiply operations, so force them with (int64_t)__insn_mul_ls_ls() and keep the base data type as 64 bit. Leave add/sub etc. as is. Keep fast_sqrt() and fast_inv_sqrt() in float. Check generated code often! Could probably pack data tighter and optimize it more, but 20 spheres was more than expected. Let's quit while ahead!

I found an old TILE-Gx card in the toolshed. Dusted off the old TILE-Gx raytracer, ported the control code from GPURay and did some optimizations. Notably, using precalculated poseye/rv for the initial reflection pass, chopping off the lower 8 rows, switched to using the Par library for parallelization, removed shadows. It can now do 12 spheres instead of a measly 6. Lab demo picture (background needs more work):

Writing parallel code on the TILE-Gx is still fun! Too bad the floating point support is very limited. Must resist urge to make an integer version.

The load indication in previous videos is read from /sys/devices/platform/host1x/gpu.0/load. It updates 60 times a second. Let's replace the boring number with a load bar that averages the last 10 readings so it doesn't jump around too much. And throw in a clock and some bouncing spheres to generate load:

Let's try bouncing some spheres around:

In GPURay, the intBitsToFloat() trick to eliminate negative values unfortunately generates a redundant MOV instruction:

(...) vec4 t0 = -b0 - sqrt( d0 ); t0 += intBitsToFloat(floatBitsToInt( t0 )>>31); float mv = min( min( t0.x, t0.y ), min( t0.z, t0.w ) ); (...)

Another way to eliminate them is by inversion. There'll be more generated code, but fewer effective operations:

(...) vec4 t0 = 1.0f / (-b0 - sqrt( d0 )); float mv = max( max( t0.x, t0.y ), max( t0.z, t0.w ) ); (...)

The min() calls and if() tests have to be flipped and the value inverted again outside the loop:

(...) int refl; for( refl = 0; refl < REFLECTIONS; refl++ ) { float rt_dist = 0.0f; (...) if( any( greaterThan( d0, vec4( 0.0f ) ) ) ) { vec4 t0 = 1.0f / (-b0 - sqrt( d0 )); float mv = max( max( t0.x, t0.y ), max( t0.z, t0.w ) ); if( mv > rt_dist ) { rt_dist = mv; rt_t0 = t0.yzw; rt_i = i + 0; } } // ... // ditto for the other groups // ... (...) if( rt_dist > 0.33f ) break; } if( rt_i == -1 ) break; ivec3 iv = ivec3( equal( rt_t0, vec3( rt_dist ) ) ); rt_i += iv.x + iv.y + iv.y + iv.z + iv.z + iv.z; // sphere hit + reflection ray_pos += ray_dir * (1.0f / rt_dist);

This is slightly quicker and eliminates the remaining split second drop below 60 fps:

The goal of 128 spheres is finally reached! Unroll 32 first loop, unroll 16 second loop. Z sort spheres on the CPU and cut off when something close enough is found.

All the details are wrapped up in the new GPURay article.

120 spheres. Only a minor change to unrolling makes this possible with the latest code. The unrolls should match a multiple of 8, so unroll the sphere loop 24 times. If the shadow loop is unrolled 24 times, the extension to the any(greaterThan()) if-statement will make things horrendously slow, so de-unroll that one to 8.

Let's reintroduce an old classic: Preloading. The assembler code shows that loads from obj_pos[] are done once with a LDC.F32X4, but the first in each block is used right after load. Let's try preloading 4 at a time and do the ray_pos subs at the same time. This gives neatly stacked sets of 4 loads + 4 subs that are far enough away. Also, it runs quicker:

for( refl = 0; refl < REFLECTIONS; refl++ ) { float rt_dist = FARDIST; int rt_i = -1; vec4 rt_t0; vec4 v0, v1, v2, v3; vec4 v4, v5, v6, v7; // preload 0-3 v0 = vec4( ray_pos - obj_pos[0].xyz, obj_pos[0].w ); v1 = vec4( ray_pos - obj_pos[1].xyz, obj_pos[1].w ); v2 = vec4( ray_pos - obj_pos[2].xyz, obj_pos[2].w ); v3 = vec4( ray_pos - obj_pos[3].xyz, obj_pos[3].w ); // trace_ray() for( int i = 0; i < OBJNUM; i += 16 ) { vec4 b0, d0; // preload 4-7 v4 = vec4( ray_pos - obj_pos[i+4].xyz, obj_pos[i+4].w ); v5 = vec4( ray_pos - obj_pos[i+5].xyz, obj_pos[i+5].w ); v6 = vec4( ray_pos - obj_pos[i+6].xyz, obj_pos[i+6].w ); v7 = vec4( ray_pos - obj_pos[i+7].xyz, obj_pos[i+7].w ); // calc 0-3 b0 = vec4( dot( v0.xyz, ray_dir ), dot( v1.xyz, ray_dir ), dot( v2.xyz, ray_dir ), dot( v3.xyz, ray_dir ) ); d0 = vec4( b0.x * b0.x - dot( v0.xyz, v0.xyz ) + v0.w, b0.y * b0.y - dot( v1.xyz, v1.xyz ) + v1.w, b0.z * b0.z - dot( v2.xyz, v2.xyz ) + v2.w, b0.w * b0.w - dot( v3.xyz, v3.xyz ) + v3.w ); if( any( greaterThan( d0, vec4( 0.0f ) ) ) ) { vec4 t0 = -b0 - sqrt( d0 ); t0 += intBitsToFloat(floatBitsToInt( t0 )>>31); vec2 mv = min( t0.xy, t0.zw ); mv.x = min( mv.x, mv.y ); if( mv.x < rt_dist ) { rt_dist = mv.x; rt_t0 = t0; rt_i = i + 0; } } // ... // ditto for preload 8-11, calc 4-7, preload 12-15, calc 8-11, preload 16-19, calc 12-15 // ...

The subs could be shifted around so they're not done in the last iteration, but the compiler is being difficult about this. Needs some more work.

The same can be done in the shadow calculation for a slight improvement there too:

vec4 v0, v1, v2, v3; vec4 v4, v5, v6, v7; // preload 0-3 v0 = vec4( ray_pos - obj_pos[0].xyz, obj_pos[0].w ); v1 = vec4( ray_pos - obj_pos[1].xyz, obj_pos[1].w ); v2 = vec4( ray_pos - obj_pos[2].xyz, obj_pos[2].w ); v3 = vec4( ray_pos - obj_pos[3].xyz, obj_pos[3].w ); // check shadow int i; for( i = 0; i < OBJNUM; i += 16 ) { vec4 d0, d1, d2, d3; vec4 b0; // preload 4-7 v4 = vec4( ray_pos - obj_pos[i+4].xyz, obj_pos[i+4].w ); v5 = vec4( ray_pos - obj_pos[i+5].xyz, obj_pos[i+5].w ); v6 = vec4( ray_pos - obj_pos[i+6].xyz, obj_pos[i+6].w ); v7 = vec4( ray_pos - obj_pos[i+7].xyz, obj_pos[i+7].w ); // calc 0-3 b0 = vec4( dot( v0.xyz, l ), dot( v1.xyz, l ), dot( v2.xyz, l ), dot( v3.xyz, l ) ); d0 = vec4( b0.x * b0.x - dot( v0.xyz, v0.xyz ) + v0.w, b0.y * b0.y - dot( v1.xyz, v1.xyz ) + v1.w, b0.z * b0.z - dot( v2.xyz, v2.xyz ) + v2.w, b0.w * b0.w - dot( v3.xyz, v3.xyz ) + v3.w ); // ... // ditto for preload 8-11, calc 4-7, preload 12-15, calc 8-11, preload 16-19, calc 12-15 // ...

Some more cycles can be save by simplifying the rt_hit calculation and removing the light_col parameter:

(...) ivec3 iv = ivec3( equal( rt_t0, vec3( rt_dist ) ) ); rt_i += iv.x + iv.y + iv.y + iv.z + iv.z + iv.z; (...) col += vec3( specular ) + obj_col[rt_i].xyz * diffuse; ray_dir = reflect( ray_dir, n ); (...)

I forgot to post measurements of the last steps, so let's wrap them up in one graph:

Still not enough for 128 spheres, but I'm trying some different unrolls for 120 at the moment. Stay tuned!

It's possible to get rid of another any/lessThan vec4 combination by calculating min first and doing a single test. This is about 5% faster. Not enough for 128 spheres, but a step on the way. Same in all 4 sphere sections, just add +4/+8/+12 to rt_i:

(...) if( any( greaterThan( d0, vec4( 0.0f ) ) ) ) { vec4 t0 = -b0 - sqrt( d0 ); t0 += intBitsToFloat(floatBitsToInt( t0 )>>31); vec2 mv = min( t0.xy, t0.zw ); mv.x = min( mv.x, mv.y ); if( mv.x < rt_dist ) { rt_dist = mv.x; rt_t0 = t0; rt_i = i + 0; } } (...)

112 spheres this time. Let's go through the necessary changes. The CPU preprocessing is still under wraps, but there's changes in the GPU code that can be published.

A simple start: The shadow calculation is quicker if all the tests are gathered at the bottom. The generated code isn't pretty, but it's quicker than max/add for some reason. Never mind:

// check shadow int i; for( i = 0; i < OBJNUM; i += 16 ) { vec3 v0, v1, v2, v3; vec4 d0, d1, d2, d3; vec4 b0; // 0..3 v0 = ray_pos - obj_pos[i+0].xyz; v1 = ray_pos - obj_pos[i+1].xyz; v2 = ray_pos - obj_pos[i+2].xyz; v3 = ray_pos - obj_pos[i+3].xyz; b0 = vec4( dot( v0, l ), dot( v1, l ), dot( v2, l ), dot( v3, l ) ); d0 = vec4( b0.x * b0.x - dot( v0, v0 ) + obj_pos[i+0].w, b0.y * b0.y - dot( v1, v1 ) + obj_pos[i+1].w, b0.z * b0.z - dot( v2, v2 ) + obj_pos[i+2].w, b0.w * b0.w - dot( v3, v3 ) + obj_pos[i+3].w ); // ... // ditto for 4-7, 8-11, 12-15 // ... if( any( greaterThan( d0, vec4( 0.0f ) ) ) || any( greaterThan( d1, vec4( 0.0f ) ) ) || any( greaterThan( d2, vec4( 0.0f ) ) ) || any( greaterThan( d3, vec4( 0.0f ) ) ) ) break; }

The main sphere loop has lots of room for improvement. The dv greaterThan() and if() construct is less than optimal. floatBitsToInt() can be used to convert negative numbers to (-)NaN so the negative t0 values get eliminated. Then just find the lowest value and save what's needed to find the id later instead of doing it in place:

// trace_ray() for( int i = 0; i < OBJNUM; i += 16 ) { vec3 v0, v1, v2, v3; vec4 b0, d0; // 0-3 v0 = ray_pos - obj_pos[i+0].xyz; v1 = ray_pos - obj_pos[i+1].xyz; v2 = ray_pos - obj_pos[i+2].xyz; v3 = ray_pos - obj_pos[i+3].xyz; b0 = vec4( dot( v0, ray_dir ), dot( v1, ray_dir ), dot( v2, ray_dir ), dot( v3, ray_dir ) ); d0 = vec4( b0.x * b0.x - dot( v0, v0 ) + obj_pos[i+0].w, b0.y * b0.y - dot( v1, v1 ) + obj_pos[i+1].w, b0.z * b0.z - dot( v2, v2 ) + obj_pos[i+2].w, b0.w * b0.w - dot( v3, v3 ) + obj_pos[i+3].w ); if( any( greaterThan( d0, vec4( 0.0f ) ) ) ) { vec4 t0 = -b0 - sqrt( d0 ); t0 += intBitsToFloat(floatBitsToInt( t0 )>>31); if( any( lessThan( t0, vec4( rt_dist ) ) ) ) { vec2 mv = min( t0.xy, t0.zw ); mv.x = min( mv.x, mv.y ); rt_dist = mv.x; rt_t0 = t0; rt_i = i; } } // ... // ditto for 4-7, 8-11, 12-15 // ... }

After the sphere loop is done, find out what was hit:

if( rt_i == -1 ) break; ivec4 iv = mix( ivec4( 0 ), ivec4( rt_i, rt_i+1, rt_i+2, rt_i+3 ), equal( vec4( rt_dist ), rt_t0 ) ); iv.xy += iv.zw; int rt_hit = iv.x + iv.y;