Random stuff I'm working on, in chronological order.

Based on code by w23 which is available on ShaderToy. It's licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.

The original code produces 26 fps on the NVidia Jetson TX2. Nvcc actually manages to make mostly useful assembler code of it, since functions are short and it figures out the overlaps between the world()/w1()/w2()/t1()/t2() functions. So let's look elsewhere for speedups. 1080p60 or bust!

Start off by moving the trajectory and O-calculation to the CPU side. Also move the color and lightpos tables for later use. The CPU can adjust those for free while rendering a frame and costs roughly 0.5 fps. I never liked the shadows, removing them increases the fps to 43. Adjust the fog-like attenuation from 10-20 to 12-18 and cover the left and right 256 columns with some triangles for a boost to 58 fps. In trace(), start L at 12.5 instead of 0 for 78 fps, so we're way over target. That means the number of spheres (N) can be increased to 8 and we end up at 64 fps. Any more spheres is too much without changing the paths.

Let's do some other minor tweaks since I like flashy stuff: Rotate the light sources around the spheres, make the lights bigger/smaller depending on Z, remove gamma correction, add fade in/fade out, adjust the specular values. Average fps ends up at around 62 fps and the load at 93%.

Again based on code by Kamil Kolaczynski which is available on ShaderToy. It's licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.

Let's try to make a new version of Noise 3D Fly Through that uses the NVidia Jetson TX2 and looks even closer to the original. The Pascal 256 core GPU runs at 1300mhz, so there's 30 percent more processing power available.

Let's do this again from scratch. Start with the original code and apply each change from the previous post that doesn't totally destroy the look. It starts out at 13 fps in 1280x720. Adding the table-based noise() function increases this to 23 fps. Reducing depth to 96 gives 25 fps. Increasing MarchDumping to 0.9 gives 27, and killing off soft shadows and ambient occlusion gives 34 fps. So there's still some work to do before it'll generate 60 fps.

Adjusting the mentioned parameters more will give serious visual artifacts. The trick this time is to mitigate these artifacts as much as possible by adjusting MD on the fly. Change castRay() so MD starts at 0.9, and multiply it by 1.025 for each step after the first 16 steps are done. This gives roughly 60 fps and looks alot more like the original. Yeah, there's some artifacts, but it's running in 720p60 on a small SoC, not a GTX1080Ti. Must resist urge to order a GTX1080Ti!

Based on code by Kamil Kolaczynski which is available on ShaderToy. It's licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.

The NVidia Tegra X1 is getting a bit long in the tooth, and only manages around 5 frames per second in 1080p. Serious corner cutting had to be done to make it run in 60 fps: Reduce resolution to 720p (doh), remove soft shadows and ambient occlusion, reduce max steps to 96, increase cutoff in castRay() to 0.03 and fix the if()-order in castRay(). Those changes are trivial.

The main work is done in the noise() function. The trick is to come up with a significantly quicker one that doesn't result in too much or too little open space. On the CPU, precalculate a uniform table with vec4 values that's a power of two in size so it wraps nicely. The formula used is the same as in the original, but on a significantly smaller scale. It has just 256 entries, which seems to be just enough generate an interesting result.

#define TABLELEN 256 #define TABLEMASK (TABLELEN-1) (...) for( int i = 0; i < TABLELEN; i++ ) { v4 v; v.x = fract( sinf( ((i + 0)&TABLEMASK) * 43758.5453123f ) ); v.y = fract( sinf( ((i + 157)&TABLEMASK) * 43758.5453123f ) ); v.z = fract( sinf( ((i + 113)&TABLEMASK) * 43758.5453123f ) ); v.w = fract( sinf( ((i + 270)&TABLEMASK) * 43758.5453123f ) ); gpuj->table[i] = v; } (...)

Then make a new noise function that just reads from the table with LDC.F32X4 before mix(). Take care to handle negative numbers correctly:

float noise( vec3 x00 ) { ivec3 p00 = ivec3( floor( x00 ) ); vec3 f00 = fract( x00 ); f00 = f00 * f00 * (3.0 - 2.0 * f00); uint n00 = uint(abs(p00.x + p00.y * 157 + 113 * p00.z)); vec4 h00 = table[(n00+0u)&uint(TABLEMASK)]; vec4 h01 = table[(n00+1u)&uint(TABLEMASK)]; h00 = mix( h00, h01, f00.x ); h00.xy = mix( h00.xz, h00.yw, f00.y ); h00.x = mix( h00.x, h00.y, f00.z ); return h00.x; }

Unfortunately, the compiler refuses to generate LRP instructions. Interleaving multiple noise() calls to avoid stalls doesn't work well, since it will deinterleave them to save registers.

The fbm() function has three noise() calls, and that's a bit much. Remove one and adjust the weights:

float fbm(vec3 p) { float f; f = 0.500 * noise(p); p *= 2.91; f += 0.250 * noise(p); return f; }

The result is not as cool as the original, but it still looks pretty nice and runs in 60 fps.

A rehash of the 1080p60 julia fractals video. Rendered in 4k and downscaled for higher quality. Fixed the pathing a bit, and better synchronization with the music (I got lucky).

Test video traced in 4k and downscaled to 1080p. No advanced filters, just simple averaging. Looks alot better than the other GPURay videos, but it's far from realtime on the X1. Obviously.

To get 60 fps Thorn fractals on the X1, the code has to be changed a bit. The original fractal code was optimized for a single pixel at max iterations. The method used in the video samples four pixels, averages them and assumes the iteration count is always low. So let's rewrite stuff to match the video:

First, set up the coordinates. Move each a quarter pixel left/right and up/down. Could be simplified, and the 16:9 scaling is now really out of place:

void main() { vec4 xv1 = vec4 ( 1.7778f * (v_texCoord.x - 0.5f - 0.25f/1920.0f) / (0.5f * sz) + sx, 1.7778f * (v_texCoord.x - 0.5f + 0.25f/1920.0f) / (0.5f * sz) + sx, 1.7778f * (v_texCoord.x - 0.5f - 0.25f/1920.0f) / (0.5f * sz) + sx, 1.7778f * (v_texCoord.x - 0.5f + 0.25f/1920.0f) / (0.5f * sz) + sx ); vec4 yv1 = vec4( (v_texCoord.y - 0.5f - 0.25f/1080.0f) / (0.5f * sz) + sy, (v_texCoord.y - 0.5f - 0.25f/1080.0f) / (0.5f * sz) + sy, (v_texCoord.y - 0.5f + 0.25f/1080.0f) / (0.5f * sz) + sy, (v_texCoord.y - 0.5f + 0.25f/1080.0f) / (0.5f * sz) + sy );

Need some vars to keep track of which of the four are done:

ivec4 done = ivec4( MAXITER ); int donecnt = 0;

The new main loop:

int i; for( i = 0; i < MAXITER; i++ ) { vec4 xv0 = xv1 / cos( yv1 ) + re; vec4 yv0 = yv1 / sin( xv1 ) + im; if( done.x == MAXITER && xv0.x * xv0.x + yv0.x * yv0.x > ESCAPE ) { done.x = i; donecnt++; } if( done.y == MAXITER && xv0.y * xv0.y + yv0.y * yv0.y > ESCAPE ) { done.y = i; donecnt++; } if( done.z == MAXITER && xv0.z * xv0.z + yv0.z * yv0.z > ESCAPE ) { done.z = i; donecnt++; } if( done.w == MAXITER && xv0.w * xv0.w + yv0.w * yv0.w > ESCAPE ) { done.w = i; donecnt++; } if( donecnt == 4 ) break; xv1 = xv0; yv1 = yv0; }

Get the colors, average and display:

vec3 col0 = getRGB( float(done.x) / float(MAXITER) * colscale ); vec3 col1 = getRGB( float(done.y) / float(MAXITER) * colscale ); vec3 col2 = getRGB( float(done.z) / float(MAXITER) * colscale ); vec3 col3 = getRGB( float(done.w) / float(MAXITER) * colscale ); outColor = vec4( (col0 + col1 + col2 + col3)*0.25f, 1.0f ); }

This will run the path used in the video at 60 fps and look exactly the same. Load graph:

There's now an abundance of free GPU capacity. A simple improvement would be to add a center point, move the four other points further out and try some different weightings. A quick test shows that this does improve the result, but probably little point in making a new video.

Paul Bourke's 2012 article about Thorn fractals is quite interesting. He recently posted a short video on Youtube demonstrating the effect.

It's quite simple to change the GPU Fractal GLSL code to support Thorn fractals. Replace the inner loop as follows:

(...) for( int j = 0; j < DEPTH/4; j++ ) { xv0.x = xv0.w / cos(yv0.w) + fre; yv0.x = yv0.w / sin(xv0.w) + fim; xv0.y = xv0.x / cos(yv0.x) + fre; yv0.y = yv0.x / sin(xv0.x) + fim; xv0.z = xv0.y / cos(yv0.y) + fre; yv0.z = yv0.y / sin(xv0.y) + fim; xv0.w = xv0.z / cos(yv0.z) + fre; yv0.w = yv0.z / sin(xv0.z) + fim; bv[j].x = xv0.x * xv0.x + yv0.x * yv0.x > ESCAPE; bv[j].y = xv0.y * xv0.y + yv0.y * yv0.y > ESCAPE; bv[j].z = xv0.z * xv0.z + yv0.z * yv0.z > ESCAPE; bv[j].w = xv0.w * xv0.w + yv0.w * yv0.w > ESCAPE; } (...)

The results are somewhat tacky. Bourke suggests sampling a much larger picture and downscaling with a high quality filter. Unfortunately, the X1 GPU doesn't have the horsepower for that if we want it to run at 60 fps. Let's instead try a cheaper solution: Sample 4 pixels per output pixel (+/-0.25) and average them in a single pass. The palette used is red-yellow-black with adjusted weighting depending on the re/im values:

If it's easy to project the quaternion on a plane, it should be easy to rotate it too. Also found some spare cycles, so detail is significantly increased:

It's alive! NVidia Jetson TX2 module with Tegra P1 "Parker"!

[inky:/tmp] $ cat /proc/cpuinfo processor : 0 model name : ARMv8 Processor rev 3 (v8l) BogoMIPS : 62.50 Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 CPU implementer : 0x41 CPU architecture: 8 CPU variant : 0x1 CPU part : 0xd07 CPU revision : 3 processor : 1 model name : ARMv8 Processor rev 0 (v8l) BogoMIPS : 62.50 Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 CPU implementer : 0x4e CPU architecture: 8 CPU variant : 0x0 CPU part : 0x003 CPU revision : 0 MTS version : 38019512 processor : 2 model name : ARMv8 Processor rev 0 (v8l) BogoMIPS : 62.50 Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 CPU implementer : 0x4e CPU architecture: 8 CPU variant : 0x0 CPU part : 0x003 CPU revision : 0 MTS version : 38019512 processor : 3 model name : ARMv8 Processor rev 3 (v8l) BogoMIPS : 62.50 Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 CPU implementer : 0x41 CPU architecture: 8 CPU variant : 0x1 CPU part : 0xd07 CPU revision : 3 processor : 4 model name : ARMv8 Processor rev 3 (v8l) BogoMIPS : 62.50 Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 CPU implementer : 0x41 CPU architecture: 8 CPU variant : 0x1 CPU part : 0xd07 CPU revision : 3 processor : 5 model name : ARMv8 Processor rev 3 (v8l) BogoMIPS : 62.50 Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 CPU implementer : 0x41 CPU architecture: 8 CPU variant : 0x1 CPU part : 0xd07 CPU revision : 3

[inky:/tmp] $ cat /sys/devices/17000000.gp10b/devfreq/17000000.gp10b/available_frequencies 114750000 216750000 318750000 420750000 522750000 624750000 726750000 828750000 930750000 1032750000 1134750000 1236750000 1300500000

Seems like 1.3ghz is the max GPU speed available. Let's force it to that:

[inky:/tmp] $ echo userspace >/sys/devices/17000000.gp10b/devfreq/17000000.gp10b/governor [inky:/tmp] $ echo 1300500000 >/sys/devices/17000000.gp10b/devfreq/17000000.gp10b/userspace/set_freq [inky:/tmp] $ ./gpuray -q GL Vendor: NVIDIA Corporation GL Renderer: NVIDIA Tegra GL Version: OpenGL ES 3.2 NVIDIA 381.00 (...) 85.1386 87.7119 88.7040 89.4596 82.6418 81.5514 78.9375 85.7998 86.3342 (...) Time (ms): 28378.757812 Frames: 2401 FPS: 84.605537

That's almost exactly 30% faster than the X1. The shape of the graph changes with 160 spheres, meaning it's time to reconsider the unrolling. So no 160 spheres at 60 fps yet:

I started out trying to add a shadow plane to the GPU Raytraced Quaternion Julia Sets code. That's very simple, but unfortunately they're not very pretty. They mostly look like blobs (duh). The fun part is that instead of drawing a shadow, it's almost free drawing the quaternion instead. So the shadow calculation can be used to distort it and make a semi-interesting video. Added some background 8-bit style pulsing effect for fun:

Revised the color scheme slightly in GPU Raytraced Quaternion Julia Sets and made a new video. Who says I don't recycle stuff?

New article with source code for the Mellanox TILE-Gx integer raytracer is available.

Time for modern art! An attempt to replicate one of the Burning Ship fractal test pictures from Wikipedia. Precise coordinates found on Paul Bourke's page.

The fractal code from the GPU Hacks article is used, just replace the inner loop calculation as follows:

xv0.x = xv0.w * xv0.w - yv0.w * yv0.w - fre; yv0.x = 2.0f * abs( xv0.w * yv0.w ) - fim; xv0.y = xv0.x * xv0.x - yv0.x * yv0.x - fre; yv0.y = 2.0f * abs( xv0.x * yv0.x ) - fim; xv0.z = xv0.y * xv0.y - yv0.y * yv0.y - fre; yv0.z = 2.0f * abs( xv0.y * yv0.y ) - fim; xv0.w = xv0.z * xv0.z - yv0.z * yv0.z - fre; yv0.w = 2.0f * abs( xv0.z * yv0.z ) - fim;

And flip y calculation, set iterations to 64 and set output color rg=(1.0f - float(i) / float(MAXITER).

Increase to 40 spheres by forcing multiply-accumulate in some places. Made the center sphere a bit bigger and placed the rotating spheres a bit closer to each other, since there was cycles to spare:

Also a slight movement speedup, so it doesn't compare directly to the previous version:

Finally! 36 spheres on the Mellanox TILE-Gx36, ie. 1 sphere per core. (Actually slightly more since 1 core is used for administration) Had to do some counter-intuitive restructuring of the loops to max it out. Gcc 6.3 still cannot create optimal loops. Hmm.

Load measurement of 3600 frames:

32 spheres integer raytracer on Mellanox TILE-Gx36:

Pretty much same as the old one, but uses custom float to integer and vice versa conversion routines with only as much precision as needed. N-R steps in inv_float() moved to int. Higher precision fixed point removes most of the bugs in the first video.

20 spheres integer raytracer on Mellanox TILE-Gx36:

Not as simple as it sounds. Using fix16 format is one option, but there's no point in getting stuck with max 32 bit operations on a 64 bit architecture. The compiler tends to pollute 32 bit operations with interleaves, and there's no 64x64 bit multiply. But there's a collection of 32x32->64 bit multiply operations, so force them with (int64_t)__insn_mul_ls_ls() and keep the base data type as 64 bit. Leave add/sub etc. as is. Keep fast_sqrt() and fast_inv_sqrt() in float. Check generated code often! Could probably pack data tighter and optimize it more, but 20 spheres was more than expected. Let's quit while ahead!

I found an old TILE-Gx card in the toolshed. Dusted off the old TILE-Gx raytracer, ported the control code from GPURay and did some optimizations. Notably, using precalculated poseye/rv for the initial reflection pass, chopping off the lower 8 rows, switched to using the Par library for parallelization, removed shadows. It can now do 12 spheres instead of a measly 6. Lab demo picture (background needs more work):

Writing parallel code on the TILE-Gx is still fun! Too bad the floating point support is very limited. Must resist urge to make an integer version.

The load indication in previous videos is read from /sys/devices/platform/host1x/gpu.0/load. It updates 60 times a second. Let's replace the boring number with a load bar that averages the last 10 readings so it doesn't jump around too much. And throw in a clock and some bouncing spheres to generate load:

Let's try bouncing some spheres around:

In GPURay, the intBitsToFloat() trick to eliminate negative values unfortunately generates a redundant MOV instruction:

(...) vec4 t0 = -b0 - sqrt( d0 ); t0 += intBitsToFloat(floatBitsToInt( t0 )>>31); float mv = min( min( t0.x, t0.y ), min( t0.z, t0.w ) ); (...)

Another way to eliminate them is by inversion. There'll be more generated code, but fewer effective operations:

(...) vec4 t0 = 1.0f / (-b0 - sqrt( d0 )); float mv = max( max( t0.x, t0.y ), max( t0.z, t0.w ) ); (...)

The min() calls and if() tests have to be flipped and the value inverted again outside the loop:

(...) int refl; for( refl = 0; refl < REFLECTIONS; refl++ ) { float rt_dist = 0.0f; (...) if( any( greaterThan( d0, vec4( 0.0f ) ) ) ) { vec4 t0 = 1.0f / (-b0 - sqrt( d0 )); float mv = max( max( t0.x, t0.y ), max( t0.z, t0.w ) ); if( mv > rt_dist ) { rt_dist = mv; rt_t0 = t0.yzw; rt_i = i + 0; } } // ... // ditto for the other groups // ... (...) if( rt_dist > 0.33f ) break; } if( rt_i == -1 ) break; ivec3 iv = ivec3( equal( rt_t0, vec3( rt_dist ) ) ); rt_i += iv.x + iv.y + iv.y + iv.z + iv.z + iv.z; // sphere hit + reflection ray_pos += ray_dir * (1.0f / rt_dist);

This is slightly quicker and eliminates the remaining split second drop below 60 fps:

The goal of 128 spheres is finally reached! Unroll 32 first loop, unroll 16 second loop. Z sort spheres on the CPU and cut off when something close enough is found.

All the details are wrapped up in the new GPURay article.

120 spheres. Only a minor change to unrolling makes this possible with the latest code. The unrolls should match a multiple of 8, so unroll the sphere loop 24 times. If the shadow loop is unrolled 24 times, the extension to the any(greaterThan()) if-statement will make things horrendously slow, so de-unroll that one to 8.

Let's reintroduce an old classic: Preloading. The assembler code shows that loads from obj_pos[] are done once with a LDC.F32X4, but the first in each block is used right after load. Let's try preloading 4 at a time and do the ray_pos subs at the same time. This gives neatly stacked sets of 4 loads + 4 subs that are far enough away. Also, it runs quicker:

for( refl = 0; refl < REFLECTIONS; refl++ ) { float rt_dist = FARDIST; int rt_i = -1; vec4 rt_t0; vec4 v0, v1, v2, v3; vec4 v4, v5, v6, v7; // preload 0-3 v0 = vec4( ray_pos - obj_pos[0].xyz, obj_pos[0].w ); v1 = vec4( ray_pos - obj_pos[1].xyz, obj_pos[1].w ); v2 = vec4( ray_pos - obj_pos[2].xyz, obj_pos[2].w ); v3 = vec4( ray_pos - obj_pos[3].xyz, obj_pos[3].w ); // trace_ray() for( int i = 0; i < OBJNUM; i += 16 ) { vec4 b0, d0; // preload 4-7 v4 = vec4( ray_pos - obj_pos[i+4].xyz, obj_pos[i+4].w ); v5 = vec4( ray_pos - obj_pos[i+5].xyz, obj_pos[i+5].w ); v6 = vec4( ray_pos - obj_pos[i+6].xyz, obj_pos[i+6].w ); v7 = vec4( ray_pos - obj_pos[i+7].xyz, obj_pos[i+7].w ); // calc 0-3 b0 = vec4( dot( v0.xyz, ray_dir ), dot( v1.xyz, ray_dir ), dot( v2.xyz, ray_dir ), dot( v3.xyz, ray_dir ) ); d0 = vec4( b0.x * b0.x - dot( v0.xyz, v0.xyz ) + v0.w, b0.y * b0.y - dot( v1.xyz, v1.xyz ) + v1.w, b0.z * b0.z - dot( v2.xyz, v2.xyz ) + v2.w, b0.w * b0.w - dot( v3.xyz, v3.xyz ) + v3.w ); if( any( greaterThan( d0, vec4( 0.0f ) ) ) ) { vec4 t0 = -b0 - sqrt( d0 ); t0 += intBitsToFloat(floatBitsToInt( t0 )>>31); vec2 mv = min( t0.xy, t0.zw ); mv.x = min( mv.x, mv.y ); if( mv.x < rt_dist ) { rt_dist = mv.x; rt_t0 = t0; rt_i = i + 0; } } // ... // ditto for preload 8-11, calc 4-7, preload 12-15, calc 8-11, preload 16-19, calc 12-15 // ...

The subs could be shifted around so they're not done in the last iteration, but the compiler is being difficult about this. Needs some more work.

The same can be done in the shadow calculation for a slight improvement there too:

vec4 v0, v1, v2, v3; vec4 v4, v5, v6, v7; // preload 0-3 v0 = vec4( ray_pos - obj_pos[0].xyz, obj_pos[0].w ); v1 = vec4( ray_pos - obj_pos[1].xyz, obj_pos[1].w ); v2 = vec4( ray_pos - obj_pos[2].xyz, obj_pos[2].w ); v3 = vec4( ray_pos - obj_pos[3].xyz, obj_pos[3].w ); // check shadow int i; for( i = 0; i < OBJNUM; i += 16 ) { vec4 d0, d1, d2, d3; vec4 b0; // preload 4-7 v4 = vec4( ray_pos - obj_pos[i+4].xyz, obj_pos[i+4].w ); v5 = vec4( ray_pos - obj_pos[i+5].xyz, obj_pos[i+5].w ); v6 = vec4( ray_pos - obj_pos[i+6].xyz, obj_pos[i+6].w ); v7 = vec4( ray_pos - obj_pos[i+7].xyz, obj_pos[i+7].w ); // calc 0-3 b0 = vec4( dot( v0.xyz, l ), dot( v1.xyz, l ), dot( v2.xyz, l ), dot( v3.xyz, l ) ); d0 = vec4( b0.x * b0.x - dot( v0.xyz, v0.xyz ) + v0.w, b0.y * b0.y - dot( v1.xyz, v1.xyz ) + v1.w, b0.z * b0.z - dot( v2.xyz, v2.xyz ) + v2.w, b0.w * b0.w - dot( v3.xyz, v3.xyz ) + v3.w ); // ... // ditto for preload 8-11, calc 4-7, preload 12-15, calc 8-11, preload 16-19, calc 12-15 // ...

Some more cycles can be save by simplifying the rt_hit calculation and removing the light_col parameter:

(...) ivec3 iv = ivec3( equal( rt_t0, vec3( rt_dist ) ) ); rt_i += iv.x + iv.y + iv.y + iv.z + iv.z + iv.z; (...) col += vec3( specular ) + obj_col[rt_i].xyz * diffuse; ray_dir = reflect( ray_dir, n ); (...)

I forgot to post measurements of the last steps, so let's wrap them up in one graph:

Still not enough for 128 spheres, but I'm trying some different unrolls for 120 at the moment. Stay tuned!

It's possible to get rid of another any/lessThan vec4 combination by calculating min first and doing a single test. This is about 5% faster. Not enough for 128 spheres, but a step on the way. Same in all 4 sphere sections, just add +4/+8/+12 to rt_i:

(...) if( any( greaterThan( d0, vec4( 0.0f ) ) ) ) { vec4 t0 = -b0 - sqrt( d0 ); t0 += intBitsToFloat(floatBitsToInt( t0 )>>31); vec2 mv = min( t0.xy, t0.zw ); mv.x = min( mv.x, mv.y ); if( mv.x < rt_dist ) { rt_dist = mv.x; rt_t0 = t0; rt_i = i + 0; } } (...)

112 spheres this time. Let's go through the necessary changes. The CPU preprocessing is still under wraps, but there's changes in the GPU code that can be published.

A simple start: The shadow calculation is quicker if all the tests are gathered at the bottom. The generated code isn't pretty, but it's quicker than max/add for some reason. Never mind:

// check shadow int i; for( i = 0; i < OBJNUM; i += 16 ) { vec3 v0, v1, v2, v3; vec4 d0, d1, d2, d3; vec4 b0; // 0..3 v0 = ray_pos - obj_pos[i+0].xyz; v1 = ray_pos - obj_pos[i+1].xyz; v2 = ray_pos - obj_pos[i+2].xyz; v3 = ray_pos - obj_pos[i+3].xyz; b0 = vec4( dot( v0, l ), dot( v1, l ), dot( v2, l ), dot( v3, l ) ); d0 = vec4( b0.x * b0.x - dot( v0, v0 ) + obj_pos[i+0].w, b0.y * b0.y - dot( v1, v1 ) + obj_pos[i+1].w, b0.z * b0.z - dot( v2, v2 ) + obj_pos[i+2].w, b0.w * b0.w - dot( v3, v3 ) + obj_pos[i+3].w ); // ... // ditto for 4-7, 8-11, 12-15 // ... if( any( greaterThan( d0, vec4( 0.0f ) ) ) || any( greaterThan( d1, vec4( 0.0f ) ) ) || any( greaterThan( d2, vec4( 0.0f ) ) ) || any( greaterThan( d3, vec4( 0.0f ) ) ) ) break; }

The main sphere loop has lots of room for improvement. The dv greaterThan() and if() construct is less than optimal. floatBitsToInt() can be used to convert negative numbers to (-)NaN so the negative t0 values get eliminated. Then just find the lowest value and save what's needed to find the id later instead of doing it in place:

// trace_ray() for( int i = 0; i < OBJNUM; i += 16 ) { vec3 v0, v1, v2, v3; vec4 b0, d0; // 0-3 v0 = ray_pos - obj_pos[i+0].xyz; v1 = ray_pos - obj_pos[i+1].xyz; v2 = ray_pos - obj_pos[i+2].xyz; v3 = ray_pos - obj_pos[i+3].xyz; b0 = vec4( dot( v0, ray_dir ), dot( v1, ray_dir ), dot( v2, ray_dir ), dot( v3, ray_dir ) ); d0 = vec4( b0.x * b0.x - dot( v0, v0 ) + obj_pos[i+0].w, b0.y * b0.y - dot( v1, v1 ) + obj_pos[i+1].w, b0.z * b0.z - dot( v2, v2 ) + obj_pos[i+2].w, b0.w * b0.w - dot( v3, v3 ) + obj_pos[i+3].w ); if( any( greaterThan( d0, vec4( 0.0f ) ) ) ) { vec4 t0 = -b0 - sqrt( d0 ); t0 += intBitsToFloat(floatBitsToInt( t0 )>>31); if( any( lessThan( t0, vec4( rt_dist ) ) ) ) { vec2 mv = min( t0.xy, t0.zw ); mv.x = min( mv.x, mv.y ); rt_dist = mv.x; rt_t0 = t0; rt_i = i; } } // ... // ditto for 4-7, 8-11, 12-15 // ... }

After the sphere loop is done, find out what was hit:

if( rt_i == -1 ) break; ivec4 iv = mix( ivec4( 0 ), ivec4( rt_i, rt_i+1, rt_i+2, rt_i+3 ), equal( vec4( rt_dist ), rt_t0 ) ); iv.xy += iv.zw; int rt_hit = iv.x + iv.y;

Video with 96 spheres. Doing 16x16 block probing on the CPU before rendering it on the GPU:

In the never ending hunt for more spheres, it's time to turn to the CPU for answers. It's mostly idling anwyay. Let the CPU check the corners of each 16x16 block of the display and determine if the area is covered by only spheres, only shadow plane/sky or both. This makes the GPU code significantly quicker, since half the code can be eliminated for most of the screen:

Corner checking takes around 6ms CPU time, so let a separate thread calculate that for the next frame while the current one is rendering. Synchronize it all with a barrier. Simple. This calls for an increase to 96 spheres at 60 fps (barely):

Code needs some work before publication.

There was a slight mishap in the raytracer code which makes some shadows disappear. It's noticeable in the video if you pay close attention to the shadows. The measurments are a bit off too, but it's still above 60 fps after fixing it, so I'm not gonna bother updating them. The GLSL source code in the GPU Hacks Article has been updated.

GPURay 1.7, 80 spheres at 60 fps on NVidia Tegra X1:

Music: 1992 Amiga Oktalyzer module by Jan Wegger.

There's an interesting pragma that can be used in NVidia GLSL code: #pragma optionNV (unroll all). Trying it on the raytracer gives interesting results: A huge assembler file that has no REP loops at all, and slightly quicker execution time.

So the trick is to find the correct unroll level and do it manually. Earlier attempts were clearly too conservative. Automatic unrolling doesn't work if there's if() statements in the loop. After some trial and error it seems that 16 is a good unroll value for a target of 80 spheres. Since the unroll is greater, using vec4 in the second loops starts paying off too. Let's increase to 80 spheres and do a test run. Blue is old code, orange is unroll all and green is unroll 16 + vec4:

The loops end up looking like this:

// trace_ray() for( int i = 0; i < OBJNUM; i += 16 ) { vec3 v0, v1, v2, v3; vec4 b03, d03; bvec4 dv; // 0-3 v0 = ray_pos - obj_pos[i+0].xyz; v1 = ray_pos - obj_pos[i+1].xyz; v2 = ray_pos - obj_pos[i+2].xyz; v3 = ray_pos - obj_pos[i+3].xyz; b03 = vec4( dot( v0, ray_dir ), dot( v1, ray_dir ), dot( v2, ray_dir ), dot( v3, ray_dir ) ); d03 = vec4( b03.x * b03.x - dot( v0, v0 ) + obj_pos[i+0].w, b03.y * b03.y - dot( v1, v1 ) + obj_pos[i+1].w, b03.z * b03.z - dot( v2, v2 ) + obj_pos[i+2].w, b03.w * b03.w - dot( v3, v3 ) + obj_pos[i+3].w ); dv = greaterThan( d03, vec4( 0.0f ) ); if( any( dv ) ) { vec4 t03 = -b03 - sqrt( d03 ); dv = dv && greaterThan( t03, vec4( 0.0f ) ); if( dv.x && t03.x < rt_dist ) { rt_dist = t03.x; rt_hit = i+0; } if( dv.y && t03.y < rt_dist ) { rt_dist = t03.y; rt_hit = i+1; } if( dv.z && t03.z < rt_dist ) { rt_dist = t03.z; rt_hit = i+2; } if( dv.w && t03.w < rt_dist ) { rt_dist = t03.w; rt_hit = i+3; } } // ditto for 4-7,8-11,12-15... } // ... // check shadow int i; for( i = 0; i < OBJNUM; i += 16 ) { vec3 v0, v1, v2, v3; vec4 b03, d03; v0 = ray_pos - obj_pos[i+0].xyz; v1 = ray_pos - obj_pos[i+1].xyz; v2 = ray_pos - obj_pos[i+2].xyz; v3 = ray_pos - obj_pos[i+3].xyz; b03 = vec4( dot( v0, l ), dot( v1, l ), dot( v2, l ), dot( v3, l ) ); d03 = vec4( b03.x * b03.x - dot( v0, v0 ) + obj_pos[i+0].w, b03.y * b03.y - dot( v1, v1 ) + obj_pos[i+1].w, b03.z * b03.z - dot( v2, v2 ) + obj_pos[i+2].w, b03.w * b03.w - dot( v3, v3 ) + obj_pos[i+3].w ); if( any( greaterThan( d03, vec4( 0.0f ) ) ) ) break; // ditto for 4-7,8-11,12-15... }

An attempt to maximize the Tegra X1 GPU power usage. Around 11 watts is the current highscore while still looking cool:

One issue with the solution below is the excessive use of if()-arguments. By wrapping things up in vectors we can get most of them out of the way. greaterThan() etc. in OpenGL compiles to the corresponding sgt etc.

for( int i = 0; i < OBJNUM; i += 4 ) { vec3 v0 = ray_pos - obj_pos[i+0].xyz; vec3 v1 = ray_pos - obj_pos[i+1].xyz; vec3 v2 = ray_pos - obj_pos[i+2].xyz; vec3 v3 = ray_pos - obj_pos[i+3].xyz; vec4 b03 = vec4( dot( v0, ray_dir ), dot( v1, ray_dir ), dot( v2, ray_dir ), dot( v3, ray_dir ) ); vec4 d03 = vec4( b03.x * b03.x - dot( v0, v0 ) + obj_pos[i+0].w, b03.y * b03.y - dot( v1, v1 ) + obj_pos[i+1].w, b03.z * b03.z - dot( v2, v2 ) + obj_pos[i+2].w, b03.w * b03.w - dot( v3, v3 ) + obj_pos[i+3].w ); bvec4 dv = greaterThanEqual( d03, vec4( 0.0f ) ); if( any( dv ) ) { vec4 t03 = -b03 - sqrt( d03 ); dv = dv && greaterThanEqual( t03, vec4( 0.0f ) ); if( dv.x && t03.x < rt_dist ) { rt_dist = t03.x; rt_hit = i; } if( dv.y && t03.y < rt_dist ) { rt_dist = t03.y; rt_hit = i+1; } if( dv.z && t03.z < rt_dist ) { rt_dist = t03.z; rt_hit = i+2; } if( dv.w && t03.w < rt_dist ) { rt_dist = t03.w; rt_hit = i+3; } } }

The if() set can be reduced with some mix() and min() and stuff. The generated code is neat, but it's not very quick yet. Have to investigate that further. Luckily, this is not like in the old days.

The cool part is that it can now do 72 spheres:

Video with 64 spheres:

While perusing more generated code and trying out other unroll structures, it's clear that the one from 28 October is not totally optimal for larger numbers of spheres. A different approach yielding better results seems to be turning down the lower unroll to 4 (not shown), and increasing the upper to 4 while moving the sqrts out of the way. So there's some double tests, but the compiler seems to figure it out. The whole point of that is to insert a continue if all 4 doesn't hit at all. The miss ratio is pretty high and we're going through everything anyway:

for( int i = 0; i < OBJNUM; i += 4 ) { vec3 v0 = ray_pos - obj_pos[i+0].xyz; vec3 v1 = ray_pos - obj_pos[i+1].xyz; vec3 v2 = ray_pos - obj_pos[i+2].xyz; vec3 v3 = ray_pos - obj_pos[i+3].xyz; float b0 = dot( v0, ray_dir ); float b1 = dot( v1, ray_dir ); float b2 = dot( v2, ray_dir ); float b3 = dot( v3, ray_dir ); float d0 = b0 * b0 - dot( v0, v0 ) + obj_pos[i+0].w; float d1 = b1 * b1 - dot( v1, v1 ) + obj_pos[i+1].w; float d2 = b2 * b2 - dot( v2, v2 ) + obj_pos[i+2].w; float d3 = b3 * b3 - dot( v3, v3 ) + obj_pos[i+3].w; if( d0 < 0.0f && d1 < 0.0f && d2 < 0.0f && d3 < 0.0f ) continue; float t0 = -b0 - sqrt( d0 ); float t1 = -b1 - sqrt( d1 ); float t2 = -b2 - sqrt( d2 ); float t3 = -b3 - sqrt( d3 ); if( d0 >= 0.0f && t0 > 0.0f && t0 < rt_dist ) { rt_dist = t0; rt_hit = i; } if( d1 >= 0.0f && t1 > 0.0f && t1 < rt_dist ) { rt_dist = t1; rt_hit = i+1; } if( d2 >= 0.0f && t2 > 0.0f && t2 < rt_dist ) { rt_dist = t2; rt_hit = i+2; } if( d3 >= 0.0f && t3 > 0.0f && t3 < rt_dist ) { rt_dist = t3; rt_hit = i+3; } }

The good news is that the number of spheres can be increased to 64 with plenty of cycles to spare. The bad news is that there's not enough cycles for next multiple of 4. Have to find a way to get around that.

In the never-ending hunt for more spheres in the GPU Raytracer, I noticed that the glGetProgramBinary() call outputs assembler code too, in NV_gpu_program5 format. That makes it reasonably easy to see where improvements can be made.

The conditional construct in the first loop seems to confuse the compiler, and blocking negative square roots is pointless when they're basically free. Wrap it all up in a simple if() that reduces nicely in the generated assembler code. Still need to unroll 2 by hand, though, since the compiler still cannot unroll automatically when there's if() expressions in the loop:

for( int i = 0; i < OBJNUM; i += 2 ) { vec3 v0 = ray_pos - obj_pos[i+0].xyz; vec3 v1 = ray_pos - obj_pos[i+1].xyz; float b0 = dot( v0, ray_dir ); float b1 = dot( v1, ray_dir ); float d0 = b0 * b0 - dot( v0, v0 ) + obj_pos[i+0].w; float d1 = b1 * b1 - dot( v1, v1 ) + obj_pos[i+1].w; float t0 = -b0 - sqrt( d0 ); float t1 = -b1 - sqrt( d1 ); if( d0 >= 0.0f && t0 > 0.0f && t0 < rt_dist ) { rt_dist = t0; rt_hit = i; } if( d1 >= 0.0f && t1 > 0.0f && t1 < rt_dist ) { rt_dist = t1; rt_hit = i+1; } }

A balance has to be made between unrolling loop 1 and 2. Seems like unrolling loop 1 twice, as above, and loop 2 10 (!) times is the best combination:

for( i = 0; i < OBJNUM; i += 10 ) { vec3 v0 = ray_pos - obj_pos[i+0].xyz; vec3 v1 = ray_pos - obj_pos[i+1].xyz; vec3 v2 = ray_pos - obj_pos[i+2].xyz; vec3 v3 = ray_pos - obj_pos[i+3].xyz; vec3 v4 = ray_pos - obj_pos[i+4].xyz; vec3 v5 = ray_pos - obj_pos[i+5].xyz; vec3 v6 = ray_pos - obj_pos[i+6].xyz; vec3 v7 = ray_pos - obj_pos[i+7].xyz; vec3 v8 = ray_pos - obj_pos[i+8].xyz; vec3 v9 = ray_pos - obj_pos[i+9].xyz; float b0 = dot( v0, l ); float b1 = dot( v1, l ); float b2 = dot( v2, l ); float b3 = dot( v3, l ); float b4 = dot( v4, l ); float b5 = dot( v5, l ); float b6 = dot( v6, l ); float b7 = dot( v7, l ); float b8 = dot( v8, l ); float b9 = dot( v9, l ); float d0 = b0 * b0 - dot( v0, v0 ) + obj_pos[i+0].w; float d1 = b1 * b1 - dot( v1, v1 ) + obj_pos[i+1].w; float d2 = b2 * b2 - dot( v2, v2 ) + obj_pos[i+2].w; float d3 = b3 * b3 - dot( v3, v3 ) + obj_pos[i+3].w; float d4 = b4 * b4 - dot( v4, v4 ) + obj_pos[i+4].w; float d5 = b5 * b5 - dot( v5, v5 ) + obj_pos[i+5].w; float d6 = b6 * b6 - dot( v6, v6 ) + obj_pos[i+6].w; float d7 = b7 * b7 - dot( v7, v7 ) + obj_pos[i+7].w; float d8 = b8 * b8 - dot( v8, v8 ) + obj_pos[i+8].w; float d9 = b9 * b9 - dot( v9, v9 ) + obj_pos[i+9].w; if( d0 > 0.0f || d1 > 0.0f || d2 > 0.0f || d3 > 0.0f || d4 > 0.0f || d5 > 0.0f || d6 > 0.0f || d7 > 0.0f || d8 > 0.0f || d9 > 0.0f ) break; }

Unlike gcc, the NVidia compiler seems to be able to look ahead far enough to convert this into a coherent set of dp3/mad/or/trunc instructions. It's now considered wise to keep number of spheres a multiple of 10. Looks like it can finally do 60 spheres at 60 fps:

New video:

I missed a trivial optimization in the shadow calculation in the GPU Raytracer. There's no point in calculating the distance, it's enough to know whether any sphere is hit or not. That simplifies the loop enough so it can be unrolled further. A limitation is that the sphere count has to be a multiple of 4, not 2 as earlier:

(...) int i; for( i = 0; i < OBJNUM; i += 4 ) { vec3 v0 = ray_pos - obj_pos[i+0].xyz; vec3 v1 = ray_pos - obj_pos[i+1].xyz; vec3 v2 = ray_pos - obj_pos[i+2].xyz; vec3 v3 = ray_pos - obj_pos[i+3].xyz; float b0 = dot( v0, l ); float b1 = dot( v1, l ); float b2 = dot( v2, l ); float b3 = dot( v3, l ); float d0 = b0 * b0 - dot( v0, v0 ) + obj_pos[i+0].w; float d1 = b1 * b1 - dot( v1, v1 ) + obj_pos[i+1].w; float d2 = b2 * b2 - dot( v2, v2 ) + obj_pos[i+2].w; float d3 = b3 * b3 - dot( v3, v3 ) + obj_pos[i+3].w; if( d0 > 0.0f || d1 > 0.0f || d2 > 0.0f || d3 > 0.0f ) break; } (...)

Let's compare the old and the new version, now with 56 spheres:

Obviously, the same is true for the input data in the fractal and quaternion code too. The results are less dramatic, but the quaternion routine is now stable above 60 fps. Updated the article and made new measurements.

The NVidia article How About Constant Buffers gives some interesting information about how data is stored and accessed on the GPU. The scene data in the GPU Raytracer is constant and the usage pattern fairly regular, so storing it in a constant buffer sounds like a good idea. Constant buffers translate to uniform buffers in OpenGL, so it's a matter of just replacing "shader storage" with "uniform" in the right places.

The results are spectacular. Old version (SSBO) versus new version (UBO), 32 spheres:

Video below. Can now have 52 spheres and keep fps above 60 all the time:

Made videos of MultiDoom and MultiQuake running on the Mellanox TILE-Gx36 multicore CPU:

It's been a slow summer, so I finally had time to update the SRTP AES Optimization article from 2010. Should now support all CPU types and packet sizes larger than 4096. It's available here: SRTP AES Optimization Revisited

Source files for the generic raytracer:

Just stuff the files in a directory and compile and link as specified below. Option -t0 (the default) will run it directly without setting up any threads.

A quick port of the GPU Raytracer to more generic C code. Also includes optional Pthreads support.

[user@phyreztorm]$ gcc -O3 -ffast-math -Wall -Wextra -std=gnu99 -o raytracer -lm -lpthread raytracer.c bmp.c [user@phyreztorm]$ ./raytracer -t 16 -o 7680x4320 Threads: 16 Thread 0: 270 lines ( 0- 269) Thread 2: 270 lines ( 540- 809) Thread 3: 270 lines ( 810-1079) (...etc...) Time: 892.050049 ms Saving bitmap out.bmp

Just dividing into chunks of lines isn't a very good idea. Should use blocks like in the TILE-Gx raytracer, but that'd require more work. Test run using assorted thread counts on Intel Xeon E5-2699v3:

Source code for GPU fractals, raytracer and quaternion raytraced julia sets now available here.

A friend scanned all issues of the old Norwegian Amiga magazine Digital 1989-1992. They're available here.

Back to 32 spheres in raytracer after removing some deadwood.

But that's boring, raytraced quaternion julia sets are cooler:

Still pretty rough, but it runs. Based on code by Keenan Crane.

Sacrificed some spheres to get 60 fps with plane shadows:

April is GPU appreciation month! Some attempts at making old cr... err, old stuff run on an NVidia Tegra X1.

36 objects, 3 reflections, non-reflective shadows on center sphere. 126 lines shader code, could be less if nvcc would stop screwing up unrolling.

Based on a GPU implementation by John Tsiombikas. I spent some time optimizing it and used a HSV'ish palette instead. 256 iterations. 45 fps if all pixels are at max iterations, aka. black screen. So 60 fps for all normal cases. Need bigger GPU.

Based on a CPU implementation by Lode Vandevenne. Relatively easy to port to a GLSL shader.

Using a HSV-based palette when making effects is useful, since it wraps without any edges. But the color has to be converted back to RGB before displaying, and this can be complicated. Several methods of assorted quality can be found on the net. This one seems popular, and this one is also interesting. Donald Reynold's implementation uses cosines for everything, but then the hexagonal shape is gone. So let's do a twist: Use a cosine for the basic H slope, clamp() for the flat top/bottom and mix() or similar for S and V:

// Input: HSV 0.0f..1.0f vec3 HSV2RGB( vec3 hsv ) { float h = hsv.x * 2.0f*PI; // h 0..2*PI float d120 = 2.0f*PI/3.0f; vec3 hv = vec3( h, h-d120, h+d120 ); vec3 rgb = clamp( 0.5f + cos( hv ), 0.0f, 1.0f ); // flatten h before s,v return ((1.0f-hsv.y) + rgb*hsv.y) * hsv.z; // mix( 1.0f,rgb,hsv.y) => ((rgb-1.0f)*hsv.y+1.0f)*hsv.z }

A straight H2RGB where S=V=1.0f can be made even simpler:

vec3 H2RGB( float h ) { h *= 2.0f*PI; // h 0..2*PI float d120 = 2.0f*PI/3.0f; vec3 hv = vec3( h, h-d120, h+d120 ); return 0.5f + cos( hv ); // add clamp() if needed }