## GPU Hacks: Fractals, a Raytracer and Raytraced Quaternion Julia Sets

Nils L. Corneliusen
10 June 2016

## Introduction

The NVidia Tegra X1 has a Maxwell-based GPU with a theoretical FP32 peak of 512 GFLOPS per second. It can easily be programmed using OpenGL GLSL shaders. However, making fast GPU code is different from making fast CPU code. 3 non-typical GPU jobs are implemented in fragment shaders and optimized for better performance. Videos and source code is included.

## GPU Fractals

Note: The theory of this section is still good, but has been revised numerous times since the publication of this article. Refer to the Pipeline in 2017 and 2018 for more info.

As a start, look at this code by John Tsiombikas:

```// Julia GLSL Fragment Shader
uniform sampler1D tex;
uniform vec2 c;
uniform int iter;

void main() {
vec2 z;
z.x = 3.0 * (gl_TexCoord[0].x - 0.5);
z.y = 2.0 * (gl_TexCoord[0].y - 0.5);

int i;
for(i=0; i<iter; i++) {
float x = (z.x * z.x - z.y * z.y) + c.x;
float y = (z.y * z.x + z.x * z.y) + c.y;

if((x * x + y * y) > 4.0) break;
z.x = x;
z.y = y;
}

gl_FragColor = texture1D(tex, (i == iter ? 0.0 : float(i)) / 100.0);
}

uniform sampler1D tex;
uniform vec2 center;
uniform float scale;
uniform int iter;

void main() {
vec2 z, c;

c.x = 1.3333 * (gl_TexCoord[0].x - 0.5) * scale - center.x;
c.y = (gl_TexCoord[0].y - 0.5) * scale - center.y;

int i;
z = c;
for(i=0; i<iter; i++) {
float x = (z.x * z.x - z.y * z.y) + c.x;
float y = (z.y * z.x + z.x * z.y) + c.y;

if((x * x + y * y) > 4.0) break;
z.x = x;
z.y = y;
}

gl_FragColor = texture1D(tex, (i == iter ? 0.0 : float(i)) / 100.0);
}
```

The code for both is essentially the same; two conditionals is enough to separate them. Iterations as input is not a good idea if the main loop can be unrolled. Currently it can't, but we'll get to that later. Let's start by defining an interface:

```#version 310 es
precision highp float;
layout(location = 0) in vec2 v_texCoord;
layout(location = 0) out vec4 outColor;

#define MAXITER 256
#define DEPTH 32

layout(binding = 0) uniform stuff
{
float sx, sy, sz;
float re, im;
float mandel;
};
```

Now both Julia and Mandelbrot sets can be represented. Use mandel=0.0f for Julia sets. Interesting Julia values can be found on Wikipedia.

For coloring, let's try a variation of the HSV method that has black at the end. I just use a table:

```// Hue table with black at end instead of wrapping
const vec3 cols[7] = vec3[7](
vec3( 1.0f, 0.0f, 0.0f ),
vec3( 1.0f, 1.0f, 0.0f ),
vec3( 0.0f, 1.0f, 0.0f ),
vec3( 0.0f, 1.0f, 1.0f ),
vec3( 0.0f, 0.0f, 1.0f ),
vec3( 1.0f, 0.0f, 1.0f ),
vec3( 0.0f, 0.0f, 0.0f )
);

vec3 getRGB( float h )
{
float Slice     = 6.0f * h;
float SliceInt  = floor( Slice );
float SliceFrac = Slice - SliceInt;

int i = int( SliceInt );
return mix( cols[i], cols[i+1], SliceFrac );
}
```

In main(), set up the initial values:

```    vec4  xv0 = vec4( 0.0f, 0.0f, 0.0f, 1.7778f * (v_texCoord.x - 0.5f) / (0.5f * sz) + sx );
vec4  yv0 = vec4( 0.0f, 0.0f, 0.0f,           (v_texCoord.y - 0.5f) / (0.5f * sz) + sy );
bvec4 bv[DEPTH/4];

float fre = mandel != 0.0f ? xv0.w : re;
float fim = mandel != 0.0f ? yv0.w : im;
```

And, finally, it's time to roll things out. Just duplicating the calculations + if() won't work. Let's try something different. Instead of optimizing for early cutoff, let's optimize for the worst case, ie. max iterations is expected. As long as it's possible to quickly identify the first value above 4.0 later, it doesn't matter if the code overcalculates several values. The trick is to figure out the optimal depth and make the compiler stuff the comparisons between the calculations to avoid stalls.

As it turns out, 32 seems to be the optimal depth. And packing the calculations into a vec4 and comparisons into a bvec4 array seems to work ok. What doesn't seem to work is using greaterThan(), so skip that:

```    int i;
for( i = 0; i < MAXITER; i += DEPTH ) {

for( int j = 0; j < DEPTH/4; j++ ) {

xv0.x = xv0.w * xv0.w - yv0.w * yv0.w + fre; yv0.x = 2.0f * xv0.w * yv0.w + fim;
xv0.y = xv0.x * xv0.x - yv0.x * yv0.x + fre; yv0.y = 2.0f * xv0.x * yv0.x + fim;
xv0.z = xv0.y * xv0.y - yv0.y * yv0.y + fre; yv0.z = 2.0f * xv0.y * yv0.y + fim;
xv0.w = xv0.z * xv0.z - yv0.z * yv0.z + fre; yv0.w = 2.0f * xv0.z * yv0.z + fim;

bv[j].x = xv0.x * xv0.x + yv0.x * yv0.x > 4.0f;
bv[j].y = xv0.y * xv0.y + yv0.y * yv0.y > 4.0f;
bv[j].z = xv0.z * xv0.z + yv0.z * yv0.z > 4.0f;
bv[j].w = xv0.w * xv0.w + yv0.w * yv0.w > 4.0f;

}

if( any( bv[0] ) || any( bv[1] ) || any( bv[2] ) || any( bv[3] ) ||
any( bv[4] ) || any( bv[5] ) || any( bv[6] ) || any( bv[7] ) ) break;

}
```

Then it's a matter of figuring out which value is the first to be greater than 4.0. Let's bundle up the comparison results as bits in a 32-bit uint and use findLSB() to find the lowest bit in the current block of 32 results:

```    uint bits = 0u;
for( int j = 0; j < DEPTH/4; j++ )
bits |= (uint(bv[j].x)<<(j*4)) | (uint(bv[j].y)<<(j*4+1)) | (uint(bv[j].z)<<(j*4+2)) | (uint(bv[j].w)<<(j*4+3));

int lsb = findLSB( bits );
i += lsb == -1 ? 0 : lsb;

outColor = vec4( getRGB( float(i) / float(MAXITER) ), 1.0f );
```

Source files:

If there's 256 iterations and all pixels are black, the result should be about 45 fps in 1920x1080 resolution. That should be very close to what's theoretically possible with this method.

FPS measurement, rendering to PBuffer (no vsync), same path as in the video:

## GPU Raytracer

Note: This section is a bit outdated. Check out the separate GPURay article for the latest version which can do 128 spheres with a little help from the CPU.

80 spheres at 60 fps in version 1.7: (Yes, shadows are bugged)

Music: 1992 Amiga Oktalyzer module by Jan Wegger.

I used the TILE-Gx Raytracer as a basis. It's based on code written by Shodruky Rhyammer.

Let's start off by defining an interface:

```#version 310 es
precision highp float;
layout(location = 0) in vec2 v_texCoord;
layout(location = 0) out vec4 outColor;

#define OBJNUM (5*16)
#define REFLECTIONS 3
#define FARDIST 256.0f

layout(binding = 0) uniform stuff
{
vec4 obj_pos[OBJNUM];
vec4 obj_col[OBJNUM];
vec4 plane_norm;
vec4 plane_col;
vec4 eye;
vec4 light_pos;
vec4 light_col;
vec4 startpos;
float ax;
float ay;
float fx_offset;
};
```

obj_pos[].w is the sphere's radius squared. startpos, ax, ay should be calculated as in Mr Rhyammer's code.

Initialize the positions:

```    vec3 ray_dir = normalize( vec3( startpos.x + ax * v_texCoord.x,
startpos.y + ay * v_texCoord.y,
startpos.z ) );

vec3 col = vec3( 0.0f );

vec3 ray_pos = eye.xyz;
```

The wobble effect in some of the videos is made by adjusting fx_offset from 0.0 to 0.7 slowly. Just make sure obj_pos[1] moves a bit:

```    // wobble
if( fx_offset != 0.0f ) {
ray_pos.x += sin( v_texCoord.x*4.0f + obj_pos[1].x ) * fx_offset;
ray_pos.y += sin( v_texCoord.y*2.0f                ) * fx_offset;
ray_pos.z += sin( v_texCoord.x*4.0f + obj_pos[1].z ) * fx_offset;
}
```

Next is the main reflection loop and the crucial first trace_ray() call, which is inlined here. trace_ray() will stall badly since every calculation depend on the previous. And the compiler doesn't unroll or interleave since there's an if() at the end, so it has to be done manually.

```    for( int refl = 0; refl < REFLECTIONS; refl++ ) {

float rt_dist = FARDIST;
int rt_hit = -1;

// trace_ray()
float rt_dist = FARDIST;
int rt_hit = -1;

// trace_ray()
for( int i = 0; i < OBJNUM; i += 16 ) {
vec3 v0, v1, v2, v3;
vec4 b03, d03;
bvec4 dv;

// 0-3
v0 = ray_pos - obj_pos[i+0].xyz;
v1 = ray_pos - obj_pos[i+1].xyz;
v2 = ray_pos - obj_pos[i+2].xyz;
v3 = ray_pos - obj_pos[i+3].xyz;

b03 = vec4( dot( v0, ray_dir ),
dot( v1, ray_dir ),
dot( v2, ray_dir ),
dot( v3, ray_dir ) );

d03 = vec4( b03.x * b03.x - dot( v0, v0 ) + obj_pos[i+0].w,
b03.y * b03.y - dot( v1, v1 ) + obj_pos[i+1].w,
b03.z * b03.z - dot( v2, v2 ) + obj_pos[i+2].w,
b03.w * b03.w - dot( v3, v3 ) + obj_pos[i+3].w );

dv = greaterThan( d03, vec4( 0.0f ) );

if( any( dv ) ) {

vec4 t03 = -b03 - sqrt( d03 );

dv = dv && greaterThan( t03, vec4( 0.0f ) );

if( dv.x && t03.x < rt_dist ) { rt_dist = t03.x; rt_hit = i+0; }
if( dv.y && t03.y < rt_dist ) { rt_dist = t03.y; rt_hit = i+1; }
if( dv.z && t03.z < rt_dist ) { rt_dist = t03.z; rt_hit = i+2; }
if( dv.w && t03.w < rt_dist ) { rt_dist = t03.w; rt_hit = i+3; }

}

// ...
// ditto for 4-7, 8-11, 12-15
// ...

}
```

The plane is not reflected and only has non-reflective shadows. So let's exit if we didn't hit anything and it's not the first reflection:

```        if( rt_hit == -1 && refl > 0 ) break;
```

That means checking for plane intersection only happens on the first iteration when all spheres are missed. If the plane is missed, generate a pulsing skyline instead. obj_pos[1].z should of course vary a bit, or it gets boring quick.

```        // plane with shadows
if( rt_hit == -1 && refl == 0 ) {
rt_dist = -(dot( ray_pos, plane_norm.xyz ) + plane_norm.w) / dot( ray_dir, plane_norm.xyz );

if( rt_dist <= 0.0f ) {
// plane miss, background
vec2 dxy = v_texCoord - 0.5f;
float dv = 1.0f - sqrt( dot( dxy, dxy ) + (sin(obj_pos[1].z)+1.0f)/2.0f );
col = plane_col.yzx * dv;
break;
}
```

The plane is hit, so determine if there's a shadow. Trace a ray back to the light pos and see if anything is hit. Since only hit/no hit is interesting, it can be simplified compared to the original trace_ray().

```            //  plane hit
ray_pos += ray_dir * rt_dist;
vec3 l   = normalize( light_pos.xyz - ray_pos );

int i;
for( i = 0; i < OBJNUM; i += 16 ) {
vec3 v0, v1, v2, v3;
vec4 b03, d03;

v0 = ray_pos - obj_pos[i+0].xyz;
v1 = ray_pos - obj_pos[i+1].xyz;
v2 = ray_pos - obj_pos[i+2].xyz;
v3 = ray_pos - obj_pos[i+3].xyz;

b03 = vec4( dot( v0, l ),
dot( v1, l ),
dot( v2, l ),
dot( v3, l ) );

d03 = vec4( b03.x * b03.x - dot( v0, v0 ) + obj_pos[i+0].w,
b03.y * b03.y - dot( v1, v1 ) + obj_pos[i+1].w,
b03.z * b03.z - dot( v2, v2 ) + obj_pos[i+2].w,
b03.w * b03.w - dot( v3, v3 ) + obj_pos[i+3].w );

if( any( greaterThan( d03, vec4( 0.0f ) ) ) ) break;

}

// ...
// ditto for 4-7, 8-11, 12-15
// ...

float diffuse = i == OBJNUM ? dot( plane_norm.xyz, l ) * 1.2f : 0.0f;
col = plane_col.xyz * diffuse;
break;
}
```

And finally, normal sphere hits and reflection are handled in the mainline:

```        // sphere hit + reflection
ray_pos += ray_dir * rt_dist;
vec3 n   = normalize( ray_pos - obj_pos[rt_hit].xyz );
vec3 l   = normalize( light_pos.xyz - ray_pos );

float diffuse  = max( dot( n, l ), 0.0f );
float specular = pow( max( dot( ray_dir, l - n * 2.0f * diffuse ), 0.0f ), 8.0f ); // ! always pos diffuse

col += light_col.xyz * specular + obj_col[rt_hit].xyz * diffuse;

ray_dir = reflect( ray_dir, n );

}

outColor = vec4( col, 1.0f );

}
```

Source files:

It's grown a bit in size due to the manual unrolling, so now it clocks in at about 300 lines. It can easily be reduced to around 100 lines.

FPS measurement, rendering to PBuffer (no vsync). 80 spheres, 3 reflections, shadows on plane:

What about increasing resolution to 7680x4320? Not very quick (around 4 fps) but it works:

A collection of videos:

GPURay 1.7: 80 spheres + shadows
GPURay 1.6: 72 spheres + shadows
GPURay 1.5: 64 spheres + shadows
GPURay 1.4: 60 spheres + shadows
GPURay 1.3: 52 spheres + shadows
GPURay 1.2: 30 spheres + shadows
GPURay 1.1: 36 spheres
GPURay 1.0: 31 spheres

## GPU Raytraced Quaternion Julia Sets

Note: This section is still technically correct, but some spare cycles were found and detail was increased in 2017-2018. Also proper rotation added. Please refer to the quaternion code in the Revised 2017 Xmas demo.

22 June 2017: Plane projection:

6 June 2017: Slightly revised color scheme in new video:

Original video:

The natural extension of a fractal routine and a raytracer. Naah, I just came across it randomly on the net and it looked cool.

Keenan Crane has written a paper about the theory behind it, and he's also made a GPU implementation. This was way back in 2004. Is it possible to get it to run quickly on the Tegra X1? Let's give it a shot.

For simplicity, I recycled some of the raytracer stuff, so the interface is similar:

```#version 310 es
precision highp float;
layout(location = 0) in vec2 v_texCoord;
layout(location = 0) out vec4 outColor;

layout(binding = 0) uniform stuff
{
vec4 eye_pos;
vec4 light_pos;
vec4 startpos;
vec4 mu_pos;
float epsilon;
float ax;
float ay;
};

#define ITERATIONS          6
#define ESCAPE_THRESHOLD   10.0f
#define DEL                 0.0001f
```

Iterations is moved to a define to make unrolling more viable later. startpos, ax, ay should be calculated as in the raytracer. An interesting position would be mu_pos=(-0.125, -0.256, 0.847, 0.0895), epsilon=0.02 where the left half resembles Doh from the Arkanoid breakout game. More positions can be found on Paul Bourke's page.

Let's go through the routines from top to bottom. quatMult() and quatSq() are ok, except there's performance issues with using the GLSL functions dot()/cross() there. I have no idea why, so let's skip using them in these two crucial functions. Paul Bourke has a good explanation of the mathematics involved.

```// Q1 Q2 = [ x1 x2 - y1 y2 - z1 z2 - w1 w2 ] + | q1.x*q2.x     - dot( q1.yzw, q2.yzw )
//         [ x1 y2 + y1 x2 + z1 w2 - w1 z2 ] + | q1.x * q2.yzw + q2x * q1.yzw + cross( q1.yzw, q2.yzw )
//         [ x1 z2 + z1 x2 + w1 y2 - y1 w2 ] + | ...             ...            ...
//         [ x1 w2 + w1 x2 + y1 z2 - z1 y2 ] + | ...             ...            ...

vec4 quatMult( vec4 q1, vec4 q2 )
{
vec4 r;
r.x   = q1.x   * q2.x   - q1.y   * q2.y   - q1.z   * q2.z   - q1.w   * q2.w;
r.yzw = q1.xxx * q2.yzw + q1.yzw * q2.xxx + q1.zwy * q2.wyz - q1.wyz * q2.zwy;
return r;
}

vec4 quatSq( vec4 q )
{
vec4 r;
r.x   = q.x * q.x - (q.y * q.y + q.z * q.z + q.w * q.w );
r.yzw = 2.0f * q.x * q.yzw;
return r;
}

```

inout vars seems to cost a bit, so the ones not needed are eliminated. Some of the loops have constant inputs for first iteration, so they can be trivially reduced. (Not by the compiler, of course.) And again manual interleave-unrolling must be done by hand. This is becoming routine:

```vec4 iterateIntersect( inout vec4 q1, vec4 c )
{
bool r0,  r1;
vec4 q0;
vec4 qp0, qp1;

// reduce for qp1=vec4( 1.0f, 0.0f, 0.0f, 0.0f )
qp0 = 2.0f *           q1;        q0 = quatSq( q1 ) + c;
qp1 = 2.0f * quatMult( q0, qp0 ); q1 = quatSq( q0 ) + c;
r0  = dot( q0, q0 ) > ESCAPE_THRESHOLD;
r1  = dot( q1, q1 ) > ESCAPE_THRESHOLD;
if( r0 || r1 ) {
q1   = r0 ? q0  : q1;
return r0 ? qp0 : qp1;
}

// unroll 2 to reduce stalls
for( int i = 0; i < ITERATIONS-2; i += 2 ) {
qp0 = 2.0f * quatMult( q1, qp1 ); q0 = quatSq( q1 ) + c;
qp1 = 2.0f * quatMult( q0, qp0 ); q1 = quatSq( q0 ) + c;
r0  = dot( q0, q0 ) > ESCAPE_THRESHOLD;
r1  = dot( q1, q1 ) > ESCAPE_THRESHOLD;
if( r0 || r1 ) break;
}

q1   = r0 ? q0  : q1;
return r0 ? qp0 : qp1;
}

float intersectQJulia( inout vec3 rO, vec3 rD, vec4 c )
{
float dist;

while( true ) {
vec4 z  = vec4( rO, 0.0f );
vec4 zp = iterateIntersect( z, c );

float normZ = length( z );
dist = 0.5f * normZ * log( normZ ) / length( zp );

rO += rD * dist;

if( dist < epsilon || dot( rO, rO ) > BOUNDING_RADIUS_2 ) break;
}

return dist;
}

vec3 normEstimate( vec3 qP, vec4 c )
{
vec4 gx1 = vec4( qP, 0.0f ); gx1.x -= DEL;
vec4 gx2 = vec4( qP, 0.0f ); gx2.x += DEL;
vec4 gy1 = vec4( qP, 0.0f ); gy1.y -= DEL;
vec4 gy2 = vec4( qP, 0.0f ); gy2.y += DEL;
vec4 gz1 = vec4( qP, 0.0f ); gz1.z -= DEL;
vec4 gz2 = vec4( qP, 0.0f ); gz2.z += DEL;

// reduce iteration 0 where z=0
float dyz = dot( qP.yz, qP.yz );
float qx  = qP.x * qP.x;
float qx2 = 2.0f * qP.x;

gx1 = vec4( gx1.x * gx1.x - dyz,                   2.0f * gx1.x * qP.yz,  0.0f ) + c;
gx2 = vec4( gx2.x * gx2.x - dyz,                   2.0f * gx2.x * qP.yz,  0.0f ) + c;
gy1 = vec4( qx            - dot( gy1.yz, gy1.yz ), qx2          * gy1.yz, 0.0f ) + c;
gy2 = vec4( qx            - dot( gy2.yz, gy2.yz ), qx2          * gy2.yz, 0.0f ) + c;
gz1 = vec4( qx            - dot( gz1.yz, gz1.yz ), qx2          * gz1.yz, 0.0f ) + c;
gz2 = vec4( qx            - dot( gz2.yz, gz2.yz ), qx2          * gz2.yz, 0.0f ) + c;

for( int i = 0; i < ITERATIONS-1; i++ ) {
gx1 = quatSq( gx1 ) + c;
gx2 = quatSq( gx2 ) + c;
gy1 = quatSq( gy1 ) + c;
gy2 = quatSq( gy2 ) + c;
gz1 = quatSq( gz1 ) + c;
gz2 = quatSq( gz2 ) + c;
}

return normalize( vec3( length( gx2 ) - length( gx1 ),
length( gy2 ) - length( gy1 ),
length( gz2 ) - length( gz1 ) ) );
}
```

I like flashy stuff, so the base color is changed to red, the normal vector is weighted less and the specular calculation is made more aggressive:

```vec3 Phong( vec3 light, vec3 eye, vec3 pt, vec3 N )
{
vec3 obj_col = vec3( 1.0f, 0.0f, 0.0f );

vec3  L  = normalize( light - pt );
vec3  E  = normalize( eye   - pt );
float NL = dot( N, L );
vec3  R  = L - 2.0f * NL * N;

float diffuse = max( NL, 0.0f );

obj_col += abs( N ) * 0.2f;

return obj_col * diffuse + pow( max( dot( E, R ), 0.0f ), 8.0f );
}
```

intersectSphere() always hits on the front side, so it can be reduced too:

```vec3 intersectSphere( vec3 rO, vec3 rD )
{
float b0 = dot( rO, rD );
float d0 = b0 * b0 - dot( rO, rO ) + BOUNDING_RADIUS_2;
float t0 = -b0 - sqrt( d0 );

rO += t0 * rD;

return rO;
}
```

The main loop uses the same position calculations as the raytracer. Shadows are removed; they were expensive.

```void main()
{
vec3 rD = normalize( vec3( (startpos.x + ax * v_texCoord.x),
(startpos.y + ay * v_texCoord.y),
(startpos.z                    ) ) );

vec3 rO = eye_pos.xyz;

vec3 light = light_pos.xyz;

rO = intersectSphere( rO, rD );

float dist = intersectQJulia( rO, rD, mu_pos );

if( dist >= epsilon ) return;

vec3 N = normEstimate( rO, mu_pos );

outColor.rgb = Phong( light, rD, rO, N );
outColor.a = 1.0f;
}
```

Source files:

Obviously, this code is heavier. But there's just one object, and the lowest result seems to be about 62 fps when morphing through some of the points from Paul Bourke's article.

FPS measurement, rendering to PBuffer (no vsync), same path as in the video:

Comments are always appreciated. I prefer being contacted on LinkedIn. Email is also available, you can figure it out from the front page. I switched to a disposable email address system, so it will change regularly.

Remember to appreciate this classic Abstruse Goose strip.