Pipeline, 2016 Archive

Random stuff I'm working on, in chronological order.

2018 -- 2017 -- 2016

22 December 2016

Video with 96 spheres. Doing 16x16 block probing on the CPU before rendering it on the GPU:

21 December 2016

In the never ending hunt for more spheres, it's time to turn to the CPU for answers. It's mostly idling anwyay. Let the CPU check the corners of each 16x16 block of the display and determine if the area is covered by only spheres, only shadow plane/sky or both. This makes the GPU code significantly quicker, since half the code can be eliminated for most of the screen:

GPURay: 96 spheres probe grid

Corner checking takes around 6ms CPU time, so let a separate thread calculate that for the next frame while the current one is rendering. Synchronize it all with a barrier. Simple. This calls for an increase to 96 spheres at 60 fps (barely):

GPURay: 96 spheres

Code needs some work before publication.

14 December 2016

There was a slight mishap in the raytracer code which makes some shadows disappear. It's noticeable in the video if you pay close attention to the shadows. The measurments are a bit off too, but it's still above 60 fps after fixing it, so I'm not gonna bother updating them. The GLSL source code in the GPU Hacks Article has been updated.

14 December 2016

GPURay 1.7, 80 spheres at 60 fps on NVidia Tegra X1:

Music: 1992 Amiga Oktalyzer module by Jan Wegger.

There's an interesting pragma that can be used in NVidia GLSL code: #pragma optionNV (unroll all). Trying it on the raytracer gives interesting results: A huge assembler file that has no REP loops at all, and slightly quicker execution time.

So the trick is to find the correct unroll level and do it manually. Earlier attempts were clearly too conservative. Automatic unrolling doesn't work if there's if() statements in the loop. After some trial and error it seems that 16 is a good unroll value for a target of 80 spheres. Since the unroll is greater, using vec4 in the second loops starts paying off too. Let's increase to 80 spheres and do a test run. Blue is old code, orange is unroll all and green is unroll 16 + vec4:

GPURay: 80 spheres

The loops end up looking like this:

        // trace_ray()
        for( int i = 0; i < OBJNUM; i += 16 ) {
            vec3 v0, v1, v2, v3;
            vec4 b03, d03;
            bvec4 dv;

            // 0-3
            v0 = ray_pos - obj_pos[i+0].xyz;
            v1 = ray_pos - obj_pos[i+1].xyz;
            v2 = ray_pos - obj_pos[i+2].xyz;
            v3 = ray_pos - obj_pos[i+3].xyz;

            b03 = vec4( dot( v0, ray_dir ),
                        dot( v1, ray_dir ),
                        dot( v2, ray_dir ),
                        dot( v3, ray_dir ) );

            d03 = vec4( b03.x * b03.x - dot( v0, v0 ) + obj_pos[i+0].w,
                        b03.y * b03.y - dot( v1, v1 ) + obj_pos[i+1].w,
                        b03.z * b03.z - dot( v2, v2 ) + obj_pos[i+2].w,
                        b03.w * b03.w - dot( v3, v3 ) + obj_pos[i+3].w );

            dv = greaterThan( d03, vec4( 0.0f ) );

            if( any( dv ) ) {

                vec4 t03 = -b03 - sqrt( d03 );

                dv = dv && greaterThan( t03, vec4( 0.0f ) );

                if( dv.x && t03.x < rt_dist ) { rt_dist = t03.x; rt_hit = i+0; }
                if( dv.y && t03.y < rt_dist ) { rt_dist = t03.y; rt_hit = i+1; }
                if( dv.z && t03.z < rt_dist ) { rt_dist = t03.z; rt_hit = i+2; }
                if( dv.w && t03.w < rt_dist ) { rt_dist = t03.w; rt_hit = i+3; }

            }

            // ditto for 4-7,8-11,12-15...
        }

        // ...

        // check shadow
        int i;
        for( i = 0; i < OBJNUM; i += 16 ) {
            vec3 v0, v1, v2, v3;
            vec4 b03, d03;

            v0 = ray_pos - obj_pos[i+0].xyz;
            v1 = ray_pos - obj_pos[i+1].xyz;
            v2 = ray_pos - obj_pos[i+2].xyz;
            v3 = ray_pos - obj_pos[i+3].xyz;

            b03 = vec4( dot( v0, l ),
                        dot( v1, l ),
                        dot( v2, l ),
                        dot( v3, l ) );

            d03 = vec4( b03.x * b03.x - dot( v0, v0 ) + obj_pos[i+0].w,
                        b03.y * b03.y - dot( v1, v1 ) + obj_pos[i+1].w,
                        b03.z * b03.z - dot( v2, v2 ) + obj_pos[i+2].w,
                        b03.w * b03.w - dot( v3, v3 ) + obj_pos[i+3].w );

            if( any( greaterThan( d03, vec4( 0.0f ) ) ) ) break;

            // ditto for 4-7,8-11,12-15...

        }

12 December 2016

An attempt to maximize the Tegra X1 GPU power usage. Around 11 watts is the current highscore while still looking cool:

4 November 2016

One issue with the solution below is the excessive use of if()-arguments. By wrapping things up in vectors we can get most of them out of the way. greaterThan() etc. in OpenGL compiles to the corresponding sgt etc.

        for( int i = 0; i < OBJNUM; i += 4 ) {

            vec3  v0 = ray_pos - obj_pos[i+0].xyz;
            vec3  v1 = ray_pos - obj_pos[i+1].xyz;
            vec3  v2 = ray_pos - obj_pos[i+2].xyz;
            vec3  v3 = ray_pos - obj_pos[i+3].xyz;

            vec4 b03 = vec4( dot( v0, ray_dir ),
                             dot( v1, ray_dir ),
                             dot( v2, ray_dir ),
                             dot( v3, ray_dir ) );

            vec4 d03 = vec4( b03.x * b03.x - dot( v0, v0 ) + obj_pos[i+0].w,
                             b03.y * b03.y - dot( v1, v1 ) + obj_pos[i+1].w,
                             b03.z * b03.z - dot( v2, v2 ) + obj_pos[i+2].w,
                             b03.w * b03.w - dot( v3, v3 ) + obj_pos[i+3].w );

            bvec4 dv = greaterThanEqual( d03, vec4( 0.0f ) );

            if( any( dv ) ) {

                vec4 t03 = -b03 - sqrt( d03 );

                dv = dv && greaterThanEqual( t03, vec4( 0.0f ) );

                if( dv.x && t03.x < rt_dist ) {
                    rt_dist = t03.x;
                    rt_hit  = i;
                }
                if( dv.y && t03.y < rt_dist ) {
                    rt_dist = t03.y;
                    rt_hit  = i+1;
                }
                if( dv.z && t03.z < rt_dist ) {
                    rt_dist = t03.z;
                    rt_hit  = i+2;
                }
                if( dv.w && t03.w < rt_dist ) {
                    rt_dist = t03.w;
                    rt_hit  = i+3;
                }

            }

        }

The if() set can be reduced with some mix() and min() and stuff. The generated code is neat, but it's not very quick yet. Have to investigate that further. Luckily, this is not like in the old days.

The cool part is that it can now do 72 spheres:

2 November 2016

Video with 64 spheres:

1 November 2016

While perusing more generated code and trying out other unroll structures, it's clear that the one from 28 October is not totally optimal for larger numbers of spheres. A different approach yielding better results seems to be turning down the lower unroll to 4 (not shown), and increasing the upper to 4 while moving the sqrts out of the way. So there's some double tests, but the compiler seems to figure it out. The whole point of that is to insert a continue if all 4 doesn't hit at all. The miss ratio is pretty high and we're going through everything anyway:

        for( int i = 0; i < OBJNUM; i += 4 ) {

            vec3  v0 = ray_pos - obj_pos[i+0].xyz;
            vec3  v1 = ray_pos - obj_pos[i+1].xyz;
            vec3  v2 = ray_pos - obj_pos[i+2].xyz;
            vec3  v3 = ray_pos - obj_pos[i+3].xyz;
            float b0 = dot( v0, ray_dir );
            float b1 = dot( v1, ray_dir );
            float b2 = dot( v2, ray_dir );
            float b3 = dot( v3, ray_dir );
            float d0 = b0 * b0 - dot( v0, v0 ) + obj_pos[i+0].w;
            float d1 = b1 * b1 - dot( v1, v1 ) + obj_pos[i+1].w;
            float d2 = b2 * b2 - dot( v2, v2 ) + obj_pos[i+2].w;
            float d3 = b3 * b3 - dot( v3, v3 ) + obj_pos[i+3].w;

            if( d0 < 0.0f && d1 < 0.0f && d2 < 0.0f && d3 < 0.0f ) continue;

            float t0 = -b0 - sqrt( d0 );
            float t1 = -b1 - sqrt( d1 );
            float t2 = -b2 - sqrt( d2 );
            float t3 = -b3 - sqrt( d3 );

            if( d0 >= 0.0f && t0 > 0.0f && t0 < rt_dist ) {
                rt_dist = t0;
                rt_hit  = i;
            }
            if( d1 >= 0.0f && t1 > 0.0f && t1 < rt_dist ) {
                rt_dist = t1;
                rt_hit  = i+1;
            }
            if( d2 >= 0.0f && t2 > 0.0f && t2 < rt_dist ) {
                rt_dist = t2;
                rt_hit  = i+2;
            }
            if( d3 >= 0.0f && t3 > 0.0f && t3 < rt_dist ) {
                rt_dist = t3;
                rt_hit  = i+3;
            }

        }

The good news is that the number of spheres can be increased to 64 with plenty of cycles to spare. The bad news is that there's not enough cycles for next multiple of 4. Have to find a way to get around that.

28 October 2016

In the never-ending hunt for more spheres in the GPU Raytracer, I noticed that the glGetProgramBinary() call outputs assembler code too, in NV_gpu_program5 format. That makes it reasonably easy to see where improvements can be made.

The conditional construct in the first loop seems to confuse the compiler, and blocking negative square roots is pointless when they're basically free. Wrap it all up in a simple if() that reduces nicely in the generated assembler code. Still need to unroll 2 by hand, though, since the compiler still cannot unroll automatically when there's if() expressions in the loop:

        for( int i = 0; i < OBJNUM; i += 2 ) {

            vec3  v0 = ray_pos - obj_pos[i+0].xyz;
            vec3  v1 = ray_pos - obj_pos[i+1].xyz;
            float b0 = dot( v0, ray_dir );
            float b1 = dot( v1, ray_dir );
            float d0 = b0 * b0 - dot( v0, v0 ) + obj_pos[i+0].w;
            float d1 = b1 * b1 - dot( v1, v1 ) + obj_pos[i+1].w;
            float t0 = -b0 - sqrt( d0 );
            float t1 = -b1 - sqrt( d1 );

            if( d0 >= 0.0f && t0 > 0.0f && t0 < rt_dist ) {
                rt_dist = t0;
                rt_hit  = i;
            }
            if( d1 >= 0.0f && t1 > 0.0f && t1 < rt_dist ) {
                rt_dist = t1;
                rt_hit  = i+1;
            }

        }

A balance has to be made between unrolling loop 1 and 2. Seems like unrolling loop 1 twice, as above, and loop 2 10 (!) times is the best combination:

            for( i = 0; i < OBJNUM; i += 10 ) {

                vec3  v0 = ray_pos - obj_pos[i+0].xyz;
                vec3  v1 = ray_pos - obj_pos[i+1].xyz;
                vec3  v2 = ray_pos - obj_pos[i+2].xyz;
                vec3  v3 = ray_pos - obj_pos[i+3].xyz;
                vec3  v4 = ray_pos - obj_pos[i+4].xyz;
                vec3  v5 = ray_pos - obj_pos[i+5].xyz;
                vec3  v6 = ray_pos - obj_pos[i+6].xyz;
                vec3  v7 = ray_pos - obj_pos[i+7].xyz;
                vec3  v8 = ray_pos - obj_pos[i+8].xyz;
                vec3  v9 = ray_pos - obj_pos[i+9].xyz;
                float b0 = dot( v0, l );
                float b1 = dot( v1, l );
                float b2 = dot( v2, l );
                float b3 = dot( v3, l );
                float b4 = dot( v4, l );
                float b5 = dot( v5, l );
                float b6 = dot( v6, l );
                float b7 = dot( v7, l );
                float b8 = dot( v8, l );
                float b9 = dot( v9, l );
                float d0 = b0 * b0 - dot( v0, v0 ) + obj_pos[i+0].w;
                float d1 = b1 * b1 - dot( v1, v1 ) + obj_pos[i+1].w;
                float d2 = b2 * b2 - dot( v2, v2 ) + obj_pos[i+2].w;
                float d3 = b3 * b3 - dot( v3, v3 ) + obj_pos[i+3].w;
                float d4 = b4 * b4 - dot( v4, v4 ) + obj_pos[i+4].w;
                float d5 = b5 * b5 - dot( v5, v5 ) + obj_pos[i+5].w;
                float d6 = b6 * b6 - dot( v6, v6 ) + obj_pos[i+6].w;
                float d7 = b7 * b7 - dot( v7, v7 ) + obj_pos[i+7].w;
                float d8 = b8 * b8 - dot( v8, v8 ) + obj_pos[i+8].w;
                float d9 = b9 * b9 - dot( v9, v9 ) + obj_pos[i+9].w;
                if( d0 > 0.0f || d1 > 0.0f || d2 > 0.0f || d3 > 0.0f || d4 > 0.0f || d5 > 0.0f || d6 > 0.0f || d7 > 0.0f || d8 > 0.0f || d9 > 0.0f ) break;

            }

Unlike gcc, the NVidia compiler seems to be able to look ahead far enough to convert this into a coherent set of dp3/mad/or/trunc instructions. It's now considered wise to keep number of spheres a multiple of 10. Looks like it can finally do 60 spheres at 60 fps:

GPURay: 60 spheres, unrolling

New video:

20 September 2016

I missed a trivial optimization in the shadow calculation in the GPU Raytracer. There's no point in calculating the distance, it's enough to know whether any sphere is hit or not. That simplifies the loop enough so it can be unrolled further. A limitation is that the sphere count has to be a multiple of 4, not 2 as earlier:

            (...)
            int i;
            for( i = 0; i < OBJNUM; i += 4 ) {

                vec3  v0 = ray_pos - obj_pos[i+0].xyz;
                vec3  v1 = ray_pos - obj_pos[i+1].xyz;
                vec3  v2 = ray_pos - obj_pos[i+2].xyz;
                vec3  v3 = ray_pos - obj_pos[i+3].xyz;
                float b0 = dot( v0, l );
                float b1 = dot( v1, l );
                float b2 = dot( v2, l );
                float b3 = dot( v3, l );
                float d0 = b0 * b0 - dot( v0, v0 ) + obj_pos[i+0].w;
                float d1 = b1 * b1 - dot( v1, v1 ) + obj_pos[i+1].w;
                float d2 = b2 * b2 - dot( v2, v2 ) + obj_pos[i+2].w;
                float d3 = b3 * b3 - dot( v3, v3 ) + obj_pos[i+3].w;
                if( d0 > 0.0f || d1 > 0.0f || d2 > 0.0f || d3 > 0.0f ) break;

            }
            (...)

Let's compare the old and the new version, now with 56 spheres:

GPURay: 56 spheres, simpler shadow calculation

5 September 2016

Obviously, the same is true for the input data in the fractal and quaternion code too. The results are less dramatic, but the quaternion routine is now stable above 60 fps. Updated the article and made new measurements.

31 August 2016

The NVidia article How About Constant Buffers gives some interesting information about how data is stored and accessed on the GPU. The scene data in the GPU Raytracer is constant and the usage pattern fairly regular, so storing it in a constant buffer sounds like a good idea. Constant buffers translate to uniform buffers in OpenGL, so it's a matter of just replacing "shader storage" with "uniform" in the right places.

The results are spectacular. Old version (SSBO) versus new version (UBO), 32 spheres:

GPURay: SSBO vs UBO performance

Video below. Can now have 52 spheres and keep fps above 60 all the time:

2 August 2016

Made videos of MultiDoom and MultiQuake running on the Mellanox TILE-Gx36 multicore CPU:

27 July 2016

It's been a slow summer, so I finally had time to update the SRTP AES Optimization article from 2010. Should now support all CPU types and packet sizes larger than 4096. It's available here: SRTP AES Optimization Revisited

27 June 2016

Source files for the generic raytracer:

Just stuff the files in a directory and compile and link as specified below. Option -t0 (the default) will run it directly without setting up any threads.

24 June 2016

A quick port of the GPU Raytracer to more generic C code. Also includes optional Pthreads support.

[user@phyreztorm]$ gcc -O3 -ffast-math -Wall -Wextra -std=gnu99 -o raytracer -lm -lpthread raytracer.c bmp.c
[user@phyreztorm]$ ./raytracer -t 16 -o 7680x4320
Threads: 16
Thread    0:  270 lines (   0- 269)
Thread    2:  270 lines ( 540- 809)
Thread    3:  270 lines ( 810-1079)
(...etc...)
Time: 892.050049 ms
Saving bitmap out.bmp

CPU Raytracer

Just dividing into chunks of lines isn't a very good idea. Should use blocks like in the TILE-Gx raytracer, but that'd require more work. Test run using assorted thread counts on Intel Xeon E5-2699v3:

CPU Raytracer, measurement

10 June 2016

Source code for GPU fractals, raytracer and quaternion raytraced julia sets now available here.

9 June 2016

A friend scanned all issues of the old Norwegian Amiga magazine Digital 1989-1992. They're available here.

1 June 2016

26 May 2016

Back to 32 spheres in raytracer after removing some deadwood.

But that's boring, raytraced quaternion julia sets are cooler:

GPU Plasma

Still pretty rough, but it runs. Based on code by Keenan Crane.

13 May 2016

Sacrificed some spheres to get 60 fps with plane shadows:

April 2016

April is GPU appreciation month! Some attempts at making old cr... err, old stuff run on an NVidia Tegra X1.

GPU Raytracing

36 objects, 3 reflections, non-reflective shadows on center sphere. 126 lines shader code, could be less if nvcc would stop screwing up unrolling.

GPU Julia

Based on a GPU implementation by John Tsiombikas. I spent some time optimizing it and used a HSV'ish palette instead. 256 iterations. 45 fps if all pixels are at max iterations, aka. black screen. So 60 fps for all normal cases. Need bigger GPU.

GPU Plasma Effect

GPU Plasma

Based on a CPU implementation by Lode Vandevenne. Relatively easy to port to a GLSL shader.

HSV to RGB Conversion

Using a HSV-based palette when making effects is useful, since it wraps without any edges. But the color has to be converted back to RGB before displaying, and this can be complicated. Several methods of assorted quality can be found on the net. This one seems popular, and this one is also interesting. Donald Reynold's implementation uses cosines for everything, but then the hexagonal shape is gone. So let's do a twist: Use a cosine for the basic H slope, clamp() for the flat top/bottom and mix() or similar for S and V:

// Input: HSV 0.0f..1.0f
vec3 HSV2RGB( vec3 hsv )
{
    float h = hsv.x * 2.0f*PI; // h 0..2*PI

    float d120 = 2.0f*PI/3.0f;

    vec3 hv  = vec3( h, h-d120, h+d120 );

    vec3 rgb = clamp( 0.5f + cos( hv ), 0.0f, 1.0f ); // flatten h before s,v

    return ((1.0f-hsv.y) + rgb*hsv.y) * hsv.z; // mix( 1.0f,rgb,hsv.y) => ((rgb-1.0f)*hsv.y+1.0f)*hsv.z
}

A straight H2RGB where S=V=1.0f can be made even simpler:

vec3 H2RGB( float h )
{
    h *= 2.0f*PI; // h 0..2*PI

    float d120 = 2.0f*PI/3.0f;

    vec3 hv = vec3( h, h-d120, h+d120 );

    return 0.5f + cos( hv ); // add clamp() if needed
}

www.ignorantus.com