Latest News: Updated 12 June 2018 --
YouTube Channel --
- A Different Method for Image Transformations (9 March 2018)
Image transformations using separable filters can be implemented as a
vertical pass followed by a horizontal pass.
They usually differ in their implementation, and the horizontal one is slower.
A new method is presented where the passes are almost similar
by ordering and transposing data in a specific manner.
This may be faster on architectures where the L1 cache is large enough
to hold a temporary dataset of 8 lines. An integer Intel SSE2 implementation is provided.
- Integer Raytracing on Mellanox TILE-Gx (5 June 2017)
A second attempt at raytracing on the Mellanox TILE-Gx. The TILE-Gx has very limited floating
point support in hardware, so let's try using fixed point math instead. Unfortunately, calculating
square roots is very time consuming. An alternative approach is explored where custom conversion
routines and integer math does this quickly enough to render 40 spheres in 1080p60. Videos and source
- GPURay - GPU+CPU Raytracer for NVidia Tegra X1 (3 February 2017)
The GPU only raytracer described in my 2016 article "GPU Hacks" can trace 80 spheres at 60 fps
on the NVidia Tegra X1. This new article describes how to use CPU preprocessing to boost
the sphere count to 128 with minimal changes to the GPU part. Videos and source code is included.
- SRTP AES Optimization Revisited (27 July 2016)
- In a 2010 article titled "SRTP AES Optimization" I presented
a method to make SRTP AES run significantly quicker. Unfortunately, there were some caveats:
Packet length had to be 4096 bytes or less and a multiple of 16, and the target CPU was expected to be big endian.
Let's try to address these issues in a new and improved version that will run on any 32-bit CPU.
- GPU Hacks: Fractals, a Raytracer and Raytraced Quaternion Julia Sets (10 June 2016)
- The NVidia Tegra X1
has a Maxwell-based GPU with a theoretical FP32 peak of 512 GFLOPS per second. It can easily be programmed
using OpenGL GLSL shaders. However, making fast GPU code is different from making fast CPU code.
3 non-typical GPU jobs are implemented in fragment shaders and optimized for better performance.
Videos and source code is included.
- OpenSSL aes_core.c Replacement for EZchip TILE-Gx (24 September 2015)
- Drop-in replacement for aes_core.c that's significantly faster. Includes a second
look at how to do the last round in less than half the instructions.
- Bilinear Picture Scaling on Tilera TILE-Gx (7 November 2014)
- A look at how to do bilinear picture scaling on the Tilera TILE-Gx. Two
different approaches are tried out. Measurements are done on different core counts and data
sizes. Uses a new parallelization library, presented in the article, to split the work across multiple cores.
- Raytracing on Tilera TILE-Gx (21 April 2014)
- Raytracing is a job well suited for multicore CPUs. Challenge of the day: Make a raytracer for
the Tilera TILE-Gx36 that's quick enough to output 1920x1080p60 video. Source code, pictures,
videos and performance measurements included.
- RGB to YUV Conversion on Tilera TILE-Gx (12 April 2013)
- A RGB to YUV conversion routine for Tilera TILE-Gx that uses the new dual dot product
instructions for maximum efficiency.
- YUV to RGB Conversion on Tilera TILE-Gx (7 November 2012)
- Optimizing for the Tilera TILE-Gx CPU is very different from Intel SSE2.
An attempt to get optimal performance using 8 bit multipliers as much as possible.
- YUV to RGB Conversion Using SSE2 (23 October 2012)
- A common error in this class of conversion routines on SSE2 is too conservative use of
multipliers, leading to complicated data shuffling before and/or after the multiplications.
SSE2 multipliers are inherently cheap to use, so let's try to maximize their usage instead.
- A Look at Halide's SSE2 3x3 Box Filter (24 August 2012)
- A look at the SSE2 3x3 box filter used as example code in the Halide language specification.
I get significantly better results using normal C code and SSE2 intrinsics. The code is also comprehensible.
- AES Optimization on Tilera TILE-Gx (23 December 2011)
- A TILE-Gx core can issue 3 instructions in parallel, given a set of strict restrictions.
This paper explores how to exploit that in an AES encryption routine using TILE-Gx intrinsics.
- SRTP SHA1 Optimization (13 December 2010)
- Calculating SHA1 hashes on SRTP packets can be quite costly on low end CPUs. Since lengths etc.
are static, let's try to strip out the code that actually does SHA1 calculation in OpenSSL and make it
as fast as possible. Tests are performed on a Freescale MPC8270 CPU.
- SRTP AES Optimization (2 April 2010)
- A "feature" in the SRTP specification makes it possible to reduce the CPU cost
of AES encryption and decryption by 30%.
- Random stuff I'm working on, in chronological order.
- NVidia Jetson TX2 Xmas Demo 2017 (2017 - 2018)
Everything in one place. Contains new video of the remastered edition that's 60 fps all the way.
- MultiQuake - Quake for Mellanox TILE-Gx CPUs (5 May 2014 - 4 Aug 2016)
Port of the original Quake to Mellanox TILE-Gx multicore CPUs.
Number of Quakes possible to run in parallel is only limited by your screen size.
Only native version available. Has custom, intrinsics based TILE-Gx 2x-5x scalers.
- MultiDoom - Doom for Mellanox TILE-Gx CPUs (2 June 2009 - 4 Aug 2016)
Port of SDL Doom 1.10 to Mellanox TILE-Gx multicore CPUs.
Number of Dooms possible to run in parallel is only limited by your screen size.
Host based version for PCI Express cards and native version available. 2x and 3x EPX scalers.
- TANDBERG Secrets (1999-2010)
- Pictures of hidden menus and unreleased protoypes I worked on in the Classic and MXP days.
- Source Code and Schematics for PCTVNet HomePilot Set Top Box (1998-1999)
- Source code used to run the HomePilot internet box back in 98-99. Not all; it's only
what I had checked out on my last day. Might be interesting for code archaelogists.
Also Schematics for 2.0 hardware.
- Triumph Amiga Demos and Source Code (1985-1995)
- Source code, videos and other fun stuff from old Commodore Amiga demos I did work on.
It's disposable and will change regularly. Any replies will be sent from a GMail address.
Source code license for nonderivative works, unless clearly labelled otherwise: CC0 1.0.
Articles, publications, pictures, media files etc.: CC BY 4.0.