My friend Hannibal found a spare Cisco SX80 [cisco.com] videoconferencing unit in the back of his van. Might need a quick cleanup. Released in 2014, it featured a cutting-edge 36-core Tilera TILE-Gx CPU [wikipedia.org]. We used it to write cool mega-parallel versions of Doom and Quake and integer raytracers and scalers and other really important stuff. I spent a lot of time doing actual work on it, too, hand-optimizing the video pipeline. Fun times with a short (V)LIW pipeline and loads of cores! There are several articles about the TILE-Gx in the Articles section.
Applying separable filters efficiently on multicore CPUs seems to be a difficult problem. A popular image processing package does it the wrong way, as explained in that book nobody has read: Just dividing a picture into completely separate segments for each core leads to cache trashing very quickly. 8 cores seemed to the limit for FHD source pictures on a random Xeon back then.
My newer method from the articles A Fast Image Scaler for ARM Neon (2023) and Exploiting the Cache: Faster Separable Filters (2013,2018) lets you split jobs into blocks of 8 rows without much fuzz. It is still extremely fast on a single core, including the new Apple ARM CPUs. I never made a parallel version of it, but the higher degree of read overlap given by multiple smaller jobs is guaranteed to yield much better multicore results than any other current popular solution.
With the advent of bigger multicore CPUs and greater caches, it's time to rethink the entire thing. Again. Weird how often that happens. Anyway, I had this brilliant (or derivative, I forget which) idea the other day that if it was possible to make each horizontal job, hereafter referred to as a sweep, as small as possible, the speedup would be great. This would require a CPU with comparatively short cache lines and an ultra-quick fetchadd. Long story short: Any modern ARMv8+ with 64-byte cache lines and load-store exclusive instructions would suffice. You also have to figure out how to track the start/phase and how to stack the loads and such crap, but that's implementation details.
How to deal with the first vertical pass is another matter. This solution would not benefit from making the passes similar like in my old articles, since the sweeps are just 64 dst bytes long. I'm thinking of a nifty method where dst lines are pitched to the cores in the same manner as the sweeps. It might be too nifty since it would produce lines at a much lower rate. Again: Implementation details. The code should give excellent cache locality and near linear speed increase for larger hordes of orcs, sorry cores, - I'm thinking 32 and up.
What's missing here is, obviously, the code. I'll make a version in C and intrinsics when I feel like it. Might be this week, might be next year. I'm reconsidering which license to use on new code that I create, so don't hold your breath. Feel free to rip off the idea. The world needs more coders that care about energy usage and execution speed.
So, before you all scurry back to your unstable rust compilers and snake pits, let's set the record straight. Imagine a world where energy waste is frowned upon, except one area: "Modern" programming. In my opinion, if a new programming language can't solve a problem without a measurable increase in execution speed or energy usage compared to C, it's a failure. (For obvious reasons stated elsewhere, it's impossible to be faster than correctly written C. Duh.). Conversely, if a new language ruins the expressibility of C by severely constraining memory access and enforcing specific formatting rules, programming conventions, definition files, verbose semantics etc., what you have is just a very inefficient form of Newspeak. Read Nineteen Eighty-Four [wikipedia.org] and think about this: Why is the hoi polloi trying to enforce Newspeak, also known as unstable rust, in programming? George Orwell's dystopian vision is slowly coming true, in ways that could not be predicted in 1949.
Before Xmas, I installed an Elekit TU-8900 amplifier [elekit.co.jp] to power the old Sonus Faber Venere speakers. Never got round to posting a picture of it. It's reasonably priced and not too hard to build yourself. I did splurge on some WE 300B and Telefunken 12AU7 tubes:
Recently, I got hold of a Brocksieper EarMax Silver Edition [brocksieper.com] OTL headphone amp to drive the assortment of Focal cans. I have to find out whether OTL is cool or not. I still got a Mac [mcintoshlabs.com] for serious listening, but it needs a minor overhaul. Anyway, the Brocksieper is also reasonably priced. Sounds pretty ok with the stock GE 12AT7 and Toshiba 6922s, and some random (Chord-)wiring from the bag of holding:
The Norwegian version of Real Programming was released 4 years ago today. As mentioned below, we handed out virtual awards last year. Nobody really made a staunch stand against fast and energy saving code like the Waterhousers and Ryghs this time, so I guess no new awards. Saving them for the 5 year anniversary next year.
Interesting things did happen, like AI replacing many menial programming tasks, and all the big tech companies reducing their head counts. Who saw that coming? My palantir might not be trustworthy, but it's never wrong.
Happy new year! 2024 ended with the launch of the Xmas Demo 2024. Sjur and I went back to the roots and made effects run in 1080p60 on a $99 NVidia Jetson Nano. We're pretty happy with it. Control code is 1199 lines, shaders 1585 lines. Complete source code can be found on the info page.
For the hoi polloi that use base 10, ignorantus.com celebrated 15 years online last year. Us hardcore guys will be celebrating 0x10 years this year. The first Real Programming anniversary awards were handed out virtually. Will there be another one? My palantir is keeping quiet at the moment. kode24.no and digi.no were revealed to be wasting enormous amounts of energy. (Speaking of wastage, the AI frenzy is continuing. However, out of the ashes, something useful may arise [thregister.com]. Bad news for those in the idioting business. Feels like I predicted this already. Spooky.)
In the audio lab, we've been toying with a hybrid tube amp and the trusty, old TubeCube. Sjur built a prototype speaker cabinet with his CNC machine and installed some interesting new speaker elements. Lately, we've been going more hardcore [westernelectric.com]. Is there any other way? Report coming.
In the Nano lab, we've been working with assorted sensors, cameras, lenses, well, the usual array of things that can be connected to it and utilized in interesting new ways. We're running some custom AI stuff on it, too. Remember Corneliusen's two laws of software management. They're in that book. Anyway, we're keeping it under wraps for now. The projects, not the laws. Duh.
Stay tuned for more top notch stuff in 2025! (May be considered regular stuff in some regions)
This article is published under the following license: Attribution-NoDerivatives 4.0 International (CC BY-ND 4.0)
Short summary:
You may copy and redistribute the material in any medium or format for any purpose, even commercially.
You must give appropriate credit, provide a link to the license, and indicate if changes were made.
If you remix, transform, or build upon the material, you may not distribute the modified material.