Fluid simulation (DX11/DirectCompute)


Download Fluid3D V1.3



Improved version that has no more artifacts and is 70% faster.

The simulation now has a more smoke like aspect, and  the smoke bounces off the walls.

Instead of visualizing the velocity an advected fire reaction coordinate is rendered.

The simulation speedup is realized by vectorizing some compute kernels, especially the Jacobi iterations.

Thread group size for all kernels is now 256, either 16x4x4 or 16x16x1.

Rendering was made faster by sampling a scalar volume instead of a float4 volume.

Space bar switches to showing velocity size instead of advected smoke.

All simulation is now via second order MacCormack including limiters.




Download Fluid3D V1.1




Here another volume simulator, this time simulating an incompressible fluid solving the Navier- Stokes differential equations. The simulation runs in a 200x200x200 voxel box

The calculations make use of a well known scheme of velocity advection,  Jacobi pressure solving and making the velocity divergence free by subtracting the gradient of the pressure.

This is the so called  Semi-Lagrangian scheme. A more accurate solver makes use of the second order MacCormack technique. The simulation makes use of the latter. However it makes the simulation unstable and introduces artifacts. Limiting generated extremes can fix this, unfortunately I was not able to get this working, so the simulation runs without limiters, still the result is some visual interesting turbulent behavior.

The amplitude of the speed vectors are visualized. To make a 3d rendering, a simple ray maximum projection is used. This shoots rays through the volume searching the maximum speed along the ray. With a linear interpolation the speed is given some color.



            - mouse left drags a source on a plane through the center of the volume and parallel to the screen

            - mouse right to rotate the volume

            - mouse wheel to zoom in and out

            - space bar to toggle between MacCormack and Semi-Lagrangian simulation




26 Dec 2009   /   Jan Vlietinck




3D waves simulator (DX11)


Download Waves3D V1.1



Long time ago (16 years) I wrote a 2D wave simulator, based on finite differencing the Laplace wave equation. The simulation grid was 80 x 64 points and it ran real-time.


Now in a similar spirit I've written a wave simulator that does simulation of volumetric 3D waves. These could be sound or electromagnetic waves. The simulation grid now is 400 x 400 x 400 or 64 million voxels large. The simulation equation becomes slightly more complex but still fairly simple namely:

                                    p(x, y, z) =  2*p'(x, y ,z) - p''(x, y, z) + 

c*( p'(x-1,y,z) + p'(x+1,y,z) + p'(x,y+1,z) + p'(x,y-1,z) + p'(x,y,z+1) + p'(x,y,z-1) - 6*p'(x,y,z))


With p(x,y,z) the pressure (for sound waves) at time t, p' the pressure at time t-dt, and p'' the pressure at time t - 2*dt.


There is a scalar and vectorized implementation.

On a HD 5870 scalar simulation speed is about 119 frames per second.

The vectorized version runs at 257 frames per second corresponding to 16.5 GigaVoxels/s.


Only one slice of the simulation is visualized, moving the mouse vertically shows other slices.

The S key toggles between the scalar and vectorized mode..

By pressing the space bar other wave source constellations can be viewed, namely two sources apart, a 4 source swirl and 2 source dipole.

The slice viewing direction can be cycled by pressing the 'Enter' key to one of the 3 main axes (only scalar mode)





23 Oct 2009   /   Jan Vlietinck



DX11 DirectCompute Julia 4D


Download DX11 Julia 4D fractal generator V1.4



Here another fractal generator, this time rendering 4 dimensional Quaternion Julia fractals.

It continuously morphs the shape and colors of the fractal.


The shader code is a port of the original Cg version written by Keenan Crane.


The program tries to make use of a DX11 compute shader 5, if this is not possible like on DX10 hardware a pixel shader 4 is used instead of a compute shader. This enables the code to run on any DX11 or DX10 GPU.


The fractal can be rotated with the mouse to see it in all it's 3D glory. Zoom with mouse wheel.

Also the morphing can be toggled with the space bar, for better inspection.

Fractal detail can be increased and decreased with the +/- keys of the numeric pad.

Self shadowing can be toggled with the S key.

With the P key, it is possible to switch between pixel and compute shader (if available)

ALT + Enter goes from windowed to full screen




11 Oct 2009   /   Jan Vlietinck



DX11 DirectCompute Mandelbrot and Julia viewer


Download DX11 / AVX 2 / AVX 512 Mandelbrot and Julia viewer V2.3

Download DX11 Mandelbrot and Julia viewer V1.8

Download DX11 feature level DX10 version





Here a quite fast Mandelbrot  and Julia viewer, making use of DX11 and the DirectCompute API.

The software detects if your GPU support doubles, if not it will run using only floats.


The set is calculated with up to 1024 iterations. Making use of the horsepower of DX11 GPUs enables real-time panning and zooming even at high resolution.


A scalar one and a vectorized computation version is included.

Both generate the same output. The vectorized version was made after suboptimal performance on the ATI HD 5870 with scalar calculation. The vectorization is done by calculating 2x2 pixels at once.

Compared to the scalar version it runs twice faster on this GPU at over 1.9 TFLOP/s.

In doubles mode, performance is less than 400 GFLOP/s


A GTX 480 is about half as fast in float mode and about quarter speed in double mode, compared to a HD 5870.


Full source code is included.

Remark that no drawing code was needed. It is possible to directly write to the backbuffer from the compute shader.


Key controls


Space bar        : Toggles between Mandelbrot and Julia

M key               : Toggles between 1024 and 2048 maximum iterations

A/Z keys           : Cycle colors


V key               : Vector calculations (only used for floats, not much effect for doubles)

S key               : Scalar calculations

F key               : Float  calculations

D key               : Double calculations

E key               : Toggle between two different double versions


The first version is a straight conversion of the float version.  However ATI currently is buggy and slow for this version. The second version does less loop unrolling and works ok on ATI. The first version is about 1/3 faster (at least currently on Nvidia)




Move                     : Pan

Move +  SHIFT      : In Julia mode, move base point around to get a morphing fractal

Drag left and right  : Zoom in / out


Pressing the space bar switches to Julia calculation. It takes the point in the center of the Mandelbrot view as the Julia base point.  Holding the SHIFT key while moving the mouse moves the Julia base point, this results in an animation with the Julia set changing shape.

Pressing the space bar again switches back to Mandelbrot calculation. With a deeper zoomed in Mandelbrot view the Julia set will more gradual change shape.


DX10 support


In order to support DX10 GPUs the DX11 feature level DX10 version should be used.

It will try to make use of compute shader 4 instead of compute shader 5.

This requires an additional pass with a pixel shader to copy compute shader output to the screen.


In case there is no support for compute shaders, as currently is the case with Nvidia on Vista, the calculations will be done with pixel shaders only.

In this case only the scalar version of the algorithm can be used.


On a GTX280 computational throughput is 1/4 of that of a HD 5870. This is to be excepted as the former has around 600 GFLOP/s, where the latter has over 4 times more.





7 Oct 2009   /   Jan Vlietinck



Fast software renderer



Download demo version 1.1



Here a rather fast software renderer engine demo called FQuake.

It renders some level of the original Quake game.


The special thing about this renderer is that it is pure CPU without using GPU. This is a port of an original renderer I wrote back in 1997, which ran on ARM processors, based on a reverse engineered format of the PC game data. In contrast to those original versions this version here does texture mapping in software with bilinear interpolation instead of point sampling. I wrote this software to learn about SSE and how fast it can be, I used the knowledge to write a software version of my volume rendering engine at Agfa.


This demo engine is highly optimized making use of multiple threads and SSE code.

Perspective, bilinear texture mapping runs at 650 Mpix/s on a Quad Core 2,  3.2Ghz.

A 64-bit version is also included running 15% faster at 750 Mpix/s.


For comparison with GPU rendering a DX10 version is included. On a GTX280 GPU, rendering is about six times faster compared to native FQuake CPU rendering. Also a DX10 WARP software rendering version is included. This CPU rendering version runs about 5 times slower compared to native FQuake CPU rendering.. Also a DX9 version is included enabling comparison with the SwiftShader software renderer, which has rendering speed similar to WARP.


When looking at CPU usage one can see that utilization is only between 80 and 90%. The cause seems to be the copying of the rendered image from CPU memory to GPU memory done by the graphics card drivers. After upgrading to a HD5870 graphics card I noticed that rendering speed is now 800 Mpix/s, as this card seems to have a more efficient way of copying than the GTX280.


The engine makes use of an algorithm that ensures zero overdraw.

At 2560x1600 resolution the engine runs at between 120 and 160 frames per second, only slightly depending on scene complexity, corresponding to between 500 and 650 Mpix/s texture and pixel fill rate.


The mapped texture consists of two layers, a material and light map. Though normally you would do this with multi texturing, the engine does it with single texturing. To make this possible a LRU texture cache is maintained with on the fly compositing material/light texture maps as needed.


To make optimally of all CPU cores, the screen is split up according to the number of cores. The splitting positions of the screen are continuously moved to adapt to the scene complexity so that all cores are maximally loaded.


You can fly through the scene, clicking the mouse left and right buttons for forward and backward movement. Holding the middle button with left/right causes quad speed.

You can also switch between bilinear and point sampling, by pressing the space bar.


The image is displayed via DirectDraw. For some reason, on systems with dual screens the rendering can be slow. To get normal rendering speed you may have to disable one of the screens. For graphics cards with PCIe 1.x the rendering speed will be limited to 500 Mpix/s by the 2GB/s graphics bus.








20 Jul 2009         jvlietinck <at> gmail <dot> com