Fluid
simulation (DX11/DirectCompute)

Improved
version that has no more artifacts and is 70% faster.

The
simulation now has a more smoke like aspect, and the smoke bounces off the walls.

Instead
of visualizing the velocity an advected fire reaction coordinate is rendered.

The
simulation speedup is realized by vectorizing some compute kernels, especially
the Jacobi iterations.

Thread
group size for all kernels is now 256, either 16x4x4 or 16x16x1.

Rendering
was made faster by sampling a scalar volume instead of a float4 volume.

Space
bar switches to showing velocity size instead of advected smoke.

All
simulation is now via second order MacCormack including limiters.

Here
another volume simulator, this time simulating an incompressible fluid solving
the Navier- Stokes differential equations. The simulation runs in a 200x200x200
voxel box

The
calculations make use of a well known scheme of velocity advection, Jacobi pressure solving and making the
velocity divergence free by subtracting the gradient of the pressure.

This
is the so called Semi-Lagrangian scheme.
A more accurate solver makes use of the second order MacCormack technique. The
simulation makes use of the latter. However it makes the simulation unstable
and introduces artifacts. Limiting generated extremes can fix this,
unfortunately I was not able to get this working, so the simulation runs
without limiters, still the result is some visual interesting turbulent
behavior.

The
amplitude of the speed vectors are visualized. To make a 3d rendering, a simple
ray maximum projection is used. This shoots rays through the volume searching
the maximum speed along the ray. With a linear interpolation the speed is given
some color.

Controls:

- mouse left drags a source on a plane
through the center of the volume and parallel to the screen

- mouse right to rotate the volume

- mouse wheel to zoom in and out

- space bar to toggle between
MacCormack and Semi-Lagrangian simulation

______________________________________________________________________

26
Dec 2009 / Jan
Vlietinck

3D
waves simulator (DX11)

Long
time ago (16 years) I wrote a 2D wave simulator, based on finite differencing
the

Now
in a similar spirit I've written a wave simulator that does simulation of
volumetric 3D waves. These could be sound or electromagnetic waves. The
simulation grid now is 400 x 400 x 400 or 64 million voxels large. The simulation
equation becomes slightly more complex but still fairly simple namely:

p(x, y, z)
= 2*p'(x, y ,z) - p''(x, y, z) +

c*(
p'(x-1,y,z) + p'(x+1,y,z) + p'(x,y+1,z) + p'(x,y-1,z) + p'(x,y,z+1) +
p'(x,y,z-1) - 6*p'(x,y,z))

With
p(x,y,z) the pressure (for sound waves) at time t, p' the pressure at time
t-dt, and p'' the pressure at time t - 2*dt.

There
is a scalar and vectorized implementation.

On
a HD 5870 scalar simulation speed is about 119 frames per second.

The
vectorized version runs at 257 frames per second corresponding to 16.5
GigaVoxels/s.

Only
one slice of the simulation is visualized, moving the mouse vertically shows
other slices.

The
S key toggles between the scalar and vectorized mode..

By
pressing the space bar other wave source constellations can be viewed, namely
two sources apart, a 4 source swirl and 2 source dipole.

The
slice viewing direction can be cycled by pressing the 'Enter' key to one of the
3 main axes (only scalar mode)

______________________________________________________________________

23
Oct 2009 / Jan
Vlietinck

DX11
DirectCompute Julia 4D

Download DX11 Julia 4D fractal generator V1.4

Here
another fractal generator, this time rendering 4 dimensional Quaternion Julia
fractals.

It
continuously morphs the shape and colors of the fractal.

The
shader code is a port of the original Cg
version written by Keenan Crane.

The
program tries to make use of a DX11 compute shader 5, if this is not possible
like on DX10 hardware a pixel shader 4 is used instead of a compute shader.
This enables the code to run on any DX11 or DX10 GPU.

The
fractal can be rotated with the mouse to see it in all it's 3D glory. Zoom with
mouse wheel.

Also
the morphing can be toggled with the space bar, for better inspection.

Fractal
detail can be increased and decreased with the +/- keys of the numeric pad.

Self
shadowing can be toggled with the S key.

With
the P key, it is possible to switch between pixel and compute shader (if
available)

ALT
+ Enter goes from windowed to full screen

______________________________________________________________________

11
Oct 2009 / Jan
Vlietinck

DX11
DirectCompute Mandelbrot and Julia viewer

Download DX11 / AVX 2 / AVX 512 Mandelbrot and Julia viewer
V2.3

Download DX11 Mandelbrot and Julia viewer V1.8

Download DX11 feature level DX10 version

Here
a quite fast Mandelbrot and Julia
viewer, making use of DX11 and the DirectCompute API.

The
software detects if your GPU support doubles, if not it will run using only
floats.

The
set is calculated with up to 1024 iterations. Making use of the horsepower of
DX11 GPUs enables real-time panning and zooming even at high resolution.

A
scalar one and a vectorized computation version is included.

Both
generate the same output. The vectorized version was made after suboptimal
performance on the ATI HD 5870 with scalar calculation. The vectorization is
done by calculating 2x2 pixels at once.

Compared
to the scalar version it runs twice faster on this GPU at over 1.9 TFLOP/s.

In
doubles mode, performance is less than 400 GFLOP/s

A
GTX 480 is about half as fast in float mode and about quarter speed in double
mode, compared to a HD 5870.

Full
source code is included.

Remark
that no drawing code was needed. It is possible to directly write to the
backbuffer from the compute shader.

**Key controls**

Space
bar : Toggles between Mandelbrot
and Julia

M
key : Toggles between 1024
and 2048 maximum iterations

A/Z
keys : Cycle colors

V
key : Vector calculations
(only used for floats, not much effect for doubles)

S
key : Scalar calculations

F
key : Float calculations

D
key : Double calculations

E
key : Toggle between two
different double versions

The
first version is a straight conversion of the float version. However ATI currently is buggy and slow for
this version. The second version does less loop unrolling and works ok on ATI. The
first version is about 1/3 faster (at least currently on Nvidia)

**Mouse**

Move :
Pan

Move
+ SHIFT
: In Julia mode, move base
point around to get a morphing fractal

Drag
left and right : Zoom in / out

Pressing
the space bar switches to Julia calculation. It takes the point in the center
of the Mandelbrot view as the Julia base point.
Holding the SHIFT key while moving the mouse moves the Julia base point,
this results in an animation with the Julia set changing shape.

Pressing
the space bar again switches back to Mandelbrot calculation. With a deeper
zoomed in Mandelbrot view the Julia set will more gradual change shape.

DX10
support

In
order to support DX10 GPUs the DX11 feature level DX10 version should be used.

It
will try to make use of compute shader 4 instead of compute shader 5.

This
requires an additional pass with a pixel shader to copy compute shader output
to the screen.

In
case there is no support for compute shaders, as currently is the case with
Nvidia on

In
this case only the scalar version of the algorithm can be used.

On
a GTX280 computational throughput is 1/4 of that of a HD 5870. This is to be
excepted as the former has around 600 GFLOP/s, where the latter has over 4
times more.

______________________________________________________________________

7
Oct 2009 / Jan
Vlietinck

Fast
software renderer

Here
a rather fast software renderer engine demo called FQuake.

It
renders some level of the original Quake game.

The
special thing about this renderer is that it is pure CPU without using GPU.
This is a port of an original renderer I wrote back in 1997, which ran on ARM
processors, based on a reverse engineered format of the PC game data. In
contrast to those original versions this version here does texture mapping in
software with bilinear interpolation instead of point sampling. I wrote this
software to learn about SSE and how fast it can be, I used the knowledge to
write a software version of my volume rendering engine at Agfa.

This
demo engine is highly optimized making use of multiple threads and SSE code.

Perspective,
bilinear texture mapping runs at 650 Mpix/s on a Quad Core 2, 3.2Ghz.

A
64-bit version is also included running 15% faster at 750 Mpix/s.

For
comparison with GPU rendering a DX10 version is included. On a GTX280 GPU,
rendering is about six times faster compared to native FQuake CPU rendering.
Also a DX10 WARP software rendering version is included. This CPU rendering
version runs about 5 times slower compared to native FQuake CPU rendering..
Also a DX9 version is included enabling comparison with the SwiftShader
software renderer, which has rendering speed similar to WARP.

When
looking at CPU usage one can see that utilization is only between 80 and 90%.
The cause seems to be the copying of the rendered image from CPU memory to GPU
memory done by the graphics card drivers. After upgrading to a HD5870 graphics
card I noticed that rendering speed is now 800 Mpix/s, as this card seems to
have a more efficient way of copying than the GTX280.

The
engine makes use of an algorithm that ensures zero overdraw.

At
2560x1600 resolution the engine runs at between 120 and 160 frames per second,
only slightly depending on scene complexity, corresponding to between 500 and
650 Mpix/s texture and pixel fill rate.

The
mapped texture consists of two layers, a material and light map. Though
normally you would do this with multi texturing, the engine does it with single
texturing. To make this possible a LRU texture cache is maintained with on the
fly compositing material/light texture maps as needed.

To
make optimally of all CPU cores, the screen is split up according to the number
of cores. The splitting positions of the screen are continuously moved to adapt
to the scene complexity so that all cores are maximally loaded.

You
can fly through the scene, clicking the mouse left and right buttons for
forward and backward movement. Holding the middle button with left/right causes
quad speed.

You
can also switch between bilinear and point sampling, by pressing the space bar.

The
image is displayed via DirectDraw. For some reason, on systems with dual
screens the rendering can be slow. To get normal rendering speed you may have
to disable one of the screens. For graphics cards with PCIe 1.x the rendering
speed will be limited to 500 Mpix/s by the 2GB/s graphics bus.

Enjoy,

Jan

______________________________________________________________________

20
Jul 2009 jvlietinck <at> gmail
<dot> com