Lattice-Boltzmann-Model

This repository contains a 3D implementation of a Lattice-Boltzmann model on a D3Q19 or D3Q27 lattice for high Reynolds number flow.

The code, NVIDIA's CUDA Fortran, runs on a single core, an OPEN-MP or MPI multicore, a single GPU, or multiple GPUs using MPI, with optimization primarily for the GPU and multiple GPUs. The code runs in single precision as default but a flag copiles a double precision version.

The collision operator is a single relaxation time with relaxation to a third-order Hermite expansion of the equilibrium distribution similar to the approach taken by Jacob et al (2018) and Feng et al, (2018), but excluding the hybrid regularization using finite differences that I did not yet need. It is also possible to use the code for viscous flow without regularization and turbulence closure, and using a second-order BGK expansion.

The turbulence closure scheme is the one described by Vreman (2004).

The model boundary conditions are periodic or closed no-slip or free-slip two-timestep bounceback in the i-, j-, and k-directions. Additionally, there are inflow-outflow conditions in the i-direction.

The code allows for inserting solid bodies within the model domain to simulate, e.g., flow around an airfoil or a cylinder.

Additionally, there is a complete implementation of an actuator line model for the NREL-5Mw wind turbine on an arbitrary rotor plane, and it is possible to include multiple turbines at any location of the model domain.

Inflow turbulence is mimicked or introduced at a section inside the inflow boundary at i=1 (typically at the slice i=10) by applying a smooth in space and time, pseudo-random force on the fluid.

The model also allows including buoyancy forcing by advecting potential potential temperature as a passive tracer.

The forcing function for the inflow turbulence, the turbines, and the buoyancy forcing is the one of Kupershtokh (2009).

Release notes:

(Jan 2026): Code upgraded to allow for MPI parallelization and buoyancy forcing

**Previous version

As this release is a major upgrade, I may have introduced some issus. The previous operatinal code is stored under the version_gpu branch.
The Netcdf diagnostics dump needs an update for use with MPI tiles and the Buoyancy (potential temperature) variable.

MPI parallelization

To use MPI, compile with MPI=1. (make -B MPI=1 CUDA=1 for multiple GPU boards, and make -B MPI=1 for multiple CPUs). The MPI parallelization should scale almost linearly on multiple CPUs and GPUs, and is much more efficient than using OPEN-MP.
The MPI parallelization splits the model domain into a number of tiles in the j-direction.
Edit mod_dimensions where ny is now the local tile dimension, and nyg is the global grid dimension in the j-direction. And note that nyg = nrtiles x ny.
I have completely rewritten the actuatorline model for MPI parallelization, allowing blades and forces to intersect tile boundaries. You can also add a tilt and yaw angle to the rotor in the infile.in where the format for turbines is slightly updated.
Restart and diagnostic files now include the tile number in the file name, e.g., tec_0002_020050.plt and restart_0002_020050.uf, which refer to a diagnostic file and a restart file for tile number 2 at the timestep 20050. For plotting, you will now have to read multiple files per timestep. For consistency, all runs, also without MPI parallelization, include a tile number in the file name (0000 in the non-MPI case). Due to the larger number of output files, I am now dumping them in directories: output, restart, and testing
To start an MPI simulation to run on 4 GPUs or 4 CPUs, type

   ulimit -s unlimited (when running on CPUs)
   mpirun -np 4 boltzmann

Buoyancy forcing

infile.in now includes additional lines specifying if buoyancy forcing is included and if the atmospheric boundary layer is stable, neutral, or unstable.

 # Atmospheric boundary layer
 2                ! iablvisc      : mode for ABL (0-none, 1-mechanical layer, 2-bouyancy scheme)
 400.0            ! ablheight     : height of ABL in meters
 1                ! istable       : Stability of ABL when iablvisc=2 (1=stable, 0=netural, -1=unstable)

When including buoyancy forcing, the model includes an additional macroscopic potential temperature variable that is advected and diffused by solving an advection-diffusion equation using an RK2 method with upstream spatial differences for the advection term. Input to the advection-diffusion solver is the macroscopic fluid velocities and the eddy diffusivity calculated from tau, but with some constraints introduced for stability.
The buoyancy force is a global force applied in every grid point (which adds to the computational cost).
The stable case uses Dirichlet boundary conditions on inflow and the closed boundaries, and a radiation condition for the outflow.
The unstable case uses periodic conditions in the x direction, a Neumann condition for heat flux on the surface boundary, and a Dirichlet condition for constant temperature at the top boundary.

Notes

The forcing_apply currently uses only second order Hermite expansion for saving some cpu time. Revert to 3rd order in case of instabilities related to strong forces leading to high Mach numbers.
Note also the minor change in infile.in where I replaced lnodump with ldump and where ldum needs to be true for saving diagnostics and restart files.

 T                ! ldump         : Saving of diagnostics and restarts files

(Oct 2025) The latest pushes have been major upgrades. Speedup is now around 100 on GPU relative a single CPU core.

The code uses pointers to switch beteen f1 and f2 every second timestep. This allows for a significant simplification for the implementation of boundary conditions and reduces the load on the GPU for the postcoll routine.
postcoll now replaces fequil2, regularization, vreman and collisions, which are all done in one common kernel.
The actuatorline model was a mess, and I have now cleaned it up and tested it again.
I have changed the format of the infile.in. Also, if there is no infile.in present, Boltzmann will generate one for you.
I have developed a relatively robust test environment. If you activate ltesting in infile.in the code will dump the whole distribution function f in a file at the end of the simulation e.g., testing000200.uf if you run 200 time-steps. All subsequent 200 time-steps runs will then compute the difference between the latest simulations and the reference testing000200.uf file. A tolerance of (RMSE=0.1E-06 and MAXERR=0.1E-05) are acceptable in sigle precision.

License

This project is dual-licensed to accommodate both academic/research and commercial use:

Academic / Research License
- Free for non-commercial academic and research purposes.
- Based on Apache License 2.0 terms (with non-commercial restriction).
- See LICENSE-Academic.txt for full details.
- Commercial use is not permitted under this license.
Commercial / Proprietary License
- Required for any commercial, proprietary, or for-profit use.
- See LICENSE-Commercial.txt for details.
- To obtain a commercial license, please contact: [Geir Evensen] – [geir.evensen@gmail.com]

Summary: Academic/research users can use the software freely under the Academic License. Commercial users must obtain a commercial license before using the software in any for-profit or proprietary project.

A note on the forcing function

Previously, for the turbine forcing, it was also possible to run with the forcing formulations of Guo et al (2002). However, when using regularization, we project the non-equilibrium distribution onto the third-order Hermite polynomials. The significant difference between the Guo scheme's equilibrium distribution computed on the forcing-updated velocities leads to a poor representation using Hermite polynomials, and we partly lose the effect of the forcing. Thus, in the Guo scheme with regularization, it is necessary to compute the regularization first on R(fneq)=R(f-feq(u)) and then recover f=feq(u)+R(fneq) before computing the updated forcing velocities u+du and the forcing distribution df. Next we must compute feq(u+du) and fneq(u+du)= f-feq(u+du), which goes into the collision and vreman calls. Thus, the cost of Guo is therefore much higher as it requires two calls to fequil and an extra computation of f=feq +R(fneq) and extra computation of fneq= f-feq(u+du). It is possible to reduce the computational cost by updating only at the turbine locations, but for now, it is not worth the effort.

Installation:

1. Building the Project

If you plan to collaborate or contribute anything to the project, use the Advanced Installation option.

1a. Basic installation

Create a directory to clone the three following repositories:

git clone git@github.com:geirev/LBM.git

1b. Advanced installation

Make a personal github account unless you already have one.
Fork the LBM repository.
Next clone the forked repositories where you need to replace with your github userid.
Set upstream to the original repositories.

git clone git@github.com:<userid>/LBM.git
pushd LBM
git remote add upstream git://github.com:geirev/LBM
popd

or, if you have not set up git-ssh (or see instructions below)

git clone git@github.com:<userid>/LBM.git
pushd LBM
git remote add upstream https://github.com/geirev/LBM

If you are new to Git, read the section Git instructions

2. Required Packages

FFTW3 Package

sudo apt-get -y update
sudo apt-get -y install libfftw3-dev  # fft library used when sampling pseudo-random fields

Gfortran compiler

sudo apt-get -y install gfortran

NVIDIA CUDA NVfortran

Nvidia Cuda fortran compiler and utilities installation:

Install the NVIDIA HPC SDK from https://developer.nvidia.com/hpc-sdk-downloads

Nvidia CUDA toolkit and drivers can be installed from https://developer.nvidia.com/cuda-downloads

Add the following in your .bachrc file

# path to executable
PATH="$PATH:$HOME/bin"
# nvfortran paths
export NVHPC=/opt/nvidia/hpc_sdk
export PATH=$NVHPC/Linux_x86_64/2025/compilers/bin:$PATH
export LD_LIBRARY_PATH=$NVHPC/Linux_x86_64/2025/math_libs/lib64:$LD_LIBRARY_PATH

Optional: mpi

Optionally, you can compile with OPEN-MPI to run on multiple GPUs by installing

sudo apt install openmpi-bin libopenmpi-dev

and add the following in your .bachrc file

export PATH="$PATH:$HOME/bin"
# nvfortran paths
export PATH=$NVHPC/Linux_x86_64/2025/compilers/bin:$PATH
export PATH=$NVHPC/Linux_x86_64/2025/comm_libs/mpi/bin:$PATH
export LD_LIBRARY_PATH=$NVHPC/Linux_x86_64/2025/comm_libs/mpi/lib:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=$NVHPC/Linux_x86_64/2025/math_libs/lib64:$LD_LIBRARY_PATH
export HWLOC_HIDE_ERRORS=1
# Clean NVIDIA MPI defaults (no IB hardware)
export OMPI_MCA_coll_hcoll_enable=0
export UCX_TLS=sm,tcp,cuda_copy,cuda_ipc
export HWLOC_HIDE_ERRORS=1

Optional: Netcdf

Optionally, you can use netcdf for output and if you wish using netcdf you must then install it on your system.

sudo apt install netcdf-bin libnetcdf-dev libnetcdff-dev

In case of problems you can use the included installation script with manual compilation:

./bin/install_netcdf.sh

Some editing of paths and makefile may be necessary in this case.

3. Compile the `LBM` code

For gpu compilation run 'nvidia-smi' or 'lshw -C display' to find you gpu-card. Then check the compute capability of your gpu in the table https://developer.nvidia.com/cuda-gpus. Note also the link to the old/legacy GPU cards in the link [https://developer.nvidia.com/cuda-legacy-gpus] (https://developer.nvidia.com/cuda-legacy-gpus), and also be aware that some old GPU cards may require you to install older versions than nvhpc-25-7.

Navigate to the src folder and open the makefile and specify the correct -gpu=ccXX flag.

cd LBM/src
vi makefile

Typically you will only need to edit the file mod_dimensions.F90 to set the grid dimensions before compilations.

Then compile and install the executable in the target directory, defaulting to $HOME/bin so make sure $HOME/bin exists and is included in your path:

Default is compilation for single core in single precision on D3Q27 lattice using nvfortran/pgf90.

make -B

where the -B option force compilation from scratch.

When running on a single core or using OPEN-MP, gfortran generates a faster executable than nvfortran, so use

make -B GFORTRAN=1

to generate an executable for a single core in single precision on D3Q27 lattice using gfortran.

Similarly to compile for OPEN-MP in single precision on D3Q27 lattice using gfortran

make -B GFORTRAN=1 MP=1

nvfortran compilation for single core in single precision on D3Q19 lattice

make -B D3Q19=1

nvfortran compilation for OPEN-MP in single precision on D3Q27 lattice

make -B MP=1

nvfortran compilation for CUDA GPU in single precision and D3Q27 lattice

make -B CUDA=1

nvfortran compilation for CUDA GPU in single precision and D3Q27 lattice with MPI parallelization

make -B CUDA=1 MPI=1

nvfortran compilation for MPI paralleization on CPUs in single precision and D3Q27 lattice

make -B MPI=1

Compilation for CUDA GPU in double precision and D3Q27 lattice

make -B DP=1 CUDA=1

To recompile from scratch add a -B flag. (Necessary if you change in between parallelization settings like CUDA or OPEN-MP).

E.g., recomile for GPU in double precision using D3Q19 lattice

make -B CUDA=1 DP=1 D3Q19=1

To compile and link the netcdf library you compile as

make -B CUDA=1 NETCDF=1

Some editing of paths to NETCDF libraries etc in the makefile might be necessary.

Running in single precision is about 3 faster than using double precision.

Running on the D3Q19 lattice reduces the CPU time with around 40 % but seems to introduce some noise for high Reynolds number flow.

A test running 200 time steps on single CPU, with OPEN-MP, and GPU for a domain of 121x121x928 gave the following wall times:

single-core     (gfortran)  :  753.83 s   (make -B GFORTRAN=1)
single-core     (nvfortran) : 1096.98 s   (make -B)
open-mp 18 cores (nvfortran):  176.39 s   (make -B MP=1)
open-mp 24 cores (gfortran) :  140.07 s   (make -B GFORTRAN=1 MP=1)
GPU             (nvfortran) :    7.00 s   (make -B CUDA=1)

The simulations were run on a "Lenovo Legion 7 Pro" laptop with a "Core Ultra 9 275 HX" (having 24 independent cores) and the gpu card is "Nvidia RTX 5090."

This plot clearly shows that GPU is the optimal choice for heavy simulations, while CPU scaling beyond 10-16 cores is inefficient. Also, for single core and OPEN-MP gfortran is faster than nvfortran with the compiler flags currently used.

4. Run the code

Start by defining the required grid dimensions in the src/mod_dimensions.F90 file, and compile.

Create a separate catalog preferably on a large scratch disk or work area, e.g.,

mkdir rundir
cd cd rundir
ulimit -s unlimited
boltzmann

The model will then generate a template infile.in file. Spend some time understanding the inputs you can provide through the infile.in.

If you want to run with wind turbines, link or copy the Airfoils directory to the rundir and define a number of turbines larger than 0 with their grid locations at the bottom of the infile.in.

The example infile.in corresponds to a 2D city case with flow through the city, and the program should run without any other input files. Just choose the city2 case in mod_dimensions.F90. You can also use this grid for the cylinder case by changing city2 to cyliner in the infile.in. For more realistic 3D runs increase the number of vertical grid points.

The example/run.sh script may be required for large grids, as it sets ulimit -s unlimited and it also defines the number of cores used in OPEN-MP simuilations.

The example/uvel.orig file defines an atmospheric boundary layer if it is found in the run direcotry.

In summary to execute the code on a single core:

ulimit -s unlimited
boltzmann

To run the code on 18 cores using OPEN-MP:

export OMP_NUM_THREADS=18
ulimit -s unlimited
boltzmann

To run on the GPU it's just:

make -B CUDA=1
boltzmann

To run on multiple GPUs (4 in this case) using MPI:

make -B CUDA=1 MPI=1
mpirun -np 4 ./boltzmann

Note the definition of ny_global vs ny in mod_dimensions.F90.

To run on multiple CPU cores (24 here) using MPI:

make -B MPI=1
ulimit -s unlimited
mpirun -np 24 --bind-to core ./boltzmann

Note the definition of ny_global vs ny.

To run on multiple CPU cores (24 here) using MPI and OPEN-MP on two cores:

make -B MPI=1 MP=1
export OMP_NUM_THREADS=2
mpirun -np 12 --map-by ppr:12:socket:pe=2 --bind-to core  ./boltzmann

Note the definition of ny_global vs ny.

5. Code profiling

To profile the code run, e.g.,

nsys profile --stats=true boltzmann

or on prehistoric GPU architectures

nvprof boltzmann

which gives a detailed listing of the CPU time used by each kernel.

To obtain more realistic profiling, removing the device to host copying which starts dominating for short GPU runs, set lnodump to true and ltiming to false in infile.in. This avoids writing dianostic and restart files and eliminates all the syncs before and after kernal lauches which the profiling reacts on but which has little impact on the total simulation time.

6. Plotting

The boltzmann program stores all diagnostic output files in a catalog output. There are a lot of Tecplot files, in particular when using MPI, as we store every tile in its own file. Similarly, we store the restart files to a catalog named restart and the testing files in the catalog testing.

The current code version outputs Tecplot plt files read by tec360.

The plotting routine is m_diag.F90 that dumpis a file tecGRID.plt containing the grid layout, i.e., the i, j, and k indices and the blanking variable. For each solution time the routine saves the density and three velocity components in each grid point. When using Tecplot one must load the tecGRID.plt file as the first file and add any number of solution files. Thus, the diagnostic saved is minimal, and we compute absolute velocity and vorticity, and Q-criterion diagnostics within Tecplot using the loaded velocity fields.

Additionally, it is possible to compute the averages over any number of time-steps and these are then saved to tecAVERAGE.plt. This file also contains the turbulent kinetic energy.

An alternative to Tecplot is the open-source program Paraview which also reads Tecplot files. However ensure that the option tecout is set to 0 to dump full Tecplot files.

Note also the option to output netcdf files by setting tecout to 3 in infile.in and compiling with NEDCDF=1.

7. Code standards

If you plan to change the code note the following:

I always define subroutines in new modules:

module m_name_of_subroutine
! define global variables here
contains
subroutine name_of_subroutine
! define local variables here
...
end subroutine
end module

in the main program you write

program name
use m_name_of_subroutine
call  name_of_subroutine
end program

The main program then has access to all the global variables defined in the module, and knows the header of the subroutine and the compiler checks the consistency between the call and the subroutine definition.

make new -> updates the dependencies for the makefile make tags -> runs ctags (useful if you use vim)

The current makefile updates the dependencies at every compilation, so if you add a file with a new subroutine you can just type make and it will be included in the compilation.

For this to work install the scripts in the ./bin in your path and install ctags

7. Git instructions

When working with git repositories other than the ones you own, and when you expect to contribute to the code, a good way got organize your git project is described in https://opensource.com/article/19/7/create-pull-request-github This link is also a good read: Git tutorial

This organization will allow you to make changes and suggest them to be taken into the original code through a pull request.

So, you need a github account. Then you fork the repository to your account (make your personal copy of it) (fork button on github.com). This you clone to your local system where you can compile and run.

git clone https://github.com/<YourUserName>/LBM
git remote add upstream https://github.com/geirev/LBM
git remote add origin git@github.com:<YourUserName>/LBM
git remote -v                   #   should list both your local and remote repository

To keep your local main branch up to date with the upstream code (my original repository)

git switch main             #   unless you are not already there
git fetch upstream              #   get info about upstream repo
git merge upstream/main       #   merges upstream main with your local main

If you want to make changes to the code:

git switch -c branchname      #   Makes a new branch and moves to it

Make your changes

git add .                       #   In the root of your repo, stage for commit
git status                      #   Tells you status
git commit                      #   Commits your changes to the local repo

Push to your remote origin repo

git push -u origin branchname   #   FIRST TIME to create the branch on the remote origin
git push                        #   Thereafter: push your local changes to your forked  origin repo

To make a pull request:

Commit your changes on the local branch

git add .                       #   In the root of your repo, stage for commit
git status                      #   Tells you status
git commit -m"Commit message"   #   Commits your changes to the local repo
git commit --amend              #   Add changes to prevous commit
git push --force                #   If using --amend and previous commit was pushed

Update the branch where you are working to be consistent with the upstream main

git switch main             #   unless you are not already there
git fetch upstream              #   get info about upstream repo
git merge upstream/main       #   merges upstream main with your local main
git switch brancname          #   back to your local branch
git rebase main               #   your branch is updated by adding your local changes to the updated main

squash commits into one (if you have many commits)

git log                      #   lists commits
git rebase -i indexofcommit  #   index of commit before your first commit

Change pick to squash for all commits except the latest one. save and then make a new unified commit message.

git push --force             #   force push branch to origin

open github.com, chose your branch, make pullrequest, check that there are no conflicts

Then we are all synced.

If you manage all this you are a git guru. Every time you need to know something just search for git "how to do something" and there are tons of examples out there.

For advanced users: Set the sshkey so you don't have to write a passwd everytime you push to your remote repo: check settings / keys tab Follow instructions in https://help.github.com/en/github/using-git/changing-a-remotes-url

To make your Linux terminal show you the current branch in the prompt include the follwoing in your .bashrc

parse_git_branch() {
     git branch 2> /dev/null | sed -e '/^[^*]/d' -e 's/* \(.*\)/(\1)/'
}
export PS1="\[\033[01;32m\]\u@\h\[\033[00m\]:\[\033[1;31m\]\w\[\033[0;93m\]\$(parse_git_branch)\[\033[0;97m\]\$ "

Name		Name	Last commit message	Last commit date
Latest commit History 206 Commits
airfoils		airfoils
bin		bin
doc		doc
example		example
src		src
.gitignore		.gitignore
LICENSE-APACHE.txt		LICENSE-APACHE.txt
LICENSE-Academic.txt		LICENSE-Academic.txt
LICENSE-Commercial.txt		LICENSE-Commercial.txt
README.md		README.md
commercial_agreement.md		commercial_agreement.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Licenses found

Uh oh!

Repository files navigation

Lattice-Boltzmann-Model

Release notes:

(Jan 2026): Code upgraded to allow for MPI parallelization and buoyancy forcing

(Oct 2025) The latest pushes have been major upgrades. Speedup is now around 100 on GPU relative a single CPU core.

License

A note on the forcing function

Installation:

1. Building the Project

1a. Basic installation

1b. Advanced installation

2. Required Packages

FFTW3 Package

Gfortran compiler

NVIDIA CUDA NVfortran

Optional: mpi

Optional: Netcdf

3. Compile the `LBM` code

4. Run the code

5. Code profiling

6. Plotting

7. Code standards

7. Git instructions

About

Licenses found

Uh oh!

Releases

Packages

Contributors 2

Languages

License

Licenses found

geirev/LBM

Folders and files

Latest commit

History

Repository files navigation

Lattice-Boltzmann-Model

Release notes:

(Jan 2026): Code upgraded to allow for MPI parallelization and buoyancy forcing

(Oct 2025) The latest pushes have been major upgrades. Speedup is now around 100 on GPU relative a single CPU core.

License

A note on the forcing function

Installation:

1. Building the Project

1a. Basic installation

1b. Advanced installation

2. Required Packages

FFTW3 Package

Gfortran compiler

NVIDIA CUDA NVfortran

Optional: mpi

Optional: Netcdf

3. Compile the LBM code

4. Run the code

5. Code profiling

6. Plotting

7. Code standards

7. Git instructions

About

Topics

Resources

License

Licenses found

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

3. Compile the `LBM` code

Packages