Skip to content

Examples

Lennard-Jones Langevin droplet

examples/lj_langevin/ simulates a self-bound Lennard-Jones liquid droplet (non-periodic) with a Langevin thermostat (Allen–Tildesley integrator, reduced units), writing an XYZ trajectory. It comes in three implementations that share the physics but make different points — and demonstrate that the library stays neighbour-list-only: all Lennard-Jones code lives in the examples.

  • Python NumPy/CuPy (lj_langevin.py)prototyping. The neighbour list returns the distance vectors D, so the LJ force is a few array operations and the per-atom forces are scattered with a bincount-per-component accumulation. The same code runs on CPU (NumPy) or GPU (CuPy) via --device.
  • Python JAX (lj_langevin_jax.py) — uses the dense fixed-capacity neighbour_matrix (array_namespace=jax.numpy) so shapes are static and the per-step force + Langevin update jit-compile once; forces are a masked sum over the neighbour axis (no scatter). See the note below.
  • C++ (lj_langevin_cpu.cc / lj_langevin_gpu.cu)performance. The neighbour list supplies only the ij connectivity; a single fused pass recomputes distances and accumulates the LJ force, never materialising per-pair arrays.
  • Python Warp (lj_langevin_warp.py)interop. The LJ force/energy and the Langevin integrator are NVIDIA Warp kernels (compiled once, launched every step), but the neighbour list is built by this library — or, for comparison, by vesin or the classic matscipy 1.2.0 package (--neighbours {matscipy,matscipy-classic,vesin}). Positions live in one device buffer that Warp wraps zero-copy through DLPack and the list builder reads directly; the fused kernel recomputes each pair's distance and atomically accumulates the force on i. Timing is broken down per phase with muTimer (build list / LJ force / integrate), with the neighbour-list build reported separately.

The array and Warp examples also take --neighbours {matscipy,matscipy-classic,vesin}, so the Benchmark compares this library's list against the classic matscipy 1.2.0 package and vesin feeding the same kernels.

Running

# Python (extension on PYTHONPATH)
python examples/lj_langevin/lj_langevin.py     --device cpu --steps 2000 --out traj.xyz
python examples/lj_langevin/lj_langevin.py     --device gpu --steps 2000 --out traj.xyz
python examples/lj_langevin/lj_langevin_jax.py --device cpu --steps 2000 --out traj.xyz

# Warp kernels + this library's neighbour list (or vesin, --neighbours vesin)
python examples/lj_langevin/lj_langevin_warp.py --device gpu --neighbours matscipy --atoms 2000

# C++ (BUILD_EXAMPLES=ON; the GPU binary needs ENABLE_CUDA=ON)
./build/examples/lj_langevin/lj_langevin_cpu --steps 2000 --out traj.xyz
./build/examples/lj_langevin/lj_langevin_gpu --steps 2000 --out traj.xyz

The Warp example additionally needs warp-lang, muTimer, and (for the vesin comparison) vesin: pip install warp-lang muTimer vesin. On WSL, put libcuda.so on LD_LIBRARY_PATH (e.g. /usr/lib/wsl/lib) for the vesin GPU path.

Scaling benchmark

The Benchmark page sweeps logarithmically spaced droplet sizes (100, 1000, … up to the GPU memory) over the full cross-product of device (CPU/GPU), neighbour list and kernels (Warp / array / JAX / C++), with a time-vs-atoms plot faceted by kernel. The neighbour-list backends are matscipy-neighbours (this library), matscipy 1.2.0 (the classic matscipy package, the CPU reference this library descends from) and vesin. It is generated by examples/lj_langevin/benchmark.py (re-run it to refresh the numbers for your own hardware). The broad picture:

  • The neighbour-list build dominates the step, so the list choice drives the scaling: matscipy-neighbours' cell list stays close to linear on both devices, matscipy 1.2.0 is a single-threaded CPU reference, while vesin's GPU path falls behind for these large, low-density droplets.
  • The kernel choice mostly shifts the curve — the fused C++/CUDA and Warp kernels avoid materialising per-pair arrays; the array (NumPy/CuPy) path is the simplest; JAX jit-compiles a dense masked sum.
  • On the CPU the matscipy list is benchmarked single-threaded ((1t)) and multi-threaded ((mt)); the gap widens with size, and multithreading actually hurts for tiny droplets (thread-spawn overhead).

vesin and matscipy 1.2.0 only feed the Warp and array kernels (JAX needs the dense neighbour_matrix and C++ uses the in-tree core); matscipy 1.2.0 is also CPU-only. Those cells are left empty.

JAX and the dense neighbour list

JAX compiles for static array shapes, so the variable-size pair list is a poor fit: it makes XLA recompile every step (a naive eager loop is ~1.3 s/step, compile-bound, on both CPU and GPU). The dense fixed-capacity neighbour_matrix (n x K rows) gives static shapes so the per-step update jit-compiles once — which is what makes the JAX rows above fast. Pick --max-neighbours large enough for the densest atom (the call raises if it is too small). JAX defaults to float32; the example enables jax_enable_x64 (the list works in float64).