Examples¶
Lennard-Jones Langevin droplet¶
examples/lj_langevin/ simulates a self-bound Lennard-Jones liquid droplet
(non-periodic) with a Langevin thermostat (Allen–Tildesley integrator, reduced
units), writing an XYZ trajectory. It comes in three implementations that share
the physics but make different points — and demonstrate that the library stays
neighbour-list-only: all Lennard-Jones code lives in the examples.
- Python NumPy/CuPy (
lj_langevin.py) — prototyping. The neighbour list returns the distance vectorsD, so the LJ force is a few array operations and the per-atom forces are scattered with abincount-per-component accumulation. The same code runs on CPU (NumPy) or GPU (CuPy) via--device. - Python JAX (
lj_langevin_jax.py) — uses the dense fixed-capacityneighbour_matrix(array_namespace=jax.numpy) so shapes are static and the per-step force + Langevin updatejit-compile once; forces are a masked sum over the neighbour axis (no scatter). See the note below. - C++ (
lj_langevin_cpu.cc/lj_langevin_gpu.cu) — performance. The neighbour list supplies only theijconnectivity; a single fused pass recomputes distances and accumulates the LJ force, never materialising per-pair arrays. - Python Warp (
lj_langevin_warp.py) — interop. The LJ force/energy and the Langevin integrator are NVIDIA Warp kernels (compiled once, launched every step), but the neighbour list is built by this library — or, for comparison, byvesinor the classicmatscipy1.2.0 package (--neighbours {matscipy,matscipy-classic,vesin}). Positions live in one device buffer that Warp wraps zero-copy through DLPack and the list builder reads directly; the fused kernel recomputes each pair's distance and atomically accumulates the force oni. Timing is broken down per phase withmuTimer(build list / LJ force / integrate), with the neighbour-list build reported separately.
The array and Warp examples also take
--neighbours {matscipy,matscipy-classic,vesin}, so the
Benchmark compares this library's list against the classic
matscipy 1.2.0 package and vesin feeding
the same kernels.
Running¶
# Python (extension on PYTHONPATH)
python examples/lj_langevin/lj_langevin.py --device cpu --steps 2000 --out traj.xyz
python examples/lj_langevin/lj_langevin.py --device gpu --steps 2000 --out traj.xyz
python examples/lj_langevin/lj_langevin_jax.py --device cpu --steps 2000 --out traj.xyz
# Warp kernels + this library's neighbour list (or vesin, --neighbours vesin)
python examples/lj_langevin/lj_langevin_warp.py --device gpu --neighbours matscipy --atoms 2000
# C++ (BUILD_EXAMPLES=ON; the GPU binary needs ENABLE_CUDA=ON)
./build/examples/lj_langevin/lj_langevin_cpu --steps 2000 --out traj.xyz
./build/examples/lj_langevin/lj_langevin_gpu --steps 2000 --out traj.xyz
The Warp example additionally needs warp-lang, muTimer, and (for the vesin
comparison) vesin: pip install warp-lang muTimer vesin. On WSL, put
libcuda.so on LD_LIBRARY_PATH (e.g. /usr/lib/wsl/lib) for the vesin GPU
path.
Scaling benchmark¶
The Benchmark page sweeps logarithmically spaced droplet sizes
(100, 1000, … up to the GPU memory) over the full cross-product of device
(CPU/GPU), neighbour list and kernels (Warp / array / JAX / C++), with a
time-vs-atoms plot faceted by kernel. The neighbour-list backends are
matscipy-neighbours (this library), matscipy 1.2.0 (the classic
matscipy package, the CPU reference this library descends from) and vesin.
It is generated by examples/lj_langevin/benchmark.py (re-run it to refresh the
numbers for your own hardware). The broad picture:
- The neighbour-list build dominates the step, so the list choice drives the scaling: matscipy-neighbours' cell list stays close to linear on both devices, matscipy 1.2.0 is a single-threaded CPU reference, while vesin's GPU path falls behind for these large, low-density droplets.
- The kernel choice mostly shifts the curve — the fused C++/CUDA and Warp
kernels avoid materialising per-pair arrays; the array (NumPy/CuPy) path is the
simplest; JAX
jit-compiles a dense masked sum. - On the CPU the matscipy list is benchmarked single-threaded (
(1t)) and multi-threaded ((mt)); the gap widens with size, and multithreading actually hurts for tiny droplets (thread-spawn overhead).
vesin and matscipy 1.2.0 only feed the Warp and array kernels (JAX needs the
dense neighbour_matrix and C++ uses the in-tree core); matscipy 1.2.0 is also
CPU-only. Those cells are left empty.
JAX and the dense neighbour list¶
JAX compiles for static array shapes, so the variable-size pair list is a poor
fit: it makes XLA recompile every step (a naive eager loop is ~1.3 s/step,
compile-bound, on both CPU and GPU). The dense fixed-capacity neighbour_matrix
(n x K rows) gives static shapes so the per-step update jit-compiles once —
which is what makes the JAX rows above fast. Pick --max-neighbours large enough
for the densest atom (the call raises if it is too small). JAX defaults to
float32; the example enables jax_enable_x64 (the list works in float64).