Benchmark¶

Per-step wall time of the Lennard-Jones Langevin droplet example across droplet sizes, for the full cross-product of device (CPU / GPU), neighbour list and kernels (Warp / array (NumPy/CuPy) / JAX / C++). Lower is better. The neighbour-list backends are:

matscipy-neighbours — this library (matscipy_neighbours), CPU + GPU;
matscipy 1.2.0 — the classic matscipy package's neighbour_list, the CPU reference this library descends from;
vesin — vesin, CPU + GPU.

Test machine

CPU: Intel(R) Core(TM) Ultra 7 356H (16 logical cores)
GPU: NVIDIA RTX PRO 500 Blackwell Generation Laptop GPU

CPU threading

On the CPU the matscipy-neighbours list is benchmarked both single-threaded (OMP_NUM_THREADS=1, the (1t) rows) and multi-threaded (all 16 logical cores, the (mt) rows). The C++ force loop is OpenMP-parallel and follows the same setting; the Warp and array kernels and the JAX backend use their own threading. The classic matscipy 1.2.0 and vesin CPU lists are single-threaded. The GPU rows are unaffected.

Empty cells

vesin and matscipy 1.2.0 only feed the Warp and array kernels: JAX uses the dense neighbour_matrix and the C++ example uses the in-tree C++ core, so those rows are left empty. matscipy 1.2.0 is CPU-only, so it has no GPU rows. A blank in an otherwise-populated row marks a size that exceeded the GPU memory (e.g. the JAX dense matrix and the CuPy/vesin GPU paths at the largest sizes on this card).

Run configuration: reduced LJ units, cutoff 2.5, dt 0.005, friction 1.0, temperature 0.7; logarithmically spaced sizes (100 → 1000000 atoms). Up to 40 steps per point (fewer for the largest systems; JAX and Warp are compiled once during an untimed warm-up).

Time vs. number of atoms

Configuration	100 atoms	1000 atoms	10000 atoms	100000 atoms	1000000 atoms
Warp · matscipy-neighbours · GPU	0.85	1.42	5.32	31.22	244.88
Warp · matscipy-neighbours · CPU (mt)	0.31	3.77	13.88	91.11	1093.46
Warp · matscipy-neighbours · CPU (1t)	0.15	1.97	25.04	256.83	2807.42
Warp · matscipy 1.2.0 · CPU	0.19	3.19	41.62	461.17	5314.42
Warp · vesin · GPU	1.66	3.60	11.53	275.64	5426.82
Warp · vesin · CPU	0.37	3.78	45.86	568.37	7309.84
array (NumPy/CuPy) · matscipy-neighbours · GPU	2.75	7.27	49.40	477.16	5858.19
array (NumPy/CuPy) · matscipy-neighbours · CPU (mt)	0.87	6.11	39.11	518.21	5851.86
array (NumPy/CuPy) · matscipy-neighbours · CPU (1t)	0.32	4.02	54.21	723.14	7660.85
array (NumPy/CuPy) · matscipy 1.2.0 · CPU	0.47	5.49	74.91	907.25	10138.81
array (NumPy/CuPy) · vesin · GPU	4.28	9.49	72.26	946.87	—
array (NumPy/CuPy) · vesin · CPU	0.49	6.96	84.99	1054.04	12846.20
JAX · matscipy-neighbours · GPU	5.10	6.48	10.74	69.57	—
JAX · matscipy-neighbours · CPU (mt)	3.30	6.63	19.96	324.35	3042.44
JAX · matscipy-neighbours · CPU (1t)	2.16	4.91	30.27	501.67	4883.77
JAX · matscipy 1.2.0 · CPU	—	—	—	—	—
JAX · vesin · GPU	—	—	—	—	—
JAX · vesin · CPU	—	—	—	—	—
C++ · matscipy-neighbours · GPU	1.03	1.54	3.83	27.39	287.22
C++ · matscipy-neighbours · CPU (mt)	0.83	6.95	21.49	73.30	612.11
C++ · matscipy-neighbours · CPU (1t)	0.08	1.88	23.79	240.28	2384.93
C++ · matscipy 1.2.0 · CPU	—	—	—	—	—
C++ · vesin · GPU	—	—	—	—	—
C++ · vesin · CPU	—	—	—	—	—

(values are ms/step)

How to read it:

The neighbour-list build dominates the step, so the list choice drives the scaling: matscipy-neighbours' cell list stays close to linear on both devices, the classic matscipy 1.2.0 list is a single-threaded CPU reference, and vesin's GPU path falls behind for these large, low-density droplets.
The kernel choice mostly shifts the curve: the fused C++/CUDA and Warp kernels avoid materialising per-pair arrays, the array (NumPy/CuPy) path is the simplest, and JAX jit-compiles a dense masked sum.
On the CPU, the matscipy-neighbours (mt) rows pull away from (1t) as the system grows; and even single-threaded, matscipy-neighbours (1t) is already faster than the classic matscipy 1.2.0 and vesin CPU lists.

This page is generated by examples/lj_langevin/benchmark.py. Regenerate it on your own hardware with:

python examples/lj_langevin/benchmark.py --build build --doc-out docs/benchmark.md

For the C++ rows, build with -DBUILD_EXAMPLES=ON (and -DENABLE_CUDA=ON for the GPU binary); the other rows need pip install jax warp-lang vesin muTimer matscipy==1.2.0 in the interpreter that runs this driver.