This page lists LAMMPS performance on several benchmark problems, run on various machines, both in serial and parallel and on GPUs. Note that input and sample output files for many of these benchmark tests are provided in the bench directory of the LAMMPS distribution. See the bench/README file for details.
These are the parallel machines for which benchmark data is given. The "Processors" column is the most number of processors on that machine that LAMMPS was run on. Message passing bandwidth and latency is in units of Mb/sec and microsecs at the MPI level, i.e. what a program like LAMMPS sees. More information on machine characteristics, including their "birth" year, is given at the bottom of the page.
Vendor/Machine | Processors | Site | CPU | Interconnect | Bandwidth | Latency |
Dell T7500 dual hex-core desktop | 12 | SNL | 3.47 GHz Xeon | on-chip | ?? | ?? |
Xeon/Myrinet cluster | 512 | SNL | 3.4 GHz dual Xeons (64-bit) | Myrinet | 230 | 9 |
IBM p690+ | 512 | Daresbury | 1.7 GHz Power4+ | custom | 1450 | 6 |
IBM BG/L | 65536 | LLNL | 700 MHz PowerPC 440 | custom | 150 | 3 |
Cray XT3 | 10000 | SNL | 2.0 GHz Opteron | Cstar | 1100 | 7 |
Cray XT5 | 1920 | SNL | 2.4 GHz Opteron | Cstar | 1100 | 7 |
One-processor timings are also listed for some older machines whose characteristics are also given below.
Name | Machine | Processors | Site | CPU | Interconnect | Bandwidth | Latency |
Laptop | Mac PowerBook | 1 | SNL | 1 GHz G4 PowerPC | N/A | N/A | N/A |
ASCI Red | Intel | 1500 | SNL | 333 MHz Pentium III | custom | 310 | 18 |
Ross | custom Linux cluster | 64 | SNL | 500 MHz DEC Alpha | Myrinet | 100 | 65 |
Liberty | HP Linux cluster | 64 | SNL | 3.0 GHz dual Xeons (32-bit) | Myrinet | 230 | 9 |
Cheetah | IBM p690 | 64 | ORNL | 1.3 GHz Power4 | custom | 1490 | 7 |
Billion-atom LJ timings are also given for GPU clusters, with more characteristics given below.
Name | Machine | GPUs | Site | GPU | Interconnect | Bandwidth | Latency |
Keeneland | Intel/NVIDIA cluster | 360 | ORNL | Tesla M2090 | Qlogic QDR | ??? | ??? |
Lincoln | Intel/NVIDIA cluster | 384 | NCSA | Tesla C1060 | Infiniband | 1500 | 12 |
For each of the 5 benchmarks, fixed- and scaled-size timings are shown in tables and in comparative plots. Fixed-size means that the same problem with 32,000 atoms was run on varying numbers of processors. Scaled-size means that when run on P processors, the number of atoms in the simulation was P times larger than the one-processor run. Thus a scaled-size 64-processor run is for 2,048,000 atoms; a 32K proc run is for ~1 billion atoms.
All listed CPU times are in seconds for 100 timesteps. Parallel efficiencies refer to the ratio of ideal to actual run time. For example, if perfect speed-up would have given a run-time of 10 seconds, and the actual run time was 12 seconds, then the efficiency is 10/12 or 83.3%. In most cases parallel runs were made on production machines while other jobs were running, which can sometimes degrade performance.
The files needed to run these benchmarks are part of the LAMMPS distribution. If your platform is sufficiently different from the machines listed, you can send your timing results and machine info and we'll add them to this page. Note that the CPU time (in seconds) for a run is what appears in the "Loop time" line of the output log file, e.g.
Loop time of 3.89418 on 8 procs for 100 steps with 32000 atoms
These benchmarks are meant to span a range of simulation styles and computational expense for interaction forces. Since LAMMPS run time scales roughly linearly in the number of atoms simulated, you can use the timing and parallel efficiency data to estimate the CPU cost for problems you want to run on a given number of processors. As the data below illustrates, fixed-size problems generally have parallel efficiencies of 50% or better so long as the atoms/processor is a few hundred or more. Scaled-size problems generally have parallel efficiencies of 80% or more across a wide range of processor counts.
Thanks to the following individuals for running the various benchmarks:
This is a summary of single-processor LAMMPS performance in CPU secs per atom per timestep for the 5 benchmark problems which follow. This is on a Dell Precision T7500 desktop Red Hat linux box with dual hex-core 3.47 GHz Intel Xeon processors, using the Intel 11.1 icc compiler. The ratios indicate that if the atomic LJ system has a normalized cost of 1.0, the bead-spring chains and granular systems run 2x and 4x faster, while the EAM metal and solvated protein models run 2.6x and 16x slower respectively. These differences are primarily due to the expense of computing a particular pairwise force field for a given number of neighbors per atom.
Problem: | LJ | Chain | EAM | Chute | Rhodopsin |
CPU/atom/step: | 7.02E-7 | 3.15E-7 | 1.85E-6 | 1.71E-7 | 1.14E-5 |
Ratio to LJ: | 1.0 | 0.45 | 2.64 | 0.24 | 16.2 |
Input script for this problem.
Atomic fluid:
Performance data:
These plots show fixed-size parallel efficiency for the same 32K atom problem run on different numbers of processors and scaled-size efficiency for runs with 32K atoms/proc. Thus a scaled-size 64-processor run is for 2,048,000 atoms; a 32K proc run is for ~1 billion atoms. Each curve is normalized to be 100% on one processor for the respective machine; one-processor timings are shown in parenthesis. Click on the plot for a larger version.
Input script for this problem.
Bead-spring polymer melt with 100-mer chains and FENE bonds:
Performance data:
These plots show fixed-size parallel efficiency for the same 32K atom problem run on different numbers of processors and scaled-size efficiency for runs with 32K atoms/proc. Thus a scaled-size 64-processor run is for 2,048,000 atoms; a 32K proc run is for ~1 billion atoms. Each curve is normalized to be 100% on one processor for the respective machine; one-processor timings are shown in parenthesis. Click on the plot for a larger version.
Input script for this problem.
Cu metallic solid with embedded atom method (EAM) potential:
Performance data:
These plots show fixed-size parallel efficiency for the same 32K atom problem run on different numbers of processors and scaled-size efficiency for runs with 32K atoms/proc. Thus a scaled-size 64-processor run is for 2,048,000 atoms; a 32K proc run is for ~1 billion atoms. Each curve is normalized to be 100% on one processor for the respective machine; one-processor timings are shown in parenthesis. Click on the plot for a larger version.
Input script for this problem.
Chute flow of packed granular particles with frictional history potential:
Performance data:
These plots show fixed-size parallel efficiency for the same 32K atom problem run on different numbers of processors and scaled-size efficiency for runs with 32K atoms/proc. Thus a scaled-size 64-processor run is for 2,048,000 atoms; a 32K proc run is for ~1 billion atoms. Each curve is normalized to be 100% on one processor for the respective machine; one-processor timings are shown in parenthesis. Click on the plot for a larger version.
Input script for this problem.
All-atom rhodopsin protein in solvated lipid bilayer with CHARMM force field, long-range Coulombics via PPPM (particle-particle particle mesh), SHAKE constraints. This model contains counter-ions and a reduced amount of water to make a 32K atom system:
Performance data:
These plots show fixed-size parallel efficiency for the same 32K atom problem run on different numbers of processors and scaled-size efficiency for runs with 32K atoms/proc. Thus a scaled-size 64-processor run is for 2,048,000 atoms; a 32K proc run is for ~1 billion atoms. Each curve is normalized to be 100% on one processor for the respective machine; one-processor timings are shown in parenthesis. Click on the plot for a larger version.
This section shows performance results for a desktop system with dual hex-core Xeon processors and 2 NVIDIA Tesla/Fermi GPUs. More system details are given below, for the "Desktop" entry.
The benchmark problems themselves are described in more detail above in the CPU section. The input scripts and instructions for running these GPU test cases are included in the bench/GPU directory of the LAMMPS distribution.
The performance is plotted as a function of system size (number of atoms), where the size of the benchmark problems was varied. The Y-axis is atom-timesteps per second. Thus a value of 10 for a 1M atom system means it ran at a rate of 10 timesteps/second.
Results are shown for running in CPU-only mode, and on 1 or 2 GPUs, using either of the GPU or USER-CUDA packages.
The CPU-only results are for running on a single core and on all 12 cores, always in double precision.
For the GPU package, the number of CPU cores/node used was whatever gave the fastest performance for a particular problem size. For small problems this is typically less than all 12; for large problems it is typically all 12. The precision refers to the portion of the calculation performed on the GPU (pairwise interactions). Results are shown for single precision, double precision, and mixed precision which means pairwise interactions calculated in single precision, with the aggregate per-atom force accumulated in double precision.
For the USER-CUDA package, the number of CPU cores used is always equal to the number of GPUs used, i.e. 1 or 2 for this system. The three precisions have the same meaning as for the GPU package, except that other portions of the calculation are also performed on the GPU, e.g. time integration.
Click on the plots for a larger version.
This section shows performance results for the Titan development system, each node of which has a 16-core AMD CPU and a NVIDIA Tesla/Fermi GPU. More system details are given below, for the "Titan Development" entry. Note that the eventual Titan machine will have Tesla/Kepler GPUs, and more of them. We will post updated benchmark results for that machine, when available.
The benchmark problems themselves are described in more detail above in the CPU section. The input scripts and instructions for running these GPU test cases are included in the bench/GPU directory of the LAMMPS distribution.
For the rhodopsin benchmark, which computes long-range Coulombics via the PPPM option of the kspace_style command, these benchmarks were run with the run_style verlet/split command, to split the real-space versus K-space computations across the CPUs. This makes little difference on small node counts, but on large node counts, it enables better scaling, since the FFTs computed by PPPM are performed on fewer processors. This was done for both the strong- and weak-scaling results below. The ratio of real-to-kspace processors was chosen to give the best performance, and was 7:1 on this 16 core/node machine.
For the strong-scaling plots, a fixed-size problem of 256,000 atoms was run for all node counts. The node count varied from 1 to 128, or 16 to 2048 cores. The Y-axis is atom-timesteps per second. Thus a value of 10 for the 256,000 atom system means it ran at a rate of roughly 40 timesteps/second.
Strong-scaling results are shown for running in CPU-only mode, and on the GPU, using the GPU package. The CPU-only results are double-precision, the GPU results are for mixed precision which means pairwise interactions calculated in single precision, with the aggregate per-atom force accumulated in double precision. The dotted line indicates the slope for perfect scalability.
For the GPU package, the number of CPU cores/node used was whatever gave the fastest performance for a particular problem size. For the strong-scaling results with the large per-node atom count (256000), this was typically nearly all 16 cores.
Click on the plots for a larger version.
For the weak-scaling plots, a scaled-size problem of 32,000 atoms/node was run for all node counts. The node count varied from 1 to 8192, or 16 to 131072 cores; only 960 nodes on the current development machine have GPUs. Thus the largest system on 8192 nodes has ~262 million atoms.
The Y-axis is atom-timesteps per second. Thus a value of 100 for a 1M atom system (32 nodes) means it ran at a rate of 100 timesteps/second.
Weak-scaling results are shown for running in CPU-only mode, in CPU-only mode with the numa option invoked for the processors command, and on the GPU, using the GPU package. The CPU-only and NUMA results are double-precision, the GPU results are for mixed precision which means pairwise interactions calculated in single precision, with the aggregate per-atom force accumulated in double precision. The dotted line indicates the slope for perfect scalability.
The NUMA results alter the layout of cores to the logical 3d grid of processors that overlays the simulation domain. The processors numa command does this so that cores within a node and within a NUMA region (inside the node) are close together in the topology of the 3d grid, to reduce off-node communication costs. This can give a speed-up of 10-15% on large node counts, as shown in the plots.
For the GPU package, the number of CPU cores/node used was whatever gave the fastest performance for a particular problem size. For the weak-scaling results with the smaller per-node atom count (32000), this was typically 4-8 cores out of 16.
Click on the plots for a larger version.
The following table summarizes the CPU cost of various potentials, as implemented in LAMMPS, each for a system commonly modeled by that potential. The desktop machine these were run on is described below. The last 3 entries are for VASP timings, to give a comparison with DFT calculations. The details for the VASP runs are described below.
The listed timing is CPU seconds per timestep per atom for a one processor (core) run. Note that this is per timestep, as is the ratio to LJ; the timestep size is listed in the table. In each case a short 100-step run of a roughly 32000 atom system was performed. The speed-up is for a 4-processor run of the same 32000-atom system. Speed-ups greater than 4x are due to cache effects.
To first order, the CPU and memory cost for simulations with all these potentials scales linearly with the number of atoms N, and inversely with the number of processors P when running in parallel. This assumes the density doesn't change so that the neighbors per atom stays constant as you change N. This holds for N/P ratios larger than some threshhold, say 1000 atoms per processor. Thus you can use this data to estimate the run-time of different size problems on varying numbers of processors.
Potential | System | # Atoms | Timestep | Neighs/atom | Memory | CPU | LJ Ratio | P=4 Speed-up | Input script | Tarball |
Granular | chute flow | 32000 | 0.0001 tau | 7.2 | 33 Mb | 2.08e-7 | 0.26x | 4.28x | in.granular | bench_granular.tar.gz |
FENE bead/spring | polymer melt | 32000 | 0.012 tau | 9.7 | 8.4 Mb | 2.86e-7 | 0.36x | 3.78x | in.fene | bench_fene.tar.gz |
Lennard-Jones | LJ liquid | 32000 | 0.005 tau | 76.9 | 12 Mb | 8.01e-7 | 1.0x | 3.56x | in.lj | bench_lj.tar.gz |
DPD | pure solvent | 32000 | 0.04 tau | 41.3 | 9.4 Mb | 1.22e-6 | 1.53x | 3.54x | in.dpd | bench_dpd.tar.gz |
EAM | bulk Cu | 32000 | 5 fmsec | 75.5 | 13 Mb | 1.87e-6 | 2.34x | 3.83x | in.eam | bench_eam.tar.gz |
REBO | polyethylene | 32640 | 0.5 fmsec | 149 | 33 Mb | 3.18e-6 | 3.97x | 3.61x | in.rebo | bench_rebo.tar.gz |
Stillinger-Weber | bulk Si | 32000 | 1 fmsec | 30.0 | 11 Mb | 3.28e-6 | 4.10x | 3.83x | in.sw | bench_sw.tar.gz |
Tersoff | bulk Si | 32000 | 1 fmsec | 16.6 | 9.2 Mb | 3.74e-6 | 4.67x | 3.92x | in.tersoff | bench_tersoff.tar.gz |
ADP | bulk Ni | 32000 | 5 fmsec | 83.6 | 25 Mb | 5.58e-6 | 6.97x | 3.61x | in.adp | bench_adp.tar.gz |
EIM | crystalline NaCl | 32000 | 0.5 fmsec | 98.9 | 14 Mb | 5.60e-6 | 6.99x | 3.86x | in.eim | bench_eim.tar.gz |
Peridynamics | glass fracture | 32000 | 22.2 nsec | 422 | 144 Mb | 7.46e-6 | 9.31x | 3.78x | in.peri | bench_peri.tar.gz |
SPC/E | liquid water | 36000 | 2 fmsec | 700 | 86 Mb | 8.77e-6 | 11.0x | 3.46x | in.spce | bench_spce.tar.gz |
CHARMM + PPPM | solvated protein | 32000 | 2 fmsec | 376 | 124 Mb | 1.13e-5 | 14.1x | 3.66x | in.protein | bench_protein.tar.gz |
MEAM | bulk Ni | 32000 | 5 fmsec | 48.8 | 54 Mb | 1.32e-5 | 16.5x | 3.73x | in.meam | bench_meam.tar.gz |
Gay-Berne | ellipsoid mixture | 32768 | 0.002 tau | 140 | 21 Mb | 2.20e-5 | 27.5x | 3.63x | in.gb | bench_gb.tar.gz |
BOP | bulk CdTe | 32000 | 1 fmsec | 4.4 | 74 Mb | 2.51e-5 | 31.3x | 3.88x | in.bop | bench_bop.tar.gz |
AIREBO | polyethylene | 32640 | 0.5 fmsec | 681 | 101 Mb | 3.25e-5 | 40.6x | 3.66x | in.airebo | bench_airebo.tar.gz |
ReaxFF/C | PETN crystal | 32480 | 0.1 fmsec | 667 | 976 Mb | 1.09e-4 | 136x | 3.17x | in.reaxc | bench_reaxc.tar.gz |
COMB | crystalline SiO2 | 32400 | 0.2 fmsec | 572 | 85 Mb | 2.00e-4 | 250x | 3.89x | in.comb | bench_comb.tar.gz |
eFF | H plasma | 32000 | 0.001 fmsec | 5066 | 365 Mb | 2.16e-4 | 270x | 3.71x | in.eff | bench_eff.tar.gz |
ReaxFF | PETN crystal | 16240 | 0.1 fmsec | 667 | 425 Mb | 2.84e-4 | 354x | 3.78x | in.reax | bench_reax.tar.gz |
VASP/small | water | 192/512 | 0.3 fmsec | N/A | 320 procs | 26.2 | 17.7e6 | 100% | N/A | N/A |
VASP/medium | CO2 | 192/1024 | 0.8 fmsec | N/A | 384 procs | 252 | 170e6 | 100% | N/A | N/A |
VASP/large | Xe | 432/3456 | 2.0 fmsec | N/A | 384 procs | 1344 | 908e6 | 100% | N/A | N/A |
Notes:
Details for different systems:
The Lennard-Jones benchmark problem described above (100 timesteps, reduced density of 0.8442, 2.5 sigma cutoff, etc) has been run on different machines for billion-atom tests. For the LJ benchmark LAMMPS requires a little less than 1/2 Terabyte of memory per billion atoms, which is used mostly for neighbor lists.
Machine | # of Atoms | Processors | CPU Time (secs) | Parallel Efficiency | Flop Rate | Date |
Keeneland | 1 million | 1 GPU | 2.35 | 100% | 27.0 Gflop | 2012 |
Keeneland | 1 billion | 288 GPUs | 17.7 | 46.3% | 3.60 Tflop | 2012 |
Lincoln | 1 million | 1 GPU | 4.24 | 100% | 15.0 Gflop | 2011 |
Lincoln | 1 billion | 288 GPUs | 28.7 | 51.3% | 2.21 Tflop | 2011 |
Cray XT5 | 1 million | 1 | 148.7 | 100% | 427 Mflop | 2011 |
Cray XT5 | 1 billion | 1920 | 103.0 | 75.1% | 616 Gflop | 2011 |
Cray XT3 | 1 million | 1 | 235.3 | 100% | 270 MFlop | 2006 |
Cray XT3 | 1 billion | 10000 | 25.1 | 93.6% | 2.53 Tflop | 2006 |
Cray XT3 | 10 billion | 10000 | 246.8 | 95.2% | 2.57 Tflop | 2006 |
Cray XT3 | 40 billion | 10000 | 979.0 | 96.0% | 2.59 Tflop | 2006 |
IBM BG/L | 1 million | 1 | 898.3 | 100% | 70.7 Mflop | 2005 |
IBM BG/L | 1 billion | 4096 | 227.6 | 96.3% | 279 Gflop | 2005 |
IBM BG/L | 1 billion | 32K | 30.2 | 90.7% | 2.10 Tflop | 2005 |
IBM BG/L | 1 billion | 64K | 16.0 | 85.6% | 3.97 Tflop | 2005 |
IBM BG/L | 10 billion | 64K | 148.9 | 92.0% | 4.26 Tflop | 2005 |
IBM BG/L | 40 billion | 64K | 585.4 | 93.6% | 4.34 Tflop | 2005 |
ASCI Red | 32000 | 1 | 62.88 | 100% | 32.3 Mflop | 2004 |
ASCI Red | 750 million | 1500 | 1156 | 85.0% | 41.2 Gflop | 2004 |
The parallel efficiencies are estimated from the per-atom CPU or GPU time for a large single processor (or GPU) run on each machine:
The aggregate flop rate is estimated using the following values for the pairwise interactions, which dominate the run time:
This is a conservative estimate in the sense that flops computed for atom pairs outside the force cutoff, building neighbor lists, and time integration are not counted. For the USER-CUDA package running on GPUs, Newton's 3rd law is not used (because it's faster not to), which doubles the pairwise interaction count, but that is not included in the flop rate either.
This section lists characteristics of machines used in the benchmarking along with options used in compiling LAMMPS. The communication parameters are for bandwidth and latency at the MPI level, i.e. what a program like LAMMPS sees.
Desktop = Dell Precision T7500 desktop workstation running Red Hat linux
Mac laptop = PowerBook G4 running OS X 10.3
ASCI Red = ASCI Intel Tflops MPP
Ross = CPlant DEC Alpha/Myrinet cluster
Liberty = Intel/Myrinet cluster packaged by HP
Cheetah = IBM p690 cluster
Xeon/Myrinet cluster = Spirit
IBM p690+ cluster = HPCx
IBM BG/L = Blue Gene Light
Cray XT3 = Red Storm
Cray XT5 = xtp
Lincoln = GPU cluster
Keeneland = GPU cluster
Titan Development = GPU-enabled supercomputer (used to be Jaguar)