Skip to content

TEA2023 Blind Test 

The TEA2023 blind test aims to evaluate the performance of modern Machine Learning Force Fields (MLFFs) in reproducing the dynamics of biomolecules, molecules on surfaces, and complex periodic systems, which is essential for a broad range of practical applications, from biochemistry to solar energy.

Within this project, we have defined a series of tests that aim to clearly assess the quality of the most advanced MLFFs in the literature, highlighting their application ranges, limitations, and the necessary steps for further improvement.


All MLFF models need to be assessed for accuracy, stability, performance, and portability. Yet, a clear and fair comparison between the cutting-edge approaches used most is still missing. For this purpose, the TEA project was born as a collaboration between the TCP group of Prof. Alexandre Tkachenko at the University of Luxembourg and leading international experts in the field of MLFFs such as Prof. Gabor Csanyi, Prof. Klaus-Robert Müller, Prof. O. Anatole von Lilienfeld, Prof. Markus Meuwl, and others. Within this project, each group of MLFF developers received identical training, validation, and test datasets through which they could train and assess their models. Afterward, in the post-test phase, each participant was granted the possibility to progressively improve their models, which were then uniformly evaluated using the same high-performance computing infrastructure, MeluXina HPC, at LuxProvide, by the group of Prof. Tkatchenko in Luxembourg.


All MLFF models were assessed for accuracy, stability, and performance. Stability tests involved running 12 independent NVT Molecular Dynamics (MD) simulations at varying temperatures (300 K, 500 K, and 700 K) using a Langevin thermostat with a 1 fs timestep. Models were tested on their ability to prevent nonphysical states, ensuring consistent starting configurations.

Challenge 1: Reproducing PES for Flexible Organic Molecules

This challenge focused on the alanine tetrapeptide dataset, split into folded and unfolded configurations. Participants trained MLFFs on one subset and predicted the other. Evaluation criteria included accuracy, stability, and the ability to extrapolate in configurational space. Training/validation sets ranged from 200 to 1000 samples to confirm model convergence.

Challenge 2: Handling Incomplete Reference Data

The Ac-Phe-Ala5-Lys dataset was used to test MLFFs on complete and incomplete data. Models were trained on configurations from 125 MD trajectories, while 75 trajectories remained unseen. The performance of MLFFs trained on 4000 samples was compared for both seen and unseen datasets. Stability and performance tests followed the same protocol as Challenge 1.

Challenge 3: Multi-Component Systems

This challenge utilized a 1,8-Naphthyridine molecule on a graphene sheet, focusing on molecule-surface interactions. Training/validation sets ranged from 200 to 1000 samples, with stability tests conducted at a maximum of 500 K to avoid desorption. Models were trained on 1000 configurations with periodic boundary conditions.

Challenge 4: Complex Multi-Component Systems with Heavy and Light Atoms

The Methylammonium (MA) Lead Iodide perovskite dataset, consisting of 384 atoms, tested MLFFs on interactions within MA molecules, Pb and I atoms, and between MA and PbI3 subsystems. Training/validation sets ranged from 100 to 500 samples, with stability tests performed on models trained on 500 configurations under three-dimensional periodic boundary conditions.

Each challenge in the TEA2023 blind test provided detailed insights into the robustness, accuracy, and performance of the MLFF models across varied conditions and configurations. The use of the Meluxina HPC environment ensured that all models were compared under consistent and controlled settings, leading to reliable and unbiased results. Over the course of the test, approximately 600 million configurations of molecular and periodic systems were processed using MLFFs. This is equivalent to hundreds of billions of CPU hours if computed using traditional reference quantum chemistry methods.


The TEA2023 blind test successfully benchmarked the capabilities of modern MLFFs across a range of complex scenarios, revealing both their strengths and areas needing improvement in extrapolation power, stability, and computational efficiency. The high-performance computing (HPC) resources provided by MeluXina were instrumental in this endeavor, enabling rigorous and uniform testing conditions. The availability of such advanced computational infrastructure was crucial for executing the extensive simulations and evaluations required, thereby ensuring the reliability and fairness of the comparative analysis. This underscores the importance of robust HPC support in advancing the development and validation of sophisticated ML models in computational chemistry.

Further Work 

Based on the insights gained from the blind test and the positive feedback from the community, similar blind tests should be conducted regularly to assess milestones achieved and to identify new directions for future development of the field of MLFFs.