文章精读 - Paper Reading 2：Machine learning potentials for metal-organic frameworks using an incremental learning approach

发表于 - Posted on 2025/07/09 编辑于 - Edited on 2025/09/30 字数 - Word count: 2.4k 阅读时间 - Reading time ≈ 9 mins.

使用不完美的MLP，利用Metadynamics对相空间采样，再利用采样的数据点训练MLP。发明出主动学习的策略的人简直是天才。

看懂文章后，其实我们自己也能做。

虽然最后我们组没用这个方法，但Gabe也提出来一个类似的方法用于采样。

They used imperfect MLPs, utilized Metadynamics for phase space sampling, and then trained the MLP with the sampled data points. It's simply genius to invent the active learning strategy.

Actually we can do it by ourselves after understanding what they did.

Although our group didn't use this method, Gabe proposed a similar approach for sampling.

链接 - Link：https://www.nature.com/articles/s41524-023-00969-x

作者 - Author：Sander Vandenhaute, Maarten Cools-Ceuppens, Simon DeKeyser, Toon Verstraelen & Veronique Van Speybroeck

年份 - Year：2023

期刊 - Journal (IF)：npj Computational Materials (9.4)

标题 - Title：Machine learning potentials for metal-organic frameworks using an incremental learning approach

关键词 - keywords：

摘要 - Abstract：

由于金属有机骨架 (MOF) 中存在空间异质性以及影响其行为的复杂操作条件，对其物理过程进行计算建模极具挑战性。密度泛函理论 (DFT) 可以在量子力学层面描述原子间相互作用，但对于纳米和皮秒级以外的系统，计算成本过高。本文，我们提出了一种增量学习方案，用于构建精确且数据高效的 MOF 机器学习势。该方案基于等变神经网络势的强大功能，并结合并行增强采样和动态训练，以迭代方式同时探索和学习相空间。每种材料只需进行数百次单点 DFT 计算，即可获得精确且可迁移的势，即使对于具有多个结构不同相的柔性骨架也是如此。该增量学习方案具有普遍适用性，并可能为在更大的时空窗口内以更高的精度建模骨架材料铺平道路。

Computational modeling of physical processes in metal-organic frameworks (MOFs) is highly challenging due to the presence of spatial heterogeneities and complex operating conditions which affect their behavior. Density functional theory (DFT) may describe interatomic interactions at the quantum mechanical level, but is computationally too expensive for systems beyond the nanometer and picosecond range. Herein, we propose an incremental learning scheme to construct accurate and data-efficient machine learning potentials for MOFs. The scheme builds on the power of equivariant neural network potentials in combination with parallelized enhanced sampling and on-the-fly training to simultaneously explore and learn the phase space in an iterative manner. With only a few hundred single-point DFT evaluations per material, accurate and transferable potentials are obtained, even for flexible frameworks with multiple structurally different phases. The incremental learning scheme is universally applicable and may pave the way to model framework materials in larger spatiotemporal windows with higher accuracy.

研究方法 - Research Methods:

使用增量学习方案（让模型在已有知识的基础上，逐步学习新数据，而不是用新数据从头开始训练整个模型），使用 MLP 本身对相空间进行采样，并在遇到未知区域时结合即时训练，采用NequIP （等变神经网络势）来构建 MOF 的机器学习势。

In this work, we present an incremental learning scheme，sampling the phase space using the MLP itself, in combination with on-the-fly training whenever it encounters unknown regions.

具体来说，该算法首先构建第一代机器学习势（MLP），该 MLP 基于一小组初始配置进行训练，这些初始配置是通过对初始配置的粒子位置和应变分量施加随机扰动获得的。然后，将第一代多层感知器 (MLP) 用于短时间（约 1 ps）的多行走器赝动力学 (MTD) 模拟，以探索初始构型周围的相空间。提取每个行走器的最终构型，并进行新的 DFT 计算以获取能量和力。将后者添加到训练集和验证集中，之后通过短时间训练模型获得下一代多层感知器 (MLP)。即使采样是使用最初不准确的 MLP 进行的，仍然足以探索相空间中有意义的区域并生成几乎不相关的样本。每次迭代的采样时间可以保持相对较短，因为它可以用于逐步探索更大范围的相空间。训练完成后，模型被认为已经学习了相空间中的一个增量区域，然后可以传递到下一次迭代，继续进行 MTD 采样。偏置势的状态从上一次迭代中保留下来，以确保步行者在每次迭代中探索略有不同的相空间区域。通过不断交替增强的采样和训练步骤，可以即时探索和学习集体变量的整个相空间，而无需执行昂贵的基于 DFT 的分子动力学模拟。重要的是，我们的方法确保进行 QM 评估的原子配置始终由 ~1 ps 的 MTD 轨迹分隔，这意味着它们最多只是弱相关的。因此，即使我们不依赖于其他主动学习方案 30、32 中发现的任何专门的不确定性测量，我们也能保证训练数据中 QM 评估配置之间几乎没有冗余。

The algorithm starts by constructing a first generation MLP, which is trained based on a small set of initial configurations that are obtained by applying random perturbations to the particle positions and strain components of the initial configuration. This first generation MLP is then used in a short (~1 ps) multiple walker metadynamics (MTD) simulation in order to explore the phase space around the initial configuration. The final configuration of each of the walkers is extracted and subjected to a new DFT calculation to obtain the energy and forces. The latter are added to the training and validation sets after which a next generation MLP is obtained by training the model for a short amount of time. Even though the sampling was performed with an initially inaccurate MLP, it still suffices to explore a meaningful region of phase space and generate almost decorrelated samples. The sampling time per iteration may be kept relatively short as it serves the purpose to gradually explore larger portions of phase space. After training, the model is considered to have learned an incremental region in phase space, and may then be passed on to the next iteration in which it continues the MTD sampling. The state of the bias potential is retained from the previous iteration as to ensure that the walkers explore a slightly different region of phase space in each iteration. By continuously alternating the enhanced sampling and training steps, the entire phase space of the collective variable is explored and learned on-the-fly without the need to perform expensive DFT-based molecular dynamics simulations. Importantly, our approach ensures that atomic configurations for which a QM evaluation is performed are always separated by a ~1 ps MTD trajectory, which implies that they are at most only weakly correlated. As such, we are guaranteed that there is little redundancy between the QM-evaluated configurations in the training data even though we did not rely on any specialized uncertainty measures as found in other active learning schemes30,32.

结论 - Conclusions：

在本研究中，我们提出了一种构建用于框架材料的精确且可迁移的多层势函数 (MLP) 的有效方法。即使对于多相体系，我们也表明大约 1000 次量子力学 (QM) 评估足以构建精确的等变多层势函数 (MPNN)。这种计算效率的提升对未来的研究至关重要，因为现在可以在 MLP 训练期间采用更先进的量子力学 (QM) 方法（例如混合泛函或更高），从而更准确地描述这些材料中的动态现象。此外，构建用于描述多种框架的单一势函数的能力前景广阔，尤其是因为我们观察到，每种材料的量子力学 (QM) 评估次数实际上会随着训练集多样性的增加而减少（从大约 1000 次减少到仅约 300 次）。

In this work, we propose an efficient approach for the construction of accurate and transferable MLPs for framework materials. Even for systems with multiple phases, we show that about 1000 QM evaluations are sufficient to construct accurate equivariant MPNNs. This increased computational efficiency is important for future research, as it is now possible to employ more advanced QM methods (e.g., hybrid functionals or beyond) during MLP training and in this way allow for a more accurate description of dynamic phenomena in these materials. In addition, the ability to construct a single potential for the description of multiple frameworks is highly promising, especially because we observed that the number of QM evaluations per material actually decreases with increasing variety in the training set (from about 1000 to only about 300).

金句 - Quotes:

广义上讲，这些模型（机器学习势能）可以分为核回归方法，通过将给定配置与一组参考配置进行比较来确定相互作用能量，或神经网络势，它直接根据数千甚至数百万个参数确定 PES 的高维表示。

Broadly speaking, these models (Machine learning potentials) may be categorized into either kernel regression methods, which determine the interaction energy by comparing a given configuration to a set of reference configurations21,22, or neural network potentials, which directly determine a high-dimensional representation of the PES based on thousands or even millions of parameters23.

对于具有多个稳定相的灵活框架，需要对相空间中所有相关区域进行适当的采样，不仅包括目标热力学条件下材料的稳定相，还包括活化区域以及重要转变路径上的所有点。使用平衡第一性原理分子动力学很难做到这一点，因为使用密度泛函理论 (DFT) 评估每个时间步长的原子间力时，模拟时间有限。此外，从平衡分子动力学获得的构型通常采样接近自由能最小值的区域，无法捕捉相间的重要过渡态，尽管这些过渡态对于训练数据至关重要。

For flexible frameworks with multiple stable phases, all relevant regions in phase space need to be properly sampled, including not only the stable phases of the material within the thermodynamic conditions of interest but also activated regions and all points along important transition paths. This is very difficult to achieve using equilibrium first principles MD because simulation times are limited when using DFT to evaluate the interatomic forces at each timestep. In addition, configurations obtained from equilibrium MD typically sample regions close to free energy minima and fail to capture important transition states in between phases, even though these are essential to include in the training data.