# Benchmark Dataset used in the paper **Fast Evaluation of Unbiased Atomic Forces in ab initio Variational Monte Carlo via the Lagrangian Technique**

This repository provides the **structures**, **input files**, **output files**, and a **Jupyter notebook** used for the paper [arXiv:2511.05222](https://doi.org/10.48550/arXiv.2511.05222)

`Archive.zip` (SHA256:ac3f4832c268b54fb21df9a690d432e8831367790231c2798240074050e5576f) contains four directories:

- `structures_org/`
- `structures_shifted/`
- `results/`
- `analysis/`

The structures are taken from the [rMD17 dataset](https://doi.org/10.6084/m9.figshare.12672038).

---

## Overview

### 1) `structures_org/` and `structures_shifted/`

These directories contain the structures (XYZ format) for three molecules used for the rMD17 benchmark test in the above paper. 

- `ethanol/`
- `malonaldehyde/`
- `benzene/`

Each molecule directory contains **100 structures** extracted from the rMD17 dataset.

#### Naming convention

Each structure file is named using three indices:

- `{molecule}_0_idx0_old8.xyz`

where:

- The first integer (before `_idx`) is the **consecutive index in this benchmark set**.
- The second integer (after `idx`) is the **consecutive index in the rMD17 dataset**.
- The third integer (after `old`) is the **index of the conformation in the original MD17 dataset**.

#### Difference between `_org/` and `_shifted/`:

- `structures_org/` contains the structures **exactly as extracted** from the distributed rMD17 dataset.
- `structures_shifted/` contains the same structures after shifting the **molecular centroid** to the following coordinates:

 - ethanol: `[9.63857306, 9.63857306, 9.63857306]`
 - malonaldehyde: `[9.87397841, 9.87397841, 9.87397841]`
 - benzene: `[10.07516, 10.07516, 10.07516]`

where the unit is angstrom. The shift is applied because one should place a molecule at the center of the simulation cell in CP2K calculations.

---

### 2) `results/`

This directory contains the input and output files used for/obtained by **Psi4** and **TurboRVB**. **CP2K** was used to generate trial wavefunctions for **TurboRVB**.

---

#### A) `psi4-xxx/`

All-electron DFT and CC calculations performed by Psi4.

- `{molecule}.xyz`  
Structure.
- `{molecule}_E_F.xyz`  
 Energies and forces stored in the extended XYZ (extxyz) format (units are eV and eV/angstrom, respectively).
- `run_psi4.py`  
 Psi4 running script.
- `psi.out`  
 Psi4 output file.

**Basis sets**:
- `def2-QZVPPD` for all DFT calculations
- `cc-pVQZ` for HF, MP2, CCSD, and CCSD(T)

---

#### B) `cp2k-xxx/`

CP2K DFT calculations with effective core potential (ECP) used to generate trial wavefunctions for TurboRVB.

- `{molecule}.xyz`  
 Structure.
- `{molecule}.inp`  
 CP2K input file.
- `basis.cp2k`  
 Basis set in the CP2K format.
- `ecp.cp2k`  
 Effective core potential (ECP)  in the CP2K format.
- `{molecule}.out`  
 CP2K output file.
- `{molecule}-TREXIO.h5`  
 Generated TREX-IO file.

---

#### C) `cp2k-xxx-lr/`

Linear-response calculations with effective core potential (ECP) using the VMC parameter derivatives obtained from wavefunctions stored in `turborvb-vmc-JSD/`.

- `{molecule}.xyz`  
 Structure used for the calculation.
- `{molecule}.inp`  
 CP2K input file.
- `basis.cp2k`  
 Basis set in the CP2K format.
- `ecp.cp2k`  
 Effective core potential (ECP) in the CP2K format.
- `{molecule}.out`  
 CP2K output file.
- `{molecule}-TREXIO.h5`  
 TREXIO file.
- `{molecule}-TREXIO.dEdP.dat`  
 Parameter derivatives generated by TurboRVB.
- `{molecule}-resp.frc`  
 Force corrections obtained from the linear-response calculation (unit is Ha/bohr).

---

#### D) `turborvb-vmc-JSD/`

VMC calculations with effective core potential (ECP) using the Jastrow-Slater determinant (JSD) wavefunction with the frozen DFT orbitals. The parameter derivatives used by `cp2k-xxx-lr/` are obtained with the wavefunction stored here.

- `{molecule}.xyz`  
 Structure used for the calculation.
- `{molecule}_E_bF.xyz`  
 Energies and **biased** forces in extxyz format (units are eV and eV/angstrom, respectively).
- `{molecule}_E_cF.xyz`  
 Energies and **unbiased** forces in extxyz format (units are eV and eV/angstrom, respectively).
- `vmc_0.input`, `vmc_1.input`  
 TurboRVB input files.
- `wavefunction.dat`  
 Manybody wavefunction in the TurboRVB format.
- `pseudo.dat`  
 Effective core potential (ECP) in the TurboRVB format.
- `vmc_0.output`, `vmc_1.output`  
 TurboRVB output files.
- `energy.dat`  
 Energy (unit: Hartree).
- `forces.dat`  
 Biased forces (unit: Hartree/bohr).

---

#### E) `turborvb-vmc-JSDopt/`

VMC calculations with effective core potential (ECP) using the Jastrow-Slater determinant (JSD) wavefunction, where **all variational parameters** (both in Jastrow and determinant parts) are optimized.

- `{molecule}.xyz`  
 Structure.
- `{molecule}_E_F.xyz`  
 Energies and forces in the extxyz format (units are eV and eV/angstrom, respectively).
- `vmc_0.input`, `vmc_1.input`  
 TurboRVB input files.
- `wavefunction.dat`  
 Manybody wavefunction in the TurboRVB format.
- `pseudo.dat`  
 Effective core potential in the TurboRVB format.
- `vmc_0.output`, `vmc_1.output`  
 TurboRVB output files.
- `energy.dat`  
 Energy (unit: Hartree).
- `forces.dat`  
 Forces (unit: Hartree/bohr).

---

#### JSON summary files

Each setting directory also contains a JSON summary file:

- `XXXX-E-F-summary.json`

This JSON file contains:

- structure index,
- atomic positions (unit is angstrom),
- energies and forces (units are eV and eV/angstrom, respectively),
- and (for VMC calculations) error bars.

for all structures computed using the given setting.

---

### 3) `analysis/`

This directory contains the Jupyter notebook used to analyze the benchmark results:

- `plot_tables_and_graphs.ipynb`

---

## LICENSE

- This data is distributed under the CC0 LICENSE.
