# Fileset

[GLambard_SMILESX_20260128.pdf](https://mdr.nims.go.jp/filesets/3829f9a4-e727-4527-8cc0-736f15212789/download)

## Creator

[LAMBARD Guillaume](https://orcid.org/0000-0003-0275-4079)

## Rights

Copyright 2026 LAMBARD Guillaume

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.


## Other metadata

[SMILES-X: A Tailored and Effective Molecular Property Inference and Generation Pipeline](https://mdr.nims.go.jp/datasets/1d7646e7-badd-4764-88b9-aa8f623de6f0)

## Fulltext

PowerPoint PresentationSMILES-X: A tailored and effective molecular property inference and generation pipelineLAMBARD Guillaume (ラムバール ギヨム)LAMBARD.Guillaume@nims.go.jpNational Institute for Materials Science (NIMS)Centre for Basic Research on Materials (CBRM)Data-driven Materials Design GroupInternational Institute for Carbon-Neutral Energy ResearchCenter for Energy Systems Design – Workshop - 2026/01/28SMILES-X contextC[N]1C=NC2=C1C(=O)N(C)C(=O)N2CRepresentations1D SMILES2D graph 3D mol1D, 2D, 3D fingerprintsand/or1D, 2D, 3D physical descriptorsFeaturesLinear regressionsDecision treesNeural networksA.I. modelsTargetProperty- Small datasets (<<10,000 samples)- Multiple representations- Domain(task)-specific features- Design new features? Yes, but hard- Time-consuming to compute- Try them all Representations x Features space x Models- No/little interpretation of outcomes- Prediction uncertainties assessmentInconvenienceshttps://doi.org/10.1016/j.ymeth.2014.08.005BinaryECFPPhysical descriptors: [Valence electrons, charge, molecular weight, number of aromatic rings, etc.]2SMILES-X: Efficient physicochemical properties prediction for small molecules and homopolymers C[N]1C=NC2=C1C(=O)N(C)C(=O)N2CMolecular representation1D SMILES2D graphAutomated molecular description (i.e. featurization)TargetedProperty-C-C-nsmall moleculerepeating unit*CC*From small (< 100) to big (>> 10,000 samples) datasets Automated “SMILES-to-Property” inference model design“SMILES-to-property” high accuracy prediction + uncertainty“SMILES-to-property” interpretation+SMILES-XWater solubility Refractive indexsmall molecules homopolymersProperties PredictionInterpretationG. Lambard et al., Mach. Learn.: Sci. Technol., 1(2), 025004 (2020) https://github.com/Lambard-ML-Team/SMILES-XSMILES-X 2.x coming…3Successful depiction of homopolymers with SMILES*CC(*)c1ccccc1C(=O)N(C)Cpoly[2-(dimethylcarbamoyl)styrene]IUPAC name2D graphsCanonical SMILESa) b)C(C(*)c1ccccc1C(=O)N(C)C)* C(*)(c1ccccc1C(=O)N(C)C)C* *C(c1ccccc1C(=O)N(C)C)C* c1(C(C*)*)ccccc1C(=O)N(C)C c1cccc(C(=O)N(C)C)c1C(C*)* c1ccc(C(=O)N(C)C)c(C(C*)*)c1 c1cc(C(=O)N(C)C)c(C(C*)*)cc1 c1c(C(=O)N(C)C)c(C(C*)*)ccc1 c1(C(=O)N(C)C)c(C(C*)*)cccc1 C(=O)(N(C)C)c1c(C(C*)*)cccc1 O=C(N(C)C)c1c(C(C*)*)cccc1 N(C)(C)C(c1c(C(C*)*)cccc1)=O CN(C)C(c1c(C(C*)*)cccc1)=O CN(C(c1c(C(C*)*)cccc1)=O)CNon-canonical SMILES Use by the SMILES-X software as augmentation of canonical SMILES→ Common methodology for improving convergence and performance of deep learning models4SMILES-X pipelinePipeline’C’ ‘[N]’ ‘1’ ‘C’ ‘=‘ ‘N’ ‘C’ ‘2’ ‘=’ ‘C’ ’1’ ‘C’ ‘(‘ ‘=‘ ‘O’ ‘)’ … etc.C[N]1C=NC2=C1C(=O)N(C)C(=O)N2C[N]1(C)C= NC2=C1C(=O)N(C)C(=O)N2C…Training - Validation - Test sets (80% - 10% - 10%)▪ # units▪ Batch size▪ Learning rateh* = argmin f(h)Architecture hyper-parameters h hf = Root-mean-square error (RMSE) on the prediction of the target propertyNon-canonic augmentationof SMILESBayesian optimization viaBayesian optimization5InputsSMILES-X architectureLayersToken IDChemical Vocabulary EMBEDNumerical vector / tokenENCODENumerical matrix / SMILEStoken 1token 2…token nWeighted vector / SMILESATTENDPropertyContextual encodingWeight tokens according to their link to the target propertyLinear regression / classificationORPREDICT INTERPRETElemental encoding6Some results* Prediction of the coefficient of linear thermal expansion (CLTE) for amorphous homopolymers (10-5 K-1) - #compounds: 106E. Gracheva, G. Lambard, S. Samitsu, K. Sodeyama, A. Nakata, STAM: Methods, 1:1, 213-224 (2021)7FreeSolvESOL LipophilicityFreeSolv MDPhysical Chemistry Datasets from http://moleculenet.ai/datasets-1* ESOL: Water solubility experimental data for common organic small molecules (log data in mols/litre) - #compounds: 1128* FreeSolv: Hydration free energy experimental and computational data for small molecules in water (kcal/mol) - #compounds: 642* Lipophilicity: Octanol/water distribution coefficient (logD at pH 7.4) experimental data - #compounds: 4200G. Lambard et al., Mach. Learn.: Sci. Technol., 1(2), 025004 (2020) Context of generative AI for SMILES https://doi.org/10.1038/432823a• Exploring the chemical space is hard• >> 1060 possibilities • for small molecules, or homopolymers repeated unit• More for copolymers (x structural dependency)• More for polymer blends (x mixing ratios)• Combinatorial puzzle with limited hardware/software, time, cost• Can’t rely on experiments alone• Can’t rely on computational chemistry alone (e.g. DFT, TD-DFT, MD, etc.)• Materials → properties: likelihood p(o|s) is estimated• But limited to joint space of states s (tokens) and observables o (properties) presently known• Properties → Materials: posterior p(s|o) can be estimated through Bayesian inversion of the likelihood p(o|s)• No need of Generative Adversarial Networks (GANs), or reinforcement learning (RL) here• Bayesian principle: p(s|o) ∝ p(o|s) . p(s) (neglecting evidence, p(o) = ∑S p(o|s) . p(s)), with p(s) the prior over states s• A molecular structure p(s) = p(s0) . ∏t p(st|st-1,…, s0)8SMILES-X on tokens enumerationSMILES-X on %Biodeg.predictionsTokens = any SMILES characters in a given dataset e.g., PoLyInfoSMILES-X: AI-assisted generation of small molecules or homopolymers9Observed Tg (℃)Predicted Tg (℃)Observed Tg (℃)Predicted Tg (℃)SMILES-X training on PoLyInfo dataSMILES-X testing on lab dataAI-assisted generation of high Tg homopolymersMAE ~ 20.9 ℃RMSE ~ 32 ℃R2 ~ 0.91Found in training setSeen for the first time by SMILES-XEven though PoLyInfo data comes from various sources, SMILES-X performs very well on unseen lab data SMILES-X + SMILES-Neo generationPoLyInfo data distributionGenerated data distributionPolymethylene with predicted Tg = 340.2 ± 115.2 ℃ and Tg > 300 ℃ in the laboratory1011 Slide 1 Slide 2 Slide 3 Slide 4: Successful depiction  of homopolymers  with SMILES Slide 5 Slide 6 Slide 7 Slide 8 Slide 9 Slide 10 Slide 11