# Fileset

[s41524-019-0203-2.pdf](https://mdr.nims.go.jp/filesets/a1f7e342-d5cf-43d6-9000-1106f509d5da/download)

## Creator

Stephen Wu, Yukiko Kondo, Masa-aki Kakimoto, Bin Yang, Hironao Yamada, [Isao Kuwajima](https://orcid.org/0000-0002-5994-3834), [Guillaume Lambard](https://orcid.org/0000-0003-0275-4079), Kenta Hongo, [Yibin Xu](https://orcid.org/0000-0001-8600-8748), Junichiro Shiomi, Christoph Schick, Junko Morikawa, Ryo Yoshida

## Rights

[Creative Commons BY Attribution 4.0 International](https://creativecommons.org/licenses/by/4.0/)

## Other metadata

[Machine-learning-assisted discovery of polymers with high thermal conductivity using a molecular design algorithm](https://mdr.nims.go.jp/datasets/f2b2dca6-0f86-4212-ab79-a332ce6ae4c3)

## Fulltext

Machine-learning-assisted discovery of polymers with high thermal conductivity using a molecular design algorithmARTICLE OPENMachine-learning-assisted discovery of polymers with highthermal conductivity using a molecular design algorithmStephen Wu 1,2, Yukiko Kondo3, Masa-aki Kakimoto3, Bin Yang4, Hironao Yamada1, Isao Kuwajima3, Guillaume Lambard3,Kenta Hongo 3,5,6, Yibin Xu3, Junichiro Shiomi3,7, Christoph Schick4,8, Junko Morikawa3,9 and Ryo Yoshida 1,2,3The use of machine learning in computational molecular design has great potential to accelerate the discovery of innovativematerials. However, its practical benefits still remain unproven in real-world applications, particularly in polymer science. Wedemonstrate the successful discovery of new polymers with high thermal conductivity, inspired by machine-learning-assistedpolymer chemistry. This discovery was made by the interplay between machine intelligence trained on a substantially limitedamount of polymeric properties data, expertise from laboratory synthesis and advanced technologies for thermophysical propertymeasurements. Using a molecular design algorithm trained to recognize quantitative structure—property relationships withrespect to thermal conductivity and other targeted polymeric properties, we identified thousands of promising hypotheticalpolymers. From these candidates, three were selected for monomer synthesis and polymerization because of their syntheticaccessibility and their potential for ease of processing in further applications. The synthesized polymers reached thermalconductivities of 0.18–0.41W/mK, which are comparable to those of state-of-the-art polymers in non-composite thermo-plastics.npj Computational Materials            (2019) 5:66 ; https://doi.org/10.1038/s41524-019-0203-2INTRODUCTIONThe ability of machine intelligence trained on massive amounts ofdata to match or even outperform humans has been demon-strated in intellectually demanding tasks across various fields.1–3As such, there is growing interest in the use of machine learning(ML) to reap substantial time and cost savings in the developmentof new materials.4,5 In particular, remarkable advances haverecently been made in ML for de novo molecular design.6–10 Thegoal of computational molecular design is the identification ofnew promising molecules whose physicochemical properties meetarbitrary given requirements. Despite the growing potential of MLin materials science, its practical impacts have not been fullyverified. To the best of our knowledge, the emphasis of recentstudies has largely been on algorithmic developments, whereasmuch less work has been done on the experimental verification ofcomputationally designed materials (except for a few works11,12).In the particular case of polymers, it is unprecedented thatdesigned polymers were synthesized and experimentally con-firmed. Major challenges in polymer informatics, for example, arisefrom the lack of data on polymeric properties and from thestructural complexity/diversity of polymers.13–15 In this study, wedemonstrate the successful discovery of new polymers with highthermal conductivity that were designed by our ML algorithm,referred to as Bayesian molecular design.16 This proof-of-conceptstudy intended to highlight a promising new example of polymerinformatics and to raise several issues that should be addressed toenable the widespread use of ML.This study focused on the design of a chemical structure in therepeat unit of a polymer. The objective of molecular design is togenerate promising hypothetical chemical structures that exhibit aset of desired properties. The chemical space of small organicmolecules is known to consist of as many as 1060 potentialcandidates,17 whereas the total number of currently knowncompounds is at most 108.18 The emergence of ML algorithms,which can exhaustively search this very large space, cancontribute significantly to expanding the frontier of the vastchemical universe. In the history of chemical informatics, therehave been extensive studies into computational molecular design.Their origin dates back to the pioneering work by Venkatasu-bramanian et al.19 Most such studies have focused on the use of alimited number of chemical fragments and their stochasticrecombination to sequentially transform starting compounds intodesired targets.20,21 However, this approach significantly narrowsthe design space. To broaden the search space, more advancedML techniques using probabilistic language models haveappeared in recent years.7,10,22,23 The Bayesian method developedin our previous work has also contributed to technologicaladvancement in this stream.16Despite remarkable methodological innovations in computa-tional molecular design, there are still barriers to achieving asuccessful proof of concept. Such barriers arise mainly from thesubstantially limited amount of polymeric properties data, inaddition to the synthetic difficulty of designed candidates,disagreements between expert knowledge and machine-acquiredCorrected: Publisher CorrectionReceived: 21 October 2018 Accepted: 28 May 20191The Institute of Statistical Mathematics, Research Organization of Information and Systems, Tachikawa, Tokyo 190-8562, Japan; 2The Graduate University for Advanced Studies,Tachikawa, Tokyo 190-8562, Japan; 3Center for Materials research by Information Integration (CMI2), Research and Services Division of Materials Data and Integrated System(MaDIS), National Institute for Materials Science (NIMS), Tsukuba, Ibaraki 305-0047, Japan; 4Institute of Physics and Competence Centre CALOR, University of Rostock, 18059Rostock, Germany; 5Japan Advanced Institute of Science and Technology, Nomi, Ishikawa 923-1292, Japan; 6PRESTO, JST, Kawaguchi, Saitama 332-0012, Japan; 7The University ofTokyo, Bunkyo-ku, Tokyo 113-8656, Japan; 8Tokyo Tech World Research Hub Initiative (WRHI), Tokyo Institute of Technology, Tokyo 226-8503, Japan and 9Tokyo Institute ofTechnology, Meguro-ku, Tokyo 152-8550, JapanCorrespondence: Junko Morikawa (morikawa.j.aa@m.titech.ac.jp) or Ryo Yoshida (yoshidar@ism.ac.jp)www.nature.com/npjcompumatsPublished in partnership with the Shanghai Institute of Ceramics of the Chinese Academy of Scienceshttp://orcid.org/0000-0002-7847-8106http://orcid.org/0000-0002-7847-8106http://orcid.org/0000-0002-7847-8106http://orcid.org/0000-0002-7847-8106http://orcid.org/0000-0002-7847-8106http://orcid.org/0000-0002-2580-0907http://orcid.org/0000-0002-2580-0907http://orcid.org/0000-0002-2580-0907http://orcid.org/0000-0002-2580-0907http://orcid.org/0000-0002-2580-0907http://orcid.org/0000-0001-8092-0162http://orcid.org/0000-0001-8092-0162http://orcid.org/0000-0001-8092-0162http://orcid.org/0000-0001-8092-0162http://orcid.org/0000-0001-8092-0162https://doi.org/10.1038/s41524-019-0203-2https://doi.org/10.1038/s41524-019-0217-9mailto:morikawa.j.aa@m.titech.ac.jpmailto:yoshidar@ism.ac.jpwww.nature.com/npjcompumatsintelligence and the difficulty of meeting stringent requirements inpractical applications. Indeed, the experimental data set onthermal conductivity that we used was limited in size, as itconsisted of only 28 training instances. The limited amount oftraining data rendered ordinal ML methods impractical forprediction, as demonstrated. In addition, as a second-rank tensor,thermal conductivity can vary substantially across polymerprocessing operations, such as laminating films and spinningfibres, where anisotropic molecular orientation is introduced. Mostof these variations have not been recorded in the currentdatabase. Therefore, we failed to derive practically usefulprediction models directly from the given data.Our ML workflow was designed to overcome the issue oflimited data. A solution to mitigate this barrier was to exploitproxy properties related to thermal conductivity as alternativedesign targets. In the Bayesian molecular design process thatgenerated a library of virtual chemical structures, we specified ahigher region of glass transition temperatures and meltingtemperatures as alternative design targets, for which sufficientdata were given to obtain reliable prediction models. We knowempirically that polymers with higher glass transition tempera-tures tend to be achieved by rigid structures, which result inhigher thermal conductivity. In addition, taking into account theease of processing of polymers, we selected designed candidatesby eliminating those with exceedingly high glass transitiontemperatures. Furthermore, an ML framework referred to as“transfer learning” was introduced to obtain a thermal conductiv-ity model with the given small data set. For the given targetproperty to be predicted from the limited supply of data, modelson physically related proxy properties were pre-trained using anadequate amount of data, which captured common featuresrelevant to the target task of predicting thermal conductivity. Re-purposing such machine-acquired features for the target taskproduced an outstanding achievement in the prediction accuracyeven with the exceedingly small data set. We used the transferredthermal conductivity model to screen promising candidates overthe virtual library that was produced by targeting the glasstransition and melting temperatures, and then proceeded withlaboratory synthesis and experimental characterization of thethermophysical properties. Figure 1 outlines the analytic workflowof this study. R codes to reproduce key results are available athttps://github.com/stewu5/HighTCond_Polymer_iqspr.Finally, three chemical structures were selected from a list of1000 designed candidates on the basis of criteria involvingsynthetic accessibility (SA) and ease of processing, which arerequired for the practical use of enhanced newly designedpolymers with high thermal conductivity. Then, the monomers ofthese candidates were synthesized and polymerized using retro-synthetic routes designed by synthetic chemists. The synthesizedpolymers exhibited a glassy state, and two of them werecrystallized by annealing. We also observed the change in thecrystal system resulting from additional chemical reaction duringannealing. Their thermal conductivities reached 0.18–0.41 W/mKwithin non-composite thermo-plastics in amorphous and semi-crystalline states.RESULTSDataPoLyInfo24 has recorded approximately one hundred kinds ofpolymeric properties of chemical structures in terms of theconstitutional repeat units. Narrowing the focus to 14,423 uniquehomopolymers in the database, we generated ML models thatdescribe a set of properties as a function of the chemicalstructures. We extracted a total of 38,310 structure–propertyrelationships with respect to thermal conductivity (λ), glasstransition temperature (Tg), melting temperature (Tm) and densityFig. 1 Machine learning (ML)-assisted de novo design and experimental validation of new polymers. a The objective of forward prediction isto derive a model that describes polymeric properties (e.g., glass transition temperature (Tg) and melting temperature (Tm)) as a function ofchemical structures in the constitutional repeat units. The forward model trained on the data set from PoLyInfo was inverted to obtain abackward model, which was conditioned by desired property regions (UTg and UTm ). The backward model produced a library of hypotheticalchemical structures that exhibit the desired properties. In addition, we developed a prediction model of thermal conductivity, which wasutilized in the post-screening of the produced library. Here, an ML framework called transfer learning was used to overcome the issue oflimited data on thermal conductivity: prediction models of proxy properties were pre-trained on given large data sets from PoLyInfo and QM9,and then the pre-trained models were fine-tuned using the limited data on the target property. We did not use the transferred models directlyfor the molecular design calculation because their generalization capability would likely be restricted on the design space spanned by the fewtraining polymers. b Analytic workflow consisting of four internal steps towards materials discoveryS. Wu et al.2npj Computational Materials (2019)    66 Published in partnership with the Shanghai Institute of Ceramics of the Chinese Academy of Sciences1234567890():,;https://github.com/stewu5/HighTCond_Polymer_iqspr(ρ), as summarized in Table 1. When multiple property values wererecorded for a polymer under the same experimental condition,they were reduced to the mean value.The volume of data varies significantly across different proper-ties. For example, PoLyInfo recorded multiple values of Tg and Tmfor 5917 and 3234 unique homopolymers, respectively. In contrast,there were 322 observations for only 28 homopolymers withrespect to λ around room temperature (10–35 °C). Moreover, λvaried considerably even within the same polymer, as shown inFig. 2a (unreliable data were removed by curation). Such within-polymer fluctuations could arise from differences in processingoperations, higher-order molecular structures or any othermeasurement conditions that varied in different studies. Unfortu-nately, such information was mostly not recorded in the database.Consequently, supervised learning directly using the given dataon λ failed to reach desirable levels of prediction accuracy(Fig. 3d).The lack of data in terms of both quantity and quality promptedus to pursue a strategic solution based on the use of Tg and Tm asproxy target properties in the de novo design calculation, asdescribed later. In addition, we applied transfer learning to obtaina prediction model on λ, which was used in the post-screeningprocess. In the construction of pre-trained models for transferlearning, we utilized the four data sets from PoLyInfo and the QM9data set25,26 that records the computational data of specific heatcapacity at constant volume (CV) for 133,805 small organicmolecules, which were calculated at the B3LYP/6-31G(2df,p) levelof quantum chemistry.Overview of Bayesian molecular designThe objective of the de novo design calculation is to algorithmi-cally create a chemical structure S in a polymer repeat unit, that is,monomer, for which n polymeric properties Y= (Y1, …, Yn) lie in adesired region U. The chemical structure S that represents aconfiguration of atoms and chemical bonding is encoded as asequence of SMILES symbols (simplified molecular-input line-entrysystem27) in which S= s1s2…sg forms a variable-length string, hereconsisting of g letters. For example, a SMILES string representingTable 1. Summary of the structure–property relationship data sets from PoLyInfo and QM9 and their classification by useUse Database Property Number of structures Number of samples Max σ of within-polymer fluctuation Range of temperatureCMD, TLλ PoLyInfo Tg 5917 17,001 30 °C N/ACMD, TLλ PoLyInfo Tm 3234 12,374 30 °C N/ATLλ PoLyInfo ρ 1516 8613 0.50 g/cm3 10–35 °CTLλ QM9 CV 133,805 133,885 0.97 cal/molK 25 °CPost-screening PoLyInfo λ 28 322 0.10W/mK 10–35 °CFor the PoLyInfo data sets, only homopolymers that have linearly connected structures with no additives or fillers were selected: CMD, used for forwardmodelling in the molecular design calculation; post-screening, used for transfer learning to obtain a screening model of λ; TLλ, used to obtain pre-trainedsource models for transfer learning; σ, standard deviation; Tg glass transition temperature, Tm melting temperature(°C)(°C)(g/cm3)(W/mK)a bFig. 2 Summary of PoLyInfo data. a Average properties of recorded polymers are plotted in ascending order with error bars indicating ±1σ (σ:standard deviation). b Scatterplot matrix that summarizes the joint distribution of the five polymeric properties. CP denotes specific heatcapacity at constant pressureS. Wu et al.3Published in partnership with the Shanghai Institute of Ceramics of the Chinese Academy of Sciences npj Computational Materials (2019)    66 phenol (C6H6O) is C1=CC=C(C=C1)O, where C and O indicate thealiphatic carbon and oxygen atoms, and = indicates the doublebond. The start and terminal of a ring closure are designated by acommon digit, 1 in this case, and the side chain is enclosed inparentheses, “(” and “)”.The Bayesian molecular design framework relies on thestatement of Bayes’ law:pðSjY 2 UÞ / pðY 2 UjSÞpðSÞ; (1)where p(A|B) denotes the conditional probability distribution of Agiven B. ML models on n properties were trained withstructure–property relationship data sets that define the forwardmodel pðYjSÞ ¼ Qni¼1 pðYijSÞ. Imposing the desired region U on Yprovides p(Y∈ U|S) on the right-hand side of Eq. (1). Thisprobability evaluates the goodness of fit of S with respect to theproperty requirements. The prior distribution p(S) serves to reducethe occurrence of chemically unfavourable or unrealistic structuresin designed molecules as it assigns zero or lower probabilitymasses to invalid or unrealistic chemical structures. For a given p(S), Bayes’ law inverts the forward model (S→ Y) to obtain thebackward model p(S|Y∈ U) (Y→ S). We then draw a randomsample of the SMILES string (S) from high-probability regions ofthe backward model using a sequential Monte Carlo (SMC)method28 to identify promising monomers that exhibit thedesired U. The R language library iqspr 1.016 (the latest versionis 2.4) that we developed was used to pipeline the forward andbackward calculations.The SMC method shares a common algorithmic structure withgenetic algorithms. The prior p(S) constitutes the most importantfactor that influences the structural features of the producedsample. In the implementation of iqspr, the prior is modelled by aprobabilistic language model that we call the extended n-gram,which takes the form pðSÞ ¼ pðs1ÞQgi¼2 pðsijsi�1; ¼ ; s1Þ. Theoccurrence probability of the ith letter, si, depends on thepreceding si−1, …, s1. The conditional probability p(si|si−1, …, s1)is estimated by the frequencies of substring patterns in a trainingset of existing chemical structures. The trained language model isanticipated to successfully learn structural patterns of the existingcompounds or implied contexts of “chemically favourable orrealistic” structures. For a given randomly chosen substring si−1,…, s1, the trained probabilistic model is used to modify the rest ofthe components by recursively adding subsequent letters accord-ing to the conditional probabilities, which encode the acquiredchemical reality. In this way, a currently given set {S1, …, SM} of Mchemical structures could be consecutively updated to a newpopulation. The fitness scores of the updated structures areassessed based on the forward model. Structures with betterfitness have a better chance to survive in the next generation. Thisprocess is iterated many times, and at the end, samples from thetargeted posterior are produced. The algorithmic details areshown in Ikebata et al.16As mentioned in the beginning, molecular design techniquesusing probabilistic language models have appeared rapidly since2017. The present method has some distinctive methodologicalfeatures, which are briefly noted here. One of the distinctivefeatures of our method is that it relies on the Bayesian framework,which provides a natural way to pipeline the workflow betweenthe forward and backward prediction processes. In addition, theBayesian approach benefits from the principle-based handling of“uncertainty” in the prediction models. A chemical structure S isdesigned based on P(S|Y∈ U), the probability that for a given S, itsproperty Y lies in a desired region U in the presence of predictionuncertainty in the trained forward model S→ Y. The design resultsdepend strongly on whether or not the uncertainty is considered.Another feature lies in the architecture of probabilistic languagemodels. One major difficulty of constructing a SMILES generator isassociated with the rules of grammar regarding the expression ofrings and branching components. To be specific, unclosed ringand branch indicators must be prohibited. For instance, anystrings extended rightward from a given s1:6= CC(C(C shouldeventually contain two closing letters, “)”. In addition, the issue of“long-term dependency” must be addressed: neighbours in astring are not always adjacent in the original molecular graph. Forexample, the occurrence probability of the last carbon in astructure expressed by CCCCC(CCCCC)C should be affected moreby the letters in the main chain, that is, the first five “C” in thestring, than by the adjacent letters because the substring in theparentheses constitutes a branch from the main chain. It is quitedifficult for ordinal language models to capture such intrinsicpatterns in SMILES representations without any special operations.Most recent works have relied on deep neural networks (DNNs) asmolecular generators, such as Recurrent Neural Networks7 orVariational Autoencoder.8 In general, massive numbers of traininginstances are needed for such DNNs to learn the underlying high-level contexts of chemical rules and the grammatical rules ofSMILES in fully data-driven analysis without any prior knowledge.However, the extended n-gram that we developed is a highlyengineered model specifically developed for the ML of the SMILESlanguage. It required significantly less data to train than the DNNs.a b Thermal conductivity(transfer learning)Glass transition temperaturePrediction (°C) Prediction (°C)Observation (°C)(°C)n oit avresbOcMelting temperaturePrediction (W/mK)Observation (W/ mK)Thermal conductivity(direct learning)Prediction (W/mK)Observation (W/mK)dFig. 3 Performance of forward prediction models. a, b Five-fold cross-validation of trained linear models for glass transition temperature (Tg)and melting temperature (Tm). All predicted values in the five validation sets are plotted against observed values, denoted by blue dots (redfor the training). The mean absolute error (MAE), root mean square error (RMSE) and correlation coefficient (R) are shown in each plot.c, d Validation results for the prediction model on λ that exhibited the best transferability (MAE= 0.0204W/mK) out of 1000 pre-trainedmodels on Tm. The prediction results of the best transferred model and a random forest model trained directly using the 28 data points for λ(MAE= 0.0327W/mK) are shown in c, d respectivelyS. Wu et al.4npj Computational Materials (2019)    66 Published in partnership with the Shanghai Institute of Ceramics of the Chinese Academy of SciencesForward prediction on Tg and TmForward models on Tg and Tm were used as the proxy targets inthe Bayesian design calculation. The chemical structure of amonomer was encoded into a descriptor vector of binary digitscomprised of multiple molecular fingerprints, such as theextended connectivity fingerprint.29 For Tg or Tm, a linearregression model, which described the polymeric property as afunction of molecular fingerprints, was trained on a randomselection of 80% of the instances of the given data in PoLyInfo.Figure 3a, b show the prediction performance of these models onthe validation data set.Forward prediction on λFor the post-screening, we developed neural network models for λusing a transfer learning technique to break the barrier of theexceedingly limited data. First, we generated 1000 pre-trainedneural networks for Tg, Tm and ρ using the data from PoLyInfo, aswell as 1000 models for CV with the QM9 data set. Each neuralnetwork consisted of a fully connected pyramid structure in whichthe size of layers and the number of neurons were randomlychosen. For a given pre-trained model, we refined the weightparameters using the small data set on λ, for which the initialvalues of parameters were taken from the pre-trained neuralnetworks of the related tasks. Among the 1000 pre-trained modelsof each property, we identified the best transferable model ofpredicting λ that exhibited the highest generalization capability onthe five validation sets, each randomly constructed from 20% ofthe given data. Figures 3c and 4b show the performance of twomodels on λ that were transferred from Tm and CV, respectively.The model that performed best in predicting λ was transferredfrom a pre-trained model of the monomer-level CV. The predictionaccuracy of the transferred model reached 0.0204 W/mK of themean absolute error (MAE), as the MAE was reduced by 40%compared with that of a random forest model trained directlyusing the 28 data points (MAE= 0.0327W/mK) (see Fig. 3c, d).Further details are described in the Supplementary Information(SI), for example, a successful transfer from Tg to λ.Design targetsTransfer learning has substantially improved the accuracy ofpredicting λ. Nevertheless, we could not dispel uncertainty in thegeneralization capability because the given model was validatedonly on an input subspace spanned by the 28 training polymers,which was rather small with respect to the entire materials space.The use of such a unreliable forward model, in turn, could lead tosignificant inaccuracy or bias in designed molecules. Thus, insteadof directly targeting λ in the design calculation, we decided to usethe relatively reliable models on Tg and Tm as intermediate targets,and the transferred model on λ was used in the post-screeningstep. Though the connection between λ and these surrogateproperties has not yet been fully understood, there is someevidence to support our strategy.It is widely known that increasing the rigidity of polymer chainscan increase the values of Tg and Tm, consequently leading to highvalues of λ. For example, it has been reported that the maximumvalue of λ in a glassy phase depends on the level of Tg.30,31Theoretically, lattice heat conduction in crystals can be conceivedin terms of the kinetics of propagating phonons, where thermalabSA scorecPrediction (W/mK)/W(noitavre sbOmK)Thermalconductivity(W/mK)Fig. 4 Summary of screening results. a Repeat units of 24 screened polymers. The synthesized polymers are numbered in red. A zoomedversion is available in SI (Fig. S3) b The predicted and observed values of λ for the 28 existing polymers recorded in PoLyInfo (grey) and thethree synthesized polyamides (coloured and numbered). c Predicted properties are shown on λ vs. SA scores. Grey dots denote the 1000designed candidates, and the 24 screened candidates are colour coded as described in the legends on the right-hand side. The numbers areassigned to the newly synthesized polyamidesS. Wu et al.5Published in partnership with the Shanghai Institute of Ceramics of the Chinese Academy of Sciences npj Computational Materials (2019)    66 conductivity is determined by the heat capacity, group velocityand mean free path (or velocity times lifetime) of phonons. Here,velocity can be related to harmonic interatomic/intermolecularforce constants (IFCs) and the lifetime of anharmonic IFCs. Ofcourse, polymers are often disorderly in structure, which reducesthe mean free path so that phonons no longer propagate, andtheir thermal conductivity can be expressed by the heat capacityand mode diffusivity obtainable by harmonic IFCs.32 However, thisdoes not mean that disorder terminates the propagation of allphonons. Even in amorphous polymers, some phonons can stillpropagate depending on the frequency. In both scenarios(harmonic and anharmonic), the strength of intermolecular forcesaffects thermal conductivity. Therefore, we expect to see somecorrelation, either directly or indirectly, between thermal con-ductivity and Tg and Tm, which are also strongly affected by thestrength of intermolecular forces, as transition fundamentallyinvolves the breaking of bonds or a cooperative mode change,where harmonic and anharmonic forces correspond to small andlarge intermolecular displacements.The observed data also showed weak positive correlationsbetween Tg, Tm and λ, as shown in Fig. 2b. Indeed, the success ofthe model transfer from Tg or Tm to λ constitutes evidence infavour of using Tg and Tm as proxy design targets (Fig. 3c, d andSI). We have chosen a target design range of 200–500 °C and300–600 °C for Tg and Tm, respectively.High λ is produced not only by rigid polymer main chains withhigh Tg or Tm but also by the highly oriented molecular chain thatis often observed in ultra-drawn fibres, axially oriented thin filmsand injection-moulded pieces.33 In addition, processing ease isindispensable for the practical use of polymeric materials to shapethem as films, fibres, moulding and so on. From the perspective offurther developments and industrial applications, we targetedliquid-crystalline polymers (LCPs) in both the de novo designcalculation and the post-screening process. We chose thisparticular target because of its practical importance in effectivethermal management applications, heat exchangers and energystorage. In general, polymers have quite low thermal conductivity,typically 0.1–0.2 W/mK, because of their semi-crystalline, electri-cally insulating structures. The side chains or main chains of LCPsmake up a family of thermoplastics that exhibit high heatresistance and tolerance, high electrical resistance and highchemical resistance.34–36 The ordered stacked orientation alongone direction of LCPs significantly increases their thermalconductivity in the direction of the molecular orientation. In thisstudy, LCP likeliness was set as a design objective because of theintrinsic processability and rigidity of LCPs to enhance thermalconductivity in further applications. We compiled a list of LCP-likesubstructures (Fig. 5) based on expert knowledge. During the denovo design calculation, sequentially generated structures werescored higher if they contained one or more fragments in the listso as to create a library of LCP-like structures. Thus, the forwardmodel in Eq. (1) takes the formpðY 2 UjSÞ / pðYTg 2 UTg jSÞpðYTm 2 UmjSÞθ1ðYf ðSÞ\Uf≠ϕÞ, where 1(⋅)denotes the indicator function, which takes the value one if theargument is true and zero otherwise. In addition to theprobabilities that Tg and Tm of S lie in the desired regions, UTgand UTm , the additional score θ > 1 is assigned to S if itssubstructures Yf(S) coincide with at least one fragment listed inthe LCP-likeliness filter Uf. Furthermore, in the post-screening step,we once again screened out LCP-like candidates that containedone or more fragments while assessing the predicted values of λand SA.Backward prediction: generation of candidatesThe iqspr package consists of two main modules: (1) MLalgorithms to train the forward prediction models and the priordistribution and (2) the Monte Carlo generation of de novomolecules from the backward model. The preparation of theforward model has already been described. The prior distributionp(S) takes the form of a probabilistic language model. We thentrained the model on the SMILES strings of the 14,423 uniquehomopolymers recorded in PoLyInfo. The trained prior implicitlyencoded frequently appearing atomic configuration and chemicalbonding in the existing polymers with the given instances of theSMILES character sequences. Monte Carlo samples drawn fromthis prior are anticipated to recognize implied contexts in thechemical language such as exclusion rules of invalid chemicalbonding, SA and chemical stability.With the prior and the forward model to form the backwardmodel, the SMC calculation was executed to successively refineSMILES strings of seed molecules such that their resultingproperties lay in the desired property region. The iqspr scriptthat we used is provided at the GitHub repository, https://github.com/stewu5/HighTCond_Polymer_iqspr, along with the modelstrained on Tg and Tm, and the chemical language model. Wegenerated 1000 promising synthetic targets with predictedpolymeric properties lying in the prescribed ranges of Tg andTm. Examples of the generated chemical structures are depicted inFig. 4a. Supplementary Movie S1 shows the process of transform-ing chemical structures and refining the target properties.Fig. 5 List of fragments compiled on the liquid-crystalline polymer (LCP)-likeliness filter that were used in the de novo design and post-screening processS. Wu et al.6npj Computational Materials (2019)    66 Published in partnership with the Shanghai Institute of Ceramics of the Chinese Academy of Scienceshttps://github.com/stewu5/HighTCond_Polymer_iqsprhttps://github.com/stewu5/HighTCond_Polymer_iqsprSelection of synthetic targetsTo assist in the selection of synthetic targets, we imposedscreening steps on the 1000 designed candidates. First, to identifyLCP-like structures, candidates that exhibited one or morecomponents on the list in Fig. 5a were moved forward. Next, weevaluated their synthesizability using Schuffenhauer’s SA scores.37Finally, considering the ease of processing required in industry, weprioritized candidates with Tg ≤ 300. As a result, 24 candidateswere identified for the further investigation of potential routes ofchemical synthesis (Fig. 4a). Eventually, the synthetic routes ofFig. 6 Details of the three synthesized polymers. a Chemical structures of three synthesized polymers and b–e designed synthetic routes tothe targets (see SI for further details)S. Wu et al.7Published in partnership with the Shanghai Institute of Ceramics of the Chinese Academy of Sciences npj Computational Materials (2019)    66 three kinds of polyamides could be identified (Fig. 6a) andsuccessfully synthesized (see SI for more details), namely,polyamides 4, 13 and 19, a wholly aromatic polyamide, anaromatic polyhydrazide, and an aliphatic–aromatic polyamide,respectively. Figure 4c shows the predicted values of λ with the SAscores of the three polyamides. In the decision-making process,we placed particular importance on the SA and the ease ofprocessing for the created polymers. As a consequence, thepredicted values of λ for the three selected polyamides were notparticularly high.Experimental validationAs shown in Fig. 6b–e, polymers 4 and 13 were prepared by thereaction between dicarboxylic acids (dicarboxyl chloride) anddiamines, whereas polymer 19 was prepared starting from a self-condensation AB type monomer. In addition, an analogouspolyamide to 19, denoted as 19a, was prepared from asymmetricdicarboxylic acid monomer M6 and m-phenylenediamine M7.Polyamide 19a has three different sequences, as shown in Fig. 6e.The monomers for 19 and 19a were newly synthesized, and thepreparative procedure is described in SI.Among the three synthesized polyamides (4, wholly aromaticpolyamide; 13, aromatic polyhydrazide; 19 or 19a,aliphatic–aromatic polyamide), 19 is a completely new substance.Chemical analysis was carried out by elemental analysis, nuclearmagnetic resonance (1H NMR) and infrared (Fourier-transforminfrared) spectroscopy. The thermophysical properties of inherentviscosity, thermal diffusivity, specific heat capacity at constantpressure (CP), ρ, Tg and Tm were measured using an Ostwaldviscometer, the temperature wave method (TWA),38,39 differentialscanning calorimetry (DSC), Archimedes’ method and a fastscanning calorimeter (FSC).40 Thermogravimetric analysis andthermomechanical analysis suggested that the weight loss ofpolymers 4 and 13 was as low as 5 and 20%, respectively, even at500 °C, and heat resistance was high. By utilizing the FSCtechnique, the Tg and Tm of all three polymers were observed;the values were not detectable by conventional DSC except for Tgof polymer 19 at 500 °C or less. We confirmed the crystallinity ofthe polymers by X-ray diffraction measurements. For thermalconductivity near room temperature, compressed polymers 4 and13 reached 0.26 and 0.22 W/mK, respectively. Polymer 19 wassoluble in organic solvent; thus, film formation is possible. Polymer19 could be categorized as an amorphous polymer with Tg 194 °Cclearly observed by conventional DSC; its thermal conductivity,0.195W/mK, is notably high for an amorphous polymer. Polymers13 and 4 reached thermal conductivities of 0.39 and 0.41 W/mKafter annealing at 370 or 420 °C, respectively. These values werecomparable to those of state-of-the-art polymers in non-composite thermoplastics. As summarized in Table 2 and Fig. 4c,the experimentally confirmed Tg, Tm and λ were highly consistentwith their predicted values for polymers 4, 13 and 19a. A fullsummary of all the material properties tested is available in SI(Table S2).DISCUSSIONThe high-level agreement between the predicted and experi-mental thermal conductivities validates the ML protocols as thefirst stage of molecular design in this study. The absoluteprediction errors in 4, 13 and 19a were 0.015, 0.001 and0.017W/mK for λ and 65, 2 and 70 °C for Tg, respectively.In addition, to evaluate the thermophysical properties of thelimited amount of synthesized new polymers, recent measure-ment techniques have been introduced. Thermal diffusivity wasmeasured by the micro-scale temperature wave analysis (TWA)originally developed for the small-scale measurement of polymers(TWA,33 Fig. S13 in SI). Thermal conductivity was calculated from themeasured thermal diffusivity along with the measured density andspecific heat capacity (Table S2). The ultra-fast scanning nano-scalecalorimetric technique (FSC,36 Fig. S9 in SI) has been applied for themeasurement of Tg and Tm of aromatic polyamides for the first time,as these temperatures have not been observed because of thermaldegradation when measured by conventional DSC. By using thescan rate of 30,000 K/s, we could experimentally observe Tg, Tm, andin the case of polymer 13, cold crystallization phenomena.The thermal conductivity of new and existing polymers iscompared in Table 3. The new polymers, three kinds of polyamidecontaining mesogen groups, as depicted in Fig. 4a, werecompared with typical polyimide films utilized in electronicapplications. The typical polyimides, such as Kapton and Upilex,in the amorphous state exhibited thermal conductivity values ofapproximately 0.17–0.22W/mK, whereas the thermal conductivityof the new polymers was 18–80% higher, in the range of0.20–0.41 W/mK. The post-screening by LCP filter successfullyproduced a liquid-crystalline-like polymer with the not-so-hightargeted Tg (<300 °C) based on the consideration of otherimportant factors, such as SA and the ease of processing requiredin industry. A film-shaped polymer was realized for thesynthesized polymer 19a, which is soluble in organic solvent.To conclude, we have demonstrated the discovery of newthermally conductive polymers by the use of a series of MLmethods in combination with a comprehensive database ofpolymer properties, expertise from organic synthesis andadvanced measurement technologies for thermal properties. Inparticular, the experimentally confirmed properties of thecomputationally designed polymers are highly consistent withthe predicted values from ML. We discovered a retrosynthesisroute to designed monomers, which have actually been synthe-sized and polymerized. Some of the resulting polymers exhibitedTable 2. Experimental properties of the three newly synthesized polymers compared with predictions from ML modelsPolymer 4 (pre) 4 (obs) 4 (anneal) 13 (pre) 13 (obs) 13 (anneal) 19 (pre) 19a (obs)Tg (°C) (DSC) 286 N/Aa– 228 N/Aa– 121 194Tg (°C) (FSC) 286 221 – 228 226 – 121 191Tm (°C) (FSC) 404 513 – 426 494 – 321 303λ (W/mK) 0.246 0.261 0.408b 0.225 0.224 0.387 0.218 0.195Xc – 0.16 – – 0.30 0.30 – 0.09cCompressed film-shaped samples were used in all cases except the X-ray diffraction of polymer 19a. We report values from prediction (pre), observation (obs)and observation after annealing (anneal)DSC differential scanning calorimetry, FSC fast scanning calorimetry, Tg glass transition temperature, Tm melting temperatureaTg values, and instead, FSC was introduced to determine Tg and TmbThermal conductivity of annealed polymer 4 was obtained using the heat capacity and density measured for non-heat-treated samplescCrystallinity (Xc) of polymer 19a was measured in powder formS. Wu et al.8npj Computational Materials (2019)    66 Published in partnership with the Shanghai Institute of Ceramics of the Chinese Academy of Sciencescrystallinity, glassy states and promising thermal properties. Theirpotential processability and ability to act as casting films providethe basis for revealing further optimized properties.To fully enjoy the great potential of ML-driven polymerchemistry, there are still some hurdles to be overcome. A widevariety of databases have been developed in various fields ofmaterials science, which provide the starting point for data-intensive and ML-centric workflows (Materials Project,41 Atom-Works,42 OQMD43 and so on). However, very little such work hasbeen done for polymers; there are no comprehensive databases ofpolymeric properties other than PoLyInfo and Polymer Genome,44at least in the public domain. In addition, where polymers areconcerned, high-throughput, automated computations such asmolecular dynamics simulations are currently difficult to execute.In this study, the available data on thermal conductivity were toosparse to obtain models generally applicable to a diverse set ofinput materials. Even for the indeterminate target property Tg, theavailable data set would be more or less uncertain, as it consists ofseveral thousand polymers spanning only a tiny fraction of thevast polymer landscape. Therefore, our workflow was constructedon the premise that predicted properties have a certain level ofdiscrepancy from reality, and computationally designed candi-dates were used as a guideline for chemists’ decision-making.Furthermore, this study focused only on considerably simplifiedmodels that ignored any key covariates other than the chemicalstructures of repeat units. The inability of the current models toaccount for observed within-polymer fluctuations in polymericproperties might be largely due to the lack of data on processingparameters, higher-order molecular structures and so on. This lackof data is one of the most fundamental issues in polymerinformatics.Another issue concerns the lack of ML methods to facilitatechemical synthesis. In this study, synthesized polymers wereselected by emphasizing synthetic accessibly over the novelty ofdesigned structures and thermal properties. In recent years,several researchers have begun to develop ML methods forchemical synthesis.45,46 Unfortunately, many chemists are stillunconvinced of the utility of such strategies, as well as of de novodesign methods, because their practical impacts remain unex-plored in real-world applications. In future work, ML methods fordesign and synthesis should be pipelined and practised.5 We hopethat this proof-of-concept study could contribute to the wide-spread use of such ML platforms, opening up new opportunities inthe next generation of polymer chemistry.METHODSPolymer design using iqsprThe 1000 candidates were generated using iqspr 1.0. The script available athttps://github.com/stewu5/HighTCond_Polymer_iqspr can be used toreproduce the results of this study. To summarize, first, for the prior p(S),we used the extended n-gram of order n= 10 as the chemical languagemodel for SMILES strings; this approach was developed in our previousstudy.16 The language model was trained on 14,423 homopolymersrecorded in PoLyInfo. The forward models consisted of two Bayesian linearmodels trained on 5917 and 3234 instances of Tg and Tm, respectively. Thetraining was performed with default hyperparameters. The descriptor wascalculated by combining seven different fingerprints implemented in iqspr:standard, extended, hybridization, maccs, circular and pubchem (see https://cran.r-project.org/web/packages/rcdk/rcdk.pdf for descriptions of thesefingerprints). One hundred structures randomly selected from the 14,423existing polymers were sequentially modified over 500 iterations;molecules created in the burn-in period (first 100 iterations) werediscarded. In the SMC run, annealing was scheduled to lower theTable 3. Comparison of the thermal conductivity of new and existing polymers at approximately 300 K, as reported in the literature and as measuredby temperature wave analysis in this studyNo. Film grad Manufacturer Chemical structures d (μm) λ (W/mK) in thickness Ref.Kapton Toray PMDA/ODA 7.3 0.198 38Toray PMDA/ODA 12.7 0.194 38Toray PMDA/ODA 25 0.194 38Toray PMDA/ODA 50 0.186 38Toray PMDA/ODA 76.4 0.189 38Toray PMDA/ODA 124.6 0.189 38Toray PMDA/ODA 175 0.191 38UPILEX-S Ube BPDA/p-PDA 7.5 0.168 48Ube BPDA/p-PDA 12.6 0.211 48Ube BPDA/p-PDA 20.5 0.216 48UPILEX-R Ube BPDA/ODA 7.5 0.183 48Ube BPDA/ODA 12.2 0.186 48Ube BPDA/ODA 20.5 0.194 484 Predicted this study Fig. 4a – 0.246 This study4 Observed this study Fig. 4a 97 0.261 This study4 Annealed this study Fig. 4a – 0.408 This study13 Predicted this study Fig. 4a – 0.225 This study13 Observed this study Fig. 4a 112 0.224 This study13 Annealed this study Fig. 4a – 0.387 This study19 Predicted this study Fig. 4a – 0.218 This study19 Observed this study Fig. 4a 103 0.195 This studyPMDA/ODA pyromellitic dianhydride and 4,4 -oxydianiline, BPDA/p-PDA 3,3, 4,4 -biphenyltetracarboxylic dianhydride and p-phenylenediamine, BPDA/ODA 3,3,4,4 -biphenyltetracarboxylic dianhydride and 4,4 -oxydianiline, d thickness of the plate/film-shaped specimensS. Wu et al.9Published in partnership with the Shanghai Institute of Ceramics of the Chinese Academy of Sciences npj Computational Materials (2019)    66 https://github.com/stewu5/HighTCond_Polymer_iqsprhttps://cran.r-project.org/web/packages/rcdk/rcdk.pdfhttps://cran.r-project.org/web/packages/rcdk/rcdk.pdftemperature linearly from T= 30 to T= 1 at every step during the burn-inperiod and maintain T= 1 after the burn-in. As described in the Resultssection, we applied the LCP-likeliness filter, θ1ðYf ðSÞ\Uf≠ϕÞ , in every step ofthe SMC run. The score was set to θ= 10. Note that iqspr 1.0 does notpermit the use of such additional filters. Therefore, we customized theoriginal execution command lines by simple scripting. Finally, we selected1000 candidates with the highest values of pðYTg jSÞpðYTm jSÞ among all thegenerated structures.Transfer learningWe used the MXNet package47 to train the pre-trained neural networksmodels for predicting Tg, Tm, ρ of polymers and CV of monomers. Then, apre-trained model was re-trained by fine-tuning it to the limited availabledata on λ.We started to build a “shotgun pre-trained model library” for Tg, Tm, ρand CV. For each property, we generated and trained 1000 neural networkswith randomly constructed different network structures. Each networkformed a fully connected pyramid in which the number of hidden layerswas randomly chosen from {3, 4}. The size of the input layer consisted of arandomly selected subset of 400–600 of the descriptors composed entirelyof all the fingerprints. Then, the number of neurons was randomly reducedby 20–80% in each of the following layers, and the number of neurons inthe last hidden layer was bounded by 10–30 (pre-determined randomly).Neurons in all hidden layers were activated by ReLU (Rectified Linear Unit),and a linear activation function was configured on the output layer. Thedetails of the predictive performance of the best transferred model for λamong the 1000 fine-tuning trials are shown in SI.Monomer and polymer synthesisDetails on the synthesis of monomers and polymers are provided in SI.Measurement of thermophysical propertiesDetailed procedures for the measurement of the polymer properties areprovided in SI. In particular, recent measurement techniques wereintroduced to evaluate the limited number of new polymers with highTg and Tm. Thermal diffusivity was measured by micro-scale TWA38 (see Fig.S13 in SI). Ultra-fast scanning nano-scale calorimetry (FSC,40 see Fig. S9 inSI) was introduced to execute a 30,000 K/s temperature scan to observe Tg,Tm and cold crystallization λ, which is unique among the semi-crystallinepolymers.DATA AVAILABILITYThe digital data in PoLyInfo were manually extracted because acquisition using anapplication programming interface is not currently supported. The QM9 data arepublicly available at http://quantum-machine.org/datasets/. The trained models,constructed using R, and other data are available upon request.ACKNOWLEDGEMENTSThis work was supported in part by the “Materials Research by InformationIntegration” Initiative (MI2I) project of the Support Program for Starting Up InnovationHub from Japan Science and Technology Agency (JST) and a Grant-in-Aid forScientific Research (B) 15H02672 from the Japan Society for the Promotion of Science(JSPS). S.W. gratefully acknowledges financial support from JSPS KAKENHI GrantNumber JP18K18017. K.H. gratefully acknowledges financial support from JSPSKAKENHI Grant Number JP17K17762, a Grant-in-Aid for Scientific Research onInnovative Areas (16H06439) and PRESTO (JPMJPR16NA). C.S. gratefully acknowl-edges financial support from the Ministry of Education and Science of the RussianFederation (Grant 14.Y26.31.0019), and J.M. acknowledges partial financial support byJSPS KAKENHI Grant Number JP16K06768.AUTHOR CONTRIBUTIONSS.W., M.-a.K., J.M. and R.Y. planned the study. Computational design calculations wereperformed by S.W., H.Y., I.K., G.L., K.H., Y.X., J.S. and R.Y. Polymer synthesis andproperty measurements were performed by Y.K., M.-a.K., B.Y., C.S. and J.M. All authorsdiscussed the results and were involved in the development and writing of themanuscript, as well as taking the accountability for all aspects of the work in thismanuscript.ADDITIONAL INFORMATIONSupplementary information accompanies the paper on the npj ComputationalMaterials website (https://doi.org/10.1038/s41524-019-0203-2).Competing interests: The authors declare one potential patent application to besubmitted in the near future. Patent applicant (whether author or institution): NIMS.Name of inventor(s): R.Y., S.W., M.K., J.M. and Y.X. Application number: not yetavailable. Status of application: in preparation. Specific aspect of manuscript coveredin patent application: not specified.Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claimsin published maps and institutional affiliations.REFERENCES1. He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. InProc. IEEE Computer Society Conference on Computer Vision and Pattern Recogni-tion. 770–778 (IEEE, Las Vegas, NV, USA, 2016).2. Silver, D. et al. Mastering the game of Go without human knowledge. Nature 550,354–359 (2017).3. Brown, N. & Sandholm, T. Superhuman AI for heads-up no-limit poker: libratusbeats top professionals. Science 359, 418–424 (2017).4. Green, M. L. et al. Fulfilling the promise of the materials genome initiative withhigh-throughput experimental methodologies. Appl. Phys. Rev. 4, 011105 (2017).5. Tabor, D. P. et al. Accelerating the discovery of materials for clean energy in theera of smart automation. Nat. Rev. Mater. 3, 5–20 (2018).6. Yoshikawa, N. et al. Population-based de novo molecule generation, usinggrammatical evolution. Chem. Lett. 47, 1431–1434 (2018).7. Segler, M. H. S., Kogej, T., Tyrchan, C. & Waller, M. P. Generating focused moleculelibraries for drug discovery with recurrent neural networks. ACS Cent. Sci. 4,120–131 (2018).8. Lim, J., Ryu, S., Kim, J. W. & Kim, W. Y. Molecular generative model based onconditional variational autoencoder for de novo molecular design. J. Cheminform.10, 31 (2018).9. Kadurin, A., Nikolenko, S., Khrabrov, K., Aliper, A. & Zhavoronkov, A. Drugan: anadvanced generative adversarial autoencoder model for de novo generation ofnew molecules with desired molecular properties in silico. Mol. Pharm. 14,3098–3104 (2017).10. Yang, X., Zhang, J., Yoshizoe, K., Terayama, K. & Tsuda, K. ChemTS: an efficientpython library for de novo molecular generation. Sci. Technol. Adv. Mater. 18,972–976 (2017).11. Gómez-Bombarelli, R. et al. Design of efficient molecular organic light-emittingdiodes by a high-throughput virtual screening and experimental approach. Nat.Mater. 15, 1120–1127 (2016).12. Mannodi-Kanakkithodi, A. et al. Scoping the polymer genome: a roadmap forrational polymer dielectrics design and beyond. Mater. Today 21, 785–796 (2017).13. Ramprasad, R., Batra, R., Pilania, G., Mannodi-Kanakkithodi, A. & Kim, C. Machinelearning in materials informatics: recent applications and prospects. npj Comput.Mater. 3, 54 (2017).14. Audus, D. J. & de Pablo, J. J. Polymer informatics: opportunities and challenges.ACS Macro Lett. 6, 1078–1082 (2017).15. Peerless, J. S., Milliken, N. J. B., Oweida, T. J., Manning, M. D. & Yingling, Y. G. Softmatter informatics: current progress and challenges. Adv. Theory Simul. 2,1800129 (2019).16. Ikebata, H., Hongo, K., Isomura, T., Maezono, R. & Yoshida, R. Bayesian moleculardesign with a chemical language model. J. Comput. Aided Mol. Des. 31, 379–391(2017).17. Bohacek, R. S., McMartin, C. & Guida, W. C. The art and practice of structure-baseddrug design: a molecular modeling perspective. Med. Res. Rev. 16, 3–50 (1996).18. Kim, S. et al. Pubchem substance and compound databases. Nucleic Acids Res. 44,D1202–D1213 (2016).19. Venkatasubramanian, V., Chan, K. & Caruthers, J. M. Computer-aided moleculardesign using genetic algorithms. Comput. Chem. Eng. 18, 833–844 (1994).20. Mannodi-Kanakkithodi, A., Pilania, G., Huan, T. D., Lookman, T. & Ramprasad, R.Machine learning strategy for accelerated design of polymer dielectrics. Sci. Rep.6, 20952 (2016).21. Venkatraman, V. & Alsberg, B. Designing high-refractive index polymers usingmaterials informatics. Polymers 10, 103 (2018).22. Gómez-Bombarelli, R. et al. Automatic chemical design using a data-drivencontinuous representation of molecules. ACS Cent. Sci. 4, 268–276 (2018).23. Sanchez-Lengeling, B. & Aspuru-Guzik, A. Inverse molecular design usingmachine learning: generative models for matter engineering. Science 3610,360–365 (2018).S. Wu et al.10npj Computational Materials (2019)    66 Published in partnership with the Shanghai Institute of Ceramics of the Chinese Academy of Scienceshttp://quantum-machine.org/datasets/https://doi.org/10.1038/s41524-019-0203-224. Otsuka, S., Kuwajima, I., Hosoya, J., Xu, Y. & Yamazaki, M. Polyinfo: polymerdatabase for polymeric materials design. In 2011 International Conference onEmerging Intelligent Data and Web Technologies. 22–29 (Tirana, Albania, 2011).25. Blum, L. C. & Reymond, J.-L. 970 million druglike small molecules for virtualscreening in the chemical universe database GDB-13. J. Am. Chem. Soc. 131,8732–8733 (2009).26. Rupp, M., Tkatchenko, A., Müller, K.-R. & von Lilienfeld, O. A. Fast and accuratemodeling of molecular atomization energies with machine learning. Phys. Rev.Lett. 108, 058301 (2012).27. Weininger, D. SMILES, a chemical language and information system. 1. Introductionto methodology and encoding rules. J. Chem. Inf. Comput. Sci. 28, 31–36 (1988).28. Del Moral, P., Doucet, A. & Jasra, A. Sequential Monte Carlo samplers. J. R. Stat.Soc. Ser. B 68, 411–436 (2006).29. Rogers, D. & Hahn, M. Extended-connectivity fingerprints. J. Chem. Inf. Model. 50,742–754 (2010).30. Morikawa, J., Tan, J. & Hashimoto, T. Study of change in thermal diffusivity ofamorphous polymers during glass transition. Polymers 36, 4439–4443 (1995).31. Morikawa, J. & Hashimoto, T. Study on thermal diffusivity of poly(ethylene ter-ephthalate) and poly(ethylene naphthalate). Polymers 38, 5397–5400 (1997).32. Allen, P. B. & Feldman, J. L. Thermal conductivity of disordered harmonic solids.Phys. Rev. B 480, 12581–12588 (1993).33. Shen, S., Henry, A., Tong, J., Zheng, R. & Chen, G. Polyethylene nanofibres withvery high thermal conductivities. Nat. Nanotechnol. 5, 251–255 (2010).34. Sugimoto, A., Yoshioka, Y., Kang, S. & Tokita, M. Thermal diffusivity of side-chain-polymer smectic liquid crystals. Polymers 106, 35–42 (2016).35. Shin, J. et al. Thermally functional liquid crystal networks by magnetic field drivenmolecular orientation. ACS Macro Lett. 5, 955–960 (2016).36. Wang, M. et al. Homeotropically-aligned main-chain and side-on liquid crystallineelastomer films with high anisotropic thermal conductivities. Chem. Commun. 52,4313–4316 (2016).37. Ertl, P. & Schuffenhauer, A. Estimation of synthetic accessibility score of drug-likemolecules based on molecular complexity and fragment contributions. J. Che-minform. 1, 8 (2009).38. Morikawa, J. & Hashimoto, T. Thermal diffusivity of aromatic polyimide thin filmsby temperature wave analysis. J. Appl. Phys. 105, 113506 (2009).39. Tawade, B. V., Valsange, N. G. & Wadgaonkar, P. P. Synthesis and characterizationof polyhydrazides and poly(1,3,4-oxadiazole)s containing multiple arylene etherlinkages and pendent pentadecyl chains. High. Perform. Polym. 29, 836–848(2017).40. Gao, Y. L. et al. Calorimetric measurements of undercooling in single micron sizedsnagcu particles in a wide range of cooling rates. Thermochim. Acta 482, 1–7(2009).41. Jain, A. et al. The Materials Project: materials genome approach to acceleratingmaterials innovation. APL Mater. 1, 15010 (2013).42. Xu, Y., Yamazaki, M. & Villars, P. Inorganic materials database for exploring thenature of material. Jpn. J. Appl. Phys. 50, 11RH02 (2011).43. Kirklin, S. et al. The Open Quantum Materials Database (OQMD): assessing theaccuracy of DFT formation energies. npj Comput. Mater. 1, 15010 (2015).44. Huan, T. D. et al. A polymer dataset for accelerated property prediction anddesign. Sci. Data 3, 160012 (2016).45. Liu, B. et al. Retrosynthetic reaction prediction using neural sequence-to-sequence models. ACS Cent. Sci. 3, 1103–1113 (2017).46. Segler, M. H. S., Preuss, M. & Waller, M. P. Planning chemical syntheses with deepneural networks and symbolic AI. Nature 555, 604–610 (2018).47. Chen, T. et al. MXNet: a flexible and efficient machine learning library for het-erogeneous distributed systems. arXiv https://arxiv.org/abs/1512.01274 (2015).48. Choy, C. L., Leung, W. P. & Ng, Y. K. Thermal diffusivity of polymer films by theflash radiometry method. J. Polym. Sci. Part B 25, 1779–1799 (1987).Open Access This article is licensed under a Creative CommonsAttribution 4.0 International License, which permits use, sharing,adaptation, distribution and reproduction in anymedium or format, as long as you giveappropriate credit to the original author(s) and the source, provide a link to the CreativeCommons license, and indicate if changes were made. The images or other third partymaterial in this article are included in the article’s Creative Commons license, unlessindicated otherwise in a credit line to the material. If material is not included in thearticle’s Creative Commons license and your intended use is not permitted by statutoryregulation or exceeds the permitted use, you will need to obtain permission directlyfrom the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.© The Author(s) 2019S. Wu et al.11Published in partnership with the Shanghai Institute of Ceramics of the Chinese Academy of Sciences npj Computational Materials (2019)    66 https://arxiv.org/abs/1512.01274http://creativecommons.org/licenses/by/4.0/http://creativecommons.org/licenses/by/4.0/ Machine-learning-assisted discovery of polymers with high thermal conductivity using a molecular design algorithm Introduction Results Data Overview of Bayesian molecular design Forward prediction on Tg and Tm Forward prediction on &#x003BB; Design targets Backward prediction: generation of candidates Selection of synthetic targets Experimental validation Discussion Methods Polymer design using iqspr Transfer learning Monomer and polymer synthesis Measurement of thermophysical properties Supplementary information Acknowledgements Author contributions Competing interests ACKNOWLEDGMENTS