# Fileset

[c7cp08280k.pdf](https://mdr.nims.go.jp/filesets/afc4614c-2709-46c8-a8ff-6240d06789fe/download)

## Creator

[SODEYAMA, Keitaro](https://orcid.org/0000-0002-9228-0729), [IGARASHI, Yasuhiko](https://orcid.org/0000-0003-1042-6657), [NAKAYAMA, Tomofumi](https://orcid.org/0000-0003-1240-3571), [TATEYAMA, Yoshitaka](https://orcid.org/0000-0002-5532-6134), [OKADA, Masato](https://orcid.org/0000-0002-9040-8784)

## Rights



## Other metadata

[Liquid electrolyte informatics using an exhaustive search with linear regression](https://mdr.nims.go.jp/datasets/f036da11-e329-46f1-bcb5-498ff5f11169)

## Fulltext

Liquid electrolyte informatics using an exhaustive search with linear regressionThis journal is© the Owner Societies 2018 Phys. Chem. Chem. Phys., 2018, 20, 22585--22591 | 22585Cite this:Phys.Chem.Chem.Phys.,2018, 20, 22585Liquid electrolyte informatics using an exhaustivesearch with linear regression†Keitaro Sodeyama, *abc Yasuhiko Igarashi, abd Tomofumi Nakayama, dYoshitaka Tateyama ace and Masato Okada adExploring new liquid electrolyte materials is a fundamental target for developing new high-performancelithium-ion batteries. In contrast to solid materials, disordered liquid solution properties have beenless studied by data-driven information techniques. Here, we examined the estimation accuracy andefficiency of three information techniques, multiple linear regression (MLR), least absolute shrinkage andselection operator (LASSO), and exhaustive search with linear regression (ES-LiR), by using coordinationenergy and melting point as test liquid properties. We then confirmed that ES-LiR gives the mostaccurate estimation among the techniques. We also found that ES-LiR can provide the relationshipbetween the ‘‘prediction accuracy’’ and ‘‘calculation cost’’ of the properties via a weight diagram ofdescriptors. This technique makes it possible to choose the balance of the ‘‘accuracy’’ and ‘‘cost’’ whenthe search of a huge amount of new materials was carried out.1. IntroductionComputational material design with a data-driven informationtechnique has become popular for materials research recently.1The materials for next-generation lithium-ion batteries (LIBs)are the representative targets. Future LIBs require a highervoltage, a higher capacity, and a longer cycle life and need to besafer.2,3 For such properties, a variety of new ‘‘electrode’’materials have been reported.4–6 However, new ‘‘electrolyte’’materials, typically consisting of liquid solvents and Li-salts,have not appeared since 1991 for commercial use. This is becausethe search for liquid materials is more difficult compared to thatfor solid materials due to the disordered structure of liquid.Exploring new liquid materials with desirable properties is achallenging issue.7–9In order to discover new liquid electrolytes with desirableproperties, virtual screening with a data-driven informationtechnique is one possible option. In this screening, a databaseof the features of materials called descriptors is first constructedwith data from first-principles calculations or moleculardynamics simulations and/or experiments. Next, we determinethe estimation rule (fitting equation) to predict the target proper-ties based on the selected descriptors in the database by usingthe information techniques. Finally, we handle a huge numberof candidate materials under the rule. Several applications ofvirtual screening to explore new LIB materials have beenreported, though most of them are limited to solid materialsresearch.10–13 Only a few applications have been reported for theliquid materials.14–16To extract the estimation rule for predicting the targetproperties, we have to select descriptors using data-driventechniques. It is called the variable selection problem. In general,multiple linear regression (MLR),17 in which all the descriptorsare used for the estimation, is the most standard treatment for theestimation of the properties of materials. However, irrelevant andredundant descriptors from data do not contribute to the accuracyof a predictive model or may in fact decrease the accuracy of themodel. Thus, we have to remove these descriptors. Moreover,fewer descriptors are desirable because it reduces the complexityof the model, and a simpler model is simpler to understandand explain.When there are N explanatory variables, the simplestvariable selection method is a search for all combinations ofthe variables which requires 2N � 1 = NC1+ NC2 +� � �+ NCN timesof estimations.18 We called this naive method the exhaustivesearch (ES) method.19–21 Although the ES method comes at theexpense of computational complexity of at least O(2N), we cana Center for Materials Research by Information Integration (cMI2), Research andServices Division of Materials Data and Integrated System (MaDIS), NationalInstitute for Materials Science (NIMS), 1-2-1 Sengen, Tsukuba, Ibaraki, 305-0047,Japan. E-mail: SODEYAMA.Keitaro@nims.go.jpb PRESTO, Japan Science and Technology Agency (JST), 4-1-8 Honcho, Kawaguchi,Saitama 333-0012, Japanc Elements Strategy Initiative for Catalysts & Batteries (ESICB), Kyoto University,Nishikyo-ku, Kyoto 615-8510, Japand Graduate School of Frontier Sciences, The University of Tokyo, 5-1-5,Kashiwanoha, Kashiwa, Chiba 277-8561, Japane Center for Green Research on Energy and Environmental Materials (GREEN),and International Center for Materials Nanoarchitectonics, National Institutefor Materials Science, 1-1 Namiki, Tsukuba, Ibaraki 305-0044, Japan† Electronic supplementary information (ESI) available. See DOI: 10.1039/c7cp08280kReceived 11th December 2017,Accepted 24th May 2018DOI: 10.1039/c7cp08280krsc.li/pccpPCCPPAPEROpen Access Article. Published on 14 June 2018. Downloaded on 4/7/2020 7:23:18 AM.  This article is licensed under a Creative Commons Attribution 3.0 Unported Licence.View Article OnlineView Journal  | View Issuehttp://orcid.org/0000-0002-9228-0729http://orcid.org/0000-0003-1042-6657http://orcid.org/0000-0003-1240-3571http://orcid.org/0000-0002-5532-6134http://orcid.org/0000-0002-9040-8784http://crossmark.crossref.org/dialog/?doi=10.1039/c7cp08280k&domain=pdf&date_stamp=2018-06-13http://rsc.li/pccphttp://creativecommons.org/licenses/by/3.0/http://creativecommons.org/licenses/by/3.0/https://doi.org/10.1039/c7cp08280khttps://pubs.rsc.org/en/journals/journal/CPhttps://pubs.rsc.org/en/journals/journal/CP?issueid=CP02003522586 | Phys. Chem. Chem. Phys., 2018, 20, 22585--22591 This journal is© the Owner Societies 2018use the ES method within the compass of N = 30, and the ES methodcan select the best descriptors for predicting the target properties.In this study, we apply the ES method for linear regression andpropose a set of descriptor combinations that can produce betterestimations. For comparison, we also apply least absolute shrinkageand selection operator (LASSO)22 using an L1-norm regularizationterm as a standard approximate method for the sparse variableselection, for which the computational complexity is O(N3).In the search for LIB liquid electrolytes, the evaluation of theproperties of ion transport and electrochemical stability isindispensable. For the transport, solvation to and desolvationfrom Li-ions at the electrolyte/electrode interface plays a crucialrole, and thus the coordination energy of the solvent to Li-ionsis an important measure. In order to keep the liquid state forthe fast Li-ion transport, the melting point of the electrolyte isalso a fundamental property. For the electrochemical stability,the quantities such as ionization potential and electron affinityare significant. Here, however, we focus on the quantitiesrelated to the Li-ion transport as the first target.In this study, we investigated the estimation accuracy of theMLR, LASSO, and ES-LiR techniques in the search for liquidelectrolyte materials. We estimated the coordination energiesand melting points as the required properties of the LIB liquidelectrolytes and discussed the extracted descriptors by LASSOand ES with linear regression (ES-LiR). The strategy of theES-LiR method will be useful and applicable in the search forliquid electrolytes with other desired properties.2. Computational details2.1. DatabaseTo predict novel LIB liquid electrolytes with desired properties by theinformation techniques, we constructed a database of known liquidelectrolytes. We selected 103 solvent molecules which were commer-cialized as battery grade materials from KISHIDA Chemical Co.,Ltd.23 We adopted the values of melting point, boiling point, flashpoint, density of solvent, and molecular weight from the cataloguedata. Representative solvent molecules are shown in Scheme 1and the complete list is shown in Scheme S1 of the ESI.†2.2. Cluster model calculationsTo make the database of the electrolytes more substantial, weadded the following values obtained by density functionaltheory (DFT) calculations of the molecular systems using theGaussian 09 code:24 the coordination energy between a Li-ionand a solvent molecule, the Mulliken charge of the atom(typically oxygen atom) that is coordinated to a Li-ion, thedistance between a Li-ion and the coordinated atom (typicallyLi–O distance) (R(Li–O)), the HOMO energy, the LUMO energy,and the dipole moment values of the 103 solvent molecules.The calculated data of the representative solvent molecules areshown in Table 1, and the complete data are listed in Table S1in the ESI.† The coordination energies (Ecoord) are evaluated bythe difference between the ‘‘total energy of a Li–solvent complex’’and ‘‘the total energies of a solvent molecule and that of a Li-ion’’(Ecoord = E(Li–solvent) � {E(solvent) + E(Li-ion)}). We adopted theB3LYP functional25 with cc-pVDZ basis sets.26 The Mullikencharges and the dipole moments are obtained from the DFTcalculations of pure solvent molecules without Li-ions. Geometryoptimizations of the Li–solvent complexes and the pure solventmolecules were also carried out. In this study, totally 10 descrip-tors (explanation variables) were adopted for the database. Thereare several missing data in the catalogue. We omitted them for theprediction. When the data have no specific value but a range ofvalues, we averaged them.2.3. Data-driven information techniquesWe applied the data-driven information techniques of MLR, LASSO,and ES-LiR to the electrolyte materials search. MLR is a typicalsupervised machine learning technique to predict certain values ofthe properties. The method tries to represent the relationshipbetween the set of the given values of the properties, calledexplanation variables, and the target values for the prediction, calleddependent variable, by constructing a model of the linear equation.We set a target value and an i-th explanation variable as z and xi(i = 1,. . ., 10), respectively. We then assume that the relationshipbetween them is linear and derive it from minimizing eqn (1),E ¼X103m¼1zm �X10i¼1wixmi !2; (1)Scheme 1 Representative 25 solvent molecules for the database(Li, purple; O, red; N, blue; C, grey; F, light blue; S, yellow; P, orange;H, white). Whole molecules are shown in Scheme S1 in the ESI.† Thesolvent names are referred to in Table 1.Paper PCCPOpen Access Article. Published on 14 June 2018. Downloaded on 4/7/2020 7:23:18 AM.  This article is licensed under a Creative Commons Attribution 3.0 Unported Licence.View Article Onlinehttp://creativecommons.org/licenses/by/3.0/http://creativecommons.org/licenses/by/3.0/https://doi.org/10.1039/c7cp08280kThis journal is© the Owner Societies 2018 Phys. Chem. Chem. Phys., 2018, 20, 22585--22591 | 22587where wi (i = 1,. . ., 10) is the coefficient of the i-th explanationvariable.As descriptors xi, we adopted the following sets of features,x1 = boiling point, x2 = density, x3 = dipole moment, x4 = flashpoint, x5 = HOMO, x6 = LUMO, x7 = melting point, x8 = molecularweight, x9 = Mulliken charge, and x10 = distance between theLi-ion and the coordinated oxygen atom for the prediction of thecoordination energies. In the case of the melting point predic-tion, x7 is redefined to the coordination energy and the otherdescriptors are the same as in the former case.LASSO is also the supervised machine learning method. Thelinear equation of the fitting is the same as that of the MLRmethod, while LASSO involves a penalty term as expressed inthe second term of eqn (2).E ¼X103m¼1zm �X10i¼1wixmi !2þ lX10i¼1wij j (2)In eqn (2), l is the penalty parameter and the order of thepenalty term is linear. This method is a sparse estimationtechnique and can minimize the error function with extracteddescriptor sets. If l is sufficiently large, some of the coefficientsare driven to zero, leading to a sparse model in which thecorresponding coefficients play no role. On the other hand, inthe case where l = 0, the results are the same as the results ofMLR. The penalty term allows complex models to be trained onthe data sets of limited size without severe over-fitting.To determine a suitable value of the penalty parameter, l, weuse cross validation (CV), which approximately extract the predic-tion error from the limited data. For the CV, the given data fromthe database are divided to training data and validating data toevaluate the prediction accuracy. After the iteration of this trainingand validating process with different dividing positions, the CVerror is obtained with less variability. We carried out the 10-fold(10 times iterations) cross validation and choose an optimal basedon when the CV error was at its minimum. In this study, the CVerror of LASSO is derived from the coefficients in eqn (2), whichare affected by the optimal penalty parameter.We then consider the proposed sparse estimation techni-que, ES-LiR. Assuming that the coefficients are sparse, namely,the coefficients have a small number of non-zero elements, weestimate which coefficient of the explanatory variable is non-zero. To be more precise, let us consider that the number ofexplanatory variables is N. In ES-LiR, in contrast to LASSO,whether each coefficient is zero or not is determinedby exhaustively evaluating all combinations of N explanatoryvariables, 2N � 1. To evaluate each combination, each value ofthe non-zero coefficient is determined by the least squaresmethod and we calculate the CVE for each combination.Finally, we obtain optimal non-zero elements. This approachrequires a longer calculation time compared with MLR andLASSO. In this study, the size of the data is not large and we caneasily apply the ES-LiR method for the estimation.We formulate exhaustive search for the linear regressionproblem (ES-LiR) by using an indicator variable that representsa combination of non-zero explanatory variables. The indicatoris defined as an N-dimensional binary vector,c = (c1, c2,. . ., cN) A {0,1}N (3)Each variable ci takes 0 or 1: ci = 1 if the i-th variable belongs tothe combination and ci = 0 if it does not. Using the indicator, c,Table 1 Calculated values of the coordination energy (Ecoord), the HOMO energy, the LUMO energy, the dipole moment, the Mulliken charge of theoxygen (nitrogen) atom, and the distance between the Li-ion and the oxygen (nitrogen) atom (R(Li–O)) of 25 solvent molecules for the databaseAbbreviation Solvent nameChemicalformulaEcoord(kcal mol�1)HOMO(eV)LUMO(eV)Dipole moment(Debye)Mullikencharge R(Li–O) (Å)PC Propylene carbonate C4H6O3 �57.4 �7.93 0.946 5.255 �0.243 1.747EC Ethylene carbonate C3H4O3 �55.9 �8.017 0.919 5.07 �0.24 1.752VC Vinylene carbonate C3H2O3 �51.7 �6.973 �0.137 4.365 �0.231 1.76FEC Fluoroethylene carbonate C3H3O3F �51.2 �8.468 0.493 4.487 �0.222 1.763DMC Dimethyl carbonate C3H6O3 �50.0 �7.774 1.115 0.342 �0.306 1.747DEC Diethyl carbonate C5H10O3 �52.6 �7.654 1.217 0.613 �0.308 1.74EMC Ethyl methyl carbonate C4H8O3 �51.3 �7.713 1.168 0.514 �0.307 1.744DAC Diallyl carbonate C7H14O3 �31.7 �7.419 �0.238 0.494 �0.306 1.74Furan Furan C4H4O �48.7 �6.265 0.296 0.511 �0.17 1.866THF Tetrahydrofuran C4H8O �47.2 �6.832 1.38 1.434 �0.323 1.808THP Tetrahydropyran C5H10O �43.2 �6.711 1.537 1.301 �0.324 1.804DOL 1,3-Dioxolane C3H6O2 �64.4 �6.955 1.493 1.324 �0.315 1.818DMM Dimethoxy methane C3H8O2 �52.0 �6.846 1.459 2.165 �0.298 1.905MA Methyl acetate C3H6O2 �53.5 �7.371 0.339 1.733 �0.265 1.755EP Ethyl propionate C5H10O2 �58.6 �7.31 0.414 1.763 �0.269 1.787GBL g-Butyrolactone C4H6O2 �54.7 �7.269 0.254 4.296 �0.237 1.758TMP Trimethyl phosphate C3H9O4P �56.8 �7.765 1.112 3.356 �0.467 1.74NMP N-Methyl-2-pyrrolidone C5H9ON �65.1 �6.421 0.842 3.609 �0.299 1.724ES Ethylene sulfite C2H4O3S �63.9 �7.725 �0.823 3.123 �0.423 1.758SL Sulfolane C4H8O2S �63.7 �7.383 0.826 5.087 �0.459 2.014PS 1,3-Propane sultone C3H6O3S �57.3 �7.917 0.549 5.468 �0.426 2.034DMSO Dimethyl sulfoxide C2H6OS �67.8 �6.01 0.963 3.821 �0.542 1.718AN Acetonitrile C2H3N �47.0 �8.933 0.898 3.743 �0.181 1.92PN Propionitrile C3H5N �48.4 �8.802 0.587 3.826 �0.185 1.914MEK Methyl ethyl ketone C4H8O �53.0 �6.601 �0.386 2.771 �0.225 1.759PCCP PaperOpen Access Article. Published on 14 June 2018. Downloaded on 4/7/2020 7:23:18 AM.  This article is licensed under a Creative Commons Attribution 3.0 Unported Licence.View Article Onlinehttp://creativecommons.org/licenses/by/3.0/http://creativecommons.org/licenses/by/3.0/https://doi.org/10.1039/c7cp08280k22588 | Phys. Chem. Chem. Phys., 2018, 20, 22585--22591 This journal is© the Owner Societies 2018we can write the linear regression problem by minimizingE ¼Xpm¼1zm �XNi¼1wicixmi !2;where p is the number of samples. This formulation makes theessence of the problem more explicit, and the best c formodeling and predicting a target variable, z, is searched byminimizing the CVE in ES-LiR.It is easy to imagine that the ES method becomes intractablefor a large size. To reduce the computational load, it is effectiveto use sampling methods, such as the Markov chain Monte Carlo(MCMC) method and the replica exchange Monte Carlo (REMC)method. In our previous study,21 to deal with the difficulty, weproposed the approximate exhaustive search (AES) method forlinear regression, using the above sampling method.3. Results and discussion3.1. Coordination energy predictionThe correlation between the calculated coordination energiesand estimated ones by MLR, LASSO, and ES-LiR is shown inFig. 1, and their predicted values are shown in Table 2. In thesedata, the estimated values have a good correlation with the truevalues (DFT calculated data). For the samples with the lowestcoordination energies of around �100 kcal mol�1 (true value),the estimation accuracy is not high. The solvents are 12-crown4-ether and 18-crown 6-ether as shown in Tables S1 and S2(ESI†). They coordinate to Li-ions by four or more oxygen atomsof the solvent. Thus, the coordination manner is different fromthe other solvents, and it can be affected to the low estimationaccuracy of the coordination energy.The CV errors of the MLR, LASSO and ES-LiR methods werecalculated to be 10.2, 9.18, and 8.78 kcal mol�1, respectively(Table 3). This suggests that the prediction accuracy of ES-LiR isthe best among the three methods. The accuracy is mainlyaffected by the quality of the descriptor choice and the selectionof the data-driven technique. Regarding the choice of descrip-tors, we can generate the descriptors from first-principlescalculation results to improve the prediction accuracy, thoughtoo many descriptors may cause over-fitting in some informa-tion techniques and decrease the accuracy, especially the MLRcase. The ES-LiR method can consider the whole combinationpatterns of the descriptors, and the over-fitting is easilydetected by the result of the less prediction accuracy of thecombinations. This indicates that we are not suffered from theselection of the information techniques. Remaining treatmentfor improving the prediction accuracy is by increasing theamount of descriptors.Fig. 2 shows the histogram of the CV errors of descriptorcombinations calculated by the ES-LiR method. The histogramcan extract not only the optimal solution but all the solutions,which enable us to map the solutions of various machinelearning and data-driven methods and scientists’ hypotheses.Then, we can evaluate these methods and hypotheses.21 Asshown in Fig. 2, the CV errors of MLR and LASSO and the bestvalue of ES-LiR are depicted. This suggests that LASSO, whichFig. 1 Coordination energies of 103 solvent molecules with truevalues (calculated by the first-principles method) and estimated values(calculated by data-driven techniques) of MLR, LASSO, and ES-LiR (theleast error combination of the descriptors).Table 2 Estimated and first-principles calculation values of the coordina-tion energies of solvents (kcal mol�1)Solvents True value MLR LASSO ES-LiRPC �57.4 �50.7 �55.5 �57.1EC �55.9 �55.5 �55.6 �57.6VC �51.7 �54.1 �53.1 �53.0FEC �51.2 �49.3 �53.3 �55.8DMC �50.0 �55.0 �53.6 �53.9DEC �52.6 �51.0 �52.7 �53.8EMC �51.3 �52.3 �54.9 �54.8Furan �31.7 �48.0 �48.4 �46.1THF �48.7 �51.3 �53.4 �52.5THP �47.2 �50.1 �52.0 �51.9DOL �43.2 �47.0 �53.6 �53.6DMM �64.4 �49.7 �50.6 �49.2MA �52.0 �50.3 �51.5 �51.8EP �53.5 �51.5 �51.6 �51.5MCA �58.6 �50.8 �52.1 �54.6VA �54.7 �52.0 �51.0 �49.6GBL �56.8 �52.5 �54.5 �55.5TMP �65.1 �59.7 �62.8 �64.8NMP �63.9 �58.7 �57.3 �57.7SL �63.7 �56.8 �61.4 �66.3PS �57.3 �60.3 �59.5 �61.1DMSO �67.8 �68.2 �64.7 �67.2AN �47.0 �46.5 �45.6 �46.6PN �48.4 �45.3 �46.4 �47.2MEK �53.0 �51.9 �49.3 �49.4Table 3 Cross-validation errors of the coordination energies and theextracted combination of descriptors of MLR, LASSO, and ES-LiRData-driventechniqueCombinationof descriptorsCV error(kcal mol�1)MLR x1 – x10 10.2LASSO x4, x8, x9, x10 9.18ES-LiR x4, x9, x10 8.78x1 = boiling point, x2 = density, x3 = dipole moment, x4 = flash point,x5 = HOMO, x6 = LUMO, x7 = melting point, x8 = molecular weight,x9 = Mulliken charge, and x10 = R(Li–O).Paper PCCPOpen Access Article. Published on 14 June 2018. Downloaded on 4/7/2020 7:23:18 AM.  This article is licensed under a Creative Commons Attribution 3.0 Unported Licence.View Article Onlinehttp://creativecommons.org/licenses/by/3.0/http://creativecommons.org/licenses/by/3.0/https://doi.org/10.1039/c7cp08280kThis journal is© the Owner Societies 2018 Phys. Chem. Chem. Phys., 2018, 20, 22585--22591 | 22589has been widely used in recent studies, is not a best predictionmethod and the extracted descriptors are not a best combination(Table 3) from the combinations of the small CVE data.The ES-LiR method not only minimizes the CVE but alsoderives the CVE in all combinations, so you can see the wholepicture of them. Using the whole pictures, the ES-LiR methodcan be used to construct the weight diagram, which shows thetop 25 best combinations of the descriptors, as shown in Fig. 3.The weight diagram reveals the stability of the importantdescriptors for the estimation, even if the error is at the samelevel as the other methods. Each colour represents the fittedcoefficient of each descriptor, which shows the importance forthe coordination energy prediction. The white-blocks of themap correspond to the descriptors which are not adopted forthe prediction. From this data, the Mulliken charge is thesignificant descriptor for the coordination energy predictionand flash point, and R(Li–O) can also contribute to it. Thecoordination energy is highly affected by the Coulomb inter-action between the Li cation and the oxygen atom that has anegative electron charge. Thus, the extraction of the Mullikencharge as a good descriptor fits our chemical intuition, even ifthe Mulliken charge values are sometimes quantitatively notstable with the basis functions. The R(Li–O) is also a trivialdescriptor for the estimation of the solvation energy becausethe distance corresponds to the strength of the interactionbetween Li and O. On the other hand, the flash point is not atrivial descriptor. It might be a weak relationship between ‘‘theoxygen radical reaction for burning’’ and ‘‘the Li cation–solventinteraction’’, though the number of the samples should beincreased for such a discussion.In materials informatics, proper combinations of descrip-tors change depending on the purpose of data analysis. In thispaper, our goal is both to accurately predict the coordinationenergy and to reduce the calculation cost. Using the weightdiagram (Fig. 3), we realize our purpose. As shown in Fig. 3, the11th accurate combination does not include the descriptor ofR(Li–O). To obtain the distance between Li and oxygen, addi-tional Li–solvent complex calculations are required, though theother descriptors, density, flash point, and Mulliken charge, areobtained by catalogue data and only solvent calculations.The difference in the first and 11th CV errors is quite small,0.126 kcal mol�1. The value is not a significantly big differencefor comparing the coordination energies of various solvents.According to Table 1, the 10�1 kcal mol�1 order is the targetaccuracy for coordination energies. Then, if we choose the 11thbest combination of descriptors (‘‘Flash point’’ and ‘‘Mullikencharge’’), we can reduce the calculation cost to a half becausethe extra calculation for obtaining R(Li–O) is omitted. Thisindicates that we can choose the balance of the ‘‘predictionaccuracy’’ and the ‘‘calculation cost for obtaining the descriptors’’for the combinatorial material search when we employ the ES-LiRmethod and calculate the histogram and weight diagram.3.2. Melting pointFig. 4 shows the correlation between the melting point fromthe catalogue data and the estimated data by MLR, LASSO, andES-LiR. The CV errors of them were obtained to be 30.06, 29.75,and 28.49 1C, respectively (Table 4). Although the CV error isstill large in ES-LiR, the error of ES-LiR is smaller than theLASSO and MLR results. From the extraction of the descriptorsby LASSO, density is one of the significant descriptors for themelting point. It matches the chemical intuition because theFig. 2 Histogram of the CV errors of descriptor combinations obtained bythe ES-LiR method for the coordination energy prediction. The smallestCV error values of ES-LiR and the CV errors of LASSO and MLR are alsoshown.Fig. 3 Weight diagram of the descriptors on accurate top 25 combina-tions of descriptors for the coordination energy prediction.Fig. 4 Melting points of 103 solvent molecules with true values(calculated by first-principles method) and the estimated values (calcu-lated by data-driven technique) of MLR, LASSO, and ES-LiR which is theleast error combination.PCCP PaperOpen Access Article. Published on 14 June 2018. Downloaded on 4/7/2020 7:23:18 AM.  This article is licensed under a Creative Commons Attribution 3.0 Unported Licence.View Article Onlinehttp://creativecommons.org/licenses/by/3.0/http://creativecommons.org/licenses/by/3.0/https://doi.org/10.1039/c7cp08280k22590 | Phys. Chem. Chem. Phys., 2018, 20, 22585--22591 This journal is© the Owner Societies 2018density is highly related to the interaction between the solventmolecules in the liquid state, and the melting point is also highlyaffected by the interaction between the solvent molecules. SinceLASSO is an approximation method, even if the choice of thedescriptors matches the scientific background, it may be just acoincidence. There is a possibility that the completely differentset of descriptors can reproduce a more accurate estimation.In contrast, the ES-LiR method can propose a reliable set ofdescriptors from the best to worst estimations. Fig. 5 shows thehistogram of the whole combination patterns of descriptorsobtained by ES-LiR. Fig. 6 confirms that from at least the top25 combinations, density is one of the most important descrip-tors and flash point, molecular weight and Mulliken charge havealso big contributions for the melting point prediction.3.3. Statistical significance of the proposed methods aboutthe CV errorLet us consider the statistical significance of the difference inthe CV errors of MLR, LASSO, and ES-LiR. For the evaluationof the CV errors, we calculated the CV error for each data set inES-LiR, just like the condition of LASSO. As a result of applyingit to the coordination energy prediction, the CV errors of MLR,LASSO, and ES-LiR are respectively 10.20, 9.18, and 6.34. Weconducted a paired sample t-test to the data of 10-fold CV errorsof ‘‘MLR and ES-LiR’’ and ‘‘LASSO and ES-LiR’’, and the p valuewas less than 0.001, which was a significant result.4. ConclusionsIn order to explore new LIB electrolyte materials, we investi-gated the estimation procedure by data-driven informationtechniques. We predicted the coordination energies andmelting points of solvents by information techniques such asMLR, LASSO, and ES-LiR. ES-LiR reproduced the most accurateestimation of the properties among them. We found thatES-LiR chose the balance of ‘‘prediction accuracy’’ and the‘‘calculation cost to obtain the descriptors’’ when the combi-natorial material search by virtual screening was carried out.This feature is general for all the material exploring studieswith virtual screening. This treatment can be a key technique tofuture material searches.Conflicts of interestThere are no conflicts to declare.AcknowledgementsThis research was supported by the JST, PRESTO and NIMS,‘‘Materials research by information’’ integration initiative.The calculations in this work were carried out on the super-computer center of NIMS. The work was supported in part bythe K computer at the RIKEN AICS through the HPCI SystemResearch Projects (Proposal no. hp160174, hp170198, andhp180134). This work was also supported in part by MEXTKAKENHI (JP15H05701).References1 T. Lookman, F. J. Alexander and K. Rajan, Information Sciencefor Materials Discovery and Design, Springer, New York, 2015.2 J. B. Goodenough and Y. Kim, Chem. Mater., 2010, 22, 587–603.3 K. Xu, Chem. Rev., 2004, 104, 4303–4417.4 H.-J. Peng, S. Urbonaite, C. Villevieille, H. Wolf, K. Leitnerand P. Novak, J. Electrochem. Soc., 2015, 162, A7072–A7077.5 N. Yabuuchi, M. Takeuhci, M. Nakayama, H. Shiiba,M. Ogawa, K. Nakayama, T. Ohta, D. Endo, T. Ozaki,T. Inamasu, K. Sato and S. Komaba, Proc. Natl. Acad. Sci.U. S. A., 2015, 112, 7650–7655.Table 4 Cross-validation errors of the melting points and the extractedcombination of descriptors of MLR, LASSO, and ES-LiRData-driven technique Combination of descriptors CV error (C)MLR x1 � x10 30.06LASSO x2 � x10 29.75ES-LiR x2, x3, x4, x5, x8, x9 28.49x1 = boiling point, x2 = density, x3 = dipole moment, x4 = flash point,x5 = HOMO, x6 = LUMO, x7 = coordination energy, x8 = molecularweight, x9 = Mulliken charge, and x10 = R(Li–O).Fig. 5 Histogram of the CV error of descriptor combinations obtained bythe ES-LiR method for the melting point prediction. The smallest CV errorvalues of ES-LiR and the CV errors of LASSO and MLR are also shown.Fig. 6 Weight diagram of descriptors based on the accurate top 25combinations of descriptors for the melting point prediction.Paper PCCPOpen Access Article. Published on 14 June 2018. Downloaded on 4/7/2020 7:23:18 AM.  This article is licensed under a Creative Commons Attribution 3.0 Unported Licence.View Article Onlinehttp://creativecommons.org/licenses/by/3.0/http://creativecommons.org/licenses/by/3.0/https://doi.org/10.1039/c7cp08280kThis journal is© the Owner Societies 2018 Phys. Chem. Chem. Phys., 2018, 20, 22585--22591 | 225916 F. Luo, B. Liu, J. Zheng, G. Chu, K. Zhong, H. Li, X. Huangand L. Chen, J. Electrochem. Soc., 2015, 162, A2509–A2528.7 Y. Yamada, K. Furukawa, K. Sodeyama, M. Yaegashi,K. Kikuchi, Y. Tateyama and A. Yamada, J. Am. Chem. Soc.,2014, 136, 5039–5046.8 K. Sodeyama, Y. Yamada, K. Aikawa, A. Yamada andY. Tateyama, J. Phys. Chem. C, 2014, 118, 14091–14097.9 J. Haruyama, K. Sodeyama, L. Han, K. Takada andY. Tateyama, Chem. Mater., 2014, 26, 4248–4255.10 A. Jain, S. P. Ong, G. Hautier, W. Chen, W. D. Richards,S. Dacek, S. Cholia, D. Gunter, D. Skinner, G. Ceder andK. A. Persson, APL Mater., 2013, 1, 011002.11 M. Nishijima, T. Ootani, Y. Kamimura, T. Sueki, S. Esaki,S. Murai, K. Fujita, K. Tanaka, K. Ohira, Y. Koyama andI. Tanaka, Nat. Commun., 2014, 5, 4553.12 R. Jalem, T. Aoyama, M. Nakayama and M. Nogami, Chem.Mater., 2012, 24, 1357–1364.13 R. Jalem, M. Kimura, M. Nakayama and T. Kasuga, J. Chem.Inf. Model., 2015, 55, 1158–1168.14 M. Korth, Phys. Chem. Chem. Phys., 2014, 16, 7919–7926.15 T. Husch, N. D. Yilmazer, A. Balducci and M. Korth, Phys.Chem. Chem. Phys., 2015, 17, 3394–3401.16 N. N. Rajput, X. Qu, N. Sa, A. K. Burrell and K. A. Persson,J. Am. Chem. Soc., 2015, 137, 3411–3420.17 C. M. Bishop, in Pattern Recognition and Machine Learning,ed. M. Jordan, J. Kleinberg and B. Schölkopf, SpringerScience + Business Media LLC, New York, 2006, 128.18 T. M. Cover and J. M. Van Campenhout, IEEE Trans. Syst.Man Cybern., 1977, 7(9), 657–661.19 K. Nagata, J. Kitazono, S. Nakajima, S. Eifuku, R. Tamuraand M. Okada, IPSJ Online Trans., 2015, 8, 25–32.20 Y. Igarashi, K. Nagata, T. Kuwatani, T. Omori, Y. Nakanishi-Ohno and M. Okada, J. Phys.: Conf. Ser., 2016, 699, 012001.21 Y. Igarashi, H. Takenaka, Y. Nakanishi-Ohno, M. Uemura,S. Ikeda and M. Okada, J. Phys. Soc. Jpn., 2018, 87, 044802.22 R. Tibshirani, J. Royal Stat. Soc. B, 1996, 58, 267–288.23 KISHIDA product information, http://www.kishida.co.jp/english/product, accessed July 2016.24 M. J. Frisch, G. W. Trucks, H. B. Schlegel, G. E. Scuseria,M. A. Robb, J. R. Cheeseman, G. Scalmani, V. Barone,B. Mennucci, G. A. Petersson, H. Nakatsuji, M. Caricato,X. Li, H. P. Hratchian, A. F. Izmaylov, J. Bloino, G. Zheng,J. L. Sonnenberg, M. Hada, M. Ehara, K. Toyota, R. Fukuda,J. Hasegawa, M. Ishida, T. Nakajima, Y. Honda, O. Kitao,H. Nakai, T. Vreven, J. A. Montgomery, Jr., J. E. Peralta,F. Ogliaro, M. Bearpark, J. J. Heyd, E. Brothers, K. N. Kudin,V. N. Staroverov, R. Kobayashi, J. Normand, K. Raghavachari,A. Rendell, J. C. Burant, S. S. Iyengar, J. Tomasi, M. Cossi,N. Rega, J. M. Millam, M. Klene, J. E. Knox, J. B. Cross,V. Bakken, C. Adamo, J. Jaramillo, R. Gomperts, R. E.Stratmann, O. Yazyev, A. J. Austin, R. Cammi, C. Pomelli,J. W. Ochterski, R. L. Martin, K. Morokuma, V. G. Zakrzewski,G. A. Voth, P. Salvador, J. J. Dannenberg, S. Dapprich, A. D.Daniels, Ö. Farkas, J. B. Foresman, J. V. Ortiz, J. Cioslowskiand D. J. Fox, Gaussian 09 (Revision D.01), Gaussian, Inc.,Wallingford CT, 2009.25 A. D. Becke, J. Chem. Phys., 1993, 98, 5648–5652.26 T. H. Dunning Jr., J. Chem. Phys., 1989, 90, 1007–1023.PCCP PaperOpen Access Article. Published on 14 June 2018. Downloaded on 4/7/2020 7:23:18 AM.  This article is licensed under a Creative Commons Attribution 3.0 Unported Licence.View Article Onlinehttp://www.kishida.co.jp/english/producthttp://www.kishida.co.jp/english/producthttp://creativecommons.org/licenses/by/3.0/http://creativecommons.org/licenses/by/3.0/https://doi.org/10.1039/c7cp08280k