# Fileset

[shrine20230330-1595-1im5f61.pdf](https://mdr.nims.go.jp/filesets/c19f81e1-45b9-4b0d-a96e-6114161bc2a6/download)

## Creator

Xun Liu, Zhufeng Hou, Dabao Lu, [Bo Da](https://orcid.org/0000-0002-0785-8662), [Hideki Yoshikawa](https://orcid.org/0000-0002-7389-8865), [Shigeo Tanuma](https://orcid.org/0000-0003-2628-9941), Yang Sun, Zejun Ding

## Rights

[Creative Commons BY Attribution 4.0 International](https://creativecommons.org/licenses/by/4.0/)

## Other metadata

[Unveiling the principle descriptor for predicting the electron inelastic mean free path based on a machine learning framework](https://mdr.nims.go.jp/datasets/4b0d56e8-be2f-4781-b43b-5053a60245f6)

## Fulltext

Unveiling the principle descriptor for predicting the electron inelastic mean free path based on a machine learning frameworkFull Terms & Conditions of access and use can be found athttps://www.tandfonline.com/action/journalInformation?journalCode=tsta20Science and Technology of Advanced MaterialsISSN: 1468-6996 (Print) 1878-5514 (Online) Journal homepage: https://www.tandfonline.com/loi/tsta20Unveiling the principle descriptor for predictingthe electron inelastic mean free path based on amachine learning frameworkXun Liu, Zhufeng Hou, Dabao Lu, Bo Da, Hideki Yoshikawa, Shigeo Tanuma,Yang Sun & Zejun DingTo cite this article: Xun Liu, Zhufeng Hou, Dabao Lu, Bo Da, Hideki Yoshikawa, Shigeo Tanuma,Yang Sun & Zejun Ding (2019) Unveiling the principle descriptor for predicting the electron inelasticmean free path based on a machine learning framework, Science and Technology of AdvancedMaterials, 20:1, 1090-1102, DOI: 10.1080/14686996.2019.1689785To link to this article:  https://doi.org/10.1080/14686996.2019.1689785© 2019 The Author(s). Published by NationalInstitute for Materials Science in partnershipwith Taylor & Francis Group.View supplementary material Accepted author version posted online: 07Nov 2019.Published online: 26 Nov 2019.Submit your article to this journal Article views: 383 View related articles View Crossmark datahttps://www.tandfonline.com/action/journalInformation?journalCode=tsta20https://www.tandfonline.com/loi/tsta20https://www.tandfonline.com/action/showCitFormats?doi=10.1080/14686996.2019.1689785https://doi.org/10.1080/14686996.2019.1689785https://www.tandfonline.com/doi/suppl/10.1080/14686996.2019.1689785https://www.tandfonline.com/doi/suppl/10.1080/14686996.2019.1689785https://www.tandfonline.com/action/authorSubmission?journalCode=tsta20&show=instructionshttps://www.tandfonline.com/action/authorSubmission?journalCode=tsta20&show=instructionshttps://www.tandfonline.com/doi/mlt/10.1080/14686996.2019.1689785https://www.tandfonline.com/doi/mlt/10.1080/14686996.2019.1689785http://crossmark.crossref.org/dialog/?doi=10.1080/14686996.2019.1689785&domain=pdf&date_stamp=2019-11-07http://crossmark.crossref.org/dialog/?doi=10.1080/14686996.2019.1689785&domain=pdf&date_stamp=2019-11-07Unveiling the principle descriptor for predicting the electron inelastic meanfree path based on a machine learning frameworkXun Liua,b,c, Zhufeng Houd, Dabao Lua,b,c, Bo Da b,c, Hideki Yoshikawab, Shigeo Tanumac, Yang Sun eand Zejun DingaaHefei National Laboratory for Physical Sciences at Microscale and Department of Physics, University of Science and Technology of China, Hefei,Anhui, People’s Republic of China;bResearch and Services Division of Materials Data and Integrated System, National Institute for Materials Science, Tsukuba, Ibaraki, Japan;cResearch Center for Advanced Measurement and Characterization, National Institute for Materials Science, Tsukuba, Ibaraki, Japan;dState Key Laboratory of Structural Chemistry, Fujian Institute of Research on the Structure of Matter, Chinese Academy of Sciences,Fuzhou, China;eUS Department of Energy, Ames Laboratory, Ames, IA, USAABSTRACTThe TPP-2M formula is the most popular empirical formula for the estimation of the electroninelastic mean free paths (IMFPs) in solids from several simple material parameters. The TPP-2Mformula, however, poorly describes several materials because it relies heavily on the traditionalleast-squares analysis. Herein, we propose a new framework based on machine learning toovercome the weakness. This framework allows a selection from an enormous number ofcombined terms (descriptors) to build a new formula that describes the electron IMFPs. Theresulting framework not only provides higher average accuracy and stability but also reveals thephysics meanings of several newly found descriptors. Using the identified principle descriptors,a complete physics picture of electron IMFPs is obtained, including both single and collectiveelectron behaviors of inelastic scattering. Our findings suggest that machine learning is robustand efficient to predict the IMFP and has great potential in building a regression framework fordata-driven problems. Furthermore, this method could be applicable to find empirical formula forgiven experimental data using a series of parameters given a priori, holds potential to finda deeper connection between experimental data and a priori parameters.ARTICLE HISTORYReceived 25 July 2019Revised 4 November 2019Accepted 4 November 2019KEYWORDSSurface science; machinelearning; inelastic mean freepath; the Least AbsoluteShrinkage and SelectionOperator (LASSO)CLASSIFICATION212 Surface and interfaces;404 Materials informatics /Genomics1. IntroductionThe electron inelastic mean free path (IMFP) [1,2],which describes the mean distance an electron travelsthrough a solid before losing energy, is of fundamentalimportance to electron-based surface analysis techni-ques, such as scanning electron microscopy, X-rayphotoelectron spectroscopy, and Auger electron spectro-scopy [2–7]. With a dielectric formalism, the IMFP canbe calculated by various algorithms, such as Penn algo-rithm [8,9], Mermin algorithm [10–15] and ex-Merminalgorithm [16]. The full Penn algorithm (FPA) has beenused to produce the largest IMFP database, and thus hasCONTACT Bo Da DA.Bo@nims.go.jp Research Center for Advanced Measurement and Characterization, National Institute for Materials Science,Tsukuba, Ibaraki, Japan; Zejun Ding zjding@ustc.edu.cn Hefei National Laboratory for Physical Sciences at Microscale and Department of Physics,University of Science and Technology of China, Hefei, Anhui, People’s Republic of ChinaBo Da and Zejun Ding are corresponding authorsSupplemental data for this article can be accessed here.SCIENCE AND TECHNOLOGY OF ADVANCED MATERIALS2019, VOL. 20, NO. 1, 1090–1102https://doi.org/10.1080/14686996.2019.1689785© 2019 The Author(s). Published by National Institute for Materials Science in partnership with Taylor & Francis Group.This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permitsunrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.http://orcid.org/0000-0002-0785-8662http://orcid.org/0000-0002-4344-2920https://doi.org/10.1080/14686996.2019.1689785http://www.tandfonline.comhttps://crossmark.crossref.org/dialog/?doi=10.1080/14686996.2019.1689785&domain=pdf&date_stamp=2020-02-02had considerable influence in the field of surface analysis.In recent years, Tanuma et al. calculated IMFPs for 27elemental materials [17,18], 15 inorganic compounds[19], and 14 organic compounds [20] in a wide energyrange from 50 to 2000 eV. Furthermore, to increase theaccuracy of IMFP calculations and expand the contentsof the database, the IMFPs of 41 elemental materials [21]and 42 inorganic compounds [22] for energies up to200 keV were calculated. We note that the databaseadopted here includes IMFPs for 41 elemental materials[21] and 42 compounds [22] calculated by the FPA.Unfortunately, the calculations made by such algo-rithms and formulae need the energy loss function(ELF) [23] for the material of interest, which is usuallydifficult to obtain [24–27]. ELFs are, thus, still unavail-able for many materials. To overcome this problem,researchers develop artificial empirical formulaewhose independent variables are simple material-dependent parameters. In fact, researchers in thearea of surface analysis tend to use empirical formulaeinstead of the FPA in application. Moreover, empiricalformulae have a simple form that unifies the informa-tion of IMFP data. It is therefore of vital conveniencefor researchers to search for a relationship betweenIMFP data and material-dependent parameters.Although the use of empirical formulae may causesome accuracy loss, the formulae can be used quicklyand have good descriptors for the definition of IMFPs.As a starting point, the Bethe equation [28] forinelastic scattering was used in order to parameterizethe IMFP data calculated or measured. All parametersof the equation are microscopic quantities. However,the original Bethe formula has an obvious shortcomingin that it is only valid for sufficiently high energies(above 200 eV).Many formulae based on the Bethe formula (e.g.TPP-2M [21,22], G1 [29], and S1 [30]) have beenderived successively. For example, Tanuma et al.[21,22] used macroscopic quantities for parametersin the Bethe formula while trying to extend the Betheequation to low energies such as 50 eV. They estab-lished a new empirical formula, the TPP-2M formula.Two correction terms were introduced into thedenominator to expand the energy range to lowerenergies. In addressing higher energies, a relativisticrevision was made for the most recent version of theTPP-2M equation [21], allowing an accurate descrip-tion of the IMFP. The use of the TPP-2M equationallows the convenient determination of the IMFP fora certain material and even the prediction of unknownIMFPs for some materials.Although there are many formulae for predictingthe IMFP, there are still problems to be solved,mainly relating to the artificial selection of the com-bination of terms. The combination space of terms isnearly infinite. The descriptions of several materials,such as carbon allotropes and boron nitride (BN), arevery poor, because manually chosen terms can cap-ture only relatively obvious physics of most materials,lacking both an overall and comprehensive under-standing. Furthermore, Tanuma and co-workers havespent more than 20 years to build a database ofIMFPs for elemental solids, inorganic and organiccompounds, and to validate the applicability of theTPP-2M formula to many materials (see their initialwork [17] to their most recent work [22]). Beyondthe fitting work itself, however, one cannot ensurethe applicability of the formula to materials not in thefitting database; that is, one cannot ensure general-ization ability in machine learning (ML) terminology.Generally speaking, the manual selection of featuresis no longer efficient or even reliable.In this work, we develop a framework instead ofusing the existing regression procedure, successfullyavoiding the problems mentioned above. We firstestablish a suitable prototype formula and obtainvalues of key parameters using the prototype formulaand the least-squares method. Meanwhile, a descriptorpool is established simultaneously using fundamentaland important material-dependent parameters. Wethus set the values of key parameters as a trainingtarget and descriptors in the pool as features. TheLeast Absolute Shrinkage and Selection Operator(LASSO) [31] is used to form the linear combinationof the principle descriptors, which means the unim-portant terms are automatically eliminated. Followingthis core step of ML, a brand-new empirical IMFPformula is produced, just after a process of mergingsimilar terms and adjustments. Through this method,the new descriptors ensure robustness and general-ization performance on all materials. Moreover, fea-tures selected from the data-driven degree are morelikely to hold deeper physics meaning than featuresobtained in several attempts of using the TPP-2Mformula, which is one of the most important aspectsof our work. We note that this new framework is notlimited to the formula for IMFPs but can be easilyapplied in other fields. The simplification of empiricalformulae and the further discovery of informationbehind the terms in the formulae are superior aspectsof our framework.2. Methods2.1. LassoLASSO [31] is a well-known set of techniques used inmany data-driven statistical analyses in different fields.It provides low-dimensional solutions by recastinga problem into a convex minimization problem. Thatis to say, a sharp reduction in the number of terms (i.e.the number of descriptor selection characteristics) ismathematically achieved by solving a minimizationfunction:Sci. Technol. Adv. Mater. 20 (2019) 1091 X. LIU et al.argminw12njjXw� yjj22 þ λjjwjj1; (1)where the first term is similar to a term in the least-squares algorithm while the second term is the so-called penalty term. This least-squares penalty termis combined with a constant λ and the l1-norm of theparameter vector ||w||1. On the one hand, a largervalue of λ will eliminate more descriptors in the linearregression; on the other hand, the use of the l1-norm||w||1 is crucial. In fact, the shrinkage function ofLASSO relies on this l1-norm.Figure 1 is a simple LASSO algorithm applicationexample of two-dimensional descriptors. The redellipse is the branch of values for target parametervector w. The value of ||w|| on the same ellipse is thesame, and a smaller ellipse corresponds to a bettervalue of w. The square centered on the origin inFigure 1 represents the set of points that satisfy theconstraints of the l1-norm in Equation (1); only pointsthat fall into the square can be selected. An optimiza-tion method is applied, and the estimated value ofLASSO is the intersection of the ellipse and the squarebelow. Unless the ellipse is exactly tangential to thesquare on one side of the rectangle, the intersectionwill fall on the vertices of the rectangle, and the esti-mated value of a parameter will be compressed to zero.That is to say, the variable has been removed from themodel. If the penalty term uses not the l1-norm butsay the l2-norm, the square in Figure 1 is a circle. Theellipse will have a low possibility of intersecting withthe vertices of the l2-norm circle, and there is noshrinkage ability in this case. This penalty term is thestakeholder in LASSO.2.2. Cross validationOur new formula fits the IMFP with high accuracysimilar to or even better than that of the TPP-2Mformula owing to the chosen descriptors. Such accu-racy is achieved for all materials in the dataset becausea traditional k-fold cross validation (CV) [32] is natu-rally used in our ML work. In k-fold CV, the originalsample is randomly partitioned into k equally sizedsubsamples. Of the k subsamples, a single subsample isretained as the validation data for testing the model,and the remaining k − 1 subsamples are used as train-ing data. The CV process is then repeated k times, witheach of the k subsamples used exactly once as thevalidation data. The k results can then be averaged toproduce a single estimation.We note that LASSO and CV used in this work arepowered by the Scikit-learn library [32].2.3. Details of building the descriptor poolOur goal is to create combined descriptors that havephysics meanings. Generally speaking, the complete-ness of descriptors that hold the same complexity mustbe ensured in this establishment procedure, but with-out introducing unphysical operations or quantities.Here a step-by-step framework like that shown inTable 1 is applied to the combination procedure.(I) The starting point is the seven basic featuresshown as series A in Table 1. The introductionand necessity analysis has been discussed inresults and discussion part.(II) The first step is vital in establishing a well-formed combination of descriptors. This isbecause the descriptors created in each stepstrictly relate to those created in the last step.The first combination is shown in Table 1 asseries B (including B1, B2, and B3). The deci-sion point in series B considers the physicsmeanings of descriptors; therefore, summationand difference operations between inhomoge-neous quantities, such as Ei + Z and Ei2 + Eg, arenot accepted. A serious observation of theseven basic features reveals that only energies(Eg and Ei) can be combined like series B1(limited to quadratic terms). Moreover,Figure 1. A simple LASSO algorithm application example oftwo-dimensional descriptors. Here w1 and w2 are the twodimensions of the target parameter. The red ellipse is thebranch of values for target parameter vector w and the squarebelow represents the l1-norm in LASSO.Table 1. Feature combination framework based on seven basicfeatures.ID Description #A1 7 basic features 7B1 Ei þ Eg� �; Ei � Eg� �; E2i þ E2g� �; E2i � E2g� �61B2 f∙g; f,g∈{A1}B3 f/g; f,g∈{A1}C1 f∙g; f∈{A}, g∈{B} 1136C2 f/g; f,g∈{A,B}D1 fi, f∈{A,B,C}, i∈{-0.9, . . ., -0.1,0.1, . . ., 0.9} 16,524Sci. Technol. Adv. Mater. 20 (2019) 1092 X. LIU et al.a multiplication or division operation will notcreate inhomogeneous quantities, like descrip-tors in series B2 or B3 in Table 1. We note thatowing to zero values existing for some basicfeatures, descriptors with divided-by-0 pro-blems are automatically excluded.(III) On the basis of series A and B, more complexdescriptors can be combined. In seriesC (including C1 and C2), descriptors in seriesB are further multiplied or divided or dividedby basic features in series A. This procedureraises the complexity of the descriptor by onestep and ensures that all descriptors of thesame complexity are included. Althoughthere will obviously be repeated descriptors,as seen in the result, we use a linear regressionafter LASSO to merge similar terms. This typeof step-by-step procedure can theoretically berepeated time after time, but this is not donehere considering the calculation ability of theprogram and for the sake of simplicity; furtheranalysis can be seen in the results and discus-sion sections.(IV) Referring to the original terms in TPP-2M[21,22] and other formulae [29,30] previouslydeveloped and the need for a root operation,powered terms are used for series A, B, andC to make series D. This series has the largestvolume in the descriptor pool and providesalternative choices for precise terms in theformula.Descriptors in the pool are created as describedabove. All descriptors in series A, B, C, and D areused with LASSO. This automatic brute-force andstep-by-step method of establishing a descriptor poolcan be used in normal empirical-formula regressionwork. The framework can enumerate the descriptorsneeded, fulfilling the need of corresponding complex-ity, and thus has good control of the detailed operationbase for specific needs of the empirical formula.3. Results and discussion3.1. Selection of the prototype formulaThe prototype formula and target values of the keyparameters in the formula must first be decided. Thefirst point is the prototype formula. Early work byBethe [28] treated inelastic scattering by atoms andestablished the so-called Bethe theory and Bethe for-mula for the description of energy dependence ofinelastic cross sections.Tanuma et al., then, proposed the followinga predictive equation for IMFP over 200 eV based onthe Bethe formula [33].λ ¼ EE2p β ln γEð Þ½ � ; (2)where λ is the IMFP, E is the electron energy, Ep isfree-electron plasmon energy, and β and γ are para-meters. They determined general formula for theseparameters based on the IMFP data over 200 eV to2000 eV for only 31 materials.Meanwhile, the most compatible and well-receivedmature formula is the relativistic TPP-2M formula[21,22]. It has a wider applicable energy region(50 eV – 200 keV) but its performance at energieslower than 100 eV is not reliable. This must be dueto the limitations of the accuracy of the used IMFPdatabase at low energies.The prototype can be treated as a modified Betheformula:λ ¼ α Eð ÞEE2p βrln γrα Eð ÞE� �� CrE þ DrE2�   ; (3)where α(E) is the relativistic modification term onlyassociated with E while βr, γr, Cr, and Dr are para-meters to be determined at each target material. It isseen that the modified Bethe formula has Cr and Dr forone and two more order corrections compared withthe original Bethe formula. α(E) is then introduced tomeet the requirement of higher energy (>10 keV). Thenecessity of the modification terms can be validated byusing a Fano plot in which (E/λ) is plotted versus lnE.The need for the additional terms can be seen from theFano plots if the data points lie sufficiently close toa straight line in Figure 2(a).Figure 2(a) shows a clear linear relationship atenergies ~200 eV and has a uniform rising trend atenergies higher than 10 keV. The values of C and D areeffective only at energies lower than 200 eV as wasshown in Equation (2) [33].For the higher energy region, however, Shinotsukaet al. [21] reported that the trend is due to the relati-vistic effect, which is not negligible in the higherenergy region, showing that the α(E) term is necessary.Then, we present a relativistic Fano plot (Figure 2(b))in which (α(E)E/λ) versus lnα(E)E has a good linearrelationship in our selected energy region above200 eV. We haveλ ¼ α Eð ÞEE2p βrln γrα Eð ÞE� ��   ; (4)which we refer to as the TPP-LASSO formula. The Epand some of the basic features that are mentioned laterare extracted from the literature Refs [21,22]. We havealso made some attempts showing that Equation (4) issuitable for this work as the prototype formula.Sci. Technol. Adv. Mater. 20 (2019) 1093 X. LIU et al.3.2. Fitting of the parameters in the prototypeformulaAnother consideration is the values of key parameters,namely βr and γr. The least-squares method can beapplied to a selected IMFP database and TPP-LASSOformula to fit βr and γr. The essential problem is thus toselect a robust IMFP database. Fortunately, through dec-ades of study, a large quantity of IMFP results has beenaccumulated to serve as a reliable database with which tobuild the ML model. Shinotsuka et al. [21] theoreticallycomputed the IMFP with the FPA for 41 elementalmaterials that have complete data of optical constantsover a wide energy range. Shinotsuka et al. [22] similarlycalculated IMFPs for 42 compound materials. TheseIMFP data for 83 materials can be included in the initialdatabase for the fitting of key parameters. However, thelow-energy (<50 eV) IMFPs calculated with FPA are notreliable. We thus adopt only IMFPs above 200 eV. Theinformation for the 83 solids is therefore included in themodel to obtain the parameters.According to the target formula, β and γ are fittedusing the least-squares method. Here, the accuracy ofthe fitting is measured as the root-mean-square devia-tion (RMSD):RMSD ¼ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi1nXni¼1λfit Eið Þ � λ Eið Þλ Eið Þ� �2vuut ; (5)where n is the total number of data points in the dataset,Ei is the electron energy, λfit Eið Þ is the IMFP calculatedusing our fitted β and γ, and λ Eið Þ is the target valuecalculated by the FPA. The predication improves as theRMSD approaches zero. The fitting quality can be mea-sured through the RMSD stated here. The RMSD aver-aged across all materials is 1.8%, with the deviationbeing a maximum for Ni (3.0%). The fitting of β and γis relatively accurate and can be applied to the LASSOprocedure as training data.3.3. Establishment of the descriptor poolGhiringhelli et al. recently introduced ML and data-driven concepts to material science [34]. In their work,they gave a framework for the choice of the set ofdescriptive parameters (termed descriptor) to revealthe scientific connection between the descriptor andthe actuating mechanisms. They not only provided thedescriptor piling-up framework but also suggestedLASSO as the latest ML method [34] to select impor-tant descriptors. We believe that this method will bevital to work on formula regression, which is relevantto the present work. Utilizing the ‘feature selection’characteristic, we can search for descriptors froma large quantity of descriptors that we enumerated todescribe the IMFP better than ever. The next part inbuilding the database for ML is thus to select properinput parameters, namely the descriptor of the mate-rial feature. Following ‘feature selection’ for findingthe best descriptors in Ref [34]., it is apparent thata complex build of the feature space is required.According to the flow presented in Table 1, thestarting point is the seven basic features (series A).The features are Z (atomic number),M (atomic mass),ρ (density), Nv (number of valence electrons peratom), Eg (bandgap energy), Ei (starting-point energy),and R (atomic radius). First, it is obvious that some ofthe features are for elemental materials and not com-patible to compounds. Our solution is to extend theirdefinitions to the compounds such that they are rea-sonable. For Z, the total number of electrons permolecule is used for compounds; forM, the molecularmass is used instead; for R, the molecular average ofthe radius for all atoms per molecule is used instead,similar to the case for element materials. Second, Ei isa new feature and is valence-band width plus the bandgap energy. That is to say, the starting point of electronenergy in the TPP(−2M) formula is the Fermi energy(EF) for conductors [21] and bottom of the conductionFigure 2. (a) Non-relativistic and (b) relativistic Fano plots for Si (an exemplary representative of elemental materials, open circles)and h-GaN (an exemplary representative of compounds, open squares).Two energy ranges are considered: from 50 eV to 1 MeV forelemental materials and from 200 eV to 200 keV for compounds.Sci. Technol. Adv. Mater. 20 (2019) 1094 X. LIU et al.band for non-conductors [22]. Ei is defined as such forelement materials and compounds because we believethat the starting point of the electron energy has itsphysical and distinguished meaning in the mix ofelement and compound materials. On the basis ofthe most basic features, a step-by-step combination iscarried out to establish the descriptor pool (series B, C,and D), ensuring the completion of descriptors in eachstep but excluding inhomogeneous descriptors.Detailed information can be found in the Methodssection and Table 1.3.4. Selecting principle terms with LASSOLASSO is run on the set of ~17,000 candidatedescriptors. The least-squares results for β and γwere set as the target values; that is, y values inEquation (1) of the LASSO method. Cross-validation (CV) is adopted naturally to adjust thehyper-parameters in LASSO. With the best hyper-parameters obtained by CV, 23 descriptors for βand 29 descriptors for γ, whose coefficients are notzero, are selected. Coefficients of the other descrip-tors are reduced to zero on the basis of the ‘featureselection’ of LASSO. The TPP-LASSO formula isbuilt with our framework, while β and γ are the linearcombination of the descriptors together with theintercept, and coefficients are also given by LASSO.Despite the quantity of descriptors sharply redu-cing from ~17,000 to ~30, only 0.2% principle descrip-tors are selected, there are still too many descriptors(terms) for a formula. A natural proposal is to selectdescriptors according to their importance, but it isdifficult to see the importance directly. Here an impor-tance measurement is introduced intuitively. Theimportance of the m-th descriptor for a certain mate-rial can be measured asIm ¼ amFmj jPni¼1 aiFij j ; (6)where a is the coefficient of descriptors, F is thedescriptor value, and n is the total number of descrip-tors. A larger value of Im corresponds to a moreimportant descriptor. The sum of Im for all descriptorsshould be 100% for a certain material. To obtain theoverall importance of the m-th feature for all materi-als, the average importance of β and γ for all materialsis calculated.Figure 3 shows the accumulation process whendescriptors are added in turn. For β, (M/ρNv)0.5 and(M/ρNv)0.4 obviously have the largest proportionsexcept for the intercept and are considered the maindescriptors, while Z/Nv and others are considered thecorrection. Similarly, for γ, [(Eg + Ei)ρ]−0.2 is the maindescriptor and (Zρ/M)−0.8 and others are the correc-tion. It naturally follows from this train of thought toshrink the quantity of descriptors and omit the unim-portant terms in the formula. Therefore, only threedescriptors of β and two descriptors of γ are used forour shortened TPP-LASSO formula, which we refer toas the TPP-LASSO-S formula. Considering that thecoefficients of the descriptors for the shortened for-mula may no longer be precise, the linear regressor isused to update the formula for the same IMFP data-base. Our TPP-LASSO-S formula is given byλ ¼ α Eð ÞEE2p βrln γrα Eð ÞE� ��   Å� �; (7a)βr ¼ �0:0012þ 0:046MρNv� �0:5�0:035MρNv� �0:4þ0:0019ZNv; (7b)Figure 3. Percentage histogram of the importance ratio for LASSO-selected descriptors for a) β and b) γ. Only descriptors with animportance ratio greater than 1% are shown in detail. The upper shadowed parts in the columns on the right of each panelsummarize the importance of descriptors with minor importance (<1%). The upper red parts represent the importance increasewhen terms are accumulated following the importance order from high to low. We note that the importance of the constant term,namely, the intercept of the linear combination, is included in the importance of the first term.Sci. Technol. Adv. Mater. 20 (2019) 1095 X. LIU et al.γr ¼ �0:07þ 0:26½ ρðEiþEgÞ��0:2þ0:066ZρM� ��0:8:(7c)where α Eð Þ ¼ 1þ E=2mec2ð Þ½ �= 1þ E=mec2ð Þ½ �2, andmec2 is the electron rest energy (510,998.9eV), Ep isthe free-electron plasmon energy (in eV), Ei is thestarting-point energy (in eV), Eg is the bandgap energyfor nonconductors (in eV), ρ is the bulk density (ing cm−3) and Nv is the number of valence electrons peratom or molecule.3.5. Necessity of ~17,000 candidate descriptorsTable 1 shows that the descriptor pool has been estab-lished in our framework has for the major parts step-by-step, namely A, B, C, and D, a total of ~17,000descriptors. This quantity of descriptors is appropriatewhile considering together accuracy, stability, formulalength, and calculation consumption. To allow discus-sion of this statement, a series of different simplifiedTPP-LASSO formulae were produced with repeats ofthe entire framework, in which the procedures werethe same except for the descriptor pool size. Thesimplification for each TPP-LASSO formula was con-ducted using the least-squares method and the targetterm quantities of β and γ were the same as or less thanthat in Equation (7) to allow fair comparison.Figure 4 compares the average RMSDs and varia-tion of RMSDs for the different simplified TPP-LASSO formulae produced. On one hand, the simpli-fied TPP-LASSO formulae produced using less thanthe A + B + C + D descriptor pool showed pooreraccuracy and stability compared to those obtainedusing the original TPP-2M formula. This representsthe shortage of quantities if the descriptor pool doesnot reach step D. On the other hand, theA + B + C + D descriptor pool that included all thedescriptors displayed the same complexity as the termsin the TPP-2M formula but achieved better perfor-mance. The above discussion reveals that using theA + B + C + D descriptor pool in our framework isquite accurate and stable and does not introduceunnecessary complexity.In addition, we also validated the robustness ofLASSO by using different sizes of training datasetrandomly extracted from the set A + B + C + D. Itturns out that LASSO can pick out the five principledescriptors in β and γ appearing in our simplifiedTPP-LASSO formula, when given reasonable differentcontents and sizes of training sets. This also reflects thesteadiness of our principle descriptors pick up byLASSO.3.6. Comparison of our new formula and otherformulaeAs previously mentioned, many empirical formulaestand parallel within the field of describing IMFPs,especially formulae applicable in similar energyranges. Formulae for high-energy electrons are nowconsidered for comparison.Gries proposed the so-called G1 formula [29] usingan atomistic model:λ ¼ k1VaEZ�ðlog E� k2Þ nmð Þ; (8)where Va ¼ M=ρ is the atomic volume, Z* is thenominal effective number of interaction-prone elec-trons per atom, which was found to equal Z0.5, andaverage values per atom of M and Z* are used forcompounds. k1 and k2 are fitting parameters;Tanuma et al. [35] summarized their best values onthe basis of Gries’ work. The most inconvenient pointis that k1 and k2 values are given separately for eachgroup of material relating to the periodic table. Thisso-called G1 formula, has better performance for sev-eral compounds but there can be substantial devia-tions (approximately 50%) for some materials.Another empirical expression, designated the S1formula, was proposed by Seah [30] to estimateIMFP values for materials:λ ¼ 4þ 0:44Z0:5 þ 0:104E0:872ð Þa1:7Z0:3 1�Wð Þ nmð Þ; (9a)a3 ¼ 1021MρNA g þ hð Þ ; (9b)where W = 0.02Eg, (W = 0 for an elemental solid) andNA is the Avogadro constant. The terms g and h inEquation (9b) represent stoichiometry coefficients forassumed binary compound GgHh; for an elementalmaterial, g = 1 and h = 0. The S1 formula is notfollowing the consideration of Bethe equation thusFigure 4. Comparison of average percentage RMSDs andvariation of RMSDs for different TPP-LASSO-S formulae pro-duced using our framework with different sizes of descriptorpool. The x-axis represents the size of the descriptor pool. Thered columns indicate average RMSDs and blue columns showthe variation of RMSDs. The TPP-2M formula is included in thelast group of columns to allow direct comparison.Sci. Technol. Adv. Mater. 20 (2019) 1096 X. LIU et al.loses some of the physics image. As a result, the S1formula is relatively accurate for most elemental mate-rials but the adjustment for compounds is clearlyinsufficient, leading to a poor description for them.Furthermore, the S1 formula cannot be furtherexpanded to a multiple compound like Y3Al5O12.Figure 5 compares our TPP-LASSO formulae andother empirical formulae mentioned above, showingthe RMSD and variance. S1, G1, and TPP-2M formu-lae are not optimized for the newly calculated IMFPdatabase, namely the FPA results. Therefore, some ofthe formulae may not applicable to some materials, forwhich these materials will be neglected in the statisticsof corresponding formula. We multiplied the electronenergy by α(E) in the comparison because the S1 andG1 formulae do not considering relativistic modifica-tion in the high energy region.In the degree of horizontal comparison, Figure 5(a)focuses on the accuracy of the formulae. The RMSDs ofthe new formulae are lower than 10%, while otherformulae cannot achieve such accuracy for both ele-mental materials and compounds, even if unsuitablematerials are ignored. Beyond looking at accuracy,Figure 5(b) shows the RMSD variance for each formula.The figure reflects the stability of the IMFP description,or the generalization ability in terms of ML. To put itsimply, there are barely any obviously poorly describedmaterials owing to the contribution of CV, and thevariances are lower than 0.005 for our formulae.In contrast, there are many extremely high-RMSDmaterials for some other formulae. For example, theRMSD of diamond is as high as 71% according to theTPP-2M formula. In another degree of vertical com-parison, S1 and G1 formulae provide relatively accu-rate and stable descriptions of elemental materials butpoor descriptions of compounds; the TPP-2M formulahas the same level of description accuracy for elemen-tal materials and compounds, while it has poor stabi-lity because of outliers like the carbon allotropesmentioned above. So far, our formulae are seen to benot only accurate but also stable and all-round.To make a uniform comparison, a recentlyproposed ML method, namely the Gaussian processregressor (GPR) [36], was used to predict the IMFP forelemental material (details of the prediction of IMFPusing GPR will be presented elsewhere). It is seen thatthe accuracy of our TPP-LASSO formula is between thatof the GPR and the accuracies of other formulae; how-ever, our TPP-LASSO formula and the GPR have similarstabilities. This reveals the advantage of our new formulaover other empirical formulae due to the introduction oftheML element. It is noted that the formula proposed byNguyen-Truong [37] has a decisive weakness in that itdoes not apply simple material parameters and it is thusextremely reliant on the ELF. Additionally, his formula isderived from a (infinitive) high-energy approximation ofthe FPA, resulting in this formula being inapplicable atenergies below 500 eV. Although his formula hasa powerful fitting performance in the high energy region,such an analytical formula is not appropriate for com-parison here.For most materials, our formulae better describe theIMFPs for most materials than those of TPP-2M for-mula; i.e. our formulae have lower average RMSDs thanthe other empirical formulae considered. Table 2 com-pares the RMSDs in detail. Numerically speaking, aver-age RMSDs on all materials are 7.2% and 8.0% for thenon-simplified and TPP-LASSO-S formulae and 10.8%for the TPP-2M formula, showing an improvement ofnearly one-third. In fact, 50 out of the 83 materials inthe case of the TPP-LASSO formula and 52 out of the83 materials in the case of the TPP-LASSO-S formulahave accuracies better than those when using the origi-nal TPP-2M formula. Furthermore, materials poorlydescribed by the TPP-2M formula, such as the threecarbons and two types of Born Nitride as shown inTable 2, are accurately described by our new formula.Detailed comparisons are presented in Figure 6. OurFigure 5. Comparison of (a) average percentage RMSDs and (b) variation in RMSDs of for all 83 materials when using the S1, G1,TPP-2M, TPP-LASSO, and TPP-LASSO-S formulae and a machine learning method (i.e. the GPR). The red line is for elementalmaterials and the blue line for compounds.Sci. Technol. Adv. Mater. 20 (2019) 1097 X. LIU et al.TPP-LASSO formula poorly describes the five materialsin the figure but does a better job than the TPP-2Mformula. We note that carbon allotropes have similarRMSDs according to our TPP-LASSO-S formula.Besides the comparison of IMFPs between those ofdifferent formulae and those of FPA result, here wealso compare IMFPs by TPP-2M formula and those byour TPP-LASSO-S formula with the experimentalTable 2. Comparison of percentage RMSDs calculating from Equation (5) among TPP-2M, TPP-LASSO, and TPP-LASSO-S formulae.RMSDelemental TPP-2M TPP-LASSO TPP-LASSO-SRMSDcompounds TPP-2M TPP-LASSO TPP-LASSO-SLi 16.1% 7.5% 8.8% AgBr 9.4% 6.5% 7.0%Be 21.0% 2.7% 19.5% AgCl 8.0% 7.0% 6.4%C-graphite 45.2% 5.0% 18.9% h-AgI 9.0% 5.5% 8.1%C-diamond 71.2% 5.1% 24.8% Al2O3 18.1% 3.4% 4.0%C-glassy 2.1% 24.2% 16.4% AlAs 0.8% 3.9% 2.2%Na 3.8% 6.3% 6.1% h-AlN 13.9% 2.2% 4.1%Mg 8.8% 9.3% 13.5% AlSb 3.9% 8.0% 2.9%Al 8.7% 4.7% 12.6% c-BN 66.0% 10.8% 19.2%Si 4.1% 6.1% 2.2% h-BN 33.3% 2.1% 4.5%K 2.4% 0.6% 4.9% h-CdS 10.4% 9.4% 9.2%Sc 25.4% 20.6% 26.1% h-CdSe 12.2% 9.5% 10.1%Ti 19.7% 9.3% 19.6% CdTe 7.5% 3.2% 5.7%V 7.6% 4.0% 8.9% GaAs 4.1% 7.0% 3.3%Cr 3.8% 5.8% 6.2% h-GaN 3.0% 3.4% 6.0%Fe 4.0% 9.9% 2.2% GaP 3.0% 6.3% 4.4%Co 4.6% 4.0% 12.4% GaSb 9.1% 11.3% 4.5%Ni 3.1% 3.9% 7.9% h-GaSe 0.9% 6.1% 1.9%Cu 8.9% 10.7% 4.3% InAs 9.2% 9.0% 3.4%Ge 3.5% 2.1% 4.5% InP 6.8% 9.7% 5.3%Y 13.3% 1.4% 3.8% InSb 14.0% 13.2% 4.6%Nb 1.8% 14.6% 7.4% KBr 5.7% 11.7% 20.3%Mo 5.0% 5.3% 3.2% KCl 4.5% 11.7% 18.7%Ru 2.9% 1.9% 7.5% MgF2 20.9% 9.0% 13.5%Rh 5.0% 3.8% 11.9% MgO 10.0% 7.7% 5.4%Pd 2.8% 2.1% 10.0% NaCl 16.5% 27.5% 32.1%Ag 3.1% 5.3% 10.8% NbC0.712 2.1% 4.5% 0.9%In 20.5% 4.1% 2.8% NbC0.844 2.4% 4.5% 1.0%Sn 1.7% 13.4% 10.8% NbC0.93 2.6% 4.6% 1.1%Cs 32.3% 4.3% 2.8% PbS 6.2% 3.4% 1.2%Gd 7.6% 16.1% 10.7% PbSe 9.0% 3.8% 2.1%Tb 7.7% 1.8% 4.1% PbTe 15.4% 7.7% 1.7%Dy 2.7% 8.4% 3.1% SiC 15.1% 2.6% 8.3%Hf 12.6% 10.7% 9.1% SiO2 2.8% 19.4% 22.9%Ta 15.0% 5.2% 2.3% SnTe 11.9% 15.9% 7.1%W 6.8% 5.5% 2.4% TiC0.7 14.0% 4.9% 13.1%Re 4.4% 2.7% 3.2% TiC0.95 17.1% 6.9% 15.4%Os 7.8% 7.8% 4.7% VC0.76 3.4% 4.3% 3.8%Ir 8.2% 4.4% 3.1% VC0.86 5.2% 2.7% 5.5%Pt 10.9% 4.8% 3.3% Y3Al5O12 1.4% 5.6% 5.1%Au 10.8% 5.2% 3.5% ZnS 4.9% 10.4% 9.5%Bi 12.9% 4.5% 1.8% ZnSe 11.4% 9.5% 9.6%ZnTe 8.3% 3.4% 5.3%Figure 6. Comparison between the FPA-calculated IMFP values (black hollowed dots), the experimental IMFP values for graphitecarbon (black hollowed triangles), the IMFP described by GPR (green dot line), S1 formula (purple dash-dot line), G1 formula(brown dash-dot-dot line), TPP-2M formula (indigo short-dash line), TPP-LASSO formula (red solid line), and TPP-LASSO-S formula(blue dash line). The results of three typical carbon allotropes are shown in a) while the results of c/h-BN are shown in b).Sci. Technol. Adv. Mater. 20 (2019) 1098 X. LIU et al.IMFPs from Tanuma et al. [38]. The comparison forgraphite for electron energies above 200 eV is shownin Figure 6(a). The experimental result is closer to theIMFPs of our TPP-LASSO-S formula than those ofTPP-2M formula except the data point at 200eV. Wealso show the comparison results for electron energiesabove 200 eV as RMSD (Equation (5)) for differentmaterials as below (before the slash is the RMSDbetween TPP-2M formula and experimental IMFPs;after the slash is the RMSD between our TPP-LASSO-S formula and experimental IMFPs): graphite carbon(14.0%/15.2%), Si (11.2%/9.4%), Cr (14.3%/15.5%), Fe(9.3%/12.3%), Cu (5.2%/8.2%), Mo (18.8%/16.1%), Ag(11.26%/15.5%), Ta (29.3%/11.0%), W (29.7%/21.6%),Pt (11.7%/16.7%), Au (7.6%/8.7%), Average (14.8%/13.7%). The accuracy of our TPP-LASSO-S formula isslightly superior to the TPP-2M formula. The RMSDsof Ta and W, however, are greatly decreased throughour formula. This must be a clear evidence that ourformula could increase the accuracy of poorlydescripted materials in TPP-2M formula, withoutany large accuracy sacrifice of other materials.3.7. Physics picture behind the principle termsInformation is buried deep in Figures 5 and 6. First, it ispossible that our TPP-LASSO formula outperforms theTPP-2M formula because it has many more terms.However, the RMSD difference between the TPP-LASSO formula and TPP-2M formula and the differ-ence between TPP-LASSO and TPP-LASSO-S formulaereveals that the number of terms is not an importantfactor; i.e. the TPP-LASSO-S formula does not lose itsadvantage when the number of descriptors is similar tothat for the TPP-2M formula. Second, it is noted that ifthere is a large RMSD fluctuation among differentmaterials, the generalization capability of the selecteddescriptors is probably poor. The extremely largeRMSDs of the five typical materials for the originalTPP-2M formula are such examples of poor general-ization capability, which is considered the greatestweakness of the TPP-2M formula. For our TPP-LASSO formula, the generalization capability is sogood that irrespective of where the number of descrip-tors is limited, the IMFP is described with relativelyuniform accuracy. The similar RMSDs for our TPP-LASSO-S formula among carbon allotropes demon-strate the generalization capability in different cases.On the basis of the reliability of our new formula andthe descriptors selected, we finally turn to the physicsmeaning behind the descriptors that we found. Figure 3shows that themost important descriptors are (M/ρNv)0.5and (M/ρNv)0.4. Surprisingly, the definition of Ep isEp ¼ 28:8ρNvM� �0:5eVð Þ: (10)In other words, the main descriptors we found for βare actually Ep�1 and Ep�0:8. So far, one of the mosteffective descriptors found manually to describe theIMFP is Ep. Ep has been used in the TPP-2M formulafollowing the initial work of Tanuma et al. [17], forvalence electrons make the main contribution to elec-tron scattering in a bulk and Ep contain Nv in theformulae. We believe that this interesting fact is notjust a coincidence and that there is a physics meaningbehind it. As expected, once Tanuma et al. have visitedto explain the magnitude of the IMFP on elementmaterials in Ref [39]. Tanuma et al. compared thetheoretical calculated IMFP formula and TPP-2M for-mula and concluded thatβ ffi kEa; (11)where k is a constant and ΔEa is the average excitationenergy. In the discussion in Ref [39]., a hypothesis is touse Ep as the candidate of ΔEa, and it is thus concludedthat β ~ 1/Ep.In our work, however, 1/Ep is selected as the mostimportant descriptor out of ~17,000 descriptors. Thisis evidence that in a very large space, 1/Ep is the mostsuitable descriptor of the IMFP. In Figure 7(a), β hasFigure 7. Relationships of the principle descriptors with β: (a) Ep−1 and (b) Z/Nv. Open circles represent the deviating data foralkaline metals and the corresponding linear fit.Sci. Technol. Adv. Mater. 20 (2019) 1099 X. LIU et al.a clear linear relationship with Ep�1, similarly to thatwith Ep�0:8.Besides Ep, another key term for β is Z/Nv. In themost basic Bethe equation [28], Z plays an importantrole in the definition of electron density. However, inthe latest TPP-2M formula, Z is not included because ofan undiscovered relationship between the material-dependent parameters and the formula. Herein,through the powerful LASSO method, Z/Nv has beenbrought in the new formula as a major breakthrough.Figure 7(b) shows the linear relationship between β andZ/Nv except for alkaline metals. The important point isthat Z has not been included in the TPP-2M formulayet, but Z was introduced to our TPP-LASSO formulaby LASSO. Considering that Z is the total electronnumber, Nv/Z can be considered the valence electronratio of the total electron number, and Z/Nv is thereciprocal of it. Alkaline metals are seemingly ‘self-contained’ because they have another linear relation-ship. Despite alkalinemetals being separated from othermaterials, there is still an obvious linear distribution.Among the terms for γ, the most common term is[(Ei + Eg)ρ]−0.2. This principle descriptor holdsapproximate physics meanings. In the case of metals,this term is simplified to associate with EFρ and hasa relatively obvious physics meaning related to thenormalized Fermi energy, which is mainly affects sec-ondary electron (SE) excitation in metals. As for semi-conductors and insulators, the principle descriptor[(Ei + Eg)ρ]−0.2 somewhat reveals the physics pictureof SE excitation. Figure 8(a) shows a schematic dia-gram of the energy-band structure of a semiconductoror insulator with bandgap energy Eg. When energeticelectrons move inside an insulator, they may transferall or part of their energy to electrons in the valenceband, and then the electrons in the valence band cantransfer across the band gap to the conduction band asa typical SE excitation process in a semiconductor orinsulator. It is obvious that such SE excitation onlyoccurs under the premise that the energy of the pri-mary energetic electron must be above Ei + Eg, i.e. Ev +2Eg for insulators referring to the bottom of valenceband, to excite an electron located at the top of thevalence band across the band gap into the conductionband as the limiting case; at the same time, the pri-mary energetic electron is still within the conductionband after losing energy. Correspondingly, the term(Ei + Eg)ρ reflects the possibility of SE excitation ofa semiconductor or insulator per unit volume.Therefore, LASSO selects this descriptor, suggestingthat regardless of metals and insulators, SE excitationsare strongly correlated to the electron inelastic scatter-ing behavior.Figure 8(b) shows the linear relationship between γand [(Ei + Eg)ρ]−0.2. In Figure 8(b), the red pointsrepresent materials in which Eg≠0, in other words,insulators or semiconductors, and black points indi-cate metals. The red points share a common linearrelationship with the black points, which means thedescriptor [(Ei + Eg)ρ]−0.2 holds for all kinds of mate-rials when describing IMFPs. This is also evidence thatthis principle descriptor has a strong ability to general-ize the IMFPs for all materials.More than the isolated explanations for eachdescriptor, a more important physics picture isobtained by putting together the physics meaning ofthe descriptors. As known by physicists in the surfaceanalysis field, the IMFP is a fundamental parameterdescribing the process of electron scattering when anenergetic electron moves inside or near a material. Ininelastic scattering, there are two main contributingexcitations: single electron excitation and plasmonexcitation. Reviewing the principle descriptors men-tioned above, some relationships can be summarized:(1) As the principle descriptor of γ, [(Ei + Eg)ρ]−0.2reflects the SE excitation contributed by singleFigure 8. (a) Typical band structure and electron excitation process for insulators. Here Eg is the band gap energy and Ei is starting-point energy, which is the valence-band width plus the band gap energy for insulators. Suppose that there are two electrons:electron 1 at the energy of the valence band edge (Ev) and electron 2 at the energy of Eg higher than the conduction band bottom(Ei + Eg). Electron 2 gives energy to electron 1 and electron 1 is excited to the conduction band. There is therefore an energyrestriction that electron 2 must be higher than Ei + Eg or else electron 2 will fall into the band gap after giving out energy, which isimpossible. (b) Relationship between the principle descriptor [(Ei + Eg)ρ]−0.2 and γ. The red (black) dots represent for materials withEg≠0 (Eg = 0).Sci. Technol. Adv. Mater. 20 (2019) 1100 X. LIU et al.electron excitations in various materials caused bydifferent material band structures. (2) As the princi-ple descriptor of β, Ep, as its name suggests, occupiesa very important position in the description of plas-mon excitation in inelastic scattering [17]. In fact,plasmon excitation can be seen as a collective oscilla-tion of valence electrons. Together with the singleelectron behavior in (1), it can be summarized thattwo main electron inelastic scattering behaviorscaused by single electron excitation and plasmonexcitation are included in the principle descriptorschosen by LASSO: Ep in β for collective behavior ofvalence electrons and [(Ei + Eg)ρ]−0.2 in γ for indivi-dual behavior of valence electrons. Although theprinciple descriptors were produced completely digi-tally, they turned out to describe a meaningful phy-sics picture.4. ConclusionsOn the basis of an existing database, we developeda new framework using ML to enhance the accuracyof an empirical formula and give a formula for theIMFP for an example. The parameters in the TPP-LASSO formula were thoroughly discussed usinga Fano plot, and the LASSO algorithm was thusemployed to select the combination of terms forthese parameters. The LASSO algorithm demon-strated superior ability in reducing the number ofterms without reducing the descriptive ability withinan acceptable range. With the introduction of a systemthat analyzes importance, the balance of accuracy andconvenience can also be adjusted easily. Besidesimproved accuracy, another important advantage ofthe framework is the ability of the framework to guideapplication or exploration. Herein, we provided data-driven evidence for the long-existing parameter Epand innovatively introduced Z into the TPP-2M for-mula, which Tanuma et al. failed to do using the Betheequation. A reasonable hypothesis of the connectionbetween the band structure and IMFP is revealed bythe framework’s selection of a major term. These con-tributions are strong evidence of the all-round abilityof the new framework. Importantly, not limited to theapplication example of the IMFP here, the frameworkcan easily be applied to other fields to determinea reliable empirical formula according to a specifieddatabase, while providing key descriptors to the field.The analysis is superior to traditional least-squaresand remainder analyses in terms of accuracy, timetaken, and convenience.Author’s contributionsX.L. wrote the program, performed the analysis of results,and wrote the initial manuscript. Z.F.H. and Y.S. gave cru-cial suggestions to the ML program and the initialmanuscript. D.B.L. gave help with the code programming.B.D. and Z.J.D. supervised the research. H.Y. and S.T. gavephysics picture and suggestions. All authors discussed andcommented on the manuscript. All the authors developedthe concepts together and participated the discussions of thework.AcknowledgmentsThis work was supported by the “Materials research byInformation Integration” Initiative (MI2I) Project of theSupport Program for Starting Up Innovation Hub fromJapan Science and Technology Agency (JST) and theNational Natural Science Foundation of China (No.11574289). The calculations in this study were performedon Numerical Materials Simulator at NIMS. We thankDr. Nagata Kenji from National Institute for MaterialsScience for the suggestions about machine learning.Disclosure statementNo potential conflict of interest was reported by the authors.FundingThis work was supported by the National Natural ScienceFoundation of China [11574289].ORCIDBo Da http://orcid.org/0000-0002-0785-8662Yang Sun http://orcid.org/0000-0002-4344-2920Data availabilityAll data generated and/or analyzed during this study areincluded in this articleReferences[1] International Organization for Standardization(ISO).Surface chemical analysis-vocabulary-part 1: generalterms and terms used in spectroscopy. Geneva: ISO;2013. ISO 18115-1:2013.[2] Powell CJ, Jablonski A. Surface sensitivity of X-rayphotoelectron spectroscopy. Nucl InstrumMeth PhysRes A. 2009;601:54–65.[3] Zou YB, Mao SF, Da B, et al. Surface sensitivity ofsecondary electrons emitted from amorphous solids:calculation of mean escape depth by a Monte Carlomethod. J Appl Phys. 2016;120:235102.[4] Bourke JD, Chantler CT. Momentum-dependent life-time broadening of electron energy loss spectra: Aself-consistent coupled-plasmon model. J PhysChem Lett. 2015;6:314–319.[5] Chantler CT, Bourke JD. X-ray spectroscopic measure-ment of photoelectron inelastic mean free paths inmolybdenum. J Phys Chem Lett. 2010;1:2422–2427.[6] WernerWS, Smekal W, Störi H, et al. Emission-depth-selective Auger photoelectron coincidencespectroscopy. Phys Rev Lett. 2005;94:038302.Sci. Technol. Adv. Mater. 20 (2019) 1101 X. LIU et al.[7] Ding ZJ, Shimizu R. A Monte Carlo modeling of elec-tron interaction with solids including cascade second-ary electron production. Scanning. 1996;18:92–113.[8] Penn DR. Electron mean-free-path calculations usinga model dielectric function. Phys Rev B. 1987;35:482.[9] Mao SF, Li YG, Zeng RG, et al. Electron inelasticscattering and secondary electron emission calculatedwithout the single-pole approximation. J Appl Phys.2008;104:114907.[10] Mermin ND. Lindhard dielectric function in therelaxation-time approximation. Phys Rev B.1970;1:2362.[11] Da B, Shinotsuka H, Yoshikawa H, et al. Comparisonof the Mermin and Penn models for inelasticmean-free path calculations for electrons based ona model using optical energy-loss functions. SurfInterface Anal. 2019;51:627–640.[12] Nguyen-Truong HT. Low-energy electron inelasticmean free paths for liquid water. J Phys CondensMatter. 2018;30:155101.[13] Garcia-Molina R, Abril I, Kyriakou I, et al. Inelasticscattering and energy loss of swift electron beams inbiologically relevant materials. Surf Interface Anal.2016;49:11.[14] Nguyen-Truong HT. Penn algorithm includingdamping for calculating the electron inelastic meanfree path. J Phys Chem C. 2015;119:7883.[15] Nguyen-Truong HT. Energy-loss function includingdamping and prediction of plasmon lifetime.J Electron Spectros Relat Phenom. 2014;193:79.[16] Da B, Shinotsuka H, Yoshikawa H, et al. ExtendedMermin method for calculating the electron inelasticmean free path. Phys Rev Lett. 2014;113:063201.[17] Tanuma S, Powell CJ, Penn DR. Calculations of elec-tron inelastic mean free paths for 31 materials. SurfInterface Anal. 1988;11:577–589.[18] Tanuma S, Powell CJ, PennDR. Calculations of electroninelastic mean free paths. II. Data for 27 elements overthe 50-2000 eV range. Surf Interface Anal.1991;17:911–926.[19] Tanuma S, Powell CJ, Penn DR. Calculations of elec-tron inelastic mean free paths. III. Data for 15 inor-ganic compounds over the 50-2000 eV range. SurfInterface Anal. 1991;17:927–939.[20] Tanuma S, Powell CJ, Penn DR. Calculations of elec-tron inelastic mean free paths. V. Data for 14 organiccompounds over the 50-2000 eV range. Surf InterfaceAnal. 1994;21:165–176.[21] Shinotsuka H, Tanuma S, Powell CJ, et al.Calculations of electron inelastic mean free paths.X. Data for 41 elemental solids over the 50 eV to200 keV range with the relativistic full Pennalgorithm, Surf Interface Anal. 2015;47:871–888.ibid, Surf. Interface Anal., 47, 1132 (2015).[22] Shinotsuka H, Tanuma S, Powell CJ, et al.Calculations of electron inelastic mean free paths.XII. Data for 42 inorganic compounds over the 50eV to 200 keV range with the full Penn algorithm.Surf Interface Anal. 2019;51:427–457.[23] Sun Y, Xu H, Da B, et al. Calculations of energy-lossfunction for 26 materials. Chin J Chem Phys.2016;29:663.[24] Tougaard S, Chorkendorff I. Differential inelasticelectron scattering cross sections from experimentalreflection electron-energy-loss spectra: application tobackground removal in electron spectroscopy. PhysRev B. 1987;35:6570.[25] Werner WS, Hayek M. Influence of the elastic scatter-ing cross-section on angle-resolved reflection electronenergy loss spectra of polycrystalline Al, Ni, Pt andAu. Surf Interface Anal. 1994;22:79–83.[26] Da B, Sun Y, Mao SF, et al. A reverse Monte Carlomethod for deriving optical constants of solids fromreflection electron energy-loss spectroscopy spectra.J Appl Phys. 2013;113:214303.[27] Xu H, Da B, Tóth J, et al. Absolute determination ofoptical constants by reflection electron energy lossspectroscopy. Phys Rev B. 2017;95:195417.[28] Bethe HA. Zur theorie des Durchgangs schnellerKorpuskularstrahlen durch Materie. Ann Phys.1930;5:325–400.[29] Gries WH. A universal predictive equation for theinelastic mean free pathlengths of X-ray photoelec-trons and Auger electrons. Surf Interface Anal.1996;24:38–50.[30] Seah MP. An accurate and simple universal curve forthe energy-dependent electron inelastic mean freepath. Surf Interface Anal. 2012;44:497–503.[31] Tibshirani R. Regression shrinkage and selection viathe lasso. J Roy Stat Soc B. 1996;58:267–288.[32] Pedregosa F, Varoquaux G, Gramfort A, et al. Scikit-learn: machine learning in Python. J Mach Learn Res.2011;12:2825.[33] Tanuma S, Powell CJ, Penn DR. Proposed formulafor electron inelastic mean free paths based on cal-culations for 31 materials. Surf Sci. 1987;192:L849–L857.[34] Ghiringhelli LM, Vybiral J, Levchenko SV, et al. Bigdata of materials science: critical role of thedescriptor. Phys Rev Lett. 2015;114:105503.[35] Tanuma S, Powell CJ, Penn DR. Calculations ofelectron inelastic mean free paths (IMFPs) VI ana-lysis of the gries inelastic scattering model andpredictive IMFP equation. Surf Interface Anal.1997;25:25–35.[36] Rasmussen CE, Williams CKI. Gaussian processes formachine learning. Cambridge (MA): MIT Press;2006.[37] Nguyen-Truong HT. Analytical formula forhigh-energy electron inelastic mean free path. J PhysChem C. 2015;119:23627–23631.[38] Tanuma S, Shiratori T, Kimura T, et al. Experimentaldetermination of electron inelastic mean free paths in13 elemental solids in the 50 to 5000 eV energy rangeby elastic-peak electron spectroscopy. Surf InterfaceAnal. 2005;37:833.[39] Tanuma S. IMFP の定性的な理解ついて, J. SurfAnal. 1996;2:89–90.Sci. Technol. Adv. Mater. 20 (2019) 1102 X. LIU et al. Abstract 1. Introduction 2. Methods 2.1. Lasso 2.2. Cross validation 2.3. Details of building the descriptor pool 3. Results and discussion 3.1. Selection of the prototype formula 3.2. Fitting of the parameters in the prototype formula 3.3. Establishment of the descriptor pool 3.4. Selecting principle terms with LASSO 3.5. Necessity of ~17,000 candidate descriptors 3.6. Comparison of our new formula and other formulae 3.7. Physics picture behind the principle terms 4. Conclusions Author’s contributions Acknowledgments Disclosure statement Funding ORCID Data availability References