# Fileset

[Dam_evidence_based_5.0134999.pdf](https://mdr.nims.go.jp/filesets/77a8d79a-5b34-4a05-bf57-04d84c78b7e0/download)

## Creator

[Minh-Quyet Ha](https://orcid.org/0000-0003-4617-0059), [Duong-Nguyen Nguyen](https://orcid.org/0000-0003-0980-8754), [Viet-Cuong Nguyen](https://orcid.org/0000-0002-8008-582X), [Hiori Kino](https://orcid.org/0000-0002-8912-686X), [Yasunobu Ando](https://orcid.org/0000-0003-3702-034X), [Takashi Miyake](https://orcid.org/0000-0003-2658-3470), [Thierry Denœux](https://orcid.org/0000-0002-0660-5436), [Van-Nam Huynh](https://orcid.org/0000-0002-3860-7815), [Hieu-Chi Dam](https://orcid.org/0000-0001-8252-7719)

## Rights

Creative Commons BY Attribution 4.0 International[Creative Commons BY Attribution 4.0 International](https://creativecommons.org/licenses/by/4.0/)

## Other metadata

[Evidence-based data mining method to reveal similarities between materials based on physical mechanisms](https://mdr.nims.go.jp/datasets/e467d9bf-e291-4e48-90d8-033de57d52ee)

## Fulltext

Evidence-based data mining method to reveal similarities between materials based on physical mechanismsJ. Appl. Phys. 133, 053904 (2023); https://doi.org/10.1063/5.0134999 133, 053904© 2023 Author(s).Evidence-based data mining method toreveal similarities between materials basedon physical mechanisms Cite as: J. Appl. Phys. 133, 053904 (2023); https://doi.org/10.1063/5.0134999Submitted: 15 November 2022 • Accepted: 17 January 2023 • Published Online: 06 February 2023 Minh-Quyet Ha,  Duong-Nguyen Nguyen,  Viet-Cuong Nguyen, et al.COLLECTIONSNote: This paper is part of the Special Topic on: Multi-Principal Element Materials: Structure, Property, andProcessing. This paper was selected as Featuredhttps://images.scitation.org/redirect.spark?MID=176720&plid=1817972&setID=378286&channelID=0&CID=668197&banID=520703472&PID=0&textadID=0&tc=1&type=tclick&mt=1&hc=5e5b24fe487255bebf407dc8b9c33e4ba6873140&location=https://doi.org/10.1063/5.0134999https://aip.scitation.org/topic/collections/featured?SeriesKey=japhttps://doi.org/10.1063/5.0134999http://orcid.org/0000-0003-4617-0059https://aip.scitation.org/author/Ha%2C+Minh-Quyethttp://orcid.org/0000-0003-0980-8754https://aip.scitation.org/author/Nguyen%2C+Duong-Nguyenhttp://orcid.org/0000-0002-8008-582Xhttps://aip.scitation.org/author/Nguyen%2C+Viet-Cuonghttps://aip.scitation.org/topic/collections/featured?SeriesKey=japhttps://doi.org/10.1063/5.0134999https://aip.scitation.org/action/showCitFormats?type=show&doi=10.1063/5.0134999http://crossmark.crossref.org/dialog/?doi=10.1063%2F5.0134999&domain=aip.scitation.org&date_stamp=2023-02-06Evidence-based data mining method to revealsimilarities between materials based on physicalmechanismsCite as: J. Appl. Phys. 133, 053904 (2023); doi: 10.1063/5.0134999View Online Export Citation CrossMarkSubmitted: 15 November 2022 · Accepted: 17 January 2023 ·Published Online: 6 February 2023Minh-Quyet Ha,1 Duong-Nguyen Nguyen,1 Viet-Cuong Nguyen,2 Hiori Kino,3 Yasunobu Ando,4Takashi Miyake,4 Thierry Denœux,5 Van-Nam Huynh,1 and Hieu-Chi Dam1,a)AFFILIATIONS1Japan Advanced Institute of Science and Technology, 1-1 Asahidai, Nomi, Ishikawa 923-1292, Japan2HPC SYSTEMS, Inc., 3-9-15 Kaigan, Minato, Tokyo 108-0022, Japan3Research and Services Division of Materials Data and Integrated System, National Institute for Materials Science,1-2-1 Sengen, Tsukuba, Ibaraki 305-0044, Japan4Research Center for Computational Design of Advanced Functional Materials,National Institute of Advanced Industrial Science and Technology, 1-1-1 Umezono, Tsukuba, Ibaraki 305-8568, Japan5Heudiasyc–UMR CNRS 7253, Université de Technologie de Compiègne, Compiègne, FranceNote: This paper is part of the Special Topic on: Multi-Principal Element Materials: Structure, Property, and Processing.a)Author to whom correspondence should be addressed: dam@jaist.ac.jpABSTRACTMeasuring the similarity between materials is essential for estimating their properties and revealing the associated physical mechanisms.However, current methods for measuring the similarity between materials rely on theoretically derived descriptors and parameters fittedfrom experimental or computational data, which are often insufficient and biased. Furthermore, outliers and data generated by multiplemechanisms are usually included in the dataset, making the data-driven approach challenging and mathematically complicated. To over-come such issues, we apply the Dempster–Shafer theory to develop an evidential regression-based similarity measurement (eRSM) method,which can rationally transform data into evidence. It then combines such evidence to conclude the similarities between materials, consider-ing their physical properties. To evaluate the eRSM, we used two material datasets, including 3d transition metal–4f rare-earth binary andquaternary high-entropy alloys with target properties, Curie temperature, and magnetization. Based on the information obtained on the sim-ilarities between the materials, a clustering technique is applied to learn the cluster structures of the materials that facilitate the interpretationof the mechanism. The unsupervised learning experiments demonstrate that the obtained similarities are applicable to detect anomalies andappropriately identify groups of materials whose properties correlate differently with their compositions. Furthermore, significant improve-ments in the accuracies of the predictions for the Curie temperature and magnetization of the quaternary alloys are obtained by introducingthe similarities, with the reduction in mean absolute errors of 36% and 18%, respectively. The results show that the eRSM can adequatelymeasure the similarities and dissimilarities between materials in these datasets with respect to mechanisms of the target properties.© 2023 Author(s). All article content, except where otherwise noted, is licensed under a Creative Commons Attribution (CC BY) license(http://creativecommons.org/licenses/by/4.0/). https://doi.org/10.1063/5.0134999I. INTRODUCTIONThe concept of machine learning has great potential for applica-tion in several areas of materials science, especially for discoveringnew materials. In materials science, a number of the problemsaddressed by data-driven approaches require the effective utilizationof existing material data for predicting the properties of new materi-als and understanding the underlying physicochemical mechanisms.1From an engineering point of view, developing a data-drivenmodel that quickly and accurately predicts the physical propertiesJournal ofApplied Physics ARTICLE scitation.org/journal/japJ. Appl. Phys. 133, 053904 (2023); doi: 10.1063/5.0134999 133, 053904-1© Author(s) 2023https://doi.org/10.1063/5.0134999https://doi.org/10.1063/5.0134999https://www.scitation.org/action/showCitFormats?type=show&doi=10.1063/5.0134999http://crossmark.crossref.org/dialog/?doi=10.1063/5.0134999&domain=pdf&date_stamp=2023-02-06http://orcid.org/0000-0003-4617-0059http://orcid.org/0000-0003-0980-8754http://orcid.org/0000-0002-8008-582Xhttp://orcid.org/0000-0002-8912-686Xhttp://orcid.org/0000-0003-3702-034Xhttp://orcid.org/0000-0003-2658-3470http://orcid.org/0000-0002-0660-5436http://orcid.org/0000-0002-3860-7815http://orcid.org/0000-0001-8252-7719mailto:dam@jaist.ac.jphttp://creativecommons.org/licenses/by/4.0/http://creativecommons.org/licenses/by/4.0/https://doi.org/10.1063/5.0134999https://aip.scitation.org/journal/japof possible materials from accumulated data can reduce the timerequired for material development. By applying a data-drivenmodel to screen materials in silico, we narrow down the candidatesthat require expensive calculations and experiments to verify. Ifthere are sufficient independent supervised data from the distribu-tion of the target material data, a model with high prediction accu-racy can be built using state-of-the-art data-driven techniques.However, because materials research and development aim todevelop materials that are superior to existing ones, the distributionof the target prediction data may be completely different from thedistribution of the original training data. Therefore, there are con-cerns about whether data-driven models can accurately predict thephysical properties of new materials.On the contrary, considering the history of materials science,researchers have discovered various materials through a loop ofhypothesis and verification based on their knowledge, experience,and serendipity. Particularly, hypothesizing relies heavily ondescribing, interpreting, and understanding the underlying physico-chemical mechanisms of the observed physical phenomena ofmaterials. Scientifically, applying a data-driven approach to extract-ing knowledge from existing complicated material data can acceler-ate the process of describing, interpreting, and understanding thephysicochemical mechanisms underlying the observed physicalphenomena of materials. This reduces the time required for mate-rial development. Hence, to be effectively applied to materialsscience, data-driven approaches that are interpretable and under-standable to humans must be developed.One of the most intuitive and interpretable data-drivenapproaches for humans is analogy-based inductive reasoning,which infers the properties of a new instance using the informationof the observed instances that are most similar to it.2–5 By applyinganalogy-based models, we can easily explain the reasoning processbehind the predictions and reveal the physicochemical mechanismsrationalizing the observations.6,7 Materials scientists have resolveddifferent problems in materials science by systematizing informa-tion about analogies in composition or structure between materialsthat exhibit similar physicochemical properties.8–11Especially, in a discipline based on fundamental principles,such as condensed matter physics, it is essential to elucidate thephysical mechanisms and which materials are manifested througheach of these physical mechanisms. However, despite several newmaterials and superior properties having been discovered, it is stilldifficult to appropriately quantify the similarities between materialsto elucidate the underlying physicochemical mechanisms of theseproperties. Furthermore, this difficulty arises from the fact that themechanisms of materials’ properties are typically interpreted interms of physicochemical concepts based on relative criteria.The phenomenon of superconductivity in materials, whichoriginates from the instability of metals, is a well-known exampleof the above difficulty. One of the most successful theories thatdescribe the microscopic mechanisms is the Bardeen–Cooper–Schrieffer (BCS) theory for superconductivity,12 the origin of whichis electron–phonon interactions. However, there also exist othermechanisms. For example, one of the most plausive origins ofsuperconductivity in the high-TC cuprates is electron–electroninteractions. Nevertheless, it is not easy to achieve a consensus ofclassifying the superconducting mechanism of materials amongresearchers as the origins. Although the emergence of superconduc-tivity is basically due to the instability in the metallic phase, it isnot easy to achieve the consensus because both the mentioned andother mechanisms can contribute cooperatively in increasing theTC value, for example. Although it is challenging to classify individ-ual materials when considering phenomena that cause such a situa-tion, it is expected that the underlying physical mechanisms can bediscovered if we can inductively quantify the similarities betweenthe materials of interest and group similar materials using all obser-vation data.Incidentally, inductive reasoning with inefficient similarityassessment can lead to misidentification of outliers13 and difficultyin explaining the underlying physicochemical mechanisms of data-sets using single models. Therefore, regarding predefined materialdescriptors, an exhaustive examination of all possible hypothesesabout the unknown physicochemical mechanisms is necessary toassess the similarity between the materials. Furthermore, similaritymeasures are usually context-dependent. Because the contextchanges, the similarity measure must be modified to adequatelycapture the phenomena under study.14,15 Thus, a quantitativemeasure of similarity needs to consider the uncertainty arisingfrom the context or the measurement itself, especially in situationswhere material data are often insufficient and heavily biased.Moreover, similarities from different contexts may not be directlycomparable in the integration to draw conclusions about the simi-larity between materials. These reasons make it challenging toapply data-driven approaches to materials science.To overcome such issues and efficiently extract knowledgefrom the data, we propose a new approach that shifts from measur-ing the similarity between materials to quantitatively measure theconfidence in their similarities. We adopt the Dempster–Shafertheory,16–18 referred to as the evidence theory, to develop an evi-dential regression-based similarity measurement (eRSM) for detect-ing subgroups of materials such that leaned models from thesubgroups show high correlations between descriptors and thetarget property of the constituent materials. Further analysis ofmodels describing the subgroups provides valuable information toextract, interpret, and understand physical mechanisms. TheDempster–Shafer theory can be regarded as a generalization of theBayesian approach for solving the problem of incomplete andinsufficient information. Moreover, it is suitable for solving mate-rial data problems.19,20 The measure of similarity here refers towhether the observed physical properties of the materials understudy are described using the same hidden mechanism that hasnot yet been revealed. In other words, we consider any pair ofmaterials (in the dataset) as similar if their physical propertiescan be described by the same hidden mechanism; otherwise, thepair of materials is considered dissimilar. We then first generatenumerous hypothetical mechanisms by randomly choosingsubsets of data instances and constructing regression models foreach subset. Each regression model is considered a source of evi-dence of the similarities between materials. Thereafter, theDempster–Shafer theory,16–18 which has a foundation for model-ing and combining the uncertainty of evidence, is applied to inte-grate the collected pieces of evidence to draw conclusions aboutthe similarities between materials. The eRSM consists of threemain steps as follows:Journal ofApplied Physics ARTICLE scitation.org/journal/japJ. Appl. Phys. 133, 053904 (2023); doi: 10.1063/5.0134999 133, 053904-2© Author(s) 2023https://aip.scitation.org/journal/jap1. Collect sources of evidence: Hypothetical mechanisms are col-lected from a dataset by applying regression analysis with singleor mixture models and are used as sources of evidence to ratio-nalize the similarity states of materials.2. Model similarity evidence: An appropriate mass function isdesigned to model the obtained evidence within the frameworkof the evidence theory.3. Combine pieces of evidence: Dempster’s rule of combination isused to integrate the pieces of the evidence.The steps of the eRSM are explained in detail in Sec. II.Regarding the framework of the evidence theory, the essentialcontributions of the eRSM are collecting sources of evidenceabout the similarities between materials from datasets anddesigning suitable mass functions to model the pieces of evi-dence rationally. The effectiveness of obtained similarities usingthe eRSM for subdividing alloys from datasets into homogenoussubgroups is supported by experiments on (1) a dataset ofbinary alloys with their Curie temperature as a target property(Sec. III B) and (2) two datasets of quaternary alloys with theirmagnetization (Sec. III C) and Curie temperature (Sec. III D) asthe target properties. Further analysis of the detected subgroupsto interpret the underlying physical mechanisms is shown inSec. III E.II. METHODOLOGYWe consider a dataset D consisting of p data instances. Weassume that a data instance with index i in D is described byn predefined descriptors and is represented by an n-dimensionalnumerical vector, xi ¼ x1i , x2i , . . . , xni� �[ Rn. The targetproperty of the data instance xi is yi [ R. Thereafter, the datasetD ¼ (x1, y1), (x2, y2) . . . (xp, yp)� �is represented using ap� nþ 1ð Þð Þ matrix. In this study, we consider that D maycontain pairs of data instances xi and xj, where xi � xj; however,the value of yi is far from yj.A. Collecting sources of similarity evidenceWe perform random subset sampling of the data instanceswithout replacement to collect a large amount of evidence of thesimilarity between pairs of data instances in D. Considering eachsample, we obtain two datasets: the reference dataset, Dref , and theevaluation dataset, Deval (Dref >Deval ¼ ; and Dref <Deval ¼ D).Considering Dref , we can generate a single function or multiple ref-erence functions fr :Rn ! R using a Gaussian process (GP)21 or amixture of Gaussian processes (MGP),22 respectively. This studyapplies GP- or MGP-based models instead of other nonlinearregression models, such as kernel ridge regression,23 random forestregression,24 or artificial neural networks25 because GP or MGPcan quantify the uncertainty of its prediction without introducingany other statistical validation. The sampling ratios of Dref from Dare fixed at 0.3 and 0.7 for the experiments with GP and MGP,respectively. Each reference function fr is considered a source toprovide pieces of evidence for the similarity between (xi, yi) and(xj, yj) in Deval . The function fr is not used to provide any informa-tion about the similarities between the data instances in Dref orbetween a data instance in Dref and a data instance in Deval . This isto exclude self-evaluation to ensure the objectivity of the evidence.Regarding a reference function fr , we consider the state of the simi-larity between (xi, yi) and (xj, yj) as• Similar: Both data instances can be considered to have been gen-erated by the function fr [Fig. 1(a)].• Dissimilar: Only one of the data instances can be considered tohave been generated by the function fr [Fig. 1(b)].• Uncertain: Neither of the data instances can be considered tohave been generated by the function fr [Fig. 1(c)]. The uncertainstate indicates that fr does not provide any information about thesimilarity between (xi, yi) and (xj, yj).To quantitatively evaluate whether (xi, yi) can be consideredto have been generated by the regression function fr , we use theFIG. 1. Illustrative figures of the three possible similarity states between two data instances (blue circles), including similar (a), dissimilar (b), and uncertain (c), consideringa referential regression model fr (black line). The gray region is the interval that determines whether a data instance can be considered to have been generated by regres-sion model fr .Journal ofApplied Physics ARTICLE scitation.org/journal/japJ. Appl. Phys. 133, 053904 (2023); doi: 10.1063/5.0134999 133, 053904-3© Author(s) 2023https://aip.scitation.org/journal/japlikelihood p(Oijfr), the probability of event Oi that a data instance(xi, yi) is observed, considering fr . The likelihood p(Oijfr) ismodeled using a normal distribution with mean and standard devi-ation depending on the predicted target value ŷi ¼ fr(xi) and thecorresponding standard error σxi by fr, respectively. This isexpressed asp(Oijfr) ¼ 1 if Δi � 3 �σ,2� Ðþ1Δi�3 �σ N uj0, α σxið Þdu otherwise,�(1)where Δi ¼ jyi � ŷij ¼ jyi � fr(xi)j is the deviation from the true tothe predicted target values of data instance i using fr , and �σ is theaverage of the predictive standard error of all the data instances inDref . α is the hyperparameter used to adjust the condition thatrestricts the data instances belonging to the function fr . In otherwords, the interval that determines the probability that a datainstance (xi, yi) belongs to fr is α σxi , and if the data instance fallsoutside this interval, it is determined that it does not belong to fr .By increasing or decreasing the value of the parameter α, the con-dition for determining whether a data instance (xi, yi) belongs to fris relaxed or tightened, making p(Oijfr) larger or smaller, respec-tively. Optimal values of α can be chosen using statistical criteriaand appropriate validation methods; however, we set α ¼ 2 for allexperiments in this work to reduce model complexity. We considerp(Oijfr) as the probability that (xi, yi) is generated by fr, andp(Oijfr) ¼ 1� p(Oijfr) is the probability that (xi, yi) is not gener-ated by fr. Figure 1 in the supplementary material illustrates theprocess of modeling the probability p(Oijfr).Events where (xi, yi) or (xj, yj) is generated by the function frare independent events. Therefore, considering the function fr , wecan evaluate the joint probabilities of observing• both data instances:p(Oi, Ojjfr) ¼ p(Oijfr)� p(Ojjfr); (2)• only one of the data instances:p(Oi, Ojjfr)þ p(Oi, Ojjfr)¼ p(Oijfr)� p(Ojjfr)þ p(Oijfr)� p(Ojjfr); (3)• neither of the data instances:p(Oi, Ojjfr) ¼ p(Oijfr)� p(Ojjfr)¼ 1� p(Oi, Ojjfr)� p(Oi, Ojjfr)� p(Oi, Ojjfr): (4)B. Modeling evidence by mass functionsConsidering the Dempster–Shafer theory framework,16 webegin by defining the frame of discernment Ω. Let Ω ¼ {s, ds} bethe universal set representing the similarity states of any two datainstances (xi, yi) and (xj, yj). s and ds denote the similarity and dis-similarity states between the two data instances, respectively.According to the Dempster–Shafer theory, the evidence of thesimilarity states between these two data instances is represented bya mass function mi,j (or a basic probability assignment).16 Thisassigns probability masses to all the nonempty subsets of Ω(X ¼ {{s}, {ds}, {s, ds}}). It is defined as follows:mi,j :X ! 0, 1½ � withXE[Xm(E) ¼ 1: (5)The masses assigned to {s} and {ds} reflect the degrees of beliefexactly committed to the evidence to support the similarity anddissimilarity between (xi, yi) and (xj, yj), respectively. The weightassigned to {s, ds} expresses the degree of belief that the evidenceprovides no information about the similarity (or dissimilarity)between (xi, yi) and (xj, yj).Therefore, the mass function mi,jfr, which models a piece of evi-dence of the similarity between (xi, yi) and (xj, yj) collected fromfr , is defined as follows:mi,jfr({s}) ¼ p(Oi, Ojjfr)γ i,j, (6)mi,jfr({ds}) ¼ p(Oi, Ojjfr)þ p(Oi, Ojjfr)γ i,j, (7)mi,jfr({s, ds}) ¼ 1� 1γ i,jþ p(Oi, Ojjfr)γ i,j, (8)where γ i,j ¼ e�σΔy þ 1� �� σxi�σ þ 1� �� σxj�σ þ 1� �is a discountingfactor,16,26 which describes the unreliability of evidence about thesimilarity between (xi, yi) and (xj, yj) collected from a source ofevidence fr . Δy is the variation range of the target variable y in thedataset D. The smaller the �σ relative to Δy , the more reliable thelearned regression function fr . Also, when σxi and σxj are smallerthan �σ, fr can provide reliable evidence for the relationship between(xi, yi) and (xj, yj). By contrast, when σxi and σxj are large com-pared to �σ, fr cannot provide reliable evidence for the relationshipbetween (xi, yi) and (xj, yj). A detailed explanation of each compo-nent in γ i,j is provided in Sec. I of the supplementary material.C. Dempster’s rule in combining evidenceAssuming that we can collect q pieces of evidence fromF r ¼ {f 1r , . . . , fqr }, a set of q reference functions is generated fromD to evaluate the similarity between a pair of data instances withindices i and j. According to the Dempster–Shafer theory frame-work, any two pieces of evidence collected from the reference func-tions f lr and f kr , which are modeled by the corresponding massfunctions mi,jf lrand mi,jf kr, respectively, can be combined using theDempster rule of combination to assign the joint mass mi,j{f lr ,fkr }toeach nonempty subset E of Ω as follows:mi,j{f lr ,fkr }(E) ¼ mi,jf lr�mi,jf kr� �(E)¼PEt>Ev¼E mi,jf lr(Et)�mi,jf kr(Ev)1�PEt>Ev¼; mi,jf lr(Et)�mi,jf kr(Ev), (9)Journal ofApplied Physics ARTICLE scitation.org/journal/japJ. Appl. Phys. 133, 053904 (2023); doi: 10.1063/5.0134999 133, 053904-4© Author(s) 2023https://www.scitation.org/doi/suppl/10.1063/5.0134999https://www.scitation.org/doi/suppl/10.1063/5.0134999https://aip.scitation.org/journal/japwhere E, Et , and Ev are nonempty subsets of Ω. Dempster’s rule iscommutative and associative.Based on Dempster’s rule, the obtained mass functions corre-sponding to the q pieces of evidence are combined to assign thefinal mass mi,jF ras follows:mi,jF r(E) ¼ mi,jf 1r�mi,jf 2r� . . .�mi,jf qr� �(E): (10)We perform similar analyses for all pairs of data instances inD to construct symmetric matrices M comprising the similarities(M[i, j] ¼ M[j, i] ¼ mi,jF r({s})) between them. Thereafter, theobtained matrix is applied for further unsupervised data mininganalysis, such as clustering or data visualization.III. EXPERIMENTS AND RESULTSIn this section, we perform three experiments to demonstrate theapplication of our similarity measurement in dealing with outliers anddata generated by multiple mechanisms when designing materialdescriptors. We apply the eRSM to measure similarities between mag-netic of three datasets for detecting subgroups of materials: (1) theexperimentally observed Curie temperature dataset (Dbinary) of binaryalloys for transitioning rare-earth metals, (2) the dataset of calculatedmagnetization of quaternary high-entropy alloys (DMagquaternary), and (3)the dataset of calculated Curie temperature of quaternary high-entropyalloys (DTCquaternary). Note that the datasets DMagquaternary and DTCquaternarycontain similar alloys and differ only in the target properties.A. DatasetsThe details of the datasets investigated in this study are asfollows.• Binary alloys dataset Dbinary :27 A material dataset containing 100transition rare-earth metal binary alloys, comprising nickel (Ni),manganese (Mn), cobalt (Co), or iron (Fe), and the correspond-ing Curie temperatures (TC). This dataset was collected from theAtomwork database of the National Institute for MaterialsScience.28,29 Each binary alloy in Dbinary is represented usingseven descriptors: (1) and (2) the atomic number of transitionmetal (ZT) and rare-earth (ZR) constituents, (3) projection of thespin magnetic moment onto the total angular moment of the 4felections (J4f 1� g j� �), (4) and (5) covalent radius (rcovT) andfirst ionization (IPT ) of the transition metal, and (6) and (7) con-centration of the transition metal (CT ) and rare-earth metal (CR).The selection of these seven descriptors has been discussed indetail in previous studies.10,30• Quaternary high-entropy alloys datasets Dquaternary:27 A materialdataset contains 990 equiatomic quaternary high-entropy alloys,which comprise 14 transition metals Ag, Cd, Co, Cr, Cu, Fe, Mn,Mo, Ni, Pd, Rh, Ru, Tc, Zn, and the corresponding calculated mag-netizations and Curie temperatures in the BCC phase. The datasetwas collected from an original dataset of 147 630 equiatomic quater-nary high-entropy alloys calculated using the Korringa–Kohn–Rostoker coherent approximation method.31 Each alloy in Dquaternaryis represented using 135 compositional descriptors, including themeans, standard deviations, and covariance of the atomic represen-tations of their constituent elements13 and four categorical featuresindicating the elements comprising the quaternary alloy. Thefeature selection process applied to this dataset has been discussedin detail in Sec. III of the supplementary material.B. Assessment of the similarity between transitionrare-earth metal binary alloys based on mechanismsof Curie temperatureIn the first experiment, we show the versatility of the eRSMfor detecting outliers and identifying a mixture of mechanisms. WeFIG. 2. (a) Observed and predicted Curie temperature of alloys in the dataset Dbinary using model generated for nickel (Ni), iron (Fe), and manganese (Mn)-based alloys.The blue and gray points indicate cobalt (Co)-based alloys and alloys of other transition metals (Ni, Fe, Mn), respectively. (b) Prediction error of Co-based alloys whenexcluding (top) or including (bottom) data of other Co-based alloys to the training dataset.Journal ofApplied Physics ARTICLE scitation.org/journal/japJ. Appl. Phys. 133, 053904 (2023); doi: 10.1063/5.0134999 133, 053904-5© Author(s) 2023https://www.scitation.org/doi/suppl/10.1063/5.0134999https://aip.scitation.org/journal/japapply the eRSM to assess the similarities between 100 transitionrare earth metal binary alloys comprising nickel (Ni), manganese(Mn), cobalt (Co), or iron (Fe) in the dataset Dbinary based on theirCurie temperatures. We can construct a regression model using aGaussian process by considering the data instances in Dbinary . Thisshows a high prediction accuracy with an R2 score of 0:963 and amean absolute error (MAE) of 40 (K) in tenfold cross-validation.However, such a nonparametric regression model does not guaran-tee the reliability of the model in the subsequent exploratory pre-dictions. This is because the number of observable alloys isrelatively small compared to the number of possible alloys.Figure 2(a) shows the results of the exploratory prediction ofthe Curie temperature of the Co-based binary alloys in Dbinaryusing a Gaussian process regression model constructed from thedata of binary alloys of Ni, Mn, and Fe. The regression model con-structed from the data of binary alloys of Ni, Mn, and Fe shows ahigh prediction accuracy in tenfold cross-validation [R2 ¼ 0:946and MAE ¼ 35 (K)]. Although the Co-based alloys with highCurie temperature tend to be underestimated by the model, theother Co-based alloys are often overestimated. The prediction errorfor the Co-based alloys is critically reduced when some data of theother Co-based alloys are included [Fig. 2(b)]. This observationsupports the hypothesis that the underlying mechanisms are differ-ent between the Co-based alloys and alloys of other transitionmetals. This facilitates the use of the eRSM to clarify the mixturemechanism from this dataset.By applying the eRSM on the dataset Dbinary , we obtain a simi-larity matrix Mbinary with moderately high similarity values amongthe data instances [Fig. 3(a)]. Thus, approximately, all the datainstances can be regressed by a relatively smooth function. This isconsistent with the high prediction accuracy of tenfold cross-validation for all the alloys in the dataset. Considering the explor-atory data analysis, to avoid false intuition or misunderstanding,the grouping of alloys in Dbinary is done such that the similaritiesFIG. 3. (a) Heatmap illustrating the similarity matrix Mbinary extracted for all the data instances in the Dbinary . (b) Confusion matrices measuring the regression-based simi-larities between alloys in four groups G1–G4 and the dissimilarities between the models generated for alloys in different groups.FIG. 4. Dependence of TC on the concentration of the transition metal (CT ) inalloys. Red, blue, green, and yellow scatters indicate alloys containing cobalt (Co),iron (Fe), manganese (Mn), and nickel (Ni). Alloys in G1 are highlighted by triangles.Journal ofApplied Physics ARTICLE scitation.org/journal/japJ. Appl. Phys. 133, 053904 (2023); doi: 10.1063/5.0134999 133, 053904-6© Author(s) 2023https://aip.scitation.org/journal/japbetween the alloys in each group are high. Moreover, one alloy canbelong to more than one group simultaneously, or it can be innone of the groups. We apply a graph-based clustering method32 tothe extracted similarity matrix to detect overlapping subgroups ofmaterials. As a result, we observe four groups of alloys, denoted asG1, G2, G3, and G4, which show high intragroup similarities,exceeding 0:7 [Fig. 3(a)]. Nevertheless, the similarity between thealloys in group G1 and those in G2, G3, and G4 is significantly dis-similar. In addition, a small group of alloys [Fig. 3(a), gray region]is approximately different from all the others and can be consid-ered outliers. The remaining alloys are not assigned to any groupto have confidence in the clustering analysis results.To evaluate the validity of the analysis process quantitatively,we trained the regression models for TC using data from each ofthe four groups G1, G2, G3, and G4. Moreover, we monitored theirprediction accuracy on these groups. The confusion matrix summa-rizing the correlation between the observed and predicted TC bythe four learned regression models is shown in Fig. 4. The diagonalplots illustrate the cross-validation results of the models learnedfrom the four groups of alloys. The off-diagonal plot shows the cor-relation between the observed TC and the predictions made by themodel learned from the alloys of the other groups. The obtainedresults confirm the intragroup similarity of the alloys in groups G1,G2, G3, and G4, respectively, dissimilarity between the five groups,and intra-group dissimilarity of the alloys considered outliers. Thisindicates that the obtained results suggest that the physical mecha-nisms of alloys in G1 may be different from those of the alloys inG2, G3, and G4. Nonetheless, it is difficult to determine theFIG. 5. (a) and (d) Heatmaps illustrating the similarity matrices MMagquaternary (a) and MTCquaternary (d) extracted from datasets DMagquaternary and DTCquaternary , focusing on mechanismsof magnetization and TC , respectively. (b) and (e) The confusion matrix summarizes the differences between the magnetization (b) or TC (e) mechanisms of alloys inextracted groups. (c) and (f ) Visualization of quaternary alloys in the two-dimensional embedding spaces constructed by applying the t-distributed stochastic neighborembedding (t-SNE) to MMagquaternary (c) and MTCquaternary (f ). Red, blue, and gray contours indicate gaussian models ĜMag1 ĜTC1� �, ĜMag2 ĜTC2� �, and ĜMag3 ĜTC3� �, respectively,learned by using the Gaussian mixture models33 in the embedding space focusing on mechanisms of magnetization TCð Þ. In addition, red and blue points in sub-figures(b) and (c) [(e) and (f )] indicate the alloys in GMag1 GTC1� �) and GMag2 GTC2� �, respectively.Journal ofApplied Physics ARTICLE scitation.org/journal/japJ. Appl. Phys. 133, 053904 (2023); doi: 10.1063/5.0134999 133, 053904-7© Author(s) 2023https://aip.scitation.org/journal/japdifferences between the mechanisms of the TC of alloys in G2, G3,and G4.Moreover, considering the alloys in G1, there is a strong linearcorrelation between TC and the concentration of transition metalsin the alloys with a Pearson correlation coefficient of 0:95 (Fig. 4,triangle scatters). This result is consistent with the observation ofthe previous research30 when considering all binary alloys of transi-tion metals and rare-earth metals in Dbinary ; the range of TC isfound to be correlated with the composition ratio of the transitionmetals. Furthermore, 13 of the 17 alloys in G1 are Co-based alloyswith high Curie temperatures (TC . 600 K). By contrast, most ofthe other Co-based alloys in Dbinary have lower Curie temperatures(TC , 500 K) and are assigned to G2, G3, and G4. These results areconsistent with the observation that the regression model for Fe-,Mn-, and Ni-based alloys tends to underestimate the TC of theCo-based alloys with high TC and overestimates the TC of theremaining Co-based alloys [Fig. 2(a)].In addition, we examine the behavior of eRSM on toy datasetssynthesized with outliers or multiple mechanisms to assess the effi-ciency of this similarity measure. Detailed results of these experi-ments are summarized in Sec. II of the supplementary material.Briefly, the eRSM demonstrates that it can effectively assess theFIG. 6. Prediction accuracies for magnetization (a) and (b) and Curie temperature (c) and (d) of the alloys with tenfold cross-validations. Prediction validation results withsingle gaussian process regression models for magnetization and Curie temperature are shown in sub-figures (a) and (c), respectively. Prediction validation results withmixtures of expert models for magnetization and Curie temperature are shown in sub-figures (b) and (d), respectively. Blue and white circles indicate magnetic alloys (finitemagnetization) and non-magnetic alloys (zero magnetization), respectively.Journal ofApplied Physics ARTICLE scitation.org/journal/japJ. Appl. Phys. 133, 053904 (2023); doi: 10.1063/5.0134999 133, 053904-8© Author(s) 2023https://www.scitation.org/doi/suppl/10.1063/5.0134999https://aip.scitation.org/journal/japsimilarity between the data instances and use the similarity fordetecting outliers and a mixture of mechanisms.C. Assessment of the similarity between quaternaryhigh-entropy alloys based on mechanisms ofmagnetizationThe effectiveness of the eRSM in detecting outliers and identi-fying mixture mechanisms in the material dataset has been shownin the previous experiment. In the next two experiments, we showthe potential of applying the measured similarity to design descrip-tors for materials.Considering this experiment, we subsequently apply the eRSMto assess the similarities between 990 quaternary high-entropyalloys comprising 14 transition metals in the dataset DMagquaternarybased on their magnetization. To predict the magnetization ofthese alloys, we attempted to construct an optimal Gaussianprocess regression model using the designed descriptors. TheGaussian process can poorly regress the magnetization with an R2score of 0:75 and an MAE of 0:13 (T) in the tenfold cross-validation. The obtained results suggest that the magnetization ofthese alloys may not be described by a single model in the designeddescriptor space. This indicates that the existence of outliers ormixture models of the magnetization properties of these alloys inthe descriptor space should be considered in the analysis of thisdataset.Applying the eRSM, we obtain a similarity matrix MMagquaternarywith two core groups of alloys denoted by GMag1 and GMag2 , showinghigh intra-group similarities and exceeding 0:5 [Fig. 5(a)]. Some ofthe alloys in GMag1 are similar to those in GMag2 ; nonetheless, the restshow apparent dissimilarities. Furthermore, one small group ofalloys [Fig. 5(a), yellow region] showed dissimilarities with theothers and could be considered outliers. The remaining alloys inDMagquaternary do not exhibit apparent similarities with alloys in groupsGMag1 and GMag2 . Therefore, they are not assigned to any group.To validate the obtained results quantitatively, we trainedthree regression models using data from each group, GMag1 , GMag2 ,and outliers. We monitored the prediction accuracy of the threelearned regression models for data in all the groups. The confu-sion matrix summarizing the correlations between the observedand predicted values of the target variable using the learnedregression models is shown in Fig. 5(c). The diagonal plots illus-trate the tenfold cross-validation results of the models learnedfrom these three groups of alloys. The off-diagonal plot showsthe correlation between the observed magnetization and the pre-dictions made by the model learned from the alloys of the othergroups.The obtained results confirm the intragroup similarity of thealloys in groups GMag1 and GMag2 , respectively, the dissimilaritybetween the two groups, and the intra-group dissimilarity of thealloys considered as outliers. Specifically, we observe that groupGMag2 consists of ferrimagnetic alloys or alloys whose magnetizationis relatively smaller [magnetization , 0:1 (T)] than the others inthe group GMag1 . In contrast, using the data in GMag1 , we can con-struct a Gaussian process regression model with a high predictionaccuracy with an R2 score of 0:992 and an MAE of 0:016 (T) in thetenfold cross-validation.Therefore, we can use the information of the constituent ele-ments of each alloy to predict which group it belongs to inadvance20 and apply an appropriate regression model to improveprediction accuracy for the alloys. We combine the similarity mea-sured by using the eRSM with the Jaccard similarity coefficient34and apply the t-distributed stochastic neighbor embedding35(t-SNE) to construct a two-dimensional embedding map[Fig. 5(c)]. Details of the combination method are shown in Sec. IVof the supplementary material. As a result, we can easily distinguishthe alloys in groups GMag1 (red) and GMag2 (blue) when they formtwo separate regions with high density in the embedding space. Weapply a Gaussian mixture model33 (GMM) on the embeddingspace to identify groups and calculate the probability of an alloybelonging to a particular identified group. Alloys in differentgroups are treated differently by using a mixture of experts36(MoE) approach. Figures 6(a) and 6(b) show a reduction of theproposed mixture of experts in MAE of 18% compared with theresult of the single model, from 0:13 (T) to 0:11 (T). Further analy-sis shows that applying the obtained similarities in MOE improvesthe prediction accuracy for magnetic alloys [Fig. 7(a) in thesupplementary material].D. Assessment of the similarity between thequaternary high-entropy alloys based on mechanismsof Curie temperatureConsidering this experiment, the target data are the same asin Sec. III C (Dquaternary); however, the physical property of interestis TC . A regression model can be constructed using a Gaussianprocess. This shows a rather high prediction accuracy in tenfoldcross-validation with an R2 score of 0:85 and an MAE of 67 (K).We also observe two distinguishable groups of quaternary alloys inthe dataset DTCquaternary when applying the eRSM. Figure 5(d)FIG. 7. Proportions of quaternary alloys containing Fe or Co in group GMag1 (a)and GTC1 (b).Journal ofApplied Physics ARTICLE scitation.org/journal/japJ. Appl. Phys. 133, 053904 (2023); doi: 10.1063/5.0134999 133, 053904-9© Author(s) 2023https://www.scitation.org/doi/suppl/10.1063/5.0134999https://www.scitation.org/doi/suppl/10.1063/5.0134999https://aip.scitation.org/journal/japillustrates the similarity matrix MTCquaternary with two groups of alloysdenoted as GTC1 and GTC2 , showing high intra-group similarities andexceeding 0:5. Some of the alloys in GTC1 are similar to those inGTC2 . Nonetheless, the others exhibit apparent dissimilarities, whichis consistent with the observation of two high-density regions (red)in the embedding map of MTCquaternary [Fig. 5(e)]. Furthermore, asmall group of alloys [Fig. 5(d), yellow region] showed dissimilari-ties with all the others and could be considered outliers. Theremaining alloys do not show apparent similarities with alloys ingroups GTC1 and GTC2 ; thus, they are not assigned to any group.Following the same analysis procedure as in Sec. III C, wetrained regression models for Curie temperature using data fromeach of the three groups GTC1 , GTC2 , and outliers and monitoredtheir prediction accuracy on these groups. Figure 5(f ) showsthe confusion matrix that summarizes the obtained results.The diagonal plots illustrate the tenfold cross-validation resultsof the models learned from these three groups of alloys. Theoff-diagonal plot shows the correlation between the observedCurie temperature and the predictions made by the regressionmodel learned from the alloys of the other groups. We can alsoconfirm the intra-group similarity of the alloys in groupsGTC1 and GTC2 , respectively, dissimilarity between the two groups,and intra-group dissimilarity of the alloys considered outliers.Specifically, we observe that the Curie temperatures of approx-imately all the alloys in group GTC2 have a low TC, which is 0 (K) orFIG. 8. Effect of coexistence of the 14 transition metals on magnetization and Curie temperature mechanisms. Each pie chart results from quaternary alloys containing therespective element pair. They show the percentages of alloys that follow the magnetization mechanisms (lower-left triangle) and Curie temperature mechanisms (upper-righttriangle), as extracted by the eRSM. Red and blue areas indicate the percentages of alloys whose magnetization and TC are finite GMag1�and GTC1 Þ and zero GMag2�andGTC2 Þ, respectively. Yellow areas indicate the percentages of alloys that are detected as outliers. By contrast, gray regions indicate the fractions of alloys not assigned to theextracted groups.Journal ofApplied Physics ARTICLE scitation.org/journal/japJ. Appl. Phys. 133, 053904 (2023); doi: 10.1063/5.0134999 133, 053904-10© Author(s) 2023https://aip.scitation.org/journal/japrelatively smaller than that of the other alloys. Furthermore, usingthe data in GTC1 , we can construct a Gaussian process regressionmodel with a high prediction accuracy with an R2 score of 0:985and an MAE of 19 (K) in the tenfold cross-validation.Therefore, we utilize the similarity information to designdescriptors for quaternary alloys due to the effectiveness of the datafor detecting the mixture of multiple mechanisms in the dataset.We apply similar methods as in the previous experiment to con-struct a two-dimensional embedding map [Fig. 5(f )] and thenlearn a mixture of experts to predict Curie temperature of quater-nary alloys in the dataset DTCquaternary . The proposed mixture ofmodels exhibits higher prediction accuracy than the single modelin tenfold cross-validations [Figs. 6(c) and 6(d)]. The MAE of theproposed mixture of expert reduces approximately 36%, from67 (K) to 49 (K).E. Discussion of the obtained similarities betweenmaterials and the associated physical mechanismsRegarding the experiments with the datasets DMagquaternary andDTCquaternary focusing on magnetization or TC , the datasets seem to bea self-evident example where magnetization and TC are cases sensi-tive to finite or zero. As we can see from the results describedabove (Secs. III C and III D and Sec. VI in the supplementarymaterial), the prediction accuracy is low when considering a singleregression model for the entire dataset. In this section, we payattention to the analysis of the extracted alloys groups GMag1 , GMag2 ,GTC1 , and GTC2 to identify underlying patterns.Figure 7 shows that Fe and Co, which have a large spinmoment, ferromagnetic interactions with many elements and resultin high magnetization or TC , are dominant elements comprisingalloys in two groups GMag1 (a) and GTC1 (b). Furthermore, in theanalysis that considers the proportion of the quaternary alloysfixing two of their four constituent elements concerning theextracted four groups GMag1 , GMag2 , GTC1 , and GTC2 , we observe thatthe proportion of Fe-containing and Co-containing alloys in twogroups GMag1 (a) and GTC1 is significantly larger than other groups(Fig. 8). Thus, the prediction models constructed from the data ofthe alloys in GMag1 or GTC1 are more suitable to predict magnetiza-tion or TC , respectively, of alloys containing these elements. Theremaining Fe–X and Co–X (X denotes the other transition metalscomprised in the alloys) alloys are considered outliers of theextracted mechanisms or unassigned HEAs, which are not assignedto any of these mechanisms. Conversely, Mn–X alloys exhibitsimilar behavior as Fe–X and Co–X when focusing on the magneti-zation mechanisms. However, for the Curie temperature, the Mn–Xalloys are categorized in the group GTC2 of low TC besides the othergroups. Especially among the Fe–X and Co–X alloys, the percentageof Fe–Mn and Co–Mn alloys considered outliers of the mecha-nisms extracted from GTC1 is relatively higher, 55% and 43%, respec-tively (Fig. 8).For further investigation, we organized the raw data of thequaternary alloys by focusing on the presence or absence of Mn.Figure 9 shows the correlation between magnetization and Curietemperature of 556 (56%) alloys with non-zero properties. Thetotal number of data instances is 990, and the number of datainstances where both TC and magnetization are zero is 413 (42%),while there are 21 (2%) alloys with zero TC but have finite magneti-zation. We found that the alloys containing all three elements, Mn,Fe, and Co, show high Curie temperatures [TC . 900 K].Conversely, the alloys containing either pairs of Mn–Fe or Mn–Coshow moderate Curie temperatures. By contrast, the Mn-containingalloys without Fe or Co have low Curie temperatures [TC , 250 K].Furthermore, the trends of these three alloy groups do not offerany significant correlation between magnetization and Curie tem-perature. However, an apparent positive correlation between mag-netization and Curie temperature can be observed for the group ofMn-free alloys.To interpret the results obtained, we considered a hypothesisof the origin of the observed data. The estimated magnetization isthe sum of all the local magnetic moments divided by the unitvolume. The local magnetic moments are determined by the spinconfigurations of atomic sites that stabilize the structure of alloys.Conversely, given a particular structure and spin configuration, theTC can be estimated from the spin–spin exchange energy.First-principles calculations show that early transition metals andlate transition metals often have antiferromagnetic interactions.37This interaction has also been confirmed in high-entropy alloys byusing automatic exhaustive calculations.31 Mn lies between earlyand late transition metals; thus, the estimation of the spin configu-ration (ferromagnetic or antiferromagnetic) in Mn-containingalloys should be cautiously considered in different situations, espe-cially in high-entropy alloys whose elements can stochastically existFIG. 9. Correlation between magnetization (T ) and Curie temperature (K) ofquaternary alloys with non-zero magnetization and non-zero Curie temperaturein datasets DMagquaternary and DTCquaternary . Marginal plots show a histogram of theproperties of the alloys.Journal ofApplied Physics ARTICLE scitation.org/journal/japJ. Appl. Phys. 133, 053904 (2023); doi: 10.1063/5.0134999 133, 053904-11© Author(s) 2023https://www.scitation.org/doi/suppl/10.1063/5.0134999https://www.scitation.org/doi/suppl/10.1063/5.0134999https://aip.scitation.org/journal/japat the same atomic site. From this consideration, we can admit ahypothesis that the alloys containing Mn follow a different rule formagnetization than those grouped into GMag2 . Conversely, the alloyscontaining Mn may follow the same rules for TC as the alloysgrouped into GTC2 , albeit with a spin configuration that providesmagnetization. The details are beyond the scope of this paper andwill not be discussed here, but further analysis is promising.IV. CONCLUSIONSIn this study, we developed a method that can be used torationally transform material data from multiple sources into evi-dence of similarities between materials and combine the evidenceto conclude the similarities between materials. The extracted simi-larity–dissimilarity information has significant potential for appli-cations in the subgroup discovery of materials. The effectiveness ofthe eRSM in detecting homogenous subgroups of materials hasbeen demonstrated by using two experiments on two datasets ofmagnetic materials. In addition, further analysis of the detectedsubgroups improves the existing knowledge of problems related tothe applied datasets of magnetic materials. For example, we revealthe differences in the mechanisms of the Curie temperature ofCo-based binary alloys when using our method to a dataset of 100transition rare-earth metal binary alloys comprising Ni, Mn, Co,and Fe. Moreover, we explored the mechanisms of ferrimagneticand low Curie temperature alloys from the magnetic dataset of cal-culated quaternary alloys. By measuring the similarity betweenmaterials with uncertainty, the method described herein is expectedto extract valuable information for describing and interpreting theunderlying physical mechanisms in material datasets.SUPPLEMENTARY MATERIALSee the supplementary material for the following additionalinformation: (1) explanation of the formulation modeling uncer-tainty, (2) evaluation of the eRSM using the toy datasets, and (3)feature selection and pre-analysis in the dataset of quaternary high-entropy alloys.ACKNOWLEDGMENTSThis work was supported by the Ministry of Education,Culture, Sports, Science, and Technology of Japan (MEXT) withthe Program for Promoting Research on the SupercomputerFugaku (DPMSD), JSPS KAKENHI grants 20K05301, JP19H05815(Grants-in-Aid for Scientific Research on Innovative AreasInterface Ionics), 21K14396 (Grant-in-Aid for Early- CareerScientists), and 20K05068, Japan.AUTHOR DECLARATIONSConflict of InterestThe authors have no conflicts to disclose.Author ContributionsMinh-Quyet Ha: Conceptualization (equal); Data curation (equal);Formal analysis (equal); Investigation (equal); Methodology (equal);Validation (equal); Visualization (equal); Writing – original draft(equal); Writing – review & editing (equal). Duong-NguyenNguyen: Conceptualization (equal); Formal analysis (supporting);Investigation (supporting); Methodology (supporting); Writing –original draft (supporting). Viet Cuong Nguyen: Funding acquisi-tion (credit); Resources (credit); Software (credit). Hiori Kino: Datacuration (lead); Formal analysis (supporting); Investigation (sup-porting); Methodology (equal); Validation (equal); Writing – origi-nal draft (equal); Writing – review & editing (equal). YasunobuAndo: Formal analysis (supporting); Methodology (supporting);Writing – review & editing (supporting). Takashi Miyake: Formalanalysis (supporting); Methodology (supporting); Validation (sup-porting); Writing – original draft (supporting); Writing – review &editing (supporting). Thierry Denoeux: Formal analysis (credit);Methodology (credit); Writing – review & editing (credit).Van-Nam Huynh: Conceptualization (supporting); Formal analysis(supporting); Investigation (supporting); Methodology (supporting);Writing – review & editing (supporting). Hieu-Chi Dam:Conceptualization (equal); Data curation (equal); Formal analysis(equal); Funding acquisition (equal); Investigation (equal);Methodology (equal); Project administration (equal); Resources(equal); Supervision (equal); Validation (equal); Visualization(equal); Writing – original draft (equal); Writing – review & editing(equal).DATA AVAILABILITYThe data that support the findings of this study are openlyavailable in Zenodo at http://doi.org/10.5281/zenodo.7540840,Ref. 27.REFERENCES1B. Kailkhura, B. Gallagher, S. Kim, A. Hiszpanski, and T. Y.-J. Han, “Reliableand explainable machine-learning methods for accelerated material discovery,”npj Comput. Mater. 5, 108 (2019).2J. Tenenbaum, “Learning the structure of similarity,” Adv. Neural Inf. Process.Syst. 8, 3–9 (1995).3J. Tenenbaum, V. Silva, and J. Langford, “A global geometric framework fornonlinear dimensionality reduction,” Science 290, 2319–2323 (2000).4Y. Yang, F. Liang, S. Yan, Z. Wang, and T. S. Huang, “On a theory of nonpara-metric pairwise similarity for clustering: Connecting clustering to classification,”Adv. Neural Inf. Process. Syst. 27, 145–153 (2014).5C. Chen, O. Li, D. Tao, A. Barnett, C. Rudin, and J. K. Su, “This looks like that:Deep learning for interpretable image recognition,” in Advances in NeuralInformation Processing Systems, edited by H. Wallach, H. Larochelle,A. Beygelzimer, F. d’Alché-Buc, E. Fox, and R. Garnett (Curran Associates, Inc.,2019), Vol. 32.6B. Letham, C. Rudin, T. H. McCormick, and D. Madigan, “Interpretable classi-fiers using rules and Bayesian analysis: Building a better stroke predictionmodel,” Ann. Appl. Stat. 9, 1350–1371 (2015).7C. Rudin, “Stop explaining black box machine learning models for high stakesdecisions and use interpretable models instead,” Nat. Mach. Intell. 1, 206–215(2019).8B. R. Goldsmith, M. Boley, J. Vreeken, M. Scheffler, and L. M. Ghiringhelli,“Uncovering structure-property relationships of materials by subgroup discov-ery,” New J. Phys. 19, 013031 (2017).9R. Ramprasad, R. Batra, G. Pilania, A. Mannodi-Kanakkithodi, and C. Kim,“Machine learning in materials informatics: Recent applications and prospects,”npj Comput. Mater. 3, 54 (2017).Journal ofApplied Physics ARTICLE scitation.org/journal/japJ. Appl. Phys. 133, 053904 (2023); doi: 10.1063/5.0134999 133, 053904-12© Author(s) 2023https://www.scitation.org/doi/suppl/10.1063/5.0134999http://doi.org/10.5281/zenodo.7540840http://doi.org/10.5281/zenodo.7540840https://doi.org/10.1038/s41524-019-0248-2https://doi.org/10.1126/science.290.5500.2319https://doi.org/10.1214/15-AOAS848https://doi.org/10.1038/s42256-019-0048-xhttps://doi.org/10.1088/1367-2630/aa57c2https://doi.org/10.1038/s41524-017-0056-5https://aip.scitation.org/journal/jap10D.-N. Nguyen, T.-L. Pham, V.-C. Nguyen, T.-D. Ho, T. Tran, K. Takahashi,and H.-C. Dam, “Committee machine that votes for similarity between materi-als,” IUCrJ 5, 830–840 (2018).11D.-N. Nguyen, T.-L. Pham, V.-C. Nguyen, H. Kino, T. Miyake, and H.-C. DAM,“Ensemble learning reveals dissimilarity between rare-earth transition binary alloyswith respect to the Curie temperature,” J. Phys.: Mater. 2, 034009 (2019).12J. Bardeen, L. N. Cooper, and J. R. Schrieffer, “Theory of superconductivity,”Phys. Rev. 108, 1175–1204 (1957).13A. Seko, A. Togo, and I. Tanaka, “Descriptors for machine learning of materi-als data,” in Nanoinformatics, edited by I. Tanaka (Springer Singapore,Singapore, 2018), pp. 3–23.14A. Tversky, “Features of similarity,” Psychol. Rev. 84, 327–352 (1977).15R. L. Goldstone, D. L. Medin, and J. Halberstadt, “Similarity in context,”Mem. Cognit. 25, 237–255 (1997).16G. Shafer, A Mathematical Theory of Evidence (Princeton University Press, 1976).17T. Denœux, D. Dubois, and H. Prade, “Representations of uncertainty in artifi-cial intelligence: Beyond probability and possibility,” in A Guided Tour ofArtificial Intelligence Research, edited by P. Marquis, O. Papini, and H. Prade(Springer-Verlag, 2020), Vol. 1, Chap. 4, pp. 119–150.18A. P. Dempster, “Upper and lower probabilities induced by a multivaluedmapping,” Ann. Math. Stat. 38, 325–339 (1967).19N. Nu Thanh Ton, M.-Q. Ha, T. Ikenaga, A. Thakur, H.-C. Dam, andT. Taniike, “Solvent screening for efficient chemical exfoliation of graphite,” 2DMater. 8, 015019 (2020).20M.-Q. Ha, D.-N. Nguyen, V.-C. Nguyen, T. Nagata, T. Chikyow, H. Kino,T. Miyake, T. Denœux, V.-N. Huynh, and H.-C. Dam, “Evidence-based recom-mender system for high-entropy alloys,” Nat. Comput. Sci. 1, 470–478 (2021).21C. Williams and C. Rasmussen, “Gaussian processes for regression,” inAdvances in Neural Information Processing Systems 8, Max-Planck-Gesellschaft(MIT Press, Cambridge, MA, 1996), pp. 514–520.22M. Lázaro-Gredilla, S. Van Vaerenbergh, and N. D. Lawrence, “Overlappingmixtures of Gaussian processes for the data association problem,” PatternRecognit. 45, 1386–1395 (2012).23V. Vovk, “Kernel ridge regression,” in Empirical Inference (Springer, 2013),pp. 105–116.24L. Breiman, “Random forests,” Mach. Learn. 45, 5–32 (2001).25A. Jain, J. Mao, and K. Mohiuddin, “Artificial neural networks: A tutorial,”Computer 29, 31–44 (1996).26P. Smets, “Belief functions: The disjunctive rule of combination andthe generalized Bayesian theorem,” Int. J. Approx. Reason. 9, 1–35(1993).27H.-C. Dam (2023). “Datasets of binary and quaternary alloys with Curie tem-perature and magnetization for the eRSM,” Zenodo. http://doi.org/10.5281/zenodo.7540840.28P. Villars, M. Berndt, K. Brandenburg, K. Cenzual, J. Daams,F. Hulliger, T. Massalski, H. Okamoto, K. Osaki, A. Prince, H. Putz, andS. Iwata, “The Pauling File, binaries edition,” J. Alloys. Compd. 367, 293–297(2004).29Y. Xu, M. Yamazaki, and P. Villars, “Inorganic materials database for explor-ing the nature of material,” Jpn. J. Appl. Phys. 50, 11RH02 (2011).30H. C. Dam, V. C. Nguyen, T. L. Pham, A. T. Nguyen, K. Terakura, T. Miyake,and H. Kino, “Important descriptors and descriptor groups of Curie tempera-tures of rare-earth transition-metal binary alloys,” J. Phys. Soc. Jpn. 87, 113801(2018).31T. Fukushima, H. Akai, T. Chikyow, and H. Kino, “Automatic exhaustive cal-culations of large material space by Korringa-Kohn-Rostoker coherent potentialapproximation method applied to equiatomic quaternary high entropy alloys,”Phys. Rev. Mater. 6, 023802 (2022).32Y.-Y. Ahn, J. P. Bagrow, and S. Lehmann, “Link communities reveal multiscalecomplexity in networks,” Nature 466, 761–764 (2010).33D. Reynolds, “Gaussian Mixture Models,” in Encyclopedia of Biometrics, editedby S. Z. Li and A. K. Jain (Springer, Boston, MA, 2015).34A. H. Murphy, “The Finley affair: A signal event in the history of forecastverification,” Weather Forecast. 11, 3–20 (1996).35L. van der Maaten and G. Hinton, “Visualizing data using t-SNE,” J. Mach.Learn. Res. 9, 2579–2605 (2008).36T. L. Pham, H. Kino, K. Terakura, T. Miyake, and H. C. Dam, “Novel mixturemodel for the representation of potential energy surfaces,” J. Chem. Phys. 145,154103 (2016).37H. Akai, M. Akai, S. Blügel, B. Drittler, H. Ebert, K. Terakura, R. Zeller, andP. H. Dederichs, “Theory of hyperfine interactions in metals,” Prog. Theor. Phys.Suppl. 101, 11–77 (1990).Journal ofApplied Physics ARTICLE scitation.org/journal/japJ. Appl. Phys. 133, 053904 (2023); doi: 10.1063/5.0134999 133, 053904-13© Author(s) 2023https://doi.org/10.1107/S2052252518013519https://doi.org/10.1103/PhysRev.108.1175https://doi.org/10.1037/0033-295X.84.4.327https://doi.org/10.3758/BF03201115https://doi.org/10.1214/aoms/1177698950https://doi.org/10.1088/2053-1583/abc08ahttps://doi.org/10.1088/2053-1583/abc08ahttps://doi.org/10.1038/s43588-021-00097-whttps://doi.org/10.1016/j.patcog.2011.10.004https://doi.org/10.1016/j.patcog.2011.10.004https://doi.org/10.1023/A:1010933404324https://doi.org/10.1109/2.485891https://doi.org/10.1016/0888-613X(93)90005-Xhttp://doi.org/10.5281/zenodo.7540840http://doi.org/10.5281/zenodo.7540840http://doi.org/10.5281/zenodo.7540840https://doi.org/10.1016/j.jallcom.2003.08.058https://doi.org/10.1143/JJAP.50.11RH02https://doi.org/10.7566/JPSJ.87.113801https://doi.org/10.1103/PhysRevMaterials.6.023802https://doi.org/10.1038/nature09182https://doi.org/10.1007/978-1-4899-7488-4_196https://doi.org/10.1175/1520-0434(1996)011%3C0003:TFAASE%3E2.0.CO;2https://doi.org/10.1063/1.4964318https://doi.org/10.1143/PTPS.101.11https://doi.org/10.1143/PTPS.101.11https://aip.scitation.org/journal/jap Evidence-based data mining method to reveal similarities between materials based on physical mechanisms I. INTRODUCTION II. METHODOLOGY A. Collecting sources of similarity evidence B. Modeling evidence by mass functions C. Dempster’s rule in combining evidence III. EXPERIMENTS AND RESULTS A. Datasets B. Assessment of the similarity between transition rare-earth metal binary alloys based on mechanisms of Curie temperature C. Assessment of the similarity between quaternary high-entropy alloys based on mechanisms of magnetization D. Assessment of the similarity between the quaternary high-entropy alloys based on mechanisms of Curie temperature E. Discussion of the obtained similarities between materials and the associated physical mechanisms IV. CONCLUSIONS SUPPLEMENTARY MATERIAL AUTHOR DECLARATIONS Conflict of Interest Author Contributions DATA AVAILABILITY References