# Fileset

[s41524-026-01966-6 (2).pdf](https://mdr.nims.go.jp/filesets/d861db92-69d5-4099-b174-d5c56eb100e6/download)

## Creator

Yuna Oikawa, Guillaume Deffrennes, Rintaro Shimayoshi, [Taichi Abe](https://orcid.org/0000-0002-5065-0939), [Ryo Tamura](https://orcid.org/0000-0002-0349-358X), Koji Tsuda

## Rights

[Creative Commons BY-NC-ND Attribution-NonCommercial-NoDerivs 4.0 International](https://creativecommons.org/licenses/by-nc-nd/4.0/)

## Other metadata

[aLLoyM: a large language model for alloy phase diagram prediction](https://mdr.nims.go.jp/datasets/2bc9a70d-646f-41a2-952d-ccec287f5d6f)

## Fulltext

aLLoyM: a large language model for alloy phase diagram predictionnpj | computationalmaterials ArticlePublished in partnership with the Shanghai Institute of Ceramics of the Chinese Academy of Scienceshttps://doi.org/10.1038/s41524-026-01966-6aLLoyM: a large language model for alloyphase diagram predictionCheck for updatesYuna Oikawa1, Guillaume Deffrennes2, Rintaro Shimayoshi1, Taichi Abe3, Ryo Tamura1,4 &Koji Tsuda1,4,5Large languagemodels (LLMs) are general-purpose tools with wide-ranging applications, including inmaterials science. In this work, we introduce aLLoyM, a fine-tuned LLM specifically trained on alloycompositions, temperatures, and their corresponding phase information. To develop aLLoyM, wecurated question-and-answer (Q&A) pairs for binary and ternary phase diagrams using the open-source Computational Phase Diagram Database (CPDDB) and assessments based on CALPHAD(CALculation of PHAse Diagrams). We fine-tuned Mistral, an open-source pre-trained LLM, for twodistinct Q&A formats: multiple-choice and short-answer. Benchmark evaluations demonstrate thatfine-tuning substantially enhances performance on multiple-choice phase diagram questions.Moreover, the short-answer model of aLLoyM can generate novel phase diagrams from itscomponents alone, suggesting that it may aid the discovery of new materials systems. To promotefurther research and adoption, we have publicly released the short-answer fine-tuned version ofaLLoyM, along with the complete benchmarking Q&A dataset, on Hugging Face.Phase diagrams serve as fundamental roadmaps in materials science, pro-viding critical insights into material behavior across varying thermo-dynamic conditions. The ability to accurately predict and interpret thesediagrams represents a cornerstone of efficient materials design, withexperienced practitioners often relying on accumulated expertise toanticipate phase relationships. While large experimental databases1–4 andcomputational repositories5–7 have established valuable reference collec-tions, the experimental determination of phase diagrams remains resource-intensive and prohibitively time-consuming for comprehensive materialsexploration.Recent advances in machine learning methodologies have demon-strated promising capabilities for phase diagram prediction, with conven-tional approaches including neural networks, support vector machines,random forests, and label propagation algorithms showing measurablesuccess8–15. Concurrently, the emergence of large language models (LLMs)such asGPT-4, LLaMA, andMistral has opened novel avenues formaterialsscience applications16–20. Unlike specialized machine learning models thatoperate on isolated datasets, LLMs represent general-purpose architecturescapable of leveraging broader scientific knowledge, such as thermodynamicprinciples and elementary properties encoded during pre-training, intophase diagram predictions. Preliminary investigations have explored LLMapplications in phase diagram analysis, including system-specific trainingon Mg-Al-Zn data21 and experimental diagram annotation22, suggestingsubstantial potential for phase diagram analysis.In this study, we introduce aLLoyM, an LLM fine-tuned for phasediagram generation (Fig. 1). Due to computational resource constraints, weadopted low-rank adaptation (LoRA) instead of full fine-tuning. The effi-cacy of LoRA in domain-specific fine-tuning scenarios has been sub-stantiated in prior studies23–26. Our approach leverages the ComputationalPhase Diagram Database (CPDDB)5, a comprehensive open-source repo-sitory published by the National Institute for Materials Science (NIMS), asthe primary training corpus. From the CPDDB, thermodynamic database(TDB) files for 389 binary and 38 ternary phase diagrams were obtained,with the distribution of constituent elements illustrated in Figs. S1 and S2.Each TDB file contains Gibbs free energy functions for individual phases,enabling the construction of phase diagrams through CALPHAD assess-ments. Phase diagram calculations were performed across systematiccompositional and temperature grids using Pandat software27. For com-positional variables, elemental fractions were sampled from 0% to 100% in2% increments. For binaries, temperature was varied from 200 K to 5000 Kin 50 K intervals, while for ternaries, the temperature was fixed at 800 K dueto the computational cost. There are twomain reasonswhy the temperaturewas fixed at 800 K for the ternary systems: (1) it is close to typical annealingtemperatures for high-entropy alloys, and (2) it is comparable to annealing1Graduate School of Frontier Sciences, The University of Tokyo, Chiba, Japan. 2University Grenoble Alpes, CNRS, Grenoble INP, SIMaP, Grenoble, France.3Research Center for Structural Materials, National Institute for Materials Science, Tsukuba, Ibaraki, Japan. 4Center for Basic Research on Materials, NationalInstitute for Materials Science, Tsukuba, Ibaraki, Japan. 5RIKEN Center for Advanced Intelligence Project, Tokyo, Japan. e-mail: tamura.ryo@nims.go.jp;tsuda@k.u-tokyo.ac.jpnpj Computational Materials |           (2026) 12:97 11234567890():,;1234567890():,;http://crossmark.crossref.org/dialog/?doi=10.1038/s41524-026-01966-6&domain=pdfhttp://crossmark.crossref.org/dialog/?doi=10.1038/s41524-026-01966-6&domain=pdfhttp://crossmark.crossref.org/dialog/?doi=10.1038/s41524-026-01966-6&domain=pdfmailto:tamura.ryo@nims.go.jpmailto:tsuda@k.u-tokyo.ac.jpwww.nature.com/npjcompumatstemperatures for steel as well as the aging treatment of Ni superalloys.Consequently, phase diagrams at this temperature are of particular interestfrom the perspective of phase diagram determination. This systematicsampling approach generated 837,475 data points, each defining the rela-tionship between elemental composition, temperature, and correspondingphase names. From these data points, we constructed question-and-answer(Q&A) pairs. For example, a questionmight include information about thecomposition and temperature, and the answer would be the associatedphase name.One of the important features of LLMs is their ability to handlemultiple tasks within a single model. Thus, in this study, we developed amodel capable of performing three different Q&A tasks using a unifiedarchitecture.We then fine-tunedMistral, an open-source pre-trained LLM,on theseQ&As to incorporate domain-specific knowledge through selectiveparameter updates.The aLLoyM model was comprehensively benchmarked using twodistinct Q&A formats: multiple-choice and short-answer. The multiple-choice Q&As facilitated direct comparative analysis between baselineand fine-tuned model performance, with results demonstrating thatfine-tuning yielded substantial improvements in predictive accuracyrelative to the baseline LLM. In contrast, the short-answerQ&As operateindependently of multiple-choice constraints, rendering it particularlysuitable for predicting previously unexplored phase diagrams withoutrequiring additional domain knowledge. The implementation of short-answer Q&As with aLLoyM can be employed to generate novel phasediagrams, as exemplified in Fig. 1, and facilitated the generation ofillustrative examples. The aLLoyM model optimized for short-answerapplications is publicly accessible through the Hugging Face platform(https://huggingface.co/Playingyoyo/aLLoyM).Fig. 1 | Schematic of fine-tuned LLM for phase diagram generation: aLLoyM.Q&As were generated from CPDDB using CALPHAD assessments, and Mistral was fine-tuned on these pairs.https://doi.org/10.1038/s41524-026-01966-6 Articlenpj Computational Materials |           (2026) 12:97 2https://huggingface.co/Playingyoyo/aLLoyMwww.nature.com/npjcompumatsResultsMultiple choice Q&AsWe conducted a benchmark evaluation using multiple-choice questions tocompare theperformance of aLLoyMagainst a baselineLLM.Eachquestionrequired the model to choose the correct answer from four options, wherethree distractors were randomly selected from answers related to the samesystems (Fig. 2). To assess themodel’s generalization capability, the dataset,comprising binary and ternary systems, was split into training and test setsusing an 8:2 ratio. Twodistinct data splitting strategieswere implemented toevaluate model performance under different generalization scenarios (seeFig. 2). Interpolation split: data points were randomly distributed across allavailable systems, allowing assessment of model performance on familiarsystems with varying compositional and thermal conditions. Extrapolationsplit: systems in the test set were completely excluded from the training set,enabling evaluation of themodel’s ability to generalize to previously unseensystems.We considered three types ofQ&A tasks. Full phase information: giventhe input composition and temperature, the model predicts the completephase information, including phase names and their corresponding frac-tions and compositions. An example of this Q&A task is presented in Fig. 1.Phase name: the model predicts only the phase names based on the inputcomposition and temperature. The output is the phase domain withoutspecifying phase fractions or compositions for full phase information.Experimental condition: given the constitutive elements and a specific phasedomain, the model predicts a possible composition and temperature. Thistask serves as the inverse of thephasenameprediction.The examples of eachQ&A are summarized in Table S1. These three Q&A tasks were trainedwithin a single LLM.Fig. 2 | Accuracies of the baseline model (Mistral)and the fine-tuned model (aLLoyM) on multiple-choice Q&As. Results are reported separately forinterpolation and extrapolation settings, and coverall three Q&A task types: full phase informationinference, phase name prediction, and experimentalcondition inference. Performance is also dis-tinguished between binary and ternary systems.https://doi.org/10.1038/s41524-026-01966-6 Articlenpj Computational Materials |           (2026) 12:97 3www.nature.com/npjcompumatsThe accuracies of all Q&A tasks in the multiple-choice are shown inFig. 2, with performance evaluated separately for binary and ternary systemsacross the three task types. As a baseline, we employed the Mistral-Nemo-Instruct-2407-bnb-4bit model using Hugging Face’s causal languagemodeling interface28. The baseline model’s performance remained close torandom guessing in both interpolation and extrapolation settings, withaccuracy only slightly above the level expected by chance. These findingsindicate that the baseline language model struggled to produce correctanswers to phase diagram questions. Ideally, predictions obtained fromconventional machine learning methods should also have been consideredas an additional baseline. However, constructing prediction models withsuch approaches presents a major challenge in numerically encoding phaselabels. The phase names appearing in different phase diagrams vary con-siderably, making it difficult to standardize them or to convert them intonumerical representations. Consequently, applying conventional machinelearningmethods to the present dataset is not straightforward. In contrast, akey advantage of LLMs lies in their ability to handle phase names directly.In contrast to the baseline, the fine-tunedmodels exhibited substantialperformance improvements across all tasks. For both interpolation andextrapolation settings, individual models were fine-tuned on the completeensemble of three Q&A tasks. In all cases, the fine-tuned models out-performed the baseline. As anticipated, performance was generally higheron interpolation tasks compared to extrapolation tasks. Furthermore, pre-dictions for ternary systems proved more challenging than those for binarysystems, while performance differences among the three Q&A tasks wererelatively minor. These results demonstrate that, when provided with sui-table training data, LLMs are capable of accurately predictingphase diagraminformation. Notably, the model’s success on extrapolation setting suggestsan ability to generalize knowledge from known systems to make informedpredictions for previously unseen combinations.We evaluated the average accuracy for ternary systems as a function ofthe number of constituent binary pairs included in the training dataset. Theresults of phase nameprediction for the extrapolation split are shown in Fig.S3. Although the average accuracy generally improved with an increasingnumberof constituent binarypairs, the variability remained substantial. Thetraining was performed using a mixture of binary and ternary data, albeitwith an imbalanced distribution. To assess the impact of this imbalance, weconstructed a model trained exclusively on ternary data and compared itsaccuracy (see Fig. S4). The results demonstrated that excluding binary datareduced the prediction accuracy for ternary phase diagrams. These findingssuggest that, in the present case, distributional imbalance does not inher-ently impair the prediction performance of LLMs.Short answer Q&AsAdopting the short-answer questions allows themodel to generate responseswithout relying on predefined multiple-choice options. Consistent with themultiple-choiceQ&As, thefine-tunedmodelswere trainedusing the full datacorresponding to all threeQ&Atasks. To evaluate the alignment between theground-truth answers and those generated by aLLoyM, we introduced ascoring metric described in the “Methods” section depending on the Q&Atask. The score ranges from 0 to 100%, with higher values indicating greateragreement between the generated and ground-truth answers.Figure 3 presents the average scores for each task. As anticipated,performancewas superior on interpolation settings relative to extrapolationFig. 3 | Average scores of the fine-tuned models(aLLoyM) for short answer Q&As. Results arepresented for interpolation and extrapolation con-figurations, with individual tasks (full phase infor-mation, phase name, and experimental condition).Binary and ternary systems were evaluatedseparately.https://doi.org/10.1038/s41524-026-01966-6 Articlenpj Computational Materials |           (2026) 12:97 4www.nature.com/npjcompumatssettings. Among the three Q&A task categories, predicting complete phaseinformation proved most challenging. Nevertheless, the model demon-strated robust performance in predicting phase names, even under extra-polation conditions. Furthermore, it successfully generated appropriateexperimental conditions from specified phase information in extrapolationsettings, suggesting that when a target phase is designated, aLLoyM pos-sesses the capacity to reliably propose suitable experimental parameters.Across all tasks, predictions for ternary systems were consistently morechallenging than those for binary systems.Note that SupplementaryNoteAwas prepared to analyze the sources of prediction errors, such as missingphases or inaccurate temperatures. In addition, the effect of jointly fine-tuning all three Q&A tasks, as opposed to fine-tuning them separately, is ofparticular relevance for evaluating the capabilities of LLMs. In the binaryprediction task,we compared these two training strategies andobserved thatthe resulting accuracies were nearly identical (see Fig. S5). These resultsdemonstrate that LLMs are capable of effectively learning multiple taskswithin a single model.Based on the phase names predicted by aLLoyM, we reconstructed thephase diagrams for the element sets in the extrapolated test set. Figures4 and 5 present representative binary and ternary phase diagrams exhibitingvarying levels of predictive performance. The scores represent averagesacross each complete phase diagram, with corresponding ground-truthphase diagrams provided for comparison. Across all cases, predictive per-formance remains consistently higher in regions proximate to pureelements and diminishes progressively as compositions approach inter-mediate regions. When the intermediate compositional range exhibitsrelatively simple phase behavior, the generated phase diagramsdemonstrategreater accuracy, yielding elevated scores as observed in the Co-Th andMg-Si-Cu systems. Conversely, systems characterized by more complex inter-mediate phase behavior frequently produce lower scores, as exemplified bythe Co-Ti and Cr-Ni-Al systems. These findings indicate that the inherentcomplexity of intermediate compositional regions is a key factor con-tributing to the difficulty of phase diagram generation for aLLoyM.To check the training-testing data division dependence of accuracy,five-fold cross-validation on the full phase information task for the shortanswer Q&As was performed (see Fig. S6). It was confirmed that theaccuracy fluctuates significantly across each fold, suggesting that the accu-racy of the answers varies depending on the chemical distance reflected inthe data split.Novel phase diagram generationaLLoyM enables the generation of entirely novel phase diagrams, includingthose that are currently unknown or extremely difficult to constructexperimentally. Figure 6 presents examples of such phase diagrams for bothbinary and ternary systems, generated using aLLoyMwith the short-answerQ&A format. We first examine the results for binary systems. Phase dia-grams were generated for the Th-Ac (thorium-actinium) and U-Nh (ura-nium-nihonium) systems. In the case of Th-Ac, pure thorium wasFig. 4 | Representative binary phase diagrams exhibiting varying predictive performance, as generated by aLLoyM for the phase name prediction task. The ground-truth phase diagrams are also shown. Lower scores denote greater discrepancies between generated and ground-truth phase diagrams.https://doi.org/10.1038/s41524-026-01966-6 Articlenpj Computational Materials |           (2026) 12:97 5www.nature.com/npjcompumatsFig. 5 | Representative 800K ternary isothermal sections exhibiting varying predictive performance, as generated by aLLoyM for the phase name prediction task. Theground-truth phase diagrams are also shown. Lower scores denote greater discrepancies between generated and ground-truth phase diagrams.Fig. 6 | Examples of unknown phase diagram generations. Binary phase diagrams and 800K ternary isothermal sections were inferred by aLLoyM in the phase nameprediction task.https://doi.org/10.1038/s41524-026-01966-6 Articlenpj Computational Materials |           (2026) 12:97 6www.nature.com/npjcompumatsincorporated within the training dataset, whereas actinium was omittedowing to its short half-life. aLLoyM predicted themelting point of actiniumto be approximately 1400 °C, which is consistent with the experimentalvalue of approximately 1050 °C29.However,while the stable crystal structureof actinium is known to be face-centered cubic (FCC)30, the model incor-rectly predicted it as hexagonal close-packed (HCP). For the U-Nh system,neither uranium nor nihonium was included in the training data, makingthis an entirely extrapolative prediction. The predicted melting point foruranium was approximately 900 °C, compared to the known value of1135 °C31, indicating only a moderate deviation. However, aLLoyM erro-neously predictedHCPas the stable structure, while uranium’s actual stablestructure is body-centered cubic (BCC)32. For nihonium, no experimentaldata on melting point or crystal structure are currently available. Never-theless, themodelwas able to generatephasediagramoutputs, illustrating itspotential to make predictions in domains where experimental data arescarce or nonexistent. It should be noted that the pre-trainedMistral modelcorrectly predicted the stable low-temperature structures of actinium anduranium. However, after fine-tuning, aLLoyM generated incorrect crystalstructures, indicating that catastrophic forgetting occurred evenwhenRoLAfine-tuning was applied.We subsequently examine the results for ternary systems. tungsten(W), tantalum (Ta), and osmium (Os) are all elements characterized byexceptionally high melting points, rendering experimental investigation oftheir ternary phase diagram particularly challenging. To date, no ternaryphase diagrams have been established for this system, although all threeconstituent elements are present in the training data for binary systems.Using aLLoyM, we reconstructed the ternary phase diagram for this systemat 800 K. In the intermediate compositional region, the model predicts theemergence of three-phase coexistence. Notably, aLLoyM also predicts theexistence of phases designated with “WOLF” nomenclature that are absentfrom the training data. These may reflect latent knowledge embeddedwithin the pre-trainedMistral model. Finally, we generated a ternary phasediagram for nihonium, uranium, and actinium at 800 K, representing anentirely hypothetical system that cannot be experimentally realized. Here aswell, the model predicts three-phase coexistence in the intermediate com-positional region as well as the W-Ta-Os system.Since the phase diagrams generated above remain beyond experi-mental validation, they should be regarded only as illustrative examples. Asthe next step of this research, it is important to predict realistic novel phasediagrams that can be experimentally tested and to carry out their experi-mental evaluation.Reliability assessmentEvaluating the reliability of generated answers is crucial for demonstratingthe accuracy of the predictions. Here, we address methods for assessing thereliability of aLLoyM’s output using both temperature-based evaluation andconfidence-based evaluation.Temperature-based evaluation: aLLoyM occasionally produces novelphase-name predictions in response to short-answerQ&As, such as a phasename containing “WOLF.” To assess the reliability of such predictions, weperform a temperature-based evaluation. For each question where a phasename containing “WOLF”was generated at sampling temperatureT=0,wegenerated 100 responses per temperature setting, increasing the tempera-ture in increments of 0.3. The proportion of generated phase names thatcontained the keyword “WOLF” at each temperature is shown inFig. 7a. Forcomparison, we conducted the same analysis for a question whose ground-truth answer is a phase name containing “FCC.” We observed that theproportion of FCC-containing predictions remained stable even as thetemperature increased, whereasWOLF-containing predictions disappearedrapidly. This indicates that “WOLF” corresponds to a low-reliability pre-diction. Such analyses are therefore essential whenever a novel phase-nameprediction is obtained, to evaluate the robustness of the model’s output.Confidence-based evaluation: we next examined a confidence-basedmethod for assessing prediction reliability. Because confidence can beevaluated at the token level, we focused on multiple-choice Q&As in whichall answers share a uniform length. For each question, we estimated con-fidence by computing the model’s log-likelihood of generating each can-didate option label (a, b, c, or d) as the next token, applying a softmax overthese scores, and taking the probability assigned to themost likely label.Wethen compared the resulting confidence distributions for correct versusincorrect predictions in the full phase prediction task, as shown in Fig. 7b.The confidence associated with incorrect predictions is clearly lower thanthat of correct predictions. This indicates that, for multiple-choice Q&As,inspecting themodel’s confidence provides an effectivemeans of evaluatingthe reliability of its answers.DiscussionIn this work, we developed aLLoyM, a fine-tuned Large Language Modelspecialized for relations between alloy compositions, temperatures, andphase information. The model was fine-tuned on Q&As for binary andternary phase diagrams constructed from the open-source ComputationalPhaseDiagramDatabase (CPDDB)usingCALculation of PHAseDiagrams(CALPHAD) assessments. Our benchmark results demonstrated that fine-tuning significantly improves the accuracy of the model in selecting thecorrect responses tomultiple-choice questions concerning phase diagrams.Furthermore, the short-answer model of aLLoyM can be used to generatephase diagrams for previously unreported systems. These results indicatethat aLLoyM provides a potentially useful framework for phase diagramprediction, with some capacity for extrapolation to novel systems. Its abilityto infer phase behavior in previously unexplored compositional spacescould facilitate the design and discovery of new materials.The consistently stronger performance on binary systems compared toternary systems across all evaluations can be attributed to the relativelylimited availability of ternary training data. Moreover, the absence oftemperature-dependent training data for ternary systems prevents aLLoyMfrommaking reliable predictions across different temperatures, particularlybelow800K. Futurework should therefore prioritize expanding the trainingdata for ternary and higher-order systems with explicit temperatureFig. 7 | Reliability analysis of phase diagram gen-erations. a Proportion of generated answers thatincluded the specific keyword depending on thesampling temperature. 100 responses are generatedfor the questions where the answers including the“WOLF” (novel prediction) and FCC (correct).b Confidence distributions between correct andincorrect predictions for the full phase predictiontask of multiple choice Q&As.https://doi.org/10.1038/s41524-026-01966-6 Articlenpj Computational Materials |           (2026) 12:97 7www.nature.com/npjcompumatsdependence to enable more robust predictions of multi-component phasediagrams. In addition, the integration of experimental phase diagrams usingLLMs represents an important future perspective, where phase diagramannotation techniques based on LLMs can be leveraged.A key advantage of aLLoyM’s natural language framework lies in itsability to utilize the virtually unlimited vocabulary of elements and phasenames acquired during pretraining, thus supporting broad generalization todiverse chemical systems. While the current implementation tends togenerate phase names seen during training, this limitation opens promisingopportunities for improvement through advanced prompt engineering. Inparticular, incorporating thermodynamics-aware prompts may help guidethe model toward applying physically meaningful reasoning during infer-ence, thereby enhancing prediction accuracy. aLLoyM’s training utilized astandardized prompt template, which means its prediction quality may besensitive to variations in how prompts are phrased or input formats arechanged. We encourage users to experiment with various promptingapproaches and share successful strategies, as the field of prompt engi-neering is constantly evolving to optimize LLMperformance through inputdesign33,34. To advance phase diagram prediction, integrating thermo-dynamic information, particularly Gibbs energy, will be essential. As Gibbsenergy data are available in TDB files, future models should be trained toincorporate this information directly. Another current limitation ofaLLoyM is the absence of uncertainty quantification, which restricts it todeterministic outputs. Embedding mechanisms for uncertainty estimationwill be critical for enabling reliable predictions in materials design. In thisstudy,we examined thegenerationof phasediagrams throughdiscreteQ&Aas a proof of concept. Nevertheless, we recognize the importance of devel-oping approaches that can handle phase boundaries in a continuousmanner. One promising direction is the use of Q&A grounded in graph-based representations, along with the development of strategies for con-structingQ&Adatasets that are better tailored to phase diagram generation.Collectively, these directions represent important pathways for futureresearch toward developing more capable LLMs specifically tailored tophase diagram prediction and materials discovery.MethodsFine-tuningWe fine-tuned the Mistral-Nemo-Instruct-2407 model using LoRA (Low-Rank Adaptation) with rank 16 and alpha 16, targeting attention and feed-forwardprojections. These hyperparameter values correspond to the defaultLoRA adapters in the pretrained model’s official demo https://colab.research.google.com/github/unslothai/studio/blob/main/colabs/mistral_nemo_12b.ipynb. We confirmed that modifying these values does notsubstantially affect the accuracies (see Fig. S7). Training data was formattedusing a structured prompt template with Instruction, Input, and Outputsections (see Table S1). The model was trained for 15,000 steps with alearning rate of 2 × 10−4, batch size of 16 per device, and 4 gradient accu-mulation steps using the AdamW optimizer with bfloat16 precision.Training of the full-aLLoyM required 32 h on aNVIDIAA100GPU (80GBPCIe). The training was conducted using Python 3.10.10 in a Linux envir-onment (6.8.0-55-generic). On the same GPU, generation requiredapproximately 1 s per question. We confirmed that general-purpose lan-guage tasks can indeed be performed even after LoRA-based fine-tuning,indicating that knowledge retention is maintained.Scoring criteria for generated answersFor the short answer Q&As, the scoring criteria for generated answersdepend on the specific Q&A task, as both the answer format and the targetsubject vary across tasks. The definition of the scores for each task is shownbelow. All of the following scores are defined with a maximum value of100%. The details are summarized in Supplementary Note B. Full phaseinformation: the exactmatchbetween the generatedanswerand the ground-truth answer was used. Phase name: the score was evaluated using theJaccard similarity of perfectly matching phase names. Experimental condi-tion: the scoring of experimental conditions evaluates how well the elementcompositions and temperature match one of the ground-truths by com-paring composition accuracy and temperature accuracy.Data availabilityAll Q&A data used in this study are publicly available at: https://huggingface.co/datasets/Playingyoyo/aLLoyM-dataset. The short-answerversion of aLLoyM,fine-tuned on the full dataset, can be accessed at: https://huggingface.co/Playingyoyo/aLLoyM.Code availabilityCode for aLLoyM is available at https://github.com/tsudalab/aLLoyM/tree/main.Received: 29 July 2025; Accepted: 11 January 2026;References1. Schlesinger, M. E. & Mueller, E. M. (eds) ASMHandbook, Vol. 3 (ASMInternational, 1983).2. Massalski, T. B. & Okamoto, H. (eds) Binary Alloy Phase Diagrams(ASM International, 1990).3. Villars, P., Prince, A. & Okamoto, H.Handbook of Ternary Alloy PhaseDiagrams (ASM International, 1995).4. Okamoto, H. Desk Handbook 2nd edn. ASM Handbooks (ASMInternational, 2010).5. Computational phase diagram database (CPDDB). https://cpddb.nims.go.jp/.6. Jung, I.-H. & Van Ende, M.-A. Computational thermodynamiccalculations: FactSage from CALPHAD thermodynamic database tovirtual process simulation.Metall. Mater. Trans. B 51, 1851–1874(2020).7. Hallstedt, B., Noori, M., Kies, F., Oppermann, F. & Haase, C.Thermodynamicdatabase formulti-principal element alloyswithin thesystem Al-Co-Cr-Fe-Mn-Ni-C. Calphad 83, 102644 (2023).8. Terayama, K. et al. Efficient construction method for phase diagramsusing uncertainty sampling. Phys. Rev. Mater. 3, https://doi.org/10.1103/PhysRevMaterials.3.033802 (2019).9. Aghaaminiha, M., Ghanadian, S. A., Ahmadi, E. & Farnoud, A. M. Amachine learning approach to estimation of phasediagrams for three-component lipid mixtures. Biochim. Biophys. Acta Biomembr. 1862,183350 (2020).10. Dai, C. & Glotzer, S. C. Efficient phase diagram sampling by activelearning. J. Phys. Chem. B 124, 1275–1284 (2020).11. Lund, J., Wang, H., Braatz, R. D. & García, R. E. Machine learning ofphase diagrams.Mater. Adv. 3, 8485–8497 (2022).12. Zipoli, F., Viterbo, V., Schilter, O., Kahle, L. & Laino, T. Prediction ofphase diagrams and associated phase structural properties. Ind. Eng.Chem. Res. 61, 8378–8389 (2022).13. Tamura, R. et al. Machine-learning-based phase diagramconstruction for high-throughput batch experiments. Sci. Technol.Adv. Mater. Methods 2, 153–161 (2022).14. Deffrennes, G., Terayama, K., Abe, T. & Tamura, R. A machinelearning-based classification approach for phase diagram prediction.Mater. Des. 215, 110497 (2022).15. Tamura, R. et al. AIPHAD, an active learningweb application for visualunderstanding of phase diagrams. Commun. Mater. 5, 139 (2024).16. Jablonka, K. M. et al. 14 examples of how LLMs can transformmaterials science and chemistry: a reflection on a large languagemodel hackathon. Digital Discov. 2, 1233–1250 (2023).17. Liu, Y. et al. Generative artificial intelligence and its applications inmaterials science: current situation and future perspectives. J.Materiomics 9, 798–816 (2023).18. Lei, G., Docherty, R. & Cooper, S. J. Materials science in the era oflarge language models: a perspective. Digital Discov. 3, 1257–1272(2024).https://doi.org/10.1038/s41524-026-01966-6 Articlenpj Computational Materials |           (2026) 12:97 8https://colab.research.google.com/github/unslothai/studio/blob/main/colabs/mistral_nemo_12b.ipynbhttps://colab.research.google.com/github/unslothai/studio/blob/main/colabs/mistral_nemo_12b.ipynbhttps://colab.research.google.com/github/unslothai/studio/blob/main/colabs/mistral_nemo_12b.ipynbhttps://huggingface.co/datasets/Playingyoyo/aLLoyM-datasethttps://huggingface.co/datasets/Playingyoyo/aLLoyM-datasethttps://huggingface.co/Playingyoyo/aLLoyMhttps://huggingface.co/Playingyoyo/aLLoyMhttps://github.com/tsudalab/aLLoyM/tree/mainhttps://github.com/tsudalab/aLLoyM/tree/mainhttps://cpddb.nims.go.jp/https://cpddb.nims.go.jp/https://cpddb.nims.go.jp/https://doi.org/10.1103/PhysRevMaterials.3.033802https://doi.org/10.1103/PhysRevMaterials.3.033802https://doi.org/10.1103/PhysRevMaterials.3.033802www.nature.com/npjcompumats19. Deb, J., Saikia, L., Dihingia, K. D. & Sastry, G. N. ChatGPT in thematerial design: selected case studies to assess the potential ofChatGPT. J. Chem. Inf. Model. 64, 799–811 (2024).20. Jiang, X. et al. Applications of natural language processing and largelanguage models in materials discovery. npj Comput. Mater. 11, 79(2025).21. Yan, Z. et al. PDGPT: a large language model for acquiring phasediagram information in magnesium alloys.Mater. Genome Eng. Adv.2, e77 (2024).22. Zha, Y., Li, Y. & Lu, X.-G. Enhancing large language modelcomprehension of material phase diagrams through promptengineering and benchmark datasets. Mathematics 12, 3141 (2024).23. Hu, E. J. et al. Lora: low-rank adaptation of large language models.https://arxiv.org/abs/2106.09685 (2021).24. Gruver, N. et al. Fine-tuned language models generate stableinorganic materials as text. https://arxiv.org/abs/2402.04379(2025).25. Harod, H. et al. Effichem: efficient adaptation of chemical languagemodels for molecular property prediction. https://doi.org/10.26434/chemrxiv-2025-2lljt (2025).26. Gao, B. et al. Lora-chem: modular machine learning for multitaskprediction in organic reactions. https://doi.org/10.26434/chemrxiv-2025-p7sxn (2025).27. Pandat software. https://computherm.com/.28. Hugging face, mistral-nemo-instruct-2407-bnb-4bit. https://huggingface.co/unsloth/Mistral-Nemo-Instruct-2407-bnb-4bit.29. Periodic table, actinium. https://periodic-table.rsc.org/element/89/actinium.30. Farr, J., Giorgi, A., Bowman, M. & Money, R. The crystal structure ofactiniummetal and actinium hydride. J. Inorg. Nucl. Chem. 18, 42–47(1961).31. Periodic table, uranium. https://periodic-table.rsc.org/element/92/uranium.32. Grenthe, I. et al. Uranium, 253–698 (Springer, 2008).33. Sahoo, P. et al. A systematic survey of prompt engineering in largelanguagemodels: techniques and applications. https://arxiv.org/abs/2402.07927 (2025).34. Rodriguez, A. D., Dearstyne, K. R. & Cleland-Huang, J. Promptsmatter: insights and strategies for prompt engineering in automatedsoftware traceability. https://doi.org/10.1109/REW57809.2023.00087 (2023).AcknowledgementsThe authors would like to thank Etsuko Ogamino for data collection. Thisstudywas supported by a project subsidizedby JSPSKAKENHI (25K01492and 25KJ0870), JST-CREST (JPMJCR21O2), and MEXT Program: DataCreation and Utilization Type Material Research and Development Project(JPMXP1122715503 and JPMXP1122712807).Author contributionsAll the authors conceived the original idea. Y.O. and R.S. prepared Q&Asfrom CALPHAD assessment data and developed aLLoyM. G.D., T.A., andR.T. prepared the CALPHAD assessment data of phase diagrams. Y.O.,R.T., and K.T. wrote the original manuscript. All the authors discussed theresults, commentedon themanuscript, and approved the final version of themanuscript.Competing interestsThe authors declare no competing interests.Additional informationSupplementary information The online version containssupplementary material available athttps://doi.org/10.1038/s41524-026-01966-6.Correspondence and requests for materials should be addressed toRyo Tamura or Koji Tsuda.Reprints and permissions information is available athttp://www.nature.com/reprintsPublisher’s note Springer Nature remains neutral with regard tojurisdictional claims in published maps and institutional affiliations.Open Access This article is licensed under a Creative CommonsAttribution-NonCommercial-NoDerivatives 4.0 International License,which permits any non-commercial use, sharing, distribution andreproduction in any medium or format, as long as you give appropriatecredit to the original author(s) and the source, provide a link to the CreativeCommons licence, and indicate if you modified the licensed material. Youdo not have permission under this licence to share adapted materialderived from this article or parts of it. The images or other third partymaterial in this article are included in the article’s Creative Commonslicence, unless indicated otherwise in a credit line to thematerial. If materialis not included in thearticle’sCreativeCommons licenceandyour intendeduse is not permitted by statutory regulation or exceeds the permitted use,you will need to obtain permission directly from the copyright holder. Toview a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.© The Author(s) 2026https://doi.org/10.1038/s41524-026-01966-6 Articlenpj Computational Materials |           (2026) 12:97 9https://arxiv.org/abs/2106.09685https://arxiv.org/abs/2106.09685https://arxiv.org/abs/2402.04379https://arxiv.org/abs/2402.04379https://doi.org/10.26434/chemrxiv-2025-2lljthttps://doi.org/10.26434/chemrxiv-2025-2lljthttps://doi.org/10.26434/chemrxiv-2025-2lljthttps://doi.org/10.26434/chemrxiv-2025-p7sxnhttps://doi.org/10.26434/chemrxiv-2025-p7sxnhttps://doi.org/10.26434/chemrxiv-2025-p7sxnhttps://computherm.com/https://computherm.com/https://huggingface.co/unsloth/Mistral-Nemo-Instruct-2407-bnb-4bithttps://huggingface.co/unsloth/Mistral-Nemo-Instruct-2407-bnb-4bithttps://huggingface.co/unsloth/Mistral-Nemo-Instruct-2407-bnb-4bithttps://periodic-table.rsc.org/element/89/actiniumhttps://periodic-table.rsc.org/element/89/actiniumhttps://periodic-table.rsc.org/element/89/actiniumhttps://periodic-table.rsc.org/element/92/uraniumhttps://periodic-table.rsc.org/element/92/uraniumhttps://periodic-table.rsc.org/element/92/uraniumhttps://arxiv.org/abs/2402.07927https://arxiv.org/abs/2402.07927https://arxiv.org/abs/2402.07927https://doi.org/10.1109/REW57809.2023.00087https://doi.org/10.1109/REW57809.2023.00087https://doi.org/10.1109/REW57809.2023.00087https://doi.org/10.1038/s41524-026-01966-6http://www.nature.com/reprintshttp://creativecommons.org/licenses/by-nc-nd/4.0/http://creativecommons.org/licenses/by-nc-nd/4.0/www.nature.com/npjcompumats aLLoyM: a large language model for alloy phase diagram prediction Results Multiple choice Q&#x00026;As Short answer Q&#x00026;As Novel phase diagram generation Reliability assessment Discussion Methods Fine-tuning Scoring criteria for generated answers Data availability Code availability References Acknowledgements Author contributions Competing interests Additional information