# Fileset

[d4dd00319e.pdf](https://mdr.nims.go.jp/filesets/a89028d2-af55-4305-9b93-016bd3f9daa7/download)

## Creator

[Christophe Bajan](https://orcid.org/0009-0008-1433-9618), [Guillaume Lambard](https://orcid.org/0000-0003-0275-4079)

## Rights

[Creative Commons BY Attribution 4.0 International](https://creativecommons.org/licenses/by/4.0/)

## Other metadata

[Exploring the Expertise of Large Language Models in Materials Science and Metallurgical Engineering](https://mdr.nims.go.jp/datasets/25df19d4-bd2e-4da3-84af-d095a35fdc8f)

## Fulltext

Exploring the expertise of large language models in materials science and metallurgical engineeringDigitalDiscoveryPAPEROpen Access Article. Published on 20 January 2025. Downloaded on 1/20/2025 7:55:58 AM.  This article is licensed under a Creative Commons Attribution 3.0 Unported Licence.View Article OnlineView JournalExploring the expData-Driven Material Design Group, NationaJapan. E-mail: BAJAN.Christophe@nims.go.† Electronic supplementary informahttps://doi.org/10.1039/d4dd00319e‡ These authors contributed equally to thCite this: DOI: 10.1039/d4dd00319eReceived 2nd October 2024Accepted 7th January 2025DOI: 10.1039/d4dd00319ersc.li/digitaldiscovery© 2025 The Author(s). Published bertise of large language models inmaterials science and metallurgical engineering†Christophe Bajan ‡ and Guillaume Lambard ‡*The integration of artificial intelligence into various domains is rapidly increasing, with Large LanguageModels (LLMs) becoming more prevalent in numerous applications. This work is included in an overallproject which aims to train an LLM specifically in the field of materials science. To assess the impact ofthis specialized training, it is essential to establish the baseline performance of existing LLMs in materialsscience. In this study, we evaluated 15 different LLMs using the MaScQA question answering (Q&A)benchmark. This benchmark comprises questions from the Graduate Aptitude Test in Engineering(GATE), tailored to test models' capabilities in answering questions related to materials science andmetallurgical engineering. Our results indicate that closed-source LLMs, such as Claude-3.5-Sonnet andGPT-4o, perform the best with an overall accuracy of ∼84%, while open-source models, such asLlama3-70b and Phi3-14b, top at ∼56% and ∼43%, respectively. These findings provide a baselinefor the raw capabilities of LLMs on Q&A tasks applied to materials science, and emphasise thesubstantial improvement that could be brought to open-source models via prompt engineering andfine-tuning strategies. We anticipate that this work could push the adoption of LLMs as valuableassistants in materials science, demonstrating their utilities in this specialised domain and related sub-domains.1 IntroductionLarge Language Models (LLMs) represent a signicantadvancement in articial intelligence (AI), demonstratingexceptional prociency in natural language processing (NLP).These models are designed to generate human-like text basedon the patterns extracted from large pre-training data. LLMshave shown notable progress in a range of NLP tasks, includingtext generation, translation, summarization, and questionanswering on various benchmarks.However, LLMs' capabilities oen degrade when addressingdomain-specic requests, such as those in materials science.1This limitation arises because pre-training data typically comefrom diverse web sources, encompassing a wide range ofdomains. While this approach effectively compresses generalknowledge into the LLM's parameters, it can lead to themerging of unrelated contexts during inference, potentiallyresulting in incorrect assertions.To overcome this challenge and effectively utilize LLMs fordomain-specic tasks, two primary strategies can be employed:l Institute for Materials Science, Tsukuba,jp; LAMBARD.Guillaume@nims.go.jption (ESI) available. See DOI:is work.y the Royal Society of Chemistry(i) Train a dedicated LLM from scratch with a smallerparameter count, specically tailored to encapsulate the desireddomain knowledge.(ii) Fine-tune a pre-trained LLM to a specic domain.2In this study, we adopt the second strategy, leveraging theinstruction-following capabilities and general NLP prociencyof pre-existing models. Our nal objective is to ne-tune anexisting LLM and integrate it into a retrieval-augmentedgeneration (RAG) system for materials science applications.To guide this future ne-tuning process and establish a baselinefor evaluation, we rst assess in the present study the capabil-ities of available LLMs in materials science. This evaluationaims to:� Establish a comprehensive baseline performance onmaterials science tasks.� Identify LLMs that balance high capabilities with modestparameter counts, crucial for efficient ne-tuning anddeployment.� Discover potential areas for improvement in the evaluationprocess itself.1.1 LLMs in materials scienceRecent years have witnessed signicant advancements inleveraging LLMs for materials science and engineering.Domain-specic models and tools have emerged to address thechallenges of applying NLP techniques to scientic research.Notable examples include:Digital Discoveryhttp://crossmark.crossref.org/dialog/?doi=10.1039/d4dd00319e&domain=pdf&date_stamp=2025-01-17http://orcid.org/0009-0008-1433-9618http://orcid.org/0000-0003-0275-4079https://doi.org/10.1039/d4dd00319ehttp://creativecommons.org/licenses/by/3.0/http://creativecommons.org/licenses/by/3.0/https://doi.org/10.1039/d4dd00319ehttps://pubs.rsc.org/en/journals/journal/DDDigital Discovery PaperOpen Access Article. Published on 20 January 2025. Downloaded on 1/20/2025 7:55:58 AM.  This article is licensed under a Creative Commons Attribution 3.0 Unported Licence.View Article Online� MatBERT:3 a BERT-based model ne-tuned on materialsscience literature, enabling tasks such as information extrac-tion and text classication.� Mat2Vec:4 provides word embeddings tailored for mate-rials science, facilitating semantic analysis and knowledgerepresentation.� KGQA4MAT:5 a knowledge-based system demonstratingthe utility of knowledge graph question answering for struc-tured scientic reasoning, particularly in applications likemetal–organic frameworks.� HoneyComb:6 highlights the adaptability of LLMs tospecialized agent-based systems that can assist in materialsresearch workows.Furthermore, frameworks like SciQAG7 have been developedto automatically generate question-answer (Q&A) pairs fromscientic literature, addressing the need for domain-specicQ&A datasets. These efforts complement existing benchmarkssuch as ChemLLMBench8 (for chemistry), MultiMedQA9 (formedicine), and SciEval10 (for STEM domains).Despite these advancements, there remains a need fortailored benchmarks that specically evaluate LLMs' under-standing of materials science concepts. The MaScQA bench-mark1 addresses this gap by providing a curated dataset of 650questions covering diverse sub-elds within materials science,including thermodynamics, atomic structure, mechanicalbehavior, and materials characterization. It allows for evalu-ating fundamental comprehension, conceptual reasoning, andnumerical problem-solving—capabilities essential for real-world materials science tasks.1.2 The MaScQA benchmarkWhile MaScQA is the most comprehensive benchmark tailoredspecically to materials science and metallurgical engineering,alternative Q&A datasets focus on related scientic domains:� SciQ:11 a general science dataset with 13 679 questionsacross physics, chemistry, and biology, useful for evaluatingbroader scientic reasoning.� ChemData700k and ChemBench4k:12 benchmarksdesigned for chemistry competency, focusing on tasks related tochemical properties, reactions, and structures.� MoleculeQA:13 a dataset for molecular-level reasoning,particularly useful for tasks involving molecular properties anddesign.These alternatives offer valuable insights but either lack thespecicity of MaScQA or focus on narrower aspects of chemistryand molecular properties. MaScQA remains unique in its abilityto test both conceptual understanding and numerical reasoningacross diverse materials science sub-elds, making it the mostsuitable benchmark for this study.Originally consisting of 650 questions derived from theGraduate Aptitude Test in Engineering (GATE), the MaScQAbenchmark was rened by ourselves by manually removing 6Q&A samples due to issues such as duplication or missinginformation (see Table 1 in the ESI† for details). This minorreduction does not signicantly bias the evaluationoutcomes.Digital DiscoveryThe MaScQA benchmark is categorized by four types ofquestions:� 283 Multiple Choice Questions (MCQs)� 70 Matching Type Questions (MATCH)� 67 Numerical Questions with Multiple Choices (MCQN)� 224 Numerical Questions (NUM)These question types test various aspects of materialsscience knowledge, from conceptual understanding to numer-ical problem-solving. The questions span 14 distinct sub-eldswithin materials science, as shown in Fig. 1.We selected this benchmark due to its comprehensivecoverage of various domains within materials science, thesubstantial number of questions with answers curated by handby the MaScQA authors, and the diversity of question types thatnecessitate both broad knowledge and computational abilities.By establishing a baseline of LLM performance on the MaScQAbenchmark, we can better understand their current limitationsand potential areas for improvement in materials scienceapplications.1.3 LLM selectionThe selection of LLMs for this study encompasses a variety ofclosed- and open-source models listed in Table 1. This diversityensures a comprehensive evaluation across different architec-tures, accessibility, and ne-tunability.14,15 The models weresourced from leading AI research organizations and companies,including Anthropic, OpenAI, Meta, Mistral AI, and Microso.By evaluating models from these varied sources, we aim tocapture a broad spectrum of performance characteristics,enabling a more thorough understanding of the current state ofLLMs applied to materials science. This approach allows us toassess not only the raw performance of these models inanswering materials science questions but also to capture thetrade-off between their accessibility, affordability, and custom-ization potential for further domain-specic ne-tuning.16,17The choice of LLMs reects models that were widely usedand publicly available at the time of experimentation. Includingboth older and newer versions of the same models (e.g., GPT-3.5-turbo and GPT-4) enables us to track progress and eval-uate incremental improvements in reasoning and performancefor domain-specic tasks. While newer models, such as Llama3.1, were released aer our experiments, the results presentedhere provide a valuable baseline for future comparisons.Notably, improvements observed for Llama 3.1:70b on bench-marks like MATH18 suggest that further evaluation on MaScQAcould yield insightful comparisons.2 Methodology2.1 LLM preparationOur study diverges from the original work from Zaki et al.1 onseveral key aspects. We expanded our evaluation to 15 differentLLMs instead of only 3 (Llama2-70b, GPT-4, and GPT-3.5-turbo)to gain a broader understanding of LLM capabilities in mate-rials science. Additionally, we chose not to include the chain-of-thought prompting method as preliminary results in ref. 1© 2025 The Author(s). Published by the Royal Society of Chemistryhttp://creativecommons.org/licenses/by/3.0/http://creativecommons.org/licenses/by/3.0/https://doi.org/10.1039/d4dd00319eFig. 1 Distribution of the number of questions per sub-field. On the top-right hand, the number of questions per type is also reported.Figure updated from Zaki et al.1 after removal of 6 Q&A samples from the original MaScQA dataset.Paper Digital DiscoveryOpen Access Article. Published on 20 January 2025. Downloaded on 1/20/2025 7:55:58 AM.  This article is licensed under a Creative Commons Attribution 3.0 Unported Licence.View Article Onlineindicated that it did not signicantly inuence the performanceof LLMs in answering materials science related questions.Another important difference came from the temperatureparameter that regulates the stochasticity of the LLM response.Zaki et al. used a temperature of 1 during LLM's evaluationswhich allows for more randomness in the model's responses.However, we opted to use a temperature of 0 to ensuremaximum determinism and consistency in the answers. Atemperature of 0 ensures that a model chooses the most prob-able answer and provides a fairer assessment of the models'knowledge integration and usage abilities. Indeed, with theTable 1 List of the LLMs and their characteristics selected for this studyModels Developer Open-sourClaude-3-Haiku Anthropic 7Claude-3-Opus Anthropic 7Claude-3.5-Sonnet Anthropic 7GPT-3.5-turbo OpenAI 7GPT-4 OpenAI 7GPT-4-turbo OpenAI 7GPT-4o OpenAI 7GPT-4o-mini OpenAI 7Llama2-7b Meta 3Llama2-70b Meta 3Llama3-8b Meta 3Llama3-70b Meta 3Mistral-7b Mistral AI 3Phi3-3.8b Microso 3Phi3-14b Microso 3© 2025 The Author(s). Published by the Royal Society of Chemistryshape of the posterior distribution of tokens for a given inputsequence being unknown for every LLM, this would impose theproposal of two strategies for a fair evaluation: (i) x thetemperature as we did, or (ii) nd the best temperature for eachLLM. As the second strategy being costly and time-prohibitive,we opted for the rst one such that the most probable outputfrom each LLM is compared. To also ensure the reliability of ourresults, we submitted each question to the models three timesto assess the repeatability of their answers. Indeed, even thougha temperature of 0 was xed to maximize determinism inanswers, uncontrollable features leading to stochasticity stillce Fine-tuning Number of parameters7 —7 —7 —3 —3 —7 —3 —3 —3 7B3 70B3 8B3 70B3 7B3 3.8B3 14BDigital Discoveryhttp://creativecommons.org/licenses/by/3.0/http://creativecommons.org/licenses/by/3.0/https://doi.org/10.1039/d4dd00319eFig. 2 Pipeline for generating and evaluating responses from LLMs tothe MaScQA benchmark.Digital Discovery PaperOpen Access Article. Published on 20 January 2025. Downloaded on 1/20/2025 7:55:58 AM.  This article is licensed under a Creative Commons Attribution 3.0 Unported Licence.View Article Onlineremain such as oating-point precision,19 expert selection inmixture of experts (MoE) models like GPT-4 and Mixtral-8x7B,20multi-threaded operations, random number generator statedifferences between runs, etc.21Finally, we maintained consistency with the original study byusing the same assistant prompt preceding every question andinstructing LLM's desired behaviour: “Solve the followingquestion. Write the correct answer inside a list at the end”. Thisapproach allowed for direct comparison of our results to thoseof Zaki et al.1We used the OpenAI, Anthropic and Ollama APIs to accessthe models.22–24 The models used in this study are GPT-4-turbo,GPT-4o, GPT-4o-mini, GPT-4, GPT-3.5-turbo, Claude-3-Opus,Claude-3-Haiku, Claude-3.5-Sonnet, Llama2-7b, Llama2-70b,Llama3-8b, Llama3-70b, Mistral-7b, Phi3-3.8b and Phi3-14b.The tokenization process for all LLMs was handled automati-cally by the respective Python libraries, Ollama and OpenAI,which provide built-in tokenization as part of their APIs. Nocustom tokenization was applied in this study. Readers inter-ested in the specics of tokenization can refer to the officialdocumentation of these libraries. The results were saved in*.txtles and are available on GitHub: https://github.com/Lambard-ML-Team/LLM_comparison_4MS.The LLMs were tested on two different machines: a MacBookPro M1 (2020, 8 GB RAM) and a GPU server (8× A100 40 GB PCIeNVIDIA GPUs). To assess the impact of hardware on perfor-mance only GPT-3.5-turbo, GPT-4, Llama2-7b, and Llama3-8bhave been tested on both machines. For models such as GPT-3.5-turbo and GPT-4 which only rely on OpenAI's servers, theresults remained consistent across bothmachines. However, formodels like Llama2-7b and Llama3-8b, which run locally andare directly impacted by the host machine's specications,performance variations were observed. Llama2-7b performedsimilarly on both machines, while Llama3-8b exhibited a 16%performance improvement on the GPU server. To ensureoptimal testing conditions, we divided the models based ontheir computational requirements and on machines' avail-ability. The distribution of models is as follows:� MacBook Pro M1: GPT-4-turbo, GPT-4o, GPT-4, GPT-3.5-turbo, Claude-3-Opus, Claude-3-Haiku, Claude-3.5-Sonnet,Llama2-7b, and Llama3-8b.� GPU server: GPT-4, GPT-4o-mini, GPT-3.5-turbo, Llama2-7b, Llama2-70b, Llama3-8b, Llama3-70b, Mistral-7b, Phi3-3.8b, and Phi3-14b.This distribution ensures that local models benet from theGPU server's superior computational resources, providinga more accurate assessment of LLMs' capabilities underoptimal conditions. In the study conducted in ref. 1, the eval-uation of the LLMs' responses was manually performed.However, our study involves a signicantly larger amount ofLLM responses to evaluate, 19 LLMs (15 unique models and 4models assessed on both machines) across three iterations foreach of the 644 questions, resulting in a total of ∼37 000answers. Given the large scale of this dataset, manual evalua-tion would be impractical. Therefore, we applied a LLM-as-a-judge strategy25 assisted by GPT-4o to handle this extensiveDigital Discoveryvolume efficiently and ensure accuracy. Fig. 2 summarises theentire pipeline for generating answers and evaluating them.2.2 Autonomous answer analysisTo estimate the accuracy of GPT-4o to autonomously analyseLLM responses, we manually checked the results for fourdifferent LLMs. The manual analysis wasn't straightforward ascertain models, mainly Llama2 and Llama3, provided ambig-uous answers as shown in Fig. 3. Our approach for determiningthe correctness of these answers involved adopting theperspective of an examiner and evaluating whether the LLM'sresponse matched the correct answer, focusing solely on thecorrectness of the selected option rather than the accompa-nying reasoning or explanatory text.As shown in Fig. 3, there are several types of ambiguousanswers from the Llama2-7b model. Fig. 3(a) illustrates a casewhere the reasoning and calculation are incorrect, but thecorrect letter is selected with an incorrect value association.Fig. 3(b) shows the model selecting the correct answer whileproviding contradictory reasoning. Fig. 3(c) demonstratesa situation where the reasoning and calculation are incorrect,yet the correct answer is chosen. Finally, Fig. 3(d) depicts thecorrect answer being selected despite incorrect reasoning andassociated text.In the case of MATCH, MCQ, and MCQN questions,responses are assessed solely based on the selected letter (A, B,C, or D) rather than the accompanying reasoning, calculations,or explanatory text. Consequently, for such questions, theanswers depicted in Fig. 3 should be considered correct if theyalign with the expected answer's letter, regardless of any asso-ciated reasoning or textual explanations.Finally, to validate GPT-4o's role as an evaluator, we per-formed a manual comparison of its judgments against human-assigned scores, as shown in Tables 2 and 3. This analysisdemonstrates GPT-4o's accuracy as a judge while also identi-fying areas where discrepancies arise, particularly for questionsrequiring nuanced reasoning.© 2025 The Author(s). Published by the Royal Society of Chemistryhttps://github.com/Lambard-ML-Team/LLM_comparison_4MShttps://github.com/Lambard-ML-Team/LLM_comparison_4MShttp://creativecommons.org/licenses/by/3.0/http://creativecommons.org/licenses/by/3.0/https://doi.org/10.1039/d4dd00319eFig. 3 Example of ambiguous answers from the Llama2-7b model analysed by GPT-4o. (a) Wrong reasoning and calculation, selected thecorrect letter but associated the wrong value with it, (b) selected the correct answer but the reasoning says the opposite, (c) reasoning andcalculation are incorrect but selected the correct answer, and (d) selected the correct answer but the reasoning and the text associated with theletter C are incorrect.Table 2 Number of misclassifications (over 644 questions) and esti-mated accuracy of the evaluating model GPT-4o with the firstapproachModels Errors GPT-4o Accuracy GPT-4oClaude-3.5-Sonnet 10 98.4%GPT-4-turbo 17 97.4%Llama3-8b (MAC) 40 93.8%Llama2-7b (GPU server) 48 92.5%Overall accuracy — 95.5%Table 3 Number of misclassifications (out of 644 questions) and the corapplying the second approach. The table also includes a comparative anModels Errors GPT-4oLlama2-7b (GPU server) 15Llama3-8b (GPU server) 11Mistral-7b (GPU server) 16GPT-4 (GPU server) 11Overall accuracy —© 2025 The Author(s). Published by the Royal Society of ChemistryPaper Digital DiscoveryOpen Access Article. Published on 20 January 2025. Downloaded on 1/20/2025 7:55:58 AM.  This article is licensed under a Creative Commons Attribution 3.0 Unported Licence.View Article Online2.2.1 First approach. Initially, we selected GPT-4o for thistask, using a straightforward prompt: “Based on the questionand the correct answer, You must tell if the other answer iscorrect or not by answering only with Correct or Incorrect” asshown in Fig. 4a and then submitted the question in the format:“The question is” + hQUESTIONi + “, the correct answer is” +hCORRECT ANSWERi + “and the other answer is:” + hMODELANSWERi. Consequently, the accuracy of GPT-4o in properlyevaluating LLMs' answers was estimated to be an overall∼95.5% which is a strong performance, as shown in Table 2. Weresponding estimated accuracy of the evaluating model GPT-4o whenalysis with GPT-4o-miniAccuracy GPT-4o Errors GPT-4o-mini97.7% 2898.3% —97.5% —98.3% 4197.9% 94.6%Digital Discoveryhttp://creativecommons.org/licenses/by/3.0/http://creativecommons.org/licenses/by/3.0/https://doi.org/10.1039/d4dd00319eFig. 4 Comparison of the prompt used for the evaluation of GPT-4o: (a) corresponds to the first approach with a straightforward prompt, while(b) corresponds to the second approach with a step-by-step protocol and detailed explanation required.Digital Discovery PaperOpen Access Article. Published on 20 January 2025. Downloaded on 1/20/2025 7:55:58 AM.  This article is licensed under a Creative Commons Attribution 3.0 Unported Licence.View Article Onlinedene here by “misclassication” the correct answers labelledas incorrect, and incorrect answers labelled as correct. However,we observed signicant variation depending on the specicmodel being evaluated. Models with generally lower perfor-mance such as Llama2 and Llama3 were more susceptible toerrors in the evaluation process. Notably, these modelsfrequently had correct answers misclassied as incorrect moreoen than incorrect answers misclassied as correct. Forinstance, Llama2-7b initially demonstrated 85/644 correctanswers; however, we observed that 48 correct answers weremisclassied by GPT-4o. Despite maintaining an accuracy of∼92.5%, this misclassication resulted in Llama2-7b having anincrease to 133 correct answers in total, reecting a difference of∼56.5%.2.2.2 Second approach. In an attempt to resolve the issue ofmisclassication with the rst approach, we decided to updatethe prompt for the evaluation to a more sophisticated one. Inthis new approach, the questions were formatted differentlyand the prompt described the task that GPT-4o had to performmore precisely. Specically, the prompt instructed the model toevaluate not only the accuracy of the predicted answer but alsothe validity of the reasoning behind it, if provided. The modelwas required to ensure that the predicted answer matched thecorrect option, contained the correct set of matched entities, orwas numerically accurate within an acceptable range. Further-more, the model was tasked with providing a clear and conciseexplanation of its judgment, focusing on the key factors thatDigital Discoveryinuenced its decision. This rened prompt, shown in Fig. 4b,enhanced the model's ability to interpret and evaluate answersmore effectively, ultimately improving the accuracy and reli-ability of the evaluation process.As shown in Table 3, the accuracy of the evaluation reached∼97.9%, demonstrating greater stability across differentmodels. Notably, Llama2-7b's misclassications decreasedfrom 48 in the initial approach to 15, and Llama3-8b'smisclassications dropped from 40 to 11. This signicantdecrease inmisclassications highlights the effectiveness of therevised evaluation prompt. However, if the revised prompt isapplied to the GPT-4o-mini model as a judge, the results wereless conclusive when compared to those of GPT-4o, with 28misclassications observed for Llama2-7b and 41 for GPT-4.Historically, the model GPT-4o-mini was made available to thepublic by OpenAI during the evaluation process of the LLMs'answers, and its more attractive price tag enabled us to try it outon the benchmark.A key issue with GPT-4o-mini was its failure to recognizesome correct answers when the evaluated LLM neglected toinclude the corresponding letter in its responses. This suggeststhat while the new prompt greatly enhances evaluation accuracyfor higher-performing models, it may still be prone to errorswith LLMs with lower reasoning capabilities or when criticalelements, such as the letter designation in answers, areomitted. Future work could explore rening the prompt furtherto handle such cases more effectively or developing additional© 2025 The Author(s). Published by the Royal Society of Chemistryhttp://creativecommons.org/licenses/by/3.0/http://creativecommons.org/licenses/by/3.0/https://doi.org/10.1039/d4dd00319ePaper Digital DiscoveryOpen Access Article. Published on 20 January 2025. Downloaded on 1/20/2025 7:55:58 AM.  This article is licensed under a Creative Commons Attribution 3.0 Unported Licence.View Article Onlinelayers of validation to ensure even greater accuracy andconsistency across all model types.2.3 Random baseline calculationsTo assess the extent to which LLMs outperform chance-levelguessing, we compute a random baseline for each of the ques-tion types in the MaScQA benchmark. Knowing that each of theMATCH, MCQ, andMCQN questions has four options, with onecorrect answer, we derive the mean, m, and standard deviation,s, of the expected number of correct answers from the proper-ties of the binomial distribution, which models the number ofsuccesses (correct answers) in a xed number of independenttrials (questions), each with a xed probability of success p =0.25. Specically,m ¼ n� p; s ¼ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffin� p� ð1� pÞp;where n is the number of questions for a given category, and p isthe probability of guessing correctly.Therefore, and as reported in Table 4, we have:� For MATCH questions (70 total):m = 70 × 0.25 z 17.5, s = (70 × 0.25 × 0.75)0.5 z 3.6.� For MCQ questions (283 total):m = 283 × 0.25 z 70.7, s = (283 × 0.25 × 0.75)0.5 z 7.3.� For MCQN questions (67 total):m = 67 × 0.25 z 16.7, s = (67 × 0.25 × 0.75)0.5 z 3.5.Table 4 Number of correct answers achieved by 19 Large Languagbenchmark. Each model was evaluated through three submissions perwere tested on two different machines to assess potential variations inquestions within each category. For comparison, we also incorporate aMachine used LLM MATCH (70) MCQ (2Mac Pro M1 GPT-4-turbo 65.0 � 1.0 236.8 �GPT-4o 67.9 � 0.9 260.1 �GPT-4 60.4 � 1.4 214.8 �GPT-3.5-turbo 25.1 � 2.7 157.8 �Claude-3-Opus 68.7 � 0.6 240.3 �Claude-3-Haiku 40.3 � 0.6 205.1 �Claude-3.5-Sonnet 69.0 � 0.0 248.8 �Llama2-7b 9.3 � 2.4 99.2 �Llama3-8b 22.5 � 0.7 132.9 �GPU server GPT-4 61.4 � 0.5 212.4 �GPT-4o-mini 59.2 � 0.4 226.9 �GPT-3.5-turbo 24.0 � 3.6 158.3 �Llama2-7b 9.1 � 3.7 98.9 �Llama2-70b 18.9 � 3.6 129.3 �Llama3-8b 21.5 � 4.2 153.8 �Llama3-70b 51.8 � 0.9 199.2 �Mistral-7b 19.4 � 2.9 129.2 �Phi3-3.8b 32.9 � 1.4 146.8 �Phi3-14b 38.5 � 3.5 170.5 �Random baseline — 17.5 � 3.6 70.7 �© 2025 The Author(s). Published by the Royal Society of ChemistryFor NUM questions (224 total), a precise numericalreasoning is required, and the answers aren't multiple-choice.Thus, the probability of guessing correctly by chance is effec-tively close to zero. This stems from the nature of the problem:without predened options, the likelihood of randomly select-ing the correct answer in a continuous or large discrete range(e.g., all real numbers or integers) is negligible. Consequently,we x the mean baseline accuracy for NUM questions at 0%with equivalently 0% in standard deviation, acknowledging theunlikelihood of nding the correct answer randomly ona continuous range of real numbers.Finally, the combined m= 105.0 and sz 8.9 for the entire setof MATCH, MCQ, MCQN, and NUM questions are derived fromthe sum of the means and variances (s2) of each questioncategory, respectively.Thus, we can compare the performance of each LLM againstthis random baseline to highlight their ability for knowledgeretrieval, logical reasoning, and numerical computationeffectively.3 ResultsAer establishing the accuracy of the methodology for theautonomous evaluation pipeline, the entire list of LLMs fromTable 1 were evaluated on the MaScQA benchmark with theresults presented in Tables 4 and 5. Table 4 summarizes theaverage correctness of each LLM across three iterations on the644 benchmark questions. Additionally, to assess the impact ofhardware on model performance, four LLMs (GPT-4, GPT-3.5-turbo, Llama2-7b and Llama3-8b) were tested on a MAC anda GPU server. This comparative evaluation offers valuablee Models (LLMs) (representing 15 unique models) on the MaScQA1question to ensure robustness and consistency of results. Some LLMsperformance. The numbers in parentheses indicate the number ofrandom baseline as computed in Section 2.383) MCQN (67) NUM (224) Total correct answer (644)2.8 48.8 � 2.7 141.2 � 3.5 491.8 � 4.52.2 50.7 � 2.0 161.0 � 5.9 539.7 � 8.22.4 34.4 � 0.2 80.4 � 6.9 390.1 � 3.52.2 29.1 � 1.3 47.8 � 3.7 259.8 � 8.40.6 49.2 � 0.2 143.6 � 3.8 501.8 � 3.70.2 33.0 � 0.3 77.0 � 0.3 355.4 � 0.50.7 55.1 � 2.0 167.1 � 0.2 540.0 � 1.31.6 14.7 � 4.8 5.8 � 1.7 129.0 � 4.91.1 15.1 � 0.8 18.2 � 1.1 188.8 � 1.22.7 33.9 � 1.7 85.7 � 2.3 393.4 � 3.61.1 47.1 � 0.9 120.8 � 3.3 454.0 � 4.61.5 30.0 � 3.2 49.9 � 0.5 262.2 � 0.910.1 12.3 � 2.8 5.0 � 2.9 125.3 � 10.44.1 20.7 � 3.0 11.8 � 0.7 180.7 � 8.71.1 22.8 � 4.1 21.1 � 0.9 219.1 � 5.12.5 36.5 � 2.0 73.0 � 3.6 360.6 � 1.95.2 10.0 � 2.9 14.4 � 5.1 173.1 � 6.83.9 18.8 � 1.0 36.8 � 6.1 235.2 � 9.65.0 23.9 � 3.6 43.0 � 5.4 275.8 � 7.47.3 16.7 � 3.5 0.0 � 0.0 105.0 � 8.9Digital Discoveryhttp://creativecommons.org/licenses/by/3.0/http://creativecommons.org/licenses/by/3.0/https://doi.org/10.1039/d4dd00319eTable 5 Performance (accuracy (%)) for 15 different LLMs evaluated on the MaScQA1 benchmark. Each LLM was assessed through threesubmissions for each question to ensure robustness and consistency of results. For comparison, we also incorporate a random baseline ascomputed in Section 2.3Models MATCH (%) MCQ (%) MCQN (%) NUM (%) Overall accuracy(%)Claude-3-Haiku 57.6 � 0.8 72.5 � 0.1 49.3 � 0.4 34.4 � 0.1 55.2 � 0.1Claude-3-Opus 98.1 � 0.8 84.9 � 0.2 73.4 � 0.3 64.1 � 1.7 77.9 � 0.6Claude-3.5-Sonnet 98.6 � 0.0 87.9 � 0.2 82.2 � 3.0 74.6 � 0.1 83.9 � 0.2GPT-3.5-turbo 35.1 � 4.2 55.9 � 0.6 44.1 � 3.3 21.8 � 1.2 40.5 � 0.9GPT-4 87.0 � 1.6 75.5 � 0.9 51.0 � 1.7 37.1 � 2.4 60.8 � 0.6GPT-4-turbo 92.9 � 1.4 83.7 � 1.0 72.8 � 4.1 63.0 � 1.6 76.4 � 0.7GPT-4o 97.0 � 1.2 91.9 � 0.8 75.6 � 3.0 71.9 � 2.6 83.8 � 1.3GPT-4o-mini 84.6 � 0.6 80.2 � 0.4 70.3 � 1.3 53.9 � 1.5 70.5 � 0.7Llama2-7b 13.2 � 4.0 35.0 � 2.3 20.1 � 5.6 2.4 � 1.0 19.7 � 1.2Llama2-70b 27.0 � 5.2 45.7 � 1.4 30.8 � 4.4 5.3 � 0.3 28.1 � 1.4Llama3-8b 31.4 � 3.9 50.6 � 4.1 28.3 � 7.4 8.8 � 0.8 31.7 � 2.6Llama3-70b 74.0 � 1.2 70.4 � 0.9 54.5 � 2.9 32.6 � 1.6 56.0 � 0.3Mistral-7b 27.8 � 4.1 45.7 � 1.8 14.9 � 4.3 6.4 � 2.3 26.9 � 1.0Phi3-3.8b 47.0 � 2.0 51.9 � 1.4 28.1 � 1.5 16.4 � 2.7 36.5 � 1.5Phi3-14b 55.0 � 5.0 60.2 � 1.8 35.7 � 5.3 19.2 � 2.4 42.8 � 1.1Random baseline 25.0 � 5.2 25.0 � 2.6 25.0 � 5.3 0.0 � 0.0 16.3 � 1.4Fig. 5 Comparison of the number of average correct answers,including the 15 unique LLMs tested, to the total number of questionsper category, i.e., MATCH, MCQ, MCQN, and NUM, as well as for thewhole set of questions. A random baseline per category is indicated asDigital Discovery PaperOpen Access Article. Published on 20 January 2025. Downloaded on 1/20/2025 7:55:58 AM.  This article is licensed under a Creative Commons Attribution 3.0 Unported Licence.View Article Onlineinsights into how computational resources can inuence theperformance and accuracy of LLMs' responses. For GPT-4 andGPT-3.5-turbo, no performance differences were observed, asthese models rely on the server infrastructure provided byOpenAI, thereby rendering the local hardware inconsequential.However, a notable performance increase of ∼16% wasobserved for Llama3-8b when run on the GPU server incomparison to MAC M1. Conversely, Llama2-7b showed nosignicant performance difference between the two machines,likely due to the MAC M1's sufficient capability to handle themodel effectively.This disparity in performance, particularly with Llama3-8b,can be attributed to the computational demands exceedingthe MAC M1's capacity, whereas the GPU server, with superiorhardware capabilities, could manage the workload withoutcompromise. Additionally, when running Llama2-7b andLlama3-8b on the MAC M1, the system resources were fullyutilized, leaving the machine unable to perform other tasksuntil completion. This was not the case on the GPU server,where system performance remained stable, underscoring theimportance of hardware resources in managing complexmodels like Llama3-8b.Fig. 5 illustrates that, in general, LLMs tend to demonstratehigher accuracy when responding to questions that providea set of possible answers (MATCH, MCQ and MCQN). Thisphenomenon can be explained by the fact that, for the type ofquestions with multiple choices available, the model is requiredto select from a predened list of options. Similar to a studentguessing the correct answer, the model may choose the correctoption even if the underlying reasoning or calculations areawed. This tendency is further demonstrated in Fig. 3, wheremodels exhibited correct selections despite incorrect reasoning.An important aspect of our analysis is the evaluation of theLLMs on NUM, which present a unique challenge as they do notprovide potential answers. This type of question requiresmodels to rely solely on their internal knowledge, reasoning,Digital Discoveryand computational abilities. The results for NUM, as depictedin Table 5, offer a clear depiction of the LLMs' capabilities inthese areas. Notably, the performance of the models on NUMquestions reveals distinct groups. The difficulties observed inMaScQA's NUM and MCQN categories align with challengesreported in benchmarks such as MATH18 and ChemBench4k.12These tasks oen require multi-step computations, reasoningunder constraints, and precision in numerical outputs—areaswhere current LLMs frequently fall short.Models like Llama2-7b and Mistral-7b, which performedworse than random in MCQN, highlight a persistent issue ofa dashed line.© 2025 The Author(s). Published by the Royal Society of Chemistryhttp://creativecommons.org/licenses/by/3.0/http://creativecommons.org/licenses/by/3.0/https://doi.org/10.1039/d4dd00319ePaper Digital DiscoveryOpen Access Article. Published on 20 January 2025. Downloaded on 1/20/2025 7:55:58 AM.  This article is licensed under a Creative Commons Attribution 3.0 Unported Licence.View Article Onlineshallow numerical reasoning and tokenization inefficiencies.Addressing these limitations may require targeted ne-tuningwith domain-specic datasets or improved model architec-tures better suited for handling numerical reasoning tasks.As shown in Tables 4, 5 and Fig. 5, most of the tested LLMsoutperform in average the random baseline in all questioncategories, except for Llama2-7b in the MATCH and MCQNcategories, as well as Mistral-7b in the MCQN category. Forthose two last LLMs, their results in the MCQN category seem tobe hindered by their poor capability on numerical computa-tions, as their performance on the MCQ category aloneoutperforms the random baseline. However, concerning thebehavior of Llama2-7b in the MATCH category, it could implythat Llama2-7b follows systematic awed reasoning patternslearned from its training data that aren't tted to materialsscience and engineering. Additionally, the lack of domain-specic knowledge is hypothesized to also be a culprit. Thisemphasizes the need for domain-targeted ne-tuning orretraining to align LLMs with materials science tasks. Impor-tantly, such behaviors underscore the value of rigorous bench-marking across diverse question types to identify and addressweaknesses in model reasoning capabilities. Also, issuesobserved in MATCH and MCQ categories are not unique toMaScQA. Similar limitations have been identied in bench-marks like SciQ11 and MoleculeQA.13 For MATCH tasks, LLMsstruggle to establish logical relationships between entities,oen defaulting to heuristic-based reasoning. MCQ tasks, whilesimpler, can be impacted by pattern exploitation where modelsrely on supercial cues rather than true conceptualunderstanding.These trends underscore the importance of prompt optimi-zation and domain-specic ne-tuning to improve structuredreasoning and conceptual alignment in materials science tasks.Future work could explore methods to guide models moreeffectively through MATCH-type reasoning frameworks andnumerical computations.Claude-3.5-Sonnet emerges as the top performer, closelyfollowed by GPT-4o, both achieving an accuracy exceeding∼70%. This level of accuracy is considered acceptable given thecomplexity of the task. Claude-3-Opus and GPT-4-turbo closelyfollow with ∼64–63%, both models demonstrating a largeeffectiveness at handling numerical computations by compar-ison to the average pool of LLMs topping at ∼30.6% (see Fig. 5).Notably, the best studied open-source model, Llama3-70b,achieves results that are closely aligned with those of GPT-4and Claude-3-Haiku with ∼32.6%, underscoring its competi-tiveness with closed-source models.Furthermore, the performance comparison between Phi3-3.8b, Phi3-14b, and GPT-3.5-turbo reveals minimal differ-ences, suggesting that the parameter count may not be the soledeterminant of a LLM's effectiveness. Interestingly, Phi3-3.8boutperforms several models with double its parameter count,including Llama3-8b, Mistral-7b, and Llama2-7b. The relativelypoor performance of these larger models highlights thecomplexity of balancing model size with other factors such asarchitecture and training data quality, which can signicantlyimpact overall performance.© 2025 The Author(s). Published by the Royal Society of ChemistryThe models utilized in the study by Zaki et al.1 showcomparable performance to those in our current study. Notably,Llama2-70b exhibited slightly improved performance in ourevaluation, with an accuracy of 28.1 ± 1.4% compared to the24.0% reported by Zaki et al. This difference could be attributedto the application of the chain-of-thought (CoT) technique onLlama2-70b in their study, as well as the systematic variation incomputational resources and machines used.In contrast, GPT-4 and GPT-3.5-turbo demonstrated consis-tent performance across both studies. Specically, GPT-4 ach-ieved an accuracy of 60.8 ± 0.6% in our work, closely aligningwith the 61.38% reported by Zaki et al. Similarly, GPT-3.5-turboperformed at 40.5 ± 0.9%, which is consistent with the 38.31%observed in their study. These results suggest that the perfor-mance of these models is robust across different experimentalsetups and conditions. The slight variations in accuracy canlikely be attributed to the difference in temperature settingsused during evaluation.The evaluation of the LLMs, shown in Table 5 and Fig. 6,demonstrates that Claude-3.5-Sonnet and GPT-4o are amongthe top performers, achieving overall accuracies of approxi-mately 84% (see Fig. 1 in the ESI† for details concerning theLLMs' average accuracy on each category MATCH,MCQ,MCQN,and NUM). Claude-3.5-Sonnet emerges as the highestperformer, with an overall accuracy of 83.9% with a highstability. Its exceptional performance across MATCH and NUMcategories underscores its prociency in pattern recognitionand numerical reasoning, suggesting that it excels in tasksrequiring both structured matching and complex calculations.GPT-4o closely follows with an overall accuracy of 83.8%. Itdemonstrates particular strength in the MCQ category, attain-ing the highest accuracy of 91.9%. This indicates that GPT-4o ishighly effective at handling multiple-choice questions whereoptions are provided. Additionally, GPT-4o's performance inNUM at 71.9% suggests a solid capability in numericalreasoning, although it slightly lags behind Claude-3.5-Sonnet inthis area.Claude-3-Opus and GPT-4-turbo also exhibit commendableperformance, with overall accuracies of 77.9% and 76.4%,respectively. These models show a balanced capability acrossdifferent question types, reecting their robustness and versa-tility in handling diverse tasks. Their relatively high perfor-mance across MATCH and MCQ categories indicates that theyare reliable choices for a range of question types, though they donot quite reach the top levels achieved by Claude-3.5-Sonnetand GPT-4o.GPT-4 and GPT-4o-mini achieved overall accuracies of 60.8%and 70.5%, respectively. While GPT-4 had lower performance inthe NUM category, it was relatively strong in MATCH and MCQcategories. Llama3-70b also falls into the mid-tier category withan overall accuracy of 56.0%. Although it did not outperform theleading models, it showed decent performance in MATCH andMCQ categories. This model's performance highlights itscapability in handling structured questions, although it stilllags behind the top performers. Llama2-7b, Llama2-70b,Llama3-8b, and Mistral-7b exhibited poor performance acrossall categories, with overall accuracies below 32%. These modelsDigital Discoveryhttp://creativecommons.org/licenses/by/3.0/http://creativecommons.org/licenses/by/3.0/https://doi.org/10.1039/d4dd00319eFig. 6 Average overall performance for the studied 15 unique LLMs with their standard deviation obtained from three runs over the whole set of644 MATCH, MCQ, MCQN, and NUM questions.Digital Discovery PaperOpen Access Article. Published on 20 January 2025. Downloaded on 1/20/2025 7:55:58 AM.  This article is licensed under a Creative Commons Attribution 3.0 Unported Licence.View Article Onlinestruggled particularly in the NUM category, where their accu-racies were very low (ranging from 2.4% to 8.8%). This signi-cant shortfall in numerical reasoning capabilities underscoresmajor limitations in these models' ability to handle complexquantitative tasks, which might be due to their training data orarchitectural constraints. Also, several factors may explain theobserved limitations of open-source models on numericalreasoning tasks:� Training data limitations: open-source models are oentrained on publicly available datasets, which may lack sufficientexamples of numerical reasoning, particularly in scienticdomains like materials science.� Tokenization inefficiencies: numbers are tokenized assequences rather than atomic units, leading to errors in oper-ations involving precision or formatting.� Smaller model capacity: models with fewer parametershave limited ability to perform complex, multi-step computa-tions compared to their larger closed-source counterparts.� Reasoning biases: open-source models prioritize uencyduring pretraining, resulting in outputs that appear plausiblebut lack numerical accuracy.Then, Phi3-3.8b and Phi3-14b performed better than themodels explained before, with overall accuracies of 36.5% and42.8%, respectively. Despite these improvements, their perfor-mance still fell short of the top-tier models, particularly incomplex tasks such as MCQN and NUM. This suggests thatwhile these models have some capabilities, they are not yetDigital Discoverycompetitive with the leading models in handling more chal-lenging question types.Addressing these gaps requires a combination of strategies.For example, ne-tuning open-source models on curated data-sets with extensive numerical tasks could signicantly improvetheir reasoning capabilities. Additionally, advancements intokenization strategies and enhanced pretraining methodscould help smaller models better handle numerical precision,rounding, and formatting—critical elements for scienticapplications like materials discovery.Such targeted improvements are particularly relevant fortasks like calculating material properties or designing experi-ments, where numerical accuracy is essential. By bridging thesegaps, open-source models can evolve into robust tools fordomain-specic applications in materials science.From the perspective of the categories of questions:� MATCH: Claude-3.5-Sonnet achieved the highest accuracy(98.6%), closely followed by Claude-3-Opus (98.1%) and GPT-4o(97.0%). This suggests that these models are particularly adeptat tasks requiring pattern recognition and matching. The highaccuracy across these models suggests their robust capability inidentifying and matching patterns effectively.� MCQ: GPT-4o led in this category with a 91.9% accuracy,indicating its strength in handling multiple-choice questionswith provided options, reecting its ability to navigate throughchoices efficiently.� MCQN: Claude-3.5-Sonnet achieved an accuracy of 82.2%,due to its capability to integrate numerical reasoning within the© 2025 The Author(s). Published by the Royal Society of Chemistryhttp://creativecommons.org/licenses/by/3.0/http://creativecommons.org/licenses/by/3.0/https://doi.org/10.1039/d4dd00319ePaper Digital DiscoveryOpen Access Article. Published on 20 January 2025. Downloaded on 1/20/2025 7:55:58 AM.  This article is licensed under a Creative Commons Attribution 3.0 Unported Licence.View Article Onlinecontext of multiple-choice questions. The model's strongperformance in this category suggests that it can effectivelyhandle questions that require both choice selection andnumerical computation.� NUM: the NUM category, which requires open-endednumerical answers without provided options, was the mostchallenging. Claude-3.5-Sonnet performed the best witha 74.6% accuracy, and its advanced numerical reasoning abili-ties suggests that it is particularly adept at generating accuratenumerical responses when no options are provided.The results in Fig. 6 highlight that while different modelsexhibit strengths in specic areas, Claude-3.5-Sonnet's perfor-mance across both pattern recognition and numericalreasoning tasks positions it as a particularly versatile model.The challenges observed in the NUM category across all modelsunderscore the need for continued advancements in handlingopen-ended numerical reasoning tasks.4 DiscussionThe results of this study underscore the current superiority ofclosed-source models, such as GPT and Claude families ofmodels, over their open-source counterparts like Llama,Mistral, and Phi3. Closed-source models consistently demon-strated higher accuracy across various question categories,indicating their advanced architecture, extensive training, andoptimization for a broad range of tasks, including the elds ofmaterials science and engineering. However, the potential ofopen-source models should not be overlooked. Despite theirlower performance in this benchmark, open-source modelsoffer opportunities for optimization through methods likeprompt engineering and ne-tuning. Fine-tuning, in particular,is a powerful tool that allows these models to be adapted tospecic tasks or datasets, potentially enhancing their perfor-mance in specialized domains such as in materials science andchemistry.Overall, the inclusion of a random baseline for the MATCH,MCQ, MCQN, and NUM categories highlights the signicantadvantage provided by LLMs in answering materials sciencequestions. For most of the tested LLMs, except Llama2-7b andMistral-7b, they achieve accuracies demonstrating their abilityto display reasoning, i.e., a consistent arrangement of theirfragments of memorized knowledge, and retrieve informationfar beyond chance-level guessing. Notably, the NUM category,which lacks predened options, showcases the models'numerical reasoning capabilities—a critical skill for tasks suchas calculating material properties or experimental parameters.Phi3-3.8b stands out as a particularly promising candidatefor such optimization. Despite having a relatively low number ofparameters, it achieved an overall accuracy of 36.5%, which iscommendable given its smaller scale. This suggests that withtargeted ne-tuning and prompt optimization, Phi3-3.8b couldpotentially improve its performance signicantly withoutdemanding an expensive hardware load.An interesting direction for future work could involvesystematically ne-tuning Phi3-3.8b and other open-sourcemodels on domain-specic datasets, such as materials science© 2025 The Author(s). Published by the Royal Society of Chemistryor other technical elds. The MaScQA benchmark resultsdirectly inform the development of a RAG system tailored formaterials science applications. Such a system will enable AItools to assist researchers in tasks like synthesizing knowledgefrom massive literature corpora, proposing experimentaldesigns, and predicting material properties with minimalhuman input.For example, strong performance on NUM and MCQ ques-tions demonstrates an LLM's capability to accurately calculatematerial parameters or resolve conceptual queries—skillsessential for automating computational tasks or pre-experimental analyses. Fine-tuning open-source models likePhi3-3.8b using curated materials science datasets will ensurethat these tools become domain-optimized, democratizingaccess to AI-powered solutions in materials research. Addi-tionally, prompt engineering strategies could be explored tobetter leverage the model's existing capabilities, potentiallyboosting its performance in specic tasks. By carefully craingprompts that guide the model's reasoning process, we can helpit generate more accurate and contextually appropriateresponses. This approach is particularly useful for numericalreasoning tasks, where precise wording can inuence themodel's output. These approaches not only aim to bridge theperformance gap between open- and closed-source models butalso promote the democratization of AI by enhancing the utilityof models that are freely accessible to the community.While closed-source models currently lead in performance,the exibility and accessibility of open-source models presenta valuable opportunity for ongoing research and development.By focusing on ne-tuning and prompt optimization, it ispossible to enhance the performance of open-source models,making them viable alternatives for specialized applicationsand contributing to the advancement of open AI technologiesfor diverse domains, materials science included.While GPT-4o provides a creative and scalable approach forautomating performance evaluation, it is not without limita-tions. Discrepancies between GPT-4o's assessments andhuman-assigned scores highlight challenges such as potentialbiases in LLM judgments, inconsistencies in reasoning, anddifficulties with questions requiring deeper conceptual under-standing. For this reason, we have complemented GPT-4o-basedevaluations with traditional accuracy metrics, ensuring that theresults remain quantitatively robust and reliable. Future workcould explore hybrid evaluation frameworks that combineautomated LLM-based scoring with rigorous manual validation.The discrepancy observed in evaluation errors for lower-performing models suggests that outputs from these modelsare more challenging for automated evaluators like GPT-4o toassess accurately. Also, several factors could contribute to thehigher susceptibility of lower-performing models to evaluationerrors:� Ambiguity in outputs: lower-performing models oenproduce ambiguous or incomplete answers, which are inher-ently harder to evaluate. Outputs may include partially correctinformation or lack the precision required, particularly fornumerical and structured tasks.Digital Discoveryhttp://creativecommons.org/licenses/by/3.0/http://creativecommons.org/licenses/by/3.0/https://doi.org/10.1039/d4dd00319eDigital Discovery PaperOpen Access Article. Published on 20 January 2025. Downloaded on 1/20/2025 7:55:58 AM.  This article is licensed under a Creative Commons Attribution 3.0 Unported Licence.View Article Online� Hallucinations and shallow reasoning: these models aremore prone to hallucinations—condent but incorrectoutputs—and rely on supercial reasoning, especially whenconfronted with multi-step or complex questions. Such outputscan mislead evaluators like GPT-4o.� Tokenization and numerical precision issues: numericalreasoning tasks (e.g., NUM) require strict handling of tokeni-zation and precision. Lower-quality models frequently generateoutputs with formatting errors or rounding inconsistencies,increasing evaluation discrepancies.� Evaluator bias: automated evaluators like GPT-4o mayexhibit biases toward linguistic uency and coherence. Outputsfrom lower-performing models, which tend to lack these qual-ities, can be disproportionately misclassied.These observations offer a preliminary explanation for theobserved phenomenon. A more detailed investigation involvingmodel-level diagnostics or deeper access to closed-sourcearchitectures would be required to fully analyze this behavior.Future work could focus on developing error analysis frame-works and improving evaluator calibration to better handleoutputs from lower-performing models.This study represents a critical rst step in identifying thebest-performing LLMs as candidates for ne-tuning and inte-gration into a materials science RAG system. To further advancethe applicability of LLMs inmaterials science, several directionsfor future work are identied:� Fine-tuning open-source models: while models like Phi3-3.8b show promise, ne-tuning on curated, domain-specicdatasets rich in materials science literature and numericalreasoning tasks will be essential for improving theircapabilities.� Exploring temperature effects: adjusting temperaturesettings could dynamically optimize model outputs for tasksrequiring both creativity and precision, particularly in numer-ical and reasoning-heavy questions.� Advanced error correction strategies: implementing tech-niques such as CoT prompting, in-context learning (ICL), andpost-hoc validation methods will address hallucinations,ambiguity, and shallow reasoning in lower-performing models.� Improved tokenization for numerical tasks: enhancingtokenization strategies to treat numerical inputs as atomic unitsrather than sequences will reduce errors in numerical reasoningand precision.The end goal is to create an AI system capable of compre-hensively reasoning over materials science knowledge, acceler-ating discoveries and reducing the time between hypothesisgeneration and experimental validation.5 ConclusionsThis study used the MaScQA benchmark, developed by Zakiet al.,1 to assess the performance of 15 different LLMs acrossa diverse set of tasks. The MaScQA dataset is notable for itsinclusion of questions from various sub-elds from materialsscience and engineering and its range of question typesMATCH, MCQ, MCQN, and NUM, each of which evaluatesdifferent aspects of model capability, such as reasoning, patternDigital Discoveryrecognition, numerical computation, and decision-making.Among the models tested, two demonstrated exceptionalperformance: Claude-3.5-Sonnet and GPT-4o. Claude-3.5-Sonnet achieved an overall accuracy of 83.9 ± 0.2%, whileGPT-4o closely followed with an accuracy of 83.8 ± 1.3%. Theseresults highlight the advanced capabilities of these models inhandling a wide array of tasks, particularly in domainsrequiring robust pattern recognition and complex numericalreasoning.The variety of question types in the MaScQA benchmarkallowed for a comprehensive evaluation of the LLMs, revealingnot only the strengths of the top-performing models but alsothe specic areas where other models struggled. For instance,the NUM category, which involves open-ended numericalquestions, proved to be particularly challenging for mostmodels, underscoring the ongoing difficulties in developingLLMs with strong numerical computation abilities.Overall, the ndings from this study emphasize the potentialof using benchmarks like MaScQA to push the boundaries ofLLM capabilities for specic domains like materials science andengineering. The high performance of Claude-3.5-Sonnet andGPT-4o suggests that while state-of-the-art models continue toimprove, there remains signicant potential for furtherimprovements, particularly for open-source models that can bene-tuned and optimized for specic tasks. Future work in thisarea will focus on enhancing the capabilities of open-sourcemodels through targeted ne-tuning and prompt engineering,potentially narrowing the gap between open- and closed-sourcemodels and contributing to the broader development of acces-sible and high-performing AI systems for science.Data availabilityThis study was carried out using publicly available data fromMaScQA benchmark at https://github.com/M3RG-IITD/MaScQArelated to the article: https://pubs.rsc.org/en/content/articlelanding/2024/dd/d3dd00188a. The code used for thisstudy can be found in the repository LLM_comparison_4MS athttps://github.com/Lambard-ML-Team/LLM_comparison_4MS.In this repository you can nd the code used to submit thequestions to the different models (https://github.com/Lambard-ML-Team/LLM_comparison_4MS/tree/main/questions_to_models) and the code used for the GPT-4oanalysis of the response (https://github.com/Lambard-ML-Team/LLM_comparison_4MS/tree/main/GPT_4o_Analysis).Data for this paper, including the answer for each model andthe result of the analysis by GPT-4o are also available athttps://github.com/Lambard-ML-Team/LLM_comparison_4MS.Author contributionsChristophe Bajan: conceptualization, methodology, soware,data analysis, writing—original dra. Guillaume Lambard:conceptualization, methodology, soware, validation,resources, supervision, funding acquisition, project adminis-tration, writing—nal dra.© 2025 The Author(s). Published by the Royal Society of Chemistryhttps://github.com/M3RG-IITD/MaScQAhttps://pubs.rsc.org/en/content/articlelanding/2024/dd/d3dd00188ahttps://pubs.rsc.org/en/content/articlelanding/2024/dd/d3dd00188ahttps://github.com/Lambard-ML-Team/LLM_comparison_4MShttps://github.com/Lambard-ML-Team/LLM_comparison_4MS/tree/main/questions_to_modelshttps://github.com/Lambard-ML-Team/LLM_comparison_4MS/tree/main/questions_to_modelshttps://github.com/Lambard-ML-Team/LLM_comparison_4MS/tree/main/questions_to_modelshttps://github.com/Lambard-ML-Team/LLM_comparison_4MS/tree/main/GPT_4o_Analysishttps://github.com/Lambard-ML-Team/LLM_comparison_4MS/tree/main/GPT_4o_Analysishttps://github.com/Lambard-ML-Team/LLM_comparison_4MShttp://creativecommons.org/licenses/by/3.0/http://creativecommons.org/licenses/by/3.0/https://doi.org/10.1039/d4dd00319ePaper Digital DiscoveryOpen Access Article. Published on 20 January 2025. Downloaded on 1/20/2025 7:55:58 AM.  This article is licensed under a Creative Commons Attribution 3.0 Unported Licence.View Article OnlineConflicts of interestThere are no conicts to declare.Notes and references1 M. Zaki, N. A. Krishnan, et al., Digital Discovery, 2024, 3, 313–327.2 W. Lu, R. K. Luu and M. J. Buehler, arXiv, 2024, preprint,arXiv:2409.03444, pp. 1–56, DOI: 10.48550/arXiv.2409.03444.3 A. Trewartha, N. Walker, H. Huo, S. Lee, K. Cruse,J. Dagdelen, A. Dunn, K. A. Persson, G. Ceder and A. Jain,Patterns, 2022, 3, 100488.4 V. Tshitoyan, J. Dagdelen, L. Weston, A. Dunn, Z. Rong,O. Kononova, K. A. Persson, G. Ceder and A. Jain, Nature,2019, 571, 95–98.5 Y. An, J. Greenberg, A. Kalinowski, X. Zhao, X. Hu, F. J. Uribe-Romo, K. Langlois, J. Furst and D. A. Gómez-Gualdrón, arXiv,2024, preprint, arXiv:2309.11361, pp. 1–14, DOI: 10.48550/arXiv.2309.11361.6 H. Zhang, Y. Song, Z. Hou, S. Miret and B. Liu, Findings of theAssociation for Computational Linguistics: EMNLP 2024,Miami, Florida, USA, 2024, pp. 3369–3382.7 Y. Wan, Y. Liu, A. Ajith, C. Grazian, B. Hoex, W. Zhang, C. Kit,T. Xie and I. Foster, arXiv, 2024, preprint, arXiv:2405.09939,pp. 1–22, DOI: 10.48550/arXiv.2405.09939.8 T. Guo, K. Guo, B. Nan, Z. Liang, Z. Guo, N. V. Chawla,O. Wiest and X. Zhang, arXiv, 2023, preprint,arXiv:2305.18365, pp. 1–27, DOI: 10.48550/arXiv.2305.18365.9 K. Singhal, S. Azizi, T. Tu, S. S. Mahdavi, J. Wei, H. W. Chung,N. Scales, A. Tanwani, H. Cole-Lewis, S. Pfohl, P. Payne,M. Seneviratne, P. Gamble, C. Kelly, A. Babiker, N. Schärli,A. Chowdhery, P. Manseld, D. Demner-Fushman,B. Agüera y Arcas, D. Webster, G. S. Corrado, Y. Matias,K. Chou, J. Gottweis, N. Tomasev, Y. Liu, A. Rajkomar,J. Barral, C. Semturs, A. Karthikesalingam andV. Natarajan, Nature, 2023, 620, 172–180.10 L. Sun, Y. Han, Z. Zhao, D. Ma, Z. Shen, B. Chen, L. Chen andK. Yu, arXiv, 2023, preprint, arXiv:2308.13149, pp. 1–9, DOI:10.48550/arXiv.2308.13149.11 J. Welbl, N. F. Liu and M. Gardner, arXiv, 2017, preprint,arXiv:1707.06209, pp. 1–13, DOI: 10.48550/arXiv.1707.06209.12 D. Zhang, W. Liu, Q. Tan, J. Chen, H. Yan, Y. Yan, J. Li,W. Huang, X. Yue, D. Zhou, S. Zhang, M. Su, H.-S. Zhong,© 2025 The Author(s). Published by the Royal Society of ChemistryY. Li and W. Ouyang, arXiv, 2024, preprint,arXiv:2402.06852, pp. 1–26, DOI: 10.48550/arXiv.2402.06852.13 X. Lu, H. Cao, Z. Liu, S. Bai, L. Chen, Y. Yao, H.-T. Zheng andY. Li, arXiv, 2024, preprint, arXiv:2403.08192, pp. 1–19, DOI:10.48550/arXiv.2403.08192.14 Z.-Y. Chen, F.-K. Xie, M. Wan, Y. Yuan, M. Liu, Z.-G. Wang,S. Meng and Y.-G. Wang, Chin. Phys. B, 2023, 32, 118104.15 T. Xie, Y. Wan, W. Huang, Z. Yin, Y. Liu, S. Wang, Q. Linghu,C. Kit, C. Grazian, W. Zhang, I. Razzak and B. Hoex, arXiv,2023, preprint, arXiv:2308.13565, pp. 1–19, DOI: 10.48550/arXiv.2308.13565.16 Y. Chiang, C.-H. Chou and J. Riebesell, arXiv, 2024, preprint,arXiv:2401.17244, pp. 1–32, DOI: 10.48550/arXiv.2401.17244.17 Y. Song, S. Miret and B. Liu, arXiv, 2023, preprint,arXiv:2305.08264, pp. 1–17, DOI: 10.48550/arXiv.2305.08264.18 D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart,E. Tang, D. Song and J. Steinhardt, arXiv, 2021, preprint,arXiv:2103.03874, pp. 1–22, DOI: 10.48550/arXiv.2103.03874.19 M. Courbariaux, Y. Bengio and J.-P. David, ICLR (Workshop),2015.20 A. Q. Jiang, A. Sablayrolles, A. Roux, A. Mensch, B. Savary,C. Bamford, D. S. Chaplot, D. de las Casas, E. Bou Hanna,F. Bressand, G. Lengyel, G. Bour, G. Lample, L. RenardLavaud, L. Saulnier, M.-A. Lachaux, P. Stock,S. Subramanian, S. Yang, S. Antoniak, T. Le Scao,T. Gervet, T. Lavril, T. Wang, T. Lacroix and W. El Sayed,arXiv, 2024, preprint, arXiv:2401.04088, pp. 1–13, DOI:10.48550/arXiv.2401.04088.21 C. Brugger, S. Weithoffer, C. De Schryver, U. Wasenmüllerand N. Wehn, Adv. Radio Sci., 2014, 12, 75–81.22 OpenAI, OpenAI API, 2023, https://openai.com/api/,accessed: May-Aug 2024.23 Anthropic, Anthropic API, 2023, https://www.anthropic.com,accessed: May-Aug 2024.24 Ollama, Ollama API, 2023, https://ollama.ai, accessed: May-Aug 2024.25 L. Zheng, W.-L. Chiang, Y. Sheng, S. Zhuang, Z. Wu,Y. Zhuang, Z. Lin, Z. Li, D. Li, E. Xing, H. Zhang,J. E. Gonzalez and I. Stoica, Thirty-seventh Conference onNeural Information Processing Systems Datasets andBenchmarks Track, 2023.Digital Discoveryhttps://doi.org/10.48550/arXiv.2409.03444https://doi.org/10.48550/arXiv.2309.11361https://doi.org/10.48550/arXiv.2309.11361https://doi.org/10.48550/arXiv.2405.09939https://doi.org/10.48550/arXiv.2305.18365https://doi.org/10.48550/arXiv.2308.13149https://doi.org/10.48550/arXiv.1707.06209https://doi.org/10.48550/arXiv.2402.06852https://doi.org/10.48550/arXiv.2403.08192https://doi.org/10.48550/arXiv.2308.13565https://doi.org/10.48550/arXiv.2308.13565https://doi.org/10.48550/arXiv.2401.17244https://doi.org/10.48550/arXiv.2305.08264https://doi.org/10.48550/arXiv.2103.03874https://doi.org/10.48550/arXiv.2401.04088https://openai.com/api/https://www.anthropic.comhttps://ollama.aihttp://creativecommons.org/licenses/by/3.0/http://creativecommons.org/licenses/by/3.0/https://doi.org/10.1039/d4dd00319e Exploring the expertise of large language models in materials science and metallurgical engineeringElectronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d4dd00319e Exploring the expertise of large language models in materials science and metallurgical engineeringElectronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d4dd00319e Exploring the expertise of large language models in materials science and metallurgical engineeringElectronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d4dd00319e Exploring the expertise of large language models in materials science and metallurgical engineeringElectronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d4dd00319e Exploring the expertise of large language models in materials science and metallurgical engineeringElectronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d4dd00319e Exploring the expertise of large language models in materials science and metallurgical engineeringElectronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d4dd00319e Exploring the expertise of large language models in materials science and metallurgical engineeringElectronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d4dd00319e Exploring the expertise of large language models in materials science and metallurgical engineeringElectronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d4dd00319e Exploring the expertise of large language models in materials science and metallurgical engineeringElectronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d4dd00319e Exploring the expertise of large language models in materials science and metallurgical engineeringElectronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d4dd00319e Exploring the expertise of large language models in materials science and metallurgical engineeringElectronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d4dd00319e Exploring the expertise of large language models in materials science and metallurgical engineeringElectronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d4dd00319e Exploring the expertise of large language models in materials science and metallurgical engineeringElectronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d4dd00319e Exploring the expertise of large language models in materials science and metallurgical engineeringElectronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d4dd00319e Exploring the expertise of large language models in materials science and metallurgical engineeringElectronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d4dd00319e Exploring the expertise of large language models in materials science and metallurgical engineeringElectronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d4dd00319e Exploring the expertise of large language models in materials science and metallurgical engineeringElectronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d4dd00319e Exploring the expertise of large language models in materials science and metallurgical engineeringElectronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d4dd00319e