# Fileset

[Jxiv_article.zip](https://mdr.nims.go.jp/filesets/36fa86e1-dc48-45b2-aa9d-2e59835dfe17/download)

## Creator

KAWANO, Hiroyuki, SATO, Fumitaka, [YOSHITAKE, Michiko](https://orcid.org/0000-0002-0973-5666), MOTEKI, Fuma, TERAOKA, Hiroshi

## Rights

[Creative Commons BY-ND Attribution-NoDerivatives 4.0 International](https://creativecommons.org/licenses/by-nd/4.0/)

## Other metadata

[MaterialBERT for Natural Language Processing of Materials Science Texts](https://mdr.nims.go.jp/datasets/936f7bdd-e1eb-4100-929b-db71a00f8df9)

## Fulltext

Jxiv_article/journal_list.xlsxSheet1  publisher  journal ISSN  jornal name  year  ジャーナル名はヤフー検索scientific journal ISSN # メイ ケンサク  acs  0001-4842  Accounts of Chemical Research  2017-2018    0002-7863  Journal of the American Chemical Society  2016-2018    0003-2700  Analytical Chemistry  2017-2018    0006-2960  Biochemistry  2017-2018    0009-2665  Chemical Reviews  2017-2018    0013-936x  Environmental Science & Technology  2017-2018    0020-1669  Inorganic Chemistry  2017-2018    0021-8561  Journal of Agricultural and Food Chemistry  2016-2018    0021-9568  Journal of Chemical & Engineering Data  2017-2018    0021-9584  Journal of Chemical Education  2017-2018    0022-2623  Journal of Medicinal Chemistry  2017-2018    0022-3263  The Journal of Organic Chemistry  2017-2018    0024-9297  Macromolecules  2016-2018    0163-3864  Journal of Natural Products  2017-2018    0276-7333  Organometallics  2017-2018    0743-7463  Langmuir  2016-2018    0887-0624  Energy & Fuels  2017-2018    0888-5885  Industrial & Engineering Chemistry Research  2016-2018    0893-228x  Chemical Research in Toxicology  2017-2018    0897-4756  Chemistry of Materials  2016-2018    1043-1802  Bioconjugate Chemistry  2017-2018    1083-6160  Organic Process Research & Development  2017-2018    1089-5639  The Journal of Physical Chemistry A  2017-2018    1520-6106  The Journal of Physical Chemistry B  2016-2018    1523-7060  Organic Letters  2017-2018    1525-7797  Biomacromolecules  2016-2018    1528-7483  Crystal Growth & Design  2017-2018    1530-6984  Nano Letters  2017-2018    1535-3893  Journal of Proteome Research  2017-2018    1543-8384  Molecular Pharmaceutics  2017-2018    1549-9596  Journal of Chemical Information and Modeling  2017-2018    1549-9618  Journal of Chemical Theory and Computation  2017-2018    1554-8929  ACS Chemical Biology  2017-2018    1932-7447  The Journal of Physical Chemistry C  2017-2018    1936-0851  ACS Nano  2017-2018    1944-8244  ACS Applied Materials & Interfaces  2017-2018    1948-5875  ACS Medicinal Chemistry Letters  2017-2018    1948-7185  The Journal of Physical Chemistry Letters  2017-2018    1948-7193  ACS Chemical Neuroscience  2017-2018    2155-5435  ACS Catalysis  2017-2018    2156-8952  ACS Combinatorial Science  2017-2018    2161-1653  ACS Macro Letters  2017-2018    2161-5063  ACS Synthetic Biology  2017-2018    2168-0485  ACS Sustainable Chemistry & Engineering  2017-2018    2328-8930  Environmental Science & Technology Letters  2017-2018    2330-4022  ACS Photonics  2017-2018    2373-8227  ACS Infectious Diseases  2017-2018    2373-9878  ACS Biomaterials Science & Engineering  2017-2018    2374-7943  ACS Central Science  2017-2018    2379-3694  ACS Sensors  2017-2018    2380-8195  ACS Energy Letters  2017-2018    2470-1343  ACS Omega  2018    2472-3452  ACS Earth and Space Chemistry  2017-2018    2574-0962  ACS Applied Energy Materials  2017-2018    2574-0970  ACS Applied Nano Materials  2017-2018    2575-9108  ACS Pharmacology & Translational Science  2018    2576-6422  ACS Applied Bio Materials  2018  aip  0003-6951  Applied Physics Letters  2005-2019    0021-8979  Journal of Applied Physics  2005-2019    0021-9606  JOURNAL OF CHEMICAL PHYSICS  2005-2019    0022-2488  Journal of mathematical physics  2005-2019    0034-6748  REVIEW OF SCIENTIFIC INSTRUMENTS  2005-2019    0047-2689  Journal of Physical and Chemical Reference Data  2008-2018    0734-2101  Journal of Vacuum Science and Technology A  2005-2019    1042-346x  Journal of Laser Applications  2011-2015, 2017-2019    1054-1500  An Interdisciplinary Journal of Nonlinear Science  2005-2019    1055-5269  Surface Science Spectra  2008-2018    1070-664x  Physics of Plasmas  2006-2019    1070-6631  Physics of Fluids  2005-2019    1071-1023  Journal of Vacuum Science and Technology B  2004-2009    1932-1058  Biomicrofluidics  2007-2019    1934-8630  Biointerphases  2015-2019    1938-1387  Journal of Laser Applications  2014-2016  onlineISSN    1941-7012  Journal of Renewable and Sustainable Energy  2009-2019    2166-532x  APL Materials  2018    2166-2746  Journal of Vacuum Science & Technology B  2010-2019  aps  0031-9007  Physcal Review Letters  2015-2019    1098-0121  Physcal Review B  2014-2015    2469-9940  Physcal Review B  2016    2469-9950  Physcal Review B  2016-2017  elsevier  0008-6223  CARBON  2015-2017    0009-2614  Chemical Physics Letters  2015-2017    0010-938x  Corrosion Science  2015-2017    0013-4686  Electrochimica Acta  2015-2017    0014-3057  European Polymer Journal  2017-2018    0021-9517  Journal of Catalysis  2015-2018    0021-9797  Journal of Colloid and Interface Science  2015-2017    0022-0248  Journal of Crystal Growth  2015-2017    0022-2313  Journal of Luminescence  2015-2017    0022-3093  Journal of Non-Crystalline Solids  2015-2019    0022-3697  Journal of Physics and Chemistry of Solids  2015-2017    0022-4596  Journal of Solid State Chemistry  2015-2017    0025-5408  Materials Research Bulletin  2015-2017    0032-3861  Polymer  2015-2017    0038-1098  Solid State Communications  2015-2017    0038-1101  Solid-State Electronics  2017-2018    0039-6028  Surface Science  2015-2017    0040-6090  Thin Solid Films  2015-2017    0079-6425  Progress in Materials Science  2015-2017    0079-6786  Progress in Solid State Chemistry  2017-2018    0142-1123  International Journal of Fatigue  2015-2017    0142-9612  Biomaterials  2015-2017    0167-577x  Materials Letters  2015-2017    0167-2738  Solid State Ionics  2015-2017    0167-5729  Surface Science Reports  2015-2017    0169-4332  Applied Surface Science  2015-2018    0254-0584  Materials Chemistry and Physics  2015-2017    0266-3538  Composites Science and Technology  2017-2018    0272-8842  Ceramics International  2015-2017    0304-3991  Ultramicroscopy  2015-2017    0304-8853  Journal of Magnetism and Magnetic Materials  2015-2018    0364-5916  CALPHAD: Computer Coupling of Phase Diagrams and Thermochemistry  2015-2017    0376-7388  Journal of Membrane Science  2017-2018    0378-7753  Journal of Power Sources  2015-2017    0379-6779  Synthetic Metals  2017-2018    0921-4534  Physica C: Superconductivity and its applications  2015-2017    0921-5093  Materials Science & Engineering A  2015-2017    0921-5107  Materials Science and Engineering B  2017-2018    0925-3467  Optical Materials  2015-2017    0925-4005  Sensors and Actuators B: Chemical  2015-2017    0925-8388  Journal of Alloys and Compounds  2015-2017    0925-9635  Diamond & Related Materials  2015-2017    0926-860x  Applied Catalysis A: General  2015-2016    0926-3373  Applied Catalysis B: Environmental  2015-2016    0927-0248  Solar Energy Materials & Solar Cells  2015-2017    0927-796x  Materials Science and Engineering R  2015-2017    0927-7765  Colloids and Surfaces B: Biointerfaces  2015-2017    0955-2219  Journal of the European Ceramic Society  2015-2017    0966-9795  Intermetallics  2015-2017    1005-0302  Journal of Materials Science & Technology  2017-2018    1293-2558  Solid State Sciences  2017-2018    1359-6454  Acta Materialia  2015-2017    1359-6462  Scripta Materialia  2015-2018    1369-7021  Materials Today  2017-2018    1387-7003  Inorganic Chemistry Communications  2017-2019    1388-2481  Electrochemistry Communications  2015-2017    1566-1199  Organic Electronics  2015-2016    1742-7061  Acta Biomaterialia  2015-2017    2211-2855  Nano Energy  2017-2018  iop  0004-637x  The Astrophysical Journal  2017    0004-6256  The Astronomical Journal  2017-2018    0004-6280  Publications of the Astronomical Society of the Pacific  2015-2018    0021-4922  Japanese Journal of Applied Physics  2015-2018    0022-3727  Journal of Physics D: Applied Physics  2015-2018    0026-1394  Metrologia  2015-2018    0029-5515  Nuclear Fusion  2015-2018    0031-8949  Physica Scripta  2015-2018    0031-9120  Physics Education  2017-2018    0031-9155  Physics in Medicine & Biology  2015-2018    0034-4885  Reports on Progress in Physics  2015-2018    0036-021x  Russian Chemical Reviews  2015-2018    0036-0279  Russian Mathematical Surveys  2015-2018    0067-0049  The Astrophysical Journal Supplement Series  2017-2018    0143-0807  European Journal of Physics  2015-2018    0169-5983  The Japan Society of Fluid Mechanics  2017-2018    0253-6102  Communications in Theoretical Physics  2018    0256-307x  Chinese Physics Letters  2017-2018    0264-9381  Classical and Quantum Gravity  2015-2018    0266-5611  Inverse Problems  2015-2018    0268-1242  Semiconductor Science and Technology  2015-2018    0295-5075  Europhysics Letters  2015-2018    0741-3335  Plasma Physics and Controlled Fusion  2015-2018    0951-7715  Nonlinearity  2015-2018    0952-4746  Journal of Radiological Protection  2015-2018    0953-2048  Superconductor Science and Technology  2015-2018    0953-4075  Journal of Physics B: Atomic, Molecular and Optical Physics  2015-2018    0953-8984  Journal of Physics: Condensed Matter  2015-2018    0954-3899  Journal of Physics G: Nuclear and Particle Physics  2015-2018    0957-0233  Measurement Science and Technology  2015-2018    0957-4484  Nanotechnology  2015-2018    0960-1317  Journal of Micromechanics and Microengineering  2015-2018    0963-0252  Plasma Sources Science and Technology  2015-2018    0964-1726  Smart Materials and Structures  2015-2018    0965-0393  Modelling and Simulation in Materials Science and Engineering  2015-2018    0967-3334  Physiological Measurement  2015-2018    1009-0630  Plasma Science and Technology  2017-2018    1054-660x  Laser Physics  2015-2018    1063-7818  Quantum Electronics  2015-2018    1063-7869  Physics-Uspekhi  2015-2018    1064-5616  Sbornik Mathematics  2015-2018    1064-5632  Izvestiya Mathematics  2015-2018    1367-2630  New Journal of Physics  2015-2018    1468-6996  Science and Technology of Advanced Materials  2015    1475-7516  Journal of Cosmology and Astroparticle Physics  2015-2016    1478-3975  Physical Biology  2015-2018    1612-2011  Laser Physics Letters  2015-2018    1674-1056  Chinese Physics B  2015-2018    1674-1137  Chinese Physics C  2015-2018    1674-4527  Research in Astronomy and Astrophysics  2017-2018    1674-4926  Journal of Semiconductors  2017-2018    1741-2560  Journal of Neural Engineering  2015-2018    1742-2132  Journal of Geophysics and Engineering  2015-2018    1742-5468  Journal of Statistical Mechanics: Theory and Experiment  2015-2018    1748-605x  Biomedical Materials  2017-2018  onlineISSN    1748-3182  Bioinspiration & Biomimetics  2015    1748-3190  Bioinspiration & Biomimetics  2015-2018  onlineISSN    1748-6041  Biomedical Materials  2015-2018    1748-9326  Environmental Research Letters  2015-2018    1749-4680  Computational Science & Discovery  2015    1749-4699  Computational Science & Discovery  2015  onlineISSN    1751-8113  Journal of Physics A: Mathematical and Theoretical  2015-2018    1752-7155  Journal of Breath Research  2015-2017    1752-7163  Journal of Breath Research  2017-2018  onlineISSN    1758-5082  Biofabrication  2015    1758-5090  Biofabrication  2015-2018  onlineISSN    1882-0778  Applied Physics Express  2015-2018    2040-8978  Journal of Optics  2015-2018    2041-8205  The Astrophysical Journal Letters  2017-2018    2043-6262  Advances in Natural Sciences: Nanoscience and Nanotechnology  2015-2018    2050-6120  Methods and Applications in Fluorescence  2015-2018    2051-672x  Surface Topography: Metrology and Properties  2015-2018    2053-1583  2D Materials  2015-2018    2053-1591  Materials Research Express  2015-2018    2053-1613  Translational Materials Research  2015-2018    2057-1739  Convergent Science Physical Oncology  2015-2018    2057-1976  Biomedical Physics & Engineering Express  2015-2018    2058-8585  Flexible and Printed Electronics  2016-2018    2058-9565  Quantum Science and Technology  2016-2018    2399-1984  Nano Futures  2018    2399-6528  Journal of Physics Communications  2018    2399-7532  Multifunctional Materials  2018    2515-5172  Research Notes of the AAS  2018    2515-7639  Journal of Physics: Materials  2018  JJAP  0021-4922  Japanese Journal of Applied Physics  2014-2015    1882-0778  Applied Physics Express  2014-2015  rsc  1364-548x  Chemical Communications  2015-2018    1364-5498  Faraday Discussions  2016-2018    1364-5528  Analyst  2015-2018    1364-5544  Journal of Analytical Atomic Spectrometry  2015-2018  onlineISSN    1369-9261  New Journal of Chemistry  2016-2018  onlineISSN    1460-4744  Chemical Society Reviews  2015-2018  onlineISSN    1460-4752  Natural Product Reports  2015-2018  onlineISSN    1463-9084  Physical Chemistry Chemical Physics  2015-2018  onlineISSN    1463-9270  Green Chemistry  2015-2018  onlineISSN    1466-8033  CrystEngComm  2015-2018    1473-0197  Lab on a Chip - Miniaturisation for Chemistry and Biology  2015-2018    1474-9092  Photochemical & Photobiological Sciences  2016-2018    1477-0539  Organic & Biomolecular Chemistry  2015-2018    1477-9234  Dalton Transactions  2015-2018    1742-2051  Molecular BioSystems  2015-2017    1744-6848  Soft Matter  2015-2018    1754-5706  Energy & Environmental Science  2016-2018    1756-591x  Metallomics  2016-2018    1756-1108  Chemistry Education Research and Practice  2016-2018    1757-9708  Integrative Biology  2016-2018    1759-9679  Analytical Methods  2015-2018    1759-9962  Polymer Chemistry  2015-2018    2040-2511  MedChemComm  2016-2018  onlineISSN    2040-3372  Nanoscale  2015-2018    2041-6539  Chemical Science  2015-2018    2042-650x  Food & Function  2016-2018    2044-4761  Catalysis Science & Technology  2015-2018    2045-4538  Toxicology Research  2015-2018    2046-2069  RSC Advances  2015-2018    2047-4849  Biomaterials Science  2015-2018    2050-7496  Journal of Materials Chemistry A  2015-2018    2050-7518  Journal of Materials Chemistry B  2015-2018    2050-7534  Journal of Materials Chemistry C  2015-2018    2050-7895  Environmental Science Processes & Impacts  2016-2018    2051-6355  Materials Horizons  2015-2018    2051-8161  Environmental Science Nano  2016-2018    2052-1537  Materials Chemistry Frontiers  2016-2018  onlineISSN    2052-1553  Inorganic Chemistry Frontiers  2016-2018  onlineISSN    2052-4129  Organic Chemistry Frontiers  2015-2018  onlineISSN    2053-1419  Environmental Science Water Research & Technology  2016-2018    2055-6764  Nanoscale Horizons  2015-2018    2058-9689  Molecular Systems Design & Engineering  2016-2018    2058-9883  Reaction Chemistry & Engineering  2016-2018    2398-4902  Sustainable Energy & Fuels  2017-2018    2515-4184  Molecular Omics  2018  springer  0022-2461  Journal of Materials Science  2015-2019    0170-0839  Polymer Bulletin  2015-2018    0732-8818  Experimental Techniques  2017-2018    0970-4140  Journal of the Indian Institute of Science  2017-2018    1022-9760  Journal of Polymer Research  2015-2019    1547-0091  Journal of Coatings Technology and Research  2015-2018    1556-276x  Nanoscale Research Letters  2015    1868-6958  Cancer Nanotechnology  2015-2018    1931-7573  Nanoscale Research Letters  2015-2018    1939-5981  International Journal of Metalcasting  2016-2018    2008-9295  International Nano Letters  2015-2018    2050-7445  Heritage Science  2015-2018    2055-7124  Biomaterials Research  2015-2018    2190-5509  Applied Nanoscience  2015-2018    2192-9262  Metallography, Microstructure, and Analysis  2015-2018    2193-9764  Integrating Materials and Manufacturing Innovation  2015-2018    2194-0509  Progress in Biomaterials  2015-2018    2194-1459  Materials for Renewable and Sustainable Energy  2015-2018    2196-050x  In Silico Cell and Tissue Science  2015    2196-1107  Journal of Solid State Lighting  2015-2016    2196-2936  Metallurgical and Materials Transactions E  2015-2017    2196-4351  Applied Adhesion Science  2015-2018    2196-5404  Nano Convergence  2015-2018    2198-0926  Advanced Structural and Chemical Imaging  2015, 2017-2018    2198-4220  Journal of Bio- and Tribo-Corrosion  2015-2019    2199-384x  Shape Memory and Superelasticity  2015-2018    2199-7446  Journal of Dynamic Behavior of Materials  2015-2018    2364-821x  Gold Bulletin  2015-2018    2364-4133  Regenerative Engineering and Translational Medicine  2015-2018    2365-6301  Graphene Technology  2016-2018    2510-1560  Journal of the Australian Ceramic Society  2017-2018  wiley  0002-7820  Journal of the American Ceramic Society  2005-2019    0021-8995  Journal of Applied Polymer Science  1999-2019    0032-3888  Polymer Engineering & Science  2011-2019    0049-8246  X‐Ray Spectrometry  2011-2019    0887-624x  Journal of Polymer Science Part A: Polymer Chemistry  2004-2019    0887-6266  Journal of Polymer Science Part B: Polymer Physics  2000-2019    0935-9648  Advanced Materials  2011-2019    0947-6539  Chemistry – A European Journal  2003-2019    0959-8103  Polymer International  2000-2019    1433-7851  Angewandte Chemie International Edition  2003-2019    1613-6810  Small  2005-2019    1616-301x  Advanced Functional Materials  2011-2019    1861-4728  Chemistry – An Asian Journal  2006-2019    1864-5631  ChemSusChem  2008-2019Jxiv_article/MaterialBERT_Jxiv_complete.pdfMaterialBERT for Natural Language Processing of Materials Science Texts Michiko Yoshitakea*, Fumitaka Satoa,b, Hiroyuki Kawanoa,b and Hiroshi Teraokaa,b aMaDIS, National Institute for Material Science, Tsukuba, Japan; bRidgelinez, Tokyo, Japan 1-1, Namiki, Tuskuba, Ibaraki, Japan 305-0044, yoshitake.michiko@nims.go.jp    MaterialBERT for natural language processing of materials science texts A BERT (Bidirectional Encoder Representations from Transformers) model, which we named “MaterialBERT,” has been generated using scientific papers in wide area of material science as a corpus. A new vocabulary list for tokenizer was generated using material science corpus. Two BERT models with different vocabulary lists for the tokenizer, one with the original one made by Google and the other newly made by the authors, were generated. Word vectors embedded during the pre-training with the two MaterialBERT models reasonably reflect the meanings of materials names in material-class clustering and in the relationship between base materials and their compounds or derivatives for not only inorganic materials but also organic materials and organometallic compounds. Fine-tuning with CoLA (The Corpus of Linguistic Acceptability) using the pre-trained MaterialBERT showed a higher score than the original BERT. Keywords: word embedding; pre-training; BERT; literal information             Subject classification codes: Databases, data structure, ontology 1. Introduction Informatics techniques have been extensively utilized in the business and industrial fields [1-3]. In material science fields, machine learning of numerical data such as composition, electrical conductivity, reflective index, solubility, and friction coefficient, and that of processing data such as process temperature and pressure, have increasingly attracting attention [4-6]. In addition to numerical data, literature data, such as comments on SNS (Social Networking Service) and customer claims have been vigorously analysed with informatics techniques in business fields [7-10]. Informatics techniques on such literature data given in natural languages are called natural language processing (NLP) techniques; they have explosively developed and are applied in social business fields because of the huge data available from web sites and SNS. Here, to apply machine learning techniques to natural language, characters or words are converted to numerical data, usually to high-dimensional vectors; this is called embedding. Among the many ways of conversion, Word2Vec [11] attracted sensational attention since it demonstrated that the embedding reflects the meaning of a word. Word2Vec is a simple 1-layer neural network, which does not require many computer resources. Many embeddings by Word2Vec method using corpora from different fields, such as Japanese language, materials science, and bioscience, were made. Embeddings using a corpus from materials science papers, especially focused on inorganic materials, have been made named Mat2Vec [12]. Among scientific abstracts in materials science taken from Elsevier’s Scopus, Science Direct API, and the Springer Nature API, abstracts relevant to inorganic materials science were selected and used as a corpus in Mat2Vec. The successful embedding of meanings from materials science viewpoint was demonstrated [12]. Natural language is data with sequence, and the sequence of words is highly important. Therefore, NLP techniques basically use recurrent neural networks (RNNs) with embedded words. Word2Vec is a technique for embedding, which uses words surrounding a target word so that the context is taken into consideration to some extent, but the sequence of words is not considered. Advanced RNN techniques suitable for NLP, such as bidirectional LSTM (Long Short Term Memory) [13] have been developed, however, complicated RNN-based methods require excessive computational resources. Epoch-making methods to simplify the RNN network, transformer, attention, and BERT, have been developed [14]. BERT model is revolutionary because after pre-training (predicting a randomly masked word in two sequential sentences), fine-tuning for many tasks such as given in General Language Understanding Evaluation (GLUE) [15] can be trained with a small dataset. Examples of tasks in GLUE are Q&A, paraphrasing, implicational relation between two sentences, grammatical correctness (CoLA), and sentiment judgment. Because of this feature of BERT, it can be used in various applications. The original BERT used a dictionary that contained 30M token vocabulary and the pre-training corpus consisted of the BooksCorpus (800M words) [16] and English Wikipedia (2,500M words). The corpus used contained general words that are not specified in a certain area. Therefore, many models using the BERT algorithm with a corpus from specific fields have been constructed such as BioBERT (bio-medical) [17], MedBERT [18], SciBERT (bio science 82% + computer science 18%) [19], Japanese BERT [20,21], FinBERT (financial) [22], LeagalBERT [23].  A BERT model specific to wide area of materials science (inorganic, organic, composite, metal-organic, etc.) was desired for our work to produce a kind of knowledge graph on material property relationships [24, 25]. Therefore, we started generating a BERT model specific to ‘wide area of materials science’ (MaterialBERT) and reported at a conference [26]. At the moment, we pre-trained using an original BERT except a corpus, which were scientific articles in materials science journals. However, despite huge technical terms specific to a materials science filed, the original vocabulary list released with the original BERT (“vocab.txt” file) contains only very general ones because it was made from the corpus used to pre-train the original BERT. Therefore, we built a vocabulary list specific to materials science from scientific articles in materials science journals and started generating another MaterialBERT using the newly made vocabulary list. Meanwhile, MatSciBERT [27], which is a kind of transfer learning of SciBERT using scientific papers in inorganic materials field (inorganic classes and ceramics, bulk metallic glasses, alloys, and cement and concrete) was posted. Then, MatBERT [28] was posted, which is a variant pre-training BERT in inorganic materials field (both solid state dataset and doping dataset were taken from inorganic materials science and gold nanoparticle dataset). Both MatSciBERT and MatBERT are considered domain-specific to “inorganic materials science”. It was reported [29] that there were no significant differences among BioBERT, SciBERT and MatSciBERT for their sentence classification task of polymer science texts, which is out of inorganic material science. Therefore, it would be useful to generate models specific to materials science in general, not limited to inorganic materials science. Moreover, recently, materials, which cannot be classified by traditional material classes such as inorganic or organic materials, have emerged (composite materials, perovskite solar cell materials, metal organic frameworks, etc.). Due to this situation, not only for our work on knowledge graph, a BERT model that is domain-specific to “wide materials science” could be useful for material-class-interdisciplinary works. If one focuses on phenomena such as fracture and refraction, the scientific principles of the phenomena is common among all classes of materials. In many materials R&D, researchers search materials that satisfy a specific functional characteristic which is based on the corresponding phenomena. Especially in the era of SDGs (Sustainable Development Goals), the replacement of current functional materials with those better fit SDGs is required. Such replacement often occurs beyond the traditional material classes. Furthermore, our MaterialBERT could be used as a starting point for generating a narrower domain-specific BERT model in materials science field by transfer learning.  2. Method We downloaded and used the original BERT code to train MaterialBERT on our corpus with the same configuration and size as BERT-Base-uncased (12-layer, hidden layer dimension=768, Total Parameters = 110M) [14]. Sentence lengths up to 512 tokens were used for pre-training. In addition to the difference of a corpus from the original BERT, a variation in vocabulary list was made. One vocabulary list is the same as that the original BERT used (“vocab.txt file in the github [30], we refer to Original Vocab). The other vocabulary list was made in the following way: first, a vocabulary list was made in the same way as the authors of SciBERT [19] did except the vocabulary size, where the vocabulary list was made during the training of a tokenizer with SentencePiece [31] using our material science corpus. Then, this vocabulary list was added to the original BERT vocabulary list (vocab.txt) and used as a second vocabulary list (we refer to Sentence Vocab). Sentence Vocab contains material-specific words such as bond‐containing, radiation‐absorbed, isothermal, mesoporosity, chromatography, amide‐, acetate‐methanol, alkaline‐metal, α‐methyl‐α‐phenyl, etc. Two MaterialBERT were generated, one with Original Vocab and the other with Sentence Vocab, both with the architecture as the original BERT and with our materials science corpus. The Original Vocab contains about 30 K words and Sentence Vocab contains 140 K words. The embedded words vectors had 200 dimensions.  The corpus we used was taken from scientific articles our institute (NIMS) purchased in XML format from nine publishers (ACS, AIP, APS, ELSEVIER, IOP, JJAP, RSC, SPRINGER, WILEY), and most of them were published between 2005 to 2019. Our corpus contains scientific articles not only in inorganic materials but also in organic materials and composite materials. It also includes articles from journals that offer physical and/or chemical basis to phenomena in materials science (often cited in articles on a material papers). The list of the names of the journals, ISSNs and publication years used is provided in the appendix. Materials Science is a very board field and expanding further year by year. Therefore, the authors did not feel reasonable to use established criteria for choosing articles. Rather the authors rely on the decision of each journal (manuscripts that are not the criteria of the journal are not accepted). We confirmed that the journals listed in the appendix are materials science related and used all published articles within the specified journal, since BERT need huge corpus. We exclude articles that contained only abstracts (without the main body). Approximately 750,000 articles were included in this study. Only abstract and body sections from article texts were extracted as a cleansing process because parts such as affiliation, acknowledgement, and references become noise in the NLP in our case. Chemical formulae and mathematical expressions (they are not natural language) in the articles were eliminated from the article texts for pre-training. The estimated number of words for approximately 750,000 articles was roughly 3000 M, which is comparable to the original BERT. Each model was trained on two NVIDIA Tesla V100 GPUs and took about three months to complete.  3. Results and Discussion 3.1. Pre-training 3.1.1. Learning curves Figure 1 shows the learning curve during the pre-training. Learning using the original vocabulary list (Original Vocab) for the tokenizer is shown in (a), and that using the vocabulary list made from our corpus (Sentence Vocab) is shown in (b). Because the size of the Sentence Vocab (140M words) is more than four times larger than the Original Vocab (30M words), the time required for one iteration for (b) is much longer and the iteration end is taken for a much smaller iteration of 143,000 (b) instead of 410,000 (a). Because of the smaller number of iterations, the final loss was larger for (b). If the iterations continued until the numbers were similar to (a), the final loss for (b) would be similar to that of (a). 3.1.2. Embedding of meaning The results of the evaluation of word embeddings are presented below. The 200-dimension word vectors of material names were subject of principal component analysis and projected onto a plane with two main components. The results of two sets of word vectors embedded using the two different dictionaries were compared. 3.1.2.1 Clustering of materials Names of materials such as iron, aluminum, silicon, zinc selenide, zinc oxide, boron nitride, polystyrene, polyvinyl chloride were used for the analysis. Material names such as micelle, supramolecule, which are not classified in usual material classes, were also included as “others”. Words used are listed in Table1 with a class assigned by clustering. The clustering of word vectors of different types of material names is shown in Fig. 2. The word vectors make well-separated clusters according to well-established material classes, such as metals, semiconductors, and polymers [32, 33, 34]. The positions of the clusters themselves do not have a meaning and depend on the vocabulary list used for the tokenizer. This shows that words are well-embedded in both MaterialBERT models constructed using the Original Vocab (Fig. 2a) and Sentence Vocab (Fig.2b).  3.1.2.2. Inorganic materials Word vectors for four typical elements, and their oxides, carbides, and chlorides were subject to principal component analysis, and the vectors were projected onto a plane with two main components. The results are shown in Fig. 3. For both models using different dictionaries, elements, oxides, carbides, and chlorides formed clusters. Accordingly, the vectors of oxide formation (oxide of), carbide formation (carbide of), and chloride formation (chloride of) are similar for all four elements. There is a slight difference in the oxide formation vectors between (a) and (b). However, as the vectors are well separated, the difference is not meaningful. To examine more elements, word vectors for aluminum, calcium, iron, lithium, magnesium, molybdenum, nickel, silicon, sodium, tantalum, titanium, zinc, and zirconium and their oxides, carbides, and chlorides were also analysed in the same way as described above and shown in Fig. 4. For both MaterialBERT models, elements, oxides, carbides, and chlorides formed clusters, as shown in Fig. 3.  3.1.2.3. Organic materials Word vectors of names of organic compounds were analysed using the principal component analysis method. The vectors of organic compounds with different functional groups, alkanes, carboxylic acids, and amines are plotted in Fig. 5. The vectors of decane, ethane, heptane, hexane, octane, pentane, and propane, as well as their carboxylic acid derivatives and amine derivatives are plotted. Similar to inorganic compounds, different functional groups form a cluster with each other, and changes in the functional groups for the above seven alkanes can be represented as similar vectors, although the variance is larger than with inorganic materials, possibly because of a large number of similar names in organic compounds in various papers used as a corpus.  3.1.1..3 Organometallics In Fig. 6, word vectors of organometallics are plotted after principal component analysis for R-metal-carbonyl (acetylcobalt tetracarbonyl, acetylmanganese pentacarbonyl, benzene chromium tricarbonyl, butadiene iron tricarbonyl, dicobalt octarbonyl, dimanganese decacarbonyl, ethyl cobalt tetracarbonyl, hexamethyl benzene chromium tricarbonyl, hexamethylborazine chromium tricarbonyl, methyl manganese pentacarbonyl), alkyl-metal (diethylmagnesium, diethylzinc, dimethyl cadmium, dimethyl mercury, dimethyl zinc, methylcopper, tetramethyltin, trimethylgallium, triphenylgallium), and R-lithium (benzyl-lithium, butyl-lithium, ethyl-lithium, methyl-lithium, phenyl-lithium, vinyl-lithium), where R is an abbreviation for any group in which a hydrocarbon chain is attached to the rest of the molecule. Here, for alkyl-metal, “metal” is not lithium but magnesium, cadmium, mercury, zinc, copper, tin, and gallium. The scattering of vectors is similar to that of organic materials in Fig. 5, suggesting that the word embeddings with meanings as reasonable as in organic materials are achieved for inorganic-organic complex compounds. Despite a vast variety of materials in organometallics, various R and various metals are possible, listing the names of organometallics appearing in scientific papers (in the corpus) is difficult. Therefore, only a limited number of organometallic compounds were used for the evaluation.  3.2. Fine-tuning Among GLUE, only CoLA [35] (grammatical correctness of sentences) can be used for the evaluation of MaterialBERT fine-tuning, because grammar does not depend on a specific field but others do depend on fields of texts used for the evaluation. Therefore, fine-tuning was preformed using CoLA. The score of the MaterialBERT model with the original vocabulary list (Original Vocab) was 62.5 %, and that with the newly made vocabulary list from our corpus (Sentence Vocab) was 66.2%, which is much higher than the score of the original BERTBASE (corresponding to our model) 52.1 % [14]. The score of the original BERTLARGE (deeper neural network used) was reported 60.5 % [14], which is still lower than both MaterialBERTs. It is unknown why MaterialBERTs showed higher score with CoLA, which is nothing to do with materials science. One speculation is that the quality of the corpus used for the pre-training in our corpus, scientific articles were collected from selected scientific journals, which means that the articles are English-corrected and peer-reviewed so that the grammatical correctness of the sentences is high. However, there is no method to characterize a corpus and a evaluation dataset and to measure a kind of distance between them. It is difficult to specify the reason of the higher score.      Various different domain-specific BERTs have been generated since fine-tuning results are supposedly related to the overlap of the domain of corpus used for pre-training and that of the evaluation dataset. Results of fine-tuning using datasets and tasks of author’s pick-up are often given as examples, but they do not logically indicate that users would obtain the similar score for their tasks with their datasets. Possibly due to this, FinBERT does not give the score of fine-tuning results of their tasks but offers web-based fine-tuning for sentiment predictions of uploaded users’ text [38].   In materials science domain, MatSciBERT and MatBERT, both being pre-trained using corpuses that are domain-specific to materials (in close examination materials out of inorganic materials are not included), used inorganic materials datasets for evaluations [28, 36, 37]. MatSciBERT [27] reported approximately 8% better results on glass vs. non-glass topics classification task using in-house dataset (not disclosed) with their MatSciBERT than SciBERT. On the other hand, for sentence classification tasks of polymer science texts, no differences among BioBERT, SciBERT and MatSciBERT was reported [29], although MatSciBERT having material texts as a corpus is expected to have some advantages over BioBERT and SciBERT. With the development of tools such as HuggingFace Transformer [39], pre-training models begin to be used by users who want to do some text-mining tasks of their interests but are familiar to neither NLP nor machine learning. In such new circumstances, there are risks that high scores in authors’ fine-tuning examples give misleading information to users that high scores should be obtained by the model for users’ tasks with users’ datasets, which is not guaranteed. With the above reasons, the authors intend to let users assess the fine-tuning effects for their specific tasks by making the present MaterialBERT models publicly available upon the publication of this article. MaterialBERT should be useful for material science domains out of inorganic materials, and especially for NLP tasks that handle items regardless of material types such as inorganic, organic, or composite. Furthermore, MaterialBERT could be used as a starting point for transfer learning to generate a narrower domain-specific BERT model in materials science field such as “phase diagram”, “fracture”, “liquid crystal”, “plasma”, etc.  4. Conclusions Pre-trained BERT models with wide range of materials science corpus have been successfully developed using the architecture of the original BERT. A new vocabulary list has been made from materials science corpus. Two MaterialBERT models were generated: one with the vocabulary list that the original BERT used and the other with the newly made vocabulary list. It was shown for both MaterialBERT models that word vectors embedded during the pre-training reasonably reflect the meanings of materials names in material-class clustering and in the relationship between base materials and their compounds or derivatives for not only inorganic materials but also organic materials and organometallic compounds. Fine-tuning using CoLA (sentence classification by grammatical correctness) marked a score much higher than the original BERT, which would reflect the grammatical quality of the corpus used for MaterialBERT models.  The developed MaterialBERT models cover wide range of materials science, not only inorganic materials. Because of this wideness, an appropriate evaluation of fine-tuning from a viewpoint of material science is impossible due to the lack of suitable evaluation datasets. However, there is no comparable pre-trained BERT model for widely covered materials science. Furthermore, MaterialBERT models can be used as a starting point for transfer learning to generate a narrower domain-specific BERT model in materials science field such as “phase diagram”, “resin”, “liquid crystal”, etc. Because results on fine-tuning are strongly depend on the similarity between a corpus used for the pre-training and that for fine-tuning, the authors intend to let users assess the fine-tuning effects for their specific tasks by making the present MaterialBERT models publicly available upon the publication of this article. The models and the newly developed vocabulary list will be uploaded to the material data repository at NIMS [40] upon the publication of this article so that all users can use it freely.  Acknowledgements The authors thank the NIMS TDM platform for supplying well-organized XML files from the publishers.  Author contributions M. Y. was involved in conceptualization, corpus selection and cleaning, writing, project management. F. S. was involved in management of program coding. H. K. was involved in management of program coding. H. T. was involved in program coding and calculations.  Conflict of interest The authors declare no potential conflict of interests.  References:  [1] List of universities offering degrees in “business informatics”. Available from:  https://en.everybodywiki.com/List_of_universities_offering_degrees_in_business_informatics [2] BizNews. What is business informatics? Available from: https://biznewske.com/what-is-business-informatics/ [3] A journal with title “industrial informatics”. IEEE Transactions on Industrial Informatics. ISSN: 1551-3203. [4] Ramprasad R, Batra R, Pilania G, et al. Machine learning in materials informatics: recent applications and prospects. npj Comput Mater. 2017;3:54 (1-13). https://doi.org/10.1038/s41524-017-0056-5 [5] Agrawal A, Choudhary A, Perspective: Materials informatics and big data: Realization of the “fourth paradigm” of science in materials science. APL Materials. 2016; 4: 053208 (1-10). https://doi.org/10.1063/1.4946894 [6] Tanaka F, Sato H, Yoshii N, et al. Materials Informatics for Process and Material Co-Optimization. IEEE Transactions on Semiconductor Manufacturing. 2019;32:444-449. [7] Hassan AUl, Hussain J, Hussain M, et al. Sentiment analysis of social networking sites (SNS) data using machine learning approach for the measurement of depression. Proceedings of 2017 International Conference on Information and Communication Technology Convergence (ICTC); 2017 Oct 18-20; Jeju, South Korea. IEEE; 2017. [8] Yoshida S, Kitazono J, Ozawa S, et al. Sentiment analysis for various SNS media using Naïve Bayes classifier and its application to flaming detection. Proceedings of 2014 IEEE Symposium on Computational Intelligence in Big Data (CIBD); 2014 Dec. 9-12; Orlando, FL, USA. IEEE; 2015. [9] Ahn H, Lee S. An Analytic Study on Private SNS for Bonding Social Networking.  In Meiselwitz, G. (eds) Social Computing and Social Media. SCSM 2015. Lecture Notes in Computer Science(), vol 9182. Springer, Cham. https://doi.org/10.1007/978-3-319-20367-6_12 [10] Khairi SSM, Ghani RAM. Analysis of social networking sites on academic performance among university students: A PLS-SEM approach. AIP Conference Proceedings. 2019; 2138, 050015. Available from: https://doi.org/10.1063/1.5121120 [11] Mikolov T, Sutskever I, Chen K, et al. Distributed representations of words and phrases and their compositionality. Advances in Neural Information Processing Systems 26 (NIPS 2013). Available from: https://papers.nips.cc/paper/2013. [12] Tshitoyan V, Dagdelen J, Weston L, et al. Unsupervised word embeddings capture latent knowledge from materials science literature. Nature. 2019;571:95–98. [13] Long short-term memory. Available from:  https://en.wikipedia.org/wiki/Long_short-term_memory [14] Devlin J, Chang MW, Lee K, et al. BERT: Pre-training of deep bidirectional transformers for language understanding. Available from: https://arxiv.org/pdf/1810.04805.pdf. [15] Wang A, Singh A, Michael J, et al. GLUE: A multi-task benchmark and analysis platform for natural language understanding. Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, 2018, 353–355, Available from: https://aclanthology.org/W18-5446, https://gluebenchmark.com/ [16] Zhu Y, Kiros R, Zemel R, et al. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In: Proceedings of the 2015 IEEE International Conference on Computer Vision(ICCV); 2015 Dec 7-13; Santiago, Chile. IEEE; 2015. p.19–27.  [17] Lee J, Yoon W, Kim S, et al. BioBERT: a pre-trained biomedical language representation model for biomedical text mining; 2019 Oct 31. arXiv:1901.08746. Available from:  https://arxiv.org/abs/1901.08746 [18] Rasmy L, Xiang Y, Xie Z, Cui Tao, et al. Med-BERT: pre-trained contextualized embeddings on large-scale structured electronic health records for disease prediction; 2020 May 22. arXiv:2005.12833v1. Available from: https://doi.org/10.48550/arXiv.2005.12833 [19] Beltagy I, Lo K, Cohan A. SCIBERT: A pretrained language model for scientific text. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019. p. 3615–3620, DOI:10.18653/v1/D19-1371. Available from: https://arxiv.org/abs/1903.10676 [20] BERT Japanese Pretrained Model. Available from: https://nlp.ist.i.kyoto-u.ac.jp/?ku_bert_japanese, https://laboro.ai/activity/column/engineer/laboro-bert/. [21] Pretrained Japanese BERT models. Available from: https://github.com/cl-tohoku/bert-japanese [22] Yang Y, Christopher M, Siy UYet al. FinBERT: A pretrained language model for financial communications. Available from: https://arxiv.org/abs/2006.08097 [23] Chalkidis I, Fergadiotis M, Malakasiotis P, et al. LEGAL-BERT: The Muppets straight out of Law School. arXiv:2010.02559v1, 2020 Oct 6. Available from: https://doi.org/10.48550/arXiv.2010.02559 [24] Yoshitake M, Kuwajima I, Yagyu S, et al. System for Searching Relationship among Physical Properties for Materials CurationTM. Vac. Surf. Sci. 2018;61:200–205. [25] Yoshitake, M. Tool for Designing Breakthrough Discovery in Materials Science.  Materials 2021;14:6946(1-15). Available from: https://doi.org/10.3390/ma14226946 [26] Yoshitake M, Sato F, Kawano H, et al. MaterialBERT for Natural Language Processing of Materials Science Texts. Paper presented at: 68th JSAP Spring Meeting; 2021 Mar 16-19; On line. [27] Gupta T, Zaki M. Krishnan ANM, et al. MatSciBERT: A Materials Domain Language Model for Text Mining and Information Extraction. arXiv:2109.15290v1, 2021 Sep 30. Available from: https://doi.org/10.48550/arXiv.2109.15290 [28] Walker N, Trewartha A, Huo H,aoyan, et al. The Impact of Domain-Specific Pre-Training on Named Entity Recognition Tasks in Materials Science. Available from SSRN: https://ssrn.com/abstract=3950755 or http://dx.doi.org/10.2139/ssrn.3950755 [29] Oka H, Ishii M, Sentence classification for polymer data extraction from scientific articles. Poster session presented at: 69th JSAP Spring Meeting; 2022 Mar 22-26; Sagamihara, Kanagawa. [30] when one down load BERT-base from https://github.com/google-research/bert, vocab.txt file is included in the zip file [31] Kudo T, Richardson J. SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations; p. 66-71. 2018 Oct 31-Nov 4, Brussels, Belgium. Association for Computational Linguistics. Available from: https://github.com/google/sentencepiece [32] Classes of Materials, University of Cambridge, https://www.doitpoms.ac.uk/tlplib/artefact/classes.php [33] Introduction to Materials Science and Engineering, University of Washington USA, Prof. Christine Luscombe, http://courses.washington.edu/mse170/powerpoint/luscombe/Week1complete.pdf [34] “semiconductor” is relatively new class of materials as mentioned in Materials science, https://en.wikipedia.org/wiki/Materials_science and in Materials science and engineering: https://en.wikiversity.org/wiki/Portal:Materials_science_and_engineering [35] Warstadt A, Singh A, Bowman SR. Neural network acceptability judgments. Available from: https://arxiv.org/abs/1805.12471, https://nyu-mll.github.io/CoLA/ [36] Weston L, Tshitoyan V, Dagdelen J, et al. Named Entity Recognition and Normalization Applied to Large-Scale Information Extraction from the Materials Science Literature. J. Chem. Inf. Model., 2019;59:3692–3702. [37] Friedrich A, Adel H, Tomazic F, et al. The SOFC-Exp Corpus and Neural Approaches to Information Extraction in the Materials Science Domain. Proceedings of the 58th annual meeting of the association for computational linguistics, Association for Computational Linguistics, Online, 2020, pp. 1255–1268. [38] ProsusAI / finBERT, https://github.com/ProsusAI/finBERT [39] huggingface / transformers, https://github.com/huggingface/transformers [40] https://doi.org/10.48505/nims.3705   Table 1. List of words used for material class clustering with a class assigned by clustering. Figure 1. Learning curve for pre-training with the original dictionary (a) and the newly made dictionary from our corpus (b). Figure 2. Material class captured by word embeddings: two-dimensional projection of the word vectors in the plane with the first and second principal components for 79 materials from different material classes using the original dictionary (a) and the newly made dictionary from our corpus (b). Others are materials such as metal-organic framework and composite material. Figure 3. Word embeddings for magnesium, aluminum, silicon, iron, their principal oxides, carbides and chlorides projected onto two dimensions using principal component analysis and represented as points in space. (a) is obtained using the original dictionary and (b) using the newly made dictionary from our corpus. The projected space between (a) and (b) is slightly different but in both space the relative positioning of the words encodes materials science relationships, such that there exist consistent vector operations between words that represent concepts such as ‘oxide of’, ‘carbide of’ and ‘chloride of’. Figure 4. Word embeddings for 13 elements (lithium, sodium, magnesium, aluminium, silicon, calcium, titanium, iron, nickel, zinc, zirconium, molybdenum, tantalum, and their principal oxides, carbides and chlorides projected onto two dimensions using principal component analysis and represented as points in space. (a) is obtained using the original dictionary and (b) using the newly made dictionary from our corpus. Figure 5. Word embeddings for 7 alkanes, and their carboxylic acid, and amine derivatives projected onto two dimensions using principal component analysis and represented as points in space. (a) is obtained using the original dictionary and (b) using the newly made dictionary from our corpus. Figure 6. Word embeddings for organometallics (R-metal-carbonyl, alkyl-metal, and R-lithium, where R means an abbreviation for any group in which a hydrocarbon chain is attached to the rest of the molecule) projected onto two dimensions using principal component analysis and represented as points in space. (a) is obtained using the original dictionary and (b) using the newly made dictionary from our corpus.    Table 1. List of words used for material class clustering with a class assigned by clustering.  metals ceramics semiconductors polymers othersiron aluminum oxide silicon polyethylene metal complex aluminum silicon carbide germanium polypropylene metal organic framework copper tungsten carbide gallium arsenide polystyrene composite material titanium Yttria-stabilized zirconia gallium phosphide polyvinyl chloride clathrate gold zinc oxide indium phosphide synthetic rubber methane hydrate platinum zirconia silicon carbide phenol formaldehyde resin supramolecule chromium boron nitride zinc selenide neoprene crown ether nickel Sialon cadmium sulfide nylon cyclodextrin cobalt silicon nitride gallium nitride polyacrylonitrile liposome tungsten titanium carbide gallium oxide PVB micellepalladium glass diamond cellulose steel barium black phosphorus starch high-speed steel titanate fullerene chitin superalloys hydroxyapatite carbon nanotube protein inconel ferrite lignin duralumin calcium fluoride silicone bronze celluloidamalgam alumel chromel intermetallics intermetallic compound metallic glass(a)(b)0246810120 100000 200000 300000 4000000246810120 50000 100000 150000Iteration countsIteration countsLossLossFigure 1. Learning curve for pre-training with the original dictionary (a) and the newly made dictionary from our corpus (b).-20-15-10-50510152025-20 -15 -10 -5 0 5 10 15 20 25metalceramicssemiconductorpolymerothers(a)-20-15-10-50510152025-20 -15 -10 -5 0 5 10 15 20 25(b)metalceramicssemiconductorpolymerothersFirst principal componentFirst principal componentSecond principal componentSecond principal componentFigure 2. Material class captured by word embeddings: two-dimensional projection of the word vectors in the plane with the first and second principal components for 79 materials from different material classes using the original dictionary (a) and the newly made dictionary from our corpus (b). Others are materials such as metal-organic framework and composite material.-3-2-101234-3 -2 -1 0 1 2 3 4 5-3-2-101234-3 -2 -1 0 1 2 3 4 5aluminiumironmagnesiumsiliconoxide ofcarbide ofchloride of(a)(b)aluminiumironmagnesiumsiliconoxide ofcarbide ofchloride ofFirst principal componentSecond principal componentFirst principal componentSecond principal componentFigure 3. Word embeddings for magnesium, aluminum, silicon, iron, their principal oxides, carbides and chlorides projected onto two dimensions using principal component analysis and represented as points in space. (a) is obtained using the original dictionary and (b) using the newly made dictionary from our corpus. The projected space between (a) and (b) is slightly different but in both space the relative positioning of the words encodes materials science relationships, such that there exist consistent vector operations between words that represent concepts such as ‘oxide of’, ‘carbide of’ and ‘chloride of’.-3-2-101234-3 -2 -1 0 1 2 3 4 5-3-2-101234-3 -2 -1 0 1 2 3 4 5(a)carbideoxidechloride(b)elementcarbideoxidechlorideelementFirst principal componentSecond principal componentFirst principal componentSecond principal componentFigure 4. Word embeddings for 13 elements (lithium, sodium, magnesium, aluminium, silicon, calcium, titanium, iron, nickel, zinc, zirconium, molybdenum, tantalum, and their principal oxides, carbides and chlorides projected onto two dimensions using principal component analysis and represented as points in space. (a) is obtained using the original dictionary and (b) using the newly made dictionary from our corpus.-3-2.5-2-1.5-1-0.500.511.522.53-5 -4 -3 -2 -1 0 1 2 3 4 5-amine-acid(a)-3-2.5-2-1.5-1-0.500.511.522.53-5 -4 -3 -2 -1 0 1 2 3 4 5-amine-acid(b)First principal componentSecond principal componentFirst principal componentSecond principal componentFigure 5. Word embeddings for 7 alkanes, and their carboxylic acid, and amine derivatives projected onto two dimensions using principal component analysis and represented as points in space. (a) is obtained using the original dictionary and (b) using the newly made dictionary from our corpus.(a)(b)-3-2-10123-5 -4 -3 -2 -1 0 1 2 3R-metal-carbonylalky-metalR-lithium-3-2-10123-5 -4 -3 -2 -1 0 1 2 3R-metal-carbonylalky-metal R-lithiumFirst principal componentSecond principal componentFirst principal componentSecond principal componentFigure 6. Word embeddings for organometallics (R-metal-carbonyl, alkyl-metal, and R-lithium, where R means an abbreviation for any group in which a hydrocarbon chain is attached to the rest of the molecule) projected onto two dimensions using principal component analysis and represented as points in space. (a) is obtained using the original dictionary and (b) using the newly made dictionary from our corpus.  MaterialBERT_Jxiv.pdf  Figs_Jxiv