MaterialBERT for Natural Language Processing of Materials Science Texts

YOSHITAKE, Michiko; SATO, Fumitaka; KAWANO, Hiroyuki; TERAOKA, Hiroshi; MOTEKI, Fuma

Publication

MaterialBERT for Natural Language Processing of Materials Science Texts

MDR Open Deposited

Analytics

No preview available

Download the file

Citations

A BERT (Bidirectional Encoder Representations from Transformers) model, which we named “MaterialBERT,” has been generated using scientific papers in wide area of material science as a corpus. A new vocabulary list for tokenizer was generated using material science corpus. Two BERT models with different vocabulary lists for the tokenizer, one with the original one made by Google and the other newly made by the authors, were generated. Word vectors embedded during the pre-training with the two MaterialBERT models reasonably reflect the meanings of materials names in material-class clustering and in the relationship between base materials and their compounds or derivatives for not only inorganic materials but also organic materials and organometallic compounds. Fine-tuning with CoLA (The Corpus of Linguistic Acceptability) using the pre-trained MaterialBERT showed ahigher score than the original BERT.
MaterialBERT could be used as a starting point for generating a narrower domain-specific BERT model in materials science field by transfer learning.

DOI

https://doi.org/10.48505/nims.3705

First published at

https://doi.org/10.51094/jxiv.119

Creator

Name

YOSHITAKE, Michiko

ORCID

https://orcid.org/0000-0002-0973-5666

Organization

National Institute for Materials Science

Sub organization

MaDIS

Role

author

Name

SATO, Fumitaka

Organization

Ridgelinez Limited

Role

author

Name

KAWANO, Hiroyuki

Organization

Ridgelinez Limited

Role

author

Name

TERAOKA, Hiroshi

Organization

Ridgelinez Limited

Role

author

Name

MOTEKI, Fuma

Organization

Ridgelinez Limited

Role

operator

Keyword

Resource type

Date published

08/08/2022

Rights statement

Creative Commons BY-ND Attribution-NoDerivatives 4.0 International

Licensed Date

08/08/2022

Manuscript type

Author's original (Preprint)

Last modified

10/08/2022

Items

Title	Date Uploaded	Size	Visibility	Actions
MaterialBERT_README__20220808.md	08/08/2022	5.7 KB	MDR Open	Download
MaterialBERT_Dict_Pre-trained_Model.zip		1.14 GB	MDR Open	Download
MaterialBERT_Pre-trained_Model.zip		1020 MB	MDR Open	Download
Jxiv_article.zip	09/08/2022	1.66 MB	MDR Open	Download