Machine extraction of polymer data from tables using XML versions of scientific articles

YOSHIZAWA, Atsushi; ,; OKA, Hiroyuki; MATSUMOTO, Yuji; ISHII, Masashi; SHINDO, Hiroyuki

Dataset

Machine extraction of polymer data from tables using XML versions of scientific articles

MDR Open Deposited

Analytics

No preview available

Download the file

Download This Work (Zip)

Citations

In this study, we examined machine extraction of polymer data from tables in scientific articles. The extraction system consists of five processes: table extraction, data formatting, polymer name recognition, property specifier identification, and data extraction. Tables were first extracted in plain text. XML versions of scientific articles were used, and the tabular forms were accurately extracted, even for complicated tables, such as multi-column, multi-row, and merged tables. Polymer name recognition was performed using a named entity recognizer created by deep neural network learning of polymer names. The preparation cost of the training data was reduced using a rule-based algorithm. The target polymer properties in this study were glass transition temperature (Tg), melting temperature (Tm), and decomposition temperature (Td), and the specifiers were identified using partial string matching. Through these five processes, 2,181 data points for Tg, 1,526 for Tm, and 2,316 for Td were extracted from approximately 18,000 scientific articles published by Elsevier. Nearly half of them were extracted from complicated tables. The F-scores for the extraction were 0.871, 0.870, and 0.841, respectively. These results indicate that the extraction system created in this study can rapidly and accurately collect large amounts of polymer data from tables in scientific literature.

[Revision history]
2020-05-26: Initial upload under the title "Automatic extraction of polymer data from tables in XML documents of scientific articles"
2021-02-25: Updated version under the title "Machine extraction of polymer data from tables using XML versions of scientific articles"

DOI: https://doi.org/10.11503/nims.1190
First published at: https://doi.org/10.1080/27660400.2021.1899456
Creator: Name

YOSHIZAWA, Atsushi

Organization

National Institute for Materials Science

Role

author

Name

,

Name

OKA, Hiroyuki

ORCID

https://orcid.org/0000-0002-1768-2429

Organization

National Institute for Materials Science

Role

author

Name

MATSUMOTO, Yuji

ORCID

https://orcid.org/0000-0003-4946-9574

Organization

RIKEN

Role

author

Name

ISHII, Masashi

ORCID

https://orcid.org/0000-0003-0357-2832

Organization

National Institute for Materials Science

Role

author

Name

SHINDO, Hiroyuki

ORCID

https://orcid.org/0000-0003-1081-9194

Organization

Nara Institute of Science and Technology

Role

author
Keyword: XML
informatics
machine extraction
polymer data
table
Resource type: Dataset
Data origin: informatics and data science
Rights statement: Creative Commons BY-NC-SA Attribution-NonCommercial-ShareAlike 4.0 International
Licensed Date: 13/01/2021
Source: PubMan
Language: English
Last modified: 25/02/2021
Other Date: Created

13/01/2021

Items

Thumbnail	Title	Date Uploaded	Size	Visibility	Actions
	supplemental-data_210113.zip	25/02/2021	91.8 KB	MDR Open	Download

Dataset

Machine extraction of polymer data from tables using XML versions of scientific articles

Description

Method

Instruments

Specimen details

Items