# Machine extraction of polymer data from tables using XML versions of scientific articles

https://mdr.nims.go.jp/datasets/c1b3c5e9-1aae-420b-8962-c415ebbdf38b

## File

- [supplemental-data_210113.zip](https://mdr.nims.go.jp/filesets/b52d100f-2af1-49a6-b421-1d4bc6e9f95f/download) ([Detail](https://mdr.nims.go.jp/filesets/b52d100f-2af1-49a6-b421-1d4bc6e9f95f.md))

## Id

c1b3c5e9-1aae-420b-8962-c415ebbdf38b

## Local identifier


## Visibility

open_to_public

## State

published

## Created at

2021-08-19T06:10:53.235580Z

## Updated at

2024-06-21T06:45:49.841011Z

## Published at

2021-08-19T13:30:04.575551Z

## Doi

https://doi.org/10.11503/nims.1190

## First published url

https://doi.org/10.1080/27660400.2021.1899456

## Date published

2021-01-01

## Recorded date published

2021-1-1

## Resource type

dataset

## Manuscript type

authors_original

## Collection


## Title

- title: Machine extraction of polymer data from tables using XML versions of scientific
    articles
  title_type: original
  lang: en

## Description

- description: "In this study, we examined machine extraction of polymer data from
    tables in scientific articles. The extraction system consists of five processes:
    table extraction, data formatting, polymer name recognition, property specifier
    identification, and data extraction. Tables were first extracted in plain text.
    XML versions of scientific articles were used, and the tabular forms were accurately
    extracted, even for complicated tables, such as multi-column, multi-row, and merged
    tables. Polymer name recognition was performed using a named entity recognizer
    created by deep neural network learning of polymer names. The preparation cost
    of the training data was reduced using a rule-based algorithm. The target polymer
    properties in this study were glass transition temperature (Tg), melting temperature
    (Tm), and decomposition temperature (Td), and the specifiers were identified using
    partial string matching. Through these five processes, 2,181 data points for Tg,
    1,526 for Tm, and 2,316 for Td were extracted from approximately 18,000 scientific
    articles published by Elsevier. Nearly half of them were extracted from complicated
    tables. The F-scores for the extraction were 0.871, 0.870, and 0.841, respectively.
    These results indicate that the extraction system created in this study can rapidly
    and accurately collect large amounts of polymer data from tables in scientific
    literature.\r\n\r\n[Revision history]\r\n2020-05-26: Initial upload under the
    title \"Automatic extraction of polymer data from tables in XML documents of scientific
    articles\"\r\n2021-02-25: Updated version under the title \"Machine extraction
    of polymer data from tables using XML versions of scientific articles\""
  description_type: abstract
  lang: en

## Creator

- name: YOSHIZAWA, Atsushi
  role: author
- name: ISHII, Masashi
  role: author
  orcid: https://orcid.org/0000-0003-0357-2832
- name: SHINDO, Hiroyuki
  role: author
  orcid: https://orcid.org/0000-0003-1081-9194
- name: OKA, Hiroyuki
  role: author
  orcid: https://orcid.org/0000-0002-1768-2429
- name: MATSUMOTO, Yuji
  role: author
  orcid: https://orcid.org/0000-0003-4946-9574

## Contact agent


## Publisher


## Managing organization


## Keyword

- subject: polymer data
  schema: not_defined
- subject: table
  schema: not_defined
- subject: machine extraction
  schema: not_defined
- subject: informatics
  schema: not_defined
- subject: XML
  schema: not_defined

## Rights


## Other identifier(s)


## Data origin


## Embargo


## Journal


## Conference


## Related item


## Funding


## Instrument


## Instrument operator


## Instrument managing organization


## Measurement method


## Specimen


## Chemical composition


## Structure for specimen


## Structural feature for specimen


## Specific property for specimen


## Process for specimen treatment


## Computational method


## Energy level/transition state


## Software


## Custom property


## Fileset

- id: b52d100f-2af1-49a6-b421-1d4bc6e9f95f
  filename: supplemental-data_210113.zip
  content_type: application/zip
  size: 93966
  md5: 5a90114b657b483867cdc6b439d8979c

## Thumbnail

fileset_id: b52d100f-2af1-49a6-b421-1d4bc6e9f95f
filename: supplemental-data_210113.zip