# Semi-automatic staging area for high-quality structured data extraction from scientific literature

https://mdr.nims.go.jp/datasets/cb4a04ae-5822-424e-bea4-89a167a4918e

## File

- [Semi-automatic staging area for high-quality structured data extraction from scientific literature.pdf](https://mdr.nims.go.jp/filesets/fa8a1426-6392-4706-9f5f-36b8212d846a/download) ([Detail](https://mdr.nims.go.jp/filesets/fa8a1426-6392-4706-9f5f-36b8212d846a.md))

## Id

cb4a04ae-5822-424e-bea4-89a167a4918e

## Local identifier



## Visibility

open_to_public

## State

published

## Created at

2024-07-11T04:41:30.232864Z

## Updated at

2024-07-11T07:30:24.792218Z

## Published at

2024-07-11T07:30:24.851936Z

## Doi



## First published url

https://doi.org/10.1080/27660400.2023.2286219

## Date published

2023-12-31

## Recorded date published

2023-12-31

## Resource type

journal_article

## Manuscript type

vor

## Collection



## Title

- title: Semi-automatic staging area for high-quality structured data extraction from
    scientific literature
  title_type: original
  lang: en

## Description

- description: " We propose a semi-automatic staging area for efficiently building
    an accurate database of experimental physical properties of superconductors from
    literature, called SuperCon2, to enrich the existing manually-built superconductor
    database SuperCon. Here we report our curation interface (SuperCon2 Interface)
    and a workflow managing the state transitions of each examined record, to validate
    the dataset of superconductors from PDF documents collected using Grobid-superconductors
    in a previous work. This curation workflow allows both automatic and manual operations,
    the former contains ‘anomaly detection’ that scans new data identifying outliers,
    and a ‘training data collector’ mechanism that collects training data examples
    based on manual corrections. Such training data collection policy is effective
    in improving the machine-learning models with a reduced number of examples. For
    manual operations, the interface (SuperCon2 interface) is developed to increase
    efficiency during manual correction by providing a smart interface and an enhanced
    PDF document viewer. We show that our interface significantly improves the curation
    quality by boosting precision and recall as compared with the traditional ‘manual
    correction’. Our semi-automatic approach would provide a solution for achieving
    a reliable database with text-data mining of scientific documents. "
  description_type: abstract
  lang: und

## Creator

- name: Luca Foppiano
  role: author
  orcid: https://orcid.org/0000-0002-6114-6164
  organization: National Institute for Materials Science
  ror: https://ror.org/026v1ze26
- name: Tomoya Mato
  role: author
  orcid: https://orcid.org/0000-0002-0918-6468
  organization: National Institute for Materials Science
  ror: https://ror.org/026v1ze26
- name: Kensei Terashima
  role: author
  orcid: https://orcid.org/0000-0003-0375-3043
  organization: National Institute for Materials Science
  ror: https://ror.org/026v1ze26
- name: Pedro Ortiz Suarez
  role: author
- name: Taku Tou
  role: author
- name: Chikako Sakai
  role: author
  orcid: https://orcid.org/0000-0002-0597-6825
  department: National Institute for Materials Science
  ror: https://ror.org/026v1ze26
- name: Wei-Sheng Wang
  role: author
  orcid: https://orcid.org/0009-0001-3572-5736
  organization: National Institute for Materials Science
  ror: https://ror.org/026v1ze26
- name: Toshiyuki Amagasa
  role: author
- name: Yoshihiko Takano
  role: author
  orcid: https://orcid.org/0000-0002-1541-6928
  organization: National Institute for Materials Science
  ror: https://ror.org/026v1ze26
- name: Masashi Ishii
  role: author
  orcid: https://orcid.org/0000-0003-0357-2832
  organization: National Institute for Materials Science
  ror: https://ror.org/026v1ze26

## Contact agent



## Publisher

organization: Informa UK Limited

## Managing organization



## Keyword

- subject: materials informatics
  schema: not_defined
- subject: superconductors
  schema: not_defined
- subject: machine learning
  schema: not_defined
- subject: " database"
  schema: not_defined
- subject: tdm
  schema: not_defined

## Rights

- identifier: https://creativecommons.org/licenses/by/4.0/

## Other identifier(s)



## Data origin

- data_origin_type: other

## Embargo



## Journal

- title: 'Science and Technology of Advanced Materials: Methods'
  issn: '27660400'
  article_number: '2286219 '

## Conference



## Related item



## Funding

- funder_name: Research and Development

## Instrument



## Instrument operator



## Instrument managing organization



## Measurement method



## Specimen



## Chemical composition



## Structure for specimen



## Structural feature for specimen



## Specific property for specimen



## Process for specimen treatment



## Computational method



## Energy level/transition state



## Software



## Custom property



## Fileset

- id: fa8a1426-6392-4706-9f5f-36b8212d846a
  filename: Semi-automatic staging area for high-quality structured data extraction
    from scientific literature.pdf
  content_type: application/pdf
  size: 7701419
  md5: 6b9a1c9159990de2a79b51dfeea820f2

## Thumbnail

fileset_id: fa8a1426-6392-4706-9f5f-36b8212d846a
filename: Semi-automatic staging area for high-quality structured data extraction
  from scientific literature.pdf