Automatic Identification and Normalisation of Physical Measurements in Scientific Literature

FOPPIANO, Luca; ROMARY, Laurent; ISHII, Masashi; TANIFUJI, Mikiko

Publication

Automatic Identification and Normalisation of Physical Measurements in Scientific Literature

MDR Open Deposited

Analytics

Download PDF

Download This Work (Zip)

Citations

We present Grobid-quantities, an open-source application for extracting and normalising measurements from scientific and patent literature. Tools of this kind, aiming to understand and make unstructured information accessible, represent the building blocks for large-scale Text and Data Mining (TDM) systems. Grobid-quantities is a module built on top of Grobid [6] [13], a machine learning framework for parsing and structuring PDF documents. Designed to process large quantities of data, it provides a robust implementation accessible in batch mode or via a REST API. The machine learning engine architecture follows the cascade approach, where each model is specialised in the resolution of a specific task. The models are trained using CRF (Conditional Random Field) algorithm [12] for extracting quantities (atomic values, intervals and lists), units (such as length, weight) and different value representations (numeric, alphabetic or scientific notation). Identified measurements are normalised according to the International System of Units (SI). Thanks to its stable recall and reliable precision, Grobid-quantities has been integrated as the measurement-extraction engine in various TDM projects, such as Marve (Measurement Context Extraction from Text), for extracting semantic measurements and meaning in Earth Science [10]. At the National Institute for Materials Science in Japan (NIMS), it is used in an ongoing project to discover new superconducting materials. Normalised materials characteristics (such as critical temperature, pressure) extracted from scientific literature are a key resource for materials informatics (MI) [9].

DOI

https://doi.org/10.48505/nims.3039

First published at

https://doi.org/10.1145/3342558.3345411

Creator

Name

FOPPIANO, Luca

ORCID

https://orcid.org/0000-0002-6114-6164

Organization

National Institute for Materials Science

Sub organization

MaDIS

Role

author

Name

ROMARY, Laurent

ORCID

https://orcid.org/0000-0002-0756-0508

Organization

Inria

Sub organization

ALMAnaCH

Role

author

Name

ISHII, Masashi

ORCID

https://orcid.org/0000-0003-0357-2832

Organization

National Institute for Materials Science

Sub organization

MaDIS

Role

author

Name

TANIFUJI, Mikiko

ORCID

https://orcid.org/0000-0001-5284-6364

Organization

National Institute for Materials Science

Sub organization

MaDIS

Role

author

Keyword

Resource type

Conference Proceeding

Publisher

Association for Computing Machinery

Date published

23/09/2019

Rights statement

In Copyright

Journal

Title

DocEng '19: Proceedings of the ACM Symposium on Document Engineering 2019

Sequence number

978-1-4503-6887-2

Start page

1

End page

4

Total number of pages

4

Manuscript type

Accepted manuscript

Event

Title

Doceng 2019

Location

Berlin

Start date

23/09/2019

End date

26/09/2019

Language

English

Last modified

01/07/2021

Items

Thumbnail	Title	Date Uploaded	Size	Visibility	Actions
	main.pdf	15/01/2021	506 KB	MDR Open	Download

Publication

Automatic Identification and Normalisation of Physical Measurements in Scientific Literature

Downloadable Content

Items