Article Leveraging Segmentation of Physical Units through a Newly Open Source Corpus

ISHII, Masashi SAMURAI ORCID ; TANIFUJI, Mikiko ORCID ; FOPPIANO, Luca ORCID ; DIEB M. Thear SAMURAI ORCID ; SUZUKI, Akira ORCID

Collection

Citation
ISHII, Masashi, TANIFUJI, Mikiko, FOPPIANO, Luca, DIEB M. Thear, SUZUKI, Akira. Leveraging Segmentation of Physical Units through a Newly Open Source Corpus. https://doi.org/10.34968/nims.1360

Description:

(abstract)

The identification of physical measurements is a recurrent task in material informatics (MI). Quantities and units are contained in unstructured text with ad-hoc conventions to convey their meaning: it is common to encounter new variations of existing units or new complex composed units which are not present in any unit dictionaries. By consequence, string matching and dictionary lookup are failing or generating false positive when the unit complexity grows. A generic unit segmentation system is therefore necessary. In this submission, we present our approach to unit representation that can be used to design generic unit normalisation systems. We introduce the public availability of a corpus of segmented physical units. This dataset, comprising about 2000 entries in XML format, can be used to train or evaluate sequence labelling models for unit segmentation. Through this contribution, we provide an Open Source (CC-BY licensed) corpus for segmenting physical units, a resource that can be used to evaluate and compare physical measurement segmentation systems.

Rights:

Keyword: segmentation, quantities, measurements, physical quantities

Date published: 2019-10-18

Publisher: JSAP

Journal:

Funding:

Manuscript type: Author's original (Submitted manuscript)

MDR DOI: https://doi.org/10.34968/nims.1360

First published URL:

Related item:

Other identifier(s):

Contact agent:

Updated at: 2022-10-03 01:53:43 +0900

Published on MDR: 2021-08-13 01:20:03 +0900

Filename Size
Filename presentation-jsap-2019.pdf (Thumbnail)
application/pdf
Size 1.2 MB Detail