ISHII, Masashi
;
TANIFUJI, Mikiko
;
FOPPIANO, Luca
;
DIEB M. Thear
;
SUZUKI, Akira
Description:
(abstract)The identification of physical measurements is a recurrent task in material informatics (MI). Quantities and units are contained in unstructured text with ad-hoc conventions to convey their meaning: it is common to encounter new variations of existing units or new complex composed units which are not present in any unit dictionaries. By consequence, string matching and dictionary lookup are failing or generating false positive when the unit complexity grows. A generic unit segmentation system is therefore necessary. In this submission, we present our approach to unit representation that can be used to design generic unit normalisation systems. We introduce the public availability of a corpus of segmented physical units. This dataset, comprising about 2000 entries in XML format, can be used to train or evaluate sequence labelling models for unit segmentation. Through this contribution, we provide an Open Source (CC-BY licensed) corpus for segmenting physical units, a resource that can be used to evaluate and compare physical measurement segmentation systems.
Rights:
Keyword: segmentation, quantities, measurements, physical quantities
Date published: 2019-10-18
Publisher: JSAP
Journal:
Funding:
Manuscript type: Author's original (Submitted manuscript)
MDR DOI: https://doi.org/10.34968/nims.1360
First published URL:
Related item:
Other identifier(s):
Contact agent:
Updated at: 2022-10-03 01:53:43 +0900
Published on MDR: 2021-08-13 01:20:03 +0900
Filename | Size | |||
---|---|---|---|---|
Filename |
presentation-jsap-2019.pdf
(Thumbnail)
application/pdf |
Size | 1.2 MB | Detail |