Leveraging Segmentation of Physical Units through a Newly Open Source Corpus

The identification of physical measurements is a recurrent task in material informatics (MI). Quantities and units are contained in unstructured text with ad-hoc conventions to convey their meaning: it is common to encounter new variations of existing units or new complex composed units which are not present in any unit dictionaries. By consequence, string matching and dictionary lookup are failing or generating false positive when the unit complexity grows. A generic unit segmentation system is therefore necessary. In this submission, we present our approach to unit representation that can be used to design generic unit normalisation systems. We introduce the public availability of a corpus of segmented physical units. This dataset, comprising about 2000 entries in XML format, can be used to train or evaluate sequence labelling models for unit segmentation. Through this contribution, we provide an Open Source (CC-BY licensed) corpus for segmenting physical units, a resource that can be used to evaluate and compare physical measurement segmentation systems.

  • 18/10/2019
  • 15/06/2020
