%0 Dataset %T Leveraging Segmentation of Physical Units through a Newly Open Source Corpus %A FOPPIANO, Luca %8 25/10/2019 %U https://mdr.nims.go.jp/concern/datasets/bg257f85d %R https://doi.org/10.34968/nims.1220 %X The identification of physical measurements is a recurrent need in materials informatics (MI). For example, the extraction of superconductor materials and their properties2 requires to identify and under- stand temperature, pressure, magnetisation. When designing automatic systems for information extrac- tion from scientific literature, the identification of the raw measurement alone is not sufficient. Quan- tity transformations, such as normalisation, require the understanding of values and units, which are contained in unstructured text with ad-hoc conventions. String matching and lookups are failing with growing unit complexity and variability. Therefore a generic unit segmentation system is necessary. This contribution is part of a larger project called Grobid-quantities3, a machine learning (ML) based, Open Source system for extracting and normalising physical measurements from scientific and patent literature. In this submission, we present a general approach for units representation, and we introduce the public availability (Creative Commons licence) of a corpus of segmented physical units. Currently, there are no comparable results in scientific literature because no public datasets are available for this task. Our approach for the unit representation follows the International System of Measurement (SI), where each unit is represented as a product of triples: prefix, base and power. This straightforward approach offers the flexibility to support any combination of units from any system of measurements. Figure 1 illustrates an example where kV2/cm is tokenised and segmented as product of triples. We used the Grobid-quantities ML-based unit segmentation implementation to create a new corpus. We used data provided by previous work of some of the authors4, where about 2000 units were extracted from 3490 papers of Journal of Applied Physics. The data was pre-annotated and manually corrected. The resulting corpus contains approximately 700 simple and 1300 complex units, and it’s available in XML format at the Grobid-quantities repository3. It is suitable for evaluating new or existing systems for unit segmentation. We plan to increase the coverage by adding new data from other domains. %[ 15/06/2020 %9 Dataset %~ MDR: NIMS Materials Data Repository %W National Institute for Materials Science