# Fileset

[Leveraging_Segmentation_of_Physical_Units_through_a_Newly_Open_Source_Corpus.pdf](https://mdr.nims.go.jp/filesets/75eff41c-300f-48ae-891b-16084e13be1a/download)

## Creator

[FOPPIANO, Luca](https://orcid.org/0000-0002-6114-6164), [SUZUKI, Akira](https://orcid.org/0000-0002-8167-0414), [DIEB M. Thear](https://orcid.org/0000-0002-8111-2009), [ISHII, Masashi](https://orcid.org/0000-0003-0357-2832), [TANIFUJI, Mikiko](https://orcid.org/0000-0001-5284-6364)

## Rights

Creative Commons BY Attribution 4.0 International[Creative Commons BY Attribution 4.0 International](https://creativecommons.org/licenses/by/4.0/)

## Other metadata

[Leveraging Segmentation of Physical Units through a Newly Open Source Corpus](https://mdr.nims.go.jp/datasets/0a3ba3d0-f072-4dc5-aadc-f31fc0636fa8)

## Fulltext

Leveraging Segmentation of Physical Units through a Newly Open Source CorpusLuca Foppiano, Akira Suzuki, Thaer M. Dieb, Masashi Ishii1 and Mikiko TanifujiMaDIS, National Institute for Materials Science (NIMS)E-mail: FOPPIANO.Luca@nims.go.jpThe identification of physical measurements is a recurrent need in materials informatics (MI). Forexample, the extraction of superconductor materials and their properties2 requires to identify and under-stand temperature, pressure, magnetisation. When designing automatic systems for information extrac-tion from scientific literature, the identification of the raw measurement alone is not sufficient. Quan-tity transformations, such as normalisation, require the understanding of values and units, which arecontained in unstructured text with ad-hoc conventions. String matching and lookups are failing withgrowing unit complexity and variability. Therefore a generic unit segmentation system is necessary.This contribution is part of a larger project called Grobid-quantities3, a machine learning (ML) based,Open Source system for extracting and normalising physical measurements from scientific and patentliterature. In this submission, we present a general approach for units representation, and we introducethe public availability (Creative Commons licence) of a corpus of segmented physical units. Currently,there are no comparable results in scientific literature because no public datasets are available for thistask. Our approach for the unit representation follows the International System of Measurement (SI),where each unit is represented as a product of triples: prefix, base and power. This straightforwardapproach offers the flexibility to support any combination of units from any system of measurements.Figure 1 illustrates an example where kV2/cm is tokenised and segmented as product of triples.Figure 1: The process of parsing a raw unit into the product of triples. Notice that the label pow is usedto identify both exponent and division marks (needed to correctly set the second triple’s exponent, in thiscase negative).We used the Grobid-quantities ML-based unit segmentation implementation to create a new corpus.We used data provided by previous work of some of the authors4, where about 2000 units were extractedfrom 3490 papers of Journal of Applied Physics. The data was pre-annotated and manually corrected.The resulting corpus contains approximately 700 simple and 1300 complex units, and it’s availablein XML format at the Grobid-quantities repository3. It is suitable for evaluating new or existing systemsfor unit segmentation. We plan to increase the coverage by adding new data from other domains.1. Corresponding author: ISHII.Masashi@nims.go.jp2. Luca Foppiano et al., “Proposal for Automatic Extraction Framework of Superconductors related Information from Sci-entific literature,” THE INSTITUTE OF ELECTRONICS, INFORMATION AND COMMUNICATION ENGINEERS, 2019,3. grobid-quantities, https://github.com/kermitt2/grobid-quantities, [Online; accessed 18-April-2019], 2016.4. Suzuki Akira and Ishii Masashi, “Constructing a ”Unit dictionary” from scientific articles,” in Third International Work-shop on SCIentific DOCument Analysis (JSAI International Symposia on AI) (Springer, 2018).