# Leveraging Segmentation of Physical Units through a Newly Open Source Corpus

https://mdr.nims.go.jp/datasets/0a3ba3d0-f072-4dc5-aadc-f31fc0636fa8

## File

- [Leveraging_Segmentation_of_Physical_Units_through_a_Newly_Open_Source_Corpus.pdf](https://mdr.nims.go.jp/filesets/75eff41c-300f-48ae-891b-16084e13be1a/download) ([Detail](https://mdr.nims.go.jp/filesets/75eff41c-300f-48ae-891b-16084e13be1a.md))
- [unit-evaluation-corpus.tei.xml](https://mdr.nims.go.jp/filesets/e6659d87-3dc9-40f7-baea-66d3bf39e0b4/download) ([Detail](https://mdr.nims.go.jp/filesets/e6659d87-3dc9-40f7-baea-66d3bf39e0b4.md))

## Id

0a3ba3d0-f072-4dc5-aadc-f31fc0636fa8

## Local identifier



## Visibility

open_to_public

## State

published

## Created at

2021-08-19T06:08:34.471395Z

## Updated at

2022-10-02T16:25:01.342421Z

## Published at

2021-08-20T03:26:47.026159Z

## Doi

https://doi.org/10.34968/nims.1220

## First published url



## Date published

2019-10-25

## Recorded date published

25/10/2019

## Resource type

dataset

## Manuscript type

authors_original

## Collection



## Title

- title: Leveraging Segmentation of Physical Units through a Newly Open Source Corpus
  title_type: original
  lang: en

## Description

- description: "The identification of physical measurements is a recurrent need in
    materials informatics (MI). For example, the extraction of superconductor materials
    and their properties2 requires to identify and under- stand temperature, pressure,
    magnetisation. When designing automatic systems for information extrac- tion from
    scientific literature, the identification of the raw measurement alone is not
    sufficient. Quan- tity transformations, such as normalisation, require the understanding
    of values and units, which are contained in unstructured text with ad-hoc conventions.
    String matching and lookups are failing with growing unit complexity and variability.
    Therefore a generic unit segmentation system is necessary.\r\nThis contribution
    is part of a larger project called Grobid-quantities3, a machine learning (ML)
    based, Open Source system for extracting and normalising physical measurements
    from scientific and patent literature. In this submission, we present a general
    approach for units representation, and we introduce the public availability (Creative
    Commons licence) of a corpus of segmented physical units. Currently, there are
    no comparable results in scientific literature because no public datasets are
    available for this task. Our approach for the unit representation follows the
    International System of Measurement (SI), where each unit is represented as a
    product of triples: prefix, base and power. This straightforward approach offers
    the flexibility to support any combination of units from any system of measurements.
    Figure 1 illustrates an example where kV2/cm is tokenised and segmented as product
    of triples.\r\nWe used the Grobid-quantities ML-based unit segmentation implementation
    to create a new corpus. We used data provided by previous work of some of the
    authors4, where about 2000 units were extracted from 3490 papers of Journal of
    Applied Physics. The data was pre-annotated and manually corrected.\r\nThe resulting
    corpus contains approximately 700 simple and 1300 complex units, and it’s available
    in XML format at the Grobid-quantities repository3. It is suitable for evaluating
    new or existing systems for unit segmentation. We plan to increase the coverage
    by adding new data from other domains."
  description_type: abstract
  lang: en

## Creator

- name: FOPPIANO, Luca
  role: author
  orcid: https://orcid.org/0000-0002-6114-6164
- name: SUZUKI, Akira
  role: author
  orcid: https://orcid.org/0000-0002-8167-0414
- name: DIEB M. Thear
  role: author
  orcid: https://orcid.org/0000-0002-8111-2009
- name: ISHII, Masashi
  role: author
  orcid: https://orcid.org/0000-0003-0357-2832
- name: TANIFUJI, Mikiko
  role: author
  orcid: https://orcid.org/0000-0001-5284-6364

## Contact agent



## Publisher



## Managing organization



## Keyword



## Rights

- description: Creative Commons BY Attribution 4.0 International
  identifier: https://creativecommons.org/licenses/by/4.0/

## Other identifier(s)



## Data origin



## Embargo



## Journal



## Conference



## Related item



## Funding



## Instrument



## Instrument operator



## Instrument managing organization



## Measurement method



## Specimen



## Chemical composition



## Structure for specimen



## Structural feature for specimen



## Specific property for specimen



## Process for specimen treatment



## Computational method



## Energy level/transition state



## Software



## Custom property



## Fileset

- id: 75eff41c-300f-48ae-891b-16084e13be1a
  filename: Leveraging_Segmentation_of_Physical_Units_through_a_Newly_Open_Source_Corpus.pdf
  content_type: application/pdf
  size: 59855
  md5: 77e69d1f27bc7e5707a44cdb0a51dd56
- id: e6659d87-3dc9-40f7-baea-66d3bf39e0b4
  filename: unit-evaluation-corpus.tei.xml
  content_type: application/xml
  size: 127204
  md5: 6b59f06fee5f4a2fc5a9094b13c13a18

## Thumbnail

fileset_id: e6659d87-3dc9-40f7-baea-66d3bf39e0b4
filename: unit-evaluation-corpus.tei.xml