FOPPIANO, Luca
;
SUZUKI, Akira
;
DIEB M. Thear
;
ISHII, Masashi
;
TANIFUJI, Mikiko
説明:
(abstract)The identification of physical measurements is a recurrent need in materials informatics (MI). For example, the extraction of superconductor materials and their properties2 requires to identify and under- stand temperature, pressure, magnetisation. When designing automatic systems for information extrac- tion from scientific literature, the identification of the raw measurement alone is not sufficient. Quan- tity transformations, such as normalisation, require the understanding of values and units, which are contained in unstructured text with ad-hoc conventions. String matching and lookups are failing with growing unit complexity and variability. Therefore a generic unit segmentation system is necessary.
This contribution is part of a larger project called Grobid-quantities3, a machine learning (ML) based, Open Source system for extracting and normalising physical measurements from scientific and patent literature. In this submission, we present a general approach for units representation, and we introduce the public availability (Creative Commons licence) of a corpus of segmented physical units. Currently, there are no comparable results in scientific literature because no public datasets are available for this task. Our approach for the unit representation follows the International System of Measurement (SI), where each unit is represented as a product of triples: prefix, base and power. This straightforward approach offers the flexibility to support any combination of units from any system of measurements. Figure 1 illustrates an example where kV2/cm is tokenised and segmented as product of triples.
We used the Grobid-quantities ML-based unit segmentation implementation to create a new corpus. We used data provided by previous work of some of the authors4, where about 2000 units were extracted from 3490 papers of Journal of Applied Physics. The data was pre-annotated and manually corrected.
The resulting corpus contains approximately 700 simple and 1300 complex units, and it’s available in XML format at the Grobid-quantities repository3. It is suitable for evaluating new or existing systems for unit segmentation. We plan to increase the coverage by adding new data from other domains.
データの性質:
権利情報:
Creative Commons BY Attribution 4.0 International
キーワード:
刊行年月日: 2019-10-25
出版者:
掲載誌:
研究助成金:
原稿種別: 査読前原稿 (Author's original)
MDR DOI: https://doi.org/10.34968/nims.1220
公開URL:
関連資料:
その他の識別子:
連絡先:
更新時刻: 2022-10-03 01:25:01 +0900
MDRでの公開時刻: 2021-08-20 12:26:47 +0900
Description / 説明 :
Category / カテゴリ :
Category description / カテゴリの説明 :
Analysis field / 解析分野 :
Analysis field description / 解析分野の説明 :
Measurement environment / 計測環境 :
Standarized procedure / 標準手順 :
Measured at / 計測時刻 :
| ファイル名 | サイズ | |||
|---|---|---|---|---|
| ファイル名 |
Leveraging_Segmentation_of_Physical_Units_through_a_Newly_Open_Source_Corpus.pdf
application/pdf |
サイズ | 58.5KB | 詳細 |
| ファイル名 |
unit-evaluation-corpus.tei.xml
(サムネイル)
application/xml |
サイズ | 124KB | 詳細 |