論文 Leveraging Segmentation of Physical Units through a Newly Open Source Corpus

ISHII, Masashi SAMURAI ORCID ; TANIFUJI, Mikiko ORCID ; FOPPIANO, Luca ORCID ; DIEB M. Thear SAMURAI ORCID ; SUZUKI, Akira ORCID

コレクション

引用
ISHII, Masashi, TANIFUJI, Mikiko, FOPPIANO, Luca, DIEB M. Thear, SUZUKI, Akira. Leveraging Segmentation of Physical Units through a Newly Open Source Corpus. https://doi.org/10.34968/nims.1360

説明:

(abstract)

The identification of physical measurements is a recurrent task in material informatics (MI). Quantities and units are contained in unstructured text with ad-hoc conventions to convey their meaning: it is common to encounter new variations of existing units or new complex composed units which are not present in any unit dictionaries. By consequence, string matching and dictionary lookup are failing or generating false positive when the unit complexity grows. A generic unit segmentation system is therefore necessary. In this submission, we present our approach to unit representation that can be used to design generic unit normalisation systems. We introduce the public availability of a corpus of segmented physical units. This dataset, comprising about 2000 entries in XML format, can be used to train or evaluate sequence labelling models for unit segmentation. Through this contribution, we provide an Open Source (CC-BY licensed) corpus for segmenting physical units, a resource that can be used to evaluate and compare physical measurement segmentation systems.

権利情報:

キーワード: segmentation, quantities, measurements, physical quantities

刊行年月日: 2019-10-18

出版者: JSAP

掲載誌:

研究助成金:

原稿種別: 査読前原稿 (Author's original)

MDR DOI: https://doi.org/10.34968/nims.1360

公開URL:

関連資料:

その他の識別子:

連絡先:

更新時刻: 2022-10-03 01:53:43 +0900

MDRでの公開時刻: 2021-08-13 01:20:03 +0900

ファイル名 サイズ
ファイル名 presentation-jsap-2019.pdf (サムネイル)
application/pdf
サイズ 1.2MB 詳細