# Fileset

[main.pdf](https://mdr.nims.go.jp/filesets/35ff6ef5-6c4c-42fd-bf5c-618189b691ad/download)

## Creator

[FOPPIANO, Luca](https://orcid.org/0000-0002-6114-6164), DIEB, Sae, SUZUKI, Akira, BAPTISTA DE CASTRO, Pedro, IWASAKI, Suguru, UZUKI, Azusa, ESPARZA ECHEVARRIA, Miren Garbine, MENG, Yan, TERASHIMA, Kensei, TAKANO, Yoshihiko, ISHII, Masashi

## Rights


## Other metadata

[SuperMat: Construction of a linked annotated dataset from superconductors-related publications ](https://mdr.nims.go.jp/datasets/8aa85f51-da22-4b24-97f9-2799a35b3936)

## Fulltext

SuperMat: Construction of a linked annotateddataset from superconductors-related publicationsLuca Foppiano1*, Sae Dieb1, Akira Suzuki1, Pedro Baptista deCastro2, Suguru Iwasaki2, Azusa Uzuki2, Miren Garbine EsparzaEchevarria2, Yan Meng2, Kensei Terashima2, Laurent Romary3,Yoshihiko Takano2, and Masashi Ishii1*1Material Database Group, MaDIS, NIMS, Tsukuba, 305-0044,Japan2Nano Frontier Superconducting Materials Group, MANA, NIMS,Tsukuba, 305-0047, Japan3ALMAnaCH, Inria, Paris, 75012, France*corresponding authors: Luca Foppiano(FOPPIANO.Luca@nims.go.jp), Masashi Ishii(ISHII.Masashi@nims.go.jp)January 7, 2021AbstractA growing number of papers are published in the area of supercon-ducting materials science. However, novel text and data mining (TDM)processes are still needed to efficiently access and exploit this accumu-lated knowledge, paving the way towards data-driven materials design.Herein, we present SuperMat (Superconductor Materials), an annotatedcorpus of linked data derived from scientific publications on superconduc-tors, which comprises 142 articles, 16052 entities, and 1398 links that arecharacterised into six categories: the names, classes, and properties ofmaterials; links to their respective superconducting critical temperature(Tc); and parametric conditions such as applied pressure or measurementmethods. The construction of SuperMat resulted from a fruitful collab-oration between computer scientists and material scientists, and its highquality is ensured through validation by domain experts. The qualityof the annotation guidelines was ensured by satisfactory Inter AnnotatorAgreement (IAA) between the annotators and the domain experts. Super-Mat includes the dataset, annotation guidelines, and annotation supporttools that use automatic suggestions to help minimise human errors.Background & summaryThe vast majority of scientific knowledge exists as published articles [8, 15, 34, 1].These publications are presented mainly as text, which is challenging to be1used as a machine-readable structure. Meanwhile, as a part of the text anddata mining (TDM) discipline, computer-assisted information collection fromthe literature has become a supportive asset for scientific research [33]. In thepast decades, new TDM processes were developed for several natural sciencedisciplines to achieve automatic document processing such as information re-trieval, entity extraction, and clustering. TDM has been applied in biologyfor identifing interactions between agents (e.g. bacteria, viruses, genes, andproteins) [11, 23, 22] to support the research on serious diseases including can-cer [25]. In chemistry, it was used for the disambiguation of chemical compoundsnames, synthesis extraction, and retrieval [10]. In both domains, the applicationof TDM was based on manually curated datasets (corpora) that functioned asinfrastructures. Examples are the BioCreative IV CHEMDNER corpus [24] inchemistry, and Genia [16] and GENETAG [37, 32] in biology. Such datasets arecrucial for developing, training, and evaluating TDM systems.In comparison, such resources in the materials science domain are ratherlimited. Reported cases include NaDev [3] on nanocrystal devices research,SC-CoMIcs [38] in the superconductors domain, and a corpus for extractingsynthesis recipes [20]. To address this shortage of infrastructure, experimen-tal data is extracted manually [7], or ab-initio calculations are used [13] butthey might not accurately describe the real system. Several challenges still hin-der the data-driven exploration of materials (also called Materials Informatics(MI)), namely: the lack of data standard, infant stage of the data-driven culture,a wide variety of conflicting stakeholders, and missing incentives for researchersto contribute to large collaborative initiatives [12]. To bridge these gaps, it isnecessary to create infrastructural resources to support TDM processes in ma-terials science through the automatic construction of databases for materialsand their properties. Such application can minimise the need for humans toread the new papers and extract the key information therein. Equally impor-tantly, it enables scientists to focus and leverage computing power and humanresources to find deeper relationships between superficially unrelated informa-tion. Other applications include providing semantically enriched search enginesthat accept fine-grain queries [28] to reduce the time needed to access specificinformation. These processes cannot be established without essential resourcessuch as dictionaries, lexicons, and datasets.Research on superconducting materials has been growing rapidly towardsboth fundamental science as well as practical applications. Superconductors dis-play many intriguing phenomena including zero-resistivity, the ability to host ahigh magnetic field, quantisation of the magnetic flux, and vortex pinning. Cur-rent applications of superconductors include medical instruments, high-speedtrains, quantum computers, and the Linear Hadron Collider (LHC) [29, 17, 2].However, discovering a new superconductor is a challenging task, as only 3%of candidate materials were found to be superconductors [19]. The NationalInstitute for Materials Science (NIMS) in Japan has been manually construct-ing databases to support material research, and SuperCon (http://supercon.nims.go.jp) is a manually curated data source for the superconductor domain.These databases would help researchers design new superconducting materialswith a higher superconducting critical temperature (Tc) (ideally up to roomtemperature) [9, 36]. However, the current resources are very limited and notdynamic enough to incorporate the information from new publications in atimely manner. In this paper, we present SuperMat (Superconductors Mate-2rials), an annotated linked corpus for superconducting material information.This dataset contains 142 documents with 16052 (7166 unique) entities, and1398 links that can serve as an infrastructural data for TDM processes in thedomain of superconducting materials. SuperMat is different from SC-CoMIcsbased on the following reasons: (a) it provides full papers instead of abstractswhich contain more detailed information about the research on superconductingmaterials; (b) it contains linked entities, and (c) is publicly available. We alsodescribe the construction guidelines for SuperMat, in the hope of supportingresearchers to systematically create annotated data. Furthermore, the uniquefeature of links between entities in SuperMat will allow the development of moreprecise methodologies to associate a particular material with its properties.MethodsContent acquisitionSuperMat originates from PDF documents of scientific articles related to su-perconductor research. The PDF format is the most widely used format forscientific publications [14]. The original documents were collected from the fol-lowing sources: (a) the Open Access (OA) version of articles referenced in theSuperCon database records; (b) articles provided by domain experts containingsuitable items and potential links of material names, Tc values, measurementmethods, and pressures; (c) articles from ”condensed matter” category of arXiv(https://arxiv.org/archive/cond-mat) selected using the search terms of”superconductor”, ”critical temperature”, and ”superconductivity”.OA versions of articles were obtained using a lookup service for bibliographicdata called biblio-glutton (https://github.com/kermitt2/biblio-glutton)that aggregates data from various sources: the Crossref (https://www.crossref.org/) bibliographic database, the unPaywall (http://unpaywall.org) service,the PubMed Central repository (https://pubmed.ncbi.nlm.nih.gov/), andmappings to other databases. We queried biblio-glutton using the bibliographicdata of each article referenced in Supercon; subsequently, we downloaded theOA article associated with the retrieved record, if available. A record in Unpay-wall does not guarantee that the downloaded article could be reused to createderivative works. Therefore, for each article that was not downloaded fromarXiv, we manually verified its reusability by checking for an explicit state-ment of the Creative Commons licence in the PDF document or at the originalpublisher’s page.Preliminary annotation studyPreliminary annotation study was carried out to assess the effort required fromthe annotators to reach an acceptable Inter Annotation Agreement (IAA ¿ 0.7) .We annotated two randomly selected OA papers, by using a preliminary versionof the guidelines with a limited tag-set of four labels: <material>, <tc> (ex-pression describing the presence or absence of superconductivity), <tcValue>(value of Tc), and <doping> (amount of substitution, such as stochiometric val-ues, usually expressed as functions of x or y). The process was iterated multipletimes. Each iteration ended with computing the IAA using the Krippendorff’s3alpha coefficient [26, 40], while annotators discussed the disagreements, andupdated the guidelines.Based on the results in Table 1, IAA reached a satisfactory level ( 0.9) afterthe third iteration. In the second iteration, although the average IAA reached0.7 on three of the four labels, the average agreement was not satisfactory. Whenanalysing the disagreement, we noticed that the low score in the <doping> labelwas caused by a heavy overlap with the <material> label, which required moreprecise definition in the guidelines.Based on this preliminary study, the following changes were implemented.(a) The label <doping> was merged under the <material> because, even withdetailed documentation it was too difficult for humans to annotate them in aconsistent way. (b) Three more labels were added: measurement methods andpressure (described as parametric conditions in relation to Tc), and class ofmaterials.Tag-set designThe tag set (also referred to as labels) represents the classes of entities and thetype of links between them, which were designed to be extracted from the text(Figure 1).EntitiesEntities (also referred as Named Entities, mentions, or surface forms) are chunksof texts that represent an information of interest, as follow:• Class (tag: <class>) represents a group of materials defined by certaincharacteristics. Superconducting materials can be classified according todifferent criteria such as the composition and magnetic properties. Amongpublications collected for this study, the domain experts identified threetypes of classes based on: (a) the composition and crystal structure, (b)material phenomena (e.g. ”I-type” and ”II-type superconductivity”, ”BCSsuperconductors”, ”nematic”, and ”conventional/unconventional super-conductivity”), and (c) high/low Tc value (e.g. ”high-tc” superconduc-tors).In this work, we only considered the (a) classes, mainly because the ma-terial composition and crystal structure do not change with time. Forexample, a cuprate from 1998 is still called a cuprate today. In compar-ison, many material phenomena used for (b) are not robust enough, andcan be biased by the viewpoint of the author(s) or research group, or themeasurement methods. Finally, the definition of ”high-tc” superconduc-tors (c) is completely relative; i.e., with the progress of research, materialsonce considered ”high-tc” might not be so anymore.• Material (tag: <material>) identifies the name of one or more materials.This label is used to collect the following types of information:– Chemical formula indicating the material by its general or stochio-metric formula (e.g. LaFe1-xO7, WB2),– Compositional name (e.g. magnesium diboride) or abbreviations(e.g. YBCO),4– The material’s shape (e.g. wire, powder, thin film) or form of material(e.g. single/poly crystal),– Modification by a dopant (Zn-doped, Si-doped) or by percentage ofdoping (2%-doped). We also considered qualitative expressions suchas overdoped, lightly doped, and pure as valid information,– Substrate information (e.g. grown on MgO(100) film) when it wasadjacent to the material name or formula, in the text,– Additional information about the sample (e.g. as-grown, untwinned,single-layer) when it was adjacent to the material name or for-mula, in the text.• Superconducting critical temperature (tag: <tc>) identifies expressionsrelated to the phenomenon of superconductivity. Any temperature men-tioned in the text is not necessarily the Tc. Rather, it could refer to thetemperature for other processes/events such as annealing/sintering tem-perature, specific measurements, and structural changes. This label iden-tifies the presence or absence of superconductivity at a given temperature(showing/not showing superconductivity at this Tc). In addition, modifiersof this information (increasing/descreasing Tc) are also retained.• Superconducting critical temperature value (tag: <tcValue>) representsthe temperature at which the superconducting phenomenon occurs. It canbe defined by different experimental criteria, such as the onset, mid-pointof resistivity drop, or zero resistivity. This value also considers boundaryconditions, such as the onset of superconductivity, zero resistance.• Applied pressure (tag: <pressure>) indicates the applied pressure corre-sponding to a measured Tc.• The measurement method (tag: <me method>) indicates the method usedto measure or calculate the presence of superconductivity. Here, we consid-ered the following categories: resistivity, magnetic susceptibility, specificheat, and theoretical calculations.LinksThe links connects entities of materials or samples to their corresponding prop-erties, conditions, and results. The links are non-directional, and there are norestrictions on the number of links for each entity. We defined three types oflinks:• material-tc: linking materials to their Tc values.• tc-pressure: connecting Tc and the applied pressure under which it wasobtained.• tc-me method: linking Tc and the corresponding measurement method.5Annotation guidelinesAnnotation guidelines include the principles and the rules that describe whatconstitutes as desired information for the SuperMat dataset and how to annotateit. They include detailed description of the specific rules that have been definedfor each type of information to be annotated, with one or more definitionsand examples illustrating what to annotate in different cases, exceptions, andreferences. We used an online system to track the discussions and decisionswhen a question or a comment was raised, and provided a link to such issuesin the respective description or example. In addition, the guidelines includelinking rules that provide information on how to correctly connect the entitiesin a relationship. The guidelines were built using a dynamic markup language(called RestructuredText) and stored in a git (https://git-scm.com/) versioncontrol system repository. We deployed them as HTML files via web, whichwere updated automatically after each modification.Annotation support toolsThe task of annotating documents is tedious and requires both attention andsubject knowledge from the annotators. Annotation support tools aim to max-imise the efficiency of annotators and minimise human mistakes. They arecomposed of a web-based collaborative annotation tool, automatic annotationsuggestions, and automatic corpus analysis.Web-based collaborative annotation tool: INCEpTIONThe annotation tool is the platform used for creating, correcting and linkingannotations. After evaluating several tools, we selected INCEpTION [18, 4],a web-based multi-user platform for machine-assisted rapid dataset annotationconstruction. INCEpTION provides supportive functionalities that include:• Multi-layer annotation sheets allow different annotation schemas over thesame documents,• Two annotation steps: annotation consists of manually correcting pre-imported documents, while curation allows another user to validate theannotations (Figure 5).• On-the-fly automatic suggestions based on active learning and string match-ing (Figure 5),• Bulk annotation corrections, and• Being open-source (Apache 2.0 license), and under active development atthe time of this paper (https://inception-project.github.io/).Annotation suggestionsPrevious works have demonstrated that annotation suggestions improve thequality of the output [6, 31, 27]. We provide two types of annotations sug-gestions. (i) Machine-based annotated data that were assigned to the docu-ments before loading into the annotation tool. Here, we use a machine learning(ML)-based system from a previously implemented prototype [5] to support our6tag-set. (ii). Active learning recommendations provided by INCEpTION areassigned on-the-fly based on previous annotations. The active-learning recom-mendations are less precise since they aim to increase the recall, and thereforethey need to be explicitly accepted by the annotator.Automatic corpus analysisAutomatic corpus analysis is a set of scripts designed to run after the validationstep. These scripts automatically find inconsistencies in the links and entities,while extracting the statistics of the corpus. We calculated the inconsistenciesby examining every annotated entity and computing the frequency of the sametext being annotated with different labels. The script outputs a summary tableby visualising each annotation value, as well as their labels and frequencies. Wevisually inspected this table, because the reported inconsistencies can be eitherobvious mistakes (Table 2) or arise from ambiguities (Table 3); therefore theircontext should be verified.Although the links are conceptually non-directed, we have defined a practi-cal convention to maintain their consistency. For example, material-tc is alwaysrepresented as a link between <tcValue> and <material> entities. The scriptalso computes the statistics (Table 4) for the number of entities (total, unique,by class), the number of links (total, intra- and inter-paragraph, between para-graphs), and other statistical information.Annotation processThe annotation workflow (Figure 2) was designed following the MATTER (Model,Annotate, Train, Test, Evaluate, and Revise) schema[35] and other relatedwork [3, 24]. The workflow is composed of five steps (Figure 2): data-preparation,correction, validation, testing and evaluation, revision. This workflow involvesthree main actors: the automatic process, computer scientists, and the domainexperts.The first step of the annotation process involves preparing the machine-basedannotated data from the source PDF documents. The PDF files are convertedto an XML-based format, and annotation is automatically applied. This isfollowed by four more steps:• Annotation: The human annotator can select a document and manuallyadd, remove, or modify each entity based on rules defined in the guidelines.Once the annotation is complete, the document is marked ”ready” for thevalidation.• Validation/Curation by domain experts: Annotations from different usersare validated and merged into a final document (Figure 5). The domainexpert (”curator”), can compare the different annotated versions, andselect the best combination of annotations, or add new ones. This stepensures that the annotations are cross-checked and that the document isvalidated by domain experts.• Automatic consistency checks and statistical analysis: This step aims todiscover obvious mistakes such as mislabelling or incorrect linking. A7sequence labelling model is trained and evaluated using 10-fold cross-validation. The evaluation provides precision, recall, and f-score metricsfor all the labels. The resulting model is used for producing machine-basedannotated data in the following iteration.• Review: Retrospective analysis of the past iteration, where unclear casesare discussed and documented in the annotation guidelines.Data transformationThere are two processes of data transformation (Figure 3): (a) from the sourcedocument (PDF) to the dataset format representation (XML-based), and (b)from the dataset format representation to the annotation tool exchange formats(https://inception-project.github.io/releases/0.16.1/docs/user-guide.html\#sect_formats) and vice-versa.• PDF to XML-based: This step converts the PDF source document to thedataset format representation in XML following the Text Encoding Ini-tiative (TEI, https://tei-c.org/) format guidelines. Such transforma-tion is performed by leveraging the functionalities provided by GROBID(https://github.com/kermitt2/grobid).We developed a customised process for collecting a subset of informationfrom the source PDF document. The process extracts the title, keywords,and abstract from the header; and paragraphs, sections. and figure andtable captions from the body. All the callouts to references, tables, andfigures are ignored. The resulting structured document is then encoded inXML as will be described below.• XML to the annotation tool exchange formats: We transform our XML-formatted data into an INCEpTIONS compatible import format, such asthe Webanno TSV 3.2 (https://inception-project.github.io/releases/0.17.0/docs/user-guide.html\#sect_formats_webannotsv3), and vice-versa using a set of Python scripts. The Webanno TSV 3.2 format isan extension of the CONLL (https://www.signll.org/conll/) format,with additions of the header and column representation.Data RecordThe dataset is composed of 142 PDF documents, of which 92% (130) are OA(Figure 4a). To comply with copyright restriction, few articles from our datasetare not publicly available in our repository. The top three publishers repre-sented in the corpus are American Physical Society (APS), Elsevier, and IOPPublishing (Figure 4b). Figure 4c illustrate the distribution by publication date.We summarise SuperMat’s content in Table 4, with the statistics of documents,entities, and links given separately. In particular, this dataset contains 16052(7166 unique) entities spread over six labels and 1398 links.Each document is encoded according to the XML TEI guidelines, which isa rich format for document representation. We have carried out no specificcustomisation, in order to remain fully compliant with the general TEI schema.A TEI document has two main parts: the header (within the <teiHeader>8tags) containing all the document metadata, and the body (within the sectiondelimited by the <text> tag). The transformed data has the following structure:<TEI xml:lang="en" xmlns="http://www.tei-c.org/ns/1.0"><teiHeader><fileDesc><titleStmt><title>[...]</title></titleStmt><publicationStmt><publisher>[...]</publisher></publicationStmt></fileDesc><encodingDesc/><abstract><p>[...]</p><ab type="keywords">[...]</ab></abstract><profileDesc></profileDesc></teiHeader><text><body><p>[...]</p><ab type="tableCaption"> [...] </ab><p> [...] </p><ab type="figureCaption"> [...] </ab></body></text></TEI>We transformed the source documents into these TEI-compliant structuresusing a simplified representation for specific content types. The general objectiveis to flatten the content into a generic structure where priority is given to theannotations. For instance, the keywords section, which groups together the keyterms defined by the author(s) of the paper, is encoded using the generic tag <abtype="keywords"> as free text, instead of the dedicated <keywords> elementthat would typically be part of the header. For both the abstract and the articlebody, the text is segmented in paragraphs (by means of the <p> element). Thetext is annotated with the generic <rs> (referencing string) element adornedwith three attributes: @type (the entity type), @corresp (to provide a link toanother annotation such as from material to Tc), and @xml:id (to uniquelyidentify the annotation for referencing or linking purposes).Because only the captions of tables and figures are retained from the orig-inal source, a simplified encoding was defined by means of the <ab> elementcharacterised by a @type attribute; that is, <ab type="figureCaption"> forfigure captions and <ab type="tableCaption"> for table captions. Here is anexample:<p>The electron-doped high-<rs type="tc">transition-9temperature</rs> (<rs type="tc">Tc</rs>) <rstype="class">iron-based pnictide</rs>superconductor <rs type="material"xml:id="m6">LaFeAsO1-xHx</rs> has a uniquephase diagram: Superconducting (SC) double domes aresandwiched by antiferromagnetic phases at ambientpressure and they turn into a single dome witha maximum <rs type="tc">Tc</rs> that<rs type="tcValue" xml:id="m7"corresp="#m6,#9">exceeds 45K</rs>at a pressure of <rs type="pressure"corresp="#m7">3.0 GPa</rs>.[...]</p>In the above snippet, the entities ”3.0 GPa”, ”exceed 45K” and ”LaFeAsO1-xHx” are linked together via the pairs @corresp, @xml:id. This schema sup-ports multiple annotations to any part of the document. For example, the entityexceed 45K has a second link with the corresponding identifier (”#9”) to anannotation outside this paragraph.ApplicationsSuperMat is constructed as a resource for TDM applications in superconductingmaterials. It can be used as data source in several complementary tasks: (1)creation of an automatic information extraction system for dataset creation,(2) articles classification, (3) named entity extraction (for example, automaticdictionary construction), (4) clustering and document synthesis, (5) training ofmachine learning (ML) algorithms, (6) evaluation of rule-based or ML-basedalgorithms, and (7) development of downstream processes, such as materialname parser, or quantity normalisation.ReusabilityThe data structure employed in this study (classes of materials, materials names,and related properties), is similar to that used in other domains in materials sci-ence. Therefore, SuperMat can be reused to facilitate or bootstrap the creationof new TDM processes in areas of materials research besides superconductors.SuperMat could be used as a feature for a machine learning model for NERor EL systems in materials science such as magnetocaloric, piezoelectric, andthermoelectric domains.Practical applicationsSuch a dataset may benefit several types of possible applications:• Evaluation tasks: This corpus can be used for evaluation tasks on au-tomatic extraction. In particular, we can envision two popular tasks insuperconducting materials science, namely: (a) NER and (b) EL meth-ods. EL techniques have been mainly designed and studied using text10from Wikipedia and newswires services which represent most of the avail-able data. To the best of our knowledge, however, there is no applicationwithin materials science.• Automatic information extraction for superconducting materials: Thisdataset can be used as training data for such a purpose. Automatic in-formation extraction using ML and text mining techniques can acceleratethe construction of databases for superconducting materials.• Document retrieval: Information retrieval is a key application helpingresearchers overcome information overload. One way is through queryexpansion to cover multiple expressions of the same term. By collectingand clustering all expressions under the same concept, it would be possibleto retrieve documents when, for example, the resistivity measurement isdescribed by a phrase other than ”resistivity”. Furthermore, the assignedlabels can be used to boost documents where a certain term belongs on aspecific label. For example, cobalt oxide can appear as either <material>or <class> depending on the context, while a user would like to obtaindocuments where cobalt oxide appears as <material>.• Weighted-clustering: Scientific document clustering has recently gainedgrowing attention because of its potential capacity for finding additionalrelevant documents of interest. For example, clustering can help locatingsimilar experimental settings in a large collection of documents. However,clustering documents based on their general content might not be optimalfor finding such detailed similarities. Annotation can be leveraged to tiltthe clustering algorithm toward entity similarity, which may provide amore focused clustering towards a specific type of information.Technical ValidationThe following measures were employed to ensure the creation of a high-qualitydataset:• Each document was revised and validated by domain experts,• The workflow begins by assigning machine-based annotated data. This hasdemonstrated to improve the annotation task over several aspects, namely:time consumption, error rate, and annotation agreement [6, 31, 27].• On-the-fly automatic annotation recommendations, which provide freshsuggestions based on online decisions made by the annotators.• The annotators have rapid access to changes in the annotation guidelines.• The discussions were documented and linked in the guidelines.• Reviews are discussed and approved collaboratively between domain ex-perts and other annotators.11These guidelines are a vital piece of this work since they contain knowledgeaccumulated from these activities. However, measuring the completeness of theguidelines is challenging. Assuming that the documents validated by domainexperts represent the ground truth, we conducted IAA analysis between differentannotators against the ground truth, using the Krippendorf’s Alpha metric [26].Table 5 shows the average IAA which is satisfying with a value of approximately0.9. The highest score is obtained in the <material> entities, while the lowestone is obtained in <pressure>, which appears less frequently in the papers. Thedisagreement in <tcValue> can appear to be too low as compared with otherlabels such as <class>, which is, at first look, more ambiguous. We analysed thedifferent cases and identified three reasons why this happens. First, <tcValue>may depend heavily on the context that requires more human attention, and itis therefore more prone to errors. Second, our suggestions system is challengedin its ability to disambiguate critical temperatures from other temperature data,leading to incorrect or invalid suggestions. Finally, the presence of mathematicalsymbols (e.g. ”~”, ”<”, and ”>”) or other modifiers (”up to”, ”exceeds”, etc.)before the <tcValue> could generate small disagreements that accumulate inthe average score.To more precisely isolate the impact of the guidelines, we grouped the IAAresults by level of domain experience. Table 6 displays the IAA between thevalidated data and the data corrected by (a) domain experts (researchers whoconduct superconducting development experiments), (b) non-domain-experts(researchers with no experience with superconducting materials), and (c) novices(students in materials science with limited domain experience). Obviously, thedomain experts have the highest agreement and the IAA value (around 0.95) is0.06 higher on average than that of non-domain experts. Thus, superconductingmaterials is a complex domain that requires knowledge in materials science toproduce high-quality data, while crowdsourcing initiatives such as the AmazonMechanical Turk might not work well.Furthermore, we measured the reliability of the guidelines by observing howquickly novices could reach a satisfying agreement with the validation of thedomain experts, without any previous training on the guidelines. From Table 6,the novices can attain high IAA results by only using the guidelines and ourannotation support tools. The average difference in agreement with domainexperts (around 0.05) indicates that the guidelines are precise and complete,and that the annotations tools offer sufficient support.Usage NotesSuperMat is provided as a set of XML files, whose format is described above.The dataset is maintained at the GitHub repository https://github.com/lfoppiano/SuperMat and will be updated regularly. We will roll out a releaseon regular basis. Releases will be accessible at GitHub and at the National Insti-tute for Materials Science (NIMS)’s Material Data Repository (MDR) platformhttps://mdr.nims.go.jp/, which permits DOIs to be automatically assignedto datasets. The current repository includes 130 OA articles, and their re-spective bibliographic data as reference. The copyrighted articles will not bere-distributed due to their restrictive licenses.12Code AvailabilityThe developed code is available from the SuperMat GitHub repository. The datatransformation scripts were written in Python and can be run from the com-mand line. They require BeautifulSoup (https://www.crummy.com/software/BeautifulSoup/), an open-source library for parsing XML and HTML for-mats. The data analysis scripts were developed as Jupyter notebooks (https://jupyter.org/) which can easily output results and graphs in the browser. Theopen source annotation tool is INCEpTION (https://inception-project.github.io/). The content was acquired using biblio-glutton (https://www.github.com/kermitt2/biblio-glutton) and Grobid (https://www.github.com/kermitt2/grobid). We computed the IAA using the Java library DkProstatistics (https://dkpro.github.io/dkpro-statistics/) [30].References[1] Bo-Christer Björk, Annikki Roos, and Mari Lauri. Scientific journal pub-lishing: yearly volume and open access availability. Inf. Res., 14, 2009.[2] Laura Cardani, Francesco Bellini, Nicola Casali, M. G. Casellano, IvanColantoni, Alessandro Coppolecchia, Carlo Cosmelli, Angelo Cruciani,A. D’Addabbo, Sergio Di Domizio, Mario Martinez, Carlos Tomei, andM Vignati. New application of superconductors: High sensitivity cryo-genic light detectors. Nuclear Instruments & Methods in Physics ResearchSection A-accelerators Spectrometers Detectors and Associated Equipment,845:338–341, 2017.[3] Thaer M. Dieb, Masaharu Yoshioka, and Shinjiro Hara. Nadev: An an-notated corpus to support information extraction from research papers onnanocrystal devices. Journal of Information Processing, 24:554–564, jan2016.[4] Richard Eckart de Castilho, Éva Mújdricza-Maydt, Seid Muhie Yimam,Silvana Hartmann, Iryna Gurevych, Anette Frank, and Chris Biemann.A web-based tool for the integrated annotation of semantic and syntacticstructures. In Proceedings of the Workshop on Language Technology Re-sources and Tools for Digital Humanities (LT4DH), pages 76–84, Osaka,Japan, December 2016. The COLING 2016 Organizing Committee.[5] Luca Foppiano, Thaer M. Dieb, Akira Suzuki, and Ishii Masashi. Proposalfor automatic extraction framework of superconductors related informa-tion from scientific literature. THE INSTITUTE OF ELECTRONICS,INFORMATION AND COMMUNICATION ENGINEERS, 2019.[6] Karën Fort and Benôıt Sagot. Influence of pre-annotation on POS-taggedcorpus development. In Proceedings of the Fourth Linguistic AnnotationWorkshop, pages 56–63, Uppsala, Sweden, July 2010. Association for Com-putational Linguistics.[7] Michael W. Gaultois, Taylor D. Sparks, Christopher K. H. Borg, Ram Se-shadri, William D. Bonificio, and David R. Clarke. Data-driven review of13thermoelectric materials: Performance and resource considerations. Chem-istry of Materials, 25(15):2911–2920, 2013.[8] Vincas Grigas, Simona Juzeniene, and Jone Velickaite. ’just google it’ -the scope of freely available information sources for doctoral thesis writing.Inf. Res., 22, 2017.[9] James J. Hamlin. Superconductivity near room temperature. Nature,569:491–492, 2019.[10] Lezan Hawizy, David M. Jessop, Nico Adams, and Peter Murray-Rust.Chemicaltagger: A tool for semantic text-mining in chemistry. Journal ofCheminformatics, 3:17 – 17, 2011.[11] Min He, Yi Wang, and Wei Li. Ppi finder: A mining tool for humanprotein-protein interactions. PLOS ONE, 4(2):1–6, 02 2009.[12] J. Hill, Gregory J. Mulholland, K. Persson, R. Seshadri, C. Wolverton,and B. Meredig. Materials science with large-scale data and informatics:Unlocking new opportunities. Mrs Bulletin, 41:399–409, 2016.[13] Anubhav Jain, Shyue Ping Ong, Geoffroy Hautier, Wei Chen,William Davidson Richards, Stephen Dacek, Shreyas Cholia, Dan Gunter,David Skinner, Gerbrand Ceder, and Kristin a. Persson. The MaterialsProject: A materials genome approach to accelerating materials innova-tion. APL Materials, 1(1):011002, 2013.[14] Duff Johnson. Pdf statistics – the universe of electronic documents, 2018-05-14.[15] Madian Khabsa and C. Lee Giles. The number of scholarly documents onthe public web. PLoS ONE, 9, 2014.[16] Jin-Dong Kim, Tomoko Ohta, Yuka Tateisi, and Jun’ichi Tsujii. Genia cor-pus - a semantically annotated corpus for bio-textmining. Bioinformatics,19 Suppl 1:i180–2, 2003.[17] Kaname Kizu, Katsuhiko Tsuchiya, Yozo Kashiwa, Haruyuki Murakami,and Kôichi Yoshida. Construction of the jacketing facility and first produc-tion results of superconductor for jt-60sa. IEEE Transactions on AppliedSuperconductivity, 20:538–542, 2010.[18] Jan-Christoph Klie, Michael Bugert, Beto Boullosa, Richard Eckartde Castilho, and Iryna Gurevych. The inception platform: Machine-assistedand knowledge-oriented interactive annotation. In Proceedings of the 27thInternational Conference on Computational Linguistics: System Demon-strations, pages 5–9. Association for Computational Linguistics, June 2018.[19] Tomohiko Konno, H. Kurokawa, F. Nabeshima, Y. Sakishita, Ryo Ogawa,I. Hosako, and A. Maeda. Deep learning model for finding new supercon-ductors. ArXiv, abs/1812.01995, 2018.14[20] Olga Kononova, Haoyan Huo, Tanjin He, Ziqin Rong, Tiago Botari, Wen-hao Sun, Vahe Tshitoyan, and Gerbrand Ceder. Text-mined dataset ofinorganic materials synthesis recipes. Scientific Data, 6(1):203, October2019.[21] H. Kotegawa, T. Kawazoe, H. Tou, K. Murata, H. Ogino, K. Kishio, andJ. Shimoyama. Contrasting pressure effects in sr2vfeaso3 and sr2scfepo3.arXiv: Superconductivity, 2009.[22] Martin Krallinger, José M. G. Izarzugaza, Carlos Rodŕıguez Penagos, andAlfonso Valencia. Extraction of human kinase mutations from literature,databases and genotyping studies. BMC Bioinformatics, 10:S1 – S1, 2009.[23] Martin Krallinger, Florian Leitner, and Alfonso Valencia. Analysis of Bio-logical Processes and Diseases Using Text Mining Approaches, pages 341–382. Humana Press, Totowa, NJ, 2010.[24] Martin Krallinger, Obdulia Rabal, Florian Leitner, Miguel Vazquez, DavidSalgado, Zhiyong Lu, Robert Leaman, Yanan Lu, Dong-Hong Ji, Daniel M.Lowe, Roger A. Sayle, Riza Theresa Batista-Navarro, Rafal Rak, TorstenHuber, Tim Rocktäschel, Sérgio Matos, David Campos, Buzhou Tang, HuaXu, Tsendsuren Munkhdalai, Keun Ho Ryu, S. V. Ramanan, P. SenthilNathan, Slavko Zitnik, Marko Bajec, Lutz Weber, Matthias Irmer, Saber A.Akhondi, Jan A. Kors, Shuo Xu, Xin An, Utpal Kumar Sikdar, Asif Ek-bal, Masaharu Yoshioka, Thaer M. Dieb, Miji Choi, Karin M. Verspoor,Madian Khabsa, C. Lee Giles, Hongfang Liu, K. E. Ravikumar, AndreLamurias, Francisco M. Couto, Hong-Jie Dai, Richard Tzong-Han Tsai,Caglar Ata, Tolga Can, Anabel Usie, Rui Alves, Isabel Segura-Bedmar,Paloma Mart́ınez, Julen Oyarzábal, and Alfonso Valencia. The chemdnercorpus of chemicals and drugs and its annotation principles. Journal ofCheminformatics, 7:S2 – S2, 2015.[25] Alexander Krasnitz. Cancer bioinformatics. In Methods in Molecular Biol-ogy, 2019.[26] Klaus Krippendorff. Reliability in Content Analysis: Some Common Mis-conceptions and Recommendations. Human Communication Research,30(3):411–433, 01 2006.[27] Todd Lingren, Louise Deleger, Katalin Molnar, Haijun Zhai, JareenMeinzen-Derr, Megan Kaiser, Laura Stoutenborough, Qi Li, and Imre Solti.Evaluating the impact of pre-annotation on annotation speed and potentialbias: natural language processing gold standard development for clinicalnamed entity recognition in clinical trial announcements. Journal of theAmerican Medical Informatics Association : JAMIA, 21(3):406–413, 2014.[28] Huaping Liu, Feng Wang, Fuchun Sun, and Bin Fang. Surface materialretrieval using weakly paired cross-modal learning. IEEE Transactions onAutomation Science and Engineering, 16:781–791, 2019.[29] Philippe Mangin and Rémi Kahn. Superconductivity: An introduction.Springer, 2016.15[30] Christian M. Meyer, Margot Mieskes, Christian Stab, and Iryna Gurevych.DKPro agreement: An open-source Java library for measuring inter-rateragreement. In Proceedings of COLING 2014, the 25th International Con-ference on Computational Linguistics: System Demonstrations, pages 105–109, Dublin, Ireland, August 2014. Dublin City University and Associationfor Computational Linguistics.[31] Aurélie Névéol, Rezarta Islamaj Dogan, and Zhiyong Lu. Semi-automaticsemantic annotation of pubmed queries: A study on quality, efficiency,satisfaction. Journal of biomedical informatics, 44 2:310–8, 2011.[32] Tomoko Ohta, Jin-Dong Kim, Sampo Pyysalo, Yue Wang, and Jun’ichiTsujii. Incorporating genetag-style annotation to genia corpus. InBioNLP@HLT-NAACL, 2009.[33] Elsa A. Olivetti, Jacqueline M. Cole, Edward Kim, Olga Kononova, Ger-brand Ceder, Thomas Yong-Jin Han, and Anna M. Hiszpanski. Data-drivenmaterials research enabled by natural language processing and informationextraction. Applied Physics Reviews, 7(4):041317, 2020.[34] Enrique Orduña-Malea, Juan Manuel Ayllon, Alberto Mart́ın-Mart́ın, andEmilio Delgado López-Cózar. Methods for estimating the size of googlescholar. Scientometrics, 104:931–949, 2015.[35] James Pustejovsky and Amber Stubbs. Natural Language Annotation forMachine Learning: A guide to corpus-building for applications. ” O’ReillyMedia, Inc.”, 2012.[36] Valentin Stanev, Corey Oses, A. Kusne, Efrain Rodriguez, JohnpierrePaglione, Stefano Curtarolo, and I. Takeuchi. Machine learning model-ing of superconducting critical temperature. npj Computational Materials,4, 09 2017.[37] Lorraine K. Tanabe, Natalie Xie, Lynne H. Thom, Wayne Matten, andW. John Wilbur. Genetag: a tagged corpus for gene/protein named entityrecognition. BMC Bioinformatics, 6:S3 – S3, 2005.[38] Kyosuke Yamaguchi, Ryoji Asahi, and Yutaka Sasaki. SC-CoMIcs: A su-perconductivity corpus for materials informatics. In Proceedings of the 12thLanguage Resources and Evaluation Conference, pages 6753–6760, Mar-seille, France, May 2020. European Language Resources Association.[39] Shigeki Yonezawa, Y. Muraoka, and Z. Hiroi. New ß-pyrochlore oxide super-conductor csos2o6. Journal of the Physical Society of Japan, 73:1655–1656,2004.[40] Antonia Zapf, Stefanie Castell, Lars Morawietz, and André Karch. Mea-suring inter-rater reliability for nominal data – which coefficients and con-fidence intervals are appropriate? BMC Medical Research Methodology,16(1):93, August 2016.16AcknowledgementsWe would like to thank Tanifuji Mikiko for her continuous support, as well as theenthusiasm and the openness with which she lead the Data PlatForm Data Cen-ter (DPFC, https://www.nims.go.jp/eng/research/materials-data-pf/index.html) at NIMS. Our warmest thanks to Patrice Lopez, the author of Grobid(https://github.com/kermitt2/grobid) and other TDM open-source projects.Author contributions statementL.F. designed and developed the work (data preparation, annotation tools, IAAexperiments, automatic annotations). M.I. and Y.T. supervised the project.L.R. defined the standardised dataset TEI format. L.F. S.D. A.S. A.U. M.G.E.E.P.B.C. Y.M. S.I., and K.T. performed the dataset annotation and validation.M.G.E.E. P.B.C. Y.M. S.I. K.T., and Y.T. validated the corpus. L.F. wrote themanuscript with assistance in editing from S.D., M.I., K.T., P.B.C., Y.M., andM.G.E.E.. All authors reviewed and approved the final manuscript.Competing interestsThe authors declare no competing interests.Figures & TablesIteration # IAA IAA by label1 0.45<material> 0.45<tc> 0.56<tcValue> 0.50<doping> 0.212 0.65<material> 0.75<tc> 0.85<tcValue> 0.85<doping> 0.393 0.89<material> 0.89<tc> 0.91<tcValue> 0.88<doping> 0.94Table 1: Summary of the IAA for each annotation iteration.17Figure 1: Example in the annotated corpus. The excerpt was taken from [21].Text Label 1 # Label 2 #LiFeAs <material> 89 <class> 1Bi-2212 <material> 34 <class> 1cobalt oxide <material> 89 <class> 1RE-123 <material> 34 <class> 1Table 2: Inconsistencies resulting from human mistakes.Text Label 1 # Label 2 #superconducting transition <material> 1 <tc> 61NCCO <material> 14 <tc> 1superconducting transition temperatures <material> 1 <tc> 11occurrence of superconductivity <material> 1 <tc> 1Table 3: Inconsistencies resulting from the overlapping of <material> and<class> labels.18Figure 2: Annotation workflow. Different colours illustrate the involvement ofeach group at each step of the workflow.Figure 3: Summary of the data transformation flows.DocumentsFiles Paragraphs Sentences Tokens142 2800 18344 1118432EntitiesEntities Unique entities Labels16040 7151 6LinksLinks Linksip Linksep1399 1286 113Table 4: Statistical overview of the dataset. Linksip indicates the number oflinks within the same paragraph (intra-paragraph). Linksep indicate the numberof links from different paragraphs (extra-paragraphs).19(a) Papers distribution byLicence: Open Access(CC-BY) vs copyrighted.(b) Distribution by publisher.(c) Distribution by year ofpublication.Figure 4: Distribution of paper in the dataset by (a) license, (b) publisher, and(c) year of publication.Figure 5: INCEpTION curation interface. The example is taken from [39].Label Average<material> 0.956<me method> 0.887<pressure> 0.723<class> 0.925<tcValue> 0.863<tc> 0.831Micro average 0.911Table 5: Average IAA between the annotated and validated documents20Label Domain experts Non-domain experts Novices<material> 0.969 0.950 0.924<me method> 0.890 0.862 0.901<pressure> 0.836 0.741 0.746<class> 0.990 0.836 0.899<tcValue> 0.895 0.734 0.841<tc> 0.874 0.776 0.830All labels 0.940 0.882 0.896# paragraphs 1066 1648 325Table 6: Calculated IAA for annotations produced by domain experts, non-domain experts, and novices compared to the validated version. Annotationsfrom domain experts are cross validated.21