# Fileset

[ml5c00375_si_002.pdf](https://mdr.nims.go.jp/filesets/113a3bf7-4f62-43c1-840a-fcb08be15b5f/download)

## Creator

Yuna Oikawa, [Takanori Uzawa](https://orcid.org/0000-0001-6042-513X), Francois Berenger, Noriko Minagawa, Akiko Yumoto, Hideaki Takaku, [Ryo Tamura](https://orcid.org/0000-0002-0349-358X), [Yoshihiro Ito](https://orcid.org/0000-0002-1154-253X), [Koji Tsuda](https://orcid.org/0000-0002-4288-1606)

## Rights

[Creative Commons BY-NC-ND Attribution-NonCommercial-NoDerivs 4.0 International](https://creativecommons.org/licenses/by-nc-nd/4.0/)

## Other metadata

[GPepT: A Foundation Language Model for Peptidomimetics Incorporating Noncanonical Amino Acids](https://mdr.nims.go.jp/datasets/9d81f75d-cc04-410f-ae94-4f7edd013889)

## Fulltext

Suppor&ng informa&on: GPepT: A founda&on language model for pep&domime&cs incorpora&ng non-canonical amino acids Yuna Oikawa1, Takanori Uzawa2,5, Francois Berenger1, Noriko Minagawa2, Akiko Yumoto2,  Hideaki Takaku5, Ryo Tamura1,3,4, Yoshihiro Ito5, and Koji Tsuda1,3,4,*   1 Graduate School of Fron1er Sciences, The University of Tokyo, 5-1-5 Kashiwa-no-ha, Kashiwa, Chiba 277-8561, Japan.  2 Emergent Bioengineering Materials Research Team, RIKEN Center for Emergent MaPer Science, 2-1 Hirosawa, Wako, Saitama 351-0198, Japan.  3 Center for Basic Research on Materials, Na1onal Ins1tute for Materials Science (NIMS), Tsukuba 305–0044, Japan. 4 RIKEN Center for Advanced Intelligence Project, RIKEN, 1-4-1 Nihombashi, Chuo-ku, Tokyo, 103-0027 Japan.  5RIKEN Cluster for Pioneering Research, 2-1 Hirosawa, Wako, Saitama 351-0198, Japan.  E-mail: tsuda@k.u-tokyo.ac.jp  SecDon S1: Algorithmic details of Monomerizer 1. TranslaDon to a Molecular RepresentaDon: The structural formula is converted into a molecular representaDon. If the molecular representaDon contains mulDple substructures, all smaller parts (e.g., ions) are removed. 2. Template Matching: The molecular representaDon is compared to template molecules of the 20 canonical amino acids. If a match is found, the corresponding atoms in the input molecule are labeled as such. To avoid incorrect parDal matches, the template dicDonary is organized from the largest to the smallest amino acids. 3. PepDde Bond IdenDficaDon: Atoms that match the pepDde bond template structures (shown below) are labeled as such. If the number of bonds is smaller than the minimum set by the user’s preference, the procedure for that molecular representaDon is terminated. In this study, the minimum was set to 3. 4. Non-Canonical Fragments IdenDficaDon: Atoms that remain unlabeled are idenDfied as non-canonical fragments. A breadth-first search is performed to group these atoms by pepDde bonds. If a labeled pepDde bond is found within a ring structure, the process is terminated, as cyclic sequences are not the focus. 5. Fragment IsolaDon: All atoms labeled with canonical amino acids and pepDde bonds are temporarily removed. The bonding sites of these fragments to neighboring amino acids are recorded for future use in the program. 6. Fragment ClassificaDon: Each fragment is classified as either an ncAA or a terminal modificaDon based on whether they match any valid backbone template. Canonical SMILES of these monomers and the labeled pepDde molecule are saved for later processing. A]er processing each input structural formula, the list of obtained non-canonical fragments undergoes deduplicaDon depending on their tautomer hash.  7. Template Re-matching: A final round of labeling is performed on each output molecule, now including the obtained non-canonical fragments as templates. Groups of labeled atoms are checked for connecDons to pepDde bonds, idenDfying groups with only pepDde bonds connected, as the terminal of the sequence. Of them, the monomer with an N atom at the end is idenDfied as the N-terminal (the start of the sequence). A breadth-first search is conducted starDng from the N-terminus to determine the sequence of monomers. 8. Outputs results: The algorithm produces detailed output for both monomeric units and complete pepDdes. For each pepDde and each monomer, we provide structural illustraDon and sequence representaDon. To maintain data integrity, Monomerizer removes invalid sequences containing misplaced terminal modificaDons. This Dme we also filtered out any monomers that cannot be found on PubChem as well as the sequences containing them.     Figure S1: Comparison of non-canonical amino acids (ncAAs), terminal modificaDons and canonical amino acids (cAAs) mined from ChEMBL. (a) t-SNE visualizaDon of Morgan fingerprints.  (b) DistribuDon of physiochemical properDes.     Figure S2: Comparison of pepDdomimeDcs and pepDdes mined from ChEMBL (Dataset P). (a) t-SNE visualizaDon of Morgan fingerprints.  (b) DistribuDon of physiochemical properDes.       Table S1: Valid pepDdomimeDcs chosen for anDmicrobial acDvity test.      Figure S3: Bacteria growth (OD600) a]er 24 hours against pepDde concentraDon (Pep3 and Pep5).