# Fileset

[oikawa-et-al-2025-gpept-a-foundation-language-model-for-peptidomimetics-incorporating-noncanonical-amino-acids.pdf](https://mdr.nims.go.jp/filesets/4d4fc3cf-62ba-4bdb-8f9d-312509beb24d/download)

## Creator

Yuna Oikawa, [Takanori Uzawa](https://orcid.org/0000-0001-6042-513X), Francois Berenger, Noriko Minagawa, Akiko Yumoto, Hideaki Takaku, [Ryo Tamura](https://orcid.org/0000-0002-0349-358X), [Yoshihiro Ito](https://orcid.org/0000-0002-1154-253X), [Koji Tsuda](https://orcid.org/0000-0002-4288-1606)

## Rights

[Creative Commons BY-NC-ND Attribution-NonCommercial-NoDerivs 4.0 International](https://creativecommons.org/licenses/by-nc-nd/4.0/)

## Other metadata

[GPepT: A Foundation Language Model for Peptidomimetics Incorporating Noncanonical Amino Acids](https://mdr.nims.go.jp/datasets/9d81f75d-cc04-410f-ae94-4f7edd013889)

## Fulltext

GPepT: A Foundation Language Model for Peptidomimetics Incorporating Noncanonical Amino AcidsGPepT: A Foundation Language Model for PeptidomimeticsIncorporating Noncanonical Amino AcidsYuna Oikawa, Takanori Uzawa, Francois Berenger, Noriko Minagawa, Akiko Yumoto, Hideaki Takaku,Ryo Tamura, Yoshihiro Ito, and Koji Tsuda*Cite This: ACS Med. Chem. Lett. 2025, 16, 1670−1675 Read OnlineACCESS Metrics & More Article Recommendations *sı Supporting InformationABSTRACT: Language models have been increasingly popular in therapeuticpeptide generation, but molecular diversity remains limited due to reliance on the20 canonical amino acids. We propose a language model that generatespeptidomimetics incorporating noncanonical elements like noncanonical aminoacids and terminal modifications. To accomplish this, we created a vocabulary ofover 17,000 noncanonical elements by extracting them from chemical formulasstored in the ChEMBL database. Our pretrained language model, GPepT, showedimproved diversity in molecular structures and chemical properties. Todemonstrate its real-world application, we fine-tuned the model for antimicrobialpeptides. Experimental validation revealed that one of the generated peptidomi-metics exhibited effective antimicrobial activity, marking a successful case of AI-driven peptide development. GPepT is fullyaccessible on HuggingFace: https://huggingface.co/Playingyoyo/GPepT.KEYWORDS: Noncanonical Amino Acids, Amino Acids, Peptidomimetics, Protein, Peptide, Antimicrobial Peptides, RDKit, SMILES,GPT, AI, Language ModelPeptides play a pivotal role in various biological functions,including antimicrobial, anticancer, and anti-inflammatoryactivities. Recent advancements in peptide engineering havesparked interest in developing novel peptides as therapeuticagents, with over 53 peptides�accounting for 10% of the 509drugs approved by the FDA between 1999 and 2019�emerging in clinical applications.1 To enable broader range offunctions and binding properties, a diverse library oftherapeutic peptides is crucially important.2 There are atleast three approaches to increasing diversity: 1) Collectingpeptide sequences from a large pool of organisms.3 2) De novosequence generation via machine learning.4−8 3) Incorporationof noncanonical elements.9 In the first approach, Santos-Junioret al. successfully predicted nearly 1 million antimicrobialpeptides from the global microbiome.3 In the second approach,a variety of deep learning models have been developed. Earlier,separate generative models have been developed for distinctpurposes. Examples include variational autoencoders,4,6,10recurrent neural networks,5,8 generative adversarial networks7and transformers.11 More recently, it has become morecustomary to build a foundation model pretrained withunlabeled data, which can then be fine-tuned to serve aspecific purpose, e.g., PeptideBERT11 and ProtGPT2.12 Theabove-mentioned approaches create peptides only with 20canonical amino acids; hence the chemical diversity ofgenerated peptides is intrinsically limited.In the third approach, the chemical space is expanded byintroducing noncanonical elements. These elements, compris-ing noncanonical amino acids (ncAAs) and terminalmodifications, give rise to ″peptidomimetics″�peptide-likemolecules that transcend the limitations of conventional aminoacid sequences.2 Murakami et al.9 extended their languagemodel with few ncAAs, but the impact on diversity wasinsubstantial. Although subsets of common noncanonicalelements exist in the literature (see, e.g., Goettig et al.13),there is no comprehensive collection of noncanonicalelements, let alone tokens representing them. This deficiencyof tokens presents a significant barrier to generating chemicallydiverse amino acid sequences using contemporary languagemodels.14Chemical compound databases, particularly ChEMBL,15contain peptidomimetics with previously underutilized non-canonical elements that may prove valuable in therapeutics(Figure 1a). CHEMBL2407177, synthesized by Murugan etal.,16 demonstrates the utility of a Histidine-derived ncAA increating antimicrobial therapeutics with enhanced proteolyticstability and negligible hemolytic activity. Marine-derivedamino acid substituents, specifically brominated variant, arefeatured in a synthetic antifungal peptidomimeticReceived: June 13, 2025Revised: July 8, 2025Accepted: July 17, 2025Published: July 22, 2025Letterpubs.acs.org/acsmedchemlett© 2025 The Authors. Published byAmerican Chemical Society1670https://doi.org/10.1021/acsmedchemlett.5c00375ACS Med. Chem. Lett. 2025, 16, 1670−1675This article is licensed under CC-BY-NC-ND 4.0Downloaded via NATL INST FOR MATLS SCIENCE (NIMS) on August 24, 2025 at 23:43:22 (UTC).See https://pubs.acs.org/sharingguidelines for options on how to legitimately share published articles.https://pubs.acs.org/action/doSearch?field1=Contrib&text1="Yuna+Oikawa"&field2=AllField&text2=&publication=&accessType=allContent&Earliest=&ref=pdfhttps://pubs.acs.org/action/doSearch?field1=Contrib&text1="Takanori+Uzawa"&field2=AllField&text2=&publication=&accessType=allContent&Earliest=&ref=pdfhttps://pubs.acs.org/action/doSearch?field1=Contrib&text1="Francois+Berenger"&field2=AllField&text2=&publication=&accessType=allContent&Earliest=&ref=pdfhttps://pubs.acs.org/action/doSearch?field1=Contrib&text1="Noriko+Minagawa"&field2=AllField&text2=&publication=&accessType=allContent&Earliest=&ref=pdfhttps://pubs.acs.org/action/doSearch?field1=Contrib&text1="Akiko+Yumoto"&field2=AllField&text2=&publication=&accessType=allContent&Earliest=&ref=pdfhttps://pubs.acs.org/action/doSearch?field1=Contrib&text1="Hideaki+Takaku"&field2=AllField&text2=&publication=&accessType=allContent&Earliest=&ref=pdfhttps://pubs.acs.org/action/doSearch?field1=Contrib&text1="Ryo+Tamura"&field2=AllField&text2=&publication=&accessType=allContent&Earliest=&ref=pdfhttps://pubs.acs.org/action/doSearch?field1=Contrib&text1="Ryo+Tamura"&field2=AllField&text2=&publication=&accessType=allContent&Earliest=&ref=pdfhttps://pubs.acs.org/action/doSearch?field1=Contrib&text1="Yoshihiro+Ito"&field2=AllField&text2=&publication=&accessType=allContent&Earliest=&ref=pdfhttps://pubs.acs.org/action/doSearch?field1=Contrib&text1="Koji+Tsuda"&field2=AllField&text2=&publication=&accessType=allContent&Earliest=&ref=pdfhttps://pubs.acs.org/action/showCitFormats?doi=10.1021/acsmedchemlett.5c00375&ref=pdfhttps://pubs.acs.org/doi/10.1021/acsmedchemlett.5c00375?ref=pdfhttps://pubs.acs.org/doi/10.1021/acsmedchemlett.5c00375?goto=articleMetrics&ref=pdfhttps://pubs.acs.org/doi/10.1021/acsmedchemlett.5c00375?goto=recommendations&?ref=pdfhttps://pubs.acs.org/doi/10.1021/acsmedchemlett.5c00375?goto=supporting-info&ref=pdfhttps://huggingface.co/Playingyoyo/GPepThttps://pubs.acs.org/doi/10.1021/acsmedchemlett.5c00375?fig=tgr1&ref=pdfhttps://pubs.acs.org/toc/amclct/16/8?ref=pdfhttps://pubs.acs.org/toc/amclct/16/8?ref=pdfhttps://pubs.acs.org/toc/amclct/16/8?ref=pdfhttps://pubs.acs.org/toc/amclct/16/8?ref=pdfpubs.acs.org/acsmedchemlett?ref=pdfhttps://pubs.acs.org?ref=pdfhttps://pubs.acs.org?ref=pdfhttps://doi.org/10.1021/acsmedchemlett.5c00375?urlappend=%3Fref%3DPDF&jav=VoR&rel=cite-ashttps://pubs.acs.org/acsmedchemlett?ref=pdfhttps://pubs.acs.org/acsmedchemlett?ref=pdfhttps://acsopenscience.org/researchers/open-access/https://creativecommons.org/licenses/by-nc-nd/4.0/https://creativecommons.org/licenses/by-nc-nd/4.0/https://creativecommons.org/licenses/by-nc-nd/4.0/https://creativecommons.org/licenses/by-nc-nd/4.0/https://creativecommons.org/licenses/by-nc-nd/4.0/CHEMBL5270768.17 is reported to be a potent inhibitor ofSIRT1, a protein linked to type 2 diabetes and heart disease, aswell as SIRT2, which may be involved in glioma tumorigenesisand Parkinson’s disease. CHEMBL3780549, a thyrotropin-releasing hormone analogue, incorporates novel amino acidsproposed by Meena et al.18 Notably, these ncAAs remainabsent in major chemical databases including PubChem,19limiting their accessibility to the broader scientific community.In this paper, we construct a comprehensive vocabulary ofnoncanonical elements by detecting them in a large number ofchemical formulas and use it to build a foundation model forpeptidomimetics. The vocabulary was created with a Python-based software we named Monomerizer that decomposespeptides/peptidomimetics represented as chemical formulasinto amino acids and tokenize noncanonical elements (Figure1b). As a result of applying Monomerizer to ChEMBL-registered molecules, noncanonical elements were obtained,Figure 1. (a) Examples of uncommon amino acids found in ChEMBL database. (b) Function of Monomerizer: It decomposes peptides andpeptidomimetics, represented as chemical formulas, into canonical and noncanonical amino acids, and tokenizes the newly identified noncanonicalelements.ACS Medicinal Chemistry Letters pubs.acs.org/acsmedchemlett Letterhttps://doi.org/10.1021/acsmedchemlett.5c00375ACS Med. Chem. Lett. 2025, 16, 1670−16751671https://pubs.acs.org/doi/10.1021/acsmedchemlett.5c00375?fig=fig1&ref=pdfhttps://pubs.acs.org/doi/10.1021/acsmedchemlett.5c00375?fig=fig1&ref=pdfhttps://pubs.acs.org/doi/10.1021/acsmedchemlett.5c00375?fig=fig1&ref=pdfhttps://pubs.acs.org/doi/10.1021/acsmedchemlett.5c00375?fig=fig1&ref=pdfpubs.acs.org/acsmedchemlett?ref=pdfhttps://doi.org/10.1021/acsmedchemlett.5c00375?urlappend=%3Fref%3DPDF&jav=VoR&rel=cite-asand unique tokens were assigned to them. Next, the chemicalformulas were converted to sequences via the vocabulary andused to pretrain a standard transformer language model.12 Thepretrained model is called Generative pretrained Peptidomi-metics Transformer (GPepT). Using GPepT without any fine-tuning, tens of thousands of peptidomimetics were generated.The chemical diversity of generated sequences was found to begreatly enhanced in terms of both structural features andchemical properties (Figure S2).To demonstrate the relevance of our study to real-worldpeptide development, we conducted a case study on designingantimicrobial peptides. Our model was fine-tuned with thesequences that showed antimicrobial activity against Escher-ichia coli. Among the generated sequences, five peptidesproceeded to synthesis after synthesizability assessment byexperts. One peptide, containing D-Tryptophan, exhibitedantimicrobial activity against E. coli, making it one of the firstsuccessful cases of AI-generated peptidomimetics.Monomerizer accepts a molecule in SMILES (SimplifiedMolecular Input Line Entry System) format,20 a textualrepresentation of chemical structures that encodes atoms andbonds as a string. Any molecule with fewer than two peptidebonds are rejected. Then, canonical amino acids are removedfrom the molecule by SMARTS (SMILES Arbitrary TargetSpecification)-based template matching,21 which allows forspecifying substructural patterns to identify and manipulatespecific parts of a molecule. Each of the remaining parts areclassified into two categories: 1) ncAAs, 2) terminalmodifications, according to the presence or absence of abackbone. Once all molecules are processed, identicalfragments are summarized and assigned unique tokens. TokensX1, . . .,Xn are assigned to all ncAAs, and tokens Z1, ..., Zn to allterminal modifications. See Section S1 for algorithmic details.By applying Monomerizer to all bioactivity-labeled2,409,270 molecules on ChEMBL,22 we identified 11,243ncAAs and 6465 terminal modifications. 7157 (63.7%) of thencAAs and 2811 (43.5%) of the terminal modifications werenot registered in ChEMBL or PubChem as individualmolecules. Using these tokens, 42,743 molecules weresuccessfully converted into sequences, 38,138 (89.2%) ofwhich were peptidomimetics containing at least one of thenoncanonical elements. This collection of sequences, which wedesignate as Data set P, serves as the foundation for oursubsequent analyses.We developed GPepT by adapting the GPT-2 largetransformer decoder from HuggingFace, consisting of 36layers and a dimensionality of 1280, for sequence design ofpeptidomimetics. To handle our elements, we reinitialized thepretrained weights and built a custom tokenizer specificallydesigned to tokenize each element�canonical or non-canonical�rather than words or subwords as in traditionalnatural language processing. We then used HuggingFace’srun_clm.py script,23 which facilitates next-token predictiontraining with a single command, to train GPepT on our Dataset P. Training adhered to standard configurations, includingcross-entropy loss for autoregressive generation, a baselearning rate of 1 × 10−5, and Adam optimization, a widelyused stochastic gradient descent method that adapts learningrates for each parameter using estimates of first and secondmoments of the gradients,24 with β1 = 0.9, β2 = 0.999. Toassess the influence of noncanonical elements, we trained twoversions of the model: GPepT, trained on the full data set, andGPepT-canonical, trained only on sequences composed ofcanonical amino acids.After training, novel sequences were sampled with arepetition penalty 1.5. Occasionally, invalid sequences weregenerated where a terminal modification appeared in themiddle. Sampling continued until 10,000 valid sequences wereobtained. We converted the sequences back to a chemicalformula and represented them as Morgan fingerprints, fixed-length binary vectors encoding molecular substructures usedwidely for chemical similarity and machine learning tasks.Using t-SNE (t-distributed Stochastic Neighbor Embed-ding),25 we visualized the reduced high-dimensional data ofMorgan fingerprints in two dimensions for the generatedsequences by GPepT and GPepT-canonical (Figure 2a). Thesequences generated by GPepT are more dispersed than thosefrom GPepT-canonical, highlighting the superior chemicaldiversity provided by noncanonical elements. Figure 2b showsthe individual and joint distributions of five physicochemicalproperties: fraction of aromatic bonds, fraction of rotatablebonds, fraction of hydrogen bond acceptors, fraction ofhydrogen bond donors, and the fraction of carbon atomsthat are SP3 hybridized. The increased diversity is evident inFigure 2. Comparison of amino acid sequences generated by GPepT and GPepT-canonical. (a) t-SNE visualization of Morgan fingerprints. (b)Distribution of physiochemical properties of the amino acid sequences.ACS Medicinal Chemistry Letters pubs.acs.org/acsmedchemlett Letterhttps://doi.org/10.1021/acsmedchemlett.5c00375ACS Med. Chem. Lett. 2025, 16, 1670−16751672https://pubs.acs.org/doi/suppl/10.1021/acsmedchemlett.5c00375/suppl_file/ml5c00375_si_002.pdfhttps://pubs.acs.org/doi/suppl/10.1021/acsmedchemlett.5c00375/suppl_file/ml5c00375_si_002.pdfhttps://pubs.acs.org/doi/10.1021/acsmedchemlett.5c00375?fig=fig2&ref=pdfhttps://pubs.acs.org/doi/10.1021/acsmedchemlett.5c00375?fig=fig2&ref=pdfhttps://pubs.acs.org/doi/10.1021/acsmedchemlett.5c00375?fig=fig2&ref=pdfhttps://pubs.acs.org/doi/10.1021/acsmedchemlett.5c00375?fig=fig2&ref=pdfpubs.acs.org/acsmedchemlett?ref=pdfhttps://doi.org/10.1021/acsmedchemlett.5c00375?urlappend=%3Fref%3DPDF&jav=VoR&rel=cite-asthese properties as well. Figures S1 and S2 present similarresults for individual noncanonical elements and Data set P,respectively.Antimicrobial resistance attributed to 1.27 million deaths in2019.26 Antimicrobial peptides (AMPs), key components ofinnate immunity, have emerged as candidates in the fightagainst resistant pathogens.27 While antimicrobial peptides(AMPs) have emerged as promising candidates due to theirrole in innate immunity, bacterial resistance to AMPs, thoughrare, has been documented.28 Nature has addressed thischallenge through the incorporation of ncAAs in AMPs,expanding their functional diversity and potentially reducingresistance development.29 To demonstrate how our languagemodel could contribute to solving this real-world problem, wepresent a case study developing antimicrobial peptides.In Data set P, 205 sequences are labeled as antimicrobialagainst E. coli. The weights pretrained on GPepT were fine-tuned with the 205 sequences to generate antimicrobialpeptidomimetics. 500 sequences were generated using the fine-tuned model. First, we ranked all noncanonical elementsaccording to the enrichment score, i.e., the fraction ofgenerated sequences including the element divided by thefraction of sequences in Data set P including it. If the score ishigh, it implies that the element is preferred more after fine-tuning. Top five elements are shown in Figure 3. The top one’senrichment score is 33, suggesting that it is strongly related toantimicrobial activity.For experimental validation, we selected five sequences oflength 3−50 (Pep1−5; Table S1) based on the commercialavailability of the included noncanonical elements andsynthesizability. Pep1, 3, and 5 were successfully synthesizedand purified to a degree suitable for antimicrobial testing. Nounexpected or unusually high safety hazards were encountered.As shown in Figure 4a, Pep1 demonstrated potent antibacterialactivity against E. coli with a minimum inhibitory concentration(MIC) of 50 μg/mL, effectively outperforming its canonicalcounterpart ″WWWWWKZ0″ (Z0 = Amide) (MIC > 100 μg/mL) (Figure 4a). The circular dichroism (CD) spectrum ofPep1, particularly the appearance of a distinct negative peak at225 nm and a positive peak at approximately 235 nm, isatypical for canonical secondary structures and may suggestunique conformational features induced by the ncAA X556 (D-tryptophan). These peaks may arise from π−π interactionsinvolving the D-Trp residue or exciton coupling effects, whichmay contribute to increased local rigidity or to a functionallyrelevant structural orientation of the peptide. Such CDsignatures in the 225−235 nm region have been associatedwith intramolecular exciton interactions involving aromaticside chains, as reported by Zsila.30 While detailed structuralelucidation would require further spectroscopic or computa-tional studies, the observed features may indicate that ncAAincorporation stabilizes local conformations in a functionallyrelevant manner. Antimicrobial testing results for Pep3 andPep5 are presented in Figure S3.Our work has revealed the untapped potential of non-canonical elements in peptide drug discovery. The identi-fication of 11,243 ncAAs represents a significant expansion ofthe peptide building block repertoire. This comprehensivetokenization of noncanonical elements addresses a critical gapin the field, enabling the systematic exploration of expandedpeptide chemical space. Existing models such as Peptide-BERT11 can be expanded using our tokens of noncanonicalelements.We demonstrated a fine-tuning strategy to generatepeptidomimetics for specific purposes. Pep1 outperformed itscanonical counterpart in antimicrobial activity, but there isroom for improvement. In fact, nearly 80% of our ncAAs thatdo not possess primary amine, making them incompatible withpeptidomimetics synthesis using standard methods. Thislimitation underscores the gap between computationalpredictions and practical synthesis in peptide engineering.Nature enhances peptide diversity through post-translationalmodifications, suggesting that synthetic approaches couldadopt similar strategies. For example, proline analogues haveshown promise in modifying peptide backbones,31 while invitro ribosomal translation systems could allow the integrationof D-, β-, or γ-amino acids, broadening access to our ncAAdiscoveries. Developing new synthesis techniques will be key toleveraging these computational findings in the real-worldpeptide engineering.In conclusion, our work bridges computational peptidedesign with practical therapeutic development through thesystematic exploration of noncanonical elements. Whilechallenges in synthesis methods remain, our successfuldemonstration of a biologically active, language model-generated antimicrobial peptidomimetic validates this ap-proach. As synthesis capabilities evolve, this expanded chemicalFigure 3. Top 5 enriched noncanonical elements. They correspond tothe following tokens: X4857, X10616, X8507, X9886, X8517. Theirenrichment scores are 33, 33, 30, 29, and 27. Only X4857 is found inPubChem.Figure 4. Experimental validation of GPepT-generated peptidomi-metics Pep1. Canonical Pep1 refers to the peptide where non-canonical elements in Pep1 are replaced with canonical ones. a)Bacteria growth (OD600) after 24 h against peptide concentration. b)Circular dichroism spectra.ACS Medicinal Chemistry Letters pubs.acs.org/acsmedchemlett Letterhttps://doi.org/10.1021/acsmedchemlett.5c00375ACS Med. Chem. Lett. 2025, 16, 1670−16751673https://pubs.acs.org/doi/suppl/10.1021/acsmedchemlett.5c00375/suppl_file/ml5c00375_si_002.pdfhttps://pubs.acs.org/doi/suppl/10.1021/acsmedchemlett.5c00375/suppl_file/ml5c00375_si_002.pdfhttps://pubs.acs.org/doi/suppl/10.1021/acsmedchemlett.5c00375/suppl_file/ml5c00375_si_002.pdfhttps://pubs.acs.org/doi/10.1021/acsmedchemlett.5c00375?fig=fig3&ref=pdfhttps://pubs.acs.org/doi/10.1021/acsmedchemlett.5c00375?fig=fig3&ref=pdfhttps://pubs.acs.org/doi/10.1021/acsmedchemlett.5c00375?fig=fig3&ref=pdfhttps://pubs.acs.org/doi/10.1021/acsmedchemlett.5c00375?fig=fig3&ref=pdfhttps://pubs.acs.org/doi/10.1021/acsmedchemlett.5c00375?fig=fig4&ref=pdfhttps://pubs.acs.org/doi/10.1021/acsmedchemlett.5c00375?fig=fig4&ref=pdfhttps://pubs.acs.org/doi/10.1021/acsmedchemlett.5c00375?fig=fig4&ref=pdfhttps://pubs.acs.org/doi/10.1021/acsmedchemlett.5c00375?fig=fig4&ref=pdfpubs.acs.org/acsmedchemlett?ref=pdfhttps://doi.org/10.1021/acsmedchemlett.5c00375?urlappend=%3Fref%3DPDF&jav=VoR&rel=cite-asspace promises to accelerate the development of peptide-basedtherapeutics with improved drug-like properties, offering newpossibilities to address challenging therapeutic needs.■ ASSOCIATED CONTENTData Availability StatementThe code of Monomerizer is available at https://github.com/tsudalab/Monomerizer. The noncanonical elements and thesequences of peptidomimetics are available at https://zenodo.org/records/14175750. GPepT is fully accessible on Hugging-Face: https://huggingface.co/Playingyoyo/GPepT.*sı Supporting InformationThe Supporting Information is available free of charge athttps://pubs.acs.org/doi/10.1021/acsmedchemlett.5c00375.Algorithmic details of mMonomerizer, comparison ofnoncanonical amino acids (ncAAs), terminal modifica-tions and canonical amino acids (cAAs) mined fromChEMBL, comparison of peptidomimetics and peptidesmined from ChEMBL (Data set P),experimental detailsabout peptide synthesis and measurement, validpeptidomimetics chosen for antimicrobial activity test,and bacteria growth (OD600) after 24 h against peptideconcentration (Pep3 and Pep5) (PDF)■ AUTHOR INFORMATIONCorresponding AuthorKoji Tsuda − Graduate School of Frontier Sciences, TheUniversity of Tokyo, Kashiwa, Chiba 277-8561, Japan;Center for Basic Research on Materials, National Institute forMaterials Science (NIMS), Tsukuba 305−0044, Japan;RIKEN Center for Advanced Intelligence Project, RIKEN,Tokyo 103-0027, Japan; orcid.org/0000-0002-4288-1606; Email: tsuda@k.u-tokyo.ac.jpAuthorsYuna Oikawa − Graduate School of Frontier Sciences, TheUniversity of Tokyo, Kashiwa, Chiba 277-8561, JapanTakanori Uzawa − Emergent Bioengineering MaterialsResearch Team, RIKEN Center for Emergent Matter Science,Wako, Saitama 351-0198, Japan; RIKEN Cluster forPioneering Research, Wako, Saitama 351-0198, Japan;orcid.org/0000-0001-6042-513XFrancois Berenger − Graduate School of Frontier Sciences,The University of Tokyo, Kashiwa, Chiba 277-8561, JapanNoriko Minagawa − Emergent Bioengineering MaterialsResearch Team, RIKEN Center for Emergent Matter Science,Wako, Saitama 351-0198, JapanAkiko Yumoto − Emergent Bioengineering Materials ResearchTeam, RIKEN Center for Emergent Matter Science, Wako,Saitama 351-0198, JapanHideaki Takaku − RIKEN Cluster for Pioneering Research,Wako, Saitama 351-0198, JapanRyo Tamura − Graduate School of Frontier Sciences, TheUniversity of Tokyo, Kashiwa, Chiba 277-8561, Japan;Center for Basic Research on Materials, National Institute forMaterials Science (NIMS), Tsukuba 305−0044, Japan;RIKEN Center for Advanced Intelligence Project, RIKEN,Tokyo 103-0027, Japan; orcid.org/0000-0002-0349-358XYoshihiro Ito − RIKEN Cluster for Pioneering Research,Wako, Saitama 351-0198, Japan; orcid.org/0000-0002-1154-253XComplete contact information is available at:https://pubs.acs.org/10.1021/acsmedchemlett.5c00375Author ContributionsY.O., T.U., and K.T. conceived the idea and designed theresearch. Y.O., F.B., R.T., and K.T. developed the computa-tional methods. Y.O., T.U., N.M., A.Y., and H.T. performedbiological experiments. Y.I., R.W., and K.T. planned andsupervised the study. All authors contributed to thepreparation of the manuscript.NotesThe authors declare no competing financial interest.■ ACKNOWLEDGMENTSThis work was supported by JST ERATO JPMJER1903, JSTCREST JPMJCR21O2 and MEXT JPMXP1122712807. Wethank Yuya Takeda and Yuta Tomokiyo for their invaluabletechnical assistance.■ ABBREVIATIONS USEDncAA, noncanonical amino acid; GPepT, Generative pre-trained Peptidomimetics Transformer; SMILES, SimplifiedMolecular Input Line Entry System; SMARTS, SMILESArbitrary Target Specification; CD, Circular Dichroism; β1,exponential decay rate for the first moment estimates in Adamoptimizer; β2, exponential decay rate for the second momentestimates in Adam optimizer; AMP, Antimicrobial peptide; t-SNE, t-distributed Stochastic Neighbor Embedding■ REFERENCES(1) Chen, C. H.; Lu, T. K. Development and Challenges ofAntimicrobial Peptides for Therapeutic Applications. Antibiotics 2020,9 (1), 24.(2) Wang, L.; Wang, N.; Zhang, W.; Cheng, X.; Yan, Z.; Shao, G.;Wang, X.; Wang, R.; Fu, C. Therapeutic peptides: current applicationsand future directions. Signal Transduct. Target. Ther. 2022, 7 (1), 48.(3) Santos-Juńior, C. D.; Torres, M. D. T.; Duan, Y.; Rodríguez delRío, Á.; Schmidt, T. S. B.; Chong, H.; Fullam, A.; Kuhn, M.; Zhu, C.;Houseman, A.; et al. Discovery of antimicrobial peptides in the globalmicrobiome with machine learning. Cell 2024, 187 (14), 3761−3778.(4) Das, P.; Sercu, T.; Wadhawan, K.; Padhi, I.; Gehrmann, S.;Cipcigan, F.; Chenthamarakshan, V.; Strobelt, H.; dos Santos, C.;Chen, P.-Y.; et al. Accelerated antimicrobial discovery via deepgenerative models and molecular dynamics simulations. Nat. Biomed.Eng. 2021, 5 (6), 613−623.(5) Tran, D. P.; Tada, S.; Yumoto, A.; Kitao, A.; Ito, Y.; Uzawa, T.;Tsuda, K. Using molecular dynamics simulations to prioritize andunderstand AI-generated cell penetrating peptides. Sci. Rep. 2021, 11(1), 10630.(6) Tucš, A.; Berenger, F.; Yumoto, A.; Tamura, R.; Uzawa, T.;Tsuda, K. Quantum Annealing Designs Nonhemolytic AntimicrobialPeptides in a Discrete Latent Space. ACS Med. Chem. Lett. 2023, 14(5), 577−582.(7) Tucs, A.; Tran, D. P.; Yumoto, A.; Ito, Y.; Uzawa, T.; Tsuda, K.Generating Ampicillin-Level Antimicrobial Peptides with Activity-Aware Generative Adversarial Networks. ACS Omega 2020, 5 (36),22847−22851.(8) Capecchi, A.; Cai, X.; Personne, H.; Köhler, T.; van Delden, C.;Reymond, J.-L. Machine learning designs non-hemolytic antimicrobialpeptides. Chem. Sci. 2021, 12 (26), 9221−9232.(9) Murakami, Y.; Ishida, S.; Demizu, Y.; Terayama, K. Design ofantimicrobial peptides containing non-proteinogenic amino acidsusing multi-objective Bayesian optimization. Digit. Discovery 2023, 2(5), 1347−1353.(10) Szymczak, P.; Mozėjko, M.; Grzegorzek, T.; Jurczak, R.; Bauer,M.; Neubauer, D.; Sikora, K.; Michalski, M.; Sroka, J.; Setny, P.; et al.ACS Medicinal Chemistry Letters pubs.acs.org/acsmedchemlett Letterhttps://doi.org/10.1021/acsmedchemlett.5c00375ACS Med. Chem. Lett. 2025, 16, 1670−16751674https://github.com/tsudalab/Monomerizerhttps://github.com/tsudalab/Monomerizerhttps://zenodo.org/records/14175750https://zenodo.org/records/14175750https://huggingface.co/Playingyoyo/GPepThttps://pubs.acs.org/doi/10.1021/acsmedchemlett.5c00375?goto=supporting-infohttps://pubs.acs.org/doi/suppl/10.1021/acsmedchemlett.5c00375/suppl_file/ml5c00375_si_002.pdfhttps://pubs.acs.org/action/doSearch?field1=Contrib&text1="Koji+Tsuda"&field2=AllField&text2=&publication=&accessType=allContent&Earliest=&ref=pdfhttps://orcid.org/0000-0002-4288-1606https://orcid.org/0000-0002-4288-1606mailto:tsuda@k.u-tokyo.ac.jphttps://pubs.acs.org/action/doSearch?field1=Contrib&text1="Yuna+Oikawa"&field2=AllField&text2=&publication=&accessType=allContent&Earliest=&ref=pdfhttps://pubs.acs.org/action/doSearch?field1=Contrib&text1="Takanori+Uzawa"&field2=AllField&text2=&publication=&accessType=allContent&Earliest=&ref=pdfhttps://orcid.org/0000-0001-6042-513Xhttps://orcid.org/0000-0001-6042-513Xhttps://pubs.acs.org/action/doSearch?field1=Contrib&text1="Francois+Berenger"&field2=AllField&text2=&publication=&accessType=allContent&Earliest=&ref=pdfhttps://pubs.acs.org/action/doSearch?field1=Contrib&text1="Noriko+Minagawa"&field2=AllField&text2=&publication=&accessType=allContent&Earliest=&ref=pdfhttps://pubs.acs.org/action/doSearch?field1=Contrib&text1="Akiko+Yumoto"&field2=AllField&text2=&publication=&accessType=allContent&Earliest=&ref=pdfhttps://pubs.acs.org/action/doSearch?field1=Contrib&text1="Hideaki+Takaku"&field2=AllField&text2=&publication=&accessType=allContent&Earliest=&ref=pdfhttps://pubs.acs.org/action/doSearch?field1=Contrib&text1="Ryo+Tamura"&field2=AllField&text2=&publication=&accessType=allContent&Earliest=&ref=pdfhttps://orcid.org/0000-0002-0349-358Xhttps://orcid.org/0000-0002-0349-358Xhttps://pubs.acs.org/action/doSearch?field1=Contrib&text1="Yoshihiro+Ito"&field2=AllField&text2=&publication=&accessType=allContent&Earliest=&ref=pdfhttps://orcid.org/0000-0002-1154-253Xhttps://orcid.org/0000-0002-1154-253Xhttps://pubs.acs.org/doi/10.1021/acsmedchemlett.5c00375?ref=pdfhttps://doi.org/10.3390/antibiotics9010024https://doi.org/10.3390/antibiotics9010024https://doi.org/10.1038/s41392-022-00904-4https://doi.org/10.1038/s41392-022-00904-4https://doi.org/10.1016/j.cell.2024.05.013https://doi.org/10.1016/j.cell.2024.05.013https://doi.org/10.1038/s41551-021-00689-xhttps://doi.org/10.1038/s41551-021-00689-xhttps://doi.org/10.1038/s41598-021-90245-zhttps://doi.org/10.1038/s41598-021-90245-zhttps://doi.org/10.1021/acsmedchemlett.2c00487?urlappend=%3Fref%3DPDF&jav=VoR&rel=cite-ashttps://doi.org/10.1021/acsmedchemlett.2c00487?urlappend=%3Fref%3DPDF&jav=VoR&rel=cite-ashttps://doi.org/10.1021/acsomega.0c02088?urlappend=%3Fref%3DPDF&jav=VoR&rel=cite-ashttps://doi.org/10.1021/acsomega.0c02088?urlappend=%3Fref%3DPDF&jav=VoR&rel=cite-ashttps://doi.org/10.1039/D1SC01713Fhttps://doi.org/10.1039/D1SC01713Fhttps://doi.org/10.1039/D3DD00090Ghttps://doi.org/10.1039/D3DD00090Ghttps://doi.org/10.1039/D3DD00090Gpubs.acs.org/acsmedchemlett?ref=pdfhttps://doi.org/10.1021/acsmedchemlett.5c00375?urlappend=%3Fref%3DPDF&jav=VoR&rel=cite-asDiscovering highly potent antimicrobial peptides with deep generativemodel HydrAMP. Nat. Commun. 2023, 14 (1), 1453.(11) Guntuboina, C.; Das, A.; Mollaei, P.; Kim, S.; Barati Farimani,A. PeptideBERT: A Language Model Based on Transformers forPeptide Property Prediction. J. Phys. Chem. Lett. 2023, 14 (46),10427−10434.(12) Ferruz, N.; Schmidt, S.; Höcker, B.; Ferruz, S.; Höcker, B.ProtGPT2 is a deep unsupervised language model for protein design.Nat. Comm. 2022, 13 (1), 4348.(13) Goettig, P.; Koch, N. G.; Budisa, N. Non-Canonical AminoAcids in Analyses of Protease Structure and Function. Int. J. Mol. Sci.2023, 24 (18), 14035.(14) Simon, E.; Swanson, K.; Zou, J. Language models for biologicalresearch: a primer. Nat. Methods 2024, 21 (8), 1422−1429.(15) Gaulton, A.; Bellis, L. J.; Bento, A. P.; Chambers, J.; Davies, M.;Hersey, A.; Light, Y.; McGlinchey, S.; Michalovich, D.; Al-Lazikani, B.ChEMBL: a large-scale bioactivity database for drug discovery. NucleicAcids Res. 2012, 40 (D1), D1100.(16) Murugan, R. N.; Jacob, B.; Kim, E.-H.; Ahn, M.; Sohn, H.; Seo,J.-H.; Cheong, C.; Hyun, J.-K.; Lee, K. S.; Shin, S. Y.; et al. Nonhemolytic short peptidomimetics as a new class of potent and broad-spectrum antimicrobial agents. Bioorg. Med. Chem. Lett. 2013, 23 (16),4633−4636.(17) Craig, A. J.; Ermolovich, Y.; Cameron, A.; Rodler, A.; Wang,H.; Hawkes, J. A.; Hubert, M.; Björkling, F.; Molchanova, N.;Brimble, M. A.; et al. Antimicrobial Peptides IncorporatingHalogenated Marine-Derived Amino Acid Substituents. ACS Med.Chem. Lett. 2023, 14 (6), 802−809.(18) Meena, C. L.; Thakur, A.; Nandekar, P. P.; Sharma, S. S.;Sangamwar, A. T.; Jain, R. Synthesis and biology of ring-modified l-Histidine containing thyrotropin-releasing hormone (TRH) ana-logues. Eur. J. Med. Chem. 2016, 111, 72−83.(19) Kim, S.; Chen, J.; Cheng, T.; Gindulyte, A.; He, J.; He, S.; Li,Q.; Shoemaker, B. A.; Thiessen, P. A.; Yu, B.; et al. PubChem 2023update. Nucleic Acids Res. 2023, 51 (D1), D1373−D1380.(20) Weininger, D. SMILES, a chemical language and informationsystem. 1. Introduction to methodology and encoding rules. J. Chem.Inf. Comput. Sci. 1988, 28 (1), 31−36.(21) Daylight Chemical Information Systems, I. A Language forDescribing Molecular Patterns. https://www.daylight.com/dayhtml/doc/theory/theory.smarts.html (accessed June 1, 2025).(22) Chembl activities. ChEMBL Database. European BioinformaticsInstitute (EMBL-EBI). https://www.ebi.ac.uk/chembl/web_components/explore/activities/ (accessed Dec 15, 2023).(23) run_clm.py; 2021. https://github.com/huggingface/transformers/blob/main/examples/tensorflow/language-modeling/run_clm.py (accessed May 1, 2025).(24) Kingma, D. P.; Ba, J. Adam: A Method for StochasticOptimization. arXiv 2014, arXiv.1412.6980.(25) van der Maaten, L.; Hinton, G. Visualizing data using t-SNE. J.Mach. Learn. Res. 2008, 9, 2579−2605.(26) Murray, C. J. L.; Ikuta, K. S.; Sharara, F.; Swetschinski, L.;Aguilar, G. R.; Gray, A.; Han, C.; Bisignano, C.; Rao, P.; Wool, E.;et al. Global burden of bacterial antimicrobial resistance in 2019: asystematic analysis. Lancet 2022, 399 (10325), 629−655.(27) Fjell, C. D.; Hiss, J. A.; Hancock, R. E.; Schneider, G. Designingantimicrobial peptides: form follows function. Nat. Rev. DrugDiscovery 2012, 11 (1), 37−51.(28) Nizet, V. Antimicrobial Peptide Resistance Mechanisms ofHuman Bacterial Pathogens. Curr. Issues Mol. Bio. 2006, 8 (1), 11−26.(29) Garg, N.; Oman, T. J.; Andrew Wang, T.-S.; De Gonzalo, C. V.G.; Walker, S.; van der Donk, W. A. Mode of action and structure-activity relationship studies of geobacillin I. J. Antibiot. 2014, 67 (1),133−136.(30) Zsila, F. Far-UV circular dichroism signatures indicatefluorophore labeling induced conformational changes of penetratin.Amino acids 2022, 54 (7), 1109−1113.(31) Kubyshkin, V.; Davis, R.; Budisa, N. Biochemistry offluoroprolines: the prospect of making fluorine a bioelement. BeilsteinJ. Org. Chem. 2021, 17, 439−460.ACS Medicinal Chemistry Letters pubs.acs.org/acsmedchemlett Letterhttps://doi.org/10.1021/acsmedchemlett.5c00375ACS Med. Chem. Lett. 2025, 16, 1670−16751675https://doi.org/10.1038/s41467-023-36994-zhttps://doi.org/10.1038/s41467-023-36994-zhttps://doi.org/10.1021/acs.jpclett.3c02398?urlappend=%3Fref%3DPDF&jav=VoR&rel=cite-ashttps://doi.org/10.1021/acs.jpclett.3c02398?urlappend=%3Fref%3DPDF&jav=VoR&rel=cite-ashttps://doi.org/10.1038/s41467-022-32007-7https://doi.org/10.3390/ijms241814035https://doi.org/10.3390/ijms241814035https://doi.org/10.1038/s41592-024-02354-yhttps://doi.org/10.1038/s41592-024-02354-yhttps://doi.org/10.1093/nar/gkr777https://doi.org/10.1016/j.bmcl.2013.06.016https://doi.org/10.1016/j.bmcl.2013.06.016https://doi.org/10.1016/j.bmcl.2013.06.016https://doi.org/10.1021/acsmedchemlett.3c00093?urlappend=%3Fref%3DPDF&jav=VoR&rel=cite-ashttps://doi.org/10.1021/acsmedchemlett.3c00093?urlappend=%3Fref%3DPDF&jav=VoR&rel=cite-ashttps://doi.org/10.1016/j.ejmech.2016.01.038https://doi.org/10.1016/j.ejmech.2016.01.038https://doi.org/10.1016/j.ejmech.2016.01.038https://doi.org/10.1093/nar/gkac956https://doi.org/10.1093/nar/gkac956https://doi.org/10.1021/ci00057a005?urlappend=%3Fref%3DPDF&jav=VoR&rel=cite-ashttps://doi.org/10.1021/ci00057a005?urlappend=%3Fref%3DPDF&jav=VoR&rel=cite-ashttps://www.daylight.com/dayhtml/doc/theory/theory.smarts.htmlhttps://www.daylight.com/dayhtml/doc/theory/theory.smarts.htmlhttps://www.ebi.ac.uk/chembl/web_components/explore/activities/https://www.ebi.ac.uk/chembl/web_components/explore/activities/https://github.com/huggingface/transformers/blob/main/examples/tensorflow/language-modeling/run_clm.pyhttps://github.com/huggingface/transformers/blob/main/examples/tensorflow/language-modeling/run_clm.pyhttps://github.com/huggingface/transformers/blob/main/examples/tensorflow/language-modeling/run_clm.pyhttps://doi.org/10.1016/S0140-6736(21)02724-0https://doi.org/10.1016/S0140-6736(21)02724-0https://doi.org/10.1038/nrd3591https://doi.org/10.1038/nrd3591https://doi.org/10.21775/cimb.008.011https://doi.org/10.21775/cimb.008.011https://doi.org/10.1038/ja.2013.112https://doi.org/10.1038/ja.2013.112https://doi.org/10.1007/s00726-022-03149-1https://doi.org/10.1007/s00726-022-03149-1https://doi.org/10.3762/bjoc.17.40https://doi.org/10.3762/bjoc.17.40pubs.acs.org/acsmedchemlett?ref=pdfhttps://doi.org/10.1021/acsmedchemlett.5c00375?urlappend=%3Fref%3DPDF&jav=VoR&rel=cite-as