SlideShare une entreprise Scribd logo
1  sur  10
IIBMP2020 Poster
Generating annotation texts of HLA sequences with
antigen classes by a T5 (Text-to-Text Transfer
Transformer) model using International Nucleotide
Sequence Database
Eli Kaminuma1,2,3, Takatomo Fujisawa2, Osamu Ogasawara2, Masanori Arita2,3,
Yasukazu Nakamura2,4
(1. Tokyo Medical and Dental University 2. National Institute of Genetics, 3. RIKEN CSRS,
4. Kazusa DNA Research Institute)
■ International Nucleotide Sequence
Database (INDSC)
DNA Data Bank of Japan(DDBJ) collects
nucleotide sequences as a member of INSDC.
A problem of high labor costs for manual sequence annotation
in the data submission stage to INDSC
■Problem: INSDC’s sequence annotations
to be required seem to be high labor costs
DDBJ ANNOTATION HELP
DNASmartTagger :
A proposed machine learning tool for DNA sequence annotations
Accacactggtactgagacacggaccaga
ctcctacgggaggcagcagtgaggaatatt
ggacaatggagggaactctgatccagcca
tgccgcgtgcaggaagactgccctatgggt
tgtaaactgcttttatacaagaagaataag
agatacgtgtatcttgatgacggtattgtaa
gaataagcaccggctaactccgtgccagc
agccgcggtaatacggagggtgcaagcgt
tatccggaatcattgggtttaaagggtccgt
aggcggattaataagtcagtggtgaaagtc
tgcagcttaactgtagaattgccattgatac
tgttagtcttgaattattatgaagtagttag
aatatgtagtgtagcggtgaaatgcataga
tattaca
Input: DNA Sequence
sequence
e.g. INSDC FlatFile Format
Output: Annotation Tags
DNASmartTagger
data resources BioSample
452 attribute
tags
INSDC
132 attribute
tags
Machine Learning
Models
Others
annotations
(132 attribute
tags)
■ Retrieving INSDC Training Data ■ Building Deep Learning Models
SVM
(CV kfold=10)
CNN
(CV kfold=10)
k-mer freq 0.77 0.80
5’end fragm 0.72 0.73
- Evaluating machine learning models to infer
attribute values ( Evaluation metric : accuracy)
Deep learning model (CNN)+
Input parameter with k-mer frequency
■ Extracting the attribute tag “/altitude”
5,431 Sequences with Annotation for PLN Division,
Keyword Fungi (Retrieved from DDBJ ARSA)
ZONE Attribute
Value
Altitude
Zone Code
ALPINE
ZONE
1500m -- Z3
MONTAN
E ZONE
800m--
1500m
Z2
LOWLAND
ZONE
0--800m Z1
- Categorizing attribute values (/altitude)
Our conventional study of DNASmartTagger (2018): Predicting ecological
values of Biosample attribute tags from DNA sequences using deep learning
■ An example annotation of INSDC sequences for Human leukocyte antigen(HLA) allele.
LOCUS MG021788 3079 bp DNA linear HUM 04-SEP-2018
DEFINITION Homo sapiens MHC class I antigen (HLA-F) gene, HLA-F*01:01:02var allele, complete cds.
:
FEATURES Location/Qualifiers
source 1..3079
/organism="Homo sapiens"
/mol_type="genomic DNA"
/db_xref="taxon:9606"
gene <1..>3079
/gene="HLA-F"
/allele="HLA-F*01:01:02var"
:
:
BASE COUNT 601 a 866 c 951 g 661 t
ORIGIN
1 gtgtcgccgc agttcccagg ttctaaagtc ccacgcaccc cgcgggactc atatttttcc
61 cagacgcgga ggttggggtc atggcgcccc gaagcctcct cctgctgctc tcaggggccc
121 tggccctgac cgatacttgg gcaggtgagt gcggggtcca gagagaaacg gcctctgtgg
181 ggaggagtga ggggcccgcc cggtgggggc gcaggactca gggagccgcg cccggaggag
241 ggtctggcgg gtctcagccc ctcctcgccc ccaggctccc actccttgag gtatttcagc
301 accgctgtgt cgcggcccgg ccgcggggag ccccgctaca tcgccgtgga gtacgtagac
A proposed sequence-to-text generation model to annotate
INSDC DEFINITION attributes from input sequences
HLA nomenclature (Xie et al, PMID: 21172045)
T5 (Text-To-Text Transfer Transformer) model is one of
available NLP deep learning models
Reference:
https://github.com/google-research/text-to-text-transfer-transformer
■ T5 model(Raffel et al; arXiv:1910.10683) =A text-to-text deep learning model
for treating a wide variety NLP tasks from Google AI.
*SuperGLUE LeaderBoard(2020/8/27)
■T5 model structure=A type of
encoder-decoder transformer models.
■C4 Tensorflow dataset
745GB (cf. Wikipedia 16GB)
■T5 character=large model size
T5-Small (60 million params)
T5-Base (220 million params)
T5-Large (770 million params)
T5-3B (3 billion params)
T5-11B (11 billion params)
■ Preparing reference texts of HLA sequences from INSDC database
- 1,100 (train 1,000 + test 100) sequence annotations.
- Selecting the top 6 gene names (HLA-B,A,C, DQB1, G, DQA1) in INSDC HLA data.
- Deleting HLA allelic description from DEFINITION texts.
- Preparing not nucleotide sequence but amino acid sequences as model inputs.
■ Fine-tuning methods
- a pre-trained T5-Small(60 million params) model
supported by Hugging Face to perform annotation
text generation from amino acid (AA) sequences.
■ Hardware and software
- NVIDIA Tesla K80 GPU 12GB assigned on Google Colaboratory
- Python wt. PyTorch 1.5.1 deep learning library.
■Evaluation metrics
① BLEU(BiLingual Evaluation Understudy)(Papinen et al, 2002)
② Accuracy for classifying key labels.
Experimental conditions to build T5 transfer learning models
https://github.com/huggingface/transformerss
Pn : n-gram precisions up to length N
wn : positive weights
c : length of the candidate translation
r : effective reference corpus length.
https://www.aclweb.org/anthology/P02-1040.pdf
Result(1):
Output texts generated by the proposed sequence-to-text T5 model
DEFINITION Homo sapiens MHC class I antigen (HLA-F) gene, HLA-F*01:01:02var
allele, complete cds.
Basic Format: [organism name] [gene name] gene, [allelic information], complete/partial cds.
(First 10 AA codes, Last 10 AA codes)
DFVFQFKGMC ..………………
…………………………………………
…………………………………………
……..VAFRGILQRR
MT output
Homo sapiens clone DQB1_111313_00
Reference:
Homo sapiens clone DQB1_110918_01032
MHC class II antigen HLA-DQB1 gene, exon
2 and partial cds
Input Sequence
MT output
MHC class I antigen HLA-A gene, complete
cds1 gene
Reference:
Homo sapiens MHC class I protein HLA-A
gene, complete cds
Output annotation texts
MAVMAPRTLV……….
.…………………….………...
......… SDMSLTACKV
T5
Result(2): Evaluation the proposed model using generated texts
and references
■ BLEU: A popular metric for text generation
BLEU score=0.28 (100 test sequences)
■ Accuracy for classifying key
labels (gene names and
complete/partial cds types )
https://cloud.google.com/translate/automl/docs/evaluate
We will collect more suitable reference datasets
and investigate training conditions of T5 models.
key labels accuracy
(only test data
including
labels)
gene name
6 classes※1
0.42
(0.95)
cds
completeness
2 classes※2
0.35
(0.83)
※1: HLA-B, -A, -C, -DQB1, -G, -DQA1
※2: complete, partial (exon etc.)
・ TMDU : Hiroshi Tanaka, Kazuki Hashimoto
・ DDBJ : Jun Mashima, Yuichi Kodama, Kosuge Takehide
・ DBCLS : Yasunori Yamamoto
・ AIST AIRC : Jun Sese, Motoko Tsuji, Yukiko Ochi
Acknowledgements
■ We are thankful to the following members for their supports.
■ This work is partially supported by the following grants.
・NIG Research Collaboration Grants 3A2019/ 55A2020
・JST CREST Grant Number JPMJCR1501
Future work
Issues of future works
- Clean reference datasets.
- Text generation for HLA allelic code (Large reference datasets).
- Multi-modal integration (k-mer frequency vs sequence fragments).
NLP+Computer Vision NLP

Contenu connexe

Tendances

The Clinical Significance of Transcript Alignment Discrepancies
The Clinical Significance of Transcript Alignment DiscrepanciesThe Clinical Significance of Transcript Alignment Discrepancies
The Clinical Significance of Transcript Alignment DiscrepanciesReece Hart
 
Schneider_AGBT2014
Schneider_AGBT2014Schneider_AGBT2014
Schneider_AGBT2014vaschn
 
New data from giab genomes promethion
New data from giab genomes   promethionNew data from giab genomes   promethion
New data from giab genomes promethionGenomeInABottle
 
New methods diploid assembly with graphs
New methods   diploid assembly with graphsNew methods   diploid assembly with graphs
New methods diploid assembly with graphsGenomeInABottle
 
Workshop NGS data analysis - 1
Workshop NGS data analysis - 1Workshop NGS data analysis - 1
Workshop NGS data analysis - 1Maté Ongenaert
 
New data from giab genomes pacbio ccs
New data from giab genomes   pacbio ccsNew data from giab genomes   pacbio ccs
New data from giab genomes pacbio ccsGenomeInABottle
 
Church_GenomeAccess_2013_genome2013
Church_GenomeAccess_2013_genome2013Church_GenomeAccess_2013_genome2013
Church_GenomeAccess_2013_genome2013Deanna Church
 
diffReps: automated ChIP-seq differential analysis package
diffReps: automated ChIP-seq differential analysis packagediffReps: automated ChIP-seq differential analysis package
diffReps: automated ChIP-seq differential analysis packageLi Shen
 
Discovery and annotation of variants by exome analysis using NGS
Discovery and annotation of variants by exome analysis using NGSDiscovery and annotation of variants by exome analysis using NGS
Discovery and annotation of variants by exome analysis using NGScursoNGS
 
Variation graphs and population assisted genome inference copy
Variation graphs and population assisted genome inference copyVariation graphs and population assisted genome inference copy
Variation graphs and population assisted genome inference copyGenome Reference Consortium
 
Dgaston dec-06-2012
Dgaston dec-06-2012Dgaston dec-06-2012
Dgaston dec-06-2012Dan Gaston
 
Genome editing comes of age
Genome editing comes of ageGenome editing comes of age
Genome editing comes of ageJan Hryca
 
Creating Reference-Grade Human Genome Assemblies
Creating Reference-Grade Human Genome AssembliesCreating Reference-Grade Human Genome Assemblies
Creating Reference-Grade Human Genome AssembliesGenome Reference Consortium
 
NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...
NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...
NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...QIAGEN
 

Tendances (20)

The Clinical Significance of Transcript Alignment Discrepancies
The Clinical Significance of Transcript Alignment DiscrepanciesThe Clinical Significance of Transcript Alignment Discrepancies
The Clinical Significance of Transcript Alignment Discrepancies
 
Schneider_AGBT2014
Schneider_AGBT2014Schneider_AGBT2014
Schneider_AGBT2014
 
TAGC2016 schneider
TAGC2016 schneiderTAGC2016 schneider
TAGC2016 schneider
 
Ashg grc workshop2015_tg
Ashg grc workshop2015_tgAshg grc workshop2015_tg
Ashg grc workshop2015_tg
 
Ashg2015 schneider final
Ashg2015 schneider finalAshg2015 schneider final
Ashg2015 schneider final
 
New data from giab genomes promethion
New data from giab genomes   promethionNew data from giab genomes   promethion
New data from giab genomes promethion
 
New methods diploid assembly with graphs
New methods   diploid assembly with graphsNew methods   diploid assembly with graphs
New methods diploid assembly with graphs
 
Workshop NGS data analysis - 1
Workshop NGS data analysis - 1Workshop NGS data analysis - 1
Workshop NGS data analysis - 1
 
New data from giab genomes pacbio ccs
New data from giab genomes   pacbio ccsNew data from giab genomes   pacbio ccs
New data from giab genomes pacbio ccs
 
Church_GenomeAccess_2013_genome2013
Church_GenomeAccess_2013_genome2013Church_GenomeAccess_2013_genome2013
Church_GenomeAccess_2013_genome2013
 
Ashg2015 grc-pruitt
Ashg2015 grc-pruittAshg2015 grc-pruitt
Ashg2015 grc-pruitt
 
Bioinformatica t2-databases
Bioinformatica t2-databasesBioinformatica t2-databases
Bioinformatica t2-databases
 
diffReps: automated ChIP-seq differential analysis package
diffReps: automated ChIP-seq differential analysis packagediffReps: automated ChIP-seq differential analysis package
diffReps: automated ChIP-seq differential analysis package
 
Discovery and annotation of variants by exome analysis using NGS
Discovery and annotation of variants by exome analysis using NGSDiscovery and annotation of variants by exome analysis using NGS
Discovery and annotation of variants by exome analysis using NGS
 
Variation graphs and population assisted genome inference copy
Variation graphs and population assisted genome inference copyVariation graphs and population assisted genome inference copy
Variation graphs and population assisted genome inference copy
 
Agbt2015 workshop schneider
Agbt2015 workshop schneiderAgbt2015 workshop schneider
Agbt2015 workshop schneider
 
Dgaston dec-06-2012
Dgaston dec-06-2012Dgaston dec-06-2012
Dgaston dec-06-2012
 
Genome editing comes of age
Genome editing comes of ageGenome editing comes of age
Genome editing comes of age
 
Creating Reference-Grade Human Genome Assemblies
Creating Reference-Grade Human Genome AssembliesCreating Reference-Grade Human Genome Assemblies
Creating Reference-Grade Human Genome Assemblies
 
NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...
NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...
NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...
 

Similaire à [2020-09-01] IIBMP2020 Generating annotation texts of HLA sequences with antigen classes by a T5 model using INSDC

rnaseq_from_babelomics
rnaseq_from_babelomicsrnaseq_from_babelomics
rnaseq_from_babelomicsFrancisco Garc
 
RNA-seq: analysis of raw data and preprocessing - part 2
RNA-seq: analysis of raw data and preprocessing - part 2RNA-seq: analysis of raw data and preprocessing - part 2
RNA-seq: analysis of raw data and preprocessing - part 2BITS
 
[2017-05-29] DNASmartTagger
[2017-05-29] DNASmartTagger [2017-05-29] DNASmartTagger
[2017-05-29] DNASmartTagger Eli Kaminuma
 
An Efficient Biological Sequence Compression Technique Using LUT and Repeat ...
An Efficient Biological Sequence Compression Technique Using  LUT and Repeat ...An Efficient Biological Sequence Compression Technique Using  LUT and Repeat ...
An Efficient Biological Sequence Compression Technique Using LUT and Repeat ...IOSR Journals
 
RNA sequencing analysis tutorial with NGS
RNA sequencing analysis tutorial with NGSRNA sequencing analysis tutorial with NGS
RNA sequencing analysis tutorial with NGSHAMNAHAMNA8
 
A Genome Sequence Analysis System Built with Hypertable
A Genome Sequence Analysis System Built with HypertableA Genome Sequence Analysis System Built with Hypertable
A Genome Sequence Analysis System Built with HypertableDATAVERSITY
 
Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Li Shen
 
SeqinR - biological data handling
SeqinR - biological data handlingSeqinR - biological data handling
SeqinR - biological data handlingpau_corral
 
Imgc2011 bioinformatics tutorial
Imgc2011 bioinformatics tutorialImgc2011 bioinformatics tutorial
Imgc2011 bioinformatics tutorialDeanna Church
 
Introducing data analysis: reads to results
Introducing data analysis: reads to resultsIntroducing data analysis: reads to results
Introducing data analysis: reads to resultsAGRF_Ltd
 
Tools for Transcriptome Data Analysis
Tools for Transcriptome Data AnalysisTools for Transcriptome Data Analysis
Tools for Transcriptome Data AnalysisSANJANA PANDEY
 
Next-generation sequencing format and visualization with ngs.plot
Next-generation sequencing format and visualization with ngs.plotNext-generation sequencing format and visualization with ngs.plot
Next-generation sequencing format and visualization with ngs.plotLi Shen
 
BITS training - UCSC Genome Browser - Part 2
BITS training - UCSC Genome Browser - Part 2BITS training - UCSC Genome Browser - Part 2
BITS training - UCSC Genome Browser - Part 2BITS
 
1_chlamydia task completely best.docx
1_chlamydia task completely best.docx1_chlamydia task completely best.docx
1_chlamydia task completely best.docxRachaelMutheu
 
A Genome Sequence Analysis System Built With Hypertable
A Genome Sequence Analysis System Built With HypertableA Genome Sequence Analysis System Built With Hypertable
A Genome Sequence Analysis System Built With Hypertablehypertable
 

Similaire à [2020-09-01] IIBMP2020 Generating annotation texts of HLA sequences with antigen classes by a T5 model using INSDC (20)

rnaseq_from_babelomics
rnaseq_from_babelomicsrnaseq_from_babelomics
rnaseq_from_babelomics
 
RNA-seq: analysis of raw data and preprocessing - part 2
RNA-seq: analysis of raw data and preprocessing - part 2RNA-seq: analysis of raw data and preprocessing - part 2
RNA-seq: analysis of raw data and preprocessing - part 2
 
[2017-05-29] DNASmartTagger
[2017-05-29] DNASmartTagger [2017-05-29] DNASmartTagger
[2017-05-29] DNASmartTagger
 
An Efficient Biological Sequence Compression Technique Using LUT and Repeat ...
An Efficient Biological Sequence Compression Technique Using  LUT and Repeat ...An Efficient Biological Sequence Compression Technique Using  LUT and Repeat ...
An Efficient Biological Sequence Compression Technique Using LUT and Repeat ...
 
RNA sequencing analysis tutorial with NGS
RNA sequencing analysis tutorial with NGSRNA sequencing analysis tutorial with NGS
RNA sequencing analysis tutorial with NGS
 
A Genome Sequence Analysis System Built with Hypertable
A Genome Sequence Analysis System Built with HypertableA Genome Sequence Analysis System Built with Hypertable
A Genome Sequence Analysis System Built with Hypertable
 
Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2
 
Rnaseq forgenefinding
Rnaseq forgenefindingRnaseq forgenefinding
Rnaseq forgenefinding
 
SeqinR - biological data handling
SeqinR - biological data handlingSeqinR - biological data handling
SeqinR - biological data handling
 
Imgc2011 bioinformatics tutorial
Imgc2011 bioinformatics tutorialImgc2011 bioinformatics tutorial
Imgc2011 bioinformatics tutorial
 
Introducing data analysis: reads to results
Introducing data analysis: reads to resultsIntroducing data analysis: reads to results
Introducing data analysis: reads to results
 
NCBI
NCBINCBI
NCBI
 
1 2 10.1.1.468.7609
1 2 10.1.1.468.76091 2 10.1.1.468.7609
1 2 10.1.1.468.7609
 
Tools for Transcriptome Data Analysis
Tools for Transcriptome Data AnalysisTools for Transcriptome Data Analysis
Tools for Transcriptome Data Analysis
 
Biochip
BiochipBiochip
Biochip
 
Next-generation sequencing format and visualization with ngs.plot
Next-generation sequencing format and visualization with ngs.plotNext-generation sequencing format and visualization with ngs.plot
Next-generation sequencing format and visualization with ngs.plot
 
BITS training - UCSC Genome Browser - Part 2
BITS training - UCSC Genome Browser - Part 2BITS training - UCSC Genome Browser - Part 2
BITS training - UCSC Genome Browser - Part 2
 
1_chlamydia task completely best.docx
1_chlamydia task completely best.docx1_chlamydia task completely best.docx
1_chlamydia task completely best.docx
 
Genome Assembly
Genome AssemblyGenome Assembly
Genome Assembly
 
A Genome Sequence Analysis System Built With Hypertable
A Genome Sequence Analysis System Built With HypertableA Genome Sequence Analysis System Built With Hypertable
A Genome Sequence Analysis System Built With Hypertable
 

Plus de Eli Kaminuma

[2021-03-14] 植物表現型画像解析のための手作業注釈加速化手法とActive Learning
[2021-03-14] 植物表現型画像解析のための手作業注釈加速化手法とActive Learning[2021-03-14] 植物表現型画像解析のための手作業注釈加速化手法とActive Learning
[2021-03-14] 植物表現型画像解析のための手作業注釈加速化手法とActive LearningEli Kaminuma
 
[2020-12-15] 実験研究者のための深層学習入門 [第2回] Google Colab 環境で自動機械学習と深層画像生成(AutoML, GAN編) 
[2020-12-15] 実験研究者のための深層学習入門 [第2回] Google Colab 環境で自動機械学習と深層画像生成(AutoML, GAN編) [2020-12-15] 実験研究者のための深層学習入門 [第2回] Google Colab 環境で自動機械学習と深層画像生成(AutoML, GAN編) 
[2020-12-15] 実験研究者のための深層学習入門 [第2回] Google Colab 環境で自動機械学習と深層画像生成(AutoML, GAN編) Eli Kaminuma
 
[2019-11-22] JSAI合同研究会 糖尿病電子カルテを事例としたMeSH Term注釈に基づくアクセス制限研究のオープンデータ類似検索
[2019-11-22] JSAI合同研究会 糖尿病電子カルテを事例としたMeSH Term注釈に基づくアクセス制限研究のオープンデータ類似検索[2019-11-22] JSAI合同研究会 糖尿病電子カルテを事例としたMeSH Term注釈に基づくアクセス制限研究のオープンデータ類似検索
[2019-11-22] JSAI合同研究会 糖尿病電子カルテを事例としたMeSH Term注釈に基づくアクセス制限研究のオープンデータ類似検索Eli Kaminuma
 
[2019-09-02] AI・IoT活用情報とGoogle Colab植物画像注釈
[2019-09-02] AI・IoT活用情報とGoogle Colab植物画像注釈[2019-09-02] AI・IoT活用情報とGoogle Colab植物画像注釈
[2019-09-02] AI・IoT活用情報とGoogle Colab植物画像注釈Eli Kaminuma
 
[2019-03-14] JSPP19 深層学習による植物注釈タスクとPublic Cloud活用法
[2019-03-14] JSPP19 深層学習による植物注釈タスクとPublic Cloud活用法[2019-03-14] JSPP19 深層学習による植物注釈タスクとPublic Cloud活用法
[2019-03-14] JSPP19 深層学習による植物注釈タスクとPublic Cloud活用法Eli Kaminuma
 
[2018-03-29]JSPP18 Oxford Flower Image Datasetを用いた深層学習ハンズオン
[2018-03-29]JSPP18 Oxford Flower Image Datasetを用いた深層学習ハンズオン[2018-03-29]JSPP18 Oxford Flower Image Datasetを用いた深層学習ハンズオン
[2018-03-29]JSPP18 Oxford Flower Image Datasetを用いた深層学習ハンズオンEli Kaminuma
 
[18-01-26]DSTEP ディープラーニングによる出芽酵母蛍光画像の細胞内タンパク質局在の分類
[18-01-26]DSTEP  ディープラーニングによる出芽酵母蛍光画像の細胞内タンパク質局在の分類 [18-01-26]DSTEP  ディープラーニングによる出芽酵母蛍光画像の細胞内タンパク質局在の分類
[18-01-26]DSTEP ディープラーニングによる出芽酵母蛍光画像の細胞内タンパク質局在の分類 Eli Kaminuma
 
[2016-12-01] DDBJデータ解析チャレンジ報告:機械学習コンペティションのタスク設計とルール設定
[2016-12-01] DDBJデータ解析チャレンジ報告:機械学習コンペティションのタスク設計とルール設定[2016-12-01] DDBJデータ解析チャレンジ報告:機械学習コンペティションのタスク設計とルール設定
[2016-12-01] DDBJデータ解析チャレンジ報告:機械学習コンペティションのタスク設計とルール設定Eli Kaminuma
 
[2016-07-06] DDBJデータ解析チャレンジ概要
[2016-07-06] DDBJデータ解析チャレンジ概要[2016-07-06] DDBJデータ解析チャレンジ概要
[2016-07-06] DDBJデータ解析チャレンジ概要Eli Kaminuma
 
[2016-06-06] CrowdR&D:クラウド協働評価のための参加型R&Dプロジェクト情報統合基盤
[2016-06-06] CrowdR&D:クラウド協働評価のための参加型R&Dプロジェクト情報統合基盤[2016-06-06] CrowdR&D:クラウド協働評価のための参加型R&Dプロジェクト情報統合基盤
[2016-06-06] CrowdR&D:クラウド協働評価のための参加型R&Dプロジェクト情報統合基盤Eli Kaminuma
 
[2013-12-05] NGS由来ゲノムワイド多型マーカ構築とそのRDF注釈情報統合化
[2013-12-05] NGS由来ゲノムワイド多型マーカ構築とそのRDF注釈情報統合化[2013-12-05] NGS由来ゲノムワイド多型マーカ構築とそのRDF注釈情報統合化
[2013-12-05] NGS由来ゲノムワイド多型マーカ構築とそのRDF注釈情報統合化Eli Kaminuma
 
[2015-06-10] オンライン・クラウドサイエンス(市民科学)の潮流
[2015-06-10] オンライン・クラウドサイエンス(市民科学)の潮流[2015-06-10] オンライン・クラウドサイエンス(市民科学)の潮流
[2015-06-10] オンライン・クラウドサイエンス(市民科学)の潮流Eli Kaminuma
 
[2015-11-11][DDBJing33] DDBJとNIG Supercomputerの紹介、大量配列情報解析 (第33回 DDBJing 講習会 ...
[2015-11-11][DDBJing33] DDBJとNIG Supercomputerの紹介、大量配列情報解析 (第33回 DDBJing 講習会 ...[2015-11-11][DDBJing33] DDBJとNIG Supercomputerの紹介、大量配列情報解析 (第33回 DDBJing 講習会 ...
[2015-11-11][DDBJing33] DDBJとNIG Supercomputerの紹介、大量配列情報解析 (第33回 DDBJing 講習会 ...Eli Kaminuma
 

Plus de Eli Kaminuma (13)

[2021-03-14] 植物表現型画像解析のための手作業注釈加速化手法とActive Learning
[2021-03-14] 植物表現型画像解析のための手作業注釈加速化手法とActive Learning[2021-03-14] 植物表現型画像解析のための手作業注釈加速化手法とActive Learning
[2021-03-14] 植物表現型画像解析のための手作業注釈加速化手法とActive Learning
 
[2020-12-15] 実験研究者のための深層学習入門 [第2回] Google Colab 環境で自動機械学習と深層画像生成(AutoML, GAN編) 
[2020-12-15] 実験研究者のための深層学習入門 [第2回] Google Colab 環境で自動機械学習と深層画像生成(AutoML, GAN編) [2020-12-15] 実験研究者のための深層学習入門 [第2回] Google Colab 環境で自動機械学習と深層画像生成(AutoML, GAN編) 
[2020-12-15] 実験研究者のための深層学習入門 [第2回] Google Colab 環境で自動機械学習と深層画像生成(AutoML, GAN編) 
 
[2019-11-22] JSAI合同研究会 糖尿病電子カルテを事例としたMeSH Term注釈に基づくアクセス制限研究のオープンデータ類似検索
[2019-11-22] JSAI合同研究会 糖尿病電子カルテを事例としたMeSH Term注釈に基づくアクセス制限研究のオープンデータ類似検索[2019-11-22] JSAI合同研究会 糖尿病電子カルテを事例としたMeSH Term注釈に基づくアクセス制限研究のオープンデータ類似検索
[2019-11-22] JSAI合同研究会 糖尿病電子カルテを事例としたMeSH Term注釈に基づくアクセス制限研究のオープンデータ類似検索
 
[2019-09-02] AI・IoT活用情報とGoogle Colab植物画像注釈
[2019-09-02] AI・IoT活用情報とGoogle Colab植物画像注釈[2019-09-02] AI・IoT活用情報とGoogle Colab植物画像注釈
[2019-09-02] AI・IoT活用情報とGoogle Colab植物画像注釈
 
[2019-03-14] JSPP19 深層学習による植物注釈タスクとPublic Cloud活用法
[2019-03-14] JSPP19 深層学習による植物注釈タスクとPublic Cloud活用法[2019-03-14] JSPP19 深層学習による植物注釈タスクとPublic Cloud活用法
[2019-03-14] JSPP19 深層学習による植物注釈タスクとPublic Cloud活用法
 
[2018-03-29]JSPP18 Oxford Flower Image Datasetを用いた深層学習ハンズオン
[2018-03-29]JSPP18 Oxford Flower Image Datasetを用いた深層学習ハンズオン[2018-03-29]JSPP18 Oxford Flower Image Datasetを用いた深層学習ハンズオン
[2018-03-29]JSPP18 Oxford Flower Image Datasetを用いた深層学習ハンズオン
 
[18-01-26]DSTEP ディープラーニングによる出芽酵母蛍光画像の細胞内タンパク質局在の分類
[18-01-26]DSTEP  ディープラーニングによる出芽酵母蛍光画像の細胞内タンパク質局在の分類 [18-01-26]DSTEP  ディープラーニングによる出芽酵母蛍光画像の細胞内タンパク質局在の分類
[18-01-26]DSTEP ディープラーニングによる出芽酵母蛍光画像の細胞内タンパク質局在の分類
 
[2016-12-01] DDBJデータ解析チャレンジ報告:機械学習コンペティションのタスク設計とルール設定
[2016-12-01] DDBJデータ解析チャレンジ報告:機械学習コンペティションのタスク設計とルール設定[2016-12-01] DDBJデータ解析チャレンジ報告:機械学習コンペティションのタスク設計とルール設定
[2016-12-01] DDBJデータ解析チャレンジ報告:機械学習コンペティションのタスク設計とルール設定
 
[2016-07-06] DDBJデータ解析チャレンジ概要
[2016-07-06] DDBJデータ解析チャレンジ概要[2016-07-06] DDBJデータ解析チャレンジ概要
[2016-07-06] DDBJデータ解析チャレンジ概要
 
[2016-06-06] CrowdR&D:クラウド協働評価のための参加型R&Dプロジェクト情報統合基盤
[2016-06-06] CrowdR&D:クラウド協働評価のための参加型R&Dプロジェクト情報統合基盤[2016-06-06] CrowdR&D:クラウド協働評価のための参加型R&Dプロジェクト情報統合基盤
[2016-06-06] CrowdR&D:クラウド協働評価のための参加型R&Dプロジェクト情報統合基盤
 
[2013-12-05] NGS由来ゲノムワイド多型マーカ構築とそのRDF注釈情報統合化
[2013-12-05] NGS由来ゲノムワイド多型マーカ構築とそのRDF注釈情報統合化[2013-12-05] NGS由来ゲノムワイド多型マーカ構築とそのRDF注釈情報統合化
[2013-12-05] NGS由来ゲノムワイド多型マーカ構築とそのRDF注釈情報統合化
 
[2015-06-10] オンライン・クラウドサイエンス(市民科学)の潮流
[2015-06-10] オンライン・クラウドサイエンス(市民科学)の潮流[2015-06-10] オンライン・クラウドサイエンス(市民科学)の潮流
[2015-06-10] オンライン・クラウドサイエンス(市民科学)の潮流
 
[2015-11-11][DDBJing33] DDBJとNIG Supercomputerの紹介、大量配列情報解析 (第33回 DDBJing 講習会 ...
[2015-11-11][DDBJing33] DDBJとNIG Supercomputerの紹介、大量配列情報解析 (第33回 DDBJing 講習会 ...[2015-11-11][DDBJing33] DDBJとNIG Supercomputerの紹介、大量配列情報解析 (第33回 DDBJing 講習会 ...
[2015-11-11][DDBJing33] DDBJとNIG Supercomputerの紹介、大量配列情報解析 (第33回 DDBJing 講習会 ...
 

Dernier

MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 

Dernier (20)

MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 

[2020-09-01] IIBMP2020 Generating annotation texts of HLA sequences with antigen classes by a T5 model using INSDC

  • 1. IIBMP2020 Poster Generating annotation texts of HLA sequences with antigen classes by a T5 (Text-to-Text Transfer Transformer) model using International Nucleotide Sequence Database Eli Kaminuma1,2,3, Takatomo Fujisawa2, Osamu Ogasawara2, Masanori Arita2,3, Yasukazu Nakamura2,4 (1. Tokyo Medical and Dental University 2. National Institute of Genetics, 3. RIKEN CSRS, 4. Kazusa DNA Research Institute)
  • 2. ■ International Nucleotide Sequence Database (INDSC) DNA Data Bank of Japan(DDBJ) collects nucleotide sequences as a member of INSDC. A problem of high labor costs for manual sequence annotation in the data submission stage to INDSC ■Problem: INSDC’s sequence annotations to be required seem to be high labor costs DDBJ ANNOTATION HELP
  • 3. DNASmartTagger : A proposed machine learning tool for DNA sequence annotations Accacactggtactgagacacggaccaga ctcctacgggaggcagcagtgaggaatatt ggacaatggagggaactctgatccagcca tgccgcgtgcaggaagactgccctatgggt tgtaaactgcttttatacaagaagaataag agatacgtgtatcttgatgacggtattgtaa gaataagcaccggctaactccgtgccagc agccgcggtaatacggagggtgcaagcgt tatccggaatcattgggtttaaagggtccgt aggcggattaataagtcagtggtgaaagtc tgcagcttaactgtagaattgccattgatac tgttagtcttgaattattatgaagtagttag aatatgtagtgtagcggtgaaatgcataga tattaca Input: DNA Sequence sequence e.g. INSDC FlatFile Format Output: Annotation Tags DNASmartTagger data resources BioSample 452 attribute tags INSDC 132 attribute tags Machine Learning Models Others annotations (132 attribute tags)
  • 4. ■ Retrieving INSDC Training Data ■ Building Deep Learning Models SVM (CV kfold=10) CNN (CV kfold=10) k-mer freq 0.77 0.80 5’end fragm 0.72 0.73 - Evaluating machine learning models to infer attribute values ( Evaluation metric : accuracy) Deep learning model (CNN)+ Input parameter with k-mer frequency ■ Extracting the attribute tag “/altitude” 5,431 Sequences with Annotation for PLN Division, Keyword Fungi (Retrieved from DDBJ ARSA) ZONE Attribute Value Altitude Zone Code ALPINE ZONE 1500m -- Z3 MONTAN E ZONE 800m-- 1500m Z2 LOWLAND ZONE 0--800m Z1 - Categorizing attribute values (/altitude) Our conventional study of DNASmartTagger (2018): Predicting ecological values of Biosample attribute tags from DNA sequences using deep learning
  • 5. ■ An example annotation of INSDC sequences for Human leukocyte antigen(HLA) allele. LOCUS MG021788 3079 bp DNA linear HUM 04-SEP-2018 DEFINITION Homo sapiens MHC class I antigen (HLA-F) gene, HLA-F*01:01:02var allele, complete cds. : FEATURES Location/Qualifiers source 1..3079 /organism="Homo sapiens" /mol_type="genomic DNA" /db_xref="taxon:9606" gene <1..>3079 /gene="HLA-F" /allele="HLA-F*01:01:02var" : : BASE COUNT 601 a 866 c 951 g 661 t ORIGIN 1 gtgtcgccgc agttcccagg ttctaaagtc ccacgcaccc cgcgggactc atatttttcc 61 cagacgcgga ggttggggtc atggcgcccc gaagcctcct cctgctgctc tcaggggccc 121 tggccctgac cgatacttgg gcaggtgagt gcggggtcca gagagaaacg gcctctgtgg 181 ggaggagtga ggggcccgcc cggtgggggc gcaggactca gggagccgcg cccggaggag 241 ggtctggcgg gtctcagccc ctcctcgccc ccaggctccc actccttgag gtatttcagc 301 accgctgtgt cgcggcccgg ccgcggggag ccccgctaca tcgccgtgga gtacgtagac A proposed sequence-to-text generation model to annotate INSDC DEFINITION attributes from input sequences HLA nomenclature (Xie et al, PMID: 21172045)
  • 6. T5 (Text-To-Text Transfer Transformer) model is one of available NLP deep learning models Reference: https://github.com/google-research/text-to-text-transfer-transformer ■ T5 model(Raffel et al; arXiv:1910.10683) =A text-to-text deep learning model for treating a wide variety NLP tasks from Google AI. *SuperGLUE LeaderBoard(2020/8/27) ■T5 model structure=A type of encoder-decoder transformer models. ■C4 Tensorflow dataset 745GB (cf. Wikipedia 16GB) ■T5 character=large model size T5-Small (60 million params) T5-Base (220 million params) T5-Large (770 million params) T5-3B (3 billion params) T5-11B (11 billion params)
  • 7. ■ Preparing reference texts of HLA sequences from INSDC database - 1,100 (train 1,000 + test 100) sequence annotations. - Selecting the top 6 gene names (HLA-B,A,C, DQB1, G, DQA1) in INSDC HLA data. - Deleting HLA allelic description from DEFINITION texts. - Preparing not nucleotide sequence but amino acid sequences as model inputs. ■ Fine-tuning methods - a pre-trained T5-Small(60 million params) model supported by Hugging Face to perform annotation text generation from amino acid (AA) sequences. ■ Hardware and software - NVIDIA Tesla K80 GPU 12GB assigned on Google Colaboratory - Python wt. PyTorch 1.5.1 deep learning library. ■Evaluation metrics ① BLEU(BiLingual Evaluation Understudy)(Papinen et al, 2002) ② Accuracy for classifying key labels. Experimental conditions to build T5 transfer learning models https://github.com/huggingface/transformerss Pn : n-gram precisions up to length N wn : positive weights c : length of the candidate translation r : effective reference corpus length. https://www.aclweb.org/anthology/P02-1040.pdf
  • 8. Result(1): Output texts generated by the proposed sequence-to-text T5 model DEFINITION Homo sapiens MHC class I antigen (HLA-F) gene, HLA-F*01:01:02var allele, complete cds. Basic Format: [organism name] [gene name] gene, [allelic information], complete/partial cds. (First 10 AA codes, Last 10 AA codes) DFVFQFKGMC ..……………… ………………………………………… ………………………………………… ……..VAFRGILQRR MT output Homo sapiens clone DQB1_111313_00 Reference: Homo sapiens clone DQB1_110918_01032 MHC class II antigen HLA-DQB1 gene, exon 2 and partial cds Input Sequence MT output MHC class I antigen HLA-A gene, complete cds1 gene Reference: Homo sapiens MHC class I protein HLA-A gene, complete cds Output annotation texts MAVMAPRTLV………. .…………………….………... ......… SDMSLTACKV T5
  • 9. Result(2): Evaluation the proposed model using generated texts and references ■ BLEU: A popular metric for text generation BLEU score=0.28 (100 test sequences) ■ Accuracy for classifying key labels (gene names and complete/partial cds types ) https://cloud.google.com/translate/automl/docs/evaluate We will collect more suitable reference datasets and investigate training conditions of T5 models. key labels accuracy (only test data including labels) gene name 6 classes※1 0.42 (0.95) cds completeness 2 classes※2 0.35 (0.83) ※1: HLA-B, -A, -C, -DQB1, -G, -DQA1 ※2: complete, partial (exon etc.)
  • 10. ・ TMDU : Hiroshi Tanaka, Kazuki Hashimoto ・ DDBJ : Jun Mashima, Yuichi Kodama, Kosuge Takehide ・ DBCLS : Yasunori Yamamoto ・ AIST AIRC : Jun Sese, Motoko Tsuji, Yukiko Ochi Acknowledgements ■ We are thankful to the following members for their supports. ■ This work is partially supported by the following grants. ・NIG Research Collaboration Grants 3A2019/ 55A2020 ・JST CREST Grant Number JPMJCR1501 Future work Issues of future works - Clean reference datasets. - Text generation for HLA allelic code (Large reference datasets). - Multi-modal integration (k-mer frequency vs sequence fragments). NLP+Computer Vision NLP