SlideShare une entreprise Scribd logo
1  sur  56
Pierre Lindenbaum PhD UMR915 – Institut du thorax Nantes, France @yokofakun http://plindenbaum.blogspot.com [email_address] Analysing Exome Data with KNIME
2 exomes sequenced
[m/m] 1 st  case: for a given  mutation we expect... not( [m/m] )
Files
$1 Position.hg19 : 142653 $2 chrom : chr10 $3 sample.ID : sample1 $4 rs.name : rsXXXX $5 hapmap_ref_other :  $6 X1000Genome.obs :  $7 X1000Genome.desc :  $8 Freq.HTZ.ExomesV1 : 0 $9 Freq.Hom.ExomesV1 : 0 $10 A : 0 $11 C : 5 $12 G : 0 $13 T : 3 $14 modified_call : CT $15 total : 9 $16 used : 8 $17 score : 18.30:12.00 $18 reference : C $19 type : SNP_het1 $20 Gene.name : ROXAN $21 Gene.start : 143652 $22 Gene.end : 293700 $23 strand : - $24 nbre.exon : 11 $25 refseq : NR_0090 $26 typeannot : 3-UTR $27 type.pos :  $28 index.cdna :  $29 index.prot :  $30 Taille.cdna : 1769 $31 Intron.start :  $32 Intron.end :  $33 codon.wild :  $34 aa.wild :  $35 codon.mut :  $36 aa.mut :  $37 cds.wild :  $38 cds.mut :  $39 prot.wild :  $40 prot.mut :  $41 mirna : no $42 region.splice : !N/A 42 columns
$1 Position.hg19 : 142653 $2 chrom : chr10 $3 sample.ID : sample1 $4 rs.name : rsXXXX $5 hapmap_ref_other :  $6 X1000Genome.obs :  $7 X1000Genome.desc :  $8 Freq.HTZ.ExomesV1 : 0 $9 Freq.Hom.ExomesV1 : 0 $10 A : 0 $11 C : 5 $12 G : 0 $13 T : 3 $14 modified_call : CT $15 total : 9 $16 used : 8 $17 score : 18.30:12.00 $18 reference : C $19 type : SNP_het1 $20 Gene.name : ROXAN $21 Gene.start : 143652 $22 Gene.end : 293700 $23 strand : - $24 nbre.exon : 11 $25 refseq : NR_009 $26 typeannot : 3-UTR $27 type.pos :  $28 index.cdna :  $29 index.prot :  $30 Taille.cdna : 1769 $31 Intron.start :  $32 Intron.end :  $33 codon.wild :  $34 aa.wild :  $35 codon.mut :  $36 aa.mut :  $37 cds.wild :  $38 cds.mut :  $39 prot.wild :  $40 prot.mut :  $41 mirna : no $42 region.splice : !N/A Genomic Position
$1 Position.hg19 : 142653 $2 chrom : chr10 $3 sample.ID : sample1 $4 rs.name : rsXXXX $5 hapmap_ref_other :  $6 X1000Genome.obs :  $7 X1000Genome.desc :  $8 Freq.HTZ.ExomesV1 : 0 $9 Freq.Hom.ExomesV1 : 0 $10 A : 0 $11 C : 5 $12 G : 0 $13 T : 3 $14 modified_call : CT $15 total : 9 $16 used : 8 $17 score : 18.30:12.00 $18 reference : C $19 type : SNP_het1 $20 Gene.name : ROXAN $21 Gene.start : 143652 $22 Gene.end : 293700 $23 strand : - $24 nbre.exon : 11 $25 refseq : NR_0090 $26 typeannot : 3-UTR $27 type.pos :  $28 index.cdna :  $29 index.prot :  $30 Taille.cdna : 1769 $31 Intron.start :  $32 Intron.end :  $33 codon.wild :  $34 aa.wild :  $35 codon.mut :  $36 aa.mut :  $37 cds.wild :  $38 cds.mut :  $39 prot.wild :  $40 prot.mut :  $41 mirna : no $42 region.splice : !N/A Sample Name
$1 Position.hg19 : 142653 $2 chrom : chr10 $3 sample.ID : sample1 $4 rs.name : rsXXXX $5 hapmap_ref_other :  $6 X1000Genome.obs :  $7 X1000Genome.desc :  $8 Freq.HTZ.ExomesV1 : 0 $9 Freq.Hom.ExomesV1 : 0 $10 A : 0 $11 C : 5 $12 G : 0 $13 T : 3 $14 modified_call : CT $15 total : 9 $16 used : 8 $17 score : 18.30:12.00 $18 reference : C $19 type : SNP_het1 $20 Gene.name : ROXAN $21 Gene.start : 143652 $22 Gene.end : 293700 $23 strand : - $24 nbre.exon : 11 $25 refseq : NR_0090 $26 typeannot : 3-UTR $27 type.pos :  $28 index.cdna :  $29 index.prot :  $30 Taille.cdna : 1769 $31 Intron.start :  $32 Intron.end :  $33 codon.wild :  $34 aa.wild :  $35 codon.mut :  $36 aa.mut :  $37 cds.wild :  $38 cds.mut :  $39 prot.wild :  $40 prot.mut :  $41 mirna : no $42 region.splice : !N/A RS## number
$1 Position.hg19 : 142653 $2 chrom : chr10 $3 sample.ID : sample1 $4 rs.name : rsXXXX $5 hapmap_ref_other :  $6 X1000Genome.obs :  $7 X1000Genome.desc :  $8 Freq.HTZ.ExomesV1 : 0 $9 Freq.Hom.ExomesV1 : 0 $10 A : 0 $11 C : 5 $12 G : 0 $13 T : 3 $14 modified_call : CT $15 total : 9 $16 used : 8 $17 score : 18.30:12.00 $18 reference : C $19 type : SNP_het1 $20 Gene.name : ROXAN $21 Gene.start : 143652 $22 Gene.end : 293700 $23 strand : - $24 nbre.exon : 11 $25 refseq : NR_0090 $26 typeannot : 3-UTR $27 type.pos :  $28 index.cdna :  $29 index.prot :  $30 Taille.cdna : 1769 $31 Intron.start :  $32 Intron.end :  $33 codon.wild :  $34 aa.wild :  $35 codon.mut :  $36 aa.mut :  $37 cds.wild :  $38 cds.mut :  $39 prot.wild :  $40 prot.mut :  $41 mirna : no $42 region.splice : !N/A Ref. & Alt. alleles
$1 Position.hg19 : 142653 $2 chrom : chr10 $3 sample.ID : sample1 $4 rs.name : rsXXXX $5 hapmap_ref_other :  $6 X1000Genome.obs :  $7 X1000Genome.desc :  $8 Freq.HTZ.ExomesV1 : 0 $9 Freq.Hom.ExomesV1 : 0 $10 A : 0 $11 C : 5 $12 G : 0 $13 T : 3 $14 modified_call : CT $15 total : 9 $16 used : 8 $17 score : 18.30:12.00 $18 reference : C $19 type : SNP_het1 $20 Gene.name : ROXAN $21 Gene.start : 143652 $22 Gene.end : 293700 $23 strand : - $24 nbre.exon : 11 $25 refseq : NR_0090 $26 typeannot : 3-UTR $27 type.pos :  $28 index.cdna :  $29 index.prot :  $30 Taille.cdna : 1769 $31 Intron.start :  $32 Intron.end :  $33 codon.wild :  $34 aa.wild :  $35 codon.mut :  $36 aa.mut :  $37 cds.wild :  $38 cds.mut :  $39 prot.wild :  $40 prot.mut :  $41 mirna : no $42 region.splice : !N/A Gene
$1 Position.hg19 : 142653 $2 chrom : chr10 $3 sample.ID : sample1 $4 rs.name : rsXXXX $5 hapmap_ref_other :  $6 X1000Genome.obs :  $7 X1000Genome.desc :  $8 Freq.HTZ.ExomesV1 : 0 $9 Freq.Hom.ExomesV1 : 0 $10 A : 0 $11 C : 5 $12 G : 0 $13 T : 3 $14 modified_call : CT $15 total : 9 $16 used : 8 $17 score : 18.30:12.00 $18 reference : C $19 type : SNP_het1 $20 Gene.name : ROXAN $21 Gene.start : 143652 $22 Gene.end : 293700 $23 strand : - $24 nbre.exon : 11 $25 refseq : NR_0090 $26 typeannot : 3-UTR $27 type.pos :  $28 index.cdna :  $29 index.prot :  $30 Taille.cdna : 1769 $31 Intron.start :  $32 Intron.end :  $33 codon.wild :  $34 aa.wild :  $35 codon.mut :  $36 aa.mut :  $37 cds.wild :  $38 cds.mut :  $39 prot.wild :  $40 prot.mut :  $41 mirna : no $42 region.splice : !N/A Prediction
$1 Position.hg19 : 142653 $2 chrom : chr10 $3 sample.ID : sample1 $4 rs.name : rsXXXX $5 hapmap_ref_other :  $6 X1000Genome.obs :  $7 X1000Genome.desc :  $8 Freq.HTZ.ExomesV1 : 0 $9 Freq.Hom.ExomesV1 : 0 $10 A : 0 $11 C : 5 $12 G : 0 $13 T : 3 $14 modified_call : CT $15 total : 9 $16 used : 8 $17 score : 18.30:12.00 $18 reference : C $19 type : SNP_het1 $20 Gene.name : ROXAN $21 Gene.start : 143652 $22 Gene.end : 293700 $23 strand : - $24 nbre.exon : 11 $25 refseq : NR_0090 $26 typeannot : 3-UTR $27 type.pos :  $28 index.cdna :  $29 index.prot :  $30 Taille.cdna : 1769 $31 Intron.start :  $32 Intron.end :  $33 codon.wild :  $34 aa.wild :  $35 codon.mut :  $36 aa.mut :  $37 cds.wild :  $38 cds.mut :  $39 prot.wild :  $40 prot.mut :  $41 mirna : no $42 region.splice : !N/A Homo/Hetero zygote
http://www.knime.org
 
Our workflow:
 
Read the data
Rename both “ Sample” Columns
Remove the sequences (save memory/speed)
Expect “not (snp_diff.*)” for
Expect “snp_diff.*” for
Merge data. Two columns “ SAMPLE_WILD” & “ SAMPLE_MUTATED”
Highlight low quality
Remove low quality
Must be in located in a Gene
Remove if known rs#
Remove if synonymous mutation
Remove wild allele from Alt. (cleanup)
Group by Gene
 
Keep mutations carried by both samples
Group by Gene Name & Visualize
 
Retrieve the SNPs for each Gene.
 
bash version... #remove rs #in gene #remove the low qualities #keep SNP_diff #only the non-synonymous or stop #remove DNA & prot sequences #order by GENE gunzip -c AllChrom.exome.snp.pool.new.annotation.gz |awk -F ' ' '{if(substr($4,1,2)!="rs") print;}' |awk -F ' ' '{if($20!="") print;}' |awk -F ' ' '{if(index($19,"douteux")==0) print;}' |awk -F ' ' '{if(index($19,"_diff")!=0) print;}' |awk -F ' ' '{if(index($26,"nonsense")!=0 || index($26,"missense")!=0) print;}' |cut -d ' ' -f 1-27 |sort  -t ' ' -k20,20 > _jeter1.txt  #extract wild exome #remove rs #remove SNP_diff #in gene #order by gene gunzip -c AllChrom.exome.snp.u2437.new.annotation.gz |awk -F ' ' '{if(substr($4,1,2)!="rs") print;}' |awk -F ' ' '{if(index($19,"douteux")==0) print;}' |awk -F ' ' '{if(index($19,"_diff")==0) print;}' |awk -F ' ' '{if($20!="") print;}' |cut -d ' ' -f 1-27 |sort  -t ' ' -k20,20 > _jeter3.txt  #join wild & mutated data by gene #check wild sample has no mutation in the pair of mutated snps #remove wild data join  -t ' ' -1 20 -2 20 _jeter1.txt _jeter3.txt |awk -F ' ' '{if($3==$29 && int($2) == int($28) ) print;}' |cut -d ' ' -f 1 |sort | uniq rm _jeter*.txt
In one gene: SNP1: [m/+] SNP2: [m/+] 2 nd  case: Composite heterozygous
The workflow:
Read [m] & [+] files Mutated sample Wild sample
Remove cDNA & protein sequences
Remove the SNPs having a rs#
Keep the heterozygous mutations
Remove poor quality
Keep the non-synonymous mutations
Create a new column: = chrom+”_”+position;
Rename the columns 'sample-id' (will generate two distinct columns after joining)
Left join on the  column 'chrom_col'
Keep the mutations that were NOT part of the wild sample.
Cleanup, remove some columns.
Duplicate the table to Create two lists of SNPs (5' & 3').
Join both tables on gene name.
Keep the SNPs having: pos(snp1) < pos(snp2)
Display the results
#remove rs #only keep the 'SNP_het' #remove the low qualities #remove SNP_het* #only the non-synonymous or stop #remove DNA & prot sequences #add chrom_position flag #sort gunzip -c AllChrom.exome.snp.pool.new.annotation.gz |awk -F ' ' '{if(substr($4,1,2)!=&quot;rs&quot;) print;}' |awk -F ' ' '{if(index($19,&quot;douteux&quot;)==0) print;}' |awk -F ' ' '{if(index($19,&quot;_het&quot;)!=0) print;}' |awk -F ' ' '{if(index($26,&quot;nonsense&quot;)!=0 || index($26,&quot;missense&quot;)!=0) print;}' |cut -d ' ' -f 1-27 |awk -F ' ' '{printf(&quot;%s_%s%s&quot;,$2,$1,$0);}' |sort  -t ' ' -k1,1 > _jeter1.txt  #get all distinct chrom_pos in file cut -d ' ' -f 1 _jeter1.txt | sort -t ' ' -k1,1 | uniq > _jeter2.txt  #extract wild exome #keep chrom,position #add chrom_position flag #sort gunzip -c AllChrom.exome.snp.u2437.new.annotation.gz |cut -d ' ' -f 1,2 |awk -F ' ' '{printf(&quot;%s_%s&quot;,$2,$1);}' |sort  -t ' ' -k 1,1 | uniq  > _jeter3.txt  #get [m] chrom_pos not in [+] chrom_pos set comm -2 -3 _jeter2.txt  _jeter3.txt  > _jeter4.txt  #join uniq [m] chrom_pos & mutated data #remove chrom_pos #order by gene join  -t ' ' --check-order  -1 1 -2 1  _jeter1.txt _jeter4.txt|cut -d ' ' -f 2- |sort -t ' ' -k 20 > _jeter5.txt  #join to self using key= &quot;gene name&quot; #only keep if first mutation in same gene/chromosome and pos1< pos2 #keep some columns join  -t ' ' -j 20 _jeter5.txt _jeter5.txt |awk -F ' ' '{if($3==$29 && int($2) < int($28) ) print;}' |cut -d ' ' -f 1,2,3,20,26,28,46,52 > _jeter6.txt #extract gene names cut -d ' ' -f 1 _jeter6.txt | sort | uniq rm _jeter[12345].txt bash version...
Last step... http://en.wikipedia.org/wiki/File:Nobel_Prize.png
Thanks. Remember: you should learn how to use the Unix command line...

Contenu connexe

Plus de Pierre Lindenbaum

Next Generation Sequencing file Formats ( 2017 )
Next Generation Sequencing file Formats ( 2017 )Next Generation Sequencing file Formats ( 2017 )
Next Generation Sequencing file Formats ( 2017 )Pierre Lindenbaum
 
Mum, I 3D printed a gel comb !
Mum, I 3D printed a gel comb !Mum, I 3D printed a gel comb !
Mum, I 3D printed a gel comb !Pierre Lindenbaum
 
"Mon make à moi", (tout sauf Galaxy)
"Mon make à moi", (tout sauf Galaxy)"Mon make à moi", (tout sauf Galaxy)
"Mon make à moi", (tout sauf Galaxy)Pierre Lindenbaum
 
File formats for Next Generation Sequencing
File formats for Next Generation SequencingFile formats for Next Generation Sequencing
File formats for Next Generation SequencingPierre Lindenbaum
 
Building a Simple LIMS with the Eclipse Modeling Framework (EMF) ,my notebook
Building a Simple LIMS with the Eclipse Modeling Framework (EMF) ,my notebookBuilding a Simple LIMS with the Eclipse Modeling Framework (EMF) ,my notebook
Building a Simple LIMS with the Eclipse Modeling Framework (EMF) ,my notebookPierre Lindenbaum
 
Introduction to mongodb for bioinformatics
Introduction to mongodb for bioinformaticsIntroduction to mongodb for bioinformatics
Introduction to mongodb for bioinformaticsPierre Lindenbaum
 
20110114 Next Generation Sequencing Course
20110114 Next Generation Sequencing Course20110114 Next Generation Sequencing Course
20110114 Next Generation Sequencing CoursePierre Lindenbaum
 
An implementation of Jan Aerts' LocusTree
An implementation of Jan Aerts' LocusTreeAn implementation of Jan Aerts' LocusTree
An implementation of Jan Aerts' LocusTreePierre Lindenbaum
 
Pourquoi et comment créer son Réseau
Pourquoi et comment créer son RéseauPourquoi et comment créer son Réseau
Pourquoi et comment créer son RéseauPierre Lindenbaum
 
RoXaN, une nouvelle protéine cellulaire interagissant avec la protéine non-st...
RoXaN, une nouvelle protéine cellulaire interagissant avec la protéine non-st...RoXaN, une nouvelle protéine cellulaire interagissant avec la protéine non-st...
RoXaN, une nouvelle protéine cellulaire interagissant avec la protéine non-st...Pierre Lindenbaum
 

Plus de Pierre Lindenbaum (20)

Next Generation Sequencing file Formats ( 2017 )
Next Generation Sequencing file Formats ( 2017 )Next Generation Sequencing file Formats ( 2017 )
Next Generation Sequencing file Formats ( 2017 )
 
Mum, I 3D printed a gel comb !
Mum, I 3D printed a gel comb !Mum, I 3D printed a gel comb !
Mum, I 3D printed a gel comb !
 
"Mon make à moi", (tout sauf Galaxy)
"Mon make à moi", (tout sauf Galaxy)"Mon make à moi", (tout sauf Galaxy)
"Mon make à moi", (tout sauf Galaxy)
 
File formats for Next Generation Sequencing
File formats for Next Generation SequencingFile formats for Next Generation Sequencing
File formats for Next Generation Sequencing
 
Building a Simple LIMS with the Eclipse Modeling Framework (EMF) ,my notebook
Building a Simple LIMS with the Eclipse Modeling Framework (EMF) ,my notebookBuilding a Simple LIMS with the Eclipse Modeling Framework (EMF) ,my notebook
Building a Simple LIMS with the Eclipse Modeling Framework (EMF) ,my notebook
 
XML for bioinformatics
XML for bioinformaticsXML for bioinformatics
XML for bioinformatics
 
20120423.NGS.Rennes
20120423.NGS.Rennes20120423.NGS.Rennes
20120423.NGS.Rennes
 
Sketching 20120412
Sketching 20120412Sketching 20120412
Sketching 20120412
 
Introduction to mongodb for bioinformatics
Introduction to mongodb for bioinformaticsIntroduction to mongodb for bioinformatics
Introduction to mongodb for bioinformatics
 
Biostar17037
Biostar17037Biostar17037
Biostar17037
 
Variation Toolkit
Variation ToolkitVariation Toolkit
Variation Toolkit
 
Bioinformatician 2.0
Bioinformatician 2.0Bioinformatician 2.0
Bioinformatician 2.0
 
Post doctoriales 2011
Post doctoriales 2011Post doctoriales 2011
Post doctoriales 2011
 
20110114 Next Generation Sequencing Course
20110114 Next Generation Sequencing Course20110114 Next Generation Sequencing Course
20110114 Next Generation Sequencing Course
 
MyWordle.java
MyWordle.javaMyWordle.java
MyWordle.java
 
Me & Biohackathon 2010
Me & Biohackathon 2010Me & Biohackathon 2010
Me & Biohackathon 2010
 
An implementation of Jan Aerts' LocusTree
An implementation of Jan Aerts' LocusTreeAn implementation of Jan Aerts' LocusTree
An implementation of Jan Aerts' LocusTree
 
Pourquoi et comment créer son Réseau
Pourquoi et comment créer son RéseauPourquoi et comment créer son Réseau
Pourquoi et comment créer son Réseau
 
Bibliography2.0
Bibliography2.0Bibliography2.0
Bibliography2.0
 
RoXaN, une nouvelle protéine cellulaire interagissant avec la protéine non-st...
RoXaN, une nouvelle protéine cellulaire interagissant avec la protéine non-st...RoXaN, une nouvelle protéine cellulaire interagissant avec la protéine non-st...
RoXaN, une nouvelle protéine cellulaire interagissant avec la protéine non-st...
 

Dernier

The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...panagenda
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 

Dernier (20)

The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 

Analyzing Exome Data with KNIME

  • 1. Pierre Lindenbaum PhD UMR915 – Institut du thorax Nantes, France @yokofakun http://plindenbaum.blogspot.com [email_address] Analysing Exome Data with KNIME
  • 3. [m/m] 1 st case: for a given mutation we expect... not( [m/m] )
  • 5. $1 Position.hg19 : 142653 $2 chrom : chr10 $3 sample.ID : sample1 $4 rs.name : rsXXXX $5 hapmap_ref_other : $6 X1000Genome.obs : $7 X1000Genome.desc : $8 Freq.HTZ.ExomesV1 : 0 $9 Freq.Hom.ExomesV1 : 0 $10 A : 0 $11 C : 5 $12 G : 0 $13 T : 3 $14 modified_call : CT $15 total : 9 $16 used : 8 $17 score : 18.30:12.00 $18 reference : C $19 type : SNP_het1 $20 Gene.name : ROXAN $21 Gene.start : 143652 $22 Gene.end : 293700 $23 strand : - $24 nbre.exon : 11 $25 refseq : NR_0090 $26 typeannot : 3-UTR $27 type.pos : $28 index.cdna : $29 index.prot : $30 Taille.cdna : 1769 $31 Intron.start : $32 Intron.end : $33 codon.wild : $34 aa.wild : $35 codon.mut : $36 aa.mut : $37 cds.wild : $38 cds.mut : $39 prot.wild : $40 prot.mut : $41 mirna : no $42 region.splice : !N/A 42 columns
  • 6. $1 Position.hg19 : 142653 $2 chrom : chr10 $3 sample.ID : sample1 $4 rs.name : rsXXXX $5 hapmap_ref_other : $6 X1000Genome.obs : $7 X1000Genome.desc : $8 Freq.HTZ.ExomesV1 : 0 $9 Freq.Hom.ExomesV1 : 0 $10 A : 0 $11 C : 5 $12 G : 0 $13 T : 3 $14 modified_call : CT $15 total : 9 $16 used : 8 $17 score : 18.30:12.00 $18 reference : C $19 type : SNP_het1 $20 Gene.name : ROXAN $21 Gene.start : 143652 $22 Gene.end : 293700 $23 strand : - $24 nbre.exon : 11 $25 refseq : NR_009 $26 typeannot : 3-UTR $27 type.pos : $28 index.cdna : $29 index.prot : $30 Taille.cdna : 1769 $31 Intron.start : $32 Intron.end : $33 codon.wild : $34 aa.wild : $35 codon.mut : $36 aa.mut : $37 cds.wild : $38 cds.mut : $39 prot.wild : $40 prot.mut : $41 mirna : no $42 region.splice : !N/A Genomic Position
  • 7. $1 Position.hg19 : 142653 $2 chrom : chr10 $3 sample.ID : sample1 $4 rs.name : rsXXXX $5 hapmap_ref_other : $6 X1000Genome.obs : $7 X1000Genome.desc : $8 Freq.HTZ.ExomesV1 : 0 $9 Freq.Hom.ExomesV1 : 0 $10 A : 0 $11 C : 5 $12 G : 0 $13 T : 3 $14 modified_call : CT $15 total : 9 $16 used : 8 $17 score : 18.30:12.00 $18 reference : C $19 type : SNP_het1 $20 Gene.name : ROXAN $21 Gene.start : 143652 $22 Gene.end : 293700 $23 strand : - $24 nbre.exon : 11 $25 refseq : NR_0090 $26 typeannot : 3-UTR $27 type.pos : $28 index.cdna : $29 index.prot : $30 Taille.cdna : 1769 $31 Intron.start : $32 Intron.end : $33 codon.wild : $34 aa.wild : $35 codon.mut : $36 aa.mut : $37 cds.wild : $38 cds.mut : $39 prot.wild : $40 prot.mut : $41 mirna : no $42 region.splice : !N/A Sample Name
  • 8. $1 Position.hg19 : 142653 $2 chrom : chr10 $3 sample.ID : sample1 $4 rs.name : rsXXXX $5 hapmap_ref_other : $6 X1000Genome.obs : $7 X1000Genome.desc : $8 Freq.HTZ.ExomesV1 : 0 $9 Freq.Hom.ExomesV1 : 0 $10 A : 0 $11 C : 5 $12 G : 0 $13 T : 3 $14 modified_call : CT $15 total : 9 $16 used : 8 $17 score : 18.30:12.00 $18 reference : C $19 type : SNP_het1 $20 Gene.name : ROXAN $21 Gene.start : 143652 $22 Gene.end : 293700 $23 strand : - $24 nbre.exon : 11 $25 refseq : NR_0090 $26 typeannot : 3-UTR $27 type.pos : $28 index.cdna : $29 index.prot : $30 Taille.cdna : 1769 $31 Intron.start : $32 Intron.end : $33 codon.wild : $34 aa.wild : $35 codon.mut : $36 aa.mut : $37 cds.wild : $38 cds.mut : $39 prot.wild : $40 prot.mut : $41 mirna : no $42 region.splice : !N/A RS## number
  • 9. $1 Position.hg19 : 142653 $2 chrom : chr10 $3 sample.ID : sample1 $4 rs.name : rsXXXX $5 hapmap_ref_other : $6 X1000Genome.obs : $7 X1000Genome.desc : $8 Freq.HTZ.ExomesV1 : 0 $9 Freq.Hom.ExomesV1 : 0 $10 A : 0 $11 C : 5 $12 G : 0 $13 T : 3 $14 modified_call : CT $15 total : 9 $16 used : 8 $17 score : 18.30:12.00 $18 reference : C $19 type : SNP_het1 $20 Gene.name : ROXAN $21 Gene.start : 143652 $22 Gene.end : 293700 $23 strand : - $24 nbre.exon : 11 $25 refseq : NR_0090 $26 typeannot : 3-UTR $27 type.pos : $28 index.cdna : $29 index.prot : $30 Taille.cdna : 1769 $31 Intron.start : $32 Intron.end : $33 codon.wild : $34 aa.wild : $35 codon.mut : $36 aa.mut : $37 cds.wild : $38 cds.mut : $39 prot.wild : $40 prot.mut : $41 mirna : no $42 region.splice : !N/A Ref. & Alt. alleles
  • 10. $1 Position.hg19 : 142653 $2 chrom : chr10 $3 sample.ID : sample1 $4 rs.name : rsXXXX $5 hapmap_ref_other : $6 X1000Genome.obs : $7 X1000Genome.desc : $8 Freq.HTZ.ExomesV1 : 0 $9 Freq.Hom.ExomesV1 : 0 $10 A : 0 $11 C : 5 $12 G : 0 $13 T : 3 $14 modified_call : CT $15 total : 9 $16 used : 8 $17 score : 18.30:12.00 $18 reference : C $19 type : SNP_het1 $20 Gene.name : ROXAN $21 Gene.start : 143652 $22 Gene.end : 293700 $23 strand : - $24 nbre.exon : 11 $25 refseq : NR_0090 $26 typeannot : 3-UTR $27 type.pos : $28 index.cdna : $29 index.prot : $30 Taille.cdna : 1769 $31 Intron.start : $32 Intron.end : $33 codon.wild : $34 aa.wild : $35 codon.mut : $36 aa.mut : $37 cds.wild : $38 cds.mut : $39 prot.wild : $40 prot.mut : $41 mirna : no $42 region.splice : !N/A Gene
  • 11. $1 Position.hg19 : 142653 $2 chrom : chr10 $3 sample.ID : sample1 $4 rs.name : rsXXXX $5 hapmap_ref_other : $6 X1000Genome.obs : $7 X1000Genome.desc : $8 Freq.HTZ.ExomesV1 : 0 $9 Freq.Hom.ExomesV1 : 0 $10 A : 0 $11 C : 5 $12 G : 0 $13 T : 3 $14 modified_call : CT $15 total : 9 $16 used : 8 $17 score : 18.30:12.00 $18 reference : C $19 type : SNP_het1 $20 Gene.name : ROXAN $21 Gene.start : 143652 $22 Gene.end : 293700 $23 strand : - $24 nbre.exon : 11 $25 refseq : NR_0090 $26 typeannot : 3-UTR $27 type.pos : $28 index.cdna : $29 index.prot : $30 Taille.cdna : 1769 $31 Intron.start : $32 Intron.end : $33 codon.wild : $34 aa.wild : $35 codon.mut : $36 aa.mut : $37 cds.wild : $38 cds.mut : $39 prot.wild : $40 prot.mut : $41 mirna : no $42 region.splice : !N/A Prediction
  • 12. $1 Position.hg19 : 142653 $2 chrom : chr10 $3 sample.ID : sample1 $4 rs.name : rsXXXX $5 hapmap_ref_other : $6 X1000Genome.obs : $7 X1000Genome.desc : $8 Freq.HTZ.ExomesV1 : 0 $9 Freq.Hom.ExomesV1 : 0 $10 A : 0 $11 C : 5 $12 G : 0 $13 T : 3 $14 modified_call : CT $15 total : 9 $16 used : 8 $17 score : 18.30:12.00 $18 reference : C $19 type : SNP_het1 $20 Gene.name : ROXAN $21 Gene.start : 143652 $22 Gene.end : 293700 $23 strand : - $24 nbre.exon : 11 $25 refseq : NR_0090 $26 typeannot : 3-UTR $27 type.pos : $28 index.cdna : $29 index.prot : $30 Taille.cdna : 1769 $31 Intron.start : $32 Intron.end : $33 codon.wild : $34 aa.wild : $35 codon.mut : $36 aa.mut : $37 cds.wild : $38 cds.mut : $39 prot.wild : $40 prot.mut : $41 mirna : no $42 region.splice : !N/A Homo/Hetero zygote
  • 14.  
  • 16.  
  • 18. Rename both “ Sample” Columns
  • 19. Remove the sequences (save memory/speed)
  • 22. Merge data. Two columns “ SAMPLE_WILD” & “ SAMPLE_MUTATED”
  • 25. Must be in located in a Gene
  • 28. Remove wild allele from Alt. (cleanup)
  • 30.  
  • 31. Keep mutations carried by both samples
  • 32. Group by Gene Name & Visualize
  • 33.  
  • 34. Retrieve the SNPs for each Gene.
  • 35.  
  • 36. bash version... #remove rs #in gene #remove the low qualities #keep SNP_diff #only the non-synonymous or stop #remove DNA & prot sequences #order by GENE gunzip -c AllChrom.exome.snp.pool.new.annotation.gz |awk -F ' ' '{if(substr($4,1,2)!=&quot;rs&quot;) print;}' |awk -F ' ' '{if($20!=&quot;&quot;) print;}' |awk -F ' ' '{if(index($19,&quot;douteux&quot;)==0) print;}' |awk -F ' ' '{if(index($19,&quot;_diff&quot;)!=0) print;}' |awk -F ' ' '{if(index($26,&quot;nonsense&quot;)!=0 || index($26,&quot;missense&quot;)!=0) print;}' |cut -d ' ' -f 1-27 |sort -t ' ' -k20,20 > _jeter1.txt #extract wild exome #remove rs #remove SNP_diff #in gene #order by gene gunzip -c AllChrom.exome.snp.u2437.new.annotation.gz |awk -F ' ' '{if(substr($4,1,2)!=&quot;rs&quot;) print;}' |awk -F ' ' '{if(index($19,&quot;douteux&quot;)==0) print;}' |awk -F ' ' '{if(index($19,&quot;_diff&quot;)==0) print;}' |awk -F ' ' '{if($20!=&quot;&quot;) print;}' |cut -d ' ' -f 1-27 |sort -t ' ' -k20,20 > _jeter3.txt #join wild & mutated data by gene #check wild sample has no mutation in the pair of mutated snps #remove wild data join -t ' ' -1 20 -2 20 _jeter1.txt _jeter3.txt |awk -F ' ' '{if($3==$29 && int($2) == int($28) ) print;}' |cut -d ' ' -f 1 |sort | uniq rm _jeter*.txt
  • 37. In one gene: SNP1: [m/+] SNP2: [m/+] 2 nd case: Composite heterozygous
  • 39. Read [m] & [+] files Mutated sample Wild sample
  • 40. Remove cDNA & protein sequences
  • 41. Remove the SNPs having a rs#
  • 45. Create a new column: = chrom+”_”+position;
  • 46. Rename the columns 'sample-id' (will generate two distinct columns after joining)
  • 47. Left join on the column 'chrom_col'
  • 48. Keep the mutations that were NOT part of the wild sample.
  • 50. Duplicate the table to Create two lists of SNPs (5' & 3').
  • 51. Join both tables on gene name.
  • 52. Keep the SNPs having: pos(snp1) < pos(snp2)
  • 54. #remove rs #only keep the 'SNP_het' #remove the low qualities #remove SNP_het* #only the non-synonymous or stop #remove DNA & prot sequences #add chrom_position flag #sort gunzip -c AllChrom.exome.snp.pool.new.annotation.gz |awk -F ' ' '{if(substr($4,1,2)!=&quot;rs&quot;) print;}' |awk -F ' ' '{if(index($19,&quot;douteux&quot;)==0) print;}' |awk -F ' ' '{if(index($19,&quot;_het&quot;)!=0) print;}' |awk -F ' ' '{if(index($26,&quot;nonsense&quot;)!=0 || index($26,&quot;missense&quot;)!=0) print;}' |cut -d ' ' -f 1-27 |awk -F ' ' '{printf(&quot;%s_%s%s&quot;,$2,$1,$0);}' |sort -t ' ' -k1,1 > _jeter1.txt #get all distinct chrom_pos in file cut -d ' ' -f 1 _jeter1.txt | sort -t ' ' -k1,1 | uniq > _jeter2.txt #extract wild exome #keep chrom,position #add chrom_position flag #sort gunzip -c AllChrom.exome.snp.u2437.new.annotation.gz |cut -d ' ' -f 1,2 |awk -F ' ' '{printf(&quot;%s_%s&quot;,$2,$1);}' |sort -t ' ' -k 1,1 | uniq > _jeter3.txt #get [m] chrom_pos not in [+] chrom_pos set comm -2 -3 _jeter2.txt _jeter3.txt > _jeter4.txt #join uniq [m] chrom_pos & mutated data #remove chrom_pos #order by gene join -t ' ' --check-order -1 1 -2 1 _jeter1.txt _jeter4.txt|cut -d ' ' -f 2- |sort -t ' ' -k 20 > _jeter5.txt #join to self using key= &quot;gene name&quot; #only keep if first mutation in same gene/chromosome and pos1< pos2 #keep some columns join -t ' ' -j 20 _jeter5.txt _jeter5.txt |awk -F ' ' '{if($3==$29 && int($2) < int($28) ) print;}' |cut -d ' ' -f 1,2,3,20,26,28,46,52 > _jeter6.txt #extract gene names cut -d ' ' -f 1 _jeter6.txt | sort | uniq rm _jeter[12345].txt bash version...
  • 56. Thanks. Remember: you should learn how to use the Unix command line...