SlideShare a Scribd company logo
1 of 83
Interpreting MS/MS Proteomics Results The first thing I should say is that none of the material presented is original research done at Proteome Software but we do strive to make the tools presented here available in our software product Scaffold.  With that caveat aside… Brian C. Searle Proteome Software Inc.  Portland, Oregon USA Brian.Searle@ProteomeSoftware.com NPC Progress Meeting (February 2nd, 2006)  Illustrated by Toni Boudreault
Organization SEQUEST Identify This is foremost an introduction so we’re first going to talk about Then we’re going to talk about the motivations behind the development of the first really useful bioinformatics technique in our field, SEQUEST. how you go about identifying proteins with tandem mass spectrometry in the first place This technique has been extended by two other tools called X! Tandem and Mascot. X! Tandem/Mascot We’re also going to talk about how these programs differ Differ Combine and how we can use that to our advantage by considering them simultaneously using probabilities.
A A I E P A T H K K Q So, this is proteomics, so we’re going to use tandem mass spectrometry to identify proteins-- hopefully many of them, and hopefully very quickly. I G L R L K N V I T I D D C G V R T A Start with a protein
A A I E P A T And to use this technique you generally have to lyse the protein into peptides about 8 to 20 amino acids in length and… H K K Q I G L R L K N V I T I D D C G V R T A Cut with an enzyme
A A I E P A T H K K Q I G L Look at each peptide individually. R L K We select the peptide by mass using the first half of the tandem mass spectrometer N V I T I D D C G V R Select a peptide T A
A E P T I R H2O Impart energy in collision cell The mass spectrometer imparts energy into the peptide causing it to fragment at the peptide bonds between amino acids.
Measure mass of daughter ions The masses of these fragment ions is recorded using the second mass spectrometer. A E P T A E P A E Intensity 399.2 A 298.1 201.1 72.0 M/z
These ions are commonly called B ions, based on nomenclature you don’t really want to know about… A E P T I R B-type Ions H2O Intensity 72.0 129.0 97.0 101.0 113.1 174.1 M/z But the mass difference between the peaks corresponds directly to the amino acid sequence.
A E P T I R B-type Ions H2O Intensity 72.0 129.0 97.0 101.0 113.1 174.1 AE-A AEP -AE AEPT -AEP AEPTI -AEPT AEPTIR -AEPTI A-0 For example, the A-E peak minus the A peak should produce the mass of E. You can build these mass differences up and derive a sequence for the original peptide This is pretty neat and it makes tandem mass spectrometry one of the best tools out there for sequencing novel peptides. M/z
But there are a couple confounding factors. So, it seems pretty easy, doesn’t it? For example…
B ions have a tendency to degrade and lose carbon monoxide producing… A E P T I R B-type Ions H2O CO CO CO CO CO CO Intensity M/z
A ions.   A E P T I R A-type Ions H2O Furthermore… CO CO CO CO CO CO M/z
… The second half are represented as Y ions that sequence backwards. Y-type Ions And, unfortunately, this is the real world, so… R I T P E A H2O Intensity M/z
… All the peaks have different measured heights and many peaks can often be missing. Y-type Ions R I T P E A H2O Intensity M/z
All these peaks are seen together simultaneously and we don’t even know… B-type,A-type,Y-type Ions R I T P E A H2O Intensity M/z
What type of ion they are, making the mass differences approach even more difficult. Finally, as with all analytical techniques, Intensity M/z
There’s noise, producing a final spectrum that looks like… Intensity M/z
And so it’s actually fairly difficult to… ….This, on a good day. Intensity M/z
… compute the mass differences to sequence the peptide, certainly in a computer automated way. A E P T I R H2O Intensity 72.0 129.0 97.0 101.0 113.1 174.1 M/z
So the community needed a new technique. Now, it wasn’t all without hope…
Known Ion Types We knew a couple of things about peptide fragmentation. B-type ions A-type ions Y-type ions Not only do we know to expect B, A, and Y ions, but…
Known Ion Types … We also know a couple of other variations on those ions that come up. B-type ions A-type ions Y-type ions B- or Y-type +2H ions B- or Y-type -NH3 ions B- or Y-type -H2O ions We even know something about the…
… likelihood of seeing each type of ion, Known Ion Types B-type ions A-type ions Y-type ions B- or Y-type +2H ions B- or Y-type -NH3 ions B- or Y-type -H2O ions 100% 20% 100% 50% 20% 20% where generally B and Y ions are most prominent.
So it’s actually pretty easy to guess what a spectrum should look like If we know the amino acid sequence of a peptide,we can guess what the spectra should look like! if we know what the peptide sequence is.
Model Spectrum So as an example, consider the peptide ELVIS LIVES K that was synthesized by Rich Johnson in Seattle ELVISLIVESK *Courtesy of Dr. Richard Johnson http://www.hairyfatguy.com/
Model Spectrum We can create a hypothetical spectrum based on our rules
B/Y type ions (100%) Where B and Y ions are estimated at 100%, plus 2 ions are estimated at 50%, and other stragglers are at 20%. B/Y +2H type ions (50%) A type ions B/Y -NH3/-H2O (20%)
Model Spectrum So if we consider the spectrum that was derived from the ELVIS LIVES K peptide…
Model Spectrum We can find where the overlap is between the hypothetical and the actual spectra…
Model Spectrum And say conclusively based on the evidence that the spectrum does belong to the ELVIS LIVES K peptide.
But who cares? The more important question is “what about situations where we don’t know the sequence?”
We guess!
PepSeq And so this was an approach followed by a program called PepSeq which would guess every combination of amino acids possible AAAAAAAAAA AAAAAAAAAC AAAAAAAACC AAAAAAACCC ELVISLIVESK WYYYYYYYYY YYYYYYYYYY build a hypothetical spectrum, and find the best matching hypothetical. … … J. Rozenski et al.,  Org. Mass Spectrom.,  29 (1994) 654-658.
PepSeq This was a start, but it’s clearly impossibly hard with larger peptides Impossibly hard after 7 or 8 amino acids! High false positive rate because you consider so many options and there’s a lot of room to overfit the data.
PepSeq So obviously this isn’t going to work in the long run. Another strategy  is needed! Impossibly hard after 7 or 8 amino acids! High false positive rate because you consider so many options
Sequencing Explosion We needed a new invention to come around and that was shotgun Sanger-sequencing 1977 Shotgun sequencing invented, 		    bacteriophage fX174  sequenced. 1989 Yeast Genome project announced 1990 Human Genome project announced 1992 First chromosome (Yeast) sequenced 1995 H. influenza sequenced  1996 Yeast Genome sequenced  2000 Human Genome draft … In 89 and 90 the Yeast and Human Genome projects were announced followed by the first chromosome in 92 et cetra, et cetra
Sequencing Explosion 1977 Shotgun sequencing invented, 		    bacteriophage fX174  sequenced. 1989 Yeast Genome project announced 1990 Human Genome project announced 1992 First chromosome (Yeast) sequenced 1995 H. influenza sequenced  1996 Yeast Genome sequenced 2000 Human Genome draft  Eng, J. K.; McCormack, A. L.; Yates, J. R. III J. Am. Soc. Mass Spectrom. 1994, 5, 976-989.  … In 1994 Jimmy Eng and John Yates published a technique to exploit genome sequencing for use in tandem mass spectrometry. And the idea was …
SEQUEST .…instead of searching all possible peptide sequences, Now, in the post- genomic world this seems like a pretty trivial idea, search only those in genome databases. but back then there was a lot of assumption placed on the idea that we’d actually have a complete Human genome in a reasonable amount of time.
SEQUEST 2*1014-- All possible 11mers 	(ELVISLIVESK) 2*1010-- All possible peptides in NR 1*108-- All tryptic peptides in NR 4*106-- All Human tryptic peptides in NR So, In terms of 11amino acid peptides So that was huge, we’re talking about a 10 thousand  fold difference between searching every possible 11mer those in the current non-redundant protein database from the NCBI it made hypothetical spectrum matching feasible. And a 100 million fold difference for searching human trypic peptides
Instead of trying to make a better model, SEQUEST made a couple of other interesting improvements as well they decided just to make the actual spectrum look like the model with normalization… Jimmy and John noted that there was a discontinuity between the intensities of the hypothetical spectrum and the actual spectrum. SEQUEST Model Spectrum
For a scoring function they decided to use Cross-Correlation, Like so. which basically sums the peaks that overlap between hypothetical and the actual spectra SEQUEST Model Spectrum
And then they shifted the spectra back and …. SEQUEST Model Spectrum
They used this number, also called the Auto-Correlation, as their background. … Forth so that the peaks shouldn’t align. SEQUEST Model Spectrum
SEQUEST XCorr This is another representation of the Cross Correlation and the Auto Correlation. Cross Correlation (direct comparison) Auto Correlation (background) Correlation Score Offset (AMU) Gentzel M. et al  Proteomics3 (2003) 1597-1610
The XCorr score is the Cross Correlation divided by the average of the auto correlation over a 150 AMU range. SEQUEST XCorr The XCorr is high if the direct comparison is significantly greater than the background, Cross Correlation (direct comparison) which is obviously good for peptide identification. Auto Correlation (background) Correlation Score Offset (AMU) XCorr = Gentzel M. et al  Proteomics3 (2003) 1597-1610
SEQUEST DeltaCn And this XCorr is actually a pretty robust method for estimating how accurate the match is, and so far, there really haven’t been any significant improvements on it. The DeltaCn is another score that scientists often use. It measures how good the XCorr is relative to the next best match. As you can see, this is actually a pretty crude calculation.
Here’s another representation of that sentiment. The XCorr is a strong measure of accuracy, whereas the DeltaCn is a weak measure of relative goodness. . Accuracy Score Relative Score Weak (DeltaCn) Strong (XCorr) SEQUEST
Obviously, there could be an alternative method that focuses more on the success of the relative score. Mascot and X! Tandem fit that bill. Accuracy Score Relative Score Weak (DeltaCn) Strong (XCorr) SEQUEST Alternate Method Strong Weak
by-Score= Sum of intensities of peaks matching 		B-type or Y-type ions HyperScore= X! Tandem Scoring Now the X! Tandem accuracy score is rather crude. It only considers B and Y ions and and attaches these factorial terms with an admittedly hand waving argument. Fenyo, D.; Beavis, R. C.  Anal. Chem., 75 (2003) 768-774
Distribution of “Incorrect” Hits But instead of just considering the best match to the second best, it looks at the distribution of lower scoring hits, assuming that they are all wrong. This is somewhat based on ideas pioneered with the BLAST algorithm. Here, every bar represents the number of matches at a given score. The X! Tandem creators found that the distribution decays (or slopes down) exponentially… # of Matches Second Best Best Hit Hyper Score
Estimate Likelihood (E-Value) …and the log of the distribution is relatively linear because of the exponential decay. Log(# of Matches) Best Hit Hyper Score
Estimate Likelihood (E-Value) Hyper Score Expected Number Of Random Matches Log(# of Matches) Best Hit If the distribution represents the number of random matches at any given score, the linear fit should correspond to the expected number of random matches.
Estimate Likelihood (E-Value) Score of 60 has 1/10 chance of occurring at random Log(# of Matches) Best Hit And from this, you can calculate the likelihood that the best match is random. This is called an E-Value, or Expected-Value. In this case, a score of 60 corresponds with a log number of matches being -1 which means the estimated number of random matches for that score is 0.1
X! Tandem and Mascot Now, X! Tandem calculates this E-Value empirically. Another search engine, Mascot, tries to get at the same kind of number using theoretical calculations, most likely based on the number of identified peaks and the likelihood of finding certain amino acids in the genome database. They’ve never explicitly published their algorithm, so we’ll never really know, but I suspect it’s something smart. I just want to bring up a point that we’ll touch on a little later…
…the E-Value that X! Tandem calculates and the P-Value that Mascot calculates are probabilistically based, but they can only estimate the likelihood that the match is wrong. X! Tandem and Mascot This is realistically not nearly as useful as knowing the probability that a peptide identification is right, which is NOT 1 minus the P-Value.
Accuracy Score Relative Score XCorr DeltaCn X! Tandem     SEQUEST HyperScore E-Value Now, let’s go back and fill in the X! Tandem part of our accuracy/relativity scoring grid.
Accuracy Score Relative Score XCorr DeltaCn X! Tandem     SEQUEST HyperScore E-Value To reiterate, the XCorr is an excellent measure of accuracy…
Accuracy Score Relative Score XCorr DeltaCn X! Tandem     SEQUEST HyperScore E-Value …whereas the E-Value is an excellent measure of how good the best score is relative to the rest. If we assume that accuracy and relativity scores are independent measures of goodness, could we use both the SEQUEST’s XCorr and X! Tandem’s E-Value together?
10 Protein Control Sample And the answer is a resounding yes. Each point on this graph is a spectrum, where correct identifications are marked in red, while incorrect identifications are marked in blue. X! Tandem: -log(E-Value) We know what’s correct and incorrect because this is a control sample. SEQUEST: Discriminant Score Although in general the spectra SEQUEST scores well are spectra X!Tandem also scores well, there is considerable scatter between the search engines.
10 Protein Control Sample One might wonder if X! Tandem and Mascot use similar scoring approaches, would they benefit as much, but the answer is surprisingly still yes! X! Tandem: -log(E-Value) Mascot: Ion-Identity Score Now, why are the scores so different?
Why So Different? Well, here are a couple of possible reasons. ,[object Object]
Considers relative intensities
X! Tandem
Considers semi-tryptic peptides
Considers only B/Y-type Ions
Mascot
Considers theoretical    P-Value relative to search space SEQUEST is the only method to consider relative intensities.
Why So Different? X! Tandem is the only method to consider peptides outside the standard search space by default, ,[object Object]
Considers relative intensities
X! Tandem
Considers semi-tryptic peptides
Considers only B/Y-type Ions
Mascot
Considers theoretical    P-Value relative to search space such as semi-tryptic peptides. However, it’s the only score that considers only B and Y ions, as opposed to a complete model.
Why So Different? ,[object Object]
Considers relative intensities
X! Tandem
Considers semi-tryptic peptides
Considers only B/Y-type Ions
Mascot
Considers theoretical    P-Value relative to search space And Mascot is the only search engine to compute a completely theoretical P-Value
Consider Multiple Algorithms? So we clearly want to consider multiple search engines simultaneously, X! Tandem: -log(E-Value) but how? Mascot: Ion-Identity Score
How To Compare Search Engines? SEQUEST: 	XCorr>2.5, DeltaCn>0.1 Mascot:	Ion Score-Identity Score>0 X! Tandem:	E-Value<0.01 You can’t use a thresholding system For example, a SEQUEST match with an XCorr of 2.5 doesn’t mean the same thing because it’s impossible to find corresponding thresholds. as an X! Tandem match with an E-Value of 0.01.

More Related Content

What's hot

What's hot (20)

Tecniche
TecnicheTecniche
Tecniche
 
Maldi tof
Maldi tofMaldi tof
Maldi tof
 
Mass spectrometry final.pptx
Mass spectrometry final.pptxMass spectrometry final.pptx
Mass spectrometry final.pptx
 
Tandem Mass spectrometry
Tandem Mass spectrometryTandem Mass spectrometry
Tandem Mass spectrometry
 
Illumina GAIIx for high throughput sequencing
Illumina GAIIx for high throughput sequencingIllumina GAIIx for high throughput sequencing
Illumina GAIIx for high throughput sequencing
 
Proteomics
ProteomicsProteomics
Proteomics
 
Whole genome sequence
Whole genome sequenceWhole genome sequence
Whole genome sequence
 
Next generation sequencing methods
Next generation sequencing methods Next generation sequencing methods
Next generation sequencing methods
 
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
 
MALDI tof
MALDI tofMALDI tof
MALDI tof
 
Circular dichroism spectroscopy seminar ppt
Circular dichroism spectroscopy seminar pptCircular dichroism spectroscopy seminar ppt
Circular dichroism spectroscopy seminar ppt
 
Metagenomics
MetagenomicsMetagenomics
Metagenomics
 
(050407)protein chip
(050407)protein chip(050407)protein chip
(050407)protein chip
 
Introduction to Metagenomics. Applications, Approaches and Tools (Bioinformat...
Introduction to Metagenomics. Applications, Approaches and Tools (Bioinformat...Introduction to Metagenomics. Applications, Approaches and Tools (Bioinformat...
Introduction to Metagenomics. Applications, Approaches and Tools (Bioinformat...
 
Nanopore sequencing
Nanopore sequencingNanopore sequencing
Nanopore sequencing
 
Nanopore sequencing .
Nanopore sequencing .Nanopore sequencing .
Nanopore sequencing .
 
Next Generation Sequencing of DNA
Next Generation Sequencing of DNANext Generation Sequencing of DNA
Next Generation Sequencing of DNA
 
Third Generation Sequencing
Third Generation Sequencing Third Generation Sequencing
Third Generation Sequencing
 
MALDI - TOF
MALDI - TOFMALDI - TOF
MALDI - TOF
 
Lipidomics
LipidomicsLipidomics
Lipidomics
 

Viewers also liked

Mass Spectrometry: Protein Identification Strategies
Mass Spectrometry: Protein Identification StrategiesMass Spectrometry: Protein Identification Strategies
Mass Spectrometry: Protein Identification StrategiesMichel Dumontier
 
PRIDE and ProteomeXchange: supporting the cultural change in proteomics publi...
PRIDE and ProteomeXchange: supporting the cultural change in proteomics publi...PRIDE and ProteomeXchange: supporting the cultural change in proteomics publi...
PRIDE and ProteomeXchange: supporting the cultural change in proteomics publi...Juan Antonio Vizcaino
 
Mass Spectrometry Applications and spectral interpretation: Basics
Mass Spectrometry Applications and spectral interpretation: BasicsMass Spectrometry Applications and spectral interpretation: Basics
Mass Spectrometry Applications and spectral interpretation: BasicsShreekant Deshpande
 
Proteomics analysis: Basics and Applications
Proteomics analysis: Basics and ApplicationsProteomics analysis: Basics and Applications
Proteomics analysis: Basics and ApplicationsCOST action BM1006
 
Moeller proteomics course
Moeller proteomics courseMoeller proteomics course
Moeller proteomics courseUC Davis
 
proteomics, mass spectrometry, science, bioinformatics, electrophoresis, liqu...
proteomics, mass spectrometry, science, bioinformatics, electrophoresis, liqu...proteomics, mass spectrometry, science, bioinformatics, electrophoresis, liqu...
proteomics, mass spectrometry, science, bioinformatics, electrophoresis, liqu...Amit Yadav
 
Mass Spectrometry Basic By Inam
Mass Spectrometry Basic By InamMass Spectrometry Basic By Inam
Mass Spectrometry Basic By InamInamul Hasan Madar
 
How to survive a Project Management Audit Dec 2-2014
How to survive a Project Management Audit Dec 2-2014How to survive a Project Management Audit Dec 2-2014
How to survive a Project Management Audit Dec 2-2014rhenderson08
 
10 step marketing plan target
10 step marketing plan target10 step marketing plan target
10 step marketing plan targetmiko abadilla
 
A Novel Approach to Internal Standardization in LC/MS/MS Analysis; Sensitive ...
A Novel Approach to Internal Standardization in LC/MS/MS Analysis; Sensitive ...A Novel Approach to Internal Standardization in LC/MS/MS Analysis; Sensitive ...
A Novel Approach to Internal Standardization in LC/MS/MS Analysis; Sensitive ...MicroConstants
 
Design of Ion Source & Matrix Effects in LC-MS
Design of Ion Source & Matrix Effects in LC-MSDesign of Ion Source & Matrix Effects in LC-MS
Design of Ion Source & Matrix Effects in LC-MSBhaswat Chakraborty
 
Data Independent Analysis on Thermo Scientific Orbitrap MS Systems
Data Independent Analysis on Thermo Scientific Orbitrap MS SystemsData Independent Analysis on Thermo Scientific Orbitrap MS Systems
Data Independent Analysis on Thermo Scientific Orbitrap MS SystemsThermo Fisher Scientific
 
Chap 14 mass spec
Chap 14 mass specChap 14 mass spec
Chap 14 mass specceutics1315
 
1.proteomics coursework-3 dec2012-aky
1.proteomics coursework-3 dec2012-aky1.proteomics coursework-3 dec2012-aky
1.proteomics coursework-3 dec2012-akyAmit Yadav
 

Viewers also liked (20)

Mass Spectrometry: Protein Identification Strategies
Mass Spectrometry: Protein Identification StrategiesMass Spectrometry: Protein Identification Strategies
Mass Spectrometry: Protein Identification Strategies
 
PRIDE and ProteomeXchange: supporting the cultural change in proteomics publi...
PRIDE and ProteomeXchange: supporting the cultural change in proteomics publi...PRIDE and ProteomeXchange: supporting the cultural change in proteomics publi...
PRIDE and ProteomeXchange: supporting the cultural change in proteomics publi...
 
Mass spectrometry
Mass spectrometryMass spectrometry
Mass spectrometry
 
Mass Spectrometry Applications and spectral interpretation: Basics
Mass Spectrometry Applications and spectral interpretation: BasicsMass Spectrometry Applications and spectral interpretation: Basics
Mass Spectrometry Applications and spectral interpretation: Basics
 
Proteomics analysis: Basics and Applications
Proteomics analysis: Basics and ApplicationsProteomics analysis: Basics and Applications
Proteomics analysis: Basics and Applications
 
Moeller proteomics course
Moeller proteomics courseMoeller proteomics course
Moeller proteomics course
 
proteomics, mass spectrometry, science, bioinformatics, electrophoresis, liqu...
proteomics, mass spectrometry, science, bioinformatics, electrophoresis, liqu...proteomics, mass spectrometry, science, bioinformatics, electrophoresis, liqu...
proteomics, mass spectrometry, science, bioinformatics, electrophoresis, liqu...
 
Mass Spectrometry Basic By Inam
Mass Spectrometry Basic By InamMass Spectrometry Basic By Inam
Mass Spectrometry Basic By Inam
 
Explaining Peptide Prophet
Explaining Peptide ProphetExplaining Peptide Prophet
Explaining Peptide Prophet
 
How to survive a Project Management Audit Dec 2-2014
How to survive a Project Management Audit Dec 2-2014How to survive a Project Management Audit Dec 2-2014
How to survive a Project Management Audit Dec 2-2014
 
10 step marketing plan target
10 step marketing plan target10 step marketing plan target
10 step marketing plan target
 
A Novel Approach to Internal Standardization in LC/MS/MS Analysis; Sensitive ...
A Novel Approach to Internal Standardization in LC/MS/MS Analysis; Sensitive ...A Novel Approach to Internal Standardization in LC/MS/MS Analysis; Sensitive ...
A Novel Approach to Internal Standardization in LC/MS/MS Analysis; Sensitive ...
 
Design of Ion Source & Matrix Effects in LC-MS
Design of Ion Source & Matrix Effects in LC-MSDesign of Ion Source & Matrix Effects in LC-MS
Design of Ion Source & Matrix Effects in LC-MS
 
Waters protein therapeutics application proctocols
Waters protein therapeutics application proctocolsWaters protein therapeutics application proctocols
Waters protein therapeutics application proctocols
 
Mass part 2 2
Mass part 2 2Mass part 2 2
Mass part 2 2
 
Data Independent Analysis on Thermo Scientific Orbitrap MS Systems
Data Independent Analysis on Thermo Scientific Orbitrap MS SystemsData Independent Analysis on Thermo Scientific Orbitrap MS Systems
Data Independent Analysis on Thermo Scientific Orbitrap MS Systems
 
t 2006
t 2006t 2006
t 2006
 
Chap 14 mass spec
Chap 14 mass specChap 14 mass spec
Chap 14 mass spec
 
1.proteomics coursework-3 dec2012-aky
1.proteomics coursework-3 dec2012-aky1.proteomics coursework-3 dec2012-aky
1.proteomics coursework-3 dec2012-aky
 
Mass spectroscopy
Mass spectroscopyMass spectroscopy
Mass spectroscopy
 

Similar to Interpreting MS\\MS Results

Frequencyometry
FrequencyometryFrequencyometry
Frequencyometrytumbaher
 
OscarResearchProject2015
OscarResearchProject2015OscarResearchProject2015
OscarResearchProject2015Peter Cramer
 
Linear algebra notes 2
Linear algebra notes 2Linear algebra notes 2
Linear algebra notes 2Ghulam Murtaza
 
Linear algebra notes 1
Linear algebra notes 1Linear algebra notes 1
Linear algebra notes 1Ghulam Murtaza
 
Spike sorting: What is it? Why do we need it? Where does it come from? How is...
Spike sorting: What is it? Why do we need it? Where does it come from? How is...Spike sorting: What is it? Why do we need it? Where does it come from? How is...
Spike sorting: What is it? Why do we need it? Where does it come from? How is...NeuroMat
 
Wave analyzer of ions and molecules
Wave  analyzer  of  ions and moleculesWave  analyzer  of  ions and molecules
Wave analyzer of ions and moleculestumbaher
 
Frequencyometry
FrequencyometryFrequencyometry
Frequencyometrytumbaher
 
Is there any a novel best theory for uncertainty?
Is there any a novel best theory for uncertainty?  Is there any a novel best theory for uncertainty?
Is there any a novel best theory for uncertainty? Andino Maseleno
 
Bat Algorithm for Multi-objective Optimisation
Bat Algorithm for Multi-objective OptimisationBat Algorithm for Multi-objective Optimisation
Bat Algorithm for Multi-objective OptimisationXin-She Yang
 
Class8 - the Scientific Method Applied to Psycholinguistics
Class8 - the Scientific Method Applied to PsycholinguisticsClass8 - the Scientific Method Applied to Psycholinguistics
Class8 - the Scientific Method Applied to PsycholinguisticsNathacia Lucena
 
Microstrip coupler design using bat
Microstrip coupler design using batMicrostrip coupler design using bat
Microstrip coupler design using batijaia
 
Measurement  of  the  angle  θ          .docx
Measurement  of  the  angle  θ          .docxMeasurement  of  the  angle  θ          .docx
Measurement  of  the  angle  θ          .docxwkyra78
 
Bat Algorithm: Literature Review and Applications
Bat Algorithm: Literature Review and ApplicationsBat Algorithm: Literature Review and Applications
Bat Algorithm: Literature Review and ApplicationsXin-She Yang
 
Wave analyzer of molecules and ions 2.part
Wave  analyzer  of  molecules  and  ions  2.partWave  analyzer  of  molecules  and  ions  2.part
Wave analyzer of molecules and ions 2.parttumbaher
 
2015 ohsu-metagenome
2015 ohsu-metagenome2015 ohsu-metagenome
2015 ohsu-metagenomec.titus.brown
 
RMG at the Flame Chemistry Workshop 2014
RMG at the Flame Chemistry Workshop 2014RMG at the Flame Chemistry Workshop 2014
RMG at the Flame Chemistry Workshop 2014Richard West
 

Similar to Interpreting MS\\MS Results (20)

Frequencyometry
FrequencyometryFrequencyometry
Frequencyometry
 
OscarResearchProject2015
OscarResearchProject2015OscarResearchProject2015
OscarResearchProject2015
 
Linear algebra notes 2
Linear algebra notes 2Linear algebra notes 2
Linear algebra notes 2
 
Linear algebra notes
Linear algebra notesLinear algebra notes
Linear algebra notes
 
Linear algebra notes 1
Linear algebra notes 1Linear algebra notes 1
Linear algebra notes 1
 
Spike sorting: What is it? Why do we need it? Where does it come from? How is...
Spike sorting: What is it? Why do we need it? Where does it come from? How is...Spike sorting: What is it? Why do we need it? Where does it come from? How is...
Spike sorting: What is it? Why do we need it? Where does it come from? How is...
 
Wave analyzer of ions and molecules
Wave  analyzer  of  ions and moleculesWave  analyzer  of  ions and molecules
Wave analyzer of ions and molecules
 
Frequencyometry
FrequencyometryFrequencyometry
Frequencyometry
 
Is there any a novel best theory for uncertainty?
Is there any a novel best theory for uncertainty?  Is there any a novel best theory for uncertainty?
Is there any a novel best theory for uncertainty?
 
Bat Algorithm for Multi-objective Optimisation
Bat Algorithm for Multi-objective OptimisationBat Algorithm for Multi-objective Optimisation
Bat Algorithm for Multi-objective Optimisation
 
Class8 - the Scientific Method Applied to Psycholinguistics
Class8 - the Scientific Method Applied to PsycholinguisticsClass8 - the Scientific Method Applied to Psycholinguistics
Class8 - the Scientific Method Applied to Psycholinguistics
 
Bachelor's Thesis
Bachelor's ThesisBachelor's Thesis
Bachelor's Thesis
 
Microstrip coupler design using bat
Microstrip coupler design using batMicrostrip coupler design using bat
Microstrip coupler design using bat
 
Measurement  of  the  angle  θ          .docx
Measurement  of  the  angle  θ          .docxMeasurement  of  the  angle  θ          .docx
Measurement  of  the  angle  θ          .docx
 
Bat Algorithm: Literature Review and Applications
Bat Algorithm: Literature Review and ApplicationsBat Algorithm: Literature Review and Applications
Bat Algorithm: Literature Review and Applications
 
Espectrometría de Masas Atómica
Espectrometría de Masas AtómicaEspectrometría de Masas Atómica
Espectrometría de Masas Atómica
 
Wave analyzer of molecules and ions 2.part
Wave  analyzer  of  molecules  and  ions  2.partWave  analyzer  of  molecules  and  ions  2.part
Wave analyzer of molecules and ions 2.part
 
Seminar on nanorobotics
Seminar on nanoroboticsSeminar on nanorobotics
Seminar on nanorobotics
 
2015 ohsu-metagenome
2015 ohsu-metagenome2015 ohsu-metagenome
2015 ohsu-metagenome
 
RMG at the Flame Chemistry Workshop 2014
RMG at the Flame Chemistry Workshop 2014RMG at the Flame Chemistry Workshop 2014
RMG at the Flame Chemistry Workshop 2014
 

Interpreting MS\\MS Results

  • 1. Interpreting MS/MS Proteomics Results The first thing I should say is that none of the material presented is original research done at Proteome Software but we do strive to make the tools presented here available in our software product Scaffold. With that caveat aside… Brian C. Searle Proteome Software Inc. Portland, Oregon USA Brian.Searle@ProteomeSoftware.com NPC Progress Meeting (February 2nd, 2006) Illustrated by Toni Boudreault
  • 2. Organization SEQUEST Identify This is foremost an introduction so we’re first going to talk about Then we’re going to talk about the motivations behind the development of the first really useful bioinformatics technique in our field, SEQUEST. how you go about identifying proteins with tandem mass spectrometry in the first place This technique has been extended by two other tools called X! Tandem and Mascot. X! Tandem/Mascot We’re also going to talk about how these programs differ Differ Combine and how we can use that to our advantage by considering them simultaneously using probabilities.
  • 3. A A I E P A T H K K Q So, this is proteomics, so we’re going to use tandem mass spectrometry to identify proteins-- hopefully many of them, and hopefully very quickly. I G L R L K N V I T I D D C G V R T A Start with a protein
  • 4. A A I E P A T And to use this technique you generally have to lyse the protein into peptides about 8 to 20 amino acids in length and… H K K Q I G L R L K N V I T I D D C G V R T A Cut with an enzyme
  • 5. A A I E P A T H K K Q I G L Look at each peptide individually. R L K We select the peptide by mass using the first half of the tandem mass spectrometer N V I T I D D C G V R Select a peptide T A
  • 6. A E P T I R H2O Impart energy in collision cell The mass spectrometer imparts energy into the peptide causing it to fragment at the peptide bonds between amino acids.
  • 7. Measure mass of daughter ions The masses of these fragment ions is recorded using the second mass spectrometer. A E P T A E P A E Intensity 399.2 A 298.1 201.1 72.0 M/z
  • 8. These ions are commonly called B ions, based on nomenclature you don’t really want to know about… A E P T I R B-type Ions H2O Intensity 72.0 129.0 97.0 101.0 113.1 174.1 M/z But the mass difference between the peaks corresponds directly to the amino acid sequence.
  • 9. A E P T I R B-type Ions H2O Intensity 72.0 129.0 97.0 101.0 113.1 174.1 AE-A AEP -AE AEPT -AEP AEPTI -AEPT AEPTIR -AEPTI A-0 For example, the A-E peak minus the A peak should produce the mass of E. You can build these mass differences up and derive a sequence for the original peptide This is pretty neat and it makes tandem mass spectrometry one of the best tools out there for sequencing novel peptides. M/z
  • 10. But there are a couple confounding factors. So, it seems pretty easy, doesn’t it? For example…
  • 11. B ions have a tendency to degrade and lose carbon monoxide producing… A E P T I R B-type Ions H2O CO CO CO CO CO CO Intensity M/z
  • 12. A ions. A E P T I R A-type Ions H2O Furthermore… CO CO CO CO CO CO M/z
  • 13. … The second half are represented as Y ions that sequence backwards. Y-type Ions And, unfortunately, this is the real world, so… R I T P E A H2O Intensity M/z
  • 14. … All the peaks have different measured heights and many peaks can often be missing. Y-type Ions R I T P E A H2O Intensity M/z
  • 15. All these peaks are seen together simultaneously and we don’t even know… B-type,A-type,Y-type Ions R I T P E A H2O Intensity M/z
  • 16. What type of ion they are, making the mass differences approach even more difficult. Finally, as with all analytical techniques, Intensity M/z
  • 17. There’s noise, producing a final spectrum that looks like… Intensity M/z
  • 18. And so it’s actually fairly difficult to… ….This, on a good day. Intensity M/z
  • 19. … compute the mass differences to sequence the peptide, certainly in a computer automated way. A E P T I R H2O Intensity 72.0 129.0 97.0 101.0 113.1 174.1 M/z
  • 20. So the community needed a new technique. Now, it wasn’t all without hope…
  • 21. Known Ion Types We knew a couple of things about peptide fragmentation. B-type ions A-type ions Y-type ions Not only do we know to expect B, A, and Y ions, but…
  • 22. Known Ion Types … We also know a couple of other variations on those ions that come up. B-type ions A-type ions Y-type ions B- or Y-type +2H ions B- or Y-type -NH3 ions B- or Y-type -H2O ions We even know something about the…
  • 23. … likelihood of seeing each type of ion, Known Ion Types B-type ions A-type ions Y-type ions B- or Y-type +2H ions B- or Y-type -NH3 ions B- or Y-type -H2O ions 100% 20% 100% 50% 20% 20% where generally B and Y ions are most prominent.
  • 24. So it’s actually pretty easy to guess what a spectrum should look like If we know the amino acid sequence of a peptide,we can guess what the spectra should look like! if we know what the peptide sequence is.
  • 25. Model Spectrum So as an example, consider the peptide ELVIS LIVES K that was synthesized by Rich Johnson in Seattle ELVISLIVESK *Courtesy of Dr. Richard Johnson http://www.hairyfatguy.com/
  • 26. Model Spectrum We can create a hypothetical spectrum based on our rules
  • 27. B/Y type ions (100%) Where B and Y ions are estimated at 100%, plus 2 ions are estimated at 50%, and other stragglers are at 20%. B/Y +2H type ions (50%) A type ions B/Y -NH3/-H2O (20%)
  • 28. Model Spectrum So if we consider the spectrum that was derived from the ELVIS LIVES K peptide…
  • 29. Model Spectrum We can find where the overlap is between the hypothetical and the actual spectra…
  • 30. Model Spectrum And say conclusively based on the evidence that the spectrum does belong to the ELVIS LIVES K peptide.
  • 31. But who cares? The more important question is “what about situations where we don’t know the sequence?”
  • 33. PepSeq And so this was an approach followed by a program called PepSeq which would guess every combination of amino acids possible AAAAAAAAAA AAAAAAAAAC AAAAAAAACC AAAAAAACCC ELVISLIVESK WYYYYYYYYY YYYYYYYYYY build a hypothetical spectrum, and find the best matching hypothetical. … … J. Rozenski et al., Org. Mass Spectrom., 29 (1994) 654-658.
  • 34. PepSeq This was a start, but it’s clearly impossibly hard with larger peptides Impossibly hard after 7 or 8 amino acids! High false positive rate because you consider so many options and there’s a lot of room to overfit the data.
  • 35. PepSeq So obviously this isn’t going to work in the long run. Another strategy is needed! Impossibly hard after 7 or 8 amino acids! High false positive rate because you consider so many options
  • 36. Sequencing Explosion We needed a new invention to come around and that was shotgun Sanger-sequencing 1977 Shotgun sequencing invented, bacteriophage fX174 sequenced. 1989 Yeast Genome project announced 1990 Human Genome project announced 1992 First chromosome (Yeast) sequenced 1995 H. influenza sequenced 1996 Yeast Genome sequenced 2000 Human Genome draft … In 89 and 90 the Yeast and Human Genome projects were announced followed by the first chromosome in 92 et cetra, et cetra
  • 37. Sequencing Explosion 1977 Shotgun sequencing invented, bacteriophage fX174 sequenced. 1989 Yeast Genome project announced 1990 Human Genome project announced 1992 First chromosome (Yeast) sequenced 1995 H. influenza sequenced 1996 Yeast Genome sequenced 2000 Human Genome draft Eng, J. K.; McCormack, A. L.; Yates, J. R. III J. Am. Soc. Mass Spectrom. 1994, 5, 976-989. … In 1994 Jimmy Eng and John Yates published a technique to exploit genome sequencing for use in tandem mass spectrometry. And the idea was …
  • 38. SEQUEST .…instead of searching all possible peptide sequences, Now, in the post- genomic world this seems like a pretty trivial idea, search only those in genome databases. but back then there was a lot of assumption placed on the idea that we’d actually have a complete Human genome in a reasonable amount of time.
  • 39. SEQUEST 2*1014-- All possible 11mers (ELVISLIVESK) 2*1010-- All possible peptides in NR 1*108-- All tryptic peptides in NR 4*106-- All Human tryptic peptides in NR So, In terms of 11amino acid peptides So that was huge, we’re talking about a 10 thousand fold difference between searching every possible 11mer those in the current non-redundant protein database from the NCBI it made hypothetical spectrum matching feasible. And a 100 million fold difference for searching human trypic peptides
  • 40. Instead of trying to make a better model, SEQUEST made a couple of other interesting improvements as well they decided just to make the actual spectrum look like the model with normalization… Jimmy and John noted that there was a discontinuity between the intensities of the hypothetical spectrum and the actual spectrum. SEQUEST Model Spectrum
  • 41. For a scoring function they decided to use Cross-Correlation, Like so. which basically sums the peaks that overlap between hypothetical and the actual spectra SEQUEST Model Spectrum
  • 42. And then they shifted the spectra back and …. SEQUEST Model Spectrum
  • 43. They used this number, also called the Auto-Correlation, as their background. … Forth so that the peaks shouldn’t align. SEQUEST Model Spectrum
  • 44. SEQUEST XCorr This is another representation of the Cross Correlation and the Auto Correlation. Cross Correlation (direct comparison) Auto Correlation (background) Correlation Score Offset (AMU) Gentzel M. et al Proteomics3 (2003) 1597-1610
  • 45. The XCorr score is the Cross Correlation divided by the average of the auto correlation over a 150 AMU range. SEQUEST XCorr The XCorr is high if the direct comparison is significantly greater than the background, Cross Correlation (direct comparison) which is obviously good for peptide identification. Auto Correlation (background) Correlation Score Offset (AMU) XCorr = Gentzel M. et al Proteomics3 (2003) 1597-1610
  • 46. SEQUEST DeltaCn And this XCorr is actually a pretty robust method for estimating how accurate the match is, and so far, there really haven’t been any significant improvements on it. The DeltaCn is another score that scientists often use. It measures how good the XCorr is relative to the next best match. As you can see, this is actually a pretty crude calculation.
  • 47. Here’s another representation of that sentiment. The XCorr is a strong measure of accuracy, whereas the DeltaCn is a weak measure of relative goodness. . Accuracy Score Relative Score Weak (DeltaCn) Strong (XCorr) SEQUEST
  • 48. Obviously, there could be an alternative method that focuses more on the success of the relative score. Mascot and X! Tandem fit that bill. Accuracy Score Relative Score Weak (DeltaCn) Strong (XCorr) SEQUEST Alternate Method Strong Weak
  • 49. by-Score= Sum of intensities of peaks matching B-type or Y-type ions HyperScore= X! Tandem Scoring Now the X! Tandem accuracy score is rather crude. It only considers B and Y ions and and attaches these factorial terms with an admittedly hand waving argument. Fenyo, D.; Beavis, R. C. Anal. Chem., 75 (2003) 768-774
  • 50. Distribution of “Incorrect” Hits But instead of just considering the best match to the second best, it looks at the distribution of lower scoring hits, assuming that they are all wrong. This is somewhat based on ideas pioneered with the BLAST algorithm. Here, every bar represents the number of matches at a given score. The X! Tandem creators found that the distribution decays (or slopes down) exponentially… # of Matches Second Best Best Hit Hyper Score
  • 51. Estimate Likelihood (E-Value) …and the log of the distribution is relatively linear because of the exponential decay. Log(# of Matches) Best Hit Hyper Score
  • 52. Estimate Likelihood (E-Value) Hyper Score Expected Number Of Random Matches Log(# of Matches) Best Hit If the distribution represents the number of random matches at any given score, the linear fit should correspond to the expected number of random matches.
  • 53. Estimate Likelihood (E-Value) Score of 60 has 1/10 chance of occurring at random Log(# of Matches) Best Hit And from this, you can calculate the likelihood that the best match is random. This is called an E-Value, or Expected-Value. In this case, a score of 60 corresponds with a log number of matches being -1 which means the estimated number of random matches for that score is 0.1
  • 54. X! Tandem and Mascot Now, X! Tandem calculates this E-Value empirically. Another search engine, Mascot, tries to get at the same kind of number using theoretical calculations, most likely based on the number of identified peaks and the likelihood of finding certain amino acids in the genome database. They’ve never explicitly published their algorithm, so we’ll never really know, but I suspect it’s something smart. I just want to bring up a point that we’ll touch on a little later…
  • 55. …the E-Value that X! Tandem calculates and the P-Value that Mascot calculates are probabilistically based, but they can only estimate the likelihood that the match is wrong. X! Tandem and Mascot This is realistically not nearly as useful as knowing the probability that a peptide identification is right, which is NOT 1 minus the P-Value.
  • 56. Accuracy Score Relative Score XCorr DeltaCn X! Tandem SEQUEST HyperScore E-Value Now, let’s go back and fill in the X! Tandem part of our accuracy/relativity scoring grid.
  • 57. Accuracy Score Relative Score XCorr DeltaCn X! Tandem SEQUEST HyperScore E-Value To reiterate, the XCorr is an excellent measure of accuracy…
  • 58. Accuracy Score Relative Score XCorr DeltaCn X! Tandem SEQUEST HyperScore E-Value …whereas the E-Value is an excellent measure of how good the best score is relative to the rest. If we assume that accuracy and relativity scores are independent measures of goodness, could we use both the SEQUEST’s XCorr and X! Tandem’s E-Value together?
  • 59. 10 Protein Control Sample And the answer is a resounding yes. Each point on this graph is a spectrum, where correct identifications are marked in red, while incorrect identifications are marked in blue. X! Tandem: -log(E-Value) We know what’s correct and incorrect because this is a control sample. SEQUEST: Discriminant Score Although in general the spectra SEQUEST scores well are spectra X!Tandem also scores well, there is considerable scatter between the search engines.
  • 60. 10 Protein Control Sample One might wonder if X! Tandem and Mascot use similar scoring approaches, would they benefit as much, but the answer is surprisingly still yes! X! Tandem: -log(E-Value) Mascot: Ion-Identity Score Now, why are the scores so different?
  • 61.
  • 67. Considers theoretical P-Value relative to search space SEQUEST is the only method to consider relative intensities.
  • 68.
  • 74. Considers theoretical P-Value relative to search space such as semi-tryptic peptides. However, it’s the only score that considers only B and Y ions, as opposed to a complete model.
  • 75.
  • 81. Considers theoretical P-Value relative to search space And Mascot is the only search engine to compute a completely theoretical P-Value
  • 82. Consider Multiple Algorithms? So we clearly want to consider multiple search engines simultaneously, X! Tandem: -log(E-Value) but how? Mascot: Ion-Identity Score
  • 83. How To Compare Search Engines? SEQUEST: XCorr>2.5, DeltaCn>0.1 Mascot: Ion Score-Identity Score>0 X! Tandem: E-Value<0.01 You can’t use a thresholding system For example, a SEQUEST match with an XCorr of 2.5 doesn’t mean the same thing because it’s impossible to find corresponding thresholds. as an X! Tandem match with an E-Value of 0.01.
  • 84.
  • 86. X! Tandem: E-Value<0.01The simplest way would be to convert the scores into probabilities and compare those. We advocate for Andrew Keller and Alexy Nesviskii’s Peptide Prophet approach because it actually calculates a true probability, not just a p-value. Need to convert scores to probabilities!
  • 87. 10 Protein Control Sample (Q-ToF) X! Tandem approach Other Incorrect IDs for Spectrum So if you remember, X! Tandem considers the best peptide match for a spectrum against a distribution of incorrect matches Possibly Correct? # of Matches Mascot: Ion-Identity Score
  • 88. 10 Protein Control Sample (Q-ToF) Peptide Prophet approach ALL Other “Best” Matches Well, Peptide Prophet looks across the entire sample, and not at just one spectrum at a time. It compares the best match against all of the other best matches in the sample, which is clearly bimodal. Possibly Correct? # of Matches Mascot: Ion-Identity Score Keller, A. et al Anal. Chem.74, 5383-5392
  • 89. 10 Protein Control Sample (Q-ToF) Peptide Prophet approach ALL Other “Best” Matches The low mode represents matches that are most likely wrong while the high mode represents matches that are probably right. Possibly Correct? # of Matches Mascot: Ion-Identity Score Keller, A. et al Anal. Chem.74, 5383-5392
  • 90. 10 Protein Control Sample (Q-ToF) Peptide Prophet approach Peptide Prophet curve fits two distributions to the modes, following the assumption that the low scoring distribution is “Incorrect” “Incorrect” and that the higher scoring distribution is “correct”. Possibly Correct? # of Matches “Correct” Mascot: Ion-Identity Score
  • 91. 10 Protein Control Sample (Q-ToF) “Incorrect” These two distributions can be analyzed using Bayesian statistics with this formula. Now that formula looks pretty complex, but… Possibly Correct? # of Matches “Correct” Mascot: Ion-Identity Score
  • 92. 10 Protein Control Sample (Q-ToF) “Incorrect” It just calculates the height of the correct distribution at a particular score, divided by the height of both distributions. # of Matches “Correct” Mascot: Ion-Identity Score
  • 93. 10 Protein Control Sample (Q-ToF) This is essentially the probability of having that score and being correct divided by the probability of just having that score “Incorrect” “Correct” Mascot: Ion-Identity Score
  • 94. “Incorrect” Possibly Correct? # of Matches “Correct” Mascot: Ion-Identity Score This is a neat method because it actually considers the likelihood of being correct, rather than X! Tandem and Mascot, which only calculate the probability of being incorrect. It’s because of this that Peptide Prophet can get produce a true probability, which is important when the sample characteristics change.
  • 95. Q-ToF: “Incorrect” Possibly Correct? # of Matches “Correct” Mascot: Ion-Identity Score For example, the control sample we’ve been looking at was derived from Q-ToF data which produces pretty high quality results
  • 96. Q-ToF: Ion Trap: “Incorrect” If you compare that to the same sample on run on an Ion Trap, the probability of being correct is greatly diminished. Possibly Correct? # of Matches “Correct” If you’ll note, the Incorrect distribution doesn’t change very much between the two analyses, however, the likelihood that the identification is right changes dramatically! Mascot: Ion-Identity Score “Incorrect” Possibly Correct? # of Matches “Correct”
  • 97. Ion Trap: As Peptide Prophet considers the correct distribution, it is immune to fluctuations between samples. P-Values and E-Values don’t consider this information, so they can’t be compared across multiple samples, or different examinations of the same sample hence the reason why we need to use Peptide Prophet for comparing two different search engines Mascot: Ion-Identity Score “Incorrect” Possibly Correct? # of Matches “Correct”
  • 98. Consider Multiple Algorithms? X! Tandem: -log(E-Value) So going back to the scatter plot between X! Tandem and Mascot, Mascot: Ion-Identity Score we can use Peptide Prophet to compute the score threshold that represents a 95% cut-off…
  • 99. Mascot: -2.5=95% X! Tandem: 2.6=95% Consider Multiple Algorithms? Like so. X! Tandem: -log(E-Value) Mascot: Ion-Identity Score This allows you to fairly consider the answers from both search engines simultaneously. The important thing to note, is that if you looked at a different sample, these thresholds should change depending on the height of the correct distributions
  • 100.
  • 101. Using multiple search engines simultaneously yields better results
  • 102. Peptide Prophet can normalize search engine resultsall of the search engines look at different criteria
  • 103.
  • 104. Using multiple search engines simultaneously yields better results
  • 105.