SlideShare une entreprise Scribd logo
1  sur  9
Télécharger pour lire hors ligne
Abstract – Pancreatic cancer is associated with an
incredibly high mortality rate as over 80% of patients are
initially diagnosed after the cancer has metastasized. This
type of cancer is often asymptomatic when still localized to
the pancreas and a lack of understanding of specific
biomarkers and tumor precursors continues to hinder early
detection. This paper describes methods for integration of
multi-omic data into a prediction model for the
classification of pancreatic cancer patients. The goal of this
study is to uncover potentially novel genomic pathways and
relationships between miRNA and protein data, testing our
hypotheses that multi-modal data integration can provide
better classification than analyzing data from single
modalities, and striving towards identification of biomarkers
to advance early detection, genomic profiling, and even
targeted therapy for pancreatic cancer patients.
Keywords – pancreatic cancer; TCGA;
bioinformatics; biomarkers; multi-omic; multimodal; SVM;
leave-one-out cross validation
I. INTRODUCTION
A. Background and Motivation
Pancreatic cancer (PC) is the twelfth most common
cancer and the seventh most common cause of death from
cancer in the world. With nearly 350,000 new cases
worldwide each year, the most recent studies estimate that in
2015, 48,960 people will be diagnosed in the U.S. alone. PC
has the highest overall mortality rate, with 94% of all
diagnosed patients deceased within five years of their
diagnoses. Nearly 99% of all PC cases originate from
exocrine cells, with about 85% of all PC cases belonging to a
group known as pancreatic adenocarcinoma. Despite the
relative homogeneity of PC diagnoses, effective early
detection has yet to be achieved. Poor prognosis of PC can be
attributed mainly to the majority of patients being diagnosed
at an advanced stage when the cancer is resistant to treatment
and may have already metastasized [1].
There are several reasons why early detection is difficult
for this population. PC patients are often asymptomatic until
the cancer has already spread. Additionally, routine physical
exams cannot be used to detect PC as the tumors will not be
visible or easily palpated as can be the case with cancers of
the skin, breast, or colon. Therefore, the first step of early
detection consists of identifying factors that may predispose
different people to PC. The ability to integrate multiple
modalities of patient data is necessary to advance our
understanding of PC precursors and enhance early detection
methods.
B. Diagnosis and Treatment
Methods of diagnosis currently in use include imaging
tests, bio-fluid analysis, and tissue biopsy. Blood tests are
often used to evaluate organ functioning, notably liver
function for patients with jaundice, which is one of the first
noticeable signs of pancreatic cancer. Biofluid testing can
facilitate the identification of proteins that act as tumor
markers, and preferably even precursors to these conditions.
Advanced exocrine pancreatic cancer may result in elevated
levels of tumor markers such as CA 19-9 and CEA in the
bloodstream, but this not always reliable. Similarly, the
levels of several hormones in the blood can be measured for
neuroendocrine PC. Detection and measuring of these
markers may be more useful in evaluating the effectiveness of
treatment for patients already known to have pancreatic
cancer [2].
The availability of high throughput omics has enabled
identification of PC biomarkers not only in the blood, but also
in urine and even saliva. Lau et al. describes a method of
identifying several salivary transcriptomic biomarkers of
pancreatic cancer via RNA extraction of murine saliva [3].
As omics technologies continue their advancement, further
studies such as this one will continue to expand our
understanding of what and how we measure the body’s
signals and ultimately contribute to advances in early
detection and diagnosis. Biofluid testing, fueled by
Integration of Multi-Modal –Omic Data for
Prediction of Pancreatic Cancer Survival
Vikram Babu
Wallace H. Coulter Department of Biomedical Engineering
Georgia Institute of Technology
Atlanta, GA
Jacob Upperco
Wallace H. Coulter Department of Biomedical Engineering
Georgia Institute of Technology
Atlanta, GA
bioinformatics, holds promise for identifying the genes, RNA,
proteins, lipids, carbohydrates and metabolites that may act as
precursors to pancreatic cancers.
Several imaging tests currently available generally rely
on a contrast agent/dye to allow identification of strictures or
abnormal masses. These include different forms of
tomography, MRI, ultrasound, cholangiopancreatography,
scintigraphy, and angiography. Somatostatin receptor
scintigraphy (SRS) is an example of an imaging test that
highlights the potential of omics research. This technique
consists of the injection of a hormone-like substance, called
octreotide, bound to a radioactive substance for visualization.
Octreotide attaches directly to specific proteins on the tumor
cells of many neuroendocrine cancers [1]. While this is only
effective for a tiny portion of the overall PC cases, it acts a
predicate for diagnosis methods that may be specific to
distinct PC subtypes. Further analysis of gene and protein
expression of different tumor types is required for more
advanced tests.
Tissue biopsies are generally considered the only
surefire test of identifying pancreatic cancer in an individual.
Biopsies rely on imaging procedures to locate possible
tumors, and so endoscopic imaging techniques are
advantageous in respect to being able to immediately gather a
tissue sample during the same procedure. Complete tissue
resection only has potentially curative effects when a cancer
is still confined to its original tissue. In metastatic cases,
systemic treatments are sought, but every cancer subtype
reacts differently to different medications. In conjunction
with omics data, biopsies can be utilized to help identify
specific cancer subtypes in resected tumors, facilitating
optimal treatment regimes for different patients.
Many different treatments are currently in use and they
may be chosen depending on subtypes and stages of the
pancreatic cancers. Early stage cancers may be treated
through surgery and removal if still localized, although more
than 80% of pancreatic cancers have metastasized by the
diagnosis [2]. Until early detection is made feasible through
biomarker and precursor identification, physical removal or
destruction of tumors will continue to be uncommon and we
must rely on integrative bioinformatics to enhance accuracy
in cancer subtyping to guide towards the most effective
treatment option.
Radiation therapy is utilized more often for exocrine
PCs then for neuroendocrine PCs. Chemotherapy is the use
of anti-cancer drugs to destroy tumors that have spread to
other parts of the body. Targeted therapy is a more recent
development, in which new drugs are developed that attack
specific targets in cancer cells, as well as therapies that can
boost a patient’s immune system. This represents another
challenge that must be answered through analysis and
integration of bioinformatics. Personalized therapies along
these lines can only be possible through identification of
patient group/subtype-specific mutations.
This highlights an important route for future research,
and one of the most promising directions for the application
of analysis and integration of multi-omics data. Genetic
predispositions, as discussed above, represent one method of
personalized medicine in our increasing ability to predict
patient risks for certain diseases. This ties in with increasing
our understanding of and capabilities for patient-specific
therapy as well. Trastuzumab, for example, is considered a
very effective drug in breast cancer treatment, targeting the
Epidermal Growth Factor Receptor. This drug is only
beneficial for the 10-20% of breast cancer patients with
amplification of this receptor, though, and so different
treatment regimes must be selected for different patient
groups [4].
The true challenges for these informatics approaches
focus on making sense of the mass amounts of data we collect
from patients and laboratory studies. These data are collected
using different modalities and sources, each with distinct
inherent velocities. Data in the clinical space may consist of
hand-written notes taken by doctors, translated by nurses into
electronic format. While it is easy to understand how
incorrect and missing data may be produced in such formats,
these challenges even present themselves when trying to
analyze patient groups in which patient data widely vary due
to differences in the tests each patient received; different
methods for proteomic, genomic, transcriptomic and even
imaging, while beneficial in allowing us to collect
information, need to be able to complement each other and
not be analyzed solely in parallel. This last point highlights a
serious challenge in the integration of all this data towards
identification of exploitable targets in different cancer
subtypes.
II. LITERATURE SURVEY
Shen et al. analyzed DNA copy number and mRNA
expression from two sources: breast cancer cell lines obtained
from the American Type Culture Collection and lung
adenocarcinomas from Memorial Sloan-Kettering Cancer
Center. The methodology consists of a Gaussian latent
variable model representation of eigengene K-means
clustering, which can be extended to multiple data modalities.
High dimensionality is accounted for through derivation of a
sparse approximation that penalizes the complete-data log-
likelihood and reduces dimensionality. From this point,
models are selected based upon cluster separability through
calculation of proportion of deviance where “perfect
separability” would yield a proportion of deviance of 0. This
study is strong in its ability to pinpoint “important” genes
through lasso-type regularization, a method that can be
equated to placing a Laplacian prior probability distribution
centered on zero on the parameter vector. Overall, this study
is novel in its approach to integrative clustering, replacing
separate clustering and manual integration with a method for
integrative clustering that incorporates all data types in its
assignment.
Yeoman et al. implemented a multi-omic, systems
biology approach through analysis of rRNA sequencing reads
(454 FLX-titanium) and metabolomics (GC-MS system
consisting of Agilent 7890A, gas chromatograph, Agilent
5975C & Agilent 763B) through sample collection they
conducted on 36 bacterial vaginosis patients. Bray-Curtis
dissimilarity matrix was created from genus-level taxonomic
classifications normalized across the dataset of 165 rRNA
genes, which was then subjected to non-metric
multidimensional scaling (nMDS). Analysis of similarities
was used to support separation found from nMDS. The same
methods were used to analyze the 176 distinct metabolites
found across the 36 samples. Network analysis was
performed through calculation of pairwise Pearson’s product
moment correlation coefficients for parametric metadata and
calculation of pairwise Spearman’s correlation coefficients
for non-parametric metadata, with shortest path method used
to calculate distances between variables. Some critiques on
this study are that the sample population was very small with
no controls, and that only positive weights were considered
during network analysis.
Daemen et al. implemented kernel-based integration of
genome-wide data with clinical data for analysis of rectal and
prostate cancer. Samples were split into binary groupings
based on three tumor-grading models. Missing gene
expression values were imputed using k-nearest neighbors
method, and the features with variance in the bottom 50%
were eliminated. A weighted least squares – support vector
machine was used where different weights were given to
positive and negative samples. Wilcoxon rank sum test was
used for rectal cancer (only ~90 cancer-related proteins) and
multiple univariate test statistics integrated to find differential
expression of (large number of) prostate cancer proteins.
Leave-one-out cross-validation used to determine optimal
number of features as well as parameters for support vector
machine. Finally, features were selected according to top
ranked features by calculation of area under the receiver
operating characteristic curve, with ties won by the features
with lowest balanced error rate and highest sum of sensitivity
and specificity. LS-SVMs for each data type were integrated
by manually calculating change in levels over time period.
The researchers acknowledge that this multiple time point
data collection model is very expensive. Kernel matrices for
each data source are summed and weighted LS-SVM trained
on this heterogeneous kernel matrix to provide a mutli-omics
integrative approach. A critique of this method is the fact that
authors assigned equal weights across studies, which will not
produce optimal results.
Mosca and Milanesi describe a network-based analysis
of breast cancer tumor data from GEO under the ID
GSE25835 using multi-objective optimization. Their
methodology can be divided into three basic steps: defining a
multiple-weighted network containing multi-omic data sets,
identifying significant networks with multi-objective
optimization and calculation of optimization quality
parameters. Analyses of interaction data between cell types
(two tumor types and two epithelial cell types), differential
gene expression and overexpression of basal markers were
combined to identify differentially expressed networks of
protein-protein interactions. P-values were calculated using
the “Parametric Analysis of Gene Enrichment” (PAGE)
method and the log10 of this p-value was taken as the
objective function to indicate statistical significance of
differential gene expression compared to all other genes. This
methodology was extended to ductal carcinomas of the breast
(GEO ID GSE22544), colorectal tumor cells (GEO ID
GSE4107) and pancreatic ductal adenocarcinomas (GEO ID
GSE15471), with optimization problems formulated that
compared differential expression of same networks between
the three tumor types. Drawbacks to this methodology lie in
the potential variability of results due to differences in chosen
objective functions.
Kim et al. integrated gene expression, miRNA and
methylation data from normalized ovarian cancer datasets
downloaded from TCGA portal for clinical outcome
prediction. This methodology utilized a graph-based semi-
supervised learning, classification algorithm. This is an
attractive method due to sparseness properties of the input
matrix and its inherent visualization. An additional graph is
created to compare the relationships between individual
graphs, with high correlation increasing the prediction
accuracy for the integration of the datasets. Weighted matrix
created by summing the product of values of the data types
being compared, with a value of 0 representing no
relationship between a given gene and miRNA, for example.
Gaussian function of Euclidean distance calculated for final
weight matrix with larger weights being assigned to closer
patients. This study is limited by prior knowledge of the
interactions between, for example, specific miRNAs and its
target genes. Therefore this model makes it difficult to
discover novel pathways and relationships.
Madhavan et al. integrates multi-omic data collected
from colorectal cancer patients and identified genes, miRNA
and methylation levels correlated with relapse. This study
utilized t-test to filter data for significance before using a
support vector machine with recursive feature elimination,
followed by leave-one-out cross validation. While the SVM
was strong methodology for optimization, this study overall
had a limited potential to discover novel pathways or
biomarkers due to manual filtering performed. The authors
removed data that was not previously known to have specific
correlations within colorectal cancer and relapse.
Based on literature survey and the scope of our data, we
will split our patients into groups based on survival time and
then utilize t-test to reduce features according to significance
within a five-fold leave-one-out cross validation and a support
vector machine to classify data and obtain prediction scores.
As opposed to some of the studies we reviewed, we will be
analyzing accuracy as opposed to specificity and sensitivity,
because this will give a better overall indication of success of
our classification.
III. METHODS
A. Data Acquisition and Pre-Processing
Prior to any prediction modeling, data needed to be
downloaded and linked to all patients in the clinical database.
Figure 1 depicts this process.
Fig. 1
As Figure 1 outlines, there are three databases provided
by TCGA. First, the clinical database which includes various
patient data including: patient ID, survival time, cancer
stage/type, etc. Specifically the patient ID and survival time
post diagnosis were extracted from the clinical database. The
protein and miRNA expression databases included various
hyperlinks linked with a patient ID. Patient IDs were linked
from the clinical database to their corresponding modality
data. Once this link was made, the data was downloaded and
stored in a matrix.
Once data was acquired from the TCGA site, patients
that were missing modality data needed to be filtered. Once
the patients were filtered, they were randomly stratified into
three groups: training 1, training 2, validation. Table 1 and
Figure 2 outline patient filtration and stratifying.
TABLE 1
Total TCGA Patients 171
Patients Missing Protein Data 71
Patients Missing miRNA Data 7
Total Patients Used 93
Fig. 2
Furthermore, the rationale for using 1 year as the critical
time for survival time become more obvious with the data
acquisition of survival times for each of the 93 filtered
patients, as seen in Table 2.
TABLE 2
Patient Survival Time Number of Patients
<1 Year 64
1 - 2 Years 20
>2 Years 9
Total 93
As table 2 outlines, the patients surviving greater than
two years led to the decision to use 1 year as the separator
between groups. The final group sizes are shown in Tables 3
and 4.
TABLE 3
Training 1
Population
Training 2
Population
Validation
Population
Total
<1 Year
Survival
22 22 20 64
>=1 Year
Survival
10 10 9 29
Total 32 32 29 93
TABLE 4
Training 1
Population
Training 2
Population
Validation
Population
% of Total
Reduced Patient
Population
35 35 30
B. Equations
There are two main equations used as part of our study.
The first is the equation of a Support Vector Hyperplane:
(1)
Where N equals the number of support vectors used to
generate the hyperplane. represents the values associated
with the support vector indices. represents the weights of
the support vectors; negative values associated with the first
group, positive values associated with the second group. In
this case, the first group represents patients, who are support
vectors, that survived less than one year post diagnosis. The
second group being patients, also support vectors, who
survived greater than or equal to one year post diagnosis.
More details about the patients and their grouping will be
discussed further in the “Methods” section.
Furthermore, another important equation used describes
accuracy:
(2)
This equation is one evaluation metric used to determine the
success of the prediction algorithm. Since the goal is develop
a model the properly categorizes patients into either <1 or >=
1 year survival, accuracy was used over specificity and
sensitivity.
C. Hypotheses
Our null hypotheses are:
1) Multimodal prediction yields higher accuracy than
individual modality prediction
2) Multimodal pancreatic cancer prediction using predicted
decision values from individual modality hyperplane equation
yields higher accuracy than multimodal prediction using
individual-modality predicted group values
D. Prediction Modeling
The methodology used is divided into two sections: 1)
Methodology for multimodal pancreatic cancer prediction
using individual-modality predicted group values 2)
Methodology for multimodal pancreatic cancer prediction
using predicted decision values from individual modality
hyperplane equation.
8
Figure 3 outlines the methodology used to obtain the
predicted grouping of patients, from the individual data
modalities, used as classifier training for the multimodality
prediction.
Fig. 3
As Figure 3 demonstrates, before any actual predictions
can be made on the training 2 and validation groups, cross
validation is performed on the training 1 data. The purpose of
this is to determine the optimal feature size the produces the
highest potential accuracy, while also predicting the
evaluation accuracies (accuracies of training 2 and validation
group prediction). Figure 4 outlines the entire cross validation
process.
Fig. 4
Training 1 data is randomly stratified into 5 different
folds. The four training folds are then sorted so that the
patients in the <1 year group and >=1 year groups are
separated. Since both groups of patients contain the same
number of features, a Two-Sample t-Test was run for each
feature. The resulting p-values for each feature were sorted,
starting from the highest. The five fold, five iteration cross
validation was repeated for each feature size from 1 to 100.
Therefore, the top f features were selected into the classifier
trainer; f being the feature size the cross validation was
testing. The test fold then used to test the trained classifier.
The final result was cross validation accuracy. Overall, a
5x100 matrix of cross validations was evaluated. The cross
validations of the folds were averaged and the maximum
average accuracy was found; thus, the optimal feature size
yielding the highest cross validation accuracy was
determined. Following this, the optimal feature size was used
to reduce training 1 data. The reduced training 1 data was
used to train the classifier and the training 2 and validation
data tested the classifier. Since the true labels of training 2
and validation data are obtained from the clinical database,
the accuracies of the Phase I group predictions can be
calculated.
Fig. 5
Figure 5 outlines the methodology for the Phase II,
multimodal prediction.
Training 2 data is used to train the classifier. It is also
important to highlight that before the classification is made on
the validation data, a five fold cross validation is performed
on the training 2 data.
2) Methodology for multimodal pancreatic cancer prediction
using predicted decision values from individual modality
hyperplane equation.
An overview model of the multimodal prediction can be seen
in the Figure 6.
Fig. 6
The methodology for the implementation of the phase I
hyperplane equation is very similar to that of the previous
methodology. A graphical description, as Figure 7, outlines
the key difference.
Fig. 7
The main difference between Phase I and Phase II
methodology is the usage of the training 1 hyperplane
equation to calculate decision values to be used in the Phase II
prediction. The overview model and Phase II flow chart can
be reviewed as Figures 5 and 6.
IV. RESULTS
Tables 5 and 6 show the values and average accuracies
from running both methodologies a total of three times.
Included as well are the SVM plots for both multimodal
predictions and a graph of external validation vs. cross
validation for the methodology 2 as Figure 7.
TABLE 5
Run 1 Run 2 Run 3 Average
miRNA CV Accuracy 0.525 0.5 0.525 0.5167
miRNA Training 2 0.3125 0.375 0.4063 0.3646
Accuracy
miRNA Validation
Accuracy
0.4828 0.5517 0.5172 0.5172
Protein CV Accuracy 0.6 0.625 0.675 0.6333
Protein Training 2
Accuracy
0.5938 0.5313 0.6563 0.5938
Protein Validation
Accuracy
0.3103 0.4828 0.6207 0.4713
Multiple Modality CV
Accuracy
0.5 0.6167 0.6 0.5722
Validation Accuracy 0.3793 0.6552 0.5172 0.5172
TABLE 6
Hyperplane
Decision Value
Predicted Group
Decision Value
miRNA CV Accuracy 0.5167 0.5083
miRNA Training 2
Accuracy
0.3646 0.6563
miRNA Validation
Accuracy
0.5172 0.5862
Protein CV Accuracy 0.6333 0.55
Protein Training 2
Accuracy
0.5938 0.5
Protein Validation
Accuracy
0.4713 0.5172
Multiple Modality CV
Accuracy
0.5722 0.6556
Multiple Modality
Validation Accuracy
0.5172 0.5402
Fig. 7 a, b, c, d (top to bottom)
In Figure 7a (top left), the x-axis represents the miRNA
prediction data and the y-axis represents the protein
prediction data from methodology 1. In Figure 7b, the x-axis
represents the miRNA prediction data and the y-axis
represents the protein prediction data from methodology 2. As
can be seen in Figure 7a, it is expected that methodology 2
would yield a higher score since the individual modality data
inputted into phase II is more continuous than methodology 1.
In Figure 7c, the x-axis is cross validation values and the y-
axis is external validation values. Figure 7d shows the output
from running our MATLAB code for prediction modeling in
addition the graphs above.
IV. CONCLUSION
As can be deduced from the results, it seems that the null
hypotheses that the hyperplane decision values would be
more accurate than predicted group decision value and
multiple modality prediction accuracy overall would be more
accurate the individual modality accuracy seemed to be not
true.
Definitely there are areas for improvement. First, more
modalities could be included (methylation data, genomics,
etc.). Furthermore, algorithm efficiency could be reevaluated.
Improving efficiency would decrease the run time overall,
allowing the usage of larger sets of data. A GUI could be
implemented to improve the ease of use for third-party
testing.
Other areas that could be explored in the future would be
more use of clinical data. For example, only survival time
post diagnosis was used for prediction. Other clinical data
such as cancer stage or tumor type could be implemented for
similar prediction. Furthermore, the fact that the miRNA and
protein IDs and expression values used for each prediction
were saved; therefore, if improved accuracies could be
achieved, the biomarkers used for prediction could be studied.
This could lead to the discovery of novel biomarkers.
In addition to accuracy, the area under the curve was
determined as well:
(3)
where xi and yi are classifier decision values for group 1 (<1
year survival) and group 2 (>= 1 year survival) samples,
respectively. N+ and N- represent the number of samples in
groups 1 and 2. Samples classified into group 1 should have
positive decision values and samples classified into group 2
should have negative decision values. I(x) evaluates to 1 if x
is true and 0 otherwise. Note that in the case of ties, the
summation is weighted by 0.5. The motivation to calculate
this value, in addition to accuracy, is to measure the validity
of accuracy values. Due to potential skewing of results due to
uneven sample sizes (<1 year survival time more than double
the size of >= 1 year survival group), AUC is another
evaluation metric.
Reasons addressing the low accuracy could stem from issues
with skewed groups as mentioned. Since more data was
available on patients surviving less than one year, accuracy
for predicting patients who survived over a year after
diagnosis would be difficult. Increasing the sample size, using
patients sizes with reduced survival time skewing, and
running the simulation multiple times all could aid in more
reasonable results.
IV. REFERENCES
[1] "What's New in Pancreatic Cancer Research and
Treatment?" What's New in Pancreatic Cancer
Research and Treatment? American Cancer Society,
11 June 2014. Web.
<http://www.cancer.org/cancer/pancreaticcancer/det
ailedguide/pancreatic-cancer-new-research>.
[2] Ryan, David P., Theodore S. Hong, and Nabeel
Bardeesy. "Pancreatic Adenocarcinoma." The New
England Journal Of Medicine 371.11 (2014): 1039-
049. Web.
[3] Lau et al. "Role of Pancreatic Cancer-derived
Exosomes in Salivary Biomarker Development."
Journal of Biological Chemistry 288.37 (2013):
26888-6897. Web.
[4] Chouchane, Lotfi, Ravinder Mamtani, Ashraf Dallol,
and Javaid I. Sheikh. "Personalized Medicine: A
Patient - Centered Paradigm." Journal of
Translational Medicine 9.1 (2011): 206. Web.
[5] Shen, R., A. B. Olshen, and M. Ladanyi. "Integrative
Clustering of Multiple Genomic Data Types Using a
Joint Latent Variable Model with Application to
Breast and Lung Cancer Subtype Analysis."
Bioinformatics 26.2 (2010): 292-93. Web.
[6] Yeoman et al. "A Multi-Omic Systems-Based
Approach Reveals Metabolic Markers of Bacterial
Vaginosis and Insight into the Disease." Ed. Adam J.
Ratner. PLoS ONE 8.2 (2013): E56111. Web.
[7] Daemen et al. "A Kernel-based Integration of
Genome-wide Data for Clinical Decision Support."
Genome Medicine 1.4 (2009): 39. Web.
[8] Mosca, Ettore, and Luciano Milanesi. "Network-
based Analysis of Omics with Multi-objective
Optimization." Molecular BioSystems 9.12 (2013):
2971. Web.
[9] Kim et al. "Incorporating Inter-relationships between
Different Levels of Genomic Data into Cancer
Clinical Outcome Prediction." Systems Biology with
Omics Data 67.3 (2014): 344-53. Web.
[10] Madhavan et al. "Genome-wide Multi-omics
Profiling of Colorectal Cancer Identifies Immune
Determinants Strongly Associated with Relapse."
Frontiers in Genetics 4 (2013): n. pag. Web.

Contenu connexe

Tendances

CTCs - Circulating Tumor Cells
CTCs - Circulating Tumor CellsCTCs - Circulating Tumor Cells
CTCs - Circulating Tumor CellsSreepadmanabh M
 
Protocol for the Treatment of Prostate Cancer
Protocol for the Treatment of Prostate CancerProtocol for the Treatment of Prostate Cancer
Protocol for the Treatment of Prostate CancerSheldon Stein
 
Cancer chemotherapy for medical students
Cancer chemotherapy for medical studentsCancer chemotherapy for medical students
Cancer chemotherapy for medical studentstaklo simeneh
 
Dr Adeola Henry_Colorectal cancer book chapter 2014
Dr Adeola Henry_Colorectal cancer book chapter 2014Dr Adeola Henry_Colorectal cancer book chapter 2014
Dr Adeola Henry_Colorectal cancer book chapter 2014adeolahenry
 
Chapter 2.3 tumor biomarkers and vascular access
Chapter 2.3 tumor biomarkers and vascular accessChapter 2.3 tumor biomarkers and vascular access
Chapter 2.3 tumor biomarkers and vascular accessNilesh Kucha
 
Journal.pmed.1000025
Journal.pmed.1000025Journal.pmed.1000025
Journal.pmed.1000025Elsa von Licy
 
Introduction to the world of oncology
Introduction to the world of oncologyIntroduction to the world of oncology
Introduction to the world of oncologyEmad Shash
 
632 0713 - ferreyro bl - predictive score for estimating cancer after venou...
632   0713 - ferreyro bl - predictive score for estimating cancer after venou...632   0713 - ferreyro bl - predictive score for estimating cancer after venou...
632 0713 - ferreyro bl - predictive score for estimating cancer after venou...Debourdeau Phil
 
Venous thromboembolism in cancer patients
Venous thromboembolism in cancer patientsVenous thromboembolism in cancer patients
Venous thromboembolism in cancer patientsDina Barakat
 
CRC_PNR & EMVI_prognosis_BJCpaper
CRC_PNR & EMVI_prognosis_BJCpaperCRC_PNR & EMVI_prognosis_BJCpaper
CRC_PNR & EMVI_prognosis_BJCpaperLeslie Samuel
 

Tendances (20)

CTCs - Circulating Tumor Cells
CTCs - Circulating Tumor CellsCTCs - Circulating Tumor Cells
CTCs - Circulating Tumor Cells
 
Protocol for the Treatment of Prostate Cancer
Protocol for the Treatment of Prostate CancerProtocol for the Treatment of Prostate Cancer
Protocol for the Treatment of Prostate Cancer
 
Cancer chemotherapy for medical students
Cancer chemotherapy for medical studentsCancer chemotherapy for medical students
Cancer chemotherapy for medical students
 
Dr Adeola Henry_Colorectal cancer book chapter 2014
Dr Adeola Henry_Colorectal cancer book chapter 2014Dr Adeola Henry_Colorectal cancer book chapter 2014
Dr Adeola Henry_Colorectal cancer book chapter 2014
 
Chapter 2.3 tumor biomarkers and vascular access
Chapter 2.3 tumor biomarkers and vascular accessChapter 2.3 tumor biomarkers and vascular access
Chapter 2.3 tumor biomarkers and vascular access
 
Breast Cancer Biomarkers
Breast Cancer BiomarkersBreast Cancer Biomarkers
Breast Cancer Biomarkers
 
Cancer treatment
Cancer treatment Cancer treatment
Cancer treatment
 
Adjuvant therapy - Dr. Roda Amaria
Adjuvant therapy - Dr. Roda AmariaAdjuvant therapy - Dr. Roda Amaria
Adjuvant therapy - Dr. Roda Amaria
 
Journal.pmed.1000025
Journal.pmed.1000025Journal.pmed.1000025
Journal.pmed.1000025
 
Introduction to the world of oncology
Introduction to the world of oncologyIntroduction to the world of oncology
Introduction to the world of oncology
 
S13148 019-0757-3
S13148 019-0757-3S13148 019-0757-3
S13148 019-0757-3
 
Wilmarie reflection 3 final
Wilmarie reflection 3 finalWilmarie reflection 3 final
Wilmarie reflection 3 final
 
632 0713 - ferreyro bl - predictive score for estimating cancer after venou...
632   0713 - ferreyro bl - predictive score for estimating cancer after venou...632   0713 - ferreyro bl - predictive score for estimating cancer after venou...
632 0713 - ferreyro bl - predictive score for estimating cancer after venou...
 
MSKCC Publish JSW
MSKCC Publish JSWMSKCC Publish JSW
MSKCC Publish JSW
 
Article
ArticleArticle
Article
 
Venous thromboembolism in cancer patients
Venous thromboembolism in cancer patientsVenous thromboembolism in cancer patients
Venous thromboembolism in cancer patients
 
CRC_PNR & EMVI_prognosis_BJCpaper
CRC_PNR & EMVI_prognosis_BJCpaperCRC_PNR & EMVI_prognosis_BJCpaper
CRC_PNR & EMVI_prognosis_BJCpaper
 
Perosnalized
PerosnalizedPerosnalized
Perosnalized
 
IJET-V3I2P22
IJET-V3I2P22IJET-V3I2P22
IJET-V3I2P22
 
Liquid biopsy
Liquid biopsyLiquid biopsy
Liquid biopsy
 

En vedette

SDC11 G1 class 1 Sep 22nd
SDC11 G1 class 1 Sep 22ndSDC11 G1 class 1 Sep 22nd
SDC11 G1 class 1 Sep 22ndmissjaqui
 
Rig compressor-water-well-domestic
Rig compressor-water-well-domesticRig compressor-water-well-domestic
Rig compressor-water-well-domesticJASON KEMBOI
 
SocialMedia_Hough
SocialMedia_Hough SocialMedia_Hough
SocialMedia_Hough Rachel Hough
 
power point
power pointpower point
power pointdela1311
 
Kristys presentation
Kristys presentationKristys presentation
Kristys presentationkristym111
 
Assure Method
Assure Method Assure Method
Assure Method Vernie13
 
SMKN 49 JAKARTA UTARA
SMKN 49 JAKARTA UTARASMKN 49 JAKARTA UTARA
SMKN 49 JAKARTA UTARAriyahchoyah
 
Eating in the Middle IFBC presentation
Eating in the Middle IFBC presentationEating in the Middle IFBC presentation
Eating in the Middle IFBC presentationSheri Wetherell
 
Could IOT Provide Gun Control?
Could IOT Provide Gun Control?Could IOT Provide Gun Control?
Could IOT Provide Gun Control?Bofan
 
lgbt center newsletter
lgbt center newsletterlgbt center newsletter
lgbt center newsletterVictor Ortiz
 
MMSS Senior Thesis 2000
MMSS Senior Thesis 2000MMSS Senior Thesis 2000
MMSS Senior Thesis 2000Eric Morel
 
Permanent hosting at Joomla.com
Permanent hosting at Joomla.comPermanent hosting at Joomla.com
Permanent hosting at Joomla.comDouglasPickett
 
28 sept to 4 oct 2015
28 sept  to 4 oct 201528 sept  to 4 oct 2015
28 sept to 4 oct 2015snehalcnp
 
Radio PSA (Advanced PR)
Radio PSA (Advanced PR)Radio PSA (Advanced PR)
Radio PSA (Advanced PR)Victor Ortiz
 
INTRO SLIDE
INTRO SLIDEINTRO SLIDE
INTRO SLIDEkraucpa
 

En vedette (20)

SDC11 G1 class 1 Sep 22nd
SDC11 G1 class 1 Sep 22ndSDC11 G1 class 1 Sep 22nd
SDC11 G1 class 1 Sep 22nd
 
Rig compressor-water-well-domestic
Rig compressor-water-well-domesticRig compressor-water-well-domestic
Rig compressor-water-well-domestic
 
SocialMedia_Hough
SocialMedia_Hough SocialMedia_Hough
SocialMedia_Hough
 
power point
power pointpower point
power point
 
Kristys presentation
Kristys presentationKristys presentation
Kristys presentation
 
Assure Method
Assure Method Assure Method
Assure Method
 
BA certificate
BA certificateBA certificate
BA certificate
 
Fredlaw - SURGE: Intellecual Property 101
Fredlaw - SURGE:  Intellecual Property 101Fredlaw - SURGE:  Intellecual Property 101
Fredlaw - SURGE: Intellecual Property 101
 
SMKN 49 JAKARTA UTARA
SMKN 49 JAKARTA UTARASMKN 49 JAKARTA UTARA
SMKN 49 JAKARTA UTARA
 
Eating in the Middle IFBC presentation
Eating in the Middle IFBC presentationEating in the Middle IFBC presentation
Eating in the Middle IFBC presentation
 
Could IOT Provide Gun Control?
Could IOT Provide Gun Control?Could IOT Provide Gun Control?
Could IOT Provide Gun Control?
 
lgbt center newsletter
lgbt center newsletterlgbt center newsletter
lgbt center newsletter
 
INTERNATIONAL CV-WDS
INTERNATIONAL CV-WDSINTERNATIONAL CV-WDS
INTERNATIONAL CV-WDS
 
MMSS Senior Thesis 2000
MMSS Senior Thesis 2000MMSS Senior Thesis 2000
MMSS Senior Thesis 2000
 
Permanent hosting at Joomla.com
Permanent hosting at Joomla.comPermanent hosting at Joomla.com
Permanent hosting at Joomla.com
 
28 sept to 4 oct 2015
28 sept  to 4 oct 201528 sept  to 4 oct 2015
28 sept to 4 oct 2015
 
numeros en ingles
numeros en inglesnumeros en ingles
numeros en ingles
 
Radio PSA (Advanced PR)
Radio PSA (Advanced PR)Radio PSA (Advanced PR)
Radio PSA (Advanced PR)
 
Cv
CvCv
Cv
 
INTRO SLIDE
INTRO SLIDEINTRO SLIDE
INTRO SLIDE
 

Similaire à PancreaticCancerFinalPaper

Breast N C C Nguidlinesms1
Breast N C C Nguidlinesms1Breast N C C Nguidlinesms1
Breast N C C Nguidlinesms1guest108e832
 
A comprehensive guide to the liver cancer
A comprehensive guide to the liver cancerA comprehensive guide to the liver cancer
A comprehensive guide to the liver cancerEchoHan4
 
Precision Medicine in Oncology Informatics
Precision Medicine in Oncology InformaticsPrecision Medicine in Oncology Informatics
Precision Medicine in Oncology InformaticsWarren Kibbe
 
Circulating Tumor Cells and Cell-Free Nucleic Acids as Predictor Factors for ...
Circulating Tumor Cells and Cell-Free Nucleic Acids as Predictor Factors for ...Circulating Tumor Cells and Cell-Free Nucleic Acids as Predictor Factors for ...
Circulating Tumor Cells and Cell-Free Nucleic Acids as Predictor Factors for ...daranisaha
 
Circulating Tumor Cells and Cell-Free Nucleic Acids as Predictor Factors for ...
Circulating Tumor Cells and Cell-Free Nucleic Acids as Predictor Factors for ...Circulating Tumor Cells and Cell-Free Nucleic Acids as Predictor Factors for ...
Circulating Tumor Cells and Cell-Free Nucleic Acids as Predictor Factors for ...AnonIshanvi
 
Circulating Tumor Cells and Cell-Free Nucleic Acids as Predictor Factors for ...
Circulating Tumor Cells and Cell-Free Nucleic Acids as Predictor Factors for ...Circulating Tumor Cells and Cell-Free Nucleic Acids as Predictor Factors for ...
Circulating Tumor Cells and Cell-Free Nucleic Acids as Predictor Factors for ...JohnJulie1
 
Circulating Tumor Cells and Cell-Free Nucleic Acids as Predictor Factors for ...
Circulating Tumor Cells and Cell-Free Nucleic Acids as Predictor Factors for ...Circulating Tumor Cells and Cell-Free Nucleic Acids as Predictor Factors for ...
Circulating Tumor Cells and Cell-Free Nucleic Acids as Predictor Factors for ...EditorSara
 
Circulating Tumor Cells and Cell-Free Nucleic Acids as Predictor Factors for ...
Circulating Tumor Cells and Cell-Free Nucleic Acids as Predictor Factors for ...Circulating Tumor Cells and Cell-Free Nucleic Acids as Predictor Factors for ...
Circulating Tumor Cells and Cell-Free Nucleic Acids as Predictor Factors for ...EditorSara
 
Circulating Tumor Cells and Cell-Free Nucleic Acids as Predictor Factors for ...
Circulating Tumor Cells and Cell-Free Nucleic Acids as Predictor Factors for ...Circulating Tumor Cells and Cell-Free Nucleic Acids as Predictor Factors for ...
Circulating Tumor Cells and Cell-Free Nucleic Acids as Predictor Factors for ...semualkaira
 
Circulating Tumor Cells and Cell-Free Nucleic Acids as Predictor Factors for ...
Circulating Tumor Cells and Cell-Free Nucleic Acids as Predictor Factors for ...Circulating Tumor Cells and Cell-Free Nucleic Acids as Predictor Factors for ...
Circulating Tumor Cells and Cell-Free Nucleic Acids as Predictor Factors for ...semualkaira
 
Circulating Tumor Cells and Cell-Free Nucleic Acids as Predictor Factors for ...
Circulating Tumor Cells and Cell-Free Nucleic Acids as Predictor Factors for ...Circulating Tumor Cells and Cell-Free Nucleic Acids as Predictor Factors for ...
Circulating Tumor Cells and Cell-Free Nucleic Acids as Predictor Factors for ...NainaAnon
 
Circulating Tumor Cells and Cell-Free Nucleic Acids as Predictor Factors for ...
Circulating Tumor Cells and Cell-Free Nucleic Acids as Predictor Factors for ...Circulating Tumor Cells and Cell-Free Nucleic Acids as Predictor Factors for ...
Circulating Tumor Cells and Cell-Free Nucleic Acids as Predictor Factors for ...semualkaira
 
Introduction to cancer bioinformatics
Introduction to cancer bioinformaticsIntroduction to cancer bioinformatics
Introduction to cancer bioinformaticscreativebiolabs11
 
Liquid Biopsy Cancer Prevention and Monitoring
Liquid Biopsy Cancer Prevention and MonitoringLiquid Biopsy Cancer Prevention and Monitoring
Liquid Biopsy Cancer Prevention and MonitoringDavid Tjahjono,MD,MBA(UK)
 
Principles of Cancer Screening
Principles of Cancer ScreeningPrinciples of Cancer Screening
Principles of Cancer Screeningdaranisaha
 
Principles of Cancer Screening
Principles of Cancer ScreeningPrinciples of Cancer Screening
Principles of Cancer Screeningsemualkaira
 
Principles of Cancer Screening
Principles of Cancer ScreeningPrinciples of Cancer Screening
Principles of Cancer Screeningsemualkaira
 

Similaire à PancreaticCancerFinalPaper (20)

Breast N C C Nguidlinesms1
Breast N C C Nguidlinesms1Breast N C C Nguidlinesms1
Breast N C C Nguidlinesms1
 
A comprehensive guide to the liver cancer
A comprehensive guide to the liver cancerA comprehensive guide to the liver cancer
A comprehensive guide to the liver cancer
 
2034 5713-1-pb
2034 5713-1-pb2034 5713-1-pb
2034 5713-1-pb
 
Precision Medicine in Oncology Informatics
Precision Medicine in Oncology InformaticsPrecision Medicine in Oncology Informatics
Precision Medicine in Oncology Informatics
 
Circulating Tumor Cells and Cell-Free Nucleic Acids as Predictor Factors for ...
Circulating Tumor Cells and Cell-Free Nucleic Acids as Predictor Factors for ...Circulating Tumor Cells and Cell-Free Nucleic Acids as Predictor Factors for ...
Circulating Tumor Cells and Cell-Free Nucleic Acids as Predictor Factors for ...
 
Circulating Tumor Cells and Cell-Free Nucleic Acids as Predictor Factors for ...
Circulating Tumor Cells and Cell-Free Nucleic Acids as Predictor Factors for ...Circulating Tumor Cells and Cell-Free Nucleic Acids as Predictor Factors for ...
Circulating Tumor Cells and Cell-Free Nucleic Acids as Predictor Factors for ...
 
Circulating Tumor Cells and Cell-Free Nucleic Acids as Predictor Factors for ...
Circulating Tumor Cells and Cell-Free Nucleic Acids as Predictor Factors for ...Circulating Tumor Cells and Cell-Free Nucleic Acids as Predictor Factors for ...
Circulating Tumor Cells and Cell-Free Nucleic Acids as Predictor Factors for ...
 
Circulating Tumor Cells and Cell-Free Nucleic Acids as Predictor Factors for ...
Circulating Tumor Cells and Cell-Free Nucleic Acids as Predictor Factors for ...Circulating Tumor Cells and Cell-Free Nucleic Acids as Predictor Factors for ...
Circulating Tumor Cells and Cell-Free Nucleic Acids as Predictor Factors for ...
 
Circulating Tumor Cells and Cell-Free Nucleic Acids as Predictor Factors for ...
Circulating Tumor Cells and Cell-Free Nucleic Acids as Predictor Factors for ...Circulating Tumor Cells and Cell-Free Nucleic Acids as Predictor Factors for ...
Circulating Tumor Cells and Cell-Free Nucleic Acids as Predictor Factors for ...
 
Circulating Tumor Cells and Cell-Free Nucleic Acids as Predictor Factors for ...
Circulating Tumor Cells and Cell-Free Nucleic Acids as Predictor Factors for ...Circulating Tumor Cells and Cell-Free Nucleic Acids as Predictor Factors for ...
Circulating Tumor Cells and Cell-Free Nucleic Acids as Predictor Factors for ...
 
Circulating Tumor Cells and Cell-Free Nucleic Acids as Predictor Factors for ...
Circulating Tumor Cells and Cell-Free Nucleic Acids as Predictor Factors for ...Circulating Tumor Cells and Cell-Free Nucleic Acids as Predictor Factors for ...
Circulating Tumor Cells and Cell-Free Nucleic Acids as Predictor Factors for ...
 
Circulating Tumor Cells and Cell-Free Nucleic Acids as Predictor Factors for ...
Circulating Tumor Cells and Cell-Free Nucleic Acids as Predictor Factors for ...Circulating Tumor Cells and Cell-Free Nucleic Acids as Predictor Factors for ...
Circulating Tumor Cells and Cell-Free Nucleic Acids as Predictor Factors for ...
 
Circulating Tumor Cells and Cell-Free Nucleic Acids as Predictor Factors for ...
Circulating Tumor Cells and Cell-Free Nucleic Acids as Predictor Factors for ...Circulating Tumor Cells and Cell-Free Nucleic Acids as Predictor Factors for ...
Circulating Tumor Cells and Cell-Free Nucleic Acids as Predictor Factors for ...
 
Introduction to cancer bioinformatics
Introduction to cancer bioinformaticsIntroduction to cancer bioinformatics
Introduction to cancer bioinformatics
 
Biotech2012spring 1-overview 0
Biotech2012spring 1-overview 0Biotech2012spring 1-overview 0
Biotech2012spring 1-overview 0
 
Seminar mol biol_1_spring_2013
Seminar mol biol_1_spring_2013Seminar mol biol_1_spring_2013
Seminar mol biol_1_spring_2013
 
Liquid Biopsy Cancer Prevention and Monitoring
Liquid Biopsy Cancer Prevention and MonitoringLiquid Biopsy Cancer Prevention and Monitoring
Liquid Biopsy Cancer Prevention and Monitoring
 
Principles of Cancer Screening
Principles of Cancer ScreeningPrinciples of Cancer Screening
Principles of Cancer Screening
 
Principles of Cancer Screening
Principles of Cancer ScreeningPrinciples of Cancer Screening
Principles of Cancer Screening
 
Principles of Cancer Screening
Principles of Cancer ScreeningPrinciples of Cancer Screening
Principles of Cancer Screening
 

PancreaticCancerFinalPaper

  • 1. Abstract – Pancreatic cancer is associated with an incredibly high mortality rate as over 80% of patients are initially diagnosed after the cancer has metastasized. This type of cancer is often asymptomatic when still localized to the pancreas and a lack of understanding of specific biomarkers and tumor precursors continues to hinder early detection. This paper describes methods for integration of multi-omic data into a prediction model for the classification of pancreatic cancer patients. The goal of this study is to uncover potentially novel genomic pathways and relationships between miRNA and protein data, testing our hypotheses that multi-modal data integration can provide better classification than analyzing data from single modalities, and striving towards identification of biomarkers to advance early detection, genomic profiling, and even targeted therapy for pancreatic cancer patients. Keywords – pancreatic cancer; TCGA; bioinformatics; biomarkers; multi-omic; multimodal; SVM; leave-one-out cross validation I. INTRODUCTION A. Background and Motivation Pancreatic cancer (PC) is the twelfth most common cancer and the seventh most common cause of death from cancer in the world. With nearly 350,000 new cases worldwide each year, the most recent studies estimate that in 2015, 48,960 people will be diagnosed in the U.S. alone. PC has the highest overall mortality rate, with 94% of all diagnosed patients deceased within five years of their diagnoses. Nearly 99% of all PC cases originate from exocrine cells, with about 85% of all PC cases belonging to a group known as pancreatic adenocarcinoma. Despite the relative homogeneity of PC diagnoses, effective early detection has yet to be achieved. Poor prognosis of PC can be attributed mainly to the majority of patients being diagnosed at an advanced stage when the cancer is resistant to treatment and may have already metastasized [1]. There are several reasons why early detection is difficult for this population. PC patients are often asymptomatic until the cancer has already spread. Additionally, routine physical exams cannot be used to detect PC as the tumors will not be visible or easily palpated as can be the case with cancers of the skin, breast, or colon. Therefore, the first step of early detection consists of identifying factors that may predispose different people to PC. The ability to integrate multiple modalities of patient data is necessary to advance our understanding of PC precursors and enhance early detection methods. B. Diagnosis and Treatment Methods of diagnosis currently in use include imaging tests, bio-fluid analysis, and tissue biopsy. Blood tests are often used to evaluate organ functioning, notably liver function for patients with jaundice, which is one of the first noticeable signs of pancreatic cancer. Biofluid testing can facilitate the identification of proteins that act as tumor markers, and preferably even precursors to these conditions. Advanced exocrine pancreatic cancer may result in elevated levels of tumor markers such as CA 19-9 and CEA in the bloodstream, but this not always reliable. Similarly, the levels of several hormones in the blood can be measured for neuroendocrine PC. Detection and measuring of these markers may be more useful in evaluating the effectiveness of treatment for patients already known to have pancreatic cancer [2]. The availability of high throughput omics has enabled identification of PC biomarkers not only in the blood, but also in urine and even saliva. Lau et al. describes a method of identifying several salivary transcriptomic biomarkers of pancreatic cancer via RNA extraction of murine saliva [3]. As omics technologies continue their advancement, further studies such as this one will continue to expand our understanding of what and how we measure the body’s signals and ultimately contribute to advances in early detection and diagnosis. Biofluid testing, fueled by Integration of Multi-Modal –Omic Data for Prediction of Pancreatic Cancer Survival Vikram Babu Wallace H. Coulter Department of Biomedical Engineering Georgia Institute of Technology Atlanta, GA Jacob Upperco Wallace H. Coulter Department of Biomedical Engineering Georgia Institute of Technology Atlanta, GA
  • 2. bioinformatics, holds promise for identifying the genes, RNA, proteins, lipids, carbohydrates and metabolites that may act as precursors to pancreatic cancers. Several imaging tests currently available generally rely on a contrast agent/dye to allow identification of strictures or abnormal masses. These include different forms of tomography, MRI, ultrasound, cholangiopancreatography, scintigraphy, and angiography. Somatostatin receptor scintigraphy (SRS) is an example of an imaging test that highlights the potential of omics research. This technique consists of the injection of a hormone-like substance, called octreotide, bound to a radioactive substance for visualization. Octreotide attaches directly to specific proteins on the tumor cells of many neuroendocrine cancers [1]. While this is only effective for a tiny portion of the overall PC cases, it acts a predicate for diagnosis methods that may be specific to distinct PC subtypes. Further analysis of gene and protein expression of different tumor types is required for more advanced tests. Tissue biopsies are generally considered the only surefire test of identifying pancreatic cancer in an individual. Biopsies rely on imaging procedures to locate possible tumors, and so endoscopic imaging techniques are advantageous in respect to being able to immediately gather a tissue sample during the same procedure. Complete tissue resection only has potentially curative effects when a cancer is still confined to its original tissue. In metastatic cases, systemic treatments are sought, but every cancer subtype reacts differently to different medications. In conjunction with omics data, biopsies can be utilized to help identify specific cancer subtypes in resected tumors, facilitating optimal treatment regimes for different patients. Many different treatments are currently in use and they may be chosen depending on subtypes and stages of the pancreatic cancers. Early stage cancers may be treated through surgery and removal if still localized, although more than 80% of pancreatic cancers have metastasized by the diagnosis [2]. Until early detection is made feasible through biomarker and precursor identification, physical removal or destruction of tumors will continue to be uncommon and we must rely on integrative bioinformatics to enhance accuracy in cancer subtyping to guide towards the most effective treatment option. Radiation therapy is utilized more often for exocrine PCs then for neuroendocrine PCs. Chemotherapy is the use of anti-cancer drugs to destroy tumors that have spread to other parts of the body. Targeted therapy is a more recent development, in which new drugs are developed that attack specific targets in cancer cells, as well as therapies that can boost a patient’s immune system. This represents another challenge that must be answered through analysis and integration of bioinformatics. Personalized therapies along these lines can only be possible through identification of patient group/subtype-specific mutations. This highlights an important route for future research, and one of the most promising directions for the application of analysis and integration of multi-omics data. Genetic predispositions, as discussed above, represent one method of personalized medicine in our increasing ability to predict patient risks for certain diseases. This ties in with increasing our understanding of and capabilities for patient-specific therapy as well. Trastuzumab, for example, is considered a very effective drug in breast cancer treatment, targeting the Epidermal Growth Factor Receptor. This drug is only beneficial for the 10-20% of breast cancer patients with amplification of this receptor, though, and so different treatment regimes must be selected for different patient groups [4]. The true challenges for these informatics approaches focus on making sense of the mass amounts of data we collect from patients and laboratory studies. These data are collected using different modalities and sources, each with distinct inherent velocities. Data in the clinical space may consist of hand-written notes taken by doctors, translated by nurses into electronic format. While it is easy to understand how incorrect and missing data may be produced in such formats, these challenges even present themselves when trying to analyze patient groups in which patient data widely vary due to differences in the tests each patient received; different methods for proteomic, genomic, transcriptomic and even imaging, while beneficial in allowing us to collect information, need to be able to complement each other and not be analyzed solely in parallel. This last point highlights a serious challenge in the integration of all this data towards identification of exploitable targets in different cancer subtypes. II. LITERATURE SURVEY Shen et al. analyzed DNA copy number and mRNA expression from two sources: breast cancer cell lines obtained from the American Type Culture Collection and lung adenocarcinomas from Memorial Sloan-Kettering Cancer Center. The methodology consists of a Gaussian latent variable model representation of eigengene K-means clustering, which can be extended to multiple data modalities. High dimensionality is accounted for through derivation of a sparse approximation that penalizes the complete-data log- likelihood and reduces dimensionality. From this point, models are selected based upon cluster separability through calculation of proportion of deviance where “perfect separability” would yield a proportion of deviance of 0. This study is strong in its ability to pinpoint “important” genes through lasso-type regularization, a method that can be equated to placing a Laplacian prior probability distribution centered on zero on the parameter vector. Overall, this study
  • 3. is novel in its approach to integrative clustering, replacing separate clustering and manual integration with a method for integrative clustering that incorporates all data types in its assignment. Yeoman et al. implemented a multi-omic, systems biology approach through analysis of rRNA sequencing reads (454 FLX-titanium) and metabolomics (GC-MS system consisting of Agilent 7890A, gas chromatograph, Agilent 5975C & Agilent 763B) through sample collection they conducted on 36 bacterial vaginosis patients. Bray-Curtis dissimilarity matrix was created from genus-level taxonomic classifications normalized across the dataset of 165 rRNA genes, which was then subjected to non-metric multidimensional scaling (nMDS). Analysis of similarities was used to support separation found from nMDS. The same methods were used to analyze the 176 distinct metabolites found across the 36 samples. Network analysis was performed through calculation of pairwise Pearson’s product moment correlation coefficients for parametric metadata and calculation of pairwise Spearman’s correlation coefficients for non-parametric metadata, with shortest path method used to calculate distances between variables. Some critiques on this study are that the sample population was very small with no controls, and that only positive weights were considered during network analysis. Daemen et al. implemented kernel-based integration of genome-wide data with clinical data for analysis of rectal and prostate cancer. Samples were split into binary groupings based on three tumor-grading models. Missing gene expression values were imputed using k-nearest neighbors method, and the features with variance in the bottom 50% were eliminated. A weighted least squares – support vector machine was used where different weights were given to positive and negative samples. Wilcoxon rank sum test was used for rectal cancer (only ~90 cancer-related proteins) and multiple univariate test statistics integrated to find differential expression of (large number of) prostate cancer proteins. Leave-one-out cross-validation used to determine optimal number of features as well as parameters for support vector machine. Finally, features were selected according to top ranked features by calculation of area under the receiver operating characteristic curve, with ties won by the features with lowest balanced error rate and highest sum of sensitivity and specificity. LS-SVMs for each data type were integrated by manually calculating change in levels over time period. The researchers acknowledge that this multiple time point data collection model is very expensive. Kernel matrices for each data source are summed and weighted LS-SVM trained on this heterogeneous kernel matrix to provide a mutli-omics integrative approach. A critique of this method is the fact that authors assigned equal weights across studies, which will not produce optimal results. Mosca and Milanesi describe a network-based analysis of breast cancer tumor data from GEO under the ID GSE25835 using multi-objective optimization. Their methodology can be divided into three basic steps: defining a multiple-weighted network containing multi-omic data sets, identifying significant networks with multi-objective optimization and calculation of optimization quality parameters. Analyses of interaction data between cell types (two tumor types and two epithelial cell types), differential gene expression and overexpression of basal markers were combined to identify differentially expressed networks of protein-protein interactions. P-values were calculated using the “Parametric Analysis of Gene Enrichment” (PAGE) method and the log10 of this p-value was taken as the objective function to indicate statistical significance of differential gene expression compared to all other genes. This methodology was extended to ductal carcinomas of the breast (GEO ID GSE22544), colorectal tumor cells (GEO ID GSE4107) and pancreatic ductal adenocarcinomas (GEO ID GSE15471), with optimization problems formulated that compared differential expression of same networks between the three tumor types. Drawbacks to this methodology lie in the potential variability of results due to differences in chosen objective functions. Kim et al. integrated gene expression, miRNA and methylation data from normalized ovarian cancer datasets downloaded from TCGA portal for clinical outcome prediction. This methodology utilized a graph-based semi- supervised learning, classification algorithm. This is an attractive method due to sparseness properties of the input matrix and its inherent visualization. An additional graph is created to compare the relationships between individual graphs, with high correlation increasing the prediction accuracy for the integration of the datasets. Weighted matrix created by summing the product of values of the data types being compared, with a value of 0 representing no relationship between a given gene and miRNA, for example. Gaussian function of Euclidean distance calculated for final weight matrix with larger weights being assigned to closer patients. This study is limited by prior knowledge of the interactions between, for example, specific miRNAs and its target genes. Therefore this model makes it difficult to discover novel pathways and relationships. Madhavan et al. integrates multi-omic data collected from colorectal cancer patients and identified genes, miRNA and methylation levels correlated with relapse. This study utilized t-test to filter data for significance before using a support vector machine with recursive feature elimination, followed by leave-one-out cross validation. While the SVM was strong methodology for optimization, this study overall had a limited potential to discover novel pathways or biomarkers due to manual filtering performed. The authors removed data that was not previously known to have specific correlations within colorectal cancer and relapse.
  • 4. Based on literature survey and the scope of our data, we will split our patients into groups based on survival time and then utilize t-test to reduce features according to significance within a five-fold leave-one-out cross validation and a support vector machine to classify data and obtain prediction scores. As opposed to some of the studies we reviewed, we will be analyzing accuracy as opposed to specificity and sensitivity, because this will give a better overall indication of success of our classification. III. METHODS A. Data Acquisition and Pre-Processing Prior to any prediction modeling, data needed to be downloaded and linked to all patients in the clinical database. Figure 1 depicts this process. Fig. 1 As Figure 1 outlines, there are three databases provided by TCGA. First, the clinical database which includes various patient data including: patient ID, survival time, cancer stage/type, etc. Specifically the patient ID and survival time post diagnosis were extracted from the clinical database. The protein and miRNA expression databases included various hyperlinks linked with a patient ID. Patient IDs were linked from the clinical database to their corresponding modality data. Once this link was made, the data was downloaded and stored in a matrix. Once data was acquired from the TCGA site, patients that were missing modality data needed to be filtered. Once the patients were filtered, they were randomly stratified into three groups: training 1, training 2, validation. Table 1 and Figure 2 outline patient filtration and stratifying. TABLE 1 Total TCGA Patients 171 Patients Missing Protein Data 71 Patients Missing miRNA Data 7 Total Patients Used 93 Fig. 2 Furthermore, the rationale for using 1 year as the critical time for survival time become more obvious with the data acquisition of survival times for each of the 93 filtered patients, as seen in Table 2. TABLE 2 Patient Survival Time Number of Patients <1 Year 64 1 - 2 Years 20 >2 Years 9 Total 93 As table 2 outlines, the patients surviving greater than two years led to the decision to use 1 year as the separator between groups. The final group sizes are shown in Tables 3 and 4. TABLE 3 Training 1 Population Training 2 Population Validation Population Total <1 Year Survival 22 22 20 64 >=1 Year Survival 10 10 9 29 Total 32 32 29 93 TABLE 4 Training 1 Population Training 2 Population Validation Population % of Total Reduced Patient Population 35 35 30
  • 5. B. Equations There are two main equations used as part of our study. The first is the equation of a Support Vector Hyperplane: (1) Where N equals the number of support vectors used to generate the hyperplane. represents the values associated with the support vector indices. represents the weights of the support vectors; negative values associated with the first group, positive values associated with the second group. In this case, the first group represents patients, who are support vectors, that survived less than one year post diagnosis. The second group being patients, also support vectors, who survived greater than or equal to one year post diagnosis. More details about the patients and their grouping will be discussed further in the “Methods” section. Furthermore, another important equation used describes accuracy: (2) This equation is one evaluation metric used to determine the success of the prediction algorithm. Since the goal is develop a model the properly categorizes patients into either <1 or >= 1 year survival, accuracy was used over specificity and sensitivity. C. Hypotheses Our null hypotheses are: 1) Multimodal prediction yields higher accuracy than individual modality prediction 2) Multimodal pancreatic cancer prediction using predicted decision values from individual modality hyperplane equation yields higher accuracy than multimodal prediction using individual-modality predicted group values D. Prediction Modeling The methodology used is divided into two sections: 1) Methodology for multimodal pancreatic cancer prediction using individual-modality predicted group values 2) Methodology for multimodal pancreatic cancer prediction using predicted decision values from individual modality hyperplane equation. 8 Figure 3 outlines the methodology used to obtain the predicted grouping of patients, from the individual data modalities, used as classifier training for the multimodality prediction. Fig. 3 As Figure 3 demonstrates, before any actual predictions can be made on the training 2 and validation groups, cross validation is performed on the training 1 data. The purpose of this is to determine the optimal feature size the produces the highest potential accuracy, while also predicting the evaluation accuracies (accuracies of training 2 and validation group prediction). Figure 4 outlines the entire cross validation process. Fig. 4 Training 1 data is randomly stratified into 5 different folds. The four training folds are then sorted so that the
  • 6. patients in the <1 year group and >=1 year groups are separated. Since both groups of patients contain the same number of features, a Two-Sample t-Test was run for each feature. The resulting p-values for each feature were sorted, starting from the highest. The five fold, five iteration cross validation was repeated for each feature size from 1 to 100. Therefore, the top f features were selected into the classifier trainer; f being the feature size the cross validation was testing. The test fold then used to test the trained classifier. The final result was cross validation accuracy. Overall, a 5x100 matrix of cross validations was evaluated. The cross validations of the folds were averaged and the maximum average accuracy was found; thus, the optimal feature size yielding the highest cross validation accuracy was determined. Following this, the optimal feature size was used to reduce training 1 data. The reduced training 1 data was used to train the classifier and the training 2 and validation data tested the classifier. Since the true labels of training 2 and validation data are obtained from the clinical database, the accuracies of the Phase I group predictions can be calculated. Fig. 5 Figure 5 outlines the methodology for the Phase II, multimodal prediction. Training 2 data is used to train the classifier. It is also important to highlight that before the classification is made on the validation data, a five fold cross validation is performed on the training 2 data. 2) Methodology for multimodal pancreatic cancer prediction using predicted decision values from individual modality hyperplane equation. An overview model of the multimodal prediction can be seen in the Figure 6. Fig. 6 The methodology for the implementation of the phase I hyperplane equation is very similar to that of the previous methodology. A graphical description, as Figure 7, outlines the key difference. Fig. 7 The main difference between Phase I and Phase II methodology is the usage of the training 1 hyperplane equation to calculate decision values to be used in the Phase II prediction. The overview model and Phase II flow chart can be reviewed as Figures 5 and 6. IV. RESULTS Tables 5 and 6 show the values and average accuracies from running both methodologies a total of three times. Included as well are the SVM plots for both multimodal predictions and a graph of external validation vs. cross validation for the methodology 2 as Figure 7. TABLE 5 Run 1 Run 2 Run 3 Average miRNA CV Accuracy 0.525 0.5 0.525 0.5167 miRNA Training 2 0.3125 0.375 0.4063 0.3646
  • 7. Accuracy miRNA Validation Accuracy 0.4828 0.5517 0.5172 0.5172 Protein CV Accuracy 0.6 0.625 0.675 0.6333 Protein Training 2 Accuracy 0.5938 0.5313 0.6563 0.5938 Protein Validation Accuracy 0.3103 0.4828 0.6207 0.4713 Multiple Modality CV Accuracy 0.5 0.6167 0.6 0.5722 Validation Accuracy 0.3793 0.6552 0.5172 0.5172 TABLE 6 Hyperplane Decision Value Predicted Group Decision Value miRNA CV Accuracy 0.5167 0.5083 miRNA Training 2 Accuracy 0.3646 0.6563 miRNA Validation Accuracy 0.5172 0.5862 Protein CV Accuracy 0.6333 0.55 Protein Training 2 Accuracy 0.5938 0.5 Protein Validation Accuracy 0.4713 0.5172 Multiple Modality CV Accuracy 0.5722 0.6556 Multiple Modality Validation Accuracy 0.5172 0.5402 Fig. 7 a, b, c, d (top to bottom)
  • 8. In Figure 7a (top left), the x-axis represents the miRNA prediction data and the y-axis represents the protein prediction data from methodology 1. In Figure 7b, the x-axis represents the miRNA prediction data and the y-axis represents the protein prediction data from methodology 2. As can be seen in Figure 7a, it is expected that methodology 2 would yield a higher score since the individual modality data inputted into phase II is more continuous than methodology 1. In Figure 7c, the x-axis is cross validation values and the y- axis is external validation values. Figure 7d shows the output from running our MATLAB code for prediction modeling in addition the graphs above. IV. CONCLUSION As can be deduced from the results, it seems that the null hypotheses that the hyperplane decision values would be more accurate than predicted group decision value and multiple modality prediction accuracy overall would be more accurate the individual modality accuracy seemed to be not true. Definitely there are areas for improvement. First, more modalities could be included (methylation data, genomics, etc.). Furthermore, algorithm efficiency could be reevaluated. Improving efficiency would decrease the run time overall, allowing the usage of larger sets of data. A GUI could be implemented to improve the ease of use for third-party testing. Other areas that could be explored in the future would be more use of clinical data. For example, only survival time post diagnosis was used for prediction. Other clinical data such as cancer stage or tumor type could be implemented for similar prediction. Furthermore, the fact that the miRNA and protein IDs and expression values used for each prediction were saved; therefore, if improved accuracies could be achieved, the biomarkers used for prediction could be studied. This could lead to the discovery of novel biomarkers. In addition to accuracy, the area under the curve was determined as well: (3) where xi and yi are classifier decision values for group 1 (<1 year survival) and group 2 (>= 1 year survival) samples, respectively. N+ and N- represent the number of samples in groups 1 and 2. Samples classified into group 1 should have positive decision values and samples classified into group 2 should have negative decision values. I(x) evaluates to 1 if x is true and 0 otherwise. Note that in the case of ties, the summation is weighted by 0.5. The motivation to calculate this value, in addition to accuracy, is to measure the validity of accuracy values. Due to potential skewing of results due to uneven sample sizes (<1 year survival time more than double the size of >= 1 year survival group), AUC is another evaluation metric. Reasons addressing the low accuracy could stem from issues with skewed groups as mentioned. Since more data was available on patients surviving less than one year, accuracy for predicting patients who survived over a year after diagnosis would be difficult. Increasing the sample size, using patients sizes with reduced survival time skewing, and running the simulation multiple times all could aid in more reasonable results. IV. REFERENCES [1] "What's New in Pancreatic Cancer Research and Treatment?" What's New in Pancreatic Cancer Research and Treatment? American Cancer Society, 11 June 2014. Web. <http://www.cancer.org/cancer/pancreaticcancer/det ailedguide/pancreatic-cancer-new-research>. [2] Ryan, David P., Theodore S. Hong, and Nabeel Bardeesy. "Pancreatic Adenocarcinoma." The New England Journal Of Medicine 371.11 (2014): 1039- 049. Web. [3] Lau et al. "Role of Pancreatic Cancer-derived Exosomes in Salivary Biomarker Development." Journal of Biological Chemistry 288.37 (2013): 26888-6897. Web. [4] Chouchane, Lotfi, Ravinder Mamtani, Ashraf Dallol, and Javaid I. Sheikh. "Personalized Medicine: A Patient - Centered Paradigm." Journal of Translational Medicine 9.1 (2011): 206. Web. [5] Shen, R., A. B. Olshen, and M. Ladanyi. "Integrative Clustering of Multiple Genomic Data Types Using a Joint Latent Variable Model with Application to Breast and Lung Cancer Subtype Analysis." Bioinformatics 26.2 (2010): 292-93. Web. [6] Yeoman et al. "A Multi-Omic Systems-Based Approach Reveals Metabolic Markers of Bacterial Vaginosis and Insight into the Disease." Ed. Adam J. Ratner. PLoS ONE 8.2 (2013): E56111. Web. [7] Daemen et al. "A Kernel-based Integration of Genome-wide Data for Clinical Decision Support." Genome Medicine 1.4 (2009): 39. Web. [8] Mosca, Ettore, and Luciano Milanesi. "Network- based Analysis of Omics with Multi-objective
  • 9. Optimization." Molecular BioSystems 9.12 (2013): 2971. Web. [9] Kim et al. "Incorporating Inter-relationships between Different Levels of Genomic Data into Cancer Clinical Outcome Prediction." Systems Biology with Omics Data 67.3 (2014): 344-53. Web. [10] Madhavan et al. "Genome-wide Multi-omics Profiling of Colorectal Cancer Identifies Immune Determinants Strongly Associated with Relapse." Frontiers in Genetics 4 (2013): n. pag. Web.