SlideShare une entreprise Scribd logo
1  sur  10
Pattern-Matching in DNA sequences
using WEKA
C. Ainali1
, G. Kontogounis2
, N. Samaras1
1
Dept. Applied Informatics, University of Macedonia, P.O. BOX 1591, 54006
Thessaloniki, Greece, Tel:+30.2310891866, Email: {it0352,samaras}@uom.gr
2
Faculty of Medicine, Aristotle University, 541 24 Thessaloniki, Greece,
Email: kongougr@yahoo.gr
Abstract
Weka (Waikato Environment for Knowledge Analysis) can be used in several
different levels. It is an environment for automatic classification, clustering,
regression and feature selection which contains an extensive collection of
machine learning algorithms and data pre-processing methods. In classification
or regression, evaluation tools are used for analyzing and predicting the outcome
associated with a particular individual. In clustering, individuals which share
certain properties are grouped together. However, while data mining methods are
widely used to find patterns within large amounts of data, Weka does not provide
any special data mining framework for biological data. Bioweka suite, on the
other hand, adds further functionality to it and applies knowledge discovery
applications to biological data. In this paper, we present how bioweka can be a
tool in extracting useful information by comparing local gene cluster in human
and in other mammal chromosomes. A wide range of filter tools and clustering
algorithms are used so as to group together chromosomes of mammals that
share similar relationships and to give some exciting opportunities for
determining chromosomal similarities between related species.
Keywords: Data mining, WEKA, clustering, local gene cluster, chromosomes.
1. Introduction
In recent years, data stored worldwide has augmented, so that the need to
extract knowledge from very large databases is increasing. Knowledge Discovery
in Databases (KDD) or just Knowledge Discovery (KD) has been defined as the
non-trivial process of identifying valid, novel, potentially useful, and ultimately
understandable patterns in data. Data mining research has led to the
developments of numerous efficient and scalable methods for mining interesting
patterns and knowledge in large databases, and it has been defined as one
particular step of the KDD process. It uses different algorithms for classification,
regression, clustering or association rules. These new progresses in data mining
have already been applied to bio-data analysis, as everyone believe that it will
- 1 -
provide the necessary tools for better understanding of gene expression, drug
design, and other emerging problems in genomics and proteomics[4].
At the same time, numerous computational methods and algorithms, particularly
in data mining, have been developed to extract the hidden knowledge in these
data that is relevant to our understanding of the biological systems. One of the
most important search problems in bio-data analysis is similarity search and
comparison among bio-sequences and structures. For example, base-by-base
comparisons of the chromosomes of related species are made and
computational approaches have been developed for comparative analysis. In
biology, therefore, various tools have been developed for searching for similarity
among biosequences. A well known tool is BLAST [1]. The idea is aligning
sequences so that similarity can be revealed in the presence of small variations
in position. Other rapid and sensitive algorithms that have been developed to
perform homology searches in sequence databases are FASTA [5], BLAT [6] and
PatternHunter [7]. All those methods follow a seed-based approach for detecting
alignments of interest and they are based on a common principle. At the first
stage, they search for small patterns (called “seeds”) shared by input sequences
and then extend some of those (“hits”) into larger alignments of similarity regions.
However, bio-data often consists of a lot of features which form a high dimension
space and it is crucial to discover pair-wise frequent patterns and cluster bio-data
based on such frequent patterns [2].
In this paper, we undertake comparative analysis of human and chimpanzee
chromosomes using WEKA project (Waikato Environment for Knowledge
Analysis) which provides a large collection of machine learning algorithms written
in Java for data pre-processing, classification, clustering, association rules, and
visualization. It is a framework that is distributed as open source software,
running on almost any computer platform, and it can be invoked through a
common Graphical User Interface (GUI). In Weka, the overall data mining
process takes place on a single machine, since the algorithms can be executed
only locally [3]. However, while data mining methods are widely used to find
patterns within large amounts of data and to derive general laws from the data,
Weka does not provide any special data mining framework for biological data,
especially sequence and gene expression data. The BioWeka workbench
(implemented in Java and distributed under the GNU open source license) adds
this functionality to the Weka software and it provides a general foundation for
the rapid development of new data mining applications to biological data.
In BioWeka, there are components that are frequently used within biological
research projects and were proven to yield desirable results in KDD experiments.
To be more specific, BioWeka supports different sequence formats as well as
XML based formats, it provides sequence-specific filters and it offers alignment-
based classification. It also supports the analysis of gene expression data, by
loading files with gene expression data into Weka and normalizing the numeric
data. The BioWeka project was initiated to minimize the effort, and thus the time,
- 2 -
to implement and apply KDD applications to biological data. It helps researchers
in the area of computational biology and here, it will help us to estimate the
number and types of rearrangements that have occurred and also to determine
when they occurred. In a given organism or species, genes are found in a given
order that is maintained on the chromosomes from one generation to the next.
Genetic analysis has revealed that genes with a related function are frequently
found to be clustered at one chromosomal location. As it is commonly believed
that chimpanzees, to the exclusion of other primates, are the most closely
related species to human and that clustering of related genes presumably
provides an evolutionary advantage to species, we focus on comparing local
gene cluster in human (Homo Sapiens) and chimpanzee (Pan troglodytes )
chromosomes. Specifically, we applied many conventional clustering algorithms
to the above dataset and we then used the results for identifying biologically
relevant groups of genes and samples.
The outline of this paper is organized as follows. Section 2 reviews materials and
methods that we use in order to tackle the problem. Section 3 presents the
algorithm that we implement to our data. Section 4 presents our results and
section 5 concludes the paper.
2. Materials and Methods
2.1Data Collection
The DNA dataset used in this study was obtained from the website of the
Ensembl Project (Ensembl Genome Browser). Ensembl [9] provides a software
system to store, analyse, use and display genomic information. The genomes of
14 chordates are currently available through Ensembl, from mammals such as
Human and Mouse through to the ‘primitive’ chordate Ciona intestinalis (the
smallest of any experimentally manipulable chordate). The genomes of three key
model eukaryotes, yeast, fly and worm, are also imported from their respective
databases to provide easy integration of information from these organisms with
chordates. Finally a limited number of insect genomes are also available through
Ensembl owing to our participation in the Vectorbase consortium. The data used
was taken from two mammalian genomes (Homo sapiens and Pan Troglodytes
genome) to take into consideration possible similarity that may exist among the
genes. This dataset comprises of 793 DNA sequences from which 435 are from
the genomic sequence of Human chromosomes (Homo sapiens) and 358 are
from the chimpanzee chromosomes (Pan troglodytes). The sequences were
stored in two separate “flat” FASTA files called chrom1.txt and chimp_chrom1.txt.
To merge both files into one single ARFF file, both files are first converted into
ARFF. Then, one file (for this scenario, the Homo sapiens set) is loaded into the
Preprocess panel of the Weka’s Explorer workbench. Finally, the BioWeka’s
MergeSets filter is applied—specifying the Pan troglodytes set as an additional
- 3 -
input file. This results in a single data set (called results.arff), which contains an
additional attribute (relationName), which determines the origin of the sequences.
2.1Machine Learning Approach
The main interface in Weka [8] is the Explorer, shown in Figure 1. It has a set of
panels, each of which can be used to perform a certain task.
 In the Preprocess panel, one can load datasets, browse the
characteristics of attributes and apply any combination of Weka’s
supervised and unsupervised filter to the data. The Preprocess panel also
shows a histogram of the attribute that is currently selected and some
statistics about it. Once a dataset has been loaded, one of the other
panels in the Explorer can be used to perform further analysis.
 If the data entail a classification or regression problem, it can be
processed in the Classify panel. This allows configuring and executing
any of the Weka classifiers on the current dataset, as it provides an
interface to learning algorithms for classification and regression models
and evaluation tools for analyzing the outcome of the learning process.
 In the Cluster panel, one can have access to Weka’s clustering
algorithms. These include k-means, mixtures of normal distributions with
diagonal co-variance matrices estimated using EM, and a heuristic
incremental hierarchical clustering scheme. They can be visualized in a
pop-up data visualization tool and with them one can find groups of similar
instances in a dataset.
 In the Explorer’s Associate panel are available algorithms for generating
association rules, such as apriori algorithm, that can be used to statistical
dependencies between groups of attributes.
 The fifth panel (Select Attributes panel) offers methods for identifying
those subsets of attributes that are predictive of another (target) attribute
in the data. It is a panel that allows configuring and applying any
combination of Weka attribute evaluator and search method to select the
most pertinent attributes in the dataset. Search methods include best-first
search, forward selection, genetic algorithms and a simple ranking of
attributes. Evaluation measures include correlation and entropy-based
criteria as well as the performance of a selected learning scheme (e.g. a
decision tree learner) for a particular subset of attributes. Different search
and evaluation methods can be combined, making the system very
flexible.
 The last panel in the Explorer, Visualization, displays a scatter plot matrix
for all pairs of attributes in the data. It allows visualizing the current
dataset in one and two dimensions.
- 4 -
Figure 1: Weka’s Explorer Window
In the case of biological sequences (RNA, DNA and amino acid sequences)
there is a crucial point: how to derive a set of numerical or nominal attributes for
a given sequence, as there are no special data mining methods to operate on
strings. BioWeka framework [10] adds this functionality to the Weka software, as
it provides filters that extract numeric or nominal features from sequences. It is
an environment for the analysis of biological data and for knowledge discovery in
biological data. While it is based on the Weka workbench, it is a framework by
itself as can be extended in the same way Weka can.
In particular, the functionality, that BioWeka offers, includes data converters (e.g.
for FASTA [12], GenBank [14], EMBL [13], Swiss-Prot [15] sequence files, XML
files, tab-delimited microarray data). The FASTA format was developed for the
Fast Alignment Search Tool for Proteins (FASTP, [11]) and it was later
generalized to FAST-ALL. A file of this format can contain as many sequences as
necessary and each sequence may consist of nucleotides or amino acids. The
entry of a new sequence is indicated by the greater sign “>” followed by an
identifier (“1 dna”) and a description that characterize the sequence e.g.:
>1 dna:chromosome chromosome:NCBI35:1:1:245522847:1
TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCC
TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCAACCCTAACCCT
………………………………………………………………………….
The EMBL (European Molecular Biology Laboratory) format [13] is exclusively for
the nucleotide sequences, as it is used for the EMBL Nucleotide Sequence
Database. A file of this format may contain one or more sequence entries.
- 5 -
However, in contrast to the FASTA format, its structure is much more complex
and unified so that one can accumulate not only the sequence of a gene but also
further information, e. g. a useful description of the protein the gene codes for, or
only to a specific location within the sequence (a single nucleotide or a range of
nucleotides). Another flat file format for nucleotide sequences is GenBank [14]
which is equal to the EMBL one having the same types of features but they differ
in the tags. Finally, the Swiss-Prot database [15] has the same format as the
EMBL but it is for amino acid sequences.
There are also classification methods (e.g. BLAST or PSI-BLAST, local, global
and secondary structure element alignments, reimplementation of Eclat method),
filter tools and other components to merge ARFF files, save ARFF files in a
sequence format and create filter pipelines. In the bioinformatics arena, BioWeka
data mining suite has also been used to translate sequences (DNA to RNA to
Protein, DNA/RNA to its reverse complement), generate the open reading frames
of DNA/RNA sequences, analyze amino acid properties based on the Amino Acid
Index database, calculate the codon frequencies or amino acid composition and
normalize numeric feature vectors
In our study we use Eclat filter in order to discriminate between DNA sequences
from different species and the Expectation Maximization algorithm (EM) for
maximum likelihood estimation of mixture model parameters.
3. Algorithms
3.1Eclat Implementation
In the first step, we implement the Eclat [16] filter in the Homo sapiens - Pan
Troglodytes dataset, which we loaded in the Weka workbench using the
FastaSequenceLoader component of BioWeka. This filter accepts a set of
sequences and returns a set of codon frequencies. It has two properties. The
sequenceAttribute property specifies the attribute holding the sequences and the
alphabet property specifies the sequences’ alphabet (here is DNA).
The implementation of the EclatFilter component is very simple. It uses internally
a MultipleFilter instance to form a filter pipeline consisting of the following
components:
1. Translate: Cuts the sequences after the first stop codon. Therefore, the
translator’s property is filled with a single Terminator translator.
2. SymbolCounter: It is configured, so that it calculates the codon frequencies
and does not take into account ambiguous symbols.
3. Normalize: The normalizer property is set to a MinMaxNormalizer instance,
which normalizes the codon frequencies, so that all values are in the interval
[−1.0, 1.0].
4. Copy: Copies the class attribute to the end of the attribute list.
5. Remove: Removes the original sequence attributes and the class attribute.
- 6 -
Figure 2: Applying Eclat Filter
To be more specific, for each sequence in the dataset codon frequencies are
computed from its beginning up to the first stop codon. To account for some
codons missing randomly, pseudocounts are used. The frequency for a given
codon c is computed as
∑ ∈′ ′ +
+
=
Codonsc c
c
n
n
cF
64
1
)(
where ni denotes the absolute number of occurrences of a codon i. Therefore, for
each sequence 64 attributes are derived for training and classification. Finally, a
SVM [9] is trained on the feature set to discriminate between sequences from
Homo sapiens - Pan Troglodytes.
3.2 EM
Secondly, we test to load the reduced Homo sapiens - Pan Troglodytes dataset
into the Weka workbench using the CSV loader and we implement in it the EM
cluster algorithm. We get cluster assignments from the estimated mixture model
by assigning each instance xj to the cluster of highest a posteriori probability
argmaxi P (cijxj).
- 7 -
Codons
EM
==
Number of clusters selected by cross validation: 6
Cluster: 0 Prior probability: 0.433
Attribute: TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCC
Discrete Estimator. Counts = 1.37 1.87 1.87 1.87 1.87 1.87 1.87 1.87 1.87 1.87 1.87 1.89 1.89 1.89 1.89
1.89 1.89 1.89 1.89 1.89 1.89 1.89 1.89 1.89 1.89 1.89 (Total = 48.52)
Attribute: relationName
Discrete Estimator. Counts = 22.84 1.68 (Total = 24.52)
Cluster: 1 Prior probability: 0
Attribute: TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCC
Discrete Estimator. Counts = 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 (Total = 26)
Attribute: relationName
Discrete Estimator. Counts = 1 1 (Total = 2)
Cluster: 2 Prior probability: 0
Attribute: TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCC
Discrete Estimator. Counts = 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 (Total = 26)
Attribute: relationName
Discrete Estimator. Counts = 1 1 (Total = 2)
Cluster: 3 Prior probability: 0.567
Attribute: TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCC
Discrete Estimator. Counts = 17.63 2.13 2.13 2.13 2.13 2.13 2.13 2.13 2.13 2.13 2.13 1.11 1.11 1.11 1.11
1.11 1.11 1.11 1.11 1.11 1.11 1.11 1.11 1.11 1.11 1.11 (Total = 55.48)
Attribute: relationName
Discrete Estimator. Counts = 5.16 26.32 (Total = 31.48)
Cluster: 4 Prior probability: 0
Attribute: TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCC
Discrete Estimator. Counts = 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 (Total = 26)
Attribute: relationName
Discrete Estimator. Counts = 1 1 (Total = 2)
Cluster: 5 Prior probability: 0
Attribute: TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCC
Discrete Estimator. Counts = 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 (Total = 26)
Attribute: relationName
Discrete Estimator. Counts = 1 1 (Total = 2)
Clustered Instances
0 25 ( 48%)
3 27 ( 52%)
Log likelihood: -3.32727
Figure 3: The number of clusters that correspond to our reduced dataset
- 8 -
4. Results
The implementation of SMO component solves the SVM problem without any
extra matrix storage and without invoking an iterative numerical routine for each
sub-problem. It globally replaces all missing values and transforms nominal
attributes into binary ones. It also normalizes all attributes by default. Tables 1, 2
and 3 summarize the results of the SVM application, which estimated accuracy is
approximately 61, 7%.
Correctly Classified Instances 148 18.6633 %
Incorrectly Classified Instances 645 81.3367 %
Kappa statistic -0.6566
Mean absolute error 0.8134
Root mean squared error 0.9019
Relative absolute error 164.2156 %
Root relative squared error 181.2281 %
Total Number of Instances 793
Table 1: Stratified cross-validation
TP Rate FP Rate Precision Recall F-Measure Class
0.299 0.95 0.277 0.299 0.287 chrom1.txt
0.05 0.701 0.056 0.05 0.053 chimp_chrom1.txt
Table 2: Detailed Accuracy By Class
a b <-- classified as
130 305 | a = chrom1.txt
340 18 | b = chimp_chrom1.txt
Table 3: Confusion Matrix
Furthermore, in our study, implementing the EM algorithm, we get a number of
clusters that reveal that there are common sequences in the human and
chimpanzee chromosomes. In a given organism or species, genes are found in a
given order that is maintained on the chromosomes from one generation to the
next. Genetic analysis has revealed that genes with a related function are
frequently found to be clustered at one chromosomal location. The high
resemblance, in the level of a sequence, constitutes a clue of common
evolutionary origin and identical or similar operation. From the clusters
visualization, we can estimate the similarity and we can see that same
sequences from both chromosomes belong to the same cluster. Using this
information and implementing the algorithm to all the chromosomes of the two
mammals, we can confirm that our DNA is 98% identical to chimpanzees.
5. Conclusion
BioWeka is a useful software package that makes it easy to develop new data
mining methods for biological data. There are many different components that
can be combined to form a complete KD application without writing a single line
of code and existing components can be reused for their custom solutions. For
instance, the Eclat method required 647 lines of code compared to the original
- 9 -
implementation, which consists of 1260 lines of code. Moreover, the BioWeka
implementation provides an easy-to-use GUI based on the Weka software.
However, a comparison of the performance demonstrated that the original Eclat
implementation is faster than BioWeka’s implementation.
References:
[1] BLAST, http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/ information3.html
[2] H. Wang, J. Yang, W. Wang, and P. S. Yu. Clustering by pattern similarity in large data sets.
In SIGMOD’02, pp. 418–427, Madison, WI, June 2002.
[3] Witten, H., Frank, E.: Data Mining: Practical machine learning tools with Java
implementations. Morgan Kaufmann (2000).
[4] Zaki, M., Wang, J., Toivonen, H.: BIOKDD 2002: Recent Advances in Data Mining for
Bioinformatics. In SIGKDD Explorations, Vol. 4, Issue 2 (2002) 112-114
[5] Pearson, W.R. (1990) Rapid and sensitive sequence comparison with FASTP and FASTA.
Meth. Enzymol., 183, 63–98.
[6] Kent, W.J. (2002) BLAT—the BLAST-like alignment tool. Genome Res, . 12, 656–664
[7] Ma, B., Tromp, J., Li, M. (2002) PatternHunter: faster and more sensitive homology search.
Bioinformatics, 18, 440–445
[8] Frank, E., Hall, M., Trigg, L., Holmes, G., and Witten, I. H. (2004) Data mining in bioinformatics
using Weka. Bioinformatics 20, 2479 –2481
[9] E. Birney, D. Andrews, M. Caccamo, Y. Chen, L. Clarke, G. Coates, T. Cox, F. Cunningham,
V. Curwen, T. Cutts, T. Down, R. Durbin, X. M. Fernandez-Suarez, P. Flicek, S. Gräf, M.
Hammond, J. Herrero, K. Howe, V. Iyer, K. Jekosch, A. Kähäri, A. Kasprzyk, D. Keefe, F.
Kokocinski, E. Kulesha, D. London, I. Longden, C. Melsopp, P. Meidl, B. Overduin, A. Parker, G.
Proctor, A. Prlic, M. Rae, D. Rios, S. Redmond, M. Schuster, I. Sealy, S. Searle, J. Severin, G.
Slater, D. Smedley, J. Smith, A. Stabenau, J. Stalker, S. Trevanion, A. Ureta-Vidal, J. Vogel, S.
White, C. Woodwark, and T. J. P. Hubbard. Ensembl 2006. Nucleic Acids Res. 2006 Jan 1; 34
Database issue: D556-D561.
[10] Martin Szugat. BioWeka: Extending the Weka framework for Bioinformatics. Bachelor thesis,
2005 September 17.
[11] D J Lipman andWR Pearson. Rapid and sensitive protein similarity searches. Science,
227(4693):1435–1441, Mar 1985.
[12] W R Pearson and D J Lipman. Improved tools for biological sequence comparison. Proc Natl
Acad Sci U S A, 85(8):2444–2448, Apr 1988.
[13] Tamara Kulikova, Philippe Aldebert, Nicola Althorpe, Wendy Baker, Kirsty Bates, Paul
Browne, Alexandra van den Broek, Guy Cochrane, Karyn Duggan, Ruth Eberhardt, Nadeem
Faruque, Maria Garcia-Pastor, Nicola Harte, Carola Kanz, Rasko Leinonen, Quan Lin, Vincent
Lombard, Rodrigo Lopez, Renato Mancuso, Michelle McHale, Francesco Nardone, Ville
Silventoinen, Peter Stoehr, Guenter Stoesser, Mary Ann Tuli, Katerina Tzouvara, Robert
Vaughan, Dan Wu, Weimin Zhu, and Rolf Apweiler. The EMBL Nucleotide Sequence Database.
Nucleic Acids Res, 32(Database issue):27–30, Jan 2004.
[14] Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Wheeler DL. GenBank: update. Nucleic
Acids Res. 2004;32:D23–D26.
[15] A Bairoch and B Boeckmann. The SWISS-PROT protein sequence data bank. Nucleic Acids
Res, 19 Suppl:2247–2249, Apr 1991.
[16] Caroline C Friedel, Katharina H V Jahn, Selina Sommer, Stephen Rudd, Hans W Mewes,
and Igor V Tetko. Support vector machines for separation of mixed plantpathogen EST
collections based on codon usage. Bioinformatics, 21(8):1383–1388, Apr 2005. Evaluation
Studies.
- 10 -

Contenu connexe

Tendances

Gene Selection for Sample Classification in Microarray: Clustering Based Method
Gene Selection for Sample Classification in Microarray: Clustering Based MethodGene Selection for Sample Classification in Microarray: Clustering Based Method
Gene Selection for Sample Classification in Microarray: Clustering Based MethodIOSR Journals
 
Drug Discovery and Development Using AI
Drug Discovery and Development Using AIDrug Discovery and Development Using AI
Drug Discovery and Development Using AIDatabricks
 
GASCAN: A Novel Database for Gastric Cancer Genes and Primers
GASCAN: A Novel Database for Gastric Cancer Genes and PrimersGASCAN: A Novel Database for Gastric Cancer Genes and Primers
GASCAN: A Novel Database for Gastric Cancer Genes and Primersijdmtaiir
 
FAIR sequencing data repository based on iRODS
FAIR sequencing data repository based on iRODSFAIR sequencing data repository based on iRODS
FAIR sequencing data repository based on iRODSFelipe Gutierrez
 
Ai in drug design webinar 26 feb 2019
Ai in drug design webinar 26 feb 2019Ai in drug design webinar 26 feb 2019
Ai in drug design webinar 26 feb 2019Pistoia Alliance
 
Clustering Approaches for Evaluation and Analysis on Formal Gene Expression C...
Clustering Approaches for Evaluation and Analysis on Formal Gene Expression C...Clustering Approaches for Evaluation and Analysis on Formal Gene Expression C...
Clustering Approaches for Evaluation and Analysis on Formal Gene Expression C...rahulmonikasharma
 
NetBioSIG2013-KEYNOTE Benno Schwikowski
NetBioSIG2013-KEYNOTE Benno SchwikowskiNetBioSIG2013-KEYNOTE Benno Schwikowski
NetBioSIG2013-KEYNOTE Benno SchwikowskiAlexander Pico
 
2015 balti-and-bioinformatics
2015 balti-and-bioinformatics2015 balti-and-bioinformatics
2015 balti-and-bioinformaticsc.titus.brown
 
Bioinformatics-General_Intro
Bioinformatics-General_IntroBioinformatics-General_Intro
Bioinformatics-General_IntroAbhiroop Ghatak
 
NetBioSIG2013-Talk Gang Su
NetBioSIG2013-Talk Gang SuNetBioSIG2013-Talk Gang Su
NetBioSIG2013-Talk Gang SuAlexander Pico
 
Assessing Drug Safety Using AI
Assessing Drug Safety Using AIAssessing Drug Safety Using AI
Assessing Drug Safety Using AIDatabricks
 
NetBioSIG2012 anyatsalenko-en-viz
NetBioSIG2012 anyatsalenko-en-vizNetBioSIG2012 anyatsalenko-en-viz
NetBioSIG2012 anyatsalenko-en-vizAlexander Pico
 
2013 nas-ehs-data-integration-dc
2013 nas-ehs-data-integration-dc2013 nas-ehs-data-integration-dc
2013 nas-ehs-data-integration-dcc.titus.brown
 

Tendances (18)

Gene Selection for Sample Classification in Microarray: Clustering Based Method
Gene Selection for Sample Classification in Microarray: Clustering Based MethodGene Selection for Sample Classification in Microarray: Clustering Based Method
Gene Selection for Sample Classification in Microarray: Clustering Based Method
 
Drug Discovery and Development Using AI
Drug Discovery and Development Using AIDrug Discovery and Development Using AI
Drug Discovery and Development Using AI
 
GASCAN: A Novel Database for Gastric Cancer Genes and Primers
GASCAN: A Novel Database for Gastric Cancer Genes and PrimersGASCAN: A Novel Database for Gastric Cancer Genes and Primers
GASCAN: A Novel Database for Gastric Cancer Genes and Primers
 
FAIR sequencing data repository based on iRODS
FAIR sequencing data repository based on iRODSFAIR sequencing data repository based on iRODS
FAIR sequencing data repository based on iRODS
 
PhDc exam presentation
PhDc exam presentationPhDc exam presentation
PhDc exam presentation
 
Ai in drug design webinar 26 feb 2019
Ai in drug design webinar 26 feb 2019Ai in drug design webinar 26 feb 2019
Ai in drug design webinar 26 feb 2019
 
NETTAB 2012
NETTAB 2012NETTAB 2012
NETTAB 2012
 
Clustering Approaches for Evaluation and Analysis on Formal Gene Expression C...
Clustering Approaches for Evaluation and Analysis on Formal Gene Expression C...Clustering Approaches for Evaluation and Analysis on Formal Gene Expression C...
Clustering Approaches for Evaluation and Analysis on Formal Gene Expression C...
 
NetBioSIG2013-KEYNOTE Benno Schwikowski
NetBioSIG2013-KEYNOTE Benno SchwikowskiNetBioSIG2013-KEYNOTE Benno Schwikowski
NetBioSIG2013-KEYNOTE Benno Schwikowski
 
2015 balti-and-bioinformatics
2015 balti-and-bioinformatics2015 balti-and-bioinformatics
2015 balti-and-bioinformatics
 
Bioinformatics-General_Intro
Bioinformatics-General_IntroBioinformatics-General_Intro
Bioinformatics-General_Intro
 
NetBioSIG2013-Talk Gang Su
NetBioSIG2013-Talk Gang SuNetBioSIG2013-Talk Gang Su
NetBioSIG2013-Talk Gang Su
 
CSHALS 2013
CSHALS 2013CSHALS 2013
CSHALS 2013
 
Assessing Drug Safety Using AI
Assessing Drug Safety Using AIAssessing Drug Safety Using AI
Assessing Drug Safety Using AI
 
NETTAB 2013
NETTAB 2013NETTAB 2013
NETTAB 2013
 
NetBioSIG2012 anyatsalenko-en-viz
NetBioSIG2012 anyatsalenko-en-vizNetBioSIG2012 anyatsalenko-en-viz
NetBioSIG2012 anyatsalenko-en-viz
 
2013 nas-ehs-data-integration-dc
2013 nas-ehs-data-integration-dc2013 nas-ehs-data-integration-dc
2013 nas-ehs-data-integration-dc
 
2016 davis-plantbio
2016 davis-plantbio2016 davis-plantbio
2016 davis-plantbio
 

Similaire à B.3.5

Ontology based clustering algorithms
Ontology based clustering algorithmsOntology based clustering algorithms
Ontology based clustering algorithmsIkutwa
 
Bioinformatics data mining
Bioinformatics data miningBioinformatics data mining
Bioinformatics data miningSangeeta Das
 
A Review of Various Methods Used in the Analysis of Functional Gene Expressio...
A Review of Various Methods Used in the Analysis of Functional Gene Expressio...A Review of Various Methods Used in the Analysis of Functional Gene Expressio...
A Review of Various Methods Used in the Analysis of Functional Gene Expressio...ijitcs
 
COMPUTATIONAL METHODS FOR FUNCTIONAL ANALYSIS OF GENE EXPRESSION
COMPUTATIONAL METHODS FOR FUNCTIONAL ANALYSIS OF GENE EXPRESSIONCOMPUTATIONAL METHODS FOR FUNCTIONAL ANALYSIS OF GENE EXPRESSION
COMPUTATIONAL METHODS FOR FUNCTIONAL ANALYSIS OF GENE EXPRESSIONcsandit
 
A SEMANTIC RETRIEVAL SYSTEM FOR EXTRACTING RELATIONSHIPS FROM BIOLOGICAL CORPUS
A SEMANTIC RETRIEVAL SYSTEM FOR EXTRACTING RELATIONSHIPS FROM BIOLOGICAL CORPUS A SEMANTIC RETRIEVAL SYSTEM FOR EXTRACTING RELATIONSHIPS FROM BIOLOGICAL CORPUS
A SEMANTIC RETRIEVAL SYSTEM FOR EXTRACTING RELATIONSHIPS FROM BIOLOGICAL CORPUS AIRCC Publishing Corporation
 
A Semantic Retrieval System for Extracting Relationships from Biological Corpus
A Semantic Retrieval System for Extracting Relationships from Biological CorpusA Semantic Retrieval System for Extracting Relationships from Biological Corpus
A Semantic Retrieval System for Extracting Relationships from Biological CorpusAIRCC Publishing Corporation
 
Performance Evaluation of Different Data Mining Classification Algorithm and ...
Performance Evaluation of Different Data Mining Classification Algorithm and ...Performance Evaluation of Different Data Mining Classification Algorithm and ...
Performance Evaluation of Different Data Mining Classification Algorithm and ...IOSR Journals
 
A Survey on Bioinformatics Tools
A Survey on Bioinformatics ToolsA Survey on Bioinformatics Tools
A Survey on Bioinformatics Toolsidescitation
 
A consistent and efficient graphical User Interface Design and Querying Organ...
A consistent and efficient graphical User Interface Design and Querying Organ...A consistent and efficient graphical User Interface Design and Querying Organ...
A consistent and efficient graphical User Interface Design and Querying Organ...CSCJournals
 
Community Finding with Applications on Phylogenetic Networks [Extended Abstract]
Community Finding with Applications on Phylogenetic Networks [Extended Abstract]Community Finding with Applications on Phylogenetic Networks [Extended Abstract]
Community Finding with Applications on Phylogenetic Networks [Extended Abstract]Luís Rita
 
Investigating plant systems using data integration and network analysis
Investigating plant systems using data integration and network analysisInvestigating plant systems using data integration and network analysis
Investigating plant systems using data integration and network analysisCatherine Canevet
 
V1_I1_2012_Paper5.doc
V1_I1_2012_Paper5.docV1_I1_2012_Paper5.doc
V1_I1_2012_Paper5.docpraveena06
 
Data integration in a Hadoop-based data lake: A bioinformatics case
Data integration in a Hadoop-based data lake: A bioinformatics caseData integration in a Hadoop-based data lake: A bioinformatics case
Data integration in a Hadoop-based data lake: A bioinformatics caseIJDKP
 
Data integration in a Hadoop-based data lake: A bioinformatics case
Data integration in a Hadoop-based data lake: A bioinformatics caseData integration in a Hadoop-based data lake: A bioinformatics case
Data integration in a Hadoop-based data lake: A bioinformatics caseIJDKP
 
Branch: An interactive, web-based tool for building decision tree classifiers
Branch: An interactive, web-based tool for building decision tree classifiersBranch: An interactive, web-based tool for building decision tree classifiers
Branch: An interactive, web-based tool for building decision tree classifiersBenjamin Good
 

Similaire à B.3.5 (20)

Ontology based clustering algorithms
Ontology based clustering algorithmsOntology based clustering algorithms
Ontology based clustering algorithms
 
Bioinformatics data mining
Bioinformatics data miningBioinformatics data mining
Bioinformatics data mining
 
A Review of Various Methods Used in the Analysis of Functional Gene Expressio...
A Review of Various Methods Used in the Analysis of Functional Gene Expressio...A Review of Various Methods Used in the Analysis of Functional Gene Expressio...
A Review of Various Methods Used in the Analysis of Functional Gene Expressio...
 
Bi4101343346
Bi4101343346Bi4101343346
Bi4101343346
 
COMPUTATIONAL METHODS FOR FUNCTIONAL ANALYSIS OF GENE EXPRESSION
COMPUTATIONAL METHODS FOR FUNCTIONAL ANALYSIS OF GENE EXPRESSIONCOMPUTATIONAL METHODS FOR FUNCTIONAL ANALYSIS OF GENE EXPRESSION
COMPUTATIONAL METHODS FOR FUNCTIONAL ANALYSIS OF GENE EXPRESSION
 
A SEMANTIC RETRIEVAL SYSTEM FOR EXTRACTING RELATIONSHIPS FROM BIOLOGICAL CORPUS
A SEMANTIC RETRIEVAL SYSTEM FOR EXTRACTING RELATIONSHIPS FROM BIOLOGICAL CORPUS A SEMANTIC RETRIEVAL SYSTEM FOR EXTRACTING RELATIONSHIPS FROM BIOLOGICAL CORPUS
A SEMANTIC RETRIEVAL SYSTEM FOR EXTRACTING RELATIONSHIPS FROM BIOLOGICAL CORPUS
 
A Semantic Retrieval System for Extracting Relationships from Biological Corpus
A Semantic Retrieval System for Extracting Relationships from Biological CorpusA Semantic Retrieval System for Extracting Relationships from Biological Corpus
A Semantic Retrieval System for Extracting Relationships from Biological Corpus
 
Performance Evaluation of Different Data Mining Classification Algorithm and ...
Performance Evaluation of Different Data Mining Classification Algorithm and ...Performance Evaluation of Different Data Mining Classification Algorithm and ...
Performance Evaluation of Different Data Mining Classification Algorithm and ...
 
A Survey on Bioinformatics Tools
A Survey on Bioinformatics ToolsA Survey on Bioinformatics Tools
A Survey on Bioinformatics Tools
 
A consistent and efficient graphical User Interface Design and Querying Organ...
A consistent and efficient graphical User Interface Design and Querying Organ...A consistent and efficient graphical User Interface Design and Querying Organ...
A consistent and efficient graphical User Interface Design and Querying Organ...
 
G44083642
G44083642G44083642
G44083642
 
woot2
woot2woot2
woot2
 
Community Finding with Applications on Phylogenetic Networks [Extended Abstract]
Community Finding with Applications on Phylogenetic Networks [Extended Abstract]Community Finding with Applications on Phylogenetic Networks [Extended Abstract]
Community Finding with Applications on Phylogenetic Networks [Extended Abstract]
 
D1803012022
D1803012022D1803012022
D1803012022
 
Final proj 2 (1)
Final proj 2 (1)Final proj 2 (1)
Final proj 2 (1)
 
Investigating plant systems using data integration and network analysis
Investigating plant systems using data integration and network analysisInvestigating plant systems using data integration and network analysis
Investigating plant systems using data integration and network analysis
 
V1_I1_2012_Paper5.doc
V1_I1_2012_Paper5.docV1_I1_2012_Paper5.doc
V1_I1_2012_Paper5.doc
 
Data integration in a Hadoop-based data lake: A bioinformatics case
Data integration in a Hadoop-based data lake: A bioinformatics caseData integration in a Hadoop-based data lake: A bioinformatics case
Data integration in a Hadoop-based data lake: A bioinformatics case
 
Data integration in a Hadoop-based data lake: A bioinformatics case
Data integration in a Hadoop-based data lake: A bioinformatics caseData integration in a Hadoop-based data lake: A bioinformatics case
Data integration in a Hadoop-based data lake: A bioinformatics case
 
Branch: An interactive, web-based tool for building decision tree classifiers
Branch: An interactive, web-based tool for building decision tree classifiersBranch: An interactive, web-based tool for building decision tree classifiers
Branch: An interactive, web-based tool for building decision tree classifiers
 

Dernier

Zoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdfZoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdfSumit Kumar yadav
 
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...chandars293
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000Sapana Sha
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsSérgio Sacani
 
Seismic Method Estimate velocity from seismic data.pptx
Seismic Method Estimate velocity from seismic  data.pptxSeismic Method Estimate velocity from seismic  data.pptx
Seismic Method Estimate velocity from seismic data.pptxAlMamun560346
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bSérgio Sacani
 
Factory Acceptance Test( FAT).pptx .
Factory Acceptance Test( FAT).pptx       .Factory Acceptance Test( FAT).pptx       .
Factory Acceptance Test( FAT).pptx .Poonam Aher Patil
 
Introduction,importance and scope of horticulture.pptx
Introduction,importance and scope of horticulture.pptxIntroduction,importance and scope of horticulture.pptx
Introduction,importance and scope of horticulture.pptxBhagirath Gogikar
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfrohankumarsinghrore1
 
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptxCOST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptxFarihaAbdulRasheed
 
STS-UNIT 4 CLIMATE CHANGE POWERPOINT PRESENTATION
STS-UNIT 4 CLIMATE CHANGE POWERPOINT PRESENTATIONSTS-UNIT 4 CLIMATE CHANGE POWERPOINT PRESENTATION
STS-UNIT 4 CLIMATE CHANGE POWERPOINT PRESENTATIONrouseeyyy
 
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICESAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICEayushi9330
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPirithiRaju
 
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts ServiceJustdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Servicemonikaservice1
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)Areesha Ahmad
 
GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)Areesha Ahmad
 
Conjugation, transduction and transformation
Conjugation, transduction and transformationConjugation, transduction and transformation
Conjugation, transduction and transformationAreesha Ahmad
 
COMPUTING ANTI-DERIVATIVES (Integration by SUBSTITUTION)
COMPUTING ANTI-DERIVATIVES(Integration by SUBSTITUTION)COMPUTING ANTI-DERIVATIVES(Integration by SUBSTITUTION)
COMPUTING ANTI-DERIVATIVES (Integration by SUBSTITUTION)AkefAfaneh2
 

Dernier (20)

Zoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdfZoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdf
 
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
 
Seismic Method Estimate velocity from seismic data.pptx
Seismic Method Estimate velocity from seismic  data.pptxSeismic Method Estimate velocity from seismic  data.pptx
Seismic Method Estimate velocity from seismic data.pptx
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
 
Factory Acceptance Test( FAT).pptx .
Factory Acceptance Test( FAT).pptx       .Factory Acceptance Test( FAT).pptx       .
Factory Acceptance Test( FAT).pptx .
 
Introduction,importance and scope of horticulture.pptx
Introduction,importance and scope of horticulture.pptxIntroduction,importance and scope of horticulture.pptx
Introduction,importance and scope of horticulture.pptx
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdf
 
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptxCOST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
 
STS-UNIT 4 CLIMATE CHANGE POWERPOINT PRESENTATION
STS-UNIT 4 CLIMATE CHANGE POWERPOINT PRESENTATIONSTS-UNIT 4 CLIMATE CHANGE POWERPOINT PRESENTATION
STS-UNIT 4 CLIMATE CHANGE POWERPOINT PRESENTATION
 
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICESAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts ServiceJustdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)
 
Site Acceptance Test .
Site Acceptance Test                    .Site Acceptance Test                    .
Site Acceptance Test .
 
Conjugation, transduction and transformation
Conjugation, transduction and transformationConjugation, transduction and transformation
Conjugation, transduction and transformation
 
COMPUTING ANTI-DERIVATIVES (Integration by SUBSTITUTION)
COMPUTING ANTI-DERIVATIVES(Integration by SUBSTITUTION)COMPUTING ANTI-DERIVATIVES(Integration by SUBSTITUTION)
COMPUTING ANTI-DERIVATIVES (Integration by SUBSTITUTION)
 
Clean In Place(CIP).pptx .
Clean In Place(CIP).pptx                 .Clean In Place(CIP).pptx                 .
Clean In Place(CIP).pptx .
 

B.3.5

  • 1. Pattern-Matching in DNA sequences using WEKA C. Ainali1 , G. Kontogounis2 , N. Samaras1 1 Dept. Applied Informatics, University of Macedonia, P.O. BOX 1591, 54006 Thessaloniki, Greece, Tel:+30.2310891866, Email: {it0352,samaras}@uom.gr 2 Faculty of Medicine, Aristotle University, 541 24 Thessaloniki, Greece, Email: kongougr@yahoo.gr Abstract Weka (Waikato Environment for Knowledge Analysis) can be used in several different levels. It is an environment for automatic classification, clustering, regression and feature selection which contains an extensive collection of machine learning algorithms and data pre-processing methods. In classification or regression, evaluation tools are used for analyzing and predicting the outcome associated with a particular individual. In clustering, individuals which share certain properties are grouped together. However, while data mining methods are widely used to find patterns within large amounts of data, Weka does not provide any special data mining framework for biological data. Bioweka suite, on the other hand, adds further functionality to it and applies knowledge discovery applications to biological data. In this paper, we present how bioweka can be a tool in extracting useful information by comparing local gene cluster in human and in other mammal chromosomes. A wide range of filter tools and clustering algorithms are used so as to group together chromosomes of mammals that share similar relationships and to give some exciting opportunities for determining chromosomal similarities between related species. Keywords: Data mining, WEKA, clustering, local gene cluster, chromosomes. 1. Introduction In recent years, data stored worldwide has augmented, so that the need to extract knowledge from very large databases is increasing. Knowledge Discovery in Databases (KDD) or just Knowledge Discovery (KD) has been defined as the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data. Data mining research has led to the developments of numerous efficient and scalable methods for mining interesting patterns and knowledge in large databases, and it has been defined as one particular step of the KDD process. It uses different algorithms for classification, regression, clustering or association rules. These new progresses in data mining have already been applied to bio-data analysis, as everyone believe that it will - 1 -
  • 2. provide the necessary tools for better understanding of gene expression, drug design, and other emerging problems in genomics and proteomics[4]. At the same time, numerous computational methods and algorithms, particularly in data mining, have been developed to extract the hidden knowledge in these data that is relevant to our understanding of the biological systems. One of the most important search problems in bio-data analysis is similarity search and comparison among bio-sequences and structures. For example, base-by-base comparisons of the chromosomes of related species are made and computational approaches have been developed for comparative analysis. In biology, therefore, various tools have been developed for searching for similarity among biosequences. A well known tool is BLAST [1]. The idea is aligning sequences so that similarity can be revealed in the presence of small variations in position. Other rapid and sensitive algorithms that have been developed to perform homology searches in sequence databases are FASTA [5], BLAT [6] and PatternHunter [7]. All those methods follow a seed-based approach for detecting alignments of interest and they are based on a common principle. At the first stage, they search for small patterns (called “seeds”) shared by input sequences and then extend some of those (“hits”) into larger alignments of similarity regions. However, bio-data often consists of a lot of features which form a high dimension space and it is crucial to discover pair-wise frequent patterns and cluster bio-data based on such frequent patterns [2]. In this paper, we undertake comparative analysis of human and chimpanzee chromosomes using WEKA project (Waikato Environment for Knowledge Analysis) which provides a large collection of machine learning algorithms written in Java for data pre-processing, classification, clustering, association rules, and visualization. It is a framework that is distributed as open source software, running on almost any computer platform, and it can be invoked through a common Graphical User Interface (GUI). In Weka, the overall data mining process takes place on a single machine, since the algorithms can be executed only locally [3]. However, while data mining methods are widely used to find patterns within large amounts of data and to derive general laws from the data, Weka does not provide any special data mining framework for biological data, especially sequence and gene expression data. The BioWeka workbench (implemented in Java and distributed under the GNU open source license) adds this functionality to the Weka software and it provides a general foundation for the rapid development of new data mining applications to biological data. In BioWeka, there are components that are frequently used within biological research projects and were proven to yield desirable results in KDD experiments. To be more specific, BioWeka supports different sequence formats as well as XML based formats, it provides sequence-specific filters and it offers alignment- based classification. It also supports the analysis of gene expression data, by loading files with gene expression data into Weka and normalizing the numeric data. The BioWeka project was initiated to minimize the effort, and thus the time, - 2 -
  • 3. to implement and apply KDD applications to biological data. It helps researchers in the area of computational biology and here, it will help us to estimate the number and types of rearrangements that have occurred and also to determine when they occurred. In a given organism or species, genes are found in a given order that is maintained on the chromosomes from one generation to the next. Genetic analysis has revealed that genes with a related function are frequently found to be clustered at one chromosomal location. As it is commonly believed that chimpanzees, to the exclusion of other primates, are the most closely related species to human and that clustering of related genes presumably provides an evolutionary advantage to species, we focus on comparing local gene cluster in human (Homo Sapiens) and chimpanzee (Pan troglodytes ) chromosomes. Specifically, we applied many conventional clustering algorithms to the above dataset and we then used the results for identifying biologically relevant groups of genes and samples. The outline of this paper is organized as follows. Section 2 reviews materials and methods that we use in order to tackle the problem. Section 3 presents the algorithm that we implement to our data. Section 4 presents our results and section 5 concludes the paper. 2. Materials and Methods 2.1Data Collection The DNA dataset used in this study was obtained from the website of the Ensembl Project (Ensembl Genome Browser). Ensembl [9] provides a software system to store, analyse, use and display genomic information. The genomes of 14 chordates are currently available through Ensembl, from mammals such as Human and Mouse through to the ‘primitive’ chordate Ciona intestinalis (the smallest of any experimentally manipulable chordate). The genomes of three key model eukaryotes, yeast, fly and worm, are also imported from their respective databases to provide easy integration of information from these organisms with chordates. Finally a limited number of insect genomes are also available through Ensembl owing to our participation in the Vectorbase consortium. The data used was taken from two mammalian genomes (Homo sapiens and Pan Troglodytes genome) to take into consideration possible similarity that may exist among the genes. This dataset comprises of 793 DNA sequences from which 435 are from the genomic sequence of Human chromosomes (Homo sapiens) and 358 are from the chimpanzee chromosomes (Pan troglodytes). The sequences were stored in two separate “flat” FASTA files called chrom1.txt and chimp_chrom1.txt. To merge both files into one single ARFF file, both files are first converted into ARFF. Then, one file (for this scenario, the Homo sapiens set) is loaded into the Preprocess panel of the Weka’s Explorer workbench. Finally, the BioWeka’s MergeSets filter is applied—specifying the Pan troglodytes set as an additional - 3 -
  • 4. input file. This results in a single data set (called results.arff), which contains an additional attribute (relationName), which determines the origin of the sequences. 2.1Machine Learning Approach The main interface in Weka [8] is the Explorer, shown in Figure 1. It has a set of panels, each of which can be used to perform a certain task.  In the Preprocess panel, one can load datasets, browse the characteristics of attributes and apply any combination of Weka’s supervised and unsupervised filter to the data. The Preprocess panel also shows a histogram of the attribute that is currently selected and some statistics about it. Once a dataset has been loaded, one of the other panels in the Explorer can be used to perform further analysis.  If the data entail a classification or regression problem, it can be processed in the Classify panel. This allows configuring and executing any of the Weka classifiers on the current dataset, as it provides an interface to learning algorithms for classification and regression models and evaluation tools for analyzing the outcome of the learning process.  In the Cluster panel, one can have access to Weka’s clustering algorithms. These include k-means, mixtures of normal distributions with diagonal co-variance matrices estimated using EM, and a heuristic incremental hierarchical clustering scheme. They can be visualized in a pop-up data visualization tool and with them one can find groups of similar instances in a dataset.  In the Explorer’s Associate panel are available algorithms for generating association rules, such as apriori algorithm, that can be used to statistical dependencies between groups of attributes.  The fifth panel (Select Attributes panel) offers methods for identifying those subsets of attributes that are predictive of another (target) attribute in the data. It is a panel that allows configuring and applying any combination of Weka attribute evaluator and search method to select the most pertinent attributes in the dataset. Search methods include best-first search, forward selection, genetic algorithms and a simple ranking of attributes. Evaluation measures include correlation and entropy-based criteria as well as the performance of a selected learning scheme (e.g. a decision tree learner) for a particular subset of attributes. Different search and evaluation methods can be combined, making the system very flexible.  The last panel in the Explorer, Visualization, displays a scatter plot matrix for all pairs of attributes in the data. It allows visualizing the current dataset in one and two dimensions. - 4 -
  • 5. Figure 1: Weka’s Explorer Window In the case of biological sequences (RNA, DNA and amino acid sequences) there is a crucial point: how to derive a set of numerical or nominal attributes for a given sequence, as there are no special data mining methods to operate on strings. BioWeka framework [10] adds this functionality to the Weka software, as it provides filters that extract numeric or nominal features from sequences. It is an environment for the analysis of biological data and for knowledge discovery in biological data. While it is based on the Weka workbench, it is a framework by itself as can be extended in the same way Weka can. In particular, the functionality, that BioWeka offers, includes data converters (e.g. for FASTA [12], GenBank [14], EMBL [13], Swiss-Prot [15] sequence files, XML files, tab-delimited microarray data). The FASTA format was developed for the Fast Alignment Search Tool for Proteins (FASTP, [11]) and it was later generalized to FAST-ALL. A file of this format can contain as many sequences as necessary and each sequence may consist of nucleotides or amino acids. The entry of a new sequence is indicated by the greater sign “>” followed by an identifier (“1 dna”) and a description that characterize the sequence e.g.: >1 dna:chromosome chromosome:NCBI35:1:1:245522847:1 TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCC TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCAACCCTAACCCT …………………………………………………………………………. The EMBL (European Molecular Biology Laboratory) format [13] is exclusively for the nucleotide sequences, as it is used for the EMBL Nucleotide Sequence Database. A file of this format may contain one or more sequence entries. - 5 -
  • 6. However, in contrast to the FASTA format, its structure is much more complex and unified so that one can accumulate not only the sequence of a gene but also further information, e. g. a useful description of the protein the gene codes for, or only to a specific location within the sequence (a single nucleotide or a range of nucleotides). Another flat file format for nucleotide sequences is GenBank [14] which is equal to the EMBL one having the same types of features but they differ in the tags. Finally, the Swiss-Prot database [15] has the same format as the EMBL but it is for amino acid sequences. There are also classification methods (e.g. BLAST or PSI-BLAST, local, global and secondary structure element alignments, reimplementation of Eclat method), filter tools and other components to merge ARFF files, save ARFF files in a sequence format and create filter pipelines. In the bioinformatics arena, BioWeka data mining suite has also been used to translate sequences (DNA to RNA to Protein, DNA/RNA to its reverse complement), generate the open reading frames of DNA/RNA sequences, analyze amino acid properties based on the Amino Acid Index database, calculate the codon frequencies or amino acid composition and normalize numeric feature vectors In our study we use Eclat filter in order to discriminate between DNA sequences from different species and the Expectation Maximization algorithm (EM) for maximum likelihood estimation of mixture model parameters. 3. Algorithms 3.1Eclat Implementation In the first step, we implement the Eclat [16] filter in the Homo sapiens - Pan Troglodytes dataset, which we loaded in the Weka workbench using the FastaSequenceLoader component of BioWeka. This filter accepts a set of sequences and returns a set of codon frequencies. It has two properties. The sequenceAttribute property specifies the attribute holding the sequences and the alphabet property specifies the sequences’ alphabet (here is DNA). The implementation of the EclatFilter component is very simple. It uses internally a MultipleFilter instance to form a filter pipeline consisting of the following components: 1. Translate: Cuts the sequences after the first stop codon. Therefore, the translator’s property is filled with a single Terminator translator. 2. SymbolCounter: It is configured, so that it calculates the codon frequencies and does not take into account ambiguous symbols. 3. Normalize: The normalizer property is set to a MinMaxNormalizer instance, which normalizes the codon frequencies, so that all values are in the interval [−1.0, 1.0]. 4. Copy: Copies the class attribute to the end of the attribute list. 5. Remove: Removes the original sequence attributes and the class attribute. - 6 -
  • 7. Figure 2: Applying Eclat Filter To be more specific, for each sequence in the dataset codon frequencies are computed from its beginning up to the first stop codon. To account for some codons missing randomly, pseudocounts are used. The frequency for a given codon c is computed as ∑ ∈′ ′ + + = Codonsc c c n n cF 64 1 )( where ni denotes the absolute number of occurrences of a codon i. Therefore, for each sequence 64 attributes are derived for training and classification. Finally, a SVM [9] is trained on the feature set to discriminate between sequences from Homo sapiens - Pan Troglodytes. 3.2 EM Secondly, we test to load the reduced Homo sapiens - Pan Troglodytes dataset into the Weka workbench using the CSV loader and we implement in it the EM cluster algorithm. We get cluster assignments from the estimated mixture model by assigning each instance xj to the cluster of highest a posteriori probability argmaxi P (cijxj). - 7 - Codons
  • 8. EM == Number of clusters selected by cross validation: 6 Cluster: 0 Prior probability: 0.433 Attribute: TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCC Discrete Estimator. Counts = 1.37 1.87 1.87 1.87 1.87 1.87 1.87 1.87 1.87 1.87 1.87 1.89 1.89 1.89 1.89 1.89 1.89 1.89 1.89 1.89 1.89 1.89 1.89 1.89 1.89 1.89 (Total = 48.52) Attribute: relationName Discrete Estimator. Counts = 22.84 1.68 (Total = 24.52) Cluster: 1 Prior probability: 0 Attribute: TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCC Discrete Estimator. Counts = 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 (Total = 26) Attribute: relationName Discrete Estimator. Counts = 1 1 (Total = 2) Cluster: 2 Prior probability: 0 Attribute: TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCC Discrete Estimator. Counts = 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 (Total = 26) Attribute: relationName Discrete Estimator. Counts = 1 1 (Total = 2) Cluster: 3 Prior probability: 0.567 Attribute: TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCC Discrete Estimator. Counts = 17.63 2.13 2.13 2.13 2.13 2.13 2.13 2.13 2.13 2.13 2.13 1.11 1.11 1.11 1.11 1.11 1.11 1.11 1.11 1.11 1.11 1.11 1.11 1.11 1.11 1.11 (Total = 55.48) Attribute: relationName Discrete Estimator. Counts = 5.16 26.32 (Total = 31.48) Cluster: 4 Prior probability: 0 Attribute: TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCC Discrete Estimator. Counts = 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 (Total = 26) Attribute: relationName Discrete Estimator. Counts = 1 1 (Total = 2) Cluster: 5 Prior probability: 0 Attribute: TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCC Discrete Estimator. Counts = 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 (Total = 26) Attribute: relationName Discrete Estimator. Counts = 1 1 (Total = 2) Clustered Instances 0 25 ( 48%) 3 27 ( 52%) Log likelihood: -3.32727 Figure 3: The number of clusters that correspond to our reduced dataset - 8 -
  • 9. 4. Results The implementation of SMO component solves the SVM problem without any extra matrix storage and without invoking an iterative numerical routine for each sub-problem. It globally replaces all missing values and transforms nominal attributes into binary ones. It also normalizes all attributes by default. Tables 1, 2 and 3 summarize the results of the SVM application, which estimated accuracy is approximately 61, 7%. Correctly Classified Instances 148 18.6633 % Incorrectly Classified Instances 645 81.3367 % Kappa statistic -0.6566 Mean absolute error 0.8134 Root mean squared error 0.9019 Relative absolute error 164.2156 % Root relative squared error 181.2281 % Total Number of Instances 793 Table 1: Stratified cross-validation TP Rate FP Rate Precision Recall F-Measure Class 0.299 0.95 0.277 0.299 0.287 chrom1.txt 0.05 0.701 0.056 0.05 0.053 chimp_chrom1.txt Table 2: Detailed Accuracy By Class a b <-- classified as 130 305 | a = chrom1.txt 340 18 | b = chimp_chrom1.txt Table 3: Confusion Matrix Furthermore, in our study, implementing the EM algorithm, we get a number of clusters that reveal that there are common sequences in the human and chimpanzee chromosomes. In a given organism or species, genes are found in a given order that is maintained on the chromosomes from one generation to the next. Genetic analysis has revealed that genes with a related function are frequently found to be clustered at one chromosomal location. The high resemblance, in the level of a sequence, constitutes a clue of common evolutionary origin and identical or similar operation. From the clusters visualization, we can estimate the similarity and we can see that same sequences from both chromosomes belong to the same cluster. Using this information and implementing the algorithm to all the chromosomes of the two mammals, we can confirm that our DNA is 98% identical to chimpanzees. 5. Conclusion BioWeka is a useful software package that makes it easy to develop new data mining methods for biological data. There are many different components that can be combined to form a complete KD application without writing a single line of code and existing components can be reused for their custom solutions. For instance, the Eclat method required 647 lines of code compared to the original - 9 -
  • 10. implementation, which consists of 1260 lines of code. Moreover, the BioWeka implementation provides an easy-to-use GUI based on the Weka software. However, a comparison of the performance demonstrated that the original Eclat implementation is faster than BioWeka’s implementation. References: [1] BLAST, http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/ information3.html [2] H. Wang, J. Yang, W. Wang, and P. S. Yu. Clustering by pattern similarity in large data sets. In SIGMOD’02, pp. 418–427, Madison, WI, June 2002. [3] Witten, H., Frank, E.: Data Mining: Practical machine learning tools with Java implementations. Morgan Kaufmann (2000). [4] Zaki, M., Wang, J., Toivonen, H.: BIOKDD 2002: Recent Advances in Data Mining for Bioinformatics. In SIGKDD Explorations, Vol. 4, Issue 2 (2002) 112-114 [5] Pearson, W.R. (1990) Rapid and sensitive sequence comparison with FASTP and FASTA. Meth. Enzymol., 183, 63–98. [6] Kent, W.J. (2002) BLAT—the BLAST-like alignment tool. Genome Res, . 12, 656–664 [7] Ma, B., Tromp, J., Li, M. (2002) PatternHunter: faster and more sensitive homology search. Bioinformatics, 18, 440–445 [8] Frank, E., Hall, M., Trigg, L., Holmes, G., and Witten, I. H. (2004) Data mining in bioinformatics using Weka. Bioinformatics 20, 2479 –2481 [9] E. Birney, D. Andrews, M. Caccamo, Y. Chen, L. Clarke, G. Coates, T. Cox, F. Cunningham, V. Curwen, T. Cutts, T. Down, R. Durbin, X. M. Fernandez-Suarez, P. Flicek, S. Gräf, M. Hammond, J. Herrero, K. Howe, V. Iyer, K. Jekosch, A. Kähäri, A. Kasprzyk, D. Keefe, F. Kokocinski, E. Kulesha, D. London, I. Longden, C. Melsopp, P. Meidl, B. Overduin, A. Parker, G. Proctor, A. Prlic, M. Rae, D. Rios, S. Redmond, M. Schuster, I. Sealy, S. Searle, J. Severin, G. Slater, D. Smedley, J. Smith, A. Stabenau, J. Stalker, S. Trevanion, A. Ureta-Vidal, J. Vogel, S. White, C. Woodwark, and T. J. P. Hubbard. Ensembl 2006. Nucleic Acids Res. 2006 Jan 1; 34 Database issue: D556-D561. [10] Martin Szugat. BioWeka: Extending the Weka framework for Bioinformatics. Bachelor thesis, 2005 September 17. [11] D J Lipman andWR Pearson. Rapid and sensitive protein similarity searches. Science, 227(4693):1435–1441, Mar 1985. [12] W R Pearson and D J Lipman. Improved tools for biological sequence comparison. Proc Natl Acad Sci U S A, 85(8):2444–2448, Apr 1988. [13] Tamara Kulikova, Philippe Aldebert, Nicola Althorpe, Wendy Baker, Kirsty Bates, Paul Browne, Alexandra van den Broek, Guy Cochrane, Karyn Duggan, Ruth Eberhardt, Nadeem Faruque, Maria Garcia-Pastor, Nicola Harte, Carola Kanz, Rasko Leinonen, Quan Lin, Vincent Lombard, Rodrigo Lopez, Renato Mancuso, Michelle McHale, Francesco Nardone, Ville Silventoinen, Peter Stoehr, Guenter Stoesser, Mary Ann Tuli, Katerina Tzouvara, Robert Vaughan, Dan Wu, Weimin Zhu, and Rolf Apweiler. The EMBL Nucleotide Sequence Database. Nucleic Acids Res, 32(Database issue):27–30, Jan 2004. [14] Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Wheeler DL. GenBank: update. Nucleic Acids Res. 2004;32:D23–D26. [15] A Bairoch and B Boeckmann. The SWISS-PROT protein sequence data bank. Nucleic Acids Res, 19 Suppl:2247–2249, Apr 1991. [16] Caroline C Friedel, Katharina H V Jahn, Selina Sommer, Stephen Rudd, Hans W Mewes, and Igor V Tetko. Support vector machines for separation of mixed plantpathogen EST collections based on codon usage. Bioinformatics, 21(8):1383–1388, Apr 2005. Evaluation Studies. - 10 -