2012 12 12_adam_v_final

Bioinformatics

Prof. Wim Van Criekinge
18th december 2012, VUmc, Amsterdam

Outline

• Scripting
Perl (Bioperl/Python)
examples spiders/bots

• Databases
Genome Browser
examples biomart, galaxy

• AI
Classification and clustering
examples WEKA (R, Rapidminer)
2

Bioinformatics, a life science discipline …

Math

(Molecular)
Informatics
Biology


Math

Computer Science Theoretical Biology

(Molecular)
Informatics
Biology
Computational Biology


Math


Bioinformatics

(Molecular)
Informatics
Biology

Bioinformatics, a life science discipline … management of expectations

Math

NP AI, Image Analysis
Datamining structure prediction (HTX)
Bioinformatics

Interface Design Expert Annotation
Sequence Analysis (Molecular)
Informatics
Biology

Bioinformatics, a life science discipline … management of expectations

Math

NP AI, Image Analysis
Datamining structure prediction (HTX)
Bioinformatics
Discovery Informatics – Computational Genomics
Interface Design Expert Annotation
Sequence Analysis (Molecular)
Informatics
Biology

What is Perl ?

• Perl is a High-level Scripting language
• Larry Wall created Perl in 1987
 Practical Extraction (a)nd Reporting
Language
 (or Pathologically Eclectic Rubbish Lister)
• Born from a system administration tool
• Faster than sh or csh
• Sslower than C
• No need for sed, awk, tr, wc, cut, …
• Perl is open and free
• http://conferences.oreillynet.com/euroosc
on/
10

What is Perl ?

• Perl is available for most computing
platforms: all flavors of UNIX (Linux), MS-
DOS/Win32, Macintosh, VMS, OS/2, Amiga,
AS/400, Atari
• Perl is a computer language that is:
 Interpreted, compiles at run-time (need for
perl.exe !)
 Loosely “typed”
 String/text oriented
 Capable of using multiple syntax formats
• In Perl, “there‟s more than one way to do it”
11

Why use Perl for bioinformatics ?

• Ease of use by novice programmers
• Flexible language: Fast software prototyping (quick
and dirty creation of small analysis programs)
• Expressiveness. Compact code, Perl Poetry:
@{$_[$#_]||[]}
• Glutility: Read disparate files and parse the relevant
data into a new format
• Powerful pattern matching via “regular expressions”
(Best Regular Expressions on Earth)
• With the advent of the WWW, Perl has become the
language of choice to create Common Gateway
Interface (CGI) scripts to handle form submissions
and create compute severs on the WWW.
• Open Source – Free. Availability of Perl modules for
Bioinformatics and Internet.
12

Why NOT use Perl for bioinformatics ?

• Some tasks are still better done with other
languages (heavy computations / graphics)
 C(++),C#, Fortran, Java (Pascal,Visual Basic)

• With perl you can write simple programs
fast, but on the other hand it is also suitable
for large and complex programs. (yet, it is
not adequate for very large projects)
 Python

• Larry Wall: “For programmers, laziness is a
virtue”

13

What bioinformatics tasks are suited to Perl ?

• Sequence manipulation and analysis
• Parsing results of sequence analysis programs
(Blast, Genscan, Hmmer etc)
• Parsing database (eg Genbank) files
• Obtaining multiple database entries over the
internet
•…

14

Perl installation

• Perl
 Perl is available for various operating systems. To
download Perl and install it on your computer, have a
look at the following resources:
 www.perl.com (O'Reilly).
Downloading Perl Software
 ActiveState. ActivePerl for Windows, as well as for
Linux and Solaris.
ActivePerl binary packages.
 CPAN

• PHPTriad:
 bevat Apache/PHP en MySQL:
http://sourceforge.net/projects/phptriad
15

Check installation

• Command-line flags for perl
 Perl – v
Gives the current version of Perl

 Perl –e
Executes Perl statements from the comment line.
Perl –e “print 42;”
Perl –e “print ”Twonlinesn”;”

 Perl –we
Executes and print warnings
Perl –we “print „hello‟;x++;”

16

TextPad

• Syntax highlighting

• Run program (prompt for parameters)

• Show line numbers

• Clip-ons for web with perl syntax

• ….
17

Customize textpad part 1: Create Document Class

18

• Document classes

19

Customize textpad part 2: Add Perl to “Tools Menu”

20

Unzip to textpad samples directory

21

General Remarks

• Perl is mostly a free format language: add
spaces, tabs or new lines wherever you want.
• For clarity, it is recommended to write each
statement in a separate line, and use
indentation in nested structures.
• Comments: Anything from the # sign to the
end of the line is a comment. (There are no
multi-line comments).
• A perl program consists of all of the Perl
statements of the file taken collectively as
one big routine to execute.

22

Three Basic Data Types

•Scalars - $
•Arrays of scalars - @
•Associative arrays of
scalers or Hashes - %

23

2+2 = ?
$ - indicates a variable
$a = 2;
$b = 2;
$c = $a + $b;

- ends every command
;
= - assigns a value to a variable

or $c = 2 + 2;
or $c = 2 * 2;
or $c = 2 / 2;
or $c = 2 ^ 4; 2^4 <-> 24 =16
or $c = 1.35 * 2 - 3 / (0.12 + 1);

Ok, $c is 4. How do we know it?

$c = 4;
print “$c”;

print command:

“ ” - bracket output expression

print “Hello n”;

n - print a end-of-the-line character
(equivalent to pressing „Enter‟)
Strings concatenation:
print “Hello everyonen”;
print “Hello” . ” everyone” . “n”;
Expressions and strings together:
print “2 + 2 = “ . (2+2) . ”n”; 2 + 2 = 4

expression

Loops and cycles (for statement):

# Output all the numbers from 1 to 100
for ($n=1; $n<=100; $n+=1) {
print “$n n”;
}
1. Initialization:
for ( $n=1 ; ; ) { … }

2. Increment:
for ( ; ; $n+=1 ) { … }

3. Termination (do until the criteria is satisfied):
for ( ; $n<=100 ; ) { … }
4. Body of the loop - command inside curly brackets:
for ( ; ; ) { … }

FOR & IF -- all the even numbers from 1 to 100:

for ($n=1; $n<=100; $n+=1) {
if (($n % 2) == 0) {
print “$n”;
}
}

Note: $a % $b -- Modulus
-- Remainder when $a is divided by $b

Two brief diversions (warnings & strict)
• Use warnings
• strict – forces you to „declare‟ a variable the
first time you use it.
 usage: use strict; (somewhere near the top of your
script)
• declare variables with „my‟
 usage: my $variable;
 or: my $variable = „value‟;
• my sets the „scope‟ of the variable. Variable
exists only within the current block of code
• use strict and my both help you to debug
errors, and help prevent mistakes.
28

Text Processing Functions

The substr function
• Definition
• The substr function extracts a substring out of a string
and returns it. The function receives 3 arguments: a
string value, a position on the string (starting to count
from 0) and a length.
Example:
• $a = "university";
• $k = substr ($a, 3, 5);
• $k is now "versi" $a remains unchanged.
• If length is omitted, everything to the end of the string
is returned.

29

Random

$x = rand(1);

• srand
 The default seed for srand, which used to be time, has
been changed. Now it's a heady mix of difficult-to-
predict system-dependent values, which should be
sufficient for most everyday purposes. Previous to
version 5.004, calling rand without first calling srand
would yield the same sequence of random numbers on
most or all machines. Now, when perl sees that you're
calling rand and haven't yet called srand, it calls srand
with the default seed. You should still call srand
manually if your code might ever be run on a pre-
5.004 system, of course, or if you want a seed other
than the default

30

Demo/Example

• Oefening hoe goed zijn de random
nummers ?

• Als ze goed zijn kan je er Pi mee
berekenen …

• Een goede random generator is belangrijk
voor goede randomsequenties die we
nadien kunnen gebruiken in simulaties

31

Bereken Pi aan de hand van twee random
getallen

y

x

1

32

Introduction

Buffon's Needle is one of the oldest problems in the
field of geometrical probability. It was first stated in
1777. It involves dropping a needle on a lined sheet of
paper and determining the probability of the needle
crossing one of the lines on the page. The remarkable
result is that the probability is directly related to the
value of pi.

http://www.angelfire.com/wa/hurben/buff.html

In Postscript you send it too the printer … PS has no
variables but “stacks”, you can mimick this in Perl by
recursively loading and rewriting a subroutine
33

–http://www.csse.monash.edu.au/~damian/papers/HTML/Perligata.html
34

Programming

• Variables
• Flow control (if, regex …)
• Loops

• input/output
• Subroutines/object

35

What is a regular expression?

• A regular expression (regex) is simply a
way of describing text.
• Regular expressions are built up of small
units (atoms) which can represent the type
and number of characters in the text
• Regular expressions can be very broad
(describing everything), or very narrow
(describing only one pattern).

36

Regular Expression Review

• A regular expression (regex) is a way of
describing text.
• Regular expressions are built up of small units
(atoms) which can represent the type and
number of characters in the text
• You can group or quantify atoms to describe
your pattern
• Always use the bind operator (=~) to apply your
regular expression to a variable

38

Why would you use a regex?

• Often you wish to test a string for the
presence of a specific character, word, or
phrase

Examples

“Are there any letter characters in my string?”
“Is this a valid accession number?”
“Does my sequence contain a start codon (ATG)?”

39

Regular Expressions

Match to a sequence of characters

The EcoRI restriction enzyme cuts at the consensus
sequence GAATTC.
To find out whether a sequence contains a restriction site for
EcoR1, write;

if ($sequence =~ /GAATTC/) {
...
};

40

Regex-style

[m]/PATTERN/[g][i][o]
s/PATTERN/PATTERN/[g][i][e][o]
tr/PATTERNLIST/PATTERNLIST/[c][d][s]

41

Regular Expressions

Match to a character class
• Example
• The BstYI restriction enzyme cuts at the consensus sequence
rGATCy, namely A or G in the first position, then GATC, and then T or C.
To find out whether a sequence contains a restriction site for BstYI, write;
• if ($sequence =~ /[AG]GATC[TC]/) {...}; # This will match all of
AGATCT, GGATCT, AGATCC, GGATCC.
Definition
• When a list of characters is enclosed in square brackets [], one and only
one of these characters must be present at the corresponding position of
the string in order for the pattern to match. You may specify a range of
characters using a hyphen -.
• A caret ^ at the front of the list negates the character class.
Examples
• if ($string =~ /[AGTC]/) {...}; # matches any nucleotide
• if ($string =~ /[a-z]/) {...}; # matches any lowercase letter
• if ($string =~ /chromosome[1-6]/) {...}; # matches
chromosome1, chromosome2 ... chromosome6
• if ($string =~ /[^xyzXYZ]/) {...}; # matches any character except
x, X, y, Y, z, Z
42

Constructing a Regex

• Pattern starts and ends with a / /pattern/
 if you want to match a /, you need to escape it
/ (backslash, forward slash)

 you can change the delimiter to some other
character, but you probably won‟t need to
m|pattern|

• any „modifiers‟ to the pattern go after the last /
i : case insensitive /[a-z]/i
o : compile once
g : match in list context (global)
m or s : match over multiple lines

43

Looking for a pattern

• By default, a regular expression is applied to $_
(the default variable)
 if (/a+/) {die}
looks for one or more „a‟ in $_

• If you want to look for the pattern in any other
variable, you must use the bind operator
 if ($value =~ /a+/) {die}
looks for one or more „a‟ in $value

• The bind operator is in no way similar to the
„=„ sign!! = is assignment, =~ is bind.
 if ($value = /[a-z]/) {die}
Looks for one or more „a‟ in $_, not $value!!!
44

Regular Expression Atoms

• An „atom‟ is the smallest unit of a
regular expression.
• Character atoms
0-9, a-Z match themselves
. (dot) matches everything
[atgcATGC] : A character class (group)
[a-z] : another character class, a through z

45

Quantifiers

• You can specify the number of times you want
to see an atom. Examples
• d* : Zero or more times
• d+ : One or more times
• d{3} : Exactly three times
• d{4,7} : At least four, and not more than seven
• d{3,} : Three or more times
We could rewrite /ddd-dddd/ as:
/d{3}-d{4}/

46

Anchors

• Anchors force a pattern match to a certain
location
• ^ : start matching at beginning of string
• $ : start matching at end of string
• b : match at word boundary (between w and W)
• Example:
• /^ddd-dddd$/ : matches only valid phone numbers

47

Remembering Stuff

• Being able to match patterns is good, but
limited.
• We want to be able to keep portions of the
regular expression for later.
 Example: $string = „phone: 353-7236‟
We want to keep the phone number only
Just figuring out that the string contains a phone number is
insufficient, we need to keep the number as well.

48

Memory Parentheses (pattern memory)

• Since we almost always want to keep portions
of the string we have matched, there is a
mechanism built into perl.
• Anything in parentheses within the regular
expression is kept in memory.
 „phone:353-7236‟ =~ /^phone:(.+)$/;
Perl knows we want to keep everything that matches „.+‟ in the above
pattern

49

Getting at pattern memory

• Perl stores the matches in a series of default
variables. The first parentheses set goes into
$1, second into $2, etc.
 This is why we can‟t name variables ${digit}
 Memory variables are created only in the amounts
needed. If you have three sets of parentheses, you
have ($1,$2,$3).
 Memory variables are created for each matched set of
parentheses. If you have one set contained within
another set, you get two variables (inner set gets
lowest number)
 Memory variables are only valid in the current scope
50

Finding all instances of a match

• Use the „g‟ modifier to the regular expression
 @sites = $sequence =~ /(TATTA)/g;
 think g for global
 Returns a list of all the matches (in order), and
stores them in the array
 If you have more than one pair of parentheses, your
array gets values in sets
($1,$2,$3,$1,$2,$3...)

51

Perl is Greedy

• In addition to taking all your time, perl regular
expressions also try to match the largest
possible string which fits your pattern
 /ga+t/ matches gat, gaat, gaaat
 „Doh! No doughnuts left!‟ =~ /(d.+t)/
$1 contains „doughnuts left‟

• If this is not what you wanted to do, use the „?‟
modifier
 /(d.+?t)/ # match as few „.‟s as you can and still
make the pattern work

52

Substitute function

• s/pattern1/pattern2/;
• Looks kind of like a regular expression
 Patterns constructed the same way
• Inherited from previous languages, so it can be
a bit different.
 Changes the variable it is bound to!

53

tr function

• translate or transliterate
• tr/characterlist1/characterlist2/;
• Even less like a regular expression than s
• substitutes characters in the first list with
characters in the second list
$string =~ tr/a/A/; # changes every „a‟ to an „A‟
 No need for the g modifier when using tr.

55

Using tr

• Creating complimentary DNA sequence
 $sequence =~ tr/atgc/TACG/;
• Sneaky Perl trick for the day
 tr does two things.
1. changes characters in the bound variable
2. Counts the number of times it does this

 Super-fast character counter™
$a_count = $sequence =~ tr/a/a/;
replaces an „a‟ with an „a‟ (no net change), and assigns the result
(number of substitutions) to $a_count

57

Regex-Related Special Variables

• Perl has a host of special variables that get filled after every m// or s///
regex match. $1, $2, $3, etc. hold the backreferences. $+ holds the last
(highest-numbered) backreference. $& (dollar ampersand) holds the
entire regex match.
• @- is an array of match-start indices into the string. $-[0] holds the start
of the entire regex match, $-[1] the start of the first backreference, etc.
Likewise, @+ holds match-end indices (ends, not lengths).
• $' (dollar followed by an apostrophe or single quote) holds the part of
the string after (to the right of) the regex match. $` (dollar backtick)
holds the part of the string before (to the left of) the regex match. Using
these variables is not recommended in scripts when performance
matters, as it causes Perl to slow down all regex matches in your entire
script.
• All these variables are read-only, and persist until the next regex match
is attempted. They are dynamically scoped, as if they had an implicit
'local' at the start of the enclosing scope. Thus if you do a regex
match, and call a sub that does a regex match, when that sub
returns, your variables are still set as they were for the first match.

58

Voorbeeld

Which of following 4 sequences (seq1/2/3/4)

a) contains a “Galactokinase signature”
http://us.expasy.org/prosite/

b) How many of them?
c) Where (hints:pos and $&) ?

59

>SEQ1
MGNLFENCTHRYSFEYIYENCTNTTNQCGLIRNVASSIDVFHWLDVYISTTIFVISGILNFYCLFIALYT
YYFLDNETRKHYVFVLSRFLSSILVIISLLVLESTLFSESLSPTFAYYAVAFSIYDFSMDTLFFSYIMIS
LITYFGVVHYNFYRRHVSLRSLYIILISMWTFSLAIAIPLGLYEAASNSQGPIKCDLSYCGKVVEWITCS
LQGCDSFYNANELLVQSIISSVETLVGSLVFLTDPLINIFFDKNISKMVKLQLTLGKWFIALYRFLFQMT
NIFENCSTHYSFEKNLQKCVNASNPCQLLQKMNTAHSLMIWMGFYIPSAMCFLAVLVDTYCLLVTISILK
SLKKQSRKQYIFGRANIIGEHNDYVVVRLSAAILIALCIIIIQSTYFIDIPFRDTFAFFAVLFIIYDFSILSLLGSFTGVAM
MTYFGVMRPLVYRDKFTLKTIYIIAFAIVLFSVCVAIPFGLFQAADEIDGPIKCDSESCELIVKWLLFCI
ACLILMGCTGTLLFVTVSLHWHSYKSKKMGNVSSSAFNHGKSRLTWTTTILVILCCVELIPTGLLAAFGK
SESISDDCYDFYNANSLIFPAIVSSLETFLGSITFLLDPIINFSFDKRISKVFSSQVSMFSIFFCGKR
>SEQ2
MLDDRARMEA AKKEKVEQIL AEFQLQEEDL KKVMRRMQKE MDRGLRLETH EEASVKMLPT YVRSTPEGSE VGDFLSLDLG GTNFRVMLVK
VGEGEEGQWS VKTKHQMYSI PEDAMTGTAE MLFDYISECI SDFLDKHQMK HKKLPLGFTF SFPVRHEDID KGILLNWTKG FKASGAEGNN
VVGLLRDAIK RRGDFEMDVV AMVNDTVATM ISCYYEDHQC EVGMIVGTGC NACYMEEMQN VELVEGDEGR MCVNTEWGAF GDSGELDEFL
LEYDRLVDES SANPGQQLYE KLIGGKYMGE LVRLVLLRLV DENLLFHGEA SEQLRTRGAF ETRFVSQVES DTGDRKQIYN ILSTLGLRPS
TTDCDIVRRA CESVSTRAAH MCSAGLAGVI NRMRESRSED VMRITVGVDG SVYKLHPSFK ERFHASVRRL TPSCEITFIE SEEGSGRGAA
LVSAVACKKA CMLGQ
>SEQ3
MESDSFEDFLKGEDFSNYSYSSDLPPFLLDAAPCEPESLEINKYFVVIIYVLVFLLSLLGNSLVMLVILY
SRVGRSGRDNVIGDHVDYVTDVYLLNLALADLLFALTLPIWAASKVTGWIFGTFLCKVVSLLKEVNFYSGILLLACISVDRY
LAIVHATRTLTQKRYLVKFICLSIWGLSLLLALPVLIFRKTIYPPYVSPVCYEDMGNNTANWRMLLRILP
QSFGFIVPLLIMLFCYGFTLRTLFKAHMGQKHRAMRVIFAVVLIFLLCWLPYNLVLLADTLMRTWVIQET
CERRNDIDRALEATEILGILGRVNLIGEHWDYHSCLNPLIYAFIGQKFRHGLLKILAIHGLISKDSLPKDSRPSFVGSSSGH TSTTL
>SEQ4
MEANFQQAVK KLVNDFEYPT ESLREAVKEF DELRQKGLQK NGEVLAMAPA FISTLPTGAE TGDFLALDFG GTNLRVCWIQ LLGDGKYEMK
HSKSVLPREC VRNESVKPII DFMSDHVELF IKEHFPSKFG CPEEEYLPMG FTFSYPANQV SITESYLLRW TKGLNIPEAI NKDFAQFLTE
GFKARNLPIR IEAVINDTVG TLVTRAYTSK ESDTFMGIIF GTGTNGAYVE QMNQIPKLAG KCTGDHMLIN MEWGATDFSC LHSTRYDLLL
DHDTPNAGRQ IFEKRVGGMY LGELFRRALF HLIKVYNFNE GIFPPSITDA WSLETSVLSR MMVERSAENV RNVLSTFKFR FRSDEEALYL
WDAAHAIGRR AARMSAVPIA SLYLSTGRAG KKSDVGVDGS LVEHYPHFVD MLREALRELI GDNEKLISIG IAKDGSGIGA ALCALQAVKE
KKGLA MEANFQQAVK KLVNDFEYPT ESLREAVKEF DELRQKGLQK NGEVLAMAPA FISTLPTGAE TGDFLALDFG GTNLRVCWIQ
LLGDGKYEMK HSKSVLPREC VRNESVKPII DFMSDHVELF IKEHFPSKFG CPEEEYLPMG FTFSYPANQV SITESYLLRW TKGLNIPEAI
NKDFAQFLTE GFKARNLPIR IEAVINDTVG TLVTRAYTSK ESDTFMGIIF GTGTNGAYVE QMNQIPKLAG KCTGDHMLIN MEWGATDFSC
LHSTRYDLLL DHDTPNAGRQ IFEKRVGGMY LGELFRRALF HLIKVYNFNE GIFPPSITDA WSLETSVLSR MMVERSAENV RNVLSTFKFR
FRSDEEALYL WDAAHAIGRR AARMSAVPIA SLYLSTGRAG KKSDVGVDGS LVEHYPHFVD MLREALRELI GDNEKLISIG IAKDGSGIGA
ALCALQAVKE KKGLA

60

Arrays

Definitions
• A scalar variable contains a scalar value: one number or one string.
A string might contain many words, but Perl regards it as one unit.
• An array variable contains a list of scalar data: a list of numbers or a
list of strings or a mixed list of numbers and strings. The order of
elements in the list matters.
Syntax
• Array variable names start with an @ sign.
• You may use in the same program a variable named $var and
another variable named @var, and they will mean two different,
unrelated things.
Example
• Assume we have a list of numbers which were obtained as a result
of some measurement. We can store this list in an array variable as
the following:
• @msr = (3, 2, 5, 9, 7, 13, 16);

61

The foreach construct

The foreach construct iterates over a list of scalar values
(e.g. that are contained in an array) and executes a block
of code for each of the values.
• Example:
 foreach $i (@some_array) {
 statement_1;
 statement_2;
 statement_3; }
 Each element in @some_array is aliased to the variable $i in
turn, and the block of code inside the curly brackets {} is
executed once for each element.
• The variable $i (or give it any other name you wish) is
local to the foreach loop and regains its former value
upon exiting of the loop.
• Remark $_

62

Examples for using the foreach construct - cont.

• Calculate sum of all array elements:
#!/usr/local/bin/perl
@msr = (3, 2, 5, 9, 7, 13, 16);
$sum = 0;
foreach $i (@msr) {
$sum += $i; }
print "sum is: $sumn";

63

Accessing individual array elements

Individual array elements may be accessed by
indicating their position in the list (their index).
Example:
@msr = (3, 2, 5, 9, 7, 13, 16);
index value 0 3 1 2 2 5 3 9 4 7 5 13 6 16
First element: $msr[0] (here has the value of 3),
Third element: $msr[2] (here has the value of 5),
and so on.

64

The sort function

The sort function receives a list of variables (or an array) and returns the sorted list.

@array2 = sort (@array1);

@countries = ("Israel", "Norway", "France", "Argentina");
@sorted_countries = sort ( @countries);
print "ORIG: @countriesn", "SORTED: @sorted_countriesn";
Output:
ORIG: Israel Norway France Argentina
SORTED: Argentina France Israel Norway

@numbers = (1 ,2, 4, 16, 18, 32, 64);
@sorted_num = sort (@numbers);
print "ORIG: @numbers n", "SORTED: @sorted_num n";
Output:
ORIG: 1 2 4 16 18 32 64
SORTED: 1 16 18 2 32 4 64
Note that sorting numbers does not happen numerically, but by the string values of each65
number.

The push and shift functions

The push function adds a variable or a list of variables to the end of a given
array.
Example:
$a = 5;
$b = 7;
@array = ("David", "John", "Gadi");
push (@array, $a, $b);
# @array is now ("David", "John", "Gadi", 5, 7)

The shift function removes the first element of a given array and returns this
element.
Example:
@array = ("David", "John", "Gadi");
$k = shift (@array);
# @array is now ("John", "Gadi"); # $k is now "David"

Note that after both the push and shift operations the given array @array is
changed! 66

Perl Array review

• An array is designated with the „@‟ sign
• An array is a list of individual elements
• Arrays are ordered
 Your list stays in the same order that you created it, although
you can add or subtract elements to the front or back of the list
• You access array elements by number, using the
special syntax:
 $array[1] returns the „1th‟ element of the array (remember
perl starts counting at zero)
• You can do anything with an array element that you
can do with a scalar variable (addition, subtraction,
printing … whatever)

67

Generate random sequence string

for($n=1;$n<=50;$n++)

{
@a = ("A","C","G","T");
$b=$a[rand(@a)];
$r.=$b;
}

print $r;

68


The split function

• The split function splits a string to a list of substrings
according to the positions of a given delimiter. The
delimiter is written as a pattern enclosed by slashes:
/PATTERN/. Examples:
• $string = "programming::course::for::bioinformatics";
• @list = split (/::/, $string);
• # @list is now
("programming", "course", "for", "bioinformatics") #
$string remains unchanged.
• $string = "protein kinase Ct450 Kilodaltonst120
Kilobases";
• @list = split (/t/, $string); #t indicates tab #
• @list is now ("protein kinase C", "450
Kilodaltons", "120 Kilobases")

69


The join function
• The join function does the opposite of split. It
receives a delimiter and a list of strings, and joins the
strings into a single string, such that they are
separated by the delimiter.
• Note that the delimiter is written inside quotes.
• Examples:
• @list = ("programming", "course", "for",
"bioinformatics");
• $string = join ("::", @list);
• # $string is now
"programming::course::for::bioinformatics"
• $name = "protein kinase C"; $mol_weight = "450
Kilodaltons"; $seq_length = "120 Kilobases";
• $string = join ("t", $name, $mol_weight,
$seq_length);
• # $string is now: # "protein kinase Ct450
Kilodaltonst120 Kilobases" 70

When is an array not good enough?

• Sometimes you want to associate a given value
with another value. (name/value pairs)
(Rob => 353-7236, Matt => 353-7122,
Joe_anonymous => 555-1212)
(Acc#1 => sequence1, Acc#2 => sequence2, Acc#n =>
sequence-n)
• You could put this information into an array, but
it would be difficult to keep your names and
values together (what happens when you sort?
Yuck)

71

Problem solved: The associative array

• As the name suggests, an associative array allows
you to link a name with a value
• In perl-speak: associative array = hash
 „hash‟ is the preferred term, for various arcane
reasons, including that it is easier to say.
• Consider an array: The elements (values) are
each associated with a name – the index position.
These index positions are
numerical, sequential, and start at zero.
• A hash is similar to an array, but we get to name the
index positions anything we want

72

The „structure‟ of a Hash

• An array looks something like this:
0 1 2 Index
@array =
'val1' 'val2' 'val3' Value

73

The „structure‟ of a Hash

• An array looks something like this:
0 1 2 Index
@array =
'val1' 'val2' 'val3' Value

• A hash looks something like this:
Rob Matt Joe_A Key (name)
%phone =
353-7236 353-7122 555-1212 Value

74

Creating a hash

• There are several methods for creating a hash.
The most simple way – assign a list to a hash.
 %hash = („rob‟, 56, „joe‟, 17, „jeff‟, „green‟);
• Perl is smart enough to know that since you are
assigning a list to a hash, you meant to alternate
keys and values.
 %hash = („rob‟ => 56 , „joe‟ => 17, „jeff‟ => „green‟);
• The arrow („=>‟) notation helps some people, and
clarifies which keys go with which values. The
perl interpreter sees „=>‟ as a comma.

75

Getting at values

• You should expect by now that there is some
way to get at a value, given a key.
• You access a hash key like this:
 $hash{„key‟}
• This should look somewhat familiar
 $array[21] : refer to a value associated with a
specific index position in an array
 $hash{key} : refer to a value associated with a
specific key in a hash

76

Programming in general and Perl in particular

• There is more than one right way to do it. Unfortunately, there are also
many wrong ways.
 1. Always check and make sure the output is correct and logical
Consider what errors might occur, and take steps to ensure that you are
accounting for them.
 2. Check to make sure you are using every variable you declare.
Use Strict !
 3. Always go back to a script once it is working and see if you can
eliminate unnecessary steps.
Concise code is good code.
You will learn more if you optimize your code.
Concise does not mean comment free. Please use as many comments as
you think are necessary.
Sometimes you want to leave easy to understand code in, rather than short
but difficult to understand tricks. Use your judgment.
Remember that in the future, you may wish to use or alter the code you
wrote today. If you don‟t understand it today, you won‟t tomorrow.
77

Programming in general and Perl in particular

Develop your program in stages. Once part of it works, save
the working version to another file (or use a source code
control system like RCS) before continuing to improve it.
When running interactively, show the user signs of activity.
There is no need to dump everything to the screen (unless
requested to), but a few words or a number change every
few minutes will show that your program is doing
something.
Comment your script. Any information on what it is doing or
why might be useful to you a few months later.
Decide on a coding convention and stick to it. For example,
 for variable names, begin globals with a capital letter and privates
(my) with a lower case letter
 indent new control structures with (say) 2 spaces
 line up closing braces, as in: if (....) { ... ... } 78

CPAN

• CPAN: The Comprehensive Perl Archive
Network is available at www.cpan.org
and is a very large respository of Perl
modules for all kind of taks (including
bioperl)

79

What is BioPerl?

• An „open source‟ project
 http://bio.perl.org or http://www.cpan.org
• A loose international collaboration of
biologist/programmers
 Nobody (that I know of) gets paid for this
• A collection of PERL modules and methods for
doing a number of bioinformatics tasks
 Think of it as subroutines to do biology
• Consider it a „tool-box‟
 There are a lot of nice tools in there, and (usually)
somebody else takes care of fixing parsers when they
break
• BioPerl code is portable - if you give somebody a
script, it will probably work on their system
80

Multi-line parsing

use strict;
use Bio::SeqIO;

my $filename="sw.txt";
my $sequence_object;

my $seqio = Bio::SeqIO -> new (
'-format' => 'swiss',
'-file' => $filename
);

while ($sequence_object = $seqio -> next_seq) {
my $sequentie = $sequence_object-> seq();
print $sequentie."n";
}

81

Live.pl

#!e:Perlbinperl.exe -w
# script for looping over genbank entries, printing out name
use Bio::DB::Genbank;
use Data::Dumper;

$gb = new Bio::DB::GenBank();

$sequence_object = $gb->get_Seq_by_id('MUSIGHBA1');
print Dumper ($sequence_object);

$seq1_id = $sequence_object->display_id();
$seq1_s = $sequence_object->seq();
print "seq1 display id is $seq1_id n";
print "seq1 sequence is $seq1_s n";

82

Bioperl 101: 2 ESSENTIAL TOOLS

Data::Dumper to find out what
class your in

Perl bptutorial (100 Bio::Seq) to
find the available methods for that
class

83

Outline

• Scripting

• Databases
Genome Browser

• AI
84

Overview

• Bots and Spiders
 The web
 Bots
 Spiders
 Real world examples
 Bioinformatics applications
 Perl – LWP libraries
 Google hacks
 Advanced APIs
 Fetch data from NCBI / Ensembl /
85

The web

• The WWW-part of the
Internet is based on
hyperlinks
• So if one started to
follow all hyperlinks, it
would be possible to
map almost the entire
WWW
• Everything you can do
as a human (clicking,
filling in forms,…) can be
done by machines
86

Bots

• Webbots (web robots, WWW robots, bots): software applications
that run automated tasks over the Internet
• Bots perform tasks that:
 Are simple
 Structurally repetitive
 At a much higher rate than would be possible for a human
• Automated script fetches, analyses and files information from
web servers at many times the speed of a human
• Other uses:
 Chatbots
 IM / Skype / Wiki bots
 Malicious bots and bot networks (Zombies) 87

Spiders

• Webspiders / Crawlers are programs
or automated scripts which browses
the World Wide Web in a
methodical, automated manner. It is
one type of bot
• The spider starts with a list of URLs
to visit, called the seeds
 As the crawler visits these
URLs, it identifies all the
hyperlinks in the page
 It adds them to the list of URLs to
visit, called the crawl frontier
 URLs from the frontier are
recursively visited according to a
set of policies
• This process is called web crawling
or spidering: in most cases a mean 88

Spiders

Use of webcrawlers:
 Mainly used to create a copy of all the visited pages for later
processing by a search engine that will index the downloaded
pages to provide fast searches
 Automating maintenance tasks on a website, such as checking
links or validating HTML code
 Can be used to gather specific types of information from Web
pages, such as harvesting e-mail addresses
 Most common used crawler is probably the GoogleBot crawler
 Crawls
 Indexes (content + key content tags and attributes, such as Title
tags and ALT attributes)
 Serves results: PageRank Technology 89

Perl - LWP

LWP (also known as libwww-perl)
The World-Wide Web library for Perl
Set of Perl modules which provides a simple and
consistent application programming interface (API) to
the World-Wide Web
Free book: http://lwp.interglacial.com/
 LWP for newbies
LWP::Simple (demo1)
Go to a URL, fetch data, ready to parse
Attention: HTML tags and regular expression

91

Perl - LWP

 Some more advanced features
 LWP::UserAgent (demo2 – show server access logs)
 Fill in forms and parse results
 Depending on content: follow hyperlinks to other pages
and parse these again,…
 Bioinformatics examples
 Use genome browser data (demo3) and sequences
 Get gene aliases and symbols from GeneCards (demo4)

92

Google hacks

 Why not make use of crawls, indexing and
serving technologies of others (e.g. Google)
Google allows automated queries: per account 1000
queries a day
Google uses Snippets: the short pieces of text you get
in the main search results
This is the result of its indexing and parsing algoritms
Demo5: LWP and Google combined and parsing the
results

93

Advanced APIs

 An application programming interface (API) is a source
code interface that an operating system, library or service
provides to support requests made by computer programs
 Language-dependent APIs
 Language-independent APIs are written in a way they can
be called from several programming languages. This is a
desired feature for service style API which is not bound to
a particular process or system and is available as a
remote procedure call

94

Advanced APIs

 Google example used Google API / SOAP
 NCBI API
 The NCBI Web service is a web program that enables
developers to access Entrez Utilities via the Simple Object
Access Protocol (SOAP)
 Programmers may write software applications that access
the E-Utilities using any SOAP development tool
 Main tools (demo6):
E-Search Searches and retrieves primary IDs and term
translations and optionally retains results for future use in
the user's environment
E-Fetch: Retrieves records in the requested format from a
list of one or more primary IDs
95
 Ensembl API (demo7)

Fetch data from NCBI

 A NCBI database, frequently used is PubMed
PubMed can be queried using E-Utils
Uses syntax as regular PubMed website
Get the data back in data formats as on the website
(XML, Plain Text)
Parse XML results and more advanced Text-mining
techniques
Demo8
Parse results and present them in an interface
(http://matrix.ugent.be/mate/methylome/result1.html)

96

Fetch data from NCBI

 Example: PubMeth
Get data from NCBI PubMed
Get all genes and all aliases for human genes and their
annotations from Ensembl & GeneCards
Get all cancer types from cancer thesaurius
Parse PubMed results: find genes and aliases;
keywords
Keep variants in mind (Regexes are very useful)
Sort the PubMed abstracts and store found genes and
keywords in database; apply scoring scheme

97

Outline

• Scripting

• Databases
Genome Browser

• AI
98

The three genome browsers

• There are three main browsers:
 Ensembl
 NCBI MapViewer
 UCSC
• At first glance their main distinguishing features are:
 MapViewer is arranged vertically.
 Ensembl has multiple (22) different “Views”.
 UCSC has a single “View” for (almost) everything.

99

MapViewer
Home

http://www.ncbi.nlm.nih.gov/mapview/ 100

MapViewer Master Map

101

Selecting tracks on MapViewer

102

MapViewer strengths

• Good coverage of plant and fungal genomes.
• Close integration with other NCBI tools and
databases, such as Model Maker, trace archives
or Celera assemblies.
• Vertical view enables convenient overview of
regional gene descriptions.
• Discontiguous MEGABLAST is probably the most
sensitive tool available for cross-species sequence
queries.
• Ability to view multiple assemblies (e.g. Celera
and reference) simultaneously.

103

MapViewer limitations

• Little cross-species conservation or alignment
data.
• Inability to upload custom annotations and data.
• Limited capability for batch data access.
• Limited support for automated database querying.
• Vertical view makes base-pair level annotation
cumbersome.

104

UCSC Genome Browser

105
105

http://genome.ucsc.edu/

106
106

UCSC Genome Browser

107
107

Strengths of the UCSC Browser (I)

For this course I will be focusing primarily on the
UCSC Browser for several reasons:
• Strong comparative genomics capabilities.
• Fast response
 sequence searches performed with BLAT.
 code is written in speed-optimized C.
 Multiple indexing and non-normalized tables for fast
database retrieval.
• (Essentially) single “view” from single base-pair to
entire chromosome.
• Easiest interface for loading custom annotations.
108

UCSC Browser Strengths (II)

• Well suited for batch and automated querying of both
gene and intergenic regions.
• Comprehensive: tends to have the most species,
genes and annotations.
• Annotations frequently updated (Genbank/Refseq
daily / ESTs weekly).
• Able to find “similar” genes easily with GeneSorter.
• Rapid access to in situ images with VisiGene.

109

UCSC browser limitations

• Lack of “overview” mode can make it harder to see
genomic context.
• Syntenic regions cannot be viewed simultaneously.
• Cross species sequence queries with BLAT are
often insensitive.
• Comprehensiveness of database can make user
interface intimidating.
• Code access for commercial users requires
licensing.

110

Human, mouse,rat synteny in MapViewer

111

Browser/Database Batch
Querying

112
112

Batch querying overview

• Introduction / motivation
• UCSC table browser
• Custom tracks and frames
• Galaxy and direct SQL database
querying
• A batch query example
• UCSC Database “gotchas”
• Batch querying on Ensembl

113

Why batch querying

• Interactive querying is difficult if you want to study
numerous “interesting” genomic regions.

• Querying each region interactively is:
 Tedious
 Time-consuming
 Error prone

114

Batch querying examples

• As an example, say you have found one hundred candidate
polymorphisms and you want to know:
 Are they in dbSNP?
 Do they occur in any known ESTs?
 Are the sites conserved in other vertebrates?
 Are they near any ”LINE” repeat sequences?

Of course you could repeat the procedures described in
Part II one hundred times but that would get “old” very fast…

115

Other examples

• Other examples include characterizing multiple:
 Non-coding RNA candidates
 ultra-conserved regions
 introns hosting snoRNA genes

116

Browsers and databases

• Each of the genome browsers is built on top of
multiple relational databases.

• Typically data for each genome assembly are stored
in a separate database and auxiliary data, e.g. gene
ontology (GO) data, are stored in yet other
databases.

• These databases may have hundreds of tables,
many with millions of entries.

117

The UCSC Table Browser

• For batch queries, you need to query the
browser databases.

• The conventional way of querying a relational
database is via “Structured Query Language”
(SQL).

• However with the Table Browser, you can
query the database without using SQL.

118

Browser Database Formats

Nevertheless, even with the Table Browser, you need
some understanding of the underlying track, table and
file formats.
 Table formats describe how data is stored in the (relational)
databases.
 Track formats describe how the data is presented on the
browser.
 File formats describe how the data is stored in “flat files” in
conventional computer files.
 Finally, for understanding the underlying the computer code
(as we will do in the last part of this tutorial) you will need to
learn about the “C” structures which hold the data in the
source code.
119

Main UCSC Data Formats

• GFF/GTF
• BED (Browser Extensible Data)
 lists of genomic blocks
• PSL
 RNA/DNA alignments
• .chain
 pair-wise cross species alignments
• .maf
 multiple genome alignments
• .wig
 numerical data

120

Custom Tracks

• Custom tracks are essentially BED, PSL or GTF files
with formatting lines so they can be displayed on the
browser.
• A custom track file can contain multiple tracks, which
may be in different formats.
• Custom tracks are useful for:
 Display of regions of interest on the browser.
 Sharing custom data with others.
 Input of multiple, arbitrary regions for annotation by the Table
Browser.
• Custom tracks can be made by the Table Browser, or
you can make them easily yourself.

121

Selecting custom track output

122

Sending custom track to browser

123
123

Adding a custom track

124
124

Adding a custom track (II)

125

Custom track example

browser position chr22:10000000-10020000
browser hide all
track name=clones description="Clones” visibility=3
color=0,128,0 useScore=1
chr22 10000000 10004000 cloneA 960
chr22 10002000 10006000 cloneB 200
chr22 10005000 10009000 cloneC 700
chr22 10006000 10010000 cloneD 600
chr22 10011000 10015000 cloneE 300
chr22 10012000 10017000 cloneF 100

126

Limitations of the table browser

• Can be difficult to create more complex queries.
• With hundreds of tables, finding the one(s) you
want can be confusing.
• Getting intersections or unions of genomic regions
is often a multi-step process and can be tedious or
error prone.
• May be slower than direct SQL query.
• Not designed for fully automated operation.

127

Ensembl Home http://www.ensembl.org/

129

Ensembl ContigView

130

Ensembl ContigView

131

Detail and Basepair view

132

Changing tracks in Ensembl

133

Ensembl strengths (I)

• Multiple view levels shows genomic context.

• Some annotations are more complete and/or are
more clearly presented (e.g. snpView of multiple
mouse strain data.)

• Possible to create query over more than one genome
database at a time (with BioMart).

134
134

Ensembl snpView

135

Ensembl strengths (II)

• Batch and automated querying well supported and
documented (especially for perl and java).
• API (programmer interface) is designed to be
identical for all databases in a release.
• Ensembl tends to be more “community oriented” -
using standard, widely used tools and data formats.
• All data and code are completely free to all.

136

Ensembl is “community oriented”

• Close alliances with Wormbase, Flybase, SGD
• “support for easy integration with third party data and/or
programs” – BioMart
• Close integration with R/ Bioconductor software
• More use of community standard formats and
programs, e.g. DAS, GFF/GTF, Bioperl

( Note: UCSC also supports GFF/GTF and is
compatible with R/Bioconductor and DAS, but UCSC
tends to use more “homegrown” formats, e.g.
BED, PSL, and tools.)

137

Ensembl limitations

• Limited data quantifying cross-species
sequence conservation.
• Batch queries for intergenic regions with
BioMart are difficult.
• BioMart offers less complete access to
database than UCSC Table Browser.
(However, the user interface to BioMart
is easier.)

138

BioMart

• BioMart - the Ensembl “Table browser”
• Similar to the Table Browser and Galaxy tools.
• Previous version was called EnsMart.
• Fewer tables can be accessed with BioMart than
with UCSC Table Browser. In particular, non-gene
oriented queries may be difficult.
• However, the user interface is simpler.
• Tight interface with Bioconductor project for
annotation of microarray genes.

139

The Galaxy Website

• Galaxy website: http://g2.bx.psu.edu

• Galaxy objective: Provide sequence and data
manipulation tools (a la SRS or the UCSD Biology
Workbench) that are capable of being applied to genomic
data.

• The intent is to provide an easy interface to numerous
analysis tools with varied output formats that can work on
data from multiple browsers / databases.

140

Demo: Galaxy Genomics Toolkit

• Galaxy is a web interface to bioinformatics tools that
deal with genome-scale data
• There is a public server with many pre-installed tools
• Many tools work with genomic intervals
• Other tools work with various types of tab delimited
data formats, and some directly on DNA sequences
• It has excellent tools to access public data
• It can be installed on a local computer or set up as an
institutional server
• Can access a standard or custom build on Amazon
“Cloud”
• Any command line tool or web service can easily be
wrapped into the Galaxy interface.
142

Genome-Scale Data

• Bioinformatics work is challenging on
very large “genomics” data sets
 sequencing, gene expression, variants,
ChIPseq
• Complex command line programs
• Genome Browsers
• New tools

143

The Galaxy Interface has 3 parts
History =
List of Tools Central work panel data & results

144

Load Data from UCSC

Or upload from your computer 145

Demo: Galaxy Genomics Toolkit

• http://athos.ugent.be:8080: staat er een Galaxy instance.
• inloggen (als admin: new@new.be, password: newnew)
• de cleanfq history heeft 2 paar fastq files en een ref fa en een ref gtf

146

Workflows

• Galaxy saves your data, and results in the
History
• The exact commands and parameters used with
each operation with each tool are also saved.
• These operations can be saved as a
“Workflow”, which can be reused, and shared
with other users.

147

• Galaxy has many public
data sets and public
workflows, which can be
easily used in your projects
(or a tutorial)

148

NGS tools

• Galaxy has recently been expanded with tools to
analyze Next-Gen Sequence data
• File format conversions
• Analysis methods specific to different sequencing
platforms (454, Illumina, SOLID)
• Analysis methods specific to different applications
(RNA-seq, ChIP-seq, mutation finding,
metagenomics, etc).

149

• NGS tools include file
format conversion, mapping
to reference genome,
ChIPseq peak calling, RNA-
seq gene expression, etc.

• NGS data analysis uses
large files – slow to upload
and slow to process on a
public server

A number of Groups have set up custom Galaxy
servers with special tools

151

The SPARQLing future

152

Outline

• Scripting

• Databases
Genome Browser

• AI
153

Wat is „intelligent‟ ?

• Intelligentie = de mogelijkheid tot
leren en begrijpen, tot het oplossen
van problemen, tot het nemen van
beslissingen
Machine learning …

154

Turing test voor intelligentie

THE IMITATION GAME

Vrouw
Man/Machine
Ondervrager: Wie van
beide is de vrouw?

155

Wat is „artificieel‟ ?

• Artificieel = kunstmatig = door de mens
vervaardigd, niet van natuurlijke
oorsprong
• in de context van A.I.: machines, meestal
een digitale computer
• H. Simon: analogie mens-digitale
computer
 geheugen
 uitvoeringseenheid
 controle-eenheid

156

Data mining

• WAT? extraheren van kennis uit data
• Data indelen in drie groepen:
 trainingsset
 validatieset
 testset
• Clustering/Classificatie

157

Clustering

• WAT? „unsupervised learning‟ –
antwoord voor de trainingsdata niet
gekend
• Resultaat meestal als boomstructuur
• Belangrijke methode: hiërarchisch
clusteren opstellen van distance matrix

158

Cluster Analysis

• Unsupervised methods
• Descriptive modeling
 Grouping of genes with “similar” expression
profiles
 Grouping of disease tissues, cell lines, or
toxicants with “similar” effects on gene
expression
• Clustering algorithms
 Self-organizing maps
 Hierarchical clustering
 K-means clustering
 SVD
159

Linkage in Hierarchical Clustering

• Single linkage:
S(A,B) = mina minb d(a,b)
A
• Average linkage:
A(A,B) = (∑a ∑b d(a,b)) / |A| |B|
• Complete linkage:
C(A,B) = maxa maxb d(a,b)
• Centroid linkage:
M(A,B) = d(mean(A),mean(B))
• Hausdorff linkage: B
h(A,B) = maxa minb d(a,b)
H(A,B) = max(h(A,B),h(B,A))
• Ward linkage:
W(A,B) = (|A| |B| (M(A,B))2) / (|A|+|B|)

160

Hierarchical Clustering

3 clusters?
2 clusters?

161

Classificatie

• WAT? „supervised learning‟ – antwoord
voor de trainingsdata is gekend
• Verschillende classificatiemethoden:
 decision tree
 neurale netwerken
 support vector machines

162

Decision tree

Voorbeeld: tennis

163

Neurale netwerken

BOUW: Neuronen en verbindingen
TAAK:
verwerken van invoergegevens
machine learning

164

Support Vector Machines

Doorvoeren van een lineaire separatie in de data
door de dimensies aan te passen

165

Bio-informatica toepassingen

• Decision tree: zoeken naar DNA-sequenties
homoloog aan een gegeven DNA-sequentie
• Neurale netwerken: modelleren en analyseren
van genexpressiegegevens, voorspellen van de
inwerkingsplaatsen van proteasen
• Support Vector Machines: identificeren van
genen betrokken bij anti-kankermechanismen,
detecteren van homologie tussen eiwitten,
analyse van genexpressie

166

Bio-informatica toepassingen

• Hiërarchisch clusteren: opstellen van fylo-
genetische bomen op basis van DNA-sequenties
• Genetische algoritmes: moleculaire herkenning,
relatie tussen structuur en functie ophelderen,
Multiple Sequence Alignment
• Expertsystemen: ontdekken van blessures,
vroege detectie van afwijkingen aan de hartklep
• Fuzzy logic: primerdesign, voorspellen van de
functie van een onbekend gen, expressie-
analyse

167

Classification

C N
N NCC
NC

OMS
classifier

C N
CC N N
C N

169
C: cancer, N: normal

Classification

R N
N NRR
NR

OMS
classifier

R N
RR N N
R N

R: responder 170
N: non-responder

OMS Classifier using “Methylation”

Patient

Sample

Measuring Methylation

Gene Gen 1 Gen 2 Gen 3 … Gen n
Methylated + - - … +

OMS
classifier

Cancer Normal
172

Why use methylation as a biomarker ?

• What is feature/biomarker ?
 A characteristic that is objectively
measured and evaluated as an indicator
of normal biological processes,
pathogenic processes, or pharmacologic
responses to a therapeutic intervention

• Business/biological feature
selection/reduction
 Of all possible (molecular and clinical)
features oncomethylome measures
methylation (in cancer/onco)
173

Data preparation and modelling

• Data preparation
 Construct binary features « Methylated » from
PCR data (Ct and Temp)

• Modelling
 Construct classifier (cancer vs normal) from
features « Methylated »

175

Data Preparation: Feature Construction

Sample

Methylation Specific
Quantitative PCR

Temp 78 81 69 … 72
Ct 25 38 24 … 27

Feature construction: “gene Methylated in sample”

Methylated + - - … +
Compute « methylated » as function of Temp and Ct
176

Construction of features « Methylated »

• Per gene: find boolean function
 Methylated IFF:
Ct below upperbound AND
Temp above lowerbound

• Taking into account
 All Ct and Temp measurements
Methylation Specific Quantitative PCR (QMSP) for
normals and cancers

 Noise in QMPS measurements
As observed per gene during Quality Control

177


Plot of all Ct and Temp measurements for a given gene

Temp

Ct

What about noise?
178

Noise

 Noise: random error or variance in a
measured variable
 Incorrect attribute values may due to
 Quantity not correctly compared to calibration
(e.g., ruler slips)
 Inaccurate calibration device (e.g., ruler > 1m)
 Precision (e.g., truncated to nearest mile or Ångstrom unit)
 Data entry problems
 Data transmission problems
 Inconsistency in naming convention

179

Construction of features «Methylated»
Taking into account noise

QC: StdDev of Ct and Tm in IVM
StDev 1.6 StDev 0.3

StDev 0.02
StDev 3.5

Cancer

Inrobust assay Cut-off Robust assay
Normal

180


Good Reproducibility Bad Reproducibility

Methylated
Methylated

Blunt cut-off

Methylated Methylated

Sharp cut-off

181


Find most robust cut-off for each gene
Compute quality with increasing noise levels (0-2 times StdDev)

1
Quality

1

Quality
Inrobust Robust

0 2
Stdev 0
Stdev 2

Quality score based on binomial test

46 or more successes with 58 trials unlikely
16 or more successes with 44 trials likely
When probability success = 80/179
when probability success = 77/175
Expected nr successes = 21
Expected nr successes = 19
182


Methylated: inside red box

183


Methylated Unmethylated Ranked Genes
Cancer
Normal

184

Data preparation and modelling

• Data preparation
 Construct binary features « Methylated » from
PCR data (Ct and Temp)

• Modelling
 Construct classifier (cancer vs normal) from
« Methylated » features

185

Selection of modelling technique

• In theory, many techniques applicable
 Data type: boolean methylation table, discrete
classes
 See other talks today
• But, additional requirements follow from
business understanding (more details below)
 Feature selection
Final test should be based on at most ~5 genes

 Understandability
 Both provide a direct competitive advantage
• Example of acceptable technique: decision
trees 186

Decision trees
The Weka tool
@relation weather.symbolic

@attribute outlook {sunny, overcast, rainy}
@attribute temperature {hot, mild, cool}
@attribute humidity {high, normal}
@attribute windy {TRUE, FALSE}
@attribute play {yes, no}

@data
sunny,hot,high,FALSE,no
sunny,hot,high,TRUE,no
overcast,hot,high,FALSE,yes
rainy,mild,high,FALSE,yes
rainy,cool,normal,FALSE,yes
rainy,cool,normal,TRUE,no
overcast,cool,normal,TRUE,yes
sunny,mild,high,FALSE,no
sunny,cool,normal,FALSE,yes
rainy,mild,normal,FALSE,yes
sunny,mild,normal,TRUE,yes
overcast,mild,high,TRUE,yes
overcast,hot,normal,FALSE,yes
rainy,mild,high,TRUE,no

http://www.cs.waikato.ac.nz/ml/weka/
187

Decision trees
Attribute selection
outlook temperature humidity windy play
sunny hot high FALSE no play
sunny hot high TRUE no
overcast hot high FALSE yes
don‟t play
rainy mild high FALSE yes
rainy cool normal FALSE yes
rainy cool normal TRUE no
overcast cool normal TRUE yes
pno = 5/14
sunny mild high FALSE no
sunny cool normal FALSE yes
rainy mild normal FALSE yes
sunny mild normal TRUE yes
overcast mild high TRUE yes
overcast hot normal FALSE yes
rainy mild high TRUE no

 maximal gain of information
 maximal reduction of Entropy = - pyes log2 pyes - pno log2 pno pyes = 9/14
= - 9/14 log2 9/14 - 5/14 log2 5/14
= 0.94 bits
http://www-lmmb.ncifcrf.gov/~toms/paper/primer/latex/index.html
http://directory.google.com/Top/Science/Math/Applications/Information_Theory/Papers/
188

Decision trees play
0.94 bits
Attribute selection
don‟t play

play don't play play don't play
play don't play play don't play
sunny 2 3 hot 2 2
high 3 4 FALSE 6 2
overcast 4 0 mild 4 2
normal 6 1 TRUE 3 3
rainy 3 2 cool 3 1
outlook humidity temperature windy

sunny overcast rainy high normal hot mild cool false true

amount of information required to specify class of an example given that it reaches node

0.97 bits 0.0 bits 0.97 bits 0.98 bits 0.59 bits 1.0 bits 0.92 bits 0.81 bits 0.81 bits 1.0 bits
* 5/14 * 4/14 * 5/14 * 7/14 * 7/14 * 4/14 * 6/14 * 4/14 * 8/14 * 6/14

+ + + +
= 0.69 bits = 0.79 bits = 0.91 bits = 0.89 bits
gain: 0.25 bits gain: 0.15 bits gain: 0.03 bits gain: 0.05 bits

Decision trees outlook play
Attribute selection
don‟t play
sunny overcast rainy
0.97 bits outlook temperature humidity windy play
sunny hot high FALSE no
sunny hot high TRUE no
sunny mild high FALSE no
sunny cool normal FALSE yes
sunny mild normal TRUE yes
humidity temperature windy

high normal hot mild cool false true

0.0 bits 0.0 bits 0.0 bits 1.0 bits 0.0 bits 0.92 bits 1.0 bits
* 3/5 * 2/5 * 2/5 * 2/5 * 1/5 * 3/5 * 2/5

+ + +
= 0.0 bits = 0.40 bits = 0.95 bits
gain: 0.97 bits gain: 0.57 bits gain: 0.02 bits

play
Decision trees outlook
Attribute selection don‟t play

outlook temperature humidity windy play
sunny overcast rainy rainy mild high FALSE yes
rainy cool normal FALSE yes
0.97 bits rainy cool normal TRUE no
rainy mild normal FALSE yes
rainy mild high TRUE no
humidity
humidity temperature windy
high normal

high normal hot mild cool false true


1.0 bits 0.92 bits 0.92 bits 1.0 bits 0.0 bits 0.0 bits
*2/5 * 3/5 * 3/5 * 2/5 * 3/5 * 2/5

+ + +
= 0.95 bits = 0.95 bits = 0.0 bits
gain: 0.02 bits gain: 0.02 bits gain: 0.97 bits

Decision trees
final tree

play

don‟t play
outlook

sunny overcast rainy

humidity windy

high normal false true

192

Decision trees
Basic algorithm

• Initialize top node to all examples
• While impure leaves available
 select next impure leave L
 find splitting attribute A with maximal information gain
 for each value of A add child to L

193

Decision tree built from methylation table

Leave-one-out experiment
To avoid overfitting

Decision tree:
Test based on 12 genes

Sensitivity: 80%

Specificity: 88%

194

Evaluation and deployment

• Decide whether to use Classification results
 Can we use 12 gene decision tree for classifying
new patients?
• Verification of all steps
 Excercise. The above modelling procedure contains
a classical mistake: the test-sets used for cross-
validation (see leave-one-out) have actually been
used for training the model. How? (Weka is not to blame)
And how can we fix this?
• Check whether business goals have been met
 No: test based on 12 genes not useful (max ~5)
 Iteration required 196

Attempt to rebuild decision tree
with at most ~5 genes

Minimal leaf size
Increased to 12

New Decision tree:
Test based on 4 genes

Sensitivity decreased from 80% to 64%

Specificity increased from 88% to 90%
197

The impact of « cost »

• Market conditions, cost of goods &
royalty structure can limit the amount
of genes that can tested

198

The importance of « understandability »

199

The importance of « understandability »

Pre and postmarket requirements imposed for IVDMIA (510k etc)

Understandability (NO black boxes) is becoming an important asset

200

Outline

• Scripting

• Databases
Genome Browser

• AI
201

WEKA:: Introduction

• A collection of open source ML
algorithms
 pre-processing
 classifiers
 clustering
 association rule
• Created by researchers at the
University of Waikato in New Zealand
• Java based

202

WEKA:: Installation

• Download software from
http://www.cs.waikato.ac.nz/ml/weka/
 If you are interested in
modifying/extending weka there is a
developer version that includes the
source code
• Set the weka environment variable for
java
 setenv WEKAHOME /usr/local/weka/weka-3-0-
2
 setenv CLASSPATH
$WEKAHOME/weka.jar:$CLASSPATH
• Download some ML data from
http://mlearn.ics.uci.edu/MLRepositor 203
y.html

Main GUI

• Three graphical user interfaces
 “The Explorer” (exploratory data
analysis)
 “The Experimenter” (experimental
environment)
 “The KnowledgeFlow” (new process
model inspired interface)

205

Explorer: pre-processing the data

• Data can be imported from a file in
various formats: ARFF, CSV, C4.5,
binary
• Data can also be read from a URL or
from an SQL database (using JDBC)
• Pre-processing tools in WEKA are
called “filters”
• WEKA contains filters for:
 Discretization, normalization, resampling,
attribute selection, transforming and
combining attributes, …
12/18/2012 206

WEKA only deals with “flat” files

@relation heart-disease-simplified

@attribute age numeric
@attribute sex { female, male}
@attribute chest_pain_type { typ_angina, asympt, non_anginal, atyp_angina}
@attribute cholesterol numeric
@attribute exercise_induced_angina { no, yes}
@attribute class { present, not_present}

@data
63,male,typ_angina,233,no,not_present
67,male,asympt,286,yes,present
38,female,non_anginal,?,no,not_present
...

207

WEKA only deals with “flat” files

@relation heart-disease-simplified

@attribute age numeric
@attribute sex { female, male}
@attribute chest_pain_type { typ_angina, asympt, non_anginal, atyp_angina}
@attribute cholesterol numeric
@attribute exercise_induced_angina { no, yes}
@attribute class { present, not_present}

@data
63,male,typ_angina,233,no,not_present
38,female,non_anginal,?,no,not_present
...

2 12/18/2012 208
0

2 University of Waikato 12/18/2012 209
0

1

2

Explorer: building “classifiers”

• Classifiers in WEKA are models for
predicting nominal or numeric
quantities
• Implemented learning schemes
include:
 Decision trees and lists, instance-based
classifiers, support vector machines,
multi-layer perceptrons, logistic
regression, Bayes‟ nets, …

230

Decision Tree Induction: Training Dataset

age income student credit_rating buys_computer
<=30 high no fair no
This <=30 high no excellent no
31…40 high no fair yes
follows an >40 medium no fair yes
example >40 low yes fair yes
of >40 low yes excellent no
31…40 low yes excellent yes
Quinlan‟s <=30 medium no fair no
ID3 <=30 low yes fair yes
(Playing >40 medium yes fair yes
<=30 medium yes excellent yes
Tennis) 31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
2 December 18, 2012 231
3

Output: A Decision Tree for “buys_computer”

age?

<=30 overcast
31..40 >40

student? yes credit rating?

no yes excellent fair

no yes yes

232

3

4

5

Explorer: finding associations

• WEKA contains an implementation of
the Apriori algorithm for learning
association rules
 Works only with discrete data
• Can identify statistical dependencies
between groups of attributes:
 milk, butter  bread, eggs (with
confidence 0.9 and support 2000)
• Apriori can compute all rules that
have a given minimum support and
exceed a given confidence
258

Explorer: data visualization

• Visualization very useful in practice:
e.g. helps to determine difficulty of the
learning problem
• WEKA can visualize single attributes
(1-d) and pairs of attributes (2-d)
 To do: rotating 3-d visualizations (Xgobi-
style)
• Color-coded class values
• “Jitter” option to deal with nominal
attributes (and to detect “hidden” data
points) 12/18/2012
2 259
5 • “Zoom-in” function

6

2012 12 12_adam_v_final

Recommandé

Recommandé

Contenu connexe

En vedette

En vedette (9)

Similaire à 2012 12 12_adam_v_final

Similaire à 2012 12 12_adam_v_final (20)

Plus de Prof. Wim Van Criekinge

Plus de Prof. Wim Van Criekinge (20)

Dernier

Dernier (20)

2012 12 12_adam_v_final