SlideShare une entreprise Scribd logo
1  sur  274
Bioinformatics



        Prof. Wim Van Criekinge
        18th december 2012, VUmc, Amsterdam
Outline

• Scripting
  Perl (Bioperl/Python)
       examples spiders/bots

• Databases
  Genome Browser
       examples biomart, galaxy

• AI
  Classification and clustering
       examples WEKA (R, Rapidminer)
                                       2
Bioinformatics, a life science discipline …


                                              Math




                                                     (Molecular)
     Informatics
                                                       Biology
Bioinformatics, a life science discipline …


                                              Math




      Computer Science                                Theoretical Biology




                                                            (Molecular)
     Informatics
                                                              Biology
                                  Computational Biology
Bioinformatics, a life science discipline …


                                              Math




      Computer Science                                  Theoretical Biology



                                       Bioinformatics



                                                              (Molecular)
     Informatics
                                                                Biology
                                  Computational Biology
Bioinformatics, a life science discipline … management of expectations


                                     Math




 Computer Science                                         Theoretical Biology
                  NP                            AI, Image Analysis
                  Datamining                    structure prediction (HTX)
                                Bioinformatics

           Interface Design                       Expert Annotation
                            Sequence Analysis                       (Molecular)
Informatics
                                                                      Biology
                            Computational Biology
Bioinformatics, a life science discipline … management of expectations


                                     Math




 Computer Science                                         Theoretical Biology
                  NP                            AI, Image Analysis
                  Datamining                    structure prediction (HTX)
                       Bioinformatics
      Discovery Informatics – Computational Genomics
           Interface Design                       Expert Annotation
                            Sequence Analysis                       (Molecular)
Informatics
                                                                      Biology
                            Computational Biology
Bioinformatics




                 8
9
What is Perl ?

            • Perl is a High-level Scripting language
            • Larry Wall created Perl in 1987
                  Practical Extraction (a)nd Reporting
                   Language
                  (or Pathologically Eclectic Rubbish Lister)
            • Born from a system administration tool
            • Faster than sh or csh
            • Sslower than C
            • No need for sed, awk, tr, wc, cut, …
            • Perl is open and free
            • http://conferences.oreillynet.com/euroosc
              on/
                                                                 10
What is Perl ?

             • Perl is available for most computing
               platforms: all flavors of UNIX (Linux), MS-
               DOS/Win32, Macintosh, VMS, OS/2, Amiga,
               AS/400, Atari
             • Perl is a computer language that is:
                  Interpreted, compiles at run-time (need for
                   perl.exe !)
                  Loosely “typed”
                  String/text oriented
                  Capable of using multiple syntax formats
             • In Perl, “there‟s more than one way to do it”
                                                                 11
Why use Perl for bioinformatics ?

                • Ease of use by novice programmers
                • Flexible language: Fast software prototyping (quick
                  and dirty creation of small analysis programs)
                • Expressiveness. Compact code, Perl Poetry:
                  @{$_[$#_]||[]}
                • Glutility: Read disparate files and parse the relevant
                  data into a new format
                • Powerful pattern matching via “regular expressions”
                  (Best Regular Expressions on Earth)
                • With the advent of the WWW, Perl has become the
                  language of choice to create Common Gateway
                  Interface (CGI) scripts to handle form submissions
                  and create compute severs on the WWW.
                • Open Source – Free. Availability of Perl modules for
                  Bioinformatics and Internet.
                                                                       12
Why NOT use Perl for bioinformatics ?


                • Some tasks are still better done with other
                  languages (heavy computations / graphics)
                    C(++),C#, Fortran, Java (Pascal,Visual Basic)

                • With perl you can write simple programs
                  fast, but on the other hand it is also suitable
                  for large and complex programs. (yet, it is
                  not adequate for very large projects)
                    Python

                • Larry Wall: “For programmers, laziness is a
                  virtue”

                                                                     13
What bioinformatics tasks are suited to Perl ?


               • Sequence manipulation and analysis
               • Parsing results of sequence analysis programs
                 (Blast, Genscan, Hmmer etc)
               • Parsing database (eg Genbank) files
               • Obtaining multiple database entries over the
                 internet
               •…




                                                             14
Perl installation

               • Perl
                     Perl is available for various operating systems. To
                      download Perl and install it on your computer, have a
                      look at the following resources:
                     www.perl.com (O'Reilly).
                        Downloading Perl Software
                     ActiveState. ActivePerl for Windows, as well as for
                      Linux and Solaris.
                        ActivePerl binary packages.
                     CPAN



               • PHPTriad:
                     bevat Apache/PHP en MySQL:
                      http://sourceforge.net/projects/phptriad
                                                                              15
Check installation


          • Command-line flags for perl
             Perl – v
                 Gives the current version of Perl

             Perl –e
                 Executes Perl statements from the comment line.
            Perl –e “print 42;”
            Perl –e “print ”Twonlinesn”;”

             Perl –we
                 Executes and print warnings
            Perl –we “print „hello‟;x++;”




                                                                   16
TextPad


          • Syntax highlighting

          • Run program (prompt for parameters)

          • Show line numbers

          • Clip-ons for web with perl syntax

          • ….
                                                  17
Customize textpad part 1: Create Document Class




                                                  18
• Document classes




                     19
Customize textpad part 2: Add Perl to “Tools Menu”




                                                     20
Unzip to textpad samples directory


                                     21
General Remarks

          • Perl is mostly a free format language: add
            spaces, tabs or new lines wherever you want.
          • For clarity, it is recommended to write each
            statement in a separate line, and use
            indentation in nested structures.
          • Comments: Anything from the # sign to the
            end of the line is a comment. (There are no
            multi-line comments).
          • A perl program consists of all of the Perl
            statements of the file taken collectively as
            one big routine to execute.


                                                       22
Three Basic Data Types



           •Scalars - $
           •Arrays of scalars - @
           •Associative arrays of
            scalers or Hashes - %



                                    23
2+2 = ?
                       $   - indicates a variable
       $a = 2;
       $b = 2;
       $c = $a + $b;

                                              - ends every command
                                         ;
   =   - assigns a value to a variable


                         or     $c = 2 + 2;
                         or     $c = 2 * 2;
                         or     $c = 2 / 2;
                           or   $c = 2 ^ 4;          2^4 <-> 24 =16
                         or     $c = 1.35 * 2 - 3 / (0.12 + 1);
Ok, $c is 4. How do we know it?

              $c = 4;
              print “$c”;

  print command:

                            “ ”        - bracket output expression

       print “Hello n”;


                             n     - print a end-of-the-line character
                                    (equivalent to pressing „Enter‟)
Strings concatenation:
         print “Hello everyonen”;
         print “Hello” . ” everyone” . “n”;
Expressions and strings together:
         print “2 + 2 = “ . (2+2) . ”n”;                              2 + 2 = 4

                                        expression
Loops and cycles (for statement):

      # Output all the numbers from 1 to 100
      for ($n=1; $n<=100; $n+=1) {
                print “$n n”;
      }
1. Initialization:
          for ( $n=1 ; ; ) { … }

2. Increment:
         for ( ; ; $n+=1 ) { … }

3. Termination (do until the criteria is satisfied):
        for ( ; $n<=100 ; ) { … }
4. Body of the loop - command inside curly brackets:
         for ( ; ; ) { … }
FOR & IF -- all the even numbers from 1 to 100:

       for ($n=1; $n<=100; $n+=1) {
                   if (($n % 2) == 0) {
                            print “$n”;
                   }
       }


           Note: $a % $b -- Modulus
                         -- Remainder when $a is divided by $b
Two brief diversions (warnings & strict)
           • Use warnings
           • strict – forces you to „declare‟ a variable the
             first time you use it.
              usage: use strict; (somewhere near the top of your
               script)
           • declare variables with „my‟
              usage: my $variable;
              or: my $variable = „value‟;
           • my sets the „scope‟ of the variable. Variable
             exists only within the current block of code
           • use strict and my both help you to debug
             errors, and help prevent mistakes.
                                                               28
Text Processing Functions

          The substr function
          • Definition
          • The substr function extracts a substring out of a string
            and returns it. The function receives 3 arguments: a
            string value, a position on the string (starting to count
            from 0) and a length.
          Example:
          • $a = "university";
          • $k = substr ($a, 3, 5);
          • $k is now "versi" $a remains unchanged.
          • If length is omitted, everything to the end of the string
            is returned.



                                                                29
Random

         $x = rand(1);


         • srand
            The default seed for srand, which used to be time, has
             been changed. Now it's a heady mix of difficult-to-
             predict system-dependent values, which should be
             sufficient for most everyday purposes. Previous to
             version 5.004, calling rand without first calling srand
             would yield the same sequence of random numbers on
             most or all machines. Now, when perl sees that you're
             calling rand and haven't yet called srand, it calls srand
             with the default seed. You should still call srand
             manually if your code might ever be run on a pre-
             5.004 system, of course, or if you want a seed other
             than the default


                                                                         30
Demo/Example

       • Oefening hoe goed zijn de random
         nummers ?

       • Als ze goed zijn kan je er Pi mee
         berekenen …

       • Een goede random generator is belangrijk
         voor goede randomsequenties die we
         nadien kunnen gebruiken in simulaties

                                                    31
Bereken Pi aan de hand van twee random
getallen




                                             y

                                     x

                                         1




                                                 32
Introduction

Buffon's Needle is one of the oldest problems in the
 field of geometrical probability. It was first stated in
 1777. It involves dropping a needle on a lined sheet of
 paper and determining the probability of the needle
 crossing one of the lines on the page. The remarkable
 result is that the probability is directly related to the
 value of pi.


http://www.angelfire.com/wa/hurben/buff.html

In Postscript you send it too the printer … PS has no
  variables but “stacks”, you can mimick this in Perl by
  recursively loading and rewriting a subroutine
                                                        33
–http://www.csse.monash.edu.au/~damian/papers/HTML/Perligata.html
                                                               34
Programming


       • Variables
       • Flow control (if, regex …)
       • Loops

       • input/output
       • Subroutines/object

                                      35
What is a regular expression?


 • A regular expression (regex) is simply a
   way of describing text.
 • Regular expressions are built up of small
   units (atoms) which can represent the type
   and number of characters in the text
 • Regular expressions can be very broad
   (describing everything), or very narrow
   (describing only one pattern).


                                            36
37
Regular Expression Review

• A regular expression (regex) is a way of
  describing text.
• Regular expressions are built up of small units
  (atoms) which can represent the type and
  number of characters in the text
• You can group or quantify atoms to describe
  your pattern
• Always use the bind operator (=~) to apply your
  regular expression to a variable


                                                    38
Why would you use a regex?


• Often you wish to test a string for the
  presence of a specific character, word, or
  phrase

  Examples

    “Are there any letter characters in my string?”
    “Is this a valid accession number?”
    “Does my sequence contain a start codon (ATG)?”



                                                      39
Regular Expressions


Match to a sequence of characters

The EcoRI restriction enzyme cuts at the consensus
 sequence GAATTC.
To find out whether a sequence contains a restriction site for
 EcoR1, write;

if ($sequence =~ /GAATTC/) {
    ...
};




                                                            40
Regex-style




              [m]/PATTERN/[g][i][o]
         s/PATTERN/PATTERN/[g][i][e][o]
     tr/PATTERNLIST/PATTERNLIST/[c][d][s]




                                            41
Regular Expressions

Match to a character class
• Example
• The BstYI restriction enzyme cuts at the consensus sequence
  rGATCy, namely A or G in the first position, then GATC, and then T or C.
  To find out whether a sequence contains a restriction site for BstYI, write;
• if ($sequence =~ /[AG]GATC[TC]/) {...}; # This will match all of
  AGATCT, GGATCT, AGATCC, GGATCC.
Definition
• When a list of characters is enclosed in square brackets [], one and only
  one of these characters must be present at the corresponding position of
  the string in order for the pattern to match. You may specify a range of
  characters using a hyphen -.
• A caret ^ at the front of the list negates the character class.
Examples
• if ($string =~ /[AGTC]/) {...}; # matches any nucleotide
• if ($string =~ /[a-z]/) {...}; # matches any lowercase letter
• if ($string =~ /chromosome[1-6]/) {...}; # matches
  chromosome1, chromosome2 ... chromosome6
• if ($string =~ /[^xyzXYZ]/) {...}; # matches any character except
  x, X, y, Y, z, Z
                                                                                 42
Constructing a Regex

 • Pattern starts and ends with a /          /pattern/
    if you want to match a /, you need to escape it
      / (backslash, forward slash)

    you can change the delimiter to some other
     character, but you probably won‟t need to
      m|pattern|

 • any „modifiers‟ to the pattern go after the last /
      i : case insensitive /[a-z]/i
      o : compile once
      g : match in list context (global)
      m or s : match over multiple lines




                                                         43
Looking for a pattern

• By default, a regular expression is applied to $_
  (the default variable)
   if (/a+/) {die}
     looks for one or more „a‟ in $_

• If you want to look for the pattern in any other
  variable, you must use the bind operator
   if ($value =~ /a+/) {die}
     looks for one or more „a‟ in $value

• The bind operator is in no way similar to the
  „=„ sign!! = is assignment, =~ is bind.
   if ($value = /[a-z]/) {die}
     Looks for one or more „a‟ in $_, not $value!!!
                                                      44
Regular Expression Atoms


• An „atom‟ is the smallest unit of a
  regular expression.
• Character atoms
     0-9, a-Z match themselves
     . (dot) matches everything
     [atgcATGC] : A character class (group)
     [a-z] : another character class, a through z




                                                    45
Quantifiers


• You can specify the number of times you want
  to see an atom. Examples
  •   d* : Zero or more times
  •   d+ : One or more times
  •   d{3} : Exactly three times
  •   d{4,7} : At least four, and not more than seven
  •   d{3,} : Three or more times
      We could rewrite /ddd-dddd/ as:
  /d{3}-d{4}/



                                                         46
Anchors


• Anchors force a pattern match to a certain
  location
  • ^ : start matching at beginning of string
  • $ : start matching at end of string
  • b : match at word boundary (between w and W)
• Example:
  • /^ddd-dddd$/ : matches only valid phone numbers




                                                         47
Remembering Stuff


• Being able to match patterns is good, but
  limited.
• We want to be able to keep portions of the
  regular expression for later.
   Example: $string = „phone: 353-7236‟
    We want to keep the phone number only
    Just figuring out that the string contains a phone number is
      insufficient, we need to keep the number as well.




                                                                   48
Memory Parentheses (pattern memory)



• Since we almost always want to keep portions
  of the string we have matched, there is a
  mechanism built into perl.
• Anything in parentheses within the regular
  expression is kept in memory.
   „phone:353-7236‟ =~ /^phone:(.+)$/;
     Perl knows we want to keep everything that matches „.+‟ in the above
       pattern




                                                                            49
Getting at pattern memory


• Perl stores the matches in a series of default
  variables. The first parentheses set goes into
  $1, second into $2, etc.
   This is why we can‟t name variables ${digit}
   Memory variables are created only in the amounts
    needed. If you have three sets of parentheses, you
    have ($1,$2,$3).
   Memory variables are created for each matched set of
    parentheses. If you have one set contained within
    another set, you get two variables (inner set gets
    lowest number)
   Memory variables are only valid in the current scope
                                                           50
Finding all instances of a match



 • Use the „g‟ modifier to the regular expression
     @sites = $sequence =~ /(TATTA)/g;
     think g for global
     Returns a list of all the matches (in order), and
      stores them in the array
     If you have more than one pair of parentheses, your
      array gets values in sets
       ($1,$2,$3,$1,$2,$3...)




                                                            51
Perl is Greedy

• In addition to taking all your time, perl regular
  expressions also try to match the largest
  possible string which fits your pattern
   /ga+t/ matches gat, gaat, gaaat
   „Doh! No doughnuts left!‟ =~ /(d.+t)/
     $1 contains „doughnuts left‟

• If this is not what you wanted to do, use the „?‟
  modifier
   /(d.+?t)/ # match as few „.‟s as you can and still
    make the pattern work


                                                         52
Substitute function


• s/pattern1/pattern2/;
• Looks kind of like a regular expression
   Patterns constructed the same way
• Inherited from previous languages, so it can be
  a bit different.
   Changes the variable it is bound to!




                                                    53
54
tr function

• translate or transliterate
• tr/characterlist1/characterlist2/;
• Even less like a regular expression than s
• substitutes characters in the first list with
  characters in the second list
  $string =~ tr/a/A/; # changes every „a‟ to an „A‟
   No need for the g modifier when using tr.




                                                      55
Translations




               56
Using tr

• Creating complimentary DNA sequence
   $sequence =~ tr/atgc/TACG/;
• Sneaky Perl trick for the day
   tr does two things.
     1. changes characters in the bound variable
     2. Counts the number of times it does this

   Super-fast character counter™
     $a_count = $sequence =~ tr/a/a/;
     replaces an „a‟ with an „a‟ (no net change), and assigns the result
       (number of substitutions) to $a_count




                                                                           57
Regex-Related Special Variables

• Perl has a host of special variables that get filled after every m// or s///
  regex match. $1, $2, $3, etc. hold the backreferences. $+ holds the last
  (highest-numbered) backreference. $& (dollar ampersand) holds the
  entire regex match.
• @- is an array of match-start indices into the string. $-[0] holds the start
  of the entire regex match, $-[1] the start of the first backreference, etc.
  Likewise, @+ holds match-end indices (ends, not lengths).
• $' (dollar followed by an apostrophe or single quote) holds the part of
  the string after (to the right of) the regex match. $` (dollar backtick)
  holds the part of the string before (to the left of) the regex match. Using
  these variables is not recommended in scripts when performance
  matters, as it causes Perl to slow down all regex matches in your entire
  script.
• All these variables are read-only, and persist until the next regex match
  is attempted. They are dynamically scoped, as if they had an implicit
  'local' at the start of the enclosing scope. Thus if you do a regex
  match, and call a sub that does a regex match, when that sub
  returns, your variables are still set as they were for the first match.

                                                                            58
Voorbeeld

Which of following 4 sequences (seq1/2/3/4)

a) contains a “Galactokinase signature”
                http://us.expasy.org/prosite/



b) How many of them?
c) Where (hints:pos and $&) ?




                                                59
>SEQ1
MGNLFENCTHRYSFEYIYENCTNTTNQCGLIRNVASSIDVFHWLDVYISTTIFVISGILNFYCLFIALYT
  YYFLDNETRKHYVFVLSRFLSSILVIISLLVLESTLFSESLSPTFAYYAVAFSIYDFSMDTLFFSYIMIS
  LITYFGVVHYNFYRRHVSLRSLYIILISMWTFSLAIAIPLGLYEAASNSQGPIKCDLSYCGKVVEWITCS
  LQGCDSFYNANELLVQSIISSVETLVGSLVFLTDPLINIFFDKNISKMVKLQLTLGKWFIALYRFLFQMT
  NIFENCSTHYSFEKNLQKCVNASNPCQLLQKMNTAHSLMIWMGFYIPSAMCFLAVLVDTYCLLVTISILK
  SLKKQSRKQYIFGRANIIGEHNDYVVVRLSAAILIALCIIIIQSTYFIDIPFRDTFAFFAVLFIIYDFSILSLLGSFTGVAM
  MTYFGVMRPLVYRDKFTLKTIYIIAFAIVLFSVCVAIPFGLFQAADEIDGPIKCDSESCELIVKWLLFCI
  ACLILMGCTGTLLFVTVSLHWHSYKSKKMGNVSSSAFNHGKSRLTWTTTILVILCCVELIPTGLLAAFGK
  SESISDDCYDFYNANSLIFPAIVSSLETFLGSITFLLDPIINFSFDKRISKVFSSQVSMFSIFFCGKR
>SEQ2
MLDDRARMEA AKKEKVEQIL AEFQLQEEDL KKVMRRMQKE MDRGLRLETH EEASVKMLPT YVRSTPEGSE VGDFLSLDLG GTNFRVMLVK
   VGEGEEGQWS VKTKHQMYSI PEDAMTGTAE MLFDYISECI SDFLDKHQMK HKKLPLGFTF SFPVRHEDID KGILLNWTKG FKASGAEGNN
   VVGLLRDAIK RRGDFEMDVV AMVNDTVATM ISCYYEDHQC EVGMIVGTGC NACYMEEMQN VELVEGDEGR MCVNTEWGAF GDSGELDEFL
   LEYDRLVDES SANPGQQLYE KLIGGKYMGE LVRLVLLRLV DENLLFHGEA SEQLRTRGAF ETRFVSQVES DTGDRKQIYN ILSTLGLRPS
   TTDCDIVRRA CESVSTRAAH MCSAGLAGVI NRMRESRSED VMRITVGVDG SVYKLHPSFK ERFHASVRRL TPSCEITFIE SEEGSGRGAA
   LVSAVACKKA CMLGQ
>SEQ3
MESDSFEDFLKGEDFSNYSYSSDLPPFLLDAAPCEPESLEINKYFVVIIYVLVFLLSLLGNSLVMLVILY
  SRVGRSGRDNVIGDHVDYVTDVYLLNLALADLLFALTLPIWAASKVTGWIFGTFLCKVVSLLKEVNFYSGILLLACISVDRY
  LAIVHATRTLTQKRYLVKFICLSIWGLSLLLALPVLIFRKTIYPPYVSPVCYEDMGNNTANWRMLLRILP
  QSFGFIVPLLIMLFCYGFTLRTLFKAHMGQKHRAMRVIFAVVLIFLLCWLPYNLVLLADTLMRTWVIQET
  CERRNDIDRALEATEILGILGRVNLIGEHWDYHSCLNPLIYAFIGQKFRHGLLKILAIHGLISKDSLPKDSRPSFVGSSSGH TSTTL
>SEQ4
MEANFQQAVK KLVNDFEYPT ESLREAVKEF DELRQKGLQK NGEVLAMAPA FISTLPTGAE TGDFLALDFG GTNLRVCWIQ LLGDGKYEMK
  HSKSVLPREC VRNESVKPII DFMSDHVELF IKEHFPSKFG CPEEEYLPMG FTFSYPANQV SITESYLLRW TKGLNIPEAI NKDFAQFLTE
  GFKARNLPIR IEAVINDTVG TLVTRAYTSK ESDTFMGIIF GTGTNGAYVE QMNQIPKLAG KCTGDHMLIN MEWGATDFSC LHSTRYDLLL
  DHDTPNAGRQ IFEKRVGGMY LGELFRRALF HLIKVYNFNE GIFPPSITDA WSLETSVLSR MMVERSAENV RNVLSTFKFR FRSDEEALYL
  WDAAHAIGRR AARMSAVPIA SLYLSTGRAG KKSDVGVDGS LVEHYPHFVD MLREALRELI GDNEKLISIG IAKDGSGIGA ALCALQAVKE
  KKGLA MEANFQQAVK KLVNDFEYPT ESLREAVKEF DELRQKGLQK NGEVLAMAPA FISTLPTGAE TGDFLALDFG GTNLRVCWIQ
  LLGDGKYEMK HSKSVLPREC VRNESVKPII DFMSDHVELF IKEHFPSKFG CPEEEYLPMG FTFSYPANQV SITESYLLRW TKGLNIPEAI
  NKDFAQFLTE GFKARNLPIR IEAVINDTVG TLVTRAYTSK ESDTFMGIIF GTGTNGAYVE QMNQIPKLAG KCTGDHMLIN MEWGATDFSC
  LHSTRYDLLL DHDTPNAGRQ IFEKRVGGMY LGELFRRALF HLIKVYNFNE GIFPPSITDA WSLETSVLSR MMVERSAENV RNVLSTFKFR
  FRSDEEALYL WDAAHAIGRR AARMSAVPIA SLYLSTGRAG KKSDVGVDGS LVEHYPHFVD MLREALRELI GDNEKLISIG IAKDGSGIGA
  ALCALQAVKE KKGLA


                                                                                                       60
Arrays


Definitions
• A scalar variable contains a scalar value: one number or one string.
  A string might contain many words, but Perl regards it as one unit.
• An array variable contains a list of scalar data: a list of numbers or a
  list of strings or a mixed list of numbers and strings. The order of
  elements in the list matters.
Syntax
• Array variable names start with an @ sign.
• You may use in the same program a variable named $var and
  another variable named @var, and they will mean two different,
  unrelated things.
Example
• Assume we have a list of numbers which were obtained as a result
  of some measurement. We can store this list in an array variable as
  the following:
• @msr = (3, 2, 5, 9, 7, 13, 16);

                                                                         61
The foreach construct

     The foreach construct iterates over a list of scalar values
     (e.g. that are contained in an array) and executes a block
     of code for each of the values.
   • Example:
        foreach $i (@some_array) {
        statement_1;
        statement_2;
        statement_3; }
        Each element in @some_array is aliased to the variable $i in
         turn, and the block of code inside the curly brackets {} is
         executed once for each element.
   • The variable $i (or give it any other name you wish) is
     local to the foreach loop and regains its former value
     upon exiting of the loop.
   • Remark $_

                                                                        62
Examples for using the foreach construct - cont.


• Calculate sum of all array elements:
  #!/usr/local/bin/perl
  @msr = (3, 2, 5, 9, 7, 13, 16);
  $sum = 0;
  foreach $i (@msr) {
  $sum += $i; }
  print "sum is: $sumn";




                                                   63
Accessing individual array elements



Individual array elements may be accessed by
  indicating their position in the list (their index).
Example:
@msr = (3, 2, 5, 9, 7, 13, 16);
index value 0 3 1 2 2 5 3 9 4 7 5 13 6 16
First element: $msr[0] (here has the value of 3),
Third element: $msr[2] (here has the value of 5),
and so on.



                                                         64
The sort function

The sort function receives a list of variables (or an array) and returns the sorted list.

@array2 = sort (@array1);

#!/usr/local/bin/perl
@countries = ("Israel", "Norway", "France", "Argentina");
@sorted_countries = sort ( @countries);
print "ORIG: @countriesn", "SORTED: @sorted_countriesn";
Output:
ORIG: Israel Norway France Argentina
SORTED: Argentina France Israel Norway

#!/usr/local/bin/perl
@numbers = (1 ,2, 4, 16, 18, 32, 64);
@sorted_num = sort (@numbers);
print "ORIG: @numbers n", "SORTED: @sorted_num n";
Output:
ORIG: 1 2 4 16 18 32 64
SORTED: 1 16 18 2 32 4 64
Note that sorting numbers does not happen numerically, but by the string values of each65
   number.
The push and shift functions


The push function adds a variable or a list of variables to the end of a given
  array.
Example:
$a = 5;
$b = 7;
@array = ("David", "John", "Gadi");
push (@array, $a, $b);
# @array is now ("David", "John", "Gadi", 5, 7)

The shift function removes the first element of a given array and returns this
  element.
Example:
@array = ("David", "John", "Gadi");
$k = shift (@array);
# @array is now ("John", "Gadi"); # $k is now "David"

Note that after both the push and shift operations the given array @array is
  changed!                                                                       66
Perl Array review


• An array is designated with the „@‟ sign
• An array is a list of individual elements
• Arrays are ordered
   Your list stays in the same order that you created it, although
    you can add or subtract elements to the front or back of the list
• You access array elements by number, using the
  special syntax:
   $array[1]     returns the „1th‟ element of the array (remember
    perl starts counting at zero)
• You can do anything with an array element that you
  can do with a scalar variable (addition, subtraction,
  printing … whatever)


                                                                        67
Generate random sequence string


for($n=1;$n<=50;$n++)

{
    @a = ("A","C","G","T");
    $b=$a[rand(@a)];
    $r.=$b;
}

print $r;

                                  68
Text Processing Functions


The split function

• The split function splits a string to a list of substrings
  according to the positions of a given delimiter. The
  delimiter is written as a pattern enclosed by slashes:
  /PATTERN/. Examples:
• $string = "programming::course::for::bioinformatics";
• @list = split (/::/, $string);
• # @list is now
  ("programming", "course", "for", "bioinformatics") #
  $string remains unchanged.
• $string = "protein kinase Ct450 Kilodaltonst120
  Kilobases";
• @list = split (/t/, $string); #t indicates tab #
• @list is now ("protein kinase C", "450
  Kilodaltons", "120 Kilobases")

                                                               69
Text Processing Functions


The join function
• The join function does the opposite of split. It
  receives a delimiter and a list of strings, and joins the
  strings into a single string, such that they are
  separated by the delimiter.
• Note that the delimiter is written inside quotes.
• Examples:
• @list = ("programming", "course", "for",
  "bioinformatics");
• $string = join ("::", @list);
• # $string is now
  "programming::course::for::bioinformatics"
• $name = "protein kinase C"; $mol_weight = "450
  Kilodaltons"; $seq_length = "120 Kilobases";
• $string = join ("t", $name, $mol_weight,
  $seq_length);
• # $string is now: # "protein kinase Ct450
  Kilodaltonst120 Kilobases"                                 70
When is an array not good enough?


• Sometimes you want to associate a given value
  with another value. (name/value pairs)
  (Rob => 353-7236, Matt => 353-7122,
    Joe_anonymous => 555-1212)
  (Acc#1 => sequence1, Acc#2 => sequence2, Acc#n =>
 sequence-n)
• You could put this information into an array, but
  it would be difficult to keep your names and
  values together (what happens when you sort?
  Yuck)

                                                      71
Problem solved: The associative array


• As the name suggests, an associative array allows
  you to link a name with a value
• In perl-speak: associative array = hash
   „hash‟ is the preferred term, for various arcane
    reasons, including that it is easier to say.
• Consider an array: The elements (values) are
  each associated with a name – the index position.
  These index positions are
  numerical, sequential, and start at zero.
• A hash is similar to an array, but we get to name the
  index positions anything we want

                                                       72
The „structure‟ of a Hash


• An array looks something like this:
                     0      1      2      Index
     @array =
                   'val1' 'val2' 'val3'   Value




                                                  73
The „structure‟ of a Hash


 • An array looks something like this:
                       0      1     2      Index
      @array =
                    'val1' 'val2' 'val3'   Value


  • A hash looks something like this:
                 Rob         Matt    Joe_A         Key (name)
%phone =
              353-7236 353-7122 555-1212           Value


                                                           74
Creating a hash

• There are several methods for creating a hash.
  The most simple way – assign a list to a hash.
   %hash = („rob‟, 56, „joe‟, 17, „jeff‟, „green‟);
• Perl is smart enough to know that since you are
  assigning a list to a hash, you meant to alternate
  keys and values.
   %hash = („rob‟ => 56 , „joe‟ => 17, „jeff‟ => „green‟);
• The arrow („=>‟) notation helps some people, and
  clarifies which keys go with which values. The
  perl interpreter sees „=>‟ as a comma.


                                                              75
Getting at values


• You should expect by now that there is some
  way to get at a value, given a key.
• You access a hash key like this:
   $hash{„key‟}
• This should look somewhat familiar
   $array[21] : refer to a value associated with a
    specific index position in an array
   $hash{key} : refer to a value associated with a
    specific key in a hash



                                                      76
Programming in general and Perl in particular

• There is more than one right way to do it. Unfortunately, there are also
  many wrong ways.
    1. Always check and make sure the output is correct and logical
      Consider what errors might occur, and take steps to ensure that you are
        accounting for them.
    2. Check to make sure you are using every variable you declare.
      Use Strict !
    3. Always go back to a script once it is working and see if you can
     eliminate unnecessary steps.
      Concise code is good code.
      You will learn more if you optimize your code.
      Concise does not mean comment free. Please use as many comments as
        you think are necessary.
      Sometimes you want to leave easy to understand code in, rather than short
        but difficult to understand tricks. Use your judgment.
      Remember that in the future, you may wish to use or alter the code you
        wrote today. If you don‟t understand it today, you won‟t tomorrow.
                                                                                77
Programming in general and Perl in particular


 Develop your program in stages. Once part of it works, save
  the working version to another file (or use a source code
  control system like RCS) before continuing to improve it.
 When running interactively, show the user signs of activity.
  There is no need to dump everything to the screen (unless
  requested to), but a few words or a number change every
  few minutes will show that your program is doing
  something.
 Comment your script. Any information on what it is doing or
  why might be useful to you a few months later.
 Decide on a coding convention and stick to it. For example,
     for variable names, begin globals with a capital letter and privates
      (my) with a lower case letter
     indent new control structures with (say) 2 spaces
     line up closing braces, as in: if (....) { ... ... }                   78
CPAN


       • CPAN: The Comprehensive Perl Archive
         Network is available at www.cpan.org
         and is a very large respository of Perl
         modules for all kind of taks (including
         bioperl)




                                                   79
What is BioPerl?

• An „open source‟ project
   http://bio.perl.org or http://www.cpan.org
• A loose international collaboration of
  biologist/programmers
   Nobody (that I know of) gets paid for this
• A collection of PERL modules and methods for
  doing a number of bioinformatics tasks
   Think of it as subroutines to do biology
• Consider it a „tool-box‟
   There are a lot of nice tools in there, and (usually)
    somebody else takes care of fixing parsers when they
    break
• BioPerl code is portable - if you give somebody a
  script, it will probably work on their system
                                                        80
Multi-line parsing

            use strict;
            use Bio::SeqIO;

            my $filename="sw.txt";
            my $sequence_object;

            my $seqio = Bio::SeqIO -> new (
                               '-format' => 'swiss',
                               '-file' => $filename
                               );

            while ($sequence_object = $seqio -> next_seq) {
            my $sequentie = $sequence_object-> seq();
            print $sequentie."n";
            }

                                                              81
Live.pl


          #!e:Perlbinperl.exe -w
          # script for looping over genbank entries, printing out name
          use Bio::DB::Genbank;
          use Data::Dumper;

          $gb = new Bio::DB::GenBank();

          $sequence_object = $gb->get_Seq_by_id('MUSIGHBA1');
          print Dumper ($sequence_object);

          $seq1_id = $sequence_object->display_id();
          $seq1_s = $sequence_object->seq();
          print "seq1 display id is $seq1_id n";
          print "seq1 sequence is $seq1_s n";


                                                                         82
Bioperl 101: 2 ESSENTIAL TOOLS


            Data::Dumper to find out what
            class your in

            Perl bptutorial (100 Bio::Seq) to
            find the available methods for that
            class




                                                  83
Outline

• Scripting
  Perl (Bioperl/Python)
       examples spiders/bots

• Databases
  Genome Browser
       examples biomart, galaxy

• AI
  Classification and clustering
       examples WEKA (R, Rapidminer)
                                       84
Overview


• Bots and Spiders
   The web
   Bots
   Spiders
   Real world examples
     Bioinformatics applications
   Perl – LWP libraries
   Google hacks
   Advanced APIs
   Fetch data from NCBI / Ensembl /
                                       85
The web



• The WWW-part of the
  Internet is based on
  hyperlinks
• So if one started to
  follow all hyperlinks, it
  would be possible to
  map almost the entire
  WWW
• Everything you can do
  as a human (clicking,
  filling in forms,…) can be
  done by machines
                               86
Bots



• Webbots (web robots, WWW robots, bots): software applications
  that run automated tasks over the Internet
• Bots perform tasks that:
    Are simple
    Structurally repetitive
    At a much higher rate than would be possible for a human
• Automated script fetches, analyses and files information from
  web servers at many times the speed of a human
• Other uses:
    Chatbots
    IM / Skype / Wiki bots
    Malicious bots and bot networks (Zombies)                 87
Spiders


• Webspiders / Crawlers are programs
  or automated scripts which browses
  the World Wide Web in a
  methodical, automated manner. It is
  one type of bot
• The spider starts with a list of URLs
  to visit, called the seeds
    As the crawler visits these
     URLs, it identifies all the
     hyperlinks in the page
    It adds them to the list of URLs to
     visit, called the crawl frontier
    URLs from the frontier are
     recursively visited according to a
     set of policies
• This process is called web crawling
  or spidering: in most cases a mean       88
Spiders


Use of webcrawlers:
  Mainly used to create a copy of all the visited pages for later
   processing by a search engine that will index the downloaded
   pages to provide fast searches
  Automating maintenance tasks on a website, such as checking
   links or validating HTML code
  Can be used to gather specific types of information from Web
   pages, such as harvesting e-mail addresses
   Most common used crawler is probably the GoogleBot crawler
     Crawls
     Indexes (content + key content tags and attributes, such as Title
      tags and ALT attributes)
     Serves results: PageRank Technology                             89
Spiders




          90
Perl - LWP



LWP (also known as libwww-perl)
    The World-Wide Web library for Perl
    Set of Perl modules which provides a simple and
     consistent application programming interface (API) to
     the World-Wide Web
    Free book: http://lwp.interglacial.com/
   LWP for newbies
    LWP::Simple (demo1)
    Go to a URL, fetch data, ready to parse
    Attention: HTML tags and regular expression

                                                             91
Perl - LWP



   Some more advanced features
 LWP::UserAgent (demo2 – show server access logs)
 Fill in forms and parse results
 Depending on content: follow hyperlinks to other pages
  and parse these again,…
   Bioinformatics examples
 Use genome browser data (demo3) and sequences
 Get gene aliases and symbols from GeneCards (demo4)



                                                           92
Google hacks



   Why not make use of crawls, indexing and
    serving technologies of others (e.g. Google)
    Google allows automated queries: per account 1000
     queries a day
    Google uses Snippets: the short pieces of text you get
     in the main search results
    This is the result of its indexing and parsing algoritms
    Demo5: LWP and Google combined and parsing the
     results



                                                                93
Advanced APIs


   An application programming interface (API) is a source
    code interface that an operating system, library or service
    provides to support requests made by computer programs
   Language-dependent APIs
   Language-independent APIs are written in a way they can
    be called from several programming languages. This is a
    desired feature for service style API which is not bound to
    a particular process or system and is available as a
    remote procedure call




                                                             94
Advanced APIs


   Google example used Google API / SOAP
   NCBI API
     The NCBI Web service is a web program that enables
      developers to access Entrez Utilities via the Simple Object
      Access Protocol (SOAP)
     Programmers may write software applications that access
      the E-Utilities using any SOAP development tool
     Main tools (demo6):
       E-Search Searches and retrieves primary IDs and term
         translations and optionally retains results for future use in
         the user's environment
       E-Fetch: Retrieves records in the requested format from a
        list of one or more primary IDs
                                                                    95
   Ensembl API (demo7)
Fetch data from NCBI


   A NCBI database, frequently used is PubMed
    PubMed can be queried using E-Utils
    Uses syntax as regular PubMed website
    Get the data back in data formats as on the website
      (XML, Plain Text)
    Parse XML results and more advanced Text-mining
      techniques
    Demo8
    Parse results and present them in an interface
      (http://matrix.ugent.be/mate/methylome/result1.html)


                                                             96
Fetch data from NCBI



   Example: PubMeth
    Get data from NCBI PubMed
    Get all genes and all aliases for human genes and their
      annotations from Ensembl & GeneCards
    Get all cancer types from cancer thesaurius
    Parse PubMed results: find genes and aliases;
      keywords
    Keep variants in mind (Regexes are very useful)
    Sort the PubMed abstracts and store found genes and
      keywords in database; apply scoring scheme

                                                          97
Outline

• Scripting
  Perl (Bioperl/Python)
       examples spiders/bots

• Databases
  Genome Browser
       examples biomart, galaxy

• AI
  Classification and clustering
       examples WEKA (R, Rapidminer)
                                       98
The three genome browsers

• There are three main browsers:
   Ensembl
   NCBI MapViewer
   UCSC
• At first glance their main distinguishing features are:
   MapViewer is arranged vertically.
   Ensembl has multiple (22) different “Views”.
   UCSC has a single “View” for (almost) everything.




                                                        99
MapViewer
                  Home




http://www.ncbi.nlm.nih.gov/mapview/   100
MapViewer Master Map




                       101
Selecting tracks on MapViewer




                                102
MapViewer strengths

• Good coverage of plant and fungal genomes.
• Close integration with other NCBI tools and
  databases, such as Model Maker, trace archives
  or Celera assemblies.
• Vertical view enables convenient overview of
  regional gene descriptions.
• Discontiguous MEGABLAST is probably the most
  sensitive tool available for cross-species sequence
  queries.
• Ability to view multiple assemblies (e.g. Celera
  and reference) simultaneously.

                                                   103
MapViewer limitations


• Little cross-species conservation or alignment
  data.
• Inability to upload custom annotations and data.
• Limited capability for batch data access.
• Limited support for automated database querying.
• Vertical view makes base-pair level annotation
  cumbersome.




                                                 104
UCSC Genome Browser




                      105
                       105
http://genome.ucsc.edu/




                          106
                           106
UCSC Genome Browser




                      107
                       107
Strengths of the UCSC Browser (I)

  For this course I will be focusing primarily on the
  UCSC Browser for several reasons:
• Strong comparative genomics capabilities.
• Fast response
   sequence searches performed with BLAT.
   code is written in speed-optimized C.
   Multiple indexing and non-normalized tables for fast
    database retrieval.
• (Essentially) single “view” from single base-pair to
  entire chromosome.
• Easiest interface for loading custom annotations.
                                                           108
UCSC Browser Strengths (II)

• Well suited for batch and automated querying of both
  gene and intergenic regions.
• Comprehensive: tends to have the most species,
  genes and annotations.
• Annotations frequently updated (Genbank/Refseq
  daily / ESTs weekly).
• Able to find “similar” genes easily with GeneSorter.
• Rapid access to in situ images with VisiGene.




                                                   109
UCSC browser limitations


• Lack of “overview” mode can make it harder to see
  genomic context.
• Syntenic regions cannot be viewed simultaneously.
• Cross species sequence queries with BLAT are
  often insensitive.
• Comprehensiveness of database can make user
  interface intimidating.
• Code access for commercial users requires
  licensing.


                                                 110
Human, mouse,rat synteny in MapViewer




                                        111
Browser/Database Batch
Querying




                         112
                          112
Batch querying overview


• Introduction / motivation
• UCSC table browser
• Custom tracks and frames
• Galaxy and direct SQL database
  querying
• A batch query example
• UCSC Database “gotchas”
• Batch querying on Ensembl



                                   113
Why batch querying

• Interactive querying is difficult if you want to study
  numerous “interesting” genomic regions.

• Querying each region interactively is:
   Tedious
   Time-consuming
   Error prone




                                                           114
Batch querying examples

• As an example, say you have found one hundred candidate
  polymorphisms and you want to know:
    Are they in dbSNP?
    Do they occur in any known ESTs?
    Are the sites conserved in other vertebrates?
    Are they near any ”LINE” repeat sequences?

  Of course you could repeat the procedures described in
 Part II one hundred times but that would get “old” very fast…




                                                            115
Other examples



• Other examples include characterizing multiple:
   Non-coding RNA candidates
   ultra-conserved regions
   introns hosting snoRNA genes




                                                    116
Browsers and databases

• Each of the genome browsers is built on top of
  multiple relational databases.

• Typically data for each genome assembly are stored
  in a separate database and auxiliary data, e.g. gene
  ontology (GO) data, are stored in yet other
  databases.

• These databases may have hundreds of tables,
  many with millions of entries.


                                                   117
The UCSC Table Browser

• For batch queries, you need to query the
  browser databases.

• The conventional way of querying a relational
  database is via “Structured Query Language”
  (SQL).

• However with the Table Browser, you can
  query the database without using SQL.



                                                  118
Browser Database Formats

 Nevertheless, even with the Table Browser, you need
some understanding of the underlying track, table and
file formats.
 Table formats describe how data is stored in the (relational)
  databases.
 Track formats describe how the data is presented on the
  browser.
 File formats describe how the data is stored in “flat files” in
  conventional computer files.
 Finally, for understanding the underlying the computer code
  (as we will do in the last part of this tutorial) you will need to
  learn about the “C” structures which hold the data in the
  source code.
                                                                  119
Main UCSC Data Formats


 • GFF/GTF
 • BED (Browser Extensible Data)
    lists of genomic blocks
 • PSL
    RNA/DNA alignments
 • .chain
    pair-wise cross species alignments
 • .maf
    multiple genome alignments
 • .wig
    numerical data


                                          120
Custom Tracks

• Custom tracks are essentially BED, PSL or GTF files
  with formatting lines so they can be displayed on the
  browser.
• A custom track file can contain multiple tracks, which
  may be in different formats.
• Custom tracks are useful for:
   Display of regions of interest on the browser.
   Sharing custom data with others.
   Input of multiple, arbitrary regions for annotation by the Table
    Browser.
• Custom tracks can be made by the Table Browser, or
  you can make them easily yourself.

                                                                       121
Selecting custom track output




                                122
Sending custom track to browser




                                  123
                                   123
Adding a custom track




                        124
                         124
Adding a custom track (II)




                             125
Custom track example

 browser position chr22:10000000-10020000
 browser hide all
 track name=clones description="Clones” visibility=3
 color=0,128,0 useScore=1
 chr22 10000000 10004000 cloneA 960
 chr22 10002000 10006000 cloneB 200
 chr22 10005000 10009000 cloneC 700
 chr22 10006000 10010000 cloneD 600
 chr22 10011000 10015000 cloneE 300
 chr22 10012000 10017000 cloneF 100




                                                       126
Limitations of the table browser


• Can be difficult to create more complex queries.
• With hundreds of tables, finding the one(s) you
  want can be confusing.
• Getting intersections or unions of genomic regions
  is often a multi-step process and can be tedious or
  error prone.
• May be slower than direct SQL query.
• Not designed for fully automated operation.



                                                    127
Ensembl




          128
Ensembl Home   http://www.ensembl.org/




                                         129
Ensembl ContigView




                     130
Ensembl ContigView




                     131
Detail and Basepair view




                           132
Changing tracks in Ensembl




                             133
Ensembl strengths (I)

• Multiple view levels shows genomic context.

• Some annotations are more complete and/or are
  more clearly presented (e.g. snpView of multiple
  mouse strain data.)

• Possible to create query over more than one genome
  database at a time (with BioMart).




                                                     134
                                                      134
Ensembl snpView




                  135
Ensembl strengths (II)


• Batch and automated querying well supported and
  documented (especially for perl and java).
• API (programmer interface) is designed to be
  identical for all databases in a release.
• Ensembl tends to be more “community oriented” -
  using standard, widely used tools and data formats.
• All data and code are completely free to all.




                                                  136
Ensembl is “community oriented”

 • Close alliances with Wormbase, Flybase, SGD
 • “support for easy integration with third party data and/or
   programs” – BioMart
 • Close integration with R/ Bioconductor software
 • More use of community standard formats and
   programs, e.g. DAS, GFF/GTF, Bioperl

  ( Note: UCSC also supports GFF/GTF and is
  compatible with R/Bioconductor and DAS, but UCSC
  tends to use more “homegrown” formats, e.g.
  BED, PSL, and tools.)



                                                                137
Ensembl limitations

• Limited data quantifying cross-species
  sequence conservation.
• Batch queries for intergenic regions with
  BioMart are difficult.
• BioMart offers less complete access to
  database than UCSC Table Browser.
  (However, the user interface to BioMart
  is easier.)

                                         138
BioMart

• BioMart - the Ensembl “Table browser”
• Similar to the Table Browser and Galaxy tools.
• Previous version was called EnsMart.
• Fewer tables can be accessed with BioMart than
  with UCSC Table Browser. In particular, non-gene
  oriented queries may be difficult.
• However, the user interface is simpler.
• Tight interface with Bioconductor project for
  annotation of microarray genes.


                                                 139
The Galaxy Website

• Galaxy website: http://g2.bx.psu.edu

• Galaxy objective: Provide sequence and data
  manipulation tools (a la SRS or the UCSD Biology
  Workbench) that are capable of being applied to genomic
  data.

• The intent is to provide an easy interface to numerous
  analysis tools with varied output formats that can work on
  data from multiple browsers / databases.




                                                               140
141
 141
Demo: Galaxy Genomics Toolkit


• Galaxy is a web interface to bioinformatics tools that
  deal with genome-scale data
• There is a public server with many pre-installed tools
• Many tools work with genomic intervals
• Other tools work with various types of tab delimited
  data formats, and some directly on DNA sequences
• It has excellent tools to access public data
• It can be installed on a local computer or set up as an
  institutional server
• Can access a standard or custom build on Amazon
  “Cloud”
• Any command line tool or web service can easily be
  wrapped into the Galaxy interface.
                                                        142
Genome-Scale Data


• Bioinformatics work is challenging on
  very large “genomics” data sets
   sequencing, gene expression, variants,
    ChIPseq
• Complex command line programs
• Genome Browsers
• New tools




                                             143
The Galaxy Interface has 3 parts
                                       History =
List of Tools    Central work panel   data & results




                                                144
Load Data from UCSC




             Or upload from your computer   145
Demo: Galaxy Genomics Toolkit

• http://athos.ugent.be:8080: staat er een Galaxy instance.
• inloggen (als admin: new@new.be, password: newnew)
• de cleanfq history heeft 2 paar fastq files en een ref fa en een ref gtf




                                                                             146
Workflows


• Galaxy saves your data, and results in the
  History
• The exact commands and parameters used with
  each operation with each tool are also saved.
• These operations can be saved as a
  “Workflow”, which can be reused, and shared
  with other users.




                                                  147
• Galaxy has many public
  data sets and public
  workflows, which can be
  easily used in your projects
  (or a tutorial)




                                 148
NGS tools


• Galaxy has recently been expanded with tools to
  analyze Next-Gen Sequence data
• File format conversions
• Analysis methods specific to different sequencing
  platforms (454, Illumina, SOLID)
• Analysis methods specific to different applications
  (RNA-seq, ChIP-seq, mutation finding,
  metagenomics, etc).



                                                    149
• NGS tools include file
format conversion, mapping
to reference genome,
ChIPseq peak calling, RNA-
seq gene expression, etc.

 • NGS data analysis uses
large files – slow to upload
 and slow to process on a
        public server
A number of Groups have set up custom Galaxy
servers with special tools




                                               151
The SPARQLing future




                       152
Outline

• Scripting
  Perl (Bioperl/Python)
       examples spiders/bots

• Databases
  Genome Browser
       examples biomart, galaxy

• AI
  Classification and clustering
       examples WEKA (R, Rapidminer)
                                       153
Wat is „intelligent‟ ?


• Intelligentie = de mogelijkheid tot
  leren en begrijpen, tot het oplossen
  van problemen, tot het nemen van
  beslissingen
  Machine learning …



                                         154
Turing test voor intelligentie




THE IMITATION GAME


Vrouw
Man/Machine
Ondervrager: Wie van
beide is de vrouw?


                                 155
Wat is „artificieel‟ ?


• Artificieel = kunstmatig = door de mens
  vervaardigd, niet van natuurlijke
  oorsprong
• in de context van A.I.: machines, meestal
  een digitale computer
• H. Simon: analogie mens-digitale
  computer
    geheugen
    uitvoeringseenheid
    controle-eenheid


                                              156
Data mining


• WAT? extraheren van kennis uit data
• Data indelen in drie groepen:
   trainingsset
   validatieset
   testset
• Clustering/Classificatie




                                        157
Clustering


• WAT? „unsupervised learning‟ –
  antwoord voor de trainingsdata niet
  gekend
• Resultaat meestal als boomstructuur
• Belangrijke methode: hiërarchisch
  clusteren opstellen van distance matrix




                                            158
Cluster Analysis


• Unsupervised methods
• Descriptive modeling
   Grouping of genes with “similar” expression
    profiles
   Grouping of disease tissues, cell lines, or
    toxicants with “similar” effects on gene
    expression
• Clustering algorithms
   Self-organizing maps
   Hierarchical clustering
   K-means clustering
   SVD
                                                  159
Linkage in Hierarchical Clustering

• Single linkage:
  S(A,B) = mina minb d(a,b)
                                             A
• Average linkage:
  A(A,B) = (∑a ∑b d(a,b)) / |A| |B|
• Complete linkage:
  C(A,B) = maxa maxb d(a,b)
• Centroid linkage:
  M(A,B) = d(mean(A),mean(B))
• Hausdorff linkage:                         B
  h(A,B) = maxa minb d(a,b)
  H(A,B) = max(h(A,B),h(B,A))
• Ward linkage:
  W(A,B) = (|A| |B| (M(A,B))2) / (|A|+|B|)

                                                 160
Hierarchical Clustering




                          3 clusters?
                          2 clusters?

                                        161
Classificatie


• WAT? „supervised learning‟ – antwoord
  voor de trainingsdata is gekend
• Verschillende classificatiemethoden:
    decision tree
    neurale netwerken
    support vector machines




                                          162
Decision tree

          Voorbeeld: tennis




                              163
Neurale netwerken


BOUW: Neuronen en verbindingen
TAAK:
verwerken van invoergegevens
machine learning




                                 164
Support Vector Machines


Doorvoeren van een lineaire separatie in de data
door de dimensies aan te passen




                                                   165
Bio-informatica toepassingen


• Decision tree: zoeken naar DNA-sequenties
  homoloog aan een gegeven DNA-sequentie
• Neurale netwerken: modelleren en analyseren
  van genexpressiegegevens, voorspellen van de
  inwerkingsplaatsen van proteasen
• Support Vector Machines: identificeren van
  genen betrokken bij anti-kankermechanismen,
  detecteren van homologie tussen eiwitten,
  analyse van genexpressie


                                                 166
Bio-informatica toepassingen


• Hiërarchisch clusteren: opstellen van fylo-
  genetische bomen op basis van DNA-sequenties
• Genetische algoritmes: moleculaire herkenning,
  relatie tussen structuur en functie ophelderen,
  Multiple Sequence Alignment
• Expertsystemen: ontdekken van blessures,
  vroege detectie van afwijkingen aan de hartklep
• Fuzzy logic: primerdesign, voorspellen van de
  functie van een onbekend gen, expressie-
  analyse


                                                    167
Outline




          168
Classification



                        C N
                      N NCC
                        NC



                          OMS
                        classifier




                   C               N
                   CC            N N
                    C             N


                                        169
                 C: cancer, N: normal
Classification



                      R N
                    N NRR
                      NR



                        OMS
                      classifier




                 R               N
                 RR            N N
                  R             N

                   R: responder      170
                 N: non-responder
Outline




          171
OMS Classifier using “Methylation”

               Patient




              Sample


         Measuring Methylation

       Gene              Gen 1 Gen 2 Gen 3         …   Gen n
       Methylated          +     -     -           …     +


                               OMS
                             classifier


                  Cancer                  Normal
                                                               172
Why use methylation as a biomarker ?


• What is feature/biomarker ?
   A characteristic that is objectively
    measured and evaluated as an indicator
    of normal biological processes,
    pathogenic processes, or pharmacologic
    responses to a therapeutic intervention


• Business/biological feature
  selection/reduction
   Of all possible (molecular and clinical)
    features oncomethylome measures
    methylation (in cancer/onco)
                                               173
Outline




          174
Data preparation and modelling



• Data preparation
   Construct binary features « Methylated » from
    PCR data (Ct and Temp)



• Modelling
   Construct classifier (cancer vs normal) from
    features « Methylated »




                                                    175
Data Preparation: Feature Construction

                           Sample


                             Methylation Specific
                              Quantitative PCR

             Gene           Gen 1 Gen 2 Gen 3        …     Gen n
             Temp            78    81    69          …      72
             Ct              25    38    24          …      27

               Feature construction: “gene Methylated in sample”

             Gene           Gen 1 Gen 2 Gen 3         …    Gen n
             Methylated       +     -     -           …      +
              Compute « methylated » as function of Temp and Ct
                                                                   176
Construction of features « Methylated »


• Per gene: find boolean function
   Methylated IFF:
    Ct below upperbound AND
    Temp above lowerbound


• Taking into account
   All Ct and Temp measurements
     Methylation Specific Quantitative PCR (QMSP) for
      normals and cancers

   Noise in QMPS measurements
     As observed per gene during Quality Control


                                                        177
Construction of features « Methylated »

Plot of all Ct and Temp measurements for a given gene




    Temp




                                 Ct

                                                 What about noise?
                                                                     178
Noise


   Noise: random error or variance in a
    measured variable
   Incorrect attribute values may due to
       Quantity not correctly compared to calibration
        (e.g., ruler slips)
       Inaccurate calibration device (e.g., ruler > 1m)
       Precision (e.g., truncated to nearest mile or Ångstrom unit)
       Data entry problems
       Data transmission problems
       Inconsistency in naming convention

                                                                       179
Construction of features «Methylated»
        Taking into account noise

                             QC: StdDev of Ct and Tm in IVM
                                 StDev 1.6               StDev 0.3




                                                                                    StDev 0.02
StDev 3.5




                                             Cancer



                Inrobust assay                Cut-off                Robust assay
                                                Normal

                                                                                          180
Construction of features « Methylated »
  Taking into account noise


                     Good Reproducibility           Bad Reproducibility

                                            Methylated
                Methylated




Blunt cut-off




                             Methylated                  Methylated



Sharp cut-off

                                                                          181
Construction of features « Methylated »
 Taking into account noise

Find most robust cut-off for each gene
                      Compute quality with increasing noise levels (0-2 times StdDev)


                      1
            Quality




                                                                       1




                                                             Quality
                              Inrobust                                     Robust

                      0                       2
                                Stdev                                  0
                                                                             Stdev               2




                          Quality score based on binomial test




                                                        46 or more successes with 58 trials unlikely
 16 or more successes with 44 trials likely
                                                            When probability success = 80/179
    when probability success = 77/175
                                                               Expected nr successes = 21
       Expected nr successes = 19
                                                                                                       182
Construction of features « Methylated »

  Methylated: inside red box




                                          183
Construction of features « Methylated »

Methylated   Unmethylated     Ranked Genes
Cancer
Normal




                                               184
Data preparation and modelling


• Data preparation
   Construct binary features « Methylated » from
    PCR data (Ct and Temp)



• Modelling
   Construct classifier (cancer vs normal) from
    « Methylated » features




                                                    185
Selection of modelling technique


• In theory, many techniques applicable
   Data type: boolean methylation table, discrete
    classes
   See other talks today
• But, additional requirements follow from
  business understanding (more details below)
   Feature selection
     Final test should be based on at most ~5 genes

   Understandability
   Both provide a direct competitive advantage
• Example of acceptable technique: decision
  trees                                               186
Decision trees
  The Weka tool
@relation weather.symbolic

@attribute   outlook {sunny, overcast, rainy}
@attribute   temperature {hot, mild, cool}
@attribute   humidity {high, normal}
@attribute   windy {TRUE, FALSE}
@attribute   play {yes, no}

@data
sunny,hot,high,FALSE,no
sunny,hot,high,TRUE,no
overcast,hot,high,FALSE,yes
rainy,mild,high,FALSE,yes
rainy,cool,normal,FALSE,yes
rainy,cool,normal,TRUE,no
overcast,cool,normal,TRUE,yes
sunny,mild,high,FALSE,no
sunny,cool,normal,FALSE,yes
rainy,mild,normal,FALSE,yes
sunny,mild,normal,TRUE,yes
overcast,mild,high,TRUE,yes
overcast,hot,normal,FALSE,yes
rainy,mild,high,TRUE,no



  http://www.cs.waikato.ac.nz/ml/weka/
                                                187
Decision trees
  Attribute selection
      outlook    temperature   humidity   windy       play
      sunny      hot           high       FALSE       no                                play
      sunny      hot           high       TRUE        no
      overcast   hot           high       FALSE       yes
                                                                                        don‟t play
      rainy      mild          high       FALSE       yes
      rainy      cool          normal     FALSE       yes
      rainy      cool          normal     TRUE        no
      overcast   cool          normal     TRUE        yes
                                                                  pno = 5/14
      sunny      mild          high       FALSE       no
      sunny      cool          normal     FALSE       yes
      rainy      mild          normal     FALSE       yes
      sunny      mild          normal     TRUE        yes
      overcast   mild          high       TRUE        yes
      overcast   hot           normal     FALSE       yes
      rainy      mild          high       TRUE        no

       maximal gain of information
       maximal reduction of Entropy              = - pyes log2 pyes - pno log2 pno                  pyes = 9/14
                                                  = - 9/14 log2 9/14 - 5/14 log2 5/14
                                                  = 0.94 bits
http://www-lmmb.ncifcrf.gov/~toms/paper/primer/latex/index.html
http://directory.google.com/Top/Science/Math/Applications/Information_Theory/Papers/
                                                                                                          188
Decision trees                                                                                         play
                                                                                  0.94 bits
   Attribute selection
                                                                                                          don‟t play

                 play don't play                                                 play don't play
                                                    play don't play                                               play don't play
       sunny       2      3                                              hot      2       2
                                            high     3       4                                            FALSE     6      2
      overcast     4      0                                              mild     4       2
                                           normal    6       1                                            TRUE      3      3
        rainy      3      2                                              cool     3       1
            outlook                           humidity                   temperature                             windy


    sunny          overcast        rainy       high           normal   hot         mild            cool     false             true




      amount of information required to specify class of an example given that it reaches node

0.97 bits 0.0 bits 0.97 bits 0.98 bits 0.59 bits 1.0 bits 0.92 bits 0.81 bits 0.81 bits 1.0 bits
 * 5/14 * 4/14 * 5/14         * 7/14    * 7/14 * 4/14 * 6/14         * 4/14    * 8/14 * 6/14


              +                                    +                               +                             +
         = 0.69 bits                          = 0.79 bits                     = 0.91 bits                   = 0.89 bits
      gain: 0.25 bits                      gain: 0.15 bits                   gain: 0.03 bits              gain: 0.05 bits
Decision trees                               outlook                                      play
Attribute selection
                                                                                          don‟t play
                                     sunny      overcast    rainy
                     0.97 bits                                   outlook    temperature   humidity   windy   play
                                                                 sunny      hot           high       FALSE   no
                                                                 sunny      hot           high       TRUE    no
                                                                 sunny      mild          high       FALSE   no
                                                                 sunny      cool          normal     FALSE   yes
                                                                 sunny      mild          normal     TRUE    yes
    humidity                temperature                    windy


    high    normal    hot         mild       cool        false             true




0.0 bits 0.0 bits 0.0 bits 1.0 bits 0.0 bits 0.92 bits 1.0 bits
 * 3/5    * 2/5    * 2/5    * 2/5    * 1/5     * 3/5    * 2/5


       +                          +                           +
   = 0.0 bits                = 0.40 bits                 = 0.95 bits
 gain: 0.97 bits        gain: 0.57 bits                gain: 0.02 bits
play
Decision trees                               outlook
Attribute selection                                                                 don‟t play

                                                            outlook   temperature    humidity   windy    play
        sunny         overcast                   rainy      rainy     mild           high       FALSE    yes
                                                            rainy     cool           normal     FALSE    yes
                                 0.97 bits                  rainy     cool           normal     TRUE     no
                                                            rainy     mild           normal     FALSE    yes
                                                            rainy     mild           high       TRUE     no
 humidity
                        humidity                   temperature                                   windy
 high     normal

                        high         normal       hot        mild          cool                 false      true

                                              
                   1.0 bits 0.92 bits                    0.92 bits 1.0 bits 0.0 bits                    0.0 bits
                     *2/5     * 3/5                        * 3/5    * 2/5    * 3/5                       * 2/5


                            +                                 +                                     +
                       = 0.95 bits                       = 0.95 bits                            = 0.0 bits
                   gain: 0.02 bits                 gain: 0.02 bits                       gain: 0.97 bits
Decision trees
final tree


                                              play

                                              don‟t play
                            outlook

       sunny                   overcast                rainy



        humidity                                     windy

high               normal             false                    true




                                                                 192
Decision trees
 Basic algorithm



• Initialize top node to all examples
• While impure leaves available
   select next impure leave L
   find splitting attribute A with maximal information gain
   for each value of A add child to L




                                                               193
Decision tree built from methylation table



                                             Leave-one-out experiment
                                                To avoid overfitting




                                                       Decision tree:
                                                  Test based on 12 genes




          Sensitivity: 80%

          Specificity: 88%

                                                                        194
Outline




          195
Evaluation and deployment


• Decide whether to use Classification results
   Can we use 12 gene decision tree for classifying
    new patients?
• Verification of all steps
   Excercise. The above modelling procedure contains
    a classical mistake: the test-sets used for cross-
    validation (see leave-one-out) have actually been
    used for training the model. How? (Weka is not to blame)
    And how can we fix this?
• Check whether business goals have been met
   No: test based on 12 genes not useful (max ~5)
   Iteration required                                         196
Attempt to rebuild decision tree
  with at most ~5 genes



                                                 Minimal leaf size
                                                 Increased to 12




                                          New Decision tree:
                                        Test based on 4 genes




Sensitivity decreased from 80% to 64%

Specificity increased from 88% to 90%
                                                           197
Evaluation and deployment
The impact of « cost »


• Market conditions, cost of goods &
  royalty structure can limit the amount
  of genes that can tested




                                           198
Evaluation and deployment
The importance of « understandability »




                                          199
Evaluation and deployment
The importance of « understandability »




Pre and postmarket requirements imposed for IVDMIA (510k etc)

Understandability (NO black boxes) is becoming an important asset



                                                                    200
Outline

• Scripting
  Perl (Bioperl/Python)
       examples spiders/bots

• Databases
  Genome Browser
       examples biomart, galaxy

• AI
  Classification and clustering
       examples WEKA (R, Rapidminer)
                                       201
WEKA:: Introduction


• A collection of open source ML
  algorithms
   pre-processing
   classifiers
   clustering
   association rule
• Created by researchers at the
  University of Waikato in New Zealand
• Java based


                                         202
WEKA:: Installation


• Download software from
  http://www.cs.waikato.ac.nz/ml/weka/
   If you are interested in
    modifying/extending weka there is a
    developer version that includes the
    source code
• Set the weka environment variable for
  java
   setenv WEKAHOME /usr/local/weka/weka-3-0-
    2
   setenv CLASSPATH
    $WEKAHOME/weka.jar:$CLASSPATH
• Download some ML data from
  http://mlearn.ics.uci.edu/MLRepositor         203
  y.html
204
Main GUI


• Three graphical user interfaces
   “The Explorer” (exploratory data
    analysis)
   “The Experimenter” (experimental
    environment)
   “The KnowledgeFlow” (new process
    model inspired interface)




                                       205
Explorer: pre-processing the data


• Data can be imported from a file in
  various formats: ARFF, CSV, C4.5,
  binary
• Data can also be read from a URL or
  from an SQL database (using JDBC)
• Pre-processing tools in WEKA are
  called “filters”
• WEKA contains filters for:
   Discretization, normalization, resampling,
    attribute selection, transforming and
    combining attributes, …
                                                 12/18/2012   206
WEKA only deals with “flat” files


@relation heart-disease-simplified

@attribute age numeric
@attribute sex { female, male}
@attribute chest_pain_type { typ_angina, asympt, non_anginal, atyp_angina}
@attribute cholesterol numeric
@attribute exercise_induced_angina { no, yes}
@attribute class { present, not_present}

@data
63,male,typ_angina,233,no,not_present
67,male,asympt,286,yes,present
67,male,asympt,229,yes,present
38,female,non_anginal,?,no,not_present
...

                                                                             207
WEKA only deals with “flat” files


@relation heart-disease-simplified

@attribute age numeric
@attribute sex { female, male}
@attribute chest_pain_type { typ_angina, asympt, non_anginal, atyp_angina}
@attribute cholesterol numeric
@attribute exercise_induced_angina { no, yes}
@attribute class { present, not_present}

@data
63,male,typ_angina,233,no,not_present
67,male,asympt,286,yes,present
67,male,asympt,229,yes,present
38,female,non_anginal,?,no,not_present
...

2                                                            12/18/2012      208
0
2   University of Waikato   12/18/2012   209
0
2   University of Waikato   12/18/2012   210
1
2   University of Waikato   12/18/2012   211
1
2   University of Waikato   12/18/2012   212
1
2   University of Waikato   12/18/2012   213
1
2   University of Waikato   12/18/2012   214
1
2   University of Waikato   12/18/2012   215
1
2   University of Waikato   12/18/2012   216
1
2   University of Waikato   12/18/2012   217
1
2   University of Waikato   12/18/2012   218
1
2   University of Waikato   12/18/2012   219
1
2   University of Waikato   12/18/2012   220
2
2   University of Waikato   12/18/2012   221
2
2   University of Waikato   12/18/2012   222
2
2   University of Waikato   12/18/2012   223
2
2   University of Waikato   12/18/2012   224
2
2   University of Waikato   12/18/2012   225
2
2   University of Waikato   12/18/2012   226
2
2   University of Waikato   12/18/2012   227
2
2   University of Waikato   12/18/2012   228
2
2   University of Waikato   12/18/2012   229
2
Explorer: building “classifiers”


• Classifiers in WEKA are models for
  predicting nominal or numeric
  quantities
• Implemented learning schemes
  include:
   Decision trees and lists, instance-based
    classifiers, support vector machines,
    multi-layer perceptrons, logistic
    regression, Bayes‟ nets, …



                                               230
Decision Tree Induction: Training Dataset

               age    income student credit_rating   buys_computer
             <=30    high       no fair                   no
    This     <=30    high       no excellent              no
             31…40   high       no fair                   yes
follows an   >40     medium     no fair                   yes
 example     >40     low       yes fair                   yes
     of      >40     low       yes excellent              no
             31…40   low       yes excellent              yes
Quinlan‟s    <=30    medium     no fair                   no
    ID3      <=30    low       yes fair                   yes
 (Playing    >40     medium    yes fair                   yes
             <=30    medium    yes excellent              yes
  Tennis)    31…40   medium     no excellent              yes
             31…40   high      yes fair                   yes
             >40     medium     no excellent              no
2                                              December 18, 2012 231
3
Output: A Decision Tree for “buys_computer”


                                 age?


                  <=30          overcast
                                 31..40       >40

               student?           yes          credit rating?

          no              yes              excellent     fair

     no                    yes                             yes


                                                                 232
2   University of Waikato   12/18/2012   234
3
2   University of Waikato   12/18/2012   235
3
2   University of Waikato   12/18/2012   236
3
2   University of Waikato   12/18/2012   237
3
2   University of Waikato   12/18/2012   238
3
2   University of Waikato   12/18/2012   239
3
2   University of Waikato   12/18/2012   240
4
2   University of Waikato   12/18/2012   241
4
2   University of Waikato   12/18/2012   242
4
2   University of Waikato   12/18/2012   243
4
2   University of Waikato   12/18/2012   244
4
2   University of Waikato   12/18/2012   245
4
2   University of Waikato   12/18/2012   246
4
2   University of Waikato   12/18/2012   247
4
2   University of Waikato   12/18/2012   248
4
2   University of Waikato   12/18/2012   249
4
2   University of Waikato   12/18/2012   250
5
2   University of Waikato   12/18/2012   251
5
2   University of Waikato   12/18/2012   252
5
2   University of Waikato   12/18/2012   253
5
2   University of Waikato   12/18/2012   254
5
2   University of Waikato   12/18/2012   255
5
Explorer: finding associations


• WEKA contains an implementation of
  the Apriori algorithm for learning
  association rules
   Works only with discrete data
• Can identify statistical dependencies
  between groups of attributes:
   milk, butter  bread, eggs (with
    confidence 0.9 and support 2000)
• Apriori can compute all rules that
  have a given minimum support and
  exceed a given confidence
                                          258
Explorer: data visualization


 • Visualization very useful in practice:
   e.g. helps to determine difficulty of the
   learning problem
 • WEKA can visualize single attributes
   (1-d) and pairs of attributes (2-d)
    To do: rotating 3-d visualizations (Xgobi-
     style)
  • Color-coded class values
  • “Jitter” option to deal with nominal
    attributes (and to detect “hidden” data
    points)                                       12/18/2012
2                                                              259
5 • “Zoom-in” function
2   University of Waikato   12/18/2012   260
6
2   University of Waikato   12/18/2012   261
6
2   University of Waikato   12/18/2012   262
6
2   University of Waikato   12/18/2012   263
6
2   University of Waikato   12/18/2012   264
6
2   University of Waikato   12/18/2012   265
6
2   University of Waikato   12/18/2012   266
6
2   University of Waikato   12/18/2012   267
6
2   University of Waikato   12/18/2012   268
6
2   University of Waikato   12/18/2012   269
6
2012 12 12_adam_v_final
2012 12 12_adam_v_final
2012 12 12_adam_v_final
2012 12 12_adam_v_final
2012 12 12_adam_v_final
2012 12 12_adam_v_final
2012 12 12_adam_v_final
2012 12 12_adam_v_final

Contenu connexe

En vedette

En vedette (9)

Mini symposium
Mini symposiumMini symposium
Mini symposium
 
Bioinformatica p1-perl-introduction
Bioinformatica p1-perl-introductionBioinformatica p1-perl-introduction
Bioinformatica p1-perl-introduction
 
2015 bioinformatics alignments_wim_vancriekinge
2015 bioinformatics alignments_wim_vancriekinge2015 bioinformatics alignments_wim_vancriekinge
2015 bioinformatics alignments_wim_vancriekinge
 
2012 12 02_epigenetic_profiling_environmental_health_sciences
2012 12 02_epigenetic_profiling_environmental_health_sciences2012 12 02_epigenetic_profiling_environmental_health_sciences
2012 12 02_epigenetic_profiling_environmental_health_sciences
 
2015 bioinformatics python_strings_wim_vancriekinge
2015 bioinformatics python_strings_wim_vancriekinge2015 bioinformatics python_strings_wim_vancriekinge
2015 bioinformatics python_strings_wim_vancriekinge
 
2015 03 13_puurs_v_public
2015 03 13_puurs_v_public2015 03 13_puurs_v_public
2015 03 13_puurs_v_public
 
Bioinformatica t8-go-hmm
Bioinformatica t8-go-hmmBioinformatica t8-go-hmm
Bioinformatica t8-go-hmm
 
Thesis2014
Thesis2014Thesis2014
Thesis2014
 
2015 07 09__epigenetic_profiling_environmental_health_sciences_v42
2015 07 09__epigenetic_profiling_environmental_health_sciences_v422015 07 09__epigenetic_profiling_environmental_health_sciences_v42
2015 07 09__epigenetic_profiling_environmental_health_sciences_v42
 

Similaire à 2012 12 12_adam_v_final

How to be a bioinformatician
How to be a bioinformaticianHow to be a bioinformatician
How to be a bioinformaticianChristian Frech
 
2015 bioinformatics python_introduction_wim_vancriekinge_vfinal
2015 bioinformatics python_introduction_wim_vancriekinge_vfinal2015 bioinformatics python_introduction_wim_vancriekinge_vfinal
2015 bioinformatics python_introduction_wim_vancriekinge_vfinalProf. Wim Van Criekinge
 
Python for Science and Engineering: a presentation to A*STAR and the Singapor...
Python for Science and Engineering: a presentation to A*STAR and the Singapor...Python for Science and Engineering: a presentation to A*STAR and the Singapor...
Python for Science and Engineering: a presentation to A*STAR and the Singapor...pythoncharmers
 
Pharo: A Reflective System
Pharo: A Reflective SystemPharo: A Reflective System
Pharo: A Reflective SystemPharo
 
Introduction to python
Introduction to pythonIntroduction to python
Introduction to pythonMohammed Rafi
 
From Silicon to Software - IIT Madras
From Silicon to Software - IIT MadrasFrom Silicon to Software - IIT Madras
From Silicon to Software - IIT MadrasAanjhan Ranganathan
 
Introduction_to_Python.pptx
Introduction_to_Python.pptxIntroduction_to_Python.pptx
Introduction_to_Python.pptxVinay Chowdary
 
2016 bioinformatics i_python_part_1_wim_vancriekinge
2016 bioinformatics i_python_part_1_wim_vancriekinge2016 bioinformatics i_python_part_1_wim_vancriekinge
2016 bioinformatics i_python_part_1_wim_vancriekingeProf. Wim Van Criekinge
 
Presenter manual embedded systems (specially for summer interns)
Presenter manual   embedded systems (specially for summer interns)Presenter manual   embedded systems (specially for summer interns)
Presenter manual embedded systems (specially for summer interns)XPERT INFOTECH
 
Python programming language introduction unit
Python programming language introduction unitPython programming language introduction unit
Python programming language introduction unitmichaelaaron25322
 
The Joy of SciPy
The Joy of SciPyThe Joy of SciPy
The Joy of SciPykammeyer
 
Python 101 for the .NET Developer
Python 101 for the .NET DeveloperPython 101 for the .NET Developer
Python 101 for the .NET DeveloperSarah Dutkiewicz
 

Similaire à 2012 12 12_adam_v_final (20)

How to be a bioinformatician
How to be a bioinformaticianHow to be a bioinformatician
How to be a bioinformatician
 
2015 bioinformatics python_introduction_wim_vancriekinge_vfinal
2015 bioinformatics python_introduction_wim_vancriekinge_vfinal2015 bioinformatics python_introduction_wim_vancriekinge_vfinal
2015 bioinformatics python_introduction_wim_vancriekinge_vfinal
 
Python for Science and Engineering: a presentation to A*STAR and the Singapor...
Python for Science and Engineering: a presentation to A*STAR and the Singapor...Python for Science and Engineering: a presentation to A*STAR and the Singapor...
Python for Science and Engineering: a presentation to A*STAR and the Singapor...
 
December06Bulletin
December06BulletinDecember06Bulletin
December06Bulletin
 
December06Bulletin
December06BulletinDecember06Bulletin
December06Bulletin
 
P1 2018 python
P1 2018 pythonP1 2018 python
P1 2018 python
 
Bioinformatica t1-bioinformatics
Bioinformatica t1-bioinformaticsBioinformatica t1-bioinformatics
Bioinformatica t1-bioinformatics
 
Basic IT 1
Basic IT 1Basic IT 1
Basic IT 1
 
P1 2017 python
P1 2017 pythonP1 2017 python
P1 2017 python
 
Python PPT 50.pptx
Python PPT 50.pptxPython PPT 50.pptx
Python PPT 50.pptx
 
Pharo: A Reflective System
Pharo: A Reflective SystemPharo: A Reflective System
Pharo: A Reflective System
 
Sylvain Bellemare Resume
Sylvain Bellemare ResumeSylvain Bellemare Resume
Sylvain Bellemare Resume
 
Introduction to python
Introduction to pythonIntroduction to python
Introduction to python
 
From Silicon to Software - IIT Madras
From Silicon to Software - IIT MadrasFrom Silicon to Software - IIT Madras
From Silicon to Software - IIT Madras
 
Introduction_to_Python.pptx
Introduction_to_Python.pptxIntroduction_to_Python.pptx
Introduction_to_Python.pptx
 
2016 bioinformatics i_python_part_1_wim_vancriekinge
2016 bioinformatics i_python_part_1_wim_vancriekinge2016 bioinformatics i_python_part_1_wim_vancriekinge
2016 bioinformatics i_python_part_1_wim_vancriekinge
 
Presenter manual embedded systems (specially for summer interns)
Presenter manual   embedded systems (specially for summer interns)Presenter manual   embedded systems (specially for summer interns)
Presenter manual embedded systems (specially for summer interns)
 
Python programming language introduction unit
Python programming language introduction unitPython programming language introduction unit
Python programming language introduction unit
 
The Joy of SciPy
The Joy of SciPyThe Joy of SciPy
The Joy of SciPy
 
Python 101 for the .NET Developer
Python 101 for the .NET DeveloperPython 101 for the .NET Developer
Python 101 for the .NET Developer
 

Plus de Prof. Wim Van Criekinge

2019 03 05_biological_databases_part5_v_upload
2019 03 05_biological_databases_part5_v_upload2019 03 05_biological_databases_part5_v_upload
2019 03 05_biological_databases_part5_v_uploadProf. Wim Van Criekinge
 
2019 03 05_biological_databases_part4_v_upload
2019 03 05_biological_databases_part4_v_upload2019 03 05_biological_databases_part4_v_upload
2019 03 05_biological_databases_part4_v_uploadProf. Wim Van Criekinge
 
2019 03 05_biological_databases_part3_v_upload
2019 03 05_biological_databases_part3_v_upload2019 03 05_biological_databases_part3_v_upload
2019 03 05_biological_databases_part3_v_uploadProf. Wim Van Criekinge
 
2019 02 21_biological_databases_part2_v_upload
2019 02 21_biological_databases_part2_v_upload2019 02 21_biological_databases_part2_v_upload
2019 02 21_biological_databases_part2_v_uploadProf. Wim Van Criekinge
 
2019 02 12_biological_databases_part1_v_upload
2019 02 12_biological_databases_part1_v_upload2019 02 12_biological_databases_part1_v_upload
2019 02 12_biological_databases_part1_v_uploadProf. Wim Van Criekinge
 
Bio ontologies and semantic technologies[2]
Bio ontologies and semantic technologies[2]Bio ontologies and semantic technologies[2]
Bio ontologies and semantic technologies[2]Prof. Wim Van Criekinge
 
2018 03 27_biological_databases_part4_v_upload
2018 03 27_biological_databases_part4_v_upload2018 03 27_biological_databases_part4_v_upload
2018 03 27_biological_databases_part4_v_uploadProf. Wim Van Criekinge
 
2018 02 20_biological_databases_part2_v_upload
2018 02 20_biological_databases_part2_v_upload2018 02 20_biological_databases_part2_v_upload
2018 02 20_biological_databases_part2_v_uploadProf. Wim Van Criekinge
 
2018 02 20_biological_databases_part1_v_upload
2018 02 20_biological_databases_part1_v_upload2018 02 20_biological_databases_part1_v_upload
2018 02 20_biological_databases_part1_v_uploadProf. Wim Van Criekinge
 

Plus de Prof. Wim Van Criekinge (20)

2020 02 11_biological_databases_part1
2020 02 11_biological_databases_part12020 02 11_biological_databases_part1
2020 02 11_biological_databases_part1
 
2019 03 05_biological_databases_part5_v_upload
2019 03 05_biological_databases_part5_v_upload2019 03 05_biological_databases_part5_v_upload
2019 03 05_biological_databases_part5_v_upload
 
2019 03 05_biological_databases_part4_v_upload
2019 03 05_biological_databases_part4_v_upload2019 03 05_biological_databases_part4_v_upload
2019 03 05_biological_databases_part4_v_upload
 
2019 03 05_biological_databases_part3_v_upload
2019 03 05_biological_databases_part3_v_upload2019 03 05_biological_databases_part3_v_upload
2019 03 05_biological_databases_part3_v_upload
 
2019 02 21_biological_databases_part2_v_upload
2019 02 21_biological_databases_part2_v_upload2019 02 21_biological_databases_part2_v_upload
2019 02 21_biological_databases_part2_v_upload
 
2019 02 12_biological_databases_part1_v_upload
2019 02 12_biological_databases_part1_v_upload2019 02 12_biological_databases_part1_v_upload
2019 02 12_biological_databases_part1_v_upload
 
P7 2018 biopython3
P7 2018 biopython3P7 2018 biopython3
P7 2018 biopython3
 
P6 2018 biopython2b
P6 2018 biopython2bP6 2018 biopython2b
P6 2018 biopython2b
 
P4 2018 io_functions
P4 2018 io_functionsP4 2018 io_functions
P4 2018 io_functions
 
P3 2018 python_regexes
P3 2018 python_regexesP3 2018 python_regexes
P3 2018 python_regexes
 
T1 2018 bioinformatics
T1 2018 bioinformaticsT1 2018 bioinformatics
T1 2018 bioinformatics
 
Bio ontologies and semantic technologies[2]
Bio ontologies and semantic technologies[2]Bio ontologies and semantic technologies[2]
Bio ontologies and semantic technologies[2]
 
2018 05 08_biological_databases_no_sql
2018 05 08_biological_databases_no_sql2018 05 08_biological_databases_no_sql
2018 05 08_biological_databases_no_sql
 
2018 03 27_biological_databases_part4_v_upload
2018 03 27_biological_databases_part4_v_upload2018 03 27_biological_databases_part4_v_upload
2018 03 27_biological_databases_part4_v_upload
 
2018 03 20_biological_databases_part3
2018 03 20_biological_databases_part32018 03 20_biological_databases_part3
2018 03 20_biological_databases_part3
 
2018 02 20_biological_databases_part2_v_upload
2018 02 20_biological_databases_part2_v_upload2018 02 20_biological_databases_part2_v_upload
2018 02 20_biological_databases_part2_v_upload
 
2018 02 20_biological_databases_part1_v_upload
2018 02 20_biological_databases_part1_v_upload2018 02 20_biological_databases_part1_v_upload
2018 02 20_biological_databases_part1_v_upload
 
P7 2017 biopython3
P7 2017 biopython3P7 2017 biopython3
P7 2017 biopython3
 
P6 2017 biopython2
P6 2017 biopython2P6 2017 biopython2
P6 2017 biopython2
 
Van criekinge 2017_11_13_rodebiotech
Van criekinge 2017_11_13_rodebiotechVan criekinge 2017_11_13_rodebiotech
Van criekinge 2017_11_13_rodebiotech
 

Dernier

Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management systemChristalin Nelson
 
Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)cama23
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatYousafMalik24
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designMIPLM
 
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptxAUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptxiammrhaywood
 
ENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomnelietumpap1
 
Science 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptxScience 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptxMaryGraceBautista27
 
Barangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxBarangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxCarlos105
 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17Celine George
 
Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Seán Kennedy
 
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Celine George
 
Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Jisc
 
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfTechSoup
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)lakshayb543
 
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdfAMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdfphamnguyenenglishnb
 
Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management SystemChristalin Nelson
 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...JhezDiaz1
 

Dernier (20)

Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management system
 
Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice great
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-design
 
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptxAUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptx
 
ENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choom
 
Science 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptxScience 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptx
 
Barangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxBarangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptx
 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17
 
Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...
 
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
 
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptxYOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
 
Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...
 
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
 
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdfAMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
 
Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management System
 
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
 
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptxYOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
 

2012 12 12_adam_v_final

  • 1. Bioinformatics Prof. Wim Van Criekinge 18th december 2012, VUmc, Amsterdam
  • 2. Outline • Scripting Perl (Bioperl/Python) examples spiders/bots • Databases Genome Browser examples biomart, galaxy • AI Classification and clustering examples WEKA (R, Rapidminer) 2
  • 3. Bioinformatics, a life science discipline … Math (Molecular) Informatics Biology
  • 4. Bioinformatics, a life science discipline … Math Computer Science Theoretical Biology (Molecular) Informatics Biology Computational Biology
  • 5. Bioinformatics, a life science discipline … Math Computer Science Theoretical Biology Bioinformatics (Molecular) Informatics Biology Computational Biology
  • 6. Bioinformatics, a life science discipline … management of expectations Math Computer Science Theoretical Biology NP AI, Image Analysis Datamining structure prediction (HTX) Bioinformatics Interface Design Expert Annotation Sequence Analysis (Molecular) Informatics Biology Computational Biology
  • 7. Bioinformatics, a life science discipline … management of expectations Math Computer Science Theoretical Biology NP AI, Image Analysis Datamining structure prediction (HTX) Bioinformatics Discovery Informatics – Computational Genomics Interface Design Expert Annotation Sequence Analysis (Molecular) Informatics Biology Computational Biology
  • 9. 9
  • 10. What is Perl ? • Perl is a High-level Scripting language • Larry Wall created Perl in 1987  Practical Extraction (a)nd Reporting Language  (or Pathologically Eclectic Rubbish Lister) • Born from a system administration tool • Faster than sh or csh • Sslower than C • No need for sed, awk, tr, wc, cut, … • Perl is open and free • http://conferences.oreillynet.com/euroosc on/ 10
  • 11. What is Perl ? • Perl is available for most computing platforms: all flavors of UNIX (Linux), MS- DOS/Win32, Macintosh, VMS, OS/2, Amiga, AS/400, Atari • Perl is a computer language that is:  Interpreted, compiles at run-time (need for perl.exe !)  Loosely “typed”  String/text oriented  Capable of using multiple syntax formats • In Perl, “there‟s more than one way to do it” 11
  • 12. Why use Perl for bioinformatics ? • Ease of use by novice programmers • Flexible language: Fast software prototyping (quick and dirty creation of small analysis programs) • Expressiveness. Compact code, Perl Poetry: @{$_[$#_]||[]} • Glutility: Read disparate files and parse the relevant data into a new format • Powerful pattern matching via “regular expressions” (Best Regular Expressions on Earth) • With the advent of the WWW, Perl has become the language of choice to create Common Gateway Interface (CGI) scripts to handle form submissions and create compute severs on the WWW. • Open Source – Free. Availability of Perl modules for Bioinformatics and Internet. 12
  • 13. Why NOT use Perl for bioinformatics ? • Some tasks are still better done with other languages (heavy computations / graphics)  C(++),C#, Fortran, Java (Pascal,Visual Basic) • With perl you can write simple programs fast, but on the other hand it is also suitable for large and complex programs. (yet, it is not adequate for very large projects)  Python • Larry Wall: “For programmers, laziness is a virtue” 13
  • 14. What bioinformatics tasks are suited to Perl ? • Sequence manipulation and analysis • Parsing results of sequence analysis programs (Blast, Genscan, Hmmer etc) • Parsing database (eg Genbank) files • Obtaining multiple database entries over the internet •… 14
  • 15. Perl installation • Perl  Perl is available for various operating systems. To download Perl and install it on your computer, have a look at the following resources:  www.perl.com (O'Reilly). Downloading Perl Software  ActiveState. ActivePerl for Windows, as well as for Linux and Solaris. ActivePerl binary packages.  CPAN • PHPTriad:  bevat Apache/PHP en MySQL: http://sourceforge.net/projects/phptriad 15
  • 16. Check installation • Command-line flags for perl  Perl – v Gives the current version of Perl  Perl –e Executes Perl statements from the comment line. Perl –e “print 42;” Perl –e “print ”Twonlinesn”;”  Perl –we Executes and print warnings Perl –we “print „hello‟;x++;” 16
  • 17. TextPad • Syntax highlighting • Run program (prompt for parameters) • Show line numbers • Clip-ons for web with perl syntax • …. 17
  • 18. Customize textpad part 1: Create Document Class 18
  • 20. Customize textpad part 2: Add Perl to “Tools Menu” 20
  • 21. Unzip to textpad samples directory 21
  • 22. General Remarks • Perl is mostly a free format language: add spaces, tabs or new lines wherever you want. • For clarity, it is recommended to write each statement in a separate line, and use indentation in nested structures. • Comments: Anything from the # sign to the end of the line is a comment. (There are no multi-line comments). • A perl program consists of all of the Perl statements of the file taken collectively as one big routine to execute. 22
  • 23. Three Basic Data Types •Scalars - $ •Arrays of scalars - @ •Associative arrays of scalers or Hashes - % 23
  • 24. 2+2 = ? $ - indicates a variable $a = 2; $b = 2; $c = $a + $b; - ends every command ; = - assigns a value to a variable or $c = 2 + 2; or $c = 2 * 2; or $c = 2 / 2; or $c = 2 ^ 4; 2^4 <-> 24 =16 or $c = 1.35 * 2 - 3 / (0.12 + 1);
  • 25. Ok, $c is 4. How do we know it? $c = 4; print “$c”; print command: “ ” - bracket output expression print “Hello n”; n - print a end-of-the-line character (equivalent to pressing „Enter‟) Strings concatenation: print “Hello everyonen”; print “Hello” . ” everyone” . “n”; Expressions and strings together: print “2 + 2 = “ . (2+2) . ”n”; 2 + 2 = 4 expression
  • 26. Loops and cycles (for statement): # Output all the numbers from 1 to 100 for ($n=1; $n<=100; $n+=1) { print “$n n”; } 1. Initialization: for ( $n=1 ; ; ) { … } 2. Increment: for ( ; ; $n+=1 ) { … } 3. Termination (do until the criteria is satisfied): for ( ; $n<=100 ; ) { … } 4. Body of the loop - command inside curly brackets: for ( ; ; ) { … }
  • 27. FOR & IF -- all the even numbers from 1 to 100: for ($n=1; $n<=100; $n+=1) { if (($n % 2) == 0) { print “$n”; } } Note: $a % $b -- Modulus -- Remainder when $a is divided by $b
  • 28. Two brief diversions (warnings & strict) • Use warnings • strict – forces you to „declare‟ a variable the first time you use it.  usage: use strict; (somewhere near the top of your script) • declare variables with „my‟  usage: my $variable;  or: my $variable = „value‟; • my sets the „scope‟ of the variable. Variable exists only within the current block of code • use strict and my both help you to debug errors, and help prevent mistakes. 28
  • 29. Text Processing Functions The substr function • Definition • The substr function extracts a substring out of a string and returns it. The function receives 3 arguments: a string value, a position on the string (starting to count from 0) and a length. Example: • $a = "university"; • $k = substr ($a, 3, 5); • $k is now "versi" $a remains unchanged. • If length is omitted, everything to the end of the string is returned. 29
  • 30. Random $x = rand(1); • srand  The default seed for srand, which used to be time, has been changed. Now it's a heady mix of difficult-to- predict system-dependent values, which should be sufficient for most everyday purposes. Previous to version 5.004, calling rand without first calling srand would yield the same sequence of random numbers on most or all machines. Now, when perl sees that you're calling rand and haven't yet called srand, it calls srand with the default seed. You should still call srand manually if your code might ever be run on a pre- 5.004 system, of course, or if you want a seed other than the default 30
  • 31. Demo/Example • Oefening hoe goed zijn de random nummers ? • Als ze goed zijn kan je er Pi mee berekenen … • Een goede random generator is belangrijk voor goede randomsequenties die we nadien kunnen gebruiken in simulaties 31
  • 32. Bereken Pi aan de hand van twee random getallen y x 1 32
  • 33. Introduction Buffon's Needle is one of the oldest problems in the field of geometrical probability. It was first stated in 1777. It involves dropping a needle on a lined sheet of paper and determining the probability of the needle crossing one of the lines on the page. The remarkable result is that the probability is directly related to the value of pi. http://www.angelfire.com/wa/hurben/buff.html In Postscript you send it too the printer … PS has no variables but “stacks”, you can mimick this in Perl by recursively loading and rewriting a subroutine 33
  • 35. Programming • Variables • Flow control (if, regex …) • Loops • input/output • Subroutines/object 35
  • 36. What is a regular expression? • A regular expression (regex) is simply a way of describing text. • Regular expressions are built up of small units (atoms) which can represent the type and number of characters in the text • Regular expressions can be very broad (describing everything), or very narrow (describing only one pattern). 36
  • 37. 37
  • 38. Regular Expression Review • A regular expression (regex) is a way of describing text. • Regular expressions are built up of small units (atoms) which can represent the type and number of characters in the text • You can group or quantify atoms to describe your pattern • Always use the bind operator (=~) to apply your regular expression to a variable 38
  • 39. Why would you use a regex? • Often you wish to test a string for the presence of a specific character, word, or phrase Examples “Are there any letter characters in my string?” “Is this a valid accession number?” “Does my sequence contain a start codon (ATG)?” 39
  • 40. Regular Expressions Match to a sequence of characters The EcoRI restriction enzyme cuts at the consensus sequence GAATTC. To find out whether a sequence contains a restriction site for EcoR1, write; if ($sequence =~ /GAATTC/) { ... }; 40
  • 41. Regex-style [m]/PATTERN/[g][i][o] s/PATTERN/PATTERN/[g][i][e][o] tr/PATTERNLIST/PATTERNLIST/[c][d][s] 41
  • 42. Regular Expressions Match to a character class • Example • The BstYI restriction enzyme cuts at the consensus sequence rGATCy, namely A or G in the first position, then GATC, and then T or C. To find out whether a sequence contains a restriction site for BstYI, write; • if ($sequence =~ /[AG]GATC[TC]/) {...}; # This will match all of AGATCT, GGATCT, AGATCC, GGATCC. Definition • When a list of characters is enclosed in square brackets [], one and only one of these characters must be present at the corresponding position of the string in order for the pattern to match. You may specify a range of characters using a hyphen -. • A caret ^ at the front of the list negates the character class. Examples • if ($string =~ /[AGTC]/) {...}; # matches any nucleotide • if ($string =~ /[a-z]/) {...}; # matches any lowercase letter • if ($string =~ /chromosome[1-6]/) {...}; # matches chromosome1, chromosome2 ... chromosome6 • if ($string =~ /[^xyzXYZ]/) {...}; # matches any character except x, X, y, Y, z, Z 42
  • 43. Constructing a Regex • Pattern starts and ends with a / /pattern/  if you want to match a /, you need to escape it / (backslash, forward slash)  you can change the delimiter to some other character, but you probably won‟t need to m|pattern| • any „modifiers‟ to the pattern go after the last / i : case insensitive /[a-z]/i o : compile once g : match in list context (global) m or s : match over multiple lines 43
  • 44. Looking for a pattern • By default, a regular expression is applied to $_ (the default variable)  if (/a+/) {die} looks for one or more „a‟ in $_ • If you want to look for the pattern in any other variable, you must use the bind operator  if ($value =~ /a+/) {die} looks for one or more „a‟ in $value • The bind operator is in no way similar to the „=„ sign!! = is assignment, =~ is bind.  if ($value = /[a-z]/) {die} Looks for one or more „a‟ in $_, not $value!!! 44
  • 45. Regular Expression Atoms • An „atom‟ is the smallest unit of a regular expression. • Character atoms 0-9, a-Z match themselves . (dot) matches everything [atgcATGC] : A character class (group) [a-z] : another character class, a through z 45
  • 46. Quantifiers • You can specify the number of times you want to see an atom. Examples • d* : Zero or more times • d+ : One or more times • d{3} : Exactly three times • d{4,7} : At least four, and not more than seven • d{3,} : Three or more times We could rewrite /ddd-dddd/ as: /d{3}-d{4}/ 46
  • 47. Anchors • Anchors force a pattern match to a certain location • ^ : start matching at beginning of string • $ : start matching at end of string • b : match at word boundary (between w and W) • Example: • /^ddd-dddd$/ : matches only valid phone numbers 47
  • 48. Remembering Stuff • Being able to match patterns is good, but limited. • We want to be able to keep portions of the regular expression for later.  Example: $string = „phone: 353-7236‟ We want to keep the phone number only Just figuring out that the string contains a phone number is insufficient, we need to keep the number as well. 48
  • 49. Memory Parentheses (pattern memory) • Since we almost always want to keep portions of the string we have matched, there is a mechanism built into perl. • Anything in parentheses within the regular expression is kept in memory.  „phone:353-7236‟ =~ /^phone:(.+)$/; Perl knows we want to keep everything that matches „.+‟ in the above pattern 49
  • 50. Getting at pattern memory • Perl stores the matches in a series of default variables. The first parentheses set goes into $1, second into $2, etc.  This is why we can‟t name variables ${digit}  Memory variables are created only in the amounts needed. If you have three sets of parentheses, you have ($1,$2,$3).  Memory variables are created for each matched set of parentheses. If you have one set contained within another set, you get two variables (inner set gets lowest number)  Memory variables are only valid in the current scope 50
  • 51. Finding all instances of a match • Use the „g‟ modifier to the regular expression  @sites = $sequence =~ /(TATTA)/g;  think g for global  Returns a list of all the matches (in order), and stores them in the array  If you have more than one pair of parentheses, your array gets values in sets ($1,$2,$3,$1,$2,$3...) 51
  • 52. Perl is Greedy • In addition to taking all your time, perl regular expressions also try to match the largest possible string which fits your pattern  /ga+t/ matches gat, gaat, gaaat  „Doh! No doughnuts left!‟ =~ /(d.+t)/ $1 contains „doughnuts left‟ • If this is not what you wanted to do, use the „?‟ modifier  /(d.+?t)/ # match as few „.‟s as you can and still make the pattern work 52
  • 53. Substitute function • s/pattern1/pattern2/; • Looks kind of like a regular expression  Patterns constructed the same way • Inherited from previous languages, so it can be a bit different.  Changes the variable it is bound to! 53
  • 54. 54
  • 55. tr function • translate or transliterate • tr/characterlist1/characterlist2/; • Even less like a regular expression than s • substitutes characters in the first list with characters in the second list $string =~ tr/a/A/; # changes every „a‟ to an „A‟  No need for the g modifier when using tr. 55
  • 57. Using tr • Creating complimentary DNA sequence  $sequence =~ tr/atgc/TACG/; • Sneaky Perl trick for the day  tr does two things. 1. changes characters in the bound variable 2. Counts the number of times it does this  Super-fast character counter™ $a_count = $sequence =~ tr/a/a/; replaces an „a‟ with an „a‟ (no net change), and assigns the result (number of substitutions) to $a_count 57
  • 58. Regex-Related Special Variables • Perl has a host of special variables that get filled after every m// or s/// regex match. $1, $2, $3, etc. hold the backreferences. $+ holds the last (highest-numbered) backreference. $& (dollar ampersand) holds the entire regex match. • @- is an array of match-start indices into the string. $-[0] holds the start of the entire regex match, $-[1] the start of the first backreference, etc. Likewise, @+ holds match-end indices (ends, not lengths). • $' (dollar followed by an apostrophe or single quote) holds the part of the string after (to the right of) the regex match. $` (dollar backtick) holds the part of the string before (to the left of) the regex match. Using these variables is not recommended in scripts when performance matters, as it causes Perl to slow down all regex matches in your entire script. • All these variables are read-only, and persist until the next regex match is attempted. They are dynamically scoped, as if they had an implicit 'local' at the start of the enclosing scope. Thus if you do a regex match, and call a sub that does a regex match, when that sub returns, your variables are still set as they were for the first match. 58
  • 59. Voorbeeld Which of following 4 sequences (seq1/2/3/4) a) contains a “Galactokinase signature” http://us.expasy.org/prosite/ b) How many of them? c) Where (hints:pos and $&) ? 59
  • 60. >SEQ1 MGNLFENCTHRYSFEYIYENCTNTTNQCGLIRNVASSIDVFHWLDVYISTTIFVISGILNFYCLFIALYT YYFLDNETRKHYVFVLSRFLSSILVIISLLVLESTLFSESLSPTFAYYAVAFSIYDFSMDTLFFSYIMIS LITYFGVVHYNFYRRHVSLRSLYIILISMWTFSLAIAIPLGLYEAASNSQGPIKCDLSYCGKVVEWITCS LQGCDSFYNANELLVQSIISSVETLVGSLVFLTDPLINIFFDKNISKMVKLQLTLGKWFIALYRFLFQMT NIFENCSTHYSFEKNLQKCVNASNPCQLLQKMNTAHSLMIWMGFYIPSAMCFLAVLVDTYCLLVTISILK SLKKQSRKQYIFGRANIIGEHNDYVVVRLSAAILIALCIIIIQSTYFIDIPFRDTFAFFAVLFIIYDFSILSLLGSFTGVAM MTYFGVMRPLVYRDKFTLKTIYIIAFAIVLFSVCVAIPFGLFQAADEIDGPIKCDSESCELIVKWLLFCI ACLILMGCTGTLLFVTVSLHWHSYKSKKMGNVSSSAFNHGKSRLTWTTTILVILCCVELIPTGLLAAFGK SESISDDCYDFYNANSLIFPAIVSSLETFLGSITFLLDPIINFSFDKRISKVFSSQVSMFSIFFCGKR >SEQ2 MLDDRARMEA AKKEKVEQIL AEFQLQEEDL KKVMRRMQKE MDRGLRLETH EEASVKMLPT YVRSTPEGSE VGDFLSLDLG GTNFRVMLVK VGEGEEGQWS VKTKHQMYSI PEDAMTGTAE MLFDYISECI SDFLDKHQMK HKKLPLGFTF SFPVRHEDID KGILLNWTKG FKASGAEGNN VVGLLRDAIK RRGDFEMDVV AMVNDTVATM ISCYYEDHQC EVGMIVGTGC NACYMEEMQN VELVEGDEGR MCVNTEWGAF GDSGELDEFL LEYDRLVDES SANPGQQLYE KLIGGKYMGE LVRLVLLRLV DENLLFHGEA SEQLRTRGAF ETRFVSQVES DTGDRKQIYN ILSTLGLRPS TTDCDIVRRA CESVSTRAAH MCSAGLAGVI NRMRESRSED VMRITVGVDG SVYKLHPSFK ERFHASVRRL TPSCEITFIE SEEGSGRGAA LVSAVACKKA CMLGQ >SEQ3 MESDSFEDFLKGEDFSNYSYSSDLPPFLLDAAPCEPESLEINKYFVVIIYVLVFLLSLLGNSLVMLVILY SRVGRSGRDNVIGDHVDYVTDVYLLNLALADLLFALTLPIWAASKVTGWIFGTFLCKVVSLLKEVNFYSGILLLACISVDRY LAIVHATRTLTQKRYLVKFICLSIWGLSLLLALPVLIFRKTIYPPYVSPVCYEDMGNNTANWRMLLRILP QSFGFIVPLLIMLFCYGFTLRTLFKAHMGQKHRAMRVIFAVVLIFLLCWLPYNLVLLADTLMRTWVIQET CERRNDIDRALEATEILGILGRVNLIGEHWDYHSCLNPLIYAFIGQKFRHGLLKILAIHGLISKDSLPKDSRPSFVGSSSGH TSTTL >SEQ4 MEANFQQAVK KLVNDFEYPT ESLREAVKEF DELRQKGLQK NGEVLAMAPA FISTLPTGAE TGDFLALDFG GTNLRVCWIQ LLGDGKYEMK HSKSVLPREC VRNESVKPII DFMSDHVELF IKEHFPSKFG CPEEEYLPMG FTFSYPANQV SITESYLLRW TKGLNIPEAI NKDFAQFLTE GFKARNLPIR IEAVINDTVG TLVTRAYTSK ESDTFMGIIF GTGTNGAYVE QMNQIPKLAG KCTGDHMLIN MEWGATDFSC LHSTRYDLLL DHDTPNAGRQ IFEKRVGGMY LGELFRRALF HLIKVYNFNE GIFPPSITDA WSLETSVLSR MMVERSAENV RNVLSTFKFR FRSDEEALYL WDAAHAIGRR AARMSAVPIA SLYLSTGRAG KKSDVGVDGS LVEHYPHFVD MLREALRELI GDNEKLISIG IAKDGSGIGA ALCALQAVKE KKGLA MEANFQQAVK KLVNDFEYPT ESLREAVKEF DELRQKGLQK NGEVLAMAPA FISTLPTGAE TGDFLALDFG GTNLRVCWIQ LLGDGKYEMK HSKSVLPREC VRNESVKPII DFMSDHVELF IKEHFPSKFG CPEEEYLPMG FTFSYPANQV SITESYLLRW TKGLNIPEAI NKDFAQFLTE GFKARNLPIR IEAVINDTVG TLVTRAYTSK ESDTFMGIIF GTGTNGAYVE QMNQIPKLAG KCTGDHMLIN MEWGATDFSC LHSTRYDLLL DHDTPNAGRQ IFEKRVGGMY LGELFRRALF HLIKVYNFNE GIFPPSITDA WSLETSVLSR MMVERSAENV RNVLSTFKFR FRSDEEALYL WDAAHAIGRR AARMSAVPIA SLYLSTGRAG KKSDVGVDGS LVEHYPHFVD MLREALRELI GDNEKLISIG IAKDGSGIGA ALCALQAVKE KKGLA 60
  • 61. Arrays Definitions • A scalar variable contains a scalar value: one number or one string. A string might contain many words, but Perl regards it as one unit. • An array variable contains a list of scalar data: a list of numbers or a list of strings or a mixed list of numbers and strings. The order of elements in the list matters. Syntax • Array variable names start with an @ sign. • You may use in the same program a variable named $var and another variable named @var, and they will mean two different, unrelated things. Example • Assume we have a list of numbers which were obtained as a result of some measurement. We can store this list in an array variable as the following: • @msr = (3, 2, 5, 9, 7, 13, 16); 61
  • 62. The foreach construct The foreach construct iterates over a list of scalar values (e.g. that are contained in an array) and executes a block of code for each of the values. • Example:  foreach $i (@some_array) {  statement_1;  statement_2;  statement_3; }  Each element in @some_array is aliased to the variable $i in turn, and the block of code inside the curly brackets {} is executed once for each element. • The variable $i (or give it any other name you wish) is local to the foreach loop and regains its former value upon exiting of the loop. • Remark $_ 62
  • 63. Examples for using the foreach construct - cont. • Calculate sum of all array elements: #!/usr/local/bin/perl @msr = (3, 2, 5, 9, 7, 13, 16); $sum = 0; foreach $i (@msr) { $sum += $i; } print "sum is: $sumn"; 63
  • 64. Accessing individual array elements Individual array elements may be accessed by indicating their position in the list (their index). Example: @msr = (3, 2, 5, 9, 7, 13, 16); index value 0 3 1 2 2 5 3 9 4 7 5 13 6 16 First element: $msr[0] (here has the value of 3), Third element: $msr[2] (here has the value of 5), and so on. 64
  • 65. The sort function The sort function receives a list of variables (or an array) and returns the sorted list. @array2 = sort (@array1); #!/usr/local/bin/perl @countries = ("Israel", "Norway", "France", "Argentina"); @sorted_countries = sort ( @countries); print "ORIG: @countriesn", "SORTED: @sorted_countriesn"; Output: ORIG: Israel Norway France Argentina SORTED: Argentina France Israel Norway #!/usr/local/bin/perl @numbers = (1 ,2, 4, 16, 18, 32, 64); @sorted_num = sort (@numbers); print "ORIG: @numbers n", "SORTED: @sorted_num n"; Output: ORIG: 1 2 4 16 18 32 64 SORTED: 1 16 18 2 32 4 64 Note that sorting numbers does not happen numerically, but by the string values of each65 number.
  • 66. The push and shift functions The push function adds a variable or a list of variables to the end of a given array. Example: $a = 5; $b = 7; @array = ("David", "John", "Gadi"); push (@array, $a, $b); # @array is now ("David", "John", "Gadi", 5, 7) The shift function removes the first element of a given array and returns this element. Example: @array = ("David", "John", "Gadi"); $k = shift (@array); # @array is now ("John", "Gadi"); # $k is now "David" Note that after both the push and shift operations the given array @array is changed! 66
  • 67. Perl Array review • An array is designated with the „@‟ sign • An array is a list of individual elements • Arrays are ordered  Your list stays in the same order that you created it, although you can add or subtract elements to the front or back of the list • You access array elements by number, using the special syntax:  $array[1] returns the „1th‟ element of the array (remember perl starts counting at zero) • You can do anything with an array element that you can do with a scalar variable (addition, subtraction, printing … whatever) 67
  • 68. Generate random sequence string for($n=1;$n<=50;$n++) { @a = ("A","C","G","T"); $b=$a[rand(@a)]; $r.=$b; } print $r; 68
  • 69. Text Processing Functions The split function • The split function splits a string to a list of substrings according to the positions of a given delimiter. The delimiter is written as a pattern enclosed by slashes: /PATTERN/. Examples: • $string = "programming::course::for::bioinformatics"; • @list = split (/::/, $string); • # @list is now ("programming", "course", "for", "bioinformatics") # $string remains unchanged. • $string = "protein kinase Ct450 Kilodaltonst120 Kilobases"; • @list = split (/t/, $string); #t indicates tab # • @list is now ("protein kinase C", "450 Kilodaltons", "120 Kilobases") 69
  • 70. Text Processing Functions The join function • The join function does the opposite of split. It receives a delimiter and a list of strings, and joins the strings into a single string, such that they are separated by the delimiter. • Note that the delimiter is written inside quotes. • Examples: • @list = ("programming", "course", "for", "bioinformatics"); • $string = join ("::", @list); • # $string is now "programming::course::for::bioinformatics" • $name = "protein kinase C"; $mol_weight = "450 Kilodaltons"; $seq_length = "120 Kilobases"; • $string = join ("t", $name, $mol_weight, $seq_length); • # $string is now: # "protein kinase Ct450 Kilodaltonst120 Kilobases" 70
  • 71. When is an array not good enough? • Sometimes you want to associate a given value with another value. (name/value pairs) (Rob => 353-7236, Matt => 353-7122, Joe_anonymous => 555-1212) (Acc#1 => sequence1, Acc#2 => sequence2, Acc#n => sequence-n) • You could put this information into an array, but it would be difficult to keep your names and values together (what happens when you sort? Yuck) 71
  • 72. Problem solved: The associative array • As the name suggests, an associative array allows you to link a name with a value • In perl-speak: associative array = hash  „hash‟ is the preferred term, for various arcane reasons, including that it is easier to say. • Consider an array: The elements (values) are each associated with a name – the index position. These index positions are numerical, sequential, and start at zero. • A hash is similar to an array, but we get to name the index positions anything we want 72
  • 73. The „structure‟ of a Hash • An array looks something like this: 0 1 2 Index @array = 'val1' 'val2' 'val3' Value 73
  • 74. The „structure‟ of a Hash • An array looks something like this: 0 1 2 Index @array = 'val1' 'val2' 'val3' Value • A hash looks something like this: Rob Matt Joe_A Key (name) %phone = 353-7236 353-7122 555-1212 Value 74
  • 75. Creating a hash • There are several methods for creating a hash. The most simple way – assign a list to a hash.  %hash = („rob‟, 56, „joe‟, 17, „jeff‟, „green‟); • Perl is smart enough to know that since you are assigning a list to a hash, you meant to alternate keys and values.  %hash = („rob‟ => 56 , „joe‟ => 17, „jeff‟ => „green‟); • The arrow („=>‟) notation helps some people, and clarifies which keys go with which values. The perl interpreter sees „=>‟ as a comma. 75
  • 76. Getting at values • You should expect by now that there is some way to get at a value, given a key. • You access a hash key like this:  $hash{„key‟} • This should look somewhat familiar  $array[21] : refer to a value associated with a specific index position in an array  $hash{key} : refer to a value associated with a specific key in a hash 76
  • 77. Programming in general and Perl in particular • There is more than one right way to do it. Unfortunately, there are also many wrong ways.  1. Always check and make sure the output is correct and logical Consider what errors might occur, and take steps to ensure that you are accounting for them.  2. Check to make sure you are using every variable you declare. Use Strict !  3. Always go back to a script once it is working and see if you can eliminate unnecessary steps. Concise code is good code. You will learn more if you optimize your code. Concise does not mean comment free. Please use as many comments as you think are necessary. Sometimes you want to leave easy to understand code in, rather than short but difficult to understand tricks. Use your judgment. Remember that in the future, you may wish to use or alter the code you wrote today. If you don‟t understand it today, you won‟t tomorrow. 77
  • 78. Programming in general and Perl in particular Develop your program in stages. Once part of it works, save the working version to another file (or use a source code control system like RCS) before continuing to improve it. When running interactively, show the user signs of activity. There is no need to dump everything to the screen (unless requested to), but a few words or a number change every few minutes will show that your program is doing something. Comment your script. Any information on what it is doing or why might be useful to you a few months later. Decide on a coding convention and stick to it. For example,  for variable names, begin globals with a capital letter and privates (my) with a lower case letter  indent new control structures with (say) 2 spaces  line up closing braces, as in: if (....) { ... ... } 78
  • 79. CPAN • CPAN: The Comprehensive Perl Archive Network is available at www.cpan.org and is a very large respository of Perl modules for all kind of taks (including bioperl) 79
  • 80. What is BioPerl? • An „open source‟ project  http://bio.perl.org or http://www.cpan.org • A loose international collaboration of biologist/programmers  Nobody (that I know of) gets paid for this • A collection of PERL modules and methods for doing a number of bioinformatics tasks  Think of it as subroutines to do biology • Consider it a „tool-box‟  There are a lot of nice tools in there, and (usually) somebody else takes care of fixing parsers when they break • BioPerl code is portable - if you give somebody a script, it will probably work on their system 80
  • 81. Multi-line parsing use strict; use Bio::SeqIO; my $filename="sw.txt"; my $sequence_object; my $seqio = Bio::SeqIO -> new ( '-format' => 'swiss', '-file' => $filename ); while ($sequence_object = $seqio -> next_seq) { my $sequentie = $sequence_object-> seq(); print $sequentie."n"; } 81
  • 82. Live.pl #!e:Perlbinperl.exe -w # script for looping over genbank entries, printing out name use Bio::DB::Genbank; use Data::Dumper; $gb = new Bio::DB::GenBank(); $sequence_object = $gb->get_Seq_by_id('MUSIGHBA1'); print Dumper ($sequence_object); $seq1_id = $sequence_object->display_id(); $seq1_s = $sequence_object->seq(); print "seq1 display id is $seq1_id n"; print "seq1 sequence is $seq1_s n"; 82
  • 83. Bioperl 101: 2 ESSENTIAL TOOLS Data::Dumper to find out what class your in Perl bptutorial (100 Bio::Seq) to find the available methods for that class 83
  • 84. Outline • Scripting Perl (Bioperl/Python) examples spiders/bots • Databases Genome Browser examples biomart, galaxy • AI Classification and clustering examples WEKA (R, Rapidminer) 84
  • 85. Overview • Bots and Spiders  The web  Bots  Spiders  Real world examples  Bioinformatics applications  Perl – LWP libraries  Google hacks  Advanced APIs  Fetch data from NCBI / Ensembl / 85
  • 86. The web • The WWW-part of the Internet is based on hyperlinks • So if one started to follow all hyperlinks, it would be possible to map almost the entire WWW • Everything you can do as a human (clicking, filling in forms,…) can be done by machines 86
  • 87. Bots • Webbots (web robots, WWW robots, bots): software applications that run automated tasks over the Internet • Bots perform tasks that:  Are simple  Structurally repetitive  At a much higher rate than would be possible for a human • Automated script fetches, analyses and files information from web servers at many times the speed of a human • Other uses:  Chatbots  IM / Skype / Wiki bots  Malicious bots and bot networks (Zombies) 87
  • 88. Spiders • Webspiders / Crawlers are programs or automated scripts which browses the World Wide Web in a methodical, automated manner. It is one type of bot • The spider starts with a list of URLs to visit, called the seeds  As the crawler visits these URLs, it identifies all the hyperlinks in the page  It adds them to the list of URLs to visit, called the crawl frontier  URLs from the frontier are recursively visited according to a set of policies • This process is called web crawling or spidering: in most cases a mean 88
  • 89. Spiders Use of webcrawlers:  Mainly used to create a copy of all the visited pages for later processing by a search engine that will index the downloaded pages to provide fast searches  Automating maintenance tasks on a website, such as checking links or validating HTML code  Can be used to gather specific types of information from Web pages, such as harvesting e-mail addresses  Most common used crawler is probably the GoogleBot crawler  Crawls  Indexes (content + key content tags and attributes, such as Title tags and ALT attributes)  Serves results: PageRank Technology 89
  • 90. Spiders 90
  • 91. Perl - LWP LWP (also known as libwww-perl) The World-Wide Web library for Perl Set of Perl modules which provides a simple and consistent application programming interface (API) to the World-Wide Web Free book: http://lwp.interglacial.com/  LWP for newbies LWP::Simple (demo1) Go to a URL, fetch data, ready to parse Attention: HTML tags and regular expression 91
  • 92. Perl - LWP  Some more advanced features  LWP::UserAgent (demo2 – show server access logs)  Fill in forms and parse results  Depending on content: follow hyperlinks to other pages and parse these again,…  Bioinformatics examples  Use genome browser data (demo3) and sequences  Get gene aliases and symbols from GeneCards (demo4) 92
  • 93. Google hacks  Why not make use of crawls, indexing and serving technologies of others (e.g. Google) Google allows automated queries: per account 1000 queries a day Google uses Snippets: the short pieces of text you get in the main search results This is the result of its indexing and parsing algoritms Demo5: LWP and Google combined and parsing the results 93
  • 94. Advanced APIs  An application programming interface (API) is a source code interface that an operating system, library or service provides to support requests made by computer programs  Language-dependent APIs  Language-independent APIs are written in a way they can be called from several programming languages. This is a desired feature for service style API which is not bound to a particular process or system and is available as a remote procedure call 94
  • 95. Advanced APIs  Google example used Google API / SOAP  NCBI API  The NCBI Web service is a web program that enables developers to access Entrez Utilities via the Simple Object Access Protocol (SOAP)  Programmers may write software applications that access the E-Utilities using any SOAP development tool  Main tools (demo6): E-Search Searches and retrieves primary IDs and term translations and optionally retains results for future use in the user's environment E-Fetch: Retrieves records in the requested format from a list of one or more primary IDs 95  Ensembl API (demo7)
  • 96. Fetch data from NCBI  A NCBI database, frequently used is PubMed PubMed can be queried using E-Utils Uses syntax as regular PubMed website Get the data back in data formats as on the website (XML, Plain Text) Parse XML results and more advanced Text-mining techniques Demo8 Parse results and present them in an interface (http://matrix.ugent.be/mate/methylome/result1.html) 96
  • 97. Fetch data from NCBI  Example: PubMeth Get data from NCBI PubMed Get all genes and all aliases for human genes and their annotations from Ensembl & GeneCards Get all cancer types from cancer thesaurius Parse PubMed results: find genes and aliases; keywords Keep variants in mind (Regexes are very useful) Sort the PubMed abstracts and store found genes and keywords in database; apply scoring scheme 97
  • 98. Outline • Scripting Perl (Bioperl/Python) examples spiders/bots • Databases Genome Browser examples biomart, galaxy • AI Classification and clustering examples WEKA (R, Rapidminer) 98
  • 99. The three genome browsers • There are three main browsers:  Ensembl  NCBI MapViewer  UCSC • At first glance their main distinguishing features are:  MapViewer is arranged vertically.  Ensembl has multiple (22) different “Views”.  UCSC has a single “View” for (almost) everything. 99
  • 100. MapViewer Home http://www.ncbi.nlm.nih.gov/mapview/ 100
  • 102. Selecting tracks on MapViewer 102
  • 103. MapViewer strengths • Good coverage of plant and fungal genomes. • Close integration with other NCBI tools and databases, such as Model Maker, trace archives or Celera assemblies. • Vertical view enables convenient overview of regional gene descriptions. • Discontiguous MEGABLAST is probably the most sensitive tool available for cross-species sequence queries. • Ability to view multiple assemblies (e.g. Celera and reference) simultaneously. 103
  • 104. MapViewer limitations • Little cross-species conservation or alignment data. • Inability to upload custom annotations and data. • Limited capability for batch data access. • Limited support for automated database querying. • Vertical view makes base-pair level annotation cumbersome. 104
  • 108. Strengths of the UCSC Browser (I) For this course I will be focusing primarily on the UCSC Browser for several reasons: • Strong comparative genomics capabilities. • Fast response  sequence searches performed with BLAT.  code is written in speed-optimized C.  Multiple indexing and non-normalized tables for fast database retrieval. • (Essentially) single “view” from single base-pair to entire chromosome. • Easiest interface for loading custom annotations. 108
  • 109. UCSC Browser Strengths (II) • Well suited for batch and automated querying of both gene and intergenic regions. • Comprehensive: tends to have the most species, genes and annotations. • Annotations frequently updated (Genbank/Refseq daily / ESTs weekly). • Able to find “similar” genes easily with GeneSorter. • Rapid access to in situ images with VisiGene. 109
  • 110. UCSC browser limitations • Lack of “overview” mode can make it harder to see genomic context. • Syntenic regions cannot be viewed simultaneously. • Cross species sequence queries with BLAT are often insensitive. • Comprehensiveness of database can make user interface intimidating. • Code access for commercial users requires licensing. 110
  • 111. Human, mouse,rat synteny in MapViewer 111
  • 113. Batch querying overview • Introduction / motivation • UCSC table browser • Custom tracks and frames • Galaxy and direct SQL database querying • A batch query example • UCSC Database “gotchas” • Batch querying on Ensembl 113
  • 114. Why batch querying • Interactive querying is difficult if you want to study numerous “interesting” genomic regions. • Querying each region interactively is:  Tedious  Time-consuming  Error prone 114
  • 115. Batch querying examples • As an example, say you have found one hundred candidate polymorphisms and you want to know:  Are they in dbSNP?  Do they occur in any known ESTs?  Are the sites conserved in other vertebrates?  Are they near any ”LINE” repeat sequences? Of course you could repeat the procedures described in Part II one hundred times but that would get “old” very fast… 115
  • 116. Other examples • Other examples include characterizing multiple:  Non-coding RNA candidates  ultra-conserved regions  introns hosting snoRNA genes 116
  • 117. Browsers and databases • Each of the genome browsers is built on top of multiple relational databases. • Typically data for each genome assembly are stored in a separate database and auxiliary data, e.g. gene ontology (GO) data, are stored in yet other databases. • These databases may have hundreds of tables, many with millions of entries. 117
  • 118. The UCSC Table Browser • For batch queries, you need to query the browser databases. • The conventional way of querying a relational database is via “Structured Query Language” (SQL). • However with the Table Browser, you can query the database without using SQL. 118
  • 119. Browser Database Formats Nevertheless, even with the Table Browser, you need some understanding of the underlying track, table and file formats.  Table formats describe how data is stored in the (relational) databases.  Track formats describe how the data is presented on the browser.  File formats describe how the data is stored in “flat files” in conventional computer files.  Finally, for understanding the underlying the computer code (as we will do in the last part of this tutorial) you will need to learn about the “C” structures which hold the data in the source code. 119
  • 120. Main UCSC Data Formats • GFF/GTF • BED (Browser Extensible Data)  lists of genomic blocks • PSL  RNA/DNA alignments • .chain  pair-wise cross species alignments • .maf  multiple genome alignments • .wig  numerical data 120
  • 121. Custom Tracks • Custom tracks are essentially BED, PSL or GTF files with formatting lines so they can be displayed on the browser. • A custom track file can contain multiple tracks, which may be in different formats. • Custom tracks are useful for:  Display of regions of interest on the browser.  Sharing custom data with others.  Input of multiple, arbitrary regions for annotation by the Table Browser. • Custom tracks can be made by the Table Browser, or you can make them easily yourself. 121
  • 122. Selecting custom track output 122
  • 123. Sending custom track to browser 123 123
  • 124. Adding a custom track 124 124
  • 125. Adding a custom track (II) 125
  • 126. Custom track example browser position chr22:10000000-10020000 browser hide all track name=clones description="Clones” visibility=3 color=0,128,0 useScore=1 chr22 10000000 10004000 cloneA 960 chr22 10002000 10006000 cloneB 200 chr22 10005000 10009000 cloneC 700 chr22 10006000 10010000 cloneD 600 chr22 10011000 10015000 cloneE 300 chr22 10012000 10017000 cloneF 100 126
  • 127. Limitations of the table browser • Can be difficult to create more complex queries. • With hundreds of tables, finding the one(s) you want can be confusing. • Getting intersections or unions of genomic regions is often a multi-step process and can be tedious or error prone. • May be slower than direct SQL query. • Not designed for fully automated operation. 127
  • 128. Ensembl 128
  • 129. Ensembl Home http://www.ensembl.org/ 129
  • 132. Detail and Basepair view 132
  • 133. Changing tracks in Ensembl 133
  • 134. Ensembl strengths (I) • Multiple view levels shows genomic context. • Some annotations are more complete and/or are more clearly presented (e.g. snpView of multiple mouse strain data.) • Possible to create query over more than one genome database at a time (with BioMart). 134 134
  • 136. Ensembl strengths (II) • Batch and automated querying well supported and documented (especially for perl and java). • API (programmer interface) is designed to be identical for all databases in a release. • Ensembl tends to be more “community oriented” - using standard, widely used tools and data formats. • All data and code are completely free to all. 136
  • 137. Ensembl is “community oriented” • Close alliances with Wormbase, Flybase, SGD • “support for easy integration with third party data and/or programs” – BioMart • Close integration with R/ Bioconductor software • More use of community standard formats and programs, e.g. DAS, GFF/GTF, Bioperl ( Note: UCSC also supports GFF/GTF and is compatible with R/Bioconductor and DAS, but UCSC tends to use more “homegrown” formats, e.g. BED, PSL, and tools.) 137
  • 138. Ensembl limitations • Limited data quantifying cross-species sequence conservation. • Batch queries for intergenic regions with BioMart are difficult. • BioMart offers less complete access to database than UCSC Table Browser. (However, the user interface to BioMart is easier.) 138
  • 139. BioMart • BioMart - the Ensembl “Table browser” • Similar to the Table Browser and Galaxy tools. • Previous version was called EnsMart. • Fewer tables can be accessed with BioMart than with UCSC Table Browser. In particular, non-gene oriented queries may be difficult. • However, the user interface is simpler. • Tight interface with Bioconductor project for annotation of microarray genes. 139
  • 140. The Galaxy Website • Galaxy website: http://g2.bx.psu.edu • Galaxy objective: Provide sequence and data manipulation tools (a la SRS or the UCSD Biology Workbench) that are capable of being applied to genomic data. • The intent is to provide an easy interface to numerous analysis tools with varied output formats that can work on data from multiple browsers / databases. 140
  • 142. Demo: Galaxy Genomics Toolkit • Galaxy is a web interface to bioinformatics tools that deal with genome-scale data • There is a public server with many pre-installed tools • Many tools work with genomic intervals • Other tools work with various types of tab delimited data formats, and some directly on DNA sequences • It has excellent tools to access public data • It can be installed on a local computer or set up as an institutional server • Can access a standard or custom build on Amazon “Cloud” • Any command line tool or web service can easily be wrapped into the Galaxy interface. 142
  • 143. Genome-Scale Data • Bioinformatics work is challenging on very large “genomics” data sets  sequencing, gene expression, variants, ChIPseq • Complex command line programs • Genome Browsers • New tools 143
  • 144. The Galaxy Interface has 3 parts History = List of Tools Central work panel data & results 144
  • 145. Load Data from UCSC Or upload from your computer 145
  • 146. Demo: Galaxy Genomics Toolkit • http://athos.ugent.be:8080: staat er een Galaxy instance. • inloggen (als admin: new@new.be, password: newnew) • de cleanfq history heeft 2 paar fastq files en een ref fa en een ref gtf 146
  • 147. Workflows • Galaxy saves your data, and results in the History • The exact commands and parameters used with each operation with each tool are also saved. • These operations can be saved as a “Workflow”, which can be reused, and shared with other users. 147
  • 148. • Galaxy has many public data sets and public workflows, which can be easily used in your projects (or a tutorial) 148
  • 149. NGS tools • Galaxy has recently been expanded with tools to analyze Next-Gen Sequence data • File format conversions • Analysis methods specific to different sequencing platforms (454, Illumina, SOLID) • Analysis methods specific to different applications (RNA-seq, ChIP-seq, mutation finding, metagenomics, etc). 149
  • 150. • NGS tools include file format conversion, mapping to reference genome, ChIPseq peak calling, RNA- seq gene expression, etc. • NGS data analysis uses large files – slow to upload and slow to process on a public server
  • 151. A number of Groups have set up custom Galaxy servers with special tools 151
  • 153. Outline • Scripting Perl (Bioperl/Python) examples spiders/bots • Databases Genome Browser examples biomart, galaxy • AI Classification and clustering examples WEKA (R, Rapidminer) 153
  • 154. Wat is „intelligent‟ ? • Intelligentie = de mogelijkheid tot leren en begrijpen, tot het oplossen van problemen, tot het nemen van beslissingen Machine learning … 154
  • 155. Turing test voor intelligentie THE IMITATION GAME Vrouw Man/Machine Ondervrager: Wie van beide is de vrouw? 155
  • 156. Wat is „artificieel‟ ? • Artificieel = kunstmatig = door de mens vervaardigd, niet van natuurlijke oorsprong • in de context van A.I.: machines, meestal een digitale computer • H. Simon: analogie mens-digitale computer  geheugen  uitvoeringseenheid  controle-eenheid 156
  • 157. Data mining • WAT? extraheren van kennis uit data • Data indelen in drie groepen:  trainingsset  validatieset  testset • Clustering/Classificatie 157
  • 158. Clustering • WAT? „unsupervised learning‟ – antwoord voor de trainingsdata niet gekend • Resultaat meestal als boomstructuur • Belangrijke methode: hiërarchisch clusteren opstellen van distance matrix 158
  • 159. Cluster Analysis • Unsupervised methods • Descriptive modeling  Grouping of genes with “similar” expression profiles  Grouping of disease tissues, cell lines, or toxicants with “similar” effects on gene expression • Clustering algorithms  Self-organizing maps  Hierarchical clustering  K-means clustering  SVD 159
  • 160. Linkage in Hierarchical Clustering • Single linkage: S(A,B) = mina minb d(a,b) A • Average linkage: A(A,B) = (∑a ∑b d(a,b)) / |A| |B| • Complete linkage: C(A,B) = maxa maxb d(a,b) • Centroid linkage: M(A,B) = d(mean(A),mean(B)) • Hausdorff linkage: B h(A,B) = maxa minb d(a,b) H(A,B) = max(h(A,B),h(B,A)) • Ward linkage: W(A,B) = (|A| |B| (M(A,B))2) / (|A|+|B|) 160
  • 161. Hierarchical Clustering 3 clusters? 2 clusters? 161
  • 162. Classificatie • WAT? „supervised learning‟ – antwoord voor de trainingsdata is gekend • Verschillende classificatiemethoden:  decision tree  neurale netwerken  support vector machines 162
  • 163. Decision tree Voorbeeld: tennis 163
  • 164. Neurale netwerken BOUW: Neuronen en verbindingen TAAK: verwerken van invoergegevens machine learning 164
  • 165. Support Vector Machines Doorvoeren van een lineaire separatie in de data door de dimensies aan te passen 165
  • 166. Bio-informatica toepassingen • Decision tree: zoeken naar DNA-sequenties homoloog aan een gegeven DNA-sequentie • Neurale netwerken: modelleren en analyseren van genexpressiegegevens, voorspellen van de inwerkingsplaatsen van proteasen • Support Vector Machines: identificeren van genen betrokken bij anti-kankermechanismen, detecteren van homologie tussen eiwitten, analyse van genexpressie 166
  • 167. Bio-informatica toepassingen • Hiërarchisch clusteren: opstellen van fylo- genetische bomen op basis van DNA-sequenties • Genetische algoritmes: moleculaire herkenning, relatie tussen structuur en functie ophelderen, Multiple Sequence Alignment • Expertsystemen: ontdekken van blessures, vroege detectie van afwijkingen aan de hartklep • Fuzzy logic: primerdesign, voorspellen van de functie van een onbekend gen, expressie- analyse 167
  • 168. Outline 168
  • 169. Classification C N N NCC NC OMS classifier C N CC N N C N 169 C: cancer, N: normal
  • 170. Classification R N N NRR NR OMS classifier R N RR N N R N R: responder 170 N: non-responder
  • 171. Outline 171
  • 172. OMS Classifier using “Methylation” Patient Sample Measuring Methylation Gene Gen 1 Gen 2 Gen 3 … Gen n Methylated + - - … + OMS classifier Cancer Normal 172
  • 173. Why use methylation as a biomarker ? • What is feature/biomarker ?  A characteristic that is objectively measured and evaluated as an indicator of normal biological processes, pathogenic processes, or pharmacologic responses to a therapeutic intervention • Business/biological feature selection/reduction  Of all possible (molecular and clinical) features oncomethylome measures methylation (in cancer/onco) 173
  • 174. Outline 174
  • 175. Data preparation and modelling • Data preparation  Construct binary features « Methylated » from PCR data (Ct and Temp) • Modelling  Construct classifier (cancer vs normal) from features « Methylated » 175
  • 176. Data Preparation: Feature Construction Sample Methylation Specific Quantitative PCR Gene Gen 1 Gen 2 Gen 3 … Gen n Temp 78 81 69 … 72 Ct 25 38 24 … 27 Feature construction: “gene Methylated in sample” Gene Gen 1 Gen 2 Gen 3 … Gen n Methylated + - - … + Compute « methylated » as function of Temp and Ct 176
  • 177. Construction of features « Methylated » • Per gene: find boolean function  Methylated IFF: Ct below upperbound AND Temp above lowerbound • Taking into account  All Ct and Temp measurements Methylation Specific Quantitative PCR (QMSP) for normals and cancers  Noise in QMPS measurements As observed per gene during Quality Control 177
  • 178. Construction of features « Methylated » Plot of all Ct and Temp measurements for a given gene Temp Ct What about noise? 178
  • 179. Noise  Noise: random error or variance in a measured variable  Incorrect attribute values may due to  Quantity not correctly compared to calibration (e.g., ruler slips)  Inaccurate calibration device (e.g., ruler > 1m)  Precision (e.g., truncated to nearest mile or Ångstrom unit)  Data entry problems  Data transmission problems  Inconsistency in naming convention 179
  • 180. Construction of features «Methylated» Taking into account noise QC: StdDev of Ct and Tm in IVM StDev 1.6 StDev 0.3 StDev 0.02 StDev 3.5 Cancer Inrobust assay Cut-off Robust assay Normal 180
  • 181. Construction of features « Methylated » Taking into account noise Good Reproducibility Bad Reproducibility Methylated Methylated Blunt cut-off Methylated Methylated Sharp cut-off 181
  • 182. Construction of features « Methylated » Taking into account noise Find most robust cut-off for each gene Compute quality with increasing noise levels (0-2 times StdDev) 1 Quality 1 Quality Inrobust Robust 0 2 Stdev 0 Stdev 2 Quality score based on binomial test 46 or more successes with 58 trials unlikely 16 or more successes with 44 trials likely When probability success = 80/179 when probability success = 77/175 Expected nr successes = 21 Expected nr successes = 19 182
  • 183. Construction of features « Methylated » Methylated: inside red box 183
  • 184. Construction of features « Methylated » Methylated Unmethylated Ranked Genes Cancer Normal 184
  • 185. Data preparation and modelling • Data preparation  Construct binary features « Methylated » from PCR data (Ct and Temp) • Modelling  Construct classifier (cancer vs normal) from « Methylated » features 185
  • 186. Selection of modelling technique • In theory, many techniques applicable  Data type: boolean methylation table, discrete classes  See other talks today • But, additional requirements follow from business understanding (more details below)  Feature selection Final test should be based on at most ~5 genes  Understandability  Both provide a direct competitive advantage • Example of acceptable technique: decision trees 186
  • 187. Decision trees The Weka tool @relation weather.symbolic @attribute outlook {sunny, overcast, rainy} @attribute temperature {hot, mild, cool} @attribute humidity {high, normal} @attribute windy {TRUE, FALSE} @attribute play {yes, no} @data sunny,hot,high,FALSE,no sunny,hot,high,TRUE,no overcast,hot,high,FALSE,yes rainy,mild,high,FALSE,yes rainy,cool,normal,FALSE,yes rainy,cool,normal,TRUE,no overcast,cool,normal,TRUE,yes sunny,mild,high,FALSE,no sunny,cool,normal,FALSE,yes rainy,mild,normal,FALSE,yes sunny,mild,normal,TRUE,yes overcast,mild,high,TRUE,yes overcast,hot,normal,FALSE,yes rainy,mild,high,TRUE,no http://www.cs.waikato.ac.nz/ml/weka/ 187
  • 188. Decision trees Attribute selection outlook temperature humidity windy play sunny hot high FALSE no play sunny hot high TRUE no overcast hot high FALSE yes don‟t play rainy mild high FALSE yes rainy cool normal FALSE yes rainy cool normal TRUE no overcast cool normal TRUE yes pno = 5/14 sunny mild high FALSE no sunny cool normal FALSE yes rainy mild normal FALSE yes sunny mild normal TRUE yes overcast mild high TRUE yes overcast hot normal FALSE yes rainy mild high TRUE no  maximal gain of information  maximal reduction of Entropy = - pyes log2 pyes - pno log2 pno pyes = 9/14 = - 9/14 log2 9/14 - 5/14 log2 5/14 = 0.94 bits http://www-lmmb.ncifcrf.gov/~toms/paper/primer/latex/index.html http://directory.google.com/Top/Science/Math/Applications/Information_Theory/Papers/ 188
  • 189. Decision trees play 0.94 bits Attribute selection don‟t play play don't play play don't play play don't play play don't play sunny 2 3 hot 2 2 high 3 4 FALSE 6 2 overcast 4 0 mild 4 2 normal 6 1 TRUE 3 3 rainy 3 2 cool 3 1 outlook humidity temperature windy sunny overcast rainy high normal hot mild cool false true amount of information required to specify class of an example given that it reaches node 0.97 bits 0.0 bits 0.97 bits 0.98 bits 0.59 bits 1.0 bits 0.92 bits 0.81 bits 0.81 bits 1.0 bits * 5/14 * 4/14 * 5/14 * 7/14 * 7/14 * 4/14 * 6/14 * 4/14 * 8/14 * 6/14 + + + + = 0.69 bits = 0.79 bits = 0.91 bits = 0.89 bits gain: 0.25 bits gain: 0.15 bits gain: 0.03 bits gain: 0.05 bits
  • 190. Decision trees outlook play Attribute selection don‟t play sunny overcast rainy 0.97 bits outlook temperature humidity windy play sunny hot high FALSE no sunny hot high TRUE no sunny mild high FALSE no sunny cool normal FALSE yes sunny mild normal TRUE yes humidity temperature windy high normal hot mild cool false true 0.0 bits 0.0 bits 0.0 bits 1.0 bits 0.0 bits 0.92 bits 1.0 bits * 3/5 * 2/5 * 2/5 * 2/5 * 1/5 * 3/5 * 2/5 + + + = 0.0 bits = 0.40 bits = 0.95 bits gain: 0.97 bits gain: 0.57 bits gain: 0.02 bits
  • 191. play Decision trees outlook Attribute selection don‟t play outlook temperature humidity windy play sunny overcast rainy rainy mild high FALSE yes rainy cool normal FALSE yes 0.97 bits rainy cool normal TRUE no rainy mild normal FALSE yes rainy mild high TRUE no humidity humidity temperature windy high normal high normal hot mild cool false true  1.0 bits 0.92 bits 0.92 bits 1.0 bits 0.0 bits 0.0 bits *2/5 * 3/5 * 3/5 * 2/5 * 3/5 * 2/5 + + + = 0.95 bits = 0.95 bits = 0.0 bits gain: 0.02 bits gain: 0.02 bits gain: 0.97 bits
  • 192. Decision trees final tree play don‟t play outlook sunny overcast rainy humidity windy high normal false true 192
  • 193. Decision trees Basic algorithm • Initialize top node to all examples • While impure leaves available  select next impure leave L  find splitting attribute A with maximal information gain  for each value of A add child to L 193
  • 194. Decision tree built from methylation table Leave-one-out experiment To avoid overfitting Decision tree: Test based on 12 genes Sensitivity: 80% Specificity: 88% 194
  • 195. Outline 195
  • 196. Evaluation and deployment • Decide whether to use Classification results  Can we use 12 gene decision tree for classifying new patients? • Verification of all steps  Excercise. The above modelling procedure contains a classical mistake: the test-sets used for cross- validation (see leave-one-out) have actually been used for training the model. How? (Weka is not to blame) And how can we fix this? • Check whether business goals have been met  No: test based on 12 genes not useful (max ~5)  Iteration required 196
  • 197. Attempt to rebuild decision tree with at most ~5 genes Minimal leaf size Increased to 12 New Decision tree: Test based on 4 genes Sensitivity decreased from 80% to 64% Specificity increased from 88% to 90% 197
  • 198. Evaluation and deployment The impact of « cost » • Market conditions, cost of goods & royalty structure can limit the amount of genes that can tested 198
  • 199. Evaluation and deployment The importance of « understandability » 199
  • 200. Evaluation and deployment The importance of « understandability » Pre and postmarket requirements imposed for IVDMIA (510k etc) Understandability (NO black boxes) is becoming an important asset 200
  • 201. Outline • Scripting Perl (Bioperl/Python) examples spiders/bots • Databases Genome Browser examples biomart, galaxy • AI Classification and clustering examples WEKA (R, Rapidminer) 201
  • 202. WEKA:: Introduction • A collection of open source ML algorithms  pre-processing  classifiers  clustering  association rule • Created by researchers at the University of Waikato in New Zealand • Java based 202
  • 203. WEKA:: Installation • Download software from http://www.cs.waikato.ac.nz/ml/weka/  If you are interested in modifying/extending weka there is a developer version that includes the source code • Set the weka environment variable for java  setenv WEKAHOME /usr/local/weka/weka-3-0- 2  setenv CLASSPATH $WEKAHOME/weka.jar:$CLASSPATH • Download some ML data from http://mlearn.ics.uci.edu/MLRepositor 203 y.html
  • 204. 204
  • 205. Main GUI • Three graphical user interfaces  “The Explorer” (exploratory data analysis)  “The Experimenter” (experimental environment)  “The KnowledgeFlow” (new process model inspired interface) 205
  • 206. Explorer: pre-processing the data • Data can be imported from a file in various formats: ARFF, CSV, C4.5, binary • Data can also be read from a URL or from an SQL database (using JDBC) • Pre-processing tools in WEKA are called “filters” • WEKA contains filters for:  Discretization, normalization, resampling, attribute selection, transforming and combining attributes, … 12/18/2012 206
  • 207. WEKA only deals with “flat” files @relation heart-disease-simplified @attribute age numeric @attribute sex { female, male} @attribute chest_pain_type { typ_angina, asympt, non_anginal, atyp_angina} @attribute cholesterol numeric @attribute exercise_induced_angina { no, yes} @attribute class { present, not_present} @data 63,male,typ_angina,233,no,not_present 67,male,asympt,286,yes,present 67,male,asympt,229,yes,present 38,female,non_anginal,?,no,not_present ... 207
  • 208. WEKA only deals with “flat” files @relation heart-disease-simplified @attribute age numeric @attribute sex { female, male} @attribute chest_pain_type { typ_angina, asympt, non_anginal, atyp_angina} @attribute cholesterol numeric @attribute exercise_induced_angina { no, yes} @attribute class { present, not_present} @data 63,male,typ_angina,233,no,not_present 67,male,asympt,286,yes,present 67,male,asympt,229,yes,present 38,female,non_anginal,?,no,not_present ... 2 12/18/2012 208 0
  • 209. 2 University of Waikato 12/18/2012 209 0
  • 210. 2 University of Waikato 12/18/2012 210 1
  • 211. 2 University of Waikato 12/18/2012 211 1
  • 212. 2 University of Waikato 12/18/2012 212 1
  • 213. 2 University of Waikato 12/18/2012 213 1
  • 214. 2 University of Waikato 12/18/2012 214 1
  • 215. 2 University of Waikato 12/18/2012 215 1
  • 216. 2 University of Waikato 12/18/2012 216 1
  • 217. 2 University of Waikato 12/18/2012 217 1
  • 218. 2 University of Waikato 12/18/2012 218 1
  • 219. 2 University of Waikato 12/18/2012 219 1
  • 220. 2 University of Waikato 12/18/2012 220 2
  • 221. 2 University of Waikato 12/18/2012 221 2
  • 222. 2 University of Waikato 12/18/2012 222 2
  • 223. 2 University of Waikato 12/18/2012 223 2
  • 224. 2 University of Waikato 12/18/2012 224 2
  • 225. 2 University of Waikato 12/18/2012 225 2
  • 226. 2 University of Waikato 12/18/2012 226 2
  • 227. 2 University of Waikato 12/18/2012 227 2
  • 228. 2 University of Waikato 12/18/2012 228 2
  • 229. 2 University of Waikato 12/18/2012 229 2
  • 230. Explorer: building “classifiers” • Classifiers in WEKA are models for predicting nominal or numeric quantities • Implemented learning schemes include:  Decision trees and lists, instance-based classifiers, support vector machines, multi-layer perceptrons, logistic regression, Bayes‟ nets, … 230
  • 231. Decision Tree Induction: Training Dataset age income student credit_rating buys_computer <=30 high no fair no This <=30 high no excellent no 31…40 high no fair yes follows an >40 medium no fair yes example >40 low yes fair yes of >40 low yes excellent no 31…40 low yes excellent yes Quinlan‟s <=30 medium no fair no ID3 <=30 low yes fair yes (Playing >40 medium yes fair yes <=30 medium yes excellent yes Tennis) 31…40 medium no excellent yes 31…40 high yes fair yes >40 medium no excellent no 2 December 18, 2012 231 3
  • 232. Output: A Decision Tree for “buys_computer” age? <=30 overcast 31..40 >40 student? yes credit rating? no yes excellent fair no yes yes 232
  • 233. 2 University of Waikato 12/18/2012 234 3
  • 234. 2 University of Waikato 12/18/2012 235 3
  • 235. 2 University of Waikato 12/18/2012 236 3
  • 236. 2 University of Waikato 12/18/2012 237 3
  • 237. 2 University of Waikato 12/18/2012 238 3
  • 238. 2 University of Waikato 12/18/2012 239 3
  • 239. 2 University of Waikato 12/18/2012 240 4
  • 240. 2 University of Waikato 12/18/2012 241 4
  • 241. 2 University of Waikato 12/18/2012 242 4
  • 242. 2 University of Waikato 12/18/2012 243 4
  • 243. 2 University of Waikato 12/18/2012 244 4
  • 244. 2 University of Waikato 12/18/2012 245 4
  • 245. 2 University of Waikato 12/18/2012 246 4
  • 246. 2 University of Waikato 12/18/2012 247 4
  • 247. 2 University of Waikato 12/18/2012 248 4
  • 248. 2 University of Waikato 12/18/2012 249 4
  • 249. 2 University of Waikato 12/18/2012 250 5
  • 250. 2 University of Waikato 12/18/2012 251 5
  • 251. 2 University of Waikato 12/18/2012 252 5
  • 252. 2 University of Waikato 12/18/2012 253 5
  • 253. 2 University of Waikato 12/18/2012 254 5
  • 254. 2 University of Waikato 12/18/2012 255 5
  • 255. Explorer: finding associations • WEKA contains an implementation of the Apriori algorithm for learning association rules  Works only with discrete data • Can identify statistical dependencies between groups of attributes:  milk, butter  bread, eggs (with confidence 0.9 and support 2000) • Apriori can compute all rules that have a given minimum support and exceed a given confidence 258
  • 256. Explorer: data visualization • Visualization very useful in practice: e.g. helps to determine difficulty of the learning problem • WEKA can visualize single attributes (1-d) and pairs of attributes (2-d)  To do: rotating 3-d visualizations (Xgobi- style) • Color-coded class values • “Jitter” option to deal with nominal attributes (and to detect “hidden” data points) 12/18/2012 2 259 5 • “Zoom-in” function
  • 257. 2 University of Waikato 12/18/2012 260 6
  • 258. 2 University of Waikato 12/18/2012 261 6
  • 259. 2 University of Waikato 12/18/2012 262 6
  • 260. 2 University of Waikato 12/18/2012 263 6
  • 261. 2 University of Waikato 12/18/2012 264 6
  • 262. 2 University of Waikato 12/18/2012 265 6
  • 263. 2 University of Waikato 12/18/2012 266 6
  • 264. 2 University of Waikato 12/18/2012 267 6
  • 265. 2 University of Waikato 12/18/2012 268 6
  • 266. 2 University of Waikato 12/18/2012 269 6