30. Case Study 2
• Create ssh private/public key-pair
• Log in to the head node using ssh key
• Connect to sm11 via ssh
– Hint: use agent forwarding: -A
•
•
•
•
Create ssh shortcuts (on your local machine)
Connect to sm11 via ssh proxy
Start sample application (top, or xclock)
Move it between foreground/background
– Hint: use bg and fg
• Start the application in the background
– Hint: use & (ampersand)
31. Case Study 2 (2)
• Monitor processes
– Hint: use top, ps, pstree
• Experiment with top parameters (man top)
• Stop selected process
– Hint: use kill
• Start multiple process of the same thing
• Check if they running
– Hint: use ps and grep, pstree and top
• Stop all of them
– Hint: use killall
32. Case Study 3
• Log in to sm11 via ssh
• Start virtual terminal
– Hint: use screen
• Start a process in it (use top)
• Detach from the screen session
– Hint: use Ctr+a d
• Log out and back in
• List current screen sessions
– Hint: screen -ls
33. Case Study 3 (2)
• Attach to screen session
– Hint: screen –r [sessionID]
• Destroy screen session
– Hint: Ctr+a k
• Check if the screen session is destroyed
• Experiment with nohup and disown
– Hint: use nohup application name
– Hint: use disown –h jobID
• Check if they are running
34. Case Study 4
• Log in to the head node via ssh
• Obtain the test dataset file from
http://www.antgenomes.org/~yannickwurm/tmp/Si_gnF.454scaffolds.fasta
• Compress it using gzip
• Compare the sizes
– Hint: use ls –la[h], or du [–s]
• Analyse the contents of the gzipped archive
– Hint: use zcat
• Search for a pattern in the gzipped archive
– Hint: use zgrep
• Extract the contents of the archive
35. Case Study 4 (2)
• Compress multiple text files into single archive
– Hint: use tar
• Compress a directory containing multiple files
– Hint: use tar
• tar
• or
•
• List the contents of the tar archive
zip
– Hint: use tar -tvf
36. Case Study 5
• Log in to sm11
• Download code from
https://github.com/lh3/seqtk/archive/master.zip
•
•
•
•
Extract the contents
Compile
Analyse the results of the compilation
Add the binaries to the PATH
37. Case Study 5 (2)
• Run it on the Si_gnF.scaffold.fasta data
– Use it to extract the following sequences
Si_gnF.scaffold05788
Si_gnF.scaffold05760
Si_gnF.scaffold01035
Si_gnF.scaffold07345
Si_gnF.scaffold07801
Si_gnF.scaffold07087
Si_gnF.scaffold05362
Si_gnF.scaffold08533
Si_gnF.scaffold02116
Si_gnF.scaffold08406
38. Case Study 5 (3)
• Can it run on gzipped version?
• If so how?
• Repeat for the code from
https://github.com/stamatak/standardRAxML/archive/master.zip
42. Regular expressions:
Text search on steroids.
Regular expression
David
Dav(e|id)
Dav(e|id|ide|o)
At{1,2}enborough
Atte[nm]borough
At{1,2}[ei][nm]bo{0,1}ro(ugh){0,1}
Finds
David
David, Dave
David, Dave, Davide, Davo
Attenborough,
Atenborough
Attenborough,
Attemborough
Atimbro, attenbrough, etc.
Easy counting, replacing all with “Sir David Attenborough”
43. Regular expressions
Synonymous with
d
[:digit:]
[0-9]
[A-z]
[A-z], ie [A-Za-z]
s
whitespace
.
any single character
.+
one to many of anything
b*
between 0 and infinity letter ‘b’
[^abc]
any character other than a, b or c.
(
(
[:punct:]
any of these: ! " # $ % & ' ( ) * + ,
- . / : ; < = > ? @ [ ] ^ _ ` { |
• Google “Regular
• ?regexp
expression cheat sheet”
45. Functions
•R
has many. e.g.: plot(), t.test()
• Making
your own:
tree_age_estimate <- function(diameter, species) {
[...do the magic...
# maybe something like:
growth.rate <- growth.rates[ species ]
age.estimate <- diameter / growth.rate
...]
return(age.estimate)
}
>
+
>
+
tree_age_estimate(25, "White Oak")
66
tree_age_estimate(60, "Carya ovata")
190
46. “for”
Loop
> possible_colours <- c('blue', 'cyan', 'sky-blue', 'navy blue',
'steel blue', 'royal blue', 'slate blue', 'light blue', 'dark
blue', 'prussian blue', 'indigo', 'baby blue', 'electric blue')
> possible_colours
[1] "blue"
"cyan"
"sky-blue"
[5] "steel blue"
"royal blue"
"slate blue"
[9] "dark blue"
"prussian blue" "indigo"
[13] "electric blue"
> for (colour in possible_colours) {
+
print(paste("The sky is oh so, so", colour))
+ }
[1]
[1]
[1]
[1]
[1]
[1]
[1]
[1]
[1]
[1]
[1]
[1]
"The
"The
"The
"The
"The
"The
"The
"The
"The
"The
"The
"The
sky
sky
sky
sky
sky
sky
sky
sky
sky
sky
sky
sky
is
is
is
is
is
is
is
is
is
is
is
is
so,
so,
so,
so,
so,
so,
so,
so,
so,
so,
so,
so,
oh
oh
oh
oh
oh
oh
oh
oh
oh
oh
oh
oh
so
so
so
so
so
so
so
so
so
so
so
so
blue"
cyan"
sky-blue"
navy blue"
steel blue"
royal blue"
slate blue"
light blue"
dark blue"
prussian blue"
indigo"
baby blue"
"navy blue"
"light blue"
"baby blue"
55. (steve@practicalcomputing.org),†† University of Wisconsin (khuff@cae.w
Mary University of London (mark.plumbley@eecs.qmul.ac.uk),¶¶ Unive
University D.A. Aruliah † , C. Titus Brown ‡ , Neil P. ChueUniversityDavisWisconsin Guy ,
(ethan@weecology.org), and ††† Hong § , Matt of ¶ , Richard T. (wils
∗
Greg Wilson ,
Best Practices for Scientific Computing
Steven H.D. Haddock ∗∗ , Katy Huff †† , Ian M. Mitchell ‡‡ , Mark D. Plumbley §§ , Ben Waugh ¶¶ ,
Ethan P. White ∗∗∗ , Paul Wilson †††
Software Carpentry (gvwilson@software-carpentry.org),† University of Ontario Institute of Technology (Dhavide.Aru
State University (ctb@msu.edu),§ Software Sustainability Institute (N.ChueHong@epcc.ed.ac.uk),¶ Space Telescope
(mrdavis@stsci.edu), University of Toronto (guy@cs.utoronto.ca),∗∗ Monterey Bay Aquarium Research Institute
(steve@practicalcomputing.org),†† University of Wisconsin (khuff@cae.wisc.edu),‡‡ University of British Columbia (mi
Mary University of London (mark.plumbley@eecs.qmul.ac.uk),¶¶ University College London (b.waugh@ucl.ac.uk),∗∗
University (ethan@weecology.org), and ††† University of Wisconsin (wilsonp@engr.wisc.edu)
∗
arXiv:1210.0530v3 [cs.MS] 29 Nov 2012
Scientists spend an increasing amount of time building and using
a
software. However, most scientists are never taught how to do this
i
efficiently. As a result, many are unaware of tools and practices that
d
would allow them to write more reliable and maintainable code with
p
less effort. We describe a set of best practices for scientific software
m
Scientists spend an increasing amount of time building and using research and software development [61
and open source experience,
development that have solid foundations in ical studies of scientific computing [4, 31,
software. However, most scientists are never taught how to do this
e
efficiently. As a improve are unaware of tools and practices thatand the reliability of their
and that result, many scientists’ productivity
development in general (summarized in
would allow them to write more reliable and maintainable code with
software. describe a set of best practices for scientific software practices will guarantee efficient, error-frt
less effort. We
ment, but used in concert they will red
f
development that have solid foundations in research and experience,
and that improve scientists’ productivitypeople, reliability of their
and the not computers. errors in scientific software, make it easie
1. Write programs for
the authors of the software time and effo
software.
Software is as important to modern focusing on the underlying scientific ques
scientific research as
2. Automate repetitive tasks.
3. Use important to tubes. From groups
the test modern scientific research
telescopesasand computer to record history. as that work exclusively
Software is
1
telescopes andMaketubes. From groups that work exclusively
test incremental changes.
4.
on computationalto traditional laboratory and field 1. laboratory andpeople, not c
problems, to traditional Write programs for field
on computational problems, control.
5. Use version
Scientists writing software need to writeS
scientists, more and more of the daily operation of science re- operation of science rescientists, more and more of the daily cutes correctly and can be easily read and
6. computers. This includes the development of
volves aroundDon’t repeat yourself (or others).
c
programmers (especially the author’s fut
volves 7. Plan for mistakes.
around computers. This includes the development of
new algorithms, managing and analyzing the large amounts
cannot be easily read and understood it is
p
of data algorithms, managing andworksand
that are generated in single research projects, correctly.the large amounts
new 8. Optimize software only after it analyzingknow that it is actually doing what it i
to
combining disparate datasets to assess synthetic problems.
c
9. Document the designown software single research projects, and must t
and purpose ofthese rather than itssoftware developers
code be productive, mechanics.
of Scientists that are generated in for
data typically develop their
aspects of human cognition into account
t
10. Conduct requires substantial domain-specific
purposes because doing so code reviews.
human working memory is limited, huma
56. R style guide
• http://google-styleguide.googlecode.com/svn/trunk/Rguide.xml
57.
58. Education
A Quick Guide to Organizing Computational Biology
Projects
William Stafford Noble1,2*
1 Department of Genome Sciences, School of Medicine, University of Washington, Seattle, Washington, United States of America, 2 Department of Computer Science and
Engineering, University of Washington, Seattle, Washington, United States of America
Introduction
under a common root directory. The
understanding your work or who may be
exception to this rule is source code or
evaluating your research skills. Most comMost bioinformatics coursework focusscripts that are used in multiple projects.
monly, however, that ‘‘someone’’ is you. A
es on algorithms, with perhaps some
Each such program might have a project
few months from now, you may not
components devoted to learning prodirectory of its own.
remember what you were up to when you
gramming skills and learning how to
Within a given project, I use a top-level
created a particular set of files, or you may
use existing bioinformatics software. Unorganization that is logical, with chrononot remember what conclusions you drew.
fortunately, for students who are preparlogical organization at the next level, and
You will either have to then spend time
ing for a research career, this type of
logical organization below that. A sample
reconstructing your previous experiments
curriculum fails to address many of the
project, called msms, is shown in Figure 1.
or lose whatever insights you gained from
day-to-day organizational challenges asAt the root of most of my projects, I have a
those experiments.
sociated with performing computational
data directory for storing fixed data sets, a
This leads to the second principle,
experiments. In practice, the principles
results directory for tracking computawhich is actually more like a version of
Figure
names are
typeface, and filenames are
behind organizing and documenting 1. Directory structure for a sample project. Directorydo, youin large tional experiments in smaller typeface. Only a subset of
Murphy’s that the dates are formatted ,year.-,month.-,day. so that they can bepeformed on that data,
the files are shown here. NoteLaw: Everything you
sorted in chronological order. The
computational experiments are often code src/ms-analysis.c have to to do over again. and is documented in doc/ms-analysis.html. The README
source
is compiled
create bin/ms-analysis a doc directory with one subdirectory per
will probably
files in
what date. The driver script results/2009-01-15/runall
learned on the fly, and this learning is the data directories specify who downloaded the data files from what URL on manuscript, and directories such as src
automatically Inevitably, you will discover some flaw split3, corresponding to three cross-validation splits. The bin/parsegenerates the three subdirectories split1, split2, and in
sqt.py
strongly influenced by personal predilec- script is called by bothpreparation driverthe data being
for source code and bin for compiled
your initial of the runall of scripts.
doi:10.1371/journal.pcbi.1000424.g001
tions as well as by chance interactions
binaries or scripts.
analyzed, or you will get access to new
with collaborators or colleagues.
Within the data and results a complete
data, the distinction be- The your paramThese types of entries provide directowith this approach,or you will decide that Lab Notebook
The purpose of this article is to describe data and results may of a particular model was not
picture of the development a similar,
tween
not be useful.
ries, it is often tempting to apply of the project
eterization
In parallel with this chronological
over time.
Instead,
could
one good strategy for carrying out com- onebroad imagine a top-level means structure,the find itlogical toorganization. For example, you
enough. This directory that I
useful
directory called something like experiIn practice, I ask members of my
putational experiments. I will not describe , with subdirectories with names like last week, chronologically organizedhave two or group to data sets notebooks
maintain a or even
may lab research three put their lab against
ments
experiment you did
notebook. This is a document that resides
2008-12-19. Optionally, the directory
profound issues such as how to formulate
which
plan to password protection if
the set of experiments you’veroot of the results directory andyou online, behind benchmark your
in the been workname
also include a
or two
necessary. When I meet with a member
hypotheses, design experiments, or draw might ing on over word past month, will probably
that records your progress algorithms, ofso lab or a could team, we can one
in detail.
indicating the topic of the the
experiment
my you project create refer
Entries in the notebook
conclusions. Rather, I will focus therein. In practice,to single experiment you have organized should be dated, for each of lab notebook, focusing on
on
directory
need a be redone. If and they should be relatively verbose, with to the online them under data.
will often require more than one day of
the current entry but scrolling up to
relatively mundane issues such as organizthis
and documented your work clearly, thenimages In my experience, entries approach is risky,
links or embedded
or tables
work, and so you may end up working a
previous
as necessary. The URL
ing files and directories and documenting or repeating creating a new displaying the results of the experiments the can also be provided toof yourcollabobecause
logical structure remote final
few days
more before the experiment with the new
In each results folder:
•script getResults.rb or WHATIDID.txt or MyAnalysis.Rnw
•intermediates
•output
60. knitr (sweave)Analyzing & Reporting in a single file.
MyFile.Rnw
documentclass{article}
usepackage[sc]{mathpazo}
usepackage[T1]{fontenc}
usepackage{url}
begin{document}
Also works with
Markdown
instead of LaTeX!
### in R:
library(knitr)
knit(“MyFile.Rnw”)
# --> creates MyFile.tex
<<setup, include=FALSE, cache=FALSE, echo=FALSE>>=
# this is equivalent to SweaveOpts{...}
opts_chunk$set(fig.path='figure/minimal-', fig.align='center', fig.show='hold')
options(replace.assign=TRUE,width=90)
@
title{A Minimal Demo of knitr}
### in shell:
pdflatex MyFile.tex
# --> creates MyFile.pdf
author{Yihui Xie}
A Minimal Demo of knitr
maketitle
You can test if textbf{knitr} works with this minimal demo. OK, let's
get started with some boring random numbers:
Yihui Xie
February 26, 2012
<<boring-random,echo=TRUE,cache=TRUE>>=
set.seed(1121)
(x=rnorm(20))
mean(x);var(x)
@
You can test if knitr works with this minimal demo. OK, let’s get started with s
numbers:
The first element of texttt{x} is Sexpr{x[1]}. Boring boxplots
and histograms recorded by the PDF device:
set.seed(1121)
(x <- rnorm(20))
<<boring-plots,cache=TRUE,echo=TRUE>>=
## two plots side by side
par(mar=c(4,4,.1,.1),cex.lab=.95,cex.axis=.9,mgp=c(2,.7,0),tcl=-.3,las=1)
boxplot(x)
hist(x,main='')
@
Do the above chunks work? You should be able to compile the TeX{}
## [1] 0.14496 0.43832
## [10] -0.02531 0.15088
## [19] 0.13272 -0.15594
mean(x)
## [1] 0.3217
var(x)
0.15319
0.11008
1.08494 1.99954 -0.81188
1.35968 -0.32699 -0.71638
0.16027
1.80977
0
0
61. Choosing a programming language
Good for
Bad:
Excel
quick & dirty
R
numbers, stats, genomics
easy to make
mistakes
programming
Unix commandline (i.e., shell,
i.e., bash)
Can’t escape it.
Quick & Dirty
programming,
complicated things
Java
User interfaces in the
1990s.
overcomplicated.
Perl
1980s.
Everything.
Python
scripting, text
Ruby
scripting, text
Javascript
web apps
62. Ruby.
“Friends don’t let friends do Perl” - reddit user
example: reverse the contents of each line in a file
### in PERL:
open INFILE, "my_file.txt";
while (defined ($line = <INFILE>)) {
chomp($line);
@letters = split(//, $line);
@reverse_letters = reverse(@letters);
$reverse_string = join("", @reverse_letters);
print $reverse_string, "n";
}
### in Ruby:
File.open("my_file.txt").each do |line|
puts line.chomp.reverse
end
63. More ruby examples.
5.times do
puts "Hello world"
end
# Sorting people
people_sorted_by_age = people.sort_by{ |person| person.age}
64. Getting help.
• In
real life: Make friends with people. Talk to them.
• Online:
• Specific discussion mailing lists (e.g.: R, Stacks, bioruby, MAKER...)
• Programming: http://stackoverflow.com
• Bioinformatics: http://www.biostars.org
• Sequencing-related: http://seqanswers.com
• Stats: http://stats.stackexchange.com