This is a brief introduction to Linux, with emphasis on command-line interface. This presentation was made to participants of the H3ABioNet Introductory Bioinformatics workshop held in Accra, Ghana on 26 March, 2014.
2. Outline
1. What is Linux?
2. Command-line Interface, Shell & BASH
3. Popular commands
4. File Permissions and Owners
5. Installing programs
6. Piping and Scripting
7. Variables
8. Common applications in bioinformatics
9. Conclusion
13/05/2014 H3ABioNet Workshop 1: Day 4 2
3. What is Linux?
• Linux is a Unix-like computer
operating system assembled
under the model of free and
open source software
development and distribution.
• UNIX is a multitasking, multi-
user computer OS originally
developed in 1969.
13/05/2014 H3ABioNet Workshop 1: Day 4 3
Linus Torvalds – Former Chief
architect of Linux Kernel and
current project Coordinator
4. What is Linux?
• Operating system (OS):
Set of programs that manage
computer hardware resources
and provide common services for
application software.
• Kernel
13/05/2014 H3ABioNet Workshop 1: Day 4 4
5. What is Linux?
• Linux kernel (v 0.01) was 1st released in 1991. Current stable
version is 3.13 released in January 2014.
• The underlying source code of Linux kernel may be
used, modified, and distributed — commercially or non-
commercially — by anyone under licenses such as the GNU General
Public License.
• Therefore, different varieties of Linux have arisen to serve different
needs and tastes. These are called Linux distributions (or distros).
• All Linux distros have the Linux kernel in common
13/05/2014 H3ABioNet Workshop 1: Day 4 5
6. What is Linux?
13/05/2014 H3ABioNet Workshop 1: Day 4 6
Linux
Distribution
Supporting
packages
Linux kernel
Free, open-
source, proprietary
software
7. What is Linux?
• There are over 600 Linux distributions, over 300 of which are in
active development.
13/05/2014 H3ABioNet Workshop 1: Day 4 7
8. What is Linux?
• Linux distributions share core components but may look different
and include different programs and files.
• For example:
13/05/2014 H3ABioNet Workshop 1: Day 4 9
9. What is Linux?
Commercially-backed distros
• Fedora (Red Hat)
• OpenSUSE (Novell)
• Ubuntu (Canonical Ltd.)
• Mandriva Linux (Mandriva)
Ubuntu is the most popular
desktop Linux distribution with 20
million daily users
worldwide, according to
ubuntu.com.
Community-driven distros
• Debian
• Gentoo
• Slackware
• Arch Linux
13/05/2014 H3ABioNet Workshop 1: Day 4 10
10. Shell, Command-line Interface &
BASH
Command-line interface (CLI) Graphical User Interface (GUI)
13/05/2014 H3ABioNet Workshop 1: Day 4 11
The shell provides an interface for users of an operating system.
11. Shell, Command-line Interface &
BASH
Topic CLI GUI
Ease of use Generally more difficult to
successfully navigate and
operate a CLI.
Much easier when
compared to a CLI.
Control Greater control of file
system and operating
system in a CLI.
More advanced tasks
may still need a CLI.
Resources Uses less resources. Requires more
resources to load icons
etc.
Scripting Easily script a sequence of
commands to perform a task
or execute a program.
Limited ability to create
and execute tasks,
compared to CLI.
13/05/2014 H3ABioNet Workshop 1: Day 4 12
12. 13/05/2014 H3ABioNet Workshop 1: Day 4 14
Shell, Command-line Interface &
BASH
• A command is a directive to a computer program, acting as an
interpreter of some kind, to perform a specific task.
• BASH is the primary shell for GNU/Linux and Mac OS X.
Shell→ CLI→ BASH (Bourne-Again SHell)
13. • A Linux command typically consists of a program name, followed by
options and arguments.
13/05/2014 H3ABioNet Workshop 1: Day 4 15
Shell, Command-line Interface &
BASH
14. 13/05/2014 H3ABioNet Workshop 1: Day 4 16
Shell, Command-line Interface &
BASH
Useful BASH shortcuts…
Shortcut Meaning
15. Popular commands
• Directory structure
13/05/2014 H3ABioNet Workshop 1: Day 4 18
Default working
directory after user
login
Complete directory path: /home/user/Documents/LinuxClass
16. Popular commands
• Changing working directories
Command: cd
13/05/2014 H3ABioNet Workshop 1: Day 4 19
Default working
directory after user
login
Move to parent
directory
Move to child
directory
Move using complete path: cd /home/user/Documents/LinuxClass
19. Popular Commands
Task Command
Hard disk usage df -lh
RAM memory usage free mem
What processes are running in
real-time?
top
Snapshot of current processes ps aux
Stop a process running in the
terminal
CTRL + C
Stop a process that is running
outside the terminal
kill <PID>
13/05/2014 H3ABioNet Workshop 1: Day 4 22
• Monitoring & managing resources
20. Popular Commands
• Monitoring Network Connections
– Do I have an internet connection?
ping <web address>
– The ping command reports, how long a message takes back
and forth to the given server.
13/05/2014 H3ABioNet Workshop 1: Day 4 23
21. Popular Commands
• Downloading files
– wget <url of file>
– curl <url of file>
• wget is a free software package for retrieving files using
HTTP, HTTPS and FTP, the most widely-used Internet protocols.
• curl is a tool to transfer data from or to a server, using one of
several supported protocols
(DICT, FILE, FTP, FTPS, GOPHER, HTTP, HTTPS, IMAP, etc).
13/05/2014 H3ABioNet Workshop 1: Day 4 24
22. Popular Commands
• Remote Connections
– How can I get access to a remote computer?
ssh user@hostname
– The ssh (secure shell) command securely logs you into a
remote computer where you already have an account.
– X11 connections are possible using -X option.
– Example:
ssh -X user1@cbsuwrkst1.tc.cornell.edu
– scp, sftp commands allow users to securely copy files to
or from remote computers
13/05/2014 H3ABioNet Workshop 1: Day 4 25
23. Command-line help
Getting help (offline)
• More information about a command can be found from manual
pages
COMMAND: man
Example: man ls
• ARGUMENTS: -h or –help
Example: blastall --help
13/05/2014 H3ABioNet Workshop 1: Day 4 26
24. Command-line help
Getting help (online)
• Go to explainshell.com
• Write down a command-line to see the help text that matches
each argument.
13/05/2014 H3ABioNet Workshop 1: Day 4 27
25. Command-line help
• Output from explainshell.com, for:
– grep '>' fasta | sed 's/>//' > id.txt
13/05/2014 H3ABioNet Workshop 1: Day 4 28
26. File Permissions and Owners
• Linux is a multi-user OS. Therefore, different users can create
modify or delete the same files.
• To control access and modification of user files, Linux has a file
permission and ownership system.
• This system consists of two parts:
– Who is the owner of the file or directory?
– What type of access does each user have?
13/05/2014 H3ABioNet Workshop 1: Day 4 30
27. File Permissions and Owners
• Each file and directory has three user based permission groups:
1. Owner (u) - The Owner permissions apply only the owner of the file
or directory.
2. Group (g)- The Group permissions apply only to the group that has
been assigned to the file or directory.
3. All Users (‘o’ or ‘a’) - The All Users permissions apply to all other
users on the system.
• Each file or directory has three basic permission types:
1. Read (r) - The Read permission refers to a user's capability to read
the contents of the file.
2. Write (w) - The Write permissions refer to a user's capability to write
or modify a file or directory.
3. Execute(x) - The Execute permission affects a user's capability to
execute a file or view the contents of a directory.
13/05/2014 H3ABioNet Workshop 1: Day 4 31
28. File Permissions and Owners
13/05/2014 H3ABioNet Workshop 1: Day 4 32
[me@linuxbox me]$ ls -l some_file
-rw-rw-r-- 1 me me 1097374 Sep 26 18:48 some_file
Information about a file permissions: ls -l <file_name>
29. File Permissions and Owners
• The chmod command is used to modify files and directory
permissions. Typical permissions are read (r), write
(w), execute (x).
syntax: chmod [options] permissions files
13/05/2014 H3ABioNet Workshop 1: Day 4 33
30. File Permissions and Owners
• sudo
– is a command for Unix-like computer operating systems that
allows users to run programs with the security privileges of
another user (normally the superuser, or root). Its name is a
concatenation of the su command (which grants the user a shell
for the superuser) and "do", or take action.
– Example: sudo cp ./myscript.pl /usr/local/bin/
13/05/2014 H3ABioNet Workshop 1: Day 4 34
31. Installing Programs
1. Using package managers
1.1 Graphical package manager, example Synaptic for Ubuntu
1.2 High-level command-line package manager, example apt for Debian
1.3 Low-level command-line package manager, example dpkg for Debian
2. Copy executable file of program to PATH*
2.1 Pre-compiled
2.2 Build from source
* - PATH can be a directory, such as /usr/local/bin where
BASH looks for commands
13/05/2014 H3ABioNet Workshop 1: Day 4 35
32. Installing Programs
1.1 Using graphical package manager (Synaptic on Ubuntu)
13/05/2014 H3ABioNet Workshop 1: Day 4 36
33. Installing Programs
• Search and install programs using Synaptic on Ubuntu
13/05/2014 H3ABioNet Workshop 1: Day 4 37
35. Piping and Scripting
• Piping: Run different programs sequentially where the output
of one program becomes the input for the next one.
• Bash uses the “|” sign (pipe) to pipe the output of one
program as the input of another program.
• For example:
13/05/2014 H3ABioNet Workshop 1: Day 4 44
36. Piping and Scripting
• Another popular combination is redirect the stdout (output)
to a file using '>' (write or overwrite if it exists) or '>>'
(append).
• Example:
13/05/2014 H3ABioNet Workshop 1: Day 4 45
37. Piping and Scripting
• A shell program, called a script, is a tool for building applications by
"gluing together" system calls, tools, utilities, and compiled
binaries.
• For example: fasta_seq_count.sh
#! /bin/bash
# Count sequences in fasta file (1st argument)
grep –c ‘>’ $1
• To run this script:
1. Give script execute permission:
chmod u+x fasta_seq_count.sh
2. bash fasta_seq_count.sh <fasta_file>
13/05/2014 H3ABioNet Workshop 1: Day 4 46
38. Variables
• A variable is a name assigned to a location or set of locations
in computer memory, holding an item of data.
• Variables in BASH can be put into two categories:
1. System variables: Variables defined by system, such as PATH and
HOME
2. User-defined variables: Variables defined by a user during shell
session.
Example:
13/05/2014 H3ABioNet Workshop 1: Day 4 47
40. Variables
• Commands to interact with variables
• Example: Add a program executable directory to your PATH.
export PATH=/home/user/shscripts:$PATH
13/05/2014 H3ABioNet Workshop 1: Day 4 49
41. Common applications in
bioinformatics
• Fasta file manipulation
– Fasta file is a text-based format for representing either nucleotide
sequences or peptide sequences, in which nucleotides or amino acids
are represented using single-letter codes.
13/05/2014 H3ABioNet Workshop 1: Day 4 50
43. Common applications in
bioinformatics
• BLAST output manipulation
– The BLAST tabular format is one of the most common and useful
formats for presenting BLAST output. It has 12 columns:
query_id, subject_id, %identity, align_length, mismatches, gaps_openi
ngs, q_start, q_end, s_start, s_end, e_value, bit_score
13/05/2014 H3ABioNet Workshop 1: Day 4 52
45. Common applications in
bioinformatics
• High throughput sequencing software
– Create a report on the quality of a read set: fastqc
– Assemble reads into contigs: velvet, SPAdes, etc.
– Align reads to a known reference sequence: SHRiMP, Bowtie2,
BWA etc.
– Many other tools: samtools, picard, GATK, etc.
13/05/2014 H3ABioNet Workshop 1: Day 4 54
46. Conclusion
• Linux is a free and open source OS with
powerful and flexible command-line tools to
advance your bioinformatics research
projects.
• While learning to use these tools may be
challenging, at first, the rewards of UNIX/
Linux command-line proficiency is worth the
effort.
13/05/2014 H3ABioNet Workshop 1: Day 4 55
47. References
• Basic Linux by Aureliano Bombarely Gomez, Boyce Thompson Institute for
Plant Research
• Bash Scripting Guide by Mendel Cooper
• Introduction to Linux for Bioinformatics by Joachim Jacob, Bioinformatics
Training and Service facility (BITS)
• http://www.gnu.org/software/
• Linux commands, with detailed examples and
explanations: http://www.linuxconfig.org/linux-commands
• The Unix Shell (Software Carpentry): http://software-
carpentry.org/v4/shell/index.html
• Bioinformatics on the Command line by Paul Harrison, Victorian
Bioinformatics Consortium
13/05/2014 H3ABioNet Workshop 1: Day 4 56
Introduction – What is GNU/Linx?, GNU/Linux distributionsTerminal & Virtual Consoles – What is a console; Commands, stdin, stdout, stderr; Typing shortcuts for BASHPopular commands – Directories, Files, File Compressions; Manual and Help; Networking and Monitoring Resources
The Unix philosophy emphasizes building short, simple, clear, modular, and extendable code that can be easily maintained and repurposed by developers other than its creators. This is the Unix philosophy: Write programs that do one thing and do it well. Write programs to work together. Write programs to handle text streams, because that is a universal interface (Doug McIlroy). The most common version of Unix (bearing certification) is Apple's OS X, while Linux is the most popular non-certified workalike. Linus Torvalds is a Finnish software engineer best known as the chief architect of the Linux Kernel. More on the shell later.Linux is also considered a variant of the GNU operating system, initiated in 1983 by Richard Stallman. Therefore, the Free Software Foundation prefers the name GNU/Linux when referring to the operating system as a whole (see GNU/Linux naming controversy).Most operating systems can be grouped into two different families. Aside from Microsoft’s Windows NT-based operating systems, nearly everything else traces its heritage back to Unix.Linux, Mac OS X, Android, iOS, Chrome OS, Orbis OS used on the PlayStation 4, whatever firmware is running on your router — all of these operating systems are often called “Unix-like” operating systems.
In computing, the kernel is a computer program that manages input/output requests from software and translates them into data processing instructions for the central processing unit and other electronic components of a computer. The kernel is a fundamental part of a modern computer's operating system.[1]
The operating system will consist of the Linux kernel and, usually, a set of libraries and utilities from the GNU Project, with graphics support from the X Window System. In software, a package management system, also called package manager, is a collection of software tools to automate the process of installing, upgrading, configuring, and removing software packages for a computer's operating system in a consistent manner.Although all Linux distros have the Linux kernel in common, the graphical user interface, system, file structure, and desktop and server applications vary significantly.Unlike most operating systems resembling Unix, Torvalds did not use any of the original Unix source code and chose to release his code under the GNU (Gnu's Not Unix) general public license. To this day, the GNU license allows the free distribution of Linux and its derivatives as long as copies are released under the same license and include the source code.
Supporting packages includes libraries and tools to automate the process of installing, upgrading, configuring, and removing software packages for a computer's operating system in a consistent manner. The kernel of UNIX is the hub of the operating system: it allocates time and memory to programs and handles the filestore and communications in response to system calls, interacts with hardware etc. Linux distributions include the Linux kernel, supporting utilities and libraries and usually a large amount of application software to fulfil the distribution's intended use.Proprietary software or closed source software is computer software licensed under exclusive legal right of the copyright holder with the intent that the licensee is given the right to use the software only under certain conditions, and restricted from other uses, such as modification, sharing, studying, redistribution, or reverse engineering
The large number of distributions available, especially those that are still in active development is testament to the diversity of appearance and purpose that can be obtained when software is free and open-source. The Linux kernel has benefited from the contributions of thousands of programmers over the years. The philosophy is to have the choice of several exchangeable components to customize your experience. Linux distros differ in desktop environment and file managers etc.
All the so-called “Linux” distributions are really distributions of GNU/Linux. GNU is usually the first layer of user interaction. Some distributions, notably Debian, use GNU/Linux when referring to the operating system as a whole.[30] The naming issue remains controversial.As of May 2011, about 8% of a modern Linux distribution is made of GNU components, as determined by counting lines of source code making up Ubuntu's "Natty" release; meanwhile, about 9% is taken by the Linux kernel.[31]GNU = GNU is Not Unix. Gnu – a large dark antelope with a long head
In Ubuntu Linux, the default web browser is Firefox. In Debian the default web browser is Iceweasel (a rebranding of Mozilla Firefox). Although all Linux distros have the Linux kernel in common, the graphical user interface, system, file structure, and desktop and server applications vary significantly.Distributions (often called distros for short) are Operating Systems including a large collection of software applications such as word processors,spreadsheets, media players, and database applications.
Ubuntu is a Nguni Bantu term (literally, "human-ness") roughly translating to "human kindness"; in Southern Africa (South Africa and Zimbabwe). Linux Ubuntu is a Debian-based Linux operating system, with Unity as its default desktop environment. The goal of linux is to be as invisible as possible, doing theheavy lifting on the background. This GNU/Linux operating system is a solid core for a lot of computers and devices.
Quite often people new to another operating system than Microsoft Windows are confronted with the terms CLI (Command Line Interface) and GUI (Graphical User Interface). Pretty soon they get a notion about what those two are but at this stage they are still far away from being able to tell what is the "better" one. Well, there is no better -- it depends on the tasks that need be done, how experienced a user is and his personal likings.
This interaction between a computer operating system or application software and user is facilitated by the Shell. The shell includes both command-line and graphical elements for interacting with OS and apps. Graphical user interfaces (GUIs) are helpful for many tasks, but they are not good for all tasks.
standard streams are preconnected input and output channels between a computer program and its environment (typically a text terminal) when it begins execution. The three I/O connections are called standard input (stdin), standard output (stdout) and standard error (stderr). Stdin, stdout, stderr:These are standard streams for input, output, and error output. By default, standard input is read from the keyboard, while standard output and standard error are printed to the screen. BASH is the default shell for most Linux distros and Mac OS X. The Bourne-Again shell is a clone of the Bourne shell developed by the free software foundation. BASH is the Bourne shell, born again.
There may be several Options, or none at all.
The easiest way to check from the Unix command line whether the internet connection works, is to send a request to a known server (e.g. www.google.com) using the ping <web address> command. The command reports, how long a message takes back and forth to the given server. It sends an ECHO_REQUEST datagram to elicit an ICMP ECHO_RESPONSE from a host or gateway.
Pipes - curl is more in the traditional unix-style, it sends more stuff to stdout, and reads more from stdin in a "everything is a pipe" manner.Curl vs wget: http://daniel.haxx.se/docs/curl-vs-wget.htmlRecursive!Wget's major strong side compared to curl is its ability to download recursively, or even just download everything that is referred to from a remote resource, be it a HTML page or a FTP directory listing.
Ssh encrypts all data that travels across its connection including your username and password (which you’ll use to login to the remote machine). When the node’s server arrives user’s will be able to login remotely to run jobs on the server. This is common for bioinformatics facilities, and enables users to use the existing (greater) resources such as storage, processing capacity of the server. Other commands such as sftp and scp allow users to securely copy a file to/ from remote hosts.
Some commands have additional arguments
To extract id lines from a fasta file, remove the id and # redirect the output to a id.txt file
Linux has inherited from UNIX the concept of ownerships and permissions for files. This is basically because it was conceived as a networked system where different people would be using a variety of programs, files, etc. Linux is also multi-tasking, meaning one user can use the same computer to do multiple jobs. UNIX everything is a file.
The Unix operating system (and likewise, Linux) differs from other computing environments in that it is not only a multitasking system but it is also a multi-user system as well. Each file has assigned 9 different permissions, 3 for the file user-owner (u), 3 for the group-owner (g) and 3 for everyone else (o). Ls -alshows something like this for each file/dir: drwxrwxrwx
The chmod (change mode) command protects files and directories from unauthorized users on the same system, by setting access permissions.
Does Linux need antivirus software? All computer systems can suffer from malware and viruses, including Linux. Thankfully, very few viruses exist for Linux, so users typically do not install antivirus software. It is still recommended that Linux users have antivirus software installed. Some users may argue that antivirus software uses up too much resources. Thankfully, low-footprint software exists for Linux.
Path contains directories separated by colons, and tells the shell where to look for programs. $PATH is a colon-separated list of directories in which the shell looks for commands.
Standard users, by default, cannot install applications on a Linux machine. In order to successfully install an application on a Linux machine you have to have super user privileges. So, to change a command so that you can successfully run an installation you have to prefix it with “sudo”, for example: sudodpkg -isoftware.deb. To add a user to list of sudoers: # adduser foo sudo.
PHYLIP : Phylogeny Inference Package, computer programs for inferring phylogenies
A compiler is a computer program (or set of programs) that transforms source code written in a programming language (the source language) into another computer language (the target language, often having a binary form known as object code).[1] The most common reason for wanting to transform source code is to create an executable program.Check from program author’s website for detailed instructions on how to build program from source.
The standard output of command is connected via a pipe to the standard input of command2. This connection is performed before any redirections specified by the command (see REDIRECTION below). If |& is used, the standard error of command is connected to command2's standard input through the pipe; it is shorthand for 2>&1|. This implicit redirection of the standard error is performed after any redirections specified by the command.
A pipeline is a sequence of one or more commands separated by one of the control operators: | or |&.
A shell program, called a script, is an easy-to-use tool for building applications by "gluing together" system calls, tools, utilities, and compiled binaries.scripts, programs written for a special run-time environment that can interpret (rather than compile) and automate the execution of tasks which could alternatively be executed one-by-one by a human operator.More than just the insulating layer between the operating system kernel and the user, it's also a fairly powerful programming language. Bash has become a de facto standard for shell scripting on most flavors of UNIX. Bourne shell compliant scripts are created with a .sh extension.
A variable is nothing more than a label, a name assigned to a location or set of locations in computer memory holding an item of data.PATH – Your shell search path: directories separated by colonsHOME – Your home directory, such as /home/Smith
BASH variable is the full path name used to invoke the current instance of BASH
You can use env or printenv <variable_name> to print shell variable.The scope of a variable (i.e., which programs know about it) is by default, the shell in which it is defined. To make a variable and its value available to other programs your shell invokes (i.e., subshells), use the export command. This variable then becomes an environment variable since it’s available to other programs in your shell’s “environment”. The configuration of your bash shell is found in the hidden file.bashrc Usually the SYNTAX for setting an environment variable in your .bashrc (which is found in your home directory) is the following:export VARIABLE=value1.2NOTE: There shouldn't be any space between the variable and the equals sign ("=") and the value. If your value has spaces then the whole value should be put in quotesFor the case of the environement variable PATH it is a good practice to prepend additional paths to it using colons ":" since it is a system defined environment variable. For example: export PATH=$HOME/bin/perl:$PATH1.3NOTE: That environment variables are accessed by prepending the dollar sign ("$"), but when defined the dollar sign is ommittedPATH is a variable that contains the directories from which your shell (BASH) looks for commands. These directories are separated by colons.
Pattern matching and data extraction of Linux command line tools like grep, sort, cut, etc. Enable handling of large text files which would otherwise consume large chunks of memory on other platforms such as Windows (or Linux) GUI for example.
The BLAST programs are widely used tools for searching DNA and protein databases for sequence similarity to identify homologs to a query sequence. The BLAST command line interface offers additional features such as querying a custom database eg. Chromosome of your organism of interest. More advanced usage generally involves taking the output of BLAST as a first step in some kind of script. For example, Torsten's "prokka" tool uses BLAST (amongst other things) to automatically annotate a sequence. Which can best be achieved from the command-line.
A number of cutting edge programs (Bowtie, Velvet, Trinity, Stampy, etc.) do not come with an web interface, because the developers neither have time nor computing resources to provide web services for everyone. As a rule of thumb, easier an website is to use, more difficult it is to develop. Furthermore, it costs a lot of money to maintain data-intensive web services.
Freedoms of the OS fosters/ encourages the development of more software to add to the already large existing ,command line, bioinformatics tools. Therefore, it will become common for primarily, wet lab, biomedical researchers to have some command-line knowledge and skill. While learning to use these tools may be challenging, at first, the rewards of UNIX/ Linux command-line proficiency is worth the effort. Therefore, in the tutorial to follow later today we are going to guide you through using some commands and we hope that you have fun doing it.