2. Introduction
6/3/2023 5:43 PM 2
Introduction Sample Preparation Data Analysis
Considerations References
Figure 1: DNA Organization.
Adapted from Henrik's Lab. " ChIP seq - Chromatin Immunoprecipitation sequencing”, YouTube, 12 May
2021.
A method used to identify genomic regions bound by specific proteins or
protein modifications, providing insights into gene regulation and
chromatin structure.
3. Sample Preparation
6/3/2023 5:43 PM 3
Chemical
treatment
(Formaldehyde)
TF, Modified
Histone, RNA pol
Introduction Sample Preparation Considerations Data Analysis References
Figure 2: Sample preparation for ChIP-Seq
Adapted from Henrik's Lab. " ChIP seq - Chromatin Immunoprecipitation sequencing”, YouTube, 12 May 2021.
.
4. 6/3/2023 5:43 PM 4
Sample Preparation
100-300 bp
Cell Disruption and
DNA fragmentation
Introduction Sample Preparation Considerations Data Analysis References
Figure 2 (contd..): Sample preparation for ChIP-Seq
Adapted from Henrik's Lab. " ChIP seq - Chromatin Immunoprecipitation sequencing”, YouTube, 12 May 2021.
.
5. Target Enrichment
6/3/2023 5:43 PM 5
Immunoprecipitation
Introduction Sample Preparation Considerations Dara Analysis References
Figure 2 (contd..): Sample preparation for ChIP-Seq
Adapted from Henrik's Lab. " ChIP seq - Chromatin Immunoprecipitation sequencing”, YouTube, 12 May 2021.
6. Sequencing
6/3/2023 5:43 PM 6
Cross-linked reversal and
Library preparation
Sequencing of target DNA
fragment
NovaSeq
6000 System
Introduction Sample Preparation Data Analysis Considerations References
Figure 2: Sample preparation for ChIP-Seq
Adapted from Henrik's Lab. " ChIP seq - Chromatin Immunoprecipitation sequencing”, YouTube, 12 May 2021.
7. Experimental Design Considerations
1. Antibody selection
2. Chromatin fragmentation
3. Cross-linking conditions
4. Sufficient amount of starting material - 2 x 106 cells per
immunoprecipitation.
5. Control libraries
6. Reducing artifacts - normalization
7. Biological replicates ≥ 3.
6/3/2023 5:43 PM 7
Introduction Sample Preparation Data Analysis
Considerations References
8. Sequencing Considerations
Parameters Values
Read Length 50-150 bp
Sequencing Mode SE, PE
Sequencing Depth 20-40 M total read depth (for TF)
≥ 40 M for Histone marks
6/3/2023 5:43 PM 8
Table1: Sequencing considerations for ChIP-Seq
Introduction Sample Preparation Data Analysis
Considerations References
9. Sequencing Considerations
6/3/2023 5:43 PM 9
Figure 3: No. of peaks called vs. sequencing depth
Adapted from Landt et al., "ChIP-seq guidelines and practices of the ENCODE and modENCODE consortia,"
Genome Research, 22(9), 1813-1831, 2012.
Introduction Sample Preparation Data Analysis
Considerations References
10. Data Analysis
6/3/2023 5:43 PM 10
Introduction Sample Preparation Data Analysis
Considerations References
Figure 4: ChIP-Seq data analysis pipeline
Quality
control
Read
mapping
Peak calling
Data
visualization
Functional
analysis
Motif analysis
Differential
analysis
Integration
with other
data types
Reproducibility
11. 1. Preprocessing
• Quality Control (QC)
• Read trimming and filtering
• PCR duplicate removal
Important Quality Matrices
a) Per base sequence quality
b) GC content
c) Over represented sequences
6/3/2023 5:43 PM 11
FastQC, MultiQC
Introduction Sample Preparation Data Analysis
Considerations References
12. 2. Alignment
6/3/2023 5:43 PM 12
Preprocessed reads are
mapped to the reference
genome using tools like BWA
or SAMtools
Input = FASTQ
Output = SAM,
BAM
BWA, Bowtie,
STAR,
NovoAlign
Figure 5: Alignment results from BWA
Adapted from Zymo Research, https://github.com/Zymo-Research/service-pipeline-documentation, Accessed May
6, 2023.
Introduction Sample Preparation Data Analysis
Considerations References
13. 3. Peak Calling
6/3/2023 5:43 PM 13
Identification of enriched
loci in the genome.
Output = BED
Format
MACS,
SICER, Bayes
Peak
Figure 6: Peaks calling summary statistics using MACS2
Adapted from Zymo Research, https://github.com/Zymo-Research/service-pipeline-documentation, Accessed May
6, 2023.
Introduction Sample Preparation Data Analysis
Considerations References
14. 4. Visualization
6/3/2023 5:43 PM 14
Figure 7: Peaks visualization by DROMPAplus
Adapted from Nokato et al., "Methods for ChIP-seq analysis: A practical workflow and advanced applications," Journal of
Biochemistry, 159(4), 335-345, 2016, doi: 10.1093/jb/mvv124.
Introduction Sample Preparation Data Analysis
Considerations References
15. 4. Visualization
6/3/2023 5:43 PM 15
Figure 8: Peaks visualization by DROMPAplus
Adapted from Nokato et al., "Methods for ChIP-seq analysis: A practical workflow and advanced applications," Journal of
Biochemistry, 159(4), 335-345, 2016, doi: 10.1093/jb/mvv124.
Introduction Sample Preparation Data Analysis
Considerations References
16. 4. Visualization
6/3/2023 5:43 PM 16
Peaks can be viewed
directly in genome
browser e.g. UCSC
Genome Browser
ChIPseeker,
IGV
Figure 9: Peaks visualization by UCSC Genome Browser
Adapted from Zymo Research, https://github.com/Zymo-Research/service-pipeline-documentation, Accessed May 6, 2023.
Introduction Sample Preparation Data Analysis
Considerations References
17. 5. Peak Annotation
6/3/2023 5:43 PM 17
ReMap, MGA,
RSAT,
rGADEM
Figure 10: Peaks annotation by HOMER
Adapted from Zymo Research, https://github.com/Zymo-Research/service-pipeline-documentation, Accessed May 6, 2023.
Introduction Sample Preparation Data Analysis
Considerations References
18. 5. Peak Annotation
6/3/2023 5:43 PM 18
Figure 11: Peaks annotation by HOMER
Adapted from Zymo Research, https://github.com/Zymo-Research/service-pipeline-documentation, Accessed May 6, 2023.
Introduction Sample Preparation Data Analysis
Considerations References
19. References
1. Nakato, R., Shirahige, K., & Takahata, S. (2021). Methods for ChIP-seq
analysis: A practical workflow and advanced applications. Genes to Cells,
26(6), 371-382. doi: 10.1111/gtc.12863.
2. Landt, S.G., Marinov, G.K., Kundaje, A. et al. (2012). ChIP-seq guidelines
and practices of the ENCODE and modENCODE consortia. Genome Res.
22(9), 1813-1831. doi: 10.1101/gr.136184.111.
3. Zymo Research. (n.d.). Service Pipeline Documentation. GitHub.
https://github.com/Zymo-Research/service-pipeline-documentation
6/3/2023 5:43 PM 19
Introduction Sample Preparation Data Analysis
Considerations References
Notes de l'éditeur
Cross-linking between proteins and DNA in ChIP-seq samples is typically reversed by using heat and/or a chemical agent to break the cross-linking bonds and release the protein-DNA complexes.
The library preparation step in ChIP-seq (chromatin immunoprecipitation sequencing) involves converting the fragmented DNA (or chromatin) obtained from the ChIP-seq sample into a sequencing library, which can be used for high-throughput sequencing.
The library preparation step typically includes the following key steps:
End repair: The fragmented DNA ends are repaired to generate blunt ends, suitable for ligation to sequencing adapters.
Adaptor ligation: DNA sequencing adapters are ligated to the repaired DNA fragments. These adapters contain sequences that are required for the subsequent steps of the sequencing process.
Size selection: The adapter-ligated DNA fragments are size-selected to remove any unligated adapters or fragments that are too small or too large for sequencing.
PCR amplification: The size-selected DNA fragments are amplified by PCR (polymerase chain reaction) to generate sufficient material for sequencing. PCR primers specific to the adapter sequences are used to selectively amplify only the adapter-ligated fragments.
Quality control: The resulting library is evaluated for quality and quantity using various methods, such as gel electrophoresis, qPCR (quantitative PCR), or fluorometry.
The type of library for ChIP-seq (chromatin immunoprecipitation sequencing) can be either single-end or paired-end, depending on the sequencing platform and experimental design.
In single-end sequencing, only one end of the DNA fragment is sequenced, while in paired-end sequencing, both ends of the DNA fragment are sequenced. Paired-end sequencing generates more information per fragment and allows for more accurate mapping of reads to the reference genome.
Most commonly, ChIP-seq libraries are prepared as paired-end libraries, as this allows for more accurate identification of the precise binding location of the protein of interest. However, single-end sequencing may be used in some cases where cost or experimental constraints prohibit the use of paired-end sequencing.
1. An ideal antibody for ChIP-seq should have high specificity, sensitivity, and affinity for the protein of interest. It should be able to recognize the native conformation of the protein and not cross-react with other proteins in the sample. Additionally, the antibody should be able to capture the protein-DNA complexes in a highly efficient and reproducible manner.
2. If the chromatin is over-fragmented, then the DNA fragments may become too short, leading to decreased specificity and accuracy of the ChIP-seq assay. On the other hand, if the chromatin is under-fragmented, then the DNA fragments may become too large, leading to lower resolution of the assay and decreased ability to identify binding sites.
3. Crosslinking is a critical step in ChIP-seq (chromatin immunoprecipitation sequencing) as it plays a crucial role in preserving the protein-DNA interactions within the chromatin and ensuring accurate and reliable results. The conditions used for crosslinking, such as the concentration of formaldehyde, duration of crosslinking, and temperature, are all critical factors that can significantly affect the quality and specificity of the ChIP-seq data.
4. Ensure that you have a sufficient amount of starting material because the ChIP will only enrich for a small proportion. For a standard protocol, you want approximately 2 x 106 cells per immunoprecipitation. If it is difficult to obtain that many samples from your experiment, consider using low input methods. Ultimately, higher amounts of starting material yield more consistent and reproducible protein-DNA enrichments.
5. A ChIP-Seq peak should be compared with the same region of the genome in a matched control sample because only a fraction of the DNA in our ChIP sample corresponds to actual signal amidst background noise. Control libraries are an essential component of ChIP-seq (chromatin immunoprecipitation sequencing) experiments. In a ChIP-seq experiment, the goal is to identify the genomic regions bound by a specific protein of interest. However, this cannot be accomplished without taking into account the background noise and non-specific binding events that can occur during the experiment.
Control libraries provide a baseline for comparison with the experimental libraries, allowing the identification of regions that are specifically enriched for the protein of interest versus regions that are non-specifically bound or enriched due to experimental noise. The most commonly used control library is a "mock IP" or "IgG" control, which involves performing the entire ChIP-seq protocol using an antibody that does not recognize any of the proteins of interest in the sample.
6. There are a number of artifacts that tend to generate pileups of reads that could be interpreted as a false positive peaks. These include:
Open chromatin regions that are fragmented more easily than closed regions due to the accessibility of the DNA
The presence of repetitive sequences
An uneven distribution of sequence reads across the genome due to DNA composition
‘hyper-ChIPable’ regions: loci that are commonly enriched in ChIP datasets. Certain genomic regions are more susceptible to immunoprecipitation, therefore show increased ChIP signals for unrelated DNA-binding and chromatin-binding proteins.
Single-end reads are sufficient in most cases. Paired-end is good (and necessary) for allele-specific chromatin events, and investigations of transposable elements. Sequence the input controls to equal or higher depth than your ChIP samples.
A minimum of 40M total read depth; more is better for detecting some histone marks
During the PCR amplification step of library preparation, some DNA fragments may be over-amplified, resulting in multiple identical copies of the same fragment. These PCR duplicates can bias the estimation of the true fragment frequency and affect the accuracy of peak calling and differential binding analysis.
Overrepresented sequences are sequences that are found in high abundance in a ChIP-seq dataset. These sequences can arise from a variety of sources, such as sequencing adapters, PCR duplicates, or genomic regions with high GC content.
Overrepresented sequences are an important quality metric in ChIP-seq preprocessing because they can indicate potential issues with the sequencing library, such as poor sequencing quality, contamination, or bias. High levels of overrepresented sequences can lead to reduced sequencing depth, false positive peaks, and decreased sensitivity and specificity of peak calling algorithms.
Identifying and removing overrepresented sequences is an important step in ChIP-seq preprocessing to ensure the accuracy and reliability of downstream analysis. This can be done using bioinformatics tools that detect and filter out sequences that exceed a certain threshold of frequency or similarity to known contaminants or artifacts.
1. The alignment step in ChIP-seq (chromatin immunoprecipitation sequencing) is the process of mapping the sequencing reads generated from the ChIP and control libraries to a reference genome or transcriptome. The goal of the alignment step is to assign each read to its original genomic location with high accuracy and specificity, so that the genomic regions with significant binding enrichment can be identified and analyzed.
The alignment step typically involves several sub-steps, including read quality control, adapter trimming, sequence alignment, and read sorting and indexing. Different software tools and algorithms can be used for these sub-steps, depending on the type and quality of the sequencing data, the genome or transcriptome of interest, and the specific research questions.
Peak calling is a key step in the analysis of ChIP-seq (chromatin immunoprecipitation sequencing) data, which aims to identify genomic regions with significant enrichment of ChIP-seq signal over the control or background signal. These enriched regions, also called peaks, represent putative binding sites of the protein or factor of interest on the chromatin. Some common peak calling algorithms include MACS (Model-based Analysis of ChIP-Seq), SICER (Spatial Clustering for Identification of ChIP-Enriched Regions), and MAnorm (Model-based Analysis of Nucleosome Organization and Relationship to Transcription). These algorithms may also incorporate downstream analysis steps such as peak annotation, motif analysis, and gene ontology enrichment analysis.
BED (Browser Extensible Data) format is a commonly used file format for representing genomic intervals, such as the genomic coordinates of ChIP-seq peaks, gene exons, or genomic variants. The BED file format is tab-delimited, and each line in the file represents a single genomic interval.
A BED file typically contains at least three columns, representing the chromosome name, start position, and end position of the interval. Optionally, additional columns can be included to represent the name of the interval, the strand orientation, and additional metadata such as score, p-value, or functional annotations.
The basic BED format has the following three mandatory columns:
Chromosome: The name of the chromosome or contig where the interval is located.
Start: The starting position of the interval on the chromosome, using 0-based coordinates.
End: The ending position of the interval on the chromosome, using 1-based coordinates.
FRiP (Fraction of Reads in Peaks) is a commonly used quality metric for ChIP-seq data analysis. It measures the fraction of aligned reads that fall within peaks, which are genomic regions with a high density of ChIP-seq signal.
The FRiP score is calculated by dividing the number of reads that fall within called peaks by the total number of aligned reads. A high FRiP score indicates that a large proportion of aligned reads are in peaks, suggesting high enrichment of the target protein or histone modification.
FRiP scores are often used to compare the quality of different ChIP-seq experiments, and a typical cutoff for a high-quality ChIP-seq experiment is a FRiP score of at least 20%. However, the appropriate cutoff may depend on the specific biological question and the type of sample being analyzed.
In ChIP-seq data analysis, two types of peaks are commonly observed: sharp peaks and broad peaks.
Sharp peaks are typically narrow and well-defined, indicating the precise location of a protein-DNA interaction, such as a transcription factor binding site or a histone modification. Sharp peaks are characterized by a high peak summit and a steep drop-off on either side of the peak summit.
Broad peaks, on the other hand, are wider and more diffuse than sharp peaks, indicating a more extended region of protein-DNA interaction, such as a histone modification that spans a large genomic region. Broad peaks are characterized by a lower peak summit and a more gradual drop-off on either side of the peak summit.
The distinction between sharp and broad peaks is important because different peak calling algorithms may be better suited to identify one type of peak versus the other, and different downstream analyses may be required depending on the type of peak. For example, motif discovery algorithms may be more effective at identifying transcription factor binding motifs within sharp peaks, while functional annotation tools may be better suited to identifying biological pathways associated with broad peaks.
identifies the genomic region and feature a peak overlaps with, such as exon, intron, promoter of a specific gene, or intergenic, etc. It also identifies the nearest TSS to the peak, including the distance and gene
The peak score in the peak annotation file generated by HOMER is a score assigned to each peak based on the strength of the signal in that region. HOMER uses a statistical model to calculate the peak score, which takes into account the distribution of signal intensity across the genome and the size of the peak.
The peak score is a useful metric for ranking peaks by their strength and for comparing the strength of peaks across different samples. In HOMER, peaks with higher scores are considered to have stronger signals and are more likely to be biologically meaningful.
TSS= Transcription start site