ChIP-seq analysis notes from Ming Tang
Data downloaded from GEO usually are raw fastq files. One needs to do quality control (QC) on them.
Be careful with the peaks you get:
Active promoters give rise to false positive ‘Phantom Peaks’ in ChIP-seq experiments
It is good to have controls for your ChIP-seq experiments. A DNA input control (no antibody is applied) is prefered. The IgG control is also fine, but because so little DNA is there, you might get many duplicated reads due to PCR artifact.
For cancer cells, an input control can be used to correct for copy-number bias.
A quote from Tao Liu: who develped MACS1/2
I remember in a PloS One paper last year by Elizabeth G. Wilbanks et al., authors pointed out the best way to sort results in MACS is by -10*log10(pvalue) then fold enrichment. I agree with them. You don't have to worry about FDR too much if your input data are far more than ChIP data. MACS1.4 calculates FDR by swapping samples, so if your input signal has some strong bias somewhere in the genome, your FDR result would be bad. Bad FDR may mean something but it's just secondary.
The most popular peak caller by Tao Liu: MACS2. Now
--broadflag supports broad peaks calling as well.
SICER for broad histone modification ChIP-seq
HOMER can also used to call Transcription factor ChIP-seq peaks and histone modification ChIP-seq peaks.
Ritornello: High fidelity control-free chip-seq peak calling. No input is required!
Tumor samples are heterogeneous containing different cell types. MixChIP: a probabilistic method for cell type specific protein-DNA binding analysis
epic: diffuse domain ChIP-Seq caller based on SICER. It is a re-writen of SICER for faster processing using more CPUs. (Will try it for broad peak for sure). epic2 paper is out https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btz232/5421513?redirectedFrom=fulltext
Cistrome: The best place for wet lab scientist to check the binding sites. Developed by Shierly Liu lab in Harvard.
ChIP-Atlas is an integrative and comprehensive database for visualizing and making use of public ChIP-seq data. ChIP-Atlas covers almost all public ChIP-seq data submitted to the SRA (Sequence Read Archives) in NCBI, DDBJ, or ENA, and is based on over 78,000 experiments.
A map of direct TF-DNA interactions in the human genome UniBind is a comprehensive map of direct interactions between transcription factor (TFs) and DNA. High confidence TF binding site predictions were obtained from uniform processing of thousands of ChIP-seq data sets using the ChIP-eat software.
SUPERmerge:ChIP-seq coverage island analysis algorithm for broad histone marks
PeakRanger heard that it is good for broad peaks of H3K9me3 and H3K27me3.
Different parameters using the same program can produce drastic different sets of peaks especially for histone modifications with variable enrichment length and gaps between peaks. One needs to make a valid argument for parameters he uses
An example of different parameters for homer
A significant proportion of transcription-factor binding sites may be nonfunctional A post from Judge Starling
Several papers have shown that changes of adjacent TF binding poorly correlates with gene expression change:
Extensive Divergence of Transcription Factor Binding in Drosophila Embryos with Highly Conserved Gene Expression
Transcription Factors Bind Thousands of Active and Inactive Regions in theDrosophila Blastoderm
" On average, 14.7% of genes bound by a factor were differentially expressed following the knockdown of that factor, suggesting that most interactions between TF and chromatin do not result in measurable changes in gene expression levels of putative target genes. "
We analyzed the dependence of the ChIP signal on the duration of formaldehyde cross-linking time for two proteins: DNA topoisomerase 1 (Top1) that is functionally associated with the double helix in vivo, especially with active chromatin, and green fluorescent protein (GFP) that has no known bona fide interactions with DNA. With short time of formaldehyde fixation, only Top1 immunoprecipation efficiently recovered DNA from active promoters, whereas prolonged fixation augmented non-specific recovery of GFP dramatizing the need to optimize ChIP protocols to minimize the time of cross-linking, especially for abundant nuclear proteins. Thus, ChIP is a powerful approach to study the localization of protein on the genome when care is taken to manage potential artifacts.
The Gene Ontology Handbook Read it for basics for GO.
ChromHMM from Manolis Kellis in MIT.
In ChromHMM the raw reads are assigned to non-overlapping bins of 200 bps and a sample-specific threshold is used to transform the count data to binary values
Segway from Hoffman lab. Base pair resolution. Takes longer time to run.
epicseg published 2015 in genome biology. Similiar speed with ChromHMM.
Spectacle: fast chromatin state annotation using spectral learning. Also published 2015 in genome biology.
chromstaR: Tracking combinatorial chromatin state dynamics in space and time
epilogos visualization and analysis of chromatin state model data.
StatePaintR StateHub-StatePaintR: rules-based chromatin state annotations.
[IDEAS(https://github.com/yuzhang123/IDEAS/): an integrative and discriminative epigenome annotation system http://sites.stat.psu.edu/~yzz2/IDEAS/
Most of the software for ChIP annotation doesn't considered this issue when annotating peak (0-based) to transcript (1-based). To my knowledge, only HOMER consider this issue. After I figure this out, I have updated ChIPseeker (version >= 1.4.3) to fix the issue.
Bioconductor package ChIPpeakAnno.
annotatr Annotation of Genomic Regions to Genomic Annotations.
geneXtendeR computes optimal gene extensions tailored to the broadness of the specific epigenetic mark (e.g., H3K9me1, H3K27me3), as determined by a user-supplied ChIP-seq peak input file. As such, geneXtender maximizes the signal-to-noise ratio of locating genes closest to and directly under peaks
Look at a post and here describing different tools. A review paper A comprehensive comparison of tools for differential ChIP-seq analysis
PePr. It can also call peaks.
diffbind bioconductor package. Internally uses RNA-seq tools: EdgR or DESeq. Most likely, I will use this tool.
ChIPComp. Very little tutorial. Now it is on bioconductor.
chromDiff. Also from from Manolis Kellis in MIT. Similar with ChromHMM, documentation is not that detailed. Will have a try on this.
For TF ChIP-seq, one can usually find the summit of the peak (macs14 will report the summit), and extend the summit to both sides to 100bp-500bp. One can then use those 100bp-500 bp small regions to do motif analysis. Usually, oen should find the motif for the ChIPed TF in the ChIP-seq experiment if it is a DNA binding protein.
It is trickier to do motif analysis using histone modification ChIP-seq. For example, the average peak size of H3K27ac is 2~3 kb. If one wants to find TF binding motifs from H3K27ac ChIP-seq data, it is good to narrow down the region a bit. MEME and many other motif finding tools require that the DNA sequence length to be small (~500bp). One way is to use
findPeaksin homer turning on
-nfr(nucleosome free region) flag, and then do motif analysis in those regions.
suggestions for finding motifs from histone modification ChIP-seq data from HOMER page:
Since you are looking at a region, you do not necessarily want to center the peak on the specific position with the highest tag density, which may be at the edge of the region. Besides, in the case of histone modifications at enhancers, the highest signal will usually be found on nucleosomes surrounding the center of the enhancer, which is where the functional sequences and transcription factor binding sites reside. Consider H3K4me marks surrounding distal PU.1 transcription factor peaks. Typically, adding the -center >option moves peaks further away from the functional sequence in these scenarios.
Other strategy similar to
-nfrwas developed in this paper: Dissecting neural differentiation regulatory networks through epigenetic footprinting. In the method part of the paper, the authors computed a depletion score within the peaks, and use the footprinted regions to do motif analysis. (Thanks kadir for pointing out the paper)
Region Size ("-size <#>", "-size <#>,<#>", "-size given", default: 200) The size of the region used for motif finding is important. If analyzing ChIP-Seq peaks from a transcription factor, Chuck would recommend 50 bp for establishing the primary motif bound by a given transcription factor and 200 bp for finding both primary and "co-enriched" motifs for a transcription factor. When looking at histone marked regions, 500-1000 bp is probably a good idea (i.e. H3K4me or H3/H4 acetylated regions). In theory, HOMER can work with very large regions (i.e. 10kb), but with the larger the regions comes more sequence and longer execution time. These regions will be based off the center of the peaks. If you prefer an offset, you can specify "-size -300,100" to search a region of size 400 that is centered 100 bp upstream of the peak center (useful if doing motif finding on putative TSS regions). If you have variable length regions, use the option "-size given" and HOMER will use the exact regions that were used as input.
I just found PARE. PARE is a computational method to Predict Active Regulatory Elements, specifically enhancers and promoters. H3K27ac and H3K4me can be used to define active enhancers.
The fancy "supper-enhancer" term was first introduced by Richard Young in Whitehead Institute. Basically, super-enhancers are enhancers that span large genomic regions(~12.5kb). The concept of super-enhancer is not new. One of the most famous example is the Locus Control Region (LCR) that controls the globin gene expression, and this has been known for decades.
A review in Nature Genetics What are super-enhancers?
By generating a series of mouse models, deleting each of the five regulatory elements of the α-globin super-enhancer individually and in informative combinations, we demonstrate that each constituent enhancer seems to act independently and in an additive fashion with respect to hematological phenotype, gene expression, chromatin structure and chromosome conformation, without clear evidence of synergistic or higher-order effects.
paper: Hierarchy within the mammary STAT5-driven Wap super-enhancer
paper: Enhancers and super-enhancers have an equivalent regulatory role in embryonic stem cells through regulation of single or multiple genes
From the HOMER page How finding super enhancers works:
Super enhancer discovery in HOMER emulates the original strategy used by the Young lab. First, peaks are found just like any other ChIP-Seq data set. Then, peaks found within a given distance are 'stitched' together into larger regions (by default this is set at 12.5 kb). The super enhancer signal of each of these regions is then determined by the total normalized number reads minus the number of normalized reads in the input. These regions are then sorted by their score, normalized to the highest score and the number of putative enhancer regions, and then super enhancers are identified as regions past the point where the slope is greater than 1.
Example of a super enhancer plot:
In the plot above, all of the peaks past 0.95 or so would be considered "super enhancers", while the one's below would be "typical" enhancers. If the slope threshold of 1 seems arbitrary to you, well... it is! This part is probably the 'weakest link' in the super enhancer definition. However, the concept is still very useful. Please keep in mind that most enhancers probably fall on a continuum between typical and super enhancer status, so don't bother fighting over the precise number of super enhancers in a given sample and instead look for useful trends in the data.
Using ROSE from Young lab
ROSE: RANK ORDERING OF SUPER-ENHANCERS
**imPROSE - Integrated Methods for Prediction of Super-Enhancers
CREAM (Clustering of Functional Regions Analysis Method) is a new method for identification of clusters of functional regions (COREs) within chromosomes. published in Genome Research by Mathieu Lupien group. paper: Identifying clusters of cis-regulatory elements underpinning TAD structures and lineage-specific regulatory networks.
bedtools my all-time favorite tool from Araon Quinlan' lab. Great documentation! pyBedGraph: a Python package for fast operations on 1-dimensional genomic signal tracks. pyBigwig Hosting bigWig for UCSC visualization
My first play with GRO-seq data, from sam to bedgraph for visualization
convert bam file to bigwig file and visualize in UCSC genome browser in a Box (GBiB). megadept is pretty fast, can access bigWig files from the web, works on macOS, Linux & Windows, plus is also available via @Bioconductor http://www.bioconductor.org/packages/release/bioc/html/megadepth.html which makes easy to use it in #rstats. For example, for quantifying expression of custom regions from recount3 data
The genomic association tester (GAT)
poverlap from Brent Pedersen. Now he is working with Aaron Quinlan at university of Utah.
Genometric Correlation (GenometriCorr): an R package for spatial correlation of genome-wide interval datasets
Location overlap analysis for enrichment of genomic ranges bioconductor package.
regioneR Association analysis of genomic regions based on permutation tests similaRpeak: Metrics to estimate a level of similarity between two ChIP-Seq profiles
Beta from Shirley Liu's lab in Harvard. Tao Liu's previous lab.
Many papers draw meta-plot and heatmap on certain genomic regions (2kb around TSS, genebody etc) using ChIP-seq data.
See an example from the ngs.plot:
you can also draw heatmaps using R. just count (using either Homer or bedtools) the ChIP-seq reads in each bin and draw with heatmap.2 function. here and here. Those are my pretty old blog posts, I now have a much better idea on how to make those graphs from scratch.
You can also use bioconductor Genomation. It is very versatile.
EnrichedHeatmaps from Zuguang Gu based on his own package
ComplexHeatmaps. This is now my default go-to because of the flexiability of the package and the great user support. Thx!
Fluff is a Python package that contains several scripts to produce pretty, publication-quality figures for next-generation sequencing experiments I just found it 09/01/2016. looks promising especially for identifying the dynamic change.
One cavet is that the meta-plot (on the left) is an average view of ChIP-seq tag enrichment and may not reflect the real biological meaning for individual cases.
See a post from Lior Patcher How to average genome-wide data
I replied the post:
for ChIP-seq, in addition to the average plot, a heatmap that with each region in each row should make it more clear to compare (although not quantitatively). a box-plot (or a histogram) is better in this case . I am really uncomfortable averaging the signal, as a single value (mean) is not a good description of the distribution.
By Meromit Singer:
thanks for the paper ref! Indeed, an additional important issue with averaging is that one could be looking at the aggregation of several (possibly very distinct) clusters. Another thing we should all keep in mind if we choose to make such plots..
A paper from Genome Research Ubiquitous heterogeneity and asymmetry of the chromatin environment at regulatory elements
Copy number information from targeted sequencing using off-target reads bioconductor CopywriteR package.
3CPET: Finding Co-factor Complexes in Chia-PET experiment using a Hierarchical Dirichlet Process
We describe Cleavage Under Targets and Release Using Nuclease (CUT&RUN), a chromatin profiling strategy in which antibody-targeted controlled cleavage by micrococcal nuclease releases specific protein-DNA complexes into the supernatant for paired-end DNA sequencing another cut&run method. maybe useful for scChIP-seq?
Calling Cards enable multiplexed identification of the genomic targets of DNA-binding proteins this is potentially can work with single cells.
Ultra-parallel ChIP-seq by barcoding of intact nuclei as low as 1000 cells.
CUT&Tag Data Processing and Analysis Tutorial protocols.io link https://www.protocols.io/view/cut-amp-tag-data-processing-and-analysis-tutorial-bjk2kkye
CoBATCH for high-throughput single-cell epigenomic profiling Protein A in fusion to Tn5 transposase is enriched through specific antibodies to genomic regions and Tn5 generates indexed chromatin fragments ready for the library preparation and sequencing.
Some may notice that the peaks produced look both like peaks produced from the TF ChIP-seq pipeline as well as the histone ChIP-seq pipeline. This is intentional, as ATAC-seq data looks both like TF data (narrow peaks of signal) as well as histone data (broader regions of openness).
From Caleb, the author of hichipper https://twitter.com/CalebLareau/status/1098312702651523077 thx!
In HiChIP data analyses, there are two primary problems that we are trying to solve. A) Which anchors (i.e. genomic loci) should be used as a feature set and B) which loops (i.e. interactions between pairs of loci) are important in the data. 2/n
Depending on what you are hoping to use your data for, there are a variety of ways to think about anchors and loops. Two uses of HiChIP that come to mind are "which gene is this enhancer talking to" and "which loops are differential between my celltype/condition of interest" 3/n
When Martin and I wrote hichipper, we envisioned the second question being more used (i.e. building out a framework for differential loop calling), so we wanted a pre-processing pipeline that was as inclusive of potential loops as possible that could be subsetted downstream 4/n
To these ends, we reported an improved version of anchor detection from HiChIP data by modeling the restriction enzyme cut bias explicitly, which helped identify high-quality anchors from the data itself 5/n
(we achieve this by re-parametrizing MACS2 peak calling by essentially fitting a loess curve to the data in the previous picture) 6/n
Unfortunately, based on user feedback, this modified background winds up with a very, very conservative peak calling if the library preparations are sub-par. Thus, the safest way to approach HiChIP data analyses is often to use a pre-defined anchor set 7/n
These can be from either a complementary ATAC-seq or ChIP-seq dataset for the conditions that you are interested in. From what I've seen, you can supply a bed file to hichipper or other tools directly. Hichipper does some other modifications by default to this bed file FYI 8/n
In terms of the second problem of identifying loops, hichipper didn't make any revolutionary progress. We recommend some level of CPM-based filtering + mango FDR calculation (implemented in hichipper) for identifying single-library significant loops. 9/n
Where I've personally done the most is getting multiple libraries from multiple conditions and using some sort of between-replicate logic to filter to a reasonable (~10,000-20,000) number of loops ( see e.g. https://github.com/caleblareau/k562-hichip …) 10/n
Other tools (that I admittedly have not tried) use a variety of statistical techniques to (probably more intelligently from what I can tell) merge anchors or filter loops for analyses. A brief run down of those that I'm aware of (not exhaustive)-- 11/n
MAPS (https://www.biorxiv.org/content/biorxiv/early/2018/09/08/411835.full.pdf …) uses a measure of reproducibility with ChIP-seq to define a normalization and significance basis for loop calling. Given HiChIP-specific restriction enzyme bias, this seems sensible 12/n
FitHiChIP (https://www.biorxiv.org/content/early/2018/10/29/376194.full.pdf …) provides automatic merging of nearby anchors to solve the "hairball" problem, which is clearly shown Fig. 1. When I compared hichipper to FitHiC, the bias regression seemed to perform well, but I ran into memory issues which high... 13/n
resolution (i.e. ~2.5kb) HiChIP data, which the authors have apparently solved in FitHiChIP. 14/n
Additionally, there is CID, which uses a density-based method to further collapse anchors to solve the "hairball" problem. 15/n
There are certainly other tools out there, but from my experience, any of these four (hichipper, MAPS, FitHiChIP, and CID) will probably give you something sensible (again acknowledging that I myself haven't actually run these other 3 tools) 16/n
And if you're still reading this, I'll be a bit more specific about how I view hichipper pros/cons from both my own use and others in the community: hichipper provides the most "vanilla" functionality to given sensible yet exhaustive anchors and loops. 17/n
I prefer it this way because I find that for each data set, I have to apply variable downstream threshold and cutoffs because the assay is so variable depending on which experimentalist performs the protocol and the biological question often varies so much 18/n
This may be a negative for individuals new to bioinformatics or HiChIP data but seemingly a positive for someone more experienced in working with related data. It's not obvious to me which other tools may be more applicable to a novice 19/n
Hope this helps paint a picture-- do let me know what you find if you compare tools! I think that it would be useful for the community. 20/20