Protein-DNA binding: data, tools & models

Below is an annotated list with databases containing TF binding parameters (position-specific weight matrices, binding energies, cooperativity parameters, etc) and tools to transform bioinformatic parameters such as weight matrices to biophysical parameters such as binding energies. Our software for calculation of TF binding to chromatin is described on a separate page. Also have a look at separate lists of online tools for nucleosome positioning and epigenetic modifications (less frequently updated). Please feel free to contact me with suggestions/corrections.

Last updated: February 27, 2021

 | TF-binding databases | TFBS prediction (under construction) | Calculating TF affinity | Analyzing ChIP-seq TF data | Other useful numbers 

Protein-DNA binding databases*

*Entries are added in the order “newest first”, there is no ranking.

The redundancy and overlap between various motif model databases complicate downstream analysis and interpretation. Here Jeff Vierstra computed the pairwise similarity for >2,000 motif models determined for both human and mouse TFs and clustered them into 286 distinct motif clusters. 

PAXdb contains whole genome protein abundance information across organisms and tissues. E.g. for human it predicts the proteome size 20,457, of which it covers 98%. Described in Wang et al. Proteomics 2015.

Combines information about DNA binding proteins (DBPs), RNA-binding proteins (RBPs) as well as DNA and RNA binding proteins (DRBPs). In total the database has recorded 2.8 million of NBPs and their binding motifs from 662 NBP families and 2423 species. Described in Leung et al., NAR, 2019

ChIPSummitDB contains ~4,000 uniformly processed human ChIP-seq data sets and determines the cistrome for 292 TFs together with the distances between the TF binding site (TFBS) centers and the ChIP-seq peak summits. In addition to providing a comprehensive human TFBS collection, the ChIPSummitDB database and web interface allows to examine the topological arrangements of TF complexes on the DNA. Described in Czipa et al., bioRxiv, 2019.

modERN is an offshoot of the former modENCODE project. This site organizes and provides all the ChIP-seq data files generated for transcription factors in worm and fly for both modENCODE and modERN projects. Currently includes 262 TFs identifying 1.23M sites in the fly genome and 217 TFs identifying 0.67M sites in the worm genome. Described in Kudron et al., Genetics, 2017.

ReMap currently consists of 80 million peaks from 485 transcription factors (TFs), transcription coactivators (TCAs) and chromatin-remodeling factors (CRFs). The atlas is available to browse or download either for a given TF or cell line, or for the entire dataset.

JASPAR is an open-access database of curated, non-redundant transcription factor (TF) binding profiles stored as position frequency matrices (PFMs) and TF flexible models (TFFMs) for TFs across multiple species in six taxonomic groups.

The transcription factors in TFcheckpoint are manually checked for experimental evidence supporting their role in 1) regulation of RNA polymerase II and 2) specific DNA binding activity. Described in Tripathi et al, Database, 2016.

A collection of 320 370 TFs from 165 plant species, integrated with plant cis-regulation database PlantRegMap for regulation data, binding site prediction, regulation prediction and functional enrichment analysis. Described in Jin et al., Nucleic Acids Res, 2016.

ePossum uses a Bayes classifier to assess the impact of genetic alterations on TF binding in user-defined sequences. Additionally, ePOSSUM provides information on the reliability of the prediction using our test set of experimentally confirmed binding sites. Described in Hombach et al., BMC Genomics, 2016.

The authors use DNA affinity purification sequencing (DAP-seq), a high-throughput TF binding site discovery method that interrogates genomic DNA with in-vitro-expressed TFs. Using DAP-seq, they defined the Arabidopsis cistrome by resolving motifs and peaks for 529 TFs. Because genomic DNA used in DAP-seq retains 5-methylcytosines, these data suggest that >75% (248/327) of Arabidopsis TFs surveyed were methylation sensitive, a property that strongly impacts the epicistrome landscape. Described in O’Malley et al., Cell, 2016.

TEC provides transcription factor (TF) binding sites and intensities determined for nearly 200 TFs in of Escherichia coli. Users can search either TFs that may regulate specific genes or target genes regulated by TFs of specific interest, filter the result by binding intensity and/or location, view both bar chart and heat map of TF binding, analyze consensus sequence and download raw data. Described in Ishihama et al, NAR, 2016.

Contains quantitative measurements of combinatorial roles of 812 Drosophila TFs and cofactors in the context of 24 enhancers. Described in Stampfel et al., Nature, 2015.

The TFBSshape database can be used to generate heat maps and quantitative data for DNA structural features (i.e., minor groove width, roll, propeller twist and helix twist) for 739 TF datasets from 23 different species derived from the motif databases JASPAR and UniPROBE.

CollecTF compiles data on experimentally validated, naturally occurring TF-binding sites across the Bacteria domain. CollecTF entries are periodically submitted to NCBI for integration into RefSeq complete genome records as link-out features.

footprintDB is a database with 2422 unique DNA-binding proteins (mostly transcription factors, TFs), 3662 Position Weight Matrices (PWMs) and 10112 DNA Binding Sites extracted from the literature and other repositories. The binding interfaces of (most) proteins in the database are inferred from the collection of protein-DNA complexes described in 3D-footprint.

AthaMap provides a genome-wide map of potential transcription factor and small RNA binding sites in Arabidopsis thaliana.

Cistrome is a Web Portal to Explore ChIP-seq and DNase-seq Data. Currently contains human and mouse datasets.

A database of CTCF-binding sites, CTCFBSDB, now contains almost 15 million CTCF-binding sequences in 10 species. It includes integrated CTCF-binding sites with genomic topological domains defined using Hi-C data. Additionally, the updated database includes new features enabled by new CTCF-binding site data, including binding site occupancy and the ability to visualize overlapping CTCF-binding sites determined in separate experiments.

HOCOMOCO contains non-redundant curated binding models for 601 human and 396 mouse TFs. DNA sequences of TF binding regions obtained by both pregenomic and high-throughput methods were collected from existing databases and other public data. The ChIPMunk software was used to construct positional weight matrices. Four motif discovery strategies were tested based on different motif shape priors including flat and periodic priors associated with DNA helix pitch. A quality rating was manually assigned to each model based on known binding preferences. An appropriate TFBS model was selected for each TF, with similar models selected for related TFs.

Factorbook is described in a recent publication: Wang et al. (2012). Sequence features and chromatin structure around the genomic regions bound by 119 human transcription factors. Genome Res. 22: 1798–1812.

TFinDit is a relational database and a web search tool for studying transcription factor-DNA interactions. The database contains annotated transcription factor-DNA complex structures and related data, such as unbound protein structures, thermodynamic data, and binding sequences for the corresponding transcription factors in the complex structures. TFinDit also provides a user-friendly interface and allows users to either query individual entries or generate datasets through culling the database based on one or more search criteria.

A comprehensive database of 1226 motifs from 11 different sources; The site allows users to search the database with a regulatory site or matrix to identify the TFs most likely to bind the input sequence.

[to be checked later]

FlyTF currently contains 129 proteins for which PWMs are available.

TRANSFAC consists of free and paid sections. Provided binding sites are experimentally proved. Human TF weight matrices may be viewed through the web interface of UCSC Genome Browser.

KDBI is a collection of experimentally determined kinetic data of protein-protein, protein-RNA, protein-DNA, protein-ligand, RNA-ligand, DNA-ligand binding events described in the literature.

ProNIT currently contains more than 4900 entries. Each entry has the protein and nucleic acid information, experimental conditions and the following binding thermodynamic data: dissociation constant Kd, energies, stoichiometry of binding and activity (Km and kcat).

UniPROBE contains data on the preferences of proteins for all possible sequence variants (‘words’) of length k (‘k-mers’), as well as position weight matrix (PWM) and graphical sequence logo representations of the k-mer data. In total, the database currently hosts DNA binding data for 391 nonredundant proteins (individual proteins or in some cases heterodimers) from a diverse collection of organisms.

This is a personal collection. Currently contains ~50 matrices (Last checked: 06.10.2010).

 TF-binding databases | Calculating TF affinity | Analyzing ChIP-seq TF data | Other useful numbers |

TF binding site prediction (*section under construction)

Predicting TF affinities for DNA binding

SemanticBI is a convolutional neural network (CNN)recurrent neural network (RNN) architecture model that was trained on an ensemble of protein binding microarray data sets that covered multiple TFs (trained on DREAM5 PBM data sets). Described in Quan et al., 2021.

TFaffinity is a MATLAB code to calculate TF-DNA binding affinities using the TRAP algorithm. It is described in the article Whiehle et al., 2019.

ChIPanalyser is an R package that calculates ChIP-seq-like profiles based on a a statistical thermodynamic framework. The model relies on four consideration: TF binding sites can be scored using a Position weight Matrix, DNA accessibility plays a role in Transcription Factor binding, binding profiles are dependent on the number of transcription factors bound to DNA and finally binding energy (another way of describing PWM’s) or binding specificity should be modulated (hence the introduction of a binding specificity modulator). The end result of ChIPanalyser is to produce profiles simulating real ChIP-seq profile and provide accuracy measurements of these predicted profiles after being compared to real ChIP-seq data. Described in Martin and Zabet, 2019.

The DeepBind algorithm is based on convolutional neural networks and can discover new patterns even when the locations of patterns within sequences are unknown. For training, DeepBind uses a set of sequences and, for each sequence, an experimentally determined binding score. Sequences can have varying lengths, and binding scores can be real-valued measurements or binary class labels. The authors (Alipanahi et al., 2015) claimed that this algorithm outperforms all 26 existing methods for protein-DNA specificity prediction previously compared by Weinrouch et al., 2013. This is a stand alone application, available for Windows and Linux.

BayesPI-BAR (Bayesian method for Protein-DNA Interaction with Binding Affinity Ranking) uses biophysical modeling of protein-DNA interaction to predict single nucleotide polymorphisms (SNPs) that cause significant changes in the binding affinity of a regulatory region for transcription factors (TFs). It includes TF chemical potentials or protein concentrations, and direct TF binding targets as input. The authors claimed that the method compares favorably to existing programs such as sTRAP and is-rSNP, when evaluated on the same SNPs. The method is described here.

A web tool that implements a flexible and extensible algorithm for predicting TFBS. The algorithm makes use of both direct (the sequence) and several indirect readout features of protein-DNA complexes (biophysical properties such as bendability or the solvent-excluded surface of the DNA). This algorithm significantly outperforms state-of-the-art approaches for in silico identification of TFBS. Users can submit FASTA sequences for analysis.

TRAP calculates binding affinity based on the matrix description of a given TF and a set of DNA sequences to be annotated (input). It requires the specification of two biophysically-motivated parameters. The freely available program code is written in C. Further details are available in the paper by Roider et al., 2007.

STAP uses a biophysical model to analyzes transcription factor (TF)-DNA binding data, such as ChIP-chip or ChIPSeq data. The program assumes that the measured affinity of a sequence to a TF (TF_exp) in some ChIP-chip or ChIP-seq experiment is determined by: 1) the number and strength of binding sites of TF_exp in this sequence; 2) the presence of other sites that may interact cooperatively with the sites of TF_exp in the neighborhood. Specifically, it takes as input a set of DNA sequences, their binding affinities to some TF as measured by experiments (TF_exp), and the position weight matrices (PWMs) of a set of TFs, including TF_exp. It will learn the relevant parameters of the biophysical model of TF-DNA interaction, including those of TF-DNA interaction and those of TF-TF cooperative interactions.

The input to MatrixREDUCE is a sequence file in FASTA format and an expression data file in tab-delimited text format (missing values are allowed). Output data include PSAMs in numeric and graphical format, parameters of the fitted model, and an HTML summary page.

BayesPI integrates Bayesian model regularization with biophysical modeling of protein-DNA interactions and nucleosome positioning to study protein-DNA interactions, using a high-throughput dataset.

The scoring function calibrated against crystallographic data on protein-DNA contacts can recover PWMs, sometimes outperforming experimental PWMs.

 TF-binding databases | Calculating TF affinity | Analyzing ChIP_seq TF data | Other useful numbers |


ChIP-seq TF binding analysis

(*for histone ChIP-seq, see here)

PscanChIP is a web application that, given a set of genomic regions derived from a genome wide ChIP-Seq experiment, scans them and looks for over represented sequence motifs, according to motif descriptors of the TRANSFAC and JASPAR databases, or uploaded by users. The over represented motifs thus correspond to transcription factor binding sites found to be enriched in the regions themselves. The general idea is to assess which is the motif more likely to represent the binding specificity of the TF investigated; but also to identify “secondary” motifs which might correspond to other TFs interacting with the one for which the ChIP experiment was performed.

Whole-Genome rVISTA enables users to query databases containing pre-computed genome coordinates of evolutionarily conserved transcription factor binding sites in the proximal promoters (from 100 bp up to 5kb upstream) of human, mouse and Drosophila genomes. TF binding sites are based on position weight matrices from the TRANSFAC Professional database. Results are exported in a .bed format for rapid visualization in the UCSC genome browser. Flat files of mapped conserved sites and their genomic coordinates are also available for analysis with stand-alone software.

 TF-binding databases | Calculating TF affinity | Analyzing ChIP_seq TF data | Other useful numbers |