Protein-DNA binding: data, tools & models

Below is an annotated list with databases containing experimental TF binding sites, TF binding parameters (position-specific weight matrices, binding energies, cooperativity parameters, etc) and tools to transform bioinformatic parameters such as weight matrices to biophysical parameters such as binding energies. Please feel free to contact me with suggestions/corrections.

Last updated: 11/11/2018

Protein-DNA binding databases*

*Entries are added in the order “newest first”, there is no ranking.

| TF-binding databases | Calculate TF affinity | Analyze ChIP-seq TF data | Other useful numbers |

ChIP-Atlas: a data-mining suite powered by full integration of public ChIP-seq data. Contains ChIP-seq and DNase-seq data (n > 70,000) derived from six representative model organisms (human, mouse, rat, fruit fly, nematode, and budding yeast). ChIP-Atlas is able to show alignment and peak-call results for public ChIP-seq and DNase-seq data archived in SRA. The integrated data can be further analyzed to show TR-gene and TR-TR interactions, as well as to examine enrichment of protein binding for given multiple genomic coordinates or gene names. Described in Oki et al., EMBO Reports, 2018.

MeDReaders — a database for transcription factors that bind to methylated DNA. A manually curated database, currently consisting of 731 TFs which could bind to methylated DNA sequences in human and mouse based on ChIP-seq studies reported in the literature. Users can download BED files separating low-methylation and high-methylation binding sites. Described in Wang et al, Nucleic Acids Res, 2018 .

TFClass: classification of human transcription factors & their mammalian orthologs. It contains >39 000 TFs from up to 41 mammalian species are assigned to the Superclasses, Classes, Families and Subfamilies of TFClass. The corresponding sequence collection is provided in FASTA format, sequence logos and phylogenetic trees at different classification levels, predicted TF binding sites for human, mouse, dog and cow genomes as well as links to external databases. In particular, all those TFs that are also documented in the TRANSFAC® database (FACTOR table) have been linked and can be freely accessed. Described in Wingender et al., Nucleic Acids Res., 2018.

modERN — model organism Encyclopedia of Regulatory Networks (worm and fly). modERN is an offshoot of the former modENCODE project. This site organizes and provides all the ChIP-seq data files generated for transcription factors in worm and fly for both modENCODE and modERN projects. Currently includes 262 TFs identifying 1.23M sites in the fly genome and 217 TFs identifying 0.67M sites in the worm genome. Described in Kudron et al., Genetics, 2017.

ReMap – a database of ChIP-seq peaks from all publicly available datasets in human. ReMap currently consists of 80 million peaks from 485 transcription factors (TFs), transcription coactivators (TCAs) and chromatin-remodeling factors (CRFs). The atlas is available to browse or download either for a given TF or cell line, or for the entire dataset.

JASPAR – a database of transcription factor binding profiles (updated version, JASAP 2018). JASPAR is an open-access database of curated, non-redundant transcription factor (TF) binding profiles stored as position frequency matrices (PFMs) and TF flexible models (TFFMs) for TFs across multiple species in six taxonomic groups.

TFcheckpoint – manually curated TF database for human, mouse and rat. The transcription factors in TFcheckpoint are manually checked for experimental evidence supporting their role in 1) regulation of RNA polymerase II and 2) specific DNA binding activity. Described in Tripathi et al, Database, 2016.

PlantTFDB: Plant Transcription Factor Database. A collection of 320 370 TFs from 165 plant species, integrated with plant cis-regulation database PlantRegMap for regulation data, binding site prediction, regulation prediction and functional enrichment analysis. Described in Jin et al., Nucleic Acids Res, 2016.

ePosspum: Analyzes human variants to find known TF binding sites. ePossum uses a Bayes classifier to assess the impact of genetic alterations on TF binding in user-defined sequences. Additionally, ePOSSUM provides information on the reliability of the prediction using our test set of experimentally confirmed binding sites. Described in Hombach et al., BMC Genomics, 2016.

DAP-seq: Base-pair resolution atlases of the plant cistrome and epicistrome. The authors use DNA affinity purification sequencing (DAP-seq), a high-throughput TF binding site discovery method that interrogates genomic DNA with in-vitro-expressed TFs. Using DAP-seq, they defined the Arabidopsis cistrome by resolving motifs and peaks for 529 TFs. Because genomic DNA used in DAP-seq retains 5-methylcytosines, these data suggest that >75% (248/327) of Arabidopsis TFs surveyed were methylation sensitive, a property that strongly impacts the epicistrome landscape. Described in O’Malley et al., Cell, 2016.

TEC (Transcription Profiling of Escherichia coli). TEC provides transcription factor (TF) binding sites and intensities determined for nearly 200 TFs in of Escherichia coli. Users can search either TFs that may regulate specific genes or target genes regulated by TFs of specific interest, filter the result by binding intensity and/or location, view both bar chart and heat map of TF binding, analyze consensus sequence and download raw data. Described in Ishihama et al, NAR, 2016.

LEGO Factors: A database of TF roles as activators/repressors at enhancers. Contains quantitative measurements of combinatorial roles of 812 Drosophila TFs and cofactors in the context of 24 enhancers. Described in Stampfel et al., Nature, 2015.

TFBSshape: a motif database for DNA shape features of transcription factor binding sites. The TFBSshape database can be used to generate heat maps and quantitative data for DNA structural features (i.e., minor groove width, roll, propeller twist and helix twist) for 739 TF datasets from 23 different species derived from the motif databases JASPAR and UniPROBE.

CollecTF: a database of experimentally validated transcription factor-binding sites in Bacteria. CollecTF compiles data on experimentally validated, naturally occurring TF-binding sites across the Bacteria domain. CollecTF entries are periodically submitted to NCBI for integration into RefSeq complete genome records as link-out features.

footprintDB: This is a database with 2422 unique DNA-binding proteins (mostly transcription factors, TFs), 3662 Position Weight Matrices (PWMs) and 10112 DNA Binding Sites extracted from the literature and other repositories. The binding interfaces of (most) proteins in the database are inferred from the collection of protein-DNA complexes described in 3D-footprint.

Athamap: AthaMap provides a genome-wide map of potential transcription factor and small RNA binding sites in Arabidopsis thaliana.

Cistrome: Cistrome is a Web Portal to Explore ChIP-seq and DNase-seq Data. Currently contains human and mouse datasets.

CTCFBSDB 2.0: a database for CTCF-binding sites and genome organization. A database of CTCF-binding sites, CTCFBSDB, now contains almost 15 million CTCF-binding sequences in 10 species. It includes integrated CTCF-binding sites with genomic topological domains defined using Hi-C data. Additionally, the updated database includes new features enabled by new CTCF-binding site data, including binding site occupancy and the ability to visualize overlapping CTCF-binding sites determined in separate experiments.

HOCOMOCO: a comprehensive collection of human and mouse transcription factor binding sites models. HOCOMOCO contains non-redundant curated binding models for 601 human and 396 mouse TFs. DNA sequences of TF binding regions obtained by both pregenomic and high-throughput methods were collected from existing databases and other public data. The ChIPMunk software was used to construct positional weight matrices. Four motif discovery strategies were tested based on different motif shape priors including flat and periodic priors associated with DNA helix pitch. A quality rating was manually assigned to each model based on known binding preferences. An appropriate TFBS model was selected for each TF, with similar models selected for related TFs.

Factorbook: a TF-centric web repository of the ENCODE data. Factorbook is described in a recent publication: Wang et al. (2012). Sequence features and chromatin structure around the genomic regions bound by 119 human transcription factors. Genome Res. 22: 1798–1812.

TFinDit: Transcription Factor-DNA Interaction Data Depository. TFinDit is a relational database and a web search tool for studying transcription factor-DNA interactions. The database contains annotated transcription factor-DNA complex structures and related data, such as unbound protein structures, thermodynamic data, and binding sequences for the corresponding transcription factors in the complex structures. TFinDit also provides a user-friendly interface and allows users to either query individual entries or generate datasets through culling the database based on one or more search criteria.

ScerTF: a database of benchmarked position weight matrices for Saccharomyces species. A comprehensive database of 1226 motifs from 11 different sources; The site allows users to search the database with a regulatory site or matrix to identify the TFs most likely to bind the input sequence.

A curated collection of yeast transcription factor DNA binding specificity data from the Bulyk Lab. [To be checked later]

FlyTF: Drosophila transcription factor database. FlyTF currently contains 129 proteins for which PWMs are available.

TRANSFAC – a commercial database of TFs, their binding sites, regulated genes and PWMs. TRANSFAC consists of free and paid sections. Provided binding sites are experimentally proved. Human TF weight matrices may be viewed through the web interface of UCSC Genome Browser.

KDBI: Kinetic Data of Biomolecular Interactions. KDBI is a collection of experimentally determined kinetic data of protein-protein, protein-RNA, protein-DNA, protein-ligand, RNA-ligand, DNA-ligand binding events described in the literature.

ProNIT – a database of experimental thermodynamic protein-DNA interaction data. ProNIT currently contains more than 4900 entries. Each entry has the protein and nucleic acid information, experimental conditions and the following binding thermodynamic data: dissociation constant Kd, energies, stoichiometry of binding and activity (Km and kcat).

UniPROBE – an online database of protein binding microarray data on protein-DNA interactions. UniPROBE contains data on the preferences of proteins for all possible sequence variants (‘words’) of length k (‘k-mers’), as well as position weight matrix (PWM) and graphical sequence logo representations of the k-mer data. In total, the database currently hosts DNA binding data for 391 nonredundant proteins (individual proteins or in some cases heterodimers) from a diverse collection of organisms.

Drosophila transcription factor weight matrices collected by Daniel Pollard. Contains ~50 matrices (Last checked: 06.10.2010).

BindingDB – a public database of measured protein-small ligand binding affinities.

DPInteract: DNA-protein interactions for E.coli. (Last updated in 1998).

 | TF-binding databases | Calculate TF affinity | Analyze ChIP-seq TF data | Other useful numbers |

Calculate TF affinity or related parameters from weight matrices and directly from experiments (sorted “newest first”):

| TF-binding databases | Calculate TF affinity | Analyze ChIP-seq TF data | Other useful numbers |

DeepBind: Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. The DeepBind algorithm is based on convolutional neural networks and can discover new patterns even when the locations of patterns within sequences are unknown. For training, DeepBind uses a set of sequences and, for each sequence, an experimentally determined binding score. Sequences can have varying lengths, and binding scores can be real-valued measurements or binary class labels. The authors (Alipanahi et al., 2015) claimed that this algorithm outperforms all 26 existing methods for protein-DNA specificity prediction previously compared by Weinrouch et al., 2013. This is a stand alone application, available for Windows and Linux.

BayesPI-BAR: a biophysical model for characterization of regulatory sequence variations. BayesPI-BAR (Bayesian method for Protein-DNA Interaction with Binding Affinity Ranking) uses biophysical modeling of protein-DNA interaction to predict single nucleotide polymorphisms (SNPs) that cause significant changes in the binding affinity of a regulatory region for transcription factors (TFs). It includes TF chemical potentials or protein concentrations, and direct TF binding targets as input. The authors claimed that the method compares favorably to existing programs such as sTRAP and is-rSNP, when evaluated on the same SNPs. The method is described here.

PhysBinder: improving the prediction of transcription factor binding sites by flexible inclusion of biophysical properties. A web tool that implements a flexible and extensible algorithm for predicting TFBS. The algorithm makes use of both direct (the sequence) and several indirect readout features of protein-DNA complexes (biophysical properties such as bendability or the solvent-excluded surface of the DNA). This algorithm significantly outperforms state-of-the-art approaches for in silico identification of TFBS. Users can submit FASTA sequences for analysis.

TRAP – TRanscription factor Affinity Prediction. TRAP calculates binding affinity based on the matrix description of a given TF and a set of DNA sequences to be annotated (input). It requires the specification of two biophysically-motivated parameters. The freely available program code is written in C. Further details are available in the paper by Roider et al., 2007.

STAP – Sequence To Affinity Prediction. STAP uses a biophysical model to analyzes transcription factor (TF)-DNA binding data, such as ChIP-chip or ChIPSeq data. The program assumes that the measured affinity of a sequence to a TF (TF_exp) in some ChIP-chip or ChIP-seq experiment is determined by: 1) the number and strength of binding sites of TF_exp in this sequence; 2) the presence of other sites that may interact cooperatively with the sites of TF_exp in the neighborhood. Specifically, it takes as input a set of DNA sequences, their binding affinities to some TF as measured by experiments (TF_exp), and the position weight matrices (PWMs) of a set of TFs, including TF_exp. It will learn the relevant parameters of the biophysical model of TF-DNA interaction, including those of TF-DNA interaction and those of TF-TF cooperative interactions.

MatrixREDUCE – Predicting TF binding through alignment-free and affinity-based analysis of orthologous promoter sequences. The input to MatrixREDUCE is a sequence file in FASTA format and an expression data file in tab-delimited text format (missing values are allowed). Output data include PSAMs in numeric and graphical format, parameters of the fitted model, and an HTML summary page.

BayesPI – estimation of TF binding energy matrices, binding affinity and chemical potential from ChIP-Chip experiments. BayesPI integrates Bayesian model regularization with biophysical modeling of protein-DNA interactions and nucleosome positioning to study protein-DNA interactions, using a high-throughput dataset.

Creating PWMs of transcription factors using 3D structure-based computation of protein-DNA free binding energies. The scoring function calibrated against crystallographic data on protein-DNA contacts can recover PWMs, sometimes outperforming experimental PWMs.

PscanChIP is a web application that, given a set of genomic regions derived from a genome wide ChIP-Seq experiment, scans them and looks for over represented sequence motifs, according to motif descriptors of the TRANSFAC and JASPAR databases, or uploaded by users. The over represented motifs thus correspond to transcription factor binding sites found to be enriched in the regions themselves. The general idea is to assess which is the motif more likely to represent the binding specificity of the TF investigated; but also to identify “secondary” motifs which might correspond to other TFs interacting with the one for which the ChIP experiment was performed.

Whole-Genome rVISTA.Whole-Genome rVISTA enables users to query databases containing pre-computed genome coordinates of evolutionarily conserved transcription factor binding sites in the proximal promoters (from 100 bp up to 5kb upstream) of human, mouse and Drosophila genomes. TF binding sites are based on position weight matrices from the TRANSFAC Professional database. Results are exported in a .bed format for rapid visualization in the UCSC genome browser. Flat files of mapped conserved sites and their genomic coordinates are also available for analysis with stand-alone software.