Interpreting Protein Structure Prediction Data
Ginzu
Ginzu is a protocol that attempts to determine the regions of a protein chain that will fold into
globular units, called "domains". It scans the protein chain sequence with successively less
confident methods of detection to determine any homologs with experimentally determined
structures, starting with PDB-Blast (PSI-BLAST against the PDB), and followed by the more
remote fold-detection method FFAS03 (also previously ORFEUS and Pcons). After any
homologs are identified, a search of remaining regions is done with HMMER against the Pfam-
A protein family database. Lastly, the PSI-BLAST multiple sequence alignment is used to assign
regions of increased likelihood of possessing a contiguous domain based on sequence clusters.
The final step consists of selecting cut-points between the domains (and possibly defining new
domains based on the strongest cutpoints for any remaining long stretches of the sequence that
have not already matched a homolog with a structure or Pfam-A) using the PSI-BLAST MSA.
Each domain method produces a score which corresponds to the confidence in the prediction as
well as the confidence in the method to produce accurate domain predictions.
PDB-BLAST
The PSI-BLAST method is used for detecting sequence homologs of a given protein in the
Protein Data Bank (PDB). It uses the concept of searching with a position-specific residue
substitution profile appropriate to the family in which the query belongs. This allows for more
sensitive detection of remote homologous sequences. The confidence displayed is the -log(eval),
where the e-val is the value returned by a PSI-BLAST search against the PDB. A confidence
of 3.0 (e = 0.001) is considered to be a strong detection threshold and PDB-Blast is considered
the highest confidence method.
FFAS03
FFAS03 is a fold recognition method based on profile-profile comparisons. During Ginzu
execution, profiles of regions of the protein not annotated by PDB-Blast are built and compared
to profiles built for sequences in the PDB. Scores are divided by -10 to allow comparison to
other scaled e-val based scores (ex. PDB-Blast). Scores >= .95 have been shown to have less
than 3% false positives and therefore hits are thresholded at this level.
(Jaroszewski, L., Rychlewski, L., Li, Z., Li, W. & Godzik, A. (2005) FFAS03: a server for profile-profile sequence alignments. Nucl. Acids Res. 33, W284-W288)
Pfam
Pfam is a set of families of protein sequences
that are represented as hidden Markov models, and may be searched with HMMER.
The confidence of Pfam matched domains is given by: -log(eval),
where the e-val returned by the search of the Pfam database using HMMER. Values of 3.0
(e = 0.001) or higher are considered significant.
MSA
Multiple Sequence Alignment (MSA) is used in the final and lowest-confidence sequence-based step of the Ginzu domain prediction algorithm.
Analysis of the PSI-BLAST MSA is employed to predict domain cut points, based on the density of regions in the sequence alignments. Only confident
cut predictions are shown. Confidence scores represent the number of sequence
clusters in the MSA that overlap the domain.
Deduced Domains
After the Ginzu domain prediction algorithm has exhausted its analysis of the protein
sequence to predict protein domains, remaining stretches of the sequence may be designated as individual domains, with
longer stretches being cut into separate domains based on length. This is the least confident of the domain prediction
steps.
MCM
Mammoth Confidence Metric (MCM) is the probability that the structure of the domain is
classified in a specific SCOP superfamily. Rosetta structures are predicted for the domain on
IBM.s World Community Grid and compared to PDB representatives classified in SCOP using
Mammoth structure comparison method. Only structures for domains predicted using Pfam,
MSA or deduced are predicted and classified (PDB-Blast and FFAS03 are run first and produce
more confident fold predictions). Additionally, structures are only predicted for domains that
meet thresholds of length (~<150), low predicted disorder (<25%) and no predicted
transmembrane helices. The MCM score is a logistic regression of the Mammoth z-score,
Rosetta convergence score, contact order and length ratio of domain sequence and matched
structure sequence. Scores >= 0.9 are correct more than 75% and scores >= 0.8 are correct more
than 66%.
Gene Ontology (GO) Predication
The Gene Ontology (GO) is a collection of terms that allows for the annotation of proteins across
many genomes. GO function terms are predicted for domains with predicted structures and
integrated with known GO biological process and cellular component terms. Function
predictions are meant to be hypotheses testable in the lab. Predictions are given a log-likelihood
ratio (LLR) score to estimate the confidence where a score greater than zero is more likely to be
true than false. Structure evidence used to predict functions is scaled based on the confidence in
the structure classification. For example, function predictions using structure evidence from
Rosetta de novo models (i.e. MCM) are scaled by the MCM score. Finally, only predictions with
greater than -3.0 LLR are shown.
PDB
Protein Data Bank is the world's protein structure data repository. We also refer to the file format from the Protein Data Bank used to describe a protein structure as a PDB.
ORFEUS (previous method)
ORFeus is a method for matching protein sequences to likely protein folds based on very remote sequence similarities, and is
employed in the fold recognition step of the Ginzu domain prediction algorithm. The sequence profile
and predicted secondary structures are searched against a database of sequence profiles and predicted secondary
structures for proteins of known structure.
Results in the LiveBench, test of fold recognition methods suggest that scores of 7.5 or greater are almost always correct matches.
Pcons (previous method)
Pcons was the first consensus server for fold recognition and is used
in the fold recognition step by the Ginzu domain prediction algorithm. It selects the best prediction out
of several predictions. For each query sequence predictions from several fold recognition servers is collected. For each of
these models a measure that relates to the quality of the model is calculated. The prediction of this new measure is accomplished
by utilizing structural comparisons between the models and analyzing the server score for a particular model. Pcons makes at
least 10% more correct predictions than the best single method and the specificity is significantly better.
Any Pcons score higher than 1.5 should be significant.