微信扫一扫,移动浏览光盘
简介
Summary:
Publisher Summary 1
Zvelebil (cancer informatics, The Breakthrough Toby Robins Breast Cancer Research Centre, UK) and Baum (crystallography, Birbeck College, UK) present a text for advanced undergraduate and graduate students on the developing field of bioinformatics. The material is organized into seven sections--background basics, sequence alignments, evolutionary processes, genome characteristics, secondary structures, tertiary structures, cells and organisms--the majority of which contain both applications chapters, focusing on the main concepts, and theory chapters providing greater detail and the underlying mathematics. Illustrated throughout with charts, diagrams, and tables. Annotation 漏2008 Book News, Inc., Portland, OR (booknews.com)
目录
Table Of Contents:
Preface v
A Note to the Reader vii
List of Reviewers xii
Contents in Brief xiii
Part 1 Background Basics
The Nucleic Acid World
The Structure of DNA and RNA 5(5)
DNA is a linear polymer of only four different bases 5(2)
Two complementary DNA strands interact by base pairing to form a double helix 7(2)
RNA molecules are mostly single stranded but can also have base-pair structures 9(1)
DNA, RNA, and Protein: The Central Dogma 10(4)
DNA is the information store, but RNA is the messenger 11(1)
Messenger RNA is translated into protein according to the genetic code 12(1)
Translation involves transfer RNAs and RNA-containing ribosomes 13(1)
Gene Structure and Control 14(6)
RNA polymerase binds to specific sequences that position it and identify where to begin transcription 15(2)
The signals initiating transcription in eukaryotes are generally more complex than those in bacteria 17(1)
Eukaryotic mRNA transcripts undergo several modifications prior to their use in translation 18(1)
The control of translation 19(1)
The Tree of Life and Evolution 20(5)
A brief survey of the basic characteristics of the major forms of life 21(1)
Nucleic acid sequences can change as a result of mutation 22(1)
Summary 23(1)
Further Reading 24(1)
Protein Structure
Primary and Secondary Structure 25(12)
Protein structure can be considered on several different levels 26(1)
Amino acids are the building blocks of proteins 27(1)
The differing chemical and physical properties of amino acids are due to their side chains 28(1)
Amino acids are covalently linked together in the protein chain by peptide bonds 29(4)
Secondary structure of proteins is made up of α-helices and β-strands 33(2)
Several different types of β-sheet are found in protein structures 35(1)
Turns, hairpins and loops connect helices and strands 36(1)
Implication for Bioinformatics 37(3)
Certain amino acids prefer a particular structural unit 37(1)
Evolution has aided sequence analysis 38(1)
Visualization and computer manipulation of protein structures 38(2)
Proteins Fold to Form Compact Structures 40(6)
The tertiary structure of a protein is defined by the path of the polypeptide chain 41(1)
The stable folded state of a protein represents a state of low energy 41(1)
Many proteins are formed of multiple subunits 42(1)
Summary 43(1)
Further Reading 44(2)
Dealing with Databases
The Structure of Databases 46(6)
Flat-file databases store data as text files 48(1)
Relational databases are widely used for storing biological information 49(1)
XML has the flexibility to define bespoke data classifications 50(1)
Many other database structures are used for biological data 51(1)
Databases can be accessed locally or online and often link to each other 52(1)
Types of Database 52(3)
There's more to databases than just data 53(1)
Primary and derived data 53(1)
How we define and connect things is very important: Ontologies 54(1)
Looking for Databases 55(6)
Sequence databases 55(3)
Microarray databases 58(1)
Protein interaction databases 58(1)
Structural databases 59(2)
Data Quality 61(11)
Nonredundancy is especially important for some applications of sequence databases 62(1)
Automated methods can be used to check for data consistency 63(1)
Initial analysis and annotation is usually automated 64(1)
Human intervention is often required to produce the highest quality annotation 65(1)
The importance of updating databases and entry identifier and version numbers 65(1)
Summary 66(1)
Further Reading 67(5)
Part 2 Sequence Alignments
Applications Chapter
Producing and Analyzing Sequence Alignments
Principles of Sequence Alignment 72(4)
Alignment is the task of locating equivalent regions of two or more sequences to maximize their similarity 73(1)
Alignment can reveal homology between sequences 74(1)
It is easier to detect homology when comparing protein sequences than when comparing nucleic acid sequences 75(1)
Scoring Alignments 76(5)
The quality of an alignment is measured by giving it a quantitative score 76(1)
The simplest way of quantifying similarity between two sequences is percentage identity 76(1)
The dot-plot gives a visual assessment of similarity based on identity 77(2)
Genuine matches do not have to be identical 79(2)
There is a minimum percentage identity that can be accepted as significant 81(1)
There are many different ways of scoring an alignment 81(1)
Substitution Matrices 81(4)
Substitution matrices are used to assign individual scores to aligned sequence positions 81(1)
The PAM substitution matrices use substitution frequencies derived from sets of closely related protein sequences 82(2)
The BLOSUM substitution matrices use mutation data from highly conserved local regions of sequence 84(1)
The choice of substitution matrix depends on the problem to be solved 84(1)
Inserting Gaps 85(2)
Gaps inserted in a sequence to maximize similarity require a scoring penalty 85(1)
Dynamic programming algorithms can determine the optimal introduction of gaps 86(1)
Types of Alignment 87(6)
Different kinds of alignments are useful in different circumstances 87(3)
Multiple sequence alignments enable the simultaneous comparison of a set of similar sequences 90(1)
Multiple alignments can be constructed by several different techniques 90(1)
Multiple alignments can improve the accuracy of alignment for sequences of low similarity 91(1)
ClustalW can make global multiple alignments of both DNA and protein sequences 92(1)
Multiple alignments can be made by combining a series of local alignments 92(1)
Alignment can be improved by incorporating additional information 93(1)
Searching Databases 93(4)
Fast yet accurate search algorithms have been developed 94(1)
FASTA is a fast database-search method based on matching short identical segments 95(1)
BLAST is based on finding very similar short segments 95(1)
Different versions of BLAST and FASTA are used for different problems 95(1)
PSI-BLAST enables profile-based database searches 96(1)
SSEARCH is a rigorous alignment method 97(1)
Searching with Nucleic Acid or Protein Sequences 97(6)
DNA or RNA sequences can be used either directly or after translation 97(1)
The quality of a database match has to be tested to ensure that it could not have arisen by chance 97(1)
Choosing an appropriate E-value threshold helps to limit a database search 98(2)
Low-complexity regions can complicate homology searches 100(2)
Different databases can be used to solve particular problems 102(1)
Protein Sequence Motifs or Patterns 103(4)
Creation of pattern databases requires expert knowledge 104(1)
The BLOCKS database contains automatically compiled short blocks of conserved multiply aligned protein sequences 105(2)
Searching Using Motifs and Patterns 107(2)
The PROSITE database can be searched for protein motifs and patterns 107(1)
The pattern-based program PHI-BLAST searches for both homology and matching motifs 108(1)
Patterns can be generated from multiple sequences using PRATT 108(1)
The PRINTS database consists of fingerprints representing sets of conserved motifs that describe a protein family 109(1)
The Pfam database defines profiles of protein families 109(1)
Patterns and Protein Function 109(8)
Searches can be made for particular functional sites in proteins 109(1)
Sequence comparison is not the only way of analyzing protein sequences 110(1)
Summary 111(1)
Further Reading 112(5)
Theory Chapter
Pairwise Sequence Alignment and Database Searching
Substitution Matrices and Scoring 117(10)
Alignment scores attempt to measure the likelihood of a common evolutionary ancestor 117(2)
The PAM (MDM) substitution scoring matrices were designed to trace the evolutionary origins of proteins 119(3)
The BLOSUM matrices were designed to find conserved regions of proteins 122(3)
Scoring matrices for nucleotide sequence alignment can be derived in similar ways 125(1)
The substitution scoring matrix used must be appropriate to the specific alignment problem 126(1)
Gaps are scored in a much more heuristic way than substitutions 126(1)
Dynamic Programming Algorithms 127(14)
Optimal global alignments are produced using efficient variations of the Needleman-Wunsch algorithm 129(6)
Local and suboptimal alignments can be produced by making small modifications to the dynamic programming algorithm 135(4)
Time can be saved with a loss of rigor by not calculating the whole matrix 139(2)
Indexing Techniques and Algorithmic Approximations 141(12)
Suffix trees locate the positions of repeats and unique sequences 141(2)
Hashing is an indexing technique that lists the starting positions of all k-tuples 143(1)
The FASTA algorithm uses hashing and chaining for fast database searching 144(3)
The BLAST algorithm makes use of finite-state automata 147(3)
Comparing a nucleotide sequence directly with a protein sequence requires special modifications to the BLAST and FASTA algorithms 150(3)
Alignment Score Significance 153(3)
The statistics of gapped local alignments can be approximated by the same theory 156(1)
Aligning Complete Genome Sequences 156(11)
Indexing and scanning whole genome sequences efficiently is crucial for the sequence alignment of higher organisms 157(2)
The complex evolutionary relationships between the genomes of even closely related organisms require novel alignment algorithms 159(1)
Summary 159(2)
Further Reading 161(6)
Theory Chapter
Patterns, Profiles, and Multiple Alignments
Profiles and Sequence Logos 167(12)
Position-specific scoring matrices are an extension of substitution scoring matrices 168(3)
Methods for overcoming a lack of data in deriving the values for a PSSM 171(5)
PSI-BLAST is a sequence database searching program 176(1)
Representing a profile as a logo 177(2)
Profile Hidden Markov Models 179(14)
The basic structure of HMMs used in sequence alignment to profiles 180(5)
Estimating HMM parameters using aligned sequences 185(2)
Scoring a sequence against a profile HMM: The most probable path and the sum over all paths 187(3)
Estimating HMM parameters using unaligned sequences 190(3)
Aligning Profiles 193(3)
Comparing two PSSMs by alignment 193(2)
Aligning profile HMMs 195(1)
Multiple Sequence Alignments by Gradual Sequence Addition 196(11)
The order in which sequences are added is chosen based on the estimated likelihood of incorporating errors in the alignment 198(2)
Many different scoring schemes have been used in constructing multiple alignments 200(4)
The multiple alignment is built using the guide tree and profile methods and may be further refined 204(3)
Other Ways of Obtaining Multiple Alignments 207(4)
The multiple sequence alignment program DIALIGN aligns ungapped blocks 207(2)
The SAGA method of multiple alignment uses a genetic algorithm 209(2)
Sequence Pattern Discovery 211(14)
Discovering patterns in a multiple alignment: eMOTIF and AACC 213(2)
Probabilistic searching for common patterns in sequences: Gibbs and MEME 215(2)
Searching for more general sequence patterns 217(1)
Summary 218(1)
Further Reading 219(6)
Part 3 Evolutionary Processes
Applications Chapter
Recovering Evolutionary History
The Structure and Interpretation of Phylogenetic Trees 225(10)
Phylogenetic trees reconstruct evolutionary relationships 225(5)
Tree topology can be described in several ways 230(2)
Consensus and condensed trees report the results of comparing tree topologies 232(3)
Molecular Evolution and its Consequences 235(13)
Most related sequences have many positions that have mutated several times 236(1)
The rate of accepted mutation is usually not the same for all types of base substitution 236(2)
Different codon positions have different mutation rates 238(1)
Only orthologous genes should be used to construct species phylogenetic trees 239(8)
Major changes affecting large regions of the genome are surprisingly common 247(1)
Phylogenetic Tree Reconstruction 248(20)
Small ribosomal subunit rRNA sequences are well suited to reconstructing the evolution of species 249(1)
The choice of the method for tree reconstruction depends to some extent on the size and quality of the dataset 249(2)
A model of evolution must be chosen to use with the method 251(4)
All phylogenetic analyses must start with an accurate multiple alignment 255(1)
Phylogenetic analyses of a small dataset of 16S RNA sequence data 255(4)
Building a gene tree for a family of enzymes can help to identify how enzymatic functions evolved 259(5)
Summary 264(1)
Further Reading 265(3)
Theory Chapter
Building Phylogenetic Trees
Evolutionary Models and the Calculation of Evolutionary Distance 268(8)
A simple but inaccurate measure of evolutionary distance is the p-distance 268(2)
The Poisson distance correction takes account of multiple mutations at the same site 270(1)
The Gamma distance correction takes account of mutation rate variation at different sequence positions 270(1)
The Jukes-Cantor model reproduces some basic features of the evolution of nucleotide sequences 271(1)
More complex models distinguish between the relative frequencies of different types of mutation 272(3)
There is a nucleotide bias in DNA sequences 275(1)
Models of protein-sequence evolution are closely related to the substitution matrices used for sequence alignment 276(1)
Generating Single Phylogenetic Trees 276(10)
Clustering methods produce a phylogenetic tree based on evolutionary distances 276(2)
The UPGMA method assumes a constant molecular clock and produces an ultrametric tree 278(1)
The Fitch-Margoliash method produces an unrooted additive tree 279(3)
The neighbor-joining method is related to the concept of minimum evolution 282(3)
Stepwise addition and star-decomposition methods are usually used to generate starting trees for further exploration, not the final tree 285(1)
Generating Multiple Tree Topologies 286(7)
The branch-and-bound method greatly improves the efficiency of exploring tree topology 288(1)
Optimization of tree topology can be achieved by making a series of small changes to an existing tree 288(3)
Finding the root gives a phylogenetic tree a direction in time 291(2)
Evaluating Tree Topologies 293(14)
Functions based on evolutionary distances can be used to evaluate trees 293(4)
Unweighted parsimony methods look for the trees with the smallest number of mutations 297(3)
Mutations can be weighted in different ways in the parsimony method 300(2)
Trees can be evaluated using the maximum likelihood method 302(3)
The quartet-puzzling method also involves maximum likelihood in the standard implementation 305(1)
Bayesian methods can also be used to reconstruct phylogenetic trees 306(1)
Assessing the Reliability of Tree Features and Comparing Trees 307(11)
The long-branch attraction problem can arise even with perfect data and methodology 308(1)
Tree topology can be tested by examining the interior branches 309(1)
Tests have been proposed for comparing two or more alternative trees 310(1)
Summary 311(1)
Further Reading 312(6)
Part 4 Genome Characteristics
Applications Chapter
Revealing Genome Features
Preliminary Examination of Genome Sequence 318(4)
Whole genome sequences can be split up to simplify gene searches 319(1)
Structural RNA genes and repeat sequences can be excluded from further analysis 319(3)
Homology can be used to identify genes in both prokaryotic and eukaryotic genomes 322(1)
Gene Prediction in Prokaryotic Genomes 322(1)
Gene Prediction in Eukaryotic Genomes 323(14)
Programs for predicting exons and introns use a variety of approaches 323(1)
Gene predictions must preserve the correct reading frame 324(3)
Some programs search for exons using only the query sequence and a model for exons 327(5)
Some programs search for genes using only the query sequence and a gene model 332(2)
Genes can be predicted using a gene model and sequence similarity 334(2)
Genomes of related organisms can be used to improve gene prediction 336(1)
Splice Site Detection 337(1)
Splice sites can be detected independently by specialized programs 338(1)
Prediction of Promoter Regions 338(4)
Prokaryotic promoter regions contain relatively well-defined motifs 339(1)
Eukaryotic promoter regions are typically more complex than prokaryotic promoters 340(1)
A variety of promoter-prediction methods are available online 340(1)
Promoter prediction results are not very clear-cut 341(1)
Confirming Predictions 342(4)
There are various methods for calculating the accuracy of gene-prediction programs 342(1)
Translating predicted exons can confirm the correctness of the prediction 343(1)
Constructing the protein and identifying homologs 343(3)
Genome Annotation 346(7)
Genome annotation is the final step in genome analysis 347(1)
Gene ontology provides a standard vocabulary for gene annotation 348(5)
Large Genome Comparisons 353(8)
Summary 354(1)
Further Reading 355(6)
Theory Chapter
Gene Detection and Genome Annotation
Detection of Functional RNA Molecules Using Decision Trees 361(3)
Detection of tRNA genes using the tRNAscan algorithm 361(1)
Detection of tRNA genes in eukaryotic genomes 362(2)
Features Useful for Gene Detection in Prokaryotes 364(4)
Algorithms for Gene Detection in Prokaryotes 368(9)
GeneMark uses inhomogeneous Markov chains and dicodon statistics 368(3)
GLIMMER uses interpolated Markov models of coding potential 371(1)
ORPHEUS uses homology, codon statistics, and ribosome-binding sites 372(1)
GeneMark.hmm uses explicit state duration hidden Markov models 373(3)
EcoParse is an HMM gene model 376(1)
Features Used in Eukaryotic Gene Detection 377(4)
Differences between prokaryotic and eukaryotic genes 377(2)
Introns, exons, and splice sites 379(2)
Promoter sequences and binding sites for transcription factors 381(1)
Predicting Eukaryotic Gene Signals 381(8)
Detection of core promoter binding signals is a key element of some eukaryotic gene-prediction methods 381(2)
A set of models has been designed to locate the site of core promoter sequence signals 383(4)
Predicting promoter regions from general sequence properties can reduce the numbers of false-positive results 387(2)
Predicting eukaryotic transcription and translation start sites 389(1)
Translation and transcription stop signals complete the gene definition 389(1)
Predicting Exon/Intron Structure 389(8)
Exons can be identified using general sequence properties 390(2)
Splice-site prediction 392(1)
Splice sites can be predicted by sequence patterns combined with base statistics 393(1)
GenScan uses a combination of weight matrices and decision trees to locate splice sites 394(1)
GeneSplicer predicts splice sites using first-order Markov chains 394(1)
NetPlantGene uses neural networks with intron and exon predictions to predict splice sites 395(1)
Other splicing features may yet be exploited for splice-site prediction 396(1)
Specific methods exist to identify initial and terminal exons 396(1)
Exons can be defined by searching databases for homologous regions 397(1)
Complete Eukaryotic Gene Models 397(2)
Beyond the Prediction of Individual Genes 399(14)
Functional annotation 400(3)
Comparison of related genomes can help resolve uncertain predictions 403(2)
Evaluation and reevaluation of gene-detection methods 405(1)
Summary 405(1)
Further Reading 406(7)
Part 5 Secondary Structures
Applications Chapter
Obtaining Secondary Structure from Sequence
Types of Prediction Methods 413(3)
Statistical methods are based on rules that give the probability that a residue will form part of a particular secondary structure 414(1)
Nearest-neighbor methods are statistical methods that incorporate additional information about protein structure 414(1)
Machine-learning approaches to secondary structure prediction mainly make use of neural networks and HMM methods 415(1)
Training and Test Databases 416(1)
There are several ways to define protein secondary structures 417(1)
Assessing the Accuracy of Prediction Programs 417(4)
Q3 measures the accuracy of individual residue assignments 417(1)
Secondary structure predictions should not be expected to reach 100% residue accuracy 418(1)
The Sov value measures the prediction accuracy for whole elements 419(1)
CAFASP/CASP: Unbiased and readily available protein prediction assessments 419(2)
Statistical and Knowledge-Based Methods 421(9)
The GOR method uses an information theory approach 422(3)
The program Zpred includes multiple alignment of homologous sequences and residue conservation information 425(1)
There is an overall increase in prediction accuracy using multiple sequence information 426(2)
The nearest-neighbor method: The use of multiple nonhomologous sequences 428(1)
PREDATOR is a combined statistical and knowledge-based program that includes the nearest-neighbor approach 428(2)
Neural Network Methods of Secondary Structure Prediction 430(5)
Assessing the reliability of neural net predictions 432(1)
Several examples of Web-based neural network secondary structure prediction programs 432(2)
PROF: Protein forecasting 434(1)
PSIPRED 434(1)
Jnet: Using several alternative representations of the sequence alignment 434(1)
Some Secondary Structures Require Specialized Prediction Methods 435(3)
Transmembrane proteins 436(1)
Quantifying the preference for a membrane environment 437(1)
Prediction of Transmembrane Protein Structure 438(13)
Multi-helix membrane proteins 439(2)
A selection of prediction programs to predict transmembrane helices 441(2)
Statistical methods 443(1)
Knowledge-based prediction 443(1)
Evolutionary information from protein families improves the prediction 444(1)
Neural nets in transmembrane prediction 445(1)
Predicting transmembrane helices with hidden Markov models 446(1)
Comparing the results: What to choose 447(1)
What happens if a non-transmembrane protein is submitted to transmembrane prediction programs 448(1)
Prediction of transmembrane structure containing β-strands 448(3)
Coiled-coil Structures 451(4)
The COILS prediction program 452(1)
PAIRCOIL and MULTICOIL are an extension of the COILS algorithm 453(1)
Zipping the Leucine zipper: A specialized coiled coil 453(2)
RNA Secondary Structure Prediction 455(8)
Summary 458(1)
Further Reading 459(4)
Theory Chapter
Predicting Secondary Structures
Defining Secondary Structure and Prediction Accuracy 463(9)
The definitions used for automatic protein secondary structure assignment do not give identical results 464(5)
There are several different measures of the accuracy of secondary structure prediction 469(3)
Secondary Structure Prediction Based on Residue Propensities 472(13)
Each structural state has an amino acid preference which can be assigned as a residue propensity 473(3)
The simplest prediction methods are based on the average residue propensity over a sequence window 476(3)
Residue propensities are modulated by nearby sequence 479(5)
Predictions can be significantly improved by including information from homologous sequences 484(1)
The Nearest-Neighbor Methods are Based on Sequence Segment Similarity 485(7)
Short segments of similar sequence are found to have similar structure 487(1)
Several sequence similarity measures have been used to identify nearest-neighbor segments 488(2)
A weighted average of the nearest-neighbor segment structures is used to make the prediction 490(1)
A nearest-neighbor method has been developed to predict regions with a high potential to misfold 491(1)
Neural Networks Have Been Employed Successfully for Secondary Structure Prediction 492(12)
Layered feed-forward neural networks can transform a sequence into a structural prediction 494(8)
Inclusion of information on homologous sequences improves neural network accuracy 502(1)
More complex neural nets have been applied to predict secondary and other structural features 503(1)
Hidden Markov Models Have Been Applied to Structure Prediction 504(6)
HMM methods have been found especially effective for transmembrane proteins 506(3)
Nonmembrane protein secondary structures can also be successfully predicted with HMMs 509(1)
General Data Classification Techniques Can Predict Structural Features 510(14)
Support vector machines have been successfully used for protein structure prediction 511(1)
Discriminants, SOMs, and other methods have also been used 512(2)
Summary 514(1)
Further Reading 515(9)
Part 6 Tertiary Structures
Applications Chapter
Modeling Protein Structure
Potential Energy Functions and Force Fields 524(5)
The conformation of a protein can be visualized in terms of a potential energy surface 525(1)
Conformational energies can be described by simple mathematical functions 525(1)
Similar force fields can be used to represent conformational energies in the presence of averaged environments 526(1)
Potential energy functions can be used to assess a modeled structure 527(1)
Energy minimization can be used to refine a modeled structure and identify local energy minima 527(1)
Molecular dynamics and simulated annealing are used to find global energy minima 528(1)
Obtaining a Structure by Threading 529(8)
The prediction of protein folds in the absence of known structural homologs 531(1)
Libraries or databases of nonredundant protein folds are used in threading 531(1)
Two distinct types of scoring schemes have been used in threading methods 531(2)
Dynamic programming methods can identify optimal alignments of target sequences and structural folds 533(1)
Several methods are available to assess the confidence to be put on the fold prediction 534(1)
The C2-like domain from the Dictyostelia: A practical example of threading 535(2)
Principles of Homology Modeling 537(5)
Closely related target and template sequences give better models 539(1)
Significant sequence identity depends on the length of the sequence 540(1)
Homology modeling has been automated to deal with the numbers of sequences that can now be modeled 541(1)
Model building is based on a number of assumptions 541(1)
Steps in Homology Modeling 542(10)
Structural homologs to the target protein are found in the PDB 543(1)
Accurate alignment of target and template sequences is essential for successful modeling 543(1)
The structurally conserved regions of a protein are modeled first 544(1)
The modeled core is checked for misfits before proceeding to the next stage 545(1)
Sequence realignment and remodeling may improve the structure 545(1)
Insertions and deletions are usually modeled as loops 545(2)
Nonidentical amino acid side chains are modeled mainly by using rotamer libraries 547(1)
Energy minimization is used to relieve structural errors 548(1)
Molecular dynamics can be used to explore possible conformations for mobile loops 548(1)
Models need to be checked for accuracy 549(2)
How far can homology models be trusted? 551(1)
Automated Homology Modeling 552(5)
The program MODELLER models by satisfying protein structure constraints 553(1)
COMPOSER uses fragment-based modeling to automatically generate a model 553(1)
Automated methods available on the Web for comparative modeling 554(1)
Assessment of structure prediction 554(3)
Homology Modeling of PI3 Kinase p110α 557(11)
Swiss-Pdb Viewer can be used for manual or semi-manual modeling 557(1)
Alignment, core modeling, and side-chain modeling are carried out all in one 558(1)
The loops are modeled from a database of possible structures 559(1)
Energy minimization and quality inspection can be carried out within Swiss-Pdb Viewer 559(1)
MollDE is a downloadable semi-automatic modeling package 560(1)
Automated modeling on the Web illustrated with p110α kinase 561(2)
Modeling a functionally related but sequentially dissimilar protein: mTOR 563(1)
Generating a multidomain three-dimensional structure from sequence 564(1)
Summary 564(1)
Further Reading 565(3)
Applications Chapter
Analyzing Structure-Function Relationships
Functional Conservation 568(6)
Functional regions are usually structurally conserved 569(1)
Similar biochemical function can be found in proteins with different folds 570(1)
Fold libraries identify structurally similar proteins regardless of function 571(3)
Structure Comparison Methods 574(6)
Finding domains in proteins aids structure comparison 574(2)
Structural comparisons can reveal conserved functional elements not discernible from a sequence comparison 576(1)
The CE method builds up a structural alignment from pairs of aligned protein segments 576(1)
The Vector Alignment Search Tool (VAST) aligns secondary structural elements 577(1)
DALI identifies structure superposition without maintaining segment order 578(1)
FATCAT introduces rotations between rigid segments 579(1)
Finding Binding Sites 580(7)
Highly conserved, strongly charged, or hydrophobic surface areas may indicate interaction sites 582(2)
Searching for protein-protein interactions using surface properties 584(1)
Surface calculations highlight clefts or holes in a protein that may serve as binding sites 585(1)
Looking at residue conservation can identify binding sites 586(1)
Docking Methods and Programs 587(14)
Simple docking procedures can be used when the structure of a homologous protein bound to a ligand analog is known 588(1)
Specialized docking programs will automatically dock a ligand to a structure 588(2)
Scoring functions are used to identify the most likely docked ligand 590(1)
The DOCK program is a semirigid-body method that analyzes shape and chemical complementarity of ligand and binding site 590(1)
Fragment docking identifies potential substrates by predicting types of atoms and functional groups in the binding area 591(1)
GOLD is a flexible docking program, which utilizes a genetic algorithm 591(1)
The water molecules in binding sites should also be considered 592(1)
Summary 593(1)
Further Reading 594(7)
Part 7 Cells and Organisms
Proteome and Gene Expression Analysis
Analysis of Large-scale Gene Expression 601(11)
The expression of large numbers of different genes can be measured simultaneously by DNA microarrays 602(1)
Gene expression microarrays are mainly used to detect differences in gene expression in different conditions 602(2)
Serial analysis of gene expression (SAGE) is also used to study global patterns of gene expression 604(1)
Digital differential display uses bioinformatics and statistics to detect differential gene expression in different tissues 605(1)
Facilitating the integration of data from different places and experiments 606(1)
The simplest method of analyzing gene expression microarray data is hierarchical cluster analysis 606(2)
Techniques based on self-organizing maps can be used for analyzing microarray data 608(2)
Self-organizing tree algorithms (SOTAs) cluster from the top down by successive subdivision of clusters 610(1)
Clustered gene expression data can be used as a tool for further research 610(2)
Analysis of Large-scale Protein Expression 612(14)
Two-dimensional gel electrophoresis is a method for separating the individual proteins in a cell 613(1)
Measuring the expression levels shown in 2D gels 614(1)
Differences in protein expression levels between different samples can be detected by 2D gels 615(1)
Clustering methods are used to identify protein spots with similar expression patterns 615(3)
Principal component analysis (PCA) is an alternative to clustering for analyzing microarray and 2D gel data 618(1)
The changes in a set of protein spots can be tracked over a number of different samples 618(2)
Databases and online tools are available to aid the interpretation of 2D gel data 620(1)
Protein microarrays allow the simultaneous detection of the presence or activity of large numbers of different proteins 621(1)
Mass spectrometry can be used to identify the proteins separated and purified by 2D gel electrophoresis or other means 621(1)
Protein-identification programs for mass spectrometry are freely available on the Web 622(1)
Mass spectrometry can be used to measure protein concentration 623(1)
Summary 623(1)
Further Reading 624(2)
Clustering Methods and Statistics
Expression Data Require Preparation Prior to Analysis 626(7)
Data normalization is designed to remove systematic experimental errors 627(1)
Expression levels are often analyzed as ratios and are usually transformed by taking logarithms 628(2)
Sometimes further normalization is useful after the data transformation 630(1)
Principal component analysis is a method for combining the properties of an object 631(2)
Cluster Analysis Requires Distances to be Defined Between all Data Points 633(4)
Euclidean distance is the measure used in everyday life 634(1)
The Pearson correlation coefficient measures distance in terms of the shape of the expression response 635(1)
The Mahalanobis distance takes account of the variation and correlation of expression responses 636(1)
Clustering Methods Identify Similar and Distinct Expression Patterns 637(14)
Hierarchical clustering produces a related set of alternative partitions of the data 639(2)
k-means clustering groups data into several clusters but does not determine a relationship between clusters 641(3)
Self-organizing maps (SOMs) use neural network methods to cluster data into a predetermined number of clusters 644(2)
Evolutionary clustering algorithms use selection, recombination, and mutation to find the best possible solution to a problem 646(2)
The self-organizing tree algorithm (SOTA) determines the number of clusters required 648(1)
Biclustering identifies a subset of similar expression level patterns occurring in a subset of the samples 649(1)
The validity of clusters is determined by independent methods 650(1)
Statistical Analysis can Quantify the Significance of Observed Differential Expression 651(8)
t-tests can be used to estimate the significance of the difference between two expression levels 654(2)
Nonparametric tests are used to avoid making assumptions about the data sampling 656(1)
Multiple testing of differential expression requires special techniques to control error rates 657(2)
Gene and Protein Expression Data Can be Used to Classify Samples 659(10)
Many alternative methods have been proposed that can classify samples 660(1)
Support vector machines are another form of supervised learning algorithms that can produce classifiers 661(1)
Summary 662(2)
Further Reading 664(5)
Systems Biology
What is a System? 669(10)
A system is more than the sum of its parts 669(1)
A biological system is a living network 670(1)
Databases are useful starting points in constructing a network 671(1)
To construct a model more information is needed than a network 672(2)
There are three possible approaches to constructing a model 674(4)
Kinetic models are not the only way in systems biology 678(1)
Structure of the Model 679(4)
Control circuits are an essential part of any biological system 680(1)
The interactions in networks can be represented as simple differential equations 680(3)
Robustness of Biological Systems 683(6)
Robustness is a distinct feature of complexity in biology 684(1)
Modularity plays an important part in robustness 685(1)
Redundancy in the system can provide robustness 686(2)
Living systems can switch from one state to another by means of bistable switches 688(1)
Storing and Running System Models 689(6)
Specialized programs make simulating systems easier 691(1)
Standardized system descriptions aid their storage and reuse 692(1)
Summary 692(1)
Further Reading 693(2)
APPENDICES Background Theory
Appendix A: Probability, Information, and Bayesian Analysis
Probability Theory, Entropy, and Information 695(2)
Mutually exclusive events 695(1)
Occurrence of two events 696(1)
Occurrence of two random variables 696(1)
Bayesian Analysis 697(4)
Bayes' theorem 697(1)
Inference of parameter values 698(1)
Further Reading 699(2)
Appendix B: Molecular Energy Functions
Force Fields for Calculating Intra- and Intermolecular
Interaction Energies 701(5)
Bonding terms 702(2)
Nonbonding terms 704(2)
Potentials used in Threading 706(4)
Potentials of mean force 706(1)
Potential terms relating to solvent effects 707(1)
Further Reading 708(2)
Appendix C: Function Optimization
Full Search Methods 710(1)
Dynamic programming and branch-and-bound 710(1)
Local Optimization 710(5)
The downhill simplex method 711(1)
The steepest descent method 711(3)
The conjugate gradient method 714(1)
Methods using second derivatives 714(1)
Thermodynamic Simulation and Global Optimization 715(6)
Monte Carlo and genetic algorithms 716(2)
Molecular dynamics 718(1)
Simulated annealing 719(1)
Summary 719(1)
Further Reading 719(2)
List of Symbols 721(13)
Glossary 734(17)
Index 751
Preface v
A Note to the Reader vii
List of Reviewers xii
Contents in Brief xiii
Part 1 Background Basics
The Nucleic Acid World
The Structure of DNA and RNA 5(5)
DNA is a linear polymer of only four different bases 5(2)
Two complementary DNA strands interact by base pairing to form a double helix 7(2)
RNA molecules are mostly single stranded but can also have base-pair structures 9(1)
DNA, RNA, and Protein: The Central Dogma 10(4)
DNA is the information store, but RNA is the messenger 11(1)
Messenger RNA is translated into protein according to the genetic code 12(1)
Translation involves transfer RNAs and RNA-containing ribosomes 13(1)
Gene Structure and Control 14(6)
RNA polymerase binds to specific sequences that position it and identify where to begin transcription 15(2)
The signals initiating transcription in eukaryotes are generally more complex than those in bacteria 17(1)
Eukaryotic mRNA transcripts undergo several modifications prior to their use in translation 18(1)
The control of translation 19(1)
The Tree of Life and Evolution 20(5)
A brief survey of the basic characteristics of the major forms of life 21(1)
Nucleic acid sequences can change as a result of mutation 22(1)
Summary 23(1)
Further Reading 24(1)
Protein Structure
Primary and Secondary Structure 25(12)
Protein structure can be considered on several different levels 26(1)
Amino acids are the building blocks of proteins 27(1)
The differing chemical and physical properties of amino acids are due to their side chains 28(1)
Amino acids are covalently linked together in the protein chain by peptide bonds 29(4)
Secondary structure of proteins is made up of α-helices and β-strands 33(2)
Several different types of β-sheet are found in protein structures 35(1)
Turns, hairpins and loops connect helices and strands 36(1)
Implication for Bioinformatics 37(3)
Certain amino acids prefer a particular structural unit 37(1)
Evolution has aided sequence analysis 38(1)
Visualization and computer manipulation of protein structures 38(2)
Proteins Fold to Form Compact Structures 40(6)
The tertiary structure of a protein is defined by the path of the polypeptide chain 41(1)
The stable folded state of a protein represents a state of low energy 41(1)
Many proteins are formed of multiple subunits 42(1)
Summary 43(1)
Further Reading 44(2)
Dealing with Databases
The Structure of Databases 46(6)
Flat-file databases store data as text files 48(1)
Relational databases are widely used for storing biological information 49(1)
XML has the flexibility to define bespoke data classifications 50(1)
Many other database structures are used for biological data 51(1)
Databases can be accessed locally or online and often link to each other 52(1)
Types of Database 52(3)
There's more to databases than just data 53(1)
Primary and derived data 53(1)
How we define and connect things is very important: Ontologies 54(1)
Looking for Databases 55(6)
Sequence databases 55(3)
Microarray databases 58(1)
Protein interaction databases 58(1)
Structural databases 59(2)
Data Quality 61(11)
Nonredundancy is especially important for some applications of sequence databases 62(1)
Automated methods can be used to check for data consistency 63(1)
Initial analysis and annotation is usually automated 64(1)
Human intervention is often required to produce the highest quality annotation 65(1)
The importance of updating databases and entry identifier and version numbers 65(1)
Summary 66(1)
Further Reading 67(5)
Part 2 Sequence Alignments
Applications Chapter
Producing and Analyzing Sequence Alignments
Principles of Sequence Alignment 72(4)
Alignment is the task of locating equivalent regions of two or more sequences to maximize their similarity 73(1)
Alignment can reveal homology between sequences 74(1)
It is easier to detect homology when comparing protein sequences than when comparing nucleic acid sequences 75(1)
Scoring Alignments 76(5)
The quality of an alignment is measured by giving it a quantitative score 76(1)
The simplest way of quantifying similarity between two sequences is percentage identity 76(1)
The dot-plot gives a visual assessment of similarity based on identity 77(2)
Genuine matches do not have to be identical 79(2)
There is a minimum percentage identity that can be accepted as significant 81(1)
There are many different ways of scoring an alignment 81(1)
Substitution Matrices 81(4)
Substitution matrices are used to assign individual scores to aligned sequence positions 81(1)
The PAM substitution matrices use substitution frequencies derived from sets of closely related protein sequences 82(2)
The BLOSUM substitution matrices use mutation data from highly conserved local regions of sequence 84(1)
The choice of substitution matrix depends on the problem to be solved 84(1)
Inserting Gaps 85(2)
Gaps inserted in a sequence to maximize similarity require a scoring penalty 85(1)
Dynamic programming algorithms can determine the optimal introduction of gaps 86(1)
Types of Alignment 87(6)
Different kinds of alignments are useful in different circumstances 87(3)
Multiple sequence alignments enable the simultaneous comparison of a set of similar sequences 90(1)
Multiple alignments can be constructed by several different techniques 90(1)
Multiple alignments can improve the accuracy of alignment for sequences of low similarity 91(1)
ClustalW can make global multiple alignments of both DNA and protein sequences 92(1)
Multiple alignments can be made by combining a series of local alignments 92(1)
Alignment can be improved by incorporating additional information 93(1)
Searching Databases 93(4)
Fast yet accurate search algorithms have been developed 94(1)
FASTA is a fast database-search method based on matching short identical segments 95(1)
BLAST is based on finding very similar short segments 95(1)
Different versions of BLAST and FASTA are used for different problems 95(1)
PSI-BLAST enables profile-based database searches 96(1)
SSEARCH is a rigorous alignment method 97(1)
Searching with Nucleic Acid or Protein Sequences 97(6)
DNA or RNA sequences can be used either directly or after translation 97(1)
The quality of a database match has to be tested to ensure that it could not have arisen by chance 97(1)
Choosing an appropriate E-value threshold helps to limit a database search 98(2)
Low-complexity regions can complicate homology searches 100(2)
Different databases can be used to solve particular problems 102(1)
Protein Sequence Motifs or Patterns 103(4)
Creation of pattern databases requires expert knowledge 104(1)
The BLOCKS database contains automatically compiled short blocks of conserved multiply aligned protein sequences 105(2)
Searching Using Motifs and Patterns 107(2)
The PROSITE database can be searched for protein motifs and patterns 107(1)
The pattern-based program PHI-BLAST searches for both homology and matching motifs 108(1)
Patterns can be generated from multiple sequences using PRATT 108(1)
The PRINTS database consists of fingerprints representing sets of conserved motifs that describe a protein family 109(1)
The Pfam database defines profiles of protein families 109(1)
Patterns and Protein Function 109(8)
Searches can be made for particular functional sites in proteins 109(1)
Sequence comparison is not the only way of analyzing protein sequences 110(1)
Summary 111(1)
Further Reading 112(5)
Theory Chapter
Pairwise Sequence Alignment and Database Searching
Substitution Matrices and Scoring 117(10)
Alignment scores attempt to measure the likelihood of a common evolutionary ancestor 117(2)
The PAM (MDM) substitution scoring matrices were designed to trace the evolutionary origins of proteins 119(3)
The BLOSUM matrices were designed to find conserved regions of proteins 122(3)
Scoring matrices for nucleotide sequence alignment can be derived in similar ways 125(1)
The substitution scoring matrix used must be appropriate to the specific alignment problem 126(1)
Gaps are scored in a much more heuristic way than substitutions 126(1)
Dynamic Programming Algorithms 127(14)
Optimal global alignments are produced using efficient variations of the Needleman-Wunsch algorithm 129(6)
Local and suboptimal alignments can be produced by making small modifications to the dynamic programming algorithm 135(4)
Time can be saved with a loss of rigor by not calculating the whole matrix 139(2)
Indexing Techniques and Algorithmic Approximations 141(12)
Suffix trees locate the positions of repeats and unique sequences 141(2)
Hashing is an indexing technique that lists the starting positions of all k-tuples 143(1)
The FASTA algorithm uses hashing and chaining for fast database searching 144(3)
The BLAST algorithm makes use of finite-state automata 147(3)
Comparing a nucleotide sequence directly with a protein sequence requires special modifications to the BLAST and FASTA algorithms 150(3)
Alignment Score Significance 153(3)
The statistics of gapped local alignments can be approximated by the same theory 156(1)
Aligning Complete Genome Sequences 156(11)
Indexing and scanning whole genome sequences efficiently is crucial for the sequence alignment of higher organisms 157(2)
The complex evolutionary relationships between the genomes of even closely related organisms require novel alignment algorithms 159(1)
Summary 159(2)
Further Reading 161(6)
Theory Chapter
Patterns, Profiles, and Multiple Alignments
Profiles and Sequence Logos 167(12)
Position-specific scoring matrices are an extension of substitution scoring matrices 168(3)
Methods for overcoming a lack of data in deriving the values for a PSSM 171(5)
PSI-BLAST is a sequence database searching program 176(1)
Representing a profile as a logo 177(2)
Profile Hidden Markov Models 179(14)
The basic structure of HMMs used in sequence alignment to profiles 180(5)
Estimating HMM parameters using aligned sequences 185(2)
Scoring a sequence against a profile HMM: The most probable path and the sum over all paths 187(3)
Estimating HMM parameters using unaligned sequences 190(3)
Aligning Profiles 193(3)
Comparing two PSSMs by alignment 193(2)
Aligning profile HMMs 195(1)
Multiple Sequence Alignments by Gradual Sequence Addition 196(11)
The order in which sequences are added is chosen based on the estimated likelihood of incorporating errors in the alignment 198(2)
Many different scoring schemes have been used in constructing multiple alignments 200(4)
The multiple alignment is built using the guide tree and profile methods and may be further refined 204(3)
Other Ways of Obtaining Multiple Alignments 207(4)
The multiple sequence alignment program DIALIGN aligns ungapped blocks 207(2)
The SAGA method of multiple alignment uses a genetic algorithm 209(2)
Sequence Pattern Discovery 211(14)
Discovering patterns in a multiple alignment: eMOTIF and AACC 213(2)
Probabilistic searching for common patterns in sequences: Gibbs and MEME 215(2)
Searching for more general sequence patterns 217(1)
Summary 218(1)
Further Reading 219(6)
Part 3 Evolutionary Processes
Applications Chapter
Recovering Evolutionary History
The Structure and Interpretation of Phylogenetic Trees 225(10)
Phylogenetic trees reconstruct evolutionary relationships 225(5)
Tree topology can be described in several ways 230(2)
Consensus and condensed trees report the results of comparing tree topologies 232(3)
Molecular Evolution and its Consequences 235(13)
Most related sequences have many positions that have mutated several times 236(1)
The rate of accepted mutation is usually not the same for all types of base substitution 236(2)
Different codon positions have different mutation rates 238(1)
Only orthologous genes should be used to construct species phylogenetic trees 239(8)
Major changes affecting large regions of the genome are surprisingly common 247(1)
Phylogenetic Tree Reconstruction 248(20)
Small ribosomal subunit rRNA sequences are well suited to reconstructing the evolution of species 249(1)
The choice of the method for tree reconstruction depends to some extent on the size and quality of the dataset 249(2)
A model of evolution must be chosen to use with the method 251(4)
All phylogenetic analyses must start with an accurate multiple alignment 255(1)
Phylogenetic analyses of a small dataset of 16S RNA sequence data 255(4)
Building a gene tree for a family of enzymes can help to identify how enzymatic functions evolved 259(5)
Summary 264(1)
Further Reading 265(3)
Theory Chapter
Building Phylogenetic Trees
Evolutionary Models and the Calculation of Evolutionary Distance 268(8)
A simple but inaccurate measure of evolutionary distance is the p-distance 268(2)
The Poisson distance correction takes account of multiple mutations at the same site 270(1)
The Gamma distance correction takes account of mutation rate variation at different sequence positions 270(1)
The Jukes-Cantor model reproduces some basic features of the evolution of nucleotide sequences 271(1)
More complex models distinguish between the relative frequencies of different types of mutation 272(3)
There is a nucleotide bias in DNA sequences 275(1)
Models of protein-sequence evolution are closely related to the substitution matrices used for sequence alignment 276(1)
Generating Single Phylogenetic Trees 276(10)
Clustering methods produce a phylogenetic tree based on evolutionary distances 276(2)
The UPGMA method assumes a constant molecular clock and produces an ultrametric tree 278(1)
The Fitch-Margoliash method produces an unrooted additive tree 279(3)
The neighbor-joining method is related to the concept of minimum evolution 282(3)
Stepwise addition and star-decomposition methods are usually used to generate starting trees for further exploration, not the final tree 285(1)
Generating Multiple Tree Topologies 286(7)
The branch-and-bound method greatly improves the efficiency of exploring tree topology 288(1)
Optimization of tree topology can be achieved by making a series of small changes to an existing tree 288(3)
Finding the root gives a phylogenetic tree a direction in time 291(2)
Evaluating Tree Topologies 293(14)
Functions based on evolutionary distances can be used to evaluate trees 293(4)
Unweighted parsimony methods look for the trees with the smallest number of mutations 297(3)
Mutations can be weighted in different ways in the parsimony method 300(2)
Trees can be evaluated using the maximum likelihood method 302(3)
The quartet-puzzling method also involves maximum likelihood in the standard implementation 305(1)
Bayesian methods can also be used to reconstruct phylogenetic trees 306(1)
Assessing the Reliability of Tree Features and Comparing Trees 307(11)
The long-branch attraction problem can arise even with perfect data and methodology 308(1)
Tree topology can be tested by examining the interior branches 309(1)
Tests have been proposed for comparing two or more alternative trees 310(1)
Summary 311(1)
Further Reading 312(6)
Part 4 Genome Characteristics
Applications Chapter
Revealing Genome Features
Preliminary Examination of Genome Sequence 318(4)
Whole genome sequences can be split up to simplify gene searches 319(1)
Structural RNA genes and repeat sequences can be excluded from further analysis 319(3)
Homology can be used to identify genes in both prokaryotic and eukaryotic genomes 322(1)
Gene Prediction in Prokaryotic Genomes 322(1)
Gene Prediction in Eukaryotic Genomes 323(14)
Programs for predicting exons and introns use a variety of approaches 323(1)
Gene predictions must preserve the correct reading frame 324(3)
Some programs search for exons using only the query sequence and a model for exons 327(5)
Some programs search for genes using only the query sequence and a gene model 332(2)
Genes can be predicted using a gene model and sequence similarity 334(2)
Genomes of related organisms can be used to improve gene prediction 336(1)
Splice Site Detection 337(1)
Splice sites can be detected independently by specialized programs 338(1)
Prediction of Promoter Regions 338(4)
Prokaryotic promoter regions contain relatively well-defined motifs 339(1)
Eukaryotic promoter regions are typically more complex than prokaryotic promoters 340(1)
A variety of promoter-prediction methods are available online 340(1)
Promoter prediction results are not very clear-cut 341(1)
Confirming Predictions 342(4)
There are various methods for calculating the accuracy of gene-prediction programs 342(1)
Translating predicted exons can confirm the correctness of the prediction 343(1)
Constructing the protein and identifying homologs 343(3)
Genome Annotation 346(7)
Genome annotation is the final step in genome analysis 347(1)
Gene ontology provides a standard vocabulary for gene annotation 348(5)
Large Genome Comparisons 353(8)
Summary 354(1)
Further Reading 355(6)
Theory Chapter
Gene Detection and Genome Annotation
Detection of Functional RNA Molecules Using Decision Trees 361(3)
Detection of tRNA genes using the tRNAscan algorithm 361(1)
Detection of tRNA genes in eukaryotic genomes 362(2)
Features Useful for Gene Detection in Prokaryotes 364(4)
Algorithms for Gene Detection in Prokaryotes 368(9)
GeneMark uses inhomogeneous Markov chains and dicodon statistics 368(3)
GLIMMER uses interpolated Markov models of coding potential 371(1)
ORPHEUS uses homology, codon statistics, and ribosome-binding sites 372(1)
GeneMark.hmm uses explicit state duration hidden Markov models 373(3)
EcoParse is an HMM gene model 376(1)
Features Used in Eukaryotic Gene Detection 377(4)
Differences between prokaryotic and eukaryotic genes 377(2)
Introns, exons, and splice sites 379(2)
Promoter sequences and binding sites for transcription factors 381(1)
Predicting Eukaryotic Gene Signals 381(8)
Detection of core promoter binding signals is a key element of some eukaryotic gene-prediction methods 381(2)
A set of models has been designed to locate the site of core promoter sequence signals 383(4)
Predicting promoter regions from general sequence properties can reduce the numbers of false-positive results 387(2)
Predicting eukaryotic transcription and translation start sites 389(1)
Translation and transcription stop signals complete the gene definition 389(1)
Predicting Exon/Intron Structure 389(8)
Exons can be identified using general sequence properties 390(2)
Splice-site prediction 392(1)
Splice sites can be predicted by sequence patterns combined with base statistics 393(1)
GenScan uses a combination of weight matrices and decision trees to locate splice sites 394(1)
GeneSplicer predicts splice sites using first-order Markov chains 394(1)
NetPlantGene uses neural networks with intron and exon predictions to predict splice sites 395(1)
Other splicing features may yet be exploited for splice-site prediction 396(1)
Specific methods exist to identify initial and terminal exons 396(1)
Exons can be defined by searching databases for homologous regions 397(1)
Complete Eukaryotic Gene Models 397(2)
Beyond the Prediction of Individual Genes 399(14)
Functional annotation 400(3)
Comparison of related genomes can help resolve uncertain predictions 403(2)
Evaluation and reevaluation of gene-detection methods 405(1)
Summary 405(1)
Further Reading 406(7)
Part 5 Secondary Structures
Applications Chapter
Obtaining Secondary Structure from Sequence
Types of Prediction Methods 413(3)
Statistical methods are based on rules that give the probability that a residue will form part of a particular secondary structure 414(1)
Nearest-neighbor methods are statistical methods that incorporate additional information about protein structure 414(1)
Machine-learning approaches to secondary structure prediction mainly make use of neural networks and HMM methods 415(1)
Training and Test Databases 416(1)
There are several ways to define protein secondary structures 417(1)
Assessing the Accuracy of Prediction Programs 417(4)
Q3 measures the accuracy of individual residue assignments 417(1)
Secondary structure predictions should not be expected to reach 100% residue accuracy 418(1)
The Sov value measures the prediction accuracy for whole elements 419(1)
CAFASP/CASP: Unbiased and readily available protein prediction assessments 419(2)
Statistical and Knowledge-Based Methods 421(9)
The GOR method uses an information theory approach 422(3)
The program Zpred includes multiple alignment of homologous sequences and residue conservation information 425(1)
There is an overall increase in prediction accuracy using multiple sequence information 426(2)
The nearest-neighbor method: The use of multiple nonhomologous sequences 428(1)
PREDATOR is a combined statistical and knowledge-based program that includes the nearest-neighbor approach 428(2)
Neural Network Methods of Secondary Structure Prediction 430(5)
Assessing the reliability of neural net predictions 432(1)
Several examples of Web-based neural network secondary structure prediction programs 432(2)
PROF: Protein forecasting 434(1)
PSIPRED 434(1)
Jnet: Using several alternative representations of the sequence alignment 434(1)
Some Secondary Structures Require Specialized Prediction Methods 435(3)
Transmembrane proteins 436(1)
Quantifying the preference for a membrane environment 437(1)
Prediction of Transmembrane Protein Structure 438(13)
Multi-helix membrane proteins 439(2)
A selection of prediction programs to predict transmembrane helices 441(2)
Statistical methods 443(1)
Knowledge-based prediction 443(1)
Evolutionary information from protein families improves the prediction 444(1)
Neural nets in transmembrane prediction 445(1)
Predicting transmembrane helices with hidden Markov models 446(1)
Comparing the results: What to choose 447(1)
What happens if a non-transmembrane protein is submitted to transmembrane prediction programs 448(1)
Prediction of transmembrane structure containing β-strands 448(3)
Coiled-coil Structures 451(4)
The COILS prediction program 452(1)
PAIRCOIL and MULTICOIL are an extension of the COILS algorithm 453(1)
Zipping the Leucine zipper: A specialized coiled coil 453(2)
RNA Secondary Structure Prediction 455(8)
Summary 458(1)
Further Reading 459(4)
Theory Chapter
Predicting Secondary Structures
Defining Secondary Structure and Prediction Accuracy 463(9)
The definitions used for automatic protein secondary structure assignment do not give identical results 464(5)
There are several different measures of the accuracy of secondary structure prediction 469(3)
Secondary Structure Prediction Based on Residue Propensities 472(13)
Each structural state has an amino acid preference which can be assigned as a residue propensity 473(3)
The simplest prediction methods are based on the average residue propensity over a sequence window 476(3)
Residue propensities are modulated by nearby sequence 479(5)
Predictions can be significantly improved by including information from homologous sequences 484(1)
The Nearest-Neighbor Methods are Based on Sequence Segment Similarity 485(7)
Short segments of similar sequence are found to have similar structure 487(1)
Several sequence similarity measures have been used to identify nearest-neighbor segments 488(2)
A weighted average of the nearest-neighbor segment structures is used to make the prediction 490(1)
A nearest-neighbor method has been developed to predict regions with a high potential to misfold 491(1)
Neural Networks Have Been Employed Successfully for Secondary Structure Prediction 492(12)
Layered feed-forward neural networks can transform a sequence into a structural prediction 494(8)
Inclusion of information on homologous sequences improves neural network accuracy 502(1)
More complex neural nets have been applied to predict secondary and other structural features 503(1)
Hidden Markov Models Have Been Applied to Structure Prediction 504(6)
HMM methods have been found especially effective for transmembrane proteins 506(3)
Nonmembrane protein secondary structures can also be successfully predicted with HMMs 509(1)
General Data Classification Techniques Can Predict Structural Features 510(14)
Support vector machines have been successfully used for protein structure prediction 511(1)
Discriminants, SOMs, and other methods have also been used 512(2)
Summary 514(1)
Further Reading 515(9)
Part 6 Tertiary Structures
Applications Chapter
Modeling Protein Structure
Potential Energy Functions and Force Fields 524(5)
The conformation of a protein can be visualized in terms of a potential energy surface 525(1)
Conformational energies can be described by simple mathematical functions 525(1)
Similar force fields can be used to represent conformational energies in the presence of averaged environments 526(1)
Potential energy functions can be used to assess a modeled structure 527(1)
Energy minimization can be used to refine a modeled structure and identify local energy minima 527(1)
Molecular dynamics and simulated annealing are used to find global energy minima 528(1)
Obtaining a Structure by Threading 529(8)
The prediction of protein folds in the absence of known structural homologs 531(1)
Libraries or databases of nonredundant protein folds are used in threading 531(1)
Two distinct types of scoring schemes have been used in threading methods 531(2)
Dynamic programming methods can identify optimal alignments of target sequences and structural folds 533(1)
Several methods are available to assess the confidence to be put on the fold prediction 534(1)
The C2-like domain from the Dictyostelia: A practical example of threading 535(2)
Principles of Homology Modeling 537(5)
Closely related target and template sequences give better models 539(1)
Significant sequence identity depends on the length of the sequence 540(1)
Homology modeling has been automated to deal with the numbers of sequences that can now be modeled 541(1)
Model building is based on a number of assumptions 541(1)
Steps in Homology Modeling 542(10)
Structural homologs to the target protein are found in the PDB 543(1)
Accurate alignment of target and template sequences is essential for successful modeling 543(1)
The structurally conserved regions of a protein are modeled first 544(1)
The modeled core is checked for misfits before proceeding to the next stage 545(1)
Sequence realignment and remodeling may improve the structure 545(1)
Insertions and deletions are usually modeled as loops 545(2)
Nonidentical amino acid side chains are modeled mainly by using rotamer libraries 547(1)
Energy minimization is used to relieve structural errors 548(1)
Molecular dynamics can be used to explore possible conformations for mobile loops 548(1)
Models need to be checked for accuracy 549(2)
How far can homology models be trusted? 551(1)
Automated Homology Modeling 552(5)
The program MODELLER models by satisfying protein structure constraints 553(1)
COMPOSER uses fragment-based modeling to automatically generate a model 553(1)
Automated methods available on the Web for comparative modeling 554(1)
Assessment of structure prediction 554(3)
Homology Modeling of PI3 Kinase p110α 557(11)
Swiss-Pdb Viewer can be used for manual or semi-manual modeling 557(1)
Alignment, core modeling, and side-chain modeling are carried out all in one 558(1)
The loops are modeled from a database of possible structures 559(1)
Energy minimization and quality inspection can be carried out within Swiss-Pdb Viewer 559(1)
MollDE is a downloadable semi-automatic modeling package 560(1)
Automated modeling on the Web illustrated with p110α kinase 561(2)
Modeling a functionally related but sequentially dissimilar protein: mTOR 563(1)
Generating a multidomain three-dimensional structure from sequence 564(1)
Summary 564(1)
Further Reading 565(3)
Applications Chapter
Analyzing Structure-Function Relationships
Functional Conservation 568(6)
Functional regions are usually structurally conserved 569(1)
Similar biochemical function can be found in proteins with different folds 570(1)
Fold libraries identify structurally similar proteins regardless of function 571(3)
Structure Comparison Methods 574(6)
Finding domains in proteins aids structure comparison 574(2)
Structural comparisons can reveal conserved functional elements not discernible from a sequence comparison 576(1)
The CE method builds up a structural alignment from pairs of aligned protein segments 576(1)
The Vector Alignment Search Tool (VAST) aligns secondary structural elements 577(1)
DALI identifies structure superposition without maintaining segment order 578(1)
FATCAT introduces rotations between rigid segments 579(1)
Finding Binding Sites 580(7)
Highly conserved, strongly charged, or hydrophobic surface areas may indicate interaction sites 582(2)
Searching for protein-protein interactions using surface properties 584(1)
Surface calculations highlight clefts or holes in a protein that may serve as binding sites 585(1)
Looking at residue conservation can identify binding sites 586(1)
Docking Methods and Programs 587(14)
Simple docking procedures can be used when the structure of a homologous protein bound to a ligand analog is known 588(1)
Specialized docking programs will automatically dock a ligand to a structure 588(2)
Scoring functions are used to identify the most likely docked ligand 590(1)
The DOCK program is a semirigid-body method that analyzes shape and chemical complementarity of ligand and binding site 590(1)
Fragment docking identifies potential substrates by predicting types of atoms and functional groups in the binding area 591(1)
GOLD is a flexible docking program, which utilizes a genetic algorithm 591(1)
The water molecules in binding sites should also be considered 592(1)
Summary 593(1)
Further Reading 594(7)
Part 7 Cells and Organisms
Proteome and Gene Expression Analysis
Analysis of Large-scale Gene Expression 601(11)
The expression of large numbers of different genes can be measured simultaneously by DNA microarrays 602(1)
Gene expression microarrays are mainly used to detect differences in gene expression in different conditions 602(2)
Serial analysis of gene expression (SAGE) is also used to study global patterns of gene expression 604(1)
Digital differential display uses bioinformatics and statistics to detect differential gene expression in different tissues 605(1)
Facilitating the integration of data from different places and experiments 606(1)
The simplest method of analyzing gene expression microarray data is hierarchical cluster analysis 606(2)
Techniques based on self-organizing maps can be used for analyzing microarray data 608(2)
Self-organizing tree algorithms (SOTAs) cluster from the top down by successive subdivision of clusters 610(1)
Clustered gene expression data can be used as a tool for further research 610(2)
Analysis of Large-scale Protein Expression 612(14)
Two-dimensional gel electrophoresis is a method for separating the individual proteins in a cell 613(1)
Measuring the expression levels shown in 2D gels 614(1)
Differences in protein expression levels between different samples can be detected by 2D gels 615(1)
Clustering methods are used to identify protein spots with similar expression patterns 615(3)
Principal component analysis (PCA) is an alternative to clustering for analyzing microarray and 2D gel data 618(1)
The changes in a set of protein spots can be tracked over a number of different samples 618(2)
Databases and online tools are available to aid the interpretation of 2D gel data 620(1)
Protein microarrays allow the simultaneous detection of the presence or activity of large numbers of different proteins 621(1)
Mass spectrometry can be used to identify the proteins separated and purified by 2D gel electrophoresis or other means 621(1)
Protein-identification programs for mass spectrometry are freely available on the Web 622(1)
Mass spectrometry can be used to measure protein concentration 623(1)
Summary 623(1)
Further Reading 624(2)
Clustering Methods and Statistics
Expression Data Require Preparation Prior to Analysis 626(7)
Data normalization is designed to remove systematic experimental errors 627(1)
Expression levels are often analyzed as ratios and are usually transformed by taking logarithms 628(2)
Sometimes further normalization is useful after the data transformation 630(1)
Principal component analysis is a method for combining the properties of an object 631(2)
Cluster Analysis Requires Distances to be Defined Between all Data Points 633(4)
Euclidean distance is the measure used in everyday life 634(1)
The Pearson correlation coefficient measures distance in terms of the shape of the expression response 635(1)
The Mahalanobis distance takes account of the variation and correlation of expression responses 636(1)
Clustering Methods Identify Similar and Distinct Expression Patterns 637(14)
Hierarchical clustering produces a related set of alternative partitions of the data 639(2)
k-means clustering groups data into several clusters but does not determine a relationship between clusters 641(3)
Self-organizing maps (SOMs) use neural network methods to cluster data into a predetermined number of clusters 644(2)
Evolutionary clustering algorithms use selection, recombination, and mutation to find the best possible solution to a problem 646(2)
The self-organizing tree algorithm (SOTA) determines the number of clusters required 648(1)
Biclustering identifies a subset of similar expression level patterns occurring in a subset of the samples 649(1)
The validity of clusters is determined by independent methods 650(1)
Statistical Analysis can Quantify the Significance of Observed Differential Expression 651(8)
t-tests can be used to estimate the significance of the difference between two expression levels 654(2)
Nonparametric tests are used to avoid making assumptions about the data sampling 656(1)
Multiple testing of differential expression requires special techniques to control error rates 657(2)
Gene and Protein Expression Data Can be Used to Classify Samples 659(10)
Many alternative methods have been proposed that can classify samples 660(1)
Support vector machines are another form of supervised learning algorithms that can produce classifiers 661(1)
Summary 662(2)
Further Reading 664(5)
Systems Biology
What is a System? 669(10)
A system is more than the sum of its parts 669(1)
A biological system is a living network 670(1)
Databases are useful starting points in constructing a network 671(1)
To construct a model more information is needed than a network 672(2)
There are three possible approaches to constructing a model 674(4)
Kinetic models are not the only way in systems biology 678(1)
Structure of the Model 679(4)
Control circuits are an essential part of any biological system 680(1)
The interactions in networks can be represented as simple differential equations 680(3)
Robustness of Biological Systems 683(6)
Robustness is a distinct feature of complexity in biology 684(1)
Modularity plays an important part in robustness 685(1)
Redundancy in the system can provide robustness 686(2)
Living systems can switch from one state to another by means of bistable switches 688(1)
Storing and Running System Models 689(6)
Specialized programs make simulating systems easier 691(1)
Standardized system descriptions aid their storage and reuse 692(1)
Summary 692(1)
Further Reading 693(2)
APPENDICES Background Theory
Appendix A: Probability, Information, and Bayesian Analysis
Probability Theory, Entropy, and Information 695(2)
Mutually exclusive events 695(1)
Occurrence of two events 696(1)
Occurrence of two random variables 696(1)
Bayesian Analysis 697(4)
Bayes' theorem 697(1)
Inference of parameter values 698(1)
Further Reading 699(2)
Appendix B: Molecular Energy Functions
Force Fields for Calculating Intra- and Intermolecular
Interaction Energies 701(5)
Bonding terms 702(2)
Nonbonding terms 704(2)
Potentials used in Threading 706(4)
Potentials of mean force 706(1)
Potential terms relating to solvent effects 707(1)
Further Reading 708(2)
Appendix C: Function Optimization
Full Search Methods 710(1)
Dynamic programming and branch-and-bound 710(1)
Local Optimization 710(5)
The downhill simplex method 711(1)
The steepest descent method 711(3)
The conjugate gradient method 714(1)
Methods using second derivatives 714(1)
Thermodynamic Simulation and Global Optimization 715(6)
Monte Carlo and genetic algorithms 716(2)
Molecular dynamics 718(1)
Simulated annealing 719(1)
Summary 719(1)
Further Reading 719(2)
List of Symbols 721(13)
Glossary 734(17)
Index 751
Understanding bioinformatics /
- 名称
- 类型
- 大小
光盘服务联系方式: 020-38250260 客服QQ:4006604884
云图客服:
用户发送的提问,这种方式就需要有位在线客服来回答用户的问题,这种 就属于对话式的,问题是这种提问是否需要用户登录才能提问
Video Player
×
Audio Player
×
pdf Player
×