简介
Bioconductor software has become a standard tool for the analysis and comprehension of data from high-throughput genomics experiments. Its application spans a broad field of technologies used in contemporary molecular biology. In this volume, the authors present a collection of cases to apply Bioconductor tools in the analysis of microarray gene expression data. Topics covered include * import and preprocessing of data from various sources * statistical modeling of differential gene expression * biological metadata * application of graphs and graph rendering * machine learning for clustering and classification problems * gene set enrichment analysis Each chapter of this book describes an analysis of real data using hands-on example driven approaches. Short exercises help in the learning process and invite more advanced considerations of key topics. The book is a dynamic document. All the code shown can be executed on a local computer, and readers are able to reproduce every computation, figure, and table. The authors of this book have longtime experience in teaching introductory and advanced courses to the application of Bioconductor software. Florian Hahne is a Postdoc at the Fred Hutchinson Cancer Research Center in Seattle, developing novel methodologies for the analysis of high-throughput cell-biological data. Wolfgang Huber is a research group leader in the European Molecular Biology Laboratory at the European Bioinformatics Institute in Cambridge. He has wide-ranging experience in the development of methods for the analysis of functional genomics experiments. Robert Gentleman is Head of the Program in Computational Biology at the Fred Hutchinson Cancer Research Center in Seattle, and he is one of the two authors of the original R system. Seth Falcon is a member of the R core team and former project manager and developer for the Bioconductor project.
目录
Preface 5
Contents 7
List of Contributors 11
1 The ALL Dataset 12
1.1 Introduction 12
1.2 The ALL data 12
1.3 Data subsetting 13
1.4 Nonspecific filtering 14
1.5 BCR/ABL ALL1/AF4 subset 15
2 R and Bioconductor Introduction 16
2.1 Finding help in R 16
2.2 Working with packages 18
2.3 Some basic R 19
2.3.1 Functions 20
2.3.2 The apply family of functions 20
2.3.3 Environments 21
2.4 Structures for genomic data 22
2.4.1 Building an ExpressionSet from .CEL and other files 23
2.4.2 Building an ExpressionSet from scratch 23
2.4.3 ExpressionSet basics 29
2.5 Graphics 31
3 Processing Affymetrix Expression Data 36
3.1 The input data: CEL files 36
3.1.1 The sample annotation 37
3.2 Quality assessment 39
3.3 Preprocessing 43
3.4 Ranking and filtering probe sets 44
3.4.1 Summary statistics and tests for ranking 45
3.4.2 Visualization of differential expression 46
3.4.3 Highlighting interesting genes 47
3.4.4 Selecting hit lists and the multiple testing problem 49
3.4.5 Annotation 49
3.5 Advanced preprocessing 51
3.5.1 PM and MM probes 51
3.5.2 Background-correction 52
3.5.3 Summarization 54
4 Two-Color Arrays 57
4.1 Introduction 57
4.2 Data import 58
4.3 Image plots 60
4.4 Normalization 60
4.5 Differential expression 67
5 Fold-Changes, Log-Ratios, Background Correction, Shrinkage Estimation, and Variance Stabilization 72
5.1 Fold-changes and (log-)ratios 72
5.2 Background-correction and generalized logarithm 74
5.3 Calling VSN 79
5.4 How does VSN work? 81
5.5 Robust fitting and the \u201cmost genes not differentially expressed\u201d assumption 83
5.6 Single-color normalization 87
5.7 The interpretation of glog-ratios 88
5.8 Reference normalization 90
6 Easy Differential Expression 92
6.1 Example data 92
6.2 Nonspecific filtering 93
6.3 Differential expression 94
6.4 Multiple testing correction 96
7 Differential Expression 98
7.1 Motivation 98
7.1.1 The gene-by-gene approach 98
7.1.2 Nonspecific filtering 98
7.1.3 Fold-change versus t-test 99
7.2 Nonspecific filtering 99
7.3 Differential expression 101
7.4 Multiple testing 103
7.5 Moderated test statistics and the limma package 104
7.5.1 Small sample sizes 105
7.6 Gene selection by Receiver Operator Characteristic (ROC) 108
7.7 When power increases 110
8 Annotation and Metadata 112
8.1 Our data 112
8.2 Multiple probe sets per gene 115
8.3 Categories and overrepresentation 116
8.3.1 Chromosomal location 118
8.4 Working with GO 118
8.4.1 Functional analyses 119
8.5 Other annotations available 121
8.6 biomaRt 122
8.7 Database versions of annotation packages 124
8.7.1 Mapping Symbols 126
8.7.2 Other capabilities 128
9 Supervised Machine Learning 129
9.1 Introduction 129
9.1.1 Supervised machine learning check list 130
9.2 The example dataset 131
9.2.1 Nonspecific filtering of features 131
9.3 Feature selection and standardization 132
9.4 Selecting a distance 132
9.5 Machine learning 134
9.6 Cross-validation 137
9.7 Random forests 140
9.7.1 Feature selection 141
9.7.2 More exercises 142
9.8 Multigroup classification 143
10 Unsupervised Machine Learning 145
10.1 Preliminaries 145
10.1.1 Data 146
10.2 Distances 147
10.3 How many clusters? 150
10.4 Hierarchical clustering 152
10.5 Partitioning methods 154
10.5.1 PAM 155
10.6 Self-organizing maps 156
10.7 Hopach 159
10.8 Silhouette plots 160
10.9 Exploring transformations 162
10.10 Remarks 165
11 Using Graphs for Interactome Data 166
11.1 Introduction 166
11.2 Exploring the protein interaction graph 167
11.3 The co-expression graph 169
11.4 Testing the association between physical interaction and coexpression 171
11.5 Some harder problems 172
11.6 Reading PSI-25 XML files from IntAct with the Rintact package 172
11.6.1 Introduction 172
11.6.2 Loading R Packages 173
11.6.3 Obtaining the interaction information 173
11.6.4 Obtaining protein complex composition information 176
11.6.5 Creating graph objects with Rintact 177
12 Graph Layout 180
12.1 Introduction 180
12.2 Layout and rendering using Rgraphviz 182
12.2.1 Rendering parameters 182
12.2.2 Layout parameters 186
12.3 Directed graphs 187
12.3.1 Reciprocated edges 191
12.4 Subgraphs 192
12.5 Tooltips and hyperlinks on graphs 194
13 Gene Set Enrichment Analysis 199
13.1 Introduction 199
13.1.1 Simple GSEA 200
13.1.2 Visualization 201
13.1.3 Data representation 201
13.2 Data analysis 202
13.2.1 Preprocessing 202
13.2.2 Using KEGG 203
13.2.3 Permutation testing 206
13.2.4 Chromosome bands 207
13.3 Identifying and assessing the effects of overlapping gene sets 209
14 Hypergeometric Testing Used for Gene Set Enrichment Analysis 212
14.1 Introduction 212
14.2 The basic problem 213
14.3 Preprocessing and inputs 214
14.3.1 Nonspecific filtering 215
14.3.2 Gene selection via t-test 217
14.3.3 Inputs 218
14.4 Outputs and result summarization 220
14.4.1 Calling the hyperGTest function 220
14.4.2 Summarizing a GOHyperGResult object 220
14.4.3 Generating an HTML report of test results 221
14.4.4 Results in detail 221
14.5 The conditional hypergeometric test 223
14.6 Other collections of gene sets 224
14.6.1 Chromosome bands 225
14.6.2 KEGG 225
14.6.3 PFAM 225
15 Solutions to Exercises 226
2 R and Bioconductor Introduction 226
3 Processing Affymetrix Expression Data 231
4 Two-Color Arrays 235
5 Fold-Changes, Log-Ratios, Background Correction, Shrinkage Estimation, and Variance Stabilization 236
6 Easy Differential Expression 238
7 Differential Expression 238
8 Annotation and Metadata 239
9 Supervised Machine Learning 246
10 Unsupervised Machine Learning 254
11 Using Graphs for Interactome Data 261
12 Graph Layout 264
13 Gene Set Enrichment Analysis 266
14 Hypergeometric Testing Used for Gene Set Enrichment Analysis 270
References 275
Index 280
Contents 7
List of Contributors 11
1 The ALL Dataset 12
1.1 Introduction 12
1.2 The ALL data 12
1.3 Data subsetting 13
1.4 Nonspecific filtering 14
1.5 BCR/ABL ALL1/AF4 subset 15
2 R and Bioconductor Introduction 16
2.1 Finding help in R 16
2.2 Working with packages 18
2.3 Some basic R 19
2.3.1 Functions 20
2.3.2 The apply family of functions 20
2.3.3 Environments 21
2.4 Structures for genomic data 22
2.4.1 Building an ExpressionSet from .CEL and other files 23
2.4.2 Building an ExpressionSet from scratch 23
2.4.3 ExpressionSet basics 29
2.5 Graphics 31
3 Processing Affymetrix Expression Data 36
3.1 The input data: CEL files 36
3.1.1 The sample annotation 37
3.2 Quality assessment 39
3.3 Preprocessing 43
3.4 Ranking and filtering probe sets 44
3.4.1 Summary statistics and tests for ranking 45
3.4.2 Visualization of differential expression 46
3.4.3 Highlighting interesting genes 47
3.4.4 Selecting hit lists and the multiple testing problem 49
3.4.5 Annotation 49
3.5 Advanced preprocessing 51
3.5.1 PM and MM probes 51
3.5.2 Background-correction 52
3.5.3 Summarization 54
4 Two-Color Arrays 57
4.1 Introduction 57
4.2 Data import 58
4.3 Image plots 60
4.4 Normalization 60
4.5 Differential expression 67
5 Fold-Changes, Log-Ratios, Background Correction, Shrinkage Estimation, and Variance Stabilization 72
5.1 Fold-changes and (log-)ratios 72
5.2 Background-correction and generalized logarithm 74
5.3 Calling VSN 79
5.4 How does VSN work? 81
5.5 Robust fitting and the \u201cmost genes not differentially expressed\u201d assumption 83
5.6 Single-color normalization 87
5.7 The interpretation of glog-ratios 88
5.8 Reference normalization 90
6 Easy Differential Expression 92
6.1 Example data 92
6.2 Nonspecific filtering 93
6.3 Differential expression 94
6.4 Multiple testing correction 96
7 Differential Expression 98
7.1 Motivation 98
7.1.1 The gene-by-gene approach 98
7.1.2 Nonspecific filtering 98
7.1.3 Fold-change versus t-test 99
7.2 Nonspecific filtering 99
7.3 Differential expression 101
7.4 Multiple testing 103
7.5 Moderated test statistics and the limma package 104
7.5.1 Small sample sizes 105
7.6 Gene selection by Receiver Operator Characteristic (ROC) 108
7.7 When power increases 110
8 Annotation and Metadata 112
8.1 Our data 112
8.2 Multiple probe sets per gene 115
8.3 Categories and overrepresentation 116
8.3.1 Chromosomal location 118
8.4 Working with GO 118
8.4.1 Functional analyses 119
8.5 Other annotations available 121
8.6 biomaRt 122
8.7 Database versions of annotation packages 124
8.7.1 Mapping Symbols 126
8.7.2 Other capabilities 128
9 Supervised Machine Learning 129
9.1 Introduction 129
9.1.1 Supervised machine learning check list 130
9.2 The example dataset 131
9.2.1 Nonspecific filtering of features 131
9.3 Feature selection and standardization 132
9.4 Selecting a distance 132
9.5 Machine learning 134
9.6 Cross-validation 137
9.7 Random forests 140
9.7.1 Feature selection 141
9.7.2 More exercises 142
9.8 Multigroup classification 143
10 Unsupervised Machine Learning 145
10.1 Preliminaries 145
10.1.1 Data 146
10.2 Distances 147
10.3 How many clusters? 150
10.4 Hierarchical clustering 152
10.5 Partitioning methods 154
10.5.1 PAM 155
10.6 Self-organizing maps 156
10.7 Hopach 159
10.8 Silhouette plots 160
10.9 Exploring transformations 162
10.10 Remarks 165
11 Using Graphs for Interactome Data 166
11.1 Introduction 166
11.2 Exploring the protein interaction graph 167
11.3 The co-expression graph 169
11.4 Testing the association between physical interaction and coexpression 171
11.5 Some harder problems 172
11.6 Reading PSI-25 XML files from IntAct with the Rintact package 172
11.6.1 Introduction 172
11.6.2 Loading R Packages 173
11.6.3 Obtaining the interaction information 173
11.6.4 Obtaining protein complex composition information 176
11.6.5 Creating graph objects with Rintact 177
12 Graph Layout 180
12.1 Introduction 180
12.2 Layout and rendering using Rgraphviz 182
12.2.1 Rendering parameters 182
12.2.2 Layout parameters 186
12.3 Directed graphs 187
12.3.1 Reciprocated edges 191
12.4 Subgraphs 192
12.5 Tooltips and hyperlinks on graphs 194
13 Gene Set Enrichment Analysis 199
13.1 Introduction 199
13.1.1 Simple GSEA 200
13.1.2 Visualization 201
13.1.3 Data representation 201
13.2 Data analysis 202
13.2.1 Preprocessing 202
13.2.2 Using KEGG 203
13.2.3 Permutation testing 206
13.2.4 Chromosome bands 207
13.3 Identifying and assessing the effects of overlapping gene sets 209
14 Hypergeometric Testing Used for Gene Set Enrichment Analysis 212
14.1 Introduction 212
14.2 The basic problem 213
14.3 Preprocessing and inputs 214
14.3.1 Nonspecific filtering 215
14.3.2 Gene selection via t-test 217
14.3.3 Inputs 218
14.4 Outputs and result summarization 220
14.4.1 Calling the hyperGTest function 220
14.4.2 Summarizing a GOHyperGResult object 220
14.4.3 Generating an HTML report of test results 221
14.4.4 Results in detail 221
14.5 The conditional hypergeometric test 223
14.6 Other collections of gene sets 224
14.6.1 Chromosome bands 225
14.6.2 KEGG 225
14.6.3 PFAM 225
15 Solutions to Exercises 226
2 R and Bioconductor Introduction 226
3 Processing Affymetrix Expression Data 231
4 Two-Color Arrays 235
5 Fold-Changes, Log-Ratios, Background Correction, Shrinkage Estimation, and Variance Stabilization 236
6 Easy Differential Expression 238
7 Differential Expression 238
8 Annotation and Metadata 239
9 Supervised Machine Learning 246
10 Unsupervised Machine Learning 254
11 Using Graphs for Interactome Data 261
12 Graph Layout 264
13 Gene Set Enrichment Analysis 266
14 Hypergeometric Testing Used for Gene Set Enrichment Analysis 270
References 275
Index 280
- 名称
- 类型
- 大小
光盘服务联系方式: 020-38250260 客服QQ:4006604884
云图客服:
用户发送的提问,这种方式就需要有位在线客服来回答用户的问题,这种 就属于对话式的,问题是这种提问是否需要用户登录才能提问
Video Player
×
Audio Player
×
pdf Player
×