副标题：无

作者：

分类号：

ISBN：9780857292872

收录收藏 (0) 评论纠错

微信扫一扫,移动浏览光盘

简介

简介

Core Concepts in Data Analysis: Summarization, Correlation and Visualization provides in-depth descriptions of those data analysis approaches that either summarize data (principal component analysis and clustering, including hierarchical and network clustering) or correlate different aspects of data (decision trees, linear rules, neuron networks, and Bayes rule). Boris Mirkin takes an unconventional approach and introduces the concept of multivariate data summarization as a counterpart to conventional machine learning prediction schemes, utilizing techniques from statistics, data analysis, data mining, machine learning, computational intelligence, and information retrieval. Innovations following from his in-depth analysis of the models underlying summarization techniques are introduced, and applied to challenging issues such as the number of clusters, mixed scale data standardization, interpretation of the solutions, as well as relations between seemingly unrelated concepts: goodness-of-fit functions for classification trees and data standardization, spectral clustering and additive clustering, correlation and visualization of contingency data. 聽 The mathematical detail is encapsulated in the so-called 鈥渇ormulation鈥?parts, whereas most material is delivered through 鈥減resentation鈥?parts that explain the methods by applying them to small real-world data sets; concise 鈥渃omputation鈥?parts inform of the algorithmic and coding issues. Four layers of active learning and self-study exercises are provided: worked examples, case studies, projects and questions. 聽聽聽聽聽聽

Preface 5
Acknowledgments 7
Contents 8
List of Projects 13
List of Case Study 14
List of Worked Example 15
1 Introduction: What Is Core 17
1.1 Summarization and Correlation: Two Main Goals of Data Analysis 17
1.2 Case Study Problems 25
Case 1.2.1: Company 25
Case 1.2.2: Iris 27
Case 1.2.3: Market Towns 29
Case 1.2.4: Student 31
Case 1.2.5: Intrusion 31
Case 1.2.6: Confusion 33
Case 1.2.7: Amino Acid Substitution Rates 35
1.3 An Account of Data Visualization 37
1.3.1 General 37
1.3.2 Highlighting 38
1.3.3 Integrating Different Aspects 41
1.3.4 Narrating a Story 44
1.4 Summary 44
References 45
2 1D Analysis: Summarization and Visualization of a Single Feature 47
2.1 Quantitative Feature: Distribution and Histogram 47
P2.1.1 Presentation 47
F2.1.2 Formulation 49
C2.1.3 Computation 51
2.2 Further Summarization: Centers and Spreads 52
P2.2.1 Centers and Spreads: Presentation 52
Worked example 2.1. Mean 52
Worked example 2.2. Median 52
Worked example 2.3. P-quantile (percentile) 54
Worked example 2.4. Mode 54
F2.2.2 Centers and Spreads: Formulation 55
F2.2.2.1 Data Analysis Perspective 55
F2.2.2.2 Probabilistic Statistics Perspective 58
C2.2.3 Centers and Spreads: Computation 59
2.3 Binary and Categorical Features 59
P2.3.1 Presentation 59
Worked example 2.5. Entropy and Gini index of a distribution 62
F2.3.2 Formulation 62
C2.3.3 Computation 65
2.4 Modeling Uncertainty: Intervals and Fuzzy Sets 65
2.4.1 Individual Membership Functions 65
2.4.2 Central Fuzzy Set 68
Project 2.1. Computing Minkowski metric's center 68
Project 2.2. Analysis of a multimodal distribution 71
Project 2.3. Computational validation of the mean by bootstrapping 73
Project 2.4. K-fold cross validation 76
2.5 Summary 80
References 81
3 2D Analysis: Correlation and Visualization of Two Features 82
3.1 General 82
3.2 Two Quantitative Features Case 83
P3.2.1 Scatter-Plot, Linear Regression and Correlation Coefficients 83
P3.2.2 Validity of the Regression 85
Worked example 3.1. Determination coefficient 85
Worked example 3.2. Bootstrap validity testing 86
Worked example 3.3. Prediction error of the regression equation 88
F3.2.3 Linear Regression: Formulation 89
F3.2.3.1 Fitting Linear Regression 89
F3.2.3.2 Correlation Coefficient and Its Properties 90
F3.2.3.3 Linearization of Non-linear Regression 92
C3.2.4 Linear Regression: Computation 93
Project 3.1. 2D analysis, linear regression and bootstrapping 93
Project 3.2. Non-linear and linearized regression: a nature-inspired algorithm 99
Case-study 3.1. Growth of Investment 101
Case-study 3.2. Correlation Between Iris Sepal Length and Width 103
3.3 Mixed Scale Case: Nominal Feature Versus a Quantitative One 104
P3.3.1 Box-Plot, Tabular Regression and Correlation Ratio 104
Worked example 3.4. Tabular regression of Age (quantitative target) over Occupation (categorical predictor) in Students data 106
Worked example 3.5. Box-plot of Age at Occupation categories at Students data 107
Worked example 3.6. Correlation ratio 108
F3.3.2 Tabular Regression: Formulation 108
3.3.3 Nominal Target 110
3.3.3.1 Nearest Neighbor Classifier 110
Worked example 3.7. Nearest neighbor classifier 112
3.3.3.2 Interval Predicate Classifier 113
Worked example 3.8. Category contributions for interval predicate productions 114
3.4 Two Nominal Features Case 115
P3.4.1 Analysis of Contingency Tables: Presentation 115
P3.4.1.1 Deriving Conceptual Relations from Statistics 115
Worked example 3.9. Contingency table on Market towns data 115
Worked example 3.10. Equivalence and implication from a contingency table 116
Case study 3.3. Trimming Contingency Data: A Bad Option 117
P3.4.1.2 Capturing Relationships with Quetelet Indexes 117
Worked example 3.11. Quetelet index in a contingency table 118
Case-study 3.4. Has There Been a Bias in S?nS? Policy? 119
P3.4.1.3 Chi-Square Contingency Coefficient As a Summary Correlation Index 120
Worked example 3.12. Visualization of contingency table using weighted Quetelet coefficients 120
Worked example 3.13. A conventional decomposition of chi-square coefficient 121
F3.4.2 Analysis of Contingency Tables: Formulation 122
3.5 Summary 126
References 127
4 Learning Multivariate Correlations in Data 128
4.1 General: Decision Rules, Fitting Criteria, and Learning Protocols 128
4.2 Na茂ve Bayes Approach 133
4.2.1 Bayes Decision Rule 133
4.2.2 Na茂ve Bayes Classifier 135
4.2.3 Metrics of Accuracy 138
4.2.3.1 Accuracy and Related Measures: Presentation 138
Case study 4.1. Prevalence and Quetelet coefficients 140
4.2.3.2 Accuracy and Related Measures: Formulation 140
4.3 Linear Regression 143
P4.3.1 Linear Regression: Presentation 143
Case study 4.2. Linear regression for Market town data 143
Case study 4.3. Using feature weights standardized 145
F4.3.2 Linear Regression: Formulation 146
4.4 Linear Discrimination and SVM 148
P4.4.1 Linear Discrimination and SVM: Presentation 148
Worked example 4.1. A failure of Fisher discrimination criterion 149
Worked example 4.2. SVM for Iris dataset 151
F4.4.2 Linear Discrimination and SVM: Formulation 152
F4.4.2.1 Linear Discrimination 152
F4.4.2.2 Support Vector Machine (SVM) Criterion 153
F4.4.2.3 Kernels 155
4.5 Decision Trees 156
P4.5.1 General: Presentation 156
F4.5.2 General: Formulation 157
4.5.3 Measuring Correlation for Classification Trees 160
P4.5.3.1 Three Approaches to Scoring the Split-to-Target Correlation: Presentation 160
F4.5.3.2 Scoring Functions for Classification Trees: Formulation 162
C4.5.3.3 Computing Scoring Functions with MatLab: Computation 165
4.5.4 Building Classification Trees 167
Worked example 4.3. Classification tree for Iris dataset 167
Project 4.1. Prediction of learning outcome at Student data 168
C4.5.5 Building Classification Trees: Computation 172
4.5.5.1 Finding the Best Split Over a Feature: Computation 173
4.5.5.2 Organizing a Recursive Split Computation and Storage 174
4.6 Learning Correlation with Neural Networks 174
4.6.1 General 174
P4.6.1.1 Artificial Neuron and Neural Network: Presentation 174
F4.6.1.2 Activation Functions and Network Function: Formulation 177
4.6.2 Learning a Multi-layer Network 178
Worked example 4.4. Learning Iris petal sizes 179
Worked example 4.5. Predicting marks at Student dataset 180
F4.6.2.1 Fitting Neural Networks and Gradient Optimization: Formulation 180
C4.6.2.2 Error Back Propagation: Computation 183
4.7 Summary 186
References 186
5 Principal Component Analysis and SVD 188
5.1 Decoder Based Data Summarization 188
5.1.1 Structure of a Summarization Problem with Decoder 188
P5.1.2 Data Recovery Criterion: Presentation 189
F5.1.3 Data Recovery Criterion: Formulation 191
5.1.4 Data Standardization 192
Worked example 5.1. Standardizing Iris dataset 194
C5.1.5 Data Standardization: Computation 197
Project 5.1. Standardization of mixed scale data and its effect 197
Pr5.1.A Data table and its quantification 197
Pr5.1.B Visualization of the data unnormalized 199
Pr5.1.C Standardization by z-scoring 200
Pr5.1.D Range normalization and rescaling of dummy features 201
5.2 Principal Component Analysis: Model, Method, Usage 203
P5.2.1 SVD Based PCA and Its Usage: Presentation 203
P5.2.1.1 Scoring a Hidden Factor 203
Worked example 5.2. Explained proportion of data scatter in Equation (5.8) 206
Worked example 5.3. Principal components after feature centering 208
Worked example 5.4. Rescaling the talent score from Worked example 5.3 209
P5.2.1.2 Data Visualization 210
Worked example 5.5. Visualization of a fragment of Students dataset 210
P5.2.1.3 Feature Space Reduction: Criteria of Contribution and Interpretability 211
Worked example 5.6. Interpretation of principal components at the standardized Student data 212
F5.2.2 Mathematical Model of PCA-SVD and Its Properties: Formulation 213
F5.2.2.1 A Multiplicative Decoder 213
F5.2.2.2 Extension of the PC Decoder to the Case of Many Factors 215
F5.2.2.3 Conventional Formulation of PCA Using Covariance Matrix 216
F5.2.2.4 Geometric Interpretation of Principal Components 218
C5.2.3 Computing Principal Components 220
Worked example 5.7. SVD for Six Students dataset 220
Worked example 5.8. Standardized Student data visualized 220
Worked example 5.9. Evaluation of the quality of visualization of the standardized Student data 221
5.3 Application: Latent Semantic Analysis 222
P5.3.1 Latent Semantic Analysis: Presentation 222
Worked example 5.10. Latent semantic space for article-to-term data 224
F5.3.2 Latent Semantic Analysis: Formulation 225
C5.3.3 Latent Semantic Analysis: Computation 226
Worked example 5.11. Drawing Figure 5.11 226
5.4 Application: Correspondence Analysis 227
P5.4.1 Correspondence Analysis: Presentation 227
Worked example 5.12. Correspondence analysis of Protocol/Attack contingency table 228
F5.4.2 Correspondence Analysis: Formulation 228
C5.4.3 Correspondence Analysis: Computation 231
5.5 Summary 233
References 234
6 K-Means and Related Clustering Methods 235
6.1 General 235
6.2 K-Means Clustering 236
P6.2.1 Batch K-Means Partitioning 236
Worked example 6.1. K-Means clustering of Company data 238
Case Study 6.1. Dependence of K-Means on Initialization: A Drawback and Advantage 240
Case Study 6.2. Uniform Clusters Can Be Too Costly 242
Case Study 6.3. Robustness of K-Means Criterion with Data Normalization 242
F6.2.2 Batch K-Means and Its Criterion: Formulation 243
F6.2.2.1 Batch K-Means as Alternating Minimization 243
F6.2.2.2 Various Formulations of K-Means Criterion 245
C6.2.3 A Pseudo-Code for Batch K-Means: Computation 249
6.2.4 Incremental K-Means 251
P6.2.4.1 Incremental K-Means: Presentation 251
F6.2.4.2 Incremental K-Means: Formulation 252
6.2.5 Nature Inspired Algorithms for K-Means 252
P6.2.5.1 Nature Inspired Algorithms: Presentation 252
6.2.6 Partition Around Medoids PAM 259
Worked example 6.2. PAM applied to Company data 259
6.2.7 Initialization of K-Means 260
Case Study 6.4. Hartigan's Index for Choosing the Number of Clusters 261
Worked example 6.3. Selection of initial medoids in Company data 263
Worked example 6.4. Anomalous pattern in Market towns 265
6.2.8 Anomalous Pattern and Intelligent K-Means 267
P6.2.8.1 Anomalous Pattern and iK-Means: Presentation 267
Worked example 6.5. Iterated Anomalous patterns in Market towns 268
FC6.2.8.2 Anomalous Pattern and iK-Means: Formulation and Computation 268
Case Study 6.5. iK-Means Clustering of a Normally Distributed 1D Dataset 270
Project 6.1. Using contributions to determine the number of clusters 271
Project 6.2. Does PCA clean the data structure indeed: K-Means after PCA 272
6.3 Cluster Interpretation Aids 274
P6.3.1 Cluster Interpretation Aids: Presentation 274
Worked example 6.6. Centroids of Market town clusters 275
Worked example 6.7. Representatives of Company clusters 276
Worked example 6.8. Contributions of features to Market town clusters 277
Worked example 6.9. Contributions and relative contributions of features at Company clusters 278
Case-Study 6.6. 2D Analysis of Most Contributing Features 279
Worked example 6.10. Describing Market town clusters conceptually 281
Worked example 6.11. Describing Company clusters conceptually 281
F6.3.2 Cluster Interpretation Aids: Formulation 282
6.4 Extension of K-Means to Different Cluster Structures 285
6.4.1 Fuzzy K-Means Clustering 285
6.4.2 Mixture of Distributions and EM Algorithm 289
6.4.3 Kohonen's Self-Organizing Maps SOM 292
6.5 Summary 294
References 294
7 Hierarchical Clustering 296
7.1 General 296
7.2 Agglomerative Clustering and Ward\u2019s Criterion 298
P7.2.1 Agglomerative Clustering: Presentation 298
7.2.1 Worked example 7.1. Agglomerative clustering of Company dataset 298
7.2.1.1 Ward's Criterion 299
Worked example 7.2. Ward algorithm with distances only 300
F7.2.2 Square-Error Criterion and Ward Distance: Formulation 302
C7.2.3 Agglomerative Clustering: Computation 304
7.2.3.1 Agglomerative Clustering 304
7.3 Divisive and Conceptual Clustering 305
P7.3.1 Divisive Clustering: Presentation 305
Case Study 7.1. Divisive Clustering of Companies with Two-Splitting 307
Case Study 7.2. Anomalous Cluster Versus Two-Split Cluster 308
Case Study 7.3. Conceptual Clustering of Digit Data as Related to Ward Clustering 309
F7.3.2 Divisive and Conceptual Clustering: Formulation 312
C7.3.3 Divisive and Conceptual Clustering: Computation 313
7.3.3.1 Ward-Like Divisive Clustering 313
7.3.3.2 Two-Splitting (2-Means Splitting, Bisecting K-Means) 314
7.3.3.3 C-Splitting (Conceptual Clustering with Binary Splits) 314
7.4 Single Linkage Clustering, Connected Components and Maximum Spanning Tree 315
P7.4.1 Maximum Spanning Tree and Clusters: Presentation 315
Worked example 7.3. Concept of MST 317
Worked example 7.4. Building an MST on Confusion data 317
Worked example 7.5. MST and connected components 318
Worked example 7.6. Single link hierarchy corresponding to an MST 319
Case-study 7.4. Difference Between K-Means and Single Link Clustering 320
Worked example 7.7. MST and single linkage clusters for Company dataset 321
F7.4.2 MST, Connected Components and Single Link Clustering: Formulation 322
F7.4.2.1 MST and Connected Components 322
F7.4.2.2 MST and Single Link Clustering 323
C7.4.3 Building a Maximum Spanning Tree: Computation 324
7.4.3.1 Prim's Algorithm 324
7.5 Summary 325
References 325
8 Approximate and Spectral Clustering for Network and Affinity Data 327
8.1 One Cluster Summary Similarity with Background Subtracted 328
P8.1.1 Summary Similarity and Two Types of Background: Presentation 328
Worked example 8.1. Summary similarity clusters at a genuine similarity dataset 329
Case Study 8.1. Repeated One-Cluster Clustering with Repeated Removal of Background 331
Case Study 8.2. Summary Clusters at Ordinary Network Data 333
Worked example 8.2. Similarity clusters at affinity data 335
F8.1.2 One Cluster Summary Criterion and Its Properties: Formulation 336
C8.1.3 Local Algorithms for One Cluster Summary Criterion: Computation 340
8.1.3.1 AddRem(i) Algorithm 340
8.2 Two Cluster Case: Cut, Normalized Cut and Spectral Clustering 341
8.2.1 Minimum Cut and Spectral Clustering 341
P8.2.1.1 Minimum Cut and Spectral Clustering: Presentation 341
Worked example 8.3. Spectral clusters for Confusion dataset 342
Worked example 8.4. Spectral clusters for Cockroach network 342
Worked example 8.5. Spectral clustering of affinity data 344
F8.2.1.2 Minimum Cut and Spectral Clustering: Formulation 344
C8.2.1.3 Spectral Clustering for the Minimum Cut Problem: Computation 345
8.2.2 Normalized Cut and Laplace Transformation 346
P8.2.2.1 Normalized Cut: Presentation 346
Worked example 8.6. Normalized cut for Company data: Laplacian and Lapin matrices 347
Worked example 8.8. Failure of spectral clustering at Cockroach network 348
Case Study 8.3. Circular Cluster Exposed by Lapin Transformation 349
F8.2.2.2 Partition Criteria and Spectral Clustering: Formulation 350
C8.2.2.3 Pseudo-Inverse Laplacian: Computation 353
8.3 Additive Clusters 353
P8.3.1 Decomposing a Similarity Matrix over Clusters: Presentation 353
Worked example 8.9. Additive clusters at Confusion dataset 356
Project 8.1. Analysis of structure of amino acid substitution rates 357
F8.3.2 Additive Clusters One-by-One: Formulation 360
C8.3.3 Finding (Sub)Optimal Additive Clusters: Computation 365
8.3.3.1 AddRemAdd(j) Algorithm 366
8.3.3.2 ADN Algorithm 366
8.3.3.3 ADO Algorithm 367
8.4 Summary 368
References 368
Appendix 369
A1 Basic Linear Algebra 369
A1.1 Inner Product and Distance 369
A1.2 Matrix Algebra 372
A2 Basic Optimization 374
A3 Basic MatLab 376
A3.1 Introduction 376
A3.2 Loading and Storing Files 377
A3.3 Using Subsets of Entities and Features 380
A4 MatLab Program Codes 382
A4.1 Minkowski's Center: Evolutionary Algorithm 382
A4.2 Fitting Power Law: Non-linear Evolutionary and Linearization 384
A4.3 Training Neuron Network with One Hidden Layer 390
A4.4 Building Classification Trees 392
A5 Two Random Samples 395
A5.1 Short.dat 395
A5.2 A Sample of 280 N(0,10) Values 396
Index 398

已确认勘误

页码	勘误内容	提交人	修订印次

名称
类型
大小

用户反馈

FAQ

光盘服务联系方式: 020-38250260 客服QQ：4006604884

意见反馈

已确认勘误

第次印刷 筛选

第次印刷