Statistical Pattern Recognition
介绍
Contents
Preface xv
Notation xvii
1 Introduction to statistical pattern recognition 1
1.1 Statistical pattern recognition 1
1.1.1 Introduction 1
1.1.2 The basic model 2
1.2 Stages in a pattern recognition problem 3
1.3 Issues 4
1.4 Supervised versus unsupervised 5
1.5 Approaches to statistical pattern recognition 6
1.5.1 Elementary decision theory 6
1.5.2 Discriminant functions 19
1.6 Multiple regression 25
1.7 Outline of book 27
1.8 Notes and references 28
Exercises 30
2 Density estimation – parametric 33
2.1 Introduction 33
2.2 Normal-based models 34
2.2.1 Linear and quadratic discriminant functions 34
2.2.2 Regularised discriminant analysis 37
2.2.3 Example application study 38
2.2.4 Further developments 40
2.2.5 Summary 40
2.3 Normal mixture models 41
2.3.1 Maximum likelihood estimation via EM 41
2.3.2 Mixture models for discrimination 45
2.3.3 How many components? 46
2.3.4 Example application study 47
2.3.5 Further developments 49
2.3.6 Summary 49
viii CONTENTS
2.4 Bayesian estimates 50
2.4.1 Bayesian learning methods 50
2.4.2 Markov chain Monte Carlo 55
2.4.3 Bayesian approaches to discrimination 70
2.4.4 Example application study 72
2.4.5 Further developments 75
2.4.6 Summary 75
2.5 Application studies 75
2.6 Summary and discussion 77
2.7 Recommendations 77
2.8 Notes and references 77
Exercises 78
3 Density estimation – nonparametric 81
3.1 Introduction 81
3.2 Histogram method 82
3.2.1 Data-adaptive histograms 83
3.2.2 Independence assumption 84
3.2.3 Lancaster models 85
3.2.4 Maximum weight dependence trees 85
3.2.5 Bayesian networks 88
3.2.6 Example application study 91
3.2.7 Further developments 91
3.2.8 Summary 92
3.3 k-nearest-neighbour method 93
3.3.1 k-nearest-neighbour decision rule 93
3.3.2 Properties of the nearest-neighbour rule 95
3.3.3 Algorithms 95
3.3.4 Editing techniques 98
3.3.5 Choice of distance metric 101
3.3.6 Example application study 102
3.3.7 Further developments 103
3.3.8 Summary 104
3.4 Expansion by basis functions 105
3.5 Kernel methods 106
3.5.1 Choice of smoothing parameter 111
3.5.2 Choice of kernel 113
3.5.3 Example application study 114
3.5.4 Further developments 115
3.5.5 Summary 115
3.6 Application studies 116
3.7 Summary and discussion 119
3.8 Recommendations 120
3.9 Notes and references 120
Exercises 121
CONTENTS ix
4 Linear discriminant analysis 123
4.1 Introduction 123
4.2 Two-class algorithms 124
4.2.1 General ideas 124
4.2.2 Perceptron criterion 124
4.2.3 Fisher’s criterion 128
4.2.4 Least mean squared error procedures 130
4.2.5 Support vector machines 134
4.2.6 Example application study 141
4.2.7 Further developments 142
4.2.8 Summary 142
4.3 Multiclass algorithms 144
4.3.1 General ideas 144
4.3.2 Error-correction procedure 145
4.3.3 Fisher’s criterion – linear discriminant analysis 145
4.3.4 Least mean squared error procedures 148
4.3.5 Optimal scaling 152
4.3.6 Regularisation 155
4.3.7 Multiclass support vector machines 155
4.3.8 Example application study 156
4.3.9 Further developments 156
4.3.10 Summary 158
4.4 Logistic discrimination 158
4.4.1 Two-group case 158
4.4.2 Maximum likelihood estimation 159
4.4.3 Multiclass logistic discrimination 161
4.4.4 Example application study 162
4.4.5 Further developments 163
4.4.6 Summary 163
4.5 Application studies 163
4.6 Summary and discussion 164
4.7 Recommendations 165
4.8 Notes and references 165
Exercises 165
5 Nonlinear discriminant analysis – kernel methods 169
5.1 Introduction 169
5.2 Optimisation criteria 171
5.2.1 Least squares error measure 171
5.2.2 Maximum likelihood 175
5.2.3 Entropy 176
5.3 Radial basis functions 177
5.3.1 Introduction 177
5.3.2 Motivation 178
5.3.3 Specifying the model 181
x CONTENTS
5.3.4 Radial basis function properties 187
5.3.5 Simple radial basis function 187
5.3.6 Example application study 187
5.3.7 Further developments 189
5.3.8 Summary 189
5.4 Nonlinear support vector machines 190
5.4.1 Types of kernel 191
5.4.2 Model selection 192
5.4.3 Support vector machines for regression 192
5.4.4 Example application study 195
5.4.5 Further developments 196
5.4.6 Summary 197
5.5 Application studies 197
5.6 Summary and discussion 199
5.7 Recommendations 199
5.8 Notes and references 200
Exercises 200
6 Nonlinear discriminant analysis – projection methods 203
6.1 Introduction 203
6.2 The multilayer perceptron 204
6.2.1 Introduction 204
6.2.2 Specifying the multilayer perceptron structure 205
6.2.3 Determining the multilayer perceptron weights 205
6.2.4 Properties 212
6.2.5 Example application study 213
6.2.6 Further developments 214
6.2.7 Summary 216
6.3 Projection pursuit 216
6.3.1 Introduction 216
6.3.2 Projection pursuit for discrimination 218
6.3.3 Example application study 219
6.3.4 Further developments 220
6.3.5 Summary 220
6.4 Application studies 221
6.5 Summary and discussion 221
6.6 Recommendations 222
6.7 Notes and references 223
Exercises 223
7 Tree-based methods 225
7.1 Introduction 225
7.2 Classification trees 225
7.2.1 Introduction 225
7.2.2 Classifier tree construction 228
7.2.3 Other issues 237
7.2.4 Example application study 239
CONTENTS xi
7.2.5 Further developments 239
7.2.6 Summary 240
7.3 Multivariate adaptive regression splines 241
7.3.1 Introduction 241
7.3.2 Recursive partitioning model 241
7.3.3 Example application study 244
7.3.4 Further developments 245
7.3.5 Summary 245
7.4 Application studies 245
7.5 Summary and discussion 247
7.6 Recommendations 247
7.7 Notes and references 248
Exercises 248
8 Performance 251
8.1 Introduction 251
8.2 Performance assessment 252
8.2.1 Discriminability 252
8.2.2 Reliability 258
8.2.3 ROC curves for two-class rules 260
8.2.4 Example application study 263
8.2.5 Further developments 264
8.2.6 Summary 265
8.3 Comparing classifier performance 266
8.3.1 Which technique is best? 266
8.3.2 Statistical tests 267
8.3.3 Comparing rules when misclassification costs are uncertain 267
8.3.4 Example application study 269
8.3.5 Further developments 270
8.3.6 Summary 271
8.4 Combining classifiers 271
8.4.1 Introduction 271
8.4.2 Motivation 272
8.4.3 Characteristics of a combination scheme 275
8.4.4 Data fusion 278
8.4.5 Classifier combination methods 284
8.4.6 Example application study 297
8.4.7 Further developments 298
8.4.8 Summary 298
8.5 Application studies 299
8.6 Summary and discussion 299
8.7 Recommendations 300
8.8 Notes and references 300
Exercises 301
9 Feature selection and extraction 305
9.1 Introduction 305
xii CONTENTS
9.2 Feature selection 307
9.2.1 Feature selection criteria 308
9.2.2 Search algorithms for feature selection 311
9.2.3 Suboptimal search algorithms 314
9.2.4 Example application study 317
9.2.5 Further developments 317
9.2.6 Summary 318
9.3 Linear feature extraction 318
9.3.1 Principal components analysis 319
9.3.2 Karhunen–Lo`eve transformation 329
9.3.3 Factor analysis 335
9.3.4 Example application study 342
9.3.5 Further developments 343
9.3.6 Summary 344
9.4 Multidimensional scaling 344
9.4.1 Classical scaling 345
9.4.2 Metric multidimensional scaling 346
9.4.3 Ordinal scaling 347
9.4.4 Algorithms 350
9.4.5 Multidimensional scaling for feature extraction 351
9.4.6 Example application study 352
9.4.7 Further developments 353
9.4.8 Summary 353
9.5 Application studies 354
9.6 Summary and discussion 355
9.7 Recommendations 355
9.8 Notes and references 356
Exercises 357
10 Clustering 361
10.1 Introduction 361
10.2 Hierarchical methods 362
10.2.1 Single-link method 364
10.2.2 Complete-link method 367
10.2.3 Sum-of-squares method 368
10.2.4 General agglomerative algorithm 368
10.2.5 Properties of a hierarchical classification 369
10.2.6 Example application study 370
10.2.7 Summary 370
10.3 Quick partitions 371
10.4 Mixture models 372
10.4.1 Model description 372
10.4.2 Example application study 374
10.5 Sum-of-squares methods 374
10.5.1 Clustering criteria 375
10.5.2 Clustering algorithms 376
10.5.3 Vector quantisation 382
CONTENTS xiii
10.5.4 Example application study 394
10.5.5 Further developments 395
10.5.6 Summary 395
10.6 Cluster validity 396
10.6.1 Introduction 396
10.6.2 Distortion measures 397
10.6.3 Choosing the number of clusters 397
10.6.4 Identifying genuine clusters 399
10.7 Application studies 400
10.8 Summary and discussion 402
10.9 Recommendations 404
10.10 Notes and references 405
Exercises 406
11 Additional topics 409
11.1 Model selection 409
11.1.1 Separate training and test sets 410
11.1.2 Cross-validation 410
11.1.3 The Bayesian viewpoint 411
11.1.4 Akaike’s information criterion 411
11.2 Learning with unreliable classification 412
11.3 Missing data 413
11.4 Outlier detection and robust procedures 414
11.5 Mixed continuous and discrete variables 415
11.6 Structural risk minimisation and the Vapnik–Chervonenkis
dimension 416
11.6.1 Bounds on the expected risk 416
11.6.2 The Vapnik–Chervonenkis dimension 417
A Measures of dissimilarity 419
A.1 Measures of dissimilarity 419
A.1.1 Numeric variables 419
A.1.2 Nominal and ordinal variables 423
A.1.3 Binary variables 423
A.1.4 Summary 424
A.2 Distances between distributions 425
A.2.1 Methods based on prototype vectors 425
A.2.2 Methods based on probabilistic distance 425
A.2.3 Probabilistic dependence 428
A.3 Discussion 429
B Parameter estimation 431
B.1 Parameter estimation 431
B.1.1 Properties of estimators 431
B.1.2 Maximum likelihood 433
B.1.3 Problems with maximum likelihood 434
B.1.4 Bayesian estimates 434
xiv CONTENTS
C Linear algebra 437
C.1 Basic properties and definitions 437
C.2 Notes and references 441
D Data 443
D.1 Introduction 443
D.2 Formulating the problem 443
D.3 Data collection 444
D.4 Initial examination of data 446
D.5 Data sets 448
D.6 Notes and references 448
E Probability theory 449
E.1 Definitions and terminology 449
E.2 Normal distribution 454
E.3 Probability distributions 455
References 459
Index 491
|
下载地址
------分隔线----------------------------