好书推荐 好书速递 排行榜 读书文摘

The Elements of Statistical Learning

The Elements of Statistical Learning
作者:Trevor Hastie / Robert Tibshirani / Jerome Friedman
副标题:Data Mining, Inference, and Prediction, Second Edition
出版社:Springer
出版年:2009-10
ISBN:9780387848570
行业:其它
浏览数:5

内容简介

During the past decade there has been an explosion in computation and information technology. With it have come vast amounts of data in a variety of fields such as medicine, biology, finance, and marketing. The challenge of understanding these data has led to the development of new tools in the field of statistics, and spawned new areas such as data mining, machine learning, and bioinformatics. Many of these tools have common underpinnings but are often expressed with different terminology. This book describes the important ideas in these areas in a common conceptual framework. While the approach is statistical, the emphasis is on concepts rather than mathematics. Many examples are given, with a liberal use of color graphics. It is a valuable resource for statisticians and anyone interested in data mining in science or industry. The book's coverage is broad, from supervised learning (prediction) to unsupervised learning. The many topics include neural networks, support vector machines, classification trees and boosting---the first comprehensive treatment of this topic in any book. This major new edition features many topics not covered in the original, including graphical models, random forests, ensemble methods, least angle regression & path algorithms for the lasso, non-negative matrix factorization, and spectral clustering. There is also a chapter on methods for "wide" data (p bigger than n), including multiple testing and false discovery rates.

......(更多)

作者简介

Trevor Hastie, Robert Tibshirani, and Jerome Friedman are professors of statistics at Stanford University. They are prominent researchers in this area: Hastie and Tibshirani developed generalized additive models and wrote a popular book of that title. Hastie co-developed much of the statistical modeling software and environment in R/S-PLUS and invented principal curves and surfaces. Tibshirani proposed the lasso and is co-author of the very successful An Introduction to the Bootstrap. Friedman is the co-inventor of many data-mining tools including CART, MARS, projection pursuit and gradient boosting.

......(更多)

目录

1 Introduction

2 Overview of Supervised Learning

2.1 Introduction

2.2 Variable Types and Terminology

2.3 Two Simple Approaches to Prediction:

Least Squares and Nearest Neighbors

2.3.1 Linear Models and Least Squares

2.3.2 Nearest-Neighbor Methods

2.3.3 From Least Squares to Nearest Neighbors

2.4 Statistical Decision Theory

2.5 Local Methods in High Dimensions

2.6 Statistical Models, Supervised Learning

and Function Approximation

2.6.1 A Statistical Model

for the Joint Distribution Pr(X, Y )

2.6.2 Supervised Learning

2.6.3 Function Approximation

2.7 Structured Regression Models

2.7.1 Difficulty of the Problem

2.8 Classes of Restricted Estimators

2.8.1 Roughness Penalty and Bayesian Methods

2.8.2 Kernel Methods and Local Regression

2.8.3 Basis Functions and Dictionary Methods

2.9 Model Selection and the Bias–Variance Tradeoff

Bibliographic Notes

Exercises

3 Linear Methods for Regression

3.1 Introduction

3.2 Linear Regression Models and Least Squares

3.2.1 Example: Prostate Cancer

3.2.2 The Gauss–Markov Theorem

3.2.3 Multiple Regression

from Simple Univariate Regression

3.2.4 Multiple Outputs

3.3 Subset Selection

3.3.1 Best-Subset Selection

3.3.2 Forward- and Backward-Stepwise Selection

3.3.3 Forward-Stagewise Regression

3.3.4 Prostate Cancer Data Example (Continued)

3.4 Shrinkage Methods

3.4.1 Ridge Regression

3.4.2 The Lasso

3.4.3 Discussion: Subset Selection, Ridge Regression

and the Lasso

3.4.4 Least Angle Regression

3.5 Methods Using Derived Input Directions

3.5.1 Principal Components Regression

3.5.2 Partial Least Squares

3.6 Discussion: A Comparison of the Selection

and Shrinkage Methods

3.7 Multiple Outcome Shrinkage and Selection

3.8 More on the Lasso and Related Path Algorithms

3.8.1 Incremental Forward Stagewise Regression

3.8.2 Piecewise-Linear Path Algorithms

3.8.3 The Dantzig Selector

3.8.4 The Grouped Lasso

3.8.5 Further Properties of the Lasso

3.8.6 Pathwise Coordinate Optimization

3.9 Computational Considerations

Bibliographic Notes

Exercises

4 Linear Methods for Classification

4.1 Introduction

4.2 Linear Regression of an Indicator Matrix

4.3 Linear Discriminant Analysis

4.3.1 Regularized Discriminant Analysis

4.3.2 Computations for LDA

4.3.3 Reduced-Rank Linear Discriminant Analysis

4.4 Logistic Regression

4.4.1 Fitting Logistic Regression Models

4.4.2 Example: South African Heart Disease

4.4.3 Quadratic Approximations and Inference

4.4.4 L1 Regularized Logistic Regression

4.4.5 Logistic Regression or LDA?

4.5 Separating Hyperplanes

4.5.1 Rosenblatt’s Perceptron Learning Algorithm .

4.5.2 Optimal Separating Hyperplanes

Bibliographic Notes

Exercises

5 Basis Expansions and Regularization

5.1 Introduction

5.2 Piecewise Polynomials and Splines

5.2.1 Natural Cubic Splines

5.2.2 Example: South African Heart Disease (Continued)

5.2.3 Example: Phoneme Recognition

5.3 Filtering and Feature Extraction

5.4 Smoothing Splines

5.4.1 Degrees of Freedom and Smoother Matrices

5.5 Automatic Selection of the Smoothing Parameters

5.5.1 Fixing the Degrees of Freedom

5.5.2 The Bias–Variance Tradeoff

5.6 Nonparametric Logistic Regression

5.7 Multidimensional Splines

5.8 Regularization and Reproducing Kernel Hilbert Spaces

5.8.1 Spaces of Functions Generated by Kernels

5.8.2 Examples of RKHS

5.9 Wavelet Smoothing

5.9.1 Wavelet Bases and the Wavelet Transform

5.9.2 Adaptive Wavelet Filtering

Bibliographic Notes

Exercises

Appendix: Computational Considerations for Splines

Appendix: B-splines

Appendix: Computations for Smoothing Splines

6 Kernel Smoothing Methods

6.1 One-Dimensional Kernel Smoothers

6.1.1 Local Linear Regression

6.1.2 Local Polynomial Regression

6.2 Selecting the Width of the Kernel

6.3 Local Regression in IRp

6.4 Structured Local Regression Models in IRp

6.4.1 Structured Kernels

6.4.2 Structured Regression Functions

6.5 Local Likelihood and Other Models

6.6 Kernel Density Estimation and Classification

6.6.1 Kernel Density Estimation

6.6.2 Kernel Density Classification

6.6.3 The Naive Bayes Classifier

6.7 Radial Basis Functions and Kernels

6.8 Mixture Models for Density Estimation and Classification

6.9 Computational Considerations

Bibliographic Notes

Exercises

7 Model Assessment and Selection

7.1 Introduction

7.2 Bias, Variance and Model Complexity

7.3 The Bias–Variance Decomposition 223

7.3.1 Example: Bias–Variance Tradeoff

7.4 Optimism of the Training Error Rate

7.5 Estimates of In-Sample Prediction Error

7.6 The Effective Number of Parameters

7.7 The Bayesian Approach and BIC

7.8 Minimum Description Length

7.9 Vapnik–Chervonenkis Dimension

7.9.1 Example (Continued)

7.10 Cross-Validation

7.10.1 K-Fold Cross-Validation

7.10.2 The Wrong and Right Way

to Do Cross-validation

7.10.3 Does Cross-Validation Really Work?

7.11 Bootstrap Methods

7.11.1 Example (Continued)

7.12 Conditional or Expected Test Error?

Bibliographic Notes

Exercises

8 Model Inference and Averaging

8.1 Introduction

8.2 The Bootstrap and Maximum Likelihood Methods

8.2.1 A Smoothing Example

8.2.2 Maximum Likelihood Inference

8.2.3 Bootstrap versus Maximum Likelihood

8.3 Bayesian Methods

8.4 Relationship Between the Bootstrap

and Bayesian Inference

8.5 The EM Algorithm

8.5.1 Two-Component Mixture Model

8.5.2 The EM Algorithm in General

8.5.3 EM as a Maximization–Maximization Procedure

8.6 MCMC for Sampling from the Posterior

8.7 Bagging

8.7.1 Example: Trees with Simulated Data

8.8 Model Averaging and Stacking

8.9 Stochastic Search: Bumping

Bibliographic Notes

Exercises

9 Additive Models, Trees, and Related Methods

9.1 Generalized Additive Models

9.1.1 Fitting Additive Models

9.1.2 Example: Additive Logistic Regression

9.1.3 Summary

9.2 Tree-Based Methods

9.2.1 Background

9.2.2 Regression Trees

9.2.3 Classification Trees

9.2.4 Other Issues

9.2.5 Spam Example (Continued)

9.3 PRIM: Bump Hunting

9.3.1 Spam Example (Continued)

9.4 MARS: Multivariate Adaptive Regression Splines

9.4.1 Spam Example (Continued)

9.4.2 Example (Simulated Data)

9.4.3 Other Issues

9.5 Hierarchical Mixtures of Experts

9.6 Missing Data

9.7 Computational Considerations

Bibliographic Notes

Exercises

10 Boosting and Additive Trees

10.1 Boosting Methods

10.1.1 Outline of This Chapter

10.2 Boosting Fits an Additive Model

10.3 Forward Stagewise Additive Modeling

10.4 Exponential Loss and AdaBoost

10.5 Why Exponential Loss?

10.6 Loss Functions and Robustness

10.7 “Off-the-Shelf” Procedures for Data Mining

10.8 Example: Spam Data

10.9 Boosting Trees

10.10 Numerical Optimization via Gradient Boosting

10.10.1 Steepest Descent

10.10.2 Gradient Boosting

10.10.3 Implementations of Gradient Boosting

10.11 Right-Sized Trees for Boosting

10.12 Regularization

10.12.1 Shrinkage

10.12.2 Subsampling

10.13 Interpretation

10.13.1 Relative Importance of Predictor Variables

10.13.2 Partial Dependence Plots

10.14 Illustrations

10.14.1 California Housing

10.14.2 New Zealand Fish

10.14.3 Demographics Data

Bibliographic Notes

Exercises

11 Neural Networks

11.1 Introduction

11.2 Projection Pursuit Regression

11.3 Neural Networks

11.4 Fitting Neural Networks

11.5 Some Issues in Training Neural Networks

11.5.1 Starting Values

11.5.2 Overfitting

11.5.3 Scaling of the Inputs

11.5.4 Number of Hidden Units and Layers

11.5.5 Multiple Minima

11.6 Example: Simulated Data

11.7 Example: ZIP Code Data

11.8 Discussion

11.9 Bayesian Neural Nets and the NIPS 2003 Challenge

11.9.1 Bayes, Boosting and Bagging

11.9.2 Performance Comparisons

11.10 Computational Considerations

Bibliographic Notes

Exercises

12 Support Vector Machines and

Flexible Discriminants

12.1 Introduction

12.2 The Support Vector Classifier

12.2.1 Computing the Support Vector Classifier

12.2.2 Mixture Example (Continued)

12.3 Support Vector Machines and Kernels

12.3.1 Computing the SVM for Classification

12.3.2 The SVM as a Penalization Method

12.3.3 Function Estimation and Reproducing Kernels

12.3.4 SVMs and the Curse of Dimensionality

12.3.5 A Path Algorithm for the SVM Classifier

12.3.6 Support Vector Machines for Regression

12.3.7 Regression and Kernels

12.3.8 Discussion

12.4 Generalizing Linear Discriminant Analysis

12.5 Flexible Discriminant Analysis

12.5.1 Computing the FDA Estimates

12.6 Penalized Discriminant Analysis

12.7 Mixture Discriminant Analysis

12.7.1 Example: Waveform Data

Bibliographic Notes

Exercises

13 Prototype Methods and Nearest-Neighbors

13.1 Introduction

13.2 Prototype Methods

13.2.1 K-means Clustering

13.2.2 Learning Vector Quantization

13.2.3 Gaussian Mixtures

13.3 k-Nearest-Neighbor Classifiers

13.3.1 Example: A Comparative Study

13.3.2 Example: k-Nearest-Neighbors

and Image Scene Classification

13.3.3 Invariant Metrics and Tangent Distance

13.4 Adaptive Nearest-Neighbor Methods

13.4.1 Example

13.4.2 Global Dimension Reduction

for Nearest-Neighbors

13.5 Computational Considerations

Bibliographic Notes

Exercises

14 Unsupervised Learning

14.1 Introduction

14.2 Association Rules

14.2.1 Market Basket Analysis

14.2.2 The Apriori Algorithm

14.2.3 Example: Market Basket Analysis

14.2.4 Unsupervised as Supervised Learning

14.2.5 Generalized Association Rules

14.2.6 Choice of Supervised Learning Method

14.2.7 Example: Market Basket Analysis (Continued)

14.3 Cluster Analysis

14.3.1 Proximity Matrices

14.3.2 Dissimilarities Based on Attributes

14.3.3 Object Dissimilarity

14.3.4 Clustering Algorithms

14.3.5 Combinatorial Algorithms

14.3.6 K-means

14.3.7 Gaussian Mixtures as Soft K-means Clustering

14.3.8 Example: Human Tumor Microarray Data

14.3.9 Vector Quantization

14.3.10 K-medoids

14.3.11 Practical Issues

14.3.12 Hierarchical Clustering

14.4 Self-Organizing Maps

14.5 Principal Components, Curves and Surfaces

14.5.1 Principal Components

14.5.2 Principal Curves and Surfaces

14.5.3 Spectral Clustering

14.5.4 Kernel Principal Components

14.5.5 Sparse Principal Components

14.6 Non-negative Matrix Factorization

14.6.1 Archetypal Analysis

14.7 Independent Component Analysis

and Exploratory Projection Pursuit

14.7.1 Latent Variables and Factor Analysis

14.7.2 Independent Component Analysis

14.7.3 Exploratory Projection Pursuit

14.7.4 A Direct Approach to ICA

14.8 Multidimensional Scaling

14.9 Nonlinear Dimension Reduction

and Local Multidimensional Scaling

14.10 The Google PageRank Algorithm

Bibliographic Notes

Exercises

15 Random Forests

15.1 Introduction

15.2 Definition of Random Forests

15.3 Details of Random Forests

15.3.1 Out of Bag Samples

15.3.2 Variable Importance

15.3.3 Proximity Plots

15.3.4 Random Forests and Overfitting

15.4 Analysis of Random Forests

15.4.1 Variance and the De-Correlation Effect

15.4.2 Bias

15.4.3 Adaptive Nearest Neighbors

Bibliographic Notes

Exercises

16 Ensemble Learning

16.1 Introduction

16.2 Boosting and Regularization Paths

16.2.1 Penalized Regression

16.2.2 The “Bet on Sparsity” Principle

16.2.3 Regularization Paths, Over-fitting and Margins

16.3 Learning Ensembles

16.3.1 Learning a Good Ensemble

16.3.2 Rule Ensembles

Bibliographic Notes

Exercises

17 Undirected Graphical Models

17.1 Introduction

17.2 Markov Graphs and Their Properties

17.3 Undirected Graphical Models for Continuous Variables

17.3.1 Estimation of the Parameters

when the Graph Structure is Known

17.3.2 Estimation of the Graph Structure

17.4 Undirected Graphical Models for Discrete Variables

17.4.1 Estimation of the Parameters

when the Graph Structure is Known

17.4.2 Hidden Nodes

17.4.3 Estimation of the Graph Structure

17.4.4 Restricted Boltzmann Machines

Exercises

18 High-Dimensional Problems: p ≫ N

18.1 When p is Much Bigger than N

18.2 Diagonal Linear Discriminant Analysis

and Nearest Shrunken Centroids

18.3 Linear Classifiers with Quadratic Regularization

18.3.1 Regularized Discriminant Analysis

18.3.2 Logistic Regression

with Quadratic Regularization

18.3.3 The Support Vector Classifier

18.3.4 Feature Selection

18.3.5 Computational Shortcuts When p ≫ N

18.4 Linear Classifiers with L1 Regularization

18.4.1 Application of Lasso

to Protein Mass Spectroscopy

18.4.2 The Fused Lasso for Functional Data

18.5 Classification When Features are Unavailable

18.5.1 Example: String Kernels

and Protein Classification

18.5.2 Classification and Other Models Using

Inner-Product Kernels and Pairwise Distances .

18.5.3 Example: Abstracts Classification

18.6 High-Dimensional Regression: Supervised Principal Components

18.6.1 Connection to Latent-Variable Modeling

18.6.2 Relationship with Partial Least Squares

18.6.3 Pre-Conditioning for Feature Selection

18.7 Feature Assessment and the Multiple-Testing Problem

18.7.1 The False Discovery Rate

18.7.2 Asymmetric Cutpoints and the SAM Procedure

18.7.3 A Bayesian Interpretation of the FDR

18.8 Bibliographic Notes

Exercises

......(更多)

读书文摘

the bias of the 1-nearest-neighbor estimate is often low, but the variance is high.

......(更多)

猜你喜欢

点击查看