4.1 Introduction
In this chapter, we revisit the classification problem and focus on linear methods for classification.
What does linear here mean?
Depending on the prediction function, the boundaries of the input space can be rough or smooth. For an important class of the methods, these boundaries are linear.
While this entire chapter is devoted to linear decision boundaries, there is considerable scope for generalization. For example, we can expand our variable
set by including their squares and cross-products
, thereby adding
additional variables. Linear functions in the augmented space map down to quadratic functions in the original space --- hence linear decision boundaries to quadratic decision boundaries.This approach can be used with any basis transformation
, where
, with
,and will be explored in later chapters.
4.2 Linear Regression of an Indicator Matrix
In this section, a basic method is introduced. This method may have some problems which are solved to some extent by some other methods in later sections.
Here, each of the respinse categories are coded via an indicator variable. If has
classes, there will be
such indicators
, with
, if
, else
. These are collected together in a vector
, and the
training instances of these form an
indicator response matrix
. Now, we can fit the linear regression model to each of the columns of
simultaneously, and the fit is given by
and the coefficient matrix is
To classify a new observation with input , we first comput the fitted output
, a
vector,and then identify the largest component and classify accordingly.
Why this method works? What is the rationale?
One justification is to view the regression as an estimate of conditional expectation.
For the random variable , so conditional expectation of each of the
seems a sensible goal. The real issue is: how good an approximation to conditional expecttion is the rather rigid linear regression model? Alternatively, are the
reasonable estimates of the posterior probabilities
, and more importantly, does this matter?
Although we can verify that for any
, the
can be negative or greater than
.
Another more simplistic viewpoint is to construct target
for each class, where
is the
th column of the
identity matrix.
if
. Then we can fit the linear model by least squares:
A new observation is classified by computing its fitted vector and classifying to the closest target:
This is exactly the same as the previous approach.
A serious problem of this method:
When the number of classes , the model may work badly(especially prevalent when
is large). Because of the rigid nature of the regression model, classes can be masked by others.
4.3 Linear Discriminant Analysis
Decision theory for classification (Section 2.4) tells us that we need to know the class posteriors for optimal classification. Suppose
is the class-conditional density of
in the class
, and let
be the prior probability of class
, with
. A simple application of Bayes Theorem gives us
We see that in terms of ability to classify, having the is almost equivalent to having the quantity
.
Suppose that we model each class density as multivariate Gaussian
Linear discriminant analysis (LDA) arises in the special case when we assume that the classes have a common covariance matrix . In comparing tow classes
and
, it is sufficient to look at the log-ratio
This is an equation linear in . It implies that the decision boundary between classes
and
--- the set where
is linear in
in a
dimensions hyperplane.
We can also see that the linear discriminant functions
are an equivalent description of the decision rule, with .
In practice, we do not know the parameters of the Gaussian distribution and will need to estimate them using our training data:
, where
is the number of class-
observations;
;
.
What if
are not assumed to be equal?
The pieces quadratic in remain. We then get quadratic discriminant functions(QDA),
The decision boundary between each pair of classes and
is described by a quadratic equation
.
4.3.1 Regularized Descriminant Analysis
In this subsection, a method which is a compromise between LDA and QDA is introduced.
A regularized covarance matrices have the form:
where is the polled covariance matrix as used in LDA. Here
aloows a continuum of models between LDA and QDA, and needs to be specified (by cross-validation, etc.).
Similar modifications allow itself to be shrunk toward the scalar covariance,
for .Replacing
by
leads to a more general family of covariances
indexed by a pair of parameters.
4.3.2 Computation for LDA
The computations of LDA and especially QDA are simplified by diagonalizing or
.
For the latter, suppose we compute the eigen-decomposition for each , where
is
orthonormal, and
a diagonal matrix of positive eigenvalues
. Then the ingredients for
are
4.3.3 Reduced-Rank Linear Discriminant Analysis
Motivations:
Gaussian classification with common covariances leads to linear dicision boundaries. Classification can be achieved by sphering the data with respect to
, and classifying to the closest centroid (modulo
) in the sphered space.
Since only the relative distances to the centroids count, one can confine the data to the subspace spanned by the centroids in the sphered space.
This subspace can be further decomposed into successively optimal subspaces in terms of centroid separation. This decomposition is identical to the decomposition due to Fisher.
For the detailed discussions, please refer to the textbook.
4.4 Logistic Regression
The logistic model arises from the desire to model the posterior probabilities of the classes via linear functions in
, while at the same time ensuring that they sum to one and remain in
.
The model has the form:
The model is specified in terms of log-odds or logit transformations. Although the model uses the last class as the denominator in the odds-ratios, the choice of denominator is arbitrary in that the estimates are equivariant under this choice. A simple calculation shows taht
and clearly sum to one. To emphasize the dependence on the entire parameter set , we denote the probabilities
.
4.4.1 Fitting Logistic Regression Model
Logistic regression models are usually fit by maximum likelihood, using the conditional likelihood of given
. Since
completely specifies the conditional distribution, the multinomial distribution is approproate. The log-likelihood for
obserbations is
where .
To maximize it, we need to solve the equation , with Newton-Raphson algorithm. The whole method is referred to as iteratively reweighted least squares (IRLS). For detailed discuss, please refer to the textbook.
Logistic regression models are used mostly as a data analysis and inference tool, where the goal is to understand the role of the input variables in explaining the outcome.
4.4.2 Example: South African Heart Disease
4.4.3 Quadratic Approximations and Inference
The maximum-likelihood parameter estimates satisfy a self-consistency relationship: they are the coefficients of a weighted least squares fit, where the responses are
and the weights are , both depending on
itself. Apart from providing a convenient algorithm, this connection with least quares has more to offer:
The weighted residual sum-of-squares is the familiar Pearson chi-square statistic
a quadratic approximation to the deviance;
Asymptotic likelihood throey says that if the model is correct, then
is consistent (i.e., converges to the true
).
A central limit theorem then shows that the distribution of
converges to
. This and other asymptotics can be derived directly from the weighted least squares fit by mimicking normal theory inference.
Model building can be costly for logistic regression models, because each model fitted requires iteration. Popular shortcuts are the Rao score test which tests for inclusion of a term, and the Wald test which can be used to test for exclusion of a term. Neither of these require iterative fitting, and are based on the maximum-likelihood fit of the current model. It turns out that both of these amount to adding or dropping a term from the weighted least squares fit, using the same weights. Such computations can be done efficiently, without recomputing the entire weighted least squares fit.
4.4.4
Regularized Logistic Regression
The penalty used in the lasso can be used for variable selection and shinkage with any linear regression model. For logistic regression, we would maximize a penalized version:
where
As with the lasso, we typically do not penalize the intercept term and standardize the predictors for the penalty to be meaningful. A solution can be found using nonlinear programming methods.
4.4.5 Logistic Regression or LDA?
4.5 Separating Hyperplanes
The main idea of this method is to find hyperplanes (classifier), which construct liner dicision boundaries that explicitly try to separate the data into different classes as well as possible.
For equation , it defines a hyperplane (or affine set)
. We have some properties:
is the vector normal to the hyperplane
, since for any
in
,
;
For any point
in
,
The signed distance of any point
to
is given by
. Hence
is proportional to the signed distance from
to the hyperplane defined by
.
4.5.1 Rosenblatt's Perceptron Learning Algorithm
The perceptron learning algorithm tries to find a separating hyperplane by minimizing the distance of misclassified points to the decision boundary.
The goal is to minimize
where indexes the set of misclassified points. The algorithm in fact uses stochastic gradient descent to minimize the piecewise linear criterion.
There are a number of problems with this algorithm, summarized in Ripley(1996):
When the data are separable, there are many solutions, and which one is found depends on the starting values.
The "finite" number of steps can be very large. The smaller the gap, the longer the time to find it.
When the data are not separable, the algorithm will not converge, and cycles develop. The cycles can be long and therefore hard to detect.
4.5.2 Optimal Separating Hyperplanes
To be continued