Linear Classifiers

Ahsan Ijaz

Classification Examples

  • Given the symptoms, identify the medical condtions of the patient.
  • Online transaction fraud using IP address, past transaction history etc.
  • Detection of DNA mutations as deleterious using data of DNA sequences with and without disease.

Webpage categorization

Spam detection

Reading mind

Personalized healthcare

Default of Payment example

  • Giving the income and balance, will the person default on payment?

Default of Payment example

  • Giving the income and balance, will the person default on payment?

Linear Classification

  • Separation of blue and red data points
    • Feature selection task same as in regression
  • Two features used in the graph below
plot of chunk unnamed-chunk-1

Sentiment Analysis Example

Linear Classifier

Using training data to learn weights for each word.

Words Coefficients
Good 0.5
great 1
Exceptional 1.5
Awesome 2.3
bad -1
terrible -1.5
hideous -2.5
awful -3
we,are,the,where,how,restaurant 0

Linear Classifier

  • Input sentence
  • Replace words with coefficients (x)
  • Calculate Score(x) as weighted count of words in sentence
  • If Score(x) > 0, declare positive sentence, else declare negative sentence

Decision Boundaries

For linear classifier:

  • When 2 co-efficients are non-zero:
    • A line
  • When 3 co-efficients are non-zero:
    • A plane
  • When many co-efficients are non-zero:
    • A hyperplane

Simple example of linear classifier

Simple hyperplane

  • Model: \(\hat{y}_i = sign(\color{blue}{Score(x_i)})\)
  • \(\color{blue}{Score(x_i)} = w_0 + w_1x_i[1] + + w_dx_i[d] = w^T\mathbf{x}\)
  • feature 1 = 1
  • feature 2 = #awesome
  • feature 3 = #awful
  • feature d = #for

Probability of each class (confidence of each prediction)

  • The food was ok but the service was good. (+1 with prob 0.5)
  • The food and service both were amazing. (+1 with prob 0.9)
  • The food was awful, so was the service. (-1 with prob 0.9)

Generalized linear models

Logistic Regression

  • sigmoid(Score) = \(\frac{1}{1+e^{(-w^T(h(x))}}\)

Multiclass Training 1 versus all

  • $\hat{P}(y=+1|x) = $ estimate of 1 vs all model for each class

  • max_prob = 0; \(\hat{y} = 0\)

  • For c = 1,...C:

    • if \(\hat{P}_c(y=+1|x)\) > max_prob:
      • \(\hat{y}\) = c
      • max_prob = \(\hat{P}_c(y=+1|x)\)

Logistic regression parameter estimation

  • Maximize likelihood function over all possible w_0,w_1,W_2 values

\[ l(w_0,w_1,w_2) = \prod_{i = 0}^{n}P(y_i|x_i,\mathbf{w}) \]

  • No closed form solution
  • Gradient ascent algortihm to find solution
  • While not converged: \[ w^{t+1} = w^t + \zeta\times{}\frac{dl}{dw}\]

Learning rate

  • Try several values (exponentially spaced)
    • find one \(\zeta\) that is too small
    • another that is too large
  • Iterate between the two values.
  • Step size that decreases with iterations \(\zeta_t = \frac{\zeta_0}{t}\)