Page cover image

#3: Linear Regression, Regularization, Bias-Variance

    Sep 9, 2021 06:36 PM

    Linear Regression

    Model Composition

    -dimensional Input Vector
    Learnable Parameter Vector
    Target Function : assumed linear
    Hypothesis Function : any linear function of parameters

    Loss Function (Error Metric)

    Measures the discrepancy between the model prediction and the actual value on the training set

    Residual Sum of Squares


    Adjust to minimize loss
    1. Matrix Representation of the Hypothesis
      1. Subsume intercept/bias into the parameter vector : Augment input vector by 1:
    1. Compute the derivatives of and set it to zero
    1. Solve for the closed form solution

    Making Prediction

    For a new observation , its prediction can be computed as

    Model Analysis

    Significance of a Single Parameter

    = variance of the observations assumed uncorrelated with constant variance
    For any variable define the Z-score as
    • is the diagonal element of
    Result: smaller โ†’ less important

    Significance of a Group of Coefficiants

    F-statistics: measures the change in residual
    • = residual sum of squares of the bigger model with parameters
    • = residual sum of squares for the nested smaller model

    Regularization (Shrinkage Methods)

    ๆœบๅ™จๅญฆไน  = ่ฎญ็ปƒ้›†ไธŠ็š„ๆœ€ๅฐๅŒ–้—ฎ้ข˜ โ‡’ ๅฎนๆ˜“่ฟ‡ๆ‹Ÿๅˆ
    ไธบๅฏนๆŠ—่ฟ‡ๆ‹Ÿๅˆ๏ผŒๅ‘ๆŸๅคฑๅ‡ฝๆ•ฐไธญๅŠ ๅ…ฅๆ่ฟฐๆจกๅž‹ๅคๆ‚็จ‹ๅบฆ็š„ๆญฃๅˆ™้กน
    Regularized Loss Function = Loss Function + Penalty Term


    • = Regularization parameter
    • = weight associated with the variables
      • generally considered to be the -norms


    • Help adjust complexity of hypothesis space
    • Balance fitness and generalizability

    Ridge Regression

    Implementation of Regularization
    • Penalty = squared magnitude (-norm) of coefficients


    • Shrink the coefficients but greater than 0
    • Confine hypothesis space โ‡’ make it smaller than the space of all linear functions
    • Output non-sparse


    • Minimize
    • Subject to constraint
    Rewritten as Lagrangian Multiplier

    Loss Function

    Closed Form Solution

    LASSO (Least Absolute Shrinkage and Selection Operator) Regression

    Implementation of Regularization
    • Penalty = absolute value (-norm) of coefficients


    • Penalize insignificant coefficients to zero
      • โ‡’ feature-selection method to remove useless coefficients
    • Prefers sparsity: less terms โ‡’ better
      • With a lot of parameters, only some of them have predictive power
    • Output is sparse: some coefficients are left out


    • Shrinkage Factor

    Bias-Variance Analysis


    express in a way that helps choosing the hypothesis space
    • How well can approximate
    • How well we can zoom in on a good hypothesis
    • A theoretical process: is not accessible in practice


    • Dependence of on
    • expectation over based on the distribution on
    • Average hypothesis over multiple draws of
    notion image


    ๐Ÿšซ : High bias | Low Variance
    โœ… : Low bias | High Variance
    Which one is better?
    notion image
    notion image
    Match the model complexity to the data resources NOT the target complexity โ‡’ better generalizability Stick to the simplest answer!! (Occam's Razor)

    Multi-Class Classification

    Use Logistic Regression models to classify classes

    Likelihood for a Single Sample (Softmax Function)

    The softmax function is a function that turns a vector of real values into a vector of real values that sum to 1.

    Likelihood of the Data

    Loss Function (Cross-Entropy Loss)

    • value of output to the j-th