Page cover image

#4: Logistic Regression & Gradient Descent

    Created
    Sep 14, 2021 04:39 PM
    Topics

    โ“ Why Not Linear Regression

    Poor performance w/ classification problems
    • Need to set an arbitrary threshold
    • Outliers significantly reduce power of model
    • Hypothesis' output value can be > 1 or < 0

    ๐Ÿ“ˆ Logistic Regression Model

    Usage: Classification
    • predict the probability that an observation belongs to one of two possible classes
    ๐Ÿ’ก
    Similar to Linear Regression, but gives probability or True/False

    Properties

    Classification algorithm:

    Hypothesis Representation

    ๐Ÿ’ก
    Maps linear regression to {0, 1} by nesting inside a Sigmoid function

    Logistic (Sigmoid) Function

    notion image

    Soft Threshold (Conversion to from signal)

    notion image

    Why Sigmoid

    • Smooth
      • Easy to compute derivative / gradient
    • Non linear: can model more complex relations

    Interpretation of Hypothesis Output

    estimated % that on input

    Target Function

    Probability that given ; parametrized by
    • The data does not give us explicit probabilities
    • Only provides samples generated
    • Pick
    As a binary classification problem (probability sums to 1):

    Decision Boundary

    A property of the hypothesis function & parameters
    • Predict if
    • Predict if
    Logistic Function: Input โ‰ฅ 0 โ‡’ Output โ‰ฅ 0.5
    Therefore,

    Non-Linear Decision Boundaries

    ๐Ÿ’ก
    Use higher-order polynomials to classify data with complex geometric shapes

    Example from Intro2ML

    notion image
    Apply Transformation
    notion image
    Then we can create a hyper-plane to separate the data for classification

    Example from Andrew Ng

    notion image
    โš ๏ธ
    Difficulty: need to come up w/ transformation b4 inspecting data

    Method to Find Best-Fit Line

    Calculates maximum likelihood
    1. Pick a probability scaled by
    1. Calculate the % of

    Loss Function

    % that the predicted value is correct
    Compute the likelihood of IID Training data

    Goal: Maximum Likelihood Estimation

    Adjust parameter to maximize likelihood

    In Practice

    Take of the likelihood to minimize
    • Equivalent since the function is monotonically decreasing

    Cross-Entropy Loss

    This error measure is small when is large and positive
    • pushes to classify each correctly
    ย 

    Logistic Regression Algorithm w/ Gradient Descent

    notion image

    โŒ› Iterative Optimization

    Minimize the loss function โ†’ make the model more useful

    ๐Ÿ“ฆBatch Gradient Descent

    An optimization technique to compute logistic regression & other learning algorithms
    ๐Ÿ’ก
    Sliding Down Hill: progressively modifies parameter in a way that decreases error
    • Computing 2nd derivative is almost impossible in most cases
    • Gradient descent provides a close estimate
      • notion image

    Procedure

    1. Start with an initial value of parameters
    1. Compute the direction at which weight decreases
    1. Update the parameters in that direction
        • learning rate (how big the step we take)

    Formula

    Insights

    • Take a step in the direction of steepest descent to gain the biggest decrease of E
    Using first order Taylor expansion we compute the change in
    Since is a unit vector, this equality holds iff

    Choosing Step Size

    notion image
    ๐Ÿ’ก
    Large step size when far away from local minima Small step size when close to the local minima
    Simple Heuristic:
    โ‡’ Learning rate algorithm
    ย 

    ๐Ÿฅš Stochastic Gradient Descent

    ๐Ÿ’ก
    Perform gradient descent for each batch, not the entire training set
    1. Randomly pick ONE training sample
    1. Compute the gradient of loss function w/ this sample
    1. Update the weights

    ๐Ÿงบ Mini-Batch Gradient Descent

    ๐Ÿ’ก
    in between batch and stochastic
    1. Randomly pick training samples
    1. Compute the gradient of the loss associated with this mini-batch
    1. Update the weights

    Comparing Gradient Descent Variants

    notion image
    Gradient Descent Variants
    Name
    Input at Each Step
    Pro
    Con
    Entire training set
    Good for convex optimization
    Smooth
    A mini-batch
    In between
    In between
    One data point
    Better skipping local extremas
    Noisy & shaky
    ย 

    ๐Ÿค” Questions

    1. How gradient descent deals with local maxima / saddle points

    Resources

    ๐Ÿงฌ
    The Ultimate Guide to Logistic Regression for Machine Learning