Page cover image

#5-6: Bayesian Machine Learning

    Created
    Sep 21, 2021 05:49 PM
    Topics

    Validation

    For any hypothesis
    • : estimated by validation

    Formula

    Dataset
    • Training set samples
    • Validation set samples
    the hypothesis selected after training on
    Expected value of validation error
    This is because
    Thus we have
    • Expected value of validation error closely matches

    Choosing Validation Set Size

    ⬆️ ⇒ ⬇️
    notion image
    too large ⇒ training set size too small
    💡
    Practical Rule: use 20% of as

    Fold-Back-In Validation

    notion image

    Procedure: Make Most Use of Data

    1. Separate the training & validation set
    1. Tune model to find best hyper-parameters
    1. Put validation set back to train the last time

    Model Selection

    💡
    Use the same validation set multiple times without loosing guarantees
    Assume you have models (hypothesis sets):
    notion image

    Cross Validation

    Small K:
    Large K:

    Leave One Out Analysis

    notion image
    1. Leave 1 sample out, use the rest to train the model
    1. Compute model error with the 1 sample
    1. Repeat step 1 & 2 for times
    1. Compute the Cross Validation Error
    notion image

    Pros & Cons

    Pro: Far more accurate
    Con: Very expansive
    • The Model is trained times

    Model Selection with Cross Validation

    1. Define models by choosing different values of :
    1. For each model :
      1. Run the cross validation module to get an estimate of the cross validation error
      2. Pick the model with the smallest error
    1. Train the model on the entire training set to obtain the final hypothesis

    Bayesian Decision Theory

    Bayes Rule

    notion image

    Given

    • State of nature :
    • Prior probabilities
    • Class conditional probability: ,

    To Find: Posterior Probability

    Error

    if we decide on
    if we decide on
     
    Evidence is a scaling factor ⇒ Pick if otherwise
     

    Conditional Risk

    multi-dimensional observations represented as a vector

    Bayes Theorem

    Evidence

    Conditional Risk

    Expected loss associated with taking an action when the true state of nature is

    Bayes Decision Rule

    🎯
    To minimize the overall risk, compute & take action to minimize conditional risk for

    Frequentist vs Bayesian Approach

    Frequentist

    Probability = relative frequency of events over the long run
    Repeat & count # of event occurrence
    Information incomplete; with uncertainty
    • Parameter is fixed but unknown
    Goal: Find params to best explain data
    Implementation: MLE
     

    Bayesian

    Probability = measure of confidence of belief in the occurrence of event
    Cannot repeat events
    Information complete; no uncertainty
    Goal: find dist over params to quantify uncertainty

    Bayesian Parameter Estimation

    1. Start with prior distribution
      1. = prior knowledge about parameters before looking at the data
    1. Given a data set
    1. Use the Bayesian Theorem to find the posterior distribution of given
        • Denominator = dist of data

    Example: Logistic Regression

    ⬆️ The product term
     
     
    Maximum A Posteriori Estimation
    Bayesian Linear Regression
    Generative vs Discriminative Models

    How to Choose a Prior

    notion image
    Objective
    Subjective
    Conjugate
     

    ?? Learning

    Prior = posterior of the previous iteration
     

    Naive Bayes Classifier

    Assumes all features are conditionally independent of other features
    only depends on y
     
     
    Reduce params to
     
    Generative Model

    Discriminative vs Generative Model

    notion image

    Discriminative

    Models
    • Models the decision boundary itself
    • Simpler
    • Cannot derive the generative model by Bayes Theorem
      • Because required but impossible to find
      • Contains a lot less information

    Generative

    Models
    • Harder to model the data itself, but more useful
    Find by Bayes Theorem
    • Since is the same for , P(x) is considered a constant normalizer and can be disregarded in computation