Skip to content

Instantly share code, notes, and snippets.

@YimianDai
Created July 30, 2019 20:25
Show Gist options
  • Save YimianDai/f57487b9412c0940b6d7e2e5e5e3c53b to your computer and use it in GitHub Desktop.
Save YimianDai/f57487b9412c0940b6d7e2e5e5e3c53b to your computer and use it in GitHub Desktop.
PRML

Pattern Recognition and Machine Learning

1. Introduction

  • 为何 generalization is a central goal in pattern recognition
    • 因为 the training data can comprise only a tiny fraction of all possible input vectors
  • pre-processing stage (有时又叫作 feature extraction)的作用
    • greatly reduces the variability within each class (减小空间)
    • speed up computation
  • Learning 的 3 大类型
    • supervised learning
    • unsupervised learning
    • reinforcement learning
  • supervised learning 可以进一步分为
    • classification
      • the aim is to assign each input vector to one of a finite number of discrete categories
    • regression
      • the desired output consists of one or more continuous variables
  • unsupervised learning 可以进一步分为
    • clustering
      • discover groups of similar examples within the data
    • density estimation
      • determine the distribution of data within the input space
    • visualization
      • project the data from a high-dimensional space down to two or three dimensions for the purpose of visualization
  • reinforcement learning
    • finding suitable actions to take in a given situation in order to maximize a reward.
    • Here the learning algorithm is not given examples of optimal outputs, in contrast to supervised learning, but must instead discover them by a process of trial and error.
    • A general feature of reinforcement learning is the trade-off between exploration, in which the system tries out new kinds of actions to see how effective they are, and exploitation, in which the system makes use of actions that are known to yield a high reward. Too strong a focus on either exploration or exploitation will yield poor results.
  • 本书的 3 大工具
    • probability theory
    • decision theory
    • information theory

1.1. Example: Polynomial Curve Fitting

  • probability theory 的作用
    • provides a framework for expressing such uncertainty in a precise and quantitative manner
  • decision theory 的作用
    • allows us to exploit this probabilistic representation in order to make predictions that are optimal according to appropriate criteria
  • training set,validation set 这么干是用来 确定模型参数,也就是在 optimize the model complexity
  • 这一小节的主要思路应该是这样的:
    • 用 Polynomial Curve fitting 问题来引出 model selection 这个概念
    • 存在 over-fitting 这个现象
    • 为了更公平合理地衡量,用 root mean square 来作为误差度量
    • 为了克服 over-fitting,引入了 regularization
    • 最后给出了 validation set 这个概念,用来 optimize the model complexity

1.2. Probability Theory

  • 这一小块的内容
    • 给出了 joint probability, marginal probability, conditional probability 的概念
    • 给出了 sum rule 和 product rule
    • 利用 joint probability 的 symmetric property,从 product rule 导出了 Bayes's theorem
    • 对 Bayes's theorem 积分掉一个 Variable,得到 conditional probability 形式的 sum rule,也就是公式(1.13)
    • 给出了 prior probability 和 posterior probability 的概念
    • 给出了两个 variable independent 的条件

1.2.1 Probability densities

  • 这一小块的内容
    • 前面都是 discrete variable,现在开始关注 continuous variable
    • 给出了 probability density 的概念
    • 对于 probability density 的嵌套,给出了 Jacobian factor 公式
    • 给出了 cumulative distribution function 的概念
    • 给出了 joint probability density 的概念
    • 给出了 probability density 形式的 sum rule 和 product rule

1.2.2 Expectations and covariances

  • 这一小块的内容
    • 给出了 expectation 的概念,给出了 discrete,continuous,以及 points of finite number 三种 cases
    • 给出了 expectations of functions of several variables 的概念
    • 给出了 conditional expectation 的概念
    • 给出了 variance 的概念
    • 注意上面这些概念都是针对 some function f(x) 的
    • 给出了 covariance 的概念,包括单变量和向量形式的,这是针对 变量 x, y 的,不是针对 function

1.2.3 Bayesian probabilities

  • 这小块非常非常重要啊

  • 这一小块讲了哪些?

    • probability 的 frequentist interpretation 是啥?
    • probability 的 Bayesian perspective 是啥?
    • 给出了带有 prior、posterior、likelihood 的 Bayes' theorem,以及 Bayes' theorem in words
    • 重点讲了 likelihood function,及其含义,frequentist 和 Bayesian 看待 likelihood function 的区别
    • 讲了 a widely used frequentist estimator is maimum likelihood, 引出了 error function 的概念,进一步引出 bootstrap (cross validation 的一种)
    • 提到了 Bayesian viewpoint 的一个有点就是引入 prior knowledge 很自然
  • 前面都是 frequentist interpretation of probability, 那什么是 frequentist interpretation of probability 呢?

    • 假设的 events 是 can be repeated numerous times in order to define a notion of probability
  • 什么是 Bayesian perspective?

    • use the machinery of probability theory to describe the uncertainty in model parameters such as w , or indeed in the choice of model itself
    • convert a prior probability into a posterior probability by incorporating the evidence provided by the observed data
  • 给出了带有 prior、posterior、likelihood 的 Bayes' theorem

    mark

  • Bayes' theorem in words

    mark

  • likelihood function

    • In both the Bayesian and frequentist paradigms, the likelihood function plays a central role
      • frequentist paradigm 是因为 A widely used frequentist estimator is maximum likelihood
      • In a frequentist setting, w is considered to be a fixed parameter, whose value is determined by some form of ‘estimator’, and error bars on this estimate are obtained by considering the distribution of possible data sets D .
        • 这种 ‘estimator’ 就是 maximum likelihood
      • By contrast, from the Bayesian viewpoint there is only a single data set D (namely the one that is actually observed), and the uncertainty in the parameters is expressed through a probability distribution over w .
    • likelihood function expresses how probable the observed data set is for different settings of the parameter vector w
      • Note that the likelihood is not a probability distribution over w , and its integral with respect to w does not (necessarily) equal one.
  • 不要小看公式(1.43)分母里的这个东西

    mark

    • 对(1.43),p(D) 是个常数,但 p(D) 其实是数据的分布,求出 p(D),给定新的输入,可以用来预测新的输出,在 1.2.6 小节这个就用上了
    • 竟然是把所有 w 积分积掉求出来,我觉得还是蛮神奇的
  • 讲了 a widely used frequentist estimator is maimum likelihood, 引出了 error function 的概念,进一步引出 bootstrap (cross validation 的一种)

  • 提到了 Bayesian viewpoint 的一个有点就是引入 prior knowledge 很自然

    • One advantage of the Bayesian viewpoint is that the inclusion of prior knowledge arises naturally.
    • 还给出了一个例子,这个例子举得非常好,也就是说,在 frequentist paradigm 的 maximum likelihood 里面其实是没有先验的,Bayesian viewpoint 引入了先验当然非常好了

1.2.4 The Gaussian distribution

  • 这一小块讲了哪些?
    • Gaussian distribution 的形式(单变量、多变量版本)
    • 给出了 Gaussian distribution 的 期望、二阶矩、方差
    • 后验是 probability of the parameters given the data
    • 似然是 probability of the data given the parameters
    • 里面估计,还是最大化似然,因为是 Gaussian Distribution,所以改用 最大化 log of the likelihood function,推导出了 相应的 均值和方差
    • 指出这个方差存在 bias, highlight 了 the significant limitations of the maximum likelihood approach,也就是 the maximum likelihood approach systematically underestimates the variance of the distribution

1.2.5 Curve fitting re-visited

  • 这一小块讲了哪些?
    • 对 curve 的取值引入了 Gaussian distribution 的先验
    • 写出了 Gaussian 先验下的 likelihood 的联合概率密度形式
    • 用 maximum likelihood 作为 estimator,
      • 用 log 拆分 likelihood 的联合概率密度函数成 sum 的形式,优化求解
      • 求出来,就是上一小节的形式
    • 用 maximum posterior 作为 estimator
      • 增加了一个 polynomial coefficient w 也符合 Gaussian prior,0 均值,协方差矩阵为 alpha * I
      • 还是一样用 log 拆分,发现就是加了 2 范数平方正则的形式

1.3. Model Selection

  • 这一块讲了哪些?
    • 一个就是讲了 cross validation 用于 model selection
    • 另外一个就是 information criteria 来选,Akaike information criterion,以及后面作者会重点讲的 Bayesian information criterion (BIC)

1.4. The Curse of Dimensionality

  • 这一块讲了哪些?
    • 讲了 curse of dimensionality 的成因
    • 讲了 merit of data
      • 往往位于低维子空间、流形空间
      • 数据具有平滑性质

1.5. Decision Theory

  • 点出了 joint probability distribution 在 Bayes’ theorem 中的核心地位
    • The joint probability distribution provides a complete summary of the uncertainty associated with these variables
    • any of the quantities appearing in Bayes’ theorem can be obtained from the joint distribution by either marginalizing or conditioning with respect to the appropriate variables
  • 给出了 inference 的概念
    • 就是 Determination of joint probability distribution from a set of training data

1.5.1 Minimizing the misclassification rate

  • 这一小块讲了哪些?
    • 就讲了如果要 misclassification rate minimization 作为 goal,那么就是比较联合概率密度 $p(x,C_k)$ 的大小好了
    • 事实上这就是在比较 posterior probability,因为根据 product rule $p(x,C_k) = p(C_k|x)p(x)$ ,数据分布 $p(x)$ 是一样的,所以 joint probability 就是 posterior probability

1.5.2 Minimizing the expected loss

  • 首先,讲了为什么 misclassification rate minimization 不够,要引入 expected loss minimization,是因为,有时候作出不同的错误决策的 cost 是不一样的,所以要引入 loss fucntion 或者 cost function 的概念
  • 就是讲了公式(1.81),expected loss,感觉很合理啊,各种错误 cost 乘上 posterior

1.5.3 The reject option

  • 就是给出了一个 reject option 的概念,也就是说 后验概率 小于某个具体值的话,不自动做出决策,而是由人工判定

1.5.4 Inference and decision

  • classification problem down into two separate stages

    • inference stage
      • use training data to learn a model for posterior probability $p(C_k|x)$
    • decision stage
      • use these posterior probabilities to make optimal class assignments
  • discriminant function 方式

    • solve both problems together and simply learn a function that maps inputs x directly into decisions
    • 跟上面 inference 和 decision 两步走的方式不同,这个一步到位
  • solving decision problems 有 three distinct approaches,按照复杂度降序排列分别是

    • generative models
    • discriminative models
    • discriminant function
  • generative models

    • First solve the inference problem of determining the class-conditional densities $p(x|C_k)$ for each class $C_k$ individually.

    • Also separately infer the prior class probabilities $p(C_k)$

    • Then use Bayes’ theorem in the form to find the posterior class probabilities $p(C_k|x)$

      mark

    • As usual, the denominator in Bayes’ theorem can be found in terms of the quantities appearing in the numerator, because

      mark

    • Equivalently, we can model the joint distribution $p(x, C_k)$ directly and then normalize to obtain the posterior probabilities.

    • Having found the posterior probabilities, we use decision theory to determine class membership for each new input x . Approaches that explicitly or implicitly model the distribution of inputs as well as outputs are known as generative models, because by sampling from them it is possible to generate synthetic data points in the input space.

  • discriminative models

    • First solve the inference problem of determining the posterior class probabilities $p(C_k|x)$, and then subsequently use decision theory to assign each new x to one of the classes.
    • Approaches that model the posterior probabilities directly are called discriminative models.
    • 不需要知道 joint distribution,直接得到 后验,比前面知道的弱一些
  • discriminant function 方式(Ammy 说就是 Regression)

    • Find a function $f(x)$, called a discriminant function, which maps each input x directly onto a class label. For instance, in the case of two-class problems, $f(·)$ might be binary valued and such that f = 0 represents class $C_1$ and f = 1 represents class $C_2$ . In this case, probabilities play no role.
  • the relative merits of generative models

    • Approach (a) is the most demanding because it involves finding the joint distribution over both x and $C_k$. For many applications, x will have high dimensionality, and consequently we may need a large training set in order to be able to determine the class-conditional densities to reasonable accuracy.
    • Note that the class priors $p(C_k)$ can often be estimated simply from the fractions of the training set data points in each of the classes.
    • One advantage of approach (a), however, is that it also allows the marginal density of data $p( x ) $ to be determined from (1.83). This can be useful for detecting new data points that have low probability under the model and for which the predictions may be of low accuracy, which is known as outlier detection or novelty detection
  • the relative merits of discriminative models

    • However, if we only wish to make classification decisions, then it can be wasteful of computational resources, and excessively demanding of data, to find the joint distribution $p(x, C_k)$ when in fact we only really need the posterior probabilities $p(C_k|x)$, which can be obtained directly through approach (b). Indeed, the class-conditional densities $p(x|C_k)$ may contain a lot of structure that has little effect on the posterior probabilities, as illustrated in Figure 1.27.
  • the relative merits of discriminative models

    • With option (c), however, we no longer have access to the posterior probabilities $p(C_k|x)$. There are many powerful reasons for wanting to compute the posterior probabilities, even if we subsequently use them to make decisions. These include:

    • Minimizing risk

      • the elements of the loss matrix are subjected to revision from time to time
      • 有后验的话,用公式 (1.81) 更新一下就好,如果是 discriminant function 方式的话,就要重新训练了
    • Reject option

      • 后验可以用来确定 rejection criterion,或者算 expected loss 用公式 (1.81)
    • Compensating for class priors

      • 这一块其实就是讲 小目标检测这种极端不平衡的二分类问题,但我没怎么看懂,不过很重要,一定要看懂为止。
    • Combining models

      • 这一块其实讲得就是概率图模型

      • For complex applications, we may wish to break the problem into a number of smaller subproblems each of which can be tackled by a separate module.

      • 给出了一个 conditional independence 的概念

        mark

      • 复杂的后验概率就可以拆分成更小的问题

        mark

      • The particular conditional independence assumption (1.84) is an example of the naive Bayes model

      • We shall see in later chapters how to construct models for combining data that do not require the conditional independence assumption

1.5.5 Loss functions for regression

  • 先是用 square loss 推导出了,regression 求解的其实就是 the conditional average of t conditioned on x

  • 根据公式(1.90)进一步揭示了 the nature of the regression problem

    mark

    • 第一项 决定了 the optimal least squares predictor is given by the conditional mean
    • The second term is the variance of the distribution of t, averaged over x . It represents the intrinsic variability of the target data and can be regarded as noise. Because it is independent of y( x ), it represents the irreducible minimum value of the loss function.
  • 跟 classification 问题一样,regression 也有三种做法

    • 第一种,求出 joint density $p(x, t)$, Then normalize to find the conditional density $p(t|x)$, and finally marginalize to find the conditional mean given by (1.89).

      mark

    • 第二种,直接求得 conditional density $p(t|x)$, 再用公式(1.89)

    • 第三种,直接用从训练数据里学到一个 regression function y(x)

  • 最后,介绍了一个 Minkowski loss,在 Minkowski loss 下的 regression value

    • conditional mean for q = 2
    • conditional median for q = 1
    • conditional mode for q → 0

1.6. Information Theory

  • 分别从 a rather heuristic motivation 和 a deeper interpretation as a measure of disorder through developments in statistical mechanics 两种方式给出了 entropy 的定义

  • 在连续分布下,给出了 differential entropy 的概念,证明了 the distribution that maximizes the differential entropy is the Gaussian

  • 给出了 conditional entropy 的定义,并给出了 conditional entropy 满足的 relationship

    mark

1.6.1 Relative entropy and mutual information

  • 先从 average additional amount of information 的角度,给出了交叉熵 relative entropy or Kullback-Leibler divergence 的定义

  • 给出了凸函数的定义,然后拓展到 多变量形式,给出了 离散、连续的形式,也就是 Jensen's inequality

    mark

    • 利用 Jensen's inequality 证明了交叉熵一定大于等于 0
  • 从 relative entropy 点出了 data compression and density estimation 有着很亲密的联系,因为两者问题的本质都是 modelling an unknown probability distribution

  • 根据公式(1.119)揭示了 relative entropy 的本质

    mark

    • 由于第二项独立于要估计得参数 theta,所以本质最小化 relative entropy,就是在 maximizing the likelihood function
  • 给出了互信息 mutual information 的定义,以及 互信息 满足的性质

    mark

    mark

    • From a Bayesian perspective, we can view p(x) as the prior distribution for x and p(x|y) as the posterior distribution after we have observed new data y . The mutual information therefore represents the reduction in uncertainty about x as a consequence of the new observation y .
    • 互信息表示的是,一旦有了一个新的观测 y,关于 x 的不确定性的减少的量

2 Probability Distributions

  • 这一章的内容是 exploration of some particular examples of probability distributions and their properties
    • 这样做是因为 these distributions can form building blocks for more complex models
    • discuss some key statistical concepts, such as Bayesian inference
  • density estimation 的概念
    • model the probability distribution $p(x)$ of a random variable $x$ , given a finite set $x_1, \cdots , x _N$ of observations.
    • 前提假设是 the data points are independent and identically distributed
  • 本章主要讲
    • 对于 discrete random variables 是 binomial distribution 和 multinomial distribution
    • 对于 continuous random variables 是 Gaussian distribution
    • 这些分布都是 parametric distributions, 这么叫是因为他们都 governed by a small number of adaptive parameters, such as the mean and variance in the case of a Gaussian for example
  • 对于 the problem of density estimation
    • 其实就是 determining suitable values for the parameters, given an observed data set
    • In a frequentist treatment, we choose specific values for the parameters by optimizing some criterion, such as the likelihood function.
    • By contrast, in a Bayesian treatment we introduce prior distributions over the parameters and then use Bayes’ theorem to compute the corresponding posterior distribution given the observed data.
  • conjugate priors 这个概念很重要啊,之前都没听过,共轭啊,要好好看看
    • lead to posterior distributions having the same functional form as the prior, and that therefore lead to a greatly simplified Bayesian analysis
    • For example, the conjugate prior for the parameters of the multinomial distribution is called the Dirichlet distribution, while the conjugate prior for the mean of a Gaussian is another Gaussian
    • All of these distributions are examples of the exponential family of distributions
  • parametric approach 的缺点
    • it assumes a specific functional form for the distribution, which may turn out to be inappropriate for a particular application.
    • 需要预先假定数据服从某种分布,但事实上很难有这种理想的条件
  • An alternative approach is given by nonparametric density estimation methods
    • the form of the distribution typically depends on the size of the data set
    • Such models still contain parameters, but these control the model complexity rather than the form of the distribution.
    • 本文会介绍下几个典型的 nonparametric density estimation methods,比如 histograms, nearest-neighbours, and kernels

2.1. Binary Variables

  • 这一小块讲了哪些?
    • 给出了针对 single binary random variable 的 Bernoulli distribution
      • Bernoulli distribution 的 probability distribution,这个 distribution 其实也就是 likelihood
      • 给出了 mean and variance
      • 因为 i.d.d.,所以一组数据的 joint likelihood 就是 他们 likelihood 的连乘
      • 因为还是指数,所以用 log 的形式,可以通过最大似然求出最佳的 参数 mu
      • 给出了 多变量(观测值)时候的期望和方差

2.1.1 The beta distribution

  • 以 beta distribution 作为 伯努利分贝的 prior,给出了 conjugacy property 的概念

    • 选择先验和似然同一函数形式,又因为后验等于先验乘上似然,所以后验也跟先验同一形式,因此 the posterior distribution can act as the prior if we subsequently observe additional data
    • 这完全可以用来做 online 的学习嘛, this sequential approach to learning arises naturally when we adopt a Bayesian viewpoint
    • 当样本无限多的时候,Bayesian(指的应该是后验)就跟 最大似然估计 一样了
    • 当是有限样本的时候,对参数 mu 的后验估计,就落在 prior mean 和 maximum likelihood estimate 之间
  • 给出了下面式子

    mark

    • 等号左边是 prior variance of theta
    • 等号右边第一项是 the average posterior variance of theta
    • 等号右边第二项是 the variance in the posterior mean of theta
    • 说明 on average, the posterior variance of theta is smaller than the prior variance
      • 后验方差小于先验,感觉可以理解啊,毕竟在给定观测的情况下,可变动的自由度、范围少了啊

2.2. Multinomial Variables

  • 给出了 Multinomial Variables 的概念
  • 给出了 likelihood function,给出了 期望,给出了 maximum likelihood solution
  • 给出了 likelihood 的 joint distribution

2.2.1 The Dirichlet distribution

  • 用 Dirichlet distribution 作为 multinomial distribution 的 conjugate prior,给出了后验形式

2.3. The Gaussian Distribution

  • 先是点出了 Gaussian Distribution 里面的 二次项,其实就是 Mahalanobis distance
  • 对协方差矩阵做特征值分解,给出了 Gaussian 分布的几何解释
    • 总体是一个任意方向的椭圆
    • 特征值开根号,就是 scaling factors in the directions of the axes,也就是轴长
    • 输入向量 x,其实是做了一个 shift (减去均值向量),再做了一个 rotation (乘上投影的基向量)
  • 给出了 正定、半正定 的含义,点出了半正定矩阵(某些特征值为 0)表示其实是在一个维数更低的子空间内
  • 给出了向量形式的 Gaussian Distribution 的 期望、二阶矩矩阵的期望
  • 点出了 Gaissian Distribution 的两大不足
    • 一个是,均值 + 协方差矩阵,主要是协方差矩阵,参数太多了,curse of dimension
      • 有一种措施是减少协方差矩阵里的参数
      • 如果没有限制,那分布就是一个任意方向的椭圆
      • 限制成对角矩阵(其他元素为 0),就是椭圆轴的方向跟坐标轴方向一致的椭圆
      • 限制成单位矩阵,那就退化成 圆 了
    • 缺点二,Gaussian 分布本质是单模态的,没法很好的近似多模态分布

2.3.1 Conditional Gaussian distributions

  • 讲了 multivariate Gaussian distribution 的一个重要性质
    • if two sets of variables are jointly Gaussian, then the conditional distribution of one set conditioned on the other is again Gaussian.
    • Similarly, the marginal distribution of either set is also Gaussian.
  • 通过推导,给出了 Conditional Gaussian distributions 的形式
  • 公式(2.73)用于公式(2.100)中的 L 是正确的
  • 作者写这一小块的目的是为了 导出 公式(2.73)、(2.75),或者说是 (2.81)、(2.82),这两只是表达基元不同而已,其实是一回事,导出他们的目的是

2.3.2 Marginal Gaussian distributions

  • 给出了 Marginal Gaussian distributions 的形式
  • 公式(2.98)就是这一小块所谓的 Marginal Gaussian distributions
  • 作者写这一小块的目的是通过公式(2.97)展示下一小节中公式(2.100)中的那个线性关系成立

2.3.3 Bayes’ theorem for Gaussian variables

  • 公式(2.99)就是先验的概率分布,公式(2.100)就是似然的概率分布
  • 是利用联合概率分布公式(2.102),从总体的协方差矩阵(2.105)和含有期望的一次项,推导出 posterior 和 分母 的协方差矩阵和期望
  • 这样一来,Bayes 公式里的 先验、似然、后验、分母 各自的概率分布就都有了

2.3.4 Maximum likelihood for the Gaussian

  • 用 likelihood maximization 来估计 Gaussian Distribution 的 最佳参数
  • 估计出来的协方差矩阵,哈哈,这不就是说明了做 PCA 为什么要先中心化后相乘么,也解释了为啥总是说 PCA 是适用于高斯噪声的,因为这就是从高斯分布导出来的啊
  • 最后跟单变量的时候一样,最大似然估计会有 bias,要用常数稍微修正一下

3 Linear Models for Regression

  • linear models for regression
    • regression 嘛,那就是要 model the predictive distribution $p(t|x)$,也就是 likelihood
      • 公式(3.10)下面明说了
    • 说是 linear,是指 adjustable parameters 是 linear 的,basis function 可以是 non-linear

3.1. Linear Basis Function Models

  • 这一小块讲了什么?
    • 就是简单介绍了一下一些常用的 basis function,比如 spline function, Gaussian basis function, sigmoidal basis, tanh function, Fourier basis, wavelet

3.1.1 Maximum likelihood and least squares

  • 用似然最大化重新推了一遍高斯分布噪声假设下的 parameter vector 的形式,也就是公式(3.15)

3.1.2 Geometry of least squares

  • 给了最小二乘的几何解释,就是在基组成的空间上的投影,也就是图3.2

3.1.3 Sequential learning

  • 点出了 sequential learning algorithm 都是通过 stochastic gradient descent 或者叫做 sequential gradient descent 来实现学习的,也就是公式(3.22),具体点就是(3.23),比想象的容易吧

3.1.4 Regularized least squares

  • 加入了正则项,重新推导了下 parameters 的最佳估计
  • 点了一下正则项可以用,一范数,二范数,q = 0.5 这样的

3.1.5 Multiple outputs

  • 前面的推导输出都是一个标量,当输出是一个向量时,系数向量 w 也就要变成系数矩阵 W,然后重新推导了一下 W 的最佳估计

3.2. The Bias-Variance Decomposition

  • 这一小节主要是从 a frequentist viewpoint 来审视 the model complexity 这个 issue,也就是 bias-variance trade-off
  • 公式(3.41)点出了 expected loss 可以写成 bias的平方,variance 和 noise 之和
    • bias represents the extent to which the average prediction over all data sets differs from the desired regression function
    • variance, measures the extent to which the solutions for individual data sets vary around their average, and hence this measures the extent to which the function $y(x;D)$ is sensitive to the particular choice of data set
    • noise 这一项跟 模型选择无关,是 intrinsic noise on the data
  • a trade-off between bias and variance
    • very flexible models having low bias and high variance
    • relatively rigid models having high bias and low variance
    • The model with the optimal predictive capability is the one that leads to the best balance between bias and variance.
    • 更形象的理解可以看 图 3.6

3.6. Limitations of Fixed Basis Functions

  • linear combination of fixed, nonlinear basis functions 的缺点
    • the basis functions are fixed before the training data set is observed
    • curse of dimensionality
      • 那 SVM 和 神经网络 作为改进手段是怎么避免维数灾难的?
  • real data 有两个特性使得我们可以避免维数灾难?
    • the data vectors typically lie close to a non-linear manifold whose intrinsic dimensionality is smaller than that of the input space as a result of strong correlations between the input variables
      • 也就是数据在低维流形子空间内
      • If we are using localized basis functions, we can arrange that they are scattered in input space only in regions containing data.
      • This approach is used in radial basis function networks and also in support vector and relevance vector machines.
        • 对啊,其实 RBF kernel 就是在做距离度量啊,怎么突然感觉支撑向量有点 KNN 的意思了。。。
      • Neural network models, which use adaptive basis functions having sigmoidal nonlinearities, can adapt the parameters so that the regions of input space over which the basis functions vary corresponds to the data manifold.
        • 没有看懂上面这句话
    • The second property is that target variables may have significant dependence on only a small number of possible directions within the data manifold.
      • Neural networks can exploit this property by choosing the directions in input space to which the basis functions respond.
      • 没有看懂这个 property

问题

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment