If you haven’t seen the previous blog, or not familiar with Hoeffding bounds, I suggest you read about it. This blog goes head first into the continuation of the previous blog here.
A hypothesis, can be thought of as a function. However, it’s not necessary. A function maps something to something else, like $f: \Re^n \to \Re$. For every input in the domain, we know there exists a single output. However, in our world of unknowns, it’s not known whether a point will map to more than one coordinate, depending on the underlying noise.
Our definition of noise, is the unknown consequences of the outside environment that we failed to capture in our data mining. This noise makes the deterministic idea of a function fall apart, so instead of:
\[f(x_n)\]We now take the stochastic analog:
\[p(y_n|x_n)\]Which gives us a probability distribution of all outputs. Instead of a mapping of $f: \Re^n \to \Re$, for example, it is now $X: \Omega \to \Re$, where $\Omega$ is the set of all events.
A hypothesis in our case is just a mathematical model that tries to map x’s to y’s with a probability associated with it, or a transformation of a random variable. Hypotheses are motivated with reasonable assumptions:
And the list goes on.
But be careful, you can’t say “linear models” is a hypothesis. It’s too broad - we need to specify the specific weights, i.e. “linear model with coefficients $x0 = 1$, $x1 = 0.2$, etc.”. Thus, we call “linear models” a hypothesis set, containing many hypotheses.
Hoeffding, as we talked about before, tries to analyze the bound of the difference between the empirical risk and the expected risk of a hypothesis, $f$.
What we managed to get was:
\[P(|R_{N}(f) - R(f)| > \epsilon) \leq 2e^{-2N\epsilon^2}\]This gives us confidence about whether our model, trained on our training data, will be able to generalize, and express the underlying data distribution, $x, y \sim p_\theta$.
This is great and all, but this is only one hypothesis. For a general class of hypotheses, like support vectors, how can we make a similar argument?
The most basic way to plug Hoeffding into a general class of hypothesis, which we will denote $ \mathcal{H} $, is to assume that there exists no overlap between the bad hypotheses(I will elaborate on what overlap means). A bad hypothesis is one where:
\[|R_{N}(f_{bad}) - R(f_{bad})| \geq \epsilon\]for some $\epsilon$.
This event can happen to any of the hypotheses in our set of hypotheses. So if our hypothesis set has cardinality $|\mathcal{H}| = M$ , then we have $M$ of these events that we have to avoid. If we define the event of i-th hypothesis being bad as
\[B_i = \{|R_{N}(f_i) - R(f_i)| \geq \epsilon\}\]Then the event that none of the hypotheses are bad is: $ {\bigcap_{i=0}^M B_i^c} $ , which translates to
“The 1st hypothesis is NOT bad AND the 2nd hypothesis is NOT bad AND…”
If we apply De Morgan’s law to this, we get:
\[\{\bigcap_{i=0}^M (B_i^c)\} = \{\bigcup_{i=0}^M B_i\}^c\]which has a much more obvious probability:
\[P(\{\bigcup_{i=0}^M B_i\} ^c) = 1 - P(\{\bigcup_{i=0}^M B_i\}) = 1 - \sum_{i=0}^M P(B_i)\]due to an assumption on each hypothesis being disjoint, which is a broad assumption. This assumption on that they are disjoint means that we add up the probability for each one without accounting for the inclusion. Recall the inclusion-exclusion principle:
\[P(A \bigcup B) = P(A) + P(B) - P(A \bigcap B)\]Here, we are assuming that $P(A \bigcap B) = 0$. If we didn’t use this assumption, then it’d get ugly. Just for the sake of illustration, we would get:
\(-1^0*2C1P(B_i) + -1^1*2C2P(B_i \cap B_j)\) for 2 events… \(-1^0*3C1P(B_i) + -1^1*3C2P(B_i \cap B_j) + -1^2*3C3P(B_i \cap B_j \cap B_k)\) for 3 events…
So we could utilize the fact that there are overlaps to get a tighter bound, but it’s way too resource consuming for now. Thus, we use hoeffding on each individual disjoint event and get:
\[1 - \sum_{i=0}^M P(B_i) \geq 1 - 2Me^{-2N\epsilon^2}\]Remember that we had an upperbound on $P(B_i)$, so the negative means we flip the inequality here.
This means that the greater the cardinality of our hypothesis class, the harder it is to bound the PAC likelihood. You may be thinking:
“But don’t linear models and neural networks and stuff have infinite cardinality?”
And you’re right. We will get results for that, but hold on, because we will set up the machinery for it :)
Now let’s do some analysis about what the above means.
Consider our previous notation, but replace $\sum_{i=0}^M P(B_i)$ with $\delta$.
We yield the following inequality:
$1 - \delta \geq 1 - 2Me^{-2N\epsilon^2}$, or $2Me^{-2N\epsilon^2} \geq \delta$.
We solve the above equation to get:
\[N = -\frac{log(\frac{\delta}{2M})}{2\epsilon^2} = \frac{log(\frac{2M}{\delta})}{2\epsilon^2}\]Once again, solve the equation, and we get:
\[\epsilon \geq \sqrt{\frac{log(\frac{2M}{\delta})}{2N}}\]And we yield similar observations as above.
So far, we’ve used the hoeffding bound to get a good estimate on how far apart empirical risk and expected risk is. Let’s now denote $f$ as the hypothesis we choose to minimize the empirical risk. Can we say something about the empirical risk between $f$ and $f^*$?
The argument is pretty subtle, so I’ll break it down into 4 parts:
We string these inequalities together:
\[(R(f) - R_N(f)) + (R_N(f^*) - R(f^*)) \leq 2\epsilon\] \[(R(f) - R(f^*)) \leq (R(f) - R(f^*)) + (R_N(f^*) - R_N(f)) \leq 2\epsilon\]Since $R_N(f^*) - R_N(f) \geq 0$.
Thus, we have just bounded $R(f) - R(f^*) \leq 2\epsilon$. I thought this proof was pretty magical when I first saw it, so definitely take a second look if you’re not sure.
Now, it’s time for the biggest result, plugging in everything we’ve seen:
We know that $R(f) - R(f^*) \leq 2\epsilon \leq 2 \sqrt{\frac{log(\frac{2M}{\delta})}{2N}}$, so:
$R(f) \leq [inf_{f^* \in \mathcal{H}}R(f^*)] + 2 \sqrt{\frac{log(\frac{2M}{\delta})}{2N}}$
This single equation is the holy grail equation to learning theory. In the next section when we talk about Vapnik-Chervonenkis dimensions, we will still use this equation, but with a small twist. We can also observe the bias-variance tradeoff in this single equation…
If we have a ton of classes, like say a neural network over a linear model, our $\mathcal{H}$ will be a large set, and thus we can’t get a tight bound. However, the optimal risk will also be hopefully lower.
We increase the complexity of our model, the set size becomes larger[increase risk], but the $inf R(f^*)$ is smaller[decrease risk]. Thus, using a neural net by itself won’t help much. Regularization saves the day here. It’s a way to control the complexity of our model[variance] while still decreasing risk. This situation is low bias, high variance.
We decrease the complexity of our model, the set size becomes smaller[decrease risk], but the $inf R(f^*)$ is larger[increase risk]. Thus, using a linear model can’t get us to the best solution. This situation is high bias, low variance.
BTW, this is one way of looking at bias-variance. I learned it by completely expanding the expected risk, so if you’re lost/looking for a more elegant solution, then wikipedia is for you, or check out my professor’s book:
Foundations of Machine Learning
Although we have retrieved something concrete here, the resulting find is quite pessimistic for now. Why? Because this cannot be used for even linear models, let alone neural nets! The number of different hypotheses for linear models in $\Re^d$ space is not let alone finite, it’s not even countable! How can we ever hope to use uniform bound on such a large hypothesis class?
Enter VC Dimensions! They’ll be the main topic for next time, which, together with our grand equation, is the single most important theorem in learning theory.