Ray2022-05-09T14:59:39+00:00http://oneraynyday.github.ioRay ZhangPortfolio Risk - Part 22022-04-26T00:00:00+00:00http://oneraynyday.github.io/math/2022/04/26/Portfolio-Risk-Pt2<h1 id="portfolio-risk---part-2-evaluation">Portfolio Risk - Part 2: Evaluation</h1>
<p>This is a continuation of the <a href="https://oneraynyday.github.io/math/2022/01/23/Portfolio-Risk-Pt1/">previous post</a>, so I’ll be skipping some preliminary setup. This blog tries to capture tidbits of learnings from working with <a href="https://www.linkedin.com/in/gappy/">Gappy</a> who is my sole mentor in my dive into modern portfolio theory. Gappy has taught at <a href="https://www.orie.cornell.edu/giuseppe-paleologo">Cornell</a> and wrote a <a href="https://www.amazon.com/Advanced-Portfolio-Management-Fundamental-Investors/dp/1119789796">whole book</a> on this subject (one of the few books I go back to read over and over again).</p>
<p>Noone makes a good risk model in one go. It’s important to understand how to evaluate a risk model in order to iteratively improve on it with respect to those performance characteristics. Unlike a typical supervised machine learning problem, evaluating a risk model is even more involved.</p>
<p>Disclaimer: I’m not responsible for any hillarious losses incurred by a portfolio inspired by methodologies listed below.</p>
<h1 id="table-of-contents">Table of Contents</h1>
<ul id="markdown-toc">
<li><a href="#portfolio-risk---part-2-evaluation" id="markdown-toc-portfolio-risk---part-2-evaluation">Portfolio Risk - Part 2: Evaluation</a></li>
<li><a href="#table-of-contents" id="markdown-toc-table-of-contents">Table of Contents</a> <ul>
<li><a href="#structural-evaluation" id="markdown-toc-structural-evaluation">Structural evaluation</a> <ul>
<li><a href="#loadings-pervasiveness--degeneracy" id="markdown-toc-loadings-pervasiveness--degeneracy">Loadings pervasiveness & degeneracy</a></li>
<li><a href="#idiosyncratic-covariance" id="markdown-toc-idiosyncratic-covariance">Idiosyncratic covariance</a> <ul>
<li><a href="#correlation-clustering" id="markdown-toc-correlation-clustering">Correlation clustering</a></li>
<li><a href="#microstructure" id="markdown-toc-microstructure">Microstructure</a></li>
<li><a href="#macrostructure" id="markdown-toc-macrostructure">Macrostructure</a> <ul>
<li><a href="#marchenko-pastur" id="markdown-toc-marchenko-pastur">Marchenko-Pastur</a></li>
<li><a href="#johns-sphericity-test" id="markdown-toc-johns-sphericity-test">John’s sphericity test</a></li>
</ul>
</li>
</ul>
</li>
</ul>
</li>
<li><a href="#loss-functions" id="markdown-toc-loss-functions">Loss functions</a> <ul>
<li><a href="#proxies-ex-post" id="markdown-toc-proxies-ex-post">Proxies (ex post)</a></li>
<li><a href="#predictors-ex-ante" id="markdown-toc-predictors-ex-ante">Predictors (ex ante)</a></li>
<li><a href="#mse" id="markdown-toc-mse">MSE</a></li>
<li><a href="#qlike" id="markdown-toc-qlike">QLike</a> <ul>
<li><a href="#what-is-quasi-likelihood" id="markdown-toc-what-is-quasi-likelihood">What is Quasi-Likelihood?</a></li>
</ul>
</li>
</ul>
</li>
<li><a href="#test-portfolios" id="markdown-toc-test-portfolios">Test portfolios</a> <ul>
<li><a href="#historical-strategy-portfolios" id="markdown-toc-historical-strategy-portfolios">Historical strategy portfolios</a></li>
<li><a href="#benchmark-portfolios" id="markdown-toc-benchmark-portfolios">Benchmark portfolios</a></li>
<li><a href="#minimum-variance-portfolios" id="markdown-toc-minimum-variance-portfolios">Minimum variance portfolios</a></li>
<li><a href="#factor-mimicking-portfolios" id="markdown-toc-factor-mimicking-portfolios">Factor mimicking portfolios</a></li>
<li><a href="#hedged-portfolios" id="markdown-toc-hedged-portfolios">Hedged portfolios</a></li>
<li><a href="#portfolio-statistic-turnover" id="markdown-toc-portfolio-statistic-turnover">Portfolio Statistic: Turnover</a></li>
</ul>
</li>
<li><a href="#conclusion" id="markdown-toc-conclusion">Conclusion</a></li>
</ul>
</li>
</ul>
<h2 id="structural-evaluation">Structural evaluation</h2>
<p>There are both qualitative and quantitative checks we should do to ensure the risk model is well defined and adequately captures the structure of asset returns. To do this, we’ll zoom into the loadings and the induced idiosyncratic covariance matrix.</p>
<h3 id="loadings-pervasiveness--degeneracy">Loadings pervasiveness & degeneracy</h3>
<p>A factor model’s loading $B$ can be either observable metrics of a given asset or a result of a statistical model. For an example, if we take the market cap of an asset as its loadings, we would essentially be describing the asset’s exposure to the <em>size</em> factor. These $m$ loadings would be column vectors in $B$:</p>
\[B = \begin{bmatrix}
\vert & \vert & ... & \vert \\
b_1 & b_2 & ... & b_f \\
\vert & \vert & ... & \vert
\end{bmatrix}\]
<p>Loadings for a large factor model should be pervasive. This roughly means $\vert\vert b_i\vert\vert = O(n)$ i.e. the norm of each loading should scale roughly linearly to the number of $n$ assets. If there are many zeros in a particular loading, it may be trying to capture small group structure amongst the assets and is not a good candidate as a factor.</p>
<p>Sometimes loadings are linearly dependent (or close enough) which leads to degeneracy during regression. Using ordinary least squares, we need to invert the gram matrix of $B$. If there is no ridge term, we would get an error during estimation.</p>
\[(B^TB)^{-1}B^Tr = f\]
<p>The offending factors can be identified using Gram Schmidt and removed during the estimation phase. The US barra model for example has the linearly dependent factor “country” which is a linear combination of all the “industry” factors that makes up the country’s asset universe.</p>
<h3 id="idiosyncratic-covariance">Idiosyncratic covariance</h3>
<p>The idiosyncratic covariance matrix here is an <strong>ex post</strong>(measured from historical data) statistic:
\(\Omega_i := \Omega_r - B\Omega_fB^T\)
where $B \in \mathbb{R}^{m \times f}, \Omega_r \in \mathbb{R}^{m \times m}, \Omega_f \in \mathbb{R}^{f \times f}$. The loadings $B$ is often constructed statistically (e.g. using PCA or other dimensionality reduction techniques) or fundamentally (e.g. using asset characteristics like market cap). The returns and factor covariance matrices can be estimated in a few ways:</p>
<ol>
<li>Using exponential weighted moving average updates of centered outer products of historical asset/factor returns to form a full rank covariance matrix.</li>
<li>Fit a multivariate <a href="https://en.wikipedia.org/wiki/Autoregressive_conditional_heteroskedasticity">GARCH</a> model over the observed asset/factor returns to generate per-period covariance matrices.</li>
<li>Estimate a <a href="https://en.wikipedia.org/wiki/Wishart_distribution">Wishart distribution</a> from historical asset/factor returns and use the parameter $\Sigma$ as a covariance estimate.</li>
</ol>
<p>Regardless of the estimation technique, we would like to get $\Omega_i$ and visualize the highly correlated clusters.</p>
<h4 id="correlation-clustering">Correlation clustering</h4>
<p>The problem of correlation clustering is NP complete, and can be framed as an graph optimization problem. Given a undirected graph $G(V, E)$, the output of correlation clustering is $C = {C_1, C_2, …, C_k}$ clusters where $C_i = (V_i, E_i)$ such that ${V_i}$ is a partition of the vertices, and edges are the corresponding edges that belong to all nodes in each cluster. In our case, the graph node are the $n$ assets and the $n^2$ edges are the correlations between assets (sometimes we use the absolute value of correlation here. For simplicity we just take the raw correlation below). The more negative the correlation of assets $i$ and $j$, the less we’d like to include assets $i$ and $j$ in the same correlation cluster. Rigorously, define a $\delta$ function:</p>
\[\delta(i,j) = \begin{cases}
1 \quad \text{if i, j are in same cluster} \\
0 \quad \text{if i, j are in different clusters}
\end{cases}\]
<p>then:</p>
\[f_+(C) = \sum_{i,j: e_{i,j} > 0} -e_{i,j} \delta(i,j) + \sum_{i,j: e_{i,j} > 0} e_{i,j} (1-\delta(i,j))\]
<p>where the left term is the reward for putting correlated nodes in the same cluster, and the right term is the penalty for putting correlated nodes in different clusters. Similarly for anti-correlated nodes:</p>
\[f_-(C) = \sum_{i,j: e_{i,j} < 0} e_{i,j} \delta(i,j) + \sum_{i,j: e_{i,j} < 0} -e_{i,j} (1-\delta(i,j))\]
<p>where the left term is the penalty for putting anticorrelated nodes in the same cluster, and the right term is the reward for putting anticorrelated nodes in different clusters. We would like to minimize the following:</p>
\[f(C) := f_+(C) + f_-(C)\]
<p>and the correct clustering is:</p>
\[C* = argmin_C f(C)\]
<p>Unfortunately we cannot phrase this in a linear programming setup since membership of clusters can only be decayed to an integer linear programming problem (which is NP complete). Luckily, we have a nondeterministic 3-approximation algorithm producing $\hat{C}$, where 3-approximation is defined as:</p>
\[f(\hat{C}) \leq 3 f(C*)\]
<p>The procedure looks like (actual code available in a <a href="https://gist.github.com/OneRaynyDay/8a7c5a2342533f7262e1e9f423aa9971">gist</a>):</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">cluster</span><span class="p">(</span><span class="n">vertices</span><span class="p">):</span>
<span class="n">v</span> <span class="p">:</span><span class="o">=</span> <span class="n">random</span> <span class="n">vertex</span>
<span class="n">c</span> <span class="p">:</span><span class="o">=</span> <span class="nb">all</span> <span class="n">nodes</span> <span class="k">with</span> <span class="n">positive</span> <span class="n">edges</span> <span class="n">to</span> <span class="n">v</span> <span class="p">(</span><span class="n">including</span> <span class="n">v</span><span class="p">)</span>
<span class="p">[</span><span class="n">c1</span><span class="p">,</span> <span class="p">...,</span> <span class="n">cn</span><span class="p">]</span> <span class="p">:</span><span class="o">=</span> <span class="n">cluster</span><span class="p">(</span><span class="n">vertices</span> <span class="o">-</span> <span class="n">c</span><span class="p">)</span>
<span class="k">return</span> <span class="p">[</span><span class="n">c</span><span class="p">,</span> <span class="n">c1</span><span class="p">,</span> <span class="p">...,</span> <span class="n">cn</span><span class="p">]</span>
</code></pre></div></div>
<p>For illustration purposes, here’s a synthetic covariance matrix of 10 assets:</p>
<p><img src="http://oneraynyday.github.io/assets/clustering/original_corr.png" alt="corr" height="40%" width="40%" /></p>
<p>This is after we re-ordered the row/column axes according to clusters:</p>
<p><img src="http://oneraynyday.github.io/assets/clustering/clustered.png" alt="reordered_corr" height="40%" width="40%" /></p>
<p>As we can see, the clustering did a pretty good job at including positive edges(darker colors is better):</p>
<p><img src="http://oneraynyday.github.io/assets/clustering/cluster_mask.png" alt="cluster_mask" height="40%" width="40%" /></p>
<p>… and excluding negative edges:</p>
<p><img src="http://oneraynyday.github.io/assets/clustering/without_cluster_mask.png" alt="without_cluster_mask" height="40%" width="40%" /></p>
<p>Of course, there are also other approaches to clustering that may be useful, such as hierarchical clustering using dendrograms (example visualization <a href="https://stackoverflow.com/a/3017704/3781180">here</a>). It’s always a good idea to inspect the idiosyncratic correlation matrix clusters to get an intuitive idea of the relationship between assets in your investment universe.</p>
<h4 id="microstructure">Microstructure</h4>
<p>With $f « m$, there must be some local structure the factor model cannot accurately capture. For example, there are only a handful of major public bank stocks which have very highly correlated historical returns. Another example is the <code class="language-plaintext highlighter-rouge">INTL</code> vs. <code class="language-plaintext highlighter-rouge">AMD</code> marketshare battle - when one company gets a major data center contract for a million server-grade CPU’s, the other company loses that contract. These two stoks exhibit consistently negative correlation in less volatile market conditions. In both cases, adding a new banking factor or a microprocessor manufacturing factor for a small set of stocks would violate the pervasiveness requirement of loadings.</p>
<p>The astute reader may realize that there is no solution to this qualitative problem. This is inherently a problem with factor models, and the modeler should use the right tool for the job. If the investment universe only consists of tightly correlated assets with significant microstructure, maybe a factor model isn’t the right approach.</p>
<h4 id="macrostructure">Macrostructure</h4>
<p>Sometimes, a factor model <em>is</em> the right approach but we’re lacking some important factors that influence the returns of a large set of assets. This usually means there are macro correlation clusters that should be encapsulated by factors. Quantitatively, this can be found by performing PCA on the idiosyncratic covariance matrix and getting the first few largest eigenvalues. If our factor model is large & optimal, we would assume the eigenvalues to be very small. If the eigenvalues are large we can incorporate the eigenvectors as loadings or interpret the eigenvectors to be some properties about the assets(which makes the factors more tradeable).</p>
<p>However depending on the purpose of the risk model, this isn’t always a relevant performance characteristic to consider. For example, a risk model with very few but tradeable factors (e.g. index tracking factors) that explain the majority of the variance of a portfolio is a great <em>hedging</em> tool. Having very few tradeable factors means it is feasible to hedge a strategy based on the model (it would be really hard to hedge a portfolio using 100+ instruments). On the other hand, the idiosyncratic covariance matrix is bound to have lots of uncaptured macrostructure. This is often a tradeoff a portfolio manager is willing to make.</p>
<h5 id="marchenko-pastur">Marchenko-Pastur</h5>
<p>Similar to asking “is there macrostructure in our idiosyncratic covariance matrix”, we can ask “does the idiosyncratic covariance matrix match our assumptions?”. We estimated our factor model such that the <strong>idiosyncratic returns are zero-centered gaussian noise</strong>. What’s the likelihood that our assumptions are correct?</p>
<p>In the field of random matrix theory (RMT), we’ll loosely<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup> define a random matrix drawn from a Wishart ensemble as a matrix $X$ constructed from the following steps:</p>
<p>First create a matrix $M \in \mathbb{R}^{n \times m}$ such that $M_{i,j} \sim \mathcal{N}(\mu,\sigma^2)$ are independently sampled. This matches our assumptions if for each row in $M$ we have the idiosyncratic returns for that day.</p>
<p>We then symmetrize the matrix by taking $X := \frac{M^TM}{n}$. Assuming idiosyncratic returns have mean 0, this is the covariance matrix. The eigenvalues of a symmetric matrix are real and non-negative, so it’s a positive semidefinite matrix. The joint probability distribution of this matrix is proportional to:</p>
\[p(X) \propto e^{\frac{-1}{2}tr(X)} det(X)^{\frac{1}{2} (n-m-1)}\]
<p>Two random dudes named Marchenko and Pastur were looking at the spectral density of the Wishart ensemble by asking the question “what happens to the distribution of eigenvalues of $X$ asymptotically as $n$ and $m$ increase?”. As $m$ increases, $X$ becomes a larger square matrix with more non-negative eigenvalues. As $n$ increases, we can use the law of large numbers to reach some convergence guarrantees(we also need $n$ to increase so the covariance matrix is not degenerate). Under the assumption that $n$ and $m$ grow in some asymptotic ratio $c := m/n$, they were able to find the spectral density function $f(x)$ as:</p>
\[f(x) = \frac{1}{2\pi c x}\sqrt{(b-x)(x-a)}\]
<p>with support $[a,b]$, where $a,b = \pm(1-\sqrt{c})^2$. The distribution looks like this with $c < 1$:</p>
<p><img src="http://oneraynyday.github.io/assets/marchenko1.png" alt="marchenko1" height="60%" width="60%" /></p>
<p>For $c > 1$:</p>
<p><img src="http://oneraynyday.github.io/assets/marchenko2.png" alt="marchenko2" height="60%" width="60%" /></p>
<p>If we can derive a reasonable $c$ either by getting a maximum likelihood estimate from our idiosyncratic covariance matrix or by an educated forecast, we can calculate the likelihood that the eigenvalues of our idiosyncratic covariance matrix is sampled from the Marchenko-Pastur distribution. The higher the likelihood, the more likely our idiosyncratic returns are gaussian random noise.</p>
<h5 id="johns-sphericity-test">John’s sphericity test</h5>
<p>Marchenko and Pastur have given us a useful tool to measure how likely our idiosyncratic covariance matrix belongs to a Wishart ensemble (the underlying idiosyncratic returns is gaussian noise). Another trait we’ve assumed about our idiosyncratic covariance matrix is that <strong>the off-diagonal entries should be zero</strong> as in the noise is uncorrelated. Some random dude named John came up with this statistic in the 1970’s:</p>
\[U = \frac{1}{m}tr((\frac{C}{tr(C)/m)} - I)^2)\]
<p>Here, $C$ is our <strong>correlation</strong> matrix(which can be directly derived from $\Omega$, our covariance matrix). Since trace is rotation invariant, the quantity inside is roughly equivalent to the squared error between the eigenvalues of a spherical distribution and the elliptical idiosyncratic covariance distribution. The score by itself has little interpretability. Comparing between two different models under the same symbol universe (for same $m$) allows us to understand which model has a better idiosyncratic covariance matrix - a lower score is “more diagonal” and better.</p>
<h2 id="loss-functions">Loss functions</h2>
<p>Loss functions is a straightforward method to quantitatively reason about the performance of a model. Loss functions output a number - if it’s high we’re sad and if it’s low we’re happy. The number usually isn’t interpretable so it is only useful when there are multiple numbers corresponding to different models to compare with.</p>
<p>Typical supervised learning problem often have simple loss functions. A common choice is to use the negative log-likelihood as your loss function with some regularization terms in the lagrangian. The intuition is simple - try to select an estimator from a family of function with relatively low bias and variance.</p>
<p>In the context of backtesting factor models, we can use loss functions to evaluate the accuracy of predicted portfolio variance. Concretely, variance loss functions are mappings of the form:</p>
\[f: \mathbb{R}_+^T \times \mathbb{R}_+^T \to \mathbb{R}\]
<p>This reads “the function takes in two vectors containing positive values indexed by time and outputs a real number”. One of the positive vector arguments is the actual portfolio variance across days, and the other is the predicted portfolio variance across days. We can <em>almost</em> consider the actual portfolio variance as the labels of the training set. <em>Unfortunately, we can’t observe the variance of any assets at a single point in time!</em> Asset variance is a nonstationary, latent quantity so we don’t have any labels for our predictions! What do we do?</p>
<h3 id="proxies-ex-post">Proxies (ex post)</h3>
<p>Proxies are able to give a reasonable volatility estimate to use as labels to compare against predictions. A simple volatility proxy can be computed as:</p>
\[\begin{split}
\sigma^2 = (w^Tr)^2 \\
\sigma = \vert w^Tr \vert
\end{split}\]
<p>The above proxy is <em>very</em> noisy but has less bias than some types of estimates, making this a bias-variance tradeoff problem. We now have some noisy estimates of our labels, which isn’t the most satisfying but we tread onwards.</p>
<h3 id="predictors-ex-ante">Predictors (ex ante)</h3>
<p>Given a factor model, we can calculate the covariance matrix using:</p>
\[\hat{\Omega} = B\Omega_fB^T + \Omega_i\]
<p>where $\Omega_i$ is a diagonal covariance matrix, $B$ is the loadings of symbols to factors, and $\Omega_f$ is the full factor covariance matrix.</p>
<p>Given a portfolio $w \in \mathbb{R}^m$, we can predict its variance:</p>
\[\hat{\sigma}^2 = w^T \hat{\Omega} w\]
<p>Now that we have the two inputs to a loss function, let’s choose our functions!</p>
<h3 id="mse">MSE</h3>
<p>MSE or mean squared error is the classic loss function everyone should know about. It’s literally just the average euclidean distance between $\sigma^2$ and $\hat{\sigma}^2$ over dates:</p>
\[\text{MSE}(\hat{\sigma}^2, \sigma^2) := \frac{1}{T} \vert \vert \hat{\sigma}^2 - \sigma^2 \vert \vert _2^2 = \frac{1}{T} \sum_t (\hat{\sigma}_t^2 - \sigma_t^2)^2\]
<p>The farther away our predicted portfolio variance, the higher the loss. If our prediction is exactly accurate, the loss would be 0.</p>
<p>One consequence of having a very noisy variance estimate is that the squared distance per time may be really different. Imagine our variance proxy outputted a total of 100$ the first day, and a total of 1,000,000$ the second day. If we were to predict 1$ the first day and 999,000$ the second day, the second day distance would dominate the loss function purely due to the magnitude of the proxy. In reality, we were really close to predicting the second day’s variance - it’s just harder to be exact when the magnitude is larger. To alleviate this problem, we typically use a normalized MSE:</p>
\[\text{MSE}'(\hat{\sigma}^2, \sigma^2) := \frac{1}{T} \vert \vert \frac{\sigma^2}{\hat{\sigma}^2} - 1 \vert \vert _2^2\]
<p>which removes the noise introduced by varying magnitudes.</p>
<h3 id="qlike">QLike</h3>
<p>QLike or quasi-likelihood is an improvement on MSE for $\sigma^2$ estimations. It’s defined as follows:</p>
\[\text{QLIKE}(\hat{\sigma}^2, \sigma^2) := \frac{1}{T} \sum_t -log(\frac{\sigma_t^2}{\hat{\sigma_t}^2}) + \frac{\sigma_t^2}{\hat{\sigma_t}^2} - 1\]
<p>Note the logarithmic term doesn’t allow any negative variance predictions which is great (MSE doesn’t care in comparison). When the ratio of proxy to prediction is small, the negative log term dominates the loss function. When the ratio of proxy to prediction is large, the linear term dominates the loss function. The sum of these convex functions form a convex function with a single minima at the ratio being 1.</p>
<p>If we set $x := \frac{\sigma_t^2}{\hat{\sigma_t}^2}$ and perform a taylor expansion at $x = 1$, we see the following:</p>
\[\text{QLIKE}(x) \approx 0 + (-1+1)(x-1) + \frac{(1)(x-1)^2}{2} + o( \vert \vert x^3 \vert \vert )\approx \text{MSE}'(x)\]
<p>Locally, it seems like quasi likelihood is very similar to a normalized mean squared error! So what is this <strong>quasi</strong>-likelihood and why do we care?</p>
<h4 id="what-is-quasi-likelihood">What is Quasi-Likelihood?</h4>
<p>Recall that in a typical linear regression, we assume $E(y \vert x)$ is a linear function in $x$, so that:</p>
\[y = \beta^Tx + \epsilon\]
<p>For some $\beta$. The noise around the mean is usually assumed to be i.i.d gaussian:</p>
\[\epsilon \sim \mathcal{N}(0, \sigma^2)\]
<p>These assumptions can be relaxed and generalized to fit other expectation functions with noise from an <a href="https://en.wikipedia.org/wiki/Exponential_family">exponential family distribution</a>. For example, we can perform a gamma regression with(you don’t need to know this):</p>
\[y = e^{\beta^Tx} + \epsilon\]
<p>which is better modeling choice when $y \geq 0$. Denoting $E(y \vert x) = \mu$, the noise from a gamma regression is assumed to be:</p>
\[\epsilon \sim \Gamma(0, \frac{\mu}{\beta^2})\]
<p>However, when the underlying distribution is unknown (e.g. the variance of our portfolio), how can we assign a likelihood function? One way to get around this is to construct a quasi-likelihood model, which not only assumes $y = h(\beta^Tx) + \epsilon$ for some $h$, but also $\epsilon$ is a zero-mean distribution with variance as a function of the mean of $y \vert x$:</p>
\[\mathbb{V}(\epsilon) = \phi V(\mu)\]
<p>Here $\phi$ is a hyperparameter and $V$ is a function we construct. Note how previously we specified the exact distribution of $\epsilon$, but now we only characterize $\epsilon$ by its first and second moments? This is because we don’t know the underlying distribution, and that’s where the “quasi” part comes from.</p>
<p>Since we don’t have a likelihood(or log-likelihood) function, we’re gonna have to make one. The log-likelihood function is an integral of the score function, which has the following requirements:</p>
\[\begin{split}
E(S(\mu)) = 0 \\
\mathbb{V}(S(\mu)) = -E(\frac{\partial S(\mu)}{\partial \mu})
\end{split}\]
<p>We define a quasi-score function as:</p>
\[S(\mu) = \frac{y - \mu}{\phi V(\mu)}\]
<p>which has a zero mean:</p>
\[E(S(\mu)) = \int \frac{y - \mu}{\phi V(\mu)} p(y)dy = \frac{\int yp(y)dy - \mu}{\phi V(\mu)} = \frac{\mu - \mu}{\phi V(\mu)} = 0\]
<p>The variance is:</p>
\[\begin{split}
\mathbb{V}(S(\mu)) &= E(S(\mu)^2) - E(S(\mu))^2 \\
&= \int (\frac{y-\mu}{\phi V(\mu)})^2 p(y)dy - 0^2 \\
&= \frac{\int y^2 p(y)dy - 2 \mu^2 + \mu^2}{(\phi V(\mu))^2} \\
&= \frac{E(y^2) - E(y)^2}{\mathbb{V}(y)^2} \\
&= \frac{1}{\mathbb{V}(y)}
\end{split}\]
<p>And it satisfies the constraints of a score function:</p>
\[-\mathbb{E}(\frac{\partial S(\mu)}{\partial \mu}) = E(\frac{(\mu - y)\frac{\partial V(\mu)}{\partial \mu} - V(\mu)}{\phi V(\mu)^2}) = \frac{1}{\mathbb{V}(y)}\]
<p>Given some choices of $V(\mu)$, we would get appropriate quasi log-likelihoods by integration. For example, if we chose $V(\mu) = 1$, we would get a normal distribution with quasi-likelihood equal to:</p>
\[log\mathcal{L}(\mu, y) = -\frac{(y-\mu)^2}{2}\]
<p>The quasi log-likelihood we chose is consistent with the construction above if we set $\phi = 1$ and $V(\mu) = \mu y$, and we were able to create it without knowing what the underlying distribution is, which may not be a member of the exponential family. For more information on the choice of $V$, refer to the paper <em>“Volatility Forecast Comparison using Imperfect Volatility Proxies”</em> by A.J. Patton.</p>
<h2 id="test-portfolios">Test portfolios</h2>
<p>Now that we’ve defined some loss functions, we need to choose appropriate test portfolios to estimate the variance. There are <em>tons</em> of different test portfolios that may be useful, but we’ll outline a few here.</p>
<h3 id="historical-strategy-portfolios">Historical strategy portfolios</h3>
<p>If you’re backtesting your factor model, the first thing that comes to mind should be to evaluate it on your historical portfolios! Arguably, there is no cleaner, practical dataset to evaluate on than your own strategies, which you’ve either made or lost tons of money on. You know what the variance of your PnL looks like better than anyone else, and that serves as an additional heuristic to measure the performance of your portfolios aside from the noisy variance proxy.</p>
<h3 id="benchmark-portfolios">Benchmark portfolios</h3>
<p>Benchmarks portfolios track the weights of constituents of a large index. For example, we can have a benchmark portfolio tracking S&P500 by securitizing any S&P500 ETF like IVV. A comprehensive factor model should be able to predict the benchmark portfolios variance accurately. Surprisingly, we have a pretty good proxy for the S&P500 volatility - VIX! If our predicted volatility is way off from the historical VIX values, there’s a good chance our factor model performs really badly.</p>
<p>Benchmarks in general span a large set of assets so we expect the variance proxies to be less noisy, which helps us interpret the loss function metrics.</p>
<h3 id="minimum-variance-portfolios">Minimum variance portfolios</h3>
<p>Given a risk model, we can construct a portfolio that is the result of an optimization problem:</p>
\[\begin{split}
\text{max}_w & \quad e^Tw \\
\text{s.t.} & \quad w^T \Omega w \leq 1 \\
\end{split}\]
<p>where $e$ is the vector of all 1’s. This is the portfolio corresponding to the largest notional holding possible with unit portfolio variance. If we focused on the set of solutions such that $w^T \Omega w = 1$, this allocation gives us the maximum capital investment(but not returns!) fixed to a variance. This portfolio should stay relatively stable over time with relatively low variance (the model should predict unit variance). If the variance proxy associated with this portfolio is very large, it means the estimated $\Omega$ is not stable for e.g. mean variance optimization.</p>
<h3 id="factor-mimicking-portfolios">Factor mimicking portfolios</h3>
<p>A factor mimicking portfolio is a portfolio which gives unit exposure to a single factor and zero exposure to everything else. When we estimate the factor returns (covered in previous post) in the case of weighted least squares:</p>
\[\hat{f} = (B^TWB)^{-1}B^TWr = P^Tr\]
<p>We get an intermediate quantity $P \in \mathbb{R}^{n \times f}$:</p>
\[P = \begin{bmatrix}
\vert & \vert & ... & \vert \\
p_1 & p_2 & ... & p_f \\
\vert & \vert & ... & \vert
\end{bmatrix}\]
<p>When we interpret individual stacked $p_i$ as portfolios in $P$ and try to calculate the factor exposures:</p>
\[B^TP = B^T[(B^TWB)^{-1}B^TW]^T = (B^TWB)(B^TWB)^{-1} = I\]
<p>We see that for $p_i$ we have unit exposure to the $i$-th factor and zero exposure to everything else! If we are interested in the variance of a particular factor tracking portfolio (e.g. it has a positive expected return), we would use it in the loss function and see how well the model can predict the portfolio’s variance.</p>
<h3 id="hedged-portfolios">Hedged portfolios</h3>
<p>Given some arbitrary portfolio, we can decompose the holdings:</p>
\[\begin{split}
w = w_f + w_i \\
w_f \in \text{span}(P) \\
w_i^TB = 0 \\
w_f \perp w_i
\end{split}\]
<p>In other words, a portfolio can be decomposed to one with zero factor exposure, and another that can be constructed by some linear combination of factor mimicking portfolios. A portfolio with zero exposures is one that is <strong>hedged</strong> to the entire factor model. Calculating this is easy if we already have $P$.</p>
<p>It is important to note that although the hedged portfolio has zero factor exposure, the risk model will still predict non-zero variance due to the diagonal $\Omega_i$ idiosyncratic variance term in $\hat{\Omega} = B\Omega_fB^T + \Omega_i$. Regardless, if the variance proxy of a hedged portfolio is large it means the factor model is unable to express all risk factors and cannot hedge a portfolio appropriately.</p>
<h3 id="portfolio-statistic-turnover">Portfolio Statistic: Turnover</h3>
<p>Aside from using the portfolio to compare predicted variance and variance proxies, there are other considerations when it comes to evaluating a risk model. The turnover of a portfolio is defined as:</p>
\[\text{turnover}(w_0,...,w_T) = \sum_t^T \vert \vert w_{t-1} \circ r_{t-1} - w_t \vert \vert _2^2\]
<p>Intuitively, the above expression gives us a measurement of how much a portfolio “moves” over time. In an ideal world, we’d like a portfolio that which moves minimally but still makes a boatload of money. Unfortunately we live in a society, so that means the turnover and alpha potential of the portfolio are often at odds with each other. Large turnover doesn’t only mean more work - the market impact of moving large amounts of capital is nontrivial. If we had to suddenly purchase 1 million shares of GOOG under a minute, we would crash through the order book and get filled at very suboptimal prices as well as shifting GOOG’s price up significantly.</p>
<p>So suppose we’ve calculated the optimal mean-variance portfolio for a given factor model - we would love to trade with it if it makes us a ton of money with relatively low risk. We should first backtest the turnover statistic of the mean-variance portfolio to see if it’s actually something we can feasibly trade. If the fees of moving from the optimal portfolio at time $T$ to $T+1$ is too high, we can’t realize any profit.</p>
<h2 id="conclusion">Conclusion</h2>
<p>Model evaluation is a <strong>really</strong> hard problem with no clear answers. It requires the porfolio manager to make active tradeoffs and a solid investment thesis for the corresponding evaluation suite to be useful. We looked at the construction of the factor model, and how well behaved the loadings are. We took a deep dive into the idiosyncratic covariance matrix and whether residual structures break our modelling assumptions. We discussed different loss functions with different theoretical motivations, and how we can use a variety of test portfolios to plug backtested variance estimates to comparatively evaluate our models. There are a lot of other considerations not listed in this blog, but I hope this serves as a brief introduction to evaluating your models!</p>
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:1" role="doc-endnote">
<p>I lied to make this simpler. For a general Wishart ensemble the pdf is parametrized with $\beta$ in the exponentiation and is called the dyson index. For the construction we defined, $\beta = 1$ because we’ve constructed a wishart ensemble using real numbers in $M$. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>
Portfolio Risk - Part 12022-01-23T00:00:00+00:00http://oneraynyday.github.io/math/2022/01/23/Portfolio-Risk-Pt1<style>
.red {
color:inherit;
}
.red:hover {
color:rgb(129, 17, 18);
}
.collapse:hover {
cursor: pointer;
}
</style>
<h1 id="portfolio-risk---part-1-modeling">Portfolio Risk - Part 1: Modeling</h1>
<p>The first step to making money in the stock market is having an accurate model of where the sources of profit(and losses) are coming from. We’ll go through some basic (but time-tested) modelling assumptions and introduce some ways a portfolio manager might evaluate their positions in the market.</p>
<p>Disclaimer: I’m not responsible for any hillarious losses incurred by a portfolio inspired by methodologies listed below.</p>
<h1 id="table-of-contents">Table of Contents</h1>
<ul id="markdown-toc">
<li><a href="#portfolio-risk---part-1-modeling" id="markdown-toc-portfolio-risk---part-1-modeling">Portfolio Risk - Part 1: Modeling</a></li>
<li><a href="#table-of-contents" id="markdown-toc-table-of-contents">Table of Contents</a> <ul>
<li><a href="#returns-and-log-returns" id="markdown-toc-returns-and-log-returns">Returns and Log-returns</a></li>
<li><a href="#modeling-returns" id="markdown-toc-modeling-returns">Modeling Returns</a> <ul>
<li><a href="#when-sigma_epsilon-is-not-diagonal" id="markdown-toc-when-sigma_epsilon-is-not-diagonal">When $\Sigma_\epsilon$ is not diagonal</a></li>
<li><a href="#rank-preserving-linear-transformations" id="markdown-toc-rank-preserving-linear-transformations">Rank-preserving linear transformations</a></li>
<li><a href="#projection" id="markdown-toc-projection">Projection</a></li>
<li><a href="#lifting" id="markdown-toc-lifting">Lifting</a></li>
<li><a href="#log-returns-covariance-transformation" id="markdown-toc-log-returns-covariance-transformation">Log-returns covariance transformation</a></li>
</ul>
</li>
<li><a href="#portfolio-optimization" id="markdown-toc-portfolio-optimization">Portfolio Optimization</a> <ul>
<li><a href="#robust-er-formulation" id="markdown-toc-robust-er-formulation">Robust-er formulation</a></li>
</ul>
</li>
<li><a href="#performance-attribution" id="markdown-toc-performance-attribution">Performance Attribution</a></li>
<li><a href="#stress-tests" id="markdown-toc-stress-tests">Stress Tests</a></li>
<li><a href="#conclusion" id="markdown-toc-conclusion">Conclusion</a></li>
</ul>
</li>
</ul>
<h2 id="returns-and-log-returns">Returns and Log-returns</h2>
<p>Making money is all about getting positive <strong>returns</strong>. One way to define return of holding an asset with price $p_t$ over 1 time-step is by measuring it in a comparative growth ratio:</p>
\[r_t = \frac{p_t}{p_{t-1}}\]
<p>This is great because returns became a unit-less number, one that doesn’t change as the price of the asset becomes larger or smaller, which is great when we may potentially have many assets to worry about in a portfolio. Also, this equation is obviously a simplification of the real world, so it misses some details like dividends and stuff. With this, we can define <strong>cumulative returns</strong> as:</p>
\[R_t = \Pi_tr_t\]
<p>If we assume that $r_t$ is normally distributed which sort of makes sense if you were a <a href="https://www.forbes.com/sites/rickferri/2012/12/20/any-monkey-can-beat-the-market/?sh=2e56b4fe630a">blindfolded monkey throwing darts at stocks</a>, you would find that analyzing $R_t$ is going to be very hard because the product of gaussians is not gaussian. So as a solution to this, people tried to tractably model returns by assuming $p_t$ is distributed log normally, which kind of makes sense because prices must(at least in equities) be greater than zero. If we plug it in, we would see that $log(r_t)$ (log-returns) is actually normally distributed:</p>
\[log(r_t) = log(\frac{p_t}{p_{t-1}}) \sim \mathcal{N}(\mu, \sigma^2) \\
p_t, p_{t-1} \sim log\mathcal{N}(\mu, \sigma^2)\]
<p>This is great because we want to analyze cumulative log-returns, which is now:</p>
\[R_t = log(\Pi_tr_t) = \sum_t log(r_t) = log(p_t) - log(p_{t-1}) + log(p_{t-1}) - log(p_{t-2}) + ... - log(p_0) \\
= log(p_t) - log(p_0) \sim \mathcal{N}(\mu, \sigma^2)\]
<p>Which makes calculations tractable. There are also a couple other mathematical benefits to making this assumption, like accuracy of approximation on small returns. We won’t dive into that rabbit hole in this blogpost, but you can find some good motivation <a href="https://quantivity.wordpress.com/2011/02/21/why-log-returns/">here</a>.</p>
<h2 id="modeling-returns">Modeling Returns</h2>
<p>Factor models are linear models that describe the return(not log-return) of a portfolio based off of a combination of factor exposures and idiosyncratic returns:</p>
\[r = Bf + \epsilon\]
<p>where $r \in \mathbb{R}^n, B \in \mathbb{R}^{n \times m}, f \in \mathbb{R}^m, \epsilon \in \mathbb{R}^n$. $r$ is the returns of a portfolio, $B$ is the factor loading matrix for each asset to each factor, $f$ is the factor returns, and $\epsilon$ is the idiosyncratic returns of the portfolio, which should(ideally) be a vector of uncorrelated random variables which capture the uniqueness of each asset in the portfolio. It is also worth noting that $\epsilon$ is uncorrelated from $f$ because it’s residualized leftover from a regression using $f$, so it should have zero correlation with the factors as well. Here, idiosyncratic simply means returns not explained by the factor exposures. We can interpret this as a graphical, causal-inference model from underlying factors linearly impacting each asset’s return with some weight. One important property we care about is the covariance matrix of returns, which can be expressed in terms of $\Sigma_f, \Sigma_\epsilon$:</p>
\[\Sigma_r = Var(r) = Var(Bf + \epsilon) = B\Sigma_fB^T + \Sigma_\epsilon\]
<p>Here, since $\epsilon$ can be considered uncorrelated(but potentially dependent!) random variables the matrix $\Sigma_\epsilon$ can be assumed diagonal. This is not always a good assumption though.</p>
<h3 id="when-sigma_epsilon-is-not-diagonal">When $\Sigma_\epsilon$ is not diagonal</h3>
<p>If the idiosyncratic covariance is not diagonal, that means there is some residual structure amongst the assets that factors are not able to fully capture. If you have an ETF like SPY which contains TSLA as an underlier, and you also have TSLA in your portfolio, it’s obvious that there are some idiosyncratic returns from TSLA that is correlated with SPY’s returns. We would fix $\Sigma_\epsilon$ to be diagonal here by <strong>securitizing</strong> our holdings, which means we get rid of SPY as a real holding and turn it into its constituents(which isn’t always an accurate as the net asset value doesn’t always equal the book price of the ETF). Suppose we have a holdings vector $w \in \mathbb{R}^n$ in units of dollars, then securitization of $w$ to $w’$ is essentially a linear transformation:</p>
\[w' = Xw\]
<p>where $w’ \in \mathbb{R}^{k}, X \in \mathbb{R}^{k \times n}, k > n$. You can imagine $X$ containing column vectors:</p>
\[X = \begin{bmatrix}
\vert & \vert & ... & \vert \\
x_1 & x_2 & ... & x_n \\
\vert & \vert & ... & \vert
\end{bmatrix}\]
<p>where each element the column vector $x_i$ represents how many equivalent shares of the underliers (indexed in $k$ dimensions) the $i$-th original asset has. For example, if SPY was the $i$-th index asset in our portfolio, and TSLA is the $j$-th index asset in the securitized output, $X_{ij}$ would be 0.0144 (which is the current % asset breakdown by SPY holdings).</p>
<p>Let’s think of another scenario, where the relationship between stocks are not so obvious. Suppose we’re holding a few pharmaceutical companies that specialize in heart disease drugs. These companies don’t own each other as an underlier, so securitization won’t help us. It’s obvious that they’re correlated but a factor like “pharmaceutical heart-disease oriented company factor” seems extremely specific and should not be included in $f$ because the residual structure is miniscule in the universe of assets we worry about in a large portfolio. Then where can we describe this relationship? Another way we can visualize a non-diagonal $\Sigma_\epsilon$ is by taking a graphical approach, where the non-zero elements off the diagonal means inverse distance of an edge on a graph where each node is an asset. Using a graphical model here makes the risk model much more complicated (and potentially nonlinear), so we omit the details for this blog.</p>
<p>Here are some properties about the factor model, which we’ll discuss first.</p>
<h3 id="rank-preserving-linear-transformations">Rank-preserving linear transformations</h3>
<p>A factor model you get from Barra has semantic meaning for each factor exposure and loading vector, but you can apply a rank preserving linear transformation to the factors such that the returns are still the same. Let $C \in \mathbb{R}^{m \times m}$ be an invertible matrix, then:</p>
\[B' = BC \\
f' = C^{-1}f\]
<p>Then the original portfolio returns is still the same using these new transformed factor loadings/returns:</p>
\[r = B'f' + \epsilon = BCC^{-1}f + \epsilon = BC + \epsilon\]
<p>The covariance matrix is preserved under the transformation:</p>
\[Var(B'f' + \epsilon) = B'\Sigma_{f'}B'^T + \Sigma_\epsilon = (BC)(C^{-1}\Sigma_fC^{-T})(C^TB^T) + \Sigma_\epsilon = B\Sigma_fB^T + \Sigma_\epsilon\]
<p>If we want to have unit factor covariance and look at the transformed loadings, we can take $\Sigma_f = USU^T$ via SVD on symmetric matrices and take $C^{-1}= US^{\frac{-1}{2}}$ in the above example. This transformation normalizes the factors to be uncorrelated with each other.</p>
<p>We can also make the loading matrix $B$ orthonormal by taking the SVD of $B = USV^T$, and taking $C = VS^{-1}$.</p>
<h3 id="projection">Projection</h3>
<p>Projection is not a rank-preserving transformation but we still care about it. We want to minimize some norm of the difference of a lower rank approximation of $Bf$ which can be formalized as follows:</p>
\[min_g E(||Bf - Ag||^2)\]
<p>where $A \in \mathbb{R}^{n \times k}, g \in \mathbb{R}^k, k < m$. This is simple OLS which has a closed form solution if we are given the loadings $A$. The solution is:</p>
\[g = (A^TA)^{-1}A^TBf\]
<h3 id="lifting">Lifting</h3>
<p>In contrast to projection, lifting is a transformation that increases the number of factors. This may be useful if $\epsilon$ exhibits some structure that can be reconsolidated as a factor:</p>
\[\epsilon = Ag + \epsilon' \implies r = Bf + Ag + \epsilon'\]
<p>To add the new factors $g$ and loadings $A$, we can just increase the length of the vector $f$ to include $g$ as the last few elements, and stack $B$ on top of $A$.</p>
<h3 id="log-returns-covariance-transformation">Log-returns covariance transformation</h3>
<p>If given our log-returns covariance matrix, then how do we know our actual returns’ covariance matrix? Let the returns be $R \in \mathbb{R}^n, R_i = e^{r_i}, r = log(\frac{p_t}{p_{t-1}}) \sim \mathcal{N}(\mu_r, \Sigma_r)$. Here we want to calculate our actual return covariance matrix $\Sigma_R$ of the random vector $R$, which is returns (not logarithmic returns).</p>
<p>First, let’s denote $\Sigma := \Sigma_r$, as it’s unambiguous when defining $\Sigma_R$. For those uninterested in the derivation, the answer is:</p>
\[(\Sigma_R)_{ij} = exp(\frac{\Sigma_{ii} + \Sigma_{jj}}{2} + \mu_i + \mu_j) [exp(\Sigma_{ij}) - 1]\]
<details class="red">
<summary class="collapse"><strong>For those who are interested in the derivation</strong></summary>
<p>We know that $\Sigma_{ij}$ is the covariance of the i-th and j-th random variable of $R$. It can be formally expressed as:</p>
\[(\Sigma_R)_{ij} = cov(exp(r_i), exp(r_j)) = E(exp(r_i)exp(r_j)) - E(exp(r_i))E(exp(r_j))\]
<p>So there are two main quantities we want to find here. Let’s tackle the easiest one first: $E(exp(r_i))$. For sake of simplicity, let’s denote $x := r_i$ and $\mu := E(x) = \mu_i, \sigma^2 := Var(x) = \Sigma_{ii}$.</p>
\[E(exp(x)) = \int_{-\infty}^{\infty} exp(x)p(x)dx = \frac{1}{\sqrt{2\pi}\sigma}\int_{-\infty}^{\infty}exp(x)exp(-\frac{(x-\mu)^2}{2\sigma^2})dx\]
<p>As we can see here, $e^x$ can be thought of as performing a shift and a scaling to the original pdf of the gaussian:</p>
\[\frac{1}{\sqrt{2\pi}\sigma}\int_{-\infty}^{\infty}exp(\frac{2\sigma^2x - (x-\mu)^2}{2\sigma^2})dx\]
<p>We can expand the quantity and complete the square on top to get:</p>
\[2\sigma^2x - (x-\mu)^2 = -[x^2 - 2x(\mu + \sigma^2) + \mu^2] = -(x-(\sigma^2+\mu))^2 + \sigma^4 + 2\mu\sigma^2\]
<p>Using this quantity on the numerator, we can refactor the equation to look like:</p>
\[\frac{1}{\sqrt{2\pi}\sigma}\int_{-\infty}^{\infty}exp(\frac{2\sigma^2x - (x-\mu)^2}{2\sigma^2})dx = \frac{exp(\mu + \frac{\sigma^2}{2})}{\sqrt{2\pi}\sigma}\int_{-\infty}^{\infty}exp(\frac{-(x - (\sigma^2+\mu))^2}{2\sigma^2}) = exp(\mu + \frac{\sigma^2}{2})\]
<p>So looking back at what we substituted, we get the quantity:</p>
\[E(exp(r_i))E(exp(r_j)) = exp(\mu_i + \mu_j + \frac{\Sigma_{ii} + \Sigma_{jj}}{2})\]
<p>Now let’s look at the slightly harder quantity $E(exp(r_i)exp(r_j))$. We can solve the expectation in a similar way, but this time with matrix version of completion of the square. I didn’t want to reprove completion of the square in matrix form so I just looked up the equation <a href="https://gregorygundersen.com/blog/2019/09/18/completing-the-square/">here</a>:</p>
\[x^TMx - 2b^Tx = (x-M^{-1}b)^TM(x-M^{-1}b) - b^TM^{-1}b \tag{1}\]
<p>The general pdf for a multivariate gaussian random variable $r \sim \mathcal{N}(\mu, \Sigma)$ is:</p>
\[\int_{-\infty}^{\infty}f(r) = \int_{-\infty}^{\infty} \frac{1}{\sqrt{(2\pi)^n|\Sigma|}}exp(\frac{-1}{2} (r-\mu)^T\Sigma^{-1}(r-\mu))\]
<p>The expectation we’re trying to solve for can be re-written as:</p>
\[E(exp(r_i)exp(r_j)) = E(exp(r_i+r_j))\]
<p>Here, $r_i$ and $r_j$ are scalar random variables, so in order to incorporate it into equation $(1)$ we make the quantity $r_i + r_j = e_{ij}^Tr$ where $e_{ij}^T = (0 \text{…} 1 \text{…} 1 \text{…} 0)$ - a vector with 1’s on the i-th and j-th indices. We get this:</p>
\[\int_{-\infty}^{\infty} \frac{1}{\sqrt{(2\pi)^n|\Sigma|}}exp(\frac{-[(r-\mu)^T\Sigma^{-1}(r-\mu) - 2e^T_{ij}(r-\mu) - 2e^T_{ij}\mu]}{2})\]
<p>Note that we did some extra work to express the last term in the form of $r-\mu$ .We get the following completion setting $r-\mu = x,\Sigma = M,b = e^T_{ij}$:</p>
\[((r-\mu)-\Sigma e_{ij})^T\Sigma^{-1}((r-\mu) - \Sigma e_{ij}) - e_{ij}^T\Sigma e_{ij} \\
= (r-(\mu+\Sigma e_{ij}))^T\Sigma^{-1}(r-(\mu+\Sigma e_{ij})) - e_{ij}^T\Sigma e_{ij}\]
<p>The completed square with the remaining $-2e^T_{ij}\mu$ term can be tucked back into the pdf to be integrated to 1 with $\mu’ := \mu + \Sigma e_{ij}$, so we get:</p>
\[\int_{-\infty}^{\infty} \frac{1}{\sqrt{(2\pi)^n|\Sigma|}}exp(\frac{-1}{2} [(r-\mu')^T\Sigma^{-1}(r-\mu') - e_{ij}^T\Sigma e_{ij} - 2e^T_{ij}\mu]) \\
= exp(\frac{e^T_{ij}\Sigma e_{ij} + 2e^T_{ij}\mu}{2})\]
<p>We’re almost there. The quadratic form of the above expression can be evaluated to:</p>
\[tr(e^T_{ij}\Sigma e_{ij}) = tr(e_{ij}e^T_{ij}\Sigma) = \Sigma_{ii} + \Sigma_{ij} + \Sigma_{ji} + \Sigma_{jj} = \Sigma_{ii} + 2\Sigma_{ij} + \Sigma_{jj}\]
<p>We can cyclically permute traces, so we rearrange to get the outer product of $e_{ij}e^T_{ij}$ with $\Sigma$. This essentially sums up all elements in that block, and the last equality is due to symmetry. Now the entire expression is:</p>
\[E(exp(r_i)exp(r_j)) - E(exp(r_i))E(exp(r_j)) \\
= [exp(\frac{\Sigma_{ii} + 2\Sigma_{ij} + \Sigma_{jj} + 2e^T_{ij}\mu}{2})] - [exp(\mu_i + \mu_j + \frac{\Sigma_{ii} + \Sigma_{jj}}{2})] \\
= [exp(\frac{\Sigma_{ii} + \Sigma_{jj}}{2} + \mu_i + \mu_j + \Sigma_{ij} )] - [exp(\mu_i + \mu_j + \frac{\Sigma_{ii} + \Sigma_{jj}}{2})] \\
= exp(\frac{\Sigma_{ii} + \Sigma_{jj}}{2} + \mu_i + \mu_j) [exp(\Sigma_{ij}) - 1]\]
</details>
<details class="red">
<summary class="collapse"><strong>Here’s some sanity checking for the cynical</strong></summary>
<p>We first generate samples of $r$ and create a randomly generated correlation matrix of $r$. We first calculate the matrix $\Sigma_R$ from an estimated $\Sigma_r$ as per equation above, and we compare that to covariance matrix calculated from directly exponentiating each sample of $r$.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">scipy.stats</span> <span class="kn">import</span> <span class="n">random_correlation</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="kn">import</span> <span class="nn">seaborn</span> <span class="k">as</span> <span class="n">sb</span>
<span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="n">plt</span>
<span class="n">num_samples</span> <span class="o">=</span> <span class="mi">100000</span>
<span class="n">dimensions</span> <span class="o">=</span> <span class="mi">10</span>
<span class="n">scale</span> <span class="o">=</span> <span class="mf">0.1</span>
<span class="n">rng</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">default_rng</span><span class="p">()</span>
<span class="n">random_eigs</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">array</span><span class="p">([</span><span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">uniform</span><span class="p">()</span> <span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">dimensions</span><span class="p">)])</span>
<span class="c1"># Normalize the random eigs so it all sums to N
</span><span class="n">random_eigs</span> <span class="o">=</span> <span class="n">random_eigs</span> <span class="o">/</span> <span class="n">random_eigs</span><span class="p">.</span><span class="nb">sum</span><span class="p">()</span> <span class="o">*</span> <span class="n">dimensions</span>
<span class="n">cov</span> <span class="o">=</span> <span class="n">random_correlation</span><span class="p">.</span><span class="n">rvs</span><span class="p">(</span><span class="n">random_eigs</span><span class="p">,</span> <span class="n">random_state</span><span class="o">=</span><span class="n">rng</span><span class="p">)</span>
<span class="n">means</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">normal</span><span class="p">(</span><span class="n">scale</span><span class="o">=</span><span class="n">scale</span><span class="p">,</span> <span class="n">size</span><span class="o">=</span><span class="n">dimensions</span><span class="p">)</span>
<span class="n">samples</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">multivariate_normal</span><span class="p">(</span><span class="n">means</span><span class="p">,</span> <span class="n">cov</span><span class="p">,</span> <span class="n">size</span><span class="o">=</span><span class="n">num_samples</span><span class="p">)</span>
<span class="n">exp_samples</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">exp</span><span class="p">(</span><span class="n">samples</span><span class="p">)</span>
<span class="n">estimated_cov</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">cov</span><span class="p">(</span><span class="n">samples</span><span class="p">,</span> <span class="n">rowvar</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span>
<span class="n">estimated_mean</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">mean</span><span class="p">(</span><span class="n">samples</span><span class="p">,</span> <span class="n">axis</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">from_log</span><span class="p">(</span><span class="n">samples</span><span class="p">):</span>
<span class="n">estimated_cov</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">cov</span><span class="p">(</span><span class="n">samples</span><span class="p">,</span> <span class="n">rowvar</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span>
<span class="n">estimated_mean</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">mean</span><span class="p">(</span><span class="n">samples</span><span class="p">,</span> <span class="n">axis</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
<span class="n">cov</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">zeros_like</span><span class="p">(</span><span class="n">estimated_cov</span><span class="p">)</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">cov</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">]):</span>
<span class="k">for</span> <span class="n">j</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">cov</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">1</span><span class="p">]):</span>
<span class="n">s_ii</span><span class="p">,</span> <span class="n">s_jj</span><span class="p">,</span> <span class="n">s_ij</span> <span class="o">=</span> <span class="n">estimated_cov</span><span class="p">[</span><span class="n">i</span><span class="p">,</span> <span class="n">i</span><span class="p">],</span> <span class="n">estimated_cov</span><span class="p">[</span><span class="n">j</span><span class="p">,</span> <span class="n">j</span><span class="p">],</span> <span class="n">estimated_cov</span><span class="p">[</span><span class="n">i</span><span class="p">,</span> <span class="n">j</span><span class="p">]</span>
<span class="n">m_i</span><span class="p">,</span> <span class="n">m_j</span> <span class="o">=</span> <span class="n">estimated_mean</span><span class="p">[</span><span class="n">i</span><span class="p">],</span> <span class="n">estimated_mean</span><span class="p">[</span><span class="n">j</span><span class="p">]</span>
<span class="n">cov</span><span class="p">[</span><span class="n">i</span><span class="p">,</span> <span class="n">j</span><span class="p">]</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">exp</span><span class="p">((</span><span class="n">s_ii</span> <span class="o">+</span> <span class="n">s_jj</span><span class="p">)</span> <span class="o">/</span> <span class="mi">2</span> <span class="o">+</span> <span class="n">m_i</span> <span class="o">+</span> <span class="n">m_j</span><span class="p">)</span> <span class="o">*</span> <span class="p">(</span><span class="n">np</span><span class="p">.</span><span class="n">exp</span><span class="p">(</span><span class="n">s_ij</span><span class="p">)</span> <span class="o">-</span> <span class="mi">1</span><span class="p">)</span>
<span class="k">return</span> <span class="n">cov</span>
<span class="c1"># NOTE: These are two separate samples from the same distribution, so slight differences
# are expected. You can also check their frobenius norm via np.linalg.norm(x-y)
</span><span class="n">sb</span><span class="p">.</span><span class="n">heatmap</span><span class="p">(</span><span class="n">from_log</span><span class="p">(</span><span class="n">samples</span><span class="p">),</span> <span class="n">cmap</span><span class="o">=</span><span class="s">"Blues"</span><span class="p">)</span>
<span class="n">sb</span><span class="p">.</span><span class="n">heatmap</span><span class="p">(</span><span class="n">np</span><span class="p">.</span><span class="n">cov</span><span class="p">(</span><span class="n">exp_samples</span><span class="p">,</span> <span class="n">rowvar</span><span class="o">=</span><span class="bp">False</span><span class="p">),</span> <span class="n">cmap</span><span class="o">=</span><span class="s">"Blues"</span><span class="p">)</span>
</code></pre></div> </div>
<p>Here’s the two graphs (which should look identical):</p>
<p><strong>Calculated covariance from $R$</strong>:</p>
<p><img src="http://oneraynyday.github.io/assets/portfolio_calculated_covariance.png" alt="portfolio_calculated_covariance.png" /></p>
<p><strong>Derived covariance from $\Sigma_r$</strong>:</p>
<p><img src="http://oneraynyday.github.io/assets/portfolio_derived_covariance.png" alt="portfolio_derived_covariance.png" /></p>
</details>
<h2 id="portfolio-optimization">Portfolio Optimization</h2>
<p>With a certain risk threshold in mind, we can optimize our portfolio to maximize returns. One way of formalizing this is to maximize $w$:</p>
\[\text{max}_w \alpha^Tw \\
\text{s.t.} w^T\Sigma_r w \leq \sigma^2\]
<p>Where $\alpha := E(r), \Sigma_r := Var(r)$.</p>
<p>Formulated with lagrange multipliers, we get a loss function:</p>
\[V(w) = \alpha^Tw - \frac{\lambda}{2}w^T\Sigma_r w\]
<p>You can think of $V(w)$ as the “value” of your portfolio given the expected returns and covariance. This is a concave function with a single extrema (the maxima, which we’re trying to find), so we can set the gradient to zero and solve for $w$:</p>
\[\nabla_w V(w^*) = \alpha - \lambda \Sigma_r w^* = 0 \\
\implies w^* = \frac{1}{\lambda}\Sigma_r^{-1}\alpha\]
<p>If we didn’t care about the covariance term, $\alpha$ would be an arbitrarily large vector in the span of $\alpha$. However, because we are aware that some assets are more volatile than others and we only have so much room for risk in our portfolio, we “rotate” our assets so that it maximizes our total returns on a particular gradient contour(in this case, shaped like an N-dimensional ellipse).</p>
<p>Unfortunately, this is actually <em>not</em> a great equation to optimize on because $\alpha$ and $\Sigma$ are estimates of the true expected returns and covariance matrices. I haven’t found a rigorous proof for this (comment below if you have any sources!), but my intuition is that this particular solution of $w^*$ is particularly sensitive to the change in the covariance matrix and expected returns (I’m guessing moreso the covariance matrix as it’s being inverted) which makes the solution hard to generalize.</p>
<h3 id="robust-er-formulation">Robust-er formulation</h3>
<p>Instead of taking the estimated $\alpha$ and $\Sigma$ as constants, we can model $\alpha \sim \mathcal{N}(\mu_\alpha, \Sigma_\alpha)$, and introduce an additional gaussian noise term $\epsilon$ to the optimization:</p>
\[V(w) = \mu_\alpha^Tw - \frac{\lambda}{2}w^T(\Sigma_\alpha+\Sigma_r) w\]
<p>Maximizing this gives us:</p>
\[w^* = \frac{1}{\lambda}(\Sigma_\alpha + \Sigma_r)^{-1}\mu_\alpha\]
<p>A (computationally) efficient assumption of $\alpha$’s distribution is that these random variables are uncorrelated, which means $\Sigma_\alpha$ is actually a diagonal matrix. If we assume the assets have similar volatility, this solution starts to look like <a href="https://en.wikipedia.org/wiki/Ridge_regression">ridge regression</a>.</p>
<hr />
<p>Note that the solutions to the above formulations only take care of maximizing $w$ for a single time interval, and the real problem lies in optimizing:</p>
\[\text{max}_{w_1, w_2, ..., w_T} \Sigma_t^T V(w_t)\]
<p>which is a much harder trajectory problem, requiring heavier hammers like the <a href="https://en.wikipedia.org/wiki/Linear%E2%80%93quadratic_regulator">linear-quadratic regulator</a>, which is commonly used in control theory and reinforcement learning.</p>
<h2 id="performance-attribution">Performance Attribution</h2>
<p>When we look at PnL and attribution to factor returns or idiosyncratic return, we can decompose it as follows for a given time:</p>
\[\text{pnl} = \sum_t\text{pnl}_t \\
= \sum_t w_t^Tr_t \\
= \sum_t w_t(B_tf_t) + \sum_tw_t^T\epsilon_t := \sum_t b_t^Tf_t + \sum_tw_t^T\epsilon_t \\
\approx \text{factor pnl}+\text{idiosyncratic pnl}\]
<p>Where $b_t \in \mathbb{R}^m$ and is the factor exposure of the portfolio. <strong>For a systematic investor, we care about factor pnl, and for fundamental investor, we care about the idiosyncratic pnl</strong>. This risk attribution equation seems good, but it’s a trap - $f_t$ and $e_t$ are estimated! If we suppose $f’_t, e’_t$ are the true factor returns and idiosyncratic return then we have accrued $\sum_t w_t^T(f_t - f’_t)$ error in our factor estimates and similarly for the idiosyncratic term. If we assume our position $w$ doesn’t change for simplicity and $f_t - f’_t \sim \mathcal{N}(\mu, \sigma^2)$ and are independent in time, summing gaussians will give us linearly increasing variance which is catastrophic to our error bounds if we were estimating cumulative returns with long intervals. This is to say much like the portfolio optimization problem, performance attribution becomes a harder problem when we’re dealing across a long time horizon.</p>
<h2 id="stress-tests">Stress Tests</h2>
<p>You’ve probably heard of 3-$\sigma$ events. What does that exactly mean? First, we need to define <strong>volatility</strong> of a portfolio:</p>
\[\sigma = \sqrt{w^T\Sigma_rw}\]
<p>which is the standard deviation of returns. If we assume that returns are independent over time T, we get:</p>
\[\sigma_T = \sqrt{\sum_i^Tw_i^T\Sigma_{r_i}w_i}\]
<p>If the volatility is fairly constant (which is usually the case when we’re looking at sub-month intervals), some people just abbreviate this to:</p>
\[\sigma_T \approx \sigma\sqrt{T}\]
<p>A 3-$\sigma$ event is when your observed event (returns) is 3 times the volatility away from the mean. This could mean we made a ton of money or lost a ton of money. You may run into 3-$\sigma$ events more often than expected, and this isn’t only due to the imperfect art of modelling returns. By definition, the $\Sigma_r$ matrix is fixed in the calculation, but this isn’t true in practice. In a catastrophic event (i.e. Elon Musk tweets the world is going to end in 5 days), the correlation across assets suddenly approaches 1 and the condition number of the covariance matrix skyrockets (the higher the condition number, the more sensitive the estimated returns are to small perturbations in $\Sigma_r$ or $w$). A better way to model an extreme event is to calculate the worst-case transformation of $\Sigma_r$ under stress:</p>
\[\max_M \sqrt{w^TMw} \\
\text{s.t.} ||\Sigma_r - M||_F < C\]
<p>Which is a better model for extreme events. Lots of stress evaluation metrics such as <a href="https://en.wikipedia.org/wiki/Value_at_risk">value-at-risk</a> can use the covariance matrix to calculate maximum loss.</p>
<h2 id="conclusion">Conclusion</h2>
<p>Portfolio modellings foundations rest on handcrafted factors, domain knowledge, and lots of linear regression. We first defined what a return is, and move on to modelling the returns using factor models. Then, we mentioned a few ways to formulate the modelling as an optimization problem so we can find the best positions for our portfolio. Lastly, we look at how to attribute money gained/lost and the worst case scenarios that may occur. There’s a lot of stuff I didn’t yet mention (i.e. performance attribution) which I might include in a subsequent post, but I hope this survey into the different parts of risk/portfolio modelling helps!</p>
<script src="https://utteranc.es/client.js" repo="OneRaynyDay/oneraynyday.github.io" issue-term="pathname" theme="github-light" crossorigin="anonymous" async=""> </script>
Portfolio Risk ELI52022-01-22T00:00:00+00:00http://oneraynyday.github.io/math/2022/01/22/Portfolio-Risk-ELI5<p>It’s been a while since I posted! This ELI5 article is meant for those who are curious about portfolio management but don’t know anything about portfolio construction. Here, I try to cover a breadth of topics rather than going very deep in one. I also explicitly oversimplify some concepts here to make it easier to digest, and we’ll try to dispel some of these lies in a later blog.</p>
<h1 id="table-of-contents">Table of Contents</h1>
<ul id="markdown-toc">
<li><a href="#table-of-contents" id="markdown-toc-table-of-contents">Table of Contents</a> <ul>
<li><a href="#problem-statement---make-some-" id="markdown-toc-problem-statement---make-some-">Problem Statement - Make some $$$</a></li>
<li><a href="#modelling-for-returns" id="markdown-toc-modelling-for-returns">Modelling for Returns</a> <ul>
<li><a href="#the-avocado-alpha" id="markdown-toc-the-avocado-alpha">The avocado alpha</a></li>
<li><a href="#two-tales-of-zero-beta-portfolios" id="markdown-toc-two-tales-of-zero-beta-portfolios">Two tales of zero beta portfolios</a></li>
<li><a href="#hedging" id="markdown-toc-hedging">Hedging</a></li>
<li><a href="#multi-factor-model" id="markdown-toc-multi-factor-model">Multi-factor model</a></li>
<li><a href="#investing-in-high-beta-stocks" id="markdown-toc-investing-in-high-beta-stocks">Investing in high beta stocks</a></li>
</ul>
</li>
<li><a href="#evaluating-performance" id="markdown-toc-evaluating-performance">Evaluating Performance</a></li>
<li><a href="#attributing-performance" id="markdown-toc-attributing-performance">Attributing Performance</a></li>
<li><a href="#conclusion" id="markdown-toc-conclusion">Conclusion</a></li>
</ul>
</li>
</ul>
<h2 id="problem-statement---make-some-">Problem Statement - Make some $$$</h2>
<p>What does it mean to “win” at the stock market? Do you want to become a bitcoin millionaire and retire early, or do you want to gradually size up a diverse portfolio that gives you performance marginally better than the S&P500 with relatively little risk, and ride the exponential growth until you retire?</p>
<p>To a greater extent, portfolio management tries to solve the latter problem. Most portfolio managers acrue large positions with leverage and try to beat the broader market in returns in order to pay themselves and grow their clients’ net worth. How do they do this?</p>
<h2 id="modelling-for-returns">Modelling for Returns</h2>
<p>You might have heard of the terms “alpha” and “beta” used in a financial context before. They’re terms used in a family of models that portfolio managers use, and are the main subjects of research in a typical hedge fund.</p>
<h3 id="the-avocado-alpha">The avocado alpha</h3>
<p>Trading firms have access to a vast amount of data which allows them to identify specific high-performing <strong>alphas</strong> (denoted $\alpha$) which are also called <em>expected returns</em>. You can think of high-performing alphas as a piece of predictive knowledge that allows someone to construct a portfolio with greater returns than the market index. Imagine you discovered an imminent earthquake is about to hit California’s central valley and destroy all the avocado farms. The price of avocados would spike shortly after, but if you beat everyone to the punch, you would buy a ton of avocados<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup> and sell them when the supply is low(and subsequently when the price is high).</p>
<p>If you discovered an avocado alpha, you wouldn’t want to tell others until you’ve bought enough avocados to make a sizeable profit. This is why trading firms are so secretive with how they determine their investments.</p>
<h3 id="two-tales-of-zero-beta-portfolios">Two tales of zero beta portfolios</h3>
<p>Another factor trading firms consider is their exposure to <strong>beta</strong> (denoted $\beta$), which is their relative risk explained by the broader market. However, no creative portfolio moves perfectly in line with the market, so we explain the rest of the risk with <strong>idiosyncratic return</strong> (denoted $\epsilon$) which are fluctuations in returns not explained by the broader market</p>
<p>If your portfolio has a positive beta greater than 1, it means when the market value has increased by 1%, your portfolio’s value increases by more than 1%. Similarly, if the market value has gone down by 1%, the portfolio value has also gone down by more than 1%. If you portfolio has a positive beta less than 1, it means your market returns are less positive and less negative compared to the market’s ups and downs. What happens if your beta is zero?</p>
<p>When the beta is zero, it means one of two things: your portfolio is not correlated with the market at all and/or your stocks have constant return. With /r/wallstreetbets making some crazy moves in the past year, the price of GME and AMC do not move according to the broader market. When the broader market value has increased by 1%, GME goes up by 69% and AMC decreases by 420% - it’s hard to draw correlations between the performance between that of the market and the meme stocks. This means there is a small beta (we can consider it zero) but a large <em>idiosyncratic return</em> which explains its volatility. In the other case, we really indeed have zero volatility - you can think of this as putting your money in the bank to accrue interest at a rate locked in for the next few years. Your returns are fairly constant and it doesn’t change depending on the stock market.</p>
<p>Our observations can be formalized in the following way:</p>
<p><strong>Returns for single stock:</strong></p>
\[r = \alpha + \beta m + \epsilon \quad \alpha,\beta,\epsilon \in \mathbb{R}\\\]
<p><strong>Returns for multiple stocks:</strong></p>
\[r = \sum_i^N \alpha_i + \beta_im + \epsilon_i\]
<p>where $r$ is the return of the portfolio and $m$ is the market return. Since we only have a single correlated factor to our portfolio in this equation (being $m$), we call this a (linear) <strong>single-factor model</strong>.</p>
<h3 id="hedging">Hedging</h3>
<p>You’ve heard the word in “hedge funds”, “hedge your bets”, etc. Applied to the context of our portfolio, it means we want to remove our exposure to risk in some aspect. In the above model, we can hedge the market. In the case of a catastrophic correlated crash in the market, we want our trading firm to continue existing<sup id="fnref:2" role="doc-noteref"><a href="#fn:2" class="footnote" rel="footnote">2</a></sup>. To do this, we want our portfolio to have a beta of 0.</p>
<h3 id="multi-factor-model">Multi-factor model</h3>
<p>We just considered the single factor model to describe the characteristics of return for our portfolio. We can do better by decomposing the market into a bunch of smaller factors. To be precise, we can model it as follows:</p>
<p><strong>Returns for single stock:</strong></p>
\[r = \alpha + \beta^Tf + \epsilon \quad\alpha,\epsilon \in \mathbb{R}, \beta, f\in \mathbb{R^m}\]
<p><strong>Returns for multiple stocks:</strong></p>
\[r = \sum_i^N \alpha_i + \beta_i^Tf_i + \epsilon_i\]
<p>where $b$ is the vector of <strong>loadings</strong> of a stock to a vector of factors $f$. Selecting the factors that best explain the returns of a portfolio is an art, and there are many ways to do it.</p>
<h3 id="investing-in-high-beta-stocks">Investing in high beta stocks</h3>
<p>You would think that by investing in high beta stocks like TQQQ or some crazy growth stocks, you would have higher expected returns. This is actually shown historically to be incorrect! If we consider beta itself to be a factor in a multi-factor model, we will often times find the expected return of the beta factor to be negative! This means exposure to more volatile stocks has historically led to worse performance. Another reason to dump your money in S&P500 and forget about it, I guess.</p>
<h2 id="evaluating-performance">Evaluating Performance</h2>
<p>There are many ways to think about the performance of a portfolio. One simple way is to just consider the expected returns of a portfolio, but that would fail to account for the potential volatility that may wipe out your firm. Most of the time, people use <strong>sharpe ratios</strong> to evaluate their performance, which is roughly:</p>
\[S = \frac{E(r) - R_b}{\sigma_r}\]
<p>where $R_b$ is some risk free return (i.e. putting it in a bank account to accrue interest) and $\sigma_r$ is the volatility of our portfolio. If we have high returns but we’re living in <a href="https://en.wikipedia.org/wiki/Hyperinflation_in_Zimbabwe">Zimbabwe</a> and using zimbabwe currency to measure our performance, maybe it’s because of the hyperinflation, not because we’re good portfolio managers. This corresponds to an extremely high risk free return, potentially making our sharpe ratio negative. If we’re a US-based hedge fund that only invests in meme stocks, return may be very high but the volatility would be so high that no institutional investors would like to work with us. This corresponds to a very high volatility, making the sharpe ratio smaller.</p>
<p>One common way to size our positions for our portfolio after we’ve established a reasonable $\alpha$ (expected returns) and $\sigma^2$ (variance) is by <strong>proportion of alpha</strong>:</p>
\[\text{position} \propto \alpha\]
<p>which seems very simple. The more money we make or lose, the bigger the magnitude we go long or short scaled linearly. One may think this is myopic and should take into the volatility of the asset, so some have tried <strong>mean variance</strong>:</p>
\[\text{position} \propto \alpha/\sigma^2\]
<p>However, it has a couple of shortcomings and has worse simulated sharpe ratio compared to the dumb proportion strategy. Without getting into the details, the reason mean variance fails is because the estimation error of the expected returns can cause drastic changes to the positions of a portfolio, which is detrimental to performance.</p>
<h2 id="attributing-performance">Attributing Performance</h2>
<p>At the end of the day, we concretely see our PnL. How do we attribute our performance to individual factors?</p>
\[\text{PnL}(T) = \sum_{t=1}^T [\text{idioPnL}(t) + \sum_f \text{factorPnL}_f(t)]\]
<p>Usually, you’d like to plot the cumulative pnl for each factor and idiosyncratic returns to have a better idea which factors are performing well and which ones aren’t. If some factors are doing badly, you would want to hedge more against the factors. If your idiosyncratic returns are doing badly, then that requires some more thought.</p>
<p>For idiosyncratic returns, you’re basically betting on the stock market - you’re guessing which stocks will outperform the markets it belongs to. Well, how do you quantify how well you’re betting on the stock market? It roughly is a combination of <strong>selection, sizing and timing</strong>, and are intrinsically harder to quantify.</p>
<h2 id="conclusion">Conclusion</h2>
<p>When it comes to building a portfolio that makes you money, there’s a whole field of mathematics powering it. What I’ve illustrated above only scratches the surface - in the subsequent posts we’ll be looking more in depth into the modelling process, performance attribution, and other fun stuff!</p>
<script src="https://utteranc.es/client.js" repo="OneRaynyDay/oneraynyday.github.io" issue-term="pathname" theme="github-light" crossorigin="anonymous" async=""> </script>
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:1" role="doc-endnote">
<p>You would actually buy avocado <em>futures</em> - which allow cash settlement - so you wouldn’t have to sell them to make your profit. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:2" role="doc-endnote">
<p>But it’s also important to note that if the market does really well, we don’t reap any of the benefits if we hedge! <a href="#fnref:2" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>
Complex Analysis - Introduction to complex numbers and holomorphicity2020-10-07T00:00:00+00:00http://oneraynyday.github.io/math/2020/10/07/Complex-Analysis-1<p>WIP.</p>
<p><strong>The complex numbers are defined as \(\mathbb{C} = \{ a+bi : a, b \in \mathbb{R}\}\)</strong>. We look at some basic operations on the field and dive into differentiability.</p>
<h1 id="basic-operations">Basic Operations</h1>
<p>Consider two complex numbers $z = a + bi$ and $w = c + di$ in the following:</p>
<ul>
<li>The <strong>conjugation</strong> of a complex number \(z\) is defined as \(\bar{z} = a - bi\). This is an <a href="https://en.wikipedia.org/wiki/Isomorphism#:~:text=In%20mathematics%2C%20an%20isomorphism%20is,an%20isomorphism%20exists%20between%20them.">isomorphic</a> map of \(\mathbb{C} \to \mathbb{C}\).</li>
<li>A non-zero complex number can be written uniquely in <strong>polar form</strong> as $z = re^{i\theta}$, $r > 0, \theta \in \mathbb{R}$, and $\theta$ is called the <strong>argument</strong>.</li>
<li>The <strong>norm form</strong> of a complex number $z$ is a measure of its distance from the origin: \(N(z) = a^2 + b^2\), and is the squared distance: \(N(z) = \|z\|^2\). The <strong>bilinear form</strong> of two complex numbers \(z,w\) is defined as \(\langle z, w \rangle = Re(z\bar{w}) = \frac{(ac-b(-d))^2+(a(-d) + bc)^2}{2} = \frac{(ac+bd)^2 + (-ad + bc)^2}{2}\).</li>
<li>The <strong>\(Re : \mathbb{C} \to \mathbb{R}\)</strong> gives the real part of a complex number $z$ is $a$, and can be derived as \(Re(z) = \frac{z+\bar{z}}{2}\). Similarly, the <strong>\(Im : \mathbb{C} \to \mathbb{R}\)</strong> part of a complex number is \(b\) and can be derived as \(Im(z) = \frac{z - \bar{z}}{2i}\). We have the property $|Re(z)|, |Im(z)| \leq |z| \leq |Re(z)| + |Im(z)|$ (draw a triangle).</li>
<li>The <strong>addition</strong> of two complex numbers $z, w$ is defined as $z+w = (a+c) + (b+d)i$. This is a translation mapping in the real and imaginary axes.</li>
<li>The <strong>subtraction</strong> of two complex numbers $z,w$ is defined as $z-w = (a-c) + (b-d)i$.</li>
<li>The <strong>multiplication</strong> of two complex numbers \(z,w\) is defined as \(zw=(a + bi)(c+di) = ac + adi + cbi + bdi^2 = (ac - bd) + (ad + bc)i\). This is a rotation and scaling operation.</li>
<li>The <strong>division</strong> of two complex numbers $z,w$ is defined as \(\frac{z}{w} = \frac{a+bi}{c+di} = \frac{a (c-di)}{(c+di)(c-di)} + \frac{bi(c-di)}{(c+di)(c-di)} = \frac{(ac + bd) + (bc - ad)i}{c^2+d^2}\). It can also be written concisely as \(z \cdot \frac{1}{w} = z \frac{\bar{w}}{N(w)} = \frac{(a+bi)(c-di)}{c^d+d^2} = \frac{(ac + bd) + (bc - ad)i}{c^2+d^2}\).</li>
</ul>
<h2 id="complex-exponentials">Complex exponentials</h2>
<p>This deserves its own section. We can define $e^x = \sum_n^\infty \frac{x^n}{n!}$ for $x\in \mathbb{R}$. This is also true in $\mathbb{C}$. Recall the trigonometric functions:</p>
\[sin(x) = \sum_n^\infty (-1)^n \frac{x^{2n+1}}{(2n+1)!} \\
cos(x) = \sum_n^\infty (-1)^n \frac{x^{2n}}{(2n)!}\]
<p>These two look very similar, and we previous had no way of unifying these concepts together. However, plugging in $i$ into the exponential function we get an expansion:</p>
\[e^{ix} = \sum_n^\infty \frac{i^nx^n}{n!} = cos(x) + isin(x)\]
<p>This is exactly why $e^{2\pi i} = 1$.</p>
<p>If we look at $e^{-ix} = cos(-x) + isin(-x) = cos(x) - isin(x)$, we can derive that $sin(x) = \frac{e^{ix} - e^{-ix}}{2i}$ and \(cos(x) = \frac{e^{ix} + e^{-ix}}{2}\), which look very familiar to the <a href="https://en.wikipedia.org/wiki/Hyperbolic_functions">hyperbolic trigonometric functions</a>, which are the $sin, cos$ variants rotated-by-90 degrees in the clockwise direction in the complex plane. These forms are called the <strong>Euler formulas</strong>.</p>
<h1 id="complex-differentiation">Complex differentiation</h1>
<p>Given a function $f: \mathbb{C} \to \mathbb{C}$, it is <strong>differentiable</strong> at $z_0$ if:</p>
\[\text{lim}_{h \to 0} \frac{f(z_0 + h) - f(z_0)}{h} = f'(z_0) \in \mathbb{C}\]
<p>Or, by delta epsilon definitions, it is differentiable at $z_0$ with derivative $f’(z_0)$ if $\forall \epsilon > 0 \exists \delta > 0$ such that for all $|z-z_0| < \delta$:</p>
\[|\frac{f(z)-f(z_0)}{z-z_0} - f'(z_0)| < \epsilon\]
<p>If $f$ is differentiable everywhere in its domain, then it’s called <strong>holomorphic</strong> or <strong>complex differentiable</strong>. The reason we don’t call it differentiable simply is because holomorphicity is <em>stronger</em> of a property. As opposed to a real-valued function $f: \mathbb{R} \to \mathbb{R}$ which is easy to draw and visualize, $f: \mathbb{C} \to \mathbb{C}$ is very hard to imagine. It’s essentially a mapping from a 2 dimensional space to a 2 dimensional space, requiring 4 dimensions to show the mapping. In multivariable calculus, a function $f: \mathbb{R}^n \to \mathbb{R}^m$ is differentiable at a point if it has a <strong>jacobian</strong> matrix $\textbf{J}$ that is $m \times n$ and represents the approximate linear map at a particular point. The matrix is essentially an arbitrary linear mapping that’s comparable to the tangent slope of a scalar real-valued function. Being holomorphic in $\mathbb{C}$ is similar in concept, except it has more restrictions.</p>
<p>If we think about a scalar-valued function, we are essentially finding the <em>slope of the function</em>, which is a number in $\mathbb{R}$. Translating it to a complex-valued function, we are finding the same slope, but in $\mathbb{C}$, i.e. it’s equivalent to multiplying with a complex number. What is multiplying with a complex number? It’s equivalent to <em>a rotation and scaling action</em>. Wait… it’s equivalent to the class of rotations and scaling in $\mathbb{R}^2$, and these are linear maps! <strong>The restriction is thus the \(2\times 2\) jacobian matrix for complex numbers is of the class of rotation matrices multiplied with scaling matrices.</strong> In particular, it’s of the form:</p>
\[\begin{bmatrix}a & b \\-b & a\end{bmatrix} \in \mathbb{R^{2x2}}\]
<p>… in the interpretation of $\mathbb{C}$ as the <a href="https://en.wikipedia.org/wiki/Complex_plane#Argand_diagram">Argand plane</a>.</p>
<p>Because holomorphic functions are essentially a subset of real-valued multivariable differentiable functions, it carries with it the equivalent laws of differentiation. In the below statements, $f, g: \mathbb{C} \to \mathbb{C}$ and are complex differentiable at $z_0$.</p>
<ol>
<li>
<p><strong>Linearity</strong>: $h = f+g \implies h’(z_0) = f’(z_0) + g’(z_0)$</p>
</li>
<li>
<p><strong>Product rule</strong>: $h = fg \implies h’(z_0) = f’(z_0)g(z_0) + f(z_0)g’(z_0)$</p>
</li>
<li><strong>Quotient rule</strong>: $h = f/g \implies h’(z_0) = \frac{g(z_0)f’(z_0) - f(z_0)g’(z_0)}{g(z_0)^2}$</li>
<li><strong>Chain rule</strong>: $h = f \circ g \implies h’(z_0) = f’(g(z_0))g’(z_0)$</li>
</ol>
<p>One function that might be “smooth” but not holomorphic is $f(z) = \bar{z}$. Individually speaking, $f(Re(z)) = Re(z)$, and $f(Im(z)) = -Im(z)i$. Both of these seem relatively differentiable in their own right but it’s not holomorphic:</p>
\[\text{lim}_{h\to 0} \frac{f(z_0+h) - f(z_o)}{h} = \frac{\bar{h}}{h}\]
<p>In the direction of $h$ from a real number, the value is 1. In the direction of $h$ from a complex number, the value is $-1$. The limit must be uniformly equal to $f’(z_0) \in \mathbb{C}$ for the function to be considered holomorphic. In the sense that the real component is $x$ and the imaginary component is $y$, if we look at the jacobian of $f(x, y) = (u(x, y), v(x,y)) = (x, -y)$ we also see that:</p>
\[\textbf{J} = \begin{bmatrix}\frac{\partial u(x,y)}{\partial x} & \frac{\partial u(x,y)}{\partial y} \\\frac{\partial v(x,y)}{\partial x} & \frac{\partial v(x,y)}{\partial y}\end{bmatrix}= \begin{bmatrix}1 & 0 \\0 & -1\end{bmatrix}\]
<p>the mapping does not conform to the generic type of rotation matrix with scaling. The <strong>Cauchy-Riemann</strong> equations essentially state this, with the following restrictions:</p>
<ul>
<li>$\frac{\partial u}{\partial x} = \frac{\partial v}{\partial y}$, which is the $a = a$ portion in the above matrix along the diagonal.</li>
<li>$\frac{\partial u}{\partial y} = -\frac{\partial v}{\partial x}$, which is the $b = -(-b)$ portion in the above matrix across from the diagonal.</li>
</ul>
<h2 id="power-series">Power Series</h2>
<p>Some examples of power series are:</p>
<ul>
<li>The complex exponential: $e^z = \sum_{n=0}^\infty \frac{z^n}{n!}$ and the $sin(x)$ and $cos(x)$ functions that we’ve seen above.</li>
<li>The geometric series: $\frac{1}{1-z} = \sum_{n=0}^\infty z^n$ for $|z| < 1$.</li>
</ul>
<p>The general form of power series is in the form of:</p>
\[\sum_{n=0}^\infty a_nz^n \ \text{where } a_n, z \in \mathbb{C}\]
<p><strong>Theorem: For any power series, there exists $R > 0$ such that if $|z| < R$ the series converges absolutely and $|z| > R$ the series diverges. This $R$ is called the radius of convergence, and is defined by $\frac{1}{R} = \text{limsup} |a_n|^{1/n}$.</strong></p>
<p><strong>Proof:</strong> <strong>Let’s take care of edge cases.</strong> Let $L = \frac{1}{R}$. If $L = 0 \implies R = \infty$, that means all convergent subsequences of form $|a_n|^{1/n}$ converges to 0. Then $a_n$ decays faster than $z^n$ grows, and thus the series is geometric and convergent.</p>
<p>If $L = \infty \implies R = 0$, that means there exists a single subsequence of form $|a_n|^{1/n}$ that diverges. A single divergent subsequence implies divergence of the series, and so we must set all terms to 0 to make sure the series converge.</p>
<p><strong>Let’s take care of convergence.</strong> We want to test that the series converges absolutely if $|z| < R$:</p>
\[\sum_{n=0}^\infty |a_nz^n| \leq \sum_{n=0}^\infty |a_n||z|^n < \infty\]
<p>Then we want to bound $|a_n||z^n|$ by some geometrically decaying ratio. This ratio can be obtained in several ways, but fundamentally it’s because of the two below statements:</p>
<ul>
<li>$|z| < R \implies \exists \delta > 0, |z|+\delta < R$</li>
<li>$\forall \epsilon > 0 \exists N \text{ such that } \forall n \geq N, ||a_n|^{1/n} - \frac{1}{R}| < \epsilon$ due to $\text{limsup}$ of the sequence.</li>
</ul>
<p>We start by picking some $\delta$ such that $|z| + \delta R < R$. For arbitrary $\epsilon$, there is a sufficiently large $N$ where for all $n \geq N, |a_n|^{1/n} - \frac{1}{R} < \frac{\epsilon}{R}$ (we removed the absolute value but it’s still true). Then, we throw away all finite $k < N$ in the above series because finite series converge, and we observe:</p>
\[\sum_{n\geq N} |a_n||z|^n < \sum_{n\geq N} (\frac{1+\epsilon}{R})^n|z|^n < \sum_{n\geq N} (\frac{1+\epsilon}{R})^n((1-\delta)R)^n = \sum_{n\geq N} ((1+\epsilon)(1-\delta))^n\]
<p>We’ll choose an $\epsilon$ sufficiently small so that the ratio $(1+\epsilon)(1-\delta) < 1$. Then the series converges geometrically.</p>
<p><strong>Let’s take care of divergence.</strong> We want to test that the series diverges absolutely if $|z| > R$. In that case, we’ll modify the first statement to:</p>
<ul>
<li>$|z| > R \implies \exists \delta > 0, |z| > R + \delta$</li>
</ul>
<p>In that case, we can pick some $\delta$ such that $|z| > R + \delta R$. For arbitrary $\epsilon$, there is sufficiently large $N, \forall n\geq N |a_n|^{1/n} > \frac{1}{R} - \frac{\epsilon}{R}$ (we removed the absolute value in the other way). We have the following inequality:</p>
\[\sum_{n\geq N} |a_n||z|^n > \sum_{n\geq N} (\frac{1-\epsilon}{R})^n|z|^n > \sum_{n\geq N} (\frac{1-\epsilon}{R})^n((1+\delta)R)^n = \sum_{n\geq N} ((1-\epsilon)(1+\delta))^n\]
<p>We’ll choose an $\epsilon$ sufficiently small so that the ratio $(1-\epsilon)(1+\delta) > 1$. Then the series diverges. $\blacksquare$</p>
<hr />
<p>On the disc of convergence boundary, convergence or divergence is not so easily determined, and is an interesting field of study by itself. For now, we can say some strong things about functions defined within the disc of convergence:</p>
<p><strong>Theorem: Within the disc of convergence, a power series function $f(z) = \sum_{n=0}^\infty a_nz^n$ is holomorphic with derivative $f’(z) = \sum_{n=0}^\infty n a_nz^{n-1}$.</strong></p>
<p>This proof is super tedious so… no. However, we can see that the derivative has the same radius of convergence as the original function. This is because $\text{limsup} |a_n|^{1/n} = \text{limsup} |n a_n|^{1/n}$ since the $\text{lim}_{n \to \infty} n^{1/n} = 1$ (applying sandwich theorem).</p>
<p>As a result of this finding, we see that <strong>power series are infinitely complex differentiable!</strong> Since power series are so important for complex analysis, we say that a function is <strong>analytic</strong> if it has a power series expansion, and thus also infinitely complex differentiable.</p>
Complex Analysis - Metric spaces review2020-10-05T00:00:00+00:00http://oneraynyday.github.io/math/2020/10/05/Complex-Analysis-0<p>This is a very informal set of notes that I’ll be maintaining while taking Terrence Tao’s complex analysis class.</p>
<h1 id="fields-order-sequences">Fields, order, sequences</h1>
<p>$\mathbb{R}, \mathbb{C}$ are fields, which are sets equipped with addition and multiplication. In addition, they exhibit associativity, commutativity, identity, inverses for both addition and multiplication, as well as distributivity of multiplication over addition. $\mathbb{R}$ is an <strong>ordered field</strong>, which obeys the <a href="https://en.wikipedia.org/wiki/Ordered_field">order axioms</a> with $<$. Because $\mathbb{R}$ is an ordered field, we usually use $sup$ and $inf$ to analyze the boundedness of sequences and series. However, $\mathbb{C}$ is not an ordered field.</p>
<p>We can easily prove it using $i \in \mathbb{C}$. Due to <a href="https://en.wikipedia.org/wiki/Trichotomy_(mathematics)">trichotomy</a>, $i < 0$ or $i = 0$ or $i > 0$. $i = 0$ is trivially absurd. If $i > 0$ then the order axiom $a, b > 0 \implies ab > 0$ is violated (pick $i$ for both $a, b$). If $i < 0$ then we pick $-i > 0$ as both $a, b$ and we violate the same axiom again.</p>
<p>$\mathbb{C}$ has candidate functions to define distance, which makes it a metric space(explained in the next section). A sequence in $\mathbb{C}$, $\{z_n\}$ <strong>converges</strong> to $w \in \mathbb{C}$ if $lim_{n \to \infty} |z_n - w| = 0$.</p>
<p>Because $\mathbb{C}$ is <strong>complete</strong> (which we’ll prove later), <strong>cauchy</strong> sequences which have the property:</p>
\[|z_n-z_m| \to 0 \ \text{as} \ n,m \to \infty\]
<p>Must also converge to some value $w \in \mathbb{C}$. Convergence and cauchy are equivalent in a complete metric space.</p>
<hr />
<h1 id="metric-space-properties">Metric space properties</h1>
<p>A metric space is a set $S$ equipped with a function $d: S \times S \to \mathbb{R}$. The function $d$ tells us the “distance” between two points in the set. It has the properties:</p>
<ul>
<li>$d(x, y) = 0 \iff x = y$ (positive definite)</li>
<li>$d(x, y) = d(y, x)$ (symmetric)</li>
<li>$d(x, z) \leq d(x, y) + d(y, z)$ (triangle inequality)</li>
</ul>
<p>In this case, $\mathbb{C}$ is a metric space if we consider the <strong>bilinear form</strong> of $\mathbb{C}$ over $\mathbb{R}$, defined as $\langle z, w \rangle = z \bar{w}$. We’ll be using the bilinear form as the distance function. The <strong>norm form</strong>, another important concept, is equal to $N(z) = z \bar{z}$.</p>
<p>In metric spaces(and more generally topology) we’re concerned with open and closed sets. An <strong>open disc</strong> with fixed radius $r$ as a function of point $z_0 \in \mathbb{C}$ is defined as:</p>
\[D_r(z_0) = \{z \in \mathbb{C} : |z-z_0| < r\}\]
<p>In a set $\Omega$, $z_0$ is an <strong>interior point</strong> if there exists some open disc around it that’s also contained within $\Omega$.</p>
<p>A set is <strong>open</strong> if every point in the set is an interior point. A complement of an open set is a <strong>closed set</strong>. A closed set contains all the limit points of the set, which are defined as the limits of convergent sequences $\{z_n\} \in \Omega$. The <strong>closure</strong> of a set is the union of $\Omega$ and its limit points, and is denoted by $\bar{\Omega}$. The <strong>boundary</strong> of a set $\Omega$ is the closure subtracted by all of the interior points, and is denoted $\partial \Omega$.</p>
<p>An open set is considered <strong>connected</strong> if it’s not possible to find disjoint non-empty sets $\Omega_1, \Omega_2$ such that $\Omega = \Omega_1 \cup \Omega_2$.</p>
<h2 id="sequential-compactness">Sequential compactness</h2>
<p>A set $\Omega$ is <strong>sequentially compact</strong> if every sequence has a convergent subsequence. An <strong>$\epsilon$-net</strong> is a union of balls with some fixed $\epsilon > 0$ centered around a subset of points in $\Omega$ s.t. $\bigcup B_\epsilon(x_\alpha) = \Omega$. A set is <strong>totally bounded</strong> if it has a finite $\epsilon$-net for every $\epsilon$. An <strong>open covering</strong> of a set $\Omega$ is a set of sets $\{U_\alpha \}$ such that $\Omega \subset \bigcup U_\alpha$. The set of open covers doesn’t need to be countable.</p>
<p><strong>Theorem: A metric space $\Omega$ is sequentially compact $\iff$ complete and totally bounded.</strong></p>
<p><strong>“Proof”</strong>: In the $\implies$ direction, since every cauchy sequence is a sequence itself, it must have a convergent subsequence. Cauchy sequences are themselves convergent if they have a convergent subsequence. Therefore, cauchy sequences converge and $\Omega$ is complete. Suppose $\Omega$ is not totally bounded, then there exists some $\epsilon > 0$ such no finite $\epsilon$-net exists. Then we can create a sequence such that no subsequences can converge, by picking points outside of the $\epsilon$ ball of any of the previous points (at any point in building the sequence, we have $n$ points, and we know that the union of these $n$ points’ balls is not equal to the entire set $\Omega$, so there’s some point out there for us to choose). We know these points will continue to exist because there is no finite $\epsilon$-net. This contradicts that $\Omega$ is sequentially compact.</p>
<p>In the $\impliedby$direction, consider an arbitrary sequence. If we have finite balls of size $\epsilon$ covering the entire set (bounded), then by pigeonhole principle there must be infinitely many points of the sequence in at least one of these balls. This subset is another totally bounded set itself, but restricted to a diameter of $\epsilon$. We then state that it must have a $\epsilon/2$-net, and continue the procedure such that we have infinite nested subsets of decreasing diameter. By picking a single point in each of these subsets we’ve created a cauchy subsequence of the original. Cauchy sequences converge in complete metric spaces, and since our sequence is arbitrary we are done. $\blacksquare$</p>
<p><strong>Lemma: (Lebesgue covering lemma) Given an open cover $\{G_\alpha\}$ of a sequentially compact metric space $\Omega$, there exists $\delta > 0$ such that the balls of radius $\delta$ centered around any point $x \in \Omega$ is a subset of one of the open sets $G_\alpha$.</strong></p>
<p><strong>“Proof”:</strong> If this is false, then we can create a sequence $\{x_n\}$ such that the $k$-th element is the center of a ball of radius $1/k$ and it’s not a subset of any $G_\alpha$. If there is no such $\delta$, we should always be able to find such $x_k$ to create an infinite sequence. Since this is a sequentially compact space, there exists a subsequence of $\{x_n\}$ that will converge to a point $x \in \Omega$, which must also belong to some open set $G_\alpha$. Since the point belongs to an open set, it’s an interior point, which means $\exists \epsilon > 0$ such that $B_\epsilon(x) \subset G_\alpha$. We said the ball $B_{1/n}(x_n)$ is not a subset of any $G_\alpha$, but eventually elements belonging to the infinite subsequence will be close enough to $x$ and the ball will be close enough such that $B_{1/n}(x_n) \subset B_\epsilon(x) \subset G_\alpha$, which is a contradiction. $\blacksquare$</p>
<hr />
<h2 id="compactness">Compactness</h2>
<p>We call $\Omega$ <strong>compact</strong> if every open covering of $\Omega$ has a finite subcovering. Additionally, the <strong>Heine-Borel</strong> theorem states that a set in $\mathbb{R}^n$ is compact $\iff$ it’s closed and bounded. An example of a compact set is $\{\frac{1}{n} : {n \in \mathbb{N}}\} \cup \{0\}$. It’s bounded in the range $[0, 1]$, and it’s closed, because the limit point is $0$.</p>
<p><strong>Theorem: In metric spaces $\Omega$ is compact $\iff$ every sequence $\{z_n\}$ in $\Omega$ has a subsequence that converges to a point in $\Omega$ (sequentially compact).</strong></p>
<p><strong>“Proof”</strong>: In the $\implies$ direction, if we assume that the set $\Omega$ is not sequentially compact, then there exists a pathological sequence $\{x_n\}$ such that there are no convergent subsequences. If that’s the case, then for any point $x \in \Omega$, there is an adequately small radius $\epsilon > 0$ such that no point in $x_n$ is in it (other than $x$ itself, if it’s in the sequence). Then we have a collection of open balls that form an open covering of the compact set. It’s obvious to see that we can’t use any finite subset of this collection to form the whole set, as each set only contains a single point in the space.</p>
<p>In the $\impliedby$ direction, let $\{G_\alpha\}$ be an arbitrary open covering for a sequentially compact $\Omega$. Lebesgue covering lemma states that $\exists \delta \forall x \exists \alpha$ such that $ B_\delta(x) \subset G_\alpha$. Being sequentially compact means it’s totally bounded, which means there exists finite $\epsilon$-nets for any $\epsilon$. If we pick $\epsilon$ to be $\delta$, then that means there are finitely many of these $B_\delta(x_i)$’s that cover the entire set. If we look at these balls $\{B_\delta(x_i) : i = 1, 2,…,n\}$ which individually satisfy $B_\delta(x_i) \subset G_{\alpha(i)}$, we can just pick the finite subset of $G_\alpha$’s associated with the balls to create a finite covering of $\Omega$. Since $\{G_\alpha\}$ is arbitrary we are done. $\blacksquare$</p>
<p>It’s important to note that sequential compactness is not equal to compactness in all scenarios. It’s only true in the case of metric spaces and not general topological spaces.</p>
<script src="https://utteranc.es/client.js" repo="OneRaynyDay/oneraynyday.github.io" issue-term="pathname" theme="github-light" crossorigin="anonymous" async=""> </script>
Interviewing During Covid2020-09-30T00:00:00+00:00http://oneraynyday.github.io/misc/2020/09/30/Interviewing-During-Covid<p><strong>I interviewed for Google’s Tensorflow, Apple’s MLPT (Machine Learning Platform & Technology), Bytedance’s ad infrastructure, Databrick’s ML team, Citadel Securities as a quantitative research analyst, Hudson River Trading(HRT) as an algorithm engineer, and Jane Street’s research desk as SWE. I received offers from all of the companies except for Jane Street. Here’s my experience interviewing during COVID.</strong></p>
<p><em>Disclaimer: I won’t be walking on the edge of leaking confidential information like an idiot(yes, I signed an NDA for all of these companies). Don’t expect to get any hints for your interviews.</em></p>
<p>The structure of this blog is inspired by my friend <a href="https://medium.com/@XiaohanZeng/i-interviewed-at-five-top-companies-in-silicon-valley-in-five-days-and-luckily-got-five-job-offers-25178cf74e0f">Han’s medium blogpost.</a></p>
<p><img src="http://oneraynyday.github.io/assets/interviews.png" alt="interviews" /></p>
<h1 id="table-of-contents">Table of Contents</h1>
<ul id="markdown-toc">
<li><a href="#table-of-contents" id="markdown-toc-table-of-contents">Table of Contents</a></li>
<li><a href="#preparation" id="markdown-toc-preparation">Preparation</a> <ul>
<li><a href="#algorithms" id="markdown-toc-algorithms">Algorithms</a></li>
<li><a href="#systems-design" id="markdown-toc-systems-design">Systems Design</a></li>
<li><a href="#math-questions" id="markdown-toc-math-questions">Math Questions</a></li>
</ul>
</li>
<li><a href="#the-interview-process" id="markdown-toc-the-interview-process">The interview process</a> <ul>
<li><a href="#more-interview-rounds-during-covid" id="markdown-toc-more-interview-rounds-during-covid">More interview rounds during COVID</a></li>
<li><a href="#dealing-with-time-zones" id="markdown-toc-dealing-with-time-zones">Dealing with time zones</a></li>
<li><a href="#which-ones-were-the-hardest" id="markdown-toc-which-ones-were-the-hardest">Which ones were the hardest?</a></li>
</ul>
</li>
<li><a href="#making-a-decision" id="markdown-toc-making-a-decision">Making a decision</a> <ul>
<li><a href="#the-culture-and-the-small-things-count" id="markdown-toc-the-culture-and-the-small-things-count">The culture and the “small” things count</a></li>
</ul>
</li>
<li><a href="#conclusion" id="markdown-toc-conclusion">Conclusion</a></li>
</ul>
<h1 id="preparation">Preparation</h1>
<p><strong>Working on machine learning infrastructure is 99% systems engineering and 1% machine learning.</strong> My experience on machine learning infrastructure teams has taught me this, and preparing for systems engineering topics was the right way to go. I did the following to prepare:</p>
<h2 id="algorithms">Algorithms</h2>
<p><strong>~50 leetcode hard questions</strong>. Some of them are DP, some are graph based, some of them are just NP-hard problems that are a pain to code(which is the point), and some include devising some clever data structure that supports a very specific access pattern. I gave myself roughly 40 minutes to solve these problems. ~15%(7 questions) of the time I couldn’t figure out the correct solution because time limit exceeded, memory limit exceeded, or I was just flat out wrong. I directly read the solutions and learned the tricks necessary to solve the type of problems moving forward. Don’t bother with medium or easy questions since hard questions often contain medium/easy tasks as subroutines, and these companies probably wouldn’t ask you easy leetcode questions anyways.</p>
<p>I wrote the solutions in either python and C++(sometimes both) and went back to polish my code for minor optimizations or readability improvements. For C++, I made sure I wasn’t using raw pointers unless appropriate and I was using C++17 (<code class="language-plaintext highlighter-rouge">constexpr</code> functions, <code class="language-plaintext highlighter-rouge">std::array</code> instead of raw arrays, smart pointers, template type deduction with lambdas, etc) features. The reason I wasn’t using C++20 was because the online coding platforms(like coderpad) likely use stable distributions of GCC and clang, which means some of the new features are in their experimental phase. <strong>I didn’t want to encounter a bug with concepts or <code class="language-plaintext highlighter-rouge">std::ranges</code> in the middle of the interview.</strong> (In fact, I found a <a href="https://stackoverflow.com/questions/62398252/why-likely-attribute-in-c20-raises-a-warning-here">bug with attributes</a> recently in a new version of gcc)</p>
<p>I also spent a few days on problems elsewhere:</p>
<ul>
<li><a href="https://codingcompetitions.withgoogle.com/codejam/archive">Codejam problems</a>. Round 1 and 2 are feasible, but round 3 was very difficult. I’d suggest studying round 1’s if you only care about interviews.</li>
<li><a href="https://codeforces.com/">Codeforce contests</a>. There are 3 tiers(or Divs, as they call it), and for interviews I suggest Div 3 and Div 2. Don’t bother with the D+ questions in Div 2, and definitely don’t bother with Div 1.</li>
</ul>
<h2 id="systems-design">Systems Design</h2>
<p>Working at Airbnb has made me pretty familiar with high level distributed systems design, but of course I worked only with a subset of them. I think Martin Kleppman’s book <a href="https://www.google.com/books/edition/Designing_Data_Intensive_Applications/p1heDgAAQBAJ?hl=en">Designing Data Intensive Applications</a> is a great read, but you’ll have to pick and choose which sections you want to go over as it’s a pretty dense book. If you don’t have time, maybe just try understanding how Kubernetes works with Marko Luksa’s <a href="https://www.manning.com/books/kubernetes-in-action">Kubernetes in Action</a>, which is a much easier read. You can then draw parallels with the distributed design for K8s against whatever systems design question the interviewer has for you.</p>
<p>Make sure you know some fundamental ideas about distributed systems like the <strong>map reduce paradigm</strong>, <strong>sharding</strong>, <strong>asynchronous and synchronous follower replicas</strong>, <strong>CAP theorem</strong>, etc. <em>What you don’t want to do is read 3 sentences about each of the terms above and regurgitate it in your interviews. Interviewers have been doing this for a while, they know you don’t actually understand the concepts.</em> Don’t be that guy.</p>
<h2 id="math-questions">Math Questions</h2>
<p><strong>These are only asked in finance firms.</strong> Honestly, these are just all over the place. I read this green book called <a href="http://quantfinanceinterviews.com/">A Practical Guide to Quantitative Finance Interviews</a> by Xinfeng Zhou, but only doing a single problem in each section by myself. Hedge funds will quiz you on discrete math to probability theory to geometry to information theory to literally anything. My advice is if you’re a software engineer interviewing for a hybrid of finance and tech places, timebox yourself in this category.</p>
<hr />
<p>I have not seen an interview question this cycle that was an exact question I’ve seen online or in books. Your mileage may vary.</p>
<h1 id="the-interview-process">The interview process</h1>
<p>Interviewing and talking with all of these companies was a great experience, even with COVID in place. Obviously, as shelter-in-place continues, these companies are conducting virtual on-site interviews and trying to make this process as smooth as possible. Without getting into the specifics, I’ll outline some common things I’ve noticed during the process in the COVID era.</p>
<ul>
<li>Many companies use Zoom or Google Hangouts for their on-sites.</li>
<li>They give you ~15 minute breaks in between interviews for water breaks.</li>
<li>Some companies give you a longer lunch break (45 mins to an hour).</li>
<li>If you’re interviewing for a company in another time zone, prepare to wake up in the early AM’s or interview in the late afternoon (sometimes after dinner).</li>
<li>Conveying an idea takes slightly longer because you’re not drawing on a whiteboard. Some companies have virtual whiteboard apps and others allow the use of Zoom whiteboards.</li>
<li><strong>Some companies added more interview rounds for virtual on-sites.</strong> Apparently more people are getting into companies with subpar technical skills during COVID and they’re making the process more selective. I think this can also be due to the increase in competition due to unemployment rates increasing.</li>
<li>Feedback and communications with recruiters is generally faster.</li>
</ul>
<h2 id="more-interview-rounds-during-covid">More interview rounds during COVID</h2>
<p>The bolded text might scare you as a potential candidate, but don’t worry too much. The added questions aren’t testing you if you know how to implement a bloom filter or a fibonacci heap or something niche. They usually test on the <em>coding abilities of the person and how well they’d actually ramp up in a novel, collaborative environment</em>. This can manifest itself in multiple ways - live debugging session with a new codebase, reading documentation to work with new technology, or a collaborative brainstorming sesion for a hard(er) problem. If you’re a decent software engineer you shouldn’t worry about these as much.</p>
<h2 id="dealing-with-time-zones">Dealing with time zones</h2>
<p><em>One of the biggest struggles I had during the interview process was adjusting my sleep schedule to wake up at 5-6AM to make sure I’m awake and on time for the interviews in New York/Chicago (I’m in California so this was a 3 hour gap)</em>. Usually, companies would fly you out the day-of or the day before the on-site. I’ve always felt tired after a plane flight and was able to get a good night’s rest before the interviews in the past. With COVID, everything is virtual and the companies expect you to interview at their hours.</p>
<p>Even with slowly adjusting my sleep schedule over a week or two I still had trouble with sleep. Personally, I get pretty nervous before an on-site and I’d need to feel adequately tired to get a good night’s rest instead of tossing and turning in bed. With the clock turned 3 hours back, I suddenly found myself not tired enough to sleep on time the night before the interview(even with a whole week of adjusting). This led to me consistently getting 6-7 hours of sleep instead of the 9 hours of sleep I usually get on game day, which really sucked.</p>
<p>Ultimately, I have no idea how much the sleep problem really affected my performance, but it was enough to shake my confidence going in.</p>
<p><em>NOTE: +1 to Citadel for proactively breaking my on-site over multiple days so I can have a sane sleep schedule for their interviews. This might depend on the specific team you’re interviewing with.</em></p>
<h2 id="which-ones-were-the-hardest">Which ones were the hardest?</h2>
<p>This is subjective, and the question can be broken up into multiple components:</p>
<ul>
<li><strong>Time pressure - Jane Street</strong>. This is probably why I failed their interviews, which were a bit longer than usual. I tend to explain my approach before coding anything to get a confirmation on the interviewer’s side that I’m on the right track. I probably spent too much time explaining and didn’t have enough time to finish the code for some interviews.</li>
<li><strong>Math questions - Citadel</strong>. They asked me some <em>really</em> interesting math problems that aren’t related to finance at all. I don’t think they expect the interviewer to get 100% of the questions since whenever I solved one the interviewer was ready with another. HRT also asked some.</li>
<li><strong>Systems design - HRT</strong>.</li>
<li><strong>Outside-the-box problems - Databricks</strong>. They conduct one of the most unique interviews I’ve ever had.</li>
<li><strong>Language specific questions - Citadel/HRT</strong>. Grilled me a lot on low level C++ stuff.</li>
<li><strong>Length of interview - HRT</strong>. I started at 8AM PST (I requested to move it to 8AM from 7AM) and finished at ~2:30PM. <strong>That is a whopping 6 hours and 30 minutes.</strong> I also did a coding challenge and 2 phone screens before I moved to on-site, totalling almost 10 hours for interviews.</li>
<li><strong>General algorithm questions - Jane Street/HRT</strong>. I think Jane Street was a bit harder given the time pressure. The flavor of algorithm questions are also different between these firms.</li>
</ul>
<p>Once again, this breakdown is <strong>subjective</strong>. I obviously have a lot of experience interviewing with Silicon Valley companies so the novelty of questions from the finance companies added to the difficulty.</p>
<h1 id="making-a-decision">Making a decision</h1>
<p>This was the hardest part for me. I spent two weeks suffering from analysis paralysis. I would wake up wanting to go to company X but wake up another day wanting to go to company Y. <em>The tug of war between different recruiters stressed me out - I couldn’t sleep, I couldn’t eat, I couldn’t do anything during the days and I unknowingly stressed out those around me with the process <strong>(special shoutouts to Ben, Eric, Nishanth, Mickey and Kibeom for dealing with my BS)</strong></em>. I used the following criteria to make my decision:</p>
<ul>
<li><strong>Manager support</strong> - How much support would my manager give me to learn new things and work on impactful projects? Is my manager someone I admire and want to learn from?</li>
<li><strong>Tech debt</strong> - How much tech debt is there and do I have to deal with it?</li>
<li><strong>Project flexibility/impact</strong> - Are the projects assigned to me or do I have autonomy to choose what I think is most impactful/interesting? If the project is assigned to me, is it something that I’ll enjoy doing for a few years at least? Will the acquired skills associated with the project be transferrable?</li>
<li><strong>Ability to learn</strong> - How collaborative is the company? Are my coworkers going to be domain experts? Can I engage in discussions with critical problem solvers?</li>
<li><strong>Stability</strong> - How high is the attrition rate? There isn’t anything bad about the idea of removing low performers to keep a company efficient, but the pressure to deliver often comes at a detriment of learning new things and keeping the infrastructure robust.</li>
<li><strong>Location</strong> - Between bay area, Seattle, Chicago, and New York City, which one do I want to go to the most? This was a complicated decision.</li>
<li><strong>Compensation</strong> - Do I need to worry about money or can I just put my head down and build things? How much risk is associated with the package?</li>
<li><strong>Culture/work-life balance</strong> - Is the company individualized or collaborative? What’s the managerial structure? Are there lots of politics? How long do people usually work? I don’t want to burn out and stop working on my blog and pursuing other hobbies, as that will likely take a toll on my mental health and ultimately cause me to leave the company anyways.</li>
</ul>
<p>I made a weighted linear model consisting of these features and used that arbitrary numerical output to reduce my choice to between Citadel and HRT which were exactly equivalent in numerical value. In the end, I decided using my gut feeling and went with HRT.</p>
<h2 id="the-culture-and-the-small-things-count">The culture and the “small” things count</h2>
<p>My decision was largely driven by the vibe of the people I talked to:</p>
<ul>
<li>My friends at HRT, especially Ben, were big factors for my decision.</li>
<li>HRT set up a virtual dinner with me and three potential new grad coworkers, with delivery app credits. It made me feel valued as a candidate since the three coworkers were so friendly and ready to chat.</li>
<li>The hiring manager understands me very well - he knew what projects I was interested and what my hesitations were when it came to making a decision and addressed them. In the morning before my final deadline, he sent me a heartfelt e-mail recapping many of the topics we’ve discussed and supporting me.</li>
<li>The recruiter also corresponded with me every week and was receptive to my feedback.</li>
</ul>
<p>Aside from HRT, I’d like to thank Xinan and Xing for being amazing hiring managers who spent <strong>a lot</strong> of time talking to me and helping me throughout the process with invaluable advice. Their experience, honesty, relatability, and transparency made the decision so much harder to make (in their favor, of course). Because of them, I felt like I wasn’t just another data point in the interview pipeline, and that they were advocating for my success regardless of where I end up.</p>
<hr />
<p>Although I felt like the decision making process was least impacted by COVID, if I were able to meet potential co-workers face-to-face it would’ve been clearer which place I would’ve liked to work at.</p>
<h1 id="conclusion">Conclusion</h1>
<p>Interviewing during COVID is definitely a different experience. It was a lot of stress and I’m glad to be over with it. In the future, for my sanity, I would not go through the process with so many companies in different time zones at the same time. I’d like to thank my family & friends for supporting me through the entire process and cheering me on. I couldn’t have done it without you all :)</p>
<script src="https://utteranc.es/client.js" repo="OneRaynyDay/oneraynyday.github.io" issue-term="pathname" theme="github-light" crossorigin="anonymous" async=""> </script>
What is a Feature Store?2020-09-22T00:00:00+00:00http://oneraynyday.github.io/cs/2020/09/22/Feature-Store<p>At my time at Airbnb, I’ve witnessed the development of the feature store effort on the machine learning infrastructure team. The project’s name is <strong>Zipline</strong>, and it has been presented at many <a href="#conference-talks">conferences</a>. As it’s one of the first open-sourced feature engineering platforms, I made sure to cover its implementation details in the query engine sections of the blog. The feature store problem is one of the most technically exciting problems for the data engineering space and many companies are trying to create their own solutions. I start by discussing the necessity of a feature store for ML applications and move on to talk about fundamental mathematical structures involved and some methods to solve the problem. The most important concepts are in the section <a href="#query-engine-offline">about offline query engines</a>, but the more novel ideas are in the section <a href="#online-equivalents">about the online query engines</a>.</p>
<h1 id="table-of-contents">Table of Contents</h1>
<ul id="markdown-toc">
<li><a href="#table-of-contents" id="markdown-toc-table-of-contents">Table of Contents</a></li>
<li><a href="#what-is-a-feature-store" id="markdown-toc-what-is-a-feature-store">What is a Feature Store?</a></li>
<li><a href="#why-do-we-need-a-feature-store" id="markdown-toc-why-do-we-need-a-feature-store">Why do we need a feature store?</a> <ul>
<li><a href="#scenario-1-daily-exports" id="markdown-toc-scenario-1-daily-exports">Scenario 1: Daily exports</a></li>
<li><a href="#scenario-2-logs" id="markdown-toc-scenario-2-logs">Scenario 2: Logs</a></li>
<li><a href="#requirements-of-a-feature-store" id="markdown-toc-requirements-of-a-feature-store">Requirements of a feature store</a></li>
</ul>
</li>
<li><a href="#aggregation-types" id="markdown-toc-aggregation-types">Aggregation types</a> <ul>
<li><a href="#abelian-groups" id="markdown-toc-abelian-groups">Abelian Groups</a></li>
<li><a href="#commutative-monoids" id="markdown-toc-commutative-monoids">Commutative Monoids</a></li>
</ul>
</li>
<li><a href="#anatomy-of-the-temporal-join" id="markdown-toc-anatomy-of-the-temporal-join">Anatomy of the Temporal Join</a></li>
<li><a href="#query-engine-offline" id="markdown-toc-query-engine-offline">Query Engine (Offline)</a> <ul>
<li><a href="#array-based-algorithms" id="markdown-toc-array-based-algorithms">Array-based Algorithms</a></li>
<li><a href="#tree-based-algorithms" id="markdown-toc-tree-based-algorithms">Tree-based Algorithms</a> <ul>
<li><a href="#alternative-segment-tree-representation" id="markdown-toc-alternative-segment-tree-representation">Alternative Segment Tree Representation</a></li>
</ul>
</li>
<li><a href="#skiplist-based-algorithms" id="markdown-toc-skiplist-based-algorithms">Skiplist-based Algorithms</a> <ul>
<li><a href="#theoretical-optimization" id="markdown-toc-theoretical-optimization">Theoretical Optimization</a></li>
</ul>
</li>
</ul>
</li>
<li><a href="#other-considerations" id="markdown-toc-other-considerations">Other considerations</a> <ul>
<li><a href="#relaxing-requirements-for-optimizations" id="markdown-toc-relaxing-requirements-for-optimizations">Relaxing requirements for optimizations</a> <ul>
<li><a href="#windowed-vs-non-windowed" id="markdown-toc-windowed-vs-non-windowed">Windowed vs. Non-Windowed</a></li>
</ul>
</li>
<li><a href="#online-equivalents" id="markdown-toc-online-equivalents">Online equivalents</a> <ul>
<li><a href="#tree-based-algorithms-1" id="markdown-toc-tree-based-algorithms-1">Tree-based Algorithms</a> <ul>
<li><a href="#a-case-study" id="markdown-toc-a-case-study">A Case Study</a></li>
</ul>
</li>
<li><a href="#skiplist-based-algorithms-1" id="markdown-toc-skiplist-based-algorithms-1">Skiplist-based Algorithms</a></li>
</ul>
</li>
</ul>
</li>
<li><a href="#conference-talks" id="markdown-toc-conference-talks">Conference Talks</a></li>
<li><a href="#conclusion" id="markdown-toc-conclusion">Conclusion</a></li>
</ul>
<style>
.red {
color:inherit;
}
.red:hover {
color:rgb(129, 17, 18);
}
.collapse:hover {
cursor: pointer;
}
.video-container {
position: relative;
padding-bottom: 56.25%; /* 16:9 */
height: 0;
}
.video-container iframe {
position: absolute;
top: 0;
left: 0;
width: 100%;
height: 100%;
}
</style>
<h1 id="what-is-a-feature-store">What is a Feature Store?</h1>
<p>A machine learning model does not typically use raw data as its features because of the following:</p>
<ol>
<li>Raw data often doesn’t show the bigger picture - it’s not a statistic representing a large quantity of data. Sometimes we may want to sum, average, etc the raw data for it to be useful.</li>
<li>The raw data firehose and the sheer scale of it may negatively affect the training process(large aws costs for a single epoch), and require a much more complex online serving architecture to meet latency requirements.</li>
<li>A lot of raw data is not useful - there could be nulls or missing information that require imputing on the fly.</li>
<li>Raw data might not be in the form that the model can readily consume. (i.e. categorical variables are sometimes turned into one-hot encoding vectors)</li>
</ol>
<p>Because of the reliance on processed data, there has been a growing demand for a component that can process and store these features. Feature stores are the solutions - they hold processed data that is used as input to machine learning models for inference, training, etc. Feature stores naturally can extend to support feature versioning, observability/monitoring tools and ad-hoc visualization tools in order to improve and understand the lineage and efficacy of features. It’s an abstract concept that can be implemented in many ways depending on the scale and specificity of your application. Generally, feature stores can deliver features to models in both online and offline settings (for both training and scoring), where in the online case latency is more important and in the offline case throughput is more important.</p>
<p>In the real world, data sources can come from anywhere - logs of applications (think Kibana log ingestion streams), daily database snapshots, pubsub messages (think Kafka messages), database commit logs and even data from other feature stores! It’s the feature store’s job to support and process these different data sources.</p>
<h1 id="why-do-we-need-a-feature-store">Why do we need a feature store?</h1>
<p>Imagine you’re a data scientist or ML engineer for an e-commerce company, and you’re trying to develop a new machine learning model that detects whether the given user will purchase something in the current session. Usually the first step is to ask yourself “what features are causative to a user’s purchase patterns?”. Maybe after putting yourself in the shoes of the customer, you decided that if a person has spent a lot of money on this platform in the last 7 days, they will likely purchase something again in this session. You quickly come up with a couple more features and decide that XGBoost would be a good baseline for this classifier, and you can serve it in an online setting.</p>
<p>Now the first step is to see where in your data warehouse you can possibly find these features.</p>
<p>Let’s go through two example scenarios:</p>
<h2 id="scenario-1-daily-exports">Scenario 1: Daily exports</h2>
<p>After some digging, you found daily hive table exports in S3 with “dollars spent for the day” for every user. Great! Now you need to write a daily job to sum all past 7 days’ exports (excluding today’s since it hasn’t landed) for each user into a single table, so you can use that as your feature store for your model.</p>
<h2 id="scenario-2-logs">Scenario 2: Logs</h2>
<p>After some digging, you found a raw data stream that contains the number of dollars spent for a user with a timestamp of the transaction. Great! You can create a table containing the total number of dollars spent for the day populated by the raw data stream, and export snapshots of it daily. Now you’re back at scenario 1.</p>
<p>But what if you want the past 7-days window to take into account the data received up until the very present? Then you would need to build something akin to a KV store that is updated every time a transaction occurs. Needless to say, this is a pretty complicated setup that wouldn’t be worth the marginal gain for most cases.</p>
<p>After doing this for a while, you quickly realize that collecting and cleaning data to create your features takes a long time and is repetitive. What if you can just declaratively express the data sources and the aggregations for your features and be done with it?</p>
<h2 id="requirements-of-a-feature-store">Requirements of a feature store</h2>
<p>From the example above, you probably already have an idea of some requirements we need the feature store to support:</p>
<ol>
<li><strong>Data recency</strong>: The daily exports scenario gives us features made from 1-day stale data, and that may not be acceptable for a system that’s sensitive to recent data or biases more towards intra-day events.</li>
<li><strong>Throughput</strong>: Table exports can be extremely large, and to enable newest features right after exports land, the feature store must be able to sustain high throughput to compute offline features.</li>
<li><strong>Online-offline consistency</strong>: The offline system that computes daily features and the online system that gives you the most up-to-date features must output the same result given the same inputs. We should expect identical setup if we were to bring an offline model to an online production environment.</li>
<li><strong>Feature repository</strong>: There are many features that can be shared among different models. This allows collaboration of multiple ML researchers and reduces duplication of effort.</li>
<li><strong>Monitoring system</strong>: If a feature recently changed anomalously, the researchers should be alerted in case their model isn’t trained to be robust under novel scenarios.</li>
</ol>
<p>This is by no means an exhaustive list of requirements, but it should give you an idea of the challenges in designing such a distributed system aside from the typical consistency, availability, etc problems.</p>
<h1 id="aggregation-types">Aggregation types</h1>
<p>It is sometimes necessary to aggregate features for them to be more useful. In the above example we used the sum of all purchases in the past 7 days. The key to understanding aggregation types is to focus on the word “sum” and “purchases”, i.e. an operator(in this case, the plus operator) working with a set of elements(in this case, numbers). In essence, we are combining raw data to create <strong>aggregations</strong>. Typical aggregation types belong in two categories, which we can rigorously define in mathematical terms. We’ll explain them briefly below.</p>
<h2 id="abelian-groups">Abelian Groups</h2>
<p>A group is a set of elements $S$ (e.g. integers) with an operator $\cdot$ (e.g. addition) such that the follow properties hold:</p>
<ul>
<li><em>Identity</em></li>
</ul>
\[\exists e \in S \ \text{such that}\ e \cdot x = x \cdot e = x \ \forall x \in S\]
<ul>
<li><em>Inverse</em></li>
</ul>
\[\forall x \ \exists x^{-1} \in S \ \text{such that}\ x \cdot x^{-1} = x^{-1} \cdot x = e\]
<ul>
<li><em>Associativity</em></li>
</ul>
\[\forall x,y,z \in S,\ (x \cdot y) \cdot z = x \cdot (y \cdot z)\]
<ul>
<li><em>Closure</em></li>
</ul>
\[\forall x, y \in S, \ x \cdot y \in S\]
<p>An abelian group is one that contains this extra property:</p>
<ul>
<li><em>Commutativity</em> (for Abelian groups only)</li>
</ul>
\[\forall x, y \in S,\ x \cdot y = y \cdot x\]
<p>If we consider the aggregated statistic of “dollars spent in the past 7 days” by summing “dollars spent in a day” datapoints, we can show how each of these properties are important. In this case, we can model the set as all real numbers, and the binary operator as addition. Formally we can denote it as $(\mathbb{R}, +)$.</p>
<p>The identity of the group is 0, and it’s used to initialize a new user who just joined the platform and has spent no money on it. We want closure so that summing the past 7 days’ worth of transactions will still be a dollar amount. Associativity and commutativity are required so that the order of transactions or the way we group them for aggregation do not change our total sum.</p>
<p>So what is inverse used for? Well, it’s not actually <strong>required</strong> in the sense that we can still compute the total sum amount in the past 7 days without it. Instead, it is a powerful tool we can use to compute our statistics <em>faster.</em> Suppose we have a cumulative sum from when the user registered until times between then and today. It would be easy to compute the 7 days window by simply subtracting the cumulative sum until today with the cumulative sum until 7 days ago. Subtraction here is essentially adding the inverse element. We discuss this exact algorithm in more detail <a href="#array-based-algorithms">later in the blog</a>. Note that if we chose our set as $\mathbb{R}_{\geq 0}$ then inverse would not apply and we would not have an abelian group. Instead we’ll have something called a <strong>commutative monoid.</strong></p>
<p>Some useful examples of abelian group operators with sets as real numbers include addition/subtraction, multiplication/division, etc. With the set defined as $\mathbb{R} \times \mathbb{N}$, we can define averaging as an operator for this to be an abelian group as well.</p>
<details class="red">
<summary class="collapse"><strong>How can average be represented as an abelian group?</strong>
</summary>
<p>Average can be expressed as an abelian group by using a 2-tuple in the form $(s, n)$, where $s \in \mathbb{R}, n \in \mathbb{N}$, where $s$ is the sum and $n$ is the count. When a value $s_t$ comes in and we want to update the average, we convert the value into the 2-tuple and add them:</p>
\[(s, n) + (s_t, 1) = (s+s_t, n+1)\]
<p>As in, we increment the count by 1, and we add the value to the running sum. The inverse operation can be used to get an average of a time interval $[a, b]$ by the following:</p>
\[(s_b, n_b) - (s_{a-1}, n_{a-1}) = (s_b-s_{a-1}, n_b-n_{a-1}) = (s_{[a,b]}, n_{[a,b]})\]
<p>The 2-tuple isn’t exactly an “average”, but contains enough information to retrieve an average, simply by $\frac{s}{n} \in \mathbb{R}$. We call the tuple an <strong>intermediate representation, or IR</strong>. Many aggregate types use IR’s that are bigger than a typical integer, and can go up to hundreds of bytes in practical applications. For reference, check out <a href="https://en.wikipedia.org/wiki/HyperLogLog">HyperLogLog</a>, which is an approximate count aggregate which uses tunable IR sizes.</p>
<p>NOTE: We used +/- above to make the example less abstract, but in reality we’re adding by the inverse element when - is used above.</p>
</details>
<h2 id="commutative-monoids">Commutative Monoids</h2>
<p>Essentially, monoids are a superset of groups, in that they don’t have the restriction of inverses. A commutative monoid thus is just our abelian group without the inverse property. An example of a commutative monoid used in aggregation statistics is the max operator. Suppose we want to know the biggest daily transaction total for a user in the past 7 days. It is not always possible to figure out what the max is within the past 7 days from cumulative maxes because we’re dealing with a commutative monoid which doesn’t have an inverse.</p>
<p>Some useful examples of commutative monoid operators with sets as real numbers include max, min, median, etc.</p>
<h1 id="anatomy-of-the-temporal-join">Anatomy of the Temporal Join</h1>
<p>When we’re defining features in the feature store, we typically ask the question “What is the <em><aggregation statistic></em> for <em><key></em> from <em><time A></em> to <em><time B></em>?”. This can be phrased by a join between the query and the raw data to create aggregated features. Semantically, the left side of the join is the query, the right side is the raw data, and the output is the aggregated features.</p>
<p>If we use the scenario 2 illustrated above as an example, we can have queries ask for “what is the total amount of money spent in some time range?”, with the raw events as purchase events with its corresponding dollar amount. The queries would look like:</p>
<p><img src="http://oneraynyday.github.io/assets/left_temporal.png" alt="left_temporal" height="25%" width="25%" />
which is the left side of the join. Meanwhile, the raw events would look like:</p>
<p><img src="http://oneraynyday.github.io/assets/right_temporal.png" alt="right_temporal" height="25%" width="25%" /></p>
<p>which is the right side of the join. The join looks like:</p>
<p><img src="http://oneraynyday.github.io/assets/joined_temporal.png" alt="joined_temporal" height="40%" width="40%" /></p>
<p>which is essentially a left join aggregated by name.</p>
<details class="red">
<summary class="collapse"><strong>This looks like it can be done in SQL!</strong>
</summary>
<p>You’re right, we can answer the question above with a SQL setup:</p>
<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">CREATE</span> <span class="k">TABLE</span> <span class="n">IF</span> <span class="k">NOT</span> <span class="k">EXISTS</span> <span class="nv">`queries`</span> <span class="p">(</span>
<span class="nv">`id`</span> <span class="nb">int</span><span class="p">(</span><span class="mi">6</span><span class="p">)</span> <span class="nb">unsigned</span> <span class="k">NOT</span> <span class="k">NULL</span><span class="p">,</span>
<span class="nv">`name`</span> <span class="nb">varchar</span><span class="p">(</span><span class="mi">100</span><span class="p">)</span> <span class="k">NOT</span> <span class="k">NULL</span><span class="p">,</span>
<span class="nv">`starttime`</span> <span class="nb">int</span><span class="p">(</span><span class="mi">3</span><span class="p">)</span> <span class="k">NOT</span> <span class="k">NULL</span><span class="p">,</span>
<span class="nv">`endtime`</span> <span class="nb">int</span><span class="p">(</span><span class="mi">3</span><span class="p">)</span> <span class="k">NOT</span> <span class="k">NULL</span><span class="p">,</span>
<span class="k">PRIMARY</span> <span class="k">KEY</span> <span class="p">(</span><span class="nv">`id`</span><span class="p">)</span>
<span class="p">);</span>
<span class="k">INSERT</span> <span class="k">INTO</span> <span class="nv">`queries`</span> <span class="p">(</span><span class="nv">`id`</span><span class="p">,</span> <span class="nv">`name`</span><span class="p">,</span> <span class="nv">`starttime`</span><span class="p">,</span> <span class="nv">`endtime`</span><span class="p">)</span> <span class="k">VALUES</span>
<span class="p">(</span><span class="s1">'1'</span><span class="p">,</span> <span class="s1">'A'</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">5</span><span class="p">),</span>
<span class="p">(</span><span class="s1">'2'</span><span class="p">,</span> <span class="s1">'B'</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">);</span>
<span class="k">CREATE</span> <span class="k">TABLE</span> <span class="n">IF</span> <span class="k">NOT</span> <span class="k">EXISTS</span> <span class="nv">`events`</span> <span class="p">(</span>
<span class="nv">`id`</span> <span class="nb">int</span><span class="p">(</span><span class="mi">6</span><span class="p">)</span> <span class="nb">unsigned</span> <span class="k">NOT</span> <span class="k">NULL</span><span class="p">,</span>
<span class="nv">`name`</span> <span class="nb">varchar</span><span class="p">(</span><span class="mi">100</span><span class="p">)</span> <span class="k">NOT</span> <span class="k">NULL</span><span class="p">,</span>
<span class="nv">`time`</span> <span class="nb">int</span><span class="p">(</span><span class="mi">3</span><span class="p">)</span> <span class="k">NOT</span> <span class="k">NULL</span><span class="p">,</span>
<span class="nv">`value`</span> <span class="nb">int</span><span class="p">(</span><span class="mi">3</span><span class="p">)</span> <span class="k">NOT</span> <span class="k">NULL</span><span class="p">,</span>
<span class="k">PRIMARY</span> <span class="k">KEY</span> <span class="p">(</span><span class="nv">`id`</span><span class="p">)</span>
<span class="p">);</span>
<span class="k">INSERT</span> <span class="k">INTO</span> <span class="nv">`events`</span> <span class="p">(</span><span class="nv">`id`</span><span class="p">,</span> <span class="nv">`name`</span><span class="p">,</span> <span class="nv">`time`</span><span class="p">,</span> <span class="nv">`value`</span><span class="p">)</span> <span class="k">VALUES</span>
<span class="p">(</span><span class="s1">'1'</span><span class="p">,</span> <span class="s1">'A'</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">10</span><span class="p">),</span>
<span class="p">(</span><span class="s1">'2'</span><span class="p">,</span> <span class="s1">'B'</span><span class="p">,</span> <span class="mi">23</span><span class="p">,</span> <span class="mi">70</span><span class="p">),</span>
<span class="p">(</span><span class="s1">'3'</span><span class="p">,</span> <span class="s1">'A'</span><span class="p">,</span> <span class="mi">15</span><span class="p">,</span> <span class="mi">30</span><span class="p">),</span>
<span class="p">(</span><span class="s1">'4'</span><span class="p">,</span> <span class="s1">'B'</span><span class="p">,</span> <span class="mi">36</span><span class="p">,</span> <span class="mi">6</span><span class="p">),</span>
<span class="p">(</span><span class="s1">'5'</span><span class="p">,</span> <span class="s1">'B'</span><span class="p">,</span> <span class="mi">49</span><span class="p">,</span> <span class="mi">20</span><span class="p">);</span>
<span class="k">SELECT</span> <span class="n">queries</span><span class="p">.</span><span class="n">name</span><span class="p">,</span> <span class="n">queries</span><span class="p">.</span><span class="n">starttime</span><span class="p">,</span> <span class="n">queries</span><span class="p">.</span><span class="n">endtime</span><span class="p">,</span> <span class="k">SUM</span><span class="p">(</span><span class="n">events</span><span class="p">.</span><span class="n">value</span><span class="p">)</span> <span class="k">FROM</span> <span class="n">queries</span> <span class="k">LEFT</span> <span class="k">JOIN</span> <span class="n">events</span> <span class="k">ON</span> <span class="n">queries</span><span class="p">.</span><span class="n">name</span> <span class="o">=</span> <span class="n">events</span><span class="p">.</span><span class="n">name</span>
<span class="k">AND</span> <span class="n">queries</span><span class="p">.</span><span class="n">starttime</span> <span class="o"><=</span> <span class="n">events</span><span class="p">.</span><span class="nb">time</span> <span class="k">AND</span> <span class="n">events</span><span class="p">.</span><span class="nb">time</span> <span class="o"><=</span> <span class="n">queries</span><span class="p">.</span><span class="n">endtime</span>
<span class="k">GROUP</span> <span class="k">BY</span> <span class="n">queries</span><span class="p">.</span><span class="n">name</span><span class="p">;</span>
</code></pre></div> </div>
<p>The problem statement, formulated in SQL, only accounts for the offline application for the feature store and not the online applications. We discuss efficient ways to solve the offline feature store below, and try to unify the solution in the online case.</p>
<p>In the meantime, if you’re comfortable thinking about the problem setup as joining tables in relational databases, it’s a great analogy to use for the rest of the blog.</p>
</details>
<h1 id="query-engine-offline">Query Engine (Offline)</h1>
<p>As we discussed before, abelian groups and commutative monoids should be considered differently because of the missing inverse property. The reason is because the optimal algorithm for ranged aggregate queries are different. Intuitively, the more restrictions and structure, the more optimal the algorithm, and we’ll discuss that below. Before we do that, let’s define some variables in our problem:</p>
<p>$N \in \mathbb{N}$ is the number of events containing raw data.</p>
<p>$M \in \mathbb{N}$ is the number of queries the user makes with arbitrary range.</p>
<p>We are considering an offline algorithm in which the queries and events will not mutate.</p>
<h2 id="array-based-algorithms">Array-based Algorithms</h2>
<p>Suppose we use the abelian group $(\mathbb{R}, +)$ for example, and we have $M$ queries and $N$ events. Suppose we have a discretized timestamp (by minutes, for example), then denote $G$ as the number of bins we can put timestamps into. We can have the cumulative sum from some starting time until now with the aggregates we’re interested in. Since inverses exist, we only need to find two cumulative sums and find the difference as the range query:</p>
<p><img src="http://oneraynyday.github.io/assets/array_based_cumsum.png" alt="array_based_cumsum" height="60%" width="60%" /></p>
<p>This algorithm is fast because it uses the fact that inverses exist in abelian groups. If $G$ is small, this array can fit in memory. To populate all events, it takes $O(max(N,G))$, since we’ll need to create a cumulative sum array after putting $N$ elements into the array bins. Each query is $O(1)$, or $O(M)$ in total. Thus, the overall runtime is $O(max(N,G) + M)$. The space complexity would be $O(G)$ since there are $G$ elements in the array.</p>
<p>It is important to note that we are sacrificing precision for speed in the tradeoff for $G$. It’s a tunable parameter, which adds to the complexity of this algorithm.</p>
<h2 id="tree-based-algorithms">Tree-based Algorithms</h2>
<p>Suppose we have a commutative monoid(note that abelian groups are also commutative monoids), then we can use a segment tree for range queries. Given the number of bins for timestamps $G$ as before, we can construct the segment tree with all events in $O(Nlog(G))$ time and query M times for $O(Mlog(G))$ time. In total, we have $O((N+M) log(G))$ for runtime and $O(G)$ for space.</p>
<p><img src="http://oneraynyday.github.io/assets/interval_tree_range_query.png" alt="interval_tree_range_query" height="60%" width="60%" /></p>
<p>Here, N is the number of discretized timestamps - not to be confused with N as the number of events. This is a very typical use-case of segment trees, and so typical optimizations like lazy propagation play important practical roles to prevent redundant updates. The segment tree displayed in the diagram above is binary but does not need to be - increasing the branching factor decreases the depth of the tree but adds overhead to queries. Note that we can use <a href="https://en.wikipedia.org/wiki/Fenwick_tree">fenwick trees</a> for the same purpose. To keep this blog fairly self-contained, we only discuss segment trees.</p>
<h3 id="alternative-segment-tree-representation">Alternative Segment Tree Representation</h3>
<p>An observation we should make is that we don’t care about the order of elements on the time scale as long as we know which queries should take it into account(this implies that queries are fixed). One optimization we can make to the segment tree is to have leaves represent the spaces between endpoints of intervals in the query rather than the integral timestamps:</p>
<p><img src="http://oneraynyday.github.io/assets/segment_tree_optimization.png" alt="segment_tree_optimization" height="60%" width="60%" /></p>
<p>For any $M$ number of queries, we have at most $2M$ leaves in this segment tree, thus insertion and query will take $O(Mlog(M))$. The total runtime is $O((N+M)log(M))$, which is independent of $G$. The space complexity is now $O(M)$.</p>
<h2 id="skiplist-based-algorithms">Skiplist-based Algorithms</h2>
<p><a href="https://en.wikipedia.org/wiki/Skip_list">Skiplists</a> are data structures which allow access of a particular element in $O(logN)$ time, by creating hopping links in a linkedlist that requires logarithmic traversals to get to any point in the list using exponentially increasing sizes of hops. In the same way, we can decompose any query into a(roughly) logarithmic number of queries, each which can be performed in $O(1)$ time. Skiplists are commonly used in the index engines for relational databases and K/V stores, and <strong>Zipline is currently using the skiplist approach for both the online and offline use cases</strong>. As a result, there are more empirical results and recommendations I’ve provided for this algorithm.</p>
<p>Before we accept events, we tile the timeline into window sizes, each window size geometrically larger than the previous (refer to the below diagram for an example). For any event, it would need to update a single window in each granularity (there are logarithmic number of different window sizes). This algorithm works for any commutative monoids. In practice, the precision is limited to some granularity to reduce memory pressure, e.g. using seconds as the smallest window size, when events come in at the millisecond scale.</p>
<p><img src="http://oneraynyday.github.io/assets/skiplist_integral.png" alt="skiplist_integral" height="60%" width="60%" /></p>
<p>The number of windows we need to query for any given range is logarithmic as we can see in the example above. If many queries have large overlaps, the windows’ results can be cached for quick re-access.</p>
<p>We take care of any query requiring more precise windows with a concept in Zipline called <strong>accumulations</strong>. This part isn’t necessary for understanding the skiplist implementation if you don’t care about fine-grained precision, i.e. if we’re only dealing with queries with integral precision in the example above.</p>
<details class="red">
<summary class="collapse"><strong>So what are accumulations?</strong>
</summary>
<p><strong>Accumulations</strong> in Zipline are query-specific intervals between the smallest granularity and the endpoint, which are used to construct the “tails’’ of intervals that do not fit nicely into the smallest granularity. One good thing about accumulations is that it works well with “infinite precision” ranges which can include irrational, trascendental, or repeating decimals.</p>
<p><img src="http://oneraynyday.github.io/assets/skiplist.png" alt="skiplist" height="35%" width="35%" /></p>
<p>As we can see in the diagram above, the accumulations are used to compute the remainders in the start and end of intervals.</p>
<p>Without accumulations, the overall algorithm is $O((M+N) log(G))$ time complexity and $O(G)$ space complexity. With accumulations, we must deal with the case that there are multiple interval endpoints that lie between the smallest granularity. In the case that all intervals lie within the same accumulation, the case degenerates into the above segment tree algorithm in the alternative form.</p>
<p><img src="http://oneraynyday.github.io/assets/skiplist_cumulations.png" alt="skiplist_cumulations" height="35%" width="35%" /></p>
<p>So the basic idea of accumulations is to <em>delegate the smaller remainder interval aggregate calculation to a different algorithm, one that can handle arbitrary precision on the start and end times of intervals.</em></p>
</details>
<p>Overall, the worst case is a logarithmic number of skip list queries with a logarithmic range query in the accumulation. This amounts to $O((M+N) (log(G) + log(M)))$ for time complexity and $O(G+M)$ for space complexity.</p>
<p><em>NOTE: The skiplist approach has many similarities with the tree approach, and one can think of the two as equivalent.</em></p>
<h3 id="theoretical-optimization">Theoretical Optimization</h3>
<p>In Chazelle’s paper <a href="https://www.cs.princeton.edu/~chazelle/pubs/ComputPartialSumsMultiArrays.pdf">here</a>, the algorithm aims to solve a generalized version of this problem. Specifically, given a d-dimension rectangular query $q = [a_1,b_1] \times … \times [a_d, b_d]$, compute partial sums of data points in the array $A$.</p>
\[\sum_{(k_1,...,k_d)\in q} A[k_1,...,k_d]\]
<p>In our case, we are only querying based off of time, which is a one-dimensional query. In our case, we can rephrase it as:</p>
\[\sum_{t \in q} A[t]\]
<p>Where $A[t]$ is the unaggregated raw data associated with the event at time $t$. The result is that with $N$ events and $M$ queries, it is possible to retrieve the aggregates in $O(M*\alpha(M,N))$, where $\alpha$ is the inverse Ackermann function. The partial sum block sizes are dictated by the growth function $R(t,k)$, where $t$ is the runtime calculated in the number of sums required for any range query, and $kM$ is the array space required for allocation. The two-parameter growth function is defined recursively in a format similar to the Ackermann function.</p>
\[R(1,k) = 1 \quad \forall k \geq 1 \\
R(t, 1) = 4 \quad \forall t > 1 \\
R(t, k) = R(t, k-3)R(t-2, R(t, k-3)) \quad \text{otherwise}\]
<p>The above is only defined for $k = 1+3n \forall n \in \mathbb{N}$ and $t = 1 + 2n \forall n \in \mathbb{N}$, but we use the growth function to upper-bound the scheme required to build the partial sum blocks. The scheme finds the smallest $(t,k)$ pair with priority for $t$ to be minimized such that it is greater than the time range, divided into $R(t, k-3)$ sized partial sum chunks, and the process repeats recursively.</p>
<p>In reality, this is similar to our algorithm but generalized to multidimensional rectangular queries and using a different step size for partial aggregates. We create partial sum windows that are exponentially growing, yielding logarithmic runtime complexity. The result above is using windows growing roughly at the rate of the Ackermann function, and thus it yields inverse Ackermann runtime complexity. In practice, no software scale has come close to requiring inverse Ackermann, and we can use the first few levels of Ackermann function block sizes to yield super logarithmic time complexity for our use case. Even then, it may be overkill and introduce more overhead than expected.</p>
<h1 id="other-considerations">Other considerations</h1>
<h2 id="relaxing-requirements-for-optimizations">Relaxing requirements for optimizations</h2>
<h3 id="windowed-vs-non-windowed">Windowed vs. Non-Windowed</h3>
<p>We describe a set of queries to be “windowed” if each query can have an arbitrary beginning time, and “non-windowed” if each query in the set has a fixed beginning time and an arbitrary end time. Non-windowed queries are common - and they often ask questions about the running state of a user, like “how many items did this person purchase on our platform until some date?”. If we fix the beginning time for all queries, there would be small optimizations we can make to the existing algorithms.</p>
<p>One such optimization would be to take advantage of the fact that we can answer all the queries in a batch setting offline. We can construct a partial aggregate separated by end times of the queries in an array. To insert an event, we perform binary search on the boundary timelines which takes $O(log(M))$. Finally, we perform an accumulation of the partial sums in-place:</p>
<p><img src="http://oneraynyday.github.io/assets/unwindowed_optimization.png" alt="unwindowed_optimization" height="60%" width="60%" /></p>
<p>To insert a particular event into an optimized segment tree(separated by endpoints of queries, not time), previously we had to search in $O(log(M))$ time and update in $O(log(M))$ time. Now, we only need to construct partial sums which only requires a single update, with an accumulation afterwards.</p>
<p>Although it’s not an asymptotic upgrade in runtime(both require logarithmic time in search), in systems like Zipline, this has shown to improve practical performance significantly.</p>
<h2 id="online-equivalents">Online equivalents</h2>
<p>In an online system, the problem becomes invariably harder. In the offline setting, we mostly cared about <strong>correctness and performance in terms of throughput.</strong> In the online case, we additionally care about <strong>performance in terms of latency.</strong> In the case we want to unify online and offline systems into a single abstraction, we also care about <strong>consistency</strong>, or as we previously called it: <strong>online-offline consistency</strong>. The consistency guarantee is that the online and offline results must be identical given the same inputs. We obviously wish to have all of these properties but under specific load and conditions, we may need to sacrifice one or more of these requirements.</p>
<h3 id="tree-based-algorithms-1">Tree-based Algorithms</h3>
<p>As the online system stays online longer, the bigger the segment tree becomes. Adding leaf nodes to existing segment trees to increase the range is usually not a good idea. Instead, the strategy is to create a new root node which is twice the range of the current tree, and attach the current root to the left. In an online setting, the allocation of a new root and a new subtree denoting events in the future time frame can be done in the background when a time threshold is met.</p>
<p>Because we don’t know when the queries will come, we cannot apply the optimization to the segment tree to only consider the query windows as ranges. Therefore, we must opt for a segment tree with ranges on time intervals. To have a fully binary segment tree, storage would be inefficient since there are $10^{10}$ milliseconds in a year. Although it may be able to store the segment tree completely in high memory instances, we can do better. An observation is that we don’t need the full segment tree to be specified. The tree itself can be relatively sparse, and only contain a leaf at an event’s timestamp. The absence of a tree node signifies $0$ events within that range. Below is an illustration of that:</p>
<p><img src="http://oneraynyday.github.io/assets/distributed_segment_tree.png" alt="distributed_segment_tree" height="60%" width="60%" /></p>
<p>Then, at any point in time the tree will have at $max(G, N)$ leaves. If we’re dealing on the 1-year max range scale and a binary segment tree, then every leaf(which associates with an event) would require on the order of $\sim30$ traversals down the tree. It is likely that with a high number of events, the tree would not be sparse, and the traversals will lead to cache misses. If ultra low-latency was not a primary concern, this design would be very efficient.</p>
<p>Suppose we have a tree too big to fit in memory, e.g. the time scale expands past a year and the tree needs to double in size, we can assume the left subtrees will not likely be accessed nor updated often in an online setting. For updates, Incoming stream data may be late by at most a few minutes to hours, and for queries, the features are usually computed with recent aggregates. To reduce the memory pressure, we can compress the left subtree(which could be very sparse) into a disk-friendly segment tree which can then be queried later in the offline case.</p>
<p>There are many types of distributed segment trees out there, like ones based off of distributed hash tables like <a href="https://www.microsoft.com/en-us/research/wp-content/uploads/2007/03/tr-2007-30.pdf">Shen et. al.</a>. These implementations are generic and don’t optimize for access patterns that zipline does in an online setting. We keep the front of the tree in memory to make the majority of queries efficient, meanwhile distributed segment trees are not specifically built for the front-biased queries.</p>
<h4 id="a-case-study">A Case Study</h4>
<p>Suppose we have an online system running with the above design for 1 year, and events are ingested at a rate of 100/sec. Furthermore, we assume each event’s timestamp is unique, with granularity up to the millisecond scale. We’re interested in the time it takes to query for an aggregate from 10 days($\sim 2^{30}$ milliseconds) ago until now. In total, we have $100 * (60 * 60 * 24 * 365) \approx 3 * 10^9$ events stored, or number of leaves in our segment tree. Suppose we perform the storage optimization so that only 1 month of aggregates are stored in memory, and the rest is on disk in an optimized format. Then we have $\sim 3 * 10^8$ events in memory and $\sim2.7 * 10^9$ on disk. If we represent each node in the segment tree with $128$ bytes (which for most cases is reasonable except for topK, HyperLogLog, etc), then we’ve only used on the order of $\sim3 * 10^{10}$ bytes, or $\sim 30$GBs, which is reasonable for a typical server machines with $\geq 100$ GB of RAM (a full binary segment tree only doubles the number of nodes, ours being sparse is a slightly higher constant factor overhead). We also have roughly $\sim 244$GB times a constant factor(to create an in-disk segment tree) worth of events stored in block storage, typically on SSD’s.</p>
<p>Given a query of 10 days, the entire query can be done in memory. Since 10 days is equal to $\sim 2^{30}$ milliseconds, the subtree’s depth is at most 30. Suppose in the worst case, we query for $log(2^{30})=30$ nodes in our data structure for the range query, each of which is descending by 1 level. In this worst case of a binary segment tree, we require $2*logM$ nodes to traverse, which in this case is 60. Equivalently, it is the number of memory locations we need to access, all relatively far away from each other. Suppose we don’t run into any page faults, then the worst case scenario is 60 local DRAM accesses, which is roughly on the scale of $100ns$ each. In total, the query would have taken 6 microseconds just for the tree traversal itself. However, a single hop even within the same AWS VPC is on the order of milliseconds as discussed <a href="https://stackoverflow.com/questions/54190445/aws-latency-between-zones-within-a-same-region#:~:text=For%20your%20use%20case%2C%20database,the%20same%20zone%20as%20RDS.">here</a>, so it is fairly negligible in the typical service oriented architecture that many companies are converging on. Unless the feature store needs to be an embedded application(in which case we do not need a network hop) and we are dealing with sub-microsecond latency requirements, this approach operates well within requirements.</p>
<h3 id="skiplist-based-algorithms-1">Skiplist-based Algorithms</h3>
<p>The reason we’ve mentioned the skiplist based algorithm above even though it performs slightly worse than the segment tree approach is because a segment tree is fixed to a specific interval. The skiplist alternative can be easily fitted for online feature serving because we can simply add new windows for aggregates. The simplest generalization of skiplist to the online equivalent rests on the assumption that the immediate features must be taken into account, but the window size could be slightly off. We will begin by rounding the start times of queries down to fit the biggest sub-intervals, then increasingly add smaller intervals until we reach the cumulation. Below is a diagram for this:</p>
<p><img src="http://oneraynyday.github.io/assets/skiplist_error.png" alt="skiplist_error" height="50%" width="50%" /></p>
<p>If we define the ratio of error interval with the original interval, we can prove that for any window size that grows exponentially(with integral growth factor > 1), the range of error can be anywhere in $[0, 1)$. Having a near-100% error ratio is bad for various reasons, but there is a way to trade off constant overhead for a lower error margin. We can define the max size window used for ranges of intervals to decrease the error margin. For example, if we have doubling window sizes, and we have an interval of length 9, then by default we’d use 2 windows of size 8, as it’s the biggest interval we used. However, that is almost a 100% error ratio(⅞), and we can decrease this by forcing intervals of length 8-16 to use window sizes that only go up to 4, not 8. In this case, we’d use 3 windows of size 4, leaving the error ratio to be closer to 50%(⅜). In the doubling window case, we can get the error to $1/ 2^k$ if we increase our expected number of windows by $2^k$ times. Similar arguments hold for different exponentially growing window sizes. In Zipline, these restrictions on max window sizes are called “hops”.</p>
<p>Note that the above greedy approach is the same as the coin change greedy algorithm, which <a href="https://graal.ens-lyon.fr/~abenoit/algo09/coins2.pdf">requires the coin system to be canonical</a> according to Pearson et. al. A coin system being canonical is defined as a coin system where the greedy solution for arbitrary change(picking the largest denominations first) is always optimal. To verify our window sizes are indeed canonical, the algorithm presented by Pearson et. al is $O(N^3)$, where $N$ is the number of different coin denominations.</p>
<p>In addition, in the real world, performance may come second as compared to the concept of <strong>online-offline consistency</strong>. To be consistent, any range queries performed in the offline setting must also report the same result in the online setting. This is because model prototyping, training and backtesting is often done in an offline setting. To feed a model slightly different input could lead to large perturbations in performance metrics. In the world of general software engineering, consistency is often chosen in the tradeoff between consistency and performance. Since Zipline is implemented using the skiplist approach and requires consistency, the offline equivalent used in Zipline returns an approximation of the range query by rounding the start date to ensure the online and offline algorithms are identical.</p>
<p>Theoretically speaking, although the skiplist approach sacrifices correctness slightly, it is faster in practice since it would theoretically incur less cache misses(all of the partial sums of a particular window size are contiguous in memory) and is simpler to implement. For large aggregates, such as hyperloglog and topK, this architecture can handle large queries better than the segment tree approach due to sequential reads on disk for large aggregates.</p>
<h1 id="conference-talks">Conference Talks</h1>
<p>Of course, this blog focused on the engine portion of the whole process, but did not cover some crucial details such as the DSL(domain specific language) for the feature store queries, the integration of feature store with existing data sources in a typical company, implementation using Spark, etc. These topics are covered in my coworker’s talks shown below, specifically for Airbnb’s Zipline project:</p>
<hr />
<p><strong>Nikhil at Strange Loop</strong></p>
<div class="video-container">
<iframe width="1120" height="630" src="https://www.youtube.com/embed/0HttRa2cXig" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen=""></iframe>
</div>
<hr />
<p><strong>Varant and Evgeny at Spark Summit</strong></p>
<div class="video-container">
<iframe width="1120" height="630" src="https://www.youtube.com/embed/iUnO4MLAGDU" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen=""></iframe>
</div>
<h1 id="conclusion">Conclusion</h1>
<p>Feature stores are an upcoming technology that enables accelerated development of robust and powerful machine learning applications. Large parts of boilerplate offline model training can be abstracted away, and data scientists can now use point-in-time correct features in an online setting. For Airbnb, this was a humongous leap in efficacy of machine learning models, especially within the fraud detection organization. As with any adequately complicated piece of infrastructure, there will always be theoretical and practical improvements in the future that this blog fails to cover. I hope to update this entry when I’m notified of future developments!</p>
<p><em>Disclaimer: The opinions expressed in this post are my own and not necessarily those of my employer.</em></p>
<script src="https://utteranc.es/client.js" repo="OneRaynyDay/oneraynyday.github.io" issue-term="pathname" theme="github-light" crossorigin="anonymous" async=""> </script>
The Story of Value Categories in C++2020-07-03T00:00:00+00:00http://oneraynyday.github.io/dev/2020/07/03/Value-Categories<p>When you see the C++ standard’s specification on <a href="https://en.cppreference.com/w/cpp/language/value_category">value categories</a>, it’s likely your eyes gloss over and you think to yourself - “why do I ever need to know this?”. At least that’s what I thought when I first took a glance. Over the years, I’ve ran into issues assuming a simplistic model of value categories, and I regret not taking a long hard look at the standard in the beginning. This is my attempt to explain value categories in a fun history lesson (at the time of writing, this view is representative of up to C++20).</p>
<h1 id="table-of-contents">Table of Contents</h1>
<ul id="markdown-toc">
<li><a href="#table-of-contents" id="markdown-toc-table-of-contents">Table of Contents</a> <ul>
<li><a href="#cpl-early-1960s" id="markdown-toc-cpl-early-1960s">CPL (early 1960s)</a></li>
<li><a href="#c-1970s" id="markdown-toc-c-1970s">C (1970s)</a></li>
<li><a href="#c98-1998" id="markdown-toc-c98-1998">C++98 (1998)</a> <ul>
<li><a href="#extensions-to-lvalues" id="markdown-toc-extensions-to-lvalues">Extensions to lvalues</a></li>
<li><a href="#introduction-to-rvalues" id="markdown-toc-introduction-to-rvalues">Introduction to rvalues</a></li>
<li><a href="#references-are-added" id="markdown-toc-references-are-added">References are added</a> <ul>
<li><a href="#const-references---replacing-pass-by-value" id="markdown-toc-const-references---replacing-pass-by-value"><code class="language-plaintext highlighter-rouge">const</code> references - replacing pass-by-value</a></li>
</ul>
</li>
</ul>
</li>
<li><a href="#c11-2011" id="markdown-toc-c11-2011">C++11 (2011)</a></li>
<li><a href="#c17-2017" id="markdown-toc-c17-2017">C++17 (2017)</a> <ul>
<li><a href="#more-on-rvo" id="markdown-toc-more-on-rvo">More on RVO</a></li>
</ul>
</li>
</ul>
</li>
<li><a href="#conclusion" id="markdown-toc-conclusion">Conclusion</a></li>
</ul>
<style>
.red {
color:inherit;
}
.red:hover {
color:rgb(129, 17, 18);
}
.collapse:hover {
cursor: pointer;
}
</style>
<h2 id="cpl-early-1960s">CPL (early 1960s)</h2>
<p>According to the standards, the archaic language <a href="https://en.wikipedia.org/wiki/CPL_(programming_language)">CPL</a> used the terms <em>“right-hand mode”</em> and <em>“left-hand mode”</em> to describe the semantics of particular expressions. When an expression was evaluated in <em>left-hand mode</em>, it yielded an address, and when it was evaluated in <em>right-hand mode</em>, it yielded a “rule for the computation of a value”. In C/C++, the rule is executed and we retrieve a value on the right hand side.</p>
<h2 id="c-1970s">C (1970s)</h2>
<p>Now come C, which started using the term “lvalue” in its standards. People debated on whether the “l” in “lvalue” stood for “locator” or for “left-hand”, but one thing was for sure - it refers to an <strong>object</strong>. This term, in C, is used to describe an entity that is in storage <em>somewhere</em>, with a location. You can access an object with the name of the variable, or pointer dereferencing its location. Here are some examples of lvalues in C:</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="n">a</span><span class="p">;</span> <span class="c1">// a is an lvalue</span>
<span class="kt">int</span><span class="o">*</span> <span class="n">b</span><span class="p">;</span> <span class="c1">// *b is an lvalue</span>
<span class="kt">int</span> <span class="n">c</span><span class="p">[</span><span class="mi">10</span><span class="p">];</span> <span class="c1">// c[0] is an lvalue</span>
<span class="k">struct</span> <span class="nc">e</span> <span class="p">{</span> <span class="kt">int</span> <span class="n">x</span><span class="p">;</span> <span class="p">};</span> <span class="c1">// e.x is an lvalue</span>
</code></pre></div></div>
<h2 id="c98-1998">C++98 (1998)</h2>
<h3 id="extensions-to-lvalues">Extensions to lvalues</h3>
<p>When C++98 came out, it adopted the idea of an “lvalue”, and used the term “rvalue” for any expression in C++ that was not an lvalue. It also added functions into lvalue category:</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">f</span><span class="p">()</span> <span class="p">{}</span> <span class="c1">// f is an lvalue</span>
</code></pre></div></div>
<h3 id="introduction-to-rvalues">Introduction to rvalues</h3>
<p>rvalues are basically what C considered non-lvalues. However, there are a few caveats with the concept, like the <strong>lvalue-to-rvalue conversion</strong>:</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="mi">3</span><span class="p">;</span> <span class="c1">// 3 is an rvalue</span>
<span class="sc">'a'</span><span class="p">;</span> <span class="c1">// 'a' is an rvalue</span>
<span class="kt">int</span> <span class="n">a</span><span class="p">,</span> <span class="n">b</span><span class="p">;</span> <span class="n">a</span> <span class="o">=</span> <span class="n">b</span><span class="p">;</span> <span class="c1">// the expression `b` is converted to an rvalue in `a = b`.</span>
</code></pre></div></div>
<p>Don’t believe me? Here’s the clang AST dump which clearly says that <code class="language-plaintext highlighter-rouge">b</code> is implicitly casted to an rvalue:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>...
`-DeclStmt 0x559e56a03270 <line:3:5, col:14>
`-VarDecl 0x559e56a031b8 <col:5, col:13> col:9 a 'int' cinit
`-ImplicitCastExpr 0x559e56a03258 <col:13> 'int' <LValueToRValue>
`-DeclRefExpr 0x559e56a03220 <col:13> 'int' lvalue Var 0x559e56a030f8 'b' 'int'
</code></pre></div></div>
<p>The intuition here is that <code class="language-plaintext highlighter-rouge">a</code> wants to be given a value, so we retrieve the value that <code class="language-plaintext highlighter-rouge">b</code> has and load it into a CPU register to then give to <code class="language-plaintext highlighter-rouge">a</code> (in reality it could be done differently). The state of the value being in a register is considered an rvalue, and hence we cast the expression <code class="language-plaintext highlighter-rouge">b</code> into an rvalue.</p>
<h3 id="references-are-added">References are added</h3>
<p>In addition, C++ introduced a nice abstraction called references(which we now know as lvalue references). If you’ve written C++, you’ve likely used it. Here’s an example of a reference:</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="n">a</span><span class="p">;</span>
<span class="kt">int</span><span class="o">&</span> <span class="n">b</span> <span class="o">=</span> <span class="n">a</span><span class="p">;</span>
<span class="kt">int</span><span class="o">&</span> <span class="n">c</span> <span class="o">=</span> <span class="mi">1</span><span class="p">;</span> <span class="c1">// invalid! 1 is not an lvalue.</span>
</code></pre></div></div>
<p>Bjarne Stroustup explains that the reason he added references in C++98 was a <a href="https://www.stroustrup.com/bs_faq2.html#pointers-and-references">need for syntactical sugar for operator overloading</a>. However, Ben Saks explains in his <a href="https://www.youtube.com/watch?v=XS2JddPq7GQ">CppCon talk</a> that some operating overloading functions are simply impossible to perform given the C++ standard:</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="nc">integer</span> <span class="p">{</span>
<span class="n">integer</span><span class="p">(</span><span class="kt">int</span> <span class="n">v</span><span class="p">)</span> <span class="o">:</span> <span class="n">value</span><span class="p">(</span><span class="n">v</span><span class="p">)</span> <span class="p">{}</span>
<span class="kt">int</span> <span class="n">value</span><span class="p">;</span>
<span class="p">};</span>
<span class="c1">// This doesn't actually modify i</span>
<span class="kt">void</span> <span class="k">operator</span><span class="o">++</span><span class="p">(</span><span class="n">integer</span> <span class="n">i</span><span class="p">)</span> <span class="p">{</span>
<span class="n">i</span><span class="p">.</span><span class="n">value</span><span class="o">++</span><span class="p">;</span>
<span class="p">}</span>
<span class="c1">// This is ill-formed and won't compile -</span>
<span class="c1">// There is already an operator++ for integer*</span>
<span class="c1">// which moves the pointer itself.</span>
<span class="kt">void</span> <span class="k">operator</span><span class="o">++</span><span class="p">(</span><span class="n">integer</span><span class="o">*</span> <span class="n">i</span><span class="p">)</span> <span class="p">{</span>
<span class="o">*</span><span class="n">i</span><span class="p">.</span><span class="n">value</span><span class="o">++</span><span class="p">;</span>
<span class="p">}</span>
<span class="c1">// This is our only option.</span>
<span class="kt">void</span> <span class="k">operator</span><span class="o">++</span><span class="p">(</span><span class="n">integer</span><span class="o">&</span> <span class="n">i</span><span class="p">)</span> <span class="p">{</span>
<span class="n">i</span><span class="p">.</span><span class="n">value</span><span class="o">++</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<h4 id="const-references---replacing-pass-by-value"><code class="language-plaintext highlighter-rouge">const</code> references - replacing pass-by-value</h4>
<p>With the introduction of references, C++ programmers now have the option to pass by reference into a function rather than pass by pointer in C. However, this came with its own set of challenges. Suppose we allow addition of two <code class="language-plaintext highlighter-rouge">integer</code>s:</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// bad! We copied the integer when it wasn't necessary.</span>
<span class="n">integer</span> <span class="k">operator</span><span class="o">+</span><span class="p">(</span><span class="n">integer</span> <span class="n">i1</span><span class="p">,</span> <span class="n">integer</span> <span class="n">i2</span><span class="p">)</span> <span class="p">{</span>
<span class="k">return</span> <span class="n">integer</span><span class="p">(</span><span class="n">i1</span><span class="p">.</span><span class="n">value</span> <span class="o">+</span> <span class="n">i2</span><span class="p">.</span><span class="n">value</span><span class="p">);</span>
<span class="p">}</span>
<span class="c1">// ill-formed and won't compile.</span>
<span class="n">integer</span> <span class="k">operator</span><span class="o">+</span><span class="p">(</span><span class="n">integer</span><span class="o">*</span> <span class="n">i1</span><span class="p">,</span> <span class="n">integer</span><span class="o">*</span> <span class="n">i2</span><span class="p">)</span> <span class="p">{</span>
<span class="k">return</span> <span class="n">integer</span><span class="p">(</span><span class="n">i1</span><span class="o">-></span><span class="n">value</span> <span class="o">+</span> <span class="n">i2</span><span class="o">-></span><span class="n">value</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>
<p>So the C-style operator overloading won’t work here as expected. Let’s try references!</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// bad! doesn't work for all situations!</span>
<span class="c1">// It works for:</span>
<span class="c1">// integer i1(2);</span>
<span class="c1">// integer i2(3);</span>
<span class="c1">// i1 + i2;</span>
<span class="c1">// integer(2) + integer(3) should yield integer(5)</span>
<span class="c1">// but these are rvalues, so it won't compile!</span>
<span class="n">integer</span> <span class="k">operator</span><span class="o">+</span><span class="p">(</span><span class="n">integer</span><span class="o">&</span> <span class="n">i1</span><span class="p">,</span> <span class="n">integer</span><span class="o">&</span> <span class="n">i2</span><span class="p">)</span> <span class="p">{</span>
<span class="k">return</span> <span class="n">integer</span><span class="p">(</span><span class="n">i1</span><span class="p">.</span><span class="n">value</span> <span class="o">+</span> <span class="n">i2</span><span class="p">.</span><span class="n">value</span><span class="p">);</span>
<span class="p">}</span>
<span class="c1">// This is our only option.</span>
<span class="n">integer</span> <span class="k">operator</span><span class="o">+</span><span class="p">(</span><span class="k">const</span> <span class="n">integer</span><span class="o">&</span> <span class="n">i1</span><span class="p">,</span> <span class="k">const</span> <span class="n">integer</span><span class="o">&</span> <span class="n">i2</span><span class="p">)</span> <span class="p">{</span>
<span class="k">return</span> <span class="n">integer</span><span class="p">(</span><span class="n">i1</span><span class="p">.</span><span class="n">value</span> <span class="o">+</span> <span class="n">i2</span><span class="p">.</span><span class="n">value</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>
<p>The reason that only <code class="language-plaintext highlighter-rouge">const integer&</code> works like this is because of <a href="https://en.cppreference.com/w/cpp/language/implicit_conversion#Temporary_materialization">temporary materialization</a> which binds the temporary <code class="language-plaintext highlighter-rouge">integer(2)</code> and <code class="language-plaintext highlighter-rouge">integer(3)</code> to <code class="language-plaintext highlighter-rouge">i1</code> and <code class="language-plaintext highlighter-rouge">i2</code> respectively, and they will exist in memory somewhere.</p>
<details class="red">
<summary class="collapse"><strong>This is so confusing, why did they have to make <code class="language-plaintext highlighter-rouge">const</code> and non-<code class="language-plaintext highlighter-rouge">const</code> references behave differently for rvalues?</strong>
</summary>
<p>Consider the case:</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">increment</span><span class="p">(</span><span class="n">integer</span><span class="o">&</span> <span class="n">x</span><span class="p">)</span> <span class="p">{</span>
<span class="o">++</span><span class="n">x</span><span class="p">;</span>
<span class="p">}</span>
<span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="n">increment</span><span class="p">(</span><span class="n">i</span><span class="p">);</span> <span class="c1">// error! No matching function.</span>
</code></pre></div> </div>
<p>Here, the developer is probably thinking - “I’ll pass in an <code class="language-plaintext highlighter-rouge">int</code> because it’ll get implicitly converted to an <code class="language-plaintext highlighter-rouge">integer</code>, and it’ll get incremented”. If this was allowed, then it would look something like:</p>
<ul>
<li>The expression <code class="language-plaintext highlighter-rouge">i</code> in <code class="language-plaintext highlighter-rouge">increment(i)</code> is casted to an rvalue via lvalue-to-rvalue conversion.</li>
<li>The value of <code class="language-plaintext highlighter-rouge">i</code> is implicitly converted to <code class="language-plaintext highlighter-rouge">integer</code> by constructor conversion.</li>
<li>A temporary <code class="language-plaintext highlighter-rouge">integer</code> is created, which doesn’t necessarily have to have storage (we already fail here!)</li>
<li>In the function, the temporary <code class="language-plaintext highlighter-rouge">integer</code> is incremented, NOT <code class="language-plaintext highlighter-rouge">i</code>.</li>
</ul>
<p>Along with other reasons, the C++ committee thus thought there was no reason for non-<code class="language-plaintext highlighter-rouge">const</code> references to bind to temporaries. Allowing this for <code class="language-plaintext highlighter-rouge">const</code> is totally fine though, because we aren’t expecting the arguments to change after the function call.</p>
</details>
<p>The fact that sometimes rvalues could “materialize” and exist in storage in the limited scope of the function, and sometimes not have storage at all must be confusing. This is why C++ further split up the concept of rvalues and defined the materialized values as <strong>xvalues</strong>, for “expiring values” and no-storage values as <strong>prvalues</strong>, for “pure rvalues”.</p>
<p>Let’s run through an example:</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">integer</span> <span class="nf">x</span><span class="p">(</span><span class="mi">3</span><span class="p">);</span> <span class="c1">// `x` is an lvalue</span>
<span class="n">x</span> <span class="o">=</span> <span class="mi">4</span><span class="p">;</span> <span class="c1">// `4` is an rvalue, and implicitly converted to `integer(4)`, also rvalue.</span>
<span class="n">x</span> <span class="o">+</span> <span class="n">x</span><span class="p">;</span> <span class="c1">// `x`'s are lvalues, and converted to rvalues (prvalues)</span>
<span class="n">integer</span><span class="p">(</span><span class="mi">2</span><span class="p">)</span> <span class="o">+</span> <span class="n">integer</span><span class="p">(</span><span class="mi">3</span><span class="p">);</span> <span class="c1">// `integer(2)` is a prvalue, and it is materialized as an xvalue</span>
</code></pre></div></div>
<hr />
<p>Life went on for C++ developers who came to live with these set of rules with <em>lvalue</em> and <em>rvalue</em>` (which is further broken up into <em>xvalues</em> and <em>prvalues</em>). However, C++11 came and value categories became more complicated.</p>
<h2 id="c11-2011">C++11 (2011)</h2>
<p>One of the biggest optimizations to the C++ language paradigm occurred in C++11, in the form of <strong>move semantics</strong>. Prior to C++11, there was no standardized way to “rip out” contents from one object to be used in another. This is especially important for heavyweight data structures which may allocate large buffers transparently to hold objects, such as <code class="language-plaintext highlighter-rouge">std::vector<T></code>. We can declare an object of type <code class="language-plaintext highlighter-rouge">T</code> “move-able” if it’s an <strong>rvalue reference</strong>, denoted as <code class="language-plaintext highlighter-rouge">T&&</code>. We can also turn <em>lvalues</em> into <em>rvalue references</em> by using <code class="language-plaintext highlighter-rouge">std::move</code>. Let’s run through a simple example:</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#include <vector>
#include <thread>
#include <chrono>
#include <iostream>
</span>
<span class="k">template</span> <span class="o"><</span><span class="k">typename</span> <span class="nc">T</span><span class="p">></span>
<span class="k">class</span> <span class="nc">data_structure</span> <span class="p">{</span>
<span class="nl">public:</span>
<span class="n">data_structure</span><span class="p">()</span> <span class="o">=</span> <span class="k">default</span><span class="p">;</span>
<span class="n">data_structure</span><span class="p">(</span><span class="k">const</span> <span class="n">data_structure</span><span class="o"><</span><span class="n">T</span><span class="o">>&</span> <span class="n">clref</span><span class="p">)</span> <span class="p">{</span>
<span class="c1">// Usually in here we would use `new` and `memcpy` or `std::copy(...)`</span>
<span class="c1">// to copy contents large buffers from clref to this object.</span>
<span class="c1">// We simulate that with a long sleep.</span>
<span class="n">std</span><span class="o">::</span><span class="n">this_thread</span><span class="o">::</span><span class="n">sleep_for</span><span class="p">(</span><span class="n">std</span><span class="o">::</span><span class="n">chrono</span><span class="o">::</span><span class="n">seconds</span><span class="p">(</span><span class="mi">1</span><span class="p">));</span>
<span class="p">}</span>
<span class="n">data_structure</span><span class="p">(</span><span class="n">data_structure</span><span class="o"><</span><span class="n">T</span><span class="o">>&&</span> <span class="n">rref</span><span class="p">)</span> <span class="p">{</span>
<span class="c1">// Usually in here we would rip out the contents from rref and leave rref in</span>
<span class="c1">// a valid but unspecified state.</span>
<span class="c1">// This is as simple as a pointer assignment.</span>
<span class="c1">// It's on the order of nanoseconds.</span>
<span class="n">std</span><span class="o">::</span><span class="n">this_thread</span><span class="o">::</span><span class="n">sleep_for</span><span class="p">(</span><span class="n">std</span><span class="o">::</span><span class="n">chrono</span><span class="o">::</span><span class="n">nanoseconds</span><span class="p">(</span><span class="mi">1</span><span class="p">));</span>
<span class="p">}</span>
<span class="p">};</span>
<span class="kt">int</span> <span class="nf">main</span><span class="p">()</span> <span class="p">{</span>
<span class="n">data_structure</span><span class="o"><</span><span class="kt">int</span><span class="o">></span> <span class="n">d</span><span class="p">;</span>
<span class="c1">// Copying</span>
<span class="n">std</span><span class="o">::</span><span class="n">cout</span> <span class="o"><<</span> <span class="s">"Copying is slow..."</span> <span class="o"><<</span> <span class="n">std</span><span class="o">::</span><span class="n">endl</span><span class="p">;</span>
<span class="n">data_structure</span><span class="o"><</span><span class="kt">int</span><span class="o">></span> <span class="n">cd</span><span class="p">(</span><span class="n">d</span><span class="p">);</span>
<span class="c1">// Moving</span>
<span class="n">std</span><span class="o">::</span><span class="n">cout</span> <span class="o"><<</span> <span class="s">"Moving is fast!"</span> <span class="o"><<</span> <span class="n">std</span><span class="o">::</span><span class="n">endl</span><span class="p">;</span>
<span class="n">data_structure</span><span class="o"><</span><span class="kt">int</span><span class="o">></span> <span class="n">md</span><span class="p">(</span><span class="n">std</span><span class="o">::</span><span class="n">move</span><span class="p">(</span><span class="n">d</span><span class="p">));</span>
<span class="p">}</span>
</code></pre></div></div>
<p>As you can see, <code class="language-plaintext highlighter-rouge">std::move(d)</code> essentially casts the current <code class="language-plaintext highlighter-rouge">data_structure<T>&</code> to <code class="language-plaintext highlighter-rouge">data_structure<T>&&</code>, and we trigger the <strong>move constructor</strong>. I talk about these <em>rvalue references</em> further in <a href="https://oneraynyday.github.io/dev/2017/09/04/C++-Lvalue-Rvalue-Move/#move-semantics">this blogpost</a>.</p>
<p>So what does the above have anything to do with value categories? Recall that before C++11, we have the notions of <em>lvalues</em> and <em>rvalues</em> which are further divided into <em>xvalues and prvalues</em>. The reason <em>rvalues</em> were split into the two was because it could be temporarily materialized into an expiring value using const references. Similarly, with <code class="language-plaintext highlighter-rouge">std::move</code>, we can also consider the moved <em>lvalues</em> as expiring values(xvalues) as well. This is because when we allow the lvalue reference’s contents to be moved, it usually should be considered expired as it’s left in a <em>valid but unspecified state</em>.</p>
<p>Now this is where things become a little confusing. Bear with me here. We have <em>rvalue</em> divided into <em>xvalue</em> and <em>prvalue</em>, then we should ideally divide <em>lvalue</em> into <em>xvalue</em> and <em>plvalue</em>(for “pure” lvalues) right? <strong>No.</strong> The standard commitee decided to call what we coined <em>plvalue</em> as <em>lvalue</em>, and <em>we</em> coined <em>lvalue</em> as <em>glvalue</em>, or “generalized lvalue” which encompasses <em>xvalue</em> and <em>lvalue</em>. The tree looks like the following:</p>
<p><img src="http://oneraynyday.github.io/assets/value_semantics_tree.png" alt="value_semantics" /></p>
<details class="red">
<summary class="collapse"><em>So why didn’t the standard committee use the term plvalue?</em>
</summary>
<p>To be fair, <em>prvalue</em> is “pure” in the sense that it does not have storage semantics(even though it may be stored in memory somewhere if it doesn’t fit in registers), meanwhile <em>xvalues</em> have temporary storage. The concept of <em>plvalue</em> doesn’t really make sense since both itself and <em>xvalues</em> would have storage semantics.
In this sense, <em>lvalue</em> is a bucketing term for objects that don’t expire. Generalized <em>lvalues</em> covers all objects, expire or not.</p>
</details>
<p>This is basically an accurate picture of the current state of value semantics in C++. It’s only confusing because of the names and historical context. If we see the evolution of these terms, it makes the bigger picture a bit easier to understand. <strong>To recap, any expression can either be glvalue or rvalue. glvalues can be either lvalues(non-expiring objects) or xvalues(expiring objects). rvalues can be either prvalues(no storage semantics) or xvalues(prvalues that temporarily materializes into objects).</strong></p>
<h2 id="c17-2017">C++17 (2017)</h2>
<p>The definition of what is a prvalue and what isn’t has been changing frequently, but one of the most important and non-obvious things we should mention is the copy ellision rules involving returning prvalues. In C++17, copy ellision is now <strong>guaranteed</strong> for function calls returning prvalues, as in they never undergo temporary materialization. This is in a class of optimizations called RVO (return value optimization), and it has important implications.</p>
<h3 id="more-on-rvo">More on RVO</h3>
<p>It’s important to note what kinds of RVO there are. There is the type of RVO that works on lvalues, which is also called NRVO, for “named return value optimization”. There is also another RVO that works on prvalues. Usually, RVO on prvalues is often more powerful than NRVO. <strong>In the standard there is no RVO for xvalues.</strong></p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#include <iostream>
</span><span class="k">using</span> <span class="k">namespace</span> <span class="n">std</span><span class="p">;</span>
<span class="k">class</span> <span class="nc">obj</span> <span class="p">{</span>
<span class="nl">public:</span>
<span class="n">obj</span><span class="p">()</span> <span class="p">{</span>
<span class="n">std</span><span class="o">::</span><span class="n">cout</span> <span class="o"><<</span> <span class="s">"default ctor"</span> <span class="o"><<</span> <span class="n">std</span><span class="o">::</span><span class="n">endl</span><span class="p">;</span>
<span class="p">}</span>
<span class="n">obj</span><span class="p">(</span><span class="k">const</span> <span class="n">obj</span><span class="o">&</span> <span class="n">o</span><span class="p">)</span> <span class="p">{</span>
<span class="n">std</span><span class="o">::</span><span class="n">cout</span> <span class="o"><<</span> <span class="s">"copy ctor"</span> <span class="o"><<</span> <span class="n">std</span><span class="o">::</span><span class="n">endl</span><span class="p">;</span>
<span class="p">}</span>
<span class="n">obj</span><span class="p">(</span><span class="n">obj</span><span class="o">&&</span> <span class="n">o</span><span class="p">)</span> <span class="p">{</span>
<span class="n">std</span><span class="o">::</span><span class="n">cout</span> <span class="o"><<</span> <span class="s">"move ctor"</span> <span class="o"><<</span> <span class="n">std</span><span class="o">::</span><span class="n">endl</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">};</span>
<span class="n">obj</span> <span class="nf">make</span><span class="p">()</span> <span class="p">{</span>
<span class="k">return</span> <span class="n">obj</span><span class="p">();</span>
<span class="p">}</span>
<span class="kt">int</span> <span class="nf">main</span><span class="p">()</span> <span class="p">{</span>
<span class="n">obj</span> <span class="n">o</span> <span class="o">=</span> <span class="n">make</span><span class="p">();</span>
<span class="p">}</span>
</code></pre></div></div>
<p>Previously, the standard didn’t specify which one would be guaranteed to be faster than the other:</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">(</span><span class="mi">1</span><span class="p">)</span>
<span class="n">Foo</span> <span class="nf">make_foo</span><span class="p">()</span> <span class="p">{</span>
<span class="n">Foo</span> <span class="n">f</span><span class="p">;</span>
<span class="c1">// Move semantics</span>
<span class="k">return</span> <span class="n">std</span><span class="o">::</span><span class="n">move</span><span class="p">(</span><span class="n">f</span><span class="p">);</span>
<span class="p">}</span>
<span class="p">(</span><span class="mi">2</span><span class="p">)</span>
<span class="n">Foo</span> <span class="nf">make_foo</span><span class="p">()</span> <span class="p">{</span>
<span class="c1">// RVO</span>
<span class="k">return</span> <span class="n">Foo</span><span class="p">();</span>
<span class="p">}</span>
</code></pre></div></div>
<p>Before, compiler writers could technically invoke the move in case (2), resulting in identical performance as (1). Now, they <em>must</em> enforce zero-copy pass-by-value semantics on (2) since <code class="language-plaintext highlighter-rouge">Foo()</code> is a prvalue. In <code class="language-plaintext highlighter-rouge">clang</code> and <code class="language-plaintext highlighter-rouge">gcc</code> that support C++14 and below, passing in <code class="language-plaintext highlighter-rouge">-fno-elide-constructors</code> allows for (2) to not be optimized and manually create the object and then perform a move(if move semantics are valid for <code class="language-plaintext highlighter-rouge">Foo</code>). In C++17, the compilers ignores the flag and continues to elide it anyways because <code class="language-plaintext highlighter-rouge">Foo()</code> is a prvalue and it must be copy elided. I suggest this <a href="https://jonasdevlieghere.com/guaranteed-copy-elision/">blogpost</a> if you’re interested in the subject.</p>
<h1 id="conclusion">Conclusion</h1>
<p>From loosely used terms in CPL to an almost lawyer-level specification, value categories has come a long way. With a clean and clear set of rules on value categories, it is easier for the standard commitee to add optimizations to the language, like move semantics and improvements on return value optimization. This is also one of the few things in the C++ standard that is ubiquitously used in other concepts, so with a basic fundamental knowledge on these value categories you’ll an easier time navigating <a href="www.cppreference.com">cppreference.com</a>. Hope you enjoyed!</p>
<script src="https://utteranc.es/client.js" repo="OneRaynyDay/oneraynyday.github.io" issue-term="pathname" theme="github-light" crossorigin="anonymous" async=""> </script>
Analyzing The Simplest C++ Program2020-05-03T00:00:00+00:00http://oneraynyday.github.io/dev/2020/05/03/Analyzing-The-Simplest-C++-Program<style>
.red {
color:inherit;
}
.red:hover {
color:rgb(129, 17, 18);
}
.collapse:hover {
cursor: pointer;
}
</style>
<h1 id="table-of-contents">Table of Contents</h1>
<ul id="markdown-toc">
<li><a href="#table-of-contents" id="markdown-toc-table-of-contents">Table of Contents</a></li>
<li><a href="#the-setup-simplest-c-program" id="markdown-toc-the-setup-simplest-c-program">The Setup: Simplest C++ Program</a></li>
<li><a href="#structure-of-the-executable-elf-format" id="markdown-toc-structure-of-the-executable-elf-format">Structure of the Executable: ELF Format</a> <ul>
<li><a href="#elf-headers" id="markdown-toc-elf-headers">ELF Headers</a></li>
<li><a href="#program-headers" id="markdown-toc-program-headers">Program Headers</a> <ul>
<li><a href="#phdr" id="markdown-toc-phdr">PHDR</a></li>
<li><a href="#interp" id="markdown-toc-interp">INTERP</a></li>
<li><a href="#load" id="markdown-toc-load">LOAD</a></li>
<li><a href="#dynamic" id="markdown-toc-dynamic">DYNAMIC</a></li>
<li><a href="#note" id="markdown-toc-note">NOTE</a></li>
<li><a href="#gnu_eh_frame" id="markdown-toc-gnu_eh_frame">GNU_EH_FRAME</a></li>
<li><a href="#gnu_stack" id="markdown-toc-gnu_stack">GNU_STACK</a></li>
<li><a href="#gnu_relro" id="markdown-toc-gnu_relro">GNU_RELRO</a></li>
</ul>
</li>
<li><a href="#recap" id="markdown-toc-recap">Recap</a></li>
</ul>
</li>
<li><a href="#what-does-g-do" id="markdown-toc-what-does-g-do">What does <code class="language-plaintext highlighter-rouge">g++</code> do?</a> <ul>
<li><a href="#preprocessor-cpp" id="markdown-toc-preprocessor-cpp">Preprocessor (<code class="language-plaintext highlighter-rouge">cpp</code>)</a></li>
<li><a href="#compiler-cc1plus" id="markdown-toc-compiler-cc1plus">Compiler (<code class="language-plaintext highlighter-rouge">cc1plus</code>)</a> <ul>
<li><a href="#front-end" id="markdown-toc-front-end">Front End</a></li>
<li><a href="#middle-end" id="markdown-toc-middle-end">Middle End</a></li>
<li><a href="#back-end" id="markdown-toc-back-end">Back End</a></li>
</ul>
</li>
<li><a href="#assembler-as" id="markdown-toc-assembler-as">Assembler (<code class="language-plaintext highlighter-rouge">as</code>)</a></li>
<li><a href="#static-linker-ld" id="markdown-toc-static-linker-ld">Static Linker (<code class="language-plaintext highlighter-rouge">ld</code>)</a> <ul>
<li><a href="#disclaimer-static-linker-and-dynamic-linkers-are-not-the-same-thing" id="markdown-toc-disclaimer-static-linker-and-dynamic-linkers-are-not-the-same-thing">Disclaimer: Static linker and dynamic linkers are NOT the same thing!</a></li>
</ul>
</li>
<li><a href="#recap-1" id="markdown-toc-recap-1">Recap</a></li>
</ul>
</li>
<li><a href="#analyzing-generated-procedures-objdump" id="markdown-toc-analyzing-generated-procedures-objdump">Analyzing Generated Procedures: <code class="language-plaintext highlighter-rouge">Objdump</code></a> <ul>
<li><a href="#main---the-dumb-and-obvious" id="markdown-toc-main---the-dumb-and-obvious"><code class="language-plaintext highlighter-rouge">main</code> - The dumb and obvious</a></li>
<li><a href="#_start---true-start-of-the-program" id="markdown-toc-_start---true-start-of-the-program"><code class="language-plaintext highlighter-rouge">_start</code> - True start of the program</a></li>
<li><a href="#__libc_csu_init-and-__libc_csu_fini---program-level-ctordtor-handlers" id="markdown-toc-__libc_csu_init-and-__libc_csu_fini---program-level-ctordtor-handlers"><code class="language-plaintext highlighter-rouge">__libc_csu_init</code> and <code class="language-plaintext highlighter-rouge">__libc_csu_fini</code> - Program level ctor/dtor handlers</a></li>
<li><a href="#_init-and-_fini---obsolete-program-level-ctordtor-with-highest-priority" id="markdown-toc-_init-and-_fini---obsolete-program-level-ctordtor-with-highest-priority"><code class="language-plaintext highlighter-rouge">_init</code> and <code class="language-plaintext highlighter-rouge">_fini</code> - Obsolete program level ctor/dtor with highest priority</a></li>
<li><a href="#register_tm_clones-deregister_tm_clones---mysterious-concurrency-model-functions" id="markdown-toc-register_tm_clones-deregister_tm_clones---mysterious-concurrency-model-functions"><code class="language-plaintext highlighter-rouge">register_tm_clones</code>, <code class="language-plaintext highlighter-rouge">deregister_tm_clones</code> - Mysterious concurrency model functions</a></li>
<li><a href="#recap-2" id="markdown-toc-recap-2">Recap</a></li>
</ul>
</li>
<li><a href="#conclusion" id="markdown-toc-conclusion">Conclusion</a></li>
<li><a href="#appendix" id="markdown-toc-appendix">Appendix</a> <ul>
<li><a href="#some-notes-about-the-cc1plus-compiler-and-general-parsing-rules" id="markdown-toc-some-notes-about-the-cc1plus-compiler-and-general-parsing-rules">Some notes about the <code class="language-plaintext highlighter-rouge">cc1plus</code> compiler and general parsing rules</a></li>
<li><a href="#transactional-memory-model--clones" id="markdown-toc-transactional-memory-model--clones">Transactional Memory Model & Clones</a> <ul>
<li><a href="#atomic-transactions" id="markdown-toc-atomic-transactions">Atomic transactions</a></li>
<li><a href="#synchronized-transactions" id="markdown-toc-synchronized-transactions">Synchronized transactions</a></li>
<li><a href="#tm-clones" id="markdown-toc-tm-clones"><code class="language-plaintext highlighter-rouge">tm</code> clones</a></li>
</ul>
</li>
</ul>
</li>
</ul>
<h1 id="the-setup-simplest-c-program">The Setup: Simplest C++ Program</h1>
<p>Here I have the simplest C++ program in existence:</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// main.cpp</span>
<span class="kt">int</span> <span class="nf">main</span><span class="p">(){}</span>
</code></pre></div></div>
<p>and here is the corresponding Makefile for this program (with some utilities we’ll use later):</p>
<div class="language-makefile highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Makefile
# We are running g++-9, v9.3.0
</span><span class="nv">CC</span><span class="o">=</span>g++
<span class="c"># Turn off optimizations because we want to be able to follow the assembly.
</span><span class="nv">FLAGS</span><span class="o">=</span><span class="nt">-O0</span> <span class="nt">-fverbose-asm</span> <span class="nt">-no-pie</span>
<span class="nl">main</span><span class="o">:</span> <span class="nf">main.cpp</span>
<span class="nv">$(CC)</span> <span class="nv">$(FLAGS)</span> <span class="nt">-o</span> <span class="nv">$@</span> <span class="nv">$^</span>
<span class="nl">dump</span><span class="o">:</span> <span class="nf">main</span>
objdump <span class="nt">-drwC</span> <span class="nt">-Mintel</span> main &> main.dump
<span class="nl">linker</span><span class="o">:</span> <span class="nf">main.cpp</span>
<span class="nv">$(CC)</span> <span class="nv">$(FLAGS)</span> <span class="nt">-o</span> /dev/null <span class="nt">-x</span> c <span class="nv">$^</span> <span class="nt">-Wl</span>,--verbose
</code></pre></div></div>
<p>Upon execution, the program simply starts up and returns with exit code 0. However, there are a few questions you might be wondering:</p>
<ol>
<li><strong>What is the structure of the executable?</strong></li>
<li><strong>What did <code class="language-plaintext highlighter-rouge">g++</code> do to generate this binary file?</strong></li>
<li><strong>What are the generated procedures being run in assembly?</strong></li>
</ol>
<p>As we’ll see, the process is extremely complicated. <em>To make this easier to navigate, I have made collapsible blocks identifiable by the cursor next to questions.</em></p>
<h1 id="structure-of-the-executable-elf-format">Structure of the Executable: ELF Format</h1>
<p><em>Our goal for this section is to understand the format of the <code class="language-plaintext highlighter-rouge">main</code> binary</em>.</p>
<p><strong>ELF, which stands for Executable and Linkable Format</strong>, is the format used for binaries and libraries that we compile with C and C++. It’s in an unreadable binary format that can be analyzed with several GNU tools. To understand what the assembly outputs are, we must first be familiar with the general layout of an ELF file.</p>
<details class="red">
<summary class="collapse"><strong>How can I tell that an executable is ELF?</strong>
</summary>
<p>You can identify an ELF file by using the <code class="language-plaintext highlighter-rouge">file</code> command:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>❯ file main
main: ELF 64-bit LSB pie executable, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, BuildID[sha1]=921d352e49a0e4262aece7e72418290189520782, for GNU/Linux 3.2.0, not stripped
</code></pre></div> </div>
</details>
<details class="red">
<summary class="collapse"><strong>How can I get information about an ELF file?</strong>
</summary>
<p>If it does say <code class="language-plaintext highlighter-rouge">ELF</code>, you can use <code class="language-plaintext highlighter-rouge">readelf</code> to analyze the headers like so:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>❯ readelf -e main
ELF Header: ...
Section Headers: ...
Key to Flags: ...
Program Headers: ...
Section to Segment mapping:
Segment Sections: ...
...
</code></pre></div> </div>
</details>
<p>In ELF there is a concept of <em>sections</em> and <em>segments</em>. Sections reside within segments, which are contiguous pieces of memory in the runtime of the executable(the pieces may be 0 bytes). Some sections may appear in more than one segment and it’s because two segments overlap(with the exception of two <code class="language-plaintext highlighter-rouge">LOAD</code> segments) with those sections in the intersection. We’ll be going more in-depth on what each of these do throughout the blog.</p>
<p>If we take a look at the ELF Header and Program Headers, we can get a lot of information about the runtime behavior of the executable, so let’s analyze it.</p>
<h2 id="elf-headers">ELF Headers</h2>
<details class="red">
<summary class="collapse"><strong>What does our ELF header look like?</strong>
</summary>
<p>We see the following relevant details in the <code class="language-plaintext highlighter-rouge">ELF Header</code> section:</p>
<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="na">ELF Header</span><span class="pi">:</span>
<span class="na">Magic</span><span class="pi">:</span> <span class="s">7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00</span>
<span class="na">Class</span><span class="pi">:</span> <span class="s">ELF64</span>
<span class="na">Data</span><span class="pi">:</span> <span class="s">2's complement, little endian</span>
<span class="nn">...</span>
<span class="na">Type</span><span class="pi">:</span> <span class="s">DYN (Shared object file)</span>
<span class="na">Number of program headers</span><span class="pi">:</span> <span class="m">11</span>
<span class="na">Number of section headers</span><span class="pi">:</span> <span class="m">27</span>
<span class="nn">...</span>
</code></pre></div> </div>
</details>
<details class="red">
<summary class="collapse"><strong>What does each section do?</strong>
</summary>
<p>Let’s go through some of these sections:</p>
<ul>
<li>The <code class="language-plaintext highlighter-rouge">Magic</code> field contains the letters <code class="language-plaintext highlighter-rouge">ELF</code>, denoted by hexadecimals <code class="language-plaintext highlighter-rouge">45 4c 46</code>. To identify whether this file is an ELF format, you need to check the header magic here.</li>
<li>The <code class="language-plaintext highlighter-rouge">Class</code> field tells us whether this file should be run on a 32-bit or 64-bit architecture. Modern CPU’s are 64-bit, which means the word size is 8 bytes rather than 4 bytes, the addressable memory space is $2^{64}$ bytes(which is practically infinite for any kind of storage) as opposed to $2^{32}$ bytes(which is 4GB, which is roughly the size of a medium dataset for machine learning), and registers can hold 64 bit data.</li>
<li>The <code class="language-plaintext highlighter-rouge">Data</code> field tells us that the data is stored in <code class="language-plaintext highlighter-rouge">little endian</code>:</li>
</ul>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># Little endian has the most significant byte last:
Consider an 4-byte integer represented by: 0x11 0x22 0x33 0x44
The little endian representation (in increasing memory order): 0x44 0x33 0x22 0x11.
</code></pre></div> </div>
<p>and <code class="language-plaintext highlighter-rouge">2's complement</code> is the representation of signed numbers. For any arbitrary positive number $N$ represented in binary, the corresponding $-N$ is represented as bit-flipped $N$ plus 1.</p>
<ul>
<li>The <code class="language-plaintext highlighter-rouge">Type</code> field tells us what kind of file it is. You might be surprised in this case to see that it’s a <code class="language-plaintext highlighter-rouge">DYN</code> for shared object file when we’re actually creating an executable. I was also baffled and asked on StackOverflow <a href="https://stackoverflow.com/questions/61567439/why-is-my-simple-main-programs-elf-header-say-its-a-dyn-shared-object-file">here</a>. TL;DR: We’ll be adding the flag <code class="language-plaintext highlighter-rouge">-no-pie</code> to turn it into an <code class="language-plaintext highlighter-rouge">EXEC</code> type file. The rest of the blog will be based off of that assumption.</li>
<li>The number of program headers is the number of segments that will be mapped into memory upon execution.</li>
<li>The number of section headers is the number of sections, each of which will be placed into one of the 11 segments.</li>
</ul>
</details>
<p>In general, the ELF headers tells us exactly what kind of platform this binary was compiled for, and a general summary of the structure of the ELF file.</p>
<hr />
<h2 id="program-headers">Program Headers</h2>
<details class="red">
<summary class="collapse"><strong>What does the <code class="language-plaintext highlighter-rouge">Program Header</code> section look like?</strong>
</summary>
<p>We see the following relevant details in the <code class="language-plaintext highlighter-rouge">Program Header</code> section:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Program Headers:
Type Offset VirtAddr PhysAddr
FileSiz MemSiz Flags Align
PHDR 0x0000000000000040 0x0000000000400040 0x0000000000400040
0x0000000000000268 0x0000000000000268 R 0x8
INTERP 0x00000000000002a8 0x00000000004002a8 0x00000000004002a8
0x000000000000001c 0x000000000000001c R 0x1
[Requesting program interpreter: /lib64/ld-linux-x86-64.so.2]
LOAD ... R
LOAD ... R E
LOAD ... R
LOAD ... R W
DYNAMIC ...
NOTE ...
GNU_EH_FRAME ...
GNU_STACK ...
GNU_RELRO ...
</code></pre></div> </div>
</details>
<p>Each program header is mapped to a segment containing zero or more sections. The <code class="language-plaintext highlighter-rouge">VirtAddr</code> field tells us where the segments will be located, <code class="language-plaintext highlighter-rouge">Flags</code> tells us the permission bits of each memory segment, and the <code class="language-plaintext highlighter-rouge">Type</code> field tells us exactly what that segment is used for.</p>
<p>Isn’t it surprising that there are so many program headers our simple C++ program? Let’s analyze what types each of these headers point to and why they’re needed.</p>
<h3 id="phdr">PHDR</h3>
<p><em>This segment usually contains no sections</em>.</p>
<p><code class="language-plaintext highlighter-rouge">PHDR</code> stands for “Program HeaDeR”, and is a bit of a strange one. According to the official linux documentations <a href="http://man7.org/linux/man-pages/man5/elf.5.html">here</a>, it says:</p>
<blockquote>
<p>… specifies the location and size of the program header table itself, both in the file and in the memory image of the program.</p>
</blockquote>
<details class="red">
<summary class="collapse"><strong>Why do we need to know where the program table is? Why don’t we just remove this metadata during runtime?</strong>
</summary>
<p>Simply stated - <strong>we want to know where the executable begins</strong>. The program table which includes <code class="language-plaintext highlighter-rouge">PHDR</code> itself could be relocated anywhere in memory if it was a PIE(position independent executable). To compute the location of the executable, we subtract the location where the header exists with the <code class="language-plaintext highlighter-rouge">VirtAddr</code> field it claims it’s in. Here’s the source code in libc:</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">...</span>
<span class="k">case</span> <span class="n">PT_PHDR</span><span class="p">:</span>
<span class="cm">/* Find out the load address. */</span>
<span class="n">main_map</span><span class="o">-></span><span class="n">l_addr</span> <span class="o">=</span> <span class="p">(</span><span class="n">ElfW</span><span class="p">(</span><span class="n">Addr</span><span class="p">))</span> <span class="n">phdr</span> <span class="o">-</span> <span class="n">ph</span><span class="o">-></span><span class="n">p_vaddr</span><span class="p">;</span>
<span class="k">break</span><span class="p">;</span>
<span class="p">...</span>
</code></pre></div> </div>
</details>
<p>Here, <code class="language-plaintext highlighter-rouge">phdr</code> is the location of the actual header, and <code class="language-plaintext highlighter-rouge">ph->p_vaddr</code> is the field <code class="language-plaintext highlighter-rouge">VirtAddr</code> deserialized from the ELF file. By subtracting, we have the base location of the executable, which we can use to find where <code class="language-plaintext highlighter-rouge">some_segment</code> lives in memory by <code class="language-plaintext highlighter-rouge">main_map->l_addr + some_segment->p_vaddr</code>. Credits to the writer of <a href="https://stackoverflow.com/questions/61568612/is-jumping-over-removing-phdr-program-header-in-elf-file-for-executable-ok-if/61568759#61568759">musl</a>, which is a libc implementation.</p>
<h3 id="interp">INTERP</h3>
<p><em>This segment usually contains one section: <code class="language-plaintext highlighter-rouge">.interp</code></em></p>
<p>This specifies where the interpreter is for loading shared library executables, and we even see the metadata tag <code class="language-plaintext highlighter-rouge">[Requesting program interpreter: /lib64/ld-linux-x86-64.so.2]</code> in the program header.</p>
<details class="red">
<summary class="collapse"><strong>What does this <code class="language-plaintext highlighter-rouge">ld-linux</code> thing do?</strong>
</summary>
<p>Thankfully, it is an executable with a very helpful help section. Let’s call it and see what it says:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>❯ /lib64/ld-linux-x86-64.so.2
Usage: ld.so [OPTION]... EXECUTABLE-FILE [ARGS-FOR-PROGRAM...]
You have invoked `ld.so', the helper program for shared library executables... This helper program loads the shared libraries needed by the program executable, prepares the program
to run, and runs it. You may invoke this helper program directly from the command line to load and run an ELF executable file; this is like executing that file itself, but always uses this helper program from the file you specified, instead of the helper program file specified in the executable file you run.
</code></pre></div> </div>
</details>
<p><em>TL;DR: <code class="language-plaintext highlighter-rouge">ld.so</code> is the dynamic loader. Programs that load shared libraries will invoke this dynamic loader to load the shared library executable. You usually don’t call this yourself, but you can. It’s like an <code class="language-plaintext highlighter-rouge">exec</code>.</em></p>
<p>We will be analyzing this in more detail later in the blog.</p>
<h3 id="load">LOAD</h3>
<p><em>This segment can contain many different sections, and there are multiple <code class="language-plaintext highlighter-rouge">LOAD</code>s per program. Some commonly occurring sections include <code class="language-plaintext highlighter-rouge">.interp .init .text .fini .dynamic .got .got.plt .data .bss </code></em></p>
<p><strong>This is the most important segment for a typical C++ program.</strong> It basically tells the linker to allocate a particular segment of memory with particular permissions. In the above, we see that there are 4 <code class="language-plaintext highlighter-rouge">LOAD</code> sections. This only happens for the newer versions of <code class="language-plaintext highlighter-rouge">ld</code>(the <strong>static</strong> linker).</p>
<details class="red">
<summary class="collapse"><strong>Why do we need 4 sections?</strong>
</summary>
<p>We need different sections because two sections may need different permissions, and/or serve completely different purposes. These segments in C++ are for the following purposes(roughly):</p>
<ul>
<li><code class="language-plaintext highlighter-rouge">.text</code>, which holds the code to be executed. This should be in the <code class="language-plaintext highlighter-rouge">R E</code> section.</li>
<li><code class="language-plaintext highlighter-rouge">.rodata</code>, which means <em>read-only data</em>. This usually holds static constants that are used in the program. This is in one of the <code class="language-plaintext highlighter-rouge">R</code> sections.</li>
<li><code class="language-plaintext highlighter-rouge">.data</code>, which is read/write data. This is in the <code class="language-plaintext highlighter-rouge">RW</code> section. There’s no execute because of buffer overflow security vulnerabilities leading to execution of code in the data section. In addition, we have <code class="language-plaintext highlighter-rouge">.bss</code> in this section as well. The name doesn’t really mean too much now - you should just consider it as “zero-initialized data”. It contains global variables and static variables that are zero-initialized. <em>The reason this segment exists is for space optimization in the executable itself</em>. (Imagine a lot of zero buffers adding space to the executable’s size)</li>
<li>The ELF header information is in the other <code class="language-plaintext highlighter-rouge">R</code> section.</li>
</ul>
</details>
<p>The kernel is responsible here to memory map these segments into our runtime and set up our execution environment involving the stack, heap, our code, etc. Without this section, we would not have executables.</p>
<p>It’s also important to note that the union of all sections within a segment may not be the entire segment itself. For example, programming constructs like our stack and heap belong to the <code class="language-plaintext highlighter-rouge">LOAD</code> segment but it may not live in any of the sections within <code class="language-plaintext highlighter-rouge">LOAD</code>.</p>
<h3 id="dynamic">DYNAMIC</h3>
<p><em>This segment usually contains one section: <code class="language-plaintext highlighter-rouge">.dynamic</code></em></p>
<p>If this executable requires dynamic linking, this field will point to us exactly what information is required. The <code class="language-plaintext highlighter-rouge">.dynamic</code> section in the ELF file shows you what shared libraries are required.</p>
<details class="red">
<summary class="collapse"><strong>How do we find the required shared libraries in the ELF file?</strong>
</summary>
<p>To view that information, run:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>❯ readelf -d main
Dynamic section at offset 0x2e20 contains 23 entries:
Tag Type Name/Value
0x0000000000000001 (NEEDED) Shared library: [libstdc++.so.6]
0x0000000000000001 (NEEDED) Shared library: [libm.so.6]
0x0000000000000001 (NEEDED) Shared library: [libgcc_s.so.1]
0x0000000000000001 (NEEDED) Shared library: [libc.so.6]
...
</code></pre></div> </div>
<p>In reality, you don’t need to do this - instead, use <code class="language-plaintext highlighter-rouge">ldd</code> to find the dependencies yourself:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>❯ ldd main
linux-vdso.so.1 (0x00007fff23f7a000)
libstdc++.so.6 => /usr/lib/libstdc++.so.6 (0x00007ff8ae1cc000)
libm.so.6 => /usr/lib/libm.so.6 (0x00007ff8ae086000)
libgcc_s.so.1 => /usr/lib/libgcc_s.so.1 (0x00007ff8ae06c000)
libc.so.6 => /usr/lib/libc.so.6 (0x00007ff8adea6000)
/lib64/ld-linux-x86-64.so.2 => /usr/lib64/ld-linux-x86-64.so.2 (0x00007ff8ae3d9000)
</code></pre></div> </div>
<p>As a follow-up, why does <code class="language-plaintext highlighter-rouge">ldd</code> tell us we have two more shared libraries than the ELF file?</p>
<p><code class="language-plaintext highlighter-rouge">ldd</code> tells us there are 2 more dynamic dependencies, namely <code class="language-plaintext highlighter-rouge">linux-vdso.so.1</code> and <code class="language-plaintext highlighter-rouge">/lib64/ld-linux-x86-64.so.2</code>. These two are actually shared libraries that the kernel automatically maps into the address space of <strong>all user-space applications</strong>. <code class="language-plaintext highlighter-rouge">vdso</code> stands for “Virtual Dynamic Shared Object”, and contains many utilities such as getting the current timestamp, which would be expensive if we were to jump to kernel-space to execute. The other shared library is the (in)famous dynamic linker. It is responsible for loading other shared objects into the main runtime’s memory space.</p>
</details>
<p>Linker issues are the biggest headaches and they often involve <code class="language-plaintext highlighter-rouge">rpath, runpath, LD_LIBRARY_PATH</code>, and other variables that may or may not be baked into the <code class="language-plaintext highlighter-rouge">.dynamic</code> section of the ELF file. Knowing how this segment works is crucial to debugging many of the common linker errors. I highly recommend this <a href="https://amir.rachum.com/blog/2016/09/17/shared-libraries/">blogpost</a> if you’re running into a practical issue with dynamic linking <code class="language-plaintext highlighter-rouge">.so</code> files. It’s out of the scope of this blog.</p>
<h3 id="note">NOTE</h3>
<p><em>This segment sometimes contains the sections: <code class="language-plaintext highlighter-rouge">.note.gnu.build-id .note.ABI-tag</code>, but it varies.</em></p>
<p>This is a very free-style section used by vendors or engineers to mark an object file with special information. This information is usually used to check for compatibility. A note segment is fairly small and we don’t really need to care much about this.</p>
<h3 id="gnu_eh_frame">GNU_EH_FRAME</h3>
<p><em>This segment usually contains one section: <code class="language-plaintext highlighter-rouge">.eh_frame_hdr</code></em></p>
<p>In here, <code class="language-plaintext highlighter-rouge">EH</code> stands for exception handling. This is a sorted data structure that handles exceptions. It maps particular exceptions to particular function handlers, and helps with frame unwinding for those nice backtraces you get with <code class="language-plaintext highlighter-rouge">bt</code> in <code class="language-plaintext highlighter-rouge">gdb</code>.</p>
<h3 id="gnu_stack">GNU_STACK</h3>
<p><em>This segment usually contains no sections</em></p>
<p>This is a lightweight header that tells us what permissions we gave the stack. The stack in general <em>should not be executable</em> for security vulnerabilities, so let’s see whether our binary is safe with <code class="language-plaintext highlighter-rouge">scanelf</code>:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>❯ scanelf . -e
TYPE STK/REL/PTL FILE
ET_EXEC RW- R-- RW- ./main
</code></pre></div></div>
<p>We don’t allow execution on the stack - good news!</p>
<h3 id="gnu_relro">GNU_RELRO</h3>
<p><em>This segment usually contains the sections <code class="language-plaintext highlighter-rouge">.dynamic .got</code> and sometimes <code class="language-plaintext highlighter-rouge">.init_array .fini_array</code></em></p>
<p>This particular section is purely for protection against security vulnerabilities. ELF binaries have two maps called the <strong>Global Offset Table</strong>, otherwise known as GOT, and the <strong>Procedure Linkage Table</strong>, otherwise known as PLT.</p>
<details class="red">
<summary class="collapse"><strong>What does the GOT and PLT do?</strong>
</summary>
<p>The GOT stores the exact address of global variables, functions and other symbols from a dynamically loaded shared library. These values in the GOT are subject to change when you re-compile the same program. When calling a function from the shared library, there may be many functions that are never called. To reduce the function resolution overhead, we create stubs in GOT for <em>lazy loaded functions</em>. The steps to resolve a lazy loaded function in runtime is outlined below:</p>
<ul>
<li>User calls a lazy loaded function from a dynamically loaded library. This is an entry in the GOT (fixed, unlike the address of the actual function).</li>
<li>If the function is unresolved, stub code in the section jump to the PLT, where it redirects the call to the dynamic linker’s <code class="language-plaintext highlighter-rouge">_dl_runtime_resolve</code> routine.</li>
<li><code class="language-plaintext highlighter-rouge">_dl_runtime_resolve</code> populates the GOT section with the correct address of the function.</li>
<li>The address of the function is used.</li>
<li>Upon further invocation of the function, the GOT no longer needs to jump to the PLT, and immediately gives the address of the function.</li>
</ul>
</details>
<p>The PLT and GOT are super important, and we need to stop malicious users from messing with it. <code class="language-plaintext highlighter-rouge">RELRO</code> in this case stands for <strong>RELocation Read Only</strong>. These in-memory structures are read-write in order to save the resolved addresses of the loaded functions, but that lends itself to security vulnerabilities. What if the user can buffer-overflow and change the entry in the GOT to execute a function containing arbitrary code?</p>
<p>Well, one way is to make the function information loading <em>all eager</em>, and then turn the section read-only before the user can screw around with the GOT. Before user code can be executed, if the GOT is already populated then turning this section read-only with the <a href="http://man7.org/linux/man-pages/man2/mprotect.2.html">system call</a> <code class="language-plaintext highlighter-rouge">mprotect</code> will prevent any vulnerabilities. Similar things are done for the dynamic section and the <code class="language-plaintext highlighter-rouge">.init_array</code> and <code class="language-plaintext highlighter-rouge">.fini_array</code> sections which we’ll discuss in the Assembly dump section.</p>
<h2 id="recap">Recap</h2>
<p>So now that we’ve seen what each of these types of segments are used for, let’s recap:</p>
<ol>
<li>The ELF header contains metadata about the program.</li>
<li>Each segment is mapped to memory somewhere. Two segments may overlap if they are not both of type <code class="language-plaintext highlighter-rouge">LOAD</code>. This is how sections may live in two segments.</li>
<li><strong><code class="language-plaintext highlighter-rouge">LOAD</code> is the most important part of our program.</strong> It’s directly mapped into memory with the relevant permissions.</li>
<li>We load the dynamic linker into the <code class="language-plaintext highlighter-rouge">DYNAMIC</code> segment to load shared objects.</li>
<li>We have particular segments like <code class="language-plaintext highlighter-rouge">GNU_RELRO</code> and <code class="language-plaintext highlighter-rouge">GNU_STACK</code> for security.</li>
</ol>
<p>Below is a diagram for clarity:</p>
<p><img src="http://oneraynyday.github.io/assets/elf_format.png" alt="elf_format" /></p>
<h1 id="what-does-g-do">What does <code class="language-plaintext highlighter-rouge">g++</code> do?</h1>
<p><em>Our goal for this section is to illustrate the different steps in turning our <code class="language-plaintext highlighter-rouge">main.cpp</code> file into <code class="language-plaintext highlighter-rouge">main</code> the executable.</em></p>
<p><strong>To make this clear, <code class="language-plaintext highlighter-rouge">g++</code> is not a compiler.</strong> A C++ compiler’s job is to read C++ code and generate the proper assembly instructions(or some intermediate language like <code class="language-plaintext highlighter-rouge">llvm</code>) and create the translation units (<code class="language-plaintext highlighter-rouge">.o</code>’s). If <code class="language-plaintext highlighter-rouge">g++</code> only created <code class="language-plaintext highlighter-rouge">.o</code>’s, we would not be able to execute anything <code class="language-plaintext highlighter-rouge">g++</code> creates for us. <em>Then what is <code class="language-plaintext highlighter-rouge">g++</code>?</em></p>
<p><strong><code class="language-plaintext highlighter-rouge">g++</code> is actually a thin wrapper that dispatches multiple different tools including preprocessors, compilers, and linkers to create whatever you want, whether it be <code class="language-plaintext highlighter-rouge">.o</code>’s, <code class="language-plaintext highlighter-rouge">.so</code>’s, or executables.</strong> If you say “<code class="language-plaintext highlighter-rouge">g++</code> compiler”, people will assume you mean <code class="language-plaintext highlighter-rouge">cc1plus</code>, but don’t think that <code class="language-plaintext highlighter-rouge">g++</code> itself is a compiler!</p>
<hr />
<h2 id="preprocessor-cpp">Preprocessor (<code class="language-plaintext highlighter-rouge">cpp</code>)</h2>
<p>The <strong>preprocessor</strong> in the context of C++ is something that takes your macros and turns them into actual values before feeding the resulting C++ program to a compiler. This also includes the <code class="language-plaintext highlighter-rouge">#include</code> directive for header files. It would usually take something like this:</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// Output:</span>
<span class="c1">// ❯ cpp main.cpp (can also invoke with g++ -E main.cpp)</span>
<span class="c1">// ...</span>
<span class="c1">// int x = 1;</span>
<span class="c1">//</span>
<span class="c1">// int main() {}</span>
<span class="cp">#define MACRO
#ifdef MACRO
</span><span class="kt">int</span> <span class="n">x</span> <span class="o">=</span> <span class="mi">1</span><span class="p">;</span>
<span class="cp">#else
</span><span class="kt">int</span> <span class="n">y</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="cp">#endif
</span>
<span class="kt">int</span> <span class="nf">main</span><span class="p">()</span> <span class="p">{}</span>
</code></pre></div></div>
<p>This is the simplest part of the workflow for creating an executable.</p>
<hr />
<h2 id="compiler-cc1plus">Compiler (<code class="language-plaintext highlighter-rouge">cc1plus</code>)</h2>
<p><em>The compiler is such a complicated beast that I can’t possibly talk about it in detail in this post(nor do I have the expertise to talk about the C++ compiler in detail).</em></p>
<p>The part of <code class="language-plaintext highlighter-rouge">g++</code> that <em>actually compiles things</em> is called <code class="language-plaintext highlighter-rouge">cc1plus</code>, and it generates assembly code from C++ code. This particular compiler is broken up into 3 logical sections - the “front end”, “middle end”, and the “back end”. For a simple program like ours, we can go through each part and see how our simple source code compiles into assembly.</p>
<h3 id="front-end">Front End</h3>
<p>The front end of the <code class="language-plaintext highlighter-rouge">cc1plus</code> compiler takes in C++ code, and transforms it into an <strong>abstract syntax tree, or AST.</strong> The AST is the semantic representation of our code in a tree-like format. The entire program is the root of the node, and different entities like function definitions, global variable declarations and other statements in the global scope are the children of the root node, and it follows recursively until we reach a primitive token like an identifier or a literal value.</p>
<p>On <code class="language-plaintext highlighter-rouge">clang</code>, we are able to compile C++ programs and obtain a <code class="language-plaintext highlighter-rouge">.dot</code> file detailing the AST it builds during compilation. The <code class="language-plaintext highlighter-rouge">int main() {}</code> program isn’t very exciting…</p>
<p><img src="http://oneraynyday.github.io/assets/main_ast.png" alt="main_ast.png" /></p>
<p>The definition for <code class="language-plaintext highlighter-rouge">main</code> is considered a <code class="language-plaintext highlighter-rouge">CompoundStmt</code>, and we have nothing inside the definition, so that’s it for the AST. So let’s look at a <em>slightly</em> more complicated program to see an example of what the AST looks like in general:</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">main</span><span class="p">()</span> <span class="p">{</span>
<span class="kt">int</span> <span class="n">x</span> <span class="o">=</span> <span class="mi">3</span><span class="p">;</span>
<span class="kt">int</span> <span class="n">y</span> <span class="o">=</span> <span class="mi">3</span> <span class="o">+</span> <span class="n">x</span><span class="p">;</span>
<span class="k">return</span> <span class="n">y</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<p>Pretty simple - we return the value 6. However, the AST that includes these symbols tell us a lot about the structure:</p>
<p><img src="http://oneraynyday.github.io/assets/main_ast2.png" alt="main_ast2.png" /></p>
<details class="red">
<summary class="collapse"><strong>What does the above AST tell us?</strong>
</summary>
<p>As you can see, since we have two declarations of <code class="language-plaintext highlighter-rouge">int</code> types, we have two nodes called <code class="language-plaintext highlighter-rouge">DeclStmt</code>, and <code class="language-plaintext highlighter-rouge">3</code> is a <code class="language-plaintext highlighter-rouge">IntegerLiteral</code> type, which is a token that cannot be expanded to further symbols. The second branch adds the variable <code class="language-plaintext highlighter-rouge">x</code> and a literal together, and we first implicitly cast the variable <code class="language-plaintext highlighter-rouge">x</code> (which is a <code class="language-plaintext highlighter-rouge">DeclRefExpr</code>) even though it’s of the same type. We also have a return statement which returns the <code class="language-plaintext highlighter-rouge">DeclRefExpr</code> corresponding to <code class="language-plaintext highlighter-rouge">y</code>. The tree is traversed by the compiler to eventually generate assembly code according to these instructions.</p>
</details>
<p>As far as I know, you can’t create a viewable AST from <code class="language-plaintext highlighter-rouge">g++</code>. However, we can get the control flow of the program and visualize it. To create the control flow for this program, use the <code class="language-plaintext highlighter-rouge">-fdump-tree-all-graph</code> mode during compilation and you’ll get a ton of <code class="language-plaintext highlighter-rouge">.dot</code> files which you can visualize with <code class="language-plaintext highlighter-rouge">graphviz</code>. Here’s our flow graph for <code class="language-plaintext highlighter-rouge">int main() {}</code>:</p>
<p><img src="http://oneraynyday.github.io/assets/main_flow.png" alt="main_flow.png" /></p>
<p>What is this language? We’ll see that the code in each block is called <code class="language-plaintext highlighter-rouge">GIMPLE</code>, a language used for the middle end section. This language is fairly simple to read for our application, and we can see it returns 0 even though we put nothing in the body. This is generated by the compiler in the special case of <code class="language-plaintext highlighter-rouge">main</code> according to the standards:</p>
<blockquote>
<p>If control reaches the end of main without encountering a return statement, the effect is that of executing <code class="language-plaintext highlighter-rouge">return 0</code></p>
</blockquote>
<p>Cool! So now we know that the compiler creates a flow graph and an AST. What happens next?</p>
<h3 id="middle-end">Middle End</h3>
<p>The middle end of <code class="language-plaintext highlighter-rouge">cc1plus</code> is arguably the biggest and most important part. It’s often working with an intermediate representation that is neither C++ nor assembly. It needs this extra step of compilation in order to make powerful optimizations. As mentioned, the middle end’s language is called <code class="language-plaintext highlighter-rouge">GIMPLE</code> (a subset of language called <code class="language-plaintext highlighter-rouge">GENERIC</code>), a language which involves max 3 operands in a single expression. Merrill’s paper on GIMPLE explains the transformation from C++ to GIMPLE for some basic operations (Merill, 2003). Here’s our simple program in GIMPLE:</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">main</span> <span class="p">()</span>
<span class="p">{</span>
<span class="kt">int</span> <span class="n">D</span><span class="mf">.2067</span><span class="p">;</span>
<span class="n">D</span><span class="mf">.2067</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="k">return</span> <span class="n">D</span><span class="mf">.2067</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<details class="red">
<summary class="collapse"><strong>Well… that’s a bit boring, can you give a more interesting example?</strong>
</summary>
<p>Let’s take a look at what GIMPLE can actually do, by using two basic rules in an example:</p>
<p><strong>Any expression involving more than 3 operands is broken up:</strong></p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>a = b + c + d;
--- Turns into ---
T1 = b + c;
a = T1 + d;
</code></pre></div> </div>
<p><strong>Conditional unary expressions are explicitly transformed into <code class="language-plaintext highlighter-rouge">if-else</code> statements:</strong></p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>a = b ? c : d;
--- Turns into ---
if (b)
T1 = c;
else
T1 = d;
a = T1;
</code></pre></div> </div>
<p>Let’s look at an example of these rules:</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">main</span><span class="p">()</span> <span class="p">{</span>
<span class="kt">int</span> <span class="n">x</span> <span class="o">=</span> <span class="mi">3</span><span class="p">;</span>
<span class="kt">int</span> <span class="n">y</span> <span class="o">=</span> <span class="mi">3</span><span class="p">;</span>
<span class="kt">int</span> <span class="n">z</span> <span class="o">=</span> <span class="p">(</span><span class="n">x</span> <span class="o">==</span> <span class="n">y</span><span class="p">)</span> <span class="o">?</span> <span class="n">x</span> <span class="o">+</span> <span class="n">y</span> <span class="o">+</span> <span class="mi">3</span> <span class="o">:</span> <span class="n">x</span> <span class="o">-</span> <span class="n">y</span> <span class="o">-</span> <span class="mi">3</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div> </div>
<p>We’re gonna need at least three temporary variables - one to break up the compound statement <code class="language-plaintext highlighter-rouge">x + y + 3</code>, one to break up the compound statement <code class="language-plaintext highlighter-rouge">x - y - 3</code>, and another to act as temporary for the unary statement’s value to assign to <code class="language-plaintext highlighter-rouge">z</code>. If we dump the GIMPLE representation for this (the <code class="language-plaintext highlighter-rouge">.gimple</code> file can be found in the entire dump from <code class="language-plaintext highlighter-rouge">-fdump-tree-all-graph</code>), we see the following:</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">main</span> <span class="p">()</span>
<span class="p">{</span>
<span class="kt">int</span> <span class="n">iftmp</span><span class="mf">.0</span><span class="p">;</span>
<span class="kt">int</span> <span class="n">D</span><span class="mf">.2074</span><span class="p">;</span>
<span class="p">{</span>
<span class="kt">int</span> <span class="n">x</span><span class="p">;</span>
<span class="kt">int</span> <span class="n">y</span><span class="p">;</span>
<span class="kt">int</span> <span class="n">z</span><span class="p">;</span>
<span class="n">x</span> <span class="o">=</span> <span class="mi">3</span><span class="p">;</span>
<span class="n">y</span> <span class="o">=</span> <span class="mi">3</span><span class="p">;</span>
<span class="k">if</span> <span class="p">(</span><span class="n">x</span> <span class="o">==</span> <span class="n">y</span><span class="p">)</span> <span class="k">goto</span> <span class="o"><</span><span class="n">D</span><span class="mf">.2071</span><span class="o">></span><span class="p">;</span> <span class="k">else</span> <span class="k">goto</span> <span class="o"><</span><span class="n">D</span><span class="mf">.2072</span><span class="o">></span><span class="p">;</span>
<span class="o"><</span><span class="n">D</span><span class="mf">.2071</span><span class="o">>:</span>
<span class="n">_1</span> <span class="o">=</span> <span class="n">x</span> <span class="o">+</span> <span class="n">y</span><span class="p">;</span>
<span class="n">iftmp</span><span class="mf">.0</span> <span class="o">=</span> <span class="n">_1</span> <span class="o">+</span> <span class="mi">3</span><span class="p">;</span>
<span class="k">goto</span> <span class="o"><</span><span class="n">D</span><span class="mf">.2073</span><span class="o">></span><span class="p">;</span>
<span class="o"><</span><span class="n">D</span><span class="mf">.2072</span><span class="o">>:</span>
<span class="n">iftmp</span><span class="mf">.0</span> <span class="o">=</span> <span class="mi">3</span><span class="p">;</span>
<span class="o"><</span><span class="n">D</span><span class="mf">.2073</span><span class="o">>:</span>
<span class="n">z</span> <span class="o">=</span> <span class="n">iftmp</span><span class="mf">.0</span><span class="p">;</span>
<span class="p">}</span>
<span class="n">D</span><span class="mf">.2074</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="k">return</span> <span class="n">D</span><span class="mf">.2074</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div> </div>
<p>As you can see, the unary operator decayed into an <code class="language-plaintext highlighter-rouge">if else</code> statement with <code class="language-plaintext highlighter-rouge">goto</code>’s, and we use three temporaries. <code class="language-plaintext highlighter-rouge">_1</code> is used to compute an intermediate value in the expression <code class="language-plaintext highlighter-rouge">x + y + 3</code>, <code class="language-plaintext highlighter-rouge">_2</code> is used to compute an intermediate value in the expression <code class="language-plaintext highlighter-rouge">x - y - 3</code>, and <code class="language-plaintext highlighter-rouge">iftmp.0</code> is a variable used to hold the values of either branches to be assigned to <code class="language-plaintext highlighter-rouge">z</code>. We actually need one more variable, <code class="language-plaintext highlighter-rouge">D.2074</code> to return a value of 0 for our main function.</p>
</details>
<p>GIMPLE is a simple language which allows for powerful optimizations. The simpler the language, the more optimization rules can be applied in general, and that’s one of the reasons for C++’s blazing fast performance.</p>
<p><em>NOTE: The middle end doesn’t only contain GIMPLE, but rather many other intermediate representations. This is a simplification.</em></p>
<h3 id="back-end">Back End</h3>
<p>The back end for <code class="language-plaintext highlighter-rouge">cc1plus</code> is responsible for taking whatever intermediate representation the middle end gives it, and generating a series of languages that follow the rules of the target architecture. This means producing code that starts to look more and more like ARM, MIPS, or x86, which are the destination languages. In this phase, we need to start caring about the fact that <em>we have finite number of registers, and arbitrary number of variables</em>. This is a whole topic in itself, called <strong>register allocation</strong>. We won’t discuss it in detail here.</p>
<hr />
<h2 id="assembler-as">Assembler (<code class="language-plaintext highlighter-rouge">as</code>)</h2>
<p>What we generate from <code class="language-plaintext highlighter-rouge">cc1plus</code> is not actually code that our computers can run. <em>There’s a difference between assembly language and machine code!</em> Machine code is represented as binary and is unreadable to humans. It’s usually the lowest level of language we work with until we play with circuits ourselves, since it can be read directly by the CPU to do instructions. Here’s an example of assembly that’s “readable”:</p>
<pre><code class="language-assembly"># AT&T syntax
push 0x0 # Push value 0 on the stack
mov 0x0, %eax # Move value 0 into register eax
</code></pre>
<p>And the corresponding machine code(in hex):</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>6a 00
b8 00 00 00 00
</code></pre></div></div>
<p>The latter is what we’ll have in the final executable.</p>
<details class="red">
<summary class="collapse"><strong>So after we compile our C++ code to assembly language, the assembler should give us machine code right?</strong>
</summary>
<p><strong>Almost but not quite!</strong> There is actually another intermediate language that is a layer above machine code - <strong>object code</strong>. This is actually what assembler outputs. Object code is basically machine code that lives inside of an object file, with jump addressing parametrized for the linker to fill in. In position independent executables(<code class="language-plaintext highlighter-rouge">pie</code>), we don’t know the absolute address of our <code class="language-plaintext highlighter-rouge">jmp</code> instructions in individual translation units so it’s the static linker’s job to fill this in and turn it into actual machine code.</p>
</details>
<hr />
<h2 id="static-linker-ld">Static Linker (<code class="language-plaintext highlighter-rouge">ld</code>)</h2>
<p><strong>After we generate the object code from <code class="language-plaintext highlighter-rouge">as</code>, organization of the memory space needs to be done.</strong> If we have multiple translation units (think of them as <code class="language-plaintext highlighter-rouge">.o</code>’s or static libraries), how are we going to piece them together into a single executable? Well, that’s the job for <code class="language-plaintext highlighter-rouge">ld</code>!</p>
<hr />
<h3 id="disclaimer-static-linker-and-dynamic-linkers-are-not-the-same-thing">Disclaimer: Static linker and dynamic linkers are NOT the same thing!</h3>
<p>As you might have remembered from the <code class="language-plaintext highlighter-rouge">.DYNAMIC</code> section of the ELF file, we list a couple of needed dynamic dependencies. This is a job for the <strong>dynamic linker</strong>, to run those dependencies somewhere in memory during runtime. The <strong>static linker</strong> is responsible for organizing a ton of object files and static libraries into a single executable. <strong>They are NOT the same!</strong> The final executable ONLY has the path of the dependencies for the dynamic linker(literally a few strings), but it has the sections, code, everything from translation units for the static linker.</p>
<p>Usually, <code class="language-plaintext highlighter-rouge">ld</code> is considered the static linker. <code class="language-plaintext highlighter-rouge">ld.so</code> is the dynamic linker. Don’t get them mixed up!</p>
<details class="red">
<summary class="collapse"><strong>You may be asking - why do we need two different types of linkers? Can’t we just have the static linker pull in all dependencies and create a standalone executable, or have the dynamic linker load all dependencies at runtime?</strong>
</summary>
<p>Yes - theoretically if every library you ever needed existed in both the static form (<code class="language-plaintext highlighter-rouge">.lib</code>, <code class="language-plaintext highlighter-rouge">.a</code>), and the shared form (<code class="language-plaintext highlighter-rouge">.so</code>), then you could have a 100% statically linked or 100% dynamically linked executable. But what practical issues arise from this?</p>
<p>If we statically link everything, including the standard library, <em>we would be pulling in a lot of code to compile.</em> Your executable will turn out to be orders of magnitude larger than you originally expected. To run this on a computer with memory/disk constraints would be very difficult. In addition, libraries under active development constantly improve with new features with minimal disruption to the user interface. If you wanted to use the newer version of any library, <em>you’d have to compile the entire program again!</em> There are also some things you just cannot do with static linking, like the <a href="http://man7.org/linux/man-pages/man8/ld.so.8.html"><code class="language-plaintext highlighter-rouge">LD_PRELOAD</code> trick</a> to replace system calls to the kernel, which requires dynamic linkage.</p>
<p>On the other hand, dynamic linking is a huge nightmare with dependencies. You may have a slim executable with minimal code bloat, but everytime you wanted to run your executable on a brand new environment, you require <em>all of your dynamically linked libraries to exist</em>. On top of that, they need to be in the same discovery path, which is a convoluted set of rules involving environment variables and baked-in paths during compilation. In addition, <strong>a key turn-off for dynamic linking is how slow it is.</strong> In an adequately performant system, the fact that dynamically linked libraries require an indirection through the GOT/PLT could be costly. In addition, the compiler doesn’t have the full picture of your procedures (all it has is the name of your shared library), so it can’t make assumptions and propagate optimizations as it would with a static library. For financial firms, this is a dealbreaker.</p>
</details>
<hr />
<p>The static linker reads a set of instructions in the <strong>linker script</strong>, which is a file written in a special language made only for <code class="language-plaintext highlighter-rouge">ld</code>.</p>
<details class="red">
<summary class="collapse"><strong>How do we view the linker script for our simple program? (Example)</strong>
</summary>
<p>If we use the below flags with verbose linkage, we’ll see the <strong>linker script</strong> actually being emitted (major parts redacted) in the <code class="language-plaintext highlighter-rouge">g++</code> driver:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>❯ make linker
g++ -O0 -fverbose-asm -no-pie -o /dev/null -x c main.cpp -Wl,--verbose
...
using internal linker script:
==================================================
...
OUTPUT_FORMAT("elf64-x86-64", "elf64-x86-64", "elf64-x86-64")
OUTPUT_ARCH(i386:x86-64)
ENTRY(_start)
SEARCH_DIR("/usr/x86_64-pc-linux-gnu/lib64"); ...
SECTIONS
{
...
.interp : { *(.interp) }
...
.rela.dyn :
{
*(.rela.init)
*(.rela.text .rela.text.* .rela.gnu.linkonce.t.*)
...
*(.rela.lrodata .rela.lrodata.* .rela.gnu.linkonce.lr.*)
*(.rela.ifunc)
}
...
.plt : { *(.plt) *(.iplt) }
.plt.got : { *(.plt.got) }
.plt.sec : { *(.plt.sec) }
...
.init_array :
{
PROVIDE_HIDDEN (__init_array_start = .);
KEEP (*(SORT_BY_INIT_PRIORITY(.init_array.*) SORT_BY_INIT_PRIORITY(.ctors.*)))
KEEP (*(.init_array EXCLUDE_FILE (*crtbegin.o *crtbegin?.o *crtend.o *crtend?.o ) .ctors))
PROVIDE_HIDDEN (__init_array_end = .);
}
.fini_array :
{
...
}
...
}
</code></pre></div> </div>
<p>As expected, the linker script tells us at the beginning that we are generating a 64-bit ELF file made for the 64 bit architecture. Upon execution of the program, we start at a symbol called <code class="language-plaintext highlighter-rouge">_start</code>, and our shared objects are found in the <code class="language-plaintext highlighter-rouge">SEARCH_DIR</code> paths(which can be modified by the <code class="language-plaintext highlighter-rouge">rpath</code> or <code class="language-plaintext highlighter-rouge">runpath</code> variables during compilation). Then, the linker script describes exactly how the sections are laid out in our executable.</p>
</details>
<p>Understanding the syntax for the linker script is actually not too hard. The most important part of a linker script is in the <code class="language-plaintext highlighter-rouge">SECTIONS</code> block. Each scope explains the organization of a particular section. If we recall in the ELF header section, we need to put our interpreter information somewhere in memory, and each object file may have their own interpreter information. Where are we going to put our <code class="language-plaintext highlighter-rouge">.interp</code> section in the final executable? It usually looks like this in the linker script:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> .interp : { *(.interp) }
</code></pre></div></div>
<p>This is saying the <code class="language-plaintext highlighter-rouge">.interp</code> section is laid out by all the <code class="language-plaintext highlighter-rouge">.interp</code> sections the linker was able to find (in the other shared libraries) in some sequence. The <code class="language-plaintext highlighter-rouge">*</code> matches all section definitions found and the <code class="language-plaintext highlighter-rouge">(.interp)</code> selects that particular section specification.</p>
<p>Let’s look at a slightly more complicated example (taken from the above generated linker script):</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>.rela.dyn :
{
*(.rela.init)
*(.rela.text .rela.text.* .rela.gnu.linkonce.t.*)
...
*(.rela.lrodata .rela.lrodata.* .rela.gnu.linkonce.lr.*)
*(.rela.ifunc)
}
</code></pre></div></div>
<p>Similarly, we see the <code class="language-plaintext highlighter-rouge">.rela.dyn</code> section being defined by all of the <code class="language-plaintext highlighter-rouge">.rela.init</code> sub-sections first, and then the <code class="language-plaintext highlighter-rouge">.rela.text</code> sections laid out after. The reason the linker did not say <code class="language-plaintext highlighter-rouge">.rela.dyn : { *(.rela) }</code> is because it would’ve been laid out without the subsections (like <code class="language-plaintext highlighter-rouge">rela.init</code>) being grouped together.</p>
<p>Then, we see the PLT sections along with the GOT being laid out with <code class="language-plaintext highlighter-rouge">.plt</code>, <code class="language-plaintext highlighter-rouge">.plt.got</code> and <code class="language-plaintext highlighter-rouge">.plt.sec</code> sections defined in the example linker script above.</p>
<p><strong>Then, we see two sections that we weren’t familiar with before - <code class="language-plaintext highlighter-rouge">.init_array</code> and <code class="language-plaintext highlighter-rouge">.fini_array</code>.</strong> These are actually calling the global constructors for statically initialized objects before the first line of code in <code class="language-plaintext highlighter-rouge">main</code> is actually executed, and calling the destructors for the same object upon exit of the <code class="language-plaintext highlighter-rouge">main</code> function. Within each translation unit (by definition, a single <code class="language-plaintext highlighter-rouge">.c/.cpp</code> file with all of the header files included) we will have <code class="language-plaintext highlighter-rouge">.init_array</code> sections containing global constructors.</p>
<p>Recall my <a href="https://oneraynyday.github.io/dev/2017/08/28/Essential-C++-1/#issue-with-static--singleton-design">blogpost from a while back</a>, where I said you can’t have static constructors that depend on each other across translation units. This is because the linker can choose to order the calls of global constructors in whichever way it wants. Calling another static variable’s methods during static construction will give you undefined behavior <em>unless you can go into the linker script and force a translation unit to be compiled first.</em> (Or, just don’t do something so stupid)</p>
<h2 id="recap-1">Recap</h2>
<p>So what did we learn about the <code class="language-plaintext highlighter-rouge">g++</code> driver?</p>
<ol>
<li><code class="language-plaintext highlighter-rouge">g++</code> is composed of 4 main parts - the <strong>preprocessor, the compiler, the assembler and the linker.</strong></li>
<li>The preprocessor replaces macros in our C++ program into actual values.</li>
<li>The compiler uses a set of rules to traverse through our source code and generate the assembly code using the semantics of our program. It’s super complicated.</li>
<li>The assembler takes the assembly code and generates <strong>object code</strong> (not machine code).</li>
<li>The linker gets instructions via the <strong>linker script</strong> to group our sections in the ELF object files in a particular organization to create an executable.</li>
</ol>
<p>Below is a diagram for clarity:</p>
<p><img src="http://oneraynyday.github.io/assets/compiler_driver.png" alt="compiler_driver" /></p>
<p>Now that we’ve understood what <code class="language-plaintext highlighter-rouge">g++</code> does roughly, let’s actually look at the emitted assembly code placed in these sections!</p>
<hr />
<h1 id="analyzing-generated-procedures-objdump">Analyzing Generated Procedures: <code class="language-plaintext highlighter-rouge">Objdump</code></h1>
<p><em>Our goal this section is to understand what procedures are generated from the compiler, what they’re used for, and in what order they are executed.</em></p>
<p>The <code class="language-plaintext highlighter-rouge">objdump</code> command quite literally dumps an object file’s information. The command I used is in the Makefile above and can be invoked by <code class="language-plaintext highlighter-rouge">make dump</code>. For such a small program, we have a surprising amount of sections to analyze. However, all of them are important.</p>
<h2 id="main---the-dumb-and-obvious"><code class="language-plaintext highlighter-rouge">main</code> - The dumb and obvious</h2>
<p>Let’s see the most obvious function that we were expecting: the <code class="language-plaintext highlighter-rouge">main</code> function.</p>
<pre><code class="language-assembly">0000000000401106 <main>:
# Push prev. caller addr into stackframe
401106: 55 push rbp
# Put current stack frame into rbp
401107: 48 89 e5 mov rbp,rsp
# Put return code 0 into eax
40110a: b8 00 00 00 00 mov eax,0x0
# Get caller addr
40110f: 5d pop rbp
# Return function
401110: c3 ret
# The next two lines are multi-byte no-ops for padding.
401111: 66 2e 0f 1f 84 00 00 00 00 00 nop WORD PTR cs:[rax+rax*1+0x0]
40111b: 0f 1f 44 00 00 nop DWORD PTR [rax+rax*1+0x0]
</code></pre>
<p>Nothing too fancy here - the function returns a literal value of 0 and does nothing else. Let’s find the more interesting sections of the assembly dump and analyze them.</p>
<h2 id="_start---true-start-of-the-program"><code class="language-plaintext highlighter-rouge">_start</code> - True start of the program</h2>
<pre><code class="language-assembly">Disassembly of section .text:
0000000000401020 <_start>:
# Same end branch instr as above
401020: f3 0f 1e fa endbr64
# sets ebp to 0
401024: 31 ed xor ebp,ebp
401026: 49 89 d1 mov r9,rdx
# Takes the top of the stack (argument), which is `argc` for C++.
# Pop also causes rsp to increase by 8 bytes (stack grows from high to low)
401029: 5e pop rsi
# Moves `argv` pointer value into rdx.
40102a: 48 89 e2 mov rdx,rsp
# Forcibly grow the stack a bit.
40102d: 48 83 e4 f0 and rsp,0xfffffffffffffff0
# pushes:
# - argc
# - argv
# - aligned stack ptr
# - garbage registers
# ... as arguments before calling __libc_start_main
401031: 50 push rax
401032: 54 push rsp
401033: 49 c7 c0 90 11 40 00 mov r8,0x401190 # addr of __libc_csu_fini
40103a: 48 c7 c1 20 11 40 00 mov rcx,0x401120 # addr of __libc_csu_init
401041: 48 c7 c7 06 11 40 00 mov rdi,0x401106 # addr of main
401048: ff 15 9a 2f 00 00 call QWORD PTR [rip+0x2f9a] # 403fe8 <__libc_start_main@GLIBC_2.2.5>
40104e: f4 hlt
40104f: 90 nop
</code></pre>
<p>This is the first function in <code class="language-plaintext highlighter-rouge">.text</code> and is the first function the program executes. Basically, <code class="language-plaintext highlighter-rouge">_start</code> prepares the system to call the <code class="language-plaintext highlighter-rouge">__libc_start_main</code> function, which takes in the following arguments:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>int __libc_start_main(int *(main) (int, char * *, char * *),
int argc,
char * * ubp_av,
void (*init) (void),
void (*fini) (void),
void (*rtld_fini) (void),
void (* stack_end));
</code></pre></div></div>
<p>Which seems to line up with the arguments the assembly code is preparing prior to calling the function. The pattern of moving arguments into registers and overflowing onto the stack via <code class="language-plaintext highlighter-rouge">push</code> is done here with the <code class="language-plaintext highlighter-rouge">ubp_av</code> variable, which looks at <code class="language-plaintext highlighter-rouge">argv</code> on the stack.</p>
<details class="red">
<summary class="collapse"><strong>Why is this <code class="language-plaintext highlighter-rouge">_start</code> function in our binary? We never declared it!</strong>
</summary>
<p>You might be curious why this <code class="language-plaintext highlighter-rouge">_start</code> along with many other symbols we’ll inspect is included in our executable. These assembly instructions are created in an object file called <code class="language-plaintext highlighter-rouge">crt1.o</code> or <code class="language-plaintext highlighter-rouge">crt0.o</code>, which stands for “C Run Time”. Depending on your operating system and compiler, you may get either of the two (but not both). These are linked statically with <em>all C and C++ executables</em> as a bootstrapping mechanism to start the program. You can actually bypass the startup code if you pass in the flags <code class="language-plaintext highlighter-rouge">-nostdlib -nostdinc</code>, which removes all standard C libraries (including the runtime library).</p>
</details>
<p>We also see the functions <code class="language-plaintext highlighter-rouge">__libc_csu_fini</code> and <code class="language-plaintext highlighter-rouge">__libc_csu_init</code> pointers being moved into registers as callee-side arguments into <code class="language-plaintext highlighter-rouge">__libc_start_main</code>. What are these?</p>
<h2 id="__libc_csu_init-and-__libc_csu_fini---program-level-ctordtor-handlers"><code class="language-plaintext highlighter-rouge">__libc_csu_init</code> and <code class="language-plaintext highlighter-rouge">__libc_csu_fini</code> - Program level ctor/dtor handlers</h2>
<p>Since these functions are fairly large, I’m too lazy to analyze them line-by-line. They basically do the construction and destruction handling as a program. We can register a list of constructors to be called by <code class="language-plaintext highlighter-rouge">__libc_csu_init</code> and similarly destructors with <code class="language-plaintext highlighter-rouge">__libc_csu_fini</code>. We can’t actually dump the contents of <code class="language-plaintext highlighter-rouge">__libc_start_main</code> since it lives in the libc shared library, but we can assume the execution order is:</p>
<ol>
<li>Call <code class="language-plaintext highlighter-rouge">__libc_csu_init</code> for the program level constructor handling</li>
<li>Call <code class="language-plaintext highlighter-rouge">main</code></li>
<li>Call <code class="language-plaintext highlighter-rouge">__libc_csu_fini</code> for the program level destructor handling</li>
</ol>
<details class="red">
<summary class="collapse"><strong>What is a program destructor? And how do I use it in C++?</strong>
</summary>
<p>Let’s see the program level constructors and destructors in action. Let’s write a simple program with functions with global constructor and destructor attributes:</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">__attribute__</span> <span class="p">((</span><span class="n">constructor</span><span class="p">))</span> <span class="n">dumb_constructor</span><span class="p">(){}</span>
<span class="kt">void</span> <span class="nf">__attribute__</span> <span class="p">((</span><span class="n">destructor</span><span class="p">))</span> <span class="n">dumb_destructor</span><span class="p">(){}</span>
<span class="kt">int</span> <span class="nf">main</span><span class="p">()</span> <span class="p">{}</span>
</code></pre></div> </div>
<p>And now I will use <code class="language-plaintext highlighter-rouge">gdb</code> to show you that they’re being called here. We see that <code class="language-plaintext highlighter-rouge">dumb_constructor</code> is being called by the init function:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Breakpoint 1, dumb_constructor () at main.cpp:1
1 void __attribute__ ((constructor)) dumb_constructor(){}
(gdb) bt
#0 dumb_constructor () at main.cpp:1
#1 0x000000000040116d in __libc_csu_init ()
#2 0x00007ffff7abcfb0 in __libc_start_main () from /usr/lib/libc.so.6
#3 0x000000000040104e in _start ()
</code></pre></div> </div>
<p>… And that <code class="language-plaintext highlighter-rouge">dumb_destructor</code> is being called by the fini function:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Breakpoint 1, dumb_destructor () at main.cpp:3
3 void __attribute__ ((destructor)) dumb_destructor(){}
(gdb) bt
#0 dumb_destructor () at main.cpp:3
#1 0x00007ffff7fe242b in _dl_fini () from /lib64/ld-linux-x86-64.so.2
#2 0x00007ffff7ad4537 in __run_exit_handlers () from /usr/lib/libc.so.6
#3 0x00007ffff7ad46ee in exit () from /usr/lib/libc.so.6
#4 0x00007ffff7abd02a in __libc_start_main () from /usr/lib/libc.so.6
#5 0x000000000040104e in _start ()
</code></pre></div> </div>
<p><em>… Wait, it’s not?</em> Bewildered, I took to StackOverflow and <a href="https://stackoverflow.com/questions/61649960/why-do-program-level-constructors-get-called-by-libc-csu-init-but-destructor">asked why this is happening</a>. It seems like in the libc code, the <code class="language-plaintext highlighter-rouge">__libc_csu_fini</code> function is essentially deprecated with the comment:</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cm">/* This function should not be used anymore. We run the executable's
destructor now just like any other. We cannot remove the function,
though. */</span>
<span class="kt">void</span>
<span class="nf">__libc_csu_fini</span> <span class="p">(</span><span class="kt">void</span><span class="p">)</span> <span class="p">{</span> <span class="p">...</span> <span class="p">}</span>
</code></pre></div> </div>
<p>To find out why historically we used <code class="language-plaintext highlighter-rouge">__libc_csu_fini</code> but now we delegate to <code class="language-plaintext highlighter-rouge">_dl_fini</code> is another rabbithole in itself, and I decided to stop my investigations there.</p>
</details>
<p>One thing I thought was interesting from breakpointing <code class="language-plaintext highlighter-rouge">__libc_csu_init</code> was that it actually called a function <code class="language-plaintext highlighter-rouge">_init</code> first before calling our <code class="language-plaintext highlighter-rouge">dumb_constructor()</code> function. What is this function?</p>
<h2 id="_init-and-_fini---obsolete-program-level-ctordtor-with-highest-priority"><code class="language-plaintext highlighter-rouge">_init</code> and <code class="language-plaintext highlighter-rouge">_fini</code> - Obsolete program level ctor/dtor with highest priority</h2>
<pre><code class="language-assembly">Disassembly of section .init:
0000000000401000 <_init>:
# Stands for End Branch (64 bits).
# When an indirect jump occurs, it must jump to an endbr64 instruction
# or else an exception occurs. This is a part of
# CET(Control-flow Enforcement Tech) to prevent buffer-overflow
# or gadget exploits on return addresses.
401000: f3 0f 1e fa endbr64
401004: 48 83 ec 08 sub rsp,0x8
# Checks to see whether __gmon_start__ exists. This symbol doesn't exist in our
# code, because we don't have gmon profiling enabled(used for gprof)
401008: 48 8b 05 e1 2f 00 00 mov rax,QWORD PTR [rip+0x2fe1] # 403ff0 <__gmon_start__>
# Jumps if %rax is equal to 0. Test does an AND operation.
40100f: 48 85 c0 test rax,rax
401012: 74 02 je 401016 <_init+0x16>
# If we don't jump, then we call the __gmon_start__ function which does
# some intrusive profiling setup.
401014: ff d0 call rax
401016: 48 83 c4 08 add rsp,0x8
40101a: c3 ret
</code></pre>
<p>This initialization function is at the front of our assembly dump. It seems like it really doesn’t do much other than call the <code class="language-plaintext highlighter-rouge">gmon</code> profiling system if it’s defined. Otherwise, it returns.</p>
<p>From looking online, it appears these two functions are actually deprecated, and we shouldn’t use them:</p>
<blockquote>
<p>Historically there have been two special functions, <code class="language-plaintext highlighter-rouge">_init</code> and <code class="language-plaintext highlighter-rouge">_fini</code> that can be used to control constructors and destructors. However, they are obsolete, and their use can lead to unpredictable results. Your libraries should not use these; use the function attributes constructor and destructor instead.</p>
</blockquote>
<p>And this makes sense! We see <code class="language-plaintext highlighter-rouge">_init</code> being called from <code class="language-plaintext highlighter-rouge">__libc_csu_init</code>, and then our own custom program level constructor being called by <code class="language-plaintext highlighter-rouge">__libc_csu_init</code> shortly after. As long as we register our constructors with the attribute, we can feel free to ignore this pair of functions.</p>
<h2 id="register_tm_clones-deregister_tm_clones---mysterious-concurrency-model-functions"><code class="language-plaintext highlighter-rouge">register_tm_clones</code>, <code class="language-plaintext highlighter-rouge">deregister_tm_clones</code> - Mysterious concurrency model functions</h2>
<p>Here’s an abbreviated view of <code class="language-plaintext highlighter-rouge">register_tm_clones</code>:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>00000000004010c0 <register_tm_clones>:
4010c0: be 48 40 40 00 mov esi,0x404048
4010c5: 48 81 ee 38 40 40 00 sub rsi,0x404038
...
4010df: 48 8b 05 02 2f 00 00 mov rax,QWORD PTR [rip+0x2f02] # 403fe8 <_ITM_registerTMCloneTable@LIBITM_1.0>
...
4010f8: c3 ret
</code></pre></div></div>
<p>After going on a scavenger hunt, it appears that <code class="language-plaintext highlighter-rouge">tm</code> stands for “Transactional Memory” which is used in multithreading applications, and functions with the prefix <code class="language-plaintext highlighter-rouge">_ITM</code> belongs to the <code class="language-plaintext highlighter-rouge">libitm</code> component of <code class="language-plaintext highlighter-rouge">gcc</code>. Of course, for other compiler flavors it may be called something else. The code for this can be found in <a href="https://github.com/gcc-mirror/gcc/blob/41d6b10e96a1de98e90a7c0378437c3255814b16/libgcc/crtstuff.c#L297">gcc</a> but it lacks comments. The <code class="language-plaintext highlighter-rouge">deregister_tm_clones</code> function appears to be called by <code class="language-plaintext highlighter-rouge">__do_global_dtors_aux</code>:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>00000000004010d0 <__do_global_dtors_aux>:
...
4010e1: e8 7a ff ff ff call 401060 <deregister_tm_clones>
...
4010f0: c3 ret
</code></pre></div></div>
<p>As far as I know, global destructors belong to static objects. But we don’t have any static objects defined as it’s such a barebones C++ program. Where is this <code class="language-plaintext highlighter-rouge">tm_clones</code> thing coming from?</p>
<p><strong>The short answer is: It’s a function being called by <code class="language-plaintext highlighter-rouge">libgcc.so</code> for transactional memory model in C++.</strong> <strong>The long answer is in the appendix.</strong></p>
<p>We know that <code class="language-plaintext highlighter-rouge">_start</code> is the beginning of our program, <em>but we actually need to run something else first.</em> When the program interpreter field in <code class="language-plaintext highlighter-rouge">INTERP</code> is specified, we actually run the dynamic linker <code class="language-plaintext highlighter-rouge">ld.so</code> to populate our memory with the shared libraries in the <code class="language-plaintext highlighter-rouge">NEEDED</code> section in <code class="language-plaintext highlighter-rouge">DYNAMIC</code>. <code class="language-plaintext highlighter-rouge">libgcc.so</code> is one of these, and so we start by loading it into memory, and then running some initialization code, which then eventually calls <code class="language-plaintext highlighter-rouge">register_tm_clones</code>, and then gives control back to the main executable at the <code class="language-plaintext highlighter-rouge">_start</code> function. <strong>So technically, <code class="language-plaintext highlighter-rouge">register_tm_clones</code> is an example of a function that gets run before the <code class="language-plaintext highlighter-rouge">_start</code> function is even called!</strong></p>
<hr />
<h2 id="recap-2">Recap</h2>
<p>Now that we’ve seen basically all of the important functions generated in assembly, let’s summarize our findings:</p>
<ol>
<li><code class="language-plaintext highlighter-rouge">main</code> is boring and expected.</li>
<li>The system starts by calling <code class="language-plaintext highlighter-rouge">_start</code>, which calls <code class="language-plaintext highlighter-rouge">__libc_csu_init</code>, then <code class="language-plaintext highlighter-rouge">__libc_start_main</code></li>
<li><code class="language-plaintext highlighter-rouge">__libc_csu_init</code> calls <code class="language-plaintext highlighter-rouge">_init</code> first, an obsolete global initializer, then our own custom ones</li>
<li><code class="language-plaintext highlighter-rouge">register_tm_clones</code> and <code class="language-plaintext highlighter-rouge">deregister_tm_clones</code> are a part of the experimental and incomplete transactional memory feature for C++. They register clones of functions that are optimized for concurrent access during runtime.</li>
</ol>
<p>Let’s see a flow chart of what this is actually doing.</p>
<p><img src="http://oneraynyday.github.io/assets/main_seq_diagram.png" alt="seq_diagram" /></p>
<h1 id="conclusion">Conclusion</h1>
<p>It was an incredibly deep rabbit hole that I dug myself into, but I’m glad I came out with a wealth of knowledge about:</p>
<ul>
<li>ELF formats (sections & segments)</li>
<li>Dynamic linker executable & script</li>
<li>PLT and GOT (shared objects symbols)</li>
<li>Libc runtime</li>
<li>Program constructors and destructors</li>
<li>Static initialization</li>
<li>Transaction memory models</li>
<li>… and more.</li>
</ul>
<p>I’ve always been interested in diving into these rabbit holes and I’ve learned some of the material in college, through Professor Eggert’s class. If there’s anyone I’d like to thank for sprouting my curiosity in the subjects discussed it would have to be him. Although there’s still many questions, I can confidently say that this investigation has made me less afraid of the mysteries of executables, and I’m excited to delve deeper into more rabbit holes in <code class="language-plaintext highlighter-rouge">libc</code> and the gnu packages.</p>
<h1 id="appendix">Appendix</h1>
<h2 id="some-notes-about-the-cc1plus-compiler-and-general-parsing-rules">Some notes about the <code class="language-plaintext highlighter-rouge">cc1plus</code> compiler and general parsing rules</h2>
<p>The C++ language has many ambiguous grammar rules which makes some expressions require arbitrary long lookaheads of the next expressions to determine the syntactic meaning of the program. For simple languages that are context-free, $LR(1)$ type parsers can be sufficient (in fact, you can implement parsers for context-free languages by a simple stack), but C++ is not context free, so it requires $LR(\infty)$ parsers to guarantee correctness. In fact, C++ templates itself is <a href="http://port70.net/~nsz/c/c%2B%2B/turing.pdf">turing complete</a>, which means the compiler may terminate with no errors and produce a program that is runnable, or terminate with an error upon parsing, or never terminate. (The “correct” definition involving the Church-Turing thesis is covered in <a href="https://oneraynyday.github.io/math/2019/02/06/Computability-Theory-Halting-Problem/">my blog here</a> and <a href="https://oneraynyday.github.io/math/2019/02/18/Recursive-Enumerable-Sets/">here</a>)</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">template</span> <span class="o"><</span><span class="kt">int</span> <span class="n">A</span><span class="p">,</span> <span class="kt">int</span> <span class="n">B</span><span class="p">></span>
<span class="kt">int</span> <span class="nf">break_my_compiler</span><span class="p">()</span>
<span class="p">{</span>
<span class="k">if</span> <span class="k">constexpr</span><span class="p">(</span><span class="n">A</span> <span class="o">==</span> <span class="n">B</span><span class="p">)</span>
<span class="k">return</span> <span class="n">break_my_compiler</span><span class="o"><</span><span class="n">A</span><span class="o">+</span><span class="mi">1</span><span class="p">,</span> <span class="n">B</span><span class="o">+</span><span class="mi">1</span><span class="o">></span><span class="p">();</span>
<span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">};</span>
<span class="kt">int</span> <span class="nf">main</span><span class="p">()</span> <span class="p">{</span>
<span class="n">break_my_compiler</span><span class="o"><</span><span class="mi">1</span><span class="p">,</span><span class="mi">1</span><span class="o">></span><span class="p">();</span>
<span class="p">}</span>
</code></pre></div></div>
<p>Try running the above program with <code class="language-plaintext highlighter-rouge">-ftemplate-depth=60000</code> and wait for your CPU to catch on fire.</p>
<p>For a simple C++ program such as ours, involving only <code class="language-plaintext highlighter-rouge">int main() {}</code>, we can assume the grammar rules fit something like:</p>
\[m ::= t\; id\; (t_1\;id_1,\;t_2\;id_2,\;...\;t_n\;id_n) \{ s_1;\;s_2;\;s_3;\;...s_m;\} \quad{\textbf{(Method declaration)}}\\
t ::= \text{int} \;|\; \text{long} \;|\; \text{float} \;|\; ... \quad{\textbf{(Type)}}\\
id ::= \text{<IDENTIFIER>} \quad{\textbf{(Variable)}}\\
s ::= \{ s_1;\;s_2;\;s_3;\;...s_j;\} \;|\; id = expr; \;|\; return \;expr\; | ... \quad{\textbf{(Statement)}} \\
...\]
<p>The semantic definition of our program is that there is a function named <code class="language-plaintext highlighter-rouge">main</code> that returns <code class="language-plaintext highlighter-rouge">int</code> and contains no statements or parameters. The compiler creates some representation of this definition, usually in the form of an <strong><a href="https://en.wikipedia.org/wiki/Abstract_syntax_tree">Abstract Syntax Tree (AST)</a></strong>.</p>
<p><em>Disclaimer: The <code class="language-plaintext highlighter-rouge">cc1plus</code> compiler implementation is miles more complicated than this. This was an example of a grammar that could fit to parse our simple program. I didn’t state the definition of $expr$ since that will require me to add a lot more rules and compilers isn’t really the focus of this blog.</em></p>
<h2 id="transactional-memory-model--clones">Transactional Memory Model & Clones</h2>
<p><strong>Warning: The below is an experimental part of the C++ standard. The contents of this blog is generated from <code class="language-plaintext highlighter-rouge">g++-9</code>.</strong></p>
<p>Basically, C++ has its own <strong>Transactional Memory Technical Specification, or TMTS</strong>, which standardizes what a transaction really means in libitm. In the specification, we have two ways of accessing the transactional memory scope:</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// (1)</span>
<span class="c1">// Psuedocode. Currently can use with __transaction_atomic</span>
<span class="n">__transaction_atomic</span> <span class="p">{</span>
<span class="n">x</span><span class="o">++</span><span class="p">;</span>
<span class="p">}</span>
<span class="c1">// (2)</span>
<span class="c1">// Pseudocode. Currently can use with __transaction_relaxed</span>
<span class="n">__transaction_relaxed</span> <span class="p">{</span>
<span class="n">x</span><span class="o">++</span><span class="p">;</span>
<span class="n">read_file</span><span class="p">();</span>
<span class="p">}</span>
<span class="c1">// (3)</span>
<span class="p">[[</span><span class="n">optimize_for_synchronized</span><span class="p">]]</span> <span class="kt">char</span><span class="o">*</span> <span class="nf">read_file</span><span class="p">()</span> <span class="p">{</span> <span class="p">...</span> <span class="p">}</span>
</code></pre></div></div>
<ol>
<li>Here, <code class="language-plaintext highlighter-rouge">__transaction_atomic</code> blocks are modeled after optimistic concurrency, where upon a check of whether <code class="language-plaintext highlighter-rouge">x</code>’s original values has been modified, it will re-run the block with the new value of <code class="language-plaintext highlighter-rouge">x</code> and scrap any changes to commit to all variables written to.</li>
<li><code class="language-plaintext highlighter-rouge">__transaction_relaxed</code> is exactly like how <code class="language-plaintext highlighter-rouge">synchronized</code> is in Java - a global lock is acquired and the code is executed sequentially in total order until the lock is released. In most cases, C++ code is <strong>not transaction safe</strong> like file I/O above, so we can’t use atomic and have to fall back to using <code class="language-plaintext highlighter-rouge">__transaction_relaxed</code>.</li>
<li>The last feature above is the C++ attribute <code class="language-plaintext highlighter-rouge">[[optimize_for_synchronized]]</code> which has the compiler optimize the function <code class="language-plaintext highlighter-rouge">read_file</code> for repeated calls within a <code class="language-plaintext highlighter-rouge">synchronized</code> block (aka, removing the global lock whenever possible).</li>
</ol>
<p>This is a very promising and interesting feature we should expect to see in the future releases of compilers, but right now it’s still in development. Here’s what we can do with it though:</p>
<h3 id="atomic-transactions">Atomic transactions</h3>
<p>Here, I compile a program with <code class="language-plaintext highlighter-rouge">-fgnu-tm</code> to enable the atomic transaction calls:</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">hello</span><span class="p">(</span><span class="kt">int</span><span class="o">&</span> <span class="n">y</span><span class="p">)</span> <span class="p">{</span>
<span class="k">return</span> <span class="n">y</span> <span class="o">+</span> <span class="mi">1</span><span class="p">;</span>
<span class="p">}</span>
<span class="kt">int</span> <span class="nf">main</span><span class="p">()</span> <span class="p">{</span>
<span class="kt">int</span> <span class="n">x</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="c1">// Begin</span>
<span class="n">__transaction_atomic</span> <span class="p">{</span>
<span class="n">x</span><span class="o">++</span><span class="p">;</span>
<span class="p">}</span> <span class="c1">// End</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">hello</span><span class="p">(</span><span class="n">x</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>
<p>It took me a while to formulate the minimum code required for the libitm calls to kick in (believe me, for some reason that function call to <code class="language-plaintext highlighter-rouge">hello</code> was required). This compiles to:</p>
<pre><code class="language-assembly">0000000000401187 <main>:
4011aa: b8 00 00 00 00 mov eax,0x0
# Begin
4011af: e8 bc fe ff ff call 401070 <_ITM_beginTransaction@plt>
...
4011c2: e8 99 fe ff ff call 401060 <_ITM_RU4@plt>
...
4011d5: e8 56 fe ff ff call 401030 <_ITM_WU4@plt>
4011da: e8 61 fe ff ff call 401040 <_ITM_commitTransaction@plt>
...
4011ea: e8 51 fe ff ff call 401040 <_ITM_commitTransaction@plt>
# End
...
401218: c3 ret
401219: 0f 1f 80 00 00 00 00 nop DWORD PTR [rax+0x0]
</code></pre>
<p>It compiles and runs fine. As you can see, we begin the transaction right after <code class="language-plaintext highlighter-rouge">mov eax,0x0</code>, which is setting <code class="language-plaintext highlighter-rouge">x = 0</code>. The transactional memory block runs and calls a few things involving <code class="language-plaintext highlighter-rouge">ITM</code> in the <code class="language-plaintext highlighter-rouge">plt</code>, which as we had learned before is a procedural linkage table, pointing to the shared library containing the definition for the functions.</p>
<h3 id="synchronized-transactions">Synchronized transactions</h3>
<p>I wanted to confirm for myself that <code class="language-plaintext highlighter-rouge">__transaction_relaxed</code> is fully working in g++-9 and we can enforce total order with it with the following program:</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#include <thread>
#include <atomic>
#include <iostream>
</span>
<span class="n">std</span><span class="o">::</span><span class="n">atomic</span><span class="o"><</span><span class="kt">int</span><span class="o">></span> <span class="n">x</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="kt">void</span> <span class="nf">t1</span><span class="p">(){</span>
<span class="n">__transaction_relaxed</span> <span class="p">{</span>
<span class="n">x</span> <span class="o">=</span> <span class="mi">1</span><span class="p">;</span>
<span class="n">std</span><span class="o">::</span><span class="n">this_thread</span><span class="o">::</span><span class="n">sleep_for</span><span class="p">(</span><span class="n">std</span><span class="o">::</span><span class="n">chrono</span><span class="o">::</span><span class="n">milliseconds</span><span class="p">(</span><span class="mi">100</span><span class="p">));</span>
<span class="n">std</span><span class="o">::</span><span class="n">cout</span> <span class="o"><<</span> <span class="s">"x is equal to... "</span> <span class="o"><<</span> <span class="n">x</span> <span class="o"><<</span> <span class="n">std</span><span class="o">::</span><span class="n">endl</span><span class="p">;</span>
<span class="k">while</span><span class="p">(</span><span class="n">x</span> <span class="o">==</span> <span class="mi">0</span><span class="p">);</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="kt">void</span> <span class="nf">t2</span><span class="p">(){</span>
<span class="c1">// Comment this scope out if you want to see it deadlock</span>
<span class="n">__transaction_relaxed</span> <span class="p">{</span>
<span class="n">std</span><span class="o">::</span><span class="n">this_thread</span><span class="o">::</span><span class="n">sleep_for</span><span class="p">(</span><span class="n">std</span><span class="o">::</span><span class="n">chrono</span><span class="o">::</span><span class="n">milliseconds</span><span class="p">(</span><span class="mi">1</span><span class="p">));</span>
<span class="c1">// This will turn x to 0 and t1 will "deadlock"</span>
<span class="n">std</span><span class="o">::</span><span class="n">cout</span> <span class="o"><<</span> <span class="s">"x is changed by t2!"</span> <span class="o"><<</span> <span class="n">std</span><span class="o">::</span><span class="n">endl</span><span class="p">;</span>
<span class="n">x</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="kt">int</span> <span class="nf">main</span><span class="p">()</span> <span class="p">{</span>
<span class="n">std</span><span class="o">::</span><span class="kr">thread</span> <span class="n">_t1</span><span class="p">(</span><span class="n">t1</span><span class="p">);</span>
<span class="n">std</span><span class="o">::</span><span class="kr">thread</span> <span class="n">_t2</span><span class="p">(</span><span class="n">t2</span><span class="p">);</span>
<span class="n">_t1</span><span class="p">.</span><span class="n">join</span><span class="p">();</span>
<span class="n">_t2</span><span class="p">.</span><span class="n">join</span><span class="p">();</span>
<span class="p">}</span>
</code></pre></div></div>
<p>Here, we have two threads, one of which would set the atomic int to 1, and if total order is assumed, will never enter the while loop. However, threads can modify the value of an atomic variable in any order, <strong>and</strong> the compiler has the ability to re-order your instructions however it wants unless you state the memory order for loading and storing the atomic (long story, we’ll cover that later). <em>In this case, the second thread would come along and set the value to 0 right before execution of the while loop for the first thread 99% of the time due to sleeps.</em> If we add the synchronized scope, we get the following output and the program exits with no infinite while loop:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>❯ ./main
x is equal to... 1
x is changed by t2!
</code></pre></div></div>
<p>This confirms that <code class="language-plaintext highlighter-rouge">libitm</code> is doing its job.</p>
<h3 id="tm-clones"><code class="language-plaintext highlighter-rouge">tm</code> clones</h3>
<p>So now we know that <code class="language-plaintext highlighter-rouge">TM</code> is an experimental transactional memory feature, but what is <code class="language-plaintext highlighter-rouge">tm_clones</code>? Well, I compiled another simple C++ program and found out. Here, we use the C++ attribute explained above:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[[optimize_for_synchronized]] int hello(int y){
return y;
}
int main () {}
</code></pre></div></div>
<p>What do we get after <code class="language-plaintext highlighter-rouge">objdump</code>‘ing it? We see something very surprising:</p>
<pre><code class="language-assembly">0000000000401144 <hello(int)>:
401144: 55 push rbp
401145: 48 89 e5 mov rbp,rsp
...
0000000000401179 <transaction clone for hello(int)>:
401179: 55 push rbp
40117a: 48 89 e5 mov rbp,rsp
...
</code></pre>
<p><strong>There appears to be cloned versions of the original function upon the C++ attribute being applied!</strong> These clone functions must’ve been registered onto the clone table(configured in static runtime) that will point to the transaction clones when called from a <code class="language-plaintext highlighter-rouge">synchronized</code> block! It makes sense for the registration to happen before runtime if any functions with such attributes are defined. <em>The functions <code class="language-plaintext highlighter-rouge">de/register_tm_clones</code> are there in case we want to enable this language feature.</em></p>
<script src="https://utteranc.es/client.js" repo="OneRaynyDay/oneraynyday.github.io" issue-term="pathname" theme="github-light" crossorigin="anonymous" async=""> </script>
My Arch Linux Setup2020-04-26T00:00:00+00:00http://oneraynyday.github.io/dev/2020/04/26/My-Arch-Linux-Setup<h1 id="table-of-contents">Table of Contents</h1>
<ul id="markdown-toc">
<li><a href="#table-of-contents" id="markdown-toc-table-of-contents">Table of Contents</a></li>
<li><a href="#my-arch-linux-setup" id="markdown-toc-my-arch-linux-setup">My Arch Linux Setup</a> <ul>
<li><a href="#preliminary" id="markdown-toc-preliminary">Preliminary</a></li>
<li><a href="#i3" id="markdown-toc-i3"><code class="language-plaintext highlighter-rouge">i3</code></a> <ul>
<li><a href="#installing-i3" id="markdown-toc-installing-i3">Installing <code class="language-plaintext highlighter-rouge">i3</code></a></li>
<li><a href="#i3-container-manipulation" id="markdown-toc-i3-container-manipulation"><code class="language-plaintext highlighter-rouge">i3</code> Container Manipulation</a></li>
<li><a href="#i3-workspaces" id="markdown-toc-i3-workspaces"><code class="language-plaintext highlighter-rouge">i3</code> Workspaces</a></li>
<li><a href="#i3blocks" id="markdown-toc-i3blocks"><code class="language-plaintext highlighter-rouge">i3blocks</code></a></li>
</ul>
</li>
<li><a href="#setting-backgrounds" id="markdown-toc-setting-backgrounds">Setting Backgrounds</a></li>
<li><a href="#using-a-file-manager" id="markdown-toc-using-a-file-manager">Using a File Manager</a></li>
<li><a href="#reading-pdfs" id="markdown-toc-reading-pdfs">Reading PDF’s</a></li>
<li><a href="#spotify" id="markdown-toc-spotify">Spotify</a></li>
<li><a href="#program-launcher-dmenu-replacement" id="markdown-toc-program-launcher-dmenu-replacement">Program Launcher (<code class="language-plaintext highlighter-rouge">dmenu</code> replacement)</a></li>
<li><a href="#terminal-emulator" id="markdown-toc-terminal-emulator">Terminal Emulator</a> <ul>
<li><a href="#solarized-colorscheme" id="markdown-toc-solarized-colorscheme">Solarized Colorscheme</a></li>
<li><a href="#aesthetics" id="markdown-toc-aesthetics">Aesthetics</a></li>
</ul>
</li>
<li><a href="#login-manager" id="markdown-toc-login-manager">Login Manager</a></li>
</ul>
</li>
</ul>
<h1 id="my-arch-linux-setup">My Arch Linux Setup</h1>
<p>I’ve set up my Arch Linux environment for over a year now and I haven’t made significant changes to it due to the fact that all of my work is through my company’s Macbooks (with SSH into remote compute resources, of course). I figured since the next few C++ blogs I’ll be writing are going to be done with the linux toolchain, I’d start using my workstation again.</p>
<p>This blog is meant to illustrate how I set up my Arch Linux environment <strong>after</strong> I’ve set up the appropriate UEFI, swap, and filesystem partitions. I’m not going to explain how boot loaders work, how to set up Pacman (the package manager for Arch), setting up locale, etc. I also won’t cover my <code class="language-plaintext highlighter-rouge">vim</code> and <code class="language-plaintext highlighter-rouge">zsh</code> setup. With that said, let’s dive in!</p>
<h2 id="preliminary">Preliminary</h2>
<p>Throughout this blog, I’ll be using <code class="language-plaintext highlighter-rouge">pacman</code> to install bare necessities, and <code class="language-plaintext highlighter-rouge">yay</code> otherwise. To install <code class="language-plaintext highlighter-rouge">yay</code>, simply run the following:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>archlinux% git clone https://aur.archlinux.org/yay.git
archlinux% cd yay
archlinux% makepkg -si
</code></pre></div></div>
<h2 id="i3"><code class="language-plaintext highlighter-rouge">i3</code></h2>
<p>What is <code class="language-plaintext highlighter-rouge">i3</code>? It’s a <strong>tiling window manager</strong> that allows you to place windows efficiently on your screen, allow simple navigation through keystrokes, separating windows into workspaces, and much more. Below is a screenshot of my setup:</p>
<p><img src="http://oneraynyday.github.io/assets/arch_screen.png" alt="screenshot1" /></p>
<p>As you can see, I have multiple terminal windows open, as well as Spotify, neatly tiled into bifurcated windows (as I’ll explain later, you can adjust the size of the windows beyond the fixed $\frac{1}{2^n}$ sizes as well). In addition, you can see the status bar above, showing the metrics on the top right and workspace on the left. We’ll explain how to set this up.</p>
<h3 id="installing-i3">Installing <code class="language-plaintext highlighter-rouge">i3</code></h3>
<p>So there are two versions of <code class="language-plaintext highlighter-rouge">i3</code> as far as I know. The windows manager I’m using has gaps in between the processes and is called <code class="language-plaintext highlighter-rouge">i3-gaps</code>, and the default one is called <code class="language-plaintext highlighter-rouge">i3-wm</code>. Install either of these with <code class="language-plaintext highlighter-rouge">pacman -S <i3 package></code>. <code class="language-plaintext highlighter-rouge">i3</code> prompts you to generate a default config file. If you decide to go this route, it’ll ask you to set the modifier key to Alt or the super key (the key itself says <em>“Win”</em>, <em>“Command”</em>, or some random logo). I set it to Alt. You can customize the <code class="language-plaintext highlighter-rouge">i3</code> config(located at <code class="language-plaintext highlighter-rouge">~/.config/i3/config</code>) to change the aesthetics and map your hotkeys.</p>
<p>If you installed <code class="language-plaintext highlighter-rouge">i3-gaps</code>, you can add gaps to the windows by adding the following into your <code class="language-plaintext highlighter-rouge">i3</code> configurations:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>##
## i3-gaps
##
# Run i3-gaps instead of i3 by default
# i3-gaps will have black bar on top if you have description borders
for_window [class="^.*"] border pixel 0
gaps inner 10
gaps outer 0
</code></pre></div></div>
<h3 id="i3-container-manipulation"><code class="language-plaintext highlighter-rouge">i3</code> Container Manipulation</h3>
<p>By container, we mean windows of processes here. For starters, <code class="language-plaintext highlighter-rouge">$mod+h/j/k/l</code> will change the focus of your screen to the adjacent container depending on the direction (like vim bindings), and <code class="language-plaintext highlighter-rouge">$mod+shift+h/j/k/l</code> will <em>move</em> the selected window, swapping positions with the window in the respective direction.</p>
<p>When a window is spawned (let’s say from <code class="language-plaintext highlighter-rouge">$mod+enter</code> to create a terminal emulator), the window can spawn horizontally or vertically from your existing focus window. I binded them to <code class="language-plaintext highlighter-rouge">$mod+c</code> for horizontal and <code class="language-plaintext highlighter-rouge">$mod+v</code> for vertical.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>+---------+ +---------+
| | | x |
| x | -> +---------+ (From $mod+v)
| | | y |
+---------+ +---------+
+---------+ +----+----+
| | | | |
| x | -> | x | y | (From $mod+c)
| | | | |
+---------+ +----+----+
</code></pre></div></div>
<p>Sometimes, you want a floating window, that you can control with <code class="language-plaintext highlighter-rouge">$mod + mouse</code>. This is achievable by using <code class="language-plaintext highlighter-rouge">$mod + shift + space</code>.</p>
<p>Sometimes, you don’t necessarily want a binary partition of the screen, and you want one window to be a bit smaller. You can set <code class="language-plaintext highlighter-rouge">i3</code> to resizing mode with <code class="language-plaintext highlighter-rouge">$mod + r</code> and <code class="language-plaintext highlighter-rouge">h/j/k/l</code> will be resizing the current focused window.</p>
<p>These are the settings:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># change focus
bindsym $mod+h focus left
bindsym $mod+j focus down
bindsym $mod+k focus up
bindsym $mod+l focus right
# move focused window
bindsym $mod+Shift+h move left
bindsym $mod+Shift+j move down
bindsym $mod+Shift+k move up
bindsym $mod+Shift+l move right
# split in horizontal orientation
bindsym $mod+c split h
# split in vertical orientation
bindsym $mod+v split v
# toggle tiling / floating
bindsym $mod+Shift+space floating toggle
mode "resize" {
bindsym h resize shrink width 10 px or 10 ppt
bindsym j resize grow height 10 px or 10 ppt
bindsym k resize shrink height 10 px or 10 ppt
bindsym l resize grow width 10 px or 10 ppt
bindsym Return mode "default"
bindsym Escape mode "default"
bindsym $mod+r mode "default"
}
bindsym $mod+r mode "resize"
</code></pre></div></div>
<h3 id="i3-workspaces"><code class="language-plaintext highlighter-rouge">i3</code> Workspaces</h3>
<p>This is my favorite feature of <code class="language-plaintext highlighter-rouge">i3</code>. On the top left corner of our <code class="language-plaintext highlighter-rouge">i3</code> bar, we have numbers associated with the workspaces we can navigate to. We can usually designate a few of them to be for special purpose. In a later example, I show that I always put the spotify app running in workspace 10.</p>
<p>To navigate to the $n$-th workspace, use <code class="language-plaintext highlighter-rouge">$mod + n</code> where <code class="language-plaintext highlighter-rouge">n</code> is from 0 to 9. To move the current container to that workspace, use <code class="language-plaintext highlighter-rouge">$mod + shift + n</code>:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># Define names for default workspaces for which we configure key bindings later on.
# We use variables to avoid repeating the names in multiple places.
set $ws1 "Terminal (1)"
set $ws2 "2"
...
.set $ws9 "Browser (9)"
set $ws10 "Spotify (0)"
# switch to workspace
bindsym $mod+1 workspace $ws1
...
.bindsym $mod+0 workspace $ws10
# move focused container to workspace
bindsym $mod+Shift+1 move container to workspace $ws1
...
bindsym $mod+Shift+0 move container to workspace $ws10
</code></pre></div></div>
<p>The configuration is shortened above.</p>
<p>NOTE: Operations like stacking containers(navigable through up focus/down focus) are cool but not covered here. I like using workspaces better.</p>
<h3 id="i3blocks"><code class="language-plaintext highlighter-rouge">i3blocks</code></h3>
<p>To pimp out your bar, we first install <code class="language-plaintext highlighter-rouge">i3blocks</code> with <code class="language-plaintext highlighter-rouge">pacman -S i3blocks</code>, which allows us to display multiple statistics on the right hand side of our bar. It’s precious screen space, so why not put it to good use?</p>
<p>To enable <code class="language-plaintext highlighter-rouge">i3blocks</code> at the top of our screen, reading the blocks to add from our config file at <code class="language-plaintext highlighter-rouge">~/.config/i3/i3blocks.conf</code>:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>## Setting i3blocks
bar {
position top
status_command i3blocks -c /home/ray/.config/i3/i3blocks.conf
font pango:FontAwesome 10
# Other configs here...
}
</code></pre></div></div>
<p>We use the <code class="language-plaintext highlighter-rouge">FontAwesome</code> font to display some crazy icons like the wifi logo, sound logo, temperature logo, etc.</p>
<p>The below is my <code class="language-plaintext highlighter-rouge">i3blocks.conf</code>:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[spotify]
command=python /home/ray/.config/i3/spotify.py
interval=1
separator=true
[wifi]
label=<?>
command=iwgetid -r;[[ -z "${BLOCK_BUTTON}" ]] || urxvt -e sh -c "nmcli d wifi list; printf '\n\n Type the following to connect to a wireless network: \n\n $ nmcli dev wifi connect <SSID>\n\n'; bash --norc"
separator=true
interval=3
[volume]
label=<?>
interval=1
separator=true
command=amixer get Master | egrep -o "[0-9]+%" | sed -n '2 p'
[temperature]
command=T=$(cat /sys/class/thermal/thermal_zone0/temp); echo $(( $T / 1000 ))°C
label=<?>
interval=10
separator=true
[time]
command=date '+%H:%M:%S'
interval=2
label=<?>
separator=true
[day]
command=date '+%a %b %e, %Y';[[ -z "${BLOCK_BUTTON}" ]] || gsimplecal &
interval=2
label=<?>
separator=true
</code></pre></div></div>
<p>I can’t display the characters denoted by <code class="language-plaintext highlighter-rouge"><?></code> on my blog because I’m not using a patched version for fonts that support glyphs. You can just replace them with text if you want! The spotify script is obtained from <a href="https://github.com/firatakandere/i3blocks-spotify">this repo</a>.</p>
<p>It looks like this:</p>
<p><img src="http://oneraynyday.github.io/assets/i3blocks.png" alt="i3blocks" /></p>
<h2 id="setting-backgrounds">Setting Backgrounds</h2>
<p>You can set a background image using <code class="language-plaintext highlighter-rouge">feh</code>. First install an image viewing tool with <code class="language-plaintext highlighter-rouge">sudo pacman -S feh</code>, and use the below command to set a background in the <code class="language-plaintext highlighter-rouge">i3</code> config :config</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># run feh to set background image
exec_always feh --bg-scale ~/wallpapers/wallpaper5.png
</code></pre></div></div>
<p>I save all of my backgrounds in my github repo <a href="https://github.com/OneRaynyDay/wallpapers">here</a>.</p>
<h2 id="using-a-file-manager">Using a File Manager</h2>
<p>If you want to be a supreme <code class="language-plaintext highlighter-rouge">vim</code> overlord and have bindings on everything, I suggest using <code class="language-plaintext highlighter-rouge">ranger</code>, installable via <code class="language-plaintext highlighter-rouge">sudo pacman -S ranger</code>. If you want to just have a typical gnome-like experience, try <code class="language-plaintext highlighter-rouge">SpaceFM</code> or <code class="language-plaintext highlighter-rouge">PCManFM</code>, which are two great GUI file managers. Add this into your config if you want to trigger <code class="language-plaintext highlighter-rouge">ranger</code> with hotkeys:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># Open file explorer (ranger)
bindsym $mod+Shift+f exec urxvt -e ranger
</code></pre></div></div>
<p>We want to have image preview for <code class="language-plaintext highlighter-rouge">ranger</code>, since by itself it’s fairly lightweight. To do this, <code class="language-plaintext highlighter-rouge">sudo pacman -S w3m terminator</code> and add the following file into the ranger config file at <code class="language-plaintext highlighter-rouge">~/.config/ranger/rc.conf</code>:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># Enable previewing images
set preview_images true
</code></pre></div></div>
<p><img src="http://oneraynyday.github.io/assets/ranger.png" alt="screenshot1" /></p>
<h2 id="reading-pdfs">Reading PDF’s</h2>
<p>In the theme of <code class="language-plaintext highlighter-rouge">vim</code> overlord, install <code class="language-plaintext highlighter-rouge">zathura</code> if you want to have a nice pdf viewer with vim-like bindings with <code class="language-plaintext highlighter-rouge">sudo pacman -S zathura</code>. I like to alias it to <code class="language-plaintext highlighter-rouge">pdf</code> because who would remember to write <code class="language-plaintext highlighter-rouge">zathura</code> when they want to read a pdf? (I put this in <code class="language-plaintext highlighter-rouge">.zshrc</code>)</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># Zathura (pdf viewer)
alias pdf=zathura
</code></pre></div></div>
<p>Some things I found useful: <code class="language-plaintext highlighter-rouge">s</code> for fitting to the width of the document, <code class="language-plaintext highlighter-rouge">r</code> for rotating documents, <code class="language-plaintext highlighter-rouge">d</code> for two-page view.</p>
<p><img src="http://oneraynyday.github.io/assets/zathura.png" alt="screenshot1" /></p>
<h2 id="spotify">Spotify</h2>
<p>Everyone listens to music via spotify now, so let’s streamline the operation by setting the last workspace to have spotify upon start-up. To do this, you need the spotify client: <code class="language-plaintext highlighter-rouge">sudo yay -S spotify</code>. This is what we add into the <code class="language-plaintext highlighter-rouge">i3</code> config file:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># spotify
for_window [class="spotify"] move to workspace $ws10
exec --no-startup-id i3-msg 'workspace $ws10; exec spotify'
bindsym $mod+shift+s exec spotify
</code></pre></div></div>
<h2 id="program-launcher-dmenu-replacement">Program Launcher (<code class="language-plaintext highlighter-rouge">dmenu</code> replacement)</h2>
<p><code class="language-plaintext highlighter-rouge">dmenu</code> is an easy way for users to run applications without having to go into the directory it lives in, and executing it like an uncultured savage. However, it’s really ugly. Let’s use a <code class="language-plaintext highlighter-rouge">dmenu</code> replacement like <code class="language-plaintext highlighter-rouge">rofi</code> by downloading via <code class="language-plaintext highlighter-rouge">sudo pacman -S rofi</code>, and we set the following config:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># start rofi (a program launcher, better than dmenu)
bindsym $mod+d exec rofi -show run
</code></pre></div></div>
<p>I haven’t found an <code class="language-plaintext highlighter-rouge">Alfred</code> replacement, but you can use <code class="language-plaintext highlighter-rouge">rofi -show window</code> to switch to whichever window you want. You can also use <code class="language-plaintext highlighter-rouge">rofi -show ssh</code> to ssh to any box you have in your <code class="language-plaintext highlighter-rouge">~/.ssh/config</code> file.</p>
<h2 id="terminal-emulator">Terminal Emulator</h2>
<p>We want to install <code class="language-plaintext highlighter-rouge">urxvt</code>, which is a terminal emulator. It’s aesthetic because we can change the opacity of the terminal itself, has nice color support, and has multiple font types it can support. To do this, run <code class="language-plaintext highlighter-rouge">sudo pacman -S rxvt-unicode</code>. To verify this, simply run:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>archlinux% echo $TERM
rxvt-unicode-256color
</code></pre></div></div>
<p>You should not yet have a <code class="language-plaintext highlighter-rouge">~/.Xresources</code> file, and that’s fine. Below are some things I added into mine:</p>
<h3 id="solarized-colorscheme">Solarized Colorscheme</h3>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>!!Source http://github.com/altercation/solarized
*background: #002b36
*foreground: #657b83
!!*fading: 40
*fadeColor: #002b36
*cursorColor: #93a1a1
*pointerColorBackground: #586e75
*pointerColorForeground: #93a1a1
!! black dark/light
*color0: #073642
*color8: #002b36
!! red dark/light
*color1: #dc322f
*color9: #cb4b16
!! green dark/light
*color2: #859900
*color10: #586e75
!! yellow dark/light
*color3: #b58900
*color11: #657b83
!! blue dark/light
*color4: #268bd2
*color12: #839496
!! magenta dark/light
*color5: #d33682
*color13: #6c71c4
!! cyan dark/light
*color6: #2aa198
*color14: #93a1a1
!! white dark/light
*color7: #eee8d5
*color15: #fdf6e3
</code></pre></div></div>
<h3 id="aesthetics">Aesthetics</h3>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>! Fonts
URxvt.font: xft:Source Code Pro for Powerline:size=12
URxvt.letterSpace: 0
! Adding transparency
urxvt*transparent: true
urxvt*shading: 30
</code></pre></div></div>
<p>Note that if you want to use the same font (<code class="language-plaintext highlighter-rouge">powerline</code> family fonts), install it using <code class="language-plaintext highlighter-rouge">sudo yay -S powerline-fonts-git</code>. Also, some glyphs are not supported by the powerline font, so it may be better to use a patched monospace font if your use case requires it.</p>
<h2 id="login-manager">Login Manager</h2>
<p>We want to have a sexy login manager to greet us for logins. Let’s use <code class="language-plaintext highlighter-rouge">lightdm</code> along with its greeter in the Aether theme by installing <code class="language-plaintext highlighter-rouge">sudo pacman -S lightdm lightdm-webkit2-greeter lightdm-webkit-theme-aether</code>. No need to change anything in <code class="language-plaintext highlighter-rouge">~/.config/i3/config</code>.</p>
<hr />
<p>There are of course, different offerings for each of these tools. Feel free to use your own! It’s a constant learning experience so I’ll also be trying to change things up once in a while.</p>
<script src="https://utteranc.es/client.js" repo="OneRaynyDay/oneraynyday.github.io" issue-term="pathname" theme="github-light" crossorigin="anonymous" async=""> </script>
Queueing Theory - Part 12020-04-14T00:00:00+00:00http://oneraynyday.github.io/math/2020/04/14/Queueing-Theory-Pt1<p>Apologies for the long delay in posts! I’ve been caught up with some busy work and only recently did I remember to revisit my blog. This time around we’ll be doing some reading for <strong>queueing theory</strong>, which is one of the most relevant mathematical fields for computer science. For the sake of this blog, I will not get into measure theoretic tools required to answer particularly complicated examples. In addition, please bear with me when there is inevitable mathematical rigor lost by applying real life examples and intuition (without this turning into a 10 page tangent on every small detail).</p>
<h1 id="table-of-contents">Table of Contents</h1>
<ul id="markdown-toc">
<li><a href="#table-of-contents" id="markdown-toc-table-of-contents">Table of Contents</a></li>
<li><a href="#what-is-queueing-theory" id="markdown-toc-what-is-queueing-theory">What is queueing theory?</a> <ul>
<li><a href="#defining-basic-terms" id="markdown-toc-defining-basic-terms">Defining basic terms</a></li>
<li><a href="#probability-review" id="markdown-toc-probability-review">Probability Review</a> <ul>
<li><a href="#basic-definitions" id="markdown-toc-basic-definitions">Basic Definitions</a></li>
<li><a href="#conditional-probability" id="markdown-toc-conditional-probability">Conditional Probability</a></li>
<li><a href="#independence" id="markdown-toc-independence">Independence</a></li>
<li><a href="#convolution" id="markdown-toc-convolution">Convolution</a></li>
</ul>
</li>
<li><a href="#important-discrete-probability-distributions" id="markdown-toc-important-discrete-probability-distributions">Important Discrete Probability Distributions</a> <ul>
<li><a href="#bernoulli" id="markdown-toc-bernoulli">Bernoulli</a></li>
<li><a href="#geometric" id="markdown-toc-geometric">Geometric</a></li>
<li><a href="#binomial" id="markdown-toc-binomial">Binomial</a></li>
<li><a href="#poisson" id="markdown-toc-poisson">Poisson</a></li>
<li><a href="#pascal" id="markdown-toc-pascal">Pascal</a></li>
</ul>
</li>
<li><a href="#important-continuous-probability-distributions" id="markdown-toc-important-continuous-probability-distributions">Important Continuous Probability Distributions</a> <ul>
<li><a href="#uniform" id="markdown-toc-uniform">Uniform</a></li>
<li><a href="#exponential" id="markdown-toc-exponential">Exponential</a></li>
<li><a href="#erlang" id="markdown-toc-erlang">Erlang</a></li>
<li><a href="#gaussian" id="markdown-toc-gaussian">Gaussian</a></li>
<li><a href="#pareto" id="markdown-toc-pareto">Pareto</a></li>
</ul>
</li>
<li><a href="#simple-queueing-system-problem" id="markdown-toc-simple-queueing-system-problem">Simple queueing system problem</a> <ul>
<li><a href="#case-1-increasing-number-of-customers-e_j-to-e_j1" id="markdown-toc-case-1-increasing-number-of-customers-e_j-to-e_j1">Case 1: Increasing number of customers, $E_j \to E_{j+1}$</a></li>
<li><a href="#case-2-decreasing-number-of-customers-e_j1-to-e_j" id="markdown-toc-case-2-decreasing-number-of-customers-e_j1-to-e_j">Case 2: Decreasing number of customers, $E_{j+1} \to E_j$</a></li>
<li><a href="#solving-system-of-equations" id="markdown-toc-solving-system-of-equations">Solving system of equations</a></li>
</ul>
</li>
<li><a href="#single-server-system---how-many-customers-are-blocked" id="markdown-toc-single-server-system---how-many-customers-are-blocked">Single Server System - How many customers are blocked?</a></li>
</ul>
</li>
<li><a href="#appendix" id="markdown-toc-appendix">Appendix</a> <ul>
<li><a href="#ball-and-urn-reference" id="markdown-toc-ball-and-urn-reference">Ball-and-Urn Reference</a></li>
</ul>
</li>
</ul>
<h1 id="what-is-queueing-theory">What is queueing theory?</h1>
<p>Instead of giving you a bland introduction, let me illustrate a picture that’s relevant to the times.</p>
<p>Imagine you’re lining up for a supermarket. Due to COVID-19, the store only allows $N$ number of visitors inside at once. As the manager of this supermarket, you probably wish you knew some queueing theory. This helps you answer questions like:</p>
<ol>
<li>Can we assume the customers will be waiting indefinitely outside, or do they see that there’s a line and leave immediately? Or is it something in between?</li>
<li>How many employees/self-checkout counters do I need to make the waiting time for people outside be within acceptable levels?</li>
<li>How many people should we let in for some fixed period of time? As in, what is the biggest $N$ we can permit (without getting everyone sick)?</li>
</ol>
<h2 id="defining-basic-terms">Defining basic terms</h2>
<p>In the above example, we define the <strong>input process</strong> as the distribution of the lengths of time between consecutive customer arrivals. The <strong>service mechanism</strong> is the number of employees we have, and the number of self-checkout counters. The <strong>queueing discipline</strong> is the behavior of blocked customers - as in whether they would leave immediately, or if they would wait for a period of time before realizing the toilet paper is all gone and they’re ready to go home.</p>
<p>This field of mathematics employs a very important concept, called <strong>conservation of flow</strong>. The CS undergrads would recall the problem <em>max flow/min cut</em>, and how it uses the idea of conservation of flow as well. In this context, we assume in the long run the supermarket will achieve an equilibrium with the number of people inside the store at any point in time. This means that the number of customers coming in will be the same as the number of customers going out.</p>
<h2 id="probability-review">Probability Review</h2>
<h3 id="basic-definitions">Basic Definitions</h3>
<p>A <strong>random variable (abbreviated r.v.)</strong> $X$ is a function that maps the sample space to $\mathbb{R}$. We may assign probability to the values of the random variable with a <strong>probability function</strong>, denoted $P_X(x) = P(X=x)$ for discrete cases. For continuous cases, we define the function $f_X(x)$ as the <strong>probability density function</strong>. In the continuous case, it no longer makes sense to ask for the probability of a given event, but rather we measure by intervals:</p>
\[P(a \leq X < b) = \int_a^b f_X(x)dx\]
<p>The <strong>cumulative distribution function (abbreviated cdf)</strong> is defined as $F_x(x) = P(X \leq x)$. It exists for both continuous and discrete cases.</p>
<p>If $X$ is a random variable, then $Y = g(X)$, where $g: \mathbb{R} \to \mathbb{R}$ is also a random variable. By definition, $P_Y(y) = \sum_{x:g(x) = y} P_X(x)$.</p>
<h3 id="conditional-probability">Conditional Probability</h3>
<p>By definition, conditional probability $P(X = x | Y = y)$ is the probability that the event $X=x$ occurs under the sample space restricted to when $Y=y$. It is defined as:</p>
\[P(X=x|Y=y) = \frac{P(X=x,Y=y)}{P(Y=y)} \equiv P_{X|Y}(x|y) = \frac{P_{X,Y}(x,y)}{P_Y(y)}\]
<p>The <strong>law of total probability</strong> states the following:</p>
\[P_Y(y) = \sum_x P_{X,Y}(x,y) = \sum_x P_{Y|X}(y|x)P_X(x)\]
<p>which is pretty self-explanatory.</p>
<h3 id="independence">Independence</h3>
<p>Two random variables $U, V$ are considered independent if \(P_{U,V}(u,v) = P_U(u)P_V(v) \; \forall u \in U, v \in V\). It follows that $P_{U|V}(u|v) = P_U(u)$, since</p>
\[P_{U|V}(u|v) = \frac{P_{U,V}(u,v)}{P_V(v)} = \frac{P_U(u)P_V(v)}{P_V(v)} = P_U(u)\]
<h3 id="convolution">Convolution</h3>
<p>The sum of two r.v.’s (can be generalized to any sum) is the <strong>convolution</strong> of their probability functions. For two r.v.’s $V_1,V_2$, the probability function for $V := V_1 + V_2$ is equal to:
\(P(V=v) = \sum_{x=-\infty}^\infty P(V_1 = x, V_2 = v-x) \qquad\text{(Discrete)} \\
= \int_{-\infty}^\infty f_{V_1, V_2}(x, v-x)dx \qquad\text{(Continuous)}\)</p>
<p>For the case where $V_1 \perp V_2$:</p>
\[P(V=v) = \sum_{x=-\infty}^\infty P(V_1 = x)P(V_2 = v-x) \qquad\text{(Discrete, independent)} \\
= \int_{-\infty}^\infty f_{V_1}(x)f_{V_2}(v-x)dx \qquad\text{(Continuous, independent)}\]
<h2 id="important-discrete-probability-distributions">Important Discrete Probability Distributions</h2>
<h3 id="bernoulli">Bernoulli</h3>
<p>There are two events in the sample space, which map to the set ${0,1}$. Parametrized by some number $p \in [0,1]$,</p>
\[P(X=1) = p \\
P(X=0) = 1-p\]
<h3 id="geometric">Geometric</h3>
<p>There are countable events in the sample space, which map to the set $\mathbb{N}^+ = {1,2,3,…}$. Parametrized by some number $p \in [0,1]$,</p>
\[P(X=i) = (1-p)^{i-1}p\]
<p>The countable sum of these probabilities should sum to 1:</p>
\[\sum_{i\in\mathbb{N}^+} P_X(i) = p \sum_{i\in\mathbb{N}^+} (1-p)^{i-1} = p \sum_{i\in\mathbb{N}} (1-p)^{i} = p \frac{1}{1-(1-p)} = 1\]
<p>One can visualize the geometric variable as the <em>“number of tries until a bernoulli trial is successful”</em>. It is important to note that a geometric random variable is <strong>memoryless</strong>. This means</p>
\[P(X > m+n | X > m) = P(X > n) \; \forall m,n \in \mathbb{N}\]
<p>This will be super useful later on for analyzing stochastic processes.</p>
<h3 id="binomial">Binomial</h3>
<p>Parametrized by $p \in [0,1], n \in \mathbb{N}^+$, it is the sum of $n$ bernoulli random variables with parameter $p$. Its distribution is defined as:</p>
\[P(X=i) = {n \choose i} p^i (1-p)^{n-i}\]
<h3 id="poisson">Poisson</h3>
<p>A poisson distribution is parametrized by $\lambda \in \mathbb{R}^+$. It’s an approximation to the binomial when $n \to \infty, p \to 0$ and $\lambda \approx np$:</p>
\[P(X = k) = e^{-\lambda} \frac{\lambda^k}{k!}\]
<p>To prove this approximation, we denote $X_n$ as the binomial distribution with parameter $n, p=\lambda/n$. Start with the equation:</p>
\[lim_{n\to\infty} P(X_n = k) = lim_{n\to\infty} {n \choose k}p^k (1-p)^{n-k} \\
= lim_{n\to\infty} {n \choose k} (\frac{\lambda}{n})^k (1-\frac{\lambda}{n})^{n-k} \\
= lim_{n\to\infty} {n \choose k} (\frac{\lambda}{n})^k (1-\frac{\lambda}{n})^{n} (1-\frac{\lambda}{n})^{-k}\]
<p>By definition of $e^x = lim_{n\to\infty} (1+\frac{x}{n})^n$ and the fact that $lim_{n\to\infty} (1+\frac{x}{n})^y = 1$, we get</p>
\[= lim_{n\to\infty} {n \choose k} (\frac{\lambda}{n})^k e^{-\lambda}\]
<p>To get $\lim_{n \to \infty}\frac{n!}{k!(n-k)!n^k}$, we see that it expands to:</p>
\[\lim_{n \to \infty} \frac{n(n-1)(n-2)...(n-k+1)}{k! n^k} = \frac{1}{k!}\]
<p>There are exactly $k$ terms above, with the expansion bounded by $n^k + O(n^{k-1})$ (because it is expanded to a polynomial). As $n\to \infty$, we no longer care about the term $\frac{O(n^{k-1})}{n^k}$ as it approaches 0, so we have $\frac{n^k + O(n^{k-1})}{n^k} \to 1$.</p>
<h3 id="pascal">Pascal</h3>
<p>A pascal distribution is parametrized by $k \in \mathbb{N}^+, p \in [0,1]$. It’s the sum of $k$ geometric variables, or in other words, <em>“number of bernoulli trials until $k$ successes”</em>. The event $X = i$ can be separated into two steps:</p>
<ol>
<li>On the $i$-th trial, we have a success. This is with probability $p$.</li>
<li>In the $i-1$ trials prior, we have had $k-1$ successes. This is the event $Y = k-1$ for the bernoulli distribution parametrized by $i-1, p$.</li>
</ol>
<p>The these two events are independent because bernoulli trials are independent. We can write it out as:</p>
\[P(X = i) = (p)({i-1 \choose k-1}p^{k-1}(1-p)^{(i-1)-(k-1)}) \\
= {i-1 \choose k-1}p^{k}(1-p)^{i-k}\]
<h2 id="important-continuous-probability-distributions">Important Continuous Probability Distributions</h2>
<h3 id="uniform">Uniform</h3>
<p>The uniform distribution is parametrized by $a, b \in \mathbb{R}, a < b$, with the density function:</p>
\[f(x) = \frac{1}{b-a} \; \forall x, a \leq x \leq b\]
<p>The probability density is $0$ everywhere else.</p>
<h3 id="exponential">Exponential</h3>
<p>The exponential distribution is parametrized by a single parameter $\mu$, with density function:</p>
\[f(x) = \mu e^{-\mu x} \; \forall x \geq 0\]
<p>The cdf is equal to:</p>
\[F(x) = 1-e^{-\mu x} \; \forall x \geq 0\]
<p>And the expected value is equal to $\frac{1}{\mu}$. The exponential distribution is used commonly in the duration between arrivals for stochastic processes, and is also memoryless. As a result, $P(X > s + t | X> t) = P(X > s) = 1 - (1 - e^{-\mu s}) = e^{- \mu s}$.</p>
<p>A common application of exponential distributions is $X = min(X_1, X_2)$, where $X_1 \sim exp(\mu_1), X_2 \sim exp(\mu_2), X_1 \perp X_2$. The complementary cumulative distribution is equal to:</p>
\[P(X > t) = P(X_1 > t \cap X_2 > t) = P(X_1>t)P(X_2>t) = e^{-\mu_1 t}e^{-\mu_2 t} = e^{-(\mu_1 + \mu_2)t}\]
<p>Taking the derivative of the cdf yields the density function and we see that $X \sim exp(\mu_1 + \mu_2)$.</p>
<p>Another common application, using the same $X_1, X_2$ is to calculate the probability $P(X_1 < X_2)$. We calculate it with:</p>
\[P(X_1 < X_2) = \int_{-\infty}^\infty dP(X_1 = x \cap X_2 > x) \\
= \int_0^\infty f_{X_1}(x) (1-F_{X_2}(x))dx \\
= \int_0^\infty \mu_1 e^{-\mu_1 x} e^{-\mu_2 x} dx \\
= \mu_1 \int_0^\infty \frac{\mu_1 + \mu_2}{\mu_1 + \mu_2} e^{-(\mu_1 + \mu_2) x} dx \\
= \frac{\mu_1}{\mu_1 + \mu_2}\]
<p>The last part is using law of total probability.</p>
<h3 id="erlang">Erlang</h3>
<p>This distribution will be explained in detail later. It has parameters $\lambda, k$ with the density function:</p>
\[f(x) = \frac{\lambda^k x^{k-1}e^{-\lambda x}}{(k-1)!} \; \forall x \geq 0\]
<p>The Erlang distribution is the sum of $k$ independent exponential random variables each parametrized by $\lambda$.</p>
<h3 id="gaussian">Gaussian</h3>
<p>One of the most common continuous distributions used for applications. The density function is:</p>
\[f(x) = \frac{1}{\sqrt{2\pi}\sigma}e^{-(x-m)^2/2\sigma^2} \; \forall x \in \mathbb{R}\]
<p>It is used in the <strong>central limit theorem</strong>, which states that the average of $N$ random variables $ Y_N := \frac{\sum_i^N X_i}{N}$ approaches a Gaussian distribution with mean $\mu$ and variance $\sigma/\sqrt N$ as $N$ becomes adequately large. Note how it is not dependent on the type of random variable $X_i$.</p>
<h3 id="pareto">Pareto</h3>
<p>Characterized by the parameter $\gamma, \delta$ , the cdf is equal to:</p>
\[P(X < x) = 1 - (x/\delta)^{-\gamma} \; \forall x \geq \delta\]
<p>Although I’m not familiar with using this distribution, it is commonly used in the <strong>Pareto principle</strong> which in one version, states that roughly <em>“20% of the population has 80% of the wealth”</em>. In computer science terms, 80% of memory access is in 20% of the allocated memory, for example. This principle for general workloads has given birth to the LRU cache virtual page replacement policy.</p>
<h2 id="simple-queueing-system-problem">Simple queueing system problem</h2>
<p>Let’s say we have a <strong><em>poisson process</em></strong> occurring with the above supermarket example. By this I mean the following:</p>
<ol>
<li>The average number of customers coming in over some fixed unit time $T$ is $\lambda$.</li>
<li>The store can fit max $s$ people at once.</li>
<li>People leave the store in $\tau$ time on average.</li>
<li>The store lets only one person in at a time.</li>
<li>Each customer is independent of the other (they enter and leave the store with no correlatio to other shoppers).</li>
</ol>
<p>Then with this model, <em>we want to know in the long run the proportion of the time the store has $j \in {0,1,…,s}$ people in it</em>, denoted $P_0, P_1, … , P_s$(you can also interpret this as the probability that the store has $j$ people in it at any point in time). We also denote the state when the store has $j$ people in it as $E_j$.</p>
<p>Let’s break this into two cases:</p>
<h3 id="case-1-increasing-number-of-customers-e_j-to-e_j1">Case 1: Increasing number of customers, $E_j \to E_{j+1}$</h3>
<p>In this case, for $j \in {0,1,…,s-1}$, we have the rate at which $E_j \to E_{j+1}$ as $\lambda P_j$ in unit time $T$, because at any time a customer arrives, there is a $P_j$ chance we are at $E_j$. For $j = s$, it is impossible for us to exceed the max capacity, so that rate is $0$ always.</p>
<h3 id="case-2-decreasing-number-of-customers-e_j1-to-e_j">Case 2: Decreasing number of customers, $E_{j+1} \to E_j$</h3>
<p>In this case, if we have $j+1$ customers in the store, each leaving after $\tau$ time on average, then we have on average $j+1$ people leaving in $\tau$ time(at this point in time). This means the rate of people leaving at $E_{j+1}$ is $\frac{j+1}{\tau}$. However, note that at any point in time, the probability of being in the state $E_{j+1}$ is $P_{j+1}$, so on average we will have $\frac{j+1}{\tau} P_{j+1}$ as the rate of which $E_{j+1} \to E_j$ when taking into account all of the other states the supermarket could be in.</p>
<h3 id="solving-system-of-equations">Solving system of equations</h3>
<p>We now have a couple of equations we have to solve for $P_0, P_1, … , P_s$. The first being the fact the sum of probabilities for all states must equal to 1, otherwise known as the <strong>law of total probability</strong>:</p>
\[\sum_{i=0}^s P_i = 1\]
<p>Then, we have that by the <strong>conservation of flow</strong>, in the long run the rates of $E_j \to E_{j+1}$ and $E_{j+1} \to E_j$ are the same:</p>
\[\lambda P_j = \frac{j+1}{\tau} P_{j+1}\]
<p>I don’t want to prove formally the induction because I’m lazy, but we can solve the recurrence relation as follows:</p>
\[P_1 = \lambda \tau P_0 \\ P_2 = \lambda \tau P_1 = \frac{(\lambda \tau)^2}{2}P_0 \\ P_3 = ... = \frac{(\lambda \tau)^3}{3!}P_0 \\ P_j = \frac{(\lambda \tau)^j}{j!}P_0\]
<p>Then, plugging into the second axiom of probability (probability of the whole sample space sums to 1, sort of), we solve for $P_0$:</p>
\[\sum_{i=0}^sP_i = \sum_{i=0}^s \frac{(\lambda \tau)^i}{i!}P_0 = P_0 \sum_{i=0}^s \frac{(\lambda \tau)^i}{i!} = 1 \\ \implies P_0 = (\sum_{i=0}^s \frac{(\lambda \tau)^i}{i!})^{-1}\]
<p>Solving for the general $P_j$, we have</p>
\[P_j = \frac{(\lambda \tau)^j}{j!} (\sum_{i=0}^s \frac{(\lambda \tau)^i}{i!})^{-1}\]
<p>So now we know what the long run proportion of time the store has $j$ people in it. We can then model the congestion of this supermarket and do more cool things with it.</p>
<h2 id="single-server-system---how-many-customers-are-blocked">Single Server System - How many customers are blocked?</h2>
<p>In queueing theory, the most basic system consists of a <strong>single server</strong> with arrival of customers modeled as an i.i.d process. You can imagine this as a sequence of independent <strong>cycles</strong>, defined by a period of inactivity (the store has noone coming in), and a period of activity (the store is serving a customer, with possibly more customers arriving during the interim period). To be precise, at the end of the period of activity, there should be no customers currently in the queue. <em>Intuitively, it makes sense that each cycle should be independent, as we return to 0 customers and the arrival of customers are not independently distributed.</em></p>
<p>Suppose the number of customers that arrive <strong>during the busy period</strong> in an arbitrary cycle is a random variable $N$, then below is an interesting result:</p>
\[P(\text{customer is blocked}) = \frac{E(N)}{1+E(N)}\]
<p>Intuitively, this makes sense. If we consider each cycle, the number of customers is $E(N) + 1$ since the first customer arrived during the inactive period, and the number of blocked customers are $E(N)$ because each one had to wait for the previous customer to finish being served. However, let’s prove this:</p>
<p>First, let’s use the law of total probability here to make this a little easier. We define a <strong>customer to be of type $i$</strong> if they arrived when $i$ people arrived in a cycle. Note that the first customer who arrived and was served immediately is also a customer of type $i$ in this event.</p>
\[P(\text{customer is blocked}) = \sum_{i=1}^\infty P(\text{customer is blocked} | \text{customer is of type i})P(\text{customer is of type i})\]
<p>Well… This is good and all, but how do we get $P(\text{customer is of type i})$ to express $E(N)$? For there to be a customer of type $i$ we must have $N = i-1$, since there must be $i-1$ waiting customers in this cycle. We thus have the following probability:</p>
\[P(\text{customer is of type i}) \\
= \frac{i P(N=i-1)}{\sum_{k=1}^\infty kP(N=k-1)} \\
= \frac{i P(N=i-1)}{\sum_{k=1}^\infty((k-1) + 1)P(N=k-1)}\\
= \frac{i P(N=i-1)}{E(N) + \sum_{k=0}^\infty P(N=k)} \\
= \frac{i P(N=i-1)}{E(N) + 1}\]
<p><em>NOTE: If you’re intimidated by the denominator, this is setup from the <a href="https://en.wikipedia.org/wiki/Urn_problem">ball-and-urn</a> type model where the i-th urn is associated with</em> ${N=i}$, <em>and the probability that a random ball chosen from the population belongs to the i-th urn is equivalent to</em> ${\text{customer arrived when N=i}}$. <em>The ball and urn set-up is added in the appendix</em>.</p>
<p>We know that $P(\text{customer is blocked} | \text{customer is of type i}) = \frac{i-1}{i}$ since there is a $\frac{i-1}{i}$ chance that this customer is not the first customer (which is the only one not blocked).</p>
<p>When we substitute these quantities into the above total probability equation, we get:</p>
\[\sum_{i=1}^\infty P(\text{customer is blocked} | \text{customer is of type i})P(\text{customer is of type i}) \\
= \sum_{i=1}^\infty (\frac{i-1}{i})(\frac{i P(N=i-1)}{E(N) + 1}) \\
= \frac{\sum_{i=0} i P(N=i)}{E(N) + 1} \\ = \frac{E(N)}{E(N)+1}\]
<h1 id="appendix">Appendix</h1>
<h2 id="ball-and-urn-reference">Ball-and-Urn Reference</h2>
<p>This is a problem from intro to probability classes. Suppose we have finite or countably many urns, indexed by $1,2,…$ where each $i$th urn includes $n_i$ balls. We want to know whats the probability of getting a ball from the $i$th urn w.r.t to the probability of picking the $i$th urn. We denote $P(Y=j)$ as the probability of the $j$th urn being chosen from the population of urns, and $P(X=j)$ as the probability that a ball from the $j$th urn is chosen from the population of balls. We have the following result:</p>
\[P(X=j) = \frac{n_j P(Y=j)}{\sum_i n_i P(Y=i)}\]
<p>Since in our queueing theory example, $n_j \approx j$ ($\pm$ some constants), we were able to express the bottom sum as $E(Y) + C$.</p>
<script src="https://utteranc.es/client.js" repo="OneRaynyDay/oneraynyday.github.io" issue-term="pathname" theme="github-light" crossorigin="anonymous" async=""> </script>
Designing Data Intensive Applications - Chapter 12019-07-29T00:00:00+00:00http://oneraynyday.github.io/dev/2019/07/29/Designing-Data-Intensive-Applications-Ch1<h1 id="table-of-contents">Table of Contents</h1>
<ul id="markdown-toc">
<li><a href="#table-of-contents" id="markdown-toc-table-of-contents">Table of Contents</a></li>
<li><a href="#what-do-we-care-about" id="markdown-toc-what-do-we-care-about">What Do We Care About?</a> <ul>
<li><a href="#reliability" id="markdown-toc-reliability">Reliability</a> <ul>
<li><a href="#aside-fault-vs-failure" id="markdown-toc-aside-fault-vs-failure">Aside: Fault vs. Failure</a></li>
<li><a href="#aside-types-of-faults" id="markdown-toc-aside-types-of-faults">Aside: Types of Faults</a></li>
</ul>
</li>
<li><a href="#scalability" id="markdown-toc-scalability">Scalability</a> <ul>
<li><a href="#aside-statistical-analysis-on-response-time" id="markdown-toc-aside-statistical-analysis-on-response-time">Aside: Statistical Analysis on Response Time</a></li>
<li><a href="#aside-how-to-deal-with-increasing-load" id="markdown-toc-aside-how-to-deal-with-increasing-load">Aside: How to Deal With Increasing Load</a></li>
</ul>
</li>
<li><a href="#maintainability" id="markdown-toc-maintainability">Maintainability</a> <ul>
<li><a href="#aside-what-operation-teams-do" id="markdown-toc-aside-what-operation-teams-do">Aside: What Operation Teams Do</a></li>
<li><a href="#aside-how-to-kiss" id="markdown-toc-aside-how-to-kiss">Aside: How to KISS</a></li>
<li><a href="#aside-how-to-evolve" id="markdown-toc-aside-how-to-evolve">Aside: How to Evolve</a></li>
</ul>
</li>
</ul>
</li>
</ul>
<h1 id="what-do-we-care-about">What Do We Care About?</h1>
<h2 id="reliability">Reliability</h2>
<p><strong>Definition</strong>: Working correctly under faults and errors.</p>
<p>We need reliability in a system because <em>users can be stupid and malicious and make mistakes</em>. We want the system to be fault-tolerant in these conditions.</p>
<h3 id="aside-fault-vs-failure">Aside: Fault vs. Failure</h3>
<p>A <strong>fault</strong> is defined as a component of a system not behaving as it is supposed to. A <strong>failure</strong> is defined as system-wide inability to perform a function. Sometimes, faults in fault-tolerant systems are triggered deliberately so that it is continuously tested.</p>
<p><em>TL;DR:</em> Faults can happen, but we need to respond to it so that no failures happen.</p>
<h3 id="aside-types-of-faults">Aside: Types of Faults</h3>
<p>A <strong>hardware fault</strong> is when a machine dies, and is largely random. This is often solved by redundancy via RAIDs, dual power supplies, etc. You can actually write software to increase hardware fault-tolerance.</p>
<p>A <strong>software fault</strong> is often logical. This is somewhat harder because writing software to find bugs in software is non-computable mathematically (it’s not even Turing-recognizable). The only thing we can do is design the system so that errors are propagated up to the developer as soon as possible.</p>
<p><em>TL;DR:</em> Hardware faults are random and easier to solve than software faults, which are often due to bugs in code.</p>
<h2 id="scalability">Scalability</h2>
<p><strong>Definition</strong>: How well a system can deal with growing complexity or load.</p>
<p>Load is a very ambiguous concept. It’s simply a metric of things that often add complexity to a system. For example, requests per second, hit rate on a cache, amount of data in a database.</p>
<p>When we increase load, we either want to keep the system the same, and see how the performance is affected, or vice versa. For example, <strong>throughput</strong> is the amount of things done by a system, and <strong>response time</strong> is the time required by the system to do a thing. Similar to response time, <strong>latency</strong> is the time spent waiting for a request to be handled. Latency is factored into response time. Looking at throughput and response time are two sides of the same dice.</p>
<p><em>TL;DR</em>: To analyze load, you want to analyze how fast the system processes one thing, or how many things a system can process in a given time interval.</p>
<h3 id="aside-statistical-analysis-on-response-time">Aside: Statistical Analysis on Response Time</h3>
<p>Once enough response time metrics are obtained, we retrieve a discrete distribution. Often people report the <strong>average response time</strong>, but it doesn’t really take into account how many users are actually experiencing the delay. We often care about this because the users experiencing the highest delay are the ones using the system the most. They often make us the most money.</p>
<p>Instead, we should use <strong>percentiles</strong>. The 99th percentile are the slowest 1% response times, often due to a large latency. Often, percentiles are used in <strong>SLA (service level agreements)</strong>.</p>
<p><em>TL;DR:</em> Use percentiles, not the average.</p>
<h3 id="aside-how-to-deal-with-increasing-load">Aside: How to Deal With Increasing Load</h3>
<p>In order to deal with increasing load, one can <strong>scale up/vertically</strong> by moving to a more powerful machine, or <strong>scale out/horizontally</strong> by adding more machines. Obviously, the power of a machine and its price is often nonlinear, and we want to be cheap. However, not all systems can be scaled out efficiently. There is a tradeoff dependent on the architecture. <strong>Elastic scaling</strong> is automatically scaling when detecting increasing load. This can be vertical or horizontal but most often it is horizontal.</p>
<p><em>TL;DR</em>: To scale, throw more machines at the problem or buy a beefier one.</p>
<h2 id="maintainability">Maintainability</h2>
<p><strong>Definition</strong>: How well the system is able to be maintained and improved.</p>
<p>Noone likes legacy systems because reading other people’s code is hard. Most of the cost of running a system is not during development but during maintenance by an operations team. Let’s save some money:</p>
<ul>
<li><strong>Operability</strong>: You want the operations team to easily maintain the system.</li>
<li><strong>Simplicity</strong>: You want new engineers to not run away when they try to understand the system.</li>
<li><strong>Evolvability</strong>: Make it easy to add changes to the system.</li>
</ul>
<p><em>TL;DR</em>: Design a system with maintainance and extensibility in mind.</p>
<h3 id="aside-what-operation-teams-do">Aside: What Operation Teams Do</h3>
<p>Operation teams write a lot of tooling around monitoring, failure investigations, security patching, etc. They are one of the most important teams, but are often overlooked because they wipe the subpar developers’ asses. They often do migration, maintenance, config management, deployment, documentation, etc. It’s a lot of hard work.</p>
<p><em>TL;DR</em>: Operation teams are unsung heroes who make sure the system is well maintained.</p>
<h3 id="aside-how-to-kiss">Aside: How to KISS</h3>
<p>By KISS, I meant Keep It Simple, Stupid. When a project gets large, uncontrolled complexity grows roughly quadratically (There are \(N^2\) edges in a complete simple undirected graph with \(N\) nodes). We want to keep the complexity growth as linear as possible when new components are added.</p>
<p>Sometimes it’s the subpar developer’s fault, but sometimes it’s the customer’s fault. Often times, stupid anti-patterns emerge where the users do something they’re not supposed to and future versions of projects need to keep backwards compatibility. This accidental complexity is there to stay, and that sucks.</p>
<p>To remove complexity, we use <strong>abstractions</strong>. Basically hide a lot of implementation details under a simple interface. This is surprisingly hard.</p>
<p><em>TL;DR</em>: Don’t make things too complicated, and prevent users from doing stupid shit by hiding things from them.</p>
<h3 id="aside-how-to-evolve">Aside: How to Evolve</h3>
<p>This is more of a engineering design process kind of issue. The development team needs to be receptive to constant change due to regulations, scalability concerns, etc. <strong>Agile</strong> is a complicated development pattern that focuses on iterative development and frequent introspections in order to change specifications on the fly.</p>
<p><em>TL;DR</em>: Agile is one method to allow for frequently changing specifications and evolution of a project.</p>
<script src="https://utteranc.es/client.js" repo="OneRaynyDay/oneraynyday.github.io" issue-term="pathname" theme="github-light" crossorigin="anonymous" async=""> </script>
Python Fundamentals - Sequences2019-06-27T00:00:00+00:00http://oneraynyday.github.io/dev/2019/06/27/Fluent-Python-1-Sequences<h1 id="table-of-contents">Table of Contents</h1>
<ul id="markdown-toc">
<li><a href="#table-of-contents" id="markdown-toc-table-of-contents">Table of Contents</a></li>
<li><a href="#overview" id="markdown-toc-overview">Overview</a> <ul>
<li><a href="#cartesian-products" id="markdown-toc-cartesian-products">Cartesian Products</a></li>
<li><a href="#restricted-listcomp" id="markdown-toc-restricted-listcomp">Restricted listcomp</a></li>
</ul>
</li>
<li><a href="#list--tuples--generic-sequences" id="markdown-toc-list--tuples--generic-sequences">List & Tuples & Generic Sequences</a> <ul>
<li><a href="#generator-expressions-genexp" id="markdown-toc-generator-expressions-genexp">Generator Expressions (genexp)</a></li>
<li><a href="#tuple" id="markdown-toc-tuple"><code class="language-plaintext highlighter-rouge">tuple</code></a> <ul>
<li><a href="#namedtuple" id="markdown-toc-namedtuple"><code class="language-plaintext highlighter-rouge">namedtuple</code></a></li>
</ul>
</li>
<li><a href="#advanced-slicing" id="markdown-toc-advanced-slicing">Advanced Slicing</a> <ul>
<li><a href="#slice-assignment" id="markdown-toc-slice-assignment">Slice Assignment</a> <ul>
<li><a href="#add" id="markdown-toc-add">Add</a></li>
<li><a href="#delete" id="markdown-toc-delete">Delete</a></li>
</ul>
</li>
<li><a href="#no-more-magic---readable-slices" id="markdown-toc-no-more-magic---readable-slices">No More Magic - Readable Slices</a></li>
<li><a href="#multi-argument-slices---how" id="markdown-toc-multi-argument-slices---how">Multi-Argument Slices - How?</a></li>
</ul>
</li>
<li><a href="#--operators-with-sequences" id="markdown-toc---operators-with-sequences"><code class="language-plaintext highlighter-rouge">+, *</code> Operators With Sequences</a> <ul>
<li><a href="#reference-issues-with-" id="markdown-toc-reference-issues-with-">Reference Issues with <code class="language-plaintext highlighter-rouge">*</code></a></li>
<li><a href="#__iadd__-and-__imul__-with-immutable-objects" id="markdown-toc-__iadd__-and-__imul__-with-immutable-objects"><code class="language-plaintext highlighter-rouge">__iadd__</code> and <code class="language-plaintext highlighter-rouge">__imul__</code> With Immutable Objects</a></li>
<li><a href="#weirdest-corner-case" id="markdown-toc-weirdest-corner-case">Weirdest Corner Case</a></li>
</ul>
</li>
<li><a href="#bisect-for-sorted-sequences" id="markdown-toc-bisect-for-sorted-sequences"><code class="language-plaintext highlighter-rouge">bisect</code> For Sorted Sequences</a> <ul>
<li><a href="#searching-using-bisect" id="markdown-toc-searching-using-bisect">Searching Using <code class="language-plaintext highlighter-rouge">bisect()</code></a></li>
<li><a href="#inserting-using-insort" id="markdown-toc-inserting-using-insort">Inserting Using <code class="language-plaintext highlighter-rouge">insort()</code></a></li>
</ul>
</li>
</ul>
</li>
<li><a href="#homogenous-array-types" id="markdown-toc-homogenous-array-types">Homogenous Array Types</a> <ul>
<li><a href="#array" id="markdown-toc-array"><code class="language-plaintext highlighter-rouge">array</code></a></li>
<li><a href="#memoryview" id="markdown-toc-memoryview"><code class="language-plaintext highlighter-rouge">memoryview</code></a></li>
</ul>
</li>
<li><a href="#deques--queues" id="markdown-toc-deques--queues">Deques & Queues</a> <ul>
<li><a href="#deque" id="markdown-toc-deque"><code class="language-plaintext highlighter-rouge">deque</code></a></li>
<li><a href="#different-queues" id="markdown-toc-different-queues">Different Queues</a></li>
</ul>
</li>
</ul>
<h1 id="overview">Overview</h1>
<p>In python’s standard library, we have the following sequence types:</p>
<ul>
<li><code class="language-plaintext highlighter-rouge">list</code>. Heterogenous and mutable.</li>
<li><code class="language-plaintext highlighter-rouge">tuple</code>. Heterogenous and immutable.</li>
<li><code class="language-plaintext highlighter-rouge">collections.deque</code>. Heterogenous and mutable.</li>
<li><code class="language-plaintext highlighter-rouge">str</code>. Flat and immutable.</li>
<li><code class="language-plaintext highlighter-rouge">bytes</code>. Flat and immutable.</li>
<li><code class="language-plaintext highlighter-rouge">bytearray</code>. Flat and mutable.</li>
<li><code class="language-plaintext highlighter-rouge">memoryview</code>. Flat and mutable.</li>
<li><code class="language-plaintext highlighter-rouge">array.array</code>. Flat and mutable.</li>
</ul>
<p>A data structure is <strong>flat/heterogenous</strong> if it can only hold the same type or hold different types, and is <strong>mutable/immutable</strong> if its contents can be modified.</p>
<p>Listcomps are fairly simple, so I will just give a few examples of them here before we jump to genexps.</p>
<h3 id="cartesian-products">Cartesian Products</h3>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>>></span> <span class="n">letters</span> <span class="o">=</span> <span class="p">[</span><span class="s">'a'</span><span class="p">,</span><span class="s">'b'</span><span class="p">,</span><span class="s">'c'</span><span class="p">]</span>
<span class="o">>>></span> <span class="n">numbers</span> <span class="o">=</span> <span class="p">[</span><span class="mi">1</span><span class="p">,</span><span class="mi">2</span><span class="p">,</span><span class="mi">3</span><span class="p">]</span>
<span class="o">>>></span> <span class="n">product</span> <span class="o">=</span> <span class="p">[(</span><span class="n">letter</span><span class="p">,</span> <span class="n">number</span><span class="p">)</span> <span class="k">for</span> <span class="n">letter</span> <span class="ow">in</span> <span class="n">letters</span> <span class="k">for</span> <span class="n">number</span> <span class="ow">in</span> <span class="n">numbers</span><span class="p">]</span>
<span class="o">>>></span> <span class="n">product</span>
<span class="p">[(</span><span class="s">'a'</span><span class="p">,</span> <span class="mi">1</span><span class="p">),</span> <span class="p">(</span><span class="s">'a'</span><span class="p">,</span> <span class="mi">2</span><span class="p">),</span> <span class="p">(</span><span class="s">'a'</span><span class="p">,</span> <span class="mi">3</span><span class="p">),</span> <span class="p">(</span><span class="s">'b'</span><span class="p">,</span> <span class="mi">1</span><span class="p">),</span> <span class="p">(</span><span class="s">'b'</span><span class="p">,</span> <span class="mi">2</span><span class="p">),</span> <span class="p">(</span><span class="s">'b'</span><span class="p">,</span> <span class="mi">3</span><span class="p">),</span> <span class="p">(</span><span class="s">'c'</span><span class="p">,</span> <span class="mi">1</span><span class="p">),</span> <span class="p">(</span><span class="s">'c'</span><span class="p">,</span> <span class="mi">2</span><span class="p">),</span> <span class="p">(</span><span class="s">'c'</span><span class="p">,</span> <span class="mi">3</span><span class="p">)]</span>
</code></pre></div></div>
<h3 id="restricted-listcomp">Restricted listcomp</h3>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>>></span> <span class="n">letters</span> <span class="o">=</span> <span class="p">[</span><span class="s">'a'</span><span class="p">,</span><span class="s">'b'</span><span class="p">,</span><span class="s">'c'</span><span class="p">]</span>
<span class="o">>>></span> <span class="n">filtered_letters</span> <span class="o">=</span> <span class="p">[</span><span class="n">letter</span> <span class="k">for</span> <span class="n">letter</span> <span class="ow">in</span> <span class="n">letters</span> <span class="k">if</span> <span class="n">letter</span> <span class="o">!=</span> <span class="s">'c'</span><span class="p">]</span>
<span class="o">>>></span> <span class="n">filtered_letters</span>
<span class="p">[</span><span class="s">'a'</span><span class="p">,</span> <span class="s">'b'</span><span class="p">]</span>
</code></pre></div></div>
<h1 id="list--tuples--generic-sequences">List & Tuples & Generic Sequences</h1>
<h2 id="generator-expressions-genexp">Generator Expressions (genexp)</h2>
<p>Generator expressions can be done to save space in the case that an entire list is not needed. We replace listcomp with genexp in the above cartesian product example:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>>></span> <span class="n">letters</span> <span class="o">=</span> <span class="p">[</span><span class="s">'a'</span><span class="p">,</span><span class="s">'b'</span><span class="p">,</span><span class="s">'c'</span><span class="p">]</span>
<span class="o">>>></span> <span class="n">numbers</span> <span class="o">=</span> <span class="p">[</span><span class="mi">1</span><span class="p">,</span><span class="mi">2</span><span class="p">,</span><span class="mi">3</span><span class="p">]</span>
<span class="o">>>></span> <span class="k">for</span> <span class="n">element</span> <span class="ow">in</span> <span class="p">(</span><span class="s">'%s %s'</span> <span class="o">%</span> <span class="p">(</span><span class="n">l</span><span class="p">,</span> <span class="n">n</span><span class="p">)</span> <span class="k">for</span> <span class="n">l</span> <span class="ow">in</span> <span class="n">letters</span> <span class="k">for</span> <span class="n">n</span> <span class="ow">in</span> <span class="n">numbers</span><span class="p">):</span>
<span class="p">...</span> <span class="k">print</span><span class="p">(</span><span class="n">element</span><span class="p">)</span>
<span class="p">...</span>
<span class="n">a</span> <span class="mi">1</span>
<span class="n">a</span> <span class="mi">2</span>
<span class="n">a</span> <span class="mi">3</span>
<span class="n">b</span> <span class="mi">1</span>
<span class="n">b</span> <span class="mi">2</span>
<span class="n">b</span> <span class="mi">3</span>
<span class="n">c</span> <span class="mi">1</span>
<span class="n">c</span> <span class="mi">2</span>
<span class="n">c</span> <span class="mi">3</span>
</code></pre></div></div>
<p>Here, we did not create a cartesian product list and scan it to print. This is clearly more efficient.</p>
<h2 id="tuple"><code class="language-plaintext highlighter-rouge">tuple</code></h2>
<p>Tuples are immutable lists which serve as records with no field names. In particular, the position of the item in the tuple may have a semantic meaning. We unpack a tuple like so:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>>></span> <span class="n">coordinates</span> <span class="o">=</span> <span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="mi">2</span><span class="p">)</span>
<span class="o">>>></span> <span class="n">_</span><span class="p">,</span> <span class="n">longitude</span> <span class="o">=</span> <span class="n">coordinates</span>
<span class="o">>>></span> <span class="n">longitude</span>
<span class="mi">2</span>
</code></pre></div></div>
<p>Another way to unpack is by using the star operator:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>>></span> <span class="n">coordinates</span> <span class="o">=</span> <span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="mi">2</span><span class="p">)</span>
<span class="o">>>></span> <span class="n">f</span> <span class="o">=</span> <span class="k">lambda</span> <span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">:</span> <span class="n">x</span><span class="o">+</span><span class="n">y</span>
<span class="o">>>></span> <span class="n">f</span><span class="p">(</span><span class="o">*</span><span class="n">coordinates</span><span class="p">)</span>
<span class="mi">3</span>
</code></pre></div></div>
<p>You can use these two in mixture:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>>></span> <span class="n">long_tup</span> <span class="o">=</span> <span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="mi">2</span><span class="p">,</span><span class="mi">3</span><span class="p">,</span><span class="mi">4</span><span class="p">,</span><span class="mi">5</span><span class="p">,</span><span class="mi">6</span><span class="p">,</span><span class="mi">7</span><span class="p">)</span>
<span class="o">>>></span> <span class="n">x</span><span class="p">,</span><span class="n">y</span><span class="p">,</span><span class="o">*</span><span class="n">z</span> <span class="o">=</span> <span class="n">long_tup</span>
<span class="o">>>></span> <span class="n">x</span>
<span class="mi">1</span>
<span class="o">>>></span> <span class="n">y</span>
<span class="mi">2</span>
<span class="o">>>></span> <span class="n">z</span>
<span class="p">[</span><span class="mi">3</span><span class="p">,</span> <span class="mi">4</span><span class="p">,</span> <span class="mi">5</span><span class="p">,</span> <span class="mi">6</span><span class="p">,</span> <span class="mi">7</span><span class="p">]</span>
</code></pre></div></div>
<h3 id="namedtuple"><code class="language-plaintext highlighter-rouge">namedtuple</code></h3>
<p><code class="language-plaintext highlighter-rouge">namedtuple</code>s are a part of the <code class="language-plaintext highlighter-rouge">collections</code> package which basically, as the name implies, have names to each field. Just as we create tuples as nameless records, we have <code class="language-plaintext highlighter-rouge">namedtuples</code> for records with names. These should be used more often:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>>></span> <span class="kn">from</span> <span class="nn">collections</span> <span class="kn">import</span> <span class="n">namedtuple</span>
<span class="o">>>></span> <span class="n">Snek</span> <span class="o">=</span> <span class="n">namedtuple</span><span class="p">(</span><span class="s">'Snake'</span><span class="p">,</span> <span class="p">[</span><span class="s">'name'</span><span class="p">,</span> <span class="s">'length'</span><span class="p">])</span>
<span class="o">>>></span> <span class="n">an</span> <span class="o">=</span> <span class="n">Snek</span><span class="p">(</span><span class="s">'anaconda'</span><span class="p">,</span> <span class="s">'4.511'</span><span class="p">)</span>
<span class="o">>>></span> <span class="n">py</span> <span class="o">=</span> <span class="n">Snek</span><span class="p">(</span><span class="s">'python'</span><span class="p">,</span> <span class="s">'3.7'</span><span class="p">)</span>
<span class="o">>>></span> <span class="n">py</span>
<span class="n">Snake</span><span class="p">(</span><span class="n">name</span><span class="o">=</span><span class="s">'python'</span><span class="p">,</span> <span class="n">length</span><span class="o">=</span><span class="s">'3.7'</span><span class="p">)</span>
<span class="o">>>></span> <span class="n">an</span>
<span class="n">Snake</span><span class="p">(</span><span class="n">name</span><span class="o">=</span><span class="s">'anaconda'</span><span class="p">,</span> <span class="n">length</span><span class="o">=</span><span class="s">'4.511'</span><span class="p">)</span>
<span class="o">>>></span> <span class="n">py</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span>
<span class="s">'3.7'</span>
</code></pre></div></div>
<h2 id="advanced-slicing">Advanced Slicing</h2>
<h3 id="slice-assignment">Slice Assignment</h3>
<p>You can modify portions of a <strong>mutable</strong> sequence using slice, i.e. add, delete, etc:</p>
<h4 id="add">Add</h4>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>>></span> <span class="n">l</span> <span class="o">=</span> <span class="p">[</span><span class="mi">1</span><span class="p">,</span><span class="mi">2</span><span class="p">,</span><span class="mi">3</span><span class="p">]</span>
<span class="o">>>></span> <span class="n">l</span><span class="p">[</span><span class="mi">1</span><span class="p">:</span><span class="mi">1</span><span class="p">]</span> <span class="o">=</span> <span class="p">[</span><span class="s">'a'</span><span class="p">,</span><span class="s">'b'</span><span class="p">,</span><span class="s">'c'</span><span class="p">]</span>
<span class="o">>>></span> <span class="n">l</span>
<span class="p">[</span><span class="mi">1</span><span class="p">,</span> <span class="s">'a'</span><span class="p">,</span> <span class="s">'b'</span><span class="p">,</span> <span class="s">'c'</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">]</span>
</code></pre></div></div>
<h4 id="delete">Delete</h4>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>>></span> <span class="n">l</span> <span class="o">=</span> <span class="p">[</span><span class="mi">1</span><span class="p">,</span><span class="mi">2</span><span class="p">,</span><span class="mi">3</span><span class="p">]</span>
<span class="o">>>></span> <span class="k">del</span> <span class="n">l</span><span class="p">[</span><span class="mi">1</span><span class="p">:]</span>
<span class="o">>>></span> <span class="n">l</span>
<span class="p">[</span><span class="mi">1</span><span class="p">]</span>
</code></pre></div></div>
<h3 id="no-more-magic---readable-slices">No More Magic - Readable Slices</h3>
<p>When you call <code class="language-plaintext highlighter-rouge">seq[start:stop:step]</code>, python calls <code class="language-plaintext highlighter-rouge">seq.__getitem__(slice(start, stop, step))</code>, where <code class="language-plaintext highlighter-rouge">slice</code> is actually an object. Thus, you can actually create <strong>named constant slices</strong> for better readability, like so:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>>></span> <span class="n">NAME</span> <span class="o">=</span> <span class="nb">slice</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span><span class="mi">4</span><span class="p">)</span>
<span class="o">>>></span> <span class="n">FIRST</span> <span class="o">=</span> <span class="nb">slice</span><span class="p">(</span><span class="mi">4</span><span class="p">,</span><span class="mi">10</span><span class="p">)</span>
<span class="o">>>></span> <span class="n">LAST</span> <span class="o">=</span> <span class="nb">slice</span><span class="p">(</span><span class="mi">10</span><span class="p">,</span> <span class="bp">None</span><span class="p">)</span>
<span class="o">>>></span> <span class="n">items</span>
<span class="p">[</span><span class="s">'a 123 456'</span><span class="p">,</span> <span class="s">'b 23 4567'</span><span class="p">]</span>
<span class="o">>>></span> <span class="k">for</span> <span class="n">item</span> <span class="ow">in</span> <span class="n">items</span><span class="p">:</span>
<span class="p">...</span> <span class="k">print</span><span class="p">(</span><span class="s">"Name: {}, First: {}, Last: {}"</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">item</span><span class="p">[</span><span class="n">NAME</span><span class="p">].</span><span class="n">strip</span><span class="p">(),</span> <span class="n">item</span><span class="p">[</span><span class="n">FIRST</span><span class="p">].</span><span class="n">strip</span><span class="p">(),</span> <span class="n">item</span><span class="p">[</span><span class="n">LAST</span><span class="p">].</span><span class="n">strip</span><span class="p">()))</span>
<span class="p">...</span>
<span class="n">Name</span><span class="p">:</span> <span class="n">a</span><span class="p">,</span> <span class="n">First</span><span class="p">:</span> <span class="mi">123</span><span class="p">,</span> <span class="n">Last</span><span class="p">:</span> <span class="mi">456</span>
<span class="n">Name</span><span class="p">:</span> <span class="n">b</span><span class="p">,</span> <span class="n">First</span><span class="p">:</span> <span class="mi">23</span><span class="p">,</span> <span class="n">Last</span><span class="p">:</span> <span class="mi">4567</span>
</code></pre></div></div>
<p>This makes your functions less “magic” and more readable for future developers.</p>
<h3 id="multi-argument-slices---how">Multi-Argument Slices - How?</h3>
<p>When you use <code class="language-plaintext highlighter-rouge">numpy</code>’s multidimensional array class, <code class="language-plaintext highlighter-rouge">np.ndarray</code>, how does it handle multiple arguments?</p>
<p>When you call <code class="language-plaintext highlighter-rouge">a[i,j]</code>, you are actually calling <code class="language-plaintext highlighter-rouge">a.__getitem__((i, j))</code>, so that the arguments inside of <code class="language-plaintext highlighter-rouge">[]</code> is packaged as a tuple. <code class="language-plaintext highlighter-rouge">numpy</code> defined custom <code class="language-plaintext highlighter-rouge">__getitem__</code> so it can accept tuples like above, unlike built-in types.</p>
<p>When you use <code class="language-plaintext highlighter-rouge">a[i, ...]</code>, it is another multi-argument slice, equivalent to <code class="language-plaintext highlighter-rouge">a[i, :, :, :]</code> if it was a 4-dimensional matrix. The <code class="language-plaintext highlighter-rouge">...</code> is considered a special keyword, an alias to <code class="language-plaintext highlighter-rouge">Ellipsis</code> object, from the <code class="language-plaintext highlighter-rouge">ellipsis</code> class. Think of <code class="language-plaintext highlighter-rouge">Ellipsis</code> as <code class="language-plaintext highlighter-rouge">True</code> or <code class="language-plaintext highlighter-rouge">False</code>, and its class <code class="language-plaintext highlighter-rouge">ellipsis</code> like <code class="language-plaintext highlighter-rouge">bool</code>.</p>
<h2 id="--operators-with-sequences"><code class="language-plaintext highlighter-rouge">+, *</code> Operators With Sequences</h2>
<h3 id="reference-issues-with-">Reference Issues with <code class="language-plaintext highlighter-rouge">*</code></h3>
<p><strong>Never do this:</strong> <code class="language-plaintext highlighter-rouge">[[]] * 3</code>. When you evaluate it at first, you’ll see <code class="language-plaintext highlighter-rouge">[[],[],[]]</code>. Innocent looking enough, but they are references to the same list. <strong>Do this instead:</strong> <code class="language-plaintext highlighter-rouge">[[] for _ in range(3)]</code>, so that each object is constructed individually.</p>
<h3 id="__iadd__-and-__imul__-with-immutable-objects"><code class="language-plaintext highlighter-rouge">__iadd__</code> and <code class="language-plaintext highlighter-rouge">__imul__</code> With Immutable Objects</h3>
<p>When you try to augment a sequence using <code class="language-plaintext highlighter-rouge">+=</code> or <code class="language-plaintext highlighter-rouge">*=</code>, it invokes the magic functions <code class="language-plaintext highlighter-rouge">__iadd__</code> and <code class="language-plaintext highlighter-rouge">__imul__</code>. If the object does not have these functions, it falls back to <code class="language-plaintext highlighter-rouge">__add__</code> and <code class="language-plaintext highlighter-rouge">__mul__</code>. For immutable objects like tuples, this sometimes spells trouble:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>>></span> <span class="n">l</span> <span class="o">=</span> <span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="mi">2</span><span class="p">,</span><span class="mi">3</span><span class="p">)</span>
<span class="o">>>></span> <span class="nb">id</span><span class="p">(</span><span class="n">l</span><span class="p">)</span>
<span class="mi">4510772608</span>
<span class="o">>>></span> <span class="n">l</span> <span class="o">+=</span> <span class="p">(</span><span class="mi">4</span><span class="p">,</span><span class="mi">5</span><span class="p">,</span><span class="mi">6</span><span class="p">)</span>
<span class="o">>>></span> <span class="n">l</span>
<span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">4</span><span class="p">,</span> <span class="mi">5</span><span class="p">,</span> <span class="mi">6</span><span class="p">)</span>
<span class="o">>>></span> <span class="nb">id</span><span class="p">(</span><span class="n">l</span><span class="p">)</span>
<span class="mi">4510680840</span>
</code></pre></div></div>
<p><strong>These are not the same objects.</strong> What’s happening underneath the hood is equivalent to <code class="language-plaintext highlighter-rouge">l = l + (4,5,6)</code>, which means we created a new tuple! <strong>To be precise, the tuple container is replaced while the references to objects inside are the same before and after.</strong></p>
<h3 id="weirdest-corner-case">Weirdest Corner Case</h3>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>>></span> <span class="n">t</span> <span class="o">=</span> <span class="p">([],)</span>
<span class="o">>>></span> <span class="n">t</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">+=</span> <span class="p">[</span><span class="mi">1</span><span class="p">]</span>
<span class="n">Traceback</span> <span class="p">(</span><span class="n">most</span> <span class="n">recent</span> <span class="n">call</span> <span class="n">last</span><span class="p">):</span>
<span class="n">File</span> <span class="s">"<stdin>"</span><span class="p">,</span> <span class="n">line</span> <span class="mi">1</span><span class="p">,</span> <span class="ow">in</span> <span class="o"><</span><span class="n">module</span><span class="o">></span>
<span class="nb">TypeError</span><span class="p">:</span> <span class="s">'tuple'</span> <span class="nb">object</span> <span class="n">does</span> <span class="ow">not</span> <span class="n">support</span> <span class="n">item</span> <span class="n">assignment</span>
<span class="o">>>></span> <span class="n">t</span>
<span class="p">([</span><span class="mi">1</span><span class="p">],)</span>
</code></pre></div></div>
<p>Wait… So we modified the list inside the tuple, but it also gave us an error? Why is this true? <strong>Because this is not an atomic operation.</strong> Specifically, it looks like this:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>>></span> <span class="n">temp_t</span> <span class="o">=</span> <span class="n">t</span>
<span class="o">>>></span> <span class="n">temp_idx</span> <span class="o">=</span> <span class="mi">0</span>
<span class="o">>>></span> <span class="n">mutable_item</span> <span class="o">=</span> <span class="n">temp_t</span><span class="p">[</span><span class="n">temp_idx</span><span class="p">]</span>
<span class="o">>>></span> <span class="n">mutable_item</span> <span class="o">+=</span> <span class="p">[</span><span class="mi">1</span><span class="p">]</span>
<span class="o">>>></span> <span class="n">temp_t</span><span class="p">[</span><span class="n">temp_idx</span><span class="p">]</span> <span class="o">=</span> <span class="n">mutable_item</span>
<span class="n">Traceback</span> <span class="p">(</span><span class="n">most</span> <span class="n">recent</span> <span class="n">call</span> <span class="n">last</span><span class="p">):</span>
<span class="n">File</span> <span class="s">"<stdin>"</span><span class="p">,</span> <span class="n">line</span> <span class="mi">1</span><span class="p">,</span> <span class="ow">in</span> <span class="o"><</span><span class="n">module</span><span class="o">></span>
<span class="nb">TypeError</span><span class="p">:</span> <span class="s">'tuple'</span> <span class="nb">object</span> <span class="n">does</span> <span class="ow">not</span> <span class="n">support</span> <span class="n">item</span> <span class="n">assignment</span>
</code></pre></div></div>
<p>You can also disassemble python code and view the bytecode to see a similar process.</p>
<h2 id="bisect-for-sorted-sequences"><code class="language-plaintext highlighter-rouge">bisect</code> For Sorted Sequences</h2>
<h3 id="searching-using-bisect">Searching Using <code class="language-plaintext highlighter-rouge">bisect()</code></h3>
<p>There are 2 bisect functions, <code class="language-plaintext highlighter-rouge">bisect_left</code> and <code class="language-plaintext highlighter-rouge">bisect_right</code>. By default, <code class="language-plaintext highlighter-rouge">bisect.bisect</code> is an alias for <code class="language-plaintext highlighter-rouge">bisect.bisect_right</code>. These functions are basically binary searches for sequences with generic ordered types. The difference between the two is subtle - when you try to search an object and the object itself is present in the sequence, <code class="language-plaintext highlighter-rouge">bisect_left</code> returns the element’s index, and <code class="language-plaintext highlighter-rouge">bisect_right</code> returns the index plus one:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>>></span> <span class="kn">from</span> <span class="nn">bisect</span> <span class="kn">import</span> <span class="n">bisect_right</span><span class="p">,</span> <span class="n">bisect_left</span>
<span class="o">>>></span> <span class="n">l</span> <span class="o">=</span> <span class="p">[</span><span class="mi">1</span><span class="p">,</span><span class="mi">2</span><span class="p">,</span><span class="mi">3</span><span class="p">]</span>
<span class="o">>>></span> <span class="n">bisect_right</span><span class="p">(</span><span class="n">l</span><span class="p">,</span> <span class="mi">2</span><span class="p">)</span>
<span class="mi">2</span>
<span class="o">>>></span> <span class="n">bisect_left</span><span class="p">(</span><span class="n">l</span><span class="p">,</span> <span class="mi">2</span><span class="p">)</span>
<span class="mi">1</span>
</code></pre></div></div>
<p>One application is <strong>discretization</strong>. For example, from numerical scores on an exam to one’s final grade, which is in discrete intervals of <code class="language-plaintext highlighter-rouge">F, D, C, B, A</code>. Or, waist size to shirt sizes:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>>></span> <span class="k">def</span> <span class="nf">shirt_size</span><span class="p">(</span><span class="n">size</span><span class="p">,</span> <span class="n">cutoff</span><span class="o">=</span><span class="p">[</span><span class="mi">30</span><span class="p">,</span><span class="mi">60</span><span class="p">],</span> <span class="n">sizes</span><span class="o">=</span><span class="p">[</span><span class="s">'Small'</span><span class="p">,</span> <span class="s">'Medium'</span><span class="p">,</span> <span class="s">'Large'</span><span class="p">]):</span>
<span class="p">...</span> <span class="k">return</span> <span class="n">sizes</span><span class="p">[</span><span class="n">bisect_right</span><span class="p">(</span><span class="n">cutoff</span><span class="p">,</span> <span class="n">size</span><span class="p">)]</span>
<span class="p">...</span>
<span class="o">>>></span> <span class="n">shirt_size</span><span class="p">(</span><span class="mi">20</span><span class="p">)</span>
<span class="s">'Small'</span>
<span class="o">>>></span> <span class="n">shirt_size</span><span class="p">(</span><span class="mi">30</span><span class="p">)</span>
<span class="s">'Medium'</span>
<span class="o">>>></span> <span class="n">shirt_size</span><span class="p">(</span><span class="mi">50</span><span class="p">)</span>
<span class="s">'Medium'</span>
<span class="o">>>></span> <span class="n">shirt_size</span><span class="p">(</span><span class="mi">60</span><span class="p">)</span>
<span class="s">'Large'</span>
<span class="o">>>></span> <span class="n">shirt_size</span><span class="p">(</span><span class="mi">100</span><span class="p">)</span>
<span class="s">'Large'</span>
</code></pre></div></div>
<p>Here, we use <code class="language-plaintext highlighter-rouge">bisect_right</code> because we would rather want the person to fit comfortably in a larger t-shirt in case their waist size was at the cutoff. If you want to look like you’re bigger than you actually are(like me), then you would use <code class="language-plaintext highlighter-rouge">bisect_left</code>.</p>
<h3 id="inserting-using-insort">Inserting Using <code class="language-plaintext highlighter-rouge">insort()</code></h3>
<p>We use <code class="language-plaintext highlighter-rouge">bisect.insort</code> to add an element into a sorted sequence:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>>></span> <span class="kn">import</span> <span class="nn">bisect</span>
<span class="o">>>></span> <span class="n">l</span> <span class="o">=</span> <span class="p">[</span><span class="mi">1</span><span class="p">,</span><span class="mi">3</span><span class="p">,</span><span class="mi">5</span><span class="p">,</span><span class="mi">7</span><span class="p">]</span>
<span class="o">>>></span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">4</span><span class="p">):</span>
<span class="p">...</span> <span class="n">bisect</span><span class="p">.</span><span class="n">insort</span><span class="p">(</span><span class="n">l</span><span class="p">,</span> <span class="n">i</span><span class="o">*</span><span class="mi">2</span><span class="p">)</span>
<span class="p">...</span> <span class="k">print</span><span class="p">(</span><span class="n">l</span><span class="p">)</span>
<span class="p">...</span>
<span class="p">[</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">5</span><span class="p">,</span> <span class="mi">7</span><span class="p">]</span>
<span class="p">[</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">5</span><span class="p">,</span> <span class="mi">7</span><span class="p">]</span>
<span class="p">[</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">4</span><span class="p">,</span> <span class="mi">5</span><span class="p">,</span> <span class="mi">7</span><span class="p">]</span>
<span class="p">[</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">4</span><span class="p">,</span> <span class="mi">5</span><span class="p">,</span> <span class="mi">6</span><span class="p">,</span> <span class="mi">7</span><span class="p">]</span>
</code></pre></div></div>
<p><code class="language-plaintext highlighter-rouge">insort</code> has extra keyword arguments to insert in a sorted subsequence, and insort left or right like bisect. Once again, this can be applied to more than just lists. Any ordered collection will do, like <code class="language-plaintext highlighter-rouge">array</code>s!</p>
<h1 id="homogenous-array-types">Homogenous Array Types</h1>
<h2 id="array"><code class="language-plaintext highlighter-rouge">array</code></h2>
<p><strong>Arrays are underrated and underused. They are so, so much faster than <code class="language-plaintext highlighter-rouge">list</code>s. <code class="language-plaintext highlighter-rouge">list</code>s are the default, but don’t be lazy when you need the performance.</strong> <code class="language-plaintext highlighter-rouge">array</code>s contain the bit & byte level representation of primitive data types, and it’s basically a C-style array. To create an array:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>>></span> <span class="kn">from</span> <span class="nn">array</span> <span class="kn">import</span> <span class="n">array</span>
<span class="o">>>></span> <span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="o">>>></span> <span class="n">nums</span> <span class="o">=</span> <span class="n">array</span><span class="p">(</span><span class="s">'d'</span><span class="p">,</span> <span class="p">(</span><span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">random</span><span class="p">()</span> <span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">10</span><span class="p">)))</span>
<span class="o">>>></span> <span class="k">for</span> <span class="n">num</span> <span class="ow">in</span> <span class="n">nums</span><span class="p">:</span>
<span class="p">...</span> <span class="k">print</span><span class="p">(</span><span class="n">num</span><span class="p">)</span>
<span class="p">...</span>
<span class="mf">0.2076318634616442</span>
<span class="mf">0.5052559930909137</span>
<span class="mf">0.26556051714640794</span>
<span class="mf">0.3538229563850064</span>
<span class="mf">0.24394891007765362</span>
<span class="mf">0.829697244498978</span>
<span class="mf">0.8050680531932854</span>
<span class="mf">0.7540974416748557</span>
<span class="mf">0.5157377814111441</span>
<span class="mf">0.6025949390048687</span>
<span class="o">>>></span> <span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="s">'temp.bin'</span><span class="p">,</span> <span class="s">'wb'</span><span class="p">)</span> <span class="k">as</span> <span class="n">f</span><span class="p">:</span>
<span class="p">...</span> <span class="n">nums</span><span class="p">.</span><span class="n">tofile</span><span class="p">(</span><span class="n">f</span><span class="p">)</span>
<span class="p">...</span>
</code></pre></div></div>
<p>And then we see the saved binary file:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>ls -la temp.bin
-rw-r--r-- 1 dev staff 80 Jun 27 16:18 temp.bin
</code></pre></div></div>
<p>Nice. It’s small and compact in binary format. <code class="language-plaintext highlighter-rouge">numpy</code> arrays do something similar. <code class="language-plaintext highlighter-rouge">bytes</code> and <code class="language-plaintext highlighter-rouge">bytearray</code> are simply specific types of <code class="language-plaintext highlighter-rouge">array</code> that will be discussed in detail later.</p>
<h2 id="memoryview"><code class="language-plaintext highlighter-rouge">memoryview</code></h2>
<p><code class="language-plaintext highlighter-rouge">memoryview</code> is like a slice of an <code class="language-plaintext highlighter-rouge">array</code>. There is no copying, everything is referenced and is usually mutable. Here we change the content of the first double in an <code class="language-plaintext highlighter-rouge">array</code> by casting the memoryview to unsigned 8-bit ints, and modifying it:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>>></span> <span class="kn">from</span> <span class="nn">array</span> <span class="kn">import</span> <span class="n">array</span>
<span class="o">>>></span> <span class="n">nums</span> <span class="o">=</span> <span class="n">array</span><span class="p">(</span><span class="s">'d'</span><span class="p">,</span> <span class="p">[</span><span class="mi">1</span><span class="p">,</span><span class="mi">2</span><span class="p">,</span><span class="mi">3</span><span class="p">])</span>
<span class="o">>>></span> <span class="n">mv</span> <span class="o">=</span> <span class="nb">memoryview</span><span class="p">(</span><span class="n">nums</span><span class="p">)</span>
<span class="o">>>></span> <span class="n">mvc</span> <span class="o">=</span> <span class="n">mv</span><span class="p">.</span><span class="n">cast</span><span class="p">(</span><span class="s">'B'</span><span class="p">)</span>
<span class="o">>>></span> <span class="n">mvc</span><span class="p">.</span><span class="n">tolist</span><span class="p">()</span>
<span class="p">[</span><span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">240</span><span class="p">,</span> <span class="mi">63</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">64</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">8</span><span class="p">,</span> <span class="mi">64</span><span class="p">]</span>
<span class="o">>>></span> <span class="n">mvc</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">=</span> <span class="mi">10</span>
<span class="o">>>></span> <span class="n">nums</span>
<span class="n">array</span><span class="p">(</span><span class="s">'d'</span><span class="p">,</span> <span class="p">[</span><span class="mf">1.0000000000000022</span><span class="p">,</span> <span class="mf">2.0</span><span class="p">,</span> <span class="mf">3.0</span><span class="p">])</span>
</code></pre></div></div>
<p>Most of the time, you probably don’t want to use this though.</p>
<h1 id="deques--queues">Deques & Queues</h1>
<h2 id="deque"><code class="language-plaintext highlighter-rouge">deque</code></h2>
<p><code class="language-plaintext highlighter-rouge">collections</code> supplies with us a <code class="language-plaintext highlighter-rouge">deque</code> container which is heterogenous. It has the standard <code class="language-plaintext highlighter-rouge">append</code> and <code class="language-plaintext highlighter-rouge">pop</code> operations. On top of it, it has a couple cool functions like <code class="language-plaintext highlighter-rouge">rotate</code>, <code class="language-plaintext highlighter-rouge">extend</code> and <code class="language-plaintext highlighter-rouge">extendleft</code>.</p>
<p><code class="language-plaintext highlighter-rouge">rotate()</code> takes in a single integer as argument, and rotates the deque in that direction. Why no <code class="language-plaintext highlighter-rouge">rotateleft</code> like <code class="language-plaintext highlighter-rouge">extend</code>? Because you can supply a negative number to rotate in the other direction. By default, <code class="language-plaintext highlighter-rouge">rotate(k)</code> moves the i-th element to the <code class="language-plaintext highlighter-rouge">(i+k) % N</code>-th place in the deque, where <code class="language-plaintext highlighter-rouge">N</code> is the size of the container.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>>></span> <span class="kn">from</span> <span class="nn">collections</span> <span class="kn">import</span> <span class="n">deque</span>
<span class="o">>>></span> <span class="n">dq</span> <span class="o">=</span> <span class="n">deque</span><span class="p">(</span><span class="s">"abcdefg"</span><span class="p">,</span> <span class="n">maxlen</span><span class="o">=</span><span class="mi">4</span><span class="p">)</span>
<span class="o">>>></span> <span class="n">dq</span>
<span class="n">deque</span><span class="p">([</span><span class="s">'d'</span><span class="p">,</span> <span class="s">'e'</span><span class="p">,</span> <span class="s">'f'</span><span class="p">,</span> <span class="s">'g'</span><span class="p">],</span> <span class="n">maxlen</span><span class="o">=</span><span class="mi">4</span><span class="p">)</span>
<span class="o">>>></span> <span class="n">dq</span><span class="p">.</span><span class="n">rotate</span><span class="p">(</span><span class="mi">2</span><span class="p">)</span>
<span class="o">>>></span> <span class="n">dq</span>
<span class="n">deque</span><span class="p">([</span><span class="s">'f'</span><span class="p">,</span> <span class="s">'g'</span><span class="p">,</span> <span class="s">'d'</span><span class="p">,</span> <span class="s">'e'</span><span class="p">],</span> <span class="n">maxlen</span><span class="o">=</span><span class="mi">4</span><span class="p">)</span>
<span class="o">>>></span> <span class="n">dq</span><span class="p">.</span><span class="n">rotate</span><span class="p">(</span><span class="o">-</span><span class="mi">1</span><span class="p">)</span>
<span class="o">>>></span> <span class="n">dq</span>
<span class="n">deque</span><span class="p">([</span><span class="s">'g'</span><span class="p">,</span> <span class="s">'d'</span><span class="p">,</span> <span class="s">'e'</span><span class="p">,</span> <span class="s">'f'</span><span class="p">],</span> <span class="n">maxlen</span><span class="o">=</span><span class="mi">4</span><span class="p">)</span>
<span class="o">>>></span> <span class="n">dq</span><span class="p">.</span><span class="n">extend</span><span class="p">(</span><span class="s">"abc"</span><span class="p">)</span>
<span class="o">>>></span> <span class="n">dq</span>
<span class="n">deque</span><span class="p">([</span><span class="s">'f'</span><span class="p">,</span> <span class="s">'a'</span><span class="p">,</span> <span class="s">'b'</span><span class="p">,</span> <span class="s">'c'</span><span class="p">],</span> <span class="n">maxlen</span><span class="o">=</span><span class="mi">4</span><span class="p">)</span>
<span class="o">>>></span> <span class="n">dq</span><span class="p">.</span><span class="n">extendleft</span><span class="p">(</span><span class="s">"gfe"</span><span class="p">)</span>
<span class="o">>>></span> <span class="n">dq</span>
<span class="n">deque</span><span class="p">([</span><span class="s">'e'</span><span class="p">,</span> <span class="s">'f'</span><span class="p">,</span> <span class="s">'g'</span><span class="p">,</span> <span class="s">'f'</span><span class="p">],</span> <span class="n">maxlen</span><span class="o">=</span><span class="mi">4</span><span class="p">)</span>
</code></pre></div></div>
<h2 id="different-queues">Different Queues</h2>
<p>There are a lot of different queue containers in python. You can technically use a <code class="language-plaintext highlighter-rouge">deque</code> as a queue itself too. However, for more complicated applications where asynchronicity is a key factor, insertion and popping from a queue may be tricky. Here, we use threadsafe <code class="language-plaintext highlighter-rouge">queue</code> library with its <code class="language-plaintext highlighter-rouge">Queue</code>, <code class="language-plaintext highlighter-rouge">LifoQueue</code> (which is literally a stack), and <code class="language-plaintext highlighter-rouge">PriorityQueue</code>. When you call <code class="language-plaintext highlighter-rouge">pop</code> on an empty default queue here, it will wait until an item has been inserted, rather than return an error message. <code class="language-plaintext highlighter-rouge">multiprocessing</code> and <code class="language-plaintext highlighter-rouge">asyncio</code> implements its own queues as well.</p>
Computability Theory - Post's Problem & Priority Argument (Part 2)2019-03-29T00:00:00+00:00http://oneraynyday.github.io/math/2019/03/29/Priority-Argument-Pt2<h1 id="table-of-contents">Table of Contents</h1>
<ul id="markdown-toc">
<li><a href="#table-of-contents" id="markdown-toc-table-of-contents">Table of Contents</a></li>
<li><a href="#recap" id="markdown-toc-recap">Recap</a></li>
<li><a href="#intuition" id="markdown-toc-intuition">Intuition</a></li>
<li><a href="#construction" id="markdown-toc-construction">Construction</a></li>
</ul>
<p><em>wip</em></p>
<h1 id="recap">Recap</h1>
<p><em>Note: If you haven’t read part 1, this entire blog will make no sense to you.</em></p>
<p>Last time, we set up the notion of Turing reducibility and oracle machines, and asked the question, <em>does there exist recursive enumerable sets \(A,B\) such that \(A \not\leq_T B\) and \(B \not\leq_T A\)?</em> , where \(\leq_T\) is Turing reducibility. Emil Post, one of the leading figures in computability theory, asked this question. He was never able to answer his own question, since he died early of electroshock therapy for depression. Two mathematicians, Friedberg and Muchnik later solved the problem in the same method, which we now call <strong>Priority Argument</strong>. We have set up the intuition for how the priority method may work in part 1, but we failed to show computability of the two sets, since in every iterative stage we relied on the following predicate:</p>
\[\forall x, \phi_e^A(x)\uparrow\]
<p>Which, as we explained, is definitely <strong>not computable</strong>. Then how do we get around this?</p>
<h1 id="intuition">Intuition</h1>
<p>We can never know whether a particular \(\phi_e^A\) will stall on all inputs, but during the time we’re waiting for it to halt, we can continue the next few stages in parallel, assuming that it’s not going to halt on anything. Then there are two cases associated with this strategy:</p>
<ol>
<li>\(\phi_e^A(x) \uparrow \forall x\), then what we have assumed for this program is fine.</li>
<li>\(\exists x, \phi_e^A(x) \downarrow\), then what we have assumed is false, but <strong>this will only occur once</strong>. Then, we can tell all programs we’re currently running, i.e. \(\phi_{e+1}^A, \phi_{e+2}^A, ...\) to restart, with our new, corrected initial segment \(x_e\) (this was covered in the construction from part 1). The concept of having to restart a program, but finitely many times is the concept of <strong>finite injury</strong>, and the concept of all programs after \(\phi_e^A\) being able to be restarted at will by the strategy dictating \(\phi_e^A\) is the concept of <strong>priority</strong>.</li>
</ol>
<p>The description above may seem ambiguous, so stick around for the construction.</p>
<h1 id="construction">Construction</h1>
<p>We formally define <strong>strategy</strong> below as some algorithm that tries to meet our requirements. Our requirements in part 1 are exactly the same here (either #1 or #2):</p>
<ol>
<li>\(\phi_e^A(x)\uparrow \forall x\), and \(g_B(x_{e-1}+1)\) can be anything (recall \(g_B\) is a characteristic function so it is total).</li>
<li>\(\exists x_e > x_{e-1}, \phi_e^A(x_e) \downarrow \neq g_b(x_e)\).</li>
</ol>
<p>Then we implement the following strategies \(S_{2e}\) for the \(e\)-th step of making \(\phi_e^A(x_e) \neq g_B(x_e)\), and the strategies \(S_{2e+1}\) for \(\phi_e^B(y_e) \neq g_A(y_e)\). We say the strategy \(S_i\) has <strong>higher priority</strong> than \(S_j\) iff \(i < j\). This means that \(S_i\) has the power to restart \(S_j\) whenever it wants to, with a new initial segment (Notice how the strategies for \(g_A\) and \(g_B\) can restart each other). We say that an initial segment up to \(x_e\) is <strong>restricted</strong> for strategies \(S_i, i > 2e\), since it is not allowed to change the initial segment, as its priority is too low. However, any strategy \(S_k, k \leq 2e\) is allowed to modify the initial segment up to \(x_e\). A strategy \(S_{2e}\) is <strong>allowed to act</strong> when \(\phi_e^A(x_e)\downarrow\) for some \(x_e\) - this means our initial guess that the program \(\phi_e^A\) loops on all inputs is false, and so we must modify the initial segment \(x_e\). When the strategy acts, it restarts all strategies \(S_j \forall j > 2e\), which restart to obey the newly updated, restricted initial segment.</p>
<p>Our algorithm can be best explained in pseudocode, so let’s describe it:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># We denote phi_e^A as e-th A-program here, similarly for B
# We also denote x_e as the initial segment index for e-th A-program,
# and similarly for y_e for B.
</span><span class="k">for</span> <span class="n">I</span><span class="o">=</span><span class="mi">0</span><span class="p">,</span><span class="mi">1</span><span class="p">,</span><span class="mi">2</span><span class="p">,...</span>
<span class="n">e</span> <span class="o">=</span> <span class="n">I</span><span class="o">/</span><span class="mi">2</span>
<span class="k">if</span> <span class="n">I</span> <span class="o">%</span> <span class="mi">2</span> <span class="o">==</span> <span class="mi">0</span><span class="p">:</span> <span class="c1"># I is even
</span> <span class="c1"># run strategy for A
</span> <span class="k">for</span> <span class="n">i</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span><span class="mi">2</span><span class="p">,...,</span><span class="n">e</span><span class="o">-</span><span class="mi">1</span>
<span class="n">Run</span> <span class="n">i</span><span class="o">-</span><span class="n">th</span> <span class="n">A</span><span class="o">-</span><span class="n">program</span> <span class="n">on</span> <span class="n">x_i</span><span class="p">,</span><span class="n">x_i</span> <span class="o">+</span> <span class="mi">1</span><span class="p">,...,</span><span class="n">x_i</span> <span class="o">+</span> <span class="n">e</span> <span class="k">for</span> <span class="n">e</span> <span class="n">steps</span>
<span class="k">if</span> <span class="n">e</span> <span class="o">></span> <span class="mi">0</span><span class="p">:</span>
<span class="n">x_e</span> <span class="o">=</span> <span class="n">x_</span><span class="p">{</span><span class="n">e</span><span class="o">-</span><span class="mi">1</span><span class="p">}</span> <span class="o">+</span> <span class="mi">1</span>
<span class="k">else</span><span class="p">:</span>
<span class="n">x_e</span> <span class="o">=</span> <span class="mi">0</span>
<span class="n">Run</span> <span class="n">e</span><span class="o">-</span><span class="n">th</span> <span class="n">A</span><span class="o">-</span><span class="n">program</span> <span class="n">on</span> <span class="n">x_e</span> <span class="k">for</span> <span class="n">e</span> <span class="n">steps</span><span class="p">.</span>
<span class="c1"># When strategy 2i acts, x_i' = x_i+j, and all k-th programs currently running
</span> <span class="c1"># where k > i, are restarted with the indices x_k = x_i' + (k-i).
</span> <span class="k">if</span> <span class="nb">any</span> <span class="n">highest</span> <span class="n">priority</span> <span class="n">i</span><span class="o">-</span><span class="n">th</span> <span class="n">A</span><span class="o">-</span><span class="n">program</span> <span class="n">terminated</span> <span class="n">on</span> <span class="n">some</span> <span class="n">x_i</span> <span class="o">+</span> <span class="n">j</span><span class="p">,</span> <span class="n">then</span> <span class="n">allow</span> <span class="n">strategy</span> <span class="mi">2</span><span class="n">i</span> <span class="n">to</span> <span class="n">act</span><span class="p">.</span>
<span class="k">else</span><span class="p">:</span> <span class="c1"># I is odd
</span> <span class="c1"># run strategy for B
</span> <span class="k">for</span> <span class="n">i</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span><span class="mi">2</span><span class="p">,...,</span><span class="n">e</span><span class="o">-</span><span class="mi">1</span>
<span class="n">Run</span> <span class="n">i</span><span class="o">-</span><span class="n">th</span> <span class="n">B</span><span class="o">-</span><span class="n">program</span> <span class="n">on</span> <span class="n">y_i</span> <span class="k">for</span> <span class="n">e</span> <span class="n">steps</span>
<span class="k">if</span> <span class="n">e</span> <span class="o">></span> <span class="mi">0</span><span class="p">:</span>
<span class="n">y_e</span> <span class="o">=</span> <span class="n">y_</span><span class="p">{</span><span class="n">e</span><span class="o">-</span><span class="mi">1</span><span class="p">}</span> <span class="o">+</span> <span class="mi">1</span>
<span class="k">else</span><span class="p">:</span>
<span class="n">y_e</span> <span class="o">=</span> <span class="mi">0</span>
<span class="n">Run</span> <span class="n">e</span><span class="o">-</span><span class="n">th</span> <span class="n">A</span><span class="o">-</span><span class="n">program</span> <span class="n">on</span> <span class="n">y_e</span> <span class="k">for</span> <span class="n">e</span> <span class="n">steps</span><span class="p">.</span>
<span class="k">if</span> <span class="nb">any</span> <span class="n">i</span><span class="o">-</span><span class="n">th</span> <span class="n">B</span><span class="o">-</span><span class="n">programs</span> <span class="n">terminated</span><span class="p">,</span> <span class="n">then</span> <span class="n">allow</span> <span class="n">strategy</span> <span class="mi">2</span><span class="n">i</span><span class="o">+</span><span class="mi">1</span> <span class="n">to</span> <span class="n">act</span><span class="p">.</span>
</code></pre></div></div>
<p>What we are doing here is <strong>computably guessing whether the predicate \(\phi_e^A(x)\uparrow \forall x\) is true</strong>. If it fails, we know it will fail finitely many times, and our program will eventually reach the result from part 1.</p>
<script src="https://utteranc.es/client.js" repo="OneRaynyDay/oneraynyday.github.io" issue-term="pathname" theme="github-light" crossorigin="anonymous" async=""> </script>
Computability Theory - Post's Problem & Priority Argument (Part 1)2019-03-23T00:00:00+00:00http://oneraynyday.github.io/math/2019/03/23/Priority-Argument<h1 id="table-of-contents">Table of Contents</h1>
<ul id="markdown-toc">
<li><a href="#table-of-contents" id="markdown-toc-table-of-contents">Table of Contents</a></li>
<li><a href="#recap" id="markdown-toc-recap">Recap</a></li>
<li><a href="#enumeration--normal-form-for-oracle-machines" id="markdown-toc-enumeration--normal-form-for-oracle-machines">Enumeration & Normal Form for Oracle Machines</a> <ul>
<li><a href="#why-the-setup" id="markdown-toc-why-the-setup">Why The Setup?</a></li>
</ul>
</li>
<li><a href="#basic-construction" id="markdown-toc-basic-construction">Basic Construction</a></li>
<li><a href="#diagonalization" id="markdown-toc-diagonalization">Diagonalization</a></li>
<li><a href="#however" id="markdown-toc-however">However…</a></li>
</ul>
<h1 id="recap">Recap</h1>
<p><strong>Quick recap</strong>: previously, we defined <em>recursive enumerable</em> (r.e.) sets, which are sets that are generated by a computable function. These sets encode a degree of complexity within them. Specifically, the membership \(x \in A \) is a semi-recursive relation, which means if indeed, \(x\) is in \(A\), then we will know eventually, but if \(x \not\in A\), we may have to wait forever to figure that out. An example is the set \(K = \{e \mid \phi_e(e)\downarrow\}\) , where \(e \in K \iff\) the \(e\)-th program, run on the input \(e\), will halt. If the program halts, then we can confidently say that \(e \in K\), but otherwise we would need to wait forever. In automata theory, this is equivalent to saying that this set is <strong>Turing-recognizable</strong>.</p>
<p>The process of <strong>reduction</strong> is common in complexity theory and computability theory, as well as many other theoretical applications. Here, we say that \(A \leq_T B\) if we can figure out membership of \(A\) by a recursive program containing calls to membership of \(B\) as an <em>oracle</em>. Along with Turing reducibility, there is also $m$-reducibility, $1$-reducibility, etc. Turing reducibility, however, to us theoretical computer scientists, make the most sense. This idea allows us to bridge relationships between sets using a computer program, say, written in C, and indeed, Turing reducibility is one of the hotter topics in modern computability due to its interesting structure.</p>
<p>This time, we will give an ambitious attempt to prove the famous <strong>Post’s Problem</strong>, which asks if there are two r.e. sets that are not Turing reducible to each other. We will prove a weaker variant here that does not require these sets to be r.e., and then introduce the concept of <strong>finite injury priority argument</strong> to solve it under computable conditions.</p>
<p>#Some Setup</p>
<p>Since Turing reducibility allows us to ask arbitrary number of questions about \(x \in B\). What sets are computable, now that we have this oracle? If \(B\) is something boring like \(\mathbb{N}\) or \(\emptyset\), in which case it would output true or false always, then we intuitively would not add any complexity to recursive programs. However, if \(B\) tells us whether \(\phi_e(x)\downarrow\), for any \(e, x\), then it gets a little more interesting.</p>
<h1 id="enumeration--normal-form-for-oracle-machines">Enumeration & Normal Form for Oracle Machines</h1>
<p>Recall that our previous theorem from Kleene (<a href="https://oneraynyday.github.io/math/2019/02/06/Computability-Theory-Halting-Problem/#normal-form-and-enumeration-theorem">here</a>) that states the following:</p>
<ol>
<li>The \(T(e,x,y)\) predicate is true when the \(e\)-th program runs on input \(x\) and terminates with the transition state history \(y\). \(T_n\) is a computable relation.</li>
<li>\(U(y)\) is a computable function that, upon taking the transition state history, picks the final state, and returns the value. For example, if \(T(e,x,y)\), then \(U(y) = \phi_e(x)\).</li>
<li>We can <strong>enumerate all programs</strong>, \(\phi_1(x), \phi_2(x), ...\) and the indices that correspond to ill-formed programs simply diverge immediately on all inputs.</li>
</ol>
<p>We can extend this to our new <strong>oracle machines</strong>, which use not only \(S, Pd\) in its function calls, but also some \(g : \mathbb{N} \to \{0,1\}\), where \(g(x) = 1\) if \(x \in B\), and \(0\) otherwise(this is also called the <em>characteristic function</em> of the relation). Recall from the definition of \(g\) that it is total and its image is the set \(\{0,1\}\). Once again, since we didn’t even specify the coding details for Kleene’s original theorem, I will only provide the intuition:</p>
<p>We originally mapped the description of the computable functions to the natural numbers. A description of the functions is essentially like ASCII for python code for example, and what we can do is simply add a new ASCII representation of \(g\). Intuitively, before our python interpreter sees the following code:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">do_something</span><span class="p">():</span>
<span class="k">return</span> <span class="n">S</span><span class="p">(</span><span class="n">S</span><span class="p">(</span><span class="n">S</span><span class="p">(</span><span class="n">Pd</span><span class="p">(</span><span class="mi">2</span><span class="p">))))</span> <span class="c1"># Should return 4
</span><span class="k">def</span> <span class="nf">do_something_else</span><span class="p">():</span>
<span class="k">return</span> <span class="n">g</span><span class="p">(</span><span class="n">S</span><span class="p">(</span><span class="n">S</span><span class="p">(</span><span class="mi">0</span><span class="p">)))</span> <span class="c1"># Uh oh! Our interpreter doesn't know what g is
</span></code></pre></div></div>
<p>And it fails because of <code class="language-plaintext highlighter-rouge">NameError: name 'g' is not defined</code>. However, assume now that the python standard library has the definition for <code class="language-plaintext highlighter-rouge">g</code> as an oracle, and we can simply call it.</p>
<h2 id="why-the-setup">Why The Setup?</h2>
<p>Now you may ask, “Ray, what does this have anything to do with trying to solve Post’s problem?”. I know, it seems a little off-topic that I would bring programming into this abstract concept. The point I’m trying to make is that <em>upon having access to \(g\), our programs don’t look much different. They can still be represented by finite length strings</em>. This means that the list of programs that can be created using our new functions is <strong>still countable</strong>. Recall how we used countability to reach a contradiction for the halting problem? <strong>We constructed a function, so it must’ve been some \(n\)-th function in the countable enumeration. But the definition of the function makes it different from every other function, so it can’t be the \(n\)-th function.</strong> Our construction of two sets that are not Turing-reducible to each other is similar, and the below is a quick tl;dr intuition:</p>
<p>We will construct two sets \(A,B\) incrementally. Let their characteristic functions \(g_A, g_B\) denote whether \(x \in A\), \(x \in B\) respectively. We know that programs using the characteristic functions \(g_A, g_B\) can be enumerated, so we enumerate the functions \(\phi^A_1,\phi^A_2,...\) and \(\phi^B_1, \phi^B_2,...\). We want to construct \(g_A\) and \(g_B\) such that at every input \(e \in \mathbb{N}\), either \(\phi^A_e(x) \uparrow \forall x \in \mathbb{N}\), or \(\exists x, \phi_e^A(x) \downarrow \neq g_B(x)\) and similarly, \(\phi_e^B(x) \downarrow \neq g_A(x)\). <strong>This construction guarantees that the characteristic functions of both sets are not computable using the other’s characteristic function.</strong> Recall what Turing reducibility means, then this implies \(A \not\leq_T B \& B \not\leq_T A\). So let’s construct the functions \(g_A\) and \(g_B\)!</p>
<h1 id="basic-construction">Basic Construction</h1>
<p><em>“You will not find a single \(\mu\) operator or \(T\) predicate in modern computability research” - Kirill Gura (114C TA)</em></p>
<p>The way we construct \(g_A\) and \(g_B\) will be explained rigorously but in english and programming terms. We define \(g_A\) and \(g_B\) in stages:</p>
<p><strong>At stage 0 (Base):</strong></p>
<p>Case 1: \(\phi_0^A(x) \uparrow \forall x\), define \(g_B(0) = 0\). Obviously, \(g_B \neq \phi_0^A\), since \(g_B(0) \downarrow \& \phi_0^A(0) \uparrow\). We define \(x_0 = 0\).</p>
<p>Case 2: \(\exists x_0, \phi_0^A(x_0) \downarrow\), then if \(\phi_0^A(x_0) = 0 \implies g_B(x_0) = 1\), and \(\phi_0^A(x_0) \neq 0 \implies g_B(x_0) = 0\). Then similarly, \(g_B \neq \phi_0^A\), since \(g_B(x_0) \neq \phi_0^A(x_0)\). Since we defined \(g_B(x_0)\), we can also arbitrarily set \(g_B(y) = 0 \forall y < x_0\).</p>
<p>We perform the same procedure for \(g_A\) and \(\phi_0^B\), and if case 2 occurs for \(g_A\), we obtain a \(y_0\) similar to the \(x_0\) above. We now have an initial segment of \(g_A\) up to \(y_0\) and \(g_B\) up to \(x_0\) in the first step.</p>
<p><strong>At stage \(e+1\) (Inductive):</strong></p>
<p>Case 1: \(\phi_{e+1}^A(x) \uparrow \forall x > x_e\), then we simply define \(g_B(x_e + 1) = 0\), similar to the base case. \(g_B \neq \phi_{e+1}^A\), since \(g_B(x_e + 1) \downarrow \& \phi_{e+1}^A(x_e + 1) \uparrow\). Define \(x_{e+1} = x_e + 1\).</p>
<p>Case 2: \(\exists x_{e+1} > x_e, \phi_{e+1}^A(x_{e+1}) \downarrow\). Then similarly, \(\phi_{e+1}^A(x_{e+1}) = 0 \implies g_B(x_{e+1}) = 1\), and \(\phi_{e+1}^A(x_{e+1}) \neq 0 \implies g_B(x_{e+1}) = 0\), and so they must not be the same. We then set \(g_B(y) = 0 \forall x_e < y < x_{e+1}\).</p>
<p>We do the same procedure for \(g_A\) and \(\phi_{e+1}^B\). Now that we have defined the procedure above inductively, let us now prove by diagonalization that these two sets \(A,B\) are not Turing reducible to each other.</p>
<h1 id="diagonalization">Diagonalization</h1>
<p>Let us suppose for a moment that \(A \leq_T B\), that is, the membership function of \(A\) can be defined as a recursive program containing calls to membership of \(B\). Then, formally, \(g_A \in \textbf{R}(\mathbb{N}, 0, 1, S, Pd, g_B)\). By Kleene’s theorem which states that we can enumerate the countably many computable functions in the oracle machine, we know that \(g_A\), then must be \(\phi_e^B(x)\), for some \(e \in \mathbb{N}\). <strong>However, let’s look at the \(e\)-th step of our inductive construction - we explicitly stated that \(g_A \neq \phi_e^B\)!</strong> Then this must be a contradiction, and we can apply the same contradiction with \(g_B\) and \(\phi_e^A\). Thus our constructed sets \(A, B\) are not Turing reducible to each other, so we’re done.</p>
<h1 id="however">However…</h1>
<p>The astute reader, at this point, would be frustrated - “I see you’ve constructed these two sets, but how can you say that they are r.e.?” Actually they’re not. The predicate \(\phi_e^A(x) \uparrow \forall x\) and its variants used during the inductive and base steps are not computable. This is literally asking “will all inputs to \(\phi_e^A\) never halt?”, to which we can’t answer in finite time.</p>
<p>So why did I do all this? Because although our construction was not of r.e. sets, we showed that \(\leq_T\) is not a total order defined on all subsets of natural numbers. In addition, the proof we will introduce next time is an incremental improvement on the intuitions introduced this time. So stay tuned for the r.e. version!</p>
<script src="https://utteranc.es/client.js" repo="OneRaynyDay/oneraynyday.github.io" issue-term="pathname" theme="github-light" crossorigin="anonymous" async=""> </script>
Computability Theory - Recursive Enumerable Sets2019-02-18T00:00:00+00:00http://oneraynyday.github.io/math/2019/02/18/Recursive-Enumerable-Sets<h1 id="table-of-contents">Table of Contents</h1>
<ul id="markdown-toc">
<li><a href="#table-of-contents" id="markdown-toc-table-of-contents">Table of Contents</a> <ul>
<li><a href="#why-you-care-pmathbbn--aleph_1" id="markdown-toc-why-you-care-pmathbbn--aleph_1">Why You Care: $|P(\mathbb{N})| = \aleph_1$</a></li>
</ul>
</li>
<li><a href="#semirecursive-relations" id="markdown-toc-semirecursive-relations">Semirecursive Relations</a></li>
<li><a href="#recursive-sets" id="markdown-toc-recursive-sets">Recursive Sets</a></li>
<li><a href="#recursive-enumerable-sets" id="markdown-toc-recursive-enumerable-sets">Recursive Enumerable Sets</a></li>
<li><a href="#reductions-and-re-completeness" id="markdown-toc-reductions-and-re-completeness">Reductions and r.e. completeness</a></li>
<li><a href="#posts-problem" id="markdown-toc-posts-problem">Post’s Problem</a></li>
</ul>
<p>From the set of natural numbers, \(\mathbb{N}\), we can generate a lot of subsets \(S \subset \mathbb{N}\). In fact, the number of subsets we can have of \(\mathbb{N}\) is so large that it’s uncountable. This study of subsets on \(\mathbb{N}\) is a complicated branch of mathematics and logic, and it serves to describe and answer questions in computer science on computability.</p>
<p>Note: We will be alternating some commonly used notations.</p>
<ul>
<li>
\[\phi_e(x) \equiv \{e\}(x)\]
</li>
</ul>
<h2 id="why-you-care-pmathbbn--aleph_1">Why You Care: $|P(\mathbb{N})| = \aleph_1$</h2>
<p>We prove this by stating a more powerful claim: <strong>there does not exist a bijection between a set and its powerset</strong>. For set \(S = \emptyset\) , it is trivial: \(|S| = 0 \neq |P(S)| = |\{\{\}\}| = 1\) . So suppose it’s non-empty, and it does have a bijection by contradiction, \(f : S \to P(S)\). Then, we look at the following set,
\(B = \{s \in S : s \not\in f(s)\} \subset S\)
This is obviously a well constructed set \(B\). It contains all elements $s$ such that the bijective mapping $f(s) = S \not\ni s$ , i.e. does not contain $s$ itself. There must exist a \(b \in S\) such that \(f(b) = B\) since to be bijective, \(f\) must also be surjective(i.e. \(\forall X \in P(S), \exists x \in S, f(x) = X\)). Then, let’s analyze two cases:</p>
<ol>
<li>\(b \in B\). Then by definition, \(b \in f(b) \implies b \not\in B\). Contradiction.</li>
<li>\(b \not\in B\). Then by definition, \(b \not\in f(b) \implies b \in B\). Contradiction.</li>
</ol>
<p>Both cases give us contradiction, then the bijection is absurd for any arbitrary set \(S\). Furthermore, we can take the powerset of any uncountable set and reach a higher uncountable ordinal.</p>
<p><strong>This is important because although \(\mathbb{N}\) may be countable, analyzing the subsets of \(\mathbb{N}\) can prove to be much more complicated, and yields surprising results.</strong> The fact that we have countable recursive functions that we can program in C, but uncountable number of functions with arbitrary domain and range means there are some unexplored complexity in computability. We aim to analyze this uncountable set, which turns out to have some interesting results, and we will eventually arrive at theorems of Tarski, Godel, and Church.</p>
<h1 id="semirecursive-relations">Semirecursive Relations</h1>
<p>Before we dive in, we should revisit the <em>Halting Problem</em> as explored by the previous blog. For a given program, we want to know whether it terminates. Intuitively, we would be able to say “yes” eventually if the program does indeed terminate, but we would never be able to say “no”, since it requires infinite time for us to wait before we answer.</p>
<p>This is a <strong>semirecursive relation</strong>, asking “Would the program with code \(\hat{f}\) halt on input \(x\)” a given program. To put it more formally,
\(R(x) \iff f(x)\downarrow, \quad f\text{ is recursive}\)</p>
<p>Additionally, it is equivalent to saying, “would there exist an output from the program?”, which is:
\(R(x) \iff \exists y P(x,y)\)
Basically, semirecursive relations are ones that have a positive answer and may not have a decidable answer otherwise. We also denote these relations to be \(\Sigma_1^0\) relations.</p>
<p>Some important closure properties of semirecursive relations:</p>
<ol>
<li><strong>Disjunction:</strong> \((\exists y) Q(x, y) \lor (\exists y)R(x,y) \iff (\exists u)[Q(x, u) \lor R(x,u)]\)</li>
<li><strong>Conjunction:</strong> \((\exists y)Q(x,y) \wedge (\exists y)R(x,y) \iff (\exists u)[Q(x,(u)_0) \wedge Q(x,(u)_1)]\)</li>
<li><strong>Existential Quantifiers:</strong> \((\exists z)(\exists y)Q(x,y,z) \iff (\exists u)[Q(x, (u)_0, (u)_1)]\)</li>
<li><strong>Bounded Universal Quantifiers</strong>: \((\forall i \leq z)(\exists y)Q(x,y,i) \iff (\exists u)[Q(x, (u)_i, i) \forall i \leq z]\)</li>
<li><strong>Recursive Characteristic Functions</strong>: \(Q(x) \iff P(f_1(x),...,f_m(x))\)</li>
</ol>
<p>However, note that it is <strong>not closed under unbounded universal quantifiers</strong>. Intuitively, if you had to make sure for every number, some relation is true, you’d have to run your program infinitely many times, without termination. Meanwhile, for a semirecursive relation, it must give us a positive answer in finite time. Consequently, it must also not be closed under <strong>negation</strong>, i.e.</p>
\[H(e,x) \iff (\exists y)T(e,x,y)\]
<p>is the halting relation, and \(\lnot H(e,x) \iff (\forall y)[\lnot T(e,x,y)]\).</p>
<h1 id="recursive-sets">Recursive Sets</h1>
<p>One way to analyze the high complexity of subsets on \(\mathbb{N}\) is to see what kind of sets we can identify from algorithms. Recall the <strong>Church-Turing thesis</strong> states that algorithms and recursive partial functions are equivalent. By definition, a set \(A\) is <strong>recursive</strong> if we can say “yes, \(x \in A\)” via a recursive function and “no, \(x \not\in A\)” similarly. Then, we can just ask from 0 onwards whether an element is in the set. For example,</p>
<p>Q: “Is 0 in A?”, A: <em>runs some algorithm</em>… Yes</p>
<p>Q: “Is 1 in A?”, A: <em>runs some algorithm</em>… No</p>
<p>Q: “Is 2 in A?”, A: <em>runs some algorithm</em>… Yes</p>
<p>…</p>
<p>Then \(A = \{0,2,...\}\) for this specific example.</p>
<h1 id="recursive-enumerable-sets">Recursive Enumerable Sets</h1>
<p>then the definition is defined rigorously as the following:</p>
<p><strong>A set \(A \subset \mathbb{N}\) is recursively enumerable (r.e.) if \(A = \emptyset\) or \(A = \{f(0),f(1),f(2),f(3),...\}\), where \(f : \mathbb{N} \to \mathbb{N}\) is a total recursive function.</strong> Similarly, a <strong>set \(A\) is co-recursively enumerable (co-r.e.) if \(A^c\) is r.e.$$.</strong></p>
<p>We will prove some properties about them below:</p>
<ol>
<li><strong>\(A\) is r.e. \(\implies A = \{x \mid g(x)\downarrow\}\) .</strong> What this means is that \(A\) is the domain of some partial recursive function \(g\). We know \(A\) is the enumeration of \(f\), then \(\forall x, \exists y, T(\hat{f}, x, y)\) by Kleene’s Normal Form theorem. Then, we define the partial \(g(y) := \mu_z [ T(\hat{f}, (z)_0, (z)_1) \& U((z)_1) = y]\). By definition, \(g\) will only converge on some \(y\) if it is the result of some computation where the input is \((z)_0\), and the output computation state is \((z)_1\). We solved for both unknowns using Godel coding to dovetail.</li>
<li><strong>We can further allow \(f\) to be injective if \(A\) is infinite.</strong> We define \(B = \{ z \in \mathbb{N} \mid T(\hat{f}, (z)_0, (z)_1) \& \forall n < z,[T(\hat{f}, (n)_0, (n)_1) \implies (z)_1 \neq (n)_1]]\}\). This set \(B\) is a set that contains all inputs and outputs of \(f\) such that the output cannot be the same for two different inputs, i.e. the definition of injection. \(B\) is recursively enumerated by some recursive total function \(g\), since it is clear that we can write a function that asks for \(\mu_z\) of that giant predicate describing the set, then simply define \(\bar{f}(n) = (g(n))_1\) then \(\bar{f}\) enumerates the outputs of the original function \(f\) injectively.</li>
<li><strong>The relation \(x \in A\) is semirecursive</strong>. This is true since \(x \in A \iff \exists n [ f(n) = x ]\). Since \(f\) is recursive, this is by definition a semirecursive relation. This has some implications. Given \(A,B\) recursive enumerable,
<ol>
<li>\(A \cup B\) is also r.e.</li>
<li>\(A \cap B\) is also r.e.</li>
<li>\(A^c\) is not always r.e. - it is <strong>co-recursively enumerable</strong>. If \(A\) and \(A^c\) are both recursively enumerable, then \(A\) is recursive.</li>
<li>\(f[A]\), and \(f^{-1}[A]\), i.e. the image and inverse image of \(A\) under \(f\) is r.e.</li>
</ol>
</li>
</ol>
<p>For sake of brevity, I’ll list out a few properties without proofs (which can be found online):</p>
<ol>
<li>A set is recursive iff it can be enumerated by a total, monotonically increasing \(f : \mathbb{N} \to \mathbb{N}\) .</li>
<li><strong>The halting set</strong> defined as \(H' = \{x \mid H((x)_0, (x)_1\}\) <em>is recursive enumerable, but not recursive</em>. Recall \(H(e,x)\) is true if the partial function coded by \(e\),taking in the input \(x\), converges, i.e. \(\phi_e(x) \downarrow\). (Think about what it would imply if \(H\) was recursive)</li>
<li><strong>Post’s Diagonal</strong> is defined as \(K = \{x \mid \phi_x(x) \downarrow\}\). This is slightly more elegant than the halting set, but it is also recursive enumerable but not recursive.</li>
</ol>
<p>From what we learned so far, we can classify the sets into three sets:</p>
<ol>
<li>Recursive enumerable sets, we denote these \(\Sigma^0_1\).</li>
<li>Co-recursively enumerable sets, we denote these \(\Pi^0_1\).</li>
<li>Recursive sets, we denote these \(\Delta^0_1\).</li>
</ol>
<p>Then we have constructed the first level of a hierarchy of sets, where the intersection of recursive enumerable sets and co-recursively enumerable is the recursive set. Here’s an illustration of what I mean:</p>
<p><img src="http://oneraynyday.github.io/assets/arithmetic_hierarchy.jpg" alt="arithmetic hierarchy" /></p>
<p>We will explore what the sets are in \(\Sigma^0_2\) and etc. later.</p>
<h1 id="reductions-and-re-completeness">Reductions and r.e. completeness</h1>
<p>In algorithms courses offered in CS, where one briefly visits the NP class of problems, the topic of <strong>NP complete</strong> problems show up. An NP complete problem is one which every other problem in NP can be reduced to. What is reduction? It means that we can use another problem in order to solve this problem. A quick example is, we can reduce the problem of Clique to a problem of Independent Set by inverting the edge set of the input graph, and if independent set returns some $K$, then we have a $K$ clique in our graph. Recall we cannot increase our input size exponentially or call the reduction exponential number of times, for it to be a reduction in that sense.</p>
<p>What does reductions and r.e. complete mean then? Well, basically the same, but applied to r.e. sets. Formally, a <strong>reduction</strong> from a set \(A\) to \(B\) is such that we can ask questions about membership in \(B\) to determine membership in \(A\). There is a hierarchy of “ease” of reduction by the following classifications:</p>
<ol>
<li>The reduction is \(A \leq_T B \iff \mathcal{X}_A \in \mathcal{R}(\textbf{N}_0, \mathcal{X}_B)\). In other words, the characteristic function of \(x \in A\) can be composed as a recursive (total) function which includes \(\mathcal{X}_B, S, Pd\) on \(\mathbb{N}\). We call this <strong>Turing reducibility</strong>.</li>
<li>The reduction is \(A \leq_m B \iff [x \in A \iff f(x) \in B]\) if \(f\) is arbitrary.</li>
<li>The reduction is \(A \leq_1 B\) if \(f\) in 2) is injective.</li>
<li>The reduction is \(A \equiv B\) if \(f\) in 2) is bijective.</li>
</ol>
<p>An <strong>r.e. complete set, denote A, has the property that</strong> every r.e set is 1-reducible to it, formally:</p>
\[B \text{ is r.e.} \implies B \leq_1 A\]
<p>Intuitively, we must then assume that \(A\) must be a set that is bigger than or equal to any r.e. set, otherwise the injection won’t work. Then obviously \(A\) is infinite. For example, post’s diagonal \(K = \{e \mid \phi_e(e)\downarrow\}\) is r.e. complete. Here’s a proof.</p>
<p>For arbitrary r.e. set \(B\), it is defined as the domain of some recursive function \(f\), then \(x \in B \iff f(x) \downarrow\). Then, we define \(g(x,y) = f(x)\), so that it is independent of \(y\). Then there exists some code for \(g\), call it \(\hat{g}\), so that \(g(x) = \{\hat{g}\}(x)\). By the \(S^m_n\) theorem, \(S^1_1(\hat{g}, x)\}(y) = f(x) \forall y\). Then since any \(y\) applies, we ask \(\{S^1_1(\hat{g}, x)\}(S^1_1(\hat{g}, x))\) in \(K\). In other words:</p>
\[x \in B \iff f(x)\downarrow \iff g(x,y)\downarrow \iff \{S^1_1(\hat{g},x)\}(S^1_1(\hat{g},x))\downarrow \iff h(x) = S^1_1(\hat{g}, x) \in K\]
<p>The \(S^m_n\) theorem states that \(S^1_1\) is injective, then it is obvious that \(h\) is injective, then \(B \leq_1 K\). Since \(B\) is arbitrary, \(K\) must be r.e. complete.</p>
<h1 id="posts-problem">Post’s Problem</h1>
<p>Emil Post, one of the founders of computability theory, asked a question that could not be solved until more than 20 years later. The question asks the following: <strong>does there exist r.e. sets $A$ and $B$ that are not turing reducible to each other?</strong></p>
<p>The answer is yes. However, before we show this, we must dig further to find properties of recursive enumerable sets.</p>
<script src="https://utteranc.es/client.js" repo="OneRaynyDay/oneraynyday.github.io" issue-term="pathname" theme="github-light" crossorigin="anonymous" async=""> </script>
Computability Theory - On the Halting Problem2019-02-06T00:00:00+00:00http://oneraynyday.github.io/math/2019/02/06/Computability-Theory-Halting-Problem<h1 id="table-of-contents">Table of Contents</h1>
<ul id="markdown-toc">
<li><a href="#table-of-contents" id="markdown-toc-table-of-contents">Table of Contents</a></li>
<li><a href="#recap-of-ackermann-and-mu-recursive" id="markdown-toc-recap-of-ackermann-and-mu-recursive">Recap of Ackermann and $\mu$-recursive</a></li>
<li><a href="#partial-algebras" id="markdown-toc-partial-algebras">Partial Algebras</a> <ul>
<li><a href="#expansion-to-recursion" id="markdown-toc-expansion-to-recursion">Expansion to Recursion</a></li>
<li><a href="#compilers" id="markdown-toc-compilers">Compilers</a></li>
</ul>
</li>
<li><a href="#normal-form-and-enumeration-theorem" id="markdown-toc-normal-form-and-enumeration-theorem">Normal Form and Enumeration Theorem</a> <ul>
<li><a href="#sm_n-theorem" id="markdown-toc-sm_n-theorem">$S^m_n$ Theorem</a></li>
</ul>
</li>
<li><a href="#the-halting-problem" id="markdown-toc-the-halting-problem">The Halting Problem</a></li>
<li><a href="#why-do-you-care" id="markdown-toc-why-do-you-care">Why Do You Care?</a></li>
</ul>
<h1 id="recap-of-ackermann-and-mu-recursive">Recap of Ackermann and $\mu$-recursive</h1>
<p>Recall that our previous result was that $A(n,x)$, the Ackermann function, is not primitive recursive, which means it’s not in $\mathcal{R}_p$, but it is a recursive function. It turns out that it is in the class $\mu$-recursive functions. This $\mu$ means the “least” operator, and $\mathcal{R}_\mu$ is the set of functions that not only holds all properties of primitive recursive functions but also closed under composition of the $\mu$ operator, which is defined as the following:
\((\mu i\geq y)R(i,\vec{x}) = \text{least i} \geq y \text{, such that } R(i,\vec{x}) \text{ holds.}\)
where $R(i, \vec{x})$ is a relation. This means that for all $j < i$, $\mathcal{X}_R(i, \vec{x})$ the characteristic function of $R$, is divergent or is 0. We will later see that $\mu$-recursive is equivalent to the set of all possible recursive functions, and we will finally relate it to the halting problem.</p>
<h1 id="partial-algebras">Partial Algebras</h1>
<p>When I say “algebra”, I don’t mean “solve for $x$ in the equation $3x + 2 = 0$”. <em>It is something much more general - it is the study of mathematical symbols and their corresponding rules</em>. When we talk about the numbers $0,1,2,…$, and the functions $+, -, \times, \div$, we are thinking in a specific algebra. The algebra that we are thinking of is defined by:</p>
\[\textbf{Q}_1 = (\mathbb{Q}, 0, 1, +, -, \times, \div) \qquad \text{(0, 1 can be omitted, but kept traditionally)}\]
<p>By definition, all functions in this algebra must be a mapping $: \mathbb{Q} \to \mathbb{Q}$. The operators $+, -, \times$ can be applied to any number in $\mathbb{Q}$. However, $\div$ cannot. This is an example of a <strong><em>partial function</em></strong>, as in only part of the domain of $\div$ can be evaluated. We denote $\div : \mathbb{Q} \rightharpoonup \mathbb{Q}$. Our normal functions that maps every element in its domain to some value is called a <strong>total function</strong> in this context. A <strong>partial algebra</strong> is just an algebra that allows us to have partial functions. We denote undefined-ness as divergence with $\uparrow$, for example $\div(x,0) = \uparrow$. In addition to functions, and numbers, we also have <em>variables</em> in algebra. We can have any arbitrary $v_1,v_2,…$ as symbols in our expressions. To make equations, we have $=$ as a valid symbol as well. To allow branching conditionals, we have $\text{if, then, else}$ as symbols. To allow for multiple symbols in a function call, we have $,$ to separate arguments.</p>
<p><strong>Examples</strong>:</p>
<ul>
<li>$\div\text{if}3))$</li>
<li>$3,23\times$</li>
<li>$\text{if }(\frac{3}{2}=0)\text{ then }23 \text{ else }2$</li>
</ul>
<p>Intuitively, the first two don’t really “evaluate” to anything, and the last one evaluates to $2$.</p>
<p>In computer science, we can usually express most of our operations using the simple partial algebra:</p>
<p>$\textbf{N}_0 = (\mathbb{N}, 0, 1, S, Pd), \ S(x) = x + 1, \ Pd(x) = \begin{cases}x - 1 & \text{if } x \geq 1, \ 0 & \text{else}\end{cases}$</p>
<h2 id="expansion-to-recursion">Expansion to Recursion</h2>
<p>So far, we are not able to perform recursion in our algebra, it’s not very interesting. We define the notion of <strong>recursive variables</strong>, or <strong>function variables</strong> $p^n_i$, such that for some $p^n_i$, it is the $i$-th function variable that takes in $n$ arguments. What does it mean to be a function variable? It’s any variable that satisfies a system of equations, just like normal variables $v_i$ that could take values of $\mathbb{N}$. To make this more concrete, consider the example:</p>
\[p(x) = p(x+1)\]
<p>This is a recursive equation, but there are many different functions that could possibly satisfy this. In particular, the set of total functions that compute $p$ is of the form:$f(x) = c \in \mathbb{N}, \forall x \in \mathbb{N}$. In particular, the partial function $\epsilon(x) \uparrow \forall x \in \mathbb{N}$ will also satisfy the equation(vacuously).</p>
<p>Then, our expanded algebra, denote $R(\textbf{N}_0)$, will contain expressions of the form $p^n_m(x1,…,x_n)$ for any $m, n \in \mathbb{N}$.</p>
<p>This will allow us to essentially formulate recursive equations, and thus it can compute primitive recursive functions, as well as the Ackermann function:</p>
\[p_0(n,x) =\text{ if }(n = 0)\text{ then }S(x)\text{ else }p_1(n,x) \\
p_1(n,x) =\text{ if }(x = 0)\text{ then }p(Pd(n), 1)\text{ else }p_0(Pd(n), p_0(n, Pd(x)))\]
<h2 id="compilers">Compilers</h2>
<p>Now, how can we express this system of equations as a <strong>computation</strong>? How can we actually compute this function when we plug in values into the arguments? We have a model that will illustrate this. A <strong>transition system</strong> is a triple $\mathcal{T} = (S, \to, T)$ where $S$ is the set of states, $\to$ is a binary relation on $S$, and $T$ is the terminal states. When thought of as a directed graph, reaching $t \in T$ in a path starting from some node $s \in S$, we terminate the traversal. Consequently, $t \not\to s \forall t \in T, s \in S$. This is also a <em>deterministic transition system</em>, so that $t \to s \implies s$ is unique. Here’s an intuitive view:</p>
<p><img src="http://oneraynyday.github.io/assets/transition_system.jpg" alt="transition system" /></p>
<p>This details of this transition system can be found elsewhere, as it’s not the central point of the blog. The main take-away is that for some program</p>
\[E \equiv \begin{cases}p_0(\vec{x}) = E_0\\p_1(\vec{x}) = E_1\\...\\p_n(\vec{x}) = E_n \end{cases}\]
<p>the program, with $\vec{x}$ substituted with actual values, can be loaded into this transition system, and the last state $t \in T$ will encode the result of the computation, which is some natural number. Sometimes, the program will run into an infinite loop, like when
\(E \equiv p_0(x) = p_0(x)\)
Then we won’t reach an end state - we have reached a cycle in our directed acyclic graph $\mathcal{T}$ .</p>
<p>In fact, just by observation, this looks like a programming language, and it is underneath the hood. Its similarities tie in more with functional languages like Haskell and OCaml more than C or Python. Then since it’s a programming language, the ability for us to parse $R(\textbf{N}_0)$ in a compiler allows us to express our compilable expressions as 0’s and 1’s. The proof that states we are able to do this is quite long and it is based off of 5-10 pages of Gödel codings, which are essentially injective mappings of some tuple $(x_1,…,x_m) = \Pi_{i=1}^m p_i^{x_1+1} \in \mathbb{N}$, where $p_i$ is the $i$-th prime. This is the unique prime factorization of some number and thus the tuples’ codings must be unique. We can associate symbols with some unique tuple, which then maps to some natural number. We then create nested tuple codings to create programs injective to $\mathbb{N}$ which can be expressed in binary.</p>
<p>The compiler gives us some code, and our machine can take the input and return to us the result of the computation (if it doesn’t run into an infinite loop). Mathematically, we define the function $\phi(e,x)$ where $e$ is the coding for some program that calculates some partial system of equation(s). We will present the results of the code-ability of the program below.</p>
<h1 id="normal-form-and-enumeration-theorem">Normal Form and Enumeration Theorem</h1>
<p>(A part of) Kleene’s normal form states that there exists a primitive recursive function $U(y)$ and primitive recursive relation $T_n(e,x_1,…,x_n,y)$ such that a recursive partial function $f(x_1,…,x_n)$ is recursive <strong>if and only if there is some number e</strong> (the code of f) such that:</p>
\[\exists e \in \mathbb{N}, f(\vec{x}) = U(\mu y T_n(e,\vec{x},y)) := \phi_e(x)\]
<p>One can read $T_n$ as the relation “$e$ is a program, $\vec{x}$ is the input, and $y$ is the set of states in the transition system we take until we hit a terminal state”, and one can read $U(y)$ as “$y$ is the set of states in transition system after taking in some input for some program, and $U$ retrieves the numerical value that is the output of the program”.</p>
<p>Recall the $\mu$ operator. This theorem states that, with two primitive recursive functions and a $\mu$ operator, we are able to create any recursive partial function from its coding, feed in the inputs, get the states of computation from the transition system, and recover the output of the function. Then, obviously, all recursive functions must be in $\mathcal{R}_\mu$. One can thus enumerate the recursive partial functions like:</p>
\[\phi_1(x),\phi_2(x),...\]
<p>where if $e$ is not a program, $\phi_e(x) = \uparrow \forall x \in \mathbb{N}$. (It diverges because its domain is defined nowhere)</p>
<p>The above result shows that the set of recursive enumerable functions are countable (note at bottom*).</p>
<h2 id="sm_n-theorem">$S^m_n$ Theorem</h2>
<p>The \(S^m_n\) theorem states that there exists functions \(S^m_n\) that “hardcodes” inputs into functions and returns a new function code that works as if it was hardcoded. Concretely:</p>
<ol>
<li>
\[\forall e, y, \bar{z} = z_1,...,z_m, \bar{x} = x_1,...,x_n\]
</li>
<li>
\[U(\mu y T_{m+n}(e,\bar{z}, \bar{x}, y)) = U(\mu yT_n(S^m_n(e,\bar{z}), \bar{x}, y))\]
</li>
<li>\(S^m_n\) is injective for all \(m, n\)</li>
</ol>
<p>The proof of these theorems are difficult to explain without formal construction of the transition system and pages and pages of proofs, so just take it for face value. They’re <em>very</em> powerful, surprisingly. We will use the normal form’s enumerability to prove the halting problem is not recursive.</p>
<h1 id="the-halting-problem">The Halting Problem</h1>
<p>Suppose we have the halting relation $H(e,x)$ defined as:</p>
\[X_H(e,\vec{x}) = \begin{cases}1 \quad\text{if } \phi_e(\vec{x})\downarrow \\ 0 \quad \text{else}\end{cases}\]
<p>In other words, the relation will tell us whether the program coded as $e$ in the input of $\vec{x}$ will return a result, or will diverge from endless looping or because the input was not in the domain of some partial function in the middle of the computation. Then we have the following result: <strong>The halting relation is not recursive</strong>.</p>
<p>Why is this true? Suppose we have an enumeration of our programs $\phi_1(\vec{x}),\phi_2(\vec{x})…$, then if $X_H(e,\vec{x})$ is recursive, then we can make a recursive function that uses it in its composition. Then let’s define one such function:</p>
\[f(x) = \begin{cases}\phi_x(x) + 1 \quad \text{if } H(x,x) \\0 \quad \text{else}\end{cases}\]
<p>This function is obviously total, and assumed to be recursive. Here’s an illustration of what \(f\) would yield on the diagonal of this program/input matrix:</p>
<p><img src="http://oneraynyday.github.io/assets/halting_problem.jpg" alt="halting problem" /></p>
<p>Then, $f$ can be coded up by a recursive program, which has some code $e$. What does this mean? We can feed $f$ ‘s program code, $e$, into $f$ itself, and what do we get?</p>
\[f(e) = \phi_e(e) + 1 = f(e) + 1\]
<p>A contradiction! So then therefore $H(e,x)$ cannot be recursive, since everything else was dependent on it. This was all because we could <em>diagonalize</em> on all possible functions, and make this $f(x)$ different from every $g \in \mathcal{R}_\mu$. <strong>If $f$ is recursive, then $f$ would be different from itself.</strong></p>
<h1 id="why-do-you-care">Why Do You Care?</h1>
<p>The <strong>Church-Turing</strong> thesis states that <strong>a function is computable if and only if it is recursive</strong>. That means, there does not exist an algorithm to solve specific math problems and/or philosophical problems. The expressivity of our recursive functions is very limited compared to the whole space of functions, and some age old questions are simply beyond the realm of our current model of mathematics to answer.</p>
<hr />
<p>Note: Technically, this is an injection into $\mathbb{N}$, and to prove that it is countable we need to show that there exists an injection from $\mathbb{N}$ to $\mathcal{R}_\mu$ via Schroder-Bernstein Theorem, or to construct a bijective mapping instead into $\mathbb{N}$ instead, but since the set of recursive functions is obviously infinite, and it has to be at most countable, it’s countable.</p>
<script src="https://utteranc.es/client.js" repo="OneRaynyDay/oneraynyday.github.io" issue-term="pathname" theme="github-light" crossorigin="anonymous" async=""> </script>
Computability Theory - Primitive Recursive Functions2019-02-03T00:00:00+00:00http://oneraynyday.github.io/math/2019/02/03/Computability-Theory-Introduction<p>I haven’t been writing blogposts as often as I’d like to, but mostly for the reason that I don’t have a lot of time to goof around. Now that I have some free time, Tiff reminded me that I should write something. I will write more about this subject when I have the chance.</p>
<h1 id="what-is-computability-theory">What is Computability Theory?</h1>
<p><em>Computability theory</em> is a branch of mathematics that attempts to analyze and categorize problems into ones that can be decideable or computable under some given restrained resources. These resources are fairly applicable to real life scenarios, i.e. memory or time. I’ve had the privilege of learning from Yiannis N. Moschovakis and Amit Sahai, who are both very knowledgeable in this field. Here I attempt to condense some of the material into digestible readings for those who want to get a brief introduction. It does require a bit of mathematical background as I will be assuming that you know basic proofs by induction and some notation.</p>
<h1 id="recursion-and-primitive-recursive-functions">Recursion and Primitive Recursive Functions</h1>
<p>It is intuitive enough that in order to find out what our computers can do, we start with induction and recursion. For loops and otherwise complicated programs have some type of implicit recursion. It turns out that the <strong>principle of mathematical induction</strong> and the <strong>basic recursion lemma</strong> are almost synonymous (one can prove induction given recursion, and vice versa).</p>
<p>The <strong>basic recursion lemma</strong>(a result of Kleene’s second recursion theorem) is stated as follows: For all sets $X,W$, and any given functions $g:X\to W, h:W \times \mathbb{N} \times X \to W$, there exists one <strong>unique</strong> function $f:\mathbb{N} \times X \to W$ such that:</p>
\[f(0,x) = g(x), \\ f(n+1,x) = h(f(n,x), n, x)\]
<p>Basically, $f$ can be recursively defined by some “generating function” $h$ and some “base function” $g$. <strong>All of the interesting functions we can compute on our computers are recursive in nature.</strong></p>
<p>There is a specific class of recursive functions, called <strong>primitive recursive</strong>, denoted as $\mathcal{R}_p$. Roughly speaking, it is the set of functions that are defined by:</p>
<ol>
<li>Constant functions are in $\mathcal{R}_p, C^n_q(x_1,…,x_n) = q$</li>
<li>The projection functions are in $\mathcal{R}_p, P^n_i(x_1,…,x_i,…,x_n) = x_i$.</li>
<li>The successor functions are in $\mathcal{R}_p, S(x) = x + 1$.</li>
<li>Any composition of (1-3) are also in the set.</li>
<li>If $g, h \in \mathcal{R}_p$, and $f$ is defined as: \(f(0,x) = g(x), \\ f(y+1, x) = h(f(y,x),y,x)\), then $f \in \mathcal{R}_p$.</li>
</ol>
<p>We can build addition, multiplication, exponentiation, if-statements, loops, etc from this set of functions. Most of the functions we use in computer science can be expressed as some $f \in \mathcal{R}_p$.</p>
<p>An example of a recursive function that is <strong>not primitively recursive</strong> is the <strong>Ackermann function</strong>, which we will discuss in detail later:</p>
\[A(0,x) = x + 1, \\ A(n+1, 0) = A(n,1), \\ A(n+1, x+1) = A(n, A(n+1, x))\]
<p>It is not clear at first that $A$ has a unique solution; the proof is done via <em>double induction</em>. We first suppose there exists $f, g$ which are solutions to $A$ for some inductively increasing subdomain, and then we show that they must be equal.</p>
<ul>
<li><strong>Sub-induction</strong>: $A(0, 0) = 1$. Then $f(0,0) = g(0,0) = 1$. And it is clear that $\forall x, f(0,x+1) = g(0,x+1) = x+2$, so they are unique.</li>
<li><strong>Base</strong>: Suppose $\forall x, A(n, x)$ has a unique solution $\implies A(n+1, 0)$ unique as well. This is true since $A(n+1, 0) = A(n, 1)$, which we know is unique, so $f(n+1,0) = g(n+1,0) = f(n,1)$.</li>
<li><strong>Inductive</strong>: Suppose $\forall n \geq 1, \forall x, A(n-1,x)$ has a unique solution, as well as $A(n, k)$ by the inductive hypothesis. Then, $A(n, k+1) = A(n, A(n+1, k)) = f(n, f(n+1, k)) = g(n, g(n+1, k))$ must be unique.</li>
</ul>
<h1 id="why-ackermann-isnt-primitive-recursive">Why Ackermann isn’t Primitive Recursive</h1>
<p>Intuitively, if you fix Ackermann’s sections, $A_n(x) := A(n,x)$, you can inspect the value to see that it is growing extremely quickly. <em>One can think of the growth like the following: the 0-th section is successor, 1st section is addition, 2nd section is multiplication, 3rd section is exponentiation, 4th section is hyperexponentiation, etc.</em> The rate of growth from every section to the next is growing so fast that you can’t really use big $O$ bound using any common functions.</p>
<p><img src="http://oneraynyday.github.io/assets/ackermann_proof.jpg" alt="ackermann_proof" /></p>
<p>In order to prove $A \not\in \mathcal{R}_p$, we take an arbitrary function $f \in R_p$, and show that $f < A_n$ for some $n \in \mathbb{N}$. This rough sketch on growth shows that every $f$ that is primitive recursive is bounded by $A$, so $A$ cannot be in $\mathcal{R}_p$. I opted for this to be easier to digest so the trivial claims I make without proof are marked with $*$ . One claim we will make is the following:</p>
<p><strong>The nested Ackerman call is bounded by itself, i.e. $A_n(A_n(x)) < A_{n+2}(x)$</strong>.</p>
<ul>
<li>
<p><strong>Sub-induction:</strong> $A_0(A_0(x)) = x + 2 < A_2(x) =^* 2x+3 \forall x \in \mathbb{N}$. A simple induction proof shows the latter equality.</p>
</li>
<li>
<p><strong>Base:</strong> Suppose $A_n(A_n(x)) < A_{n+2}(x)$. Then:</p>
\[A_{n+3}(0) \\ = A_{n+2}(1) \\ = A_{n+1}(A_{n+2}(0)) \\ = A_{n+1}(A_{n+1}(1)) > A_{n+1}(A_{n+1}(0))\]
</li>
<li>
<p><strong>Induction</strong>: Suppose $A_n(A_n(x)) < A_{n+2}(x) \forall x$ and $A_{n+1}(A_{n+1}(k)) < A_{n+3}(k)$, then $A_{n+1}(A_{n+1}(k+1)) < A_{n+3}(k+1)$ since</p>
\[A_{n+3}(k+1) = A_{n+2}(A_{n+3}(k)) \\ > A_{n+2}(A_{n+1}(A_{n+1}(k))) \\ > A_{n+2}(A_n(A_{n+1}(k))) \\ = A_{n+2}(A_{n+1}(k+1)) >^* A_{n+1}(A_{n+1}(k+1))\]
</li>
</ul>
<p>This result will be useful for later.</p>
<p>To start off, our inductive hypothesis will be that for any $f \in R_p$, $f(x_1,…,x_m) < A_n(max\{x_1,…,x_m\}$ for some $n \in \mathbb{N}$.</p>
<p>If $f$ is the <strong>successor function</strong>, then $f(x) = A_0(x) <^* A_1(x)$. If $f$ is the <strong>constant function</strong> that returns $q \in \mathbb{N}$, then $f(x_1,…,x_m) = q <^* A_q(0) \leq A_q(max\{x_1,…,x_m\})$. If $f$ is the <strong>projection function</strong>, then $f(x_1,…,x_n) = x_i \leq A_0(max \{x_1,…,x_n\})$.</p>
<p>We have established the base cases. For more interesting functions, if $f$ is a <strong>composition of functions</strong>, i.e. \(f(x1,...,x_n) = h(g_1(x_1,...,x_m),...,g_m(x_1,...,x_m))\), and by inductive hypothesis we can assume $g_1,…,g_m,h$ are bounded by some $A_k(max\{x_1,…,x_n\})$ (just take the max $k$ for all of the functions). Then the composition is bounded by $A_k(A_k(x)) < A_{k+2}(x)$, using the claim above.</p>
<p>Finally, for some $f(n,\bar{x})$ defined like (5) in the <strong>primitive recursive</strong> section above, it is slightly trickier. By the inductive hypothesis, we can assume $g, h < A_{k-1}$. Then we claim the following: $f(n,\bar{x}) < A_k(n+\{\bar{x}\}) $. We denote $x := max\{\bar{x}\}$ for the proof.</p>
<ul>
<li><strong>Base:</strong> $f(0,\bar{x}) = g(\bar{x}) < A_{k-1}(x) < A_k(x) = A_k(x+0)$ as given.</li>
<li><strong>Induction</strong>: Suppose $f(n,\bar{x}) < A_k(x+n)$ . Then $f(n+1, \bar{x}) = h(f(n,\bar{x}), \bar{x}, n) < h(A_k(x+n), \bar{x}, n)$. Since $A_k(x+n) > x+n \forall k \in \mathbb{N}$, we see that the growth of arguments in $h$ is dominated by $A_k(x+n)$, so $h(A_k(x+n), \bar{x}, n)\leq A_{k-1}(A_k(x+n)) = A_k(x+n+1)$.</li>
</ul>
<p>Take $z = max\{x, n\} = max\{x_1,…,x_m,n\}$, then $f(n,\bar{x}) < A_k(x+n) \leq A_k(2z) < A_k(2z+1) = A_k(A_2(z-1)) < A_k(A_{k+1}(z-1)) = A_{k+1}(z)$. And so we have that $f(n,\bar{x})$ is bounded by another Ackerman function section.</p>
<p>The above is a sufficient proof to show that $f \in \mathcal{R}_p \implies \exists k \in \mathbb{N}, f < A_k$. <strong>Now, suppose $A$ is primitive recursive, then that means $h(n, x) = S(A(n,x)) = A(n,x) + 1$ must also be primitive recursive. Then there must exist some $k$ such that $h < A_k$, which is absurd and concludes our proof.</strong></p>
<p>It seems intuitive that if something grows faster than all of the primitive recursive functions, then it cannot be primitive recursive. It grows so fast that at $A(4,2)$, it returns an integer with 19729 digits.</p>
<h2 id="quick-aside-mu-recursiveness">Quick Aside: \(\mu\)-recursiveness</h2>
<p>Then what can we characterize the Ackermann function as? We can, technically, write a (long) while loop and search for the next value of Ackermann. This (albeit neverending) while loop is technically a \(\mu\) operator, which gives us the <em>least solution that satisfies some conditions</em>. This is vague, and I will expand on it on the next blogpost. We call any function that can be written with arbitrary while loops and primitive recursive building blocks as \(\mu\)-recursive, and Ackermann is one of such functions.</p>
<p><img src="http://oneraynyday.github.io/assets/functions_hierarchy.jpg" alt="function hierarchy" /></p>
<h1 id="what-else-is-out-there">What else is out there?</h1>
<p>Though Ackermann is not something we can practically compute for high digits, it is still in theory, “computable” by a recursive program. Then, I guess that begs the question - <em>What can’t our computers compute?</em> The results we will explore don’t only influence computer science, but answers* questions about philosophy, abstract mathematics, and many other things in our world.</p>
<p>*: <em>By answer, I don’t mean that answer to the universe and to life is 42, but rather, a concrete answer detailing why “there’s no way we will ever find out”.</em></p>
<script src="https://utteranc.es/client.js" repo="OneRaynyDay/oneraynyday.github.io" issue-term="pathname" theme="github-light" crossorigin="anonymous" async=""> </script>
Computer Architecture - Introduction2018-10-02T00:00:00+00:00http://oneraynyday.github.io/dev/2018/10/02/M151B<h1 id="manufacturing-process-of-a-cpu">Manufacturing Process of a CPU</h1>
<ul>
<li>Silicon ingot is extracted to the purest form (any impurities will lead to a defect)</li>
<li>Ingot is sliced up into wafers (thin disks)</li>
<li>wafers are then etched and patterned with a layer of material above it for the circuitry of the chip</li>
<li>The wafer is then sliced up into individual CPU dies</li>
<li>Bond the die to the package, and ship to customers</li>
</ul>
<p>Usually, the <strong>yield</strong> of a manufacturing process is the proportion of working dies per wafer (not ingot). This does not necessarily only pertain to a CPU, but also any <em>integrated circuit</em>. This is the reason why ASICS are so expensive.</p>
<h1 id="execution-time-and-throughput">Execution Time and Throughput</h1>
<p><strong>Execution time</strong> is how long it takes to do a task, and <strong>throughput</strong> is how much work is done per unit time. Throughput is roughly the <em>bandwidth</em> of a processor, and response time is specific to a task.</p>
<p>For example, if you have multiple cores, the execution time for a task that’s single threaded or is difficult to parallelize effectively will not change, but the throughput of the multi-core chip will be higher than one that is single-core.</p>
<p>So how do we measure <em>performance</em>? Let us define <em>performance</em> by:</p>
\[P(x) = \frac{1}{T(x)}\]
<p>for some task $x$, as in it is inversely proportional by the execution time. For example, if it took 10s to run a task $x$ on some machine $A$ and 15s on $B$, then one can say $A$ is 1.5x faster than $B$, because:</p>
\[\frac{P^A(x)}{P^B(x)} = \frac{T^B(x)}{T^A(x)} = \frac{15}{10} = 1.5\]
<h2 id="measuring-execution-time">Measuring Execution Time</h2>
<ul>
<li><strong>Elapsed time</strong>: total response time that can be measured by a clock</li>
<li><strong>CPU time</strong>: Time spent only on the CPU for the given task.
<ul>
<li>Does not account for I/O and OS context switching for other tasks to run, etc</li>
</ul>
</li>
</ul>
<p>How can we modify <strong>CPU time</strong>? This requires us to be familiar with a couple terminologies:</p>
<ul>
<li><strong>Clock frequency</strong>: Cycles of a clock per second (e.x. 4.0GHz)</li>
<li><strong>Clock period</strong>: Duration of a single clock cycle (e.x. 0.25ns), denote as $P$.</li>
</ul>
<p>Suppose $T$ is the CPU time, and $C$ is number of cycles required for some task:</p>
\[T = CP\]
<p>What can we do to decrease the CPU time? Well, we can either decrease the number of cycles or the time for each period. Decreasing the number of cycles for a single op could be done using vectorized instructions(doing multiple instructions at once, and <em>this is not necessarily the whole story</em>) or simply devising a better algorithm. To decrease the period, one can <strong>overclock</strong> a processor to run at higher GHz(and thus lower period).</p>
<p>So far, we have assumed that each instruction costs one cycle. We had some $C$ that we needed to lower. In reality, <em>there may be multiple cycles required for a single instruction</em>. For $N$ instructions, the $k$-th instruction which is used $N_k$ times in the procedure, which requires $I_k$ clock cycles, $C$ can be computed as :</p>
\[C = \sum_k^N I_k N_k\]
<p>So now our vectorized instruction assumption isn’t necessarily true. If we can vectorize $K$ instructions, we don’t necessarily get an $K$x speedup, since the vectorized instruction may cost on average more clock cycles as well.</p>
<h2 id="amdahls-law">Amdahl’s Law</h2>
<p>Improving algorithms leads to an issue observed by Amdahl’s law:</p>
\[T_{improved} = \frac{T_{affected}}{\text{improvement factor}} + T_{unaffected}\]
<p>This is simply saying that whatever you’re trying to optimize, you’ll never get below $T_{unaffected}$, or in other words, $T_{improved} \geq T_{unaffected}$. So the corollary or moral of the story of this law is that <em>if you want to optimize algorithms, make the common case fast.</em></p>
<h1 id="performance-factors">Performance Factors</h1>
<ul>
<li><strong>Algorithm & compilers</strong>: Will definitely affect instruction counts, so will affect $N_k$.</li>
<li><strong>ISA</strong>: Will affect $I_k$, $N_k$, and $P$.
<ul>
<li>For example, <code class="language-plaintext highlighter-rouge">CISC</code> vs. <code class="language-plaintext highlighter-rouge">RISC</code> ISA’s will have higher and lower $P$’s respectively.</li>
</ul>
</li>
</ul>
<h1 id="power-trends">Power Trends</h1>
\[\text{Power} = \text{Capacitive load} * \text{voltage}^2 * \text{frequency}\]
<h2 id="the-power-wall">The “power wall”</h2>
<p>The power wall is a phenomenon where our current chip designs have marginal improvements on <em>voltage reduction and heat reduction</em>. How else can we reduce power? If we have denser and denser chips, it will be harder for us to cool the corresponding chips that are drawing in so much power.</p>
<p>The trend between clock rate and power have been growing with each other, but lately we have switched to a multi-core processor design, which means we are keeping the clock rate the same, but just having more cores, each of which is now being optimized for lower power consumption. This is how one can keep the power consumption roughly the same but still increase performance. However, now the definition of performance is slightly different… How can we still evaluate it?</p>
<p>One type of benchmark uses elapsed time over multiple programs, and takes the geometric mean of all the runtimes. An example is <strong>SPEC CPU2006</strong>.</p>
<h1 id="instruction-set-architecture">Instruction Set Architecture</h1>
<h2 id="factors-to-consider-when-creating-an-isa">Factors to consider when creating an ISA</h2>
<ul>
<li>Operations:
<ul>
<li>How many should we have?</li>
<li>Which operations should we include?</li>
<li>How many bytes does it operate on? (Length)</li>
</ul>
</li>
<li>Operands
<ul>
<li>How many operands do we need? (2 sources, 1 destination = 3)</li>
<li>Are operands in memory, registers, constants?</li>
<li>What data types are allowed? (integer, float, etc)</li>
<li>How many bits are required to describe an operand?</li>
</ul>
</li>
<li>Instruction format
<ul>
<li>How big is the instructions itself? (What’s the least amount of bytes we can use to describe some instructions?)</li>
</ul>
</li>
</ul>
<h2 id="risc-vs-cisc">RISC vs. CISC</h2>
<p>Two main ISA classes, stands for <strong>Reduced Instruction Set Computers</strong>, and <strong>Complex Instruction Set Computers</strong>.</p>
<ul>
<li><strong>CISC</strong> (intel x86)
<ul>
<li>Large # of instructions</li>
<li>Many specialized complex instructions</li>
<li>Fits compactly into text memory, since less instructions</li>
</ul>
</li>
<li><strong>RISC</strong> (ARM, MIPS, etc)
<ul>
<li>Relatively fewer instructions
<ul>
<li>Simple instruction set means pipelining and parallelism is easier.</li>
</ul>
</li>
</ul>
</li>
</ul>
<p>There is some grey line between RISC and CISC, since intel actually “compiles” the instructions into “micro-ops”, which is RISC-like.</p>
<h1 id="mips">MIPS</h1>
<p>MIPS is a RISC ISA, and it has the following:</p>
<ul>
<li>32 registers, each of which is 32 bits. (numbered from 0-31)
<ul>
<li>Each 32-bit data in each register is called a “word”</li>
<li><code class="language-plaintext highlighter-rouge">$t0, $t1, ... $t9</code> for temporary values. (regs 8-15, then 24-25)</li>
<li><code class="language-plaintext highlighter-rouge">$s0, $s1, ... $s7</code> for saved registers. (regs 16-23)</li>
</ul>
</li>
<li>Memory is byte addressed (each address = 8-bit byte location)</li>
<li>MIPS is Big Endian (MSB, most significant byte, is at the least address of a word). For illustration purposes:</li>
</ul>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>-------- High address (0xffffffff)
| stack
| |
| v
| ^
| |
| heap
| uninitialized data
| initializes static data
| text
-------- Low address (0x00000000)
</code></pre></div></div>
<p>We can see that if the most significant byte is at the least address, then for example, an unsigned integer array in the stack:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[ 0 1 0 0 0 0 0 0 = 128, 1 0 0 0 0 0 0 0 = 256 ]
low high low high
</code></pre></div></div>
<hr />
<p>For example, let’s translate the following C code:</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="n">f</span><span class="p">,</span> <span class="n">g</span><span class="p">,</span> <span class="n">h</span><span class="p">,</span> <span class="n">i</span><span class="p">,</span> <span class="n">j</span><span class="p">;</span>
<span class="c1">// Some initialization</span>
<span class="n">f</span> <span class="o">=</span> <span class="p">(</span><span class="n">g</span> <span class="o">+</span> <span class="n">h</span><span class="p">)</span> <span class="o">-</span> <span class="p">(</span><span class="n">i</span> <span class="o">+</span> <span class="n">j</span><span class="p">);</span> <span class="c1">// we focus on this line!</span>
</code></pre></div></div>
<p>Into MIPS instructions:</p>
<pre><code class="language-assembly"># $s0 = f, $s1 = g, $s2 = h, $s3 = i, $s4 = j.
add $t0, $s1, $s2
add $t1, $s3, $s4
sub $s0, $t0, $t1
</code></pre>
<h2 id="immediate-operands">Immediate Operands</h2>
<p>Immediate operands are literal constants. For example:</p>
<pre><code class="language-assembly">addi $s3, $s3, 4 # s3 += 4;
addi $s2, $s1, -1 # No subi, so s2 = s1 - 1;
</code></pre>
<p>Most of the time, <strong>our immediate operands are small constants, so we can truncate the size of a number in our instruction from 32 bits to something smaller</strong>.</p>
<p>In particular, 0 has its own register, which holds its value, the <code class="language-plaintext highlighter-rouge">$zero</code> register. This proves to make the ISA simpler while sacrificing 1 register space.</p>
<h2 id="conditional-operations">Conditional Operations</h2>
<p>Conditional operations make up the <code class="language-plaintext highlighter-rouge">if</code> statements in C. Here are some examples of conditional operators:</p>
<ul>
<li><code class="language-plaintext highlighter-rouge">beq rs, rt, L1</code> is equivalent to <code class="language-plaintext highlighter-rouge">if(rs == rt) goto L1;</code></li>
<li><code class="language-plaintext highlighter-rouge">bne rs, rt, L1</code> is equivalent to <code class="language-plaintext highlighter-rouge">if(rs != rt) goto L1;</code></li>
<li><code class="language-plaintext highlighter-rouge">j L1</code> is equivalent to <code class="language-plaintext highlighter-rouge">goto L1;</code></li>
<li><code class="language-plaintext highlighter-rouge">slt rd, rs, rt</code> is equivalent to <code class="language-plaintext highlighter-rouge">if(rs < rt) rd = 1; else rd = 0;</code>
<ul>
<li>Use <code class="language-plaintext highlighter-rouge">sltu</code> if you’re comparing unsigned numbers.</li>
</ul>
</li>
<li><code class="language-plaintext highlighter-rouge">slti rt, rs, constant</code> is equivalent to `if(rs < constant) rt = 1; else rt = 0;
<ul>
<li>Use <code class="language-plaintext highlighter-rouge">sltui</code> if you’re using unsigned numbers.</li>
</ul>
</li>
</ul>
<p>As expected, conditional operations will change the program counter register, and therefore it is hard to optimize for a block of code that has jump statements.</p>
<p>A <strong>basic block</strong> is a sequence of instructions with no branching, and the compiler can easily optimize these blocks for comparable performance.</p>
<h2 id="instruction-formats">Instruction Formats</h2>
<h3 id="r-format">R-format</h3>
<p>In memory, R-format looks like:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[op (6 bits),
rs (5 bits),
rt (5 bits),
rd (5 bits),
shamt (5 bits),
funct (6 bits)]
</code></pre></div></div>
<ul>
<li>The opcode (<code class="language-plaintext highlighter-rouge">op</code>) is one that indicates what format of an instruction we’re running in. (For example, all R-format instructions will have the same op code)</li>
<li><code class="language-plaintext highlighter-rouge">rs, rt, rd</code> is for source, second source, and destination register numbers respectively.</li>
<li><code class="language-plaintext highlighter-rouge">shamt</code> is how many shifts to perform. This is used for shift instructions only.</li>
<li><code class="language-plaintext highlighter-rouge">funct</code> describes what function to perform (for example, <code class="language-plaintext highlighter-rouge">add</code>).</li>
</ul>
<p>For example:</p>
<pre><code class="language-assembly">add $t0, $s1, $s2
</code></pre>
<p>is translated into:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[0, # add is R-format, and op code for R is 0
17, # s1 is mapped to 17 as register number
18, # s2 is mapped to 18
8, # t0 is mapped to 8
0, # no shift is necessary; this isn't a shift command
32] # 32 = function id for add
= 0x02324020
</code></pre></div></div>
<h3 id="i-format">I-format</h3>
<p>I stands for <strong>immediate</strong>, as in immediate operands.</p>
<p>In memory, I-format looks like:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[op (6 bits),
rs (5 bits),
rt (5 bits),
constant/addr (16 bits)]
</code></pre></div></div>
<p>Same idea with R-format, except <code class="language-plaintext highlighter-rouge">constant/addr</code> is the last field, holding 16 bits. But 16 bits isn’t enough to hold all 32 bits for a word. <em>We will learn the details of how to represent 32 bits later, but tldr it requires 2 instructions.</em></p>
<script src="https://utteranc.es/client.js" repo="OneRaynyDay/oneraynyday.github.io" issue-term="pathname" theme="github-light" crossorigin="anonymous" async=""> </script>
Reinforcement Learning - Temporal Difference Learning (Q-Learning & SARSA)2018-09-30T00:00:00+00:00http://oneraynyday.github.io/ml/2018/09/30/Reinforcement-Learning-TD<h1 id="table-of-contents">Table of Contents</h1>
<ul id="markdown-toc">
<li><a href="#table-of-contents" id="markdown-toc-table-of-contents">Table of Contents</a> <ul>
<li><a href="#review-of-monte-carlo" id="markdown-toc-review-of-monte-carlo">Review of Monte Carlo</a></li>
</ul>
</li>
<li><a href="#temporal-difference-prediction" id="markdown-toc-temporal-difference-prediction">Temporal Difference Prediction</a> <ul>
<li><a href="#similarity-between-td-and-dp" id="markdown-toc-similarity-between-td-and-dp">Similarity between TD and DP</a></li>
<li><a href="#similarity-between-td-and-monte-carlo" id="markdown-toc-similarity-between-td-and-monte-carlo">Similarity between TD and Monte Carlo</a></li>
</ul>
</li>
<li><a href="#optimality-of-td0" id="markdown-toc-optimality-of-td0">Optimality of TD(0)</a></li>
<li><a href="#on-policy-sarsa" id="markdown-toc-on-policy-sarsa">On-policy: Sarsa</a></li>
<li><a href="#off-policy-q-learning" id="markdown-toc-off-policy-q-learning">Off-policy: Q-learning</a></li>
<li><a href="#example-cliff-walking" id="markdown-toc-example-cliff-walking">Example: Cliff Walking</a> <ul>
<li><a href="#sarsa-model" id="markdown-toc-sarsa-model">Sarsa Model</a></li>
<li><a href="#q-learning-model" id="markdown-toc-q-learning-model">Q-Learning Model</a></li>
<li><a href="#cliffwalking-maps" id="markdown-toc-cliffwalking-maps">Cliffwalking Maps</a></li>
<li><a href="#learning-curves" id="markdown-toc-learning-curves">Learning Curves</a></li>
</ul>
</li>
</ul>
<p>Temporal difference learning is one of the most central concepts to reinforcement learning. It is a combination of Monte Carlo ideas [todo link], and dynamic programming [todo link] as we had previously discussed.</p>
<h2 id="review-of-monte-carlo">Review of Monte Carlo</h2>
<p>In Monte Carlo, we have an update rule to capture the expected rewards $V$ under a given policy $\pi$. Usually, we simulate entire episodes and get the total reward, and attribute each action in the action sequence to the reward value. Recall we had two different ways of Monte Carlo simulation - <strong>every-visit</strong> and <strong>first-visit</strong>.</p>
<p>Given some state sequence $S_j = (s_0, s_1, …)$, and action sequence $A_j = (a_0, a_1, …)$ belonging to some episode $j$, <strong>first-visit MC</strong> will perform the following:</p>
\[V(s) = \frac{1}{N}\sum_{i:s \in S_i}^N G^i_{min_j\{s_j | s_j = s\}}\]
<p>For some reward $G^i_t$ which is the (possibly discounted) sum of subsequent rewards in episode $i$ after time $t$. In other words, if a state occurs once, or twice, or fifty times throughout an episode, we only count the first occurence of the state. and average all of the reward values for $V_s$. For <strong>every-visit MC</strong>, we have:</p>
\[\mathcal{G}^i_s = \sum_{j:s_j = s, s_j \in S_i \forall S_i} G^i_j \\
V(s) = \frac{1}{\#s} \sum_i^M \mathcal{G}^i_s\]
<p>where #s is the total number of times we saw state $s$ throughout all episodes. In other words, if a state occurs more than once, we will add those occurences into our value approximation and average all the rewards values to get $V(s)$.</p>
<p>In many cases, $V(s)$ is stationary, and both methods (first-visit and every-visit) will converge to the optimal answer. However, in a nonstationary environment, we do not want our “update magnitude” to approach 0. The below is an update rule suitable for a nonstationary, every-visit MC (eq. 0):</p>
\[V(S_t) = V(S_t) + \alpha [ G_t - V(S_t) ]\]
<p>for some constant $\alpha > 0$.</p>
<h1 id="temporal-difference-prediction">Temporal Difference Prediction</h1>
<p>In the above equation, we use the quantity $G_t$, which is the rewards for the rest of the episode following time $t$, possibly discounted by some factor $\gamma$. This quantity <em>is only known once the episode ends. For TD, this is no longer necessary!</em></p>
<p>How does TD methods do this? They only need to look at one step ahead, because their approximation is slightly less precise and less sample-based. Instead of $G_t$, they use:</p>
\[R_{t+1} + \gamma V(S_{t+1})\]
<p>Recall that the equation for $G_t$ is:</p>
\[G_t = \sum_{k=t+1} \gamma^{k-t-1}R_k = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + ...\]
<p>And the equation for $V(S_t)$ is:</p>
\[V(S_t) = E[G_t]\]
<p>And so we’re sort of “approximating” the rest of the trajectories:</p>
\[V(S_{t+1}) = E[R_{t+2} + \gamma R_{t+3} + ...] \approx R_{t+2} + \gamma R_{t+3} + ...\]
<p>So yes, it is less <em>precise</em>, but in return, <em>we are able to learn in the middle of an episode</em>. The resulting update equation for $V(S_t)$ in the analog of $\text{eq (0)}$ is:</p>
\[V(S_t) = V(S_t) + \alpha [ R_{t+1} + \gamma V(S_{t+1}) - V(S_t) ]\]
<p>In this case, since we’re only looking one step ahead, it is called $TD(0)$. Here’s some pseudocode logic:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>pi = init_pi()
V = init_V()
for i in range(NUM_ITER):
s = init_state()
episode = init_episode(s)
while episode:
prev_s = s
action = get_action(pi, V)
reward, s = episode.run(action)
V(prev_s) += alpha * (reward + gamma * V(s) - V(prev_s))
</code></pre></div></div>
<p>$TD(0)$ is considered a <em>bootstrapping</em> method because it uses previous approximations, rather than completely using the entire trajectory’s reward.</p>
<h4 id="similarity-between-td-and-dp">Similarity between TD and DP</h4>
<p>Recall that dynamic programming approaches to solve reinforcement learning problems uses the Bellman backup operator. This allows efficient incremental computation of $V(s)$, and we previously proved that $lim_{k \to \infty} v_k(s) = v_\pi(s)$. In this way, we are effectively getting more and more accurate approximations of $v$.</p>
<p>In a similar nature, <em>our approximation of $V(S_t)$ in the equation above is also iterative, and bootstraps off of the previous estimation</em>, because $V(S_t)$ as a function is not known.</p>
<h4 id="similarity-between-td-and-monte-carlo">Similarity between TD and Monte Carlo</h4>
<p>Recall that Monte Carlo requires us to estimate the expected value, because we’re taking samples of episodes and trying to approximate the underlying expected value from each action. In this way, TD acts like a every-visit MC due to its $V(S_t)$ being an approximation to the expected value of $G_t$.</p>
<p>Thus, TD and Monte Carlo both <em>approximate $V(S_t)$ via sampling because the expectation is not known.</em> However, Monte Carlo does not attempt to use previous estimations of $V$ in its convergence; instead, it takes the complete sample trajectory to estimate.</p>
<p><img src="http://oneraynyday.github.io/assets/tdvsmc.png" alt="tdvsmc" /></p>
<h1 id="optimality-of-td0">Optimality of TD(0)</h1>
<p>One might expect TD(0) to converge to the same value as monte carlo methods. However, this is <strong>not true</strong>. Sutton’s pathological example explains this very clearly:</p>
<table>
<thead>
<tr>
<th>Trials</th>
<th>Rewards</th>
</tr>
</thead>
<tbody>
<tr>
<td>A, B</td>
<td>0, 0</td>
</tr>
<tr>
<td>B</td>
<td>0</td>
</tr>
<tr>
<td>B</td>
<td>1</td>
</tr>
<tr>
<td>B</td>
<td>1</td>
</tr>
<tr>
<td>B</td>
<td>1</td>
</tr>
<tr>
<td>B</td>
<td>1</td>
</tr>
<tr>
<td>B</td>
<td>1</td>
</tr>
<tr>
<td>B</td>
<td>1</td>
</tr>
</tbody>
</table>
<p>For these episodes, we have 6 episodes where <code class="language-plaintext highlighter-rouge">B</code> resulted in a reward of 1, and 2 episodes where it resulted in 0. Therefore, we can safely say that $V(B) = \frac{3}{4}$.</p>
<p>We can make 2 “reasonable” suggestions about <code class="language-plaintext highlighter-rouge">A</code>:</p>
<ol>
<li>When <code class="language-plaintext highlighter-rouge">A</code> occurred, the resulting accumulated reward $G$ is 0. Therefore, $V(A) = 0$.</li>
<li>When <code class="language-plaintext highlighter-rouge">A</code> occurred, the resulting next state is always <code class="language-plaintext highlighter-rouge">B</code>. Therefore, since $V(A) = E(R_{t+1} + \gamma R_{t+2} + … ) \approx 0 + \gamma V(B)$, since the only next state is $B$, and our current reward is 0 from the single sample we have, the result should be $0 + \gamma V(B) = \frac{3 \gamma}{4} \neq 0$. <strong>This makes sense from the Markovian assumption, since $S_t \perp R_{t+2}$, so the fact that $A$ happened before $B$ should not cause $B$ to yield a reward of $0$.</strong></li>
</ol>
<p>In a sense, <strong>Monte Carlo minimizes the value function approximation’s mean squared error</strong>, meanwhile <strong>TD converges towards the <em>certainty-equivalence estimate</em></strong>. The certainty-equivalence estimate is the maximum likelihood estimate of the value function under the assumption that the process is MDP.</p>
<h1 id="on-policy-sarsa">On-policy: Sarsa</h1>
<p>In an on-policy setting, we learn an action-value function $q(s,a)$ rather than $v(s)$ discussed above. The generalization is quite straightforward:</p>
\[Q(S_t, A_t) = Q(S_t, A_t) + \alpha [ R_{t+1} + \gamma Q(S_{t+1}, A_{t+1}) - Q(S_t, A_t)]\]
<p>As opposed to:</p>
\[V(S_t) = V(S_t) + \alpha [ R_{t+1} + \gamma V(S_{t+1}) - V(S_t)]\]
<p>The reason we call this algorithm <strong>Sarsa</strong> is because it is a sequence of <em>State, Action, Reward, State, Action</em>. Just like any on-policy RL algorithm, we estimate $q_\pi$ for some $\pi$. This $\pi$ is some policy that outputs probability of actions based off of the current state.</p>
<p>As always, on-policy algorithms may not explore enough of the search space and lead to a suboptimal answer, or get stuck in its current policy forever and fail to terminate. As a result, we usually use $\epsilon$-greedy policy $\pi$, to select all other actions with a small probability $\epsilon$.</p>
<h1 id="off-policy-q-learning">Off-policy: Q-learning</h1>
<p>The corresponding off-policy variant of Sarsa is called Q-learning, and it is defined by the update rule:</p>
\[Q(S_t, A_t) = Q(S_t, A_t) + \alpha [ R_{t+1} + \gamma max_a Q(S_{t+1}, a) - Q(S_t, A_t) ]\]
<p>The only difference here is that instead of approximating $G_t$ with $R_{t+1} + \gamma Q(S_{t+1}, A_{t+1})$, in which $A_{t+1}$ is picked from an $\epsilon$-greedy policy, we are picking the actual optimal action which is $max_a Q(S_{t+1}, a)$. This is an <em>off-policy</em> TD control algorithm because we are using $\epsilon$-greedy for simulation, but we are approximating $Q$ with the actual $argmax$.</p>
<h1 id="example-cliff-walking">Example: Cliff Walking</h1>
<p>Once again, let’s take the <a href="https://github.com/OneRaynyDay/RLEngine">previous <strong>Cliff Walking</strong> gym we used for Monte Carlo methods</a>. As we had previously observed, the converged path on the graph was:</p>
<p><img src="http://oneraynyday.github.io/assets/cliffwalking.png" alt="mccliffwalk" /></p>
<p>This was for an <strong>on-policy Monte Carlo method</strong>. The reason for the path being so conservative, and the actor moving all the way up to the edge was due to the fact that there was $\epsilon$ probability that the actor would’ve moved in a random direction, one which will kill the actor. Even as $\epsilon$ approached 0, the strategy is not optimal, but extremely safe.</p>
<p>Now, I have added the TD functionality in the same Monte Carlo repository (now named RLModels), and refactored most of the core functionality. We directly translate the update rules above into the <code class="language-plaintext highlighter-rouge">update_Q()</code> function, and we pass in the tuple <code class="language-plaintext highlighter-rouge">(state, action, reward, state, action)</code> instead of an episode, i.e. <code class="language-plaintext highlighter-rouge">[(state, action, reward), ...]</code>:</p>
<h2 id="sarsa-model">Sarsa Model</h2>
<p>For Sarsa, we use the update rule:</p>
\[Q(S_t, A_t) = Q(S_t, A_t) + \alpha [ R_{t+1} + \gamma Q(S_{t+1}, A_{t+1}) - Q(S_t, A_t)]\]
<p>per event in an episode. We reflect that in our model below.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">FiniteSarsaModel</span><span class="p">(</span><span class="n">FiniteModel</span><span class="p">):</span>
<span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">state_space</span><span class="p">,</span> <span class="n">action_space</span><span class="p">,</span> <span class="n">gamma</span><span class="o">=</span><span class="mf">1.0</span><span class="p">,</span> <span class="n">epsilon</span><span class="o">=</span><span class="mf">0.1</span><span class="p">,</span> <span class="n">alpha</span><span class="o">=</span><span class="mf">0.01</span><span class="p">):</span>
<span class="s">"""SarsaModel takes in state_space and action_space (finite)
Arguments
---------
state_space: int OR list[observation], where observation is any hashable type from env's obs.
action_space: int OR list[action], where action is any hashable type from env's actions.
gamma: float, discounting factor.
epsilon: float, epsilon-greedy parameter.
If the parameter is an int, then we generate a list, and otherwise we generate a dictionary.
>>> m = FiniteSarsaModel(2,3,epsilon=0)
>>> m.Q
[[0, 0, 0], [0, 0, 0]]
>>> m.Q[0][1] = 1
>>> m.Q
[[0, 1, 0], [0, 0, 0]]
>>> m.pi(1, 0)
1
>>> m.pi(1, 1)
0
"""</span>
<span class="nb">super</span><span class="p">(</span><span class="n">FiniteSarsaModel</span><span class="p">,</span> <span class="bp">self</span><span class="p">).</span><span class="n">__init__</span><span class="p">(</span><span class="n">state_space</span><span class="p">,</span> <span class="n">action_space</span><span class="p">,</span> <span class="n">gamma</span><span class="p">,</span> <span class="n">epsilon</span><span class="p">)</span>
<span class="bp">self</span><span class="p">.</span><span class="n">alpha</span> <span class="o">=</span> <span class="n">alpha</span>
<span class="k">def</span> <span class="nf">update_Q</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">sarsa</span><span class="p">):</span>
<span class="s">"""Performs a TD(0) action-value update using a single step.
Arguments
---------
sarsa: (state, action, reward, state, action), an event in an episode.
"""</span>
<span class="c1"># Generate returns, return ratio
</span> <span class="n">p_state</span><span class="p">,</span> <span class="n">p_action</span><span class="p">,</span> <span class="n">reward</span><span class="p">,</span> <span class="n">n_state</span><span class="p">,</span> <span class="n">n_action</span> <span class="o">=</span> <span class="n">sarsa</span>
<span class="n">q</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">Q</span><span class="p">[</span><span class="n">p_state</span><span class="p">][</span><span class="n">p_action</span><span class="p">]</span>
<span class="bp">self</span><span class="p">.</span><span class="n">Q</span><span class="p">[</span><span class="n">p_state</span><span class="p">][</span><span class="n">p_action</span><span class="p">]</span> <span class="o">=</span> <span class="n">q</span> <span class="o">+</span> <span class="bp">self</span><span class="p">.</span><span class="n">alpha</span> <span class="o">*</span> \
<span class="p">(</span><span class="n">reward</span> <span class="o">+</span> <span class="bp">self</span><span class="p">.</span><span class="n">gamma</span> <span class="o">*</span> <span class="bp">self</span><span class="p">.</span><span class="n">Q</span><span class="p">[</span><span class="n">n_state</span><span class="p">][</span><span class="n">n_action</span><span class="p">]</span> <span class="o">-</span> <span class="n">q</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">score</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">env</span><span class="p">,</span> <span class="n">policy</span><span class="p">,</span> <span class="n">n_samples</span><span class="o">=</span><span class="mi">1000</span><span class="p">):</span>
<span class="s">"""Evaluates a specific policy with regards to the env.
Arguments
---------
env: an openai gym env, or anything that follows the api.
policy: a function, could be self.pi, self.b, etc.
"""</span>
<span class="n">rewards</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">n_samples</span><span class="p">):</span>
<span class="n">observation</span> <span class="o">=</span> <span class="n">env</span><span class="p">.</span><span class="n">reset</span><span class="p">()</span>
<span class="n">cum_rewards</span> <span class="o">=</span> <span class="mi">0</span>
<span class="k">while</span> <span class="bp">True</span><span class="p">:</span>
<span class="n">action</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">choose_action</span><span class="p">(</span><span class="n">policy</span><span class="p">,</span> <span class="n">observation</span><span class="p">)</span>
<span class="n">observation</span><span class="p">,</span> <span class="n">reward</span><span class="p">,</span> <span class="n">done</span><span class="p">,</span> <span class="n">_</span> <span class="o">=</span> <span class="n">env</span><span class="p">.</span><span class="n">step</span><span class="p">(</span><span class="n">action</span><span class="p">)</span>
<span class="n">cum_rewards</span> <span class="o">+=</span> <span class="n">reward</span>
<span class="k">if</span> <span class="n">done</span><span class="p">:</span>
<span class="n">rewards</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">cum_rewards</span><span class="p">)</span>
<span class="k">break</span>
<span class="k">return</span> <span class="n">np</span><span class="p">.</span><span class="n">mean</span><span class="p">(</span><span class="n">rewards</span><span class="p">)</span>
</code></pre></div></div>
<h2 id="q-learning-model">Q-Learning Model</h2>
<p>For Q-learning, we use an argmax variant of Sarsa, to make it an off-policy model. The update rule looks like:</p>
\[Q(S_t, A_t) = Q(S_t, A_t) + \alpha [ R_{t+1} + \gamma max_a Q(S_{t+1}, a) - Q(S_t, A_t) ]\]
<p>Here is the model in python:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">FiniteQLearningModel</span><span class="p">(</span><span class="n">FiniteModel</span><span class="p">):</span>
<span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">state_space</span><span class="p">,</span> <span class="n">action_space</span><span class="p">,</span> <span class="n">gamma</span><span class="o">=</span><span class="mf">1.0</span><span class="p">,</span> <span class="n">epsilon</span><span class="o">=</span><span class="mf">0.1</span><span class="p">,</span> <span class="n">alpha</span><span class="o">=</span><span class="mf">0.01</span><span class="p">):</span>
<span class="s">"""FiniteQLearningModel takes in state_space and action_space (finite)
Arguments
---------
state_space: int OR list[observation], where observation is any hashable type from env's obs.
action_space: int OR list[action], where action is any hashable type from env's actions.
gamma: float, discounting factor.
epsilon: float, epsilon-greedy parameter.
If the parameter is an int, then we generate a list, and otherwise we generate a dictionary.
>>> m = FiniteQLearningModel(2,3,epsilon=0)
>>> m.Q
[[0, 0, 0], [0, 0, 0]]
>>> m.Q[0][1] = 1
>>> m.Q
[[0, 1, 0], [0, 0, 0]]
>>> m.pi(1, 0)
1
>>> m.pi(1, 1)
0
"""</span>
<span class="nb">super</span><span class="p">(</span><span class="n">FiniteQLearningModel</span><span class="p">,</span> <span class="bp">self</span><span class="p">).</span><span class="n">__init__</span><span class="p">(</span><span class="n">state_space</span><span class="p">,</span> <span class="n">action_space</span><span class="p">,</span> <span class="n">gamma</span><span class="p">,</span> <span class="n">epsilon</span><span class="p">)</span>
<span class="bp">self</span><span class="p">.</span><span class="n">alpha</span> <span class="o">=</span> <span class="n">alpha</span>
<span class="k">def</span> <span class="nf">update_Q</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">sars</span><span class="p">):</span>
<span class="s">"""Performs a TD(0) action-value update using a single step.
Arguments
---------
sars: (state, action, reward, state, action) or (state, action, reward, state),
an event in an episode.
NOTE: For Q-Learning, we don't actually use the next action, since we argmax.
"""</span>
<span class="c1"># Generate returns, return ratio
</span> <span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">sars</span><span class="p">)</span> <span class="o">></span> <span class="mi">4</span><span class="p">:</span>
<span class="n">sars</span> <span class="o">=</span> <span class="n">sars</span><span class="p">[:</span><span class="mi">4</span><span class="p">]</span>
<span class="n">p_state</span><span class="p">,</span> <span class="n">p_action</span><span class="p">,</span> <span class="n">reward</span><span class="p">,</span> <span class="n">n_state</span> <span class="o">=</span> <span class="n">sars</span>
<span class="n">q</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">Q</span><span class="p">[</span><span class="n">p_state</span><span class="p">][</span><span class="n">p_action</span><span class="p">]</span>
<span class="n">max_q</span> <span class="o">=</span> <span class="nb">max</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">Q</span><span class="p">[</span><span class="n">n_state</span><span class="p">].</span><span class="n">values</span><span class="p">())</span> <span class="k">if</span> <span class="nb">isinstance</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">Q</span><span class="p">[</span><span class="n">n_state</span><span class="p">],</span> <span class="nb">dict</span><span class="p">)</span> <span class="k">else</span> <span class="nb">max</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">Q</span><span class="p">[</span><span class="n">n_state</span><span class="p">])</span>
<span class="bp">self</span><span class="p">.</span><span class="n">Q</span><span class="p">[</span><span class="n">p_state</span><span class="p">][</span><span class="n">p_action</span><span class="p">]</span> <span class="o">=</span> <span class="n">q</span> <span class="o">+</span> <span class="bp">self</span><span class="p">.</span><span class="n">alpha</span> <span class="o">*</span> \
<span class="p">(</span><span class="n">reward</span> <span class="o">+</span> <span class="bp">self</span><span class="p">.</span><span class="n">gamma</span> <span class="o">*</span> <span class="n">max_q</span> <span class="o">-</span> <span class="n">q</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">score</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">env</span><span class="p">,</span> <span class="n">policy</span><span class="p">,</span> <span class="n">n_samples</span><span class="o">=</span><span class="mi">1000</span><span class="p">):</span>
<span class="s">"""Evaluates a specific policy with regards to the env.
Arguments
---------
env: an openai gym env, or anything that follows the api.
policy: a function, could be self.pi, self.b, etc.
"""</span>
<span class="n">rewards</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">n_samples</span><span class="p">):</span>
<span class="n">observation</span> <span class="o">=</span> <span class="n">env</span><span class="p">.</span><span class="n">reset</span><span class="p">()</span>
<span class="n">cum_rewards</span> <span class="o">=</span> <span class="mi">0</span>
<span class="k">while</span> <span class="bp">True</span><span class="p">:</span>
<span class="n">action</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">choose_action</span><span class="p">(</span><span class="n">policy</span><span class="p">,</span> <span class="n">observation</span><span class="p">)</span>
<span class="n">observation</span><span class="p">,</span> <span class="n">reward</span><span class="p">,</span> <span class="n">done</span><span class="p">,</span> <span class="n">_</span> <span class="o">=</span> <span class="n">env</span><span class="p">.</span><span class="n">step</span><span class="p">(</span><span class="n">action</span><span class="p">)</span>
<span class="n">cum_rewards</span> <span class="o">+=</span> <span class="n">reward</span>
<span class="k">if</span> <span class="n">done</span><span class="p">:</span>
<span class="n">rewards</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">cum_rewards</span><span class="p">)</span>
<span class="k">break</span>
<span class="k">return</span> <span class="n">np</span><span class="p">.</span><span class="n">mean</span><span class="p">(</span><span class="n">rewards</span><span class="p">)</span>
<span class="k">if</span> <span class="n">__name__</span> <span class="o">==</span> <span class="s">"__main__"</span><span class="p">:</span>
<span class="kn">import</span> <span class="nn">doctest</span>
<span class="n">doctest</span><span class="p">.</span><span class="n">testmod</span><span class="p">()</span>
</code></pre></div></div>
<h2 id="cliffwalking-maps">Cliffwalking Maps</h2>
<p>By running the two models above, we get different cliffwalking maps:</p>
<p><strong>For SARSA:</strong></p>
<p><img src="http://oneraynyday.github.io/assets/cliffwalking_sarsa.png" alt="sarsacliffwalk" /></p>
<p>As we can see, it is similar to that of Monte Carlo’s map. The action is conservative because it does not assume any markovian structure of the process. The Monte Carlo process is actively aware of the stochasticity in the environment and tries to move to the safest corner before proceeding to the right and ultimately to the end.</p>
<p><strong>For Q-learning:</strong></p>
<p><img src="http://oneraynyday.github.io/assets/cliffwalking_qlearning.png" alt="qlearncliffwalk" /></p>
<p><em>This is a very different map compared to SARSA and Monte Carlo.</em> Q-Learning understands the underlying markovian assumption and thus ignores the stochasticity in choosing its actions, hence why it picks <strong>the optimal route</strong> (the reason it understands the markovian assumption is that it picks the greedy action, which is optimal under the Strong Markov Property of the MDP). The off-policy approach allows Q-Learning to have a policy that is optimal while its $\epsilon$-greedy simulations allows it to explore.</p>
<p>In my opinion, Q-learning wins this round.</p>
<h2 id="learning-curves">Learning Curves</h2>
<p><img src="http://oneraynyday.github.io/assets/cliffwalking_learning_plot.png" alt="learningcurves" /></p>
<p><em>We run all three models in tandem, and we record the total reward per episode for each of the techniques as epochs increase. The <code class="language-plaintext highlighter-rouge">*_interp</code> are moving averages of the log rewards.</em> It appears that <strong>SARSA</strong>, although producing approximately the same solution as Monte Carlo (recall that it is not exact), converges to a higher reward <strong>much faster</strong>. This is due to the fact that its value function was able to be updated per step rather than per episode. <strong>This method of bootstrapping allows the model to learn a lot faster than Monte Carlo.</strong></p>
<p>Another observation we can see is that <strong>Q-learning’s average reward is bad</strong>. This is due to the fact that Q-learning tries to take the <strong>optimal action</strong>, but gets screwed over by the $\epsilon$ probability of falling off a cliff due to the stochasticity of the $\epsilon$-greedy policy that it uses to explore.</p>
<p>Another interesting phenomenon observed in the above diagram is that <em>Monte Carlo actually starts to degrade in performance near the end.</em> This… I have no explanation for (and I’d love to discuss it!).</p>
<script src="https://utteranc.es/client.js" repo="OneRaynyDay/oneraynyday.github.io" issue-term="pathname" theme="github-light" crossorigin="anonymous" async=""> </script>
Airbnb Bighead - Speeding Up Inference2018-08-30T00:00:00+00:00http://oneraynyday.github.io/ml/2018/08/30/Bighead-XNORNet<h1 id="table-of-contents">Table of Contents</h1>
<ul id="markdown-toc">
<li><a href="#table-of-contents" id="markdown-toc-table-of-contents">Table of Contents</a></li>
<li><a href="#introduction" id="markdown-toc-introduction">Introduction</a></li>
<li><a href="#integer-8-bit-quantization" id="markdown-toc-integer-8-bit-quantization">Integer 8-bit quantization</a> <ul>
<li><a href="#details" id="markdown-toc-details">Details</a></li>
</ul>
</li>
<li><a href="#bitwise-quantization" id="markdown-toc-bitwise-quantization">Bitwise quantization</a> <ul>
<li><a href="#training-and-backpropagation-of-an-xnornet" id="markdown-toc-training-and-backpropagation-of-an-xnornet">Training and Backpropagation of an XNORNet</a> <ul>
<li><a href="#difficulty-of-training-an-xnornet" id="markdown-toc-difficulty-of-training-an-xnornet">Difficulty of training an XNORNet</a></li>
<li><a href="#training-tips" id="markdown-toc-training-tips">Training Tips</a></li>
</ul>
</li>
</ul>
</li>
<li><a href="#approaches-to-implement-xnornet" id="markdown-toc-approaches-to-implement-xnornet">Approaches to Implement XNORNet</a> <ul>
<li><a href="#forward-inference-framework" id="markdown-toc-forward-inference-framework">Forward-inference Framework</a> <ul>
<li><a href="#linear-algebra-backend" id="markdown-toc-linear-algebra-backend">Linear Algebra Backend</a></li>
<li><a href="#python-to-c-transpiler" id="markdown-toc-python-to-c-transpiler">Python to C++ Transpiler</a></li>
</ul>
</li>
<li><a href="#custom-xnor-blas-functionalities" id="markdown-toc-custom-xnor-blas-functionalities">Custom XNOR BLAS functionalities</a> <ul>
<li><a href="#xnormatmul" id="markdown-toc-xnormatmul"><code class="language-plaintext highlighter-rouge">xnormatmul</code></a> <ul>
<li><a href="#floating-point-packing" id="markdown-toc-floating-point-packing">Floating point packing</a></li>
<li><a href="#vectorized-xnors" id="markdown-toc-vectorized-xnors">Vectorized xnors</a></li>
<li><a href="#accumulating-bits" id="markdown-toc-accumulating-bits">Accumulating bits</a></li>
</ul>
</li>
<li><a href="#binconv2d" id="markdown-toc-binconv2d"><code class="language-plaintext highlighter-rouge">binConv2d</code></a></li>
</ul>
</li>
</ul>
</li>
</ul>
<p>This article is WIP.</p>
<p><em>The opinions expressed in this post are my own and not necesarily those of my employer (i.e. don’t fire me pls)</em></p>
<h1 id="introduction">Introduction</h1>
<p>Over the summer, I interned at Airbnb’s machine learning infrastructure team, working on Airbnb’s to-be-opensourced library called <strong><code class="language-plaintext highlighter-rouge">Bighead</code></strong>. Think Uber’s <code class="language-plaintext highlighter-rouge">Michelangelo</code>, in that it’s an end-to-end machine learning framework, meant to wrap around most of your typical machine learning workflow, from data ingestion, to training, to hyperparameter selection, visualization, and finally deployment. When it becomes opensource, you can figure out the details for yourself.</p>
<p>An argument against end-to-end machine learning frameworks is that you would need to work exclusively in their environment, and for other frameworks that are not completely compliant with its API, we would need to add wrappers (like <code class="language-plaintext highlighter-rouge">tensorflow</code>, <code class="language-plaintext highlighter-rouge">SpaCy</code>, etc).</p>
<p>However, a similar argument for end-to-end machine learning frameworks is that because you have a homogenous wrapper interface, the user <em>shouldn’t really care about which framework underneath they use, just that it works well, is easy to deploy, and easy to visualize and interpret.</em></p>
<p>If we have our own ecosystem, we can build our own data ingestion pipeline, optimize them however we want, and do cool things, like <strong>speed up machine learning inference by swapping out frameworks depending on which stage of the pipeline you are current executing.</strong></p>
<p>The scoped context of this blog doesn’t do <code class="language-plaintext highlighter-rouge">Bighead</code> enough justice(there are many other features that make it a great library), so check it out when it’s available!</p>
<h1 id="integer-8-bit-quantization">Integer 8-bit quantization</h1>
<p>Nvidia has a library for forward-inference called <code class="language-plaintext highlighter-rouge">TensorRT</code>. It’s neat because it uses underlying GPU intrinsics for optimization (<code class="language-plaintext highlighter-rouge">INT8 GEMM DP4A</code>, etc), and so on Nvidia specific GPU’s, it runs <em>very fast</em>. Nvidia has done some work in quantizing the weights and inputs of the neural network, down from <code class="language-plaintext highlighter-rouge">float32</code> to <code class="language-plaintext highlighter-rouge">float16</code> or <code class="language-plaintext highlighter-rouge">int8</code>. For <code class="language-plaintext highlighter-rouge">float32</code> to <code class="language-plaintext highlighter-rouge">float16</code>, you are essentially performing <code class="language-plaintext highlighter-rouge">static_cast<float16>(float_32_number)</code>. For <code class="language-plaintext highlighter-rouge">int8</code>, it’s a little different:</p>
\[\gamma^* = argmin_\gamma D_{KL}(P||Q_\gamma) \\
I^W_{ij} = cast_{int8}(\frac{W_{ij}}{\gamma^*}) \\
W \approx I^W * \gamma^*\]
<p>where $Q$ and $P$ are some unknown distributions(explained later). $\gamma$ is a parameter in $P$, and we optimize the kullback-leibler convergence between these two distributions with respect to $\gamma$. The $cast_{int8}$ function is a thresholding function:</p>
\[cast_{int8}(x) = max(min(-127, 2( \lfloor \frac{(2^8-1)(x+1)}{2(2^8-1)} \rfloor - \frac{1}{2})), 127)\]
<p>Okay, maybe it’s better expressed with code:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">cast_int8</span><span class="p">(</span><span class="n">x</span><span class="p">):</span>
<span class="k">return</span> <span class="nb">max</span><span class="p">(</span><span class="nb">min</span><span class="p">(</span><span class="o">-</span><span class="mi">127</span><span class="p">,</span> <span class="nb">int</span><span class="p">(</span><span class="n">x</span><span class="p">)),</span> <span class="mi">127</span><span class="p">)</span>
</code></pre></div></div>
<p>Now, let’s explain $Q$ and $P$.</p>
<p>For layers of a neural network, we usually generate these things called <em>activation values</em>, from functions called <em>activation functions</em>. The most common one is the sigmoid:</p>
\[\sigma (x) = \frac{1}{1+e^{-x}}\]
<p>… but technically almost anything could be considered an activation function. With some data distribution $D = {x_i}_{i=0}^N$, we have some kind of activation function $f$ where $f(D) = P$. This is a generic function $f$, with some unknown distribution $D$.</p>
<p>Now, if we discretize all elements involved in $f$, including the input and the weights required for any operator, we get back a function $f_{int8}$, and $f_{int8}(D) * \gamma = Q_\gamma$. We minimize the KL divergence between these two distributions, $Q_\gamma$ and $P$.</p>
<p>TL;DR: We find the best approximation of $I^W$ with respect to $\gamma$ and the current activation function $f$.</p>
<h2 id="details">Details</h2>
<p>We refer to <code class="language-plaintext highlighter-rouge">MXNet</code>’s source code, in which I made a PR in, to explain exactly how the quantization step happens. The link to the file is <a href="https://github.com/apache/incubator-mxnet/blob/master/python/mxnet/contrib/quantization.py">here</a>, and the PR is <a href="https://github.com/apache/incubator-mxnet/pull/11833">here</a>.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">hist</span><span class="p">,</span> <span class="n">hist_edges</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">histogram</span><span class="p">(</span><span class="n">arr</span><span class="p">,</span> <span class="n">bins</span><span class="o">=</span><span class="n">num_bins</span><span class="p">,</span> <span class="nb">range</span><span class="o">=</span><span class="p">(</span><span class="o">-</span><span class="n">th</span><span class="p">,</span> <span class="n">th</span><span class="p">))</span>
<span class="n">zero_bin_idx</span> <span class="o">=</span> <span class="n">num_bins</span> <span class="o">//</span> <span class="mi">2</span>
<span class="n">num_half_quantized_bins</span> <span class="o">=</span> <span class="n">num_quantized_bins</span> <span class="o">//</span> <span class="mi">2</span>
<span class="k">assert</span> <span class="n">np</span><span class="p">.</span><span class="n">allclose</span><span class="p">(</span><span class="n">hist_edges</span><span class="p">[</span><span class="n">zero_bin_idx</span><span class="p">]</span> <span class="o">+</span>
<span class="n">hist_edges</span><span class="p">[</span><span class="n">zero_bin_idx</span> <span class="o">+</span> <span class="mi">1</span><span class="p">],</span>
<span class="mi">0</span><span class="p">,</span> <span class="n">rtol</span><span class="o">=</span><span class="mf">1e-5</span><span class="p">,</span> <span class="n">atol</span><span class="o">=</span><span class="mf">1e-7</span><span class="p">)</span>
</code></pre></div></div>
<p>Here, we create a histogram to approximate our activation function. The KL divergence between two continuous-valued distributions is usually performed in software via approximate binning. Each entry of the histogram is a bin entry. i.e. If you have 2 bins, then:</p>
\[D_{KL}([1,1]||[1,1]) = 0\]
<p>Because these 2 distributions are the same.</p>
<p>We are about to enter the important <code class="language-plaintext highlighter-rouge">for</code> loop that decides what threshold, i.e. $\gamma$ is optimal. We run through all reasonable $\gamma$’s and pick the one that gives the lowest KL divergence. The reasonable $\gamma$’s are between $[0, max_i(|x_i|)]$. Any elements outside of the distribution will be absorbed to the two corners of the distribution.</p>
<p>An example:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># Sampled numbers, sorted without loss of generality
[-10, -2, 1, 8, 12]
# If gamma = 2, and 4 bins, then we have histograms of
[|x <= -1|, |-1 < x <= 0|, |0 < x <= 1|, |1 < x|]
= [2, 0, 1, 2]
</code></pre></div></div>
<p>So the below is the process:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># generate reference distribution p
</span><span class="n">p</span> <span class="o">=</span> <span class="n">sliced_nd_hist</span><span class="p">.</span><span class="n">copy</span><span class="p">()</span>
<span class="k">assert</span> <span class="n">p</span><span class="p">.</span><span class="n">size</span> <span class="o">%</span> <span class="mi">2</span> <span class="o">==</span> <span class="mi">1</span>
<span class="k">assert</span> <span class="n">p</span><span class="p">.</span><span class="n">size</span> <span class="o">>=</span> <span class="n">num_quantized_bins</span>
<span class="c1"># put left outlier count in p[0]
</span><span class="n">left_outlier_count</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="nb">sum</span><span class="p">(</span><span class="n">hist</span><span class="p">[</span><span class="mi">0</span><span class="p">:</span><span class="n">p_bin_idx_start</span><span class="p">])</span>
<span class="n">p</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">+=</span> <span class="n">left_outlier_count</span>
<span class="c1"># put right outlier count in p[-1]
</span><span class="n">right_outlier_count</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="nb">sum</span><span class="p">(</span><span class="n">hist</span><span class="p">[</span><span class="n">p_bin_idx_stop</span><span class="p">:])</span>
<span class="n">p</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span> <span class="o">+=</span> <span class="n">right_outlier_count</span>
<span class="c1"># is_nonzeros[k] indicates whether hist[k] is nonzero
</span><span class="n">is_nonzeros</span> <span class="o">=</span> <span class="p">(</span><span class="n">p</span> <span class="o">!=</span> <span class="mi">0</span><span class="p">).</span><span class="n">astype</span><span class="p">(</span><span class="n">np</span><span class="p">.</span><span class="n">int32</span><span class="p">)</span>
</code></pre></div></div>
<p>Here, we discretize our reference distribution $P$, which is retrieved from $f(D)$. In this case, we have generated <code class="language-plaintext highlighter-rouge">arr</code> which is $P$. Recall that in practice, we discretize our distributions into bins so we can do a discrete KL divergence calculation.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># calculate how many bins should be merged to generate quantized distribution q
</span><span class="n">num_merged_bins</span> <span class="o">=</span> <span class="n">sliced_nd_hist</span><span class="p">.</span><span class="n">size</span> <span class="o">//</span> <span class="n">num_quantized_bins</span>
<span class="c1"># merge hist into num_quantized_bins bins
</span><span class="k">for</span> <span class="n">j</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">num_quantized_bins</span><span class="p">):</span>
<span class="n">start</span> <span class="o">=</span> <span class="n">j</span> <span class="o">*</span> <span class="n">num_merged_bins</span>
<span class="n">stop</span> <span class="o">=</span> <span class="n">start</span> <span class="o">+</span> <span class="n">num_merged_bins</span>
<span class="n">quantized_bins</span><span class="p">[</span><span class="n">j</span><span class="p">]</span> <span class="o">=</span> <span class="n">sliced_nd_hist</span><span class="p">[</span><span class="n">start</span><span class="p">:</span><span class="n">stop</span><span class="p">].</span><span class="nb">sum</span><span class="p">()</span>
<span class="n">quantized_bins</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span> <span class="o">+=</span> <span class="n">sliced_nd_hist</span><span class="p">[</span>
<span class="n">num_quantized_bins</span> <span class="o">*</span> <span class="n">num_merged_bins</span><span class="p">:].</span><span class="nb">sum</span><span class="p">()</span>
</code></pre></div></div>
<p>Here, we generate our distribution $Q$, by binning up the current scope of numbers within $[-\gamma, \gamma]$ into the # of bins you can fit in an <code class="language-plaintext highlighter-rouge">int8</code>, which is $|[-127,-126,…,127]| = 2^8-1 = 255$ bins.</p>
<p>Now that we have retrieved <code class="language-plaintext highlighter-rouge">q</code> and <code class="language-plaintext highlighter-rouge">p</code>, let’s do a KL divergence on them. Using scikit:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">divergence</span><span class="p">[</span><span class="n">i</span> <span class="o">-</span> <span class="n">num_half_quantized_bins</span><span class="p">]</span> <span class="o">=</span> <span class="n">stats</span><span class="p">.</span><span class="n">entropy</span><span class="p">(</span><span class="n">p</span><span class="p">,</span> <span class="n">q</span><span class="p">)</span>
</code></pre></div></div>
<p>And so we have found our best threshold by choosing the minimum entropy from the above expression.</p>
<p>There are some extra details that I left out, about smoothing the distribution to prevent divide by 0 errors and etc. Read the code for more information.</p>
<p><em>Note: The <code class="language-plaintext highlighter-rouge">contrib</code> folder for most frameworks is volatile, so the code above may not be an exact 1:1 at the time that you are reading it.</em></p>
<p>The performance of this model in a single GPU usually yields ~1-2x faster inference speed. On a <code class="language-plaintext highlighter-rouge">p3.8xlarge</code> (with 4 beefy Volta GPU’s), we can get to at most 3x faster on <code class="language-plaintext highlighter-rouge">Resnet50</code>. However, can we discretize this even further?</p>
<h1 id="bitwise-quantization">Bitwise quantization</h1>
<p>Bitwise quantization is similar to <code class="language-plaintext highlighter-rouge">int8</code> quantization. The approximation is as follows:</p>
\[A, B \in \Re^{NxM} \\
\mathcal{B}^X \in \{-1, +1\}^{NxM}
\alpha, \beta \in \Re\]
<p>The approximation is as follows:</p>
\[AB \approx \mathcal{B}^{A} \mathcal{B}^{B}\alpha\beta\]
<p>One can find the best binary array and magnitudes by solving the following least squares equation:</p>
\[argmin_{\mathcal{B}^{A},\alpha} ||A - \mathcal{B}^{A}*\alpha||^2\]
<p>After some calculations, the answer is (intuitively):</p>
\[\mathcal{B}^{A}_{ij} = signum(A)_{ij} \\
\alpha = \frac{||A||_1}{N}\]
<p>Multiplying two of these approximations is simple, and also leads to the optimal approximation of the two original matrices multiplied together:</p>
\[A*B \approx \mathcal{B}^{A}*\mathcal{B}^{B} * \alpha * \beta\]
<p>And if the above were numbers:</p>
\[a*b \approx sign(a)*sign(b) * \alpha * \beta\]
<hr />
<p>The reason I added the number example is because matrix multiplication can be expressed as a series of dot products. If we can optimize the dot product kernel, it means we can optimize the matrix kernel as well.</p>
<p>If we had a number in <code class="language-plaintext highlighter-rouge">float32</code>:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>a = [sign of a | exponent of a | mantissa of a]
a * b = [sign a XNOR sign b | exp a + exp b | mantissa a * mantissa b]
</code></pre></div></div>
<p>If we had a number in ${-1, +1}$, and mapped $-1$ to $0$, and $1$ to $1$ in bits:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>a = [sign of a]
b = [sign of b]
a * b = a XNOR b
</code></pre></div></div>
<p>This means all of our multiplications can be done via <code class="language-plaintext highlighter-rouge">XNOR</code>’s! This means we can pack our data into bits and do a vectorized xor on multiple elements at once. This should definitely be faster than performing floating point multiplication in theory. In addition, a dot product is just a combination of multiplication and sum reduction. If we use <code class="language-plaintext highlighter-rouge">XNOR</code> for multiplication, we need an equivalent for sum reduction. Fortunately, there is a native intrinsic <code class="language-plaintext highlighter-rouge">popcount</code> instruction that allows one to count the number of <code class="language-plaintext highlighter-rouge">1</code>’s in bytes(32 bytes in AVX2, 64 bytes in AVX512, etc).</p>
<p>A naive approach to the dot product would be:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">x</span> <span class="o">=</span> <span class="o">~</span> <span class="p">(</span><span class="n">a</span> <span class="o">^</span> <span class="n">b</span><span class="p">)</span> <span class="c1"># xnor
</span><span class="n">result</span> <span class="o">=</span> <span class="n">popcount</span><span class="p">(</span><span class="n">x</span><span class="p">)</span> <span class="c1"># popcount
</span></code></pre></div></div>
<p>which is going to be the equivalent of our dot product.</p>
<p>But caveat about this model is that simply doing the same conversion process as Nvidia, like discretizing to <code class="language-plaintext highlighter-rouge">int8</code>, will give us <strong>horrendous performance</strong>, like, random guessing performance. So our training process is going to have to be revamped.</p>
<h2 id="training-and-backpropagation-of-an-xnornet">Training and Backpropagation of an XNORNet</h2>
<p>Where XNORNet shines is inference time. Because of the quantization of the weights and input, the forward pass of the architecture is theoretically extremely fast. However, training it is a different problem altogether.</p>
<p>To train an XNORNet, we actually consider the binarization function as some sort of activation function:</p>
\[bin(x) = sign(x) * \frac{1}{N}||x||_1\]
<p>This means that we are actually <strong>training with floating point numbers, discretizing them in the process, and retrieving real-valued gradients for update.</strong></p>
<p>This means that our backpropagation process is not going to be as fast as inference! In fact, it may be slower than normal backpropagation with floating point weights.</p>
<h3 id="difficulty-of-training-an-xnornet">Difficulty of training an XNORNet</h3>
<p>The difficult of training an XNORNet arises from the problem of discretization. Naturally, functions that discretize real values will not be continuous, and therefore it is not differentiable at every point.</p>
<p>Let’s look at a slightly modified sign function $sign$ or $signum$ (absorbing the case where $x$ is zero):</p>
\[sign(x) =
\begin{cases}
-1 & \text{if $n \leq 0$} \\
1 & \text{if $x > 0$}
\end{cases}\]
<p>This function is not differentiable from the right at the point $x = 0$, and the gradient is 0 everywhere else. How can we properly backpropagate on this function?</p>
<p>Well, the answer is we can’t. However, Hinton(2012, Using noise as a regularizer)’s lectures introduced the idea of a <strong>straight-through estimator (STE)</strong>, which is essentially arguing that the below approximation will work as a proxy for the gradient.</p>
\[\frac{\partial sign(x)_j}{\partial x_i} \approx
\begin{cases}
0 & \text{if $j \neq i$} \\
1 & \text{if $j = i$}
\end{cases}\]
<p>So therefore, if we’re performing backpropagation:</p>
\[\frac{\partial f(sign(x))}{\partial x_i} = \frac{\partial f(sign(x))}{\partial sign(x)_i} * 1\]
<p>However, for practical purposes, we perform gradient clipping at this stage, and we clip with respect to the input’s magnitude, so really our gradient through the $signum$ function looks like:</p>
\[\frac{\partial sign(x)_i}{\partial x_i} = 1_{|x_i| < 1}\]
<p>Suppose we have some cost function $C$, which we are ultimately backpropagating from, then we want to know the gradient of $W$ for updates. Without loss of generality, we assume $W$ is a 1-dimensional tensor. We discretize $W \in \Re^{m}$:</p>
\[C(bin(W)) = C(sign(W) * \frac{1}{m}||W||_1)\]
<p>Our gradient becomes:</p>
\[\frac{\partial C(bin(W))}{\partial W_i} = \\
\sum_j \frac{\partial C(bin(W))}{\partial bin(W)_j} \frac{\partial bin(W)_j}{\partial W_i} = \\
\sum_j \frac{\partial C(bin(W))}{\partial bin(W)_j} \frac{\partial sign(W)_j * \frac{1}{m}||W||_1}{\partial W_i}\]
<p>By product rule:</p>
\[\frac{\partial sign(W)_j * \frac{1}{m}||W||_1}{\partial W_i} = \\
\frac{\partial sign(W)_j}{\partial W_i} \frac{1}{m}||W||_1 + \frac{\partial \frac{1}{m}||W||_1}{\partial W_i} sign(W)_j\]
<p>Let’s tackle each piece:</p>
\[\frac{\partial sign(W)_j}{\partial W_i} =
\begin{cases}
0 & \text{if $j \neq i$} \\
1 & \text{if $j = i$}
\end{cases}\]
<p>And for the L-1 norm:</p>
\[\frac{\partial \frac{1}{m}||W||_1|}{\partial W_i} = \\
\frac{\partial \frac{1}{m} \sum_{j} |W_{j}|}{\partial W_i} = \frac{1}{m} sign(W)_i\]
<p>The final gradient should be:</p>
\[\frac{\partial C(bin(W))}{\partial W_i} = \\
\sum_j \frac{\partial C(bin(W))}{\partial bin(W)_j} \frac{\partial sign(W)_j * \frac{1}{m}||W||_1}{\partial W_i} = \\
\frac{\partial C(bin(W))}{\partial bin(W)_i} \frac{\partial sign(W)_i}{\partial W_i} \frac{1}{m}||W||_1 + \sum_j \frac{\partial C(bin(W))}{\partial bin(W)_j}\frac{\partial \frac{1}{m}||W||_1}{\partial W_i} sign(W)_j = \\
\frac{\partial C(bin(W))}{\partial bin(W)_i} 1_{|W_i| \leq 1} \frac{1}{m}||W||_1 + \sum_j \frac{\partial C(bin(W))}{\partial bin(W)_j}\frac{1}{m} sign(W)_i sign(W)_j\]
<h3 id="training-tips">Training Tips</h3>
<ul>
<li>For input $X \approx \mathcal{B}^X * \alpha$, we can remove $\alpha$ and still yield similar accuracy. There is a roughly 3% accuracy gap.</li>
<li>For the gradient, we can replace the complicated equation derived above with a simple pass-through:</li>
</ul>
\[\frac{\partial C(bin(W))}{\partial W_i} \approx \frac{ \partial C(bin(W))}{ \partial bin(W)_i} 1_{|W_i| \leq 1}\]
<p>and it would work just as well, but it takes longer to converge. We currently use this for inputs, but have the exact precise gradient for weights.</p>
<ul>
<li>As the <strong>depth of the neural network increase, the harder it is to train an XNORNet to convergence</strong>. By clipping the gradient $\forall i s.t. |X_i| \leq 1$, there may be insufficient gradient information arriving at the beginning of the neural networks. This is why several studies suggest widening the layers in an XNORNet, and why the original paper’s ResNet accuracy drops by a whopping 20% as opposed to the XNORNet version of AlexNet, which loses 10%.</li>
<li><strong>Catastrophic forgetting</strong> is a real problem in XNOR-Networks. At times the neural network will drop in training accuracy by more than 50%, and does not easily recover. Intuitively, small perturbations in the weights of the neural network drastically changes magnitude (from $-\alpha$ to $\alpha$, and vice versa), and perturbations on important parts of the weight that generate good convolution filters will cause a huge degradation in performance.</li>
<li>In reality, although the optimal magnitude for approximating the matrix $W$ is expressed as follows from the least squared equation:</li>
</ul>
\[\alpha^* = \frac{1}{m}||W||_1\]
<p>We have found that using $\alpha$ as a parameter for $W$ to backpropagate into also yields a similar accuracy, and reduces catastrophic forgetting (the learned $\alpha$ is usually very small).</p>
<ul>
<li>XNORNets are very <strong>sensitive to hyperparameters</strong>, and optimal performance requires careful handtuning or fairly exhaustive hyperparameter search, for example on the learning rate(<code class="language-plaintext highlighter-rouge">1e-3</code>), batch size(<code class="language-plaintext highlighter-rouge">1024</code>), learning rate decay(<code class="language-plaintext highlighter-rouge">0.95</code>), and etc. Attempts to train XNORNet without proper hyperparameters will fail to converge at all.</li>
<li>One can interpret <strong>discretization as some form of regularization</strong>, and thus there is no necessary dropout layer that is further required (i.e. training accuracy usually corresponds to cross-validation accuracy).</li>
<li>Clamping the weights before a forward pass between $[-1, 1]$ is a <strong>regularizer and controls gradient explosion in an XNORNet</strong>. This is also known as <strong>BinaryConnect</strong>.</li>
<li>During convolution, the authors suggest averaging up each chunk of the convolution input and doing an element-wise multiplication against the final result. We found that by simply using the total average L-1 norm (a scalar), we were able to reach the same accuracy.</li>
</ul>
<p>Although there are no proofs or bounds for why the above “hacks” yield good results and YMMV, we ended up with a test 83% accuracy on CIFAR 10 in roughly 50 epochs of 1024 batch size.</p>
<h1 id="approaches-to-implement-xnornet">Approaches to Implement XNORNet</h1>
<p><code class="language-plaintext highlighter-rouge">BLAS</code> is a very ancient, and established linear algebra framework interface. It stands for :</p>
<ul>
<li><code class="language-plaintext highlighter-rouge">B</code>asic</li>
<li><code class="language-plaintext highlighter-rouge">L</code>inear</li>
<li><code class="language-plaintext highlighter-rouge">A</code>lgebra</li>
<li><code class="language-plaintext highlighter-rouge">S</code>ubprograms</li>
</ul>
<p>and many libraries implement their subroutines based off of <code class="language-plaintext highlighter-rouge">BLAS</code>’s interface. Some of the fastest variants of <code class="language-plaintext highlighter-rouge">BLAS</code> are <code class="language-plaintext highlighter-rouge">MKL</code> from Intel, <code class="language-plaintext highlighter-rouge">ATLAS</code>, and <code class="language-plaintext highlighter-rouge">OpenBLAS</code>. The most common subprograms deep learning uses are:</p>
<ul>
<li><code class="language-plaintext highlighter-rouge">dot</code> (dot product between 2 vectors)</li>
<li><code class="language-plaintext highlighter-rouge">gemv</code>/<code class="language-plaintext highlighter-rouge">gevm</code>, <code class="language-plaintext highlighter-rouge">ge</code>neral <code class="language-plaintext highlighter-rouge">m</code>atrix <code class="language-plaintext highlighter-rouge">v</code>ector</li>
<li><code class="language-plaintext highlighter-rouge">gemm</code>, <code class="language-plaintext highlighter-rouge">ge</code>neral <code class="language-plaintext highlighter-rouge">m</code>atrix <code class="language-plaintext highlighter-rouge">m</code>atrix</li>
</ul>
<p>Because many deep learning frameworks are glued to <code class="language-plaintext highlighter-rouge">BLAS</code> libraries, we are faced with two paths:</p>
<ol>
<li>Implement a BLAS routine for bit-wise matrix multiplication, and convince some 3rd party deep learning library to incorporate our changes.</li>
<li>Make a forward-inference framework ourselves, and have the flexibility to replace <code class="language-plaintext highlighter-rouge">gemm</code> with <code class="language-plaintext highlighter-rouge">xnorgemm</code> anywhere we want.</li>
</ol>
<h2 id="forward-inference-framework">Forward-inference Framework</h2>
<h3 id="linear-algebra-backend">Linear Algebra Backend</h3>
<p><img src="http://oneraynyday.github.io/assets/xnornet/xtensor.svg" alt="xtensor" /></p>
<p>We wanted our library to interface directly with <code class="language-plaintext highlighter-rouge">NumPy</code>, which is the standard general numerical computing stack in Python along with <code class="language-plaintext highlighter-rouge">SciPy</code>. However, directly using <code class="language-plaintext highlighter-rouge">NumPy</code> will incur huge copy costs and reduced optimization, as well as the curse of GIL locking in Python.</p>
<p>We decided that we needed something performant and expressive, that had good bindings with <code class="language-plaintext highlighter-rouge">NumPy</code>, and so one of the promising candidates was <code class="language-plaintext highlighter-rouge">xtensor</code>, from the QuantStack team. It also had a variety of optimizations that we were looking for:</p>
<ul>
<li>Support for <code class="language-plaintext highlighter-rouge">NumPy</code> bindings</li>
<li>Support for N-dimensional tensors (something that Armadillo and Eigen could not provide out-of-the-box)</li>
<li>Support for lazy operators, thus reducing temporary copies and support efficient type checking using templates during compile time</li>
<li>Support for black-box <code class="language-plaintext highlighter-rouge">simd</code> intrinsic operations</li>
<li>Use modern C++11 for move semantics and compile-time generated fixed-size expressions using type traits built into the standard library.</li>
</ul>
<p>So now we have decided on our linear algebra library and language of choice, how can we dynamically express the neural network in Python, but get it to work in C++?</p>
<h3 id="python-to-c-transpiler">Python to C++ Transpiler</h3>
<p>As a primary goal of usability, we wanted the end-user to not spend too much time learning about our framework, but rather use their familiar Python Keras/MXNet/Tensorflow/Pytorch high-level layers, which are all similar in API. Just like how MXNet compiles its python graph into NNVM, and Tensorflow compiles its graph into LLVM, we compile our python graph into C++.</p>
<p>For example, we can transpile a graph in PyTorch that looks like:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">ConvNet</span><span class="p">(</span>
<span class="p">(</span><span class="n">conv</span><span class="p">):</span> <span class="n">Sequential</span><span class="p">(</span>
<span class="p">(</span><span class="mi">0</span><span class="p">):</span> <span class="n">BinaryConvolution2d</span><span class="p">()</span>
<span class="p">(</span><span class="mi">1</span><span class="p">):</span> <span class="n">ReLU</span><span class="p">()</span>
<span class="p">(</span><span class="mi">2</span><span class="p">):</span> <span class="n">MaxPool2d</span><span class="p">(</span><span class="n">kernel_size</span><span class="o">=</span><span class="mi">2</span><span class="p">,</span> <span class="n">stride</span><span class="o">=</span><span class="mi">2</span><span class="p">,</span> <span class="n">padding</span><span class="o">=</span><span class="mi">0</span><span class="p">,</span> <span class="n">dilation</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span> <span class="n">ceil_mode</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span>
<span class="p">(</span><span class="mi">3</span><span class="p">):</span> <span class="n">BatchNorm2d</span><span class="p">(</span><span class="mi">16</span><span class="p">,</span> <span class="n">eps</span><span class="o">=</span><span class="mf">1e-05</span><span class="p">,</span> <span class="n">momentum</span><span class="o">=</span><span class="mf">0.1</span><span class="p">,</span> <span class="n">affine</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span> <span class="n">track_running_stats</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="p">(</span><span class="mi">4</span><span class="p">):</span> <span class="n">BinaryConvolution2d</span><span class="p">()</span>
<span class="p">(</span><span class="mi">5</span><span class="p">):</span> <span class="n">ReLU</span><span class="p">()</span>
<span class="p">(</span><span class="mi">6</span><span class="p">):</span> <span class="n">MaxPool2d</span><span class="p">(</span><span class="n">kernel_size</span><span class="o">=</span><span class="mi">2</span><span class="p">,</span> <span class="n">stride</span><span class="o">=</span><span class="mi">2</span><span class="p">,</span> <span class="n">padding</span><span class="o">=</span><span class="mi">0</span><span class="p">,</span> <span class="n">dilation</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span> <span class="n">ceil_mode</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span>
<span class="p">(</span><span class="mi">7</span><span class="p">):</span> <span class="n">BatchNorm2d</span><span class="p">(</span><span class="mi">32</span><span class="p">,</span> <span class="n">eps</span><span class="o">=</span><span class="mf">1e-05</span><span class="p">,</span> <span class="n">momentum</span><span class="o">=</span><span class="mf">0.1</span><span class="p">,</span> <span class="n">affine</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span> <span class="n">track_running_stats</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="p">)</span>
<span class="p">(</span><span class="n">fc</span><span class="p">):</span> <span class="n">BinaryLinear</span><span class="p">()</span>
<span class="p">)</span>
</code></pre></div></div>
<p>into some C++ code that looks like:</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">...</span>
<span class="k">extern</span> <span class="s">"C"</span> <span class="p">{</span>
<span class="k">auto</span><span class="o">&&</span> <span class="n">Block__0BinaryConv2d__weight</span> <span class="o">=</span> <span class="n">getParameter</span><span class="o"><</span><span class="kt">float</span><span class="o">></span><span class="p">(</span><span class="s">"Block__0BinaryConv2d__weight"</span><span class="p">);</span>
<span class="k">auto</span><span class="o">&&</span> <span class="n">Block__0BinaryConv2d__bias</span> <span class="o">=</span> <span class="n">getParameter</span><span class="o"><</span><span class="kt">float</span><span class="o">></span><span class="p">(</span><span class="s">"Block__0BinaryConv2d__bias"</span><span class="p">);</span>
<span class="k">auto</span><span class="o">&&</span> <span class="n">Block__3BatchNorm__gamma</span> <span class="o">=</span> <span class="n">getParameter</span><span class="o"><</span><span class="kt">float</span><span class="o">></span><span class="p">(</span><span class="s">"Block__3BatchNorm__gamma"</span><span class="p">);</span>
<span class="k">auto</span><span class="o">&&</span> <span class="n">Block__3BatchNorm__beta</span> <span class="o">=</span> <span class="n">getParameter</span><span class="o"><</span><span class="kt">float</span><span class="o">></span><span class="p">(</span><span class="s">"Block__3BatchNorm__beta"</span><span class="p">);</span>
<span class="k">auto</span><span class="o">&&</span> <span class="n">Block__3BatchNorm__running_mean</span> <span class="o">=</span> <span class="n">getParameter</span><span class="o"><</span><span class="kt">float</span><span class="o">></span><span class="p">(</span><span class="s">"Block__3BatchNorm__running_mean"</span><span class="p">);</span>
<span class="k">auto</span><span class="o">&&</span> <span class="n">Block__3BatchNorm__running_var</span> <span class="o">=</span> <span class="n">getParameter</span><span class="o"><</span><span class="kt">float</span><span class="o">></span><span class="p">(</span><span class="s">"Block__3BatchNorm__running_var"</span><span class="p">);</span>
<span class="k">static</span> <span class="k">constexpr</span> <span class="kt">float</span> <span class="n">Block__3BatchNorm__epsilon</span> <span class="o">=</span> <span class="mf">1e-05</span><span class="p">;</span>
<span class="k">auto</span><span class="o">&&</span> <span class="n">Block__4BinaryConv2d__weight</span> <span class="o">=</span> <span class="n">getParameter</span><span class="o"><</span><span class="kt">float</span><span class="o">></span><span class="p">(</span><span class="s">"Block__4BinaryConv2d__weight"</span><span class="p">);</span>
<span class="k">auto</span><span class="o">&&</span> <span class="n">Block__4BinaryConv2d__bias</span> <span class="o">=</span> <span class="n">getParameter</span><span class="o"><</span><span class="kt">float</span><span class="o">></span><span class="p">(</span><span class="s">"Block__4BinaryConv2d__bias"</span><span class="p">);</span>
<span class="k">auto</span><span class="o">&&</span> <span class="n">Block__7BatchNorm__gamma</span> <span class="o">=</span> <span class="n">getParameter</span><span class="o"><</span><span class="kt">float</span><span class="o">></span><span class="p">(</span><span class="s">"Block__7BatchNorm__gamma"</span><span class="p">);</span>
<span class="k">auto</span><span class="o">&&</span> <span class="n">Block__7BatchNorm__beta</span> <span class="o">=</span> <span class="n">getParameter</span><span class="o"><</span><span class="kt">float</span><span class="o">></span><span class="p">(</span><span class="s">"Block__7BatchNorm__beta"</span><span class="p">);</span>
<span class="k">auto</span><span class="o">&&</span> <span class="n">Block__7BatchNorm__running_mean</span> <span class="o">=</span> <span class="n">getParameter</span><span class="o"><</span><span class="kt">float</span><span class="o">></span><span class="p">(</span><span class="s">"Block__7BatchNorm__running_mean"</span><span class="p">);</span>
<span class="k">auto</span><span class="o">&&</span> <span class="n">Block__7BatchNorm__running_var</span> <span class="o">=</span> <span class="n">getParameter</span><span class="o"><</span><span class="kt">float</span><span class="o">></span><span class="p">(</span><span class="s">"Block__7BatchNorm__running_var"</span><span class="p">);</span>
<span class="k">static</span> <span class="k">constexpr</span> <span class="kt">float</span> <span class="n">Block__7BatchNorm__epsilon</span> <span class="o">=</span> <span class="mf">1e-05</span><span class="p">;</span>
<span class="k">auto</span><span class="o">&&</span> <span class="n">Block__9BinaryDense__weight</span> <span class="o">=</span> <span class="n">getParameter</span><span class="o"><</span><span class="kt">float</span><span class="o">></span><span class="p">(</span><span class="s">"Block__9BinaryDense__weight"</span><span class="p">);</span>
<span class="k">auto</span><span class="o">&&</span> <span class="n">Block__9BinaryDense__bias</span> <span class="o">=</span> <span class="n">getParameter</span><span class="o"><</span><span class="kt">float</span><span class="o">></span><span class="p">(</span><span class="s">"Block__9BinaryDense__bias"</span><span class="p">);</span>
<span class="n">xt</span><span class="o">::</span><span class="n">xarray</span><span class="o"><</span><span class="kt">float</span><span class="o">></span> <span class="n">transform</span><span class="p">(</span><span class="k">const</span> <span class="n">xt</span><span class="o">::</span><span class="n">xarray</span><span class="o"><</span><span class="kt">float</span><span class="o">>&</span> <span class="n">batch</span><span class="p">,</span> <span class="k">const</span> <span class="n">std</span><span class="o">::</span><span class="n">string</span><span class="o">&</span> <span class="n">prefix</span><span class="p">){</span>
<span class="n">xt</span><span class="o">::</span><span class="n">xarray</span><span class="o"><</span><span class="kt">float</span><span class="o">></span> <span class="n">result</span><span class="p">;</span>
<span class="k">auto</span> <span class="n">in_shape</span> <span class="o">=</span> <span class="n">batch</span><span class="p">.</span><span class="n">shape</span><span class="p">()[</span><span class="mi">0</span><span class="p">];</span>
<span class="k">auto</span> <span class="n">rows</span> <span class="o">=</span> <span class="n">std</span><span class="o">::</span><span class="n">min</span><span class="p">(</span><span class="n">in_shape</span><span class="p">,</span> <span class="k">decltype</span><span class="p">(</span><span class="n">in_shape</span><span class="p">)(</span><span class="mi">60000</span><span class="p">));</span>
<span class="n">result</span><span class="p">.</span><span class="n">resize</span><span class="p">({</span> <span class="n">rows</span><span class="p">,</span> <span class="mi">10</span> <span class="p">});</span>
<span class="cp">#pragma omp parallel for
</span> <span class="k">for</span><span class="p">(</span><span class="k">auto</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o"><</span> <span class="n">rows</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">){</span>
<span class="k">auto</span><span class="o">&&</span> <span class="n">data</span> <span class="o">=</span> <span class="n">xt</span><span class="o">::</span><span class="n">view</span><span class="p">(</span><span class="n">batch</span><span class="p">,</span> <span class="n">xt</span><span class="o">::</span><span class="n">range</span><span class="p">(</span><span class="n">i</span><span class="p">,</span> <span class="n">i</span><span class="o">+</span><span class="mi">1</span><span class="p">));</span>
<span class="k">auto</span><span class="o">&&</span> <span class="n">Block__0BinaryConv2d__binConv2d</span> <span class="o">=</span> <span class="n">bighead</span><span class="o">::</span><span class="n">linalg</span><span class="o">::</span><span class="n">conv</span><span class="o">::</span><span class="n">binConv2d</span><span class="p">(</span><span class="n">data</span><span class="p">,</span> <span class="n">Block__0BinaryConv2d__weight</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
<span class="k">auto</span><span class="o">&&</span> <span class="n">Block__0BinaryConv2d__binConv2d_bias</span> <span class="o">=</span> <span class="n">Block__0BinaryConv2d__binConv2d</span> <span class="o">+</span> <span class="n">xt</span><span class="o">::</span><span class="n">view</span><span class="p">(</span><span class="n">Block__0BinaryConv2d__bias</span><span class="p">,</span> <span class="n">xt</span><span class="o">::</span><span class="n">newaxis</span><span class="p">(),</span> <span class="n">xt</span><span class="o">::</span><span class="n">all</span><span class="p">(),</span> <span class="n">xt</span><span class="o">::</span><span class="n">newaxis</span><span class="p">(),</span> <span class="n">xt</span><span class="o">::</span><span class="n">newaxis</span><span class="p">());</span>
<span class="k">auto</span><span class="o">&&</span> <span class="n">Block__1ReLU__relu</span> <span class="o">=</span> <span class="n">Block__0BinaryConv2d__binConv2d_bias</span> <span class="o">*</span> <span class="n">xt</span><span class="o">::</span><span class="n">cast</span><span class="o"><</span><span class="kt">float</span><span class="o">></span><span class="p">(</span><span class="mi">0</span> <span class="o"><</span> <span class="n">Block__0BinaryConv2d__binConv2d_bias</span><span class="p">);</span>
<span class="k">auto</span><span class="o">&&</span> <span class="n">Block__2MaxPool2d__maxPool2d</span> <span class="o">=</span> <span class="n">bighead</span><span class="o">::</span><span class="n">linalg</span><span class="o">::</span><span class="n">pool</span><span class="o">::</span><span class="n">maxPool2d</span><span class="p">(</span><span class="n">Block__1ReLU__relu</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
<span class="k">auto</span><span class="o">&&</span> <span class="n">Block__3BatchNorm__batchNormgamma</span> <span class="o">=</span> <span class="n">xt</span><span class="o">::</span><span class="n">view</span><span class="p">(</span><span class="n">Block__3BatchNorm__gamma</span><span class="p">,</span> <span class="n">xt</span><span class="o">::</span><span class="n">newaxis</span><span class="p">(),</span> <span class="n">xt</span><span class="o">::</span><span class="n">all</span><span class="p">(),</span> <span class="n">xt</span><span class="o">::</span><span class="n">newaxis</span><span class="p">(),</span> <span class="n">xt</span><span class="o">::</span><span class="n">newaxis</span><span class="p">());</span>
<span class="k">auto</span><span class="o">&&</span> <span class="n">Block__3BatchNorm__batchNormbeta</span> <span class="o">=</span> <span class="n">xt</span><span class="o">::</span><span class="n">view</span><span class="p">(</span><span class="n">Block__3BatchNorm__beta</span><span class="p">,</span> <span class="n">xt</span><span class="o">::</span><span class="n">newaxis</span><span class="p">(),</span> <span class="n">xt</span><span class="o">::</span><span class="n">all</span><span class="p">(),</span> <span class="n">xt</span><span class="o">::</span><span class="n">newaxis</span><span class="p">(),</span> <span class="n">xt</span><span class="o">::</span><span class="n">newaxis</span><span class="p">());</span>
<span class="k">auto</span><span class="o">&&</span> <span class="n">Block__3BatchNorm__batchNormrunning_mean</span> <span class="o">=</span> <span class="n">xt</span><span class="o">::</span><span class="n">view</span><span class="p">(</span><span class="n">Block__3BatchNorm__running_mean</span><span class="p">,</span> <span class="n">xt</span><span class="o">::</span><span class="n">newaxis</span><span class="p">(),</span> <span class="n">xt</span><span class="o">::</span><span class="n">all</span><span class="p">(),</span> <span class="n">xt</span><span class="o">::</span><span class="n">newaxis</span><span class="p">(),</span> <span class="n">xt</span><span class="o">::</span><span class="n">newaxis</span><span class="p">());</span>
<span class="k">auto</span><span class="o">&&</span> <span class="n">Block__3BatchNorm__batchNormrunning_var</span> <span class="o">=</span> <span class="n">xt</span><span class="o">::</span><span class="n">view</span><span class="p">(</span><span class="n">Block__3BatchNorm__running_var</span><span class="p">,</span> <span class="n">xt</span><span class="o">::</span><span class="n">newaxis</span><span class="p">(),</span> <span class="n">xt</span><span class="o">::</span><span class="n">all</span><span class="p">(),</span> <span class="n">xt</span><span class="o">::</span><span class="n">newaxis</span><span class="p">(),</span> <span class="n">xt</span><span class="o">::</span><span class="n">newaxis</span><span class="p">());</span>
<span class="k">auto</span><span class="o">&&</span> <span class="n">Block__3BatchNorm__batchNorm</span> <span class="o">=</span> <span class="n">Block__3BatchNorm__batchNormgamma</span> <span class="o">*</span> <span class="p">(</span><span class="n">Block__2MaxPool2d__maxPool2d</span> <span class="o">-</span> <span class="n">Block__3BatchNorm__batchNormrunning_mean</span><span class="p">)</span> <span class="o">/</span> <span class="n">xt</span><span class="o">::</span><span class="n">sqrt</span><span class="p">(</span><span class="n">Block__3BatchNorm__batchNormrunning_var</span> <span class="o">+</span> <span class="n">Block__3BatchNorm__epsilon</span><span class="p">)</span> <span class="o">+</span> <span class="n">Block__3BatchNorm__batchNormbeta</span><span class="p">;</span>
<span class="k">auto</span><span class="o">&&</span> <span class="n">Block__4BinaryConv2d__binConv2d</span> <span class="o">=</span> <span class="n">bighead</span><span class="o">::</span><span class="n">linalg</span><span class="o">::</span><span class="n">conv</span><span class="o">::</span><span class="n">binConv2d</span><span class="p">(</span><span class="n">Block__3BatchNorm__batchNorm</span><span class="p">,</span> <span class="n">Block__4BinaryConv2d__weight</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
<span class="k">auto</span><span class="o">&&</span> <span class="n">Block__4BinaryConv2d__binConv2d_bias</span> <span class="o">=</span> <span class="n">Block__4BinaryConv2d__binConv2d</span> <span class="o">+</span> <span class="n">xt</span><span class="o">::</span><span class="n">view</span><span class="p">(</span><span class="n">Block__4BinaryConv2d__bias</span><span class="p">,</span> <span class="n">xt</span><span class="o">::</span><span class="n">newaxis</span><span class="p">(),</span> <span class="n">xt</span><span class="o">::</span><span class="n">all</span><span class="p">(),</span> <span class="n">xt</span><span class="o">::</span><span class="n">newaxis</span><span class="p">(),</span> <span class="n">xt</span><span class="o">::</span><span class="n">newaxis</span><span class="p">());</span>
<span class="k">auto</span><span class="o">&&</span> <span class="n">Block__5ReLU__relu</span> <span class="o">=</span> <span class="n">Block__4BinaryConv2d__binConv2d_bias</span> <span class="o">*</span> <span class="n">xt</span><span class="o">::</span><span class="n">cast</span><span class="o"><</span><span class="kt">float</span><span class="o">></span><span class="p">(</span><span class="mi">0</span> <span class="o"><</span> <span class="n">Block__4BinaryConv2d__binConv2d_bias</span><span class="p">);</span>
<span class="k">auto</span><span class="o">&&</span> <span class="n">Block__6MaxPool2d__maxPool2d</span> <span class="o">=</span> <span class="n">bighead</span><span class="o">::</span><span class="n">linalg</span><span class="o">::</span><span class="n">pool</span><span class="o">::</span><span class="n">maxPool2d</span><span class="p">(</span><span class="n">Block__5ReLU__relu</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
<span class="k">auto</span><span class="o">&&</span> <span class="n">Block__7BatchNorm__batchNormgamma</span> <span class="o">=</span> <span class="n">xt</span><span class="o">::</span><span class="n">view</span><span class="p">(</span><span class="n">Block__7BatchNorm__gamma</span><span class="p">,</span> <span class="n">xt</span><span class="o">::</span><span class="n">newaxis</span><span class="p">(),</span> <span class="n">xt</span><span class="o">::</span><span class="n">all</span><span class="p">(),</span> <span class="n">xt</span><span class="o">::</span><span class="n">newaxis</span><span class="p">(),</span> <span class="n">xt</span><span class="o">::</span><span class="n">newaxis</span><span class="p">());</span>
<span class="k">auto</span><span class="o">&&</span> <span class="n">Block__7BatchNorm__batchNormbeta</span> <span class="o">=</span> <span class="n">xt</span><span class="o">::</span><span class="n">view</span><span class="p">(</span><span class="n">Block__7BatchNorm__beta</span><span class="p">,</span> <span class="n">xt</span><span class="o">::</span><span class="n">newaxis</span><span class="p">(),</span> <span class="n">xt</span><span class="o">::</span><span class="n">all</span><span class="p">(),</span> <span class="n">xt</span><span class="o">::</span><span class="n">newaxis</span><span class="p">(),</span> <span class="n">xt</span><span class="o">::</span><span class="n">newaxis</span><span class="p">());</span>
<span class="k">auto</span><span class="o">&&</span> <span class="n">Block__7BatchNorm__batchNormrunning_mean</span> <span class="o">=</span> <span class="n">xt</span><span class="o">::</span><span class="n">view</span><span class="p">(</span><span class="n">Block__7BatchNorm__running_mean</span><span class="p">,</span> <span class="n">xt</span><span class="o">::</span><span class="n">newaxis</span><span class="p">(),</span> <span class="n">xt</span><span class="o">::</span><span class="n">all</span><span class="p">(),</span> <span class="n">xt</span><span class="o">::</span><span class="n">newaxis</span><span class="p">(),</span> <span class="n">xt</span><span class="o">::</span><span class="n">newaxis</span><span class="p">());</span>
<span class="k">auto</span><span class="o">&&</span> <span class="n">Block__7BatchNorm__batchNormrunning_var</span> <span class="o">=</span> <span class="n">xt</span><span class="o">::</span><span class="n">view</span><span class="p">(</span><span class="n">Block__7BatchNorm__running_var</span><span class="p">,</span> <span class="n">xt</span><span class="o">::</span><span class="n">newaxis</span><span class="p">(),</span> <span class="n">xt</span><span class="o">::</span><span class="n">all</span><span class="p">(),</span> <span class="n">xt</span><span class="o">::</span><span class="n">newaxis</span><span class="p">(),</span> <span class="n">xt</span><span class="o">::</span><span class="n">newaxis</span><span class="p">());</span>
<span class="k">auto</span><span class="o">&&</span> <span class="n">Block__7BatchNorm__batchNorm</span> <span class="o">=</span> <span class="n">Block__7BatchNorm__batchNormgamma</span> <span class="o">*</span> <span class="p">(</span><span class="n">Block__6MaxPool2d__maxPool2d</span> <span class="o">-</span> <span class="n">Block__7BatchNorm__batchNormrunning_mean</span><span class="p">)</span> <span class="o">/</span> <span class="n">xt</span><span class="o">::</span><span class="n">sqrt</span><span class="p">(</span><span class="n">Block__7BatchNorm__batchNormrunning_var</span> <span class="o">+</span> <span class="n">Block__7BatchNorm__epsilon</span><span class="p">)</span> <span class="o">+</span> <span class="n">Block__7BatchNorm__batchNormbeta</span><span class="p">;</span>
<span class="k">auto</span><span class="o">&&</span> <span class="n">Block__8Flatten__reshape</span> <span class="o">=</span> <span class="n">xt</span><span class="o">::</span><span class="n">eval</span><span class="p">(</span><span class="n">Block__7BatchNorm__batchNorm</span><span class="p">);</span>
<span class="n">Block__8Flatten__reshape</span><span class="p">.</span><span class="n">reshape</span><span class="p">({</span> <span class="n">Block__8Flatten__reshape</span><span class="p">.</span><span class="n">shape</span><span class="p">()[</span><span class="mi">0</span><span class="p">],</span> <span class="mi">512</span> <span class="p">});</span>
<span class="k">auto</span><span class="o">&&</span> <span class="n">Block__9BinaryDense__xnorAffine</span> <span class="o">=</span> <span class="n">bighead</span><span class="o">::</span><span class="n">linalg</span><span class="o">::</span><span class="n">xnor</span><span class="o">::</span><span class="n">xnormatmul</span><span class="p">(</span><span class="n">Block__8Flatten__reshape</span><span class="p">,</span> <span class="n">Block__9BinaryDense__weight</span><span class="p">);</span>
<span class="k">auto</span><span class="o">&&</span> <span class="n">Block__9BinaryDense__xnorAffine_bias</span> <span class="o">=</span> <span class="n">Block__9BinaryDense__xnorAffine</span> <span class="o">+</span> <span class="n">xt</span><span class="o">::</span><span class="n">view</span><span class="p">(</span><span class="n">Block__9BinaryDense__bias</span><span class="p">,</span> <span class="n">xt</span><span class="o">::</span><span class="n">newaxis</span><span class="p">(),</span> <span class="n">xt</span><span class="o">::</span><span class="n">all</span><span class="p">());</span>
<span class="n">xt</span><span class="o">::</span><span class="n">view</span><span class="p">(</span><span class="n">result</span><span class="p">,</span> <span class="n">xt</span><span class="o">::</span><span class="n">range</span><span class="p">(</span><span class="n">i</span><span class="p">,</span><span class="n">i</span><span class="o">+</span><span class="mi">1</span><span class="p">))</span> <span class="o">=</span> <span class="n">std</span><span class="o">::</span><span class="n">forward</span><span class="o"><</span><span class="k">decltype</span><span class="p">(</span><span class="n">Block__9BinaryDense__xnorAffine_bias</span><span class="p">)</span><span class="o">></span><span class="p">(</span><span class="n">Block__9BinaryDense__xnorAffine_bias</span><span class="p">);</span>
<span class="p">}</span>
<span class="k">return</span> <span class="n">result</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">...</span>
<span class="p">}</span> <span class="c1">// extern "C"</span>
</code></pre></div></div>
<p>One advantage that a C++ transpiler has over something like tensorflow’s LLVM compiler is that we can easily inspect the C++ code for bugs during a model-building iteration cycle.</p>
<p>Because C++ is an easier barrier to entry, others could easily implement layers that would compile C++ code as long as they knew how to use the linear algebra backend(for e.x. <code class="language-plaintext highlighter-rouge">xtensor</code>). The fact that the layers could technically emit anything means that fellow python/C++ programmers can expand the backend emitter usage to more than just deep learning applications, but rather other parts of the data processing pipeline. In the future, we hope to compile all parts of the bighead pipeline into C++ wherever possible, so we can offer blazing fast performance after a single <code class="language-plaintext highlighter-rouge">fit()</code> call.</p>
<h2 id="custom-xnor-blas-functionalities">Custom XNOR BLAS functionalities</h2>
<h3 id="xnormatmul"><code class="language-plaintext highlighter-rouge">xnormatmul</code></h3>
<p><code class="language-plaintext highlighter-rouge">bighead::linalg::xnor::xnormatmul</code> is our level 3 BLAS routine that takes in any lazily evaluated <code class="language-plaintext highlighter-rouge">xt::xexpression<T></code> matrix <code class="language-plaintext highlighter-rouge">A</code> and <code class="language-plaintext highlighter-rouge">xt::xexpression<O></code> matrix <code class="language-plaintext highlighter-rouge">B</code>. The multiplication between the two is done via special compiler intrinsics. For our purposes, we implemented it with vectorized instructions in mind, specifically <code class="language-plaintext highlighter-rouge">AVX2</code>. It consists of 3 stages:</p>
<ol>
<li><em>quantization stage</em> - taking floating point numbers inside of a matrix and compressing it into signed bits (<code class="language-plaintext highlighter-rouge">0</code> for -1, <code class="language-plaintext highlighter-rouge">1</code> for 1).</li>
<li><em>multiplication stage</em> - instead of performing floating point multiplications, we use vectorized XNOR on all elements.</li>
<li><em>accumulation stage</em> - perform a popcount (pop stands for population) on the resulting xnor’d vector and write it into the resulting matrix.</li>
</ol>
<h4 id="floating-point-packing">Floating point packing</h4>
<p>A floating point number consists of 3 parts: sign bit, exponent, fraction. The sign bit is a 0 if the number is negative, and 1 if the number is positive. If we can extract this part of every floating point number, then our life would be very easy.</p>
<p>Luckily, we even have a vectorized instruction (in AVX2) for this:</p>
<p><code class="language-plaintext highlighter-rouge">_mm256_movemask_ps</code>, which roughly gets compiled to <code class="language-plaintext highlighter-rouge">vpmovmskb <reg> <ymm reg></code> where <code class="language-plaintext highlighter-rouge">reg</code> means register.</p>
<p>The above instruction takes the first bit of the next eight single precision floating points and packs them into a single byte. This is exactly what we need in order to pack the data.</p>
<p>We also make sure that the container of which we use to pack the data is aligned. We use the <code class="language-plaintext highlighter-rouge">xsimd::aligned_allocator<std::uint8_t, 32></code> to align the pointer of our array so that the condition <code class="language-plaintext highlighter-rouge">std::reinterpret_cast<intptr_t>(ptr) % 32</code> is always true for any bitmap we use. We need aligned instructions so that we can have <strong>safe loading for our consequent instructions below</strong>. (Note that aligned vs. unaligned instructions for loading may have very marginal difference depending on the architecture, but we decided to force alignment regardless)</p>
<h4 id="vectorized-xnors">Vectorized xnors</h4>
<p>Given the bit-packed data, we can perform a bit-flip operation(<code class="language-plaintext highlighter-rouge">~a</code>) on the xor result(<code class="language-plaintext highlighter-rouge">a^b</code>). We can use <code class="language-plaintext highlighter-rouge">_mm256_load_si256</code> and <code class="language-plaintext highlighter-rouge">_mm256_store_si256</code> to perform these operators on 256 bits of data at once.</p>
<h4 id="accumulating-bits">Accumulating bits</h4>
<p>After xnor’ing the two operands, we can scan through the resulting matrix and return the total number of <code class="language-plaintext highlighter-rouge">1</code>’s we see. There is a routine called <code class="language-plaintext highlighter-rouge">popcount</code> that performs exactly what we want. However, this only exists in AVX512, specifically the <code class="language-plaintext highlighter-rouge">_mm512_popcnt_epi32</code> and its corresponding variants. In this case, we implement the Hamming weight (or Harley-Seal) algorithm
which performs the following to compute number of bits in AVX2:</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">...</span>
<span class="c1">// For purposes of illustration, we express ints as binary</span>
<span class="kt">int</span> <span class="n">x</span> <span class="o">=</span> <span class="mo">00100110</span><span class="p">;</span> <span class="c1">// result should be 3</span>
<span class="n">temp0</span> <span class="o">=</span> <span class="n">x</span> <span class="o">&</span> <span class="mo">01010101</span><span class="p">;</span> <span class="c1">// 00000100</span>
<span class="n">temp1</span> <span class="o">=</span> <span class="p">(</span><span class="n">x</span> <span class="o">>></span> <span class="mi">1</span><span class="p">)</span> <span class="o">&</span> <span class="mo">01010101</span><span class="p">;</span> <span class="c1">// 00010001 </span>
<span class="n">res</span> <span class="o">=</span> <span class="n">temp0</span> <span class="o">+</span> <span class="n">temp1</span><span class="p">;</span> <span class="c1">// 00010101</span>
<span class="n">temp0</span> <span class="o">=</span> <span class="n">res</span> <span class="o">&</span> <span class="mo">00110011</span><span class="p">;</span> <span class="c1">// 00010001</span>
<span class="n">temp1</span> <span class="o">=</span> <span class="p">(</span><span class="n">res</span> <span class="o">>></span> <span class="mi">2</span><span class="p">)</span> <span class="o">&</span> <span class="mo">00110011</span><span class="p">;</span> <span class="c1">// 00000001</span>
<span class="n">res</span> <span class="o">=</span> <span class="n">temp0</span> <span class="o">+</span> <span class="n">temp1</span><span class="p">;</span> <span class="c1">// 00010010</span>
<span class="n">temp0</span> <span class="o">=</span> <span class="n">res</span> <span class="o">&</span> <span class="mo">00001111</span><span class="p">;</span> <span class="c1">// 00000010</span>
<span class="n">temp1</span> <span class="o">=</span> <span class="p">(</span><span class="n">res</span> <span class="o">>></span> <span class="mi">4</span><span class="p">)</span> <span class="o">&</span> <span class="mo">00001111</span><span class="p">;</span> <span class="c1">// 00000001</span>
<span class="n">res</span> <span class="o">=</span> <span class="n">temp0</span> <span class="o">+</span> <span class="n">temp1</span><span class="p">;</span> <span class="c1">// 00000011 = 3 bits.</span>
</code></pre></div></div>
<p>The rationale behind the algorithm is that it is computing, at every iteration, a container that holds <em>4 2-bit numbers, 2 4-bit numbers, and 1 8-bit number</em>, each of which contains number of bits which added up together into the 8-bit number is the popcount. We use a total of 12 operations, all of which are inexpensive. We have vectorized instructions for addition, AND, and shifting, so we are effectively using 12 operations on 256 bits at once.</p>
<h3 id="binconv2d"><code class="language-plaintext highlighter-rouge">binConv2d</code></h3>
<p>For example, in the above transpilation, we transpiled the <code class="language-plaintext highlighter-rouge">BinaryConvolution2d</code> into <code class="language-plaintext highlighter-rouge">bighead::linalg::conv::binConv2d</code> in C++. For all functions within the namespace <code class="language-plaintext highlighter-rouge">bighead</code>, they are functions implemented by the Bighead team. <code class="language-plaintext highlighter-rouge">binConv2d</code> is a function that performs convolutions with binary weights and inputs, like so:</p>
<p><img src="http://oneraynyday.github.io/assets/xnornet/xnorconv2d.png" alt="binconv" /></p>
<p>In the XNORNet paper, the authors suggested to take the average magnitude of each convolution input and multiply the resulting filters element-wise against the magnitudes. However, the authors of <code class="language-plaintext highlighter-rouge">DoReFa</code> networks and <code class="language-plaintext highlighter-rouge">ReBNetworks</code> show that a single scalar of the average magnitude of the entire input, multiplied via broadcasting will work just as well. We adopt the latter approach in our implementation.</p>
<script src="https://utteranc.es/client.js" repo="OneRaynyDay/oneraynyday.github.io" issue-term="pathname" theme="github-light" crossorigin="anonymous" async=""> </script>
Reinforcement Learning - Monte Carlo Methods2018-05-24T00:00:00+00:00http://oneraynyday.github.io/ml/2018/05/24/Reinforcement-Learning-Monte-Carlo<h1 id="table-of-contents">Table of Contents</h1>
<ul id="markdown-toc">
<li><a href="#table-of-contents" id="markdown-toc-table-of-contents">Table of Contents</a></li>
<li><a href="#introduction" id="markdown-toc-introduction">Introduction</a> <ul>
<li><a href="#first-visit-monte-carlo" id="markdown-toc-first-visit-monte-carlo">First-visit Monte Carlo</a></li>
</ul>
</li>
<li><a href="#monte-carlo-action-values" id="markdown-toc-monte-carlo-action-values">Monte Carlo Action Values</a></li>
<li><a href="#monte-carlo-control" id="markdown-toc-monte-carlo-control">Monte Carlo Control</a> <ul>
<li><a href="#exploring-starts" id="markdown-toc-exploring-starts">Exploring Starts</a></li>
<li><a href="#on-policy-epsilon-greedy-policies" id="markdown-toc-on-policy-epsilon-greedy-policies">On-Policy: $\epsilon$-Greedy Policies</a> <ul>
<li><a href="#epsilon-greedy-convergence" id="markdown-toc-epsilon-greedy-convergence">$\epsilon$-Greedy Convergence</a></li>
</ul>
</li>
<li><a href="#off-policy-importance-sampling" id="markdown-toc-off-policy-importance-sampling">Off-policy: Importance Sampling</a> <ul>
<li><a href="#off-policy-notations" id="markdown-toc-off-policy-notations">Off-policy Notations</a></li>
<li><a href="#ordinary-importance-sampling" id="markdown-toc-ordinary-importance-sampling">Ordinary Importance Sampling</a></li>
<li><a href="#weighted-importance-sampling" id="markdown-toc-weighted-importance-sampling">Weighted Importance Sampling</a></li>
<li><a href="#incremental-implementation" id="markdown-toc-incremental-implementation">Incremental Implementation</a></li>
<li><a href="#extra-discount-aware-importance-sampling" id="markdown-toc-extra-discount-aware-importance-sampling">Extra: Discount-aware Importance Sampling</a></li>
<li><a href="#extra-per-reward-importance-sampling" id="markdown-toc-extra-per-reward-importance-sampling">Extra: Per-reward Importance Sampling</a></li>
</ul>
</li>
</ul>
</li>
<li><a href="#on-policy-model-in-python" id="markdown-toc-on-policy-model-in-python">On-Policy Model in Python</a> <ul>
<li><a href="#example-blackjack" id="markdown-toc-example-blackjack">Example: Blackjack</a></li>
<li><a href="#example-cliff-walking" id="markdown-toc-example-cliff-walking">Example: Cliff Walking</a></li>
</ul>
</li>
<li><a href="#conclusion" id="markdown-toc-conclusion">Conclusion</a></li>
</ul>
<p>Previously, we discussed <a href="https://oneraynyday.github.io/ml/2018/05/06/Reinforcement-Learning-MDPs/"><strong>markov decision processes</strong></a>, and algorithms to find the optimal action-value function $q_*(s, a)$ and $v_*(s)$. We used <strong>policy iteration and value iteration to solve for the optimal policy.</strong></p>
<p>It’s nice and all to have dynamic programming solutions to reinforcement learning, but it comes with <em>many</em> restrictions. For example, are there a lot of real world problems where you know the state transition probabilities? Can you arbitrarily start at any state at the beginning? Is your MDP finite?</p>
<p>Well, I think you’ll be glad to know that Monte Carlo methods, a classic way to approximate difficult probability distributions, can handle all of your worries associated with dynamic programming solutions!</p>
<p>Once again, we will be following the RL Sutton’s book<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup>, with extra explanation and examples that the book does not offer.</p>
<h1 id="introduction">Introduction</h1>
<blockquote>
<p><em><strong>Monte Carlo</strong> simulations are named after the gambling hot spot in Monaco, since chance and random outcomes are central to the modeling technique, much as they are to games like roulette, dice, and slot machines.</em></p>
</blockquote>
<p>Monte Carlo methods look at the problem in a completely novel way compared to dynamic programming. It asks the question: <em>How many samples do I need to take from our environment to discern a good policy from a bad policy?</em></p>
<p>This time, we’ll reintroduce the idea of <strong>returns</strong>, which is the long run expected gain:</p>
\[G_t = R_{t+1} + R_{t+2} + R_{t+3} + …\]
<p>Sometimes, if the episodes has a non-zero probability of lasting infinite time, then we will use a discounting factor:</p>
\[G_t = \sum_k^\infty \gamma^k R_{t+k+1}\]
<p>We associate these returns $G_t$ with possible $A_t$’s to attempt to derive some kind of:</p>
\[V(s) = E[G_t|S_t=s] \approx \frac{\sum_{i:S^i_t=s}^N G^i_t}{N}\]
<p>By the law of large numbers, as $N$ approaches $\infty$ we can get the exact expectation. We index over $i$ for the $i$-th simulation.</p>
<p>Now if this is an MDP(which 99% of reinforcement learning problems are), then we know it exhibits the <strong>Strong Markov Property</strong> which means:</p>
\[P(X_{T+k} = j | X_T = y) = P(X_k=j|X_0=y) = p^k(y,j)\]
<p>With this, we can easily derive the fact that $t$ in the expectation is completely irrelevant, and we will be using $G_s$ from now on to denote the return starting from some state(move that state to $t = 0$).</p>
<h2 id="first-visit-monte-carlo">First-visit Monte Carlo</h2>
<p>A classic way to solve for the value function is to sample the return of the first occurence of $s$ , called <strong>first-visit MC prediction</strong>. Then a valid algorithm to find the optimal $V$ is:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">pi</span> <span class="o">=</span> <span class="n">init_pi</span><span class="p">()</span>
<span class="n">returns</span> <span class="o">=</span> <span class="n">defaultdict</span><span class="p">(</span><span class="nb">list</span><span class="p">)</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">NUM_ITER</span><span class="p">):</span>
<span class="n">episode</span> <span class="o">=</span> <span class="n">generate_episode</span><span class="p">(</span><span class="n">pi</span><span class="p">)</span> <span class="c1"># (1)
</span> <span class="n">G</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">zeros</span><span class="p">(</span><span class="o">|</span><span class="n">S</span><span class="o">|</span><span class="p">)</span>
<span class="n">prev_reward</span> <span class="o">=</span> <span class="mi">0</span>
<span class="k">for</span> <span class="p">(</span><span class="n">state</span><span class="p">,</span> <span class="n">reward</span><span class="p">)</span> <span class="ow">in</span> <span class="nb">reversed</span><span class="p">(</span><span class="n">episode</span><span class="p">):</span>
<span class="n">reward</span> <span class="o">+=</span> <span class="n">GAMMA</span> <span class="o">*</span> <span class="n">prev_reward</span>
<span class="c1"># backing up replaces s eventually,
</span> <span class="c1"># so we get first-visit reward.
</span> <span class="n">G</span><span class="p">[</span><span class="n">s</span><span class="p">]</span> <span class="o">=</span> <span class="n">reward</span>
<span class="n">prev_reward</span> <span class="o">=</span> <span class="n">reward</span>
<span class="k">for</span> <span class="n">state</span> <span class="ow">in</span> <span class="n">STATES</span><span class="p">:</span>
<span class="n">returns</span><span class="p">[</span><span class="n">state</span><span class="p">].</span><span class="n">append</span><span class="p">(</span><span class="n">state</span><span class="p">)</span>
<span class="n">V</span> <span class="o">=</span> <span class="p">{</span> <span class="n">state</span> <span class="p">:</span> <span class="n">np</span><span class="p">.</span><span class="n">mean</span><span class="p">(</span><span class="n">ret</span><span class="p">)</span> <span class="k">for</span> <span class="n">state</span><span class="p">,</span> <span class="n">ret</span> <span class="ow">in</span> <span class="n">returns</span><span class="p">.</span><span class="n">items</span><span class="p">()</span> <span class="p">}</span>
</code></pre></div></div>
<p>Another method is called <strong>every-visit MC prediction</strong>, where you sample the return of every occurence of $s$ in every episode. The estimate converges quadratically in both cases to the expectation.</p>
<h1 id="monte-carlo-action-values">Monte Carlo Action Values</h1>
<p>Sometimes, we don’t know the model of the environment, by that I mean what actions lead to what state, and how the environment in general interacts. In this case, we can use action values instead of state values, which means we solve for $q_*$.</p>
<p>We want to estimate $q_\pi(s,a)$, instead of $v_\pi(s)$. A simple change of <code class="language-plaintext highlighter-rouge">G[s]</code> to <code class="language-plaintext highlighter-rouge">G[s,a]</code> seems appropriate, and it is. One obvious issue is that now we went from $\mathcal{S}$ to $\mathcal{S} \times \mathcal{A}$ space, which is much larger, and we still need to sample this to find the expected return of each state action tuple.</p>
<p>Another issue is that as the search space increases, it becomes increasingly more likely that <strong>we might not be able to explore all state action pairs if we become greedy w.r.t our policy too quickly.</strong> We need to have a proper mix between <em>exploration</em> and <em>exploitation</em>. We will explain how we can overcome this problem in the next section.</p>
<h1 id="monte-carlo-control">Monte Carlo Control</h1>
<p>Recall <strong>policy iteration</strong> from MDP’s. In this case, it’s not too different. We still fix our $\pi$, find $q_\pi$, and then find a new $\pi’$, and etc. The general process looks like:</p>
\[\pi_0 \to q_{\pi_0} \to \pi_1 \to q_{\pi_1} \to \pi_2 \to q_{\pi_2} \to … \to \pi_n \to q_{\pi_n}\]
<p>We find the $q_{\pi_0}$’s in similar fashion to how we find our $v$’s above. We can improve our $\pi$’s by definition of the bellman optimality equation, and simply:</p>
\[\pi(s) = argmax_a q(s,a)\]
<p>For more details on this, please refer to the previous MDP blog.</p>
<p>Now, the core crux of policy iteration in the context of monte carlo methods is, as we said, <strong>how do we ensure exploration vs. exploitation?</strong></p>
<h2 id="exploring-starts">Exploring Starts</h2>
<p>A way to remedy the large state space exploration is to specify that we start in a specific state and take a specific action, round robin style across all possibilities to sample their returns. <strong>This assumes that we can start at any state and take all possible actions at every start of an episode</strong>, which is not a reasonable assumption in many situations. However, for problems like BlackJack, this is totally reasonable, which means we can have an easy fix to our problem.</p>
<p>In code, we just need a quick patch to our previous code at <code class="language-plaintext highlighter-rouge">(1)</code>:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Before (Start at some arbitrary s_0, a_0)
</span><span class="n">episode</span> <span class="o">=</span> <span class="n">generate_episode</span><span class="p">(</span><span class="n">pi</span><span class="p">)</span>
<span class="c1"># After (Start at some specific s, a)
</span><span class="n">episode</span> <span class="o">=</span> <span class="n">generate_episode</span><span class="p">(</span><span class="n">pi</span><span class="p">,</span> <span class="n">s</span><span class="p">,</span> <span class="n">a</span><span class="p">)</span> <span class="c1"># loop through s, a at every iteration.
</span></code></pre></div></div>
<h2 id="on-policy-epsilon-greedy-policies">On-Policy: $\epsilon$-Greedy Policies</h2>
<p>So what if we <strong>can’t</strong> assume that we can start at any arbitrary state and take arbitrary actions? Well, then we can still guarantee convergence as long as we’re not too greedy and explore all states infinitely many times, right?</p>
<p>The above is essentially one of the main properties of on-policy methods. An <strong>on-policy method tries to improve the policy that is currently running the trials</strong>, meanwhile an <strong>off-policy method tries to improve a different policy than the one running the trials</strong>.</p>
<p>Now with that said, we need to formalize “not too greedy”. One easy way to do this is to use what we learned in k-armed bandits - $\epsilon$-greedy methods! To recap, with $\epsilon$ probability we pick from a uniform distribution of all actions given the state, and with $1-\epsilon$ probability we pick the $argmax_a q(s,a)$ action.</p>
<p>Now we ask - does this converge to the optimal $\pi_*$ for Monte Carlo methods? The answer is <em>it will converge, but not to that policy</em>.</p>
<h3 id="epsilon-greedy-convergence">$\epsilon$-Greedy Convergence</h3>
<p>We start off with $q$, and an $\epsilon$-greedy policy $\pi’(s)$.</p>
\[q_\pi(s,\pi'(s)) = \sum_a \pi'(a|s) E_\pi(G|S=s, A=a) = \sum_a \pi'(a|s)q_\pi(s,a) \quad{\text{(1) (definition)}}\]
\[= \frac{\epsilon}{|A(s)|}\sum_a q_\pi(s, a) + (1-\epsilon) max_a q_\pi(s,a)\]
\[\geq \frac{1}{|A(s)} \sum_a q_\pi(s,a) = v_\pi(s)\]
<p>Once again, we reach the statement that this $\epsilon$ greedy policy, like the greedy policy, performs monotonic improvements over $v_\pi$. If we back up on all time steps, then we get:</p>
\[v_{\pi'}(s) \geq v_\pi(s)\]
<p>Which is what we wanted for convergence. $\blacksquare$</p>
<p>However, we need to find out what this policy can actually converge to. Obviously, since our policy is forced to be stochastic even if the optimal policy is deterministic, it’s not guaranteed to converge to $\pi*$. However, we can reframe our problem:</p>
<p>Suppose instead of having our policy hold the stochasticity of uniformly choosing actions with probability $\epsilon$, it is rather <em>the environment that randomly picks an action regardless of what our policy dictates.</em> Then, we can guarantee an optimal solution. The outline of the proof is to show that in $(1)$, if the equality holds then we have $\pi = \pi’$ and thus we have $v_\pi = v_{\pi’}$, and the equation is optimal under stochasticity due to the environment.</p>
<h2 id="off-policy-importance-sampling">Off-policy: Importance Sampling</h2>
<h3 id="off-policy-notations">Off-policy Notations</h3>
<p>Let’s introduce some new terms!</p>
<ul>
<li>$\pi$ is our <strong>target policy</strong>. We are trying to optimize this ones’ expected returns.</li>
<li>$b$ is our <strong>behavioral policy</strong>. We are using $b$ to generate the data that $\pi$ will use later.</li>
<li>$\pi(a\vert s) > 0 \implies b(a\vert s) > 0 \forall a\in \mathcal{A}$ . This is the notion of <strong>coverage</strong>.</li>
</ul>
<p>Off-policy methods usually have two or more agents, one of which is generating the data that another agent tries to optimize upon. We call them <strong>behavior policy</strong> and <strong>target policy</strong> respectively. Off policy methods are “fancier” than on policy methods, like how neural nets are “fancier” than linear models. Similarly, off policy methods often are more powerful, with the cost of generating higher variance models and slower convergence.</p>
<hr />
<p>Now, let’s talk about <strong>importance sampling</strong>.</p>
<p>Importance sampling answers the question: <em>“Given $E_b[G]$, what is $E_\pi[G]$?”</em>. In other words, how can you use the information you get from $b$’s sampling to determine the expected result from $\pi$?</p>
<p>One intuitive way you can think about it is: “If $b$ chooses $a$ a lot, and $\pi$ chooses $a$ a lot, then $b$’s behavior should be important to determine $\pi$’s behavior!”, and conversely: “If $b$ chooses $a$ a lot, and $\pi$ does not choose $a$ ever, then $b$’s behavior on $a$ should not have <strong>any</strong> importance towards $\pi$’s behavior on $a$.” Makes sense, right?</p>
<p>So that’s pretty much what <strong>importance-sampling ratio</strong> is. Given a trajectory ${(S_i, A_i)}_{i=t}^T$, the probability of this exact trajectory happening given policy $\pi$ is:</p>
\[P_\pi(\{(S_i, A_i)\}_{i=t}^T) = \Pi_{i=t}^T \pi(A_i|S_i)p(S_{i+1}|S_i, A_i)\]
<p>The ratio between $\pi$ and $b$ is simply:</p>
\[\rho_{t:T-1} = \frac{P_\pi(\{(S_i, A_i)\}_{i=t}^T)}{P_b(\{(S_i, A_i)\}_{i=t}^T)}\]
\[= \frac{\Pi_{i=t}^T \pi(A_i|S_i)p(S_{i+1}|S_i, A_i)}{\Pi_{i=t}^T b(A_i|S_i)p(S_{i+1}|S_i, A_i)}\]
\[= \frac{\Pi_{i=t}^T \pi(A_i|S_i)}{\Pi_{i=t}^T b(A_i|S_i)}\]
<h3 id="ordinary-importance-sampling">Ordinary Importance Sampling</h3>
<p>Now, there are many ways to utilize this $\rho_{t:T-1}$ to give us a good estimate of $E_\pi[G]$. The most rudimentary way is to use something called <strong>ordinary importance sampling</strong>. Suppose we had sampled $N$ episodes:</p>
\[\{(S^0_i, A^0_i) \}_i^{T^0}, \{(S^1_i, A^1_i) \}_i^{T^1}, … , \{(S^N_i, A^N_i) \}_i^{T^N}\]
<p>and denote the first arrival time of $s$ as:</p>
\[\mathcal{T^k}(s) = min\{i : S_i^k = s\}\]
<p>and we wanted to estimate $v_\pi(s)$, then we can use empirical mean to estimate the value function using <strong>first-visit method</strong>:</p>
\[v_\pi(s) \approx \frac{ \sum_k^N \rho_{\mathcal{T}^k(s) : T^k-1} G_{\mathcal{T^k(s)}}}{N}\]
<p>Of course, this is easily generalizable to an <strong>every-visit method</strong>, but I wanted to present the simplest form to get the gist. What this is saying is we need to weigh the returns of each episode differently, because *the trajectories that are more likely to occur for $\pi$ should be weighted more than the ones that are never going to occur.</p>
<p>This method of importance sampling is an <em>unbiased estimator</em>, but it suffers from <em>extreme variance problems</em>. Suppose the importance ratio, $\rho_{\mathcal{T}(s):T^k-1}$ for some k-th episode is $1000$. That is huge, but can definitely happen. Does that mean the reward will necessarily be $1000$ times more? If we only have one episode, our estimate will be exactly that. In the long run, because we have a multiplicative relationship, the ratio may either explode or vanish. This is a little concerning for the purposes of estimation.</p>
<h3 id="weighted-importance-sampling">Weighted Importance Sampling</h3>
<p>To reduce the variance, one easy, intuitive way is to reduce the magnitude of the estimate, by dividing by the total sum of all magnitudes of importance ratios(kind of like a softmax function):</p>
\[v_\pi(s) \approx \frac{ \sum_k^N \rho_{\mathcal{T}^k(s) : T^k-1} G_{\mathcal{T^k(s)}}}{ \sum_k^N \rho_{\mathcal{T}^k(s) : T^k-1} }\]
<p>This is called the <strong>weighted importance sampling</strong>. It is a <em>biased estimate</em>(with the bias asymptotically going to 0), but has reduced variance. Before, one could come up with a pathologically unbounded variance for the ordinary estimator, but each entry here has the maximum weight of 1, which bounds the variance by above. <strong>Sutton has suggested that, in practice, always use weighted importance sampling.</strong></p>
<h3 id="incremental-implementation">Incremental Implementation</h3>
<p>As with many sampling techniques, we can implement it incrementally. Suppose we use the weighted importance sampling method from last section, then we can have some sampling algorithm of the form:</p>
\[V_n = \frac{\sum_k^{n-1} W_kG_k}{\sum_k^{n-1} W_k}\]
<p>where $W_k$ could be our weight.</p>
<p>We want to form $V_{n+1}$ based off of $V_n$, which is very doable. Denote $C_n$ as $\sum_k^n W_k$, and we’ll keep this running sum updated as we go:</p>
\[V_{n+1} = \frac{\sum_k^{n+1} W_kG_k}{\sum_k^{n+1} W_k} = \frac{\sum_k^{n} W_kG_k + W_{n+1}G_{n+1}}{\sum_k^{n} W_k} \frac{\sum_k^{n} W_k}{\sum_k^{n+1} W_k} = V_n \frac{C_{n-1}}{C_n} + W_n \frac{G_n}{C_n}\]
\[= V_n + \frac{W_nG_n - V_nW_n}{C_n}\]
\[= V_n + \frac{W_n}{C_n}(G_n-V_n)\]
<p>And $C_n$’s update rule is pretty obvious: $C_{n+1} = C_n + W_{n+1}$.</p>
<p>Now, this $V_n$ is our value function, but a <strong>very similar</strong> analog of this can also be applied to $Q_n$, which is our action value.</p>
<p>While we are updating the value function, we could also update our policy $\pi$. We can update our $\pi$ with the good old $argmax_a Q_\pi(s,a)$ .</p>
<p><strong>Warning: Lots of math ahead. What we have right now is already good enough. This is approaching modern research topics.</strong></p>
<h3 id="extra-discount-aware-importance-sampling">Extra: Discount-aware Importance Sampling</h3>
<p>So far, we have counted returns, and sampled returns to get our estimates. However, we neglected the internal structure of $G$. It really is just a sum of discounted rewards, and we have failed to incorporate that into our ratio $\rho$. <strong>Discount-aware importance sampling</strong> models $\gamma$ as a probability of termination. The probability of the episode terminating at some timestep $t$, thus must be of a geometric distribution $\sim geo(\gamma)$:</p>
\[P(T = t) = (1-\gamma)^{t-1} \gamma\]
<p>And full return can be considered an expectation over a <em>random number of random variables $R_t$</em>:</p>
\[G_t = R_{t+1} + \gamma R_{t+2} + … \gamma ^{T-t-1} R_{T}\]
<p>One can construct an arbitrary telescoping sum like so:</p>
\[\sum_{k=0}^{n-1} (1-\gamma)\gamma^k + \gamma^{n} = 1\]
<p>Inductively, we can see that for setting $k$ starting at $x$, we have $\gamma^x$.</p>
\[\sum_{k=x}^{n-1}(1-\gamma)\gamma^k + \gamma^n = \gamma^x\]
<p>We substitute this into $G$:</p>
\[G_t = \sum_{k=0}^{T-t-1} (1-\gamma)\gamma^k R_{t+1} + \sum_{k=1}^{T-t-1} (1-\gamma)\gamma^k R_{t+2} … + \gamma^{T-t-1}R_T\]
\[= (1-\gamma)\sum_{k=t+1}^{T-1}\gamma^{k-t-1}G_{t:k} + \gamma^{T-t-1}G_{t:T}\]
<p>Which will lead to equivalent coefficients of $1$, $\gamma$, $\gamma^2$, etc on the $R_t$ terms. This means, now we can decompose $G_t$ into parts and apply discounting on the importance sampling ratios.</p>
<p>Now, recall that before we had:</p>
\[v_\pi(s) \approx \frac{ \sum_k^N \rho_{\mathcal{T}^k(s) : T^k-1} G_{\mathcal{T^k(s)}}}{ \sum_k^N \rho_{\mathcal{T}^k(s) : T^k-1} } \quad{\text{(Weighted importance sampling)}}\]
<p>If we expanded $G$, we would have, for one of these numerators in the sum:</p>
\[\rho_{t : T-1} [(1-\gamma)\sum_{k=t+1}^{T-1}\gamma^{k-t-1}G_{t:k} + \gamma^{T-t-1}G_{t:T}]\]
<p>Notice how we are applying the <strong>same ratio to all of the returns</strong>. Some of these returns, $G_{t:t+1}$, are being multiplied by the importance ratio of the entire trajectory, which is not “correct” under the modeling assumption that $\gamma$ is a probability of termination. Intuitively, we want $\rho_{t:t + 1}$ for $G_{t:t+1}$, and that’s easy enough:</p>
\[(1-\gamma)\sum_{k=t+1}^{T-1}\rho_{t:k-1} \gamma^{k-t-1}G_{t:k} + \rho_{t : T-1} \gamma^{T-t-1}G_{t:T}\]
<p>Ah, much better! This way, each partial return will have their correct ratios. <strong>This combats the unbounded variance problem greatly.</strong></p>
<h3 id="extra-per-reward-importance-sampling">Extra: Per-reward Importance Sampling</h3>
<p>Another way to mitigate the problematic $\rho$ and its variance issues, we can decompose $G$ into its respective rewards and do some analysis. Let’s look into $\rho_{t:T-1}G_{t:T}$:</p>
\[\rho_{t:T-1}G_{t:T} = \rho_{t:T-1} (\sum_{k=0}^{T-t} \gamma^kR_{t+k+1})\]
<p>For each term, we have $\rho_{t:T-1}\gamma^kR_{t+k+1}$. Expanding $\rho$, we can see:</p>
\[= \frac{\Pi_{i=t}^T \pi(A_i|S_i)}{\Pi_{i=t}^T b(A_i|S_i)} \gamma^k R_{t+k+1}\]
<p>Taking the expectation without the constant $\gamma^k$:</p>
\[E_b (\rho_{t:T-1}G_{t:T}) = E_b(\frac{\Pi_{i=t}^T \pi(A_i|S_i)}{\Pi_{i=t}^T b(A_i|S_i)} R_{t+k+1})\]
<p>Recall that you can only take $E(AB) = E(A)E(B)$ iff they are independent. It is obvious from the markov property that any $\pi(A_i\vert S_i)$ and $b(A_i\vert S_i)$ is independent of $R_{t+k+1}$ if $i \geq t+k+1$, and $\pi(A_i\vert S_i) \perp \pi(A_j\vert S_j) i \neq j$ (and same for $b$’s). We can take them out and get:</p>
\[E_b(\frac{\Pi_{i=t}^{t+k} \pi(A_i|S_i)}{\Pi_{i=t}^{t+k} b(A_i|S_i)} R_{t+k+1}) \Pi_{i=t+k+1}^T E_b(\frac{\pi(A_i|S_i)}{b(A_i|S_i)})\]
<p>This may look extremely ugly, but one can observe that:</p>
\[E_b(\frac{\pi(A_i|S_i)}{b(A_i|S_i)}) = \sum_a b(a|S_i) \frac{\pi(a|S_i)}{b(a|S_i)} = 1\]
<p>So we can really just completely ignore the second half:</p>
\[E_b(\frac{\Pi_{i=t}^{t+k} \pi(A_i|S_i)}{\Pi_{i=t}^{t+k} b(A_i|S_i)} R_{t+k+1}) \Pi_{i=t+k+1}^T E_b(\frac{\pi(A_i|S_i)}{b(A_i|S_i)}) = E_b(\frac{\Pi_{i=t}^{t+k} \pi(A_i|S_i)}{\Pi_{i=t}^{t+k} b(A_i|S_i)} R_{t+k+1}) = \rho_{t:t+k}R_{t+k+1}\]
<p>What does this mean? We can really express our original sum in expectation:</p>
\[E(\rho_{t:T-1}G_{t:T}) = E(\sum_{k=0}^{T-t} \rho_{t:k} \gamma^kR_{t+k+1})\]
<p>Which will then, once again, <strong>decrease the variance of our estimator</strong>.</p>
<h1 id="on-policy-model-in-python">On-Policy Model in Python</h1>
<p>Because Monte Carlo methods are generally in similar structure, I’ve made a discrete Monte Carlo model class in python that can be used to plug and play. One can also find the code <a href="https://github.com/OneRaynyDay/MonteCarloEngine">here</a>. It’s doctested.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="s">"""
General purpose Monte Carlo model for training on-policy methods.
"""</span>
<span class="kn">from</span> <span class="nn">copy</span> <span class="kn">import</span> <span class="n">deepcopy</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="k">class</span> <span class="nc">FiniteMCModel</span><span class="p">:</span>
<span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">state_space</span><span class="p">,</span> <span class="n">action_space</span><span class="p">,</span> <span class="n">gamma</span><span class="o">=</span><span class="mf">1.0</span><span class="p">,</span> <span class="n">epsilon</span><span class="o">=</span><span class="mf">0.1</span><span class="p">):</span>
<span class="s">"""MCModel takes in state_space and action_space (finite)
Arguments
---------
state_space: int OR list[observation], where observation is any hashable type from env's obs.
action_space: int OR list[action], where action is any hashable type from env's actions.
gamma: float, discounting factor.
epsilon: float, epsilon-greedy parameter.
If the parameter is an int, then we generate a list, and otherwise we generate a dictionary.
>>> m = FiniteMCModel(2,3,epsilon=0)
>>> m.Q
[[0, 0, 0], [0, 0, 0]]
>>> m.Q[0][1] = 1
>>> m.Q
[[0, 1, 0], [0, 0, 0]]
>>> m.pi(1, 0)
1
>>> m.pi(1, 1)
0
>>> d = m.generate_returns([(0,0,0), (0,1,1), (1,0,1)])
>>> assert(d == {(1, 0): 1, (0, 1): 2, (0, 0): 2})
>>> m.choose_action(m.pi, 1)
0
"""</span>
<span class="bp">self</span><span class="p">.</span><span class="n">gamma</span> <span class="o">=</span> <span class="n">gamma</span>
<span class="bp">self</span><span class="p">.</span><span class="n">epsilon</span> <span class="o">=</span> <span class="n">epsilon</span>
<span class="bp">self</span><span class="p">.</span><span class="n">Q</span> <span class="o">=</span> <span class="bp">None</span>
<span class="k">if</span> <span class="nb">isinstance</span><span class="p">(</span><span class="n">action_space</span><span class="p">,</span> <span class="nb">int</span><span class="p">):</span>
<span class="bp">self</span><span class="p">.</span><span class="n">action_space</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">arange</span><span class="p">(</span><span class="n">action_space</span><span class="p">)</span>
<span class="n">actions</span> <span class="o">=</span> <span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="o">*</span><span class="n">action_space</span>
<span class="c1"># Action representation
</span> <span class="bp">self</span><span class="p">.</span><span class="n">_act_rep</span> <span class="o">=</span> <span class="s">"list"</span>
<span class="k">else</span><span class="p">:</span>
<span class="bp">self</span><span class="p">.</span><span class="n">action_space</span> <span class="o">=</span> <span class="n">action_space</span>
<span class="n">actions</span> <span class="o">=</span> <span class="p">{</span><span class="n">k</span><span class="p">:</span><span class="mi">0</span> <span class="k">for</span> <span class="n">k</span> <span class="ow">in</span> <span class="n">action_space</span><span class="p">}</span>
<span class="bp">self</span><span class="p">.</span><span class="n">_act_rep</span> <span class="o">=</span> <span class="s">"dict"</span>
<span class="k">if</span> <span class="nb">isinstance</span><span class="p">(</span><span class="n">state_space</span><span class="p">,</span> <span class="nb">int</span><span class="p">):</span>
<span class="bp">self</span><span class="p">.</span><span class="n">state_space</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">arange</span><span class="p">(</span><span class="n">state_space</span><span class="p">)</span>
<span class="bp">self</span><span class="p">.</span><span class="n">Q</span> <span class="o">=</span> <span class="p">[</span><span class="n">deepcopy</span><span class="p">(</span><span class="n">actions</span><span class="p">)</span> <span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">state_space</span><span class="p">)]</span>
<span class="k">else</span><span class="p">:</span>
<span class="bp">self</span><span class="p">.</span><span class="n">state_space</span> <span class="o">=</span> <span class="n">state_space</span>
<span class="bp">self</span><span class="p">.</span><span class="n">Q</span> <span class="o">=</span> <span class="p">{</span><span class="n">k</span><span class="p">:</span><span class="n">deepcopy</span><span class="p">(</span><span class="n">actions</span><span class="p">)</span> <span class="k">for</span> <span class="n">k</span> <span class="ow">in</span> <span class="n">state_space</span><span class="p">}</span>
<span class="c1"># Frequency of state/action.
</span> <span class="bp">self</span><span class="p">.</span><span class="n">Ql</span> <span class="o">=</span> <span class="n">deepcopy</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">Q</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">pi</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">action</span><span class="p">,</span> <span class="n">state</span><span class="p">):</span>
<span class="s">"""pi(a,s,A,V) := pi(a|s)
We take the argmax_a of Q(s,a).
q[s] = [q(s,0), q(s,1), ...]
"""</span>
<span class="k">if</span> <span class="bp">self</span><span class="p">.</span><span class="n">_act_rep</span> <span class="o">==</span> <span class="s">"list"</span><span class="p">:</span>
<span class="k">if</span> <span class="n">action</span> <span class="o">==</span> <span class="n">np</span><span class="p">.</span><span class="n">argmax</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">Q</span><span class="p">[</span><span class="n">state</span><span class="p">]):</span>
<span class="k">return</span> <span class="mi">1</span>
<span class="k">return</span> <span class="mi">0</span>
<span class="k">elif</span> <span class="bp">self</span><span class="p">.</span><span class="n">_act_rep</span> <span class="o">==</span> <span class="s">"dict"</span><span class="p">:</span>
<span class="k">if</span> <span class="n">action</span> <span class="o">==</span> <span class="nb">max</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">Q</span><span class="p">[</span><span class="n">state</span><span class="p">],</span> <span class="n">key</span><span class="o">=</span><span class="bp">self</span><span class="p">.</span><span class="n">Q</span><span class="p">[</span><span class="n">state</span><span class="p">].</span><span class="n">get</span><span class="p">):</span>
<span class="k">return</span> <span class="mi">1</span>
<span class="k">return</span> <span class="mi">0</span>
<span class="k">def</span> <span class="nf">b</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">action</span><span class="p">,</span> <span class="n">state</span><span class="p">):</span>
<span class="s">"""b(a,s,A) := b(a|s)
Sometimes you can only use a subset of the action space
given the state.
Randomly selects an action from a uniform distribution.
"""</span>
<span class="k">return</span> <span class="bp">self</span><span class="p">.</span><span class="n">epsilon</span><span class="o">/</span><span class="nb">len</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">action_space</span><span class="p">)</span> <span class="o">+</span> <span class="p">(</span><span class="mi">1</span><span class="o">-</span><span class="bp">self</span><span class="p">.</span><span class="n">epsilon</span><span class="p">)</span> <span class="o">*</span> <span class="bp">self</span><span class="p">.</span><span class="n">pi</span><span class="p">(</span><span class="n">action</span><span class="p">,</span> <span class="n">state</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">generate_returns</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">ep</span><span class="p">):</span>
<span class="s">"""Backup on returns per time period in an epoch
Arguments
---------
ep: [(observation, action, reward)], an episode trajectory in chronological order.
"""</span>
<span class="n">G</span> <span class="o">=</span> <span class="p">{}</span> <span class="c1"># return on state
</span> <span class="n">C</span> <span class="o">=</span> <span class="mi">0</span> <span class="c1"># cumulative reward
</span> <span class="k">for</span> <span class="n">tpl</span> <span class="ow">in</span> <span class="nb">reversed</span><span class="p">(</span><span class="n">ep</span><span class="p">):</span>
<span class="n">observation</span><span class="p">,</span> <span class="n">action</span><span class="p">,</span> <span class="n">reward</span> <span class="o">=</span> <span class="n">tpl</span>
<span class="n">G</span><span class="p">[(</span><span class="n">observation</span><span class="p">,</span> <span class="n">action</span><span class="p">)]</span> <span class="o">=</span> <span class="n">C</span> <span class="o">=</span> <span class="n">reward</span> <span class="o">+</span> <span class="bp">self</span><span class="p">.</span><span class="n">gamma</span><span class="o">*</span><span class="n">C</span>
<span class="k">return</span> <span class="n">G</span>
<span class="k">def</span> <span class="nf">choose_action</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">policy</span><span class="p">,</span> <span class="n">state</span><span class="p">):</span>
<span class="s">"""Uses specified policy to select an action randomly given the state.
Arguments
---------
policy: function, can be self.pi, or self.b, or another custom policy.
state: observation of the environment.
"""</span>
<span class="n">probs</span> <span class="o">=</span> <span class="p">[</span><span class="n">policy</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">state</span><span class="p">)</span> <span class="k">for</span> <span class="n">a</span> <span class="ow">in</span> <span class="bp">self</span><span class="p">.</span><span class="n">action_space</span><span class="p">]</span>
<span class="k">return</span> <span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">choice</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">action_space</span><span class="p">,</span> <span class="n">p</span><span class="o">=</span><span class="n">probs</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">update_Q</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">ep</span><span class="p">):</span>
<span class="s">"""Performs a action-value update.
Arguments
---------
ep: [(observation, action, reward)], an episode trajectory in chronological order.
"""</span>
<span class="c1"># Generate returns, return ratio
</span> <span class="n">G</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">generate_returns</span><span class="p">(</span><span class="n">ep</span><span class="p">)</span>
<span class="k">for</span> <span class="n">s</span> <span class="ow">in</span> <span class="n">G</span><span class="p">:</span>
<span class="n">state</span><span class="p">,</span> <span class="n">action</span> <span class="o">=</span> <span class="n">s</span>
<span class="n">q</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">Q</span><span class="p">[</span><span class="n">state</span><span class="p">][</span><span class="n">action</span><span class="p">]</span>
<span class="bp">self</span><span class="p">.</span><span class="n">Ql</span><span class="p">[</span><span class="n">state</span><span class="p">][</span><span class="n">action</span><span class="p">]</span> <span class="o">+=</span> <span class="mi">1</span>
<span class="n">N</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">Ql</span><span class="p">[</span><span class="n">state</span><span class="p">][</span><span class="n">action</span><span class="p">]</span>
<span class="bp">self</span><span class="p">.</span><span class="n">Q</span><span class="p">[</span><span class="n">state</span><span class="p">][</span><span class="n">action</span><span class="p">]</span> <span class="o">=</span> <span class="n">q</span> <span class="o">*</span> <span class="n">N</span><span class="o">/</span><span class="p">(</span><span class="n">N</span><span class="o">+</span><span class="mi">1</span><span class="p">)</span> <span class="o">+</span> <span class="n">G</span><span class="p">[</span><span class="n">s</span><span class="p">]</span><span class="o">/</span><span class="p">(</span><span class="n">N</span><span class="o">+</span><span class="mi">1</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">score</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">env</span><span class="p">,</span> <span class="n">policy</span><span class="p">,</span> <span class="n">n_samples</span><span class="o">=</span><span class="mi">1000</span><span class="p">):</span>
<span class="s">"""Evaluates a specific policy with regards to the env.
Arguments
---------
env: an openai gym env, or anything that follows the api.
policy: a function, could be self.pi, self.b, etc.
"""</span>
<span class="n">rewards</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">n_samples</span><span class="p">):</span>
<span class="n">observation</span> <span class="o">=</span> <span class="n">env</span><span class="p">.</span><span class="n">reset</span><span class="p">()</span>
<span class="n">cum_rewards</span> <span class="o">=</span> <span class="mi">0</span>
<span class="k">while</span> <span class="bp">True</span><span class="p">:</span>
<span class="n">action</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">choose_action</span><span class="p">(</span><span class="n">policy</span><span class="p">,</span> <span class="n">observation</span><span class="p">)</span>
<span class="n">observation</span><span class="p">,</span> <span class="n">reward</span><span class="p">,</span> <span class="n">done</span><span class="p">,</span> <span class="n">_</span> <span class="o">=</span> <span class="n">env</span><span class="p">.</span><span class="n">step</span><span class="p">(</span><span class="n">action</span><span class="p">)</span>
<span class="n">cum_rewards</span> <span class="o">+=</span> <span class="n">reward</span>
<span class="k">if</span> <span class="n">done</span><span class="p">:</span>
<span class="n">rewards</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">cum_rewards</span><span class="p">)</span>
<span class="k">break</span>
<span class="k">return</span> <span class="n">np</span><span class="p">.</span><span class="n">mean</span><span class="p">(</span><span class="n">rewards</span><span class="p">)</span>
<span class="k">if</span> <span class="n">__name__</span> <span class="o">==</span> <span class="s">"__main__"</span><span class="p">:</span>
<span class="kn">import</span> <span class="nn">doctest</span>
<span class="n">doctest</span><span class="p">.</span><span class="n">testmod</span><span class="p">()</span>
</code></pre></div></div>
<p>Try it out for yourself if you would like to use it in different gyms.</p>
<h2 id="example-blackjack">Example: Blackjack</h2>
<p>We use OpenAI’s gym in this example. In here, we use a decaying $\epsilon$-greedy policy to solve Blackjack:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">gym</span>
<span class="n">env</span> <span class="o">=</span> <span class="n">gym</span><span class="p">.</span><span class="n">make</span><span class="p">(</span><span class="s">"Blackjack-v0"</span><span class="p">)</span>
<span class="c1"># The typical imports
</span><span class="kn">import</span> <span class="nn">gym</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="n">plt</span>
<span class="kn">from</span> <span class="nn">mc</span> <span class="kn">import</span> <span class="n">FiniteMCModel</span> <span class="k">as</span> <span class="n">MC</span>
<span class="n">eps</span> <span class="o">=</span> <span class="mi">1000000</span>
<span class="n">S</span> <span class="o">=</span> <span class="p">[(</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="n">z</span><span class="p">)</span> <span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">4</span><span class="p">,</span><span class="mi">22</span><span class="p">)</span> <span class="k">for</span> <span class="n">y</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="mi">11</span><span class="p">)</span> <span class="k">for</span> <span class="n">z</span> <span class="ow">in</span> <span class="p">[</span><span class="bp">True</span><span class="p">,</span><span class="bp">False</span><span class="p">]]</span>
<span class="n">A</span> <span class="o">=</span> <span class="mi">2</span>
<span class="n">m</span> <span class="o">=</span> <span class="n">MC</span><span class="p">(</span><span class="n">S</span><span class="p">,</span> <span class="n">A</span><span class="p">,</span> <span class="n">epsilon</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="n">eps</span><span class="o">+</span><span class="mi">1</span><span class="p">):</span>
<span class="n">ep</span> <span class="o">=</span> <span class="p">[]</span>
<span class="n">observation</span> <span class="o">=</span> <span class="n">env</span><span class="p">.</span><span class="n">reset</span><span class="p">()</span>
<span class="k">while</span> <span class="bp">True</span><span class="p">:</span>
<span class="c1"># Choosing behavior policy
</span> <span class="n">action</span> <span class="o">=</span> <span class="n">m</span><span class="p">.</span><span class="n">choose_action</span><span class="p">(</span><span class="n">m</span><span class="p">.</span><span class="n">b</span><span class="p">,</span> <span class="n">observation</span><span class="p">)</span>
<span class="c1"># Run simulation
</span> <span class="n">next_observation</span><span class="p">,</span> <span class="n">reward</span><span class="p">,</span> <span class="n">done</span><span class="p">,</span> <span class="n">_</span> <span class="o">=</span> <span class="n">env</span><span class="p">.</span><span class="n">step</span><span class="p">(</span><span class="n">action</span><span class="p">)</span>
<span class="n">ep</span><span class="p">.</span><span class="n">append</span><span class="p">((</span><span class="n">observation</span><span class="p">,</span> <span class="n">action</span><span class="p">,</span> <span class="n">reward</span><span class="p">))</span>
<span class="n">observation</span> <span class="o">=</span> <span class="n">next_observation</span>
<span class="k">if</span> <span class="n">done</span><span class="p">:</span>
<span class="k">break</span>
<span class="n">m</span><span class="p">.</span><span class="n">update_Q</span><span class="p">(</span><span class="n">ep</span><span class="p">)</span>
<span class="c1"># Decaying epsilon, reach optimal policy
</span> <span class="n">m</span><span class="p">.</span><span class="n">epsilon</span> <span class="o">=</span> <span class="nb">max</span><span class="p">((</span><span class="n">eps</span><span class="o">-</span><span class="n">i</span><span class="p">)</span><span class="o">/</span><span class="n">eps</span><span class="p">,</span> <span class="mf">0.1</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">"Final expected returns : {}"</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">m</span><span class="p">.</span><span class="n">score</span><span class="p">(</span><span class="n">env</span><span class="p">,</span> <span class="n">m</span><span class="p">.</span><span class="n">pi</span><span class="p">,</span> <span class="n">n_samples</span><span class="o">=</span><span class="mi">10000</span><span class="p">)))</span>
<span class="c1"># plot a 3D wireframe like in the example mplot3d/wire3d_demo
</span><span class="n">X</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">arange</span><span class="p">(</span><span class="mi">4</span><span class="p">,</span> <span class="mi">21</span><span class="p">)</span>
<span class="n">Y</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">arange</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">10</span><span class="p">)</span>
<span class="n">Z</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">array</span><span class="p">([</span><span class="n">np</span><span class="p">.</span><span class="n">array</span><span class="p">([</span><span class="n">m</span><span class="p">.</span><span class="n">Q</span><span class="p">[(</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="bp">False</span><span class="p">)][</span><span class="mi">0</span><span class="p">]</span> <span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="n">X</span><span class="p">])</span> <span class="k">for</span> <span class="n">y</span> <span class="ow">in</span> <span class="n">Y</span><span class="p">])</span>
<span class="n">X</span><span class="p">,</span> <span class="n">Y</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">meshgrid</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">Y</span><span class="p">)</span>
<span class="kn">from</span> <span class="nn">mpl_toolkits.mplot3d.axes3d</span> <span class="kn">import</span> <span class="n">Axes3D</span>
<span class="n">fig</span> <span class="o">=</span> <span class="n">plt</span><span class="p">.</span><span class="n">figure</span><span class="p">()</span>
<span class="n">ax</span> <span class="o">=</span> <span class="n">fig</span><span class="p">.</span><span class="n">add_subplot</span><span class="p">(</span><span class="mi">111</span><span class="p">,</span> <span class="n">projection</span><span class="o">=</span><span class="s">'3d'</span><span class="p">)</span>
<span class="n">ax</span><span class="p">.</span><span class="n">plot_wireframe</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">Y</span><span class="p">,</span> <span class="n">Z</span><span class="p">,</span> <span class="n">rstride</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span> <span class="n">cstride</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
<span class="n">ax</span><span class="p">.</span><span class="n">set_xlabel</span><span class="p">(</span><span class="s">"Player's Hand"</span><span class="p">)</span>
<span class="n">ax</span><span class="p">.</span><span class="n">set_ylabel</span><span class="p">(</span><span class="s">"Dealer's Hand"</span><span class="p">)</span>
<span class="n">ax</span><span class="p">.</span><span class="n">set_zlabel</span><span class="p">(</span><span class="s">"Return"</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">savefig</span><span class="p">(</span><span class="s">"blackjackpolicy.png"</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">show</span><span class="p">()</span>
</code></pre></div></div>
<p>And we get a pretty nice looking plot for when there is no usable ace(hence the <code class="language-plaintext highlighter-rouge">False</code> in <code class="language-plaintext highlighter-rouge">Z</code> for meshgrid plotting).</p>
<p><img src="http://oneraynyday.github.io/assets/blackjackpolicy.png" alt="blackjackplot" /></p>
<p>I also wrote up a quick off-policy version of the model that’s not yet polished, since I wanted to just get a benchmark of performance out. Here is the result:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Iterations: 100/1k/10k/100k/1million.
Tested on 10k samples for expected returns.
On-policy : greedy
-0.1636
-0.1063
-0.0648
-0.0458
-0.0312
On-policy : eps-greedy with eps=0.3
-0.2152
-0.1774
-0.1248
-0.1268
-0.1148
Off-policy weighted importance sampling:
-0.2393
-0.1347
-0.1176
-0.0813
-0.072
</code></pre></div></div>
<p>So it seems that off-policy importance sampling may be harder to converge, but does better than the epsilon greedy policy eventually.</p>
<h2 id="example-cliff-walking">Example: Cliff Walking</h2>
<p>The change to the code is actually very small because, as I said, Monte Carlo sampling is pretty environment agnostic. We changed this portion of the code(minus the plotting part):</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Before: Blackjack-v0
</span><span class="n">env</span> <span class="o">=</span> <span class="n">gym</span><span class="p">.</span><span class="n">make</span><span class="p">(</span><span class="s">"CliffWalking-v0"</span><span class="p">)</span>
<span class="c1"># Before: [(x, y, z) for x in range(4,22) for y in range(1,11) for z in [True,False]]
</span><span class="n">S</span> <span class="o">=</span> <span class="mi">4</span><span class="o">*</span><span class="mi">12</span>
<span class="c1"># Before: 2
</span><span class="n">A</span> <span class="o">=</span> <span class="mi">4</span>
</code></pre></div></div>
<p>And so we ran the gym and got -17.0 as the $E_\pi(G)$. Not bad! The cliff walking problem is a map where some blocks are cliffs and others are platforms. You get -1 reward for every step on a platform, and -100 reward for every time you fall down the cliff. When you land on a cliff, you go back to the beginning. For how big the map is, -17.0 per episode is a near-optimal policy.</p>
<p><img src="http://oneraynyday.github.io/assets/cliffwalking.png" alt="cliffwalking" /></p>
<h1 id="conclusion">Conclusion</h1>
<p>Monte Carlo methods are surprisingly good techniques for calculating optimal value functions and action values for arbitrary tasks with “weird” probability distributions for action or observation spaces. We will consider better variations of Monte Carlo methods in the future, but this is a great building block for foundational knowledge in reinforcement learning.</p>
<script src="https://utteranc.es/client.js" repo="OneRaynyDay/oneraynyday.github.io" issue-term="pathname" theme="github-light" crossorigin="anonymous" async=""> </script>
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:1" role="doc-endnote">
<p>Sutton, Richard S., and Andrew G. Barto. <em>Reinforcement Learning: an Introduction</em>. The MIT Press, 2012. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>
Reinforcement Learning - Markov Decision Process2018-05-06T00:00:00+00:00http://oneraynyday.github.io/ml/2018/05/06/Reinforcement-Learning-MDPs<h1 id="table-of-contents">Table of Contents</h1>
<ul id="markdown-toc">
<li><a href="#table-of-contents" id="markdown-toc-table-of-contents">Table of Contents</a></li>
<li><a href="#formalizations" id="markdown-toc-formalizations">Formalizations</a> <ul>
<li><a href="#the-agent-environment-interface" id="markdown-toc-the-agent-environment-interface">The Agent-Environment Interface</a></li>
<li><a href="#returns-and-episodes" id="markdown-toc-returns-and-episodes">Returns and Episodes</a></li>
<li><a href="#policies-and-value-functions" id="markdown-toc-policies-and-value-functions">Policies and Value Functions</a></li>
</ul>
</li>
<li><a href="#iterative-policy-evaluation" id="markdown-toc-iterative-policy-evaluation">Iterative Policy Evaluation</a> <ul>
<li><a href="#proof-of-convergence" id="markdown-toc-proof-of-convergence">Proof of Convergence</a></li>
</ul>
</li>
<li><a href="#policy-improvement" id="markdown-toc-policy-improvement">Policy Improvement</a></li>
<li><a href="#policy-iteration" id="markdown-toc-policy-iteration">Policy Iteration</a></li>
<li><a href="#value-iteration" id="markdown-toc-value-iteration">Value Iteration</a> <ul>
<li><a href="#proof-of-convergence-1" id="markdown-toc-proof-of-convergence-1">Proof of Convergence</a></li>
</ul>
</li>
<li><a href="#see-it-in-action" id="markdown-toc-see-it-in-action">See it in action!</a> <ul>
<li><a href="#visual-results" id="markdown-toc-visual-results">Visual Results</a></li>
</ul>
</li>
<li><a href="#conclusion" id="markdown-toc-conclusion">Conclusion</a></li>
</ul>
<p>Previously, we discussed <a href="https://oneraynyday.github.io/ml/2018/05/03/Reinforcement-Learning-Bandit/"><strong>$k$-armed bandits</strong></a>, and algorithms to find the optimal action-value function $q_*(a)$.</p>
<p>Once we found $q_*(a)$, we could just choose $argmax_a q_*(a)$ all the time to reach the optimal solution.</p>
<p>We now go one more step further, and add a <em>context</em> to our reinforcement learning problem. Context in this case, means that we have a different optimal action-value function for every state:</p>
\[q_*(a,s) = E(R_t|A_t=a, S_t=s)\]
<p>This situation, where we have different states, and actions associated with the states to yield rewards, is called a <strong>Markov Decision Process(MDP)</strong>.</p>
<p>We will be following the general structure of RL Sutton’s book<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup>, but adding extra proof, intuition, and a coding example at the end! I found some of his notation unnecessarily verbose, so some may be different.</p>
<h1 id="formalizations">Formalizations</h1>
<h2 id="the-agent-environment-interface">The Agent-Environment Interface</h2>
<p>We need to establish some notation and terminology here:</p>
<ol>
<li>The decision maker is called the <strong>agent</strong>.</li>
<li>The decision maker’s interactions is with the <strong>environment</strong>, which give rewards.</li>
</ol>
<p>For some time steps $t = 0,1,2,3,…$, we have associated with it, a state and an action $S_t \in \mathcal{S}, A_t \in \mathcal{A(s)}$. The reason actions is a function of state is we might have different permitted actions per state. If actions are the same in all states, then sometimes people abbreviate it as $\mathcal{A}$. With the associated action $A_t$, we receive a reward: $R_{t+1} \in \mathcal{R} \subset \Re$.</p>
<p>We thus have a sequence that looks like:</p>
\[S_0 \to A_0 \to R_1 \to S_1 \to A_1 \to R_2 \to …\]
<p>In a finite MDP:</p>
\[|\mathcal{S}| < \infty, |\mathcal{A}| < \infty, |\mathcal{R}| < \infty\]
<p>$(R_t, S_t)_{t\geq 0}$ is a markov chain here. A <strong>markov chain</strong>, under discrete sigma algebra, is defined as:</p>
\[P(X_n = x_n|X_{n-1}=x_{n-1}, X_{n-2}=x_{n-2}, X_{n-3}=x_{n-3},…,X_0=x_0) = P(X_n = x_n|X_{n-1}=x_{n-1})\]
<p>Or in other words, $X_{n+1}$ , the future state, is dependent only on $X_{n}$, the present state. This is often called the <strong>markov property</strong>.</p>
<p>Thus, with the assumption, we can describe the entire dynamics of a finite MDP given:</p>
\[P(S_t = s', R_t = r | S_{t-1} = s, A_{t-1}=a) \forall s',r,s,a\]
<p>We can marginalize $R_t$ and get:</p>
\[P(S_t=s'|S_{t-1}=s, A_{t-1}=a)\]
<p>which tells us the <strong>state-transition probabilities</strong>.</p>
<p>The expected reward for a state-action pair can be defined as:</p>
\[r(s,a) = E(R_t|S_{t-1}=s, A_{t-1}=a) = \sum_{r \in \mathcal{R}}r\sum_{s'\in\mathcal{S}}p(s',r|s,a)\]
<h2 id="returns-and-episodes">Returns and Episodes</h2>
<p>In general, if the agent has a finite lifespan, say $T$, then we want to maximize the reward sequence $G_t$:</p>
\[G_t = R_{t+1}+R_{t+2}+R_{t+3}+…+R_T\]
<p>Sometimes, $T$ can be a random variable.</p>
<p>However, many times our agent has an infinite lifespan, or we’re optimizing it to go to $\infty$(think of maximizing the time a robot is standing up). What do we do for that? We can re-formulate our problem using the idea of <strong>discounting</strong>:</p>
\[G_t = \sum_{k=0}^\infty \gamma^k R_{t+k+1}\]
<p>This pretty much says, “the immediate rewards are better than the rewards later on, and the rewards way later on approaches 0”. And this discounting idea can apply to finite lifespan episodes as well, because we can have a final absorbing state $S_0$ where the rewards associated with any action in that state is 0.</p>
<h2 id="policies-and-value-functions">Policies and Value Functions</h2>
<p>We want to know <em>how good it is to be in a specific state</em>, and <em>how good it is to take an action in a given state</em>.</p>
<p>A <strong>policy</strong> is a mapping from states to probabilities of selecting an action. If an agent is following policy $\pi$, then he will take action $a$ at state $s$ with probability $\pi(a\vert s)$.</p>
<p>A <strong>value</strong> of a state $s$, under a policy $\pi$, is the expected return starting from $s$ if we take the policy’s suggested actions. It is denoted by $v_\pi(s)$, and is called a <em>state-value function</em>. We want our state-value function to be maximized at the end.</p>
\[v_{\pi}(s) = E_\pi(G_t|S_t=s) = E_\pi(\sum_k^\infty\gamma^kR_{t+k+1} | S_t=s)\]
<p>If we fix the action that we take at state $S_t = s$, then we would get an <em>action-value function</em> $q_\pi(s,a)$.</p>
\[q_\pi(s,a) = E_\pi(G_t|S_t=s,A_t=a)\]
<p>If you look closely, you can see a recursive equation in $v_\pi(s)$:</p>
\[v_\pi(s) = E_\pi(G_t|S_t=s) \quad{\text{(Definition)}}\]
\[= E_\pi(R_{t+1} + \gamma G_{t+1} | S_t=s) \quad{\text{(Definition)}}\]
\[= \sum_a \pi(a|s) E_p(R_{t+1} + \gamma G_{t+1}) \quad{\text{(Expectation)}}\]
<p>where the subscript $_p$ is for the probability distribution of $p(s’,r\vert s,a)$.</p>
\[= \sum_a \pi(a|s) \sum_{s'}\sum_r p(s',r|s,a) (r+\gamma E_\pi(G_{t+1}|S_{t+1} = s'))\]
\[= \sum_a \pi(a|s) \sum_{s'}\sum_r p(s',r|s,a) (r+\gamma v_\pi(s'))\]
<p>Do you see the recursion here? The expected value of $G_t$ is dependent on the expected value of $G_{t+1}$. This is also often called the <strong>Bellman equation for $v_\pi$</strong>. There is also a similar one for $q_\pi$, with similar expansions. This is obviously a <em>hard equation to solve</em>, and the result is the holy grail - $v_\pi$. How do we solve it? That’s the subject of a large part of reinforcement learning.</p>
<p>For interpretations sake, I’ll marginalize this a bit further and use definition of expectations:</p>
\[= \sum_{s',r} p_\pi(s',r|s) r + \sum_{s',r} p_\pi(s',r|s) \gamma v_\pi(s')\]
\[= E_\pi(R|s) + \gamma E_\pi(v_\pi(S)|s)\]
<p>To optimize for the action-value function for a specific state $s$, we need to change the parameter, which is $\pi$:</p>
\[v_*(s) = max_{a} q_{\pi_*}(s,a)\]
<p>The above is the <strong>Bellman optimality equation</strong>. It’s pretty much saying: <em>what’s the best possible value we can get out of a state?</em></p>
<h1 id="iterative-policy-evaluation">Iterative Policy Evaluation</h1>
<p>I believe the notation in the book is quite verbose, so I will shorten it here for clarity. Recall our previous bellman equation:</p>
\[v_\pi(s) = \sum_{s',r} p_\pi(s',r|s)(r + \gamma v_\pi(s')) = E_\pi(R|s) + \sum_{s'} p_\pi(s'|s)\gamma v_\pi(s')\]
<p>We’ll denote $E_\pi(R\vert s)$ as $R_\pi(s)$ , and $E_\pi(R \vert s,a)$ as $R_\pi(s,a)$, etc. Then, we can vectorize our system of equations:</p>
\[V_\pi \in \Re^{|\mathcal{S}|}, P_\pi \in \Re^{|\mathcal{S}|x|\mathcal{S}|}, R_\pi \in \Re^{|\mathcal{S}|}\]
\[V_\pi = R_\pi + P_\pi V_\pi\]
<p>One way to solve this $V_\pi$ is to use an algorithm called <strong>iterative policy evaluation</strong>. It is the following update:</p>
\[v_k(s) = R_\pi(s) + \sum_{s'}p_\pi(s'|s)\gamma v_{k-1}(s')\]
<p>In vectorized form:</p>
<p>$V_{k+1} = R_\pi + \gamma P_\pi V_{k}$</p>
<p>with $v_0(s)$ being any arbitrary value in $\Re$.</p>
<h2 id="proof-of-convergence">Proof of Convergence</h2>
<p>We can prove that $lim_{k \to \infty} v_k(s) \to v_\pi(s)$.</p>
<p>Proof:</p>
\[||V_{k+1} - V_\pi|| = ||(R_\pi + \gamma P_\pi V_k) - (R_\pi + \gamma P_\pi V_\pi)||\]
\[= ||\gamma P_\pi(V_k-V_\pi)||\]
<p>Recall that $\gamma < 1$ for infinite time MDP’s, and that \(\vert \vert P_\pi\vert \vert=1 \forall \pi\), since this is a stochastic matrix.</p>
<p>By triangle inequality (of which, any proper norm exhibits):</p>
\[\leq \gamma ||P_\pi|| ||V_k-V_\pi|| < ||V_k-V_\pi||\]
<p>Then we see that, ultimately:</p>
\[||V_{k+1} - V_\pi|| < ||V_k - V_\pi|| \blacksquare\]
<h1 id="policy-improvement">Policy Improvement</h1>
<p>Now that we have the true value function(or approaching), we want to find a better policy. For any existing policy $\pi$, we want to find a <em>better</em> $\pi’$, aka one that gives us $V_{\pi’} \succeq V_\pi$ .</p>
<p>So how do we find this $\pi’$? It’s actually pretty simple. If you know $v_\pi$, then why not just choose the action that gives you the greatest value on the next step? In formal terms:</p>
\[\pi'(s) = argmax_a q_\pi(s,a)\]
\[= argmax_a E(R_{t+1} + \gamma v_\pi(S_{t+1}) | S_t=s, A_t=a)\]
<p>Why is this better? It’s due to the idea that <strong>the max of the convex set is always greater than, or equal to the convex combination of any elements of the set.</strong> We essentially apply that simple idea here.</p>
<p>Well, let’s expand out what it means:</p>
\[v_\pi(s) \leq q_\pi(s, \pi'(s)) = E(R_{t+1}+\gamma v_\pi(S_{t+1}) | S_t = s, A_t = \pi'(s))\]
\[= \sum_{r,s'} (r+\gamma v_\pi(s')) p(r,s'|s,\pi'(s))\]
\[= \sum_{r,s'} (r+\gamma v_\pi(s'))p_{\pi'}(r,s'|s)\]
\[= E_{\pi'}(R_{t+1} + \gamma v_\pi(S_{t+1}) | S_t = s)\]
\[\leq E_{\pi'}(R_{t+1} + \gamma q_{\pi}(S_{t+1}, \pi'(S_{t+1}))|S_t=s)\]
\[…\]
\[\leq E(\sum_k R_{t+k} \gamma^k | S_t=s)\]
\[= v_{\pi'}(s) \blacksquare\]
<p>This means, if we take the action that gives us better results than our current $\pi$, at all $s$ at any time step, then we have created a new policy $\pi’$.</p>
<p>What happens when we reach equality of $\pi$ and $\pi’$?</p>
\[v_{\pi'}(s) = max_a E(R_{t+1} + \gamma v_{\pi'}(S_{t+1}) | S_t=s, A_t=a)\]
<p>Recall what the <strong>Bellman optimality equation</strong> says:</p>
\[v_*(s) = max_{a} q_{\pi_*}(s,a)\]
<p>These two are exactly the same, implying that in the condition that equality occurs, we have reached the optimal policy.</p>
<h1 id="policy-iteration">Policy Iteration</h1>
<p>So now that we have a way of evaluating $v_\pi(s)$, and a way to improve $\pi$, we can do something along the lines of:</p>
<p><img src="http://oneraynyday.github.io/assets/generalizedpolicy.png" alt="policy" /></p>
<ol>
<li>Initialize with some arbitrary $\pi$, and arbitrary $v_\pi$ estimate.</li>
<li>Find the true $v_\pi$ for the $\pi$.</li>
<li>Find a better $\pi’$.</li>
<li>Repeat step 2 and 3 until $\pi’ = \pi$.</li>
</ol>
<p>This is called the <strong>policy iteration</strong> algorithm. We have presented a possible method to solve MDP’s! We’re getting somewhere! Here’s a rough pseudocode:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">value_iter</span><span class="p">(</span><span class="n">V_prev</span><span class="p">):</span>
<span class="n">V_next</span> <span class="o">=</span> <span class="n">R</span> <span class="o">+</span> <span class="n">P</span> <span class="o">*</span> <span class="n">V_prev</span>
<span class="k">while</span> <span class="n">V_next</span> <span class="o">!=</span> <span class="n">V_prev</span><span class="p">:</span>
<span class="n">V_next</span> <span class="o">=</span> <span class="n">R</span> <span class="o">+</span> <span class="n">P</span> <span class="o">*</span> <span class="n">V_prev</span>
<span class="k">return</span> <span class="n">V_next</span>
<span class="k">def</span> <span class="nf">policy_improvement</span><span class="p">(</span><span class="n">prev_pi</span><span class="p">):</span>
<span class="k">for</span> <span class="n">s</span> <span class="ow">in</span> <span class="n">S</span><span class="p">:</span>
<span class="n">pi</span><span class="p">(</span><span class="n">s</span><span class="p">)</span> <span class="o">=</span> <span class="n">prev_pi</span><span class="p">(</span><span class="n">s</span><span class="p">)</span>
<span class="k">for</span> <span class="n">s</span> <span class="ow">in</span> <span class="n">S</span><span class="p">:</span>
<span class="n">pi</span><span class="p">(</span><span class="n">s</span><span class="p">)</span> <span class="o">=</span> <span class="n">argmax_a</span><span class="p">(</span><span class="n">q</span><span class="p">(</span><span class="n">s</span><span class="p">,</span><span class="n">a</span><span class="p">))</span>
<span class="k">return</span> <span class="n">pi</span>
<span class="k">def</span> <span class="nf">policy_iter</span><span class="p">(</span><span class="n">pi_prev</span><span class="p">):</span>
<span class="c1"># Initialize
</span> <span class="n">q</span> <span class="o">=</span> <span class="n">find_q</span><span class="p">(</span><span class="n">pi_prev</span><span class="p">)</span>
<span class="n">V</span> <span class="o">=</span> <span class="n">find_V</span><span class="p">(</span><span class="n">pi_prev</span><span class="p">)</span>
<span class="n">pi_next</span> <span class="o">=</span> <span class="n">policy_improvement</span><span class="p">(</span><span class="n">pi_prev</span><span class="p">)</span>
<span class="c1"># Policy Iteration
</span> <span class="k">while</span> <span class="n">pi_next</span> <span class="o">!=</span> <span class="n">pi_prev</span><span class="p">:</span>
<span class="n">pi_prev</span> <span class="o">=</span> <span class="n">pi_next</span>
<span class="n">q</span> <span class="o">=</span> <span class="n">find_q</span><span class="p">(</span><span class="n">pi_prev</span><span class="p">)</span>
<span class="n">V</span> <span class="o">=</span> <span class="n">find_V</span><span class="p">(</span><span class="n">pi_prev</span><span class="p">)</span>
<span class="n">pi_next</span> <span class="o">=</span> <span class="n">policy_improvement</span><span class="p">(</span><span class="n">pi_prev</span><span class="p">)</span>
<span class="k">return</span> <span class="n">pi_next</span>
</code></pre></div></div>
<h1 id="value-iteration">Value Iteration</h1>
<p>Recall policy iteration. Don’t you think it’s kind of slow to run the steps 2 and 3 together? Specifically, we’re going through all states in calculating the value function $v_\pi$, AND we’re going through all the states to calculate the next $\pi’$. This is a lot of work that we don’t need to do (roughly $O(N^2)$ run time).</p>
<p>Recall that $v_* = max_a q_{\pi_*}(s,a)$, and we’re trying to find $v_*$. Expanding it, $v_*$ is also equivalent to:</p>
\[v_*(s) = max_a E_{\pi_*}(R_{t+1} + \gamma v_*(S_{t+1}) | S_t = s, A_t = a)\]
<p>So this time, we run a different algorithm called <strong>value iteration</strong>:</p>
\[v_{k+1}(s) = max_a \sum_{s',r}p(s',r|s,a)(r+\gamma v_k(s'))\]
<p>This looks like some combination of policy evaluation and policy improvement in one sweep right?</p>
<p>So why does this converge, i.e. $\lim_{k \to \infty} v_k \to v_*$?</p>
<h2 id="proof-of-convergence-1">Proof of Convergence</h2>
<p>We use the same contraction operator argument as before, but with a slight twist.</p>
\[||V_{k+1} - V_{\pi*}|| = ||(max_a R_a + \gamma P_a V_k) - (max_a R_a + \gamma P_a V_{\pi*})||\]
\[\leq max_a ||(R_a + \gamma P_a V_k) - (R_a + \gamma P_a V_{\pi*})|| \quad{\text{(1)}}\]
\[\leq max_a \gamma||P_a||||V_k-V_{\pi*}||\]
<p>So:</p>
\[||V_{k+1}-V_{\pi*}|| \leq ||V_k-V_{\pi*}||\blacksquare\]
<p>As an aside, $\text{(1)}$ is true because:</p>
\[f(x) - g(x) \leq ||f(x) - g(x)|| \quad{\text{(absolute value)}}\]
\[f(x) \leq ||f(x)-g(x)|| + g(x)\]
\[max_x f(x) \leq max_x(||f(x) - g(x)|| + g(x)) \leq max_x||f(x) - g(x)|| + max_x g(x) \quad{\text{(Triangle Inequality)}}\]
\[max_x f(x) - max_x g(x) \leq max_x ||f(x) - g(x)||\]
\[||max_x f(x) - max_x g(x)|| \leq max_x ||f(x) - g(x)||\]
<hr />
<p>So what does this algorithm look like?</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">value_iter</span><span class="p">(</span><span class="n">prev_V</span><span class="p">):</span>
<span class="n">V</span> <span class="o">=</span> <span class="bp">None</span>
<span class="k">while</span> <span class="n">prev_V</span> <span class="o">!=</span> <span class="n">V</span><span class="p">:</span>
<span class="n">V</span> <span class="o">=</span> <span class="n">prev_V</span>
<span class="k">for</span> <span class="n">s</span> <span class="ow">in</span> <span class="n">S</span><span class="p">:</span>
<span class="n">max_v_s</span> <span class="o">=</span> <span class="mi">0</span>
<span class="k">for</span> <span class="n">a</span> <span class="ow">in</span> <span class="n">A</span><span class="p">:</span>
<span class="c1"># P(s,a) is a |S|x|A| matrix, where i,j=p(i,j|s,a)
</span> <span class="n">max_v_s</span> <span class="o">=</span> <span class="nb">max</span><span class="p">(</span><span class="n">P</span><span class="p">(</span><span class="n">s</span><span class="p">,</span><span class="n">a</span><span class="p">)</span> <span class="o">*</span> <span class="p">(</span><span class="n">R</span><span class="p">(</span><span class="n">s</span><span class="p">,</span><span class="n">a</span><span class="p">)</span> <span class="o">+</span> <span class="n">gamma</span> <span class="n">V</span><span class="p">[</span><span class="n">s</span><span class="p">]),</span> <span class="n">max_v_s</span><span class="p">)</span>
<span class="n">V</span><span class="p">[</span><span class="n">s</span><span class="p">]</span> <span class="o">=</span> <span class="n">max_v_s</span>
<span class="k">return</span> <span class="n">V</span> <span class="c1"># returns optimal V
</span></code></pre></div></div>
<p>After we get the optimal value, we can easily find the optimal policy.</p>
<h1 id="see-it-in-action">See it in action!</h1>
<p>To illustrate how this could work, we took the same situation in frozen lake, a classic MDP problem, and we tried solving it with <em>value iteration</em>. Here is the code below:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="s">"""
Let's use Value Iteration to solve FrozenLake!
Setup
-----
We start off by defining our actions:
A = {move left, move right...} = {(0,1),(0,-1),...}
S = {(i,j) for 0 <= i,j < 4}
Reward for (3,3) = 1, and otherwise 0.
Probability distribution is a 4x(4x4) matrix of exactly the policy.
We have pi(a|s), where a in A, and s in S.
Problem formulation : https://gym.openai.com/envs/FrozenLake-v0/
Algorithm
---------
Because our situation is deterministic for now, we have the value iteration eq:
v <- 0 for all states.
v_{k+1}(s) = max_a (\sum_{s',r} p(s',r|s,a) (r + \gamma * v_k(s'))
... which decays to:
v_{k+1}(s = max_a (\sum_{s'} 1_(end(s')) + \gamma * v_k(s'))
Because of our deterministic state and the deterministic reward.
"""</span>
<span class="n">N</span> <span class="o">=</span> <span class="mi">4</span>
<span class="n">v</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">zeros</span><span class="p">((</span><span class="n">N</span><span class="p">,</span> <span class="n">N</span><span class="p">),</span> <span class="n">dtype</span><span class="o">=</span><span class="n">np</span><span class="p">.</span><span class="n">float32</span><span class="p">)</span> <span class="c1"># Is our value vector.
</span><span class="n">THRESHOLD</span> <span class="o">=</span> <span class="mf">1e-5</span>
<span class="n">A</span> <span class="o">=</span> <span class="p">[(</span><span class="mi">0</span><span class="p">,</span><span class="mi">1</span><span class="p">),(</span><span class="mi">0</span><span class="p">,</span><span class="o">-</span><span class="mi">1</span><span class="p">),(</span><span class="mi">1</span><span class="p">,</span><span class="mi">0</span><span class="p">),(</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span><span class="mi">0</span><span class="p">)]</span>
<span class="n">MAP</span> <span class="o">=</span> <span class="p">[</span>
<span class="s">"SFFF"</span><span class="p">,</span>
<span class="s">"FHFH"</span><span class="p">,</span>
<span class="s">"FFFH"</span><span class="p">,</span>
<span class="s">"HFFG"</span>
<span class="p">]</span>
<span class="k">def</span> <span class="nf">proj</span><span class="p">(</span><span class="n">n</span><span class="p">,</span> <span class="n">minn</span><span class="p">,</span> <span class="n">maxn</span><span class="p">):</span>
<span class="s">"""
projects n into the range [minn, maxn).
"""</span>
<span class="k">return</span> <span class="nb">max</span><span class="p">(</span><span class="nb">min</span><span class="p">(</span><span class="n">maxn</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="n">n</span><span class="p">),</span> <span class="n">minn</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">move</span><span class="p">(</span><span class="n">s</span><span class="p">,</span> <span class="n">tpl</span><span class="p">,</span> <span class="n">stochasticity</span><span class="o">=</span><span class="mi">0</span><span class="p">):</span>
<span class="s">"""
Set stochasticity to any number in [0,1].
This is equivalent to "slipping on the ground"
in FrozenLake.
"""</span>
<span class="k">if</span> <span class="n">MAP</span><span class="p">[</span><span class="n">s</span><span class="p">[</span><span class="mi">0</span><span class="p">]][</span><span class="n">s</span><span class="p">[</span><span class="mi">1</span><span class="p">]]</span> <span class="o">==</span> <span class="s">'H'</span><span class="p">:</span> <span class="c1"># Go back to the start
</span> <span class="k">return</span> <span class="p">(</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">)</span>
<span class="k">if</span> <span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">random</span><span class="p">()</span> <span class="o"><</span> <span class="n">stochasticity</span><span class="p">:</span>
<span class="k">return</span> <span class="n">random</span><span class="p">.</span><span class="n">choice</span><span class="p">(</span><span class="n">A</span><span class="p">)</span>
<span class="k">return</span> <span class="p">(</span><span class="n">proj</span><span class="p">(</span><span class="n">s</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">+</span> <span class="n">tpl</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="mi">0</span><span class="p">,</span> <span class="n">N</span><span class="p">),</span> <span class="n">proj</span><span class="p">(</span><span class="n">s</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="o">+</span> <span class="n">tpl</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span> <span class="mi">0</span><span class="p">,</span> <span class="n">N</span><span class="p">))</span>
<span class="k">def</span> <span class="nf">reward</span><span class="p">(</span><span class="n">s</span><span class="p">):</span>
<span class="k">return</span> <span class="n">MAP</span><span class="p">[</span><span class="n">s</span><span class="p">[</span><span class="mi">0</span><span class="p">]][</span><span class="n">s</span><span class="p">[</span><span class="mi">1</span><span class="p">]]</span> <span class="o">==</span> <span class="s">'G'</span>
<span class="k">def</span> <span class="nf">run_with_value</span><span class="p">(</span><span class="n">v</span><span class="p">,</span> <span class="n">gamma</span><span class="o">=</span><span class="mf">0.9</span><span class="p">):</span>
<span class="n">old_v</span> <span class="o">=</span> <span class="n">v</span><span class="p">.</span><span class="n">copy</span><span class="p">()</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">N</span><span class="p">):</span>
<span class="k">for</span> <span class="n">j</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">N</span><span class="p">):</span>
<span class="n">best_val</span> <span class="o">=</span> <span class="mi">0</span>
<span class="k">for</span> <span class="n">a</span> <span class="ow">in</span> <span class="n">A</span><span class="p">:</span>
<span class="n">new_s</span> <span class="o">=</span> <span class="n">move</span><span class="p">((</span><span class="n">i</span><span class="p">,</span><span class="n">j</span><span class="p">),</span> <span class="n">a</span><span class="p">)</span>
<span class="n">best_val</span> <span class="o">=</span> <span class="nb">max</span><span class="p">(</span><span class="n">best_val</span><span class="p">,</span> <span class="n">reward</span><span class="p">(</span><span class="n">new_s</span><span class="p">)</span> <span class="o">+</span> <span class="n">gamma</span> <span class="o">*</span> <span class="n">old_v</span><span class="p">[</span><span class="n">new_s</span><span class="p">])</span>
<span class="n">v</span><span class="p">[</span><span class="n">i</span><span class="p">,</span><span class="n">j</span><span class="p">]</span> <span class="o">=</span> <span class="n">best_val</span>
<span class="k">return</span> <span class="n">old_v</span>
<span class="c1"># Performing Value Iteration
</span><span class="n">plt</span><span class="p">.</span><span class="n">matshow</span><span class="p">(</span><span class="n">v</span><span class="p">)</span>
<span class="n">old_v</span> <span class="o">=</span> <span class="n">run_with_value</span><span class="p">(</span><span class="n">v</span><span class="p">)</span>
<span class="k">while</span> <span class="n">norm</span><span class="p">(</span><span class="n">v</span> <span class="o">-</span> <span class="n">old_v</span><span class="p">)</span> <span class="o">>=</span> <span class="n">THRESHOLD</span><span class="p">:</span>
<span class="n">old_v</span> <span class="o">=</span> <span class="n">run_with_value</span><span class="p">(</span><span class="n">v</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">matshow</span><span class="p">(</span><span class="n">v</span><span class="p">)</span>
<span class="c1"># Extracting policy from v:
</span><span class="k">def</span> <span class="nf">pi</span><span class="p">(</span><span class="n">s</span><span class="p">,</span> <span class="n">v</span><span class="p">):</span>
<span class="n">cur_best</span> <span class="o">=</span> <span class="nb">float</span><span class="p">(</span><span class="s">"-inf"</span><span class="p">)</span>
<span class="n">cur_a</span> <span class="o">=</span> <span class="bp">None</span>
<span class="k">for</span> <span class="n">a</span> <span class="ow">in</span> <span class="n">A</span><span class="p">:</span>
<span class="n">val</span> <span class="o">=</span> <span class="n">v</span><span class="p">[</span><span class="n">move</span><span class="p">(</span><span class="n">s</span><span class="p">,</span> <span class="n">a</span><span class="p">)]</span>
<span class="k">if</span> <span class="n">val</span> <span class="o">></span> <span class="n">cur_best</span><span class="p">:</span>
<span class="n">cur_a</span> <span class="o">=</span> <span class="n">a</span>
<span class="n">cur_best</span> <span class="o">=</span> <span class="n">val</span>
<span class="k">return</span> <span class="n">cur_a</span>
<span class="c1"># Plotting a nice arrow map.
</span><span class="n">action_map</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">array</span><span class="p">([</span>
<span class="p">[</span><span class="n">pi</span><span class="p">((</span><span class="n">i</span><span class="p">,</span><span class="n">j</span><span class="p">),</span> <span class="n">v</span><span class="p">)</span> <span class="k">for</span> <span class="n">j</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">N</span><span class="p">)]</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">N</span><span class="p">)])</span>
<span class="n">Fx</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">flip</span><span class="p">(</span><span class="n">np</span><span class="p">.</span><span class="n">array</span><span class="p">([</span> <span class="p">[</span><span class="n">col</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="k">for</span> <span class="n">col</span> <span class="ow">in</span> <span class="n">row</span><span class="p">]</span> <span class="k">for</span> <span class="n">row</span> <span class="ow">in</span> <span class="n">action_map</span> <span class="p">]),</span><span class="mi">0</span><span class="p">)</span>
<span class="n">Fy</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">flip</span><span class="p">([</span> <span class="p">[</span><span class="o">-</span><span class="n">col</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="k">for</span> <span class="n">col</span> <span class="ow">in</span> <span class="n">row</span><span class="p">]</span> <span class="k">for</span> <span class="n">row</span> <span class="ow">in</span> <span class="n">action_map</span> <span class="p">],</span><span class="mi">0</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">quiver</span><span class="p">(</span><span class="n">Fx</span><span class="p">,</span><span class="n">Fy</span><span class="p">)</span>
</code></pre></div></div>
<h2 id="visual-results">Visual Results</h2>
<p>We plotted the colormap of value functions per state in our 2D world, and saw it converge to a reasonable policy:</p>
<p><strong>Iteration 1:</strong>
<img src="http://oneraynyday.github.io/assets/mdp1.png" alt="mdp1" /></p>
<p><strong>Iteration 2:</strong>
<img src="http://oneraynyday.github.io/assets/mdp2.png" alt="mdp2" /></p>
<p><strong>Iteration 3:</strong>
<img src="http://oneraynyday.github.io/assets/mdp3.png" alt="mdp3" /></p>
<p><strong>Iteration 4:</strong>
<img src="http://oneraynyday.github.io/assets/mdp4.png" alt="mdp4" /></p>
<p><strong>End Result:</strong>
<img src="http://oneraynyday.github.io/assets/mdp5.png" alt="mdpend" /></p>
<p>In the end, our policy looks like:</p>
<p><img src="http://oneraynyday.github.io/assets/mdparrow.png" alt="mpdarrow" /></p>
<p>Pretty cool, huh? You can take a look at the code <a href="https://github.com/OneRaynyDay/FrozenLakeMDP">here</a>.</p>
<h1 id="conclusion">Conclusion</h1>
<p>So, our little exploration into MDP’s have been nice. We learned about how to formulate MDP’s and solve them using value iteration and policy iteration. We made a cool little coding example that we can use our algorithms to solve.</p>
<p>One major downside of these algorithms is that it’s not applicable for <strong>continuous-value</strong> domains. This means, for even a simple problem as <a href="https://github.com/openai/gym/wiki/CartPole-v0">Cart Pole</a>, we won’t have a very smooth way of solving it(discretizing and running our algorithms might work, but it’s real hacky). We will explore ways to solve that issue next time!</p>
<script src="https://utteranc.es/client.js" repo="OneRaynyDay/oneraynyday.github.io" issue-term="pathname" theme="github-light" crossorigin="anonymous" async=""> </script>
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:1" role="doc-endnote">
<p>Sutton, Richard S., and Andrew G. Barto. <em>Reinforcement Learning: an Introduction</em>. The MIT Press, 2012. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>
Reinforcement Learning - Bandit Problems2018-05-03T00:00:00+00:00http://oneraynyday.github.io/ml/2018/05/03/Reinforcement-Learning-Bandit<h1 id="table-of-contents">Table of Contents</h1>
<ul id="markdown-toc">
<li><a href="#table-of-contents" id="markdown-toc-table-of-contents">Table of Contents</a></li>
<li><a href="#k-armed-bandit-problem" id="markdown-toc-k-armed-bandit-problem">$k$-armed Bandit Problem</a></li>
<li><a href="#action-value-methods" id="markdown-toc-action-value-methods">Action-value Methods</a> <ul>
<li><a href="#estimating-action-values" id="markdown-toc-estimating-action-values">Estimating Action Values</a></li>
<li><a href="#action-selection-rule--greedy" id="markdown-toc-action-selection-rule--greedy">Action Selection Rule : Greedy</a></li>
<li><a href="#action-selection-rule--epsilon-greedy" id="markdown-toc-action-selection-rule--epsilon-greedy">Action Selection Rule : $\epsilon$-Greedy</a></li>
<li><a href="#crux-nonstationary-action-value" id="markdown-toc-crux-nonstationary-action-value">Crux: Nonstationary Action Value</a> <ul>
<li><a href="#1-transience" id="markdown-toc-1-transience">1. Transience</a></li>
<li><a href="#2-convergence" id="markdown-toc-2-convergence">2. Convergence</a></li>
</ul>
</li>
<li><a href="#action-selection-rule-optimistic-initial-values" id="markdown-toc-action-selection-rule-optimistic-initial-values">Action Selection Rule: Optimistic Initial Values</a></li>
<li><a href="#action-selection-rule-upper-confidence-bound-selection" id="markdown-toc-action-selection-rule-upper-confidence-bound-selection">Action Selection Rule: Upper-Confidence-Bound Selection</a></li>
<li><a href="#gradient-bandit-algorithms" id="markdown-toc-gradient-bandit-algorithms">Gradient Bandit Algorithms</a></li>
</ul>
</li>
<li><a href="#ending-remarks" id="markdown-toc-ending-remarks">Ending Remarks</a></li>
</ul>
<p>To begin, we should note that <strong>bandit problems</strong> are a subset of <strong>tabular solution methods</strong>. The reason it’s called tabular is because we can fit all the possible states into a table. The tables tell us everything we need to know about the state of the problem, and so we can often find the exact solution to the problem that’s posed. We are following the book from Richard Sutton and Andrew Barto. <sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup></p>
<h1 id="k-armed-bandit-problem">$k$-armed Bandit Problem</h1>
<blockquote>
<p><em>One-armed bandit</em>: a slot machine operated by pulling a long handle at the side.</p>
</blockquote>
<p>We are faced with $k$ different actions. Each action gives you an amount of money, sampled from a distribution conditioned on the action. Each time step, you pick an action. You usually end with $T$ time steps. <strong>How do you maximize your gain?</strong></p>
<p>For some time step $t$, we have an action, $A_t$, and a reward of the action, $R_t$. We denote the <strong>value* of an arbitrary action $a$</strong> as $q_*(a)$:</p>
\[q_*(a) = E[R_t|A_t = a]\]
<p>What does this mean in english? It means “the value of an action $a$ is the expected value of the reward of that action(at any time).”</p>
<p>After reading that sentence 3-4 times, it makes sense right? If you knew that doing 1 action will give you the greatest expected value, then you abuse the hell out of it to max your gains.</p>
<p>Now we obviously want $q_*(a)$. It’s hard to get, though. $Q_t(a)$ is an estimate of $q_*(a)$ at time $t$. A desirable property is that:</p>
\[lim_{t\to \infty}Q_t(a) = q_*(a)\]
<p>So how do we make the $Q_t(a)$?</p>
<hr />
<p><em>value*</em>: in this case, it’s different than the concept of rewards. Value is the long run metric, meanwhile reward is the immediate metric. If you got hooked on heroin, it would be awesome rewards for the first hour, but it would be terrible value. Note here we wrote $q_*(a)$ as an expectation without sums of future rewards. This is not usually true in MDP’s, in which we will explore later, but as bandits is a stateless problem, we can use an abuse of notation here.</p>
<h1 id="action-value-methods">Action-value Methods</h1>
<p>First, we have to estimate the action values, and then, we will decide what actions to do once we have estimates.</p>
<h2 id="estimating-action-values">Estimating Action Values</h2>
<p>One easy way to approximate $q_*(a)$ is to use <strong>sample means</strong>. What does this mean?</p>
\[Q_t(a) = \frac{\sum_i^{t-1} R_i * 1_{A_i=a}}{\sum_i^{t-1} 1_{A_i=a}}\]
<p>That might look loaded but really it’s just saying “for all the times we chose the action $a$, what was the average reward?” And this approximation of $q_*(a)$ is pretty good! Why? Because the expression above is literally SLLN(Strong Law of Large Numbers) in disguise.</p>
<p>What does that entail? It entails that $Q_t(a)$ converges almost surely to $q_*(a)$, or more formally:</p>
\[P( lim_{t \to \infty}Q_t(a) = q_*(a)) = 1\]
<p>This is stronger than a convergence in probability(which is already enough for most applications).</p>
<p>So this is great, asymptotically speaking we will definitely reach $q_*(a)$… right? (Not really, we will see why later)</p>
<h2 id="action-selection-rule--greedy">Action Selection Rule : Greedy</h2>
<p>The easiest and probably one of the worst thing you can do is to get addicted to heroin. That’s taking the greediest action, the one that gives you the greatest perceived value so far:</p>
\[A_t = argmax_a Q_t(a)\]
<p>We take whatever action has the greatest value so far. If we start as a blank slate, we’ll just keep taking the best value movement without exploring at all our other choices. This will easily lead to a suboptimal solution.</p>
<h2 id="action-selection-rule--epsilon-greedy">Action Selection Rule : $\epsilon$-Greedy</h2>
<p>What could we do as an alternative? We can try something called an <strong>$\epsilon$-greedy method</strong>. This just means, for some small probability $\epsilon < 1$ at any time $t$, we choose from all actions uniformly, rather than greedily. It’s like tossing a coin, a random variable $X$, with $\epsilon$ probability of getting a tails, in which you would need to randomly select uniformly from all states. When you get heads, you would then perform the same greedy action.</p>
<p>Asymptotically speaking, we will take the actual optimal action with probability of more than $1-\epsilon$(once we found it, we’ll make sure to choose it all the time, but we also have $\frac{1}{\vert S\vert} * \epsilon$ chance to pick it in case of the uniform action choices).</p>
<h2 id="crux-nonstationary-action-value">Crux: Nonstationary Action Value</h2>
<p>A lot of the time, our action values are changing with time. This is bad, because $q_*(a)$ is now dependent on time, so it’s more like $q_*(a,t)$.</p>
<p>Before, we had(we simplify the notation here):</p>
\[Q_n = \frac{\sum_i R_i}{n-1}\]
<p>Which can be expressed as:</p>
\[Q_{n+1} = Q_n + \frac{1}{n}(R_n - Q_n)\]
<p>Which is a very familiar update rule! Look like sgd by any chance?</p>
<p>Now this has the property of converging with probability 1 to $q_*(a)$ if it’s stationary. Now, what if it’s not? We definitely don’t want equal weighting of $R_0$ as $R_{n-1}$, because our most recent rewards reflect the current action values.</p>
<p>We can replace $\frac{1}{n}$ and replace it with $\alpha \in (0, 1]$:</p>
\[Q_{n+1} = Q_n + \alpha(R_n - Q_n)\]
<p>This is an <strong>exponential average</strong>, which geometrically decays the weight of previous rewards.</p>
<p>Now let’s abstract it even more: we introduce a function $\alpha_n(a)$ which gives us the weight of a specific reward at time step $n$ (in this case since we’re only concerned about an action, replace $\alpha_n(a)$ with $\alpha_n$):</p>
\[Q_{n+1} = Q_n + \alpha_n(R_n - Q_n)\]
<p>We need some properties about $\alpha_n(a)$ for this update to be arbitrarily convergent:</p>
<h3 id="1-transience">1. Transience</h3>
\[\sum_n \alpha_n(a) = \infty\]
<p>implies that for any starting value $Q_1 \in \Re$, we can reach any arbitrary $q_*(a) \in \Re$.</p>
<h3 id="2-convergence">2. Convergence</h3>
\[\sum_n \alpha_n(a)^2 < \infty\]
<p>implies that the steps will be “small enough to assure convergence to a finite number”. I tried searching for a proof for #2, but it seemed like it required too much <a href="http://digitalassets.lib.berkeley.edu/math/ucb/text/math_s3_v1_article-04.pdf">machinery</a>. Like, 6 pages of machinery.</p>
<hr />
<p>So why did we decide to set $\alpha_n(a) = \alpha \in (0,1]$? Isn’t that a constant? Wouldn’t we lose our guarantees for convergence?</p>
<p><strong>Yes, we do. But it’s with good reason. We don’t want to converge to a specific value. The optimal action-value is nonstationary.</strong></p>
<h2 id="action-selection-rule-optimistic-initial-values">Action Selection Rule: Optimistic Initial Values</h2>
<p>So far, we had to set initial values $Q_1(a)$ pretty arbitrarily. This is essentially a set of <strong>hyperparameters</strong> for initialization. One trick is to set the initial values for $Q_1(a) = C \forall a$, where $C > q_*(a) \forall a$.</p>
<p>This way, the learner’s estimate of $Q_n(a)$ will be decreasing at first, prioritizing exploration of all the states. Once the $Q_n(a)$’s have become close to $q_*(a)$, then greedy selection takes over.</p>
<p>One con is that this is <em>not good for nonstationary problems</em>. We are, in a way, doing simulated annealing and with enough time, we’re going to converge. We don’t want that!</p>
<h2 id="action-selection-rule-upper-confidence-bound-selection">Action Selection Rule: Upper-Confidence-Bound Selection</h2>
<p>Remember <a href="https://oneraynyday.github.io/ml/2017/08/08/Bias-Variance-Tradeoff/">this</a>? I wrote this blog a while ago about bias variance tradeoff, with this as the holy grail result:</p>
\[R(f) \leq [inf_{f^* \in \mathcal{H}}R(f^*)] + 2 \sqrt{\frac{log(\frac{2M}{\delta})}{2N}}\]
<p>A quick recap of this is:</p>
<ol>
<li>$R(f)$ is the (theoretical)risk of a hypothesis $f$.</li>
<li>$R(f*)$ is the minimum risk of a hypothesis $f$ in the space of hypothesis set $\mathcal{H}$.</li>
<li>$M$ is the size of our hypothesis set, $\vert \mathcal{H} \vert$.</li>
<li>$N$ is the number of samples.</li>
<li>$\delta$ is a constant. (If you have to know, it’s roughly the probability that a the hypothesis that we choose is bad)</li>
</ol>
<p>What this is saying most importantly, is that:</p>
<ol>
<li>At very low number of samples, our bound is very loose. We don’t know whether our current hypothesis is the best hypothesis.</li>
<li>The bigger our hypothesis set, the more loose our bound is for PAC learning.</li>
</ol>
<p>Now, this transfers over to the reinforcement learning domain as well, in the form of <strong>Upper-Confidence-Bound action selection</strong>:</p>
\[A_t = argmax_a Q_t(a)+c\sqrt{\frac{log(t)}{N_t(a)}}\]
<p>Doesn’t this look familiar to the equality above? In here, some notations:</p>
<ol>
<li>$N_t(a)$ is the number of times action $a$ has been selected for the $t$ time intervals.</li>
<li>$c$ is some constant we choose to control degree of exploration.</li>
</ol>
<p>In the same analog, we can say that $t$ is the hypothesis space size $M$ from our previous equation. This is because as we increase on $t$, the sequence of actions $(a_n)$ up to $t$ grows in space. It is thus harder to select an $A_t$. However, as the amount of times we’ve selected $a$, $N_t(a)$ increases, we also gain information about how this action behaves.</p>
<p>UCB is a very powerful algorithm, and one can interpret it with a different view of the same problem: Greedy choosing vs. Regret minimization. In this case, we can interpret it as minimizing regret by giving enough exploration in all states before choosing the $argmax$, therefore “minimizing regret”.</p>
<h2 id="gradient-bandit-algorithms">Gradient Bandit Algorithms</h2>
<p>We’ve been estimating $q_*(a)$, but what if I said there’s a different interpretation? How about we learn a <strong>preference</strong> for an action?</p>
<p>We’ll call the preference for the action $H_t(a)$. It’s not related to the reward at all, as we’ll see. We model $A_t$ as a gibbs distribution(in machine learning land, researchers call this the softmax distribution):</p>
\[P(A_t = a) = \frac{e^{H_t(a)}}{\sum_b e^{H_t(b)}} = \pi_t(a)\]
<p>Now we’re in business!</p>
<p>So how do we perform gradient-based MLE? So we do gradient ascent with respect to $H_t(a)$ because that’s what our variable is:</p>
<p>We want to <strong>maximize</strong> $E(R_t)$. Recall that:</p>
\[q_*(a,t) = E(R_t|A_t = a)\]
\[E(R_t) = \sum_a \pi_t(a)E(R_t|A_t=a) \quad{\text{(Total Expectation)}}\]
\[E(R_t) = \sum_a \pi_t(a)q_*(a,t) \quad{\text{(By definition)}}\]
\[\frac{\partial E(R_t)}{\partial H_t(b)} = \sum_a \frac{\partial \pi_t(a)}{\partial H_t(b)} q_*(a,t) \quad(q \text{ is independent})\]
<p>So now that we have this, our general format for updates should be something like:</p>
\[H_{t+1}(b) = H_t(b) + \frac{\partial E(R_t)}{\partial H_t(b)}\]
<p>… for gradient ascent.</p>
<p>Now, let’s differentiate our gibbs distribution:</p>
\[\begin{align*}
&\frac{\partial \pi_t(a)}{\partial H_t(a)} = \\
&\frac{\partial (\frac{e^{H_t(a)}}{\sum_b e^{H_t(b)}})}{\partial H_t(a)} = \\
&\frac{e^{H_t(a)} * (\sum_b e^{H_t(b)} - e^{H_t(a)})}{(\sum_b e^{H_t(b)})^2} = \\
&\frac{e^{H_t(a)}}{\sum_b e^{H_t(b)}} * (1 - \frac{e^{H_t(a)}}{\sum_b e^{H_t(b)}}) = \\
& \pi_t(a) * (1-\pi_t(a))
\end{align*}\]
<p>This is only one partial derivative of the whole gradient. What about the actions $b$ where $b \neq a$? I don’t really want to write out the $\LaTeX$ again, so here’s the answer:</p>
\[\frac{\partial \pi_t(a)}{\partial H_t(b)} = -\pi_t(a)\pi_t(b) \forall b \neq a\]
<p>We can thus observe a generalization:</p>
\[\frac{\partial \pi_t(a)}{\partial H_t(b)} = \pi_t(a)(1_{a=b}-\pi_t(b)) \forall a,b\]
<p>Okay, so now we can plug that into the equation:</p>
\[\frac{\partial E(R_t)}{\partial H_t(b)} = \sum_a \frac{\partial \pi_t(a)}{\partial H_t(b)} q_*(a,t) = \sum_a \pi_t(a)(1_{a=b}-\pi_t(b)) q_*(a,t)\]
<p>Now, we do some <strong>black magic</strong>. This is legal:</p>
\[\sum_a \pi_t(a)(1_{a=b}-\pi_t(b)) q_*(a,t) = \sum_a \pi_t(a)(1_{a=b}-\pi_t(b)) (q_*(a,t) - X_t) \forall X_t \in \Re\]
<p>Why is this true? Simply because:</p>
\[\sum_a \pi_t(a)(1_{a=b}-\pi_t(b)) = \pi_t(a) - \sum_a\pi_t(a)\pi_t(b) = 0 \quad{\text{(total probability)}}\]
<p>We’ll introduce one more weird addition:</p>
\[\begin{align*}
&\sum_a \frac{1}{\pi_t(a)}\pi_t(a)^2(1_{a=b}-\pi_t(b)) (q_*(a,t) - X_t) = \\
&E_a( (q_*(a,t) - X_t) * \pi_t(a)(1_{a=b}-\pi_t(b))/\pi_t(a))) = \\
&E_a( (q_*(a,t) - X_t) * (1_{a=b}-\pi_t(b)))
\end{align*}\]
<p>Because now, $q_*(a,t)$ is inside expectations of $a$, total expectations states that we can replace it with $R_t$.</p>
<p>One last thing is what do we choose for $X_t$? What is $X_t$? Well frankly, $X_t$ can be anything you want, but one can set this to the sampled average of previous rewards, $\bar{R}_t$, because then we would have an interpretable gradient. You don’t really need this to run the algorithm successfully, but this is chosen for stability, interpretability and historical reasons.</p>
<p>We’re all ready to see our new update rule!</p>
\[H_{t+1}(b) = H_t(b) + E_a( (R_t - \bar{R_t} ) * (1_{a=b}-\pi_t(b)))\]
<p>where $a$ is the action taken at time $t$.</p>
<p>Obviously, finding the expectation is hard, so we do a stochastic update and drop the expectations to get:</p>
\[H_{t+1}(b) = H_t(b) + (R_t - \bar{R_t} ) * (1_{a=b}-\pi_t(b))\]
<p>A simple way to choose the action is to pick $argmax_a \pi_t(a)$. So we’re done!</p>
<h1 id="ending-remarks">Ending Remarks</h1>
<p>Here’s a plot of how these algorithms do compared to each other:</p>
<p><img src="http://oneraynyday.github.io/assets/bandit_algos.png" alt="comparison" /></p>
<p>Although some of these methods are considered simple, it is not at all poorly performing. In fact, these are <em>state of the art</em> methods for many of reinforcement learning problems, and some of the ones we’ll learn later will be more complicated, more powerful, but more brittle.</p>
<script src="https://utteranc.es/client.js" repo="OneRaynyDay/oneraynyday.github.io" issue-term="pathname" theme="github-light" crossorigin="anonymous" async=""> </script>
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:1" role="doc-endnote">
<p>Sutton, Richard S., and Andrew G. Barto. <em>Reinforcement Learning: an Introduction</em>. The MIT Press, 2012. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>
Reinforcement Learning - Introduction2018-05-03T00:00:00+00:00http://oneraynyday.github.io/ml/2018/05/03/Diving-Into-Reinforcement-Learning<p><em>It’s been a while since I wrote anything to the blog</em>, and it’s partially because life has been going real fast that I haven’t had time to really think about jotting down my thoughts. I’ve been chasing a pineapple-pizza-eating, hot-chocolate-drinking someone so that’s another excuse :-).</p>
<p>Some habits die hard, and I always think about writing another blog, to learn something new and really remember it.</p>
<p>This time, it’ll be about <strong>reinforcement learning</strong>! It’s been a topic at the top of my head, one of those “magical” things that aren’t really so special once you look at it closely. I’ll be closely following the book: <strong>“Reinforcement Learning: An Introduction” by R.S.Sutton and Andrew Barto</strong>.</p>
<p>As with all “introduction to” books, this book runs the whole gamut, and is totally not just an introduction.</p>
<p>We will tackle this from the most simple models and get more complex from there. I’ve been wanting to learn more about the following(in increasing difficulty):</p>
<h2 id="tabular-solution-methods">Tabular Solution Methods</h2>
<p>This is more of the discrete domain space, faster solve, less gradient descent-y, less convex optimization type algorithms:</p>
<ol>
<li>Multi-armed Bandit Problems</li>
<li>Finite MDP (Have some background on this, but review’s nice)</li>
<li>DP methods (Have some introductory background on this)</li>
<li>Monte Carlo Methods (Sounds fun!)</li>
<li>Temporal-Difference Learning (Never heard of it)</li>
<li>n-step Bootstrapping (eh, bootstrapping is in every book)</li>
</ol>
<h2 id="approxixmate-solution-methods">Approxixmate Solution Methods</h2>
<p>This is more of the “sexy”, gradient-based, eigenvalue, continuous domain, more convex(or even non-convex) optimization type algorithms:</p>
<ol>
<li>On-policy prediction (SGD/linear methods/kernels/ANN’s)</li>
<li>Off-policy prediction (No idea what this is)</li>
<li>Elligibility Traces (No idea what this is)</li>
<li>Policy gradient (very fancy, new, sexy formulation of an old method)</li>
</ol>
<p>And if we’re feeling frisky:</p>
<h2 id="frontiers">Frontiers</h2>
<ol>
<li>General value functions</li>
<li>Designing reward signals</li>
<li>AlphaGo</li>
</ol>
<p>Expect to be seeing more soon! Whoever is reading this should keep me accountable. I’ll try to get one thing done every one or two weeks.</p>
<p>In the end, I’ll try my best to write some basic library like last time, but this time for RL.</p>
<script src="https://utteranc.es/client.js" repo="OneRaynyDay/oneraynyday.github.io" issue-term="pathname" theme="github-light" crossorigin="anonymous" async=""> </script>
Evaluation Strategy2017-12-13T00:00:00+00:00http://oneraynyday.github.io/cs/2017/12/13/Call-By<h1 id="evaluation-strategy">Evaluation Strategy</h1>
<h1 id="table-of-contents">Table of Contents</h1>
<ul id="markdown-toc">
<li><a href="#evaluation-strategy" id="markdown-toc-evaluation-strategy">Evaluation Strategy</a></li>
<li><a href="#table-of-contents" id="markdown-toc-table-of-contents">Table of Contents</a></li>
<li><a href="#call-by-value" id="markdown-toc-call-by-value">Call By Value</a></li>
<li><a href="#call-by-reference" id="markdown-toc-call-by-reference">Call By Reference</a></li>
<li><a href="#call-by-result" id="markdown-toc-call-by-result">Call By Result</a></li>
<li><a href="#call-by-value-result" id="markdown-toc-call-by-value-result">Call By Value-Result</a></li>
<li><a href="#call-by-name" id="markdown-toc-call-by-name">Call By Name</a></li>
<li><a href="#call-by-need" id="markdown-toc-call-by-need">Call By Need</a></li>
</ul>
<h1 id="call-by-value">Call By Value</h1>
<p>Copies the arguments into the callee.</p>
<p><strong>Pros:</strong></p>
<ul>
<li>Immutability via copying.</li>
</ul>
<p><strong>Cons:</strong></p>
<ul>
<li>If values are large, slow copying.</li>
<li>If we try to pass by address, leads to unreliable code.</li>
</ul>
<h1 id="call-by-reference">Call By Reference</h1>
<p>Passes arguments as references into callee.</p>
<p><strong>Pros:</strong></p>
<ul>
<li>Large objects do not need to be copied.</li>
</ul>
<p><strong>Cons:</strong></p>
<ul>
<li>Cause trouble with shared variables because
now all objects have state. Especially bad with heap variables.</li>
</ul>
<h1 id="call-by-result">Call By Result</h1>
<p>Passes arguments into the function, with 1 of them being a reference to the result that the function fills in.</p>
<p><strong>Pros:</strong></p>
<ul>
<li>Useful if want multiple return values(i.e. answer, error codes, etc)</li>
</ul>
<p><strong>Cons:</strong></p>
<ul>
<li>Caller needs to do extra work, API complicates.</li>
</ul>
<h1 id="call-by-value-result">Call By Value-Result</h1>
<p>Same as call by result, but that parameter that will contain the value, will also be used as a variable before the answer is filled in.</p>
<p><strong>Pros:</strong></p>
<ul>
<li>Saving space, maybe?</li>
</ul>
<p><strong>Cons:</strong></p>
<ul>
<li>Really complicates the strategy</li>
</ul>
<h1 id="call-by-name">Call By Name</h1>
<p>Calls by name sends the full expression as variables into the function.</p>
<p>For something like:</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">f</span><span class="p">(...</span> <span class="n">a</span><span class="p">,</span> <span class="p">...</span> <span class="n">b</span><span class="p">){</span>
<span class="p">...</span>
<span class="p">}</span>
<span class="n">f</span><span class="p">(</span><span class="n">x</span><span class="o">+</span><span class="mi">1</span><span class="p">,</span> <span class="n">x</span><span class="p">)</span>
</code></pre></div></div>
<p>We are legitimately sending an lvalue expression(x+1) that is <em>not yet evaluated</em>, and an rvalue expression(x) that is also not yet evaluated(we don’t know its address) into f.</p>
<p>To get the value out, we kind of treat them as lambdas, and <code class="language-plaintext highlighter-rouge">a</code> will give us an lvalue and <code class="language-plaintext highlighter-rouge">b</code> will give us an rvalue address. We will be updating whatever <code class="language-plaintext highlighter-rouge">x</code> we were using, as if it was passed by reference.</p>
<p><strong>Pros:</strong></p>
<ul>
<li>Delays evaluation, so control flow can optimize evaluation scheme.</li>
</ul>
<p><strong>Cons:</strong></p>
<ul>
<li>Really inefficient to write in real life.</li>
</ul>
<h1 id="call-by-need">Call By Need</h1>
<p>The same as call by name, but <strong>memoizes</strong> the results of the computation. <code class="language-plaintext highlighter-rouge">a</code> and <code class="language-plaintext highlighter-rouge">b</code> are bounded to the results after their first evaluation.</p>
<p>However, this gives different side-effects:</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">f</span><span class="p">(</span><span class="n">by</span><span class="o">-</span><span class="n">need</span> <span class="kt">int</span> <span class="n">a</span><span class="p">,</span> <span class="n">by</span><span class="o">-</span><span class="n">need</span> <span class="kt">int</span> <span class="n">b</span><span class="p">){</span>
<span class="n">b</span> <span class="o">=</span> <span class="n">a</span><span class="p">;</span> <span class="c1">// 'i = i+1' -> gives us &i = 4.</span>
<span class="c1">// a -> 4</span>
<span class="c1">// b -> &i</span>
<span class="n">b</span> <span class="o">=</span> <span class="n">a</span><span class="p">;</span> <span class="c1">// Already cached: &i = 4.</span>
<span class="p">}</span>
<span class="kt">void</span> <span class="nf">g</span><span class="p">(){</span>
<span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">3</span><span class="p">;</span>
<span class="n">f</span><span class="p">(</span><span class="n">i</span><span class="o">+</span><span class="mi">1</span><span class="p">,</span> <span class="n">i</span><span class="p">);</span>
<span class="k">return</span> <span class="n">i</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<p><strong>Pros:</strong></p>
<ul>
<li>Caches, so it is much faster than call-by-name</li>
<li>Doesn’t crash when the crashing parameter isn’t evaluated.</li>
</ul>
<p><strong>Cons:</strong></p>
<ul>
<li>Still hard to implement.</li>
</ul>
<script src="https://utteranc.es/client.js" repo="OneRaynyDay/oneraynyday.github.io" issue-term="pathname" theme="github-light" crossorigin="anonymous" async=""> </script>
Memory Management And Garbage Collection2017-12-09T00:00:00+00:00http://oneraynyday.github.io/cs/2017/12/09/Memory-And-GC<h1 id="memory-management-and-garbage-collection">Memory Management And Garbage Collection</h1>
<h1 id="table-of-contents">Table of Contents</h1>
<ul id="markdown-toc">
<li><a href="#memory-management-and-garbage-collection" id="markdown-toc-memory-management-and-garbage-collection">Memory Management And Garbage Collection</a></li>
<li><a href="#table-of-contents" id="markdown-toc-table-of-contents">Table of Contents</a></li>
<li><a href="#heaps" id="markdown-toc-heaps">Heaps</a> <ul>
<li><a href="#example-first-fit" id="markdown-toc-example-first-fit">Example: First-fit</a></li>
</ul>
</li>
<li><a href="#quick-lists" id="markdown-toc-quick-lists">Quick Lists</a></li>
<li><a href="#heap-links" id="markdown-toc-heap-links">Heap Links</a></li>
<li><a href="#heap-compaction" id="markdown-toc-heap-compaction">Heap Compaction</a></li>
<li><a href="#garbage-collection" id="markdown-toc-garbage-collection">Garbage Collection</a> <ul>
<li><a href="#mark-and-sweep-gc" id="markdown-toc-mark-and-sweep-gc">Mark-and-sweep GC</a></li>
<li><a href="#copying-gc" id="markdown-toc-copying-gc">Copying GC</a></li>
<li><a href="#reference-counting-gc" id="markdown-toc-reference-counting-gc">Reference Counting GC</a></li>
<li><a href="#going-meta-generational-gc" id="markdown-toc-going-meta-generational-gc">Going Meta: Generational GC</a></li>
</ul>
</li>
<li><a href="#comprehensive-comparison-of-gcs" id="markdown-toc-comprehensive-comparison-of-gcs">Comprehensive comparison of GC’s:</a></li>
</ul>
<h1 id="heaps">Heaps</h1>
<p>Many languages have unordered runtime memory allocation, and usually we hear this being called a “heap”, rather than a “stack”.</p>
<p>In C/C++, we hear “heap” and <code class="language-plaintext highlighter-rouge">new</code>, <code class="language-plaintext highlighter-rouge">malloc</code>, etc associated. However, <strong>what exactly is the heap</strong>?</p>
<p>The heap is <strong>a pool of blocks of memory, with an interface for unordered alloc/dealloc</strong>.</p>
<p>Heaps are not what you’d think of in a CS data structures class - those are binary/binomial/fibonacci heaps, etc. This is simply a
blob of memory being used in some policy. What’s the policy? There are a ton out there:</p>
<h2 id="example-first-fit">Example: First-fit</h2>
<p>The pseudocode is:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">alloc</span><span class="p">(</span><span class="n">size</span><span class="p">)</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">N</span><span class="p">):</span>
<span class="k">if</span> <span class="n">FREE_BLOCK</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">></span> <span class="n">size</span><span class="p">:</span>
<span class="c1"># Sets these memory blocks unavailable. Splits the block.
</span> <span class="n">allocate</span><span class="p">(</span><span class="n">FREE_BLOCK</span><span class="p">[</span><span class="n">i</span><span class="p">],</span> <span class="n">size</span><span class="p">)</span>
<span class="k">return</span> <span class="n">i</span> <span class="c1"># starting pointer
</span> <span class="k">return</span> <span class="mi">0</span> <span class="c1"># returns NULL, as in unsuccessful.
</span></code></pre></div></div>
<p>We go from the beginning til the end until we see a block that’s available, and then we split it.</p>
<p>Similarly, when we deallocate blocks, we need to coalesce the block with its neighboring free blocks. This way, we can permit allocation of bigger blocks(since available
memory regions are like <code class="language-plaintext highlighter-rouge">[4,0,4,4]</code>, we can’t allocate 8, but if we coalesce into <code class="language-plaintext highlighter-rouge">[4,0,8]</code>, we can.)</p>
<h1 id="quick-lists">Quick Lists</h1>
<p>It’s very common that small blocks get allocated and deallocated more often than larger blocks. Thus, we usually keep a <strong>quick list</strong>, in which all blocks are the same size(1024 for example), and allocate one object in each block.
Although this leads to a lot of waste, since the objects are small this is usually not a big issue.
The most important result of this is that it’s fast. It does not need coalescing, and all blocks are of the same size which is convenient.</p>
<h1 id="heap-links">Heap Links</h1>
<p>These data structures are used for both <strong>heap compaction</strong> and <strong>garbage collection</strong>.</p>
<p>Specifically, <strong>a heap link is a memory location where a value is stored and the program will use it as a heap address</strong>.</p>
<p>The heap links come from a base set, which are all pointers created by the main subroutine.
The links are populated recursively from objects that own its own heap pointers as members, etc.</p>
<p>However, heap links can be prone to errors(false positive, false negative, etc):</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">union</span> <span class="p">{</span>
<span class="kt">char</span> <span class="o">*</span><span class="n">p</span><span class="p">;</span>
<span class="kt">char</span> <span class="n">tag</span><span class="p">[</span><span class="mi">4</span><span class="p">];</span>
<span class="p">}</span> <span class="n">x</span><span class="p">;</span>
</code></pre></div></div>
<p>Is this a heap link if it’s instantiated? It could either be a heap allocated block or a stack allocated array of 4 characters.
To prevent any false negatives, we say that this is in fact a heap link.</p>
<h1 id="heap-compaction">Heap Compaction</h1>
<p><strong>Heap compaction moves all allocated blocks together at one end of the heap, and removes fragmentation in the heap.</strong> To do this, we need to know what are
all the current heap links and then simply copy all the contents into the beginning of the heap, and update the pointer values on the heap links.</p>
<p>However, we can’t guarantee that we can do heap compaction on all languages. For example C cannot avoid inclusion errors(false positives) via the union example,
and thus we could be moving values that are not actually alive, and fundamentally change the value of the variable.</p>
<h1 id="garbage-collection">Garbage Collection</h1>
<p>Garbage collection is a feature that exists in some languages for using heap links to determine the lifetime of a dynamically allocated variable. It removes any variables that are no in the root set, as in deallocates them.</p>
<p>There are 2 problems that garbage collection tackles:</p>
<ol>
<li>Dangling pointer dereferencing - Creating a pointer that eventually points to an invalid address in heap space.</li>
<li>Memory leak - Allocating a resource on the heap and not freeing it after we are done using it.</li>
</ol>
<p>As a result, some languages replace any semantics for freeing the objects with its own GC. There are 3 normal types of garbage collection:</p>
<h2 id="mark-and-sweep-gc">Mark-and-sweep GC</h2>
<p>Mark and sweep carries the collection in 2 steps:</p>
<ol>
<li>Garbage collector marks heap links that are currently being used.</li>
<li>Garbage collector passes over the heap and deallocates any blocks that are not marked.</li>
</ol>
<p>As one can see, this GC does not reduce fragmentation, because it simply deallocates any blocks that are not marked, and leaves them in its original position.
However, it also means that we won’t be accidentally changing contents of our values by moving the blocks(like the issue that heap compaction faces).</p>
<h2 id="copying-gc">Copying GC</h2>
<p>Copying collector only uses half of its available memory at a time. It fills up one side of its memory, then it uses heap links to find non-garbage blocks to copy over to the other side of the memory. As a result, it can also perform heap compaction at the same time.</p>
<p>Obviously, this also suffers from the same issue of heap compaction, and needs to do extra work to copy non-garbage blocks.</p>
<h2 id="reference-counting-gc">Reference Counting GC</h2>
<p>All other GC’s need heap links, but this one does not! In a refcount GC, each heap allocated object has # of how many resources are pointing to it.
When the count goes to 0, it is marked for deletion.</p>
<p>However, one issue with refcount GC’s is that you can have cycles of garbage:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>A - B
\ /
C
</code></pre></div></div>
<p>In this situation, all 3 objects could have no links except to themselves. In this case, none of these objects are actually GC’d.</p>
<p>There are ways to get around this. In C++ one can use <code class="language-plaintext highlighter-rouge">weak_ptr</code> and <code class="language-plaintext highlighter-rouge">shared_ptr</code> to simulate this refcounting relationship with loops.</p>
<h2 id="going-meta-generational-gc">Going Meta: Generational GC</h2>
<p>Generational GC’s take into account that old objects tend to live long, and young objects tend to die fast.</p>
<p>As a result, they have divided heaps for young, midlife, and old objects, where some arbitrary GC algorithm can be conducted on the young heap more often than that of the older ones.</p>
<p>In addition, when an object has lived in the young heap for long enough, it can be copied over to the older heaps.</p>
<h1 id="comprehensive-comparison-of-gcs">Comprehensive comparison of GC’s:</h1>
<ol>
<li>Mark and sweep:
<ul>
<li>does not support real time(locks entire program to do mark and sweep).</li>
<li>allows false positives.</li>
<li>does not manage fragmentation.</li>
<li>does 2 cycles through to mark and sweep.</li>
</ul>
</li>
<li>Copying:
<ul>
<li>does not support real time(locks entire program to copy content).</li>
<li>requires there to be no false positives.</li>
<li>heap compaction supported.</li>
<li>Uses only half of available memory.</li>
</ul>
</li>
<li>Refcounting:
<ul>
<li>does support real time.</li>
<li>allows false positives.</li>
<li>does not manage fragmentation.</li>
<li>Leads to garbage cycles.</li>
</ul>
</li>
</ol>
<script src="https://utteranc.es/client.js" repo="OneRaynyDay/oneraynyday.github.io" issue-term="pathname" theme="github-light" crossorigin="anonymous" async=""> </script>
What Are Continuations?2017-12-09T00:00:00+00:00http://oneraynyday.github.io/cs/2017/12/09/Continuations<h1 id="what-are-continuations">What Are Continuations</h1>
<h1 id="table-of-contents">Table of Contents</h1>
<ul id="markdown-toc">
<li><a href="#what-are-continuations" id="markdown-toc-what-are-continuations">What Are Continuations</a></li>
<li><a href="#table-of-contents" id="markdown-toc-table-of-contents">Table of Contents</a></li>
<li><a href="#what-is-a-continuation" id="markdown-toc-what-is-a-continuation">What is a Continuation?</a></li>
<li><a href="#using-continuations-as-currying" id="markdown-toc-using-continuations-as-currying">Using Continuations as Currying</a></li>
<li><a href="#using-continuations-as-returns" id="markdown-toc-using-continuations-as-returns">Using Continuations as Returns</a></li>
<li><a href="#green-threads-and-generators" id="markdown-toc-green-threads-and-generators">Green Threads and Generators</a></li>
</ul>
<p>Continuations in scheme are quite tricky. Here we will try to delve deep into how they work.</p>
<h1 id="what-is-a-continuation">What is a Continuation?</h1>
<p>A continuation, is “just about to return from call-with-current-continuation”, and it represents the rest of the computations.</p>
<p>In other words, a continuation is an <strong>arbitrary point in a program where a value is expected</strong>.</p>
<p>Continuations essentially save the state of the current program(registers, ip/ep, etc), and halts execution.</p>
<h1 id="using-continuations-as-currying">Using Continuations as Currying</h1>
<div class="language-scheme highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">(</span><span class="nb">+</span> <span class="nv">a</span> <span class="nv">b</span><span class="p">)</span> <span class="c1">; we need to fill in a b with values! These are the "arbitrary points"</span>
</code></pre></div></div>
<p>How do we fill in these points? We can either pass in values, or more interestingly, we can kind of curry this function by passing in 2 initially.</p>
<div class="language-scheme highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">(</span><span class="k">define</span> <span class="nv">handle</span> <span class="no">#f</span><span class="p">)</span> <span class="c1">; define a variable bound to continuation.</span>
<span class="c1">; pass in 2 as the first argument always</span>
<span class="p">(</span><span class="nb">+</span> <span class="mi">2</span>
<span class="c1">; this call/cc routine substitutes the 2nd </span>
<span class="c1">; argument of the function.</span>
<span class="c1">; We expect the argument to be a procedure that </span>
<span class="c1">; takes 1 argument - the continuation.</span>
<span class="p">(</span><span class="nb">call/cc</span> <span class="p">(</span><span class="k">lambda</span> <span class="p">(</span><span class="nf">k</span><span class="p">)</span> <span class="c1">; k is the continuation</span>
<span class="c1">; We bind handle to the continuation point.</span>
<span class="c1">; and the we return the value 2.</span>
<span class="p">(</span><span class="k">set!</span> <span class="nv">handle</span> <span class="nv">k</span><span class="p">)</span> <span class="mi">2</span><span class="p">)</span>
<span class="p">)</span>
<span class="p">)</span>
</code></pre></div></div>
<p>Here, k is the continuation point, and we can set handle to k.
We don’t yet evaluate <code class="language-plaintext highlighter-rouge">(+ 2 ...)</code>, and so we get back the handle that, once evaluated, will give us the value.</p>
<p>We could’ve done something like:</p>
<div class="language-scheme highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">(</span><span class="k">define</span> <span class="nv">handle</span>
<span class="p">(</span><span class="k">lambda</span> <span class="p">(</span><span class="nf">x</span><span class="p">)</span> <span class="p">(</span><span class="nb">+</span> <span class="mi">2</span> <span class="nv">x</span><span class="p">)))</span>
</code></pre></div></div>
<p>but there are more uses for continuations.</p>
<h1 id="using-continuations-as-returns">Using Continuations as Returns</h1>
<p>In Scheme, we don’t have “returns”, and so for example if we wanted to search for an element in a list that matches a criteria,
we would have to recurse down to that element, and the recurse back up the function stack to return the value. Would it be nice
if we had a short-circuit?</p>
<div class="language-scheme highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">;;; Does not compile!</span>
<span class="p">(</span><span class="k">define</span> <span class="p">(</span><span class="nf">search</span> <span class="nv">want?</span> <span class="nv">lst</span><span class="p">)</span>
<span class="p">(</span><span class="nb">for-each</span> <span class="p">(</span><span class="k">lambda</span> <span class="p">(</span><span class="nf">e</span><span class="p">)</span>
<span class="p">(</span><span class="k">if</span> <span class="p">(</span><span class="nf">want?</span> <span class="nv">e</span><span class="p">)</span> <span class="p">(</span><span class="nf">return</span> <span class="nv">e</span><span class="p">)))</span> <span class="nv">lst</span><span class="p">)</span> <span class="c1">; return the element early!</span>
<span class="no">#f</span> <span class="c1">; no results</span>
<span class="p">)</span>
</code></pre></div></div>
<p>This would be really nice, because for-each will go through every single element and return the last element.</p>
<p>We can use <code class="language-plaintext highlighter-rouge">call/cc</code> to emulate the return though!</p>
<div class="language-scheme highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">;;; Does compile!</span>
<span class="p">(</span><span class="k">define</span> <span class="p">(</span><span class="nf">search</span> <span class="nv">want?</span> <span class="nv">lst</span><span class="p">)</span>
<span class="p">(</span><span class="nb">call/cc</span>
<span class="p">(</span><span class="k">lambda</span> <span class="p">(</span><span class="nf">return</span><span class="p">)</span> <span class="c1">; here we define the return value.</span>
<span class="p">(</span><span class="nb">for-each</span> <span class="p">(</span><span class="k">lambda</span> <span class="p">(</span><span class="nf">e</span><span class="p">)</span>
<span class="p">(</span><span class="k">if</span> <span class="p">(</span><span class="nf">want?</span> <span class="nv">e</span><span class="p">)</span> <span class="p">(</span><span class="nf">return</span> <span class="nv">e</span><span class="p">)))</span> <span class="nv">lst</span><span class="p">)</span> <span class="c1">; return the element early!</span>
<span class="no">#f</span><span class="p">)</span>
<span class="p">)</span>
<span class="p">)</span>
</code></pre></div></div>
<p>What happens here is that the moment we find the value via <code class="language-plaintext highlighter-rouge">want? e</code>,
we send the <code class="language-plaintext highlighter-rouge">e</code> into <code class="language-plaintext highlighter-rouge">return</code>, which is a procedure that takes a continuation.</p>
<p>The entire subroutine just stops right there, and <code class="language-plaintext highlighter-rouge">call/cc</code> just returns the element.
All the registers and instruction pointers are saved at that snapshot, with the <code class="language-plaintext highlighter-rouge">%rax</code>
register holding the value of <code class="language-plaintext highlighter-rouge">e</code>.</p>
<h1 id="green-threads-and-generators">Green Threads and Generators</h1>
<p>A green thread is a thread that’s on the user-level, and cannot reap the benefits of parallelism.
Its usage is usually for multiplexing between coroutines.</p>
<p>One can implement green threads using continuations! A master clock can call a ton of coroutines using continuations, by aggregating
a list of them, and calling each one in order.</p>
<div class="language-scheme highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">(</span><span class="nf">thread1</span> <span class="mi">1</span><span class="p">)</span> <span class="c1">; Each thread is a call/cc return value.</span>
<span class="p">(</span><span class="nf">thread2</span> <span class="mi">2</span><span class="p">)</span>
<span class="o">...</span>
<span class="p">(</span><span class="nf">threadN</span> <span class="nv">N</span><span class="p">)</span>
<span class="p">(</span><span class="nf">thread1</span> <span class="mi">1</span><span class="p">)</span> <span class="c1">; back to the beginning! Multiplex N threads.</span>
</code></pre></div></div>
<p>Generators are also similar - in fact, in python, there is very little difference between the asyncio coroutines and
python generators. You can iterate through a list without having to hold all of it:</p>
<div class="language-scheme highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">;;; Nice try! But not there.</span>
<span class="p">(</span><span class="k">define</span> <span class="p">(</span><span class="nf">iter</span> <span class="nv">lst</span><span class="p">)</span>
<span class="p">(</span><span class="nb">call/cc</span> <span class="p">(</span><span class="k">lambda</span> <span class="p">(</span><span class="nf">return</span><span class="p">)</span>
<span class="p">(</span><span class="nf">return</span> <span class="p">(</span><span class="nb">car</span> <span class="nv">lst</span><span class="p">))</span>
<span class="p">(</span><span class="nf">iter</span> <span class="p">(</span><span class="nb">cdr</span> <span class="nv">lst</span><span class="p">)))))</span>
<span class="p">(</span><span class="k">define</span> <span class="nv">x</span> <span class="p">(</span><span class="nb">list</span> <span class="mi">1</span> <span class="mi">2</span> <span class="mi">3</span><span class="p">))</span>
<span class="p">(</span><span class="nf">iter</span> <span class="nv">x</span><span class="p">)</span> <span class="c1">; 1</span>
<span class="p">(</span><span class="nf">iter</span> <span class="nv">x</span><span class="p">)</span> <span class="c1">; 1</span>
<span class="p">(</span><span class="nf">iter</span> <span class="nv">x</span><span class="p">)</span> <span class="c1">; 1</span>
</code></pre></div></div>
<p>This will give us a non-stateful generator.</p>
<p>To capture state, we need to have a state variable.</p>
<div class="language-scheme highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">;;; Stolen from http://danielfm.me/posts/why-are-continuations-so-darn-cool.html</span>
<span class="p">(</span><span class="k">define</span> <span class="p">(</span><span class="nf">iter</span> <span class="nv">lst</span><span class="p">)</span>
<span class="c1">;; Defines `state` as being a function that starts the</span>
<span class="c1">;; iteration via `for-each`</span>
<span class="p">(</span><span class="k">define</span> <span class="p">(</span><span class="nf">state</span> <span class="nv">return</span><span class="p">)</span>
<span class="p">(</span><span class="nb">for-each</span>
<span class="p">(</span><span class="k">lambda</span> <span class="p">(</span><span class="nf">item</span><span class="p">)</span>
<span class="c1">;; Here, we capture the continuation that represents the</span>
<span class="c1">;; current state of the iteration</span>
<span class="p">(</span><span class="nf">let/cc</span> <span class="nv">item-cc</span>
<span class="c1">;; Before the item is yielded, we update `state` to</span>
<span class="c1">;; `item-cc` so the computation is resumed the next</span>
<span class="c1">;; time the generator is called</span>
<span class="p">(</span><span class="k">set!</span> <span class="nv">state</span> <span class="nv">item-cc</span><span class="p">)</span>
<span class="c1">;; Yields the current item to the caller</span>
<span class="p">(</span><span class="nf">return</span> <span class="nv">item</span><span class="p">)))</span>
<span class="nv">lst</span><span class="p">)</span>
<span class="c1">;; Yields 'done when the list is exhausted</span>
<span class="p">(</span><span class="nf">return</span> <span class="ss">'done</span><span class="p">))</span>
<span class="c1">;; Returns a function that calls the stored `state` with the</span>
<span class="c1">;; current continuation so we can yield one item at a time</span>
<span class="p">(</span><span class="k">define</span> <span class="p">(</span><span class="nf">generator</span><span class="p">)</span>
<span class="p">(</span><span class="nb">call/cc</span> <span class="nv">state</span><span class="p">))</span>
<span class="nv">generator</span><span class="p">)</span>
<span class="p">(</span><span class="k">define</span> <span class="nv">x</span> <span class="p">(</span><span class="nb">list</span> <span class="mi">1</span> <span class="mi">2</span> <span class="mi">3</span><span class="p">))</span>
<span class="p">(</span><span class="k">define</span> <span class="nv">next</span> <span class="p">(</span><span class="nf">iter</span> <span class="nv">x</span><span class="p">))</span>
<span class="p">(</span><span class="nf">next</span><span class="p">)</span> <span class="c1">; 1</span>
<span class="p">(</span><span class="nf">next</span><span class="p">)</span> <span class="c1">; 2</span>
<span class="p">(</span><span class="nf">next</span><span class="p">)</span> <span class="c1">; 3</span>
</code></pre></div></div>
<script src="https://utteranc.es/client.js" repo="OneRaynyDay/oneraynyday.github.io" issue-term="pathname" theme="github-light" crossorigin="anonymous" async=""> </script>
Networking Refresher2017-11-27T00:00:00+00:00http://oneraynyday.github.io/dev/2017/11/27/Networking-Refresher<h1 id="networking-refresher">Networking Refresher</h1>
<h1 id="table-of-contents">Table of Contents</h1>
<ul id="markdown-toc">
<li><a href="#networking-refresher" id="markdown-toc-networking-refresher">Networking Refresher</a></li>
<li><a href="#table-of-contents" id="markdown-toc-table-of-contents">Table of Contents</a></li>
<li><a href="#tcpip-protocols" id="markdown-toc-tcpip-protocols">TCP/IP Protocols</a> <ul>
<li><a href="#https" id="markdown-toc-https">HTTP(S)</a></li>
<li><a href="#ftp" id="markdown-toc-ftp">FTP</a></li>
<li><a href="#smtp" id="markdown-toc-smtp">SMTP</a></li>
</ul>
</li>
<li><a href="#udp-protocols" id="markdown-toc-udp-protocols">UDP Protocols</a> <ul>
<li><a href="#dns" id="markdown-toc-dns">DNS</a></li>
<li><a href="#p2p-basics" id="markdown-toc-p2p-basics">P2P Basics</a> <ul>
<li><a href="#comparison-of-p2p--clientserver" id="markdown-toc-comparison-of-p2p--clientserver">Comparison of P2P & Client/Server</a></li>
</ul>
</li>
</ul>
</li>
</ul>
<h1 id="tcpip-protocols">TCP/IP Protocols</h1>
<h2 id="https">HTTP(S)</h2>
<ul>
<li>Format of web content</li>
<li>Sends content to client without maintaining state information. Considered a “PULL” protocol.</li>
<li>HTTP 1.0 is 1 TCP connection, setup/teardown, per object in web page.</li>
<li>HTTP 1.1 is 1 TCP connection for all objects in web page sequentially.</li>
<li>HTTP 2.0 is HTTP 1.1 with pipelining(multithreaded transactions under a single connection).</li>
<li>HTTPS is just HTTP running in SSL. Requires SSL certificate, and end-to-end encryption using public key.</li>
</ul>
<h2 id="ftp">FTP</h2>
<ul>
<li>2 TCP connections - called <strong>out of band</strong>. 1 for state keeping, 1 for data stream.</li>
<li>Main file transfer protocol</li>
</ul>
<h2 id="smtp">SMTP</h2>
<ul>
<li>Main mail <strong>sending</strong> protocol
<ul>
<li>Mail stays in sender’s server if receiver’s mail server is down.</li>
</ul>
</li>
<li>Considered a “PUSH” protocol.</li>
</ul>
<h1 id="udp-protocols">UDP Protocols</h1>
<h2 id="dns">DNS</h2>
<ul>
<li>Translates hostname to IP.
<ul>
<li>Single domain name can have many IP’s - distributed service, or aliasing.</li>
</ul>
</li>
<li>UDP call to authoritative servers, who then UDP call to more specific servers covering specific url namespaces. This is recursive.</li>
<li>Usually DNS caching server to handle load on authoritative servers.</li>
</ul>
<h2 id="p2p-basics">P2P Basics</h2>
<ul>
<li>CDN’s, Content delivery networks, minimize distance between origin servers and clients by replicating origin server resources.</li>
</ul>
<h3 id="comparison-of-p2p--clientserver">Comparison of P2P & Client/Server</h3>
<p>Given $N$ clients, a file of size $F$, and servers’ upload rate of $u_s$, and i-th client’s upload rate of $u_i$,
and minimum download speed of any client to be $d_{min}$, we have this for the client/server architecture:</p>
\[max(\frac{NF}{u_s}, \frac{F}{d_{min}})\]
<p>For P2P, we have:</p>
\[max(\frac{NF}{u_s + \sum_i u_i}, \frac{F}{d_{min}}, \frac{F}{u_s})\]
<p>since we have the time it takes to distribute N copies of files, using upload speeds of all computers, as well as the minimum
download time bottleneck, and the shards of files initially required to send to each client, we get the respective parts of the equation above.</p>
<script src="https://utteranc.es/client.js" repo="OneRaynyDay/oneraynyday.github.io" issue-term="pathname" theme="github-light" crossorigin="anonymous" async=""> </script>
C++ Concurrency - Asynchronous Waiting2017-11-19T00:00:00+00:00http://oneraynyday.github.io/dev/2017/11/19/C++-Threads-Waiting-Events<h1 id="table-of-contents">Table of Contents</h1>
<ul id="markdown-toc">
<li><a href="#table-of-contents" id="markdown-toc-table-of-contents">Table of Contents</a></li>
<li><a href="#waiting-for-an-eventcondition" id="markdown-toc-waiting-for-an-eventcondition">Waiting for an event/condition</a></li>
<li><a href="#condition_variable-and-condition_variable_any" id="markdown-toc-condition_variable-and-condition_variable_any"><code class="language-plaintext highlighter-rouge">condition_variable</code> and <code class="language-plaintext highlighter-rouge">condition_variable_any</code></a> <ul>
<li><a href="#details-of-wait" id="markdown-toc-details-of-wait">Details of <code class="language-plaintext highlighter-rouge">wait()</code></a></li>
<li><a href="#an-example-of-safe_queue" id="markdown-toc-an-example-of-safe_queue">An example of <code class="language-plaintext highlighter-rouge">safe_queue</code></a></li>
</ul>
</li>
</ul>
<p>Before, we talked about the basics of how C++ threads are
used, and how threads can protect data by using <code class="language-plaintext highlighter-rouge">mutex</code>es,
<code class="language-plaintext highlighter-rouge">lock_guard</code>s, <code class="language-plaintext highlighter-rouge">unique_lock</code>s, <code class="language-plaintext highlighter-rouge">recursive_mutex</code>, and <code class="language-plaintext highlighter-rouge">once_flag</code>s.</p>
<p>Now, we talk about how threads can wait for other threads to complete tasks.</p>
<h1 id="waiting-for-an-eventcondition">Waiting for an event/condition</h1>
<p>There are multiple ways to check when a condition becomes true:</p>
<ol>
<li>Spin-lock checking.</li>
<li>Sleep for small periods of time while spin-lock checking.</li>
<li>Use a <strong>condition variable</strong> which associates with some event, which will notify the thread waiting.</li>
</ol>
<h1 id="condition_variable-and-condition_variable_any"><code class="language-plaintext highlighter-rouge">condition_variable</code> and <code class="language-plaintext highlighter-rouge">condition_variable_any</code></h1>
<p>Let’s use a case study:</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">mutex</span> <span class="n">m</span><span class="p">;</span>
<span class="n">queue</span><span class="o"><</span><span class="n">data</span><span class="o">></span> <span class="n">q</span><span class="p">;</span>
<span class="n">condition_variable</span> <span class="n">c</span><span class="p">;</span>
<span class="kt">void</span> <span class="nf">prepare</span><span class="p">(){</span>
<span class="k">while</span><span class="p">(</span><span class="n">more_data</span><span class="p">()){</span>
<span class="n">data</span> <span class="n">d</span> <span class="o">=</span> <span class="n">prepare_data</span><span class="p">();</span>
<span class="n">lock_guard</span> <span class="n">l</span><span class="p">(</span><span class="n">m</span><span class="p">);</span>
<span class="n">q</span><span class="p">.</span><span class="n">push</span><span class="p">(</span><span class="n">data</span><span class="p">);</span>
<span class="n">c</span><span class="p">.</span><span class="n">notify_one</span><span class="p">();</span> <span class="c1">// notify a waiting thread!</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="kt">void</span> <span class="nf">process</span><span class="p">(){</span>
<span class="k">while</span><span class="p">(</span><span class="nb">true</span><span class="p">){</span>
<span class="n">unique_lock</span> <span class="n">l</span><span class="p">(</span><span class="n">m</span><span class="p">);</span>
<span class="n">c</span><span class="p">.</span><span class="n">wait</span><span class="p">(</span><span class="n">l</span><span class="p">,</span> <span class="p">[]{</span><span class="k">return</span> <span class="o">!</span><span class="n">q</span><span class="p">.</span><span class="n">empty</span><span class="p">();});</span> <span class="c1">// notified!</span>
<span class="n">data</span> <span class="n">d</span> <span class="o">=</span> <span class="n">q</span><span class="p">.</span><span class="n">front</span><span class="p">();</span>
<span class="n">q</span><span class="p">.</span><span class="n">pop</span><span class="p">();</span>
<span class="n">l</span><span class="p">.</span><span class="n">unlock</span><span class="p">();</span>
<span class="n">process</span><span class="p">(</span><span class="n">d</span><span class="p">);</span>
<span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>
<p>What happens here is that <code class="language-plaintext highlighter-rouge">wait(lock, boolfunc)</code> actually unlocks the
mutex and sets the thread to sleep.</p>
<p>Specifically, what <code class="language-plaintext highlighter-rouge">wait(lock, boolfunc)</code> does is the following:</p>
<ol>
<li>Checks the condition of the <code class="language-plaintext highlighter-rouge">boolfunction</code>(could be a lambda like in e.x.)</li>
<li>If (1) is not true, then unlocks the lock and sets the thread asleep again(blocked state), waiting for (1) again.</li>
<li>If (1) is true, then keeps the lock locked, and proceeds.</li>
</ol>
<p>As you can see, we can’t simply use <code class="language-plaintext highlighter-rouge">lock_guard</code> to perform these operations, since
(2) needs to unlock the lock.</p>
<h2 id="details-of-wait">Details of <code class="language-plaintext highlighter-rouge">wait()</code></h2>
<p>Actually, <code class="language-plaintext highlighter-rouge">wait()</code> does not need to check the bool function once - it could check it for any number of times.</p>
<p>In addition, a situation could occur, called a “spurious wake”. In this case, threads can
be woken up regardless of whether they were notified, and they will test the <code class="language-plaintext highlighter-rouge">boolfunc</code>.</p>
<h2 id="an-example-of-safe_queue">An example of <code class="language-plaintext highlighter-rouge">safe_queue</code></h2>
<p>A <code class="language-plaintext highlighter-rouge">safe_queue</code> should support the following functions:</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">safe_queue</span><span class="p">(</span><span class="k">const</span> <span class="n">safe_queue</span><span class="o">&</span> <span class="n">other</span><span class="p">);</span>
<span class="kt">void</span> <span class="nf">push</span><span class="p">(</span><span class="n">T</span> <span class="n">new_value</span><span class="p">);</span> <span class="c1">// Adds new value into queue</span>
<span class="kt">bool</span> <span class="nf">wait_and_pop</span><span class="p">(</span><span class="n">T</span><span class="o">&</span> <span class="n">value</span><span class="p">);</span> <span class="c1">// assign popped front() into value</span>
<span class="kt">bool</span> <span class="nf">empty</span><span class="p">();</span> <span class="c1">// checks whether any element left</span>
</code></pre></div></div>
<p>We will have the following members:</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">mutex</span> <span class="n">m</span><span class="p">;</span>
<span class="n">queue</span><span class="o"><</span><span class="n">T</span><span class="o">></span> <span class="n">q</span><span class="p">;</span>
<span class="n">condition_variable</span> <span class="n">c</span><span class="p">;</span>
</code></pre></div></div>
<p>For the copy constructor:</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">safe_queue</span><span class="p">(</span><span class="k">const</span> <span class="n">safe_queue</span><span class="o">&</span> <span class="n">other</span><span class="p">){</span>
<span class="c1">// lock the other queue so we can copy their values safely</span>
<span class="n">lock_guard</span><span class="o"><</span><span class="n">mutex</span><span class="o">></span> <span class="n">l</span><span class="p">(</span><span class="n">other</span><span class="p">.</span><span class="n">m</span><span class="p">);</span>
<span class="n">q</span> <span class="o">=</span> <span class="n">other</span><span class="p">.</span><span class="n">q</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<p>For <code class="language-plaintext highlighter-rouge">push</code>:</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">push</span><span class="p">(</span><span class="n">T</span> <span class="n">new_value</span><span class="p">){</span>
<span class="n">lock_guard</span><span class="o"><</span><span class="n">mutex</span><span class="o">></span> <span class="n">l</span><span class="p">(</span><span class="n">m</span><span class="p">);</span>
<span class="n">q</span><span class="p">.</span><span class="n">push</span><span class="p">(</span><span class="n">new_value</span><span class="p">);</span>
<span class="n">c</span><span class="p">.</span><span class="n">notify_one</span><span class="p">();</span>
<span class="p">}</span>
</code></pre></div></div>
<p>For <code class="language-plaintext highlighter-rouge">wait_and_pop</code>:</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">wait_and_pop</span><span class="p">(</span><span class="n">T</span><span class="o">&</span> <span class="n">value</span><span class="p">){</span>
<span class="n">unique_lock</span><span class="o"><</span><span class="n">mutex</span><span class="o">></span> <span class="n">l</span><span class="p">(</span><span class="n">m</span><span class="p">);</span>
<span class="n">c</span><span class="p">.</span><span class="n">wait</span><span class="p">(</span><span class="n">l</span><span class="p">,</span> <span class="p">[</span><span class="k">this</span><span class="p">]{</span> <span class="k">return</span> <span class="o">!</span><span class="n">q</span><span class="p">.</span><span class="n">empty</span><span class="p">();</span> <span class="p">});</span>
<span class="n">value</span> <span class="o">=</span> <span class="n">q</span><span class="p">.</span><span class="n">front</span><span class="p">();</span>
<span class="n">q</span><span class="p">.</span><span class="n">pop</span><span class="p">();</span>
<span class="p">}</span>
</code></pre></div></div>
<p>For <code class="language-plaintext highlighter-rouge">empty</code>:</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">bool</span> <span class="nf">empty</span><span class="p">(){</span>
<span class="n">lock_guard</span><span class="o"><</span><span class="n">mutex</span><span class="o">></span> <span class="n">l</span><span class="p">(</span><span class="n">m</span><span class="p">);</span>
<span class="k">return</span> <span class="n">q</span><span class="p">.</span><span class="n">empty</span><span class="p">();</span>
<span class="p">}</span>
</code></pre></div></div>
<script src="https://utteranc.es/client.js" repo="OneRaynyDay/oneraynyday.github.io" issue-term="pathname" theme="github-light" crossorigin="anonymous" async=""> </script>
C++ Concurrency - Sharing Data2017-11-19T00:00:00+00:00http://oneraynyday.github.io/dev/2017/11/19/C++-Threads-Sharing-Data<h1 id="table-of-contents">Table of Contents</h1>
<ul id="markdown-toc">
<li><a href="#table-of-contents" id="markdown-toc-table-of-contents">Table of Contents</a></li>
<li><a href="#inherent-problems" id="markdown-toc-inherent-problems">Inherent Problems</a></li>
<li><a href="#stdmutex" id="markdown-toc-stdmutex"><code class="language-plaintext highlighter-rouge">std::mutex</code></a> <ul>
<li><a href="#hide-your-shared-data-from-user" id="markdown-toc-hide-your-shared-data-from-user">Hide Your Shared Data from User</a></li>
</ul>
</li>
<li><a href="#deadlock" id="markdown-toc-deadlock">Deadlock</a></li>
<li><a href="#stdunique_lock" id="markdown-toc-stdunique_lock"><code class="language-plaintext highlighter-rouge">std::unique_lock</code></a></li>
<li><a href="#stdcall_once" id="markdown-toc-stdcall_once"><code class="language-plaintext highlighter-rouge">std::call_once</code></a></li>
<li><a href="#stdrecursive_mutex-is-cs-reentrantlock" id="markdown-toc-stdrecursive_mutex-is-cs-reentrantlock"><code class="language-plaintext highlighter-rouge">std::recursive_mutex</code> is C++’s <code class="language-plaintext highlighter-rouge">ReentrantLock</code></a></li>
</ul>
<h1 id="inherent-problems">Inherent Problems</h1>
<p>Every thread is considered a <strong>lightweight process</strong>. It has its own stack space,
but will share heap space with other threads of the same process.</p>
<p>When we’re sharing data, the issue arises <strong>when the data is mutable</strong>.</p>
<p>There are 2 ways to deal with these <strong>race conditions</strong>:</p>
<ol>
<li>Use a protective data structure(locks).</li>
<li>Use atomic operations(lock-free programming).</li>
</ol>
<p>1 is usually much easier than 2. So we will cover 1 here.</p>
<h1 id="stdmutex"><code class="language-plaintext highlighter-rouge">std::mutex</code></h1>
<p>An <code class="language-plaintext highlighter-rouge">std::mutex</code> supports <code class="language-plaintext highlighter-rouge">lock()</code> and <code class="language-plaintext highlighter-rouge">unlock()</code>, but we normally should never call this
directly.</p>
<p>RAII comes to the rescue: <code class="language-plaintext highlighter-rouge">std::lock_guard</code> is a class template and allows us to lock
the <code class="language-plaintext highlighter-rouge">mutex</code> during constructor call and unlock during destructor call.</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#include <mutex>
</span><span class="k">using</span> <span class="k">namespace</span> <span class="n">std</span><span class="p">;</span>
<span class="n">std</span><span class="o">::</span><span class="n">mutex</span> <span class="n">my_mutex</span><span class="p">;</span>
<span class="kt">void</span> <span class="nf">foo</span><span class="p">(){</span>
<span class="n">lock_guard</span><span class="o"><</span><span class="n">mutex</span><span class="o">></span> <span class="n">guard</span><span class="p">(</span><span class="n">my_mutex</span><span class="p">);</span> <span class="c1">// locks the mutex</span>
<span class="n">doSomethingSynchronized</span><span class="p">();</span>
<span class="p">}</span> <span class="c1">// unlocks when scope exits.</span>
</code></pre></div></div>
<h2 id="hide-your-shared-data-from-user">Hide Your Shared Data from User</h2>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// relatively good code</span>
<span class="k">class</span> <span class="nc">data_wrapper</span>
<span class="p">{</span>
<span class="nl">private:</span>
<span class="n">some_data</span> <span class="n">data</span><span class="p">;</span>
<span class="n">std</span><span class="o">::</span><span class="n">mutex</span> <span class="n">m</span><span class="p">;</span>
<span class="nl">public:</span>
<span class="k">template</span> <span class="o"><</span><span class="k">typename</span> <span class="nc">Function</span><span class="p">></span>
<span class="kt">void</span> <span class="n">process_data</span><span class="p">(</span><span class="n">Function</span> <span class="n">func</span><span class="p">){</span>
<span class="n">std</span><span class="o">::</span><span class="n">lock_guard</span><span class="o"><</span><span class="n">std</span><span class="o">::</span><span class="n">mutex</span><span class="o">></span> <span class="n">l</span><span class="p">(</span><span class="n">m</span><span class="p">);</span>
<span class="n">func</span><span class="p">(</span><span class="n">data</span><span class="p">);</span> <span class="c1">// calls the function</span>
<span class="p">}</span>
<span class="p">};</span>
<span class="p">...</span>
<span class="c1">// evil code</span>
<span class="n">some_data</span><span class="o">*</span> <span class="n">unprotected</span><span class="p">;</span>
<span class="n">data_wrapper</span> <span class="n">x</span><span class="p">;</span>
<span class="kt">void</span> <span class="nf">badfunc</span><span class="p">(</span><span class="n">some_data</span><span class="o">&</span> <span class="n">protected_data</span><span class="p">){</span>
<span class="n">unprotected</span><span class="o">=&</span><span class="n">protected_data</span><span class="p">;</span>
<span class="p">}</span>
<span class="kt">void</span> <span class="nf">foo</span><span class="p">(){</span>
<span class="n">x</span><span class="p">.</span><span class="n">process_data</span><span class="p">(</span><span class="n">badfunc</span><span class="p">);</span> <span class="c1">// now we have unprotected access!!!</span>
<span class="p">}</span>
</code></pre></div></div>
<p>A key takeaway from this is: <strong>don’t pass pointers/references to protected data outside of the lock!</strong></p>
<h1 id="deadlock">Deadlock</h1>
<p>This happens when two threads, each acquire lock A and B, and require each other’s lock.
They wait on each other’s locks forever because neither wants to give up their lock.</p>
<p>This can be solved if you force an order on the lock, i.e. A must go before B during a contention.</p>
<p>However, sometimes it’s not possible, like when each lock holds critical data for different parts of the same object… and so we can try to lock both at the same time, but
hand over ownership in a smooth way using <code class="language-plaintext highlighter-rouge">std::adopt_lock</code>, which is an empty struct tag.</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">X</span><span class="p">{</span>
<span class="p">...</span>
<span class="k">friend</span> <span class="kt">void</span> <span class="n">swap</span><span class="p">(</span><span class="n">X</span><span class="o">&</span> <span class="n">lhs</span><span class="p">,</span> <span class="n">X</span><span class="o">&</span> <span class="n">rhs</span><span class="p">){</span>
<span class="k">if</span><span class="p">(</span><span class="o">&</span><span class="n">lhs</span> <span class="o">==</span> <span class="o">&</span><span class="n">rhs</span><span class="p">)</span>
<span class="k">return</span><span class="p">;</span>
<span class="n">std</span><span class="o">::</span><span class="n">lock</span><span class="p">(</span><span class="n">lhs</span><span class="p">.</span><span class="n">m</span><span class="p">,</span> <span class="n">rhs</span><span class="p">.</span><span class="n">m</span><span class="p">);</span>
<span class="n">std</span><span class="o">::</span><span class="n">lock_guard</span><span class="o"><</span><span class="n">std</span><span class="o">::</span><span class="n">mutex</span><span class="o">></span> <span class="n">lock_a</span><span class="p">(</span><span class="n">lhs</span><span class="p">.</span><span class="n">m</span><span class="p">,</span> <span class="n">std</span><span class="o">::</span><span class="n">adopt_lock</span><span class="p">);</span> <span class="c1">// a acquires ownership</span>
<span class="n">std</span><span class="o">::</span><span class="n">lock_guard</span><span class="o"><</span><span class="n">std</span><span class="o">::</span><span class="n">mutex</span><span class="o">></span> <span class="n">lock_b</span><span class="p">(</span><span class="n">rhs</span><span class="p">.</span><span class="n">m</span><span class="p">,</span> <span class="n">std</span><span class="o">::</span><span class="n">adopt_lock</span><span class="p">);</span> <span class="c1">// b too!</span>
<span class="n">swap</span><span class="p">(</span><span class="n">lhs</span><span class="p">.</span><span class="n">something</span><span class="p">,</span> <span class="n">rhs</span><span class="p">.</span><span class="n">something</span><span class="p">);</span>
<span class="p">}</span>
<span class="p">};</span>
</code></pre></div></div>
<p>Some tips:</p>
<ol>
<li>Don’t nested lock.</li>
<li>Avoid calling user-supplied code while holding lock. (User could try to lock)</li>
<li>Acquire locks in fixed order.</li>
</ol>
<h1 id="stdunique_lock"><code class="language-plaintext highlighter-rouge">std::unique_lock</code></h1>
<p><code class="language-plaintext highlighter-rouge">std::unique_lock</code> is more flexible than <code class="language-plaintext highlighter-rouge">std::lock_guard</code>. It doesn’t always own a mutex.</p>
<p>The ability to not own one allows us to construct the <code class="language-plaintext highlighter-rouge">unique_lock</code> without forcing to
lock anything:</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="p">...</span> <span class="c1">// same example as before</span>
<span class="n">std</span><span class="o">::</span><span class="n">unique_lock</span><span class="o"><</span><span class="n">std</span><span class="o">::</span><span class="n">mutex</span><span class="o">></span> <span class="n">lock_a</span><span class="p">(</span><span class="n">lhs</span><span class="p">.</span><span class="n">m</span><span class="p">,</span> <span class="n">std</span><span class="o">::</span><span class="n">defer_lock</span><span class="p">);</span>
<span class="n">std</span><span class="o">::</span><span class="n">unique_lock</span><span class="o"><</span><span class="n">std</span><span class="o">::</span><span class="n">mutex</span><span class="o">></span> <span class="n">lock_b</span><span class="p">(</span><span class="n">rhs</span><span class="p">.</span><span class="n">m</span><span class="p">,</span> <span class="n">std</span><span class="o">::</span><span class="n">defer_lock</span><span class="p">);</span>
<span class="n">std</span><span class="o">::</span><span class="n">lock</span><span class="p">(</span><span class="n">lock_a</span><span class="p">,</span> <span class="n">lock_b</span><span class="p">);</span> <span class="c1">// finally lock here!</span>
</code></pre></div></div>
<p>You can also call <code class="language-plaintext highlighter-rouge">unlock()</code> and <code class="language-plaintext highlighter-rouge">lock()</code> on the <code class="language-plaintext highlighter-rouge">unique_lock</code> itself. It’s good for when you want to control when you want to lock/unlock the specific resource depending on branching conditions(i.e. does this thread really need to hold on to this lock for that long?).</p>
<h1 id="stdcall_once"><code class="language-plaintext highlighter-rouge">std::call_once</code></h1>
<p>You can call a function only once by using a <code class="language-plaintext highlighter-rouge">std::once_flag</code>:</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">std</span><span class="o">::</span><span class="n">once_flag</span> <span class="n">f</span><span class="p">;</span>
<span class="n">std</span><span class="o">::</span><span class="n">shared_ptr</span><span class="o"><</span><span class="n">resource</span><span class="o">></span> <span class="n">resource_ptr</span><span class="p">;</span>
<span class="kt">void</span> <span class="nf">init_resource</span><span class="p">(){</span>
<span class="n">resource_ptr</span><span class="p">.</span><span class="n">reset</span><span class="p">(</span><span class="k">new</span> <span class="n">resource</span><span class="p">);</span>
<span class="p">}</span>
<span class="kt">void</span> <span class="nf">foo</span><span class="p">(){</span>
<span class="n">std</span><span class="o">::</span><span class="n">call_once</span><span class="p">(</span><span class="n">resource_flag</span><span class="p">,</span> <span class="n">init_resource</span><span class="p">);</span> <span class="c1">// this will only be called once.</span>
<span class="p">...</span>
<span class="p">}</span>
</code></pre></div></div>
<h1 id="stdrecursive_mutex-is-cs-reentrantlock"><code class="language-plaintext highlighter-rouge">std::recursive_mutex</code> is C++’s <code class="language-plaintext highlighter-rouge">ReentrantLock</code></h1>
<p>Some times, you may need to recursively lock the same mutex from the same thread multiple times.
A normal lock cannot handle this situation, and it will deadlock.
This is most likely a bad design decision, and you need to lock N time and release N times, otherwise a deadlock will occur.</p>
<script src="https://utteranc.es/client.js" repo="OneRaynyDay/oneraynyday.github.io" issue-term="pathname" theme="github-light" crossorigin="anonymous" async=""> </script>
C++ Concurrency - Basics of std::thread2017-11-19T00:00:00+00:00http://oneraynyday.github.io/dev/2017/11/19/C++-Threads-Basics<h1 id="table-of-contents">Table of Contents</h1>
<ul id="markdown-toc">
<li><a href="#table-of-contents" id="markdown-toc-table-of-contents">Table of Contents</a></li>
<li><a href="#stdthread-syntax-and-functions" id="markdown-toc-stdthread-syntax-and-functions"><code class="language-plaintext highlighter-rouge">std::thread</code> Syntax and Functions</a></li>
<li><a href="#you-must-join-or-detach" id="markdown-toc-you-must-join-or-detach">You must <code class="language-plaintext highlighter-rouge">join</code> or <code class="language-plaintext highlighter-rouge">detach</code></a></li>
<li><a href="#dont-detach-while-using-locals" id="markdown-toc-dont-detach-while-using-locals">Don’t <code class="language-plaintext highlighter-rouge">detach()</code> While Using Locals</a></li>
<li><a href="#transfering-ownership-of-threads" id="markdown-toc-transfering-ownership-of-threads">Transfering ownership of threads</a></li>
</ul>
<h1 id="stdthread-syntax-and-functions"><code class="language-plaintext highlighter-rouge">std::thread</code> Syntax and Functions</h1>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#include <thread>
#include <iostream>
</span><span class="k">using</span> <span class="k">namespace</span> <span class="n">std</span><span class="p">;</span>
<span class="kt">void</span> <span class="nf">hello_world</span><span class="p">(){</span>
<span class="n">cout</span> <span class="o"><<</span> <span class="s">"hello world!"</span> <span class="o"><<</span> <span class="n">endl</span><span class="p">;</span>
<span class="p">}</span>
<span class="kt">int</span> <span class="nf">main</span><span class="p">(){</span>
<span class="kr">thread</span> <span class="n">t</span><span class="p">(</span><span class="n">hello_world</span><span class="p">);</span>
<span class="n">t</span><span class="p">.</span><span class="n">join</span><span class="p">();</span>
<span class="n">assert</span><span class="p">(</span><span class="n">t</span><span class="p">.</span><span class="n">joinable</span><span class="p">()</span> <span class="o">==</span> <span class="nb">false</span><span class="p">);</span>
<span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<p>In this case, we create a thread, bind a function to it,
and then wait for it to finish using <code class="language-plaintext highlighter-rouge">join()</code>.</p>
<p>What if we called <code class="language-plaintext highlighter-rouge">detach()</code>? Then even after the instantiation <code class="language-plaintext highlighter-rouge">t</code> gets destroyed(after leaving its scope), the thread will still continue operations.</p>
<p>After <code class="language-plaintext highlighter-rouge">join()</code> is done executing, we can guarantee that
<code class="language-plaintext highlighter-rouge">t</code> is no longer associated with the actual thread, since
the thread’s execution actually finished.</p>
<p>This means, <code class="language-plaintext highlighter-rouge">joinable()</code> will become false because
it’s only true for an active thread of execution.</p>
<h1 id="you-must-join-or-detach">You must <code class="language-plaintext highlighter-rouge">join</code> or <code class="language-plaintext highlighter-rouge">detach</code></h1>
<p>What happens if we don’t call <code class="language-plaintext highlighter-rouge">join()</code> or <code class="language-plaintext highlighter-rouge">detach()</code>,
and just allow the thread’s destructor to get called?
Then in the destructor of <code class="language-plaintext highlighter-rouge">t</code>, it will check for whether <code class="language-plaintext highlighter-rouge">joinable()</code>, and if it is, it will raise <code class="language-plaintext highlighter-rouge">std::terminate()</code>.</p>
<p>Now you must be thinking, why so violent? <code class="language-plaintext highlighter-rouge">std::terminate()</code> should only be called in a few special circumstances like double exception propagation.</p>
<p>Say instead of <code class="language-plaintext highlighter-rouge">terminate</code>, it just <code class="language-plaintext highlighter-rouge">detach</code>’s the thread in the destructor.
What would happen?</p>
<p><strong>We could be inevitably allowing undefined behavior as
the (destroyed) child thread could be using references
to the scope of which was already destroyed.</strong>
Thus, the designers of <code class="language-plaintext highlighter-rouge">std::thread</code> thought <code class="language-plaintext highlighter-rouge">termiante</code>
was a necessary condition to avoid difficult UB-debugging.</p>
<p>We can’t allow a thread to not <code class="language-plaintext highlighter-rouge">join()</code> in exception handling, so for example:</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">main</span><span class="p">(){</span>
<span class="kr">thread</span> <span class="n">t</span><span class="p">(</span><span class="n">my_func</span><span class="p">);</span>
<span class="k">try</span><span class="p">{</span>
<span class="n">do_something</span><span class="p">();</span> <span class="c1">// exceptions could be called!</span>
<span class="p">}</span>
<span class="k">catch</span><span class="p">(...){</span>
<span class="n">t</span><span class="p">.</span><span class="n">join</span><span class="p">();</span>
<span class="k">throw</span><span class="p">;</span>
<span class="p">}</span>
<span class="n">t</span><span class="p">.</span><span class="n">join</span><span class="p">();</span>
<span class="p">}</span>
</code></pre></div></div>
<p>… which looks very ugly.</p>
<p>We can also implement a <code class="language-plaintext highlighter-rouge">thread_guard</code> using RAII:</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">thread_guard</span><span class="p">{</span>
<span class="n">std</span><span class="o">::</span><span class="kr">thread</span><span class="o">&</span> <span class="n">t</span><span class="p">;</span>
<span class="nl">public:</span>
<span class="k">explicit</span> <span class="n">thread_guard</span><span class="p">(</span><span class="n">std</span><span class="o">::</span><span class="kr">thread</span><span class="o">&</span> <span class="n">t_</span><span class="p">)</span> <span class="o">:</span> <span class="n">t</span><span class="p">(</span><span class="n">t_</span><span class="p">)</span> <span class="p">{}</span>
<span class="o">~</span><span class="n">thread_guard</span><span class="p">(){</span>
<span class="k">if</span><span class="p">(</span><span class="n">t</span><span class="p">.</span><span class="n">joinable</span><span class="p">())</span>
<span class="n">t</span><span class="p">.</span><span class="n">join</span><span class="p">();</span> <span class="c1">// force join here</span>
<span class="p">}</span>
<span class="p">};</span>
<span class="p">...</span>
<span class="kt">int</span> <span class="nf">main</span><span class="p">(){</span>
<span class="kr">thread</span> <span class="n">t</span><span class="p">(</span><span class="n">my_func</span><span class="p">);</span>
<span class="n">thread_guard</span> <span class="n">g</span><span class="p">(</span><span class="n">t</span><span class="p">);</span> <span class="c1">// RAII</span>
<span class="n">do_something</span><span class="p">();</span> <span class="c1">// exceptions could be called!</span>
<span class="p">}</span>
</code></pre></div></div>
<p>In this case, when <code class="language-plaintext highlighter-rouge">main()</code> exits, <code class="language-plaintext highlighter-rouge">~g()</code> is called before
<code class="language-plaintext highlighter-rouge">~t()</code> is called. Therefore, we will <code class="language-plaintext highlighter-rouge">join()</code> on t before
the destructor of <code class="language-plaintext highlighter-rouge">t</code> is called which could possibly
completely kill us.</p>
<p>When you <code class="language-plaintext highlighter-rouge">detach()</code> a thread, the thread is usually called a <strong>daemon thread</strong>, which runs in the background with no way of communication.</p>
<p>A case study of this is creating a new document window in a GUI:</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">main</span><span class="p">(){</span>
<span class="k">while</span><span class="p">(</span><span class="nb">true</span><span class="p">){</span>
<span class="kt">int</span> <span class="n">command</span> <span class="o">=</span> <span class="n">get_input</span><span class="p">();</span>
<span class="k">if</span><span class="p">(</span><span class="n">command</span> <span class="o">==</span> <span class="n">OPEN_NEW</span><span class="p">){</span>
<span class="kr">thread</span> <span class="n">t</span><span class="p">(</span><span class="n">open_document</span><span class="p">);</span>
<span class="n">t</span><span class="p">.</span><span class="n">detach</span><span class="p">();</span> <span class="c1">// we create it and let it run.</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>
<h1 id="dont-detach-while-using-locals">Don’t <code class="language-plaintext highlighter-rouge">detach()</code> While Using Locals</h1>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">print</span><span class="p">(</span><span class="kt">char</span><span class="o">*</span> <span class="n">cp</span><span class="p">){</span>
<span class="n">printf</span><span class="p">(</span><span class="s">"%s</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="n">cp</span><span class="p">);</span>
<span class="p">}</span>
<span class="kt">void</span> <span class="nf">foo</span><span class="p">(){</span>
<span class="kt">char</span> <span class="n">buffer</span><span class="p">[</span><span class="mi">1024</span><span class="p">];</span>
<span class="n">sprintf</span><span class="p">(</span><span class="n">buffer</span><span class="p">,</span> <span class="s">"hello world!"</span><span class="p">);</span>
<span class="n">std</span><span class="o">::</span><span class="kr">thread</span> <span class="n">t</span><span class="p">(</span><span class="n">print</span><span class="p">,</span> <span class="n">buffer</span><span class="p">);</span> <span class="c1">// buffer is an argument.</span>
<span class="n">t</span><span class="p">.</span><span class="n">detach</span><span class="p">();</span> <span class="c1">// uh oh!</span>
<span class="p">}</span>
</code></pre></div></div>
<p>In the above situation, we passed in a local variable,
<code class="language-plaintext highlighter-rouge">buffer</code>, which is used by the thread in <code class="language-plaintext highlighter-rouge">print()</code>.</p>
<p>If we did a <code class="language-plaintext highlighter-rouge">join()</code>, this is fine. However, we <code class="language-plaintext highlighter-rouge">detach()</code>‘d.
This means if the following sequence occured:</p>
<ol>
<li>Create thread, calls the print function</li>
<li><code class="language-plaintext highlighter-rouge">detach()</code> gets called, <code class="language-plaintext highlighter-rouge">buffer</code> is freed.</li>
<li><code class="language-plaintext highlighter-rouge">printf()</code> is called with <code class="language-plaintext highlighter-rouge">cp</code>.</li>
</ol>
<p>… then we have undefined behavior!</p>
<p>So keep in mind that when we’re detaching, <strong>never reference
local variables.</strong></p>
<h1 id="transfering-ownership-of-threads">Transfering ownership of threads</h1>
<p>The ownership model of <code class="language-plaintext highlighter-rouge">std::threads</code> is very much like
<code class="language-plaintext highlighter-rouge">std::unique_ptr</code>. You should use <code class="language-plaintext highlighter-rouge">std::move</code> to move current thread context
from one RAII container to another.</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kr">thread</span> <span class="nf">t1</span><span class="p">(</span><span class="n">my_func</span><span class="p">);</span>
<span class="kr">thread</span> <span class="n">t2</span> <span class="o">=</span> <span class="n">std</span><span class="o">::</span><span class="n">move</span><span class="p">(</span><span class="n">t1</span><span class="p">);</span>
</code></pre></div></div>
<p>After this execution, <code class="language-plaintext highlighter-rouge">t1</code> now is not <code class="language-plaintext highlighter-rouge">joinable()</code>, as it has no active thread context,
and <code class="language-plaintext highlighter-rouge">t2</code> instead has the thread context.</p>
<p>What if we did this:</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kr">thread</span> <span class="nf">t1</span><span class="p">(</span><span class="n">my_func</span><span class="p">);</span>
<span class="kr">thread</span> <span class="nf">t2</span><span class="p">(</span><span class="n">my_func</span><span class="p">);</span>
<span class="n">t2</span> <span class="o">=</span> <span class="n">std</span><span class="o">::</span><span class="n">move</span><span class="p">(</span><span class="n">t1</span><span class="p">);</span> <span class="c1">// uh oh!</span>
</code></pre></div></div>
<p>Here, <code class="language-plaintext highlighter-rouge">t2</code>’s original active context will be removed, without <code class="language-plaintext highlighter-rouge">join()</code>’s or <code class="language-plaintext highlighter-rouge">detach()</code>’s.
We can only expect an <code class="language-plaintext highlighter-rouge">std::terminate</code> here.</p>
<script src="https://utteranc.es/client.js" repo="OneRaynyDay/oneraynyday.github.io" issue-term="pathname" theme="github-light" crossorigin="anonymous" async=""> </script>
Essential C++ - Templates and Generic Programming2017-11-13T00:00:00+00:00http://oneraynyday.github.io/dev/2017/11/13/Essential-C++-7<p>Before we start - why did C++ create templates?</p>
<p>Templates were create