editing...

jzstark · jzstark · commit 6b1cc63e5555 · 2024-05-29T21:12:45.000+02:00
diff --git a/chapters/algodiff.md b/chapters/algodiff.md
@@ -282,34 +282,20 @@ Step |Primal computation
 You might be wondering, this looks the same as the left side.
 You are right. These two are exactly the same, and we repeat it again to make the point that, this time you cannot perform the calculation with one pass.
 You must compute the required intermediate results first, and then perform the other "backward pass", which is the key point in reverse mode.
-
----- ---------------------------------------------------------------------------------
-Step Adjoint computation
----- ---------------------------------------------------------------------------------
-10   $$\bar{v_9} = 1$$
-
-11   $$\bar{v_8} = \bar{v_9}\frac{\partial~(v_7/v_8)}{\partial~v_8} = 1 * \frac{-v_7}{v_8^2} = \frac{-1}{7.30^2} = -0.019$$
-
-12   $$\bar{v_7} = \bar{v_9}\frac{\partial~(v_7/v_8)}{\partial~v_7} = \frac{1}{v_8} = 0.137$$   
-
-13   $$\bar{v_6} = \bar{v_8}\frac{\partial~v_8}{\partial~v_6} = \bar{v_8} * \frac{\partial~(v_6 + v5)}{\partial~v_6} =  \bar{v_8}$$
-
-14   $$\bar{v_5} = \bar{v_8}\frac{\partial~v_8}{\partial~v_5} = \bar{v_8} * \frac{\partial~(v_6 + v5)}{\partial~v_5} = \bar{v_8}$$
-
-15   $$\bar{v_4} = \bar{v_6}\frac{\partial~v_6}{\partial~v_4} = \bar{v_8} * \frac{\partial~\exp{(v_4)}}{\partial~v_4} = \bar{v_8} * e^{v_4}$$
-
-16   $$\bar{v_3} = \bar{v_4}\frac{\partial~v_4}{\partial~v_3} = \bar{v_4} * \frac{\partial~(v_2 + v_3)}{\partial~v_3} = \bar{v_4}$$
-
-17   $$\bar{v_2} = \bar{v_4}\frac{\partial~v_4}{\partial~v_2} = \bar{v_4} * \frac{\partial~(v_2 + v_3)}{\partial~v_2} = \bar{v_4}$$
-
-18   $$\bar{v_1} = \bar{v_3}\frac{\partial~v_3}{\partial~v_1} = \bar{v_3} * \frac{\partial~(v_0*v_1)}{\partial~v_1} = \bar{v_4} * v_0 = \bar{v_4}$$
-
-19   $$\bar{v_{02}} = \bar{v_2}\frac{\partial~v_2}{\partial~v_0} = \bar{v_2} * \frac{\partial~(sin(v_0))}{\partial~v_0} = \bar{v_4} * cos(v_0)$$
-f
-20   $$\bar{v_{03}} = \bar{v_3}\frac{\partial~v_3}{\partial~v_0} = \bar{v_3} * \frac{\partial~(v_0 * v_1)}{\partial~v_0} = \bar{v_4} * v_1$$
-
-21   $$\bar{v_0} = \bar{v_{02}} + \bar{v_{03}} = \bar{v_4}(cos(v_0) + v_1) = \bar{v_8} * e^{v_4}(0.54 + 1) = -0.019 * e^{1.84} * 1.54 = -0.18$$
----- ---------------------------------------------------------------------------------
+Step | Adjoint computation
+---- | ---------------------------------------------------------------------------------
+10   | $$\bar{v_9} = 1$$
+11   | $$\bar{v_8} = \bar{v_9}\frac{\partial~(v_7/v_8)}{\partial~v_8} = 1 * \frac{-v_7}{v_8^2} = \frac{-1}{7.30^2} = -0.019$$
+12   | $$\bar{v_7} = \bar{v_9}\frac{\partial~(v_7/v_8)}{\partial~v_7} = \frac{1}{v_8} = 0.137$$   
+13   | $$\bar{v_6} = \bar{v_8}\frac{\partial~v_8}{\partial~v_6} = \bar{v_8} * \frac{\partial~(v_6 + v5)}{\partial~v_6} =  \bar{v_8}$$
+14   | $$\bar{v_5} = \bar{v_8}\frac{\partial~v_8}{\partial~v_5} = \bar{v_8} * \frac{\partial~(v_6 + v5)}{\partial~v_5} = \bar{v_8}$$
+15   | $$\bar{v_4} = \bar{v_6}\frac{\partial~v_6}{\partial~v_4} = \bar{v_8} * \frac{\partial~\exp{(v_4)}}{\partial~v_4} = \bar{v_8} * e^{v_4}$$
+16   | $$\bar{v_3} = \bar{v_4}\frac{\partial~v_4}{\partial~v_3} = \bar{v_4} * \frac{\partial~(v_2 + v_3)}{\partial~v_3} = \bar{v_4}$$
+17   | $$\bar{v_2} = \bar{v_4}\frac{\partial~v_4}{\partial~v_2} = \bar{v_4} * \frac{\partial~(v_2 + v_3)}{\partial~v_2} = \bar{v_4}$$
+18   | $$\bar{v_1} = \bar{v_3}\frac{\partial~v_3}{\partial~v_1} = \bar{v_3} * \frac{\partial~(v_0*v_1)}{\partial~v_1} = \bar{v_4} * v_0 = \bar{v_4}$$
+19   | $$\bar{v_{02}} = \bar{v_2}\frac{\partial~v_2}{\partial~v_0} = \bar{v_2} * \frac{\partial~(sin(v_0))}{\partial~v_0} = \bar{v_4} * cos(v_0)$$
+20   | $$\bar{v_{03}} = \bar{v_3}\frac{\partial~v_3}{\partial~v_0} = \bar{v_3} * \frac{\partial~(v_0 * v_1)}{\partial~v_0} = \bar{v_4} * v_1$$
+21   | $$\bar{v_0} = \bar{v_{02}} + \bar{v_{03}} = \bar{v_4}(cos(v_0) + v_1) = \bar{v_8} * e^{v_4}(0.54 + 1) = -0.019 * e^{1.84} * 1.54 = -0.18$$
 
 Note that things a bit different for $$x_0$$. It is used in both intermediate variables $$v_2$$ and $$v_3$$.
 Therefore, we compute the adjoint of $$v_0$$ with regard to $$v_2$$ (step 19) and $$v_3$$ (step 20), and accumulate them together (step 20).
diff --git a/chapters/diffequation.md b/chapters/diffequation.md
@@ -394,7 +394,7 @@ Later we will show an example of using the symplectic solver to solve a damped h
 One feature of `owl-ode` is the automatic inference of state dimensionality from initial state.
 For example, the native solvers takes matrix as state.
 Suppose the initial state of the system is a row vector of dimension $$1\times~N$$.
-After $$T$$ time steps, the states are stacked vertically, and thus have dimensions $T\times~N$.
+After $$T$$ time steps, the states are stacked vertically, and thus have dimensions $$T\times~N$$.
 If the initial state is a column vector of shape $$N\times~1$$, then the stacked state after $$T$$ time steps will be inferred as $$N\times~T$$.
 
 The temporal integration of matrices, i.e. cases where the initial state is matrix instead of vector, is also supported.
@@ -566,7 +566,7 @@ let custom_solver = Native.D.rk45 ~tol:1E-9 ~dtmax:10.0
 ```
 
 Now, we can solve the ODEs system and visualise the results.
-In the plots, we first show how the value of $x$, $y$ and $z$ changes with time; next we show the phase plane plots between each two of them.
+In the plots, we first show how the value of $$x$$, $$y$$ and $$z$$ changes with time; next we show the phase plane plots between each two of them.
 
 ```ocaml
 let _ =
diff --git a/chapters/linalg.md b/chapters/linalg.md
@@ -843,7 +843,7 @@ R1 9.7 6.6
 val c : float = 1622.99938385646283
 ```
 
-Its condition number for inversion is much larger than one. Therefore, a small change in $A$ should leads to a large change of $$A^{-1}$$.
+Its condition number for inversion is much larger than one. Therefore, a small change in $$A$$ should leads to a large change of $$A^{-1}$$.
 
 ```ocaml
 # let a' = Linalg.D.inv a;;
@@ -1115,7 +1115,7 @@ It's inverse $$A = Q\Lambda~Q^{-1}$$ is called *Eigendecomposition*.
 Analysing A's diagonal similar matrix $$\Lambda$$ instead of A itself can greatly simplify the problem.
 
 Not every matrix can be diagonalised.
-If any two of the $n$ eigenvalues of A are not the same, then its $$n$$ eigenvectors are linear-independent ana thus A can be  diagonalised.
+If any two of the $$n$$ eigenvalues of A are not the same, then its $$n$$ eigenvectors are linear-independent ana thus A can be  diagonalised.
 Specifically, every real symmetric matrix can be diagonalised by an orthogonal matrix.
 Or put into the complex space, every hermitian matrix can be diagonalised by a unitary matrix.
 
@@ -1204,7 +1204,7 @@ $$A=U\Sigma~V^T$$
 
 Here $$U$$ is is a $$m\times~m$$ matrix. Its columns are the eigenvectors of $$AA^T$$.
 Similarly, $$V$$ is a $$n\times~n$$ matrix, and the columns of V are eigenvectors of $$A^TA$$.
-The $r$ (rank of A) singular value on the diagonal of the $$m\times~n$$ diagonal matrix $$\Sigma$$ are the square roots of the nonzero eigenvalues of both $$AA^T$$ and $$A^TA$$.
+The $$r$$ (rank of A) singular value on the diagonal of the $$m\times~n$$ diagonal matrix $$\Sigma$$ are the square roots of the nonzero eigenvalues of both $$AA^T$$ and $$A^TA$$.
 It's close related with eigenvector factorisation of a positive definite matrix.
 For a positive definite matrix, the SVD factorisation is the same as the $$Q\Lambda~Q^T$$.
 
diff --git a/chapters/maths.md b/chapters/maths.md
@@ -520,13 +520,13 @@ The permutation function returns the number $$n!/(n-k)!$$ of ordered subsets of
 The combination function returns the number $${n\choose k} = n!/(k!(n-k)!)$$ of subsets of $$k$$ elements of a set of $$n$$ elements.
 The table below provides the combinatorics functions you can use in the `Math` module.
 
-Function                Explanation
-----------------------  -----------------------------------------------------------
-`permutation n k`       Permutation number
-`permutation_float n k` Similar to `permutation` but deals with larger range and returns float
-`combination n k`       Combination number
-`combination_float n k` Similar to `combination` but deals with larger range and returns float
-`log_combination n k`   Returns the logarithm of $${n\choose k}$$
+Function                | Explanation
+----------------------  | -----------------------------------------------------------
+`permutation n k`       | Permutation number
+`permutation_float n k` | Similar to `permutation` but deals with larger range and returns float
+`combination n k`       | Combination number
+`combination_float n k` | Similar to `combination` but deals with larger range and returns float
+`log_combination n k`   | Returns the logarithm of $${n\choose k}$$
 
 Let's take a look at a simple example.
 
diff --git a/chapters/ndarray.md b/chapters/ndarray.md
@@ -575,8 +575,8 @@ Therefore we say that, a tensor can normally be expressed in the form of an ndar
 That's why we keep using the term "ndarray" in this chapter and through out the book.
 
 The basic idea about tensor is that, since the object stays the same, if we change the coordinate towards one direction, the component of the vector needs to be changed to another direction.
-Considering a single vector $$v$$ in a coordinate system with basis $e$.
-We can change the coordinate base to $$\tilde{e}$$ with linear transformation: $$\tilde{e} = Ae$$ where A is a matrix. For any vector in this space using $e$ as base, its content will be transformed as: $$\tilde{v} = A^{-1}v$$, or we can write it as:
+Considering a single vector $$v$$ in a coordinate system with basis $$e$$.
+We can change the coordinate base to $$\tilde{e}$$ with linear transformation: $$\tilde{e} = Ae$$ where A is a matrix. For any vector in this space using $$e$$ as base, its content will be transformed as: $$\tilde{v} = A^{-1}v$$, or we can write it as:
 
 $$\tilde{v}^i = \sum_j~B_j^i~v^j.$$
 
@@ -597,12 +597,12 @@ $$\tilde{L_j^i} = \sum_{kl}~B_k^i~L_l^k~A_j^l.$$
 
 Again, note we use both superscript and subscript for the linear map $$L$$, since it contains one covariant component and one contravariant component.
 Further more, we can extend this process and define the tensor.
-A tensor $T$ is an object that is invariant under a change of coordinates, and with a change of coordinates its component changes in a special way.
+A tensor $$T$$ is an object that is invariant under a change of coordinates, and with a change of coordinates its component changes in a special way.
 The way is that:
 
 $$\tilde{T_{xyz~\ldots}^{abc~\ldots}} = \sum_{ijk\ldots~rst\ldots}~B_i^aB_j^bB_k^c\ldots~T_{rst~\ldots}^{ijk~\ldots}~A_x^rA_y^sA_z^t\ldots$$
 
-Here the $ijk\ldots$ are indices of the contravariant part of the tensor and the $$rst\ldots$$ are that of the covariant part.
+Here the $$ijk\ldots$$ are indices of the contravariant part of the tensor and the $$rst\ldots$$ are that of the covariant part.
 
 One of the important operations of tensor is the *tensor contraction*. We are familiar with the matrix multiplication:
 $$C_j^i = \sum_{k}A_k^iB_j^k.$$ 
diff --git a/chapters/neural-network.md b/chapters/neural-network.md
@@ -609,7 +609,7 @@ let loss = Maths.(loss / _f (Mat.row_num yt |> float_of_int))
 To compare how different the inference result `y'` is from the true label `y`, we need the loss function.
 Previously we have used the `cross_entropy`, and in the `Loss` module, the optimisation module provides other popular loss function:
 
-- `Loss.L1norm`: $$\sum|y - y'|$$
+- `Loss.L1norm`: $$\sum\|y - y'\|$$
 - `Loss.L2norm`: $$\sum\|y - y'\|_2$$
 - `Loss.Quadratic`: $$\sum\|y - y'\|_2^2$$
 - `Loss.Hinge`: $$\sum\textrm{max}(0, 1-y^Ty')$$
diff --git a/chapters/optimisation.md b/chapters/optimisation.md
@@ -16,7 +16,7 @@ $$\textrm{minimise } f_0(\mathbf{x}),$$
 
 $$\textrm{subject to } f_i(\mathbf{x}) \leq b_i, i = 1, 2, \ldots, m. $$ 
 
-Here $\mathbf{x}$ is a vector that contains all the *optimisation variable*: $$\mathbf{x} = [x_0, x_1, ... x_n]$. Function $$f_0 : \mathbf{R}^n \rightarrow \mathbf{R}$$ is the optimisation target, and is called an *objective function*, or *cost function*.
+Here $$\mathbf{x}$$ is a vector that contains all the *optimisation variable*: $$\mathbf{x} = [x_0, x_1, ... x_n]$$. Function $$f_0 : \mathbf{R}^n \rightarrow \mathbf{R}$$ is the optimisation target, and is called an *objective function*, or *cost function*.
 An optimisation problem could be bounded by zero or more *constraints*. $$f_i : \mathbf{R}^n \rightarrow \mathbf{R}$$ in a constraint is called a *constraint function*, which are bounded by the $$b_i$$'s.
 The target is to find the optimal variable values $$\mathbf{x}^{*}$$ so that $$f_0$$ can take on a maximum or minimum value.
 
@@ -480,7 +480,7 @@ One example of algorithm: *Simulated Annealing Methods*. A suitable systems to a
 First, it contains a finite set $$S$$, and a cost function $$f$$ that is defined on this set.
 There is also a non-increasing function $$T$$ that projects the set of positive integers to real positive value.
 $$T(t)$$ is called the *temperature* at time $$t$$.
-Suppose at time $t$, the current state is $$i$$ in $$S$$.
+Suppose at time $$t$$, the current state is $$i$$ in $$S$$.
 It choose one of its neighbours $$j$$ randomly.
 Next, if $$f(i) < f(j)$$ then $$j$$ is used as the next state. If not so, then $$j$$ is chosen as the next state with a probability of $$e^{-\frac{f(j)-f(i)}{T(t)}}$$, otherwise the next state stays to be $$i$$.
 Starting from an initial state $$x_0$$, this process is repeated for a finite number of steps to find the optimum.
diff --git a/chapters/regression.md b/chapters/regression.md
@@ -824,7 +824,7 @@ In other words, for data $$\boldsymbol{x}$$ that contains any number of features
 
 $$\theta_0 + \sum_{i=1}^k\theta_i~d_i,$$ 
 
-where $d_i$ is the "distance" of the current point to the reference point $$p_k$$.
+where $$d_i$$ is the "distance" of the current point to the reference point $$p_k$$.
 In inference phase, if this function is larger than zero, then the point is predicted to be positive, otherwise it is negative.
 
 So how exactly is this "distance" calculated? There are many ways to do that, and one of the most used is the gaussian distance.
@@ -892,7 +892,7 @@ The latter is the deviation of the observed value from the unobservable true val
 
 First, let's look at two most commonly used metrics:
 
-- **Mean absolute error** (MAE): average absolute value fo residuals, represented by: $$\textrm{MAE}=\frac{1}{n}\sum|y - y'|$$.
+- **Mean absolute error** (MAE): average absolute value fo residuals, represented by: $$\textrm{MAE}=\frac{1}{n}\sum\|y - y'\|$$.
 - **Mean square error** (MSE): average squared residuals, represented as: $$\textrm{MSE}=\frac{1}{n}\sum(y-y')^2$$. This is the method we have previous used in linear regression in this chapter.
 The part before applying average is called **Residual Sum of Squares** (RSS): $$\textrm{RSS}=\sum(y-y')^2$$.
 
@@ -905,7 +905,7 @@ Based on these two basic metrics, we can derive the definition of other metrics:
 
 - **Root mean squared error** (RMSE): it is just the square root of MSE. By applying square root, the unit of error is back to normal and thus easier to interpret. Besides, this metric is similar to the standard deviation and denotes how wide the residuals spread out.
 
-- **Mean absolute percentage error** (MAPE): based on MAE, MAPE changes it into percentage representation: $$\textrm{MAPE}=\frac{1}{n}\sum |\frac{y - y'}{y}|$$. It denotes the average distance between a model's predictions and their corresponding outputs in percentage format, for easier interpretation.
+- **Mean absolute percentage error** (MAPE): based on MAE, MAPE changes it into percentage representation: $$\textrm{MAPE}=\frac{1}{n}\sum \|\frac{y - y'}{y}\|$$. It denotes the average distance between a model's predictions and their corresponding outputs in percentage format, for easier interpretation.
 
 - **Mean percentage error** (MPE): similar to MAPE, but does not use the absolute value: $$\textrm{MPE}=\frac{1}{n}\sum\left(\frac{y - y'}{y} \right)$$. Without the absolute value, the metric can represent it the predict value is larger or smaller than the observed value in data. So unlike MAE and MSE, it's a relative measurement of error.
 
diff --git a/chapters/stats.md b/chapters/stats.md
@@ -45,7 +45,7 @@ Suppose the probability of tossing head is 0.4, and for 10 times.
 ```
 
 The equation is called the *&probability density function* (PDF) of this binomial distribution.
-Formally the PDF of random variable X is denoted with $p_X(k)$ and is defined as:
+Formally the PDF of random variable X is denoted with $$p_X(k)$$ and is defined as:
 
 $$p_X(k)=P({s \in S | X(s) = k}),$$
 
@@ -324,7 +324,7 @@ It is expressed by an simple form as shown in the equation, e.g. it provides a w
 $$P(X\|Y) = \frac{P(Y\|X)P(X)}{P(Y)}$$ {#eq:stats:bayes}
 
 One powerful application of this theorem is that it provides the tool to calibrate your knowledge about something ("it has 10% percentage to happen") based on observed evidence.
-For example, a novice hardly tell if a dice is normal or loaded. If I show you a dice and ask you to estimate the probability that this dice a fake one,  you would say "hmm, I don't know, perhaps 10%". Define event $$X$$ to be "the dice is loaded", and you just set a **prior** that $$P(X) = 0.1$￥.
+For example, a novice hardly tell if a dice is normal or loaded. If I show you a dice and ask you to estimate the probability that this dice a fake one,  you would say "hmm, I don't know, perhaps 10%". Define event $$X$$ to be "the dice is loaded", and you just set a **prior** that $$P(X) = 0.1$$.
 Now I begin to roll for three times, and somehow, I got three 6's. Now I ask you again, *given the evidence you just observed*, estimate again the probability that the dice is loaded. 
 Define $$Y$$ as the event "get all 6's of all three rolling".