Skip to content

Commit 6b1cc63

Browse files
committed
editing...
1 parent e9032c0 commit 6b1cc63

File tree

9 files changed

+38
-52
lines changed

9 files changed

+38
-52
lines changed

chapters/algodiff.md

Lines changed: 14 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -282,34 +282,20 @@ Step |Primal computation
282282
You might be wondering, this looks the same as the left side.
283283
You are right. These two are exactly the same, and we repeat it again to make the point that, this time you cannot perform the calculation with one pass.
284284
You must compute the required intermediate results first, and then perform the other "backward pass", which is the key point in reverse mode.
285-
286-
---- ---------------------------------------------------------------------------------
287-
Step Adjoint computation
288-
---- ---------------------------------------------------------------------------------
289-
10 $$\bar{v_9} = 1$$
290-
291-
11 $$\bar{v_8} = \bar{v_9}\frac{\partial~(v_7/v_8)}{\partial~v_8} = 1 * \frac{-v_7}{v_8^2} = \frac{-1}{7.30^2} = -0.019$$
292-
293-
12 $$\bar{v_7} = \bar{v_9}\frac{\partial~(v_7/v_8)}{\partial~v_7} = \frac{1}{v_8} = 0.137$$
294-
295-
13 $$\bar{v_6} = \bar{v_8}\frac{\partial~v_8}{\partial~v_6} = \bar{v_8} * \frac{\partial~(v_6 + v5)}{\partial~v_6} = \bar{v_8}$$
296-
297-
14 $$\bar{v_5} = \bar{v_8}\frac{\partial~v_8}{\partial~v_5} = \bar{v_8} * \frac{\partial~(v_6 + v5)}{\partial~v_5} = \bar{v_8}$$
298-
299-
15 $$\bar{v_4} = \bar{v_6}\frac{\partial~v_6}{\partial~v_4} = \bar{v_8} * \frac{\partial~\exp{(v_4)}}{\partial~v_4} = \bar{v_8} * e^{v_4}$$
300-
301-
16 $$\bar{v_3} = \bar{v_4}\frac{\partial~v_4}{\partial~v_3} = \bar{v_4} * \frac{\partial~(v_2 + v_3)}{\partial~v_3} = \bar{v_4}$$
302-
303-
17 $$\bar{v_2} = \bar{v_4}\frac{\partial~v_4}{\partial~v_2} = \bar{v_4} * \frac{\partial~(v_2 + v_3)}{\partial~v_2} = \bar{v_4}$$
304-
305-
18 $$\bar{v_1} = \bar{v_3}\frac{\partial~v_3}{\partial~v_1} = \bar{v_3} * \frac{\partial~(v_0*v_1)}{\partial~v_1} = \bar{v_4} * v_0 = \bar{v_4}$$
306-
307-
19 $$\bar{v_{02}} = \bar{v_2}\frac{\partial~v_2}{\partial~v_0} = \bar{v_2} * \frac{\partial~(sin(v_0))}{\partial~v_0} = \bar{v_4} * cos(v_0)$$
308-
f
309-
20 $$\bar{v_{03}} = \bar{v_3}\frac{\partial~v_3}{\partial~v_0} = \bar{v_3} * \frac{\partial~(v_0 * v_1)}{\partial~v_0} = \bar{v_4} * v_1$$
310-
311-
21 $$\bar{v_0} = \bar{v_{02}} + \bar{v_{03}} = \bar{v_4}(cos(v_0) + v_1) = \bar{v_8} * e^{v_4}(0.54 + 1) = -0.019 * e^{1.84} * 1.54 = -0.18$$
312-
---- ---------------------------------------------------------------------------------
285+
Step | Adjoint computation
286+
---- | ---------------------------------------------------------------------------------
287+
10 | $$\bar{v_9} = 1$$
288+
11 | $$\bar{v_8} = \bar{v_9}\frac{\partial~(v_7/v_8)}{\partial~v_8} = 1 * \frac{-v_7}{v_8^2} = \frac{-1}{7.30^2} = -0.019$$
289+
12 | $$\bar{v_7} = \bar{v_9}\frac{\partial~(v_7/v_8)}{\partial~v_7} = \frac{1}{v_8} = 0.137$$
290+
13 | $$\bar{v_6} = \bar{v_8}\frac{\partial~v_8}{\partial~v_6} = \bar{v_8} * \frac{\partial~(v_6 + v5)}{\partial~v_6} = \bar{v_8}$$
291+
14 | $$\bar{v_5} = \bar{v_8}\frac{\partial~v_8}{\partial~v_5} = \bar{v_8} * \frac{\partial~(v_6 + v5)}{\partial~v_5} = \bar{v_8}$$
292+
15 | $$\bar{v_4} = \bar{v_6}\frac{\partial~v_6}{\partial~v_4} = \bar{v_8} * \frac{\partial~\exp{(v_4)}}{\partial~v_4} = \bar{v_8} * e^{v_4}$$
293+
16 | $$\bar{v_3} = \bar{v_4}\frac{\partial~v_4}{\partial~v_3} = \bar{v_4} * \frac{\partial~(v_2 + v_3)}{\partial~v_3} = \bar{v_4}$$
294+
17 | $$\bar{v_2} = \bar{v_4}\frac{\partial~v_4}{\partial~v_2} = \bar{v_4} * \frac{\partial~(v_2 + v_3)}{\partial~v_2} = \bar{v_4}$$
295+
18 | $$\bar{v_1} = \bar{v_3}\frac{\partial~v_3}{\partial~v_1} = \bar{v_3} * \frac{\partial~(v_0*v_1)}{\partial~v_1} = \bar{v_4} * v_0 = \bar{v_4}$$
296+
19 | $$\bar{v_{02}} = \bar{v_2}\frac{\partial~v_2}{\partial~v_0} = \bar{v_2} * \frac{\partial~(sin(v_0))}{\partial~v_0} = \bar{v_4} * cos(v_0)$$
297+
20 | $$\bar{v_{03}} = \bar{v_3}\frac{\partial~v_3}{\partial~v_0} = \bar{v_3} * \frac{\partial~(v_0 * v_1)}{\partial~v_0} = \bar{v_4} * v_1$$
298+
21 | $$\bar{v_0} = \bar{v_{02}} + \bar{v_{03}} = \bar{v_4}(cos(v_0) + v_1) = \bar{v_8} * e^{v_4}(0.54 + 1) = -0.019 * e^{1.84} * 1.54 = -0.18$$
313299

314300
Note that things a bit different for $$x_0$$. It is used in both intermediate variables $$v_2$$ and $$v_3$$.
315301
Therefore, we compute the adjoint of $$v_0$$ with regard to $$v_2$$ (step 19) and $$v_3$$ (step 20), and accumulate them together (step 20).

chapters/diffequation.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -394,7 +394,7 @@ Later we will show an example of using the symplectic solver to solve a damped h
394394
One feature of `owl-ode` is the automatic inference of state dimensionality from initial state.
395395
For example, the native solvers takes matrix as state.
396396
Suppose the initial state of the system is a row vector of dimension $$1\times~N$$.
397-
After $$T$$ time steps, the states are stacked vertically, and thus have dimensions $T\times~N$.
397+
After $$T$$ time steps, the states are stacked vertically, and thus have dimensions $$T\times~N$$.
398398
If the initial state is a column vector of shape $$N\times~1$$, then the stacked state after $$T$$ time steps will be inferred as $$N\times~T$$.
399399

400400
The temporal integration of matrices, i.e. cases where the initial state is matrix instead of vector, is also supported.
@@ -566,7 +566,7 @@ let custom_solver = Native.D.rk45 ~tol:1E-9 ~dtmax:10.0
566566
```
567567

568568
Now, we can solve the ODEs system and visualise the results.
569-
In the plots, we first show how the value of $x$, $y$ and $z$ changes with time; next we show the phase plane plots between each two of them.
569+
In the plots, we first show how the value of $$x$$, $$y$$ and $$z$$ changes with time; next we show the phase plane plots between each two of them.
570570

571571
```ocaml
572572
let _ =

chapters/linalg.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -843,7 +843,7 @@ R1 9.7 6.6
843843
val c : float = 1622.99938385646283
844844
```
845845

846-
Its condition number for inversion is much larger than one. Therefore, a small change in $A$ should leads to a large change of $$A^{-1}$$.
846+
Its condition number for inversion is much larger than one. Therefore, a small change in $$A$$ should leads to a large change of $$A^{-1}$$.
847847

848848
```ocaml
849849
# let a' = Linalg.D.inv a;;
@@ -1115,7 +1115,7 @@ It's inverse $$A = Q\Lambda~Q^{-1}$$ is called *Eigendecomposition*.
11151115
Analysing A's diagonal similar matrix $$\Lambda$$ instead of A itself can greatly simplify the problem.
11161116

11171117
Not every matrix can be diagonalised.
1118-
If any two of the $n$ eigenvalues of A are not the same, then its $$n$$ eigenvectors are linear-independent ana thus A can be diagonalised.
1118+
If any two of the $$n$$ eigenvalues of A are not the same, then its $$n$$ eigenvectors are linear-independent ana thus A can be diagonalised.
11191119
Specifically, every real symmetric matrix can be diagonalised by an orthogonal matrix.
11201120
Or put into the complex space, every hermitian matrix can be diagonalised by a unitary matrix.
11211121

@@ -1204,7 +1204,7 @@ $$A=U\Sigma~V^T$$
12041204

12051205
Here $$U$$ is is a $$m\times~m$$ matrix. Its columns are the eigenvectors of $$AA^T$$.
12061206
Similarly, $$V$$ is a $$n\times~n$$ matrix, and the columns of V are eigenvectors of $$A^TA$$.
1207-
The $r$ (rank of A) singular value on the diagonal of the $$m\times~n$$ diagonal matrix $$\Sigma$$ are the square roots of the nonzero eigenvalues of both $$AA^T$$ and $$A^TA$$.
1207+
The $$r$$ (rank of A) singular value on the diagonal of the $$m\times~n$$ diagonal matrix $$\Sigma$$ are the square roots of the nonzero eigenvalues of both $$AA^T$$ and $$A^TA$$.
12081208
It's close related with eigenvector factorisation of a positive definite matrix.
12091209
For a positive definite matrix, the SVD factorisation is the same as the $$Q\Lambda~Q^T$$.
12101210

chapters/maths.md

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -520,13 +520,13 @@ The permutation function returns the number $$n!/(n-k)!$$ of ordered subsets of
520520
The combination function returns the number $${n\choose k} = n!/(k!(n-k)!)$$ of subsets of $$k$$ elements of a set of $$n$$ elements.
521521
The table below provides the combinatorics functions you can use in the `Math` module.
522522

523-
Function Explanation
524-
---------------------- -----------------------------------------------------------
525-
`permutation n k` Permutation number
526-
`permutation_float n k` Similar to `permutation` but deals with larger range and returns float
527-
`combination n k` Combination number
528-
`combination_float n k` Similar to `combination` but deals with larger range and returns float
529-
`log_combination n k` Returns the logarithm of $${n\choose k}$$
523+
Function | Explanation
524+
---------------------- | -----------------------------------------------------------
525+
`permutation n k` | Permutation number
526+
`permutation_float n k` | Similar to `permutation` but deals with larger range and returns float
527+
`combination n k` | Combination number
528+
`combination_float n k` | Similar to `combination` but deals with larger range and returns float
529+
`log_combination n k` | Returns the logarithm of $${n\choose k}$$
530530

531531
Let's take a look at a simple example.
532532

chapters/ndarray.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -575,8 +575,8 @@ Therefore we say that, a tensor can normally be expressed in the form of an ndar
575575
That's why we keep using the term "ndarray" in this chapter and through out the book.
576576

577577
The basic idea about tensor is that, since the object stays the same, if we change the coordinate towards one direction, the component of the vector needs to be changed to another direction.
578-
Considering a single vector $$v$$ in a coordinate system with basis $e$.
579-
We can change the coordinate base to $$\tilde{e}$$ with linear transformation: $$\tilde{e} = Ae$$ where A is a matrix. For any vector in this space using $e$ as base, its content will be transformed as: $$\tilde{v} = A^{-1}v$$, or we can write it as:
578+
Considering a single vector $$v$$ in a coordinate system with basis $$e$$.
579+
We can change the coordinate base to $$\tilde{e}$$ with linear transformation: $$\tilde{e} = Ae$$ where A is a matrix. For any vector in this space using $$e$$ as base, its content will be transformed as: $$\tilde{v} = A^{-1}v$$, or we can write it as:
580580

581581
$$\tilde{v}^i = \sum_j~B_j^i~v^j.$$
582582

@@ -597,12 +597,12 @@ $$\tilde{L_j^i} = \sum_{kl}~B_k^i~L_l^k~A_j^l.$$
597597

598598
Again, note we use both superscript and subscript for the linear map $$L$$, since it contains one covariant component and one contravariant component.
599599
Further more, we can extend this process and define the tensor.
600-
A tensor $T$ is an object that is invariant under a change of coordinates, and with a change of coordinates its component changes in a special way.
600+
A tensor $$T$$ is an object that is invariant under a change of coordinates, and with a change of coordinates its component changes in a special way.
601601
The way is that:
602602

603603
$$\tilde{T_{xyz~\ldots}^{abc~\ldots}} = \sum_{ijk\ldots~rst\ldots}~B_i^aB_j^bB_k^c\ldots~T_{rst~\ldots}^{ijk~\ldots}~A_x^rA_y^sA_z^t\ldots$$
604604

605-
Here the $ijk\ldots$ are indices of the contravariant part of the tensor and the $$rst\ldots$$ are that of the covariant part.
605+
Here the $$ijk\ldots$$ are indices of the contravariant part of the tensor and the $$rst\ldots$$ are that of the covariant part.
606606

607607
One of the important operations of tensor is the *tensor contraction*. We are familiar with the matrix multiplication:
608608
$$C_j^i = \sum_{k}A_k^iB_j^k.$$

chapters/neural-network.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -609,7 +609,7 @@ let loss = Maths.(loss / _f (Mat.row_num yt |> float_of_int))
609609
To compare how different the inference result `y'` is from the true label `y`, we need the loss function.
610610
Previously we have used the `cross_entropy`, and in the `Loss` module, the optimisation module provides other popular loss function:
611611

612-
- `Loss.L1norm`: $$\sum|y - y'|$$
612+
- `Loss.L1norm`: $$\sum\|y - y'\|$$
613613
- `Loss.L2norm`: $$\sum\|y - y'\|_2$$
614614
- `Loss.Quadratic`: $$\sum\|y - y'\|_2^2$$
615615
- `Loss.Hinge`: $$\sum\textrm{max}(0, 1-y^Ty')$$

chapters/optimisation.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@ $$\textrm{minimise } f_0(\mathbf{x}),$$
1616

1717
$$\textrm{subject to } f_i(\mathbf{x}) \leq b_i, i = 1, 2, \ldots, m. $$
1818

19-
Here $\mathbf{x}$ is a vector that contains all the *optimisation variable*: $$\mathbf{x} = [x_0, x_1, ... x_n]$. Function $$f_0 : \mathbf{R}^n \rightarrow \mathbf{R}$$ is the optimisation target, and is called an *objective function*, or *cost function*.
19+
Here $$\mathbf{x}$$ is a vector that contains all the *optimisation variable*: $$\mathbf{x} = [x_0, x_1, ... x_n]$$. Function $$f_0 : \mathbf{R}^n \rightarrow \mathbf{R}$$ is the optimisation target, and is called an *objective function*, or *cost function*.
2020
An optimisation problem could be bounded by zero or more *constraints*. $$f_i : \mathbf{R}^n \rightarrow \mathbf{R}$$ in a constraint is called a *constraint function*, which are bounded by the $$b_i$$'s.
2121
The target is to find the optimal variable values $$\mathbf{x}^{*}$$ so that $$f_0$$ can take on a maximum or minimum value.
2222

@@ -480,7 +480,7 @@ One example of algorithm: *Simulated Annealing Methods*. A suitable systems to a
480480
First, it contains a finite set $$S$$, and a cost function $$f$$ that is defined on this set.
481481
There is also a non-increasing function $$T$$ that projects the set of positive integers to real positive value.
482482
$$T(t)$$ is called the *temperature* at time $$t$$.
483-
Suppose at time $t$, the current state is $$i$$ in $$S$$.
483+
Suppose at time $$t$$, the current state is $$i$$ in $$S$$.
484484
It choose one of its neighbours $$j$$ randomly.
485485
Next, if $$f(i) < f(j)$$ then $$j$$ is used as the next state. If not so, then $$j$$ is chosen as the next state with a probability of $$e^{-\frac{f(j)-f(i)}{T(t)}}$$, otherwise the next state stays to be $$i$$.
486486
Starting from an initial state $$x_0$$, this process is repeated for a finite number of steps to find the optimum.

chapters/regression.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -824,7 +824,7 @@ In other words, for data $$\boldsymbol{x}$$ that contains any number of features
824824

825825
$$\theta_0 + \sum_{i=1}^k\theta_i~d_i,$$
826826

827-
where $d_i$ is the "distance" of the current point to the reference point $$p_k$$.
827+
where $$d_i$$ is the "distance" of the current point to the reference point $$p_k$$.
828828
In inference phase, if this function is larger than zero, then the point is predicted to be positive, otherwise it is negative.
829829

830830
So how exactly is this "distance" calculated? There are many ways to do that, and one of the most used is the gaussian distance.
@@ -892,7 +892,7 @@ The latter is the deviation of the observed value from the unobservable true val
892892

893893
First, let's look at two most commonly used metrics:
894894

895-
- **Mean absolute error** (MAE): average absolute value fo residuals, represented by: $$\textrm{MAE}=\frac{1}{n}\sum|y - y'|$$.
895+
- **Mean absolute error** (MAE): average absolute value fo residuals, represented by: $$\textrm{MAE}=\frac{1}{n}\sum\|y - y'\|$$.
896896
- **Mean square error** (MSE): average squared residuals, represented as: $$\textrm{MSE}=\frac{1}{n}\sum(y-y')^2$$. This is the method we have previous used in linear regression in this chapter.
897897
The part before applying average is called **Residual Sum of Squares** (RSS): $$\textrm{RSS}=\sum(y-y')^2$$.
898898

@@ -905,7 +905,7 @@ Based on these two basic metrics, we can derive the definition of other metrics:
905905

906906
- **Root mean squared error** (RMSE): it is just the square root of MSE. By applying square root, the unit of error is back to normal and thus easier to interpret. Besides, this metric is similar to the standard deviation and denotes how wide the residuals spread out.
907907

908-
- **Mean absolute percentage error** (MAPE): based on MAE, MAPE changes it into percentage representation: $$\textrm{MAPE}=\frac{1}{n}\sum |\frac{y - y'}{y}|$$. It denotes the average distance between a model's predictions and their corresponding outputs in percentage format, for easier interpretation.
908+
- **Mean absolute percentage error** (MAPE): based on MAE, MAPE changes it into percentage representation: $$\textrm{MAPE}=\frac{1}{n}\sum \|\frac{y - y'}{y}\|$$. It denotes the average distance between a model's predictions and their corresponding outputs in percentage format, for easier interpretation.
909909

910910
- **Mean percentage error** (MPE): similar to MAPE, but does not use the absolute value: $$\textrm{MPE}=\frac{1}{n}\sum\left(\frac{y - y'}{y} \right)$$. Without the absolute value, the metric can represent it the predict value is larger or smaller than the observed value in data. So unlike MAE and MSE, it's a relative measurement of error.
911911

chapters/stats.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -45,7 +45,7 @@ Suppose the probability of tossing head is 0.4, and for 10 times.
4545
```
4646

4747
The equation is called the *&probability density function* (PDF) of this binomial distribution.
48-
Formally the PDF of random variable X is denoted with $p_X(k)$ and is defined as:
48+
Formally the PDF of random variable X is denoted with $$p_X(k)$$ and is defined as:
4949

5050
$$p_X(k)=P({s \in S | X(s) = k}),$$
5151

@@ -324,7 +324,7 @@ It is expressed by an simple form as shown in the equation, e.g. it provides a w
324324
$$P(X\|Y) = \frac{P(Y\|X)P(X)}{P(Y)}$$ {#eq:stats:bayes}
325325

326326
One powerful application of this theorem is that it provides the tool to calibrate your knowledge about something ("it has 10% percentage to happen") based on observed evidence.
327-
For example, a novice hardly tell if a dice is normal or loaded. If I show you a dice and ask you to estimate the probability that this dice a fake one, you would say "hmm, I don't know, perhaps 10%". Define event $$X$$ to be "the dice is loaded", and you just set a **prior** that $$P(X) = 0.1$.
327+
For example, a novice hardly tell if a dice is normal or loaded. If I show you a dice and ask you to estimate the probability that this dice a fake one, you would say "hmm, I don't know, perhaps 10%". Define event $$X$$ to be "the dice is loaded", and you just set a **prior** that $$P(X) = 0.1$$.
328328
Now I begin to roll for three times, and somehow, I got three 6's. Now I ask you again, *given the evidence you just observed*, estimate again the probability that the dice is loaded.
329329
Define $$Y$$ as the event "get all 6's of all three rolling".
330330

0 commit comments

Comments
 (0)