fix fonts

suzyahyah · suzyahyah · commit 5f406a5d50e6 · 2025-04-26T23:48:12.000+08:00
diff --git a/_includes/head.html b/_includes/head.html
@@ -3,6 +3,10 @@
   <meta http-equiv="X-UA-Compatible" content="IE=edge">
   <meta name="viewport" content="width=device-width, initial-scale=1">
 
+	<link rel="preconnect" href="https://fonts.googleapis.com">
+	<link rel="preconnect" href="https://fonts.gstatic.com" crossorigin>
+	<link href="https://fonts.googleapis.com/css2?family=Open+Sans:wght@300;400;500;600;700&display=swap" rel="stylesheet">
+
   <title>{% if page.title %}{{ page.title | escape }}{% else %}{{ site.title | escape }}{% endif %}</title>
   <meta name="description" content="{{ page.excerpt | default: site.description | strip_html | normalize_whitespace | truncate: 160 | escape }}">
 
diff --git a/_layouts/home.html b/_layouts/home.html
@@ -26,9 +26,9 @@ <h2>
             <a class="post-link" href="{{ post.url | relative_url }}">{{ post.title | escape }}</a>
           </h2> 
 
-        <div class="gptSummary", style="max-width:900px;">
+       <!-- <div class="gptSummary", style="max-width:900px;">
             <span class="tldr">{{ post.tldr }}</span>
-        </div>
+       </div>-->
 
 
 
diff --git a/_posts/2023-06-04-L0_reg.md~ b/_posts/2023-06-04-L0_reg.md~
@@ -0,0 +1,220 @@
+---
+layout: post
+title:  "Training Sparse Neural Networks with L0 Regularisation"
+date:   2023-04-06 
+mathjax: true
+status: [Code samples, Instructional]
+tldr: Explores L0 norm regularization for training sparse neural networks, where weights are encouraged to be entirely 0. It discusses overcoming non-differentiability issues by using a soft form of counting and reparameterization tricks. The post also delves into concrete distributions and introduces a method to make the continuous distribution more suitable for regularization purposes.
+categories: [Compression, Machine Learning]
+---
+
+
+$L_0$ norm regularisation[^fn1] is a pretty fascinating technique for neural network pruning or for training sparse networks, where weights are encouraged to be completely 0. It is easy to implement, with only a few lines of code (see below), but getting there conceptually is not so easy. 
+
+
+Several ML tricks are needed to achieve gradient flow through the network, because the default $L_0$ regularisation loss is non-differentiable for *evolving* reasons. (While solving one problem we introduce another and the loss function "evolves".) 
+
+The form of the loss is $\mathcal{L}(f(x; \tilde{\theta} \odot z), y) + \mathcal{L}\_{\mathrm{reg}}$, where $z$ is a discrete binary mask on the parameters $\tilde{\theta}$. We use $\tilde{\theta}$ because ultimately the parameters that we care about are not $\tilde{\theta}$ exactly, but $\theta = \tilde{\theta} \odot z$
+
+
+The final solution involves sampling from a *hard-concrete* distribution;
+which is obtained by stretching a *binary-concrete* distribution and then transforming the
+samples with a *hard-sigmoid*. 
+
+<br>
+
+
+### Preliminaries
+
+#### <u>$L_p$ regularisation</u> 
+
+Regularisation adds a term $\mathcal{L}_{\mathrm{reg}}$ to the loss function, which penalises the complexity of solution ($\theta$ weights) `typically' used to avoid overfitting and reduce generalisation error. The Maximum Likelihood Estimate of model parameters $\theta$ is given by
+
+$$
+\hat{\theta}_{\mathrm{MLE}} = \mathrm{argmin}_{\theta} \frac{1}{N} \sum_{i=1}^N \mathcal{L}(f(x_i, \theta), y_i)  + \lambda \mathcal{L}_{\mathrm{reg}}
+$$ 
+
+$L_p$ regularisation is a type of penalising cost based on the p-norm of the $\theta$
+vector, $\mathcal{L}_{\mathrm{reg}} = \mid \mid \theta \mid \mid_p$, where $\mid \mid \theta \mid \mid_p = (\mid\theta_1 \mid^p + \mid\theta_2 \mid^p + \cdots)^{\frac{1}{p}}$. $L_1$ and $L_2$ regularisation are typically used in gradient based methods, but $L_0$ regularisation involves counting of non-zero weights, and is non-differentiable. 
+
+Note: $L_2$ norm is continuously differentiable but $L_1$ is not continuously differentiable (at $\theta=0$). 
+
+
+#### <u>Reparameterisation Trick</u>
+
+The reparameterization trick is used when we want to sample from a distribution (and learn the parameters of that distribution). The " trick" is to reparameterise the distribution, such that a sample has
+a deterministic differentiable) and noise non-differentiable component.[^fn2]  This means
+re-expressing the sampling function as dependent on trainable parameters and some independent
+noise.
+
+Fpr example, a sample from $\mathcal{N}(\mu, \sigma^2)$ can be obtained by sampling $u$ from the standard form of the normal distribution, $u \sim \mathcal{N}(0, 1)$ and then transforming it
+using $\mu + \sigma u$. This reparameterisation makes it possible to reduce the problem of
+estimating gradients wrt parameters of a distribution, to estimating gradients wrt parameters
+of a deterministic function. 
+
+
+#### <u>Concrete Distributions</u>
+
+The class of “Concrete” distributions was invented to enable **discrete** distributions to use
+the **reparameterisation trick**, by approximating discrete distributions as continuous
+distributions.[^fn3] The high level strategy is to first, relax the state of a discrete variable into a probability vector by adding noise. Second, use a softmax (or logistic in the case of binary)
+function instead of an argmax over the probabilities. Sampling from the Concrete distribution
+then becomes taking the softmax of logits, perturbed by fixed additive noise.
+
+*Note: Don't overthink the semantics of "Concrete"; it's just a (in my opinion poor) name and stands for a "CONtinuous relaxation of disCRETE random variables".*
+
+<br><br>
+### Method
+
+> **Problem:** $L_0$ Regularisation Cost is Non-differentiable \\
+> **Solution:** Use the *probability* rather than the counts, of the weights being 0 
+
+Writing out $L_0$ regularisation, the maximum likelihood estimate is given by
+
+$$
+\hat{\theta} = \mathrm{argmin}_{\theta} \frac{1}{N}(\sum_{i=1}^N \mathcal{L}(f(x_i; \theta), y_i)) + \lambda \mid \mid \theta \mid \mid_0
+\tag{eq 1}\label{eq:1}
+$$
+
+
+Where $\mid \mid \theta \mid \mid_0 = \sum_{j=1}^{\mid \theta |} \mathbb{1} [\theta_j \neq 0]$. This loss is non-differentiable because the
+counting of parameters is non-differentiable. 
+
+To work around this, a soft form of counting is required, i.e., the *probability* of the
+weights being 0. We thus consider $\theta = \tilde{\theta} \odot z$, where $\odot$ is
+element-wise multiplication. The variable $z \sim \mathrm{Bernoulli}(\pi)$ can be
+viewed as $\\{ 0,1 \\}$ gates, which determine if the parameter $\theta$ is effectively present
+or absent. The probability of $z$ being 0 or 1, is controlled by the parameter $\pi$. We therefore need to learn $\pi$. 
+
+$$
+\pi^* = \mathrm{argmin}_{\pi} \mathbb{E}_{z \sim Bern(\pi)} \frac{1}{N} \sum_{i=1}^N \mathcal{L}
+(h(x_i, \tilde{\theta} \odot z), y_i) + \lambda \sum_{j=1}^{\mid \theta \mid} \pi_j
+\tag{eq 2}\label{eq:2}
+$$
+
+
+The regularisation cost is now differentiable because instead of raw counts of $\theta$,
+\eqref{eq:1} we are
+summing the average probability ($\pi$) of the gates $z$ being 0, and thus the parameters
+$\theta=\tilde{\theta} \odot z$ being 0. $\pi_j$ is the parameter  of each Bernoulli
+distribution that corresponds to a binary gate.  
+
+At this point, we have solved the problem of parameter counting, but still cannot use gradient based optimization for $\pi$ because the $z$ we introduced is a discrete stochastic random variable. 
+
+<br>
+
+> **Problem 2:** The gated parameters $\tilde{\theta}\odot z$ are non-differentiable because the masks $z \in \\{0, 1\\}$ are i) discrete, ii) stochastic\\
+> **Solution 2i:** Sample random variables from Binary [Concrete Distribution](#concrete-distributions)  
+> **Solution 2ii:** Apply [Reparameterisation Trick](#reparameterisation-trick)
+
+
+We have solved the first problem of the regularisation term $L_{\mathrm{reg}}$ being
+differentiable by reformulating $\mid \mid \theta \mid \mid_0 \rightarrow \sum_{j=1}^{|\theta|} \pi_j$. But
+in doing so, we rewrote the term $h(x; \theta) \rightarrow h(x; \tilde{\theta}\odot z)$. Since
+$z$ is stochastic, gradient does not flow and we would like to employ the [reparameterisation
+trick](#reparameterisation-trick). However, we are not able to reparameterise the discrete distribution due to the
+discontinuous nature of discrete states. Therefore, we need to first approximate the Bernoulli
+with a Binary [Concrete distribution](#concrete-distributions). 
+
+Next we apply the reparameterisation trick on the Binary Concrete distribution, resulting in learnable parameters $(\mathrm{log} \alpha)$ + some noise which is gumbel distributed. The noise takes the form $\log (u) - log(1-u)$, where $u \sim Uniform(0,1)$. 
+
+Let $s$ be a random variable distributed in the (0, 1) interval sampled from a Binary Concrete
+distribution. After applying the reparameterisation trick (details in Louizos 2017), we can sample 
+
+$$s = \mathrm{Sigmoid}((\mathrm{log} u - \mathrm{log} (1-u) + \mathrm{log} \alpha) / \beta)$$ 
+
+where $u \sim \mathrm{Uniform}(0, 1)$. Here $\mathrm{log}\alpha$ is the location parameter and
+$\beta$ is the temperature. The temperature controls the degree of approximation. With $\beta
+= 0$ we recover the original Bernoulli r.v. (but lose the differentiable properties). $\alpha$
+and $\beta$ are now trainable parameters, while the stochasticity comes from $u \sim U(0, 1)$. 
+
+<br>
+> **Problem:** The continuous distribution has too much probability mass which are not at 0 and 1. \\
+> **Solution:**  “stretch” this distribution beyond (0,1) and "fold" it back.
+\\
+
+We can "stretch" the samples from the distribution to $(\gamma, \zeta)$ interval, where $\gamma
+<0$ and $\zeta>1$. $\tilde{s} = s(\zeta - \gamma) + \gamma$, then apply a *hard-sigmoid* to
+fold the samples back to the interval (0, 1). $z=\mathrm{min}(1, \mathrm{max}(0, \tilde{s}))$.
+
+{% highlight python %}
+
+def sample_z(self):
+  if self.training:
+    # sample s from binary concrete
+    u = torch.FloatTensor(self.num_heads).uniform_().cuda()
+    s_ = torch.sigmoid((torch.log(u) - torch.log(1-u) + self.log_alpha) / self.beta)
+    
+  else: 
+    # test time
+    # sample without noise
+    s_ = torch.sigmoid(self.log_alpha)
+
+  # stretch values and fold them back to (0,1)
+  s_ = s_ * (self.zeta - self.gamma) + self.gamma
+  z = torch.clip(s_, min=0, max=1)
+  return z
+{% endhighlight %}
+
+<br>
+
+> **Problem:**  $z$ is no longer drawn from a Bernouli, so what should be the new regularisation term? \\
+> **Solution:** Compute the probability of $z$ being 0, but under a CDF.
+
+
+ecall the regularisation term $L_{\mathrm{reg}}$ has evolved from no. Of non-zero parameters
+\eqref{eq:1} , to probability of being 0 under a Bernouli distribution \eqref{eq:2}. 
+
+
+We still want to compute the probability of being 0 but since we now have a continuous instead
+of discrete Bernoulli, we need the cumulative distribution function (CDF) $Q(s \mid \alpha,
+\beta)$. 
+
+$$
+\pi^* = \mathrm{argmin}_{\pi} \mathbb{E}_{z \sim Bern(\pi)} \frac{1}{N} \sum_{i=1}^N \mathcal{L}
+(h(x_i, \tilde{\theta} \odot z), y_i) + \lambda \sum_{j=1}^{\mid \theta \mid} (1-Q(s_j \leq0
+\mid \alpha_j, \beta_j)) 
+\tag{eq 3}\label{eq:3}
+$$
+
+
+The regularisation cost works out to be 
+
+$$
+\sum_{j=1}^{\mid \theta \mid}(1-Q_{s_j}(0 \mid \alpha, \beta)) = \sum_{j=1}^{\mid \theta \mid} \mathrm{sigmoid}(\mathrm{log} \alpha_j - \beta\times \mathrm{log}\frac{-\gamma}{\zeta})
+$$
+
+{% highlight python %}
+self.log_ratio_ = math.log(-gamma / self.zeta)
+def get_reg_cost(self):
+  if self.log_alpha.requires_grad:
+    cost = torch.sigmoid(self.log_alpha - self.beta * self.log_ratio_).sum()
+{% endhighlight %}
+
+
+<br>
+
+#### Concluding Notes (mostly for implementation)
+
+1. When someone writes "Hard Concrete", they mean Hard sigmoid clamping on a continuous relaxation of Bernouli (Concrete) distribution. 
+
+2. $\alpha$ and $\beta$ are the parameters that we need to train.
+
+3. Start with gates initialised near 1, not 0 or 0.5, I find that this is the only
+   initialisation where the gates can be trained to a reasonable value. 
+
+4. Disable early stopping callbacks, or increase the patience level for early stopping.
+   Compared to training a model from scratch where we expect the performance to continuously
+increase, we expect the performance to drop rather than increase, as long as it doesnt drop too
+far we’re happy. 
+
+5. Consider scaling the $L_0$ Regularisation loss to be in a similar range as the task objective.
+   e.g., normalise by batch size and total number of heads. 
+
+
+<br>
+
+#### **References**
+[^fn1]: Louizos, Welling and Kingma. (2017) [Learning Sparse Neural Networks Through L0 Regularization](https://arxiv.org/pdf/1712.01312.pdf)
+[^fn2]: Kingma and Welling. [Auto-Encoding Variational Bayes](https://arxiv.org/pdf/1312.6114.pdf) Note: Reparameterisation trick was popularised in ML but not invented by these guys.
+[^fn3]: Maddison, Mnih, Yee. (2016). [The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables](https://arxiv.org/pdf/1611.00712.pdf)
diff --git a/_posts/2024-09-15-Unstructured-Data-Pipelines.md b/_posts/2024-09-15-Unstructured-Data-Pipelines.md
@@ -3,9 +3,9 @@ layout: post
 title: "Data Extraction for Unstructured Document Data"
 date: "2024-09-15"
 mathjax: true
-status: [Review]
+status: [NLP]
 shape: dot
-categories: [NLP]
+categories: [Projects]
 ---
 
 ### **Preliminaries**
diff --git a/_sass/minima.scss b/_sass/minima.scss
@@ -3,7 +3,7 @@
 // Define defaults for each variable.
 
 //$base-font-family: "Helvetica Neue", Helvetica, Arial, sans-serif !default;
-$base-font-family: Sabon, serif !default;
+$base-font-family: 'Open Sans', sans-serif;
 $base-font-size:   16px !default;
 $base-font-weight: 400 !default;
 $small-font-size:  $base-font-size * 0.875 !default;
diff --git a/_sass/minima/._base.scss.swp b/_sass/minima/._base.scss.swp
diff --git a/_sass/minima/_layout.scss b/_sass/minima/_layout.scss
@@ -319,3 +319,21 @@ a.category-post-link {
   vertical-align: middle;
   display: inline-block;
 }
+
+
+
+.iframe-container {
+  position: relative;
+  width: 100%;
+  padding-bottom: 56.25%; /* For a 16:9 aspect ratio (315 / 560 * 100) */
+  height: 0;
+  overflow: hidden;
+}
+
+.iframe-container iframe {
+  position: absolute;
+  top: 0;
+  left: 0;
+  width: 100%;
+  height: 100%;
+}
diff --git a/about.md b/about.md
@@ -22,3 +22,10 @@ My nationality is Singaporean, and served a government bond at DSO National Labs
 
 3. This blog exists for me, but if another human were to benefit, that would be quite fantastic! 
 
+---
+<br>
+
+<b>Credits</b>
+
+This site is based off jekyll minima. Thanks also to [Wenshan](https://www.linkedin.com/in/wenshan-chen/) for suggestions.
+
diff --git a/advisors.md b/advisors.md
@@ -4,6 +4,8 @@ title: Advisors
 permalink: /Advisors/
 ---
 
+#### **Major Influences**
+
 [Han Seunghoon](https://www.linkedin.com/in/louie-han-299584136) (HMGICS) taught me many things about translational RnD strategy and management.
 
 My primary advisor during PhD is [Kevin Duh](http://cs.jhu.edu/~kevinduh/). I'm also very lucky to work with [Kokil Jaidka](https://kokiljaidka.wordpress.com/about/) (National University of Singapore), Lambert Mathias (Facebook AI Integrity), and [Jason Eisner](https://www.cs.jhu.edu/~jason) (JHU).
diff --git a/othercv.md b/othercv.md
@@ -1,10 +1,16 @@
 ---
-layout: page
+layout: default
 title: Interests
 permalink: /othercv/
 ---
 
-#### **Workaway**
+#### **Main Interest**
+
+Programming. Writing.
+
+<br>
+
+#### **Life Interest**
 
 Workaway is an arrangement where you live and work at one place 5 hours a day 5 days a week in exchange for food and accomodation. 
 
@@ -25,4 +31,12 @@ My Workaway CV, last updated 2021.
 * Very Amateur drummer. Here's a [small contribution to the drumming
   world](https://twitter.com/suzyahyah/status/1344525618004676609), and here's a video from 2021 Covid lockdown. It's not great, I'm still learning.
 
-<iframe width="560" height="315" src="https://www.youtube.com/embed/G66GelPi-t8?si=ejeLKQGbB87XFPmX" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>
+<div class="iframe-container">
+  <iframe src="https://www.youtube.com/embed/G66GelPi-t8?si=ejeLKQGbB87XFPmX"
+          title="YouTube video player" 
+          frameborder="0" 
+          allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" 
+          referrerpolicy="strict-origin-when-cross-origin" 
+          allowfullscreen>
+  </iframe>
+</div>
diff --git a/papers.md b/papers.md
@@ -4,18 +4,18 @@ title: Papers
 permalink: /papers/
 ---
 
-#### Selected Publications
+#### **Selected Publications**
 
-[Where does In-context Learning Happen in Large Language Models?](https://proceedings.neurips.cc/paper_files/paper/2024/file/3979818cdc7bc8dbeec87170c11ee340-Paper-Conference.pdf) \\
+1. [Where does In-context Learning Happen in Large Language Models?](https://proceedings.neurips.cc/paper_files/paper/2024/file/3979818cdc7bc8dbeec87170c11ee340-Paper-Conference.pdf) \\
 Sia., Mueller., Duh. Neurips, 2024
 <br>
 
-[Logical satisfiability of counterfactuals for faithful explanations in NLI](https://ojs.aaai.org/index.php/AAAI/article/view/26174/25946) \\
+2. [Logical satisfiability of counterfactuals for faithful explanations in NLI](https://ojs.aaai.org/index.php/AAAI/article/view/26174/25946) \\
 Sia., Belyy, Almahairi., Khabsa., Zettlemoyer., Mathias. AAAI 2023
 
 <br>
 <br>
 
-#### All Publications
+#### **All Publications**
 
 [Google Scholar](https://scholar.google.com/citations?user=L62HORQAAAAJ&hl=en)
diff --git a/talks.md b/talks.md
@@ -4,15 +4,17 @@ title: Talks
 permalink: /talks/
 ---
 
-#### Evaluating Large Model Faithfulness and Task Efficiency 
+#### **Selected Talks**
+
+1. Evaluating Large Model Faithfulness and Task Efficiency\\
 *Google, Health AI Group, 2024*
 <br>
 <br>
 
-#### Limitations of LLM Software Agents for Physical Planning and Control Systems, [(PDF)](/assets/LimitationsOfSourceCodeAgents_SSIA.pdf) 
+2. [Limitations of LLM Software Agents for Physical Planning and Control Systems](/assets/LimitationsOfSourceCodeAgents_SSIA.pdf)\\
 *Microsoft, Program Synthesis using Examples (PROSE) Group, 2024*
 <br>
 <br>
 
-#### Generative AI Hype or Reality 
+3. Generative AI Hype or Reality\\
 *Ministry of Defence Singapore, Data Science Group, 2024*