Embedding Arithmetic with CLIP
💾 Download this Article as a Jupyter Notebook The company I work for heavily specializes on on self-supervised learning for computer vision, which is why I regularly get to work with embedding models. For those unaware, an embedding model is a neural network that maps a sample (e.g. an image or also some text) to a high-dimensional vector (I mean really high-dimensional, usually between 512 and 2048 dimensions). While we usually can’t make sense of these vectors when they appear in isolation, we can still draw some very nice conclusions when we compare them to each other or apply operations on them. In this article, I would like to demonstrate some operations that can help you make sense of embedding spaces, and can hopefully drive intuition about how neural networks work. It’s also really fun to see that your linear algebra classes from school were not in vain. ...
Distributed Training in PyTorch – I
This blog post is the first in a series in which I would like to showcase several methods on how you can train your machine learning model with PyTorch on several GPUs/CPUs across multiple nodes in a cluster. In this first part we will try to look a bit under the hood of how you can launch distributed jobs and how you can ensure proper communication between them. In follow-ups we will create our own PyTorch DistributedDataParallel (DDP) class and we will also look at popular frameworks such as PyTorch Lightning and resource schedulers like SLURM that can help you getting your distributed training running. We will strictly focus on data parallelism, meaning a parallelism where the whole model fits into the memory of a single GPU and we exchange gradients (and potentially batch norms) across the GPUs, while keeping the whole optimization local on each GPU. ...
MNIST, but in Julia!
💾 Download this Article as a Jupyter Notebook Importing the Relevant Libraries Make sure to remove CUDA if you don’t have an Nvidia GPU. using Plots, Flux, LinearAlgebra, CUDA, ProgressMeter using Flux: @functor using Statistics: mean using MLDatasets: MNIST Defining the Neural Network Creating a Module Instance Julia does not offer class-based object-orientation like Python does, so if you’re coming from PyTorch, then things look quite a bit different – but also not completely different. The fundamental paradigm of Julia’s design is multiple-dispatch, so pretty much everything we heavily rely on function-overloading. If those two terms are foreign to you, I recommend prompting your favorite LLM or search engine about them before continuing. ...
Diffusion Models
Preliminaries All machine learning problems, that I can think of, can be formulated as learning a mapping $$ \begin{equation} f_\theta: \Omega \rightarrow \Lambda \end{equation} $$ from the sample space $\Omega$ to the label space $\Lambda$. An example is supervised learning, where a problem could be learning $f_\theta(x)\approx p(c|x)$ – the probability of sample $x$ belonging to class $c$. When we do this with neural networks, we usually use several layers, e.g. $N$, $p(c|x)\approx f_{\theta_N}\circ\dots\circ f_{\theta_1}(x)$, which we can see this as discretizing the problem, i.e. solving it in $N$ steps instead of a single one. ...
Hydra and WandB for Machine Learning Experiments
Introduction Hydra 1 and WandB 2 have become indispensable tools for me when tracking my machine learning experiments. In this post I would like to share how I combine these two tools for maximum reproducibility, debuggability and flexibility in experiment scheduling. This post is very much a personal knowledge resource for me, therefore I will try to keep it up-to-date when my workflow changes. I want to cover the following things: build a sensible config hierarchy that never requires you to change multiple files using common project names and run names across WandB and Hydra debug your code without excessive logging from WandB and Hydra WandB At the time of initially writing this post (June 2024) I have been using WandB for about a year and while its feature set is massive, I use it almost exclusively for logging during training, thinking of it mostly as a tensorboard on steroids. Especially the automatic logging of the hardware useage has significantly improved my ability to squeeze every last FLOP out of my hardware. ...
Diffusion Models for Linear Inverse Problems
In the fall of 2023 I worked on inverse problems in medical imaging with diffusion models (download the thesis PDF here). This was part of a semester thesis at ETH Zurich and the problem at hand was the reconstruction of MRI (magnetic resonance imaging) acquistions from sparse k-space measurements. For those unfamiliar with how MRI works: Imagine a slice through the human body – a pixelized image – on which you now apply the 2D DFT (discrete Fourier transform) using the spatial frequencies $k$. This Fourier representation of the image is exactly what MRI acquires and we usually call it $k$-space, in reference to those spatial frequencies. Acquisition protocols in MRI usually sequentially sample vertical (or horizontal) lines in this k-space and only sampling a sparse subset of those lines has the potential to significantly speed-up acquisitions. ...
Physics-Informed Neural Networks - A Basic Example
💾 Download this Article as a Jupyter Notebook Problem Statement We would like to solve a (boundary value problem) BVP for the 1D heat equation $$ u_t(t,x) = u_{xx}(t, x), \quad t\in [0,6], \quad x\in [-1,1] $$ where the subscripts describe the first or second order partial derivatives. Together with this partial differential equation (PDE), we have boundary and initial conditions. $$ u(t,-1) = u(t,1) = 0\\ u(0,x) = -\sin(\pi x) $$ Our goal is now to approximate the function $u(t,x)$ by a neural network $u(t,x)\approx NN_\theta(t,x)$, which we can do by using auto-differentiation and a suitable optimization target that ensures that the PDE, initial condition and boundary conditions are fulfilled: ...
Statistical Inference 2 - Motivating the commonly used Loss Functions
Warning ...
Statistical Inference 1 - MLE & MAP
Maximum Likelihood Estimation (MLE) and Maximum a Posteriori (MAP) estimation are fundamental concepts in statistical inference and understanding these two is key to understanding the motivation behind the most frequently used loss functions like cross-entropy loss, mean-squared error (MSE or L2 loss) and mean absolute error (L1 loss), which will be the topic of a later post. Introduction Assuming a probability distribution $p: \bm{\Omega} \rightarrow \mathbb{R},, p(\bm{x})$, independent samples $\bm{x}^{(1)}, \dots , \bm{x}^{(N)}$ and a set of parameters $\bm{\theta}$ on the domain $\bm{\Theta}$, we would like to fit a parameterized distribution $p_{\bm{\theta}}(\bm{x})$ to the original data distribution $p(\bm{x})$, possibly even in a way, such that we can generate new samples $\bm{x}^{(N+i)}\sim p_{\bm{\theta}}(\bm{x}), i>0$ that look as if they came from $p(\bm{x})$. ...