Distributed Training in PyTorch - II
In the part I of this series, we saw how we could launch different processes with mp.spawn or through multiple terminal windows group processes in process groups for communication and synchronization implement a distributed training loop by manually synchronizing gradients or by hooking them into the backward pass In this part, we will now follow up on this and implement our own simplified version of DistributedDataParallel (DDP) and other helpers for distributed training – particularly distributed sampling of data. Initially I meant to also include a custom SyncBatchNorm implementation, however this requires knowledge on how to extend PyTorch’s Autograd engine, which I would first like to cover separately. The helpers and our DDP implementation will allow us to better structure our code and abstract away some of the boilerplate code that we had to use inside the training loop. We will then verify the correctness of our implementation by training a ResNet181 on the FashionMNIST2 dataset and comparing the results to training it on a single GPU. ...
Embedding Arithmetic with CLIP
⬇️ Download this Article as a Jupyter Notebook The company I work for heavily specializes on on self-supervised learning for computer vision, which is why I regularly get to work with embedding models. For those unaware, an embedding model is a neural network that maps a sample (e.g. an image or also some text) to a high-dimensional vector (I mean really high-dimensional, usually between 512 and 2048 dimensions). While we usually can’t make sense of these vectors when they appear in isolation, we can still draw some very nice conclusions when we compare them to each other or apply operations on them. In this article, I would like to demonstrate some operations that can help you make sense of embedding spaces, and can hopefully drive intuition about how neural networks work. It’s also really fun to see that your linear algebra classes from school were not in vain. ...
Distributed Training in PyTorch – I
This blog post is the first in a series in which I would like to showcase several methods on how you can train your machine learning model with PyTorch on several GPUs/CPUs across multiple nodes in a cluster. In this first part we will try to look a bit under the hood of how you can launch distributed jobs and how you can ensure proper communication between them. In follow-ups we will create our own PyTorch DistributedDataParallel (DDP) class and we will also look at popular frameworks such as PyTorch Lightning and resource schedulers like SLURM that can help you getting your distributed training running. We will strictly focus on data parallelism, meaning a parallelism where the whole model fits into the memory of a single GPU and we exchange gradients (and potentially batch norms) across the GPUs, while keeping the whole optimization local on each GPU. ...
MNIST, but in Julia!
⬇️ Download this Article as a Jupyter Notebook Importing the Relevant Libraries Make sure to remove CUDA if you don’t have an Nvidia GPU. using Plots, Flux, LinearAlgebra, CUDA, ProgressMeter using Flux: @functor using Statistics: mean using MLDatasets: MNIST Defining the Neural Network Creating a Module Instance Julia does not offer class-based object-orientation like Python does, so if you’re coming from PyTorch, then things look quite a bit different – but also not completely different. The fundamental paradigm of Julia’s design is multiple-dispatch, so pretty much everything we heavily rely on function-overloading. If those two terms are foreign to you, I recommend prompting your favorite LLM or search engine about them before continuing. ...
Hydra and WandB for Machine Learning Experiments
Introduction Hydra 1 and WandB 2 have become indispensable tools for me when tracking my machine learning experiments. In this post I would like to share how I combine these two tools for maximum reproducibility, debuggability and flexibility in experiment scheduling. This post is very much a personal knowledge resource for me, therefore I will try to keep it up-to-date when my workflow changes. I want to cover the following things: build a sensible config hierarchy that never requires you to change multiple files using common project names and run names across WandB and Hydra debug your code without excessive logging from WandB and Hydra WandB At the time of initially writing this post (June 2024) I have been using WandB for about a year and while its feature set is massive, I use it almost exclusively for logging during training, thinking of it mostly as a tensorboard on steroids. Especially the automatic logging of the hardware useage has significantly improved my ability to squeeze every last FLOP out of my hardware. ...
Physics-Informed Neural Networks - A Basic Example
⬇️ Download this Article as a Jupyter Notebook Problem Statement We would like to solve a (boundary value problem) BVP for the 1D heat equation $$ u_t(t,x) = u_{xx}(t, x), \quad t\in [0,6], \quad x\in [-1,1] $$ where the subscripts describe the first or second order partial derivatives. Together with this partial differential equation (PDE), we have boundary and initial conditions. $$ u(t,-1) = u(t,1) = 0\\ u(0,x) = -\sin(\pi x) $$ Our goal is now to approximate the function $u(t,x)$ by a neural network $u(t,x)\approx NN_\theta(t,x)$, which we can do by using auto-differentiation and a suitable optimization target that ensures that the PDE, initial condition and boundary conditions are fulfilled: ...
Statistical Inference 1 - MLE & MAP
Maximum Likelihood Estimation (MLE) and Maximum a Posteriori (MAP) estimation are fundamental concepts in statistical inference and understanding these two is key to understanding the motivation behind the most frequently used loss functions like cross-entropy loss, mean-squared error (MSE or L2 loss) and mean absolute error (L1 loss), which will be the topic of a later post. Introduction Assuming a probability distribution $p: \bm{\Omega} \rightarrow \mathbb{R},, p(\bm{x})$, independent samples $\bm{x}^{(1)}, \dots , \bm{x}^{(N)}$ and a set of parameters $\bm{\theta}$ on the domain $\bm{\Theta}$, we would like to fit a parameterized distribution $p_{\bm{\theta}}(\bm{x})$ to the original data distribution $p(\bm{x})$, possibly even in a way, such that we can generate new samples $\bm{x}^{(N+i)}\sim p_{\bm{\theta}}(\bm{x}), i>0$ that look as if they came from $p(\bm{x})$. ...