From 839a0c63d3c0f5ea65fc446f5b830edc92d73aea Mon Sep 17 00:00:00 2001 From: Lionel Date: Tue, 4 Feb 2025 09:47:55 +0100 Subject: [PATCH 1/7] update byol example --- docs/source/examples/byol.rst | 35 +++++++++++++++++++++++++++++++++-- 1 file changed, 33 insertions(+), 2 deletions(-) diff --git a/docs/source/examples/byol.rst b/docs/source/examples/byol.rst index ac044e692..0debfe467 100644 --- a/docs/source/examples/byol.rst +++ b/docs/source/examples/byol.rst @@ -3,10 +3,41 @@ BYOL ==== -Example implementation of the BYOL architecture. +BYOL (Bootstrap Your Own Latent) [0]_ is a self-supervised learning framework for visual +representation learning without negative samples. Unlike contrastive learning methods, +such as MoCo [1]_ and SimCLR [2]_ that compare positive and negative pairs, BYOL uses +two neural networks – "online" and a "target" networks – where the online network is +trained to predict the target’s representations of the same image under different +augmentations. The target's weights are updated as the exponential moving average +(EMA) of the online network, and the authors show that this is enough to prevent +collapse to trivial solutions. The authors particularly show that due to the absence +of negative samples, BYOL is less sensitive to the batch size during training and manages +to achieve state-of-the-art on several semi-supervised and transfer learning benchmarks. + +Key Components +-------------- + +- **Data Augmentations**: BYOL [0]_ uses the same augmentations as SimCLR [2]_, namely + random resized crop, random horizontal flip, color distortions, Gaussian blur and + solarization. The color distortiion consits of a random sequence of brightness, + constrast, saturation, hue adjustments and an optional grayscale conversion. However + the hyperparameters for the augmentations are different from SimCLR [2]_. +- **Backbone**: BYOL [0]_ uses ResNet-type convolutional backbones as the online and + target networks. They do not evaluate the performance of other architectures. +- **Projection & Prediction Head**: A projection head is used to map the output of the + backbone to a lower-dimensional space. The target network once again relies on an + EMA of the online network's projection head for the projection head. A notable + architectureal choice is the use of an additional prediction head, a secondary MLP + appended to only the online network's projection head. +- **Loss Function**: BYOL [0]_ uses a negative cosine similarity loss between the + normalized representations of the online's prediction output and the targe's + projection output. + Reference: - `Bootstrap your own latent: A new approach to self-supervised Learning, 2020 `_ + .. [0] `Bootstrap your own latent: A new approach to self-supervised Learning, 2020 `_ + .. [1] `Momentum Contrast for Unsupervised Visual Representation Learning, 2019 `_ + .. [2] `A Simple Framework for Contrastive Learning of Visual Representations, 2020 `_ .. tabs:: From 7f64e31d8b101837148d859e1b7596b1c9af5e7c Mon Sep 17 00:00:00 2001 From: Lionel Date: Tue, 4 Feb 2025 09:52:32 +0100 Subject: [PATCH 2/7] add good to know section --- docs/source/examples/byol.rst | 24 +++++++++--------------- 1 file changed, 9 insertions(+), 15 deletions(-) diff --git a/docs/source/examples/byol.rst b/docs/source/examples/byol.rst index 0debfe467..cbbf12dd1 100644 --- a/docs/source/examples/byol.rst +++ b/docs/source/examples/byol.rst @@ -17,21 +17,15 @@ to achieve state-of-the-art on several semi-supervised and transfer learning ben Key Components -------------- -- **Data Augmentations**: BYOL [0]_ uses the same augmentations as SimCLR [2]_, namely - random resized crop, random horizontal flip, color distortions, Gaussian blur and - solarization. The color distortiion consits of a random sequence of brightness, - constrast, saturation, hue adjustments and an optional grayscale conversion. However - the hyperparameters for the augmentations are different from SimCLR [2]_. -- **Backbone**: BYOL [0]_ uses ResNet-type convolutional backbones as the online and - target networks. They do not evaluate the performance of other architectures. -- **Projection & Prediction Head**: A projection head is used to map the output of the - backbone to a lower-dimensional space. The target network once again relies on an - EMA of the online network's projection head for the projection head. A notable - architectureal choice is the use of an additional prediction head, a secondary MLP - appended to only the online network's projection head. -- **Loss Function**: BYOL [0]_ uses a negative cosine similarity loss between the - normalized representations of the online's prediction output and the targe's - projection output. +- **Data Augmentations**: BYOL [0]_ uses the same augmentations as SimCLR [2]_, namely random resized crop, random horizontal flip, color distortions, Gaussian blur and solarization. The color distortiion consits of a random sequence of brightness, constrast, saturation, hue adjustments and an optional grayscale conversion. However the hyperparameters for the augmentations are different from SimCLR [2]_. +- **Backbone**: BYOL [0]_ uses ResNet-type convolutional backbones as the online and target networks. They do not evaluate the performance of other architectures. +- **Projection & Prediction Head**: A projection head is used to map the output of the backbone to a lower-dimensional space. The target network once again relies on an EMA of the online network's projection head for the projection head. A notable architectureal choice is the use of an additional prediction head, a secondary MLP appended to only the online network's projection head. +- **Loss Function**: BYOL [0]_ uses a negative cosine similarity loss between the normalized representations of the online's prediction output and the targe's projection output. + +Good to Know +------------- + +- **Backbone Networks**: SimCLR is specifically optimized for convolutional neural networks, with a focus on ResNet architectures. We do not recommend using it with transformer-based models. Reference: From ba1caef062485c7d6bcab63f32ba50882ed432bb Mon Sep 17 00:00:00 2001 From: Lionel Date: Tue, 4 Feb 2025 10:06:08 +0100 Subject: [PATCH 3/7] typos --- docs/source/examples/byol.rst | 13 +++++++------ 1 file changed, 7 insertions(+), 6 deletions(-) diff --git a/docs/source/examples/byol.rst b/docs/source/examples/byol.rst index cbbf12dd1..4ddbaddec 100644 --- a/docs/source/examples/byol.rst +++ b/docs/source/examples/byol.rst @@ -6,21 +6,22 @@ BYOL BYOL (Bootstrap Your Own Latent) [0]_ is a self-supervised learning framework for visual representation learning without negative samples. Unlike contrastive learning methods, such as MoCo [1]_ and SimCLR [2]_ that compare positive and negative pairs, BYOL uses -two neural networks – "online" and a "target" networks – where the online network is +two neural networks – "online" and "target" – where the online network is trained to predict the target’s representations of the same image under different -augmentations. The target's weights are updated as the exponential moving average +augmentations, yielding in iterative bootstrapping of the latent samples. +The target's weights are updated as the exponential moving average (EMA) of the online network, and the authors show that this is enough to prevent -collapse to trivial solutions. The authors particularly show that due to the absence +collapse to trivial solutions. The authors also show that due to the absence of negative samples, BYOL is less sensitive to the batch size during training and manages to achieve state-of-the-art on several semi-supervised and transfer learning benchmarks. Key Components -------------- -- **Data Augmentations**: BYOL [0]_ uses the same augmentations as SimCLR [2]_, namely random resized crop, random horizontal flip, color distortions, Gaussian blur and solarization. The color distortiion consits of a random sequence of brightness, constrast, saturation, hue adjustments and an optional grayscale conversion. However the hyperparameters for the augmentations are different from SimCLR [2]_. +- **Data Augmentations**: BYOL [0]_ uses the same augmentations as SimCLR [2]_, namely random resized crop, random horizontal flip, color distortions, Gaussian blur and solarization. The color distortion consists of a random sequence of brightness, constrast, saturation, hue adjustments and an optional grayscale conversion. However the hyperparameters for the augmentations are different from SimCLR [2]_. - **Backbone**: BYOL [0]_ uses ResNet-type convolutional backbones as the online and target networks. They do not evaluate the performance of other architectures. -- **Projection & Prediction Head**: A projection head is used to map the output of the backbone to a lower-dimensional space. The target network once again relies on an EMA of the online network's projection head for the projection head. A notable architectureal choice is the use of an additional prediction head, a secondary MLP appended to only the online network's projection head. -- **Loss Function**: BYOL [0]_ uses a negative cosine similarity loss between the normalized representations of the online's prediction output and the targe's projection output. +- **Projection & Prediction Head**: A projection head is used to map the output of the backbone to a lower-dimensional space. For this, the target network once again relies on an EMA of the online network. A notable architectureal choice is the use of an additional prediction head, a secondary MLP appended to only the online network's projection head. +- **Loss Function**: BYOL [0]_ uses a negative cosine similarity loss between the representations of the online's prediction output and the target's projection output. Good to Know ------------- From 16dced6abf00d85e45e8a9ff222fe2028eeec6ea Mon Sep 17 00:00:00 2001 From: Lionel Peer Date: Tue, 4 Feb 2025 10:49:25 +0100 Subject: [PATCH 4/7] Update docs/source/examples/byol.rst Co-authored-by: stegmuel <36367013+stegmuel@users.noreply.github.com> --- docs/source/examples/byol.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/examples/byol.rst b/docs/source/examples/byol.rst index 4ddbaddec..89318981c 100644 --- a/docs/source/examples/byol.rst +++ b/docs/source/examples/byol.rst @@ -7,7 +7,7 @@ BYOL (Bootstrap Your Own Latent) [0]_ is a self-supervised learning framework fo representation learning without negative samples. Unlike contrastive learning methods, such as MoCo [1]_ and SimCLR [2]_ that compare positive and negative pairs, BYOL uses two neural networks – "online" and "target" – where the online network is -trained to predict the target’s representations of the same image under different +trained to predict the target’s representation of the same image under different augmentations, yielding in iterative bootstrapping of the latent samples. The target's weights are updated as the exponential moving average (EMA) of the online network, and the authors show that this is enough to prevent From b2e83bcfcb486dab7872adf02f26893b8b00d290 Mon Sep 17 00:00:00 2001 From: Lionel Peer Date: Tue, 4 Feb 2025 10:50:53 +0100 Subject: [PATCH 5/7] Update docs/source/examples/byol.rst Co-authored-by: stegmuel <36367013+stegmuel@users.noreply.github.com> --- docs/source/examples/byol.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/examples/byol.rst b/docs/source/examples/byol.rst index 89318981c..80d049774 100644 --- a/docs/source/examples/byol.rst +++ b/docs/source/examples/byol.rst @@ -10,7 +10,7 @@ two neural networks – "online" and "target" – where the online network is trained to predict the target’s representation of the same image under different augmentations, yielding in iterative bootstrapping of the latent samples. The target's weights are updated as the exponential moving average -(EMA) of the online network, and the authors show that this is enough to prevent +(EMA) of the online network, and the authors show that this is sufficient to prevent collapse to trivial solutions. The authors also show that due to the absence of negative samples, BYOL is less sensitive to the batch size during training and manages to achieve state-of-the-art on several semi-supervised and transfer learning benchmarks. From a2a9e708e60ac20f63b1137aa2f3a4993610f9bb Mon Sep 17 00:00:00 2001 From: Lionel Peer Date: Tue, 4 Feb 2025 10:51:13 +0100 Subject: [PATCH 6/7] Update docs/source/examples/byol.rst Co-authored-by: stegmuel <36367013+stegmuel@users.noreply.github.com> --- docs/source/examples/byol.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/examples/byol.rst b/docs/source/examples/byol.rst index 80d049774..13dbfbba3 100644 --- a/docs/source/examples/byol.rst +++ b/docs/source/examples/byol.rst @@ -13,7 +13,7 @@ The target's weights are updated as the exponential moving average (EMA) of the online network, and the authors show that this is sufficient to prevent collapse to trivial solutions. The authors also show that due to the absence of negative samples, BYOL is less sensitive to the batch size during training and manages -to achieve state-of-the-art on several semi-supervised and transfer learning benchmarks. +to achieve state-of-the-art performance on several semi-supervised and transfer learning benchmarks. Key Components -------------- From 0a2f0cc67d8586affa7349fabfacc98aedbdc2be Mon Sep 17 00:00:00 2001 From: Lionel Date: Tue, 4 Feb 2025 11:00:19 +0100 Subject: [PATCH 7/7] add DINO hint --- docs/source/examples/byol.rst | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/docs/source/examples/byol.rst b/docs/source/examples/byol.rst index 13dbfbba3..8059d39aa 100644 --- a/docs/source/examples/byol.rst +++ b/docs/source/examples/byol.rst @@ -26,13 +26,14 @@ Key Components Good to Know ------------- -- **Backbone Networks**: SimCLR is specifically optimized for convolutional neural networks, with a focus on ResNet architectures. We do not recommend using it with transformer-based models. +- **Backbone Networks**: SimCLR is specifically optimized for convolutional neural networks, with a focus on ResNet architectures. We do not recommend using it with transformer-based models and instead suggest using :doc:`DINO ` [3]_. Reference: .. [0] `Bootstrap your own latent: A new approach to self-supervised Learning, 2020 `_ .. [1] `Momentum Contrast for Unsupervised Visual Representation Learning, 2019 `_ .. [2] `A Simple Framework for Contrastive Learning of Visual Representations, 2020 `_ + .. [3] `Emerging Properties in Self-Supervised Vision Transformers, 2021 `_ .. tabs::