Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lionel lig 5994: Update the BYOL examples #1794

Merged
merged 8 commits into from
Feb 4, 2025
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
31 changes: 29 additions & 2 deletions docs/source/examples/byol.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,10 +3,37 @@
BYOL
====

Example implementation of the BYOL architecture.
BYOL (Bootstrap Your Own Latent) [0]_ is a self-supervised learning framework for visual
representation learning without negative samples. Unlike contrastive learning methods,
such as MoCo [1]_ and SimCLR [2]_ that compare positive and negative pairs, BYOL uses
two neural networks – "online" and "target" – where the online network is
trained to predict the target’s representation of the same image under different
augmentations, yielding in iterative bootstrapping of the latent samples.
The target's weights are updated as the exponential moving average
(EMA) of the online network, and the authors show that this is sufficient to prevent
collapse to trivial solutions. The authors also show that due to the absence
of negative samples, BYOL is less sensitive to the batch size during training and manages
to achieve state-of-the-art performance on several semi-supervised and transfer learning benchmarks.

Key Components
--------------

- **Data Augmentations**: BYOL [0]_ uses the same augmentations as SimCLR [2]_, namely random resized crop, random horizontal flip, color distortions, Gaussian blur and solarization. The color distortion consists of a random sequence of brightness, constrast, saturation, hue adjustments and an optional grayscale conversion. However the hyperparameters for the augmentations are different from SimCLR [2]_.
- **Backbone**: BYOL [0]_ uses ResNet-type convolutional backbones as the online and target networks. They do not evaluate the performance of other architectures.
- **Projection & Prediction Head**: A projection head is used to map the output of the backbone to a lower-dimensional space. For this, the target network once again relies on an EMA of the online network. A notable architectureal choice is the use of an additional prediction head, a secondary MLP appended to only the online network's projection head.
- **Loss Function**: BYOL [0]_ uses a negative cosine similarity loss between the representations of the online's prediction output and the target's projection output.

Good to Know
-------------

- **Backbone Networks**: SimCLR is specifically optimized for convolutional neural networks, with a focus on ResNet architectures. We do not recommend using it with transformer-based models and instead suggest using :doc:`DINO <dino>` [3]_.


Reference:
`Bootstrap your own latent: A new approach to self-supervised Learning, 2020 <https://arxiv.org/abs/2006.07733>`_
.. [0] `Bootstrap your own latent: A new approach to self-supervised Learning, 2020 <https://arxiv.org/abs/2006.07733>`_
.. [1] `Momentum Contrast for Unsupervised Visual Representation Learning, 2019 <https://arxiv.org/abs/1911.05722>`_
.. [2] `A Simple Framework for Contrastive Learning of Visual Representations, 2020 <https://arxiv.org/abs/2002.05709>`_
.. [3] `Emerging Properties in Self-Supervised Vision Transformers, 2021 <https://arxiv.org/abs/2104.14294>`_


.. tabs::
Expand Down