-
Notifications
You must be signed in to change notification settings - Fork 22
/
Copy pathdocs_en_2023_06_29.jsonl
51 lines (51 loc) · 665 KB
/
docs_en_2023_06_29.jsonl
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
{"title": "1b-sentence-embeddings.md", "repo_owner": "huggingface", "repo_name": "blog", "text": "--- title: \"Train a Sentence Embedding Model with 1B Training Pairs\" authors: - user: asi guest: true --- # Train a Sentence Embedding Model with 1 Billion Training Pairs **Sentence embedding** is a method that maps sentences to vectors of real numbers. Ideally, these vectors would capture the semantic of a sentence and be highly generic. Such representations could then be used for many downstream applications such as clustering, text mining, or question answering. We developed state-of-the-art sentence embedding models as part of the project [\"Train the Best Sentence Embedding Model Ever with 1B Training Pairs\"]( This project took place during the [Community week using JAX/Flax for NLP & CV]( organized by Hugging Face. We benefited from efficient hardware infrastructure to run the project: 7 TPUs v3-8, as well as guidance from Google\u2019s Flax, JAX, and Cloud team members about efficient deep learning frameworks! ## Training methodology ### Model Unlike words, we can not define a finite set of sentences. Sentence embedding methods, therefore, compose inner words to compute the final representation. For example, SentenceBert model ([Reimers and Gurevych, 2019]( uses Transformer, the cornerstone of many NLP applications, followed by a pooling operation over the contextualized word vectors. (c.f. Figure below.)  ### Multiple Negative Ranking Loss The parameters from the composition module are usually learned using a self-supervised objective. For the project, we used a contrastive training method illustrated in the figure below. We constitute a dataset with sentence pairs \\\\( (a_i, p_i) \\\\) such that sentences from the pair have a close meaning. For example, we consider pairs such as (query, answer-passage), (question, duplicate_question),(paper title, cited paper title). Our model is then trained to map pairs \\\\( (a_i , p_i) \\\\) to close vectors while assigning unmatched pairs \\\\( (a_i , p_j), i \\neq j \\\\) to distant vectors in the embedding space. This training method is also called training with in-batch negatives, InfoNCE or NTXentLoss.  Formally, given a batch of training samples, the model optimises the following [loss function]( $$-\\frac{1}{n}\\sum_{i=1}^n\\frac{exp(sim(a_i, p_i))}{\\sum_j exp(sim(a_i, p_j))}$$ An illustrative example can be seen below. The model first embeds each sentence from every pair in the batch. Then, we compute a similarity matrix between every possible pair \\\\( (a_i, p_j) \\\\). We then compare the similarity matrix with the ground truth, which indicates the original pairs. Finally, we perform the comparison using the cross entropy loss. Intuitively, the model should assign high similarity to the sentences \u00ab How many people live in Berlin? \u00bb and \u00ab Around 3.5 million people live in Berlin \u00bb and low similarity to other negative answers such as \u00ab The capital of France is Paris \u00bb as detailed in the Figure below.  In the loss equation, `sim` indicates a similarity function between \\\\( (a, p) \\\\). The similarity function could be either the Cosine-Similarity or the Dot-Product operator. Both methods have their pros and cons summarized below ([Thakur et al., 2021]( [Bachrach et al., 2014]( | Cosine-similarity | Dot-product | |---------------------|-------------| | Vector has highest similarity to itself since \\\\( cos(a, a)=1 \\\\). | Other vectors can have higher dot-products \\\\( dot(a, a) < dot (a, b) \\\\). | | With normalised vectors it is equal to the dot product. The max vector length is equals 1. | It might be slower with certain approximate nearest neighbour methods since the max vector not known. | | With normalised vectors, it is proportional to euclidian distance. It works with k-means clustering. | It does not work with k-means clustering. | In practice, we used a scaled similarity because score differences tends to be too small and apply a scaling factor \\\\( C \\\\) such that \\\\( sim_{scaled}(a, b) = C * sim(a, b) \\\\) with typically \\\\( C = 20 \\\\) ([Henderson and al., 2020]([ [Radford and al., 2021]( ### Improving Quality with Better Batches In our method, we build batches of sample pairs \\\\( (a_i , p_i) \\\\). We consider all other samples from the batch, \\\\( (a_i , p_j), i \\neq j \\\\), as negatives sample pairs. The batch composition is therefore a key training aspect. Given the literature in the domain, we mainly focused on three main aspects of the batch. #### 1. Size matters In contrastive learning, a larger batch size is synonymous with better performance. As shown in the Figure extracted from Qu and al., ([2021]( a larger batch size increases the results.  #### 2. Hard Negatives In the same figure, we observe that including hard negatives also improves performance. Hard negatives are sample \\\\( p_j \\\\) which are hard to distinguish from \\\\( p_i \\\\). In our example, it could be the pairs \u00ab What is the capital of France? \u00bb and \u00ab What is the capital of the US? \u00bb which have a close semantic content and requires precisely understanding the full sentence to be answered correctly. On the contrary, the samples \u00ab What is the capital of France? \u00bb and \u00abHow many Star Wars movies is there?\u00bb are less difficult to distinguish since they do not refer to the same topic. #### 3. Cross dataset batches We concatenated multiple datasets to train our models. We built a large batch and gathered samples from the same batch dataset to limit the topic distribution and favor hard negatives. However, we also mix at least two datasets in the batch to learn a global structure between topics and not only a local structure within a topic. ## Training infrastructure and data As mentioned earlier, the quantity of data and the batch size directly impact the model performances. As part of the project, we benefited from efficient hardware infrastructure. We trained our models on [TPUs]( which are compute units developed by Google and super efficient for matrix multiplications. TPUs have some [hardware specificities]( which might require some specific code implementation. Additionally, we trained models on a large corpus as we concatenated multiple datasets up to 1 billion sentence pairs! All datasets used are detailed for each model in the [model card]( ## Conclusion You can find all models and datasets we created during the challenge in our [HuggingFace repository]( We trained 20 general-purpose Sentence Transformers models such as Mini-LM ([Wang and al., 2020]( RoBERTa ([liu and al., 2019]( )), DistilBERT ([Sanh and al., 2020]( and MPNet ([Song and al., 2020]( Our models achieve SOTA on multiple general-purpose Sentence Similarity evaluation tasks. We also shared [8 datasets]( specialized for Question Answering, Sentence-Similarity, and Gender Evaluation. General sentence embeddings might be used for many applications. We built a [Spaces demo]( to showcase several applications: * The **sentence similarity** module compares the similarity of the main text with other texts of your choice. In the background, the demo extracts the embedding for each text and computes the similarity between the source sentence and the other using cosine similarity. * **Asymmetric QA** compares the answer likeliness of a given query with answer candidates of your choice. * **Search / Cluster** returns nearby answers from a query. For example, if you input \u00ab python \u00bb, it will retrieve closest sentences using dot-product distance. * **Gender Bias Evaluation** report *inherent gender bias* in training set via random sampling of the sentences. Given an anchor text without mentioning gender for target occupation and 2 propositions with gendered pronouns, we compare if models assign a higher similarity to a given proposition and therefore evaluate their proportion to favor a specific gender. The [Community week using JAX/Flax for NLP & CV]( has been an intense and highly rewarding experience! The quality of Google\u2019s Flax, JAX, and Cloud and Hugging Face team members' guidance and their presence helped us all learn a lot. We hope all projects had as much fun as we did in ours. Whenever you have questions or suggestions, don\u2019t hesitate to contact us!"}
{"title": "4bit-transformers-bitsandbytes.md", "repo_owner": "huggingface", "repo_name": "blog", "text": "--- title: \"Making LLMs even more accessible with bitsandbytes, 4-bit quantization and QLoRA\" thumbnail: /blog/assets/96_hf_bitsandbytes_integration/Thumbnail_blue.png authors: - user: ybelkada - user: timdettmers guest: true - user: artidoro guest: true - user: sgugger - user: smangrul --- # Making LLMs even more accessible with bitsandbytes, 4-bit quantization and QLoRA LLMs are known to be large, and running or training them in consumer hardware is a huge challenge for users and accessibility. Our [LLM.int8 blogpost]( showed how the techniques in the [LLM.int8 paper]( were integrated in transformers using the `bitsandbytes` library. As we strive to make models even more accessible to anyone, we decided to collaborate with bitsandbytes again to allow users to run models in 4-bit precision. This includes a large majority of HF models, in any modality (text, vision, multi-modal, etc.). Users can also train adapters on top of 4bit models leveraging tools from the Hugging Face ecosystem. This is a new method introduced today in the QLoRA paper by Dettmers et al. The abstract of the paper is as follows: > We present QLoRA, an efficient finetuning approach that reduces memory usage enough to finetune a 65B parameter model on a single 48GB GPU while preserving full 16-bit finetuning task performance. QLoRA backpropagates gradients through a frozen, 4-bit quantized pretrained language model into Low Rank Adapters~(LoRA). Our best model family, which we name Guanaco, outperforms all previous openly released models on the Vicuna benchmark, reaching 99.3% of the performance level of ChatGPT while only requiring 24 hours of finetuning on a single GPU. QLoRA introduces a number of innovations to save memory without sacrificing performance: (a) 4-bit NormalFloat (NF4), a new data type that is information theoretically optimal for normally distributed weights (b) double quantization to reduce the average memory footprint by quantizing the quantization constants, and (c) paged optimizers to manage memory spikes. We use QLoRA to finetune more than 1,000 models, providing a detailed analysis of instruction following and chatbot performance across 8 instruction datasets, multiple model types (LLaMA, T5), and model scales that would be infeasible to run with regular finetuning (e.g. 33B and 65B parameter models). Our results show that QLoRA finetuning on a small high-quality dataset leads to state-of-the-art results, even when using smaller models than the previous SoTA. We provide a detailed analysis of chatbot performance based on both human and GPT-4 evaluations showing that GPT-4 evaluations are a cheap and reasonable alternative to human evaluation. Furthermore, we find that current chatbot benchmarks are not trustworthy to accurately evaluate the performance levels of chatbots. A lemon-picked analysis demonstrates where Guanaco fails compared to ChatGPT. We release all of our models and code, including CUDA kernels for 4-bit training. ## Resources This blogpost and release come with several resources to get started with 4bit models and QLoRA: - [Original paper]( - [Basic usage Google Colab notebook]( - This notebook shows how to use 4bit models in inference with all their variants, and how to run GPT-neo-X (a 20B parameter model) on a free Google Colab instance - [Fine tuning Google Colab notebook]( - This notebook shows how to fine-tune a 4bit model on a downstream task using the Hugging Face ecosystem. We show that it is possible to fine tune GPT-neo-X 20B on a Google Colab instance! - [Original repository for replicating the paper's results]( - [Guanaco 33b playground]( - or check the playground section below ## Introduction If you are not familiar with model precisions and the most common data types (float16, float32, bfloat16, int8), we advise you to carefully read the introduction in [our first blogpost]( that goes over the details of these concepts in simple terms with visualizations. For more information we recommend reading the fundamentals of floating point representation through [this wikibook document]( The recent QLoRA paper explores different data types, 4-bit Float and 4-bit NormalFloat. We will discuss here the 4-bit Float data type since it is easier to understand. FP8 and FP4 stand for Floating Point 8-bit and 4-bit precision, respectively. They are part of the minifloats family of floating point values (among other precisions, the minifloats family also includes bfloat16 and float16). Let\u2019s first have a look at how to represent floating point values in FP8 format, then understand how the FP4 format looks like. ### FP8 format As discussed in our previous blogpost, a floating point contains n-bits, with each bit falling into a specific category that is responsible for representing a component of the number (sign, mantissa and exponent). These represent the following. The FP8 (floating point 8) format has been first introduced in the paper [\u201cFP8 for Deep Learning\u201d]( with two different FP8 encodings: E4M3 (4-bit exponent and 3-bit mantissa) and E5M2 (5-bit exponent and 2-bit mantissa). |  format. Source: Original content from [`sgugger`]( | Although the precision is substantially reduced by reducing the number of bits from 32 to 8, both versions can be used in a variety of situations. Currently one could use [Transformer Engine library]( that is also integrated with HF ecosystem through accelerate. The potential floating points that can be represented in the E4M3 format are in the range -448 to 448, whereas in the E5M2 format, as the number of bits of the exponent increases, the range increases to -57344 to 57344 - but with a loss of precision because the number of possible representations remains constant. It has been empirically proven that the E4M3 is best suited for the forward pass, and the second version is best suited for the backward computation ### FP4 precision in a few words The sign bit represents the sign (+/-), the exponent bits a base two to the power of the integer represented by the bits (e.g. `2^{010} = 2^{2} = 4`), and the fraction or mantissa is the sum of powers of negative two which are \u201cactive\u201d for each bit that is \u201c1\u201d. If a bit is \u201c0\u201d the fraction remains unchanged for that power of `2^-i` where i is the position of the bit in the bit-sequence. For example, for mantissa bits 1010 we have `(0 + 2^-1 + 0 + 2^-3) = (0.5 + 0.125) = 0.625`. To get a value, we add *1* to the fraction and multiply all results together, for example, with 2 exponent bits and one mantissa bit the representations 1101 would be: `-1 * 2^(2) * (1 + 2^-1) = -1 * 4 * 1.5 = -6` For FP4 there is no fixed format and as such one can try combinations of different mantissa/exponent combinations. In general, 3 exponent bits do a bit better in most cases. But sometimes 2 exponent bits and a mantissa bit yield better performance. ## QLoRA paper, a new way of democratizing quantized large transformer models In few words, QLoRA reduces the memory usage of LLM finetuning without performance tradeoffs compared to standard 16-bit model finetuning. This method enables 33B model finetuning on a single 24GB GPU and 65B model finetuning on a single 46GB GPU. More specifically, QLoRA uses 4-bit quantization to compress a pretrained language model. The LM parameters are then frozen and a relatively small number of trainable parameters are added to the model in the form of Low-Rank Adapters. During finetuning, QLoRA backpropagates gradients through the frozen 4-bit quantized pretrained language model into the Low-Rank Adapters. The LoRA layers are the only parameters being updated during training. Read more about LoRA in the [original LoRA paper]( QLoRA has one storage data type (usually 4-bit NormalFloat) for the base model weights and a computation data type (16-bit BrainFloat) used to perform computations. QLoRA dequantizes weights from the storage data type to the computation data type to perform the forward and backward passes, but only computes weight gradients for the LoRA parameters which use 16-bit bfloat. The weights are decompressed only when they are needed, therefore the memory usage stays low during training and inference. QLoRA tuning is shown to match 16-bit finetuning methods in a wide range of experiments. In addition, the Guanaco models, which use QLoRA finetuning for LLaMA models on the [OpenAssistant dataset (OASST1)]( are state-of-the-art chatbot systems and are close to ChatGPT on the Vicuna benchmark. This is an additional demonstration of the power of QLoRA tuning. For a more detailed reading, we recommend you read the [QLoRA paper]( ## How to use it in transformers? In this section let us introduce the transformers integration of this method, how to use it and which models can be effectively quantized. ### Getting started As a quickstart, load a model in 4bit by (at the time of this writing) installing accelerate and transformers from source, and make sure you have installed the latest version of bitsandbytes library (0.39.0). ```bash pip install -q -U bitsandbytes pip install -q -U git+ pip install -q -U git+ pip install -q -U git+ ``` ### Quickstart The basic way to load a model in 4bit is to pass the argument `load_in_4bit=True` when calling the `from_pretrained` method by providing a device map (pass `\"auto\"` to get a device map that will be automatically inferred). ```python from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained(\"facebook/opt-350m\", load_in_4bit=True, device_map=\"auto\") ... ``` That's all you need! As a general rule, we recommend users to not manually set a device once the model has been loaded with `device_map`. So any device assignment call to the model, or to any model\u2019s submodules should be avoided after that line - unless you know what you are doing. Keep in mind that loading a quantized model will automatically cast other model's submodules into `float16` dtype. You can change this behavior, (if for example you want to have the layer norms in `float32`), by passing `torch_dtype=dtype` to the `from_pretrained` method. ### Advanced usage You can play with different variants of 4bit quantization such as NF4 (normalized float 4 (default)) or pure FP4 quantization. Based on theoretical considerations and empirical results from the paper, we recommend using NF4 quantization for better performance. Other options include `bnb_4bit_use_double_quant` which uses a second quantization after the first one to save an additional 0.4 bits per parameter. And finally, the compute type. While 4-bit bitsandbytes stores weights in 4-bits, the computation still happens in 16 or 32-bit and here any combination can be chosen (float16, bfloat16, float32 etc). The matrix multiplication and training will be faster if one uses a 16-bit compute dtype (default torch.float32). One should leverage the recent `BitsAndBytesConfig` from transformers to change these parameters. An example to load a model in 4bit using NF4 quantization below with double quantization with the compute dtype bfloat16 for faster training: ```python from transformers import BitsAndBytesConfig nf4_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type=\"nf4\", bnb_4bit_use_double_quant=True, bnb_4bit_compute_dtype=torch.bfloat16 ) model_nf4 = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=nf4_config) ``` #### Changing the compute dtype As mentioned above, you can also change the compute dtype of the quantized model by just changing the `bnb_4bit_compute_dtype` argument in `BitsAndBytesConfig`. ```python import torch from transformers import BitsAndBytesConfig quantization_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16 ) ``` #### Nested quantization For enabling nested quantization, you can use the `bnb_4bit_use_double_quant` argument in `BitsAndBytesConfig`. This will enable a second quantization after the first one to save an additional 0.4 bits per parameter. We also use this feature in the training Google colab notebook. ```python from transformers import BitsAndBytesConfig double_quant_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_use_double_quant=True, ) model_double_quant = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=double_quant_config) ``` And of course, as mentioned in the beginning of the section, all of these components are composable. You can combine all these parameters together to find the optimial use case for you. A rule of thumb is: use double quant if you have problems with memory, use NF4 for higher precision, and use a 16-bit dtype for faster finetuning. For instance in the [inference demo]( we use nested quantization, bfloat16 compute dtype and NF4 quantization to fit gpt-neo-x-20b (40GB) entirely in 4bit in a single 16GB GPU. ### Common questions In this section, we will also address some common questions anyone could have regarding this integration. #### Does FP4 quantization have any hardware requirements? Note that this method is only compatible with GPUs, hence it is not possible to quantize models in 4bit on a CPU. Among GPUs, there should not be any hardware requirement about this method, therefore any GPU could be used to run the 4bit quantization as long as you have CUDA>=11.2 installed. Keep also in mind that the computation is not done in 4bit, the weights and activations are compressed to that format and the computation is still kept in the desired or native dtype. #### What are the supported models? Similarly as the integration of LLM.int8 presented in [this blogpost]( the integration heavily relies on the `accelerate` library. Therefore, any model that supports accelerate loading (i.e. the `device_map` argument when calling `from_pretrained`) should be quantizable in 4bit. Note also that this is totally agnostic to modalities, as long as the models can be loaded with the `device_map` argument, it is possible to quantize them. For text models, at this time of writing, this would include most used architectures such as Llama, OPT, GPT-Neo, GPT-NeoX for text models, Blip2 for multimodal models, and so on. At this time of writing, the models that support accelerate are: ```python [ 'bigbird_pegasus', 'blip_2', 'bloom', 'bridgetower', 'codegen', 'deit', 'esm', 'gpt2', 'gpt_bigcode', 'gpt_neo', 'gpt_neox', 'gpt_neox_japanese', 'gptj', 'gptsan_japanese', 'lilt', 'llama', 'longformer', 'longt5', 'luke', 'm2m_100', 'mbart', 'mega', 'mt5', 'nllb_moe', 'open_llama', 'opt', 'owlvit', 'plbart', 'roberta', 'roberta_prelayernorm', 'rwkv', 'switch_transformers', 't5', 'vilt', 'vit', 'vit_hybrid', 'whisper', 'xglm', 'xlm_roberta' ] ``` Note that if your favorite model is not there, you can open a Pull Request or raise an issue in transformers to add the support of accelerate loading for that architecture. #### Can we train 4bit/8bit models? It is not possible to perform pure 4bit training on these models. However, you can train these models by leveraging parameter efficient fine tuning methods (PEFT) and train for example adapters on top of them. That is what is done in the paper and is officially supported by the PEFT library from Hugging Face. We also provide a [training notebook]( and recommend users to check the [QLoRA repository]( if they are interested in replicating the results from the paper. |  pretrained weights (left) are augmented by a low rank adapter comprised of weight matrics A and B (right). | #### What other consequences are there? This integration can open up several positive consequences to the community and AI research as it can affect multiple use cases and possible applications. In RLHF (Reinforcement Learning with Human Feedback) it is possible to load a single base model, in 4bit and train multiple adapters on top of it, one for the reward modeling, and another for the value policy training. A more detailed blogpost and announcement will be made soon about this use case. We have also made some benchmarks on the impact of this quantization method on training large models on consumer hardware. We have run several experiments on finetuning 2 different architectures, Llama 7B (15GB in fp16) and Llama 13B (27GB in fp16) on an NVIDIA T4 (16GB) and here are the results | Model name | Half precision model size (in GB) | Hardware type / total VRAM | quantization method (CD=compute dtype / GC=gradient checkpointing / NQ=nested quantization) | batch_size | gradient accumulation steps | optimizer | seq_len | Result | | ----------------------------------- | --------------------------------- | -------------------------- | ------------------------------------------------------------------------------------------- | ---------- | --------------------------- | ----------------- | ------- | ------ | | | | | | | | | | | | ## Acknowledgements The HF team would like to acknowledge all the people involved in this project from University of Washington, and for making this available to the community. The authors would also like to thank [Pedro Cuenca]( for kindly reviewing the blogpost, [Olivier Dehaene]( and [Omar Sanseviero]( for their quick and strong support for the integration of the paper's artifacts on the HF Hub."}
{"title": "README.md", "repo_owner": "huggingface", "repo_name": "blog", "text": "# The Hugging Face Blog Repository This is the official repository of the [Hugging Face Blog]( ## How to write an article? :keycap_1: Create a branch `YourName/Title` :keycap_2: Create a md (markdown) file, **use a short file name**. For instance, if your title is \"Introduction to Deep Reinforcement Learning\", the md file name could be `intro-rl.md`. This is important because the **file name will be the blogpost's URL**. :keycap_3: Create a new folder in `assets`. Use the same name as the name of the md file. Optionally you may add a numerical prefix to that folder, using the number that hasn't been used yet. But this is no longer required. i.e. the asset folder in this example could be `123_intro-rl` or `intro-rl`. This folder will contain **your thumbnail only**. The folder number is mostly for (rough) ordering purposes, so it's no big deal if two concurrent articles use the same number. For the rest of your files, create a mirrored folder in the HuggingFace Documentation Images [repo]( This is to reduce bloat in the GitHub base repo when cloning and pulling. : In terms of images, **try to have small files** to avoid having a slow loading user experience: - Use compressed images, you can use this website: or :keycap_4: Copy and paste this to your md file and change the elements - title - thumbnail - authors ``` --- title: \"PUT YOUR TITLE HERE\" thumbnail: /blog/assets/101_decision-transformers-train/thumbnail.gif authors: - user: your_hf_user - user: your_coauthor --- # Train your first Decision Transformer Your content here [...] ``` The blog_metadata and authors HTML comments are meant to mark where in the file will be inserted the following UI elements: - \"Published on [date]\" - \"Update on GitHub\" button - avatars of the authors that were listed in authors. Please keep the blog_metadata and authors comments exactly equal to those strings otherwise they won't be replaced. :keycap_5: Then, you can add your content. It's markdown system so if you wrote your text on notion just control shift v to copy/paste as markdown. :keycap_6: Modify `_blog.yml` to add your blogpost. :keycap_7: When your article is ready, **open a pull request**. :keycap_8: To check how your blog will look like before merging it, check out the [CodeSpace instructions]( (internal for HF team) :keycap_9: The article will be **published automatically when you merge your pull request**. ## How to get a nice responsive thumbnail? :keycap_1: Create a `1300x650` image :keycap_2: Use [this template]( and fill the content part. Or select a background you like and follow the instructions in [this Figma template]( ## Using LaTeX Just add: ``` \\\\(your_latex_here\\\\) ``` For instance: ``` \\\\( Q(S_t, A_t) \\\\) ``` $Q(S_t, A_t)$"}
{"title": "accelerate-deepspeed.md", "repo_owner": "huggingface", "repo_name": "blog", "text": "--- title: \"Accelerate Large Model Training using DeepSpeed\" thumbnail: /blog/assets/83_accelerate_deepspeed/deepspeed-thumbnail.png authors: - user: smangrul - user: sgugger --- Accelerate Large Model Training using DeepSpeed In this post we will look at how we can leverage the **[Accelerate]( library for training large models which enables users to leverage the ZeRO features of **[DeeSpeed]( # Motivation **Tired of Out of Memory (OOM) errors while trying to train large models? We've got you covered. Large models are very performant [1] but difficult to train with the available hardware. To get the most of the available hardware for training large models one can leverage Data Parallelism using ZeRO - Zero Redundancy Optimizer [2]**. Below is a short description of Data Parallelism using ZeRO with diagram from this [blog post]( ]( have implemented the core ideas of the ZERO paper. These have already been integrated in `transformers` Trainer and `accelerate` accompanied by great blogs [Fit More and Train Faster With ZeRO via DeepSpeed and FairScale]( [4] and [Accelerate Large Model Training using PyTorch Fully Sharded Data Parallel]( [5]. We defer the explanation of what goes behind the scenes to those blogs and mainly focus on leveraging DeepSpeed ZeRO using Accelerate. # Accelerate : Leverage DeepSpeed ZeRO without any code changes **Hardware setup**: 2X24GB NVIDIA Titan RTX GPUs. 60GB RAM. We will look at the task of finetuning encoder-only model for text-classification. We will use pretrained `microsoft/deberta-v2-xlarge-mnli` (900M params) for finetuning on MRPC GLUE dataset. The code is available here [run_cls_no_trainer.py]( It is similar to the official text-classification example [here]( with the addition of logic to measure train and eval time. Let's compare performance between Distributed Data Parallel (DDP) and DeepSpeed ZeRO Stage-2 in a Multi-GPU Setup. To enable DeepSpeed ZeRO Stage-2 without any code changes, please run `accelerate config` and leverage the [Accelerate DeepSpeed Plugin]( **ZeRO Stage-2 DeepSpeed Plugin Example** ```bash compute_environment: LOCAL_MACHINE deepspeed_config: gradient_accumulation_steps: 1 gradient_clipping: 1.0 offload_optimizer_device: none offload_param_device: none zero3_init_flag: false zero_stage: 2 distributed_type: DEEPSPEED fsdp_config: {} machine_rank: 0 main_process_ip: null main_process_port: null main_training_function: main mixed_precision: fp16 num_machines: 1 num_processes: 2 use_cpu: false ``` Now, run below command for training: ```bash accelerate launch run_cls_no_trainer.py \\ --model_name_or_path \"microsoft/deberta-v2-xlarge-mnli\" \\ --task_name \"mrpc\" \\ --ignore_mismatched_sizes \\ --max_length 128 \\ --per_device_train_batch_size 40 \\ --learning_rate 2e-5 \\ --num_train_epochs 3 \\ --output_dir \"/tmp/mrpc/deepspeed_stage2/\" \\ --with_tracking \\ --report_to \"wandb\" \\ ``` In our Single-Node Multi-GPU setup, the maximum batch size that DDP supports without OOM error is 8. In contrast, DeepSpeed Zero-Stage 2 enables batch size of 40 without running into OOM errors. Therefore, DeepSpeed enables to fit **5X** more data per GPU when compared to DDP. Below is the snapshot of the plots from wandb [run]( along with benchmarking table comparing DDP vs DeepSpeed.  --- | Method | Batch Size Max | Train time per epoch (seconds) | Eval time per epoch (seconds) | F1 score | Accuracy | | --- | --- | --- | --- | --- | --- | | DDP (Distributed Data Parallel) | 8 | 103.57 | 2.04 | 0.931 | 0.904 | | DeepSpeed ZeRO Stage 2 | **40** | **28.98** | **1.79** | **0.936** | **0.912** | Table 1: Benchmarking DeepSpeed ZeRO Stage-2 on DeBERTa-XL (900M) model --- With this bigger batch size, we observe ~**3.5X** speed up in total training time without any drop in perforamnce metrics, all this without changing any code. Yay! . To be able to tweak more options, you will need to use a DeepSpeed config file and minimal code changes. Let's see how to do this. # Accelerate : Leverage a DeepSpeed Config file to tweak more options First, We will look at the task of finetuning a sequence-to-sequence model for training our own Chatbot. Specifically, we will finetune `facebook/blenderbot-400M-distill` on the [smangrul/MuDoConv]( (Multi-Domain Conversation) dataset. The dataset contains conversations from 10 different data sources covering personas, grounding in specific emotional contexts, goal-oriented (e.g., restaurant reservation) and general wikipedia topics (e.g, Cricket). The code is available here [run_seq2seq_no_trainer.py]( Current pratice to effectively measure the `Engagingness` and `Humanness` of Chatbots is via Human evlauations which are expensive [6]. As such for this example, the metric being tracked is BLEU score (which isn't ideal but is the conventional metric for such tasks). One can adapt the code to train larger T5 models if you have access to GPUs that support `bfloat16` precision else you will run into `NaN` loss values. We will run a quick benchmark on `10000` train samples and `1000` eval samples as we are interested in DeepSpeed vs DDP. We will leverage the DeepSpeed Zero Stage-2 config [zero2_config_accelerate.json]( (given below) For training. for detailed information on the various config features, please refer [DeeSpeed]( documentation. ```json { \"fp16\": { \"enabled\": \"true\", \"loss_scale\": 0, \"loss_scale_window\": 1000, \"initial_scale_power\": 15, \"hysteresis\": 2, \"min_loss_scale\": 1 }, \"optimizer\": { \"type\": \"AdamW\", \"params\": { \"lr\": \"auto\", \"weight_decay\": \"auto\", \"torch_adam\": true, \"adam_w_mode\": true } }, \"scheduler\": { \"type\": \"WarmupDecayLR\", \"params\": { \"warmup_min_lr\": \"auto\", \"warmup_max_lr\": \"auto\", \"warmup_num_steps\": \"auto\", \"total_num_steps\": \"auto\" } }, \"zero_optimization\": { \"stage\": 2, \"allgather_partitions\": true, \"allgather_bucket_size\": 2e8, \"overlap_comm\": true, \"reduce_scatter\": true, \"reduce_bucket_size\": 2e8, \"contiguous_gradients\": true }, \"gradient_accumulation_steps\": 1, \"gradient_clipping\": \"auto\", \"steps_per_print\": 2000, \"train_batch_size\": \"auto\", \"train_micro_batch_size_per_gpu\": \"auto\", \"wall_clock_breakdown\": false } ``` To enable DeepSpeed ZeRO Stage-2 with above config, please run `accelerate config` and provide the config file path when asked. For more details, refer the `accelerate` official documentation for [DeepSpeed Config File]( **ZeRO Stage-2 DeepSpeed Config File Example** ```bash compute_environment: LOCAL_MACHINE deepspeed_config: deepspeed_config_file: /path/to/zero2_config_accelerate.json zero3_init_flag: false distributed_type: DEEPSPEED fsdp_config: {} machine_rank: 0 main_process_ip: null main_process_port: null main_training_function: main mixed_precision: fp16 num_machines: 1 num_processes: 2 use_cpu: false ``` Now, run below command for training: ```bash accelerate launch run_seq2seq_no_trainer.py \\ --dataset_name \"smangrul/MuDoConv\" \\ --max_source_length 128 \\ --source_prefix \"chatbot: \" \\ --max_target_length 64 \\ --val_max_target_length 64 \\ --val_min_target_length 20 \\ --n_val_batch_generations 5 \\ --n_train 10000 \\ --n_val 1000 \\ --pad_to_max_length \\ --num_beams 10 \\ --model_name_or_path \"facebook/blenderbot-400M-distill\" \\ --per_device_train_batch_size 200 \\ --per_device_eval_batch_size 100 \\ --learning_rate 1e-6 \\ --weight_decay 0.0 \\ --num_train_epochs 1 \\ --gradient_accumulation_steps 1 \\ --num_warmup_steps 100 \\ --output_dir \"/tmp/deepspeed_zero_stage2_accelerate_test\" \\ --seed 25 \\ --logging_steps 100 \\ --with_tracking \\ --report_to \"wandb\" \\ --report_name \"blenderbot_400M_finetuning\" ``` When using DeepSpeed config, if user has specified `optimizer` and `scheduler` in config, the user will have to use `accelerate.utils.DummyOptim` and `accelerate.utils.DummyScheduler`. Those are the only minor changes that the user has to do. Below we show an example of the minimal changes required when using DeepSpeed config: ```diff - optimizer = torch.optim.Adam(optimizer_grouped_parameters, lr=args.learning_rate) + optimizer = accelerate.utils.DummyOptim(optimizer_grouped_parameters, lr=args.learning_rate) - lr_scheduler = get_scheduler( - name=args.lr_scheduler_type, - optimizer=optimizer, - num_warmup_steps=args.num_warmup_steps, - num_training_steps=args.max_train_steps, - ) + lr_scheduler = accelerate.utils.DummyScheduler( + optimizer, total_num_steps=args.max_train_steps, warmup_num_steps=args.num_warmup_steps + ) ``` --- | Method | Batch Size Max | Eval Size Max | Train time per epoch (seconds) | Eval time per epoch (seconds) | | --- | --- | --- | --- | --- | | DDP (Distributed Data Parallel) | 100 | 50 | 27.36 | 48.41 | | DeepSpeed ZeRO Stage 2 | **200** | **100** | **19.06** | **39.27** | Table 2: Benchmarking DeepSpeed ZeRO Stage-2 on BlenderBot (400M) model In our Single-Node Multi-GPU setup, the maximum batch size that DDP supports without OOM error is 100. In contrast, DeepSpeed Zero-Stage 2 enables batch size of 200 without running into OOM errors. Therefore, DeepSpeed enables to fit **2X** more data per GPU when compared to DDP. We observe ~**1.44X** speedup in training and ~**1.23X** speedup in evaluation as we are able to fit more data on the same available hardware. As this model is of medium size, the speedup isn't that exciting but this will improve with bigger models. You can chat with the Chatbot trained using the entire data at Space [smangrul/Chat-E]( You can give bot a persona, ground conversation to a particular emotion, use to in goal-oriented tasks or in a free flow manner. Below is a fun conversation with the chatbot . You can find snapshots of more conversations using different contexts [here](  --- ## CPU/Disk Offloading to enable training humongous models that won\u2019t fit the GPU memory On a single 24GB NVIDIA Titan RTX GPU, one cannot train GPT-XL Model (1.5B parameters) even with a batch size of 1. We will look at how we can use DeepSpeed ZeRO Stage-3 with CPU offloading of optimizer states, gradients and parameters to train GPT-XL Model. We will leverage the DeepSpeed Zero Stage-3 CPU offload config [zero3_offload_config_accelerate.json]( (given below) for training. The rest of the process of using the config with `accelerate` is similar to the above experiment. ```json { \"fp16\": { \"enabled\": true, \"loss_scale\": 0, \"loss_scale_window\": 1000, \"initial_scale_power\": 16, \"hysteresis\": 2, \"min_loss_scale\": 1 }, \"optimizer\": { \"type\": \"AdamW\", \"params\": { \"lr\": \"auto\", \"weight_decay\": \"auto\" } }, \"scheduler\": { \"type\": \"WarmupDecayLR\", \"params\": { \"warmup_min_lr\": \"auto\", \"warmup_max_lr\": \"auto\", \"warmup_num_steps\": \"auto\", \"total_num_steps\": \"auto\" } }, \"zero_optimization\": { \"stage\": 3, \"offload_optimizer\": { \"device\": \"cpu\", \"pin_memory\": true }, \"offload_param\": { \"device\": \"cpu\", \"pin_memory\": true }, \"overlap_comm\": true, \"contiguous_gradients\": true, \"reduce_bucket_size\": \"auto\", \"stage3_prefetch_bucket_size\": \"auto\", \"stage3_param_persistence_threshold\": \"auto\", \"sub_group_size\": 1e9, \"stage3_max_live_parameters\": 1e9, \"stage3_max_reuse_distance\": 1e9, \"stage3_gather_16bit_weights_on_model_save\": true }, \"gradient_accumulation_steps\": 1, \"gradient_clipping\": \"auto\", \"steps_per_print\": 2000, \"train_batch_size\": \"auto\", \"train_micro_batch_size_per_gpu\": \"auto\", \"wall_clock_breakdown\": false } ``` **ZeRO Stage-3 CPU Offload DeepSpeed Config File Example** ```bash compute_environment: LOCAL_MACHINE deepspeed_config: deepspeed_config_file: /path/to/zero3_offload_config_accelerate.json zero3_init_flag: true distributed_type: DEEPSPEED fsdp_config: {} machine_rank: 0 main_process_ip: null main_process_port: null main_training_function: main mixed_precision: fp16 num_machines: 1 num_processes: 2 use_cpu: false ``` Now, run below command for training: ```bash accelerate launch run_clm_no_trainer.py \\ --config_name \"gpt2-xl\" \\ --tokenizer_name \"gpt2-xl\" \\ --dataset_name \"wikitext\" \\ --dataset_config_name \"wikitext-2-raw-v1\" \\ --block_size 128 \\ --output_dir \"/tmp/clm_deepspeed_stage3_offload__accelerate\" \\ --learning_rate 5e-4 \\ --per_device_train_batch_size 16 \\ --per_device_eval_batch_size 1 \\ --num_train_epochs 1 \\ --with_tracking \\ --report_to \"wandb\"\\ ``` --- | Method | Batch Size Max | Train time per epoch (seconds) | Notes | | --- | --- | --- | --- | | DDP (Distributed Data Parallel) | - | - | OOM Error | DeepSpeed ZeRO Stage 2 | **16** | 6608.35 | | Table 3: Benchmarking DeepSpeed ZeRO Stage-3 CPU Offload on GPT-XL (1.5B) model --- DDP will result in OOM error even with batch size 1. On the other hand, with DeepSpeed ZeRO Stage-3 CPU offload, we can train with a batch size of 16. Finally, please, remember that, `Accelerate` only integrates DeepSpeed, therefore if you have any problems or questions with regards to DeepSpeed usage, please, file an issue with [DeepSpeed GitHub]( # References [1] [Train Large, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers]( [2] [ZeRO: Memory Optimizations Toward Training Trillion Parameter Models]( [3] [DeepSpeed: Extreme-scale model training for everyone - Microsoft Research]( [4] [Fit More and Train Faster With ZeRO via DeepSpeed and FairScale]( [5] [Accelerate Large Model Training using PyTorch Fully Sharded Data Parallel]( [6] [Recipes for building an open-domain chatbot]("}
{"title": "accelerate-large-models.md", "repo_owner": "huggingface", "repo_name": "blog", "text": "--- title: \"How Accelerate runs very large models thanks to PyTorch\" thumbnail: /blog/assets/104_accelerate-large-models/thumbnail.png authors: - user: sgugger --- # How Accelerate runs very large models thanks to PyTorch ## Load and run large models Meta AI and BigScience recently open-sourced very large language models which won't fit into memory (RAM or GPU) of most consumer hardware. At Hugging Face, part of our mission is to make even those large models accessible, so we developed tools to allow you to run those models even if you don't own a supercomputer. All the examples picked in this blog post run on a free Colab instance (with limited RAM and disk space) if you have access to more disk space, don't hesitate to pick larger checkpoints. Here is how we can run OPT-6.7B: ```python import torch from transformers import pipeline # This works on a base Colab instance. # Pick a larger checkpoint if you have time to wait and enough disk space! checkpoint = \"facebook/opt-6.7b\" generator = pipeline(\"text-generation\", model=checkpoint, device_map=\"auto\", torch_dtype=torch.float16) # Perform inference generator(\"More and more large language models are opensourced so Hugging Face has\") ``` We'll explain what each of those arguments do in a moment, but first just consider the traditional model loading pipeline in PyTorch: it usually consists of: 1. Create the model 2. Load in memory its weights (in an object usually called `state_dict`) 3. Load those weights in the created model 4. Move the model on the device for inference While that has worked pretty well in the past years, very large models make this approach challenging. Here the model picked has 6.7 *billion* parameters. In the default precision, it means that just step 1 (creating the model) will take roughly **26.8GB** in RAM (1 parameter in float32 takes 4 bytes in memory). This can't even fit in the RAM you get on Colab. Then step 2 will load in memory a second copy of the model (so another 26.8GB in RAM in default precision). If you were trying to load the largest models, for example BLOOM or OPT-176B (which both have 176 billion parameters), like this, you would need 1.4 **terabytes** of CPU RAM. That is a bit excessive! And all of this to just move the model on one (or several) GPU(s) at step 4. Clearly we need something smarter. In this blog post, we'll explain how Accelerate leverages PyTorch features to load and run inference with very large models, even if they don't fit in RAM or one GPU. In a nutshell, it changes the process above like this: 1. Create an empty (e.g. without weights) model 2. Decide where each layer is going to go (when multiple devices are available) 3. Load in memory parts of its weights 4. Load those weights in the empty model 5. Move the weights on the device for inference 6. Repeat from step 3 for the next weights until all the weights are loaded ## Creating an empty model PyTorch 1.9 introduced a new kind of device called the *meta* device. This allows us to create tensor without any data attached to them: a tensor on the meta device only needs a shape. As long as you are on the meta device, you can thus create arbitrarily large tensors without having to worry about CPU (or GPU) RAM. For instance, the following code will crash on Colab: ```python import torch large_tensor = torch.randn(100000, 100000) ``` as this large tensor requires `4 * 10**10` bytes (the default precision is FP32, so each element of the tensor takes 4 bytes) thus 40GB of RAM. The same on the meta device works just fine however: ```python import torch large_tensor = torch.randn(100000, 100000, device=\"meta\") ``` If you try to display this tensor, here is what PyTorch will print: ``` tensor(..., device='meta', size=(100000, 100000)) ``` As we said before, there is no data associated with this tensor, just a shape. You can instantiate a model directly on the meta device: ```python large_model = torch.nn.Linear(100000, 100000, device=\"meta\") ``` But for an existing model, this syntax would require you to rewrite all your modeling code so that each submodule accepts and passes along a `device` keyword argument. Since this was impractical for the 150 models of the Transformers library, we developed a context manager that will instantiate an empty model for you. Here is how you can instantiate an empty version of BLOOM: ```python from accelerate import init_empty_weights from transformers import AutoConfig, AutoModelForCausalLM config = AutoConfig.from_pretrained(\"bigscience/bloom\") with init_empty_weights(): model = AutoModelForCausalLM.from_config(config) ``` This works on any model, but you get back a shell you can't use directly: some operations are implemented for the meta device, but not all yet. Here for instance, you can use the `large_model` defined above with an input, but not the BLOOM model. Even when using it, the output will be a tensor of the meta device, so you will get the shape of the result, but nothing more. As further work on this, the PyTorch team is working on a new [class `FakeTensor`]( which is a bit like tensors on the meta device, but with the device information (on top of shape and dtype) Since we know the shape of each weight, we can however know how much memory they will all consume once we load the pretrained tensors fully. Therefore, we can make a decision on how to split our model across CPUs and GPUs. ## Computing a device map Before we start loading the pretrained weights, we will need to know where we want to put them. This way we can free the CPU RAM each time we have put a weight in its right place. This can be done with the empty model on the meta device, since we only need to know the shape of each tensor and its dtype to compute how much space it will take in memory. Accelerate provides a function to automatically determine a *device map* from an empty model. It will try to maximize the use of all available GPUs, then CPU RAM, and finally flag the weights that don't fit for disk offload. Let's have a look using [OPT-13b]( ```python from accelerate import infer_auto_device_map, init_empty_weights from transformers import AutoConfig, AutoModelForCausalLM config = AutoConfig.from_pretrained(\"facebook/opt-13b\") with init_empty_weights(): model = AutoModelForCausalLM.from_config(config) device_map = infer_auto_device_map(model) ``` This will return a dictionary mapping modules or weights to a device. On a machine with one Titan RTX for instance, we get the following: ```python out {'model.decoder.embed_tokens': 0, 'model.decoder.embed_positions': 0, 'model.decoder.final_layer_norm': 0, 'model.decoder.layers.0': 0, 'model.decoder.layers.1': 0, ... 'model.decoder.layers.9': 0, 'model.decoder.layers.10.self_attn': 0, 'model.decoder.layers.10.activation_fn': 0, 'model.decoder.layers.10.self_attn_layer_norm': 0, 'model.decoder.layers.10.fc1': 'cpu', 'model.decoder.layers.10.fc2': 'cpu', 'model.decoder.layers.10.final_layer_norm': 'cpu', 'model.decoder.layers.11': 'cpu', ... 'model.decoder.layers.17': 'cpu', 'model.decoder.layers.18.self_attn': 'cpu', 'model.decoder.layers.18.activation_fn': 'cpu', 'model.decoder.layers.18.self_attn_layer_norm': 'cpu', 'model.decoder.layers.18.fc1': 'disk', 'model.decoder.layers.18.fc2': 'disk', 'model.decoder.layers.18.final_layer_norm': 'disk', 'model.decoder.layers.19': 'disk', ... 'model.decoder.layers.39': 'disk', 'lm_head': 'disk'} ``` Accelerate evaluated that the embeddings and the decoder up until the 9th block could all fit on the GPU (device 0), then part of the 10th block needs to be on the CPU, as well as the following weights until the 17th layer. Then the 18th layer is split between the CPU and the disk and the following layers must all be offloaded to disk Actually using this device map later on won't work, because the layers composing this model have residual connections (where the input of the block is added to the output of the block) so all of a given layer should be on the same device. We can indicate this to Accelerate by passing a list of module names that shouldn't be split with the `no_split_module_classes` keyword argument: ```python device_map = infer_auto_device_map(model, no_split_module_classes=[\"OPTDecoderLayer\"]) ``` This will then return ```python out 'model.decoder.embed_tokens': 0, 'model.decoder.embed_positions': 0, 'model.decoder.final_layer_norm': 0, 'model.decoder.layers.0': 0, 'model.decoder.layers.1': 0, ... 'model.decoder.layers.9': 0, 'model.decoder.layers.10': 'cpu', 'model.decoder.layers.11': 'cpu', ... 'model.decoder.layers.17': 'cpu', 'model.decoder.layers.18': 'disk', ... 'model.decoder.layers.39': 'disk', 'lm_head': 'disk'} ``` Now, each layer is always on the same device. In Transformers, when using `device_map` in the `from_pretrained()` method or in a `pipeline`, those classes of blocks to leave on the same device are automatically provided, so you don't need to worry about them. Note that you have the following options for `device_map` (only relevant when you have more than one GPU): - `\"auto\"` or `\"balanced\"`: Accelerate will split the weights so that each GPU is used equally; - `\"balanced_low_0\"`: Accelerate will split the weights so that each GPU is used equally except the first one, where it will try to have as little weights as possible (useful when you want to work with the outputs of the model on one GPU, for instance when using the `generate` function); - `\"sequential\"`: Accelerate will fill the GPUs in order (so the last ones might not be used at all). You can also pass your own `device_map` as long as it follows the format we saw before (dictionary layer/module names to device). Finally, note that the results of the `device_map` you receive depend on the selected dtype (as different types of floats take a different amount of space). Providing `dtype=\"float16\"` will give us different results: ```python device_map = infer_auto_device_map(model, no_split_module_classes=[\"OPTDecoderLayer\"], dtype=\"float16\") ``` In this precision, we can fit the model up to layer 21 on the GPU: ```python out {'model.decoder.embed_tokens': 0, 'model.decoder.embed_positions': 0, 'model.decoder.final_layer_norm': 0, 'model.decoder.layers.0': 0, 'model.decoder.layers.1': 0, ... 'model.decoder.layers.21': 0, 'model.decoder.layers.22': 'cpu', ... 'model.decoder.layers.37': 'cpu', 'model.decoder.layers.38': 'disk', 'model.decoder.layers.39': 'disk', 'lm_head': 'disk'} ``` Now that we know where each weight is supposed to go, we can progressively load the pretrained weights inside the model. ## Sharding state dicts Traditionally, PyTorch models are saved in a whole file containing a map from parameter name to weight. This map is often called a `state_dict`. Here is an excerpt from the [PyTorch documentation]( on saving on loading: ```python # Save the model weights torch.save(my_model.state_dict(), 'model_weights.pth') # Reload them new_model = ModelClass() new_model.load_state_dict(torch.load('model_weights.pth')) ``` This works pretty well for models with less than 1 billion parameters, but for larger models, this is very taxing in RAM. The BLOOM model has 176 billions parameters; even with the weights saved in bfloat16 to save space, it still represents 352GB as a whole. While the super computer that trained this model might have this amount of memory available, requiring this for inference is unrealistic. This is why large models on the Hugging Face Hub are not saved and shared with one big file containing all the weights, but **several** of them. If you go to the [BLOOM model page]( for instance, you will see there is 72 files named `pytorch_model_xxxxx-of-00072.bin`, which each contain part of the model weights. Using this format, we can load one part of the state dict in memory, put the weights inside the model, move them on the right device, then discard this state dict part before going to the next. Instead of requiring to have enough RAM to accommodate the whole model, we only need enough RAM to get the biggest checkpoint part, which we call a **shard**, so 7.19GB in the case of BLOOM. We call the checkpoints saved in several files like BLOOM *sharded checkpoints*, and we have standardized their format as such: - One file (called `pytorch_model.bin.index.json`) contains some metadata and a map parameter name to file name, indicating where to find each weight - All the other files are standard PyTorch state dicts, they just contain a part of the model instead of the whole one. You can have a look at the content of the index file [here]( To load such a sharded checkpoint into a model, we just need to loop over the various shards. Accelerate provides a function called `load_checkpoint_in_model` that will do this for you if you have cloned one of the repos of the Hub, or you can directly use the `from_pretrained` method of Transformers, which will handle the downloading and caching for you: ```python import torch from transformers import AutoModelForCausalLM # Will error checkpoint = \"facebook/opt-13b\" model = AutoModelForCausalLM.from_pretrained(checkpoint, device_map=\"auto\", torch_dtype=torch.float16) ``` If the device map computed automatically requires some weights to be offloaded on disk because you don't have enough GPU and CPU RAM, you will get an error indicating you need to pass an folder where the weights that should be stored on disk will be offloaded: ```python out ValueError: The current `device_map` had weights offloaded to the disk. Please provide an `offload_folder` for them. ``` Adding this argument should resolve the error: ```python import torch from transformers import AutoModelForCausalLM # Will go out of RAM on Colab checkpoint = \"facebook/opt-13b\" model = AutoModelForCausalLM.from_pretrained( checkpoint, device_map=\"auto\", offload_folder=\"offload\", torch_dtype=torch.float16 ) ``` Note that if you are trying to load a very large model that require some disk offload on top of CPU offload, you might run out of RAM when the last shards of the checkpoint are loaded, since there is the part of the model staying on CPU taking space. If that is the case, use the option `offload_state_dict=True` to temporarily offload the part of the model staying on CPU while the weights are all loaded, and reload it in RAM once all the weights have been processed ```python import torch from transformers import AutoModelForCausalLM checkpoint = \"facebook/opt-13b\" model = AutoModelForCausalLM.from_pretrained( checkpoint, device_map=\"auto\", offload_folder=\"offload\", offload_state_dict = True, torch_dtype=torch.float16 ) ``` This will fit in Colab, but will be so close to using all the RAM available that it will go out of RAM when you try to generate a prediction. To get a model we can use, we need to offload one more layer on the disk. We can do so by taking the `device_map` computed in the previous section, adapting it a bit, then passing it to the `from_pretrained` call: ```python import torch from transformers import AutoModelForCausalLM checkpoint = \"facebook/opt-13b\" device_map[\"model.decoder.layers.37\"] = \"disk\" model = AutoModelForCausalLM.from_pretrained( checkpoint, device_map=device_map, offload_folder=\"offload\", offload_state_dict = True, torch_dtype=torch.float16 ) ``` ## Running a model split on several devices One last part we haven't touched is how Accelerate enables your model to run with its weight spread across several GPUs, CPU RAM, and the disk folder. This is done very simply using hooks. > [hooks]( are a PyTorch API that adds functions executed just before each forward called We couldn't use this directly since they only support models with regular arguments and no keyword arguments in their forward pass, but we took the same idea. Once the model is loaded, the `dispatch_model` function will add hooks to every module and submodule that are executed before and after each forward pass. They will: - make sure all the inputs of the module are on the same device as the weights; - if the weights have been offloaded to the CPU, move them to GPU 0 before the forward pass and back to the CPU just after; - if the weights have been offloaded to disk, load them in RAM then on the GPU 0 before the forward pass and free this memory just after. The whole process is summarized in the following video: This way, your model can be loaded and run even if you don't have enough GPU RAM and CPU RAM. The only thing you need is disk space (and lots of patience!) While this solution is pretty naive if you have multiple GPUs (there is no clever pipeline parallelism involved, just using the GPUs sequentially) it still yields [pretty decent results for BLOOM]( And it allows you to run the model on smaller setups (albeit more slowly). To learn more about Accelerate big model inference, see the [documentation]("}
{"title": "accelerate-library.md", "repo_owner": "huggingface", "repo_name": "blog", "text": "--- title: \"Introducing Accelerate\" thumbnail: /blog/assets/20_accelerate_library/accelerate_diff.png authors: - user: sgugger --- # Introducing Accelerate ## Accelerate Run your **raw** PyTorch training scripts on any kind of device. Most high-level libraries above PyTorch provide support for distributed training and mixed precision, but the abstraction they introduce require a user to learn a new API if they want to customize the underlying training loop. Accelerate was created for PyTorch users who like to have full control over their training loops but are reluctant to write (and maintain) the boilerplate code needed to use distributed training (for multi-GPU on one or several nodes, TPUs, ...) or mixed precision training. Plans forward include support for fairscale, deepseed, AWS SageMaker specific data-parallelism and model parallelism. It provides two things: a simple and consistent API that abstracts that boilerplate code and a launcher command to easily run those scripts on various setups. ### Easy integration! Let's first have a look at an example: ```diff import torch import torch.nn.functional as F from datasets import load_dataset + from accelerate import Accelerator + accelerator = Accelerator() - device = 'cpu' + device = accelerator.device model = torch.nn.Transformer().to(device) optim = torch.optim.Adam(model.parameters()) dataset = load_dataset('my_dataset') data = torch.utils.data.DataLoader(dataset, shuffle=True) + model, optim, data = accelerator.prepare(model, optim, data) model.train() for epoch in range(10): for source, targets in data: source = source.to(device) targets = targets.to(device) optimizer.zero_grad() output = model(source) loss = F.cross_entropy(output, targets) - loss.backward() + accelerator.backward(loss) optimizer.step() ``` By just adding five lines of code to any standard PyTorch training script, you can now run said script on any kind of distributed setting, as well as with or without mixed precision. Accelerate even handles the device placement for you, so you can simplify the training loop above even further: ```diff import torch import torch.nn.functional as F from datasets import load_dataset + from accelerate import Accelerator + accelerator = Accelerator() - device = 'cpu' - model = torch.nn.Transformer().to(device) + model = torch.nn.Transformer() optim = torch.optim.Adam(model.parameters()) dataset = load_dataset('my_dataset') data = torch.utils.data.DataLoader(dataset, shuffle=True) + model, optim, data = accelerator.prepare(model, optim, data) model.train() for epoch in range(10): for source, targets in data: - source = source.to(device) - targets = targets.to(device) optimizer.zero_grad() output = model(source) loss = F.cross_entropy(output, targets) - loss.backward() + accelerator.backward(loss) optimizer.step() ``` In contrast, here are the changes needed to have this code run with distributed training are the followings: ```diff + import os import torch import torch.nn.functional as F from datasets import load_dataset + from torch.utils.data import DistributedSampler + from torch.nn.parallel import DistributedDataParallel + local_rank = int(os.environ.get(\"LOCAL_RANK\", -1)) - device = 'cpu' + device = device = torch.device(\"cuda\", local_rank) model = torch.nn.Transformer().to(device) + model = DistributedDataParallel(model) optim = torch.optim.Adam(model.parameters()) dataset = load_dataset('my_dataset') + sampler = DistributedSampler(dataset) - data = torch.utils.data.DataLoader(dataset, shuffle=True) + data = torch.utils.data.DataLoader(dataset, sampler=sampler) model.train() for epoch in range(10): + sampler.set_epoch(epoch) for source, targets in data: source = source.to(device) targets = targets.to(device) optimizer.zero_grad() output = model(source) loss = F.cross_entropy(output, targets) loss.backward() optimizer.step() ``` These changes will make your training script work for multiple GPUs, but your script will then stop working on CPU or one GPU (unless you start adding if statements everywhere). Even more annoying, if you wanted to test your script on TPUs you would need to change different lines of codes. Same for mixed precision training. The promise of Accelerate is: - to keep the changes to your training loop to the bare minimum so you have to learn as little as possible. - to have the same functions work for any distributed setup, so only have to learn one API. ### How does it work? To see how the library works in practice, let's have a look at each line of code we need to add to a training loop. ```python accelerator = Accelerator() ``` On top of giving the main object that you will use, this line will analyze from the environment the type of distributed training run and perform the necessary initialization. You can force a training on CPU or a mixed precision training by passing `cpu=True` or `fp16=True` to this init. Both of those options can also be set using the launcher for your script. ```python model, optim, data = accelerator.prepare(model, optim, data) ``` This is the main bulk of the API and will prepare the three main type of objects: models (`torch.nn.Module`), optimizers (`torch.optim.Optimizer`) and dataloaders (`torch.data.dataloader.DataLoader`). #### Model Model preparation include wrapping it in the proper container (for instance `DistributedDataParallel`) and putting it on the proper device. Like with a regular distributed training, you will need to unwrap your model for saving, or to access its specific methods, which can be done with `accelerator.unwrap_model(model)`. #### Optimizer The optimizer is also wrapped in a special container that will perform the necessary operations in the step to make mixed precision work. It will also properly handle device placement of the state dict if its non-empty or loaded from a checkpoint. #### DataLoader This is where most of the magic is hidden. As you have seen in the code example, the library does not rely on a `DistributedSampler`, it will actually work with any sampler you might pass to your dataloader (if you ever had to write a distributed version of your custom sampler, there is no more need for that!). The dataloader is wrapped in a container that will only grab the indices relevant to the current process in the sampler (or skip the batches for the other processes if you use an `IterableDataset`) and put the batches on the proper device. For this to work, Accelerate provides a utility function that will synchronize the random number generators on each of the processes run during distributed training. By default, it only synchronizes the `generator` of your sampler, so your data augmentation will be different on each process, but the random shuffling will be the same. You can of course use this utility to synchronize more RNGs if you need it. ```python accelerator.backward(loss) ``` This last line adds the necessary steps for the backward pass (mostly for mixed precision but other integrations will require some custom behavior here). ### What about evaluation? Evaluation can either be run normally on all processes, or if you just want it to run on the main process, you can use the handy test: ```python if accelerator.is_main_process(): # Evaluation loop ``` But you can also very easily run a distributed evaluation using Accelerate, here is what you would need to add to your evaluation loop: ```diff + eval_dataloader = accelerator.prepare(eval_dataloader) predictions, labels = [], [] for source, targets in eval_dataloader: with torch.no_grad(): output = model(source) - predictions.append(output.cpu().numpy()) - labels.append(targets.cpu().numpy()) + predictions.append(accelerator.gather(output).cpu().numpy()) + labels.append(accelerator.gather(targets).cpu().numpy()) predictions = np.concatenate(predictions) labels = np.concatenate(labels) + predictions = predictions[:len(eval_dataloader.dataset)] + labels = label[:len(eval_dataloader.dataset)] metric_compute(predictions, labels) ``` Like for the training, you need to add one line to prepare your evaluation dataloader. Then you can just use `accelerator.gather` to gather across processes the tensors of predictions and labels. The last line to add truncates the predictions and labels to the number of examples in your dataset because the prepared evaluation dataloader will return a few more elements to make sure batches all have the same size on each process. ### One launcher to rule them all The scripts using Accelerate will be completely compatible with your traditional launchers, such as `torch.distributed.launch`. But remembering all the arguments to them is a bit annoying and when you've setup your instance with 4 GPUs, you'll run most of your trainings using them all. Accelerate comes with a handy CLI that works in two steps: ```bash accelerate config ``` This will trigger a little questionnaire about your setup, which will create a config file you can edit with all the defaults for your training commands. Then ```bash accelerate launch path_to_script.py --args_to_the_script ``` will launch your training script using those default. The only thing you have to do is provide all the arguments needed by your training script. To make this launcher even more awesome, you can use it to spawn an AWS instance using SageMaker. Look at [this guide]( to discover how! ### How to get involved? To get started, just `pip install accelerate` or see the [documentation]( for more install options. Accelerate is a fully open-sourced project, you can find it on [GitHub]( have a look at its [documentation]( or skim through our [basic examples]( Please let us know if you have any issue or feature you would like the library to support. For all questions, the [forums]( is the place to check! For more complex examples in situation, you can look at the official [Transformers examples]( Each folder contains a `run_task_no_trainer.py` that leverages the Accelerate library!"}
{"title": "accelerate-transformers-with-inferentia2.md", "repo_owner": "huggingface", "repo_name": "blog", "text": "--- title: \"Accelerating Hugging Face Transformers with AWS Inferentia2\" thumbnail: /blog/assets/140_accelerate_transformers_with_inferentia2/thumbnail.png authors: - user: philschmid - user: juliensimon --- # Accelerating Hugging Face Transformers with AWS Inferentia2 In the last five years, Transformer models [[1]( have become the _de facto_ standard for many machine learning (ML) tasks, such as natural language processing (NLP), computer vision (CV), speech, and more. Today, many data scientists and ML engineers rely on popular transformer architectures like BERT [[2]( RoBERTa [[3]( the Vision Transformer [[4]( or any of the 130,000+ pre-trained models available on the [Hugging Face]( hub to solve complex business problems with state-of-the-art accuracy. However, for all their greatness, Transformers can be challenging to deploy in production. On top of the infrastructure plumbing typically associated with model deployment, which we largely solved with our [Inference Endpoints]( service, Transformers are large models which routinely exceed the multi-gigabyte mark. Large language models (LLMs) like [GPT-J-6B]( [Flan-T5]( or [Opt-30B]( are in the tens of gigabytes, not to mention behemoths like [BLOOM]( our very own LLM, which clocks in at 350 gigabytes. Fitting these models on a single accelerator can be quite difficult, let alone getting the high throughput and low inference latency that applications require, like conversational applications and search. So far, ML experts have designed complex manual techniques to slice large models, distribute them on a cluster of accelerators, and optimize their latency. Unfortunately, this work is extremely difficult, time-consuming, and completely out of reach for many ML practitioners. At Hugging Face, we're democratizing ML and always looking to partner with companies who also believe that every developer and organization should benefit from state-of-the-art models. For this purpose, we're excited to partner with Amazon Web Services to optimize Hugging Face Transformers for AWS [Inferentia 2]( It\u2019s a new purpose-built inference accelerator that delivers unprecedented levels of throughput, latency, performance per watt, and scalability. ## Introducing AWS Inferentia2 AWS Inferentia2 is the next generation to Inferentia1 launched in 2019. Powered by Inferentia1, Amazon EC2 Inf1 instances delivered 25% higher throughput and 70% lower cost than comparable G5 instances based on NVIDIA A10G GPU, and with Inferentia2, AWS is pushing the envelope again. The new Inferentia2 chip delivers a 4x throughput increase and a 10x latency reduction compared to Inferentia. Likewise, the new [Amazon EC2 Inf2]( instances have up to 2.6x better throughput, 8.1x lower latency, and 50% better performance per watt than comparable G5 instances. Inferentia 2 gives you the best of both worlds: cost-per-inference optimization thanks to high throughput and response time for your application thanks to low inference latency. Inf2 instances are available in multiple sizes, which are equipped with between 1 to 12 Inferentia 2 chips. When several chips are present, they are interconnected by a blazing-fast direct Inferentia2 to Inferentia2 connectivity for distributed inference on large models. For example, the largest instance size, inf2.48xlarge, has 12 chips and enough memory to load a 175-billion parameter model like GPT-3 or BLOOM. Thankfully none of this comes at the expense of development complexity. With [optimum neuron]( you don't need to slice or modify your model. Because of the native integration in [AWS Neuron SDK]( all it takes is a single line of code to compile your model for Inferentia 2. You can experiment in minutes! Test the performance your model could reach on Inferentia 2 and see for yourself. Speaking of, let\u2019s show you how several Hugging Face models run on Inferentia 2. Benchmarking time! ## Benchmarking Hugging Face Models on AWS Inferentia 2 We evaluated some of the most popular NLP models from the [Hugging Face Hub]( including BERT, RoBERTa, DistilBERT, and vision models like Vision Transformers. The first benchmark compares the performance of Inferentia, Inferentia 2, and GPUs. We ran all experiments on AWS with the following instance types: * Inferentia1 - [inf1.2xlarge]( powered by a single Inferentia chip. * Inferentia2 - [inf2.xlarge]( powered by a single Inferentia2 chip. * GPU - [g5.2xlarge]( powered by a single NVIDIA A10G GPU. _Note: that we did not optimize the model for the GPU environment, the models were evaluated in fp32._ When it comes to benchmarking Transformer models, there are two metrics that are most adopted: * **Latency**: the time it takes for the model to perform a single prediction (pre-process, prediction, post-process). * **Throughput**: the number of executions performed in a fixed amount of time for one benchmark configuration We looked at latency across different setups and models to understand the benefits and tradeoffs of the new Inferentia2 instance. If you want to run the benchmark yourself, we created a [Github repository]( with all the information and scripts to do so. ### Results The benchmark confirms that the performance improvements claimed by AWS can be reproduced and validated by real use-cases and examples. On average, AWS Inferentia2 delivers 4.5x better latency than NVIDIA A10G GPUs and 4x better latency than Inferentia1 instances. We ran 144 experiments on 6 different model architectures: * Accelerators: Inf1, Inf2, NVIDIA A10G * Models: [BERT-base]( [BERT-Large]( [RoBERTa-base]( [DistilBERT]( [ALBERT-base]( [ViT-base]( * Sequence length: 8, 16, 32, 64, 128, 256, 512 * Batch size: 1 In each experiment, we collected numbers for p95 latency. You can find the full details of the benchmark in this spreadsheet: [HuggingFace: Benchmark Inferentia2]( Let\u2019s highlight a few insights of the benchmark. ### BERT-base Here is the latency comparison for running [BERT-base]( on each of the infrastructure setups, with a logarithmic scale for latency. It is remarkable to see how Inferentia2 outperforms all other setups by ~6x for sequence lengths up to 256. Figure 1. BERT-base p95 latency ### Vision Transformer Here is the latency comparison for running [ViT-base]( on the different infrastructure setups. Inferentia2 delivers 2x better latency than the NVIDIA A10G, with the potential to greatly help companies move from traditional architectures, like CNNs, to Transformers for - real-time applications. Figure 1. ViT p95 latency ## Conclusion Transformer models have emerged as the go-to solution for many machine learning tasks. However, deploying them in production has been challenging due to their large size and latency requirements. Thanks to AWS Inferentia2 and the collaboration between Hugging Face and AWS, developers and organizations can now leverage the benefits of state-of-the-art models without the prior need for extensive machine learning expertise. You can start testing for as low as 0.76$/h. The initial benchmarking results are promising, and show that Inferentia2 delivers superior latency performance when compared to both Inferentia and NVIDIA A10G GPUs. This latest breakthrough promises high-quality machine learning models can be made available to a much broader audience delivering AI accessibility to everyone."}
{"title": "accelerated-inference.md", "repo_owner": "huggingface", "repo_name": "blog", "text": "--- title: \"How we sped up transformer inference 100x for API customers\" thumbnail: /blog/assets/09_accelerated_inference/thumbnail.png --- How we sped up transformer inference 100x for API customers Transformers has become the default library for data scientists all around the world to explore state of the art NLP models and build new NLP features. With over 5,000 pre-trained and fine-tuned models available, in over 250 languages, it is a rich playground, easily accessible whichever framework you are working in. While experimenting with models in Transformers is easy, deploying these large models into production with maximum performance, and managing them into an architecture that scales with usage is a **hard engineering challenge** for any Machine Learning Engineer. This 100x performance gain and built-in scalability is why subscribers of our hosted [Accelerated Inference API]( chose to build their NLP features on top of it. To get to the **last 10x of performance** boost, the optimizations need to be low-level, specific to the model, and to the target hardware. This post shares some of our approaches squeezing every drop of compute juice for our customers. ## Getting to the first 10x speedup The first leg of the optimization journey is the most accessible, all about using the best combination of techniques offered by the [Hugging Face libraries]( independent of the target hardware. We use the most efficient methods built into Hugging Face model [pipelines]( to reduce the amount of computation during each forward pass. These methods are specific to the architecture of the model and the target task, for instance for a text-generation task on a GPT architecture, we reduce the dimensionality of the attention matrices computation by focusing on the new attention of the last token in each pass: -| Naive version | Optimized version | -||| -||| Tokenization is often a bottleneck for efficiency during inference. We use the most efficient methods from the [ Tokenizers]( library, leveraging the Rust implementation of the model tokenizer in combination with smart caching to get up to 10x speedup for the overall latency. Leveraging the latest features of the Hugging Face libraries, we achieve a reliable 10x speed up compared to an out-of-box deployment for a given model/hardware pair. As new releases of Transformers and Tokenizers typically ship every month, our API customers do not need to constantly adapt to new optimization opportunities, their models just keep running faster. ## Compilation FTW: the hard to get 10x Now this is where it gets really tricky. In order to get the best possible performance we will need to modify the model and compile it targeting the specific hardware for inference. The choice of hardware itself will depend on both the model (size in memory) and the demand profile (request batching). Even when serving predictions from the same model, some API customers may benefit more from Accelerated CPU inference, and others from Accelerated GPU inference, each with different optimization techniques and libraries applied. Once the compute platform has been selected for the use case, we can go to work. Here are some CPU-specific techniques that can be applied with a static graph: - Optimizing the graph (Removing unused flow) - Fusing layers (with specific CPU instructions) - Quantizing the operations Using out-of-box functions from open source libraries (e.g. Transformers with [ONNX Runtime]( won\u2019t produce the best results, or could result in a significant loss of accuracy, particularly during quantization. There is no silver bullet, and the best path is different for each model architecture. But diving deep into the Transformers code and ONNX Runtime documentation, the stars can be aligned to achieve another 10x speedup. ## Unfair advantage The Transformer architecture was a decisive inflection point for Machine Learning performance, starting with NLP, and over the last 3 years the rate of improvement in Natural Language Understanding and Generation has been steep and accelerating. Another metric which accelerated accordingly, is the average size of the models, from the 110M parameters of BERT to the now 175Bn of GPT-3. This trend has introduced daunting challenges for Machine Learning Engineers when deploying the latest models into production. While 100x speedup is a high bar to reach, that\u2019s what it takes to serve predictions with acceptable latency in real-time consumer applications. To reach that bar, as Machine Learning Engineers at Hugging Face we certainly have an unfair advantage sitting in the same (virtual) offices as the Transformers and Tokenizers maintainers . We are also extremely lucky for the rich partnerships we have developed through open source collaborations with hardware and cloud vendors like Intel, NVIDIA, Qualcomm, Amazon and Microsoft that enable us to tune our models x infrastructure with the latest hardware optimizations techniques. If you want to feel the speed on our infrastructure, start a [free trial]( and we\u2019ll get in touch. If you want to benefit from our experience optimizing inference on your own infrastructure participate in our [ Expert Acceleration Program]("}
{"title": "accelerating-pytorch.md", "repo_owner": "huggingface", "repo_name": "blog", "text": "--- title: \"Accelerating PyTorch distributed fine-tuning with Intel technologies\" thumbnail: /blog/assets/36_accelerating_pytorch/04_four_nodes.png authors: - user: juliensimon --- # Accelerating PyTorch distributed fine-tuning with Intel technologies For all their amazing performance, state of the art deep learning models often take a long time to train. In order to speed up training jobs, engineering teams rely on distributed training, a divide-and-conquer technique where clustered servers each keep a copy of the model, train it on a subset of the training set, and exchange results to converge to a final model. Graphical Processing Units (GPUs) have long been the _de facto_ choice to train deep learning models. However, the rise of transfer learning is changing the game. Models are now rarely trained from scratch on humungous datasets. Instead, they are frequently fine-tuned on specific (and smaller) datasets, in order to build specialized models that are more accurate than the base model for particular tasks. As these training jobs are much shorter, using a CPU-based cluster can prove to be an interesting option that keeps both training time and cost under control. ### What this post is about In this post, you will learn how to accelerate [PyTorch]( training jobs by distributing them on a cluster of Intel Xeon Scalable CPU servers, powered by the Ice Lake architecture and running performance-optimized software libraries. We will build the cluster from scratch using virtual machines, and you should be able to easily replicate the demo on your own infrastructure, either in the cloud or on premise. Running a text classification job, we will fine-tune a [BERT]( model on the [MRPC]( dataset (one of the tasks included in the [GLUE]( benchmark). The MRPC dataset contains 5,800 sentence pairs extracted from news sources, with a label telling us whether the two sentences in each pair are semantically equivalent. We picked this dataset for its reasonable training time, and trying other GLUE tasks is just a parameter away. Once the cluster is up and running, we will run a baseline job on a single server. Then, we will scale it to 2 servers and 4 servers and measure the speed-up. Along the way, we will cover the following topics: * Listing the required infrastructure and software building blocks, * Setting up our cluster, * Installing dependencies, * Running a single-node job, * Running a distributed job. Let's get to work! ### Using Intel servers For best performance, we will use Intel servers based on the Ice Lake architecture, which supports hardware features such as Intel AVX-512 and Intel Vector Neural Network Instructions (VNNI). These features accelerate operations typically found in deep learning training and inference. You can learn more about them in this [presentation]( (PDF). All three major cloud providers offer virtual machines powered by Intel Ice Lake CPUs: - Amazon Web Services: Amazon EC2 [M6i]( and [C6i]( instances. - Azure: [Dv5/Dsv5-series]( [Ddv5/Ddsv5-series]( and [Edv5/Edsv5-series]( virtual machines. - Google Cloud Platform: [N2]( Compute Engine virtual machines. Of course, you can also use your own servers. If they are based on the Cascade Lake architecture (Ice Lake's predecessor), they're good to go as Cascade Lake also includes AVX-512 and VNNI. ### Using Intel performance libraries To leverage AVX-512 and VNNI in PyTorch, Intel has designed the [Intel extension for PyTorch]( This software library provides out of the box speedup for training and inference, so we should definitely install it. When it comes to distributed training, the main performance bottleneck is often networking. Indeed, the different nodes in the cluster need to periodically exchange model state information to stay in sync. As transformers are large models with billions of parameters (sometimes much more), the volume of information is significant, and things only get worse as the number of nodes increase. Thus, it's important to use a communication library optimized for deep learning. In fact, PyTorch includes the [```torch.distributed```]( package, which supports different communication backends. Here, we'll use the Intel oneAPI Collective Communications Library [(oneCCL)]( an efficient implementation of communication patterns used in deep learning ([all-reduce]( etc.). You can learn about the performance of oneCCL versus other backends in this PyTorch [blog post]( Now that we're clear on building blocks, let's talk about the overall setup of our training cluster. ### Setting up our cluster In this demo, I'm using Amazon EC2 instances running Amazon Linux 2 (c6i.16xlarge, 64 vCPUs, 128GB RAM, 25Gbit/s networking). Setup will be different in other environments, but steps should be very similar. Please keep in mind that you will need 4 identical instances, so you may want to plan for some sort of automation to avoid running the same setup 4 times. Here, I will set up one instance manually, create a new Amazon Machine Image [(AMI)]( from this instance, and use this AMI to launch three identical instances. From a networking perspective, we will need the following setup: * Open port 22 for ```ssh``` access on all instances for setup and debugging. * Configure [password-less]( ```ssh``` between the master instance (the one you'll launch training from) and all other instances (__master included__). * Open all TCP ports on all instances for oneCCL communication inside the cluster. __Please make sure NOT to open these ports to the external world__. AWS provides a convenient way to do this by only allowing connections from instances running a particular [security group]( Here's how my setup looks. Now, let's provision the first instance manually. I first create the instance itself, attach the security group above, and add 128GB of storage. To optimize costs, I have launched it as a [spot instance]( Once the instance is up, I connect to it with ```ssh``` in order to install dependencies. ### Installing dependencies Here are the steps we will follow: * Install Intel toolkits, * Install the Anaconda distribution, * Create a new ```conda``` environment, * Install PyTorch and the Intel extension for PyTorch, * Compile and install oneCCL, * Install the ```transformers``` library. It looks like a lot, but there's nothing complicated. Here we go! __Installing Intel toolkits__ First, we download and install the Intel [OneAPI base toolkit]( as well as the [AI toolkit]( You can learn about them on the Intel [website]( ``` wget sudo bash l_BaseKit_p_2021.4.0.3422_offline.sh wget sudo bash l_AIKit_p_2021.4.0.1460_offline.sh ``` __Installing Anaconda__ Then, we [download]( and install the Anaconda distribution. ``` wget sh Anaconda3-2021.05-Linux-x86_64.sh ``` __Creating a new conda environment__ We log out and log in again to refresh paths. Then, we create a new ```conda``` environment to keep things neat and tidy. ``` yes | conda create -n transformer python=3.7.9 -c anaconda eval \"$(conda shell.bash hook)\" conda activate transformer yes | conda install pip cmake ``` __Installing PyTorch and the Intel extension for PyTorch__ Next, we install PyTorch 1.9 and the Intel extension toolkit. __Versions must match__. ``` yes | conda install pytorch==1.9.0 cpuonly -c pytorch pip install torch_ipex==1.9.0 -f ``` __Compiling and installing oneCCL__ Then, we install some native dependencies required to compile oneCCL. ``` sudo yum -y update sudo yum install -y git cmake3 gcc gcc-c++ ``` Next, we clone the oneCCL repository, build the library and install it. __Again, versions must match__. ``` source /opt/intel/oneapi/mkl/latest/env/vars.sh git clone cd torch-ccl git checkout ccl_torch1.9 git submodule sync git submodule update --init --recursive python setup.py install cd .. ``` __Installing the transformers library__ Next, we install the ```transformers``` library and dependencies required to run GLUE tasks. ``` pip install transformers datasets yes | conda install scipy scikit-learn ``` Finally, we clone a fork of the ```transformers```repository containing the example we're going to run. ``` git clone cd transformers git checkout dist-sigopt ``` We're done! Let's run a single-node job. ### Launching a single-node job To get a baseline, let's launch a single-node job running the ```run_glue.py``` script in ```transformers/examples/pytorch/text-classification```. This should work on any of the instances, and it's a good sanity check before proceeding to distributed training. ``` python run_glue.py \\ --model_name_or_path bert-base-cased --task_name mrpc \\ --do_train --do_eval --max_seq_length 128 \\ --per_device_train_batch_size 32 --learning_rate 2e-5 --num_train_epochs 3 \\ --output_dir /tmp/mrpc/ --overwrite_output_dir True ``` This job takes __7 minutes and 46 seconds__. Now, let's set up distributed jobs with oneCCL and speed things up! ### Setting up a distributed job with oneCCL Three steps are required to run a distributed training job: * List the nodes of the training cluster, * Define environment variables, * Modify the training script. __Listing the nodes of the training cluster__ On the master instance, in ```transformers/examples/pytorch/text-classification```, we create a text file named ```hostfile```. This file stores the names of the nodes in the cluster (IP addresses would work too). The first line should point to the master instance. Here's my file: ``` ip-172-31-28-17.ec2.internal ip-172-31-30-87.ec2.internal ip-172-31-29-11.ec2.internal ip-172-31-20-77.ec2.internal ``` __Defining environment variables__ Next, we need to set some environment variables on the master node, most notably its IP address. You can find more information on oneCCL variables in the [documentation]( ``` for nic in eth0 eib0 hib0 enp94s0f0; do master_addr=$(ifconfig $nic 2>/dev/null | grep netmask | awk '{print $2}'| cut -f2 -d:) if [ \"$master_addr\" ]; then break fi done export MASTER_ADDR=$master_addr source /home/ec2-user/anaconda3/envs/transformer/lib/python3.7/site-packages/torch_ccl-1.3.0+43f48a1-py3.7-linux-x86_64.egg/torch_ccl/env/setvars.sh export LD_LIBRARY_PATH=/home/ec2-user/anaconda3/envs/transformer/lib/python3.7/site-packages/torch_ccl-1.3.0+43f48a1-py3.7-linux-x86_64.egg/:$LD_LIBRARY_PATH export LD_PRELOAD=\"${CONDA_PREFIX}/lib/libtcmalloc.so:${CONDA_PREFIX}/lib/libiomp5.so\" export CCL_WORKER_COUNT=4 export CCL_WORKER_AFFINITY=\"0,1,2,3,32,33,34,35\" export CCL_ATL_TRANSPORT=ofi export ATL_PROGRESS_MODE=0 ``` __Modifying the training script__ The following changes have already been applied to our training script (```run_glue.py```) in order to enable distributed training. You would need to apply similar changes when using your own training code. * Import the ```torch_ccl```package. * Receive the address of the master node and the local rank of the node in the cluster. ``` +import torch_ccl + import datasets import numpy as np from datasets import load_dataset, load_metric @@ -47,7 +49,7 @@ from transformers.utils.versions import require_version # Will error if the minimal version of Transformers is not installed. Remove at your own risks. -check_min_version(\"4.13.0.dev0\") +# check_min_version(\"4.13.0.dev0\") require_version(\"datasets>=1.8.0\", \"To fix: pip install -r examples/pytorch/text-classification/requirements.txt\") @@ -191,6 +193,17 @@ def main(): # or by passing the --help flag to this script. # We now keep distinct sets of args, for a cleaner separation of concerns. + # add local rank for cpu-dist + sys.argv.append(\"--local_rank\") + sys.argv.append(str(os.environ.get(\"PMI_RANK\", -1))) + + # ccl specific environment variables + if \"ccl\" in sys.argv: + os.environ[\"MASTER_ADDR\"] = os.environ.get(\"MASTER_ADDR\", \"127.0.0.1\") + os.environ[\"MASTER_PORT\"] = \"29500\" + os.environ[\"RANK\"] = str(os.environ.get(\"PMI_RANK\", -1)) + os.environ[\"WORLD_SIZE\"] = str(os.environ.get(\"PMI_SIZE\", -1)) + parser = HfArgumentParser((ModelArguments, DataTrainingArguments, TrainingArguments)) if len(sys.argv) == 2 and sys.argv[1].endswith(\".json\"): ``` Setup is now complete. Let's scale our training job to 2 nodes and 4 nodes. ### Running a distributed job with oneCCL On the __master node__, I use ```mpirun```to launch a 2-node job: ```-np``` (number of processes) is set to 2 and ```-ppn``` (process per node) is set to 1. Hence, the first two nodes in ```hostfile``` will be selected. ``` mpirun -f hostfile -np 2 -ppn 1 -genv I_MPI_PIN_DOMAIN=[0xfffffff0] \\ -genv OMP_NUM_THREADS=28 python run_glue.py \\ --model_name_or_path distilbert-base-uncased --task_name mrpc \\ --do_train --do_eval --max_seq_length 128 --per_device_train_batch_size 32 \\ --learning_rate 2e-5 --num_train_epochs 3 --output_dir /tmp/mrpc/ \\ --overwrite_output_dir True --xpu_backend ccl --no_cuda True ``` Within seconds, a job starts on the first two nodes. The job completes in __4 minutes and 39 seconds__, a __1.7x__ speedup. Setting ```-np``` to 4 and launching a new job, I now see one process running on each node of the cluster. Training completes in __2 minutes and 36 seconds__, a __3x__ speedup. One last thing. Changing ```--task_name``` to ```qqp```, I also ran the Quora Question Pairs GLUE task, which is based on a much larger [dataset]( (over 400,000 training samples). The fine-tuning times were: * Single-node: 11 hours 22 minutes, * 2 nodes: 6 hours and 38 minutes (1.71x), * 4 nodes: 3 hours and 51 minutes (2.95x). It looks like the speedup is pretty consistent. Feel free to keep experimenting with different learning rates, batch sizes and oneCCL settings. I'm sure you can go even faster! ### Conclusion In this post, you've learned how to build a distributed training cluster based on Intel CPUs and performance libraries, and how to use this cluster to speed up fine-tuning jobs. Indeed, transfer learning is putting CPU training back into the game, and you should definitely consider it when designing and building your next deep learning workflows. Thanks for reading this long post. I hope you found it informative. Feedback and questions are welcome at _julsimon@huggingface.co_. Until next time, keep learning! Julien"}
{"title": "ai-residency.md", "repo_owner": "huggingface", "repo_name": "blog", "text": "--- title: \"Announcing the AI Research Residency Program\" thumbnail: /blog/assets/57_ai_residency/residency-thumbnail.jpg authors: - user: douwekiela --- # Announcing the AI Research Residency Program The Research Residency Program is a 9-month opportunity to launch or advance your career in machine learning research . The goal of the residency is to help you grow into an impactful AI researcher. Residents will work alongside Researchers from our Science Team. Together, you will pick a research problem and then develop new machine learning techniques to solve it in an open & collaborative way, with the hope of ultimately publishing your work and making it visible to a wide audience. Applicants from all backgrounds are welcome! Ideally, you have some research experience and are excited about our mission to democratize responsible machine learning. The progress of our field has the potential to exacerbate existing disparities in ways that disproportionately hurt the most marginalized people in society \u2014 including people of color, people from working-class backgrounds, women, and LGBTQ+ people. These communities must be centered in the work we do as a research community. So we strongly encourage proposals from people whose personal experience reflects these identities.. We encourage applications relating to AI that demonstrate a clear and positive societal impact. ## How to Apply Since the focus of your work will be on developing Machine Learning techniques, your application should show evidence of programming skills and of prerequisite courses, like calculus or linear algebra, or links to an open-source project that demonstrates programming and mathematical ability. More importantly, your application needs to present interest in effecting positive change through AI in any number of creative ways. This can stem from a topic that is of particular interest to you and your proposal would capture concrete ways in which machine learning can contribute. Thinking through the entire pipeline, from understanding where ML tools are needed to gathering data and deploying the resulting approach, can help make your project more impactful. We are actively working to build a culture that values diversity, equity, and inclusivity. We are intentionally building a workplace where people feel respected and supported\u2014regardless of who you are or where you come from. We believe this is foundational to building a great company and community. Hugging Face is an equal opportunity employer and we do not discriminate on the basis of race, religion, color, national origin, gender, sexual orientation, age, marital status, veteran status, or disability status. [Submit your application here]( ## FAQs * **Can I complete the program part-time?**No. The Residency is only offered as a full-time position. * **I have been out of school for several years. Can I apply?**Yes. We will consider applications from various backgrounds. * **Can I be enrolled as a student at a university or work for another employer during the residency?**No, the residency can\u2019t be completed simultaneously with any other obligations. * **Will I receive benefits during the Residency?**Yes, residents are eligible for most benefits, including medical (depending on location). * **Will I be required to relocate for this residency?**Absolutely not! We are a distributed team and you are welcome to work from wherever you are currently located. * **Is there a deadline?**Applications close on April 3rd, 2022!"}
{"title": "aivsai.md", "repo_owner": "huggingface", "repo_name": "blog", "text": "--- title: \"Introducing AI vs. AI a deep reinforcement learning multi-agents competition system\" thumbnail: /blog/assets/128_aivsai/thumbnail.png authors: - user: CarlCochet - user: ThomasSimonini --- # Introducing AI vs. AI a deep reinforcement learning multi-agents competition system We\u2019re excited to introduce a new tool we created: ** AI vs. AI , a deep reinforcement learning multi-agents competition system**. This tool, hosted on [Spaces]( allows us **to create multi-agent competitions**. It is composed of three elements: - A *Space* with a matchmaking algorithm that **runs the model fights using a background task**. - A *Dataset* **containing the results**. - A *Leaderboard* that gets the **match history results and displays the models\u2019 ELO**. Then, when a user pushes a trained model to the Hub, **it gets evaluated and ranked against others**. Thanks to that, we can evaluate your agents against other\u2019s agents in a multi-agent setting. In addition to being a useful tool for hosting multi-agent competitions, we think this tool can also be a **robust evaluation technique in multi-agent settings.** By playing against a lot of policies, your agents are evaluated against a wide range of behaviors. This should give you a good idea of the quality of your policy. Let\u2019s see how it works with our first competition host: SoccerTwos Challenge. ## How does AI vs. AI works? AI vs. AI is an open-source tool developed at Hugging Face **to rank the strength of reinforcement learning models in a multi-agent setting**. The idea is to get a **relative measure of skill rather than an objective one** by making the models play against each other continuously and use the matches results to assess their performance compared to all the other models and consequently get a view of the quality of their policy without requiring classic metrics. The more agents are submitted for a given task or environment, **the more representative the rating becomes**. To generate a rating based on match results in a competitive environment, we decided to base the rankings on the [ELO rating system]( The core concept is that after a match ends, the rating of both players are updated based on the result and the ratings they had before the game. When a user with a high rating beats one with a low ranking, they won't get many points. Likewise, the loser would not lose many points in this case. Conversely, if a low-rated player wins in an upset against a high-rated player, it will cause a more significant effect on both of their ratings. In our context, we **kept the system as simple as possible by not adding any alteration to the quantities gained or lost based on the starting ratings of the player**. As such, gain and loss will always be the perfect opposite (+10 / -10, for instance), and the average ELO rating will stay constant at the starting rating. The choice of a 1200 ELO rating start is entirely arbitrary. If you want to learn more about ELO and see some calculation example, we wrote an explanation in our Deep Reinforcement Learning Course [here]( Using this rating, it is possible **to generate matches between models with comparable strengths automatically**. There are several ways you can go about creating a matchmaking system, but here we decided to keep it fairly simple while guaranteeing a minimum amount of diversity in the matchups and also keeping most matches with fairly close opposing ratings. Here's how works the algorithm: 1. Gather all the available models on the Hub. New models get a starting rating of 1200, while others keep the rating they have gained/lost through their previous matches. 2. Create a queue from all these models. 3. Pop the first element (model) from the queue, and then pop another random model in this queue from the n models with the closest ratings to the first model. 4. Simulate this match by loading both models in the environment (a Unity executable, for instance) and gathering the results. For this implementation, we sent the results to a Hugging Face Dataset on the Hub. 5. Compute the new rating of both models based on the received result and the ELO formula. 6. Continue popping models two by two and simulating the matches until only one or zero models are in the queue. 7. Save the resulting ratings and go back to step 1 To run this matchmaking process continuously, we use **free Hugging Face Spaces hardware with a Scheduler** to keep running the matchmaking process as a background task. The Spaces is also used to fetch the ELO ratings of each model that have already been played and, from it display [a leaderboard]( **from which everyone can check the progress of the models**. The process generally uses several Hugging Face Datasets to provide data persistence (here, matches history and model ratings). Since the process also saves the matches' history, it is possible to see precisely the results of any given model. This can, for instance, allow you to check why your model struggles with another one, most notably using another demo Space to visualize matches like [this one]( For now, **this experiment is running with the MLAgent environment SoccerTwos for the Hugging Face Deep RL Course**, however, the process and implementation, in general, are very much **environment agnostic and could be used to evaluate for free a wide range of adversarial multi-agent settings**. Of course, it is important to remind again that this evaluation is a relative rating between the strengths of the submitted agents, and the ratings by themselves **have no objective meaning contrary to other metrics**. It only represents how good or bad a model performs compared to the other models in the pool. Still, given a large and varied enough pool of models (and enough matches played), this evaluation becomes a very solid way to represent the general performance of a model. ## Our first AI vs. AI challenge experimentation: SoccerTwos Challenge This challenge is Unit 7 of our [free Deep Reinforcement Learning Course]( It started on February 1st and will end on April 30th. If you\u2019re interested, **you don\u2019t need to participate in the course to be able to participate in the competition. You can start here** In this Unit, readers learned the basics of multi-agent reinforcement learning (MARL)by training a **2vs2 soccer team.** The environment used was made by the [Unity ML-Agents team]( The goal is simple: your team needs to score a goal. To do that, they need to beat the opponent's team and collaborate with their teammate. In addition to the leaderboard, we created a Space demo where people can choose two teams and visualize them playing [ This experimentation is going well since we already have 48 models on the [leaderboard](  - Be enrolled in an accredited college or university - Be a user of the Hugging Face Hub and/or the Hugging Face\u2019s libraries - Acknowledge the [Code of Conduct]( Community is at the center of the Hugging Face ecosystem. Because of that, we strictly adhere to our [Code of conduct]( If any ambassador infringes it or behaves inadequately, they will be excluded from the Program. **[Apply here]( to become an ambassador!** **Timeline:** - Deadline for the end of the [application]( is June 13. - The Program will start on June 30, 2022. - The Program will end on December 31, 2022."}
{"title": "annotated-diffusion.md", "repo_owner": "huggingface", "repo_name": "blog", "text": "--- title: The Annotated Diffusion Model thumbnail: /blog/assets/78_annotated-diffusion/thumbnail.png authors: - user: nielsr - user: kashif --- # The Annotated Diffusion Model In this blog post, we'll take a deeper look into **Denoising Diffusion Probabilistic Models** (also known as DDPMs, diffusion models, score-based generative models or simply [autoencoders]( as researchers have been able to achieve remarkable results with them for (un)conditional image/audio/video generation. Popular examples (at the time of writing) include [GLIDE]( and [DALL-E 2]( by OpenAI, [Latent Diffusion]( by the University of Heidelberg and [ImageGen]( by Google Brain. We'll go over the original DDPM paper by ([Ho et al., 2020]( implementing it step-by-step in PyTorch, based on Phil Wang's [implementation]( - which itself is based on the [original TensorFlow implementation]( Note that the idea of diffusion for generative modeling was actually already introduced in ([Sohl-Dickstein et al., 2015]( However, it took until ([Song et al., 2019]( (at Stanford University), and then ([Ho et al., 2020]( (at Google Brain) who independently improved the approach. Note that there are [several perspectives]( on diffusion models. Here, we employ the discrete-time (latent variable model) perspective, but be sure to check out the other perspectives as well. Alright, let's dive in! ```python from IPython.display import Image Image(filename='assets/78_annotated-diffusion/ddpm_paper.png') ``` We'll install and import the required libraries first (assuming you have [PyTorch]( installed). ```python !pip install -q -U einops datasets matplotlib tqdm import math from inspect import isfunction from functools import partial %matplotlib inline import matplotlib.pyplot as plt from tqdm.auto import tqdm from einops import rearrange, reduce from einops.layers.torch import Rearrange import torch from torch import nn, einsum import torch.nn.functional as F ``` ## What is a diffusion model? A (denoising) diffusion model isn't that complex if you compare it to other generative models such as Normalizing Flows, GANs or VAEs: they all convert noise from some simple distribution to a data sample. This is also the case here where **a neural network learns to gradually denoise data** starting from pure noise. In a bit more detail for images, the set-up consists of 2 processes: * a fixed (or predefined) forward diffusion process \\\\(q\\\\) of our choosing, that gradually adds Gaussian noise to an image, until you end up with pure noise * a learned reverse denoising diffusion process \\\\(p_\\theta\\\\), where a neural network is trained to gradually denoise an image starting from pure noise, until you end up with an actual image. Both the forward and reverse process indexed by \\\\(t\\\\) happen for some number of finite time steps \\\\(T\\\\) (the DDPM authors use \\\\(T=1000\\\\)). You start with \\\\(t=0\\\\) where you sample a real image \\\\(\\mathbf{x}_0\\\\) from your data distribution (let's say an image of a cat from ImageNet), and the forward process samples some noise from a Gaussian distribution at each time step \\\\(t\\\\), which is added to the image of the previous time step. Given a sufficiently large \\\\(T\\\\) and a well behaved schedule for adding noise at each time step, you end up with what is called an [isotropic Gaussian distribution]( at \\\\(t=T\\\\) via a gradual process. ## In more mathematical form Let's write this down more formally, as ultimately we need a tractable loss function which our neural network needs to optimize. Let \\\\(q(\\mathbf{x}_0)\\\\) be the real data distribution, say of \"real images\". We can sample from this distribution to get an image, \\\\(\\mathbf{x}_0 \\sim q(\\mathbf{x}_0)\\\\). We define the forward diffusion process \\\\(q(\\mathbf{x}_t | \\mathbf{x}_{t-1})\\\\) which adds Gaussian noise at each time step \\\\(t\\\\), according to a known variance schedule \\\\(0 First, we set \\\\(\\Sigma_\\theta ( \\mathbf{x}_t, t) = \\sigma^2_t \\mathbf{I}\\\\) to untrained time dependent constants. Experimentally, both \\\\(\\sigma^2_t = \\beta_t\\\\) and \\\\(\\sigma^2_t = \\tilde{\\beta}_t\\\\) (see paper) had similar results. This was then later improved in the [Improved diffusion models]( paper, where a neural network also learns the variance of this backwards process, besides the mean. So we continue, assuming that our neural network only needs to learn/represent the mean of this conditional probability distribution. ## Defining an objective function (by reparametrizing the mean) To derive an objective function to learn the mean of the backward process, the authors observe that the combination of \\\\(q\\\\) and \\\\(p_\\theta\\\\) can be seen as a variational auto-encoder (VAE) [(Kingma et al., 2013)]( Hence, the **variational lower bound** (also called ELBO) can be used to minimize the negative log-likelihood with respect to ground truth data sample \\\\(\\mathbf{x}_0\\\\) (we refer to the VAE paper for details regarding ELBO). It turns out that the ELBO for this process is a sum of losses at each time step \\\\(t\\\\), \\\\(L = L_0 + L_1 + ... + L_T\\\\). By construction of the forward \\\\(q\\\\) process and backward process, each term (except for \\\\(L_0\\\\)) of the loss is actually the **KL divergence between 2 Gaussian distributions** which can be written explicitly as an L2-loss with respect to the means! A direct consequence of the constructed forward process \\\\(q\\\\), as shown by Sohl-Dickstein et al., is that we can sample \\\\(\\mathbf{x}_t\\\\) at any arbitrary noise level conditioned on \\\\(\\mathbf{x}_0\\\\) (since sums of Gaussians is also Gaussian). This is very convenient: we don't need to apply \\\\(q\\\\) repeatedly in order to sample \\\\(\\mathbf{x}_t\\\\). We have that $$q(\\mathbf{x}_t | \\mathbf{x}_0) = \\cal{N}(\\mathbf{x}_t; \\sqrt{\\bar{\\alpha}_t} \\mathbf{x}_0, (1- \\bar{\\alpha}_t) \\mathbf{I})$$ with \\\\(\\alpha_t := 1 - \\beta_t\\\\) and \\\\(\\bar{\\alpha}_t := \\Pi_{s=1}^{t} \\alpha_s\\\\). Let's refer to this equation as the \"nice property\". This means we can sample Gaussian noise and scale it appropriatly and add it to \\\\(\\mathbf{x}_0\\\\) to get \\\\(\\mathbf{x}_t\\\\) directly. Note that the \\\\(\\bar{\\alpha}_t\\\\) are functions of the known \\\\(\\beta_t\\\\) variance schedule and thus are also known and can be precomputed. This then allows us, during training, to **optimize random terms of the loss function \\\\(L\\\\)** (or in other words, to randomly sample \\\\(t\\\\) during training and optimize \\\\(L_t\\\\)). Another beauty of this property, as shown in Ho et al. is that one can (after some math, for which we refer the reader to [this excellent blog post]( instead **reparametrize the mean to make the neural network learn (predict) the added noise (via a network \\\\(\\mathbf{\\epsilon}_\\theta(\\mathbf{x}_t, t)\\\\)) for noise level \\\\(t\\\\)** in the KL terms which constitute the losses. This means that our neural network becomes a noise predictor, rather than a (direct) mean predictor. The mean can be computed as follows: $$ \\mathbf{\\mu}_\\theta(\\mathbf{x}_t, t) = \\frac{1}{\\sqrt{\\alpha_t}} \\left( \\mathbf{x}_t - \\frac{\\beta_t}{\\sqrt{1- \\bar{\\alpha}_t}} \\mathbf{\\epsilon}_\\theta(\\mathbf{x}_t, t) \\right)$$ The final objective function \\\\(L_t\\\\) then looks as follows (for a random time step \\\\(t\\\\) given \\\\(\\mathbf{\\epsilon} \\sim \\mathcal{N}(\\mathbf{0}, \\mathbf{I})\\\\) ): $$ \\| \\mathbf{\\epsilon} - \\mathbf{\\epsilon}_\\theta(\\mathbf{x}_t, t) \\|^2 = \\| \\mathbf{\\epsilon} - \\mathbf{\\epsilon}_\\theta( \\sqrt{\\bar{\\alpha}_t} \\mathbf{x}_0 + \\sqrt{(1- \\bar{\\alpha}_t) } \\mathbf{\\epsilon}, t) \\|^2.$$ Here, \\\\(\\mathbf{x}_0\\\\) is the initial (real, uncorrupted) image, and we see the direct noise level \\\\(t\\\\) sample given by the fixed forward process. \\\\(\\mathbf{\\epsilon}\\\\) is the pure noise sampled at time step \\\\(t\\\\), and \\\\(\\mathbf{\\epsilon}_\\theta (\\mathbf{x}_t, t)\\\\) is our neural network. The neural network is optimized using a simple mean squared error (MSE) between the true and the predicted Gaussian noise. The training algorithm now looks as follows: In other words: * we take a random sample \\\\(\\mathbf{x}_0\\\\) from the real unknown and possibily complex data distribution \\\\(q(\\mathbf{x}_0)\\\\) * we sample a noise level \\\\(t\\\\) uniformally between \\\\(1\\\\) and \\\\(T\\\\) (i.e., a random time step) * we sample some noise from a Gaussian distribution and corrupt the input by this noise at level \\\\(t\\\\) (using the nice property defined above) * the neural network is trained to predict this noise based on the corrupted image \\\\(\\mathbf{x}_t\\\\) (i.e. noise applied on \\\\(\\mathbf{x}_0\\\\) based on known schedule \\\\(\\beta_t\\\\)) In reality, all of this is done on batches of data, as one uses stochastic gradient descent to optimize neural networks. ## The neural network The neural network needs to take in a noised image at a particular time step and return the predicted noise. Note that the predicted noise is a tensor that has the same size/resolution as the input image. So technically, the network takes in and outputs tensors of the same shape. What type of neural network can we use for this? What is typically used here is very similar to that of an [Autoencoder]( which you may remember from typical \"intro to deep learning\" tutorials. Autoencoders have a so-called \"bottleneck\" layer in between the encoder and decoder. The encoder first encodes an image into a smaller hidden representation called the \"bottleneck\", and the decoder then decodes that hidden representation back into an actual image. This forces the network to only keep the most important information in the bottleneck layer. In terms of architecture, the DDPM authors went for a **U-Net**, introduced by ([Ronneberger et al., 2015]( (which, at the time, achieved state-of-the-art results for medical image segmentation). This network, like any autoencoder, consists of a bottleneck in the middle that makes sure the network learns only the most important information. Importantly, it introduced residual connections between the encoder and decoder, greatly improving gradient flow (inspired by ResNet in [He et al., 2015]( As can be seen, a U-Net model first downsamples the input (i.e. makes the input smaller in terms of spatial resolution), after which upsampling is performed. Below, we implement this network, step-by-step. ### Network helpers First, we define some helper functions and classes which will be used when implementing the neural network. Importantly, we define a `Residual` module, which simply adds the input to the output of a particular function (in other words, adds a residual connection to a particular function). We also define aliases for the up- and downsampling operations. ```python def exists(x): return x is not None def default(val, d): if exists(val): return val return d() if isfunction(d) else d def num_to_groups(num, divisor): groups = num // divisor remainder = num % divisor arr = [divisor] * groups if remainder > 0: arr.append(remainder) return arr class Residual(nn.Module): def __init__(self, fn): super().__init__() self.fn = fn def forward(self, x, *args, **kwargs): return self.fn(x, *args, **kwargs) + x def Upsample(dim, dim_out=None): return nn.Sequential( nn.Upsample(scale_factor=2, mode=\"nearest\"), nn.Conv2d(dim, default(dim_out, dim), 3, padding=1), ) def Downsample(dim, dim_out=None): # No More Strided Convolutions or Pooling return nn.Sequential( Rearrange(\"b c (h p1) (w p2) -> b (c p1 p2) h w\", p1=2, p2=2), nn.Conv2d(dim * 4, default(dim_out, dim), 1), ) ``` ### Position embeddings As the parameters of the neural network are shared across time (noise level), the authors employ sinusoidal position embeddings to encode \\\\(t\\\\), inspired by the Transformer ([Vaswani et al., 2017]( This makes the neural network \"know\" at which particular time step (noise level) it is operating, for every image in a batch. The `SinusoidalPositionEmbeddings` module takes a tensor of shape `(batch_size, 1)` as input (i.e. the noise levels of several noisy images in a batch), and turns this into a tensor of shape `(batch_size, dim)`, with `dim` being the dimensionality of the position embeddings. This is then added to each residual block, as we will see further. ```python class SinusoidalPositionEmbeddings(nn.Module): def __init__(self, dim): super().__init__() self.dim = dim def forward(self, time): device = time.device half_dim = self.dim // 2 embeddings = math.log(10000) / (half_dim - 1) embeddings = torch.exp(torch.arange(half_dim, device=device) * -embeddings) embeddings = time[:, None] * embeddings[None, :] embeddings = torch.cat((embeddings.sin(), embeddings.cos()), dim=-1) return embeddings ``` ### ResNet block Next, we define the core building block of the U-Net model. The DDPM authors employed a Wide ResNet block ([Zagoruyko et al., 2016]( but Phil Wang has replaced the standard convolutional layer by a \"weight standardized\" version, which works better in combination with group normalization (see ([Kolesnikov et al., 2019]( for details). ```python class WeightStandardizedConv2d(nn.Conv2d): \"\"\" weight standardization purportedly works synergistically with group normalization \"\"\" def forward(self, x): eps = 1e-5 if x.dtype == torch.float32 else 1e-3 weight = self.weight mean = reduce(weight, \"o ... -> o 1 1 1\", \"mean\") var = reduce(weight, \"o ... -> o 1 1 1\", partial(torch.var, unbiased=False)) normalized_weight = (weight - mean) * (var + eps).rsqrt() return F.conv2d( x, normalized_weight, self.bias, self.stride, self.padding, self.dilation, self.groups, ) class Block(nn.Module): def __init__(self, dim, dim_out, groups=8): super().__init__() self.proj = WeightStandardizedConv2d(dim, dim_out, 3, padding=1) self.norm = nn.GroupNorm(groups, dim_out) self.act = nn.SiLU() def forward(self, x, scale_shift=None): x = self.proj(x) x = self.norm(x) if exists(scale_shift): scale, shift = scale_shift x = x * (scale + 1) + shift x = self.act(x) return x class ResnetBlock(nn.Module): \"\"\" def __init__(self, dim, dim_out, *, time_emb_dim=None, groups=8): super().__init__() self.mlp = ( nn.Sequential(nn.SiLU(), nn.Linear(time_emb_dim, dim_out * 2)) if exists(time_emb_dim) else None ) self.block1 = Block(dim, dim_out, groups=groups) self.block2 = Block(dim_out, dim_out, groups=groups) self.res_conv = nn.Conv2d(dim, dim_out, 1) if dim != dim_out else nn.Identity() def forward(self, x, time_emb=None): scale_shift = None if exists(self.mlp) and exists(time_emb): time_emb = self.mlp(time_emb) time_emb = rearrange(time_emb, \"b c -> b c 1 1\") scale_shift = time_emb.chunk(2, dim=1) h = self.block1(x, scale_shift=scale_shift) h = self.block2(h) return h + self.res_conv(x) ``` ### Attention module Next, we define the attention module, which the DDPM authors added in between the convolutional blocks. Attention is the building block of the famous Transformer architecture ([Vaswani et al., 2017]( which has shown great success in various domains of AI, from NLP and vision to [protein folding]( Phil Wang employs 2 variants of attention: one is regular multi-head self-attention (as used in the Transformer), the other one is a [linear attention variant]( ([Shen et al., 2018]( whose time- and memory requirements scale linear in the sequence length, as opposed to quadratic for regular attention. For an extensive explanation of the attention mechanism, we refer the reader to Jay Allamar's [wonderful blog post]( ```python class Attention(nn.Module): def __init__(self, dim, heads=4, dim_head=32): super().__init__() self.scale = dim_head**-0.5 self.heads = heads hidden_dim = dim_head * heads self.to_qkv = nn.Conv2d(dim, hidden_dim * 3, 1, bias=False) self.to_out = nn.Conv2d(hidden_dim, dim, 1) def forward(self, x): b, c, h, w = x.shape qkv = self.to_qkv(x).chunk(3, dim=1) q, k, v = map( lambda t: rearrange(t, \"b (h c) x y -> b h c (x y)\", h=self.heads), qkv ) q = q * self.scale sim = einsum(\"b h d i, b h d j -> b h i j\", q, k) sim = sim - sim.amax(dim=-1, keepdim=True).detach() attn = sim.softmax(dim=-1) out = einsum(\"b h i j, b h d j -> b h i d\", attn, v) out = rearrange(out, \"b h (x y) d -> b (h d) x y\", x=h, y=w) return self.to_out(out) class LinearAttention(nn.Module): def __init__(self, dim, heads=4, dim_head=32): super().__init__() self.scale = dim_head**-0.5 self.heads = heads hidden_dim = dim_head * heads self.to_qkv = nn.Conv2d(dim, hidden_dim * 3, 1, bias=False) self.to_out = nn.Sequential(nn.Conv2d(hidden_dim, dim, 1), nn.GroupNorm(1, dim)) def forward(self, x): b, c, h, w = x.shape qkv = self.to_qkv(x).chunk(3, dim=1) q, k, v = map( lambda t: rearrange(t, \"b (h c) x y -> b h c (x y)\", h=self.heads), qkv ) q = q.softmax(dim=-2) k = k.softmax(dim=-1) q = q * self.scale context = torch.einsum(\"b h d n, b h e n -> b h d e\", k, v) out = torch.einsum(\"b h d e, b h d n -> b h e n\", context, q) out = rearrange(out, \"b h c (x y) -> b (h c) x y\", h=self.heads, x=h, y=w) return self.to_out(out) ``` ### Group normalization The DDPM authors interleave the convolutional/attention layers of the U-Net with group normalization ([Wu et al., 2018]( Below, we define a `PreNorm` class, which will be used to apply groupnorm before the attention layer, as we'll see further. Note that there's been a [debate]( about whether to apply normalization before or after attention in Transformers. ```python class PreNorm(nn.Module): def __init__(self, dim, fn): super().__init__() self.fn = fn self.norm = nn.GroupNorm(1, dim) def forward(self, x): x = self.norm(x) return self.fn(x) ``` ### Conditional U-Net Now that we've defined all building blocks (position embeddings, ResNet blocks, attention and group normalization), it's time to define the entire neural network. Recall that the job of the network \\\\(\\mathbf{\\epsilon}_\\theta(\\mathbf{x}_t, t)\\\\) is to take in a batch of noisy images and their respective noise levels, and output the noise added to the input. More formally: - the network takes a batch of noisy images of shape `(batch_size, num_channels, height, width)` and a batch of noise levels of shape `(batch_size, 1)` as input, and returns a tensor of shape `(batch_size, num_channels, height, width)` The network is built up as follows: * first, a convolutional layer is applied on the batch of noisy images, and position embeddings are computed for the noise levels * next, a sequence of downsampling stages are applied. Each downsampling stage consists of 2 ResNet blocks + groupnorm + attention + residual connection + a downsample operation * at the middle of the network, again ResNet blocks are applied, interleaved with attention * next, a sequence of upsampling stages are applied. Each upsampling stage consists of 2 ResNet blocks + groupnorm + attention + residual connection + an upsample operation * finally, a ResNet block followed by a convolutional layer is applied. Ultimately, neural networks stack up layers as if they were lego blocks (but it's important to [understand how they work]( ```python class Unet(nn.Module): def __init__( self, dim, init_dim=None, out_dim=None, dim_mults=(1, 2, 4, 8), channels=3, self_condition=False, resnet_block_groups=4, ): super().__init__() # determine dimensions self.channels = channels self.self_condition = self_condition input_channels = channels * (2 if self_condition else 1) init_dim = default(init_dim, dim) self.init_conv = nn.Conv2d(input_channels, init_dim, 1, padding=0) # changed to 1 and 0 from 7,3 dims = [init_dim, *map(lambda m: dim * m, dim_mults)] in_out = list(zip(dims[:-1], dims[1:])) block_klass = partial(ResnetBlock, groups=resnet_block_groups) # time embeddings time_dim = dim * 4 self.time_mlp = nn.Sequential( SinusoidalPositionEmbeddings(dim), nn.Linear(dim, time_dim), nn.GELU(), nn.Linear(time_dim, time_dim), ) # layers self.downs = nn.ModuleList([]) self.ups = nn.ModuleList([]) num_resolutions = len(in_out) for ind, (dim_in, dim_out) in enumerate(in_out): is_last = ind >= (num_resolutions - 1) self.downs.append( nn.ModuleList( [ block_klass(dim_in, dim_in, time_emb_dim=time_dim), block_klass(dim_in, dim_in, time_emb_dim=time_dim), Residual(PreNorm(dim_in, LinearAttention(dim_in))), Downsample(dim_in, dim_out) if not is_last else nn.Conv2d(dim_in, dim_out, 3, padding=1), ] ) ) mid_dim = dims[-1] self.mid_block1 = block_klass(mid_dim, mid_dim, time_emb_dim=time_dim) self.mid_attn = Residual(PreNorm(mid_dim, Attention(mid_dim))) self.mid_block2 = block_klass(mid_dim, mid_dim, time_emb_dim=time_dim) for ind, (dim_in, dim_out) in enumerate(reversed(in_out)): is_last = ind == (len(in_out) - 1) self.ups.append( nn.ModuleList( [ block_klass(dim_out + dim_in, dim_out, time_emb_dim=time_dim), block_klass(dim_out + dim_in, dim_out, time_emb_dim=time_dim), Residual(PreNorm(dim_out, LinearAttention(dim_out))), Upsample(dim_out, dim_in) if not is_last else nn.Conv2d(dim_out, dim_in, 3, padding=1), ] ) ) self.out_dim = default(out_dim, channels) self.final_res_block = block_klass(dim * 2, dim, time_emb_dim=time_dim) self.final_conv = nn.Conv2d(dim, self.out_dim, 1) def forward(self, x, time, x_self_cond=None): if self.self_condition: x_self_cond = default(x_self_cond, lambda: torch.zeros_like(x)) x = torch.cat((x_self_cond, x), dim=1) x = self.init_conv(x) r = x.clone() t = self.time_mlp(time) h = [] for block1, block2, attn, downsample in self.downs: x = block1(x, t) h.append(x) x = block2(x, t) x = attn(x) h.append(x) x = downsample(x) x = self.mid_block1(x, t) x = self.mid_attn(x) x = self.mid_block2(x, t) for block1, block2, attn, upsample in self.ups: x = torch.cat((x, h.pop()), dim=1) x = block1(x, t) x = torch.cat((x, h.pop()), dim=1) x = block2(x, t) x = attn(x) x = upsample(x) x = torch.cat((x, r), dim=1) x = self.final_res_block(x, t) return self.final_conv(x) ``` ## Defining the forward diffusion process The forward diffusion process gradually adds noise to an image from the real distribution, in a number of time steps \\\\(T\\\\). This happens according to a **variance schedule**. The original DDPM authors employed a linear schedule: > We set the forward process variances to constants increasing linearly from \\\\(\\beta_1 = 10^{\u22124}\\\\) to \\\\(\\beta_T = 0.02\\\\). However, it was shown in ([Nichol et al., 2021]( that better results can be achieved when employing a cosine schedule. Below, we define various schedules for the \\\\(T\\\\) timesteps (we'll choose one later on). ```python def cosine_beta_schedule(timesteps, s=0.008): \"\"\" cosine schedule as proposed in \"\"\" steps = timesteps + 1 x = torch.linspace(0, timesteps, steps) alphas_cumprod = torch.cos(((x / timesteps) + s) / (1 + s) * torch.pi * 0.5) ** 2 alphas_cumprod = alphas_cumprod / alphas_cumprod[0] betas = 1 - (alphas_cumprod[1:] / alphas_cumprod[:-1]) return torch.clip(betas, 0.0001, 0.9999) def linear_beta_schedule(timesteps): beta_start = 0.0001 beta_end = 0.02 return torch.linspace(beta_start, beta_end, timesteps) def quadratic_beta_schedule(timesteps): beta_start = 0.0001 beta_end = 0.02 return torch.linspace(beta_start**0.5, beta_end**0.5, timesteps) ** 2 def sigmoid_beta_schedule(timesteps): beta_start = 0.0001 beta_end = 0.02 betas = torch.linspace(-6, 6, timesteps) return torch.sigmoid(betas) * (beta_end - beta_start) + beta_start ``` To start with, let's use the linear schedule for \\\\(T=300\\\\) time steps and define the various variables from the \\\\(\\beta_t\\\\) which we will need, such as the cumulative product of the variances \\\\(\\bar{\\alpha}_t\\\\). Each of the variables below are just 1-dimensional tensors, storing values from \\\\(t\\\\) to \\\\(T\\\\). Importantly, we also define an `extract` function, which will allow us to extract the appropriate \\\\(t\\\\) index for a batch of indices. ```python timesteps = 300 # define beta schedule betas = linear_beta_schedule(timesteps=timesteps) # define alphas alphas = 1. - betas alphas_cumprod = torch.cumprod(alphas, axis=0) alphas_cumprod_prev = F.pad(alphas_cumprod[:-1], (1, 0), value=1.0) sqrt_recip_alphas = torch.sqrt(1.0 / alphas) # calculations for diffusion q(x_t | x_{t-1}) and others sqrt_alphas_cumprod = torch.sqrt(alphas_cumprod) sqrt_one_minus_alphas_cumprod = torch.sqrt(1. - alphas_cumprod) # calculations for posterior q(x_{t-1} | x_t, x_0) posterior_variance = betas * (1. - alphas_cumprod_prev) / (1. - alphas_cumprod) def extract(a, t, x_shape): batch_size = t.shape[0] out = a.gather(-1, t.cpu()) return out.reshape(batch_size, *((1,) * (len(x_shape) - 1))).to(t.device) ``` We'll illustrate with a cats image how noise is added at each time step of the diffusion process. ```python from PIL import Image import requests url = ' image = Image.open(requests.get(url, stream=True).raw) # PIL image of shape HWC image ``` Noise is added to PyTorch tensors, rather than Pillow Images. We'll first define image transformations that allow us to go from a PIL image to a PyTorch tensor (on which we can add the noise), and vice versa. These transformations are fairly simple: we first normalize images by dividing by \\\\(255\\\\) (such that they are in the \\\\([0,1]\\\\) range), and then make sure they are in the \\\\([-1, 1]\\\\) range. From the DPPM paper: > We assume that image data consists of integers in \\\\(\\{0, 1, ... , 255\\}\\\\) scaled linearly to \\\\([\u22121, 1]\\\\). This ensures that the neural network reverse process operates on consistently scaled inputs starting from the standard normal prior \\\\(p(\\mathbf{x}_T )\\\\). ```python from torchvision.transforms import Compose, ToTensor, Lambda, ToPILImage, CenterCrop, Resize image_size = 128 transform = Compose([ Resize(image_size), CenterCrop(image_size), ToTensor(), # turn into torch Tensor of shape CHW, divide by 255 Lambda(lambda t: (t * 2) - 1), ]) x_start = transform(image).unsqueeze(0) x_start.shape ``` Output: ---------------------------------------------------------------------------------------------------- torch.Size([1, 3, 128, 128]) We also define the reverse transform, which takes in a PyTorch tensor containing values in \\\\([-1, 1]\\\\) and turn them back into a PIL image: ```python import numpy as np reverse_transform = Compose([ Lambda(lambda t: (t + 1) / 2), Lambda(lambda t: t.permute(1, 2, 0)), # CHW to HWC Lambda(lambda t: t * 255.), Lambda(lambda t: t.numpy().astype(np.uint8)), ToPILImage(), ]) ``` Let's verify this: ```python reverse_transform(x_start.squeeze()) ``` We can now define the forward diffusion process as in the paper: ```python # forward diffusion (using the nice property) def q_sample(x_start, t, noise=None): if noise is None: noise = torch.randn_like(x_start) sqrt_alphas_cumprod_t = extract(sqrt_alphas_cumprod, t, x_start.shape) sqrt_one_minus_alphas_cumprod_t = extract( sqrt_one_minus_alphas_cumprod, t, x_start.shape ) return sqrt_alphas_cumprod_t * x_start + sqrt_one_minus_alphas_cumprod_t * noise ``` Let's test it on a particular time step: ```python def get_noisy_image(x_start, t): # add noise x_noisy = q_sample(x_start, t=t) # turn back into PIL image noisy_image = reverse_transform(x_noisy.squeeze()) return noisy_image ``` ```python # take time step t = torch.tensor([40]) get_noisy_image(x_start, t) ``` Let's visualize this for various time steps: ```python import matplotlib.pyplot as plt # use seed for reproducability torch.manual_seed(0) # source: def plot(imgs, with_orig=False, row_title=None, **imshow_kwargs): if not isinstance(imgs[0], list): # Make a 2d grid even if there's just 1 row imgs = [imgs] num_rows = len(imgs) num_cols = len(imgs[0]) + with_orig fig, axs = plt.subplots(figsize=(200,200), nrows=num_rows, ncols=num_cols, squeeze=False) for row_idx, row in enumerate(imgs): row = [image] + row if with_orig else row for col_idx, img in enumerate(row): ax = axs[row_idx, col_idx] ax.imshow(np.asarray(img), **imshow_kwargs) ax.set(xticklabels=[], yticklabels=[], xticks=[], yticks=[]) if with_orig: axs[0, 0].set(title='Original image') axs[0, 0].title.set_size(8) if row_title is not None: for row_idx in range(num_rows): axs[row_idx, 0].set(ylabel=row_title[row_idx]) plt.tight_layout() ``` ```python plot([get_noisy_image(x_start, torch.tensor([t])) for t in [0, 50, 100, 150, 199]]) ``` This means that we can now define the loss function given the model as follows: ```python def p_losses(denoise_model, x_start, t, noise=None, loss_type=\"l1\"): if noise is None: noise = torch.randn_like(x_start) x_noisy = q_sample(x_start=x_start, t=t, noise=noise) predicted_noise = denoise_model(x_noisy, t) if loss_type == 'l1': loss = F.l1_loss(noise, predicted_noise) elif loss_type == 'l2': loss = F.mse_loss(noise, predicted_noise) elif loss_type == \"huber\": loss = F.smooth_l1_loss(noise, predicted_noise) else: raise NotImplementedError() return loss ``` The `denoise_model` will be our U-Net defined above. We'll employ the Huber loss between the true and the predicted noise. ## Define a PyTorch Dataset + DataLoader Here we define a regular [PyTorch Dataset]( The dataset simply consists of images from a real dataset, like Fashion-MNIST, CIFAR-10 or ImageNet, scaled linearly to \\\\([\u22121, 1]\\\\). Each image is resized to the same size. Interesting to note is that images are also randomly horizontally flipped. From the paper: > We used random horizontal flips during training for CIFAR10; we tried training both with and without flips, and found flips to improve sample quality slightly. Here we use the [Datasets library]( to easily load the Fashion MNIST dataset from the [hub]( This dataset consists of images which already have the same resolution, namely 28x28. ```python from datasets import load_dataset # load dataset from the hub dataset = load_dataset(\"fashion_mnist\") image_size = 28 channels = 1 batch_size = 128 ``` Next, we define a function which we'll apply on-the-fly on the entire dataset. We use the `with_transform` [functionality]( for that. The function just applies some basic image preprocessing: random horizontal flips, rescaling and finally make them have values in the \\\\([-1,1]\\\\) range. ```python from torchvision import transforms from torch.utils.data import DataLoader # define image transformations (e.g. using torchvision) transform = Compose([ transforms.RandomHorizontalFlip(), transforms.ToTensor(), transforms.Lambda(lambda t: (t * 2) - 1) ]) # define function def transforms(examples): examples[\"pixel_values\"] = [transform(image.convert(\"L\")) for image in examples[\"image\"]] del examples[\"image\"] return examples transformed_dataset = dataset.with_transform(transforms).remove_columns(\"label\") # create dataloader dataloader = DataLoader(transformed_dataset[\"train\"], batch_size=batch_size, shuffle=True) ``` ```python batch = next(iter(dataloader)) print(batch.keys()) ``` Output: ---------------------------------------------------------------------------------------------------- dict_keys(['pixel_values']) ## Sampling As we'll sample from the model during training (in order to track progress), we define the code for that below. Sampling is summarized in the paper as Algorithm 2: Generating new images from a diffusion model happens by reversing the diffusion process: we start from \\\\(T\\\\), where we sample pure noise from a Gaussian distribution, and then use our neural network to gradually denoise it (using the conditional probability it has learned), until we end up at time step \\\\(t = 0\\\\). As shown above, we can derive a slighly less denoised image \\\\(\\mathbf{x}_{t-1 }\\\\) by plugging in the reparametrization of the mean, using our noise predictor. Remember that the variance is known ahead of time. Ideally, we end up with an image that looks like it came from the real data distribution. The code below implements this. ```python @torch.no_grad() def p_sample(model, x, t, t_index): betas_t = extract(betas, t, x.shape) sqrt_one_minus_alphas_cumprod_t = extract( sqrt_one_minus_alphas_cumprod, t, x.shape ) sqrt_recip_alphas_t = extract(sqrt_recip_alphas, t, x.shape) # Equation 11 in the paper # Use our model (noise predictor) to predict the mean model_mean = sqrt_recip_alphas_t * ( x - betas_t * model(x, t) / sqrt_one_minus_alphas_cumprod_t ) if t_index == 0: return model_mean else: posterior_variance_t = extract(posterior_variance, t, x.shape) noise = torch.randn_like(x) # Algorithm 2 line 4: return model_mean + torch.sqrt(posterior_variance_t) * noise # Algorithm 2 (including returning all images) @torch.no_grad() def p_sample_loop(model, shape): device = next(model.parameters()).device b = shape[0] # start from pure noise (for each example in the batch) img = torch.randn(shape, device=device) imgs = [] for i in tqdm(reversed(range(0, timesteps)), desc='sampling loop time step', total=timesteps): img = p_sample(model, img, torch.full((b,), i, device=device, dtype=torch.long), i) imgs.append(img.cpu().numpy()) return imgs @torch.no_grad() def sample(model, image_size, batch_size=16, channels=3): return p_sample_loop(model, shape=(batch_size, channels, image_size, image_size)) ``` Note that the code above is a simplified version of the original implementation. We found our simplification (which is in line with Algorithm 2 in the paper) to work just as well as the [original, more complex implementation]( which employs [clipping]( ## Train the model Next, we train the model in regular PyTorch fashion. We also define some logic to periodically save generated images, using the `sample` method defined above. ```python from pathlib import Path def num_to_groups(num, divisor): groups = num // divisor remainder = num % divisor arr = [divisor] * groups if remainder > 0: arr.append(remainder) return arr results_folder = Path(\"./results\") results_folder.mkdir(exist_ok = True) save_and_sample_every = 1000 ``` Below, we define the model, and move it to the GPU. We also define a standard optimizer (Adam). ```python from torch.optim import Adam device = \"cuda\" if torch.cuda.is_available() else \"cpu\" model = Unet( dim=image_size, channels=channels, dim_mults=(1, 2, 4,) ) model.to(device) optimizer = Adam(model.parameters(), lr=1e-3) ``` Let's start training! ```python from torchvision.utils import save_image epochs = 6 for epoch in range(epochs): for step, batch in enumerate(dataloader): optimizer.zero_grad() batch_size = batch[\"pixel_values\"].shape[0] batch = batch[\"pixel_values\"].to(device) # Algorithm 1 line 3: sample t uniformally for every example in the batch t = torch.randint(0, timesteps, (batch_size,), device=device).long() loss = p_losses(model, batch, t, loss_type=\"huber\") if step % 100 == 0: print(\"Loss:\", loss.item()) loss.backward() optimizer.step() # save generated images if step != 0 and step % save_and_sample_every == 0: milestone = step // save_and_sample_every batches = num_to_groups(4, batch_size) all_images_list = list(map(lambda n: sample(model, batch_size=n, channels=channels), batches)) all_images = torch.cat(all_images_list, dim=0) all_images = (all_images + 1) * 0.5 save_image(all_images, str(results_folder / f'sample-{milestone}.png'), nrow = 6) ``` Output: ---------------------------------------------------------------------------------------------------- Loss: 0.46477368474006653 Loss: 0.12143351882696152 Loss: 0.08106148988008499 Loss: 0.0801810547709465 Loss: 0.06122320517897606 Loss: 0.06310459971427917 Loss: 0.05681884288787842 Loss: 0.05729678273200989 Loss: 0.05497899278998375 Loss: 0.04439849033951759 Loss: 0.05415581166744232 Loss: 0.06020551547408104 Loss: 0.046830907464027405 Loss: 0.051029372960329056 Loss: 0.0478244312107563 Loss: 0.046767622232437134 Loss: 0.04305662214756012 Loss: 0.05216279625892639 Loss: 0.04748568311333656 Loss: 0.05107741802930832 Loss: 0.04588869959115982 Loss: 0.043014321476221085 Loss: 0.046371955424547195 Loss: 0.04952816292643547 Loss: 0.04472338408231735 ## Sampling (inference) To sample from the model, we can just use our sample function defined above: ```python # sample 64 images samples = sample(model, image_size=image_size, batch_size=64, channels=channels) # show a random one random_index = 5 plt.imshow(samples[-1][random_index].reshape(image_size, image_size, channels), cmap=\"gray\") ``` Seems like the model is capable of generating a nice T-shirt! Keep in mind that the dataset we trained on is pretty low-resolution (28x28). We can also create a gif of the denoising process: ```python import matplotlib.animation as animation random_index = 53 fig = plt.figure() ims = [] for i in range(timesteps): im = plt.imshow(samples[i][random_index].reshape(image_size, image_size, channels), cmap=\"gray\", animated=True) ims.append([im]) animate = animation.ArtistAnimation(fig, ims, interval=50, blit=True, repeat_delay=1000) animate.save('diffusion.gif') plt.show() ``` # Follow-up reads Note that the DDPM paper showed that diffusion models are a promising direction for (un)conditional image generation. This has since then (immensely) been improved, most notably for text-conditional image generation. Below, we list some important (but far from exhaustive) follow-up works: - Improved Denoising Diffusion Probabilistic Models ([Nichol et al., 2021]( finds that learning the variance of the conditional distribution (besides the mean) helps in improving performance - Cascaded Diffusion Models for High Fidelity Image Generation ([Ho et al., 2021]( introduces cascaded diffusion, which comprises a pipeline of multiple diffusion models that generate images of increasing resolution for high-fidelity image synthesis - Diffusion Models Beat GANs on Image Synthesis ([Dhariwal et al., 2021]( show that diffusion models can achieve image sample quality superior to the current state-of-the-art generative models by improving the U-Net architecture, as well as introducing classifier guidance - Classifier-Free Diffusion Guidance ([Ho et al., 2021]( shows that you don't need a classifier for guiding a diffusion model by jointly training a conditional and an unconditional diffusion model with a single neural network - Hierarchical Text-Conditional Image Generation with CLIP Latents (DALL-E 2) ([Ramesh et al., 2022]( uses a prior to turn a text caption into a CLIP image embedding, after which a diffusion model decodes it into an image - Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding (ImageGen) ([Saharia et al., 2022]( shows that combining a large pre-trained language model (e.g. T5) with cascaded diffusion works well for text-to-image synthesis Note that this list only includes important works until the time of writing, which is June 7th, 2022. For now, it seems that the main (perhaps only) disadvantage of diffusion models is that they require multiple forward passes to generate an image (which is not the case for generative models like GANs). However, there's [research going on]( that enables high-fidelity generation in as few as 10 denoising steps."}
{"title": "arxiv.md", "repo_owner": "huggingface", "repo_name": "blog", "text": "--- title: \"Hugging Face Machine Learning Demos on arXiv\" thumbnail: /blog/assets/arxiv/thumbnail.png authors: - user: abidlabs - user: osanseviero - user: pcuenq --- # Hugging Face Machine Learning Demos on arXiv We\u2019re very excited to announce that Hugging Face has collaborated with arXiv to make papers more accessible, discoverable, and fun! Starting today, [Hugging Face Spaces]( is integrated with arXivLabs through a Demo tab that includes links to demos created by the community or the authors themselves. By going to the Demos tab of your favorite paper, you can find links to open-source demos and try them out immediately  Since its launch in October 2021, Hugging Face Spaces has been used to build and share over 12,000 open-source machine learning demos crafted by the community. With Spaces, Hugging Face users can share, explore, discuss models, and build interactive applications that enable anyone with a browser to try them out without having to run any code. These demos are built using open-source tools such as the Gradio and Streamlit Python libraries, and leverage models and datasets available on the Hugging Face Hub. Thanks to the latest arXiv integration, users can now find the most popular demos for a paper on its arXiv abstract page. For example, if you want to try out demos of the BERT language model, you can go to the BERT paper\u2019s [arXiv page]( and navigate to the demo tab. You will see more than 200 demos built by the open-source community -- some demos simply showcase the BERT model, while others showcase related applications that modify or use BERT as part of larger pipelines, such as the demo shown above.  Demos allow a much wider audience to explore machine learning as well as other fields in which computational models are built, such as biology, chemistry, astronomy, and economics. They help increase the awareness and understanding of how models work, amplify the visibility of researchers' work, and allow a more diverse audience to identify and debug biases and other issues. The demos increase the reproducibility of research by enabling others to explore the paper's results without having to write a single line of code! We are thrilled about this integration with arXiv and can\u2019t wait to see how the research community will use it to improve communication, dissemination and interpretability."}
{"title": "asr-chunking.md", "repo_owner": "huggingface", "repo_name": "blog", "text": "--- title: \"Making automatic speech recognition work on large files with Wav2Vec2 in Transformers\" thumbnail: /blog/assets/49_asr_chunking/thumbnail.png authors: - user: Narsil --- # Making automatic speech recognition work on large files with Wav2Vec2 in Transformers ``` Tl;dr: This post explains how to use the specificities of the Connectionist Temporal Classification (CTC) architecture in order to achieve very good quality automatic speech recognition (ASR) even on arbitrarily long files or during live inference. ``` **Wav2Vec2** is a popular pre-trained model for speech recognition. Released in [September 2020]( by Meta AI Research, the novel architecture catalyzed progress in self-supervised pretraining for speech recognition, *e.g.* [*G. Ng et al.*, 2021]( [*Chen et al*, 2021]( [*Hsu et al.*, 2021]( and [*Babu et al.*, 2021]( On the Hugging Face Hub, Wav2Vec2's most popular pre-trained checkpoint currently amounts to over [**250,000** monthly downloads]( **Wav2Vec2** is at its core a **transformers** models and one caveat of **transformers** is that it usually has a finite amount of sequence length it can handle. Either because it uses **position encodings** (not the case here) or simply because the cost of attention in transformers is actually O(n\u00b2) in sequence_length, meaning that using very large sequence_length explodes in complexity/memory. So you cannot run with finite hardware (even a very large GPU like A100), simply run Wav2Vec2 on an hour long file. Your program will crash. Let's try it ! ```bash pip install transformers ``` ```python from transformers import pipeline # This will work on any of the thousands of models at # pipe = pipeline(model=\"facebook/wav2vec2-base-960h\") # The Public Domain LibriVox file used for the test #!wget -o very_long_file.mp3 pipe(\"very_long_file.mp3\") # Crash out of memory ! pipe(\"very_long_file.mp3\", chunk_length_s=10) # This works and prints a very long string ! # This whole blogpost will explain how to make things work ``` Simple Chunking --------------- The simplest way to achieve inference on very long files would be to simply chunk the initial audio into shorter samples, let's say 10 seconds each, run inference on those, and end up with a final reconstruction. This is efficient computationally but usually leads to subpar results, the reason being that in order to do good inference, the model needs some context, so around the chunking border, inference tends to be of poor quality. Look at the following diagram:  There are ways to try and work around the problem in a general fashion, but they are never entirely robust. You can try to chunk only when you encounter silence but you may have a non silent audio for a long time (a song, or noisy caf\u00e9 audio). You can also try to cut only when there's no voice but it requires another model and this is not an entirely solved problem. You could also have a continous voice for a very long time. As it turns out, CTC structure, which is used by Wav2Vec2, can be exploited in order to achieve very robust speech recognition even on very long files without falling into those pitfalls. Chunking with stride -------------------- Wav2Vec2 uses the [CTC algorithm]( which means that every frame of audio is mapped to a single letter prediction (logit).  That's the main feature we're going to use in order to add a `stride`. This [link]( explains it in the image context, but it's the same concept for audio. Because of this property, we can: - Start doing inference on **overlapping** chunks so that the model actually has proper context in the center. - **Drop** the inferenced logits on the side. - Chain the **logits** without their dropped sides to recover something extremely close to what the model would have predicted on the full length audio.  This is not **technically** 100% the same thing as running the model on the whole file so it is not enabled by default, but as you saw in the earlier example you need only to add `chunk_length_s` to your `pipeline` for it to work. In practice, we observed that most of the bad inference is kept within the strides, which get dropped before inference, leading to a proper inference of the full text. Let's note that you can choose every argument of this technique: ```python from transformers import pipeline pipe = pipeline(model=\"facebook/wav2vec2-base-960h\") # stride_length_s is a tuple of the left and right stride length. # With only 1 number, both sides get the same stride, by default # the stride_length on one side is 1/6th of the chunk_length_s output = pipe(\"very_long_file.mp3\", chunk_length_s=10, stride_length_s=(4, 2)) ``` Chunking with stride on LM augmented models ------------------------------------------- In [transformers]( we also added support for adding LM to Wav2Vec2 in order to boost the WER performance of the models without even finetuning. [See this excellent blogpost explaining how it works]( It turns out, that the LM works directly on the logits themselves, so we can actually apply the exact same technique as before without any modification ! So chunking large files on these LM boosted models still works out of the box. Live inference -------------- A very nice perk of using a CTC model like Wav2vec2, is that it is a single pass model, so it is **very** fast. Especially on GPU. We can exploit that in order to do live inference. The principle is exactly the same as regular striding, but this time we can feed the pipeline data **as it is coming in** and simply use striding on full chunks of length 10s for instance with 1s striding to get proper context. That requires running much more inference steps than simple file chunking, but it can make the live experience much better because the model can print things as you are speaking, without having to wait for X seconds before seeing something displayed."}
{"title": "readme.md", "repo_owner": "huggingface", "repo_name": "blog", "text": ""}
{"title": "assisted-generation.md", "repo_owner": "huggingface", "repo_name": "blog", "text": "--- title: \"Assisted Generation: a new direction toward low-latency text generation\" thumbnail: /blog/assets/assisted-generation/thumbnail.png authors: - user: joaogante --- # Assisted Generation: a new direction toward low-latency text generation Large language models are all the rage these days, with many companies investing significant resources to scale them up and unlock new capabilities. However, as humans with ever-decreasing attention spans, we also dislike their slow response times. Latency is critical for a good user experience, and smaller models are often used despite their lower quality (e.g. in [code completion]( Why is text generation so slow? What\u2019s preventing you from deploying low-latency large language models without going bankrupt? In this blog post, we will revisit the bottlenecks for autoregressive text generation and introduce a new decoding method to tackle the latency problem. You\u2019ll see that by using our new method, assisted generation, you can reduce latency up to 10x in commodity hardware! ## Understanding text generation latency The core of modern text generation is straightforward to understand. Let\u2019s look at the central piece, the ML model. Its input contains a text sequence, which includes the text generated so far, and potentially other model-specific components (for instance, Whisper also has an audio input). The model takes the input and runs a forward pass: the input is fed to the model and passed sequentially along its layers until the unnormalized log probabilities for the next token are predicted (also known as logits). A token may consist of entire words, sub-words, or even individual characters, depending on the model. The [illustrated GPT-2]( is a great reference if you\u2019d like to dive deeper into this part of text generation. A model forward pass gets you the logits for the next token, which you can freely manipulate (e.g. set the probability of undesirable words or sequences to 0). The following step in text generation is to select the next token from these logits. Common strategies include picking the most likely token, known as greedy decoding, or sampling from their distribution, also called multinomial sampling. Chaining model forward passes with next token selection iteratively gets you text generation. This explanation is the tip of the iceberg when it comes to decoding methods; please refer to [our blog post on text generation]( for an in-depth exploration. From the description above, the latency bottleneck in text generation is clear: running a model forward pass for large models is slow, and you may need to do hundreds of them in a sequence. But let\u2019s dive deeper: why are forward passes slow? Forward passes are typically dominated by matrix multiplications and, after a quick visit to the [corresponding wikipedia section]( you can tell that memory bandwidth is the limitation in this operation (e.g. from the GPU RAM to the GPU compute cores). In other words, *the bottleneck in the forward pass comes from loading the model layer weights into the computation cores of your device, not from performing the computations themselves*. At the moment, you have three main avenues you can explore to get the most out of text generation, all tackling the performance of the model forward pass. First, you have the hardware-specific model optimizations. For instance, your device may be compatible with [Flash Attention]( which speeds up the attention layer through a reorder of the operations, or [INT8 quantization]( which reduces the size of the model weights. Second, when you know you\u2019ll get concurrent text generation requests, you can batch the inputs and massively increase the throughput with a small latency penalty. The model layer weights loaded into the device are now used on several input rows in parallel, which means that you\u2019ll get more tokens out for approximately the same memory bandwidth burden. The catch with batching is that you need additional device memory (or to offload the memory somewhere) \u2013 at the end of this spectrum, you can see projects like [FlexGen]( which optimize throughput at the expense of latency. ```python # Example showcasing the impact of batched generation. Measurement device: RTX3090 from transformers import AutoModelForCausalLM, AutoTokenizer import time tokenizer = AutoTokenizer.from_pretrained(\"distilgpt2\") model = AutoModelForCausalLM.from_pretrained(\"distilgpt2\").to(\"cuda\") inputs = tokenizer([\"Hello world\"], return_tensors=\"pt\").to(\"cuda\") def print_tokens_per_second(batch_size): new_tokens = 100 cumulative_time = 0 # warmup model.generate( **inputs, do_sample=True, max_new_tokens=new_tokens, num_return_sequences=batch_size ) for _ in range(10): start = time.time() model.generate( **inputs, do_sample=True, max_new_tokens=new_tokens, num_return_sequences=batch_size ) cumulative_time += time.time() - start print(f\"Tokens per second: {new_tokens * batch_size * 10 / cumulative_time:.1f}\") print_tokens_per_second(1) # Tokens per second: 418.3 print_tokens_per_second(64) # Tokens per second: 16266.2 (~39x more tokens per second) ``` Finally, if you have multiple devices available to you, you can distribute the workload using [Tensor Parallelism]( and obtain lower latency. With Tensor Parallelism, you split the memory bandwidth burden across multiple devices, but you now have to consider inter-device communication bottlenecks in addition to the monetary cost of running multiple devices. The benefits depend largely on the model size: models that easily fit on a single consumer device see very limited benefits. Taking the results from this [DeepSpeed blog post]( you see that you can spread a 17B parameter model across 4 GPUs to reduce the latency by 1.5x (Figure 7). These three types of improvements can be used in tandem, resulting in [high throughput solutions]( However, after applying hardware-specific optimizations, there are limited options to reduce latency \u2013 and the existing options are expensive. Let\u2019s fix that! ## Language decoder forward pass, revisited You\u2019ve read above that each model forward pass yields the logits for the next token, but that\u2019s actually an incomplete description. During text generation, the typical iteration consists in the model receiving as input the latest generated token, plus cached internal computations for all other previous inputs, returning the next token logits. Caching is used to avoid redundant computations, resulting in faster forward passes, but it\u2019s not mandatory (and can be used partially). When caching is disabled, the input contains the entire sequence of tokens generated so far and the output contains the logits corresponding to the next token for *all positions* in the sequence! The logits at position N correspond to the distribution for the next token if the input consisted of the first N tokens, ignoring all subsequent tokens in the sequence. In the particular case of greedy decoding, if you pass the generated sequence as input and apply the argmax operator to the resulting logits, you will obtain the generated sequence back. ```python from transformers import AutoModelForCausalLM, AutoTokenizer tok = AutoTokenizer.from_pretrained(\"distilgpt2\") model = AutoModelForCausalLM.from_pretrained(\"distilgpt2\") inputs = tok([\"The\"], return_tensors=\"pt\") generated = model.generate(**inputs, do_sample=False, max_new_tokens=10) forward_confirmation = model(generated).logits.argmax(-1) # We exclude the opposing tips from each sequence: the forward pass returns # the logits for the next token, so it is shifted by one position. print(generated[0, 1:].tolist() == forward_confirmation[0, :-1].tolist()) # True ``` This means that you can use a model forward pass for a different purpose: in addition to feeding some tokens to predict the next one, you can also pass a sequence to the model and double-check whether the model would generate that same sequence (or part of it). Let\u2019s consider for a second that you have access to a magical latency-free oracle model that generates the same sequence as your model, for any given input. For argument\u2019s sake, it can\u2019t be used directly, it\u2019s limited to being an assistant to your generation procedure. Using the property described above, you could use this assistant model to get candidate output tokens followed by a forward pass with your model to confirm that they are indeed correct. In this utopian scenario, the latency of text generation would be reduced from `O(n)` to `O(1)`, with `n` being the number of generated tokens. For long generations, we're talking about several orders of magnitude. Walking a step towards reality, let's assume the assistant model has lost its oracle properties. Now it\u2019s a latency-free model that gets some of the candidate tokens wrong, according to your model. Due to the autoregressive nature of the task, as soon as the assistant gets a token wrong, all subsequent candidates must be invalidated. However, that does not prevent you from querying the assistant again, after correcting the wrong token with your model, and repeating this process iteratively. Even if the assistant fails a few tokens, text generation would have an order of magnitude less latency than in its original form. Obviously, there are no latency-free assistant models. Nevertheless, it is relatively easy to find a model that approximates some other model\u2019s text generation outputs \u2013 smaller versions of the same architecture trained similarly often fit this property. Moreover, when the difference in model sizes becomes significant, the cost of using the smaller model as an assistant becomes an afterthought after factoring in the benefits of skipping a few forward passes! You now understand the core of _assisted generation_. ## Greedy decoding with assisted generation Assisted generation is a balancing act. You want the assistant to quickly generate a candidate sequence while being as accurate as possible. If the assistant has poor quality, your get the cost of using the assistant model with little to no benefits. On the other hand, optimizing the quality of the candidate sequences may imply the use of slow assistants, resulting in a net slowdown. While we can't automate the selection of the assistant model for you, we\u2019ve included an additional requirement and a heuristic to ensure the time spent with the assistant stays in check. First, the requirement \u2013 the assistant must have the exact same tokenizer as your model. If this requirement was not in place, expensive token decoding and re-encoding steps would have to be added. Furthermore, these additional steps would have to happen on the CPU, which in turn may need slow inter-device data transfers. Fast usage of the assistant is critical for the benefits of assisted generation to show up. Finally, the heuristic. By this point, you have probably noticed the similarities between the movie Inception and assisted generation \u2013 you are, after all, running text generation inside text generation. There will be one assistant model forward pass per candidate token, and we know that forward passes are expensive. While you can\u2019t know in advance the number of tokens that the assistant model will get right, you can keep track of this information and use it to limit the number of candidate tokens requested to the assistant \u2013 some sections of the output are easier to anticipate than others. Wrapping all up, here\u2019s our original implementation of the assisted generation loop ([code]( 1. Use greedy decoding to generate a certain number of candidate tokens with the assistant model, producing `candidates`. The number of produced candidate tokens is initialized to `5` the first time assisted generation is called. 2. Using our model, do a forward pass with `candidates`, obtaining `logits`. 3. Use the token selection method (`.argmax()` for greedy search or `.multinomial()` for sampling) to get the `next_tokens` from `logits`. 4. Compare `next_tokens` to `candidates` and get the number of matching tokens. Remember that this comparison has to be done with left-to-right causality: after the first mismatch, all candidates are invalidated. 5. Use the number of matches to slice things up and discard variables related to unconfirmed candidate tokens. In essence, in `next_tokens`, keep the matching tokens plus the first divergent token (which our model generates from a valid candidate subsequence). 6. Adjust the number of candidate tokens to be produced in the next iteration \u2014 our original heuristic increases it by `2` if ALL tokens match and decreases it by `1` otherwise. We\u2019ve designed the API in Transformers such that this process is hassle-free for you. All you need to do is to pass the assistant model under the new `assistant_model` keyword argument and reap the latency gains! At the time of the release of this blog post, assisted generation is limited to a batch size of `1`. ```python from transformers import AutoModelForCausalLM, AutoTokenizer import torch prompt = \"Alice and Bob\" checkpoint = \"EleutherAI/pythia-1.4b-deduped\" assistant_checkpoint = \"EleutherAI/pythia-160m-deduped\" device = \"cuda\" if torch.cuda.is_available() else \"cpu\" tokenizer = AutoTokenizer.from_pretrained(checkpoint) inputs = tokenizer(prompt, return_tensors=\"pt\").to(device) model = AutoModelForCausalLM.from_pretrained(checkpoint).to(device) assistant_model = AutoModelForCausalLM.from_pretrained(assistant_checkpoint).to(device) outputs = model.generate(**inputs, assistant_model=assistant_model) print(tokenizer.batch_decode(outputs, skip_special_tokens=True)) # ['Alice and Bob are sitting in a bar. Alice is drinking a beer and Bob is drinking a'] ``` Is the additional internal complexity worth it? Let\u2019s have a look at the latency numbers for the greedy decoding case (results for sampling are in the next section), considering a batch size of `1`. These results were pulled directly out of Transformers without any additional optimizations, so you should be able to reproduce them in your setup. Glancing at the collected numbers, we see that assisted generation can deliver significant latency reductions in diverse settings, but it is not a silver bullet \u2013 you should benchmark it before applying it to your use case. We can conclude that assisted generation: 1. Requires access to an assistant model that is at least an order of magnitude smaller than your model (the bigger the difference, the better); 2. Gets up to 3x speedups in the presence of INT8 and up to 2x otherwise, when the model fits in the GPU memory; 3. If you\u2019re playing with models that do not fit in your GPU and are relying on memory offloading, you can see up to 10x speedups; 4. Shines in input-grounded tasks, like automatic speech recognition or summarization. ## Sample with assisted generation Greedy decoding is suited for input-grounded tasks (automatic speech recognition, translation, summarization, ...) or factual knowledge-seeking. Open-ended tasks requiring large levels of creativity, such as most uses of a language model as a chatbot, should use sampling instead. Assisted generation is naturally designed for greedy decoding, but that doesn\u2019t mean that you can\u2019t use assisted generation with multinomial sampling! Drawing samples from a probability distribution for the next token will cause our greedy assistant to fail more often, reducing its latency benefits. However, we can control how sharp the probability distribution for the next tokens is, using the temperature coefficient that\u2019s present in most sampling-based applications. At one extreme, with temperatures close to 0, sampling will approximate greedy decoding, favoring the most likely token. At the other extreme, with the temperature set to values much larger than 1, sampling will be chaotic, drawing from a uniform distribution. Low temperatures are, therefore, more favorable to your assistant model, retaining most of the latency benefits from assisted generation, as we can see below. Why don't you see it for yourself, so get a feeling of assisted generation? ## Future directions Assisted generation shows that modern text generation strategies are ripe for optimization. Understanding that it is currently a memory-bound problem, not a compute-bound problem, allows us to apply simple heuristics to get the most out of the available memory bandwidth, alleviating the bottleneck. We believe that further refinement of the use of assistant models will get us even bigger latency reductions - for instance, we may be able to skip a few more forward passes if we request the assistant to generate several candidate continuations. Naturally, releasing high-quality small models to be used as assistants will be critical to realizing and amplifying the benefits. Initially released under our Transformers library, to be used with the `.generate()` function, we expect to offer it throughout the Hugging Face universe. Its implementation is also completely open-source so, if you\u2019re working on text generation and not using our tools, feel free to use it as a reference. Finally, assisted generation resurfaces a crucial question in text generation. The field has been evolving with the constraint where all new tokens are the result of a fixed amount of compute, for a given model. One token per homogeneous forward pass, in pure autoregressive fashion. This blog post reinforces the idea that it shouldn\u2019t be the case: large subsections of the generated output can also be equally generated by models that are a fraction of the size. For that, we\u2019ll need new model architectures and decoding methods \u2013 we\u2019re excited to see what the future holds! ## Related Work After the original release of this blog post, it came to my attention that other works have explored the same core principle (use a forward pass to validate longer continuations). In particular, have a look at the following works: - [Blockwise Parallel Decoding]( by Google Brain - [Speculative Sampling]( by DeepMind ## Citation ```bibtex @misc {gante2023assisted, author = { {Joao Gante} }, title = { Assisted Generation: a new direction toward low-latency text generation }, year = 2023, url = { }, doi = { 10.57967/hf/0638 }, publisher = { Hugging Face Blog } } ``` ## Acknowledgements I'd like to thank Sylvain Gugger, Nicolas Patry, and Lewis Tunstall for sharing many valuable suggestions to improve this blog post. Finally, kudos to Chunte Lee for designing the gorgeous cover you can see in our web page."}
{"title": "audio-datasets.md", "repo_owner": "huggingface", "repo_name": "blog", "text": "--- title: \"A Complete Guide to Audio Datasets\" thumbnail: /blog/assets/116_audio_datasets/thumbnail.jpg authors: - user: sanchit-gandhi --- # A Complete Guide to Audio Datasets ## Introduction Datasets is an open-source library for downloading and preparing datasets from all domains. Its minimalistic API allows users to download and prepare datasets in just one line of Python code, with a suite of functions that enable efficient pre-processing. The number of datasets available is unparalleled, with all the most popular machine learning datasets available to download. Not only this, but Datasets comes prepared with multiple audio-specific features that make working with audio datasets easy for researchers and practitioners alike. In this blog, we'll demonstrate these features, showcasing why Datasets is the go-to place for downloading and preparing audio datasets. ## Contents 1. [The Hub](#the-hub) 2. [Load an Audio Dataset](#load-an-audio-dataset) 3. [Easy to Load, Easy to Process](#easy-to-load-easy-to-process) 4. [Streaming Mode: The Silver Bullet](#streaming-mode-the-silver-bullet) 5. [A Tour of Audio Datasets on the Hub](#a-tour-of-audio-datasets-on-the-hub) 6. [Closing Remarks](#closing-remarks) ## The Hub The Hugging Face Hub is a platform for hosting models, datasets and demos, all open source and publicly available. It is home to a growing collection of audio datasets that span a variety of domains, tasks and languages. Through tight integrations with Datasets, all the datasets on the Hub can be downloaded in one line of code. Let's head to the Hub and filter the datasets by task: * [Speech Recognition Datasets on the Hub]( * [Audio Classification Datasets on the Hub]( At the time of writing, there are 77 speech recognition datasets and 28 audio classification datasets on the Hub, with these numbers ever-increasing. You can select any one of these datasets to suit your needs. Let's check out the first speech recognition result. Clicking on [`common_voice`]( brings up the dataset card: Here, we can find additional information about the dataset, see what models are trained on the dataset and, most excitingly, listen to actual audio samples. The Dataset Preview is presented in the middle of the dataset card. It shows us the first 100 samples for each subset and split. What's more, it's loaded up the audio samples ready for us to listen to in real-time. If we hit the play button on the first sample, we can listen to the audio and see the corresponding text. The Dataset Preview is a brilliant way of experiencing audio datasets before committing to using them. You can pick any dataset on the Hub, scroll through the samples and listen to the audio for the different subsets and splits, gauging whether it's the right dataset for your needs. Once you've selected a dataset, it's trivial to load the data so that you can start using it. ## Load an Audio Dataset One of the key defining features of Datasets is the ability to download and prepare a dataset in just one line of Python code. This is made possible through the [`load_dataset`]( function. Conventionally, loading a dataset involves: i) downloading the raw data, ii) extracting it from its compressed format, and iii) preparing individual samples and splits. Using `load_dataset`, all of the heavy lifting is done under the hood. Let's take the example of loading the [GigaSpeech]( dataset from Speech Colab. GigaSpeech is a relatively recent speech recognition dataset for benchmarking academic speech systems and is one of many audio datasets available on the Hugging Face Hub. To load the GigaSpeech dataset, we simply take the dataset's identifier on the Hub (`speechcolab/gigaspeech`) and specify it to the [`load_dataset`]( function. GigaSpeech comes in five configurations of increasing size, ranging from `xs` (10 hours) to `xl`(10,000 hours). For the purpose of this tutorial, we'll load the smallest of these configurations. The dataset's identifier and the desired configuration are all that we require to download the dataset: ```python from datasets import load_dataset gigaspeech = load_dataset(\"speechcolab/gigaspeech\", \"xs\") print(gigaspeech) ``` **Print Output:** ```python DatasetDict({ train: Dataset({ features: ['segment_id', 'speaker', 'text', 'audio', 'begin_time', 'end_time', 'audio_id', 'title', 'url', 'source', 'category', 'original_full_path'], num_rows: 9389 }) validation: Dataset({ features: ['segment_id', 'speaker', 'text', 'audio', 'begin_time', 'end_time', 'audio_id', 'title', 'url', 'source', 'category', 'original_full_path'], num_rows: 6750 }) test: Dataset({ features: ['segment_id', 'speaker', 'text', 'audio', 'begin_time', 'end_time', 'audio_id', 'title', 'url', 'source', 'category', 'original_full_path'], num_rows: 25619 }) }) ``` And just like that, we have the GigaSpeech dataset ready! There simply is no easier way of loading an audio dataset. We can see that we have the training, validation and test splits pre-partitioned, with the corresponding information for each. The object `gigaspeech` returned by the `load_dataset` function is a [`DatasetDict`]( We can treat it in much the same way as an ordinary Python dictionary. To get the train split, we pass the corresponding key to the `gigaspeech` dictionary: ```python print(gigaspeech[\"train\"]) ``` **Print Output:** ```python Dataset({ features: ['segment_id', 'speaker', 'text', 'audio', 'begin_time', 'end_time', 'audio_id', 'title', 'url', 'source', 'category', 'original_full_path'], num_rows: 9389 }) ``` This returns a [`Dataset`]( object, which contains the data for the training split. We can go one level deeper and get the first item of the split. Again, this is possible through standard Python indexing: ```python print(gigaspeech[\"train\"][0]) ``` **Print Output:** ```python {'segment_id': 'YOU0000000315_S0000660', 'speaker': 'N/A', 'text': \"AS THEY'RE LEAVING CAN KASH PULL ZAHRA ASIDE REALLY QUICKLY \", 'audio': {'path': '/home/sanchit_huggingface_co/.cache/huggingface/datasets/downloads/extracted/7f8541f130925e9b2af7d37256f2f61f9d6ff21bf4a94f7c1a3803ec648d7d79/xs_chunks_0000/YOU0000000315_S0000660.wav', 'array': array([0.0005188 , 0.00085449, 0.00012207, ..., 0.00125122, 0.00076294, 0.00036621], dtype=float32), 'sampling_rate': 16000 }, 'begin_time': 2941.889892578125, 'end_time': 2945.070068359375, 'audio_id': 'YOU0000000315', 'title': 'Return to Vasselheim | Critical Role: VOX MACHINA | Episode 43', 'url': ' 'source': 2, 'category': 24, 'original_full_path': 'audio/youtube/P0004/YOU0000000315.opus', } ``` We can see that there are a number of features returned by the training split, including `segment_id`, `speaker`, `text`, `audio` and more. For speech recognition, we'll be concerned with the `text` and `audio` columns. Using Datasets' [`remove_columns`]( method, we can remove the dataset features not required for speech recognition: ```python COLUMNS_TO_KEEP = [\"text\", \"audio\"] all_columns = gigaspeech[\"train\"].column_names columns_to_remove = set(all_columns) - set(COLUMNS_TO_KEEP) gigaspeech = gigaspeech.remove_columns(columns_to_remove) ``` Let's check that we've successfully retained the `text` and `audio` columns: ```python print(gigaspeech[\"train\"][0]) ``` **Print Output:** ```python {'text': \"AS THEY'RE LEAVING CAN KASH PULL ZAHRA ASIDE REALLY QUICKLY \", 'audio': {'path': '/home/sanchit_huggingface_co/.cache/huggingface/datasets/downloads/extracted/7f8541f130925e9b2af7d37256f2f61f9d6ff21bf4a94f7c1a3803ec648d7d79/xs_chunks_0000/YOU0000000315_S0000660.wav', 'array': array([0.0005188 , 0.00085449, 0.00012207, ..., 0.00125122, 0.00076294, 0.00036621], dtype=float32), 'sampling_rate': 16000}} ``` Great! We can see that we've got the two required columns `text` and `audio`. The `text` is a string with the sample transcription and the `audio` a 1-dimensional array of amplitude values at a sampling rate of 16KHz. That's our dataset loaded! ## Easy to Load, Easy to Process Loading a dataset with Datasets is just half of the fun. We can now use the suite of tools available to efficiently pre-process our data ready for model training or inference. In this Section, we'll perform three stages of data pre-processing: 1. [Resampling the Audio Data](#1-resampling-the-audio-data) 2. [Pre-Processing Function](#2-pre-processing-function) 3. [Filtering Function](#3-filtering-function) ### 1. Resampling the Audio Data The `load_dataset` function prepares audio samples with the sampling rate that they were published with. This is not always the sampling rate expected by our model. In this case, we need to _resample_ the audio to the correct sampling rate. We can set the audio inputs to our desired sampling rate using Datasets' [`cast_column`]( method. This operation does not change the audio in-place, but rather signals to `datasets` to resample the audio samples _on the fly_ when they are loaded. The following code cell will set the sampling rate to 8kHz: ```python from datasets import Audio gigaspeech = gigaspeech.cast_column(\"audio\", Audio(sampling_rate=8000)) ``` Re-loading the first audio sample in the GigaSpeech dataset will resample it to the desired sampling rate: ```python print(gigaspeech[\"train\"][0]) ``` **Print Output:** ```python {'text': \"AS THEY'RE LEAVING CAN KASH PULL ZAHRA ASIDE REALLY QUICKLY \", 'audio': {'path': '/home/sanchit_huggingface_co/.cache/huggingface/datasets/downloads/extracted/7f8541f130925e9b2af7d37256f2f61f9d6ff21bf4a94f7c1a3803ec648d7d79/xs_chunks_0000/YOU0000000315_S0000660.wav', 'array': array([ 0.00046338, 0.00034808, -0.00086153, ..., 0.00099299, 0.00083484, 0.00080221], dtype=float32), 'sampling_rate': 8000} } ``` We can see that the sampling rate has been downsampled to 8kHz. The array values are also different, as we've now only got approximately one amplitude value for every two that we had before. Let's set the dataset sampling rate back to 16kHz, the sampling rate expected by most speech recognition models: ```python gigaspeech = gigaspeech.cast_column(\"audio\", Audio(sampling_rate=16000)) print(gigaspeech[\"train\"][0]) ``` **Print Output:** ```python {'text': \"AS THEY'RE LEAVING CAN KASH PULL ZAHRA ASIDE REALLY QUICKLY \", 'audio': {'path': '/home/sanchit_huggingface_co/.cache/huggingface/datasets/downloads/extracted/7f8541f130925e9b2af7d37256f2f61f9d6ff21bf4a94f7c1a3803ec648d7d79/xs_chunks_0000/YOU0000000315_S0000660.wav', 'array': array([0.0005188 , 0.00085449, 0.00012207, ..., 0.00125122, 0.00076294, 0.00036621], dtype=float32), 'sampling_rate': 16000} } ``` Easy! `cast_column` provides a straightforward mechanism for resampling audio datasets as and when required. ### 2. Pre-Processing Function One of the most challenging aspects of working with audio datasets is preparing the data in the right format for our model. Using Datasets' [`map`]( method, we can write a function to pre-process a single sample of the dataset, and then apply it to every sample without any code changes. First, let's load a processor object from Transformers. This processor pre-processes the audio to input features and tokenises the target text to labels. The `AutoProcessor` class is used to load a processor from a given model checkpoint. In the example, we load the processor from OpenAI's [Whisper medium.en]( checkpoint, but you can change this to any model identifier on the Hugging Face Hub: ```python from transformers import AutoProcessor processor = AutoProcessor.from_pretrained(\"openai/whisper-medium.en\") ``` Great! Now we can write a function that takes a single training sample and passes it through the `processor` to prepare it for our model. We'll also compute the input length of each audio sample, information that we'll need for the next data preparation step: ```python def prepare_dataset(batch): audio = batch[\"audio\"] batch = processor(audio[\"array\"], sampling_rate=audio[\"sampling_rate\"], text=batch[\"text\"]) batch[\"input_length\"] = len(audio[\"array\"]) / audio[\"sampling_rate\"] return batch ``` We can apply the data preparation function to all of our training examples using Datasets' `map` method. Here, we also remove the `text` and `audio` columns, since we have pre-processed the audio to input features and tokenised the text to labels: ```python gigaspeech = gigaspeech.map(prepare_dataset, remove_columns=gigaspeech[\"train\"].column_names) ``` ### 3. Filtering Function Prior to training, we might have a heuristic for filtering our training data. For instance, we might want to filter any audio samples longer than 30s to prevent truncating the audio samples or risking out-of-memory errors. We can do this in much the same way that we prepared the data for our model in the previous step. We start by writing a function that indicates which samples to keep and which to discard. This function, `is_audio_length_in_range`, returns a boolean: samples that are shorter than 30s return True, and those that are longer False. ```python MAX_DURATION_IN_SECONDS = 30.0 def is_audio_length_in_range(input_length): return input_length Figure 1: Streaming mode. The dataset is loaded progressively as we iterate over the dataset. Streaming mode has three primary advantages over downloading the entire dataset at once: 1. **Disk space:** samples are loaded to memory one-by-one as we iterate over the dataset. Since the data is not downloaded locally, there are no disk space requirements, so you can use datasets of arbitrary size. 2. **Download and processing time:** audio datasets are large and need a significant amount of time to download and process. With streaming, loading and processing is done on the fly, meaning you can start using the dataset as soon as the first sample is ready. 3. **Easy experimentation:** you can experiment on a handful samples to check that your script works without having to download the entire dataset. There is one caveat to streaming mode. When downloading a dataset, both the raw data and processed data are saved locally to disk. If we want to re-use this dataset, we can directly load the processed data from disk, skipping the download and processing steps. Consequently, we only have to perform the downloading and processing operations once, after which we can re-use the prepared data. With streaming mode, the data is not downloaded to disk. Thus, neither the downloaded nor pre-processed data are cached. If we want to re-use the dataset, the streaming steps must be repeated, with the audio files loaded and processed on the fly again. For this reason, it is advised to download datasets that you are likely to use multiple times. How can you enable streaming mode? Easy! Just set `streaming=True` when you load your dataset. The rest will be taken care for you: ```python gigaspeech = load_dataset(\"speechcolab/gigaspeech\", \"xs\", streaming=True) ``` All the steps covered so far in this tutorial can be applied to the streaming dataset without any code changes. The only change is that you can no longer access individual samples using Python indexing (i.e. `gigaspeech[\"train\"][sample_idx]`). Instead, you have to iterate over the dataset, using a `for` loop for example. Streaming mode can take your research to the next level: not only are the biggest datasets accessible to you, but you can easily evaluate systems over multiple datasets in one go without worrying about your disk space. Compared to evaluating on a single dataset, multi-dataset evaluation gives a better metric for the generalisation abilities of a speech recognition system (_c.f._ [End-to-end Speech Benchmark (ESB)]( The accompanying [Google Colab]( provides an example for evaluating the Whisper model on eight English speech recognition datasets in one script using streaming mode. ## A Tour of Audio Datasets on The Hub This Section serves as a reference guide for the most popular speech recognition, speech translation and audio classification datasets on the Hugging Face Hub. We can apply everything that we've covered for the GigaSpeech dataset to any of the datasets on the Hub. All we have to do is switch the dataset identifier in the `load_dataset` function. It's that easy! 1. [English Speech Recognition](#english-speech-recognition) 2. [Multilingual Speech Recognition](#multilingual-speech-recognition) 3. [Speech Translation](#speech-translation) 4. [Audio Classification](#audio-classification) ### English Speech Recognition Speech recognition, or speech-to-text, is the task of mapping from spoken speech to written text, where both the speech and text are in the same language. We provide a summary of the most popular English speech recognition datasets on the Hub: | Dataset | Domain | Speaking Style | Train Hours | Casing | Punctuation | License | Recommended Use | |-----------------------------------------------------------------------------------------|-----------------------------|-----------------------|-------------|--------|-------------|-----------------|----------------------------------| | [LibriSpeech]( | Audiobook | Narrated | 960 | | | CC-BY-4.0 | Academic benchmarks | | [Common Voice 11]( | Wikipedia | Narrated | 2300 | | | CC0-1.0 | Non-native speakers | | [VoxPopuli]( | European Parliament | Oratory | 540 | | | CC0 | Non-native speakers | | [TED-LIUM]( | TED talks | Oratory | 450 | | | CC-BY-NC-ND 3.0 | Technical topics | | [GigaSpeech]( | Audiobook, podcast, YouTube | Narrated, spontaneous | 10000 | | | apache-2.0 | Robustness over multiple domains | | [SPGISpeech]( | Fincancial meetings | Oratory, spontaneous | 5000 | | | User Agreement | Fully formatted transcriptions | | [Earnings-22]( | Fincancial meetings | Oratory, spontaneous | 119 | | | CC-BY-SA-4.0 | Diversity of accents | | [AMI]( | Meetings | Spontaneous | 100 | | | CC-BY-4.0 | Noisy speech conditions | Refer to the [Google Colab]( for a guide on evaluating a system on all eight English speech recognition datasets in one script. The following dataset descriptions are largely taken from the [ESB Benchmark]( paper. #### [LibriSpeech ASR]( LibriSpeech is a standard large-scale dataset for evaluating ASR systems. It consists of approximately 1,000 hours of narrated audiobooks collected from the [LibriVox]( project. LibriSpeech has been instrumental in facilitating researchers to leverage a large body of pre-existing transcribed speech data. As such, it has become one of the most popular datasets for benchmarking academic speech systems. ```python librispeech = load_dataset(\"librispeech_asr\", \"all\") ``` #### [Common Voice]( Common Voice is a series of crowd-sourced open-licensed speech datasets where speakers record text from Wikipedia in various languages. Since anyone can contribute recordings, there is significant variation in both audio quality and speakers. The audio conditions are challenging, with recording artefacts, accented speech, hesitations, and the presence of foreign words. The transcriptions are both cased and punctuated. The English subset of version 11.0 contains approximately 2,300 hours of validated data. Use of the dataset requires you to agree to the Common Voice terms of use, which can be found on the Hugging Face Hub: [mozilla-foundation/common_voice_11_0]( Once you have agreed to the terms of use, you will be granted access to the dataset. You will then need to provide an [authentication token]( from the Hub when you load the dataset. ```python common_voice = load_dataset(\"mozilla-foundation/common_voice_11\", \"en\", use_auth_token=True) ``` #### [VoxPopuli]( VoxPopuli is a large-scale multilingual speech corpus consisting of data sourced from 2009-2020 European Parliament event recordings. Consequently, it occupies the unique domain of oratory, political speech, largely sourced from non-native speakers. The English subset contains approximately 550 hours labelled speech. ```python voxpopuli = load_dataset(\"facebook/voxpopuli\", \"en\") ``` #### [TED-LIUM]( TED-LIUM is a dataset based on English-language TED Talk conference videos. The speaking style is oratory educational talks. The transcribed talks cover a range of different cultural, political, and academic topics, resulting in a technical vocabulary. The Release 3 (latest) edition of the dataset contains approximately 450 hours of training data. The validation and test data are from the legacy set, consistent with earlier releases. ```python tedlium = load_dataset(\"LIUM/tedlium\", \"release3\") ``` #### [GigaSpeech]( GigaSpeech is a multi-domain English speech recognition corpus curated from audiobooks, podcasts and YouTube. It covers both narrated and spontaneous speech over a variety of topics, such as arts, science and sports. It contains training splits varying from 10 hours - 10,000 hours and standardised validation and test splits. ```python gigaspeech = load_dataset(\"speechcolab/gigaspeech\", \"xs\", use_auth_token=True) ``` #### [SPGISpeech]( SPGISpeech is an English speech recognition corpus composed of company earnings calls that have been manually transcribed by S&P Global, Inc. The transcriptions are fully-formatted according to a professional style guide for oratory and spontaneous speech. It contains training splits ranging from 200 hours - 5,000 hours, with canonical validation and test splits. ```python spgispeech = load_dataset(\"kensho/spgispeech\", \"s\", use_auth_token=True) ``` #### [Earnings-22]( Earnings-22 is a 119-hour corpus of English-language earnings calls collected from global companies. The dataset was developed with the goal of aggregating a broad range of speakers and accents covering a range of real-world financial topics. There is large diversity in the speakers and accents, with speakers taken from seven different language regions. Earnings-22 was published primarily as a test-only dataset. The Hub contains a version of the dataset that has been partitioned into train-validation-test splits. ```python earnings22 = load_dataset(\"revdotcom/earnings22\") ``` #### [AMI]( AMI comprises 100 hours of meeting recordings captured using different recording streams. The corpus contains manually annotated orthographic transcriptions of the meetings aligned at the word level. Individual samples of the AMI dataset contain very large audio files (between 10 and 60 minutes), which are segmented to lengths feasible for training most speech recognition systems. AMI contains two splits: IHM and SDM. IHM (individual headset microphone) contains easier near-field speech, and SDM (single distant microphone) harder far-field speech. ```python ami = load_dataset(\"edinburghcstr/ami\", \"ihm\") ``` ### Multilingual Speech Recognition Multilingual speech recognition refers to speech recognition (speech-to-text) for all languages except English. #### [Multilingual LibriSpeech]( Multilingual LibriSpeech is the multilingual equivalent of the [LibriSpeech ASR]( corpus. It comprises a large corpus of read audiobooks taken from the [LibriVox]( project, making it a suitable dataset for academic research. It contains data split into eight high-resource languages - English, German, Dutch, Spanish, French, Italian, Portuguese and Polish. #### [Common Voice]( Common Voice is a series of crowd-sourced open-licensed speech datasets where speakers record text from Wikipedia in various languages. Since anyone can contribute recordings, there is significant variation in both audio quality and speakers. The audio conditions are challenging, with recording artefacts, accented speech, hesitations, and the presence of foreign words. The transcriptions are both cased and punctuated. As of version 11, there are over 100 languages available, both low and high-resource. #### [VoxPopuli]( VoxPopuli is a large-scale multilingual speech corpus consisting of data sourced from 2009-2020 European Parliament event recordings. Consequently, it occupies the unique domain of oratory, political speech, largely sourced from non-native speakers. It contains labelled audio-transcription data for 15 European languages. #### [FLEURS]( FLEURS (Few-shot Learning Evaluation of Universal Representations of Speech) is a dataset for evaluating speech recognition systems in 102 languages, including many that are classified as 'low-resource'. The data is derived from the [FLoRes-101]( dataset, a machine translation corpus with 3001 sentence translations from English to 101 other languages. Native speakers are recorded narrating the sentence transcriptions in their native language. The recorded audio data is paired with the sentence transcriptions to yield multilingual speech recognition over all 101 languages. The training sets contain approximately 10 hours of supervised audio-transcription data per language. ### Speech Translation Speech translation is the task of mapping from spoken speech to written text, where the speech and text are in different languages (e.g. English speech to French text). #### [CoVoST 2]( CoVoST 2 is a large-scale multilingual speech translation corpus covering translations from 21 languages into English and from English into 15 languages. The dataset is created using Mozilla's open-source Common Voice database of crowd-sourced voice recordings. There are 2,900 hours of speech represented in the corpus. #### [FLEURS]( FLEURS (Few-shot Learning Evaluation of Universal Representations of Speech) is a dataset for evaluating speech recognition systems in 102 languages, including many that are classified as 'low-resource'. The data is derived from the [FLoRes-101]( dataset, a machine translation corpus with 3001 sentence translations from English to 101 other languages. Native speakers are recorded narrating the sentence transcriptions in their native languages. An \\\\(n\\\\)-way parallel corpus of speech translation data is constructed by pairing the recorded audio data with the sentence transcriptions for each of the 101 languages. The training sets contain approximately 10 hours of supervised audio-transcription data per source-target language combination. ### Audio Classification Audio classification is the task of mapping a raw audio input to a class label output. Practical applications of audio classification include keyword spotting, speaker intent and language identification. #### [SpeechCommands]( SpeechCommands is a dataset comprised of one-second audio files, each containing either a single spoken word in English or background noise. The words are taken from a small set of commands and are spoken by a number of different speakers. The dataset is designed to help train and evaluate small on-device keyword spotting systems. #### [Multilingual Spoken Words]( Multilingual Spoken Words is a large-scale corpus of one-second audio samples, each containing a single spoken word. The dataset consists of 50 languages and more than 340,000 keywords, totalling 23.4 million one-second spoken examples or over 6,000 hours of audio. The audio-transcription data is sourced from the Mozilla Common Voice project. Time stamps are generated for every utterance on the word-level and used to extract individual spoken words and their corresponding transcriptions, thus forming a new corpus of single spoken words. The dataset's intended use is academic research and commercial applications in multilingual keyword spotting and spoken term search. #### [FLEURS]( FLEURS (Few-shot Learning Evaluation of Universal Representations of Speech) is a dataset for evaluating speech recognition systems in 102 languages, including many that are classified as 'low-resource'. The data is derived from the [FLoRes-101]( dataset, a machine translation corpus with 3001 sentence translations from English to 101 other languages. Native speakers are recorded narrating the sentence transcriptions in their native languages. The recorded audio data is paired with a label for the language in which it is spoken. The dataset can be used as an audio classification dataset for _language identification_: systems are trained to predict the language of each utterance in the corpus. ## Closing Remarks In this blog post, we explored the Hugging Face Hub and experienced the Dataset Preview, an effective means of listening to audio datasets before downloading them. We loaded an audio dataset with one line of Python code and performed a series of generic pre-processing steps to prepare it for a machine learning model. In total, this required just 13 lines of code, relying on simple Python functions to perform the necessary operations. We introduced streaming mode, a method for loading and preparing samples of audio data on the fly. We concluded by summarising the most popular speech recognition, speech translation and audio classification datasets on the Hub. Having read this blog, we hope you agree that Datasets is the number one place for downloading and preparing audio datasets. Datasets is made possible through the work of the community. If you would like to contribute a dataset, refer to the [Guide for Adding a New Dataset]( *Thank you to the following individuals who help contribute to the blog post: Vaibhav Srivastav, Polina Kazakova, Patrick von Platen, Omar Sanseviero and Quentin Lhoest.*"}
{"title": "autoformer.md", "repo_owner": "huggingface", "repo_name": "blog", "text": "--- title: \"Yes, Transformers are Effective for Time Series Forecasting (+ Autoformer)\" thumbnail: /blog/assets/150_autoformer/thumbnail.png authors: - user: elisim guest: true - user: kashif - user: nielsr --- # Yes, Transformers are Effective for Time Series Forecasting (+ Autoformer) ## Introduction A few months ago, we introduced the [Informer]( model ([Zhou, Haoyi, et al., 2021]( which is a Time Series Transformer that won the AAAI 2021 best paper award. We also provided an example for multivariate probabilistic forecasting with Informer. In this post, we discuss the question: [Are Transformers Effective for Time Series Forecasting?]( (AAAI 2023). As we will see, they are. Firstly, we will provide empirical evidence that **Transformers are indeed Effective for Time Series Forecasting**. Our comparison shows that the simple linear model, known as _DLinear_, is not better than Transformers as claimed. When compared against equivalent sized models in the same setting as the linear models, the Transformer-based models perform better on the test set metrics we consider. Afterwards, we will introduce the _Autoformer_ model ([Wu, Haixu, et al., 2021]( which was published in NeurIPS 2021 after the Informer model. The Autoformer model is [now available]( in Transformers. Finally, we will discuss the _DLinear_ model, which is a simple feedforward network that uses the decomposition layer from Autoformer. The DLinear model was first introduced in [Are Transformers Effective for Time Series Forecasting?]( and claimed to outperform Transformer-based models in time-series forecasting. Let's go! ## Benchmarking - Transformers vs. DLinear In the paper [Are Transformers Effective for Time Series Forecasting?]( published recently in AAAI 2023, the authors claim that Transformers are not effective for time series forecasting. They compare the Transformer-based models against a simple linear model, which they call _DLinear_. The DLinear model uses the decomposition layer from the Autoformer model, which we will introduce later in this post. The authors claim that the DLinear model outperforms the Transformer-based models in time-series forecasting. Is that so? Let's find out. | Dataset | Autoformer (uni.) MASE | DLinear MASE | |||| | `Traffic` | 0.910 | 0.965 | | `Exchange-Rate` | 1.087 | 1.690 | | `Electricity` | 0.751 | 0.831 | The table above shows the results of the comparison between the Autoformer and DLinear models on the three datasets used in the paper. The results show that the Autoformer model outperforms the DLinear model on all three datasets. Next, we will present the new Autoformer model along with the DLinear model. We will showcase how to compare them on the Traffic dataset from the table above, and provide explanations for the results we obtained. **TL;DR:** A simple linear model, while advantageous in certain cases, has no capacity to incorporate covariates compared to more complex models like transformers in the univariate setting. ## Autoformer - Under The Hood Autoformer builds upon the traditional method of decomposing time series into seasonality and trend-cycle components. This is achieved through the incorporation of a _Decomposition Layer_, which enhances the model's ability to capture these components accurately. Moreover, Autoformer introduces an innovative auto-correlation mechanism that replaces the standard self-attention used in the vanilla transformer. This mechanism enables the model to utilize period-based dependencies in the attention, thus improving the overall performance. In the upcoming sections, we will delve into the two key contributions of Autoformer: the _Decomposition Layer_ and the _Attention (Autocorrelation) Mechanism_. We will also provide code examples to illustrate how these components function within the Autoformer architecture. ### Decomposition Layer Decomposition has long been a popular method in time series analysis, but it had not been extensively incorporated into deep learning models until the introduction of the Autoformer paper. Following a brief explanation of the concept, we will demonstrate how the idea is applied in Autoformer using PyTorch code. #### Decomposition of Time Series In time series analysis, [decomposition]( is a method of breaking down a time series into three systematic components: trend-cycle, seasonal variation, and random fluctuations. The trend component represents the long-term direction of the time series, which can be increasing, decreasing, or stable over time. The seasonal component represents the recurring patterns that occur within the time series, such as yearly or quarterly cycles. Finally, the random (sometimes called \"irregular\") component represents the random noise in the data that cannot be explained by the trend or seasonal components. Two main types of decomposition are additive and multiplicative decomposition, which are implemented in the [great statsmodels library]( By decomposing a time series into these components, we can better understand and model the underlying patterns in the data. But how can we incorporate decomposition into the Transformer architecture? Let's see how Autoformer does it. #### Decomposition in Autoformer | ]( highlighting its significance in time series modeling. Now, let's define the decomposition layer formally: For an input series \\\\(\\mathcal{X} \\in \\mathbb{R}^{L \\times d}\\\\) with length \\\\(L\\\\), the decomposition layer returns \\\\(\\mathcal{X}_\\textrm{trend}, \\mathcal{X}_\\textrm{seasonal}\\\\) defined as: $$ \\mathcal{X}_\\textrm{trend} = \\textrm{AvgPool(Padding(} \\mathcal{X} \\textrm{))} \\\\ \\mathcal{X}_\\textrm{seasonal} = \\mathcal{X} - \\mathcal{X}_\\textrm{trend} $$ And the implementation in PyTorch: ```python import torch from torch import nn class DecompositionLayer(nn.Module): \"\"\" Returns the trend and the seasonal parts of the time series. \"\"\" def __init__(self, kernel_size): super().__init__() self.kernel_size = kernel_size self.avg = nn.AvgPool1d(kernel_size=kernel_size, stride=1, padding=0) # moving average def forward(self, x): \"\"\"Input shape: Batch x Time x EMBED_DIM\"\"\" # padding on the both ends of time series num_of_pads = (self.kernel_size - 1) // 2 front = x[:, 0:1, :].repeat(1, num_of_pads, 1) end = x[:, -1:, :].repeat(1, num_of_pads, 1) x_padded = torch.cat([front, x, end], dim=1) # calculate the trend and seasonal part of the series x_trend = self.avg(x_padded.permute(0, 2, 1)).permute(0, 2, 1) x_seasonal = x - x_trend return x_seasonal, x_trend ``` As you can see, the implementation is quite simple and can be used in other models, as we will see with DLinear. Now, let's explain the second contribution - _Attention (Autocorrelation) Mechanism_. ### Attention (Autocorrelation) Mechanism | , _autocorrelation_ for a single discrete variable \\\\(y\\\\) is used to measure the \"relationship\" (pearson correlation) between the variable's current value at time \\\\(t\\\\) to its past value at time \\\\(t-\\tau\\\\): $$ \\textrm{Autocorrelation}(\\tau) = \\textrm{Corr}(y_t, y_{t-\\tau}) $$ Using autocorrelation, Autoformer extracts frequency-based dependencies from the queries and keys, instead of the standard dot-product between them. You can think about it as a replacement for the \\\\(QK^T\\\\) term in the self-attention. In practice, autocorrelation of the queries and keys for **all lags** is calculated at once by FFT. By doing so, the autocorrelation mechanism achieves \\\\(O(L \\log L)\\\\) time complexity (where \\\\(L\\\\) is the input time length), similar to [Informer's ProbSparse attention]( Note that the theory behind computing autocorrelation using FFT is based on the [Wiener\u2013Khinchin theorem]( which is outside the scope of this blog post. Now, we are ready to see the code in PyTorch: ```python import torch def autocorrelation(query_states, key_states): \"\"\" Computes autocorrelation(Q,K) using `torch.fft`. Think about it as a replacement for the QK^T in the self-attention. Assumption: states are resized to same shape of [batch_size, time_length, embedding_dim]. \"\"\" query_states_fft = torch.fft.rfft(query_states, dim=1) key_states_fft = torch.fft.rfft(key_states, dim=1) attn_weights = query_states_fft * torch.conj(key_states_fft) attn_weights = torch.fft.irfft(attn_weights, dim=1) return attn_weights ``` Quite simple! Please be aware that this is only a partial implementation of `autocorrelation(Q,K)`, and the full implementation can be found in Transformers. Next, we will see how to aggregate our `attn_weights` with the values by time delay, process which is termed as _Time Delay Aggregation_. #### Time Delay Aggregation |  as \\\\(\\mathcal{R_{Q,K}}\\\\). The question arises: how do we aggregate these \\\\(\\mathcal{R_{Q,K}}(\\tau_1), \\mathcal{R_{Q,K}}(\\tau_2), ..., \\mathcal{R_{Q,K}}(\\tau_k)\\\\) with \\\\(\\mathcal{V}\\\\)? In the standard self-attention mechanism, this aggregation is accomplished through dot-product. However, in Autoformer, we employ a different approach. Firstly, we align \\\\(\\mathcal{V}\\\\) by calculating its value for each time delay \\\\(\\tau_1, \\tau_2, ... \\tau_k\\\\), which is also known as _Rolling_. Subsequently, we conduct element-wise multiplication between the aligned \\\\(\\mathcal{V}\\\\) and the autocorrelations. In the provided figure, you can observe the left side showcasing the rolling of \\\\(\\mathcal{V}\\\\) by time delay, while the right side illustrates the element-wise multiplication with the autocorrelations. It can be summarized with the following equations: $$ \\tau_1, \\tau_2, ... \\tau_k = \\textrm{arg Top-k}(\\mathcal{R_{Q,K}}(\\tau)) \\\\ \\hat{\\mathcal{R}}\\mathcal{_{Q,K}}(\\tau _1), \\hat{\\mathcal{R}}\\mathcal{_{Q,K}}(\\tau _2), ..., \\hat{\\mathcal{R}}\\mathcal{_{Q,K}}(\\tau _k) = \\textrm{Softmax}(\\mathcal{R_{Q,K}}(\\tau _1), \\mathcal{R_{Q,K}}(\\tau_2), ..., \\mathcal{R_{Q,K}}(\\tau_k)) \\\\ \\textrm{Autocorrelation-Attention} = \\sum_{i=1}^k \\textrm{Roll}(\\mathcal{V}, \\tau_i) \\cdot \\hat{\\mathcal{R}}\\mathcal{_{Q,K}}(\\tau _i) $$ And that's it! Note that \\\\(k\\\\) is controlled by a hyperparameter called `autocorrelation_factor` (similar to `sampling_factor` in [Informer]( and softmax is applied to the autocorrelations before the multiplication. Now, we are ready to see the final code: ```python import torch import math def time_delay_aggregation(attn_weights, value_states, autocorrelation_factor=2): \"\"\" Computes aggregation as value_states.roll(delay) * top_k_autocorrelations(delay). The final result is the autocorrelation-attention output. Think about it as a replacement of the dot-product between attn_weights and value states. The autocorrelation_factor is used to find top k autocorrelations delays. Assumption: value_states and attn_weights shape: [batch_size, time_length, embedding_dim] \"\"\" bsz, num_heads, tgt_len, channel = ... time_length = value_states.size(1) autocorrelations = attn_weights.view(bsz, num_heads, tgt_len, channel) # find top k autocorrelations delays top_k = int(autocorrelation_factor * math.log(time_length)) autocorrelations_mean = torch.mean(autocorrelations, dim=(1, -1)) # bsz x tgt_len top_k_autocorrelations, top_k_delays = torch.topk(autocorrelations_mean, top_k, dim=1) # apply softmax on the channel dim top_k_autocorrelations = torch.softmax(top_k_autocorrelations, dim=-1) # bsz x top_k # compute aggregation: value_states.roll(delay) * top_k_autocorrelations(delay) delays_agg = torch.zeros_like(value_states).float() # bsz x time_length x channel for i in range(top_k): value_states_roll_delay = value_states.roll(shifts=-int(top_k_delays[i]), dims=1) top_k_at_delay = top_k_autocorrelations[:, i] # aggregation top_k_resized = top_k_at_delay.view(-1, 1, 1).repeat(num_heads, tgt_len, channel) delays_agg += value_states_roll_delay * top_k_resized attn_output = delays_agg.contiguous() return attn_output ``` We did it! The Autoformer model is [now available]( in the Transformers library, and simply called `AutoformerModel`. Our strategy with this model is to show the performance of the univariate Transformer models in comparison to the DLinear model which is inherently univariate as will shown next. We will also present the results from _two_ multivariate Transformer models trained on the same data. ## DLinear - Under The Hood Actually, DLinear is conceptually simple: it's just a fully connected with the Autoformer's `DecompositionLayer`. It uses the `DecompositionLayer` above to decompose the input time series into the residual (the seasonality) and trend part. In the forward pass each part is passed through its own linear layer, which projects the signal to an appropriate `prediction_length`-sized output. The final output is the sum of the two corresponding outputs in the point-forecasting model: ```python def forward(self, context): seasonal, trend = self.decomposition(context) seasonal_output = self.linear_seasonal(seasonal) trend_output = self.linear_trend(trend) return seasonal_output + trend_output ``` In the probabilistic setting one can project the context length arrays to `prediction-length * hidden` dimensions via the `linear_seasonal` and `linear_trend` layers. The resulting outputs are added and reshaped to `(prediction_length, hidden)`. Finally, a probabilistic head maps the latent representations of size `hidden` to the parameters of some distribution. In our benchmark, we use the implementation of DLinear from [GluonTS]( ## Example: Traffic Dataset We want to show empirically the performance of Transformer-based models in the library, by benchmarking on the `traffic` dataset, a dataset with 862 time series. We will train a shared model on each of the individual time series (i.e. univariate setting). Each time series represents the occupancy value of a sensor and is in the range [0, 1]. We will keep the following hyperparameters fixed for all the models: ```python # Traffic prediction_length is 24. Reference: # prediction_length = 24 context_length = prediction_length*2 batch_size = 128 num_batches_per_epoch = 100 epochs = 50 scaling = \"std\" ``` The transformers models are all relatively small with: ```python encoder_layers=2 decoder_layers=2 d_model=16 ``` Instead of showing how to train a model using `Autoformer`, one can just replace the model in the previous two blog posts ([TimeSeriesTransformer]( and [Informer]( with the new `Autoformer` model and train it on the `traffic` dataset. In order to not repeat ourselves, we have already trained the models and pushed them to the HuggingFace Hub. We will use those models for evaluation. ## Load Dataset Let's first install the necessary libraries: ```python !pip install -q transformers datasets evaluate accelerate \"gluonts[torch]\" ujson tqdm ``` The `traffic` dataset, used by [Lai et al. (2017)]( contains the San Francisco Traffic. It contains 862 hourly time series showing the road occupancy rates in the range \\\\([0, 1]\\\\) on the San Francisco Bay Area freeways from 2015 to 2016. ```python from gluonts.dataset.repository.datasets import get_dataset dataset = get_dataset(\"traffic\") freq = dataset.metadata.freq prediction_length = dataset.metadata.prediction_length ``` Let's visualize a time series in the dataset and plot the train/test split: ```python import matplotlib.pyplot as plt train_example = next(iter(dataset.train)) test_example = next(iter(dataset.test)) num_of_samples = 4*prediction_length figure, axes = plt.subplots() axes.plot(train_example[\"target\"][-num_of_samples:], color=\"blue\") axes.plot( test_example[\"target\"][-num_of_samples - prediction_length :], color=\"red\", alpha=0.5, ) plt.show() ``` . We define a `Chain` of transformations from GluonTS (which is a bit comparable to `torchvision.transforms.Compose` for images). It allows us to combine several transformations into a single pipeline. The transformations below are annotated with comments to explain what they do. At a high level, we will iterate over the individual time series of our dataset and add/remove fields or features: ```python from transformers import PretrainedConfig from gluonts.time_feature import time_features_from_frequency_str from gluonts.dataset.field_names import FieldName from gluonts.transform import ( AddAgeFeature, AddObservedValuesIndicator, AddTimeFeatures, AsNumpyArray, Chain, ExpectedNumInstanceSampler, RemoveFields, SelectFields, SetField, TestSplitSampler, Transformation, ValidationSplitSampler, VstackFeatures, RenameFields, ) def create_transformation(freq: str, config: PretrainedConfig) -> Transformation: # create a list of fields to remove later remove_field_names = [] if config.num_static_real_features == 0: remove_field_names.append(FieldName.FEAT_STATIC_REAL) if config.num_dynamic_real_features == 0: remove_field_names.append(FieldName.FEAT_DYNAMIC_REAL) if config.num_static_categorical_features == 0: remove_field_names.append(FieldName.FEAT_STATIC_CAT) return Chain( # step 1: remove static/dynamic fields if not specified [RemoveFields(field_names=remove_field_names)] # step 2: convert the data to NumPy (potentially not needed) + ( [ AsNumpyArray( field=FieldName.FEAT_STATIC_CAT, expected_ndim=1, dtype=int, ) ] if config.num_static_categorical_features > 0 else [] ) + ( [ AsNumpyArray( field=FieldName.FEAT_STATIC_REAL, expected_ndim=1, ) ] if config.num_static_real_features > 0 else [] ) + [ AsNumpyArray( field=FieldName.TARGET, # we expect an extra dim for the multivariate case: expected_ndim=1 if config.input_size == 1 else 2, ), # step 3: handle the NaN's by filling in the target with zero # and return the mask (which is in the observed values) # true for observed values, false for nan's # the decoder uses this mask (no loss is incurred for unobserved values) # see loss_weights inside the xxxForPrediction model AddObservedValuesIndicator( target_field=FieldName.TARGET, output_field=FieldName.OBSERVED_VALUES, ), # step 4: add temporal features based on freq of the dataset # these serve as positional encodings AddTimeFeatures( start_field=FieldName.START, target_field=FieldName.TARGET, output_field=FieldName.FEAT_TIME, time_features=time_features_from_frequency_str(freq), pred_length=config.prediction_length, ), # step 5: add another temporal feature (just a single number) # tells the model where in the life the value of the time series is # sort of running counter AddAgeFeature( target_field=FieldName.TARGET, output_field=FieldName.FEAT_AGE, pred_length=config.prediction_length, log_scale=True, ), # step 6: vertically stack all the temporal features into the key FEAT_TIME VstackFeatures( output_field=FieldName.FEAT_TIME, input_fields=[FieldName.FEAT_TIME, FieldName.FEAT_AGE] + ( [FieldName.FEAT_DYNAMIC_REAL] if config.num_dynamic_real_features > 0 else [] ), ), # step 7: rename to match HuggingFace names RenameFields( mapping={ FieldName.FEAT_STATIC_CAT: \"static_categorical_features\", FieldName.FEAT_STATIC_REAL: \"static_real_features\", FieldName.FEAT_TIME: \"time_features\", FieldName.TARGET: \"values\", FieldName.OBSERVED_VALUES: \"observed_mask\", } ), ] ) ``` ## Define `InstanceSplitter` For training/validation/testing we next create an `InstanceSplitter` which is used to sample windows from the dataset (as, remember, we can't pass the entire history of values to the model due to time and memory constraints). The instance splitter samples random `context_length` sized and subsequent `prediction_length` sized windows from the data, and appends a `past_` or `future_` key to any temporal keys for the respective windows. This makes sure that the `values` will be split into `past_values` and subsequent `future_values` keys, which will serve as the encoder and decoder inputs respectively. The same happens for any keys in the `time_series_fields` argument: ```python from gluonts.transform import InstanceSplitter from gluonts.transform.sampler import InstanceSampler from typing import Optional def create_instance_splitter( config: PretrainedConfig, mode: str, train_sampler: Optional[InstanceSampler] = None, validation_sampler: Optional[InstanceSampler] = None, ) -> Transformation: assert mode in [\"train\", \"validation\", \"test\"] instance_sampler = { \"train\": train_sampler or ExpectedNumInstanceSampler( num_instances=1.0, min_future=config.prediction_length ), \"validation\": validation_sampler or ValidationSplitSampler(min_future=config.prediction_length), \"test\": TestSplitSampler(), }[mode] return InstanceSplitter( target_field=\"values\", is_pad_field=FieldName.IS_PAD, start_field=FieldName.START, forecast_start_field=FieldName.FORECAST_START, instance_sampler=instance_sampler, past_length=config.context_length + max(config.lags_sequence), future_length=config.prediction_length, time_series_fields=[\"time_features\", \"observed_mask\"], ) ``` ## Create PyTorch DataLoaders Next, it's time to create PyTorch DataLoaders, which allow us to have batches of (input, output) pairs - or in other words (`past_values`, `future_values`). ```python from typing import Iterable import torch from gluonts.itertools import Cyclic, Cached from gluonts.dataset.loader import as_stacked_batches def create_train_dataloader( config: PretrainedConfig, freq, data, batch_size: int, num_batches_per_epoch: int, shuffle_buffer_length: Optional[int] = None, cache_data: bool = True, **kwargs, ) -> Iterable: PREDICTION_INPUT_NAMES = [ \"past_time_features\", \"past_values\", \"past_observed_mask\", \"future_time_features\", ] if config.num_static_categorical_features > 0: PREDICTION_INPUT_NAMES.append(\"static_categorical_features\") if config.num_static_real_features > 0: PREDICTION_INPUT_NAMES.append(\"static_real_features\") TRAINING_INPUT_NAMES = PREDICTION_INPUT_NAMES + [ \"future_values\", \"future_observed_mask\", ] transformation = create_transformation(freq, config) transformed_data = transformation.apply(data, is_train=True) if cache_data: transformed_data = Cached(transformed_data) # we initialize a Training instance instance_splitter = create_instance_splitter(config, \"train\") # the instance splitter will sample a window of # context length + lags + prediction length (from the 366 possible transformed time series) # randomly from within the target time series and return an iterator. stream = Cyclic(transformed_data).stream() training_instances = instance_splitter.apply(stream, is_train=True) return as_stacked_batches( training_instances, batch_size=batch_size, shuffle_buffer_length=shuffle_buffer_length, field_names=TRAINING_INPUT_NAMES, output_type=torch.tensor, num_batches_per_epoch=num_batches_per_epoch, ) def create_test_dataloader( config: PretrainedConfig, freq, data, batch_size: int, **kwargs, ): PREDICTION_INPUT_NAMES = [ \"past_time_features\", \"past_values\", \"past_observed_mask\", \"future_time_features\", ] if config.num_static_categorical_features > 0: PREDICTION_INPUT_NAMES.append(\"static_categorical_features\") if config.num_static_real_features > 0: PREDICTION_INPUT_NAMES.append(\"static_real_features\") transformation = create_transformation(freq, config) transformed_data = transformation.apply(data, is_train=False) # we create a Test Instance splitter which will sample the very last # context window seen during training only for the encoder. instance_sampler = create_instance_splitter(config, \"test\") # we apply the transformations in test mode testing_instances = instance_sampler.apply(transformed_data, is_train=False) return as_stacked_batches( testing_instances, batch_size=batch_size, output_type=torch.tensor, field_names=PREDICTION_INPUT_NAMES, ) ``` ## Evaluate on Autoformer We have already pre-trained an Autoformer model on this dataset, so we can just fetch the model and evaluate it on the test set: ```python from transformers import AutoformerConfig, AutoformerForPrediction config = AutoformerConfig.from_pretrained(\"kashif/autoformer-traffic-hourly\") model = AutoformerForPrediction.from_pretrained(\"kashif/autoformer-traffic-hourly\") test_dataloader = create_test_dataloader( config=config, freq=freq, data=test_dataset, batch_size=64, ) ``` At inference time, we will use the model's `generate()` method for predicting `prediction_length` steps into the future from the very last context window of each time series in the training set. ```python from accelerate import Accelerator accelerator = Accelerator() device = accelerator.device model.to(device) model.eval() forecasts_ = [] for batch in test_dataloader: outputs = model.generate( static_categorical_features=batch[\"static_categorical_features\"].to(device) if config.num_static_categorical_features > 0 else None, static_real_features=batch[\"static_real_features\"].to(device) if config.num_static_real_features > 0 else None, past_time_features=batch[\"past_time_features\"].to(device), past_values=batch[\"past_values\"].to(device), future_time_features=batch[\"future_time_features\"].to(device), past_observed_mask=batch[\"past_observed_mask\"].to(device), ) forecasts_.append(outputs.sequences.cpu().numpy()) ``` The model outputs a tensor of shape (`batch_size`, `number of samples`, `prediction length`, `input_size`). In this case, we get `100` possible values for the next `24` hours for each of the time series in the test dataloader batch which if you recall from above is `64`: ```python forecasts_[0].shape >>> (64, 100, 24) ``` We'll stack them vertically, to get forecasts for all time-series in the test dataset: We have `7` rolling windows in the test set which is why we end up with a total of `7 * 862 = 6034` predictions: ```python import numpy as np forecasts = np.vstack(forecasts_) print(forecasts.shape) >>> (6034, 100, 24) ``` We can evaluate the resulting forecast with respect to the ground truth out of sample values present in the test set. For that, we'll use the [Evaluate]( library, which includes the [MASE]( metrics. We calculate the metric for each time series in the dataset and return the average: ```python from tqdm.autonotebook import tqdm from evaluate import load from gluonts.time_feature import get_seasonality mase_metric = load(\"evaluate-metric/mase\") forecast_median = np.median(forecasts, 1) mase_metrics = [] for item_id, ts in enumerate(tqdm(test_dataset)): training_data = ts[\"target\"][:-prediction_length] ground_truth = ts[\"target\"][-prediction_length:] mase = mase_metric.compute( predictions=forecast_median[item_id], references=np.array(ground_truth), training=np.array(training_data), periodicity=get_seasonality(freq)) mase_metrics.append(mase[\"mase\"]) ``` So the result for the Autoformer model is: ```python print(f\"Autoformer univariate MASE: {np.mean(mase_metrics):.3f}\") >>> Autoformer univariate MASE: 0.910 ``` To plot the prediction for any time series with respect to the ground truth test data, we define the following helper: ```python import matplotlib.dates as mdates import pandas as pd test_ds = list(test_dataset) def plot(ts_index): fig, ax = plt.subplots() index = pd.period_range( start=test_ds[ts_index][FieldName.START], periods=len(test_ds[ts_index][FieldName.TARGET]), freq=test_ds[ts_index][FieldName.START].freq, ).to_timestamp() ax.plot( index[-5*prediction_length:], test_ds[ts_index][\"target\"][-5*prediction_length:], label=\"actual\", ) plt.plot( index[-prediction_length:], np.median(forecasts[ts_index], axis=0), label=\"median\", ) plt.gcf().autofmt_xdate() plt.legend(loc=\"best\") plt.show() ``` For example, for time-series in the test set with index `4`: ```python plot(4) ```  ) ``` Train the model: ```python predictor = estimator.train( training_data=train_dataset, cache_data=True, shuffle_buffer_length=1024 ) >>> INFO:pytorch_lightning.callbacks.model_summary: | Name | Type | Params --------------------------------------- 0 | model | DLinearModel | 4.7 K --------------------------------------- 4.7 K Trainable params 0 Non-trainable params 4.7 K Total params 0.019 Total estimated model params size (MB) Training: 0it [00:00, ?it/s] ... INFO:pytorch_lightning.utilities.rank_zero:Epoch 49, global step 5000: 'train_loss' was not in top 1 INFO:pytorch_lightning.utilities.rank_zero:`Trainer.fit` stopped: `max_epochs=50` reached. ``` And evaluate it on the test set: ```python from gluonts.evaluation import make_evaluation_predictions, Evaluator forecast_it, ts_it = make_evaluation_predictions( dataset=dataset.test, predictor=predictor, ) d_linear_forecasts = list(forecast_it) d_linear_tss = list(ts_it) evaluator = Evaluator() agg_metrics, _ = evaluator(iter(d_linear_tss), iter(d_linear_forecasts)) ``` So the result for the DLinear model is: ```python dlinear_mase = agg_metrics[\"MASE\"] print(f\"DLinear MASE: {dlinear_mase:.3f}\") >>> DLinear MASE: 0.965 ``` As before, we plot the predictions from our trained DLinear model via this helper: ```python def plot_gluonts(index): plt.plot(d_linear_tss[index][-4 * dataset.metadata.prediction_length:].to_timestamp(), label=\"target\") d_linear_forecasts[index].plot(show_label=True, color='g') plt.legend() plt.gcf().autofmt_xdate() plt.show() ``` ```python plot_gluonts(4) ```  | Transformer (mv.) | Informer (uni.)| Informer (mv.) | Autoformer (uni.) | DLinear | ||| | | | || |`Traffic` | **0.876** | 1.046 | 0.924 | 1.131 | 0.910 | 0.965 | As one can observe, the [vanilla Transformer]( which we introduced last year gets the best results here. Secondly, multivariate models are typically _worse_ than the univariate ones, the reason being the difficulty in estimating the cross-series correlations/relationships. The additional variance added by the estimates often harms the resulting forecasts or the model learns spurious correlations. Recent papers like [CrossFormer]( (ICLR 23) and [CARD]( try to address this problem in Transformer models. Multivariate models usually perform well when trained on large amounts of data. However, when compared to univariate models, especially on smaller open datasets, the univariate models tend to provide better metrics. By comparing the linear model with equivalent-sized univariate transformers or in fact any other neural univariate model, one will typically get better performance. To summarize, Transformers are definitely far from being outdated when it comes to time-series forcasting! Yet the availability of large-scale datasets is crucial for maximizing their potential. Unlike in CV and NLP, the field of time series lacks publicly accessible large-scale datasets. Most existing pre-trained models for time series are trained on small sample sizes from archives like [UCR and UEA]( which contain only a few thousands or even hundreds of samples. Although these benchmark datasets have been instrumental in the progress of the time series community, their limited sample sizes and lack of generality pose challenges for pre-training deep learning models. Therefore, the development of large-scale, generic time series datasets (like ImageNet in CV) is of the utmost importance. Creating such datasets will greatly facilitate further research on pre-trained models specifically designed for time series analysis, and it will improve the applicability of pre-trained models in time series forecasting. ## Acknowledgements We express our appreciation to [Lysandre Debut]( and [Pedro Cuenca]( their insightful comments and help during this project ."}
{"title": "autonlp-prodigy.md", "repo_owner": "huggingface", "repo_name": "blog", "text": "--- title: \"Active Learning with AutoNLP and Prodigy\" thumbnail: /blog/assets/43_autonlp_prodigy/thumbnail.png authors: - user: abhishek --- Active Learning with AutoNLP and Prodigy Active learning in the context of Machine Learning is a process in which you iteratively add labeled data, retrain a model and serve it to the end user. It is an endless process and requires human interaction for labeling/creating the data. In this article, we will discuss how to use [AutoNLP]( and [Prodigy]( to build an active learning pipeline. ## AutoNLP [AutoNLP]( is a framework created by Hugging Face that helps you to build your own state-of-the-art deep learning models on your own dataset with almost no coding at all. AutoNLP is built on the giant shoulders of Hugging Face's [transformers]( [datasets]( [inference-api]( and many other tools. With AutoNLP, you can train SOTA transformer models on your own custom dataset, fine-tune them (automatically) and serve them to the end-user. All models trained with AutoNLP are state-of-the-art and production-ready. At the time of writing this article, AutoNLP supports tasks like binary classification, regression, multi class classification, token classification (such as named entity recognition or part of speech), question answering, summarization and more. You can find a list of all the supported tasks [here]( AutoNLP supports languages like English, French, German, Spanish, Hindi, Dutch, Swedish and many more. There is also support for custom models with custom tokenizers (in case your language is not supported by AutoNLP). ## Prodigy [Prodigy]( is an annotation tool developed by Explosion (the makers of [spaCy]( It is a web-based tool that allows you to annotate your data in real time. Prodigy supports NLP tasks such as named entity recognition (NER) and text classification, but it's not limited to NLP! It supports Computer Vision tasks and even creating your own tasks! You can try the Prodigy demo: [here]( Note that Prodigy is a commercial tool. You can find out more about it [here]( We chose Prodigy as it is one of the most popular tools for labeling data and is infinitely customizable. It is also very easy to setup and use. ## Dataset Now begins the most interesting part of this article. After looking at a lot of datasets and different types of problems, we stumbled upon BBC News Classification dataset on Kaggle. This dataset was used in an inclass competition and can be accessed [here]( Let's take a look at this dataset: As we can see this is a classification dataset. There is a `Text` column which is the text of the news article and a `Category` column which is the class of the article. Overall, there are 5 different classes: `business`, `entertainment`, `politics`, `sport` & `tech`. Training a multi-class classification model on this dataset using AutoNLP is a piece of cake. Step 1: Download the dataset. Step 2: Open [AutoNLP]( and create a new project. Step 3: Upload the training dataset and choose auto-splitting. Step 4: Accept the pricing and train your models. Please note that in the above example, we are training 15 different multi-class classification models. AutoNLP pricing can be as low as $10 per model. AutoNLP will select the best models and do hyperparameter tuning for you on its own. So, now, all we need to do is sit back, relax and wait for the results. After around 15 minutes, all models finished training and the results are ready. It seems like the best model scored 98.67% accuracy! So, we are now able to classify the articles in the dataset with an accuracy of 98.67%! But wait, we were talking about active learning and Prodigy. What happened to those? We did use Prodigy as we will see soon. We used it to label this dataset for the named entity recognition task. Before starting the labeling part, we thought it would be cool to have a project in which we are not only able to detect the entities in news articles but also categorize them. That's why we built this classification model on existing labels. ## Active Learning The dataset we used did have categories but it didn't have labels for entity recognition. So, we decided to use Prodigy to label the dataset for another task: named entity recognition. Once you have Prodigy installed, you can simply run: $ prodigy ner.manual bbc blank:en BBC_News_Train.csv --label PERSON,ORG,PRODUCT,LOCATION Let's look at the different values: * `bbc` is the dataset that will be created by Prodigy. * `blank:en` is the `spaCy` tokenizer being used. * `BBC_News_Train.csv` is the dataset that will be used for labeling. * `PERSON,ORG,PRODUCT,LOCATION` is the list of labels that will be used for labeling. Once you run the above command, you can go to the prodigy web interface (usually at localhost:8080) and start labelling the dataset. Prodigy interface is very simple, intuitive and easy to use. The interface looks like the following: All you have to do is select which entity you want to label (PERSON, ORG, PRODUCT, LOCATION) and then select the text that belongs to the entity. Once you are done with one document, you can click on the green button and Prodigy will automatically provide you with next unlabelled document.  Using Prodigy, we started labelling the dataset. When we had around 20 samples, we trained a model using AutoNLP. Prodigy doesn't export the data in AutoNLP format, so we wrote a quick and dirty script to convert the data into AutoNLP format: ```python import json import spacy from prodigy.components.db import connect db = connect() prodigy_annotations = db.get_dataset(\"bbc\") examples = ((eg[\"text\"], eg) for eg in prodigy_annotations) nlp = spacy.blank(\"en\") dataset = [] for doc, eg in nlp.pipe(examples, as_tuples=True): try: doc.ents = [doc.char_span(s[\"start\"], s[\"end\"], s[\"label\"]) for s in eg[\"spans\"]] iob_tags = [f\"{t.ent_iob_}-{t.ent_type_}\" if t.ent_iob_ else \"O\" for t in doc] iob_tags = [t.strip(\"-\") for t in iob_tags] tokens = [str(t) for t in doc] temp_data = { \"tokens\": tokens, \"tags\": iob_tags } dataset.append(temp_data) except: pass with open('data.jsonl', 'w') as outfile: for entry in dataset: json.dump(entry, outfile) outfile.write('\\n') ``` This will provide us with a `JSONL` file which can be used for training a model using AutoNLP. The steps will be same as before except we will select `Token Classification` task when creating the AutoNLP project. Using the initial data we had, we trained a model using AutoNLP. The best model had an accuracy of around 86% with 0 precision and recall. We knew the model didn't learn anything. It's pretty obvious, we had only around 20 samples. After labelling around 70 samples, we started getting some results. The accuracy went up to 92%, precision was 0.52 and recall around 0.42. We were getting some results, but still not satisfactory. In the following image, we can see how this model performs on an unseen sample. As you can see, the model is struggling. But it's much better than before! Previously, the model was not even able to predict anything in the same text. At least now, it's able to figure out that `Bruce` and `David` are names. Thus, we continued. We labelled a few more samples. Please note that, in each iteration, our dataset is getting bigger. All we are doing is uploading the new dataset to AutoNLP and let it do the rest. After labelling around ~150 samples, we started getting some good results. The accuracy went up to 95.7%, precision was 0.64 and recall around 0.76. Let's take a look at how this model performs on the same unseen sample. WOW! This is amazing! As you can see, the model is now performing extremely well! Its able to detect many entities in the same text. The precision and recall were still a bit low and thus we continued labeling even more data. After labeling around ~250 samples, we had the best results in terms of precision and recall. The accuracy went up to ~95.9% and precision and recall were 0.73 and 0.79 respectively. At this point, we decided to stop labelling and end the experimentation process. The following graph shows how the accuracy of best model improved as we added more samples to the dataset: Well, it's a well known fact that more relevant data will lead to better models and thus better results. With this experimentation, we successfully created a model that can not only classify the entities in the news articles but also categorize them. Using tools like Prodigy and AutoNLP, we invested our time and effort only to label the dataset (even that was made simpler by the interface prodigy offers). AutoNLP saved us a lot of time and effort: we didn't have to figure out which models to use, how to train them, how to evaluate them, how to tune the parameters, which optimizer and scheduler to use, pre-processing, post-processing etc. We just needed to label the dataset and let AutoNLP do everything else. We believe with tools like AutoNLP and Prodigy it's very easy to create data and state-of-the-art models. And since the whole process requires almost no coding at all, even someone without a coding background can create datasets which are generally not available to the public, train their own models using AutoNLP and share the model with everyone else in the community (or just use them for their own research / business). We have open-sourced the best model created using this process. You can try it [here]( The labelled dataset can also be downloaded [here]( Models are only state-of-the-art because of the data they are trained on."}
{"title": "autotrain-image-classification.md", "repo_owner": "huggingface", "repo_name": "blog", "text": "--- title: Image Classification with AutoTrain thumbnail: /blog/assets/105_autotrain-image-classification/thumbnail.png authors: - user: nimaboscarino --- # Image Classification with AutoTrain So you\u2019ve heard all about the cool things that are happening in the machine learning world, and you want to join in. There\u2019s just one problem \u2013 you don\u2019t know how to code! Or maybe you\u2019re a seasoned software engineer who wants to add some ML to your side-project, but you don\u2019t have the time to pick up a whole new tech stack! For many people, the technical barriers to picking up machine learning feel insurmountable. That\u2019s why Hugging Face created [AutoTrain]( and with the latest feature we\u2019ve just added, we\u2019re making \u201cno-code\u201d machine learning better than ever. Best of all, you can create your first project for free! [Hugging Face AutoTrain]( lets you train models with **zero** configuration needed. Just choose your task (translation? how about question answering?), upload your data, and let Hugging Face do the rest of the work! By letting AutoTrain experiment with number of different models, there's even a good chance that you'll end up with a model that performs better than a model that's been hand-trained by an engineer We\u2019ve been expanding the number of tasks that we support, and we\u2019re proud to announce that **you can now use AutoTrain for Computer Vision**! Image Classification is the latest task we\u2019ve added, with more on the way. But what does this mean for you? [Image Classification]( models learn to *categorize* images, meaning that you can train one of these models to label any image. Do you want a model that can recognize signatures? Distinguish bird species? Identify plant diseases? As long as you can find an appropriate dataset, an image classification model has you covered. ## How can you train your own image classifier? If you haven\u2019t [created a Hugging Face account]( yet, now\u2019s the time! Following that, make your way over to the [AutoTrain homepage]( and click on \u201cCreate new project\u201d to get started. You\u2019ll be asked to fill in some basic info about your project. In the screenshot below you\u2019ll see that I created a project named `butterflies-classification`, and I chose the \u201cImage Classification\u201d task. I\u2019ve also chosen the \u201cAutomatic\u201d model option, since I want to let AutoTrain do the work of finding the best model architectures for my project. Once AutoTrain creates your project, you just need to connect your data. If you have the data locally, you can drag and drop the folder into the window. Since we can also use [any of the image classification datasets on the Hugging Face Hub]( in this example I\u2019ve decided to use the [NimaBoscarino/butterflies]( dataset. You can select separate training and validation datasets if available, or you can ask AutoTrain to split the data for you. Once the data has been added, simply choose the number of model candidates that you\u2019d like AutoModel to try out, review the expected training cost (training with 5 candidate models and less than 500 images is free ), and start training! In the screenshots above you can see that my project started 5 different models, which each reached different accuracy scores. One of them wasn\u2019t performing very well at all, so AutoTrain went ahead and stopped it so that it wouldn\u2019t waste resources. The very best model hit 84% accuracy, with effectively zero effort on my end To wrap it all up, you can visit your freshly trained models on the Hub and play around with them through the integrated [inference widget]( For example, check out my butterfly classifier model over at [NimaBoscarino/butterflies]( We\u2019re so excited to see what you build with AutoTrain! Don\u2019t forget to join the community over at [hf.co/join/discord]( and reach out to us if you need any help"}
{"title": "aws-partnership.md", "repo_owner": "huggingface", "repo_name": "blog", "text": "--- title: \"Hugging Face and AWS partner to make AI more accessible\" thumbnail: /blog/assets/131_aws-partnership/aws-partnership-thumbnail.png authors: - user: jeffboudier - user: philschmid - user: juliensimon --- # Hugging Face and AWS partner to make AI more accessible It\u2019s time to make AI open and accessible to all. That\u2019s the goal of this expanded long-term strategic partnership between Hugging Face and Amazon Web Services (AWS). Together, the two leaders aim to accelerate the availability of next-generation machine learning models by making them more accessible to the machine learning community and helping developers achieve the highest performance at the lowest cost. ## A new generation of open, accessible AI Machine learning is quickly becoming embedded in all applications. As its impact on every sector of the economy comes into focus, it\u2019s more important than ever to ensure every developer can access and assess the latest models. The partnership with AWS paves the way toward this future by making it faster and easier to build, train, and deploy the latest machine learning models in the cloud using purpose-built tools. There have been significant advances in new Transformer and Diffuser machine learning models that process and generate text, audio, and images. However, most of these popular generative AI models are not publicly available, widening the gap of machine learning capabilities between the largest tech companies and everyone else. To counter this trend, AWS and Hugging Face are partnering to contribute next-generation models to the global AI community and democratize machine learning. Through the strategic partnership, Hugging Face will leverage AWS as a preferred cloud provider so developers in Hugging Face\u2019s community can access AWS\u2019s state-of-the-art tools (e.g., [Amazon SageMaker]( [AWS Trainium]( [AWS Inferentia]( to train, fine-tune, and deploy models on AWS. This will allow developers to further optimize the performance of their models for their specific use cases while lowering costs. Hugging Face will apply the latest in innovative research findings using Amazon SageMaker to build next-generation AI models. Together, Hugging Face and AWS are bridging the gap so the global AI community can benefit from the latest advancements in machine learning to accelerate the creation of generative AI applications. \u201cThe future of AI is here, but it\u2019s not evenly distributed,\u201d said Clement Delangue, CEO of Hugging Face. \u201cAccessibility and transparency are the keys to sharing progress and creating tools to use these new capabilities wisely and responsibly. Amazon SageMaker and AWS-designed chips will enable our team and the larger machine learning community to convert the latest research into openly reproducible models that anyone can build on.\u201d ## Collaborating to scale AI in the cloud This expanded strategic partnership enables Hugging Face and AWS to accelerate machine learning adoption using the latest models hosted on Hugging Face with the industry-leading capabilities of Amazon SageMaker. Customers can now easily fine-tune and deploy state-of-the-art Hugging Face models in just a few clicks on Amazon SageMaker and Amazon Elastic Computing Cloud (EC2), taking advantage of purpose-built machine learning accelerators including AWS Trainium and AWS Inferentia. \u201cGenerative AI has the potential to transform entire industries, but its cost and the required expertise puts the technology out of reach for all but a select few companies,\u201d said Adam Selipsky, CEO of AWS. \u201cHugging Face and AWS are making it easier for customers to access popular machine learning models to create their own generative AI applications with the highest performance and lowest costs. This partnership demonstrates how generative AI companies and AWS can work together to put this innovative technology into the hands of more customers.\u201d Hugging Face has become the central hub for machine learning, with more than [100,000 free and accessible machine learning models]( downloaded more than 1 million times daily by researchers, data scientists, and machine learning engineers. AWS is by far the most popular place to run models from the Hugging Face Hub. Since the [start of our collaboration]( [Hugging Face on Amazon SageMaker]( has grown exponentially. We are experiencing an exciting renaissance with generative AI, and we're just getting started. We look forward to what the future holds for Hugging Face, AWS, and the AI community."}
{"title": "bert-101.md", "repo_owner": "huggingface", "repo_name": "blog", "text": "--- title: \"BERT 101 - State Of The Art NLP Model Explained\" thumbnail: /blog/assets/52_bert_101/thumbnail.jpg authors: - user: britneymuller --- BERT 101 State Of The Art NLP Model Explained ## What is BERT? BERT, short for Bidirectional Encoder Representations from Transformers, is a Machine Learning (ML) model for natural language processing. It was developed in 2018 by researchers at Google AI Language and serves as a swiss army knife solution to 11+ of the most common language tasks, such as sentiment analysis and named entity recognition. Language has historically been difficult for computers to \u2018understand\u2019. Sure, computers can collect, store, and read text inputs but they lack basic language _context_. So, along came Natural Language Processing (NLP): the field of artificial intelligence aiming for computers to read, analyze, interpret and derive meaning from text and spoken words. This practice combines linguistics, statistics, and Machine Learning to assist computers in \u2018understanding\u2019 human language. Individual NLP tasks have traditionally been solved by individual models created for each specific task. That is, until\u2014 BERT! BERT revolutionized the NLP space by solving for 11+ of the most common NLP tasks (and better than previous models) making it the jack of all NLP trades. In this guide, you'll learn what BERT is, why it\u2019s different, and how to get started using BERT: 1. [What is BERT used for?](#1-what-is-bert-used-for) 2. [How does BERT work?](#2-how-does-bert-work) 3. [BERT model size & architecture](#3-bert-model-size--architecture) 4. [BERT\u2019s performance on common language tasks](#4-berts-performance-on-common-language-tasks) 5. [Environmental impact of deep learning](#5-enviornmental-impact-of-deep-learning) 6. [The open source power of BERT](#6-the-open-source-power-of-bert) 7. [How to get started using BERT](#7-how-to-get-started-using-bert) 8. [BERT FAQs](#8-bert-faqs) 9. [Conclusion](#9-conclusion) Let's get started! ## 1. What is BERT used for? BERT can be used on a wide variety of language tasks: - Can determine how positive or negative a movie\u2019s reviews are. [(Sentiment Analysis)]( - Helps chatbots answer your questions. [(Question answering)]( - Predicts your text when writing an email (Gmail). [(Text prediction)]( - Can write an article about any topic with just a few sentence inputs. [(Text generation)]( - Can quickly summarize long legal contracts. [(Summarization)]( - Can differentiate words that have multiple meanings (like \u2018bank\u2019) based on the surrounding text. (Polysemy resolution) **There are many more language/NLP tasks + more detail behind each of these.** ***Fun Fact:*** You interact with NLP (and likely BERT) almost every single day! NLP is behind Google Translate, voice assistants (Alexa, Siri, etc.), chatbots, Google searches, voice-operated GPS, and more. --- ### 1.1 Example of BERT BERT helps Google better surface (English) results for nearly all searches since November of 2020. Here\u2019s an example of how BERT helps Google better understand specific searches like: Source Pre-BERT Google surfaced information about getting a prescription filled. Post-BERT Google understands that \u201cfor someone\u201d relates to picking up a prescription for someone else and the search results now help to answer that. --- ## 2. How does BERT Work? BERT works by leveraging the following: ### 2.1 Large amounts of training data A massive dataset of 3.3 Billion words has contributed to BERT\u2019s continued success. BERT was specifically trained on Wikipedia (\\~2.5B words) and Google\u2019s BooksCorpus (\\~800M words). These large informational datasets contributed to BERT\u2019s deep knowledge not only of the English language but also of our world! Training on a dataset this large takes a long time. BERT\u2019s training was made possible thanks to the novel Transformer architecture and sped up by using TPUs (Tensor Processing Units - Google\u2019s custom circuit built specifically for large ML models). \u201464 TPUs trained BERT over the course of 4 days. **Note:** Demand for smaller BERT models is increasing in order to use BERT within smaller computational environments (like cell phones and personal computers). [23 smaller BERT models were released in March 2020]( [DistilBERT]( offers a lighter version of BERT; runs 60% faster while maintaining over 95% of BERT\u2019s performance. ### 2.2 What is a Masked Language Model? MLM enables/enforces bidirectional learning from text by masking (hiding) a word in a sentence and forcing BERT to bidirectionally use the words on either side of the covered word to predict the masked word. This had never been done before! **Fun Fact:** We naturally do this as humans! **Masked Language Model Example:** Imagine your friend calls you while camping in Glacier National Park and their service begins to cut out. The last thing you hear before the call drops is: Friend: \u201cDang! I\u2019m out fishing and a huge trout just [blank] my line!\u201d Can you guess what your friend said?? You\u2019re naturally able to predict the missing word by considering the words bidirectionally before and after the missing word as context clues (in addition to your historical knowledge of how fishing works). Did you guess that your friend said, \u2018broke\u2019? That\u2019s what we predicted as well but even we humans are error-prone to some of these methods. **Note:** This is why you\u2019ll often see a \u201cHuman Performance\u201d comparison to a language model\u2019s performance scores. And yes, newer models like BERT can be more accurate than humans! The bidirectional methodology you did to fill in the [blank] word above is similar to how BERT attains state-of-the-art accuracy. A random 15% of tokenized words are hidden during training and BERT\u2019s job is to correctly predict the hidden words. Thus, directly teaching the model about the English language (and the words we use). Isn\u2019t that neat? Play around with BERT\u2019s masking predictions: Hosted inference API Fill-Mask Examples Mask token: [MASK] Compute This model can be loaded on the Inference API on-demand. JSON Output Maximize **Fun Fact:** Masking has been around a long time - [1953 Paper on Cloze procedure (or \u2018Masking\u2019)]( ### 2.3 What is Next Sentence Prediction? NSP (Next Sentence Prediction) is used to help BERT learn about relationships between sentences by predicting if a given sentence follows the previous sentence or not. **Next Sentence Prediction Example:** 1. Paul went shopping. He bought a new shirt. (correct sentence pair) 2. Ramona made coffee. Vanilla ice cream cones for sale. (incorrect sentence pair) In training, 50% correct sentence pairs are mixed in with 50% random sentence pairs to help BERT increase next sentence prediction accuracy. **Fun Fact:** BERT is trained on both MLM (50%) and NSP (50%) at the same time. ### 2.4 Transformers The Transformer architecture makes it possible to parallelize ML training extremely efficiently. Massive parallelization thus makes it feasible to train BERT on large amounts of data in a relatively short period of time. Transformers use an attention mechanism to observe relationships between words. A concept originally proposed in the popular [2017 Attention Is All You Need]( paper sparked the use of Transformers in NLP models all around the world. >Since their introduction in 2017, Transformers have rapidly become the state-of-the-art approach to tackle tasks in many domains such as natural language processing, speech recognition, and computer vision. In short, if you\u2019re doing deep learning, then you need Transformers! Lewis Tunstall, Hugging Face ML Engineer & Author of Natural Language Processing with Transformers Timeline of popular Transformer model releases: Source #### 2.4.1 How do Transformers work? Transformers work by leveraging attention, a powerful deep-learning algorithm, first seen in computer vision models. \u2014Not all that different from how we humans process information through attention. We are incredibly good at forgetting/ignoring mundane daily inputs that don\u2019t pose a threat or require a response from us. For example, can you remember everything you saw and heard coming home last Tuesday? Of course not! Our brain\u2019s memory is limited and valuable. Our recall is aided by our ability to forget trivial inputs. Similarly, Machine Learning models need to learn how to pay attention only to the things that matter and not waste computational resources processing irrelevant information. Transformers create differential weights signaling which words in a sentence are the most critical to further process. A transformer does this by successively processing an input through a stack of transformer layers, usually called the encoder. If necessary, another stack of transformer layers - the decoder - can be used to predict a target output. \u2014BERT however, doesn\u2019t use a decoder. Transformers are uniquely suited for unsupervised learning because they can efficiently process millions of data points. Fun Fact: Google has been using your reCAPTCHA selections to label training data since 2011. The entire Google Books archive and 13 million articles from the New York Times catalog have been transcribed/digitized via people entering reCAPTCHA text. Now, reCAPTCHA is asking us to label Google Street View images, vehicles, stoplights, airplanes, etc. Would be neat if Google made us aware of our participation in this effort (as the training data likely has future commercial intent) but I digress.. To learn more about Transformers check out our Hugging Face Transformers Course. ## 3. BERT model size & architecture Let\u2019s break down the architecture for the two original BERT models: ML Architecture Glossary: | ML Architecture Parts | Definition | |-----------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------| | Parameters: | Number of learnable variables/values available for the model. | | Transformer Layers: | Number of Transformer blocks. A transformer block transforms a sequence of word representations to a sequence of contextualized words (numbered representations). | | Hidden Size: | Layers of mathematical functions, located between the input and output, that assign weights (to words) to produce a desired result. | | Attention Heads: | The size of a Transformer block. | | Processing: | Type of processing unit used to train the model. | | Length of Training: | Time it took to train the model. Here\u2019s how many of the above ML architecture parts BERTbase and BERTlarge has: | | Transformer Layers | Hidden Size | Attention Heads | Parameters | Processing | Length of Training | |-----------|--------------------|-------------|-----------------|------------|------------|--------------------| | BERTbase | 12 | 768 | 12 | 110M | 4 TPUs | 4 days | | BERTlarge | 24 | 1024 | 16 | 340M | 16 TPUs | 4 days | Let\u2019s take a look at how BERTlarge\u2019s additional layers, attention heads, and parameters have increased its performance across NLP tasks. ## 4. BERT's performance on common language tasks BERT has successfully achieved state-of-the-art accuracy on 11 common NLP tasks, outperforming previous top NLP models, and is the first to outperform humans! But, how are these achievements measured? ### NLP Evaluation Methods: #### 4.1 SQuAD v1.1 & v2.0 [SQuAD]( (Stanford Question Answering Dataset) is a reading comprehension dataset of around 108k questions that can be answered via a corresponding paragraph of Wikipedia text. BERT\u2019s performance on this evaluation method was a big achievement beating previous state-of-the-art models and human-level performance: #### 4.2 SWAG [SWAG]( (Situations With Adversarial Generations) is an interesting evaluation in that it detects a model\u2019s ability to infer commonsense! It does this through a large-scale dataset of 113k multiple choice questions about common sense situations. These questions are transcribed from a video scene/situation and SWAG provides the model with four possible outcomes in the next scene. The model then does its\u2019 best at predicting the correct answer. BERT out outperformed top previous top models including human-level performance: #### 4.3 GLUE Benchmark [GLUE]( (General Language Understanding Evaluation) benchmark is a group of resources for training, measuring, and analyzing language models comparatively to one another. These resources consist of nine \u201cdifficult\u201d tasks designed to test an NLP model\u2019s understanding. Here\u2019s a summary of each of those tasks: While some of these tasks may seem irrelevant and banal, it\u2019s important to note that these evaluation methods are _incredibly_ powerful in indicating which models are best suited for your next NLP application. Attaining performance of this caliber isn\u2019t without consequences. Next up, let\u2019s learn about Machine Learning's impact on the environment. ## 5. Environmental impact of deep learning Large Machine Learning models require massive amounts of data which is expensive in both time and compute resources. These models also have an environmental impact: Source Machine Learning\u2019s environmental impact is one of the many reasons we believe in democratizing the world of Machine Learning through open source! Sharing large pre-trained language models is essential in reducing the overall compute cost and carbon footprint of our community-driven efforts. ## 6. The open source power of BERT Unlike other large learning models like GPT-3, BERT\u2019s source code is publicly accessible ([view BERT\u2019s code on Github]( allowing BERT to be more widely used all around the world. This is a game-changer! Developers are now able to get a state-of-the-art model like BERT up and running quickly without spending large amounts of time and money. Developers can instead focus their efforts on fine-tuning BERT to customize the model\u2019s performance to their unique tasks. It\u2019s important to note that [thousands]( of open-source and free, pre-trained BERT models are currently available for specific use cases if you don\u2019t want to fine-tune BERT. BERT models pre-trained for specific tasks: - [Twitter sentiment analysis]( - [Analysis of Japanese text]( - [Emotion categorizer (English - anger, fear, joy, etc.)]( - [Clinical Notes analysis]( - [Speech to text translation]( - [Toxic comment detection]( You can also find [hundreds of pre-trained, open-source Transformer models]( available on the Hugging Face Hub. ## 7. How to get started using BERT We've [created this notebook]( so you can try BERT through this easy tutorial in Google Colab. Open the notebook or add the following code to your own. Pro Tip: Use (Shift + Click) to run a code cell. Note: Hugging Face's [pipeline class]( makes it incredibly easy to pull in open source ML models like transformers with just a single line of code. ### 7.1 Install Transformers First, let's install Transformers via the following code: ```python !pip install transformers ``` ### 7.2 Try out BERT Feel free to swap out the sentence below for one of your own. However, leave [MASK] in somewhere to allow BERT to predict the missing word ```python from transformers import pipeline unmasker = pipeline('fill-mask', model='bert-base-uncased') unmasker(\"Artificial Intelligence [MASK] take over the world.\") ``` When you run the above code you should see an output like this: ``` [{'score': 0.3182411789894104, 'sequence': 'artificial intelligence can take over the world.', 'token': 2064, 'token_str': 'can'}, {'score': 0.18299679458141327, 'sequence': 'artificial intelligence will take over the world.', 'token': 2097, 'token_str': 'will'}, {'score': 0.05600147321820259, 'sequence': 'artificial intelligence to take over the world.', 'token': 2000, 'token_str': 'to'}, {'score': 0.04519503191113472, 'sequence': 'artificial intelligences take over the world.', 'token': 2015, 'token_str': '##s'}, {'score': 0.045153118669986725, 'sequence': 'artificial intelligence would take over the world.', 'token': 2052, 'token_str': 'would'}] ``` Kind of frightening right? ### 7.3 Be aware of model bias Let's see what jobs BERT suggests for a \"man\": ```python unmasker(\"The man worked as a [MASK].\") ``` When you run the above code you should see an output that looks something like: ```python [{'score': 0.09747546911239624, 'sequence': 'the man worked as a carpenter.', 'token': 10533, 'token_str': 'carpenter'}, {'score': 0.052383411675691605, 'sequence': 'the man worked as a waiter.', 'token': 15610, 'token_str': 'waiter'}, {'score': 0.04962698742747307, 'sequence': 'the man worked as a barber.', 'token': 13362, 'token_str': 'barber'}, {'score': 0.037886083126068115, 'sequence': 'the man worked as a mechanic.', 'token': 15893, 'token_str': 'mechanic'}, {'score': 0.037680838257074356, 'sequence': 'the man worked as a salesman.', 'token': 18968, 'token_str': 'salesman'}] ``` BERT predicted the man's job to be a Carpenter, Waiter, Barber, Mechanic, or Salesman Now let's see what jobs BERT suggesst for \"woman\" ```python unmasker(\"The woman worked as a [MASK].\") ``` You should see an output that looks something like: ```python [{'score': 0.21981535851955414, 'sequence': 'the woman worked as a nurse.', 'token': 6821, 'token_str': 'nurse'}, {'score': 0.1597413569688797, 'sequence': 'the woman worked as a waitress.', 'token': 13877, 'token_str': 'waitress'}, {'score': 0.11547300964593887, 'sequence': 'the woman worked as a maid.', 'token': 10850, 'token_str': 'maid'}, {'score': 0.03796879202127457, 'sequence': 'the woman worked as a prostitute.', 'token': 19215, 'token_str': 'prostitute'}, {'score': 0.030423851683735847, 'sequence': 'the woman worked as a cook.', 'token': 5660, 'token_str': 'cook'}] ``` BERT predicted the woman's job to be a Nurse, Waitress, Maid, Prostitute, or Cook displaying a clear gender bias in professional roles. ### 7.4 Some other BERT Notebooks you might enjoy: [A Visual Notebook to BERT for the First Time]( [Train your tokenizer]( +Don't forget to checkout the [Hugging Face Transformers Course]( to learn more ## 8. BERT FAQs Can BERT be used with PyTorch? Yes! Our experts at Hugging Face have open-sourced the PyTorch transformers repository on GitHub. Pro Tip: Lewis Tunstall, Leandro von Werra, and Thomas Wolf also wrote a book to help people build language applications with Hugging Face called, \u2018Natural Language Processing with Transformers\u2019. Can BERT be used with Tensorflow? Yes! You can use Tensorflow as the backend of Transformers. How long does it take to pre-train BERT? The 2 original BERT models were trained on 4(BERTbase) and 16(BERTlarge) Cloud TPUs for 4 days. How long does it take to fine-tune BERT? For common NLP tasks discussed above, BERT takes between 1-25mins on a single Cloud TPU or between 1-130mins on a single GPU. What makes BERT different? BERT was one of the first models in NLP that was trained in a two-step way: 1. BERT was trained on massive amounts of unlabeled data (no human annotation) in an unsupervised fashion. 2. BERT was then trained on small amounts of human-annotated data starting from the previous pre-trained model resulting in state-of-the-art performance. ## 9. Conclusion BERT is a highly complex and advanced language model that helps people automate language understanding. Its ability to accomplish state-of-the-art performance is supported by training on massive amounts of data and leveraging Transformers architecture to revolutionize the field of NLP. Thanks to BERT\u2019s open-source library, and the incredible AI community\u2019s efforts to continue to improve and share new BERT models, the future of untouched NLP milestones looks bright. What will you create with BERT? Learn how to [fine-tune BERT]( for your particular use case"}
{"title": "bert-cpu-scaling-part-1.md", "repo_owner": "huggingface", "repo_name": "blog", "text": "--- title: \"Scaling-up BERT Inference on CPU (Part 1)\" thumbnail: /blog/assets/21_bert_cpu_scaling_part_1/imgs/numa_set.png authors: - user: mfuntowicz --- .centered { display: block; margin: 0 auto; } figure { text-align: center; display: table; max-width: 85%; /* demo; set some amount (px or %) if you can */ margin: 10px auto; /* not needed unless you want centered */ } # Scaling up BERT-like model Inference on modern CPU - Part 1 ## 1. Context and Motivations Back in October 2019, my colleague Lysandre Debut published a comprehensive _(at the time)_ [inference performance benchmarking blog (1)]( Since then, [ transformers (2)]( welcomed a tremendous number of new architectures and thousands of new models were added to the [ hub (3)]( which now counts more than 9,000 of them as of first quarter of 2021. As the NLP landscape keeps trending towards more and more BERT-like models being used in production, it remains challenging to efficiently deploy and run these architectures at scale. This is why we recently introduced our [ Inference API]( to let you focus on building value for your users and customers, rather than digging into all the highly technical aspects of running such models. This blog post is the first part of a series which will cover most of the hardware and software optimizations to better leverage CPUs for BERT model inference. For this initial blog post, we will cover the hardware part: - Setting up a baseline - Out of the box results - Practical & technical considerations when leveraging modern CPUs for CPU-bound tasks - Core count scaling - Does increasing the number of cores actually give better performance? - Batch size scaling - Increasing throughput with multiple parallel & independent model instances We decided to focus on the most famous Transformer model architecture, [BERT (Delvin & al. 2018) (4)]( While we focus this blog post on BERT-like models to keep the article concise, all the described techniques can be applied to any architecture on the Hugging Face model hub. In this blog post we will not describe in detail the Transformer architecture - to learn about that I can't recommend enough the [Illustrated Transformer blogpost from Jay Alammar (5)]( Today's goals are to give you an idea of where we are from an Open Source perspective using BERT-like models for inference on PyTorch and TensorFlow, and also what you can easily leverage to speedup inference. ## 2. Benchmarking methodology When it comes to leveraging BERT-like models from Hugging Face's model hub, there are many knobs which can be tuned to make things faster. Also, in order to quantify what \"faster\" means, we will rely on widely adopted metrics: - **Latency**: Time it takes for a single execution of the model (i.e. forward call) - **Throughput**: Number of executions performed in a fixed amount of time These two metrics will help us understand the benefits and tradeoffs along this blog post. The benchmarking methodology was reimplemented from scratch in order to integrate the latest features provided by transformers and also to let the community run and share benchmarks in an __hopefully easier__ way. The whole framework is now based on [Facebook AI & Research's Hydra configuration library]( allowing us to easily report and track all the items involved while running the benchmark, hence increasing the overall reproducibility. You can find the whole structure of the project [here]( On the 2021 version, we kept the ability to run inference workloads through PyTorch and Tensorflow as in the previous blog [(1)]( along with their traced counterpart [TorchScript (6)]( [Google Accelerated Linear Algebra (XLA) (7)]( Also, we decided to include support for [ONNX Runtime (8)]( as it provides many optimizations specifically targeting transformers based models which makes it a strong candidate to consider when discussing performance. Last but not least, this new unified benchmarking environment will allow us to easily run inference for different scenarios such as [Quantized Models (Zafrir & al.) (9)]( using less precise number representations (`float16`, `int8`, `int4`). This method known as **quantization** has seen an increased adoption among all major hardware providers. In the near future, we would like to integrate additional methods we are actively working on at Hugging Face, namely Distillation, Pruning & Sparsificaton. ## 3. Baselines All the results below were run on [Amazon Web Services (AWS) c5.metal instance]( leveraging an Intel Xeon Platinum 8275 CPU (48 cores/96 threads). The choice of this instance provides all the useful CPU features to speedup Deep Learning workloads such as: - AVX512 instructions set (_which might not be leveraged out-of-the-box by the various frameworks_) - Intel Deep Learning Boost (also known as Vector Neural Network Instruction - VNNI) which provides specialized CPU instructions for running quantized networks (_using int8 data type_) The choice of using _metal_ instance is to avoid any virtualization issue which can arise when using cloud providers. This gives us full control of the hardware, especially while targeting the NUMA (Non-Unified Memory Architecture) controller, which we will cover later in this post. _The operating system was Ubuntu 20.04 (LTS) and all the experiments were conducted using Hugging Face transformers version 4.5.0, PyTorch 1.8.1 & Google TensorFlow 2.4.0_ ## 4. Out of the box results Figure 1. PyTorch (1.8.1) vs Google TensorFlow (2.4.1) out of the box Figure 2. PyTorch (1.8.1) vs Google TensorFlow (2.4.1) out of the box - (Bigger Batch Size) Straigh to the point, out-of-the-box, PyTorch shows better inference results over TensorFlow for all the configurations tested here. It is important to note the results out-of-the-box might not reflect the \"optimal\" setup for both PyTorch and TensorFlow and thus it can look deceiving here. One possible way to explain such difference between the two frameworks might be the underlying technology to execute parallel sections within operators. PyTorch internally uses [OpenMP (10)]( along with [Intel MKL (now oneDNN) (11)]( for efficient linear algebra computations whereas TensorFlow relies on Eigen and its own threading implementation. ## 5. Scaling BERT Inference to increase overall throughput on modern CPU ### 5.1. Introduction There are multiple ways to improve the latency and throughput for tasks such as BERT inference. Improvements and tuning can be performed at various levels from enabling Operating System features, swapping dependent libraries with more performant ones, carefully tuning framework properties and, last but not least, using parallelization logic leveraging all the cores on the CPU(s). For the remainder of this blog post we will focus on the latter, also known as **Multiple Inference Stream**. The idea is simple: Allocate **multiple instances** of the same model and assign the execution of each instance to a **dedicated, non-overlapping subset of the CPU cores** in order to have truly parallel instances. ### 5.2. Cores and Threads on Modern CPUs On our way towards optimizing CPU inference for better usage of the CPU cores you might have already seen -_at least for the past 20 years_- modern CPUs specifications report \"cores\" and \"hardware threads\" or \"physical\" and \"logical\" numbers. These notions refer to a mechanism called **Simultaneous Multi-Threading** (SMT) or **Hyper-Threading** on Intel's platforms. To illustrate this, imagine two tasks **A** and **B**, executing in parallel, each on its own software thread. At some point, there is a high probability these two tasks will have to wait for some resources to be fetched from main memory, SSD, HDD or even the network. If the threads are scheduled on different physical cores, with no hyper-threading, during these periods the core executing the task is in an **Idle** state waiting for the resources to arrive, and effectively doing nothing... and hence not getting fully utilized Now, with **SMT**, the **two software threads for task A and B** can be scheduled on the same **physical core**, such that their execution is interleaved on that physical core: Task A and Task B will execute simultaneously on the physical core and when one task is halted, the other task can still continue execution on the core thereby increasing the utilization of that core. Figure 3. Illustration of Intel Hyper Threading technology (SMT) The figure 3. above simplifies the situation by assuming single core setup. If you want some more details on how SMT works on multi-cores CPUs, please refer to these two articles with very deep technical explanations of the behavior: - [Intel Hyper-Threading Technology - Technical User Guide (12)]( - [Introduction to Hyper-Threading Technology (13)]( Back to our model inference workload... If you think about it, in a perfect world with a fully optimized setup, computations take the majority of time. In this context, using the logical cores shouldn't bring us any performance benefit because both logical cores (hardware threads) compete for the core\u2019s execution resources. As a result, the tasks being a majority of general matrix multiplications (_[gemms (14)]( they are inherently CPU bounds and **does not benefits** from SMT. ### 5.3. Leveraging Multi-Socket servers and CPU affinity Nowadays servers bring many cores, some of them even support multi-socket setups (_i.e. multiple CPUs on the motherboard_). On Linux, the command `lscpu` reports all the specifications and topology of the CPUs present on the system: ```shell ubuntu@some-ec2-machine:~$ lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian Address sizes: 46 bits physical, 48 bits virtual CPU(s): 96 On-line CPU(s) list: 0-95 Thread(s) per core: 2 Core(s) per socket: 24 Socket(s): 2 NUMA node(s): 2 Vendor ID: GenuineIntel CPU family: 6 Model: 85 Model name: Intel(R) Xeon(R) Platinum 8275CL CPU @ 3.00GHz Stepping: 7 CPU MHz: 1200.577 CPU max MHz: 3900.0000 CPU min MHz: 1200.0000 BogoMIPS: 6000.00 Virtualization: VT-x L1d cache: 1.5 MiB L1i cache: 1.5 MiB L2 cache: 48 MiB L3 cache: 71.5 MiB NUMA node0 CPU(s): 0-23,48-71 NUMA node1 CPU(s): 24-47,72-95 ``` In our case we have a machine with **2 sockets**, each socket providing **24 physical cores** with **2 threads per cores** (SMT). Another interesting characteristic is the notion of **NUMA** node (0, 1) which represents how cores and memory are being mapped on the system. Non-Uniform Memory Access (**NUMA**) is the opposite of Uniform Memory Access (**UMA**) where the whole memory pool is accessible by all the cores through a single unified bus between sockets and the main memory. **NUMA** on the other hand splits the memory pool and each CPU socket is responsible to address a subset of the memory, reducing the congestion on the bus. Figure 5. Difference illustration of UMA and NUMA architectures (source (15)) In order to fully utilize the potential of such a beefy machine, we need to ensure our model instances are correctly dispatched across all the **physical** cores on all sockets along with enforcing memory allocation to be \"NUMA-aware\". On Linux, NUMA's process configuration can be tuned through [`numactl`]( which provides an interface to bind a process to a set of CPU cores (referred as **Thread Affinity**). Also, it allows tuning the memory allocation policy, making sure the memory allocated for the process is as close as possible to the cores' memory pool (referred as **Explicit Memory Allocation Directives**). _Note: Setting both cores and memory affinities is important here. Having computations done on socket 0 and memory allocated on socket 1 would ask the system to go over the sockets shared bus to exchange memory, thus leading to an undesired overhead._ ### 5.4. Tuning Thread Affinity & Memory Allocation Policy Now that we have all the knobs required to control the resources' allocation of our model instances we go further and see how to effectively deploy those and see the impact on latency and throughput. Let's go gradually to get a sense of what is the impact of each command and parameter. First, we start by launching our inference model without any tuning, and we observe how the computations are being dispatched on CPU cores (_Left_). ```shell python3 src/main.py model=bert-base-cased backend.name=pytorch batch_size=1 sequence_length=128 ``` Then we specify the core and memory affinity through `numactl` using all the **physical** cores and only a single thread (thread 0) per core (_Right_): ```shell numactl -C 0-47 -m 0,1 python3 src/main.py model=bert-base-cased backend.name=pytorch batch_size=1 sequence_length=128 ``` Figure 6. Linux htop command side-by-side results without & with Thread Affinity set As you can see, without any specific tuning, PyTorch and TensorFlow dispatch the work on a single socket, using all the logical cores in that socket (both threads on 24 cores). Also, as we highlighted earlier, we do not want to leverage the **SMT** feature in our case, so we set the process' thread affinity to target only 1 hardware thread. _Note, this is specific to this run and can vary depending on individual setups. Hence, it is recommended to check thread affinity settings for each specific use-case._ Let's take sometime from here to highlight what we did with `numactl`: - `-C 0-47` indicates to `numactl` what is the thread affinity (cores 0 to 47). - `-m 0,1` indicates to `numactl` to allocate memory on both CPU sockets If you wonder why we are binding the process to cores [0...47], you need to go back to look at the output of `lscpu`. From there you will find the section `NUMA node0` and `NUMA node1` which has the form `NUMA node ` In our case, each socket is one NUMA node and there are 2 NUMA nodes. Each socket or each NUMA node has 24 physical cores and 2 hardware threads per core, so 48 logical cores. For NUMA node 0, 0-23 are hardware thread 0 and 24-47 are hardware thread 1 on the 24 physical cores in socket 0. Likewise, for NUMA node 1, 48-71 are hardware thread 0 and 72-95 are hardware thread 1 on the 24 physical cores in socket 1. As we are targeting just 1 thread per physical core, as explained earlier, we pick only thread 0 on each core and hence logical processors 0-47. Since we are using both sockets, we need to also bind the memory allocations accordingly (0,1). _Please note that using both sockets may not always give the best results, particularly for small problem sizes. The benefit of using compute resources across both sockets might be reduced or even negated by cross-socket communication overhead._ ## 6. Core count scaling - Does using more cores actually improve performance? When thinking about possible ways to improve our model inference performances, the first rational solution might be to throw some more resources to do the same amount of work. Through the rest of this blog series, we will refer to this setup as **Core Count Scaling** meaning, only the number of cores used on the system to achieve the task will vary. This is also often referred as Strong Scaling in the HPC world. At this stage, you may wonder what is the point of allocating only a subset of the cores rather than throwing all the horses at the task to achieve minimum latency. Indeed, depending on the problem-size, throwing more resources to the task might give better results. It is also possible that for small problems putting more CPU cores at work doesn't improve the final latency. In order to illustrate this, the figure 6. below takes different problem sizes (`batch_size = 1, sequence length = {32, 128, 512}`) and reports the latencies with respect to the number of CPU cores used for running computations for both PyTorch and TensorFlow. Limiting the number of resources involved in computation is done by limiting the CPU cores involved in **intra** operations (_**intra** here means inside an operator doing computation, also known as \"kernel\"_). This is achieved through the following APIs: - PyTorch: `torch.set_num_threads(x)` - TensorFlow: `tf.config.threading.set_intra_op_parallelism_threads(x)` Figure 7. Latency measurements As you can see, depending on the problem size, the number of threads involved in the computations has a positive impact on the latency measurements. For small-sized problems & medium-sized problems using only one socket would give the best performance. For large-sized problems, the overhead of the cross-socket communication is covered by the computations cost, thus benefiting from using all the cores available on the both sockets. ## 7. Multi-Stream Inference - Using multiple instances in parallel If you're still reading this, you should now be in good shape to set up parallel inference workloads on CPU. Now, we are going to highlight some possibilities offered by the powerful hardware we have, and tuning the knobs described before, to scale our inference as linearly as possible. In the following section we will explore another possible scaling solution **Batch Size Scaling**, but before diving into this, let's take a look at how we can leverage Linux tools in order to assign Thread Affinity allowing effective model instance parallelism. Instead of throwing more cores to the task as you would do in the core count scaling setup, now we will be using more model instances. Each instance will run independently on its own subset of the hardware resources in a truly parallel fashion on a subset of the CPU cores. ### 7.1. How-to allocate multiple independent instances Let's start simple, if we want to spawn 2 instances, one on each socket with 24 cores assigned: ```shell numactl -C 0-23 -m 0 python3 src/main.py model=bert-base-cased batch_size=1 sequence_length=128 backend.name=pytorch backend.num_threads=24 numactl -C 24-47 -m 1 python3 src/main.py model=bert-base-cased batch_size=1 sequence_length=128 backend.name=pytorch backend.num_threads=24 ``` Starting from here, each instance does not share any resource with the other, and everything is operating at maximum efficiency from a hardware perspective. The latency measurements are identical to what a single instance would achieve, but throughput is actually 2x higher as the two instances operate in a truly parallel way. We can further increase the number of instances, lowering the number of cores assigned for each instance. Let's run 4 independent instances, each of them effectively bound to 12 CPU cores. ```shell numactl -C 0-11 -m 0 python3 src/main.py model=bert-base-cased batch_size=1 sequence_length=128 backend.name=pytorch backend.num_threads=12 numactl -C 12-23 -m 0 python3 src/main.py model=bert-base-cased batch_size=1 sequence_length=128 backend.name=pytorch backend.num_threads=12 numactl -C 24-35 -m 1 python3 src/main.py model=bert-base-cased batch_size=1 sequence_length=128 backend.name=pytorch backend.num_threads=12 numactl -C 36-47 -m 1 python3 src/main.py model=bert-base-cased batch_size=1 sequence_length=128 backend.name=pytorch backend.num_threads=12 ``` The outcomes remain the same, our 4 instances are effectively running in a truly parallel manner. The latency will be slightly higher than the example before (2x less cores being used), but the throughput will be again 2x higher. ### 7.2. Smart dispatching - Allocating different model instances for different problem sizes One another possibility offered by this setup is to have multiple instances carefully tuned for various problem sizes. With a smart dispatching approach, one can redirect incoming requests to the right configuration giving the best latency depending on the request workload. ```shell # Small-sized problems (sequence length sequence = 384) use the entire CPU (on socket 1 - 24/24 cores used) numactl -C 24-37 -m 1 python3 src/main.py model=bert-base-cased batch_size=1 sequence_length=384 backend.name=pytorch backend.num_threads=24 ``` ## 8. Batch size scaling - Improving throughput and latency with multiple parallel & independent model instances One another very interesting direction for scaling up inference is to actually put some more model instances into the pool along with reducing the actual workload each instance receives proportionally. This method actually changes both the size of the problem (_batch size_), and the resources involved in the computation (_cores_). To illustrate, imagine you have a server with `C` CPU cores, and you want to run a workload containing B samples with S tokens. You can represent this workload as a tensor of shape `[B, S]`, B being the size of the batch and S being the maximum sequence length within the B samples. For all the instances (`N`), each of them executes on `C / N` cores and would receive a subset of the task `[B / N, S]`. Each instance doesn't receive the global batch but instead, they all receive a subset of it `[B / N, S]` thus the name **Batch Size Scaling**. In order to highlight the benefits of such scaling method, the charts below reports both the latencies when scaling up model instances along with the effects on the throughput. When looking at the results, let's focus on the latency and the throughput aspects: On one hand, we are taking the maximum latency over the pool of instances to reflect the time it takes to process all the samples in the batch. Putting it differently, as instances operate in a truly parallel fashion, the time it takes to gather all the batch chunks from all the instances is driven by the longest time it takes for individual instance in the pool to get their chunk done. As you can see below on Figure 7., the actual latency gain when increasing the number of instances is really dependent of the problem size. In all cases, we can find an optimal resource allocation (batch size & number of instances) to minimize our latency but, there is no specific pattern on the number of cores to involve in the computation. Also, it is important to notice the results might look totally different on another system _(i.e. Operating System, Kernel Version, Framework version, etc.)_ Figure 8. sums up the best multi-instance configuration when targeting minimum latency by taking the minimum over the number of instances involved. For instance, for `{batch = 8, sequence length = 128}` using 4 instances (each with `{batch = 2}` and 12 cores) gives the best latency measurements. The Figure 9. reports all the setups minimizing latency for both PyTorch and TensorFlow for various problem-sizes. _**Spoiler**: There are numerous other optimizations we will discuss in a follow-up blog post which will substantially impact this chart._ Figure 8. Max latency evolution with respect to number of instances for a total batch size of 8 Figure 9. Optimal number of instance minimizing overall latency for a total batch size of 8 On a second hand, we observe the throughput as the sum of all the model instance executing in parallel. It allows us to visualize the scalability of the system when adding more and more instances each of them with fewer resources but also proportional workload. Here, the results show almost linear scalability and thus an optimal hardware usage. Figure 10. Sum throughput with respect to number of instances for a total batch size of 8 ## 9. Conclusion Through this blog post, we covered out-of-box BERT inference performance one can expect for PyTorch and TensorFlow, from a simple PyPi install and without further tuning. It is important to highlight results provided here reflects out-of-the-box framework setup hence, they might not provide the absolute best performances. We decided to not include optimizations as part of this blog post to focus on hardware and efficiency. Optimizations will be discussed in the second part! Then, we covered and detailed the impact, and the importance of setting the thread affinity along with the trade-off between the target problem size, and the number of cores required for achieving the task. Also, it is important to define **which criteria** _(i.e. latency vs throughput)_ to use when optimizing your deployment as the resulting setups might be totally different. On a more general note, small problem sizes (_short sequences and/or small batches_) might require much fewer cores to achieve the best possible latency than big problems (_very long sequences and/or big batches_). It is interesting to cover all these aspects when thinking about the final deployment platform as it might cut the cost of the infrastructure drastically. For instance, our 48 cores machine charges **4.848\\$/h** whereas a smaller instances with only 8 cores lowers the cost to **0.808\\$/h**, leading to a **6x cost reduction**. Last but not least, many of the knobs discussed along this blog post can be automatically tuned through a [launcher script]( highly inspired from the original script made by Intel and available [here]( The launcher script is able to automatically starts your python process(es) with the correct thread affinity, effectively splitting resources across instances along with many other performances tips! We will detail many of this tips in the second part . In the follow-up blog post, more advanced settings and tuning techniques to decrease model latency even further will be involved, such as: - Launcher script walk-through - Tuning the memory allocation library - Using Linux's Transparent Huge Pages mechanisms - Using vendor-specific Math/Parallel libraries Stay tuned! ## Acknowledgments - [Omry Yadan]( (Facebook FAIR) - Author of [OmegaConf]( & [Hydra]( for all the tips setting up Hydra correctly. - All Intel & Intel Labs' NLP colleagues - For the ongoing optimizations and research efforts they are putting into transformers and more generally in the NLP field. - Hugging Face colleagues - For all the comments and improvements in the reviewing process. ## References 1. [Benchmarking Transformers: PyTorch and TensorFlow]( 2. [HuggingFace's Transformers: State-of-the-art Natural Language Processing]( 3. [HuggingFace's Model Hub]( 4. [BERT - Pre-training of Deep Bidirectional Transformers for Language Understanding (Devlin & al. 2018)]( 5. [Illustrated Transformer blogpost from Jay Alammar]( 6. [PyTorch - TorchScript]( 7. [Google Accelerated Linear Algebra (XLA)]( 8. [ONNX Runtime - Optimize and Accelerate Machine Learning Inferencing and Training]( 9. [Q8BERT - Quantized 8Bit BERT (Zafrir & al. 2019)]( 10. [OpenMP]( 11. [Intel oneDNN]( 12. [Intel Hyper-Threading Technology - Technical User Guide]( 13. [Introduction to Hyper-Threading Technology]( 14. [BLAS (Basic Linear Algebra Subprogram) - Wikipedia]( 15. [Optimizing Applications for NUMA]("}
{"title": "bert-cpu-scaling-part-2.md", "repo_owner": "huggingface", "repo_name": "blog", "text": "--- title: \"Scaling up BERT-like model Inference on modern CPU - Part 2\" authors: - user: echarlaix - user: jeffboudier - user: mfuntowicz - user: michaelbenayoun --- # Scaling up BERT-like model Inference on modern CPU - Part 2 ## Introduction: Using Intel Software to Optimize AI Efficiency on CPU As we detailed in our [previous blog post]( Intel Xeon CPUs provide a set of features especially designed for AI workloads such as AVX512 or VNNI (Vector Neural Network Instructions) for efficient inference using integer quantized neural network for inference along with additional system tools to ensure the work is being done in the most efficient way. In this blog post, we will focus on software optimizations and give you a sense of the performances of the new Ice Lake generation of Xeon CPUs from Intel. Our goal is to give you a full picture of what\u2019s available on the software side to make the most out of your Intel hardware. As in the previous blog post, we show the performance with benchmark results and charts, along with new tools to make all these knobs and features easy to use. Back in April, Intel launched its [latest generation of Intel Xeon processors]( codename Ice Lake, targeting more efficient and performant AI workloads. More precisely, Ice Lake Xeon CPUs can achieve up to 75% faster inference on a variety of NLP tasks when comparing against the previous generation of Cascade Lake Xeon processors. This is achieved by a combination of both hardware and software improvements, [such as new instructions]( and PCIe 4.0 featured on the new Sunny Cove architecture to supports Machine Learning and Deep Learning workloads. Last but not least, Intel worked on dedicated optimizations for various frameworks which now come with Intel\u2019s flavors like [Intel\u2019s Extension for Scikit Learn]( [Intel TensorFlow]( and [Intel PyTorch Extension]( All these features are very low-level in the stack of what Data Scientists and Machine Learning Engineers use in their day-to-day toolset. In a vast majority of situations, it is more common to rely on higher level frameworks and libraries to handle multi-dimensional arrays manipulation such as [PyTorch]( and [TensorFlow]( and make use of highly tuned mathematical operators such as [BLAS (Basic Linear Algebra Subroutines)]( for the computational part. In this area, Intel plays an essential role by providing software components under the oneAPI umbrella which makes it very easy to use highly efficient linear algebra routines through Intel [oneMKL (Math Kernel Library)]( higher-level parallelization framework with Intel OpenMP or the [Threading Building Blocks (oneTBB)]( Also, oneAPI provides some domain-specific libraries such as Intel [oneDNN]( for deep neural network primitives (ReLU, fully-connected, etc.) or [oneCCL]( for collective communication especially useful when using distributed setups to access efficient all-reduce operations over multiple hosts. Some of these libraries, especially MKL or oneDNN, are natively included in frameworks such as PyTorch and TensorFlow ([since 2.5.0]( to bring all the performance improvements to the end user out of the box. When one would like to target very specific hardware features, Intel provides custom versions of the most common software, especially optimized for the Intel platform. This is for instance the case with TensorFlow, [for which Intel provides custom, highly tuned and optimized versions of the framework]( or with the Intel PyTorch Extension (IPEX) framework which can be considered as a feature laboratory before upstreaming to PyTorch. ## Deep Dive: Leveraging advanced Intel features to improve AI performances ### Performance tuning knobs As highlighted above, we are going to cover a new set of tunable items to improve the performance of our AI application. From a high-level point of view, every machine learning and deep learning framework is made of the same ingredients: 1. A structural way of representing data in memory (vector, matrices, etc.) 2. Implementation of mathematical operators 3. Efficient parallelization of the computations on the target hardware _In addition to the points listed above, deep learning frameworks provide ways to represent data flow and dependencies to compute gradients. This falls out of the scope of this blog post, and it leverages the same components as the ones listed above!_ Figure 1. Intel libraries overview under the oneAPI umbrella ### 1. Memory allocation and management libraries This blog post will deliberately skip the first point about the data representation as it is something rather framework specific. For reference, PyTorch uses its very own implementation, called [ATen]( while TensorFlow relies on the open source library [Eigen]( for this purpose. While it\u2019s very complex to apply generic optimizations to different object structures and layouts, there is one area where we can have an impact: Memory Allocation. As a short reminder, memory allocation here refers to the process of programmatically asking the operating system a dynamic (unknown beforehand) area on the system where we will be able to store items into, such as the malloc and derived in C or the new operator in C++. Memory efficiency, both in terms of speed but also in terms of fragmentation, is a vast scientific and engineering subject with multiple solutions depending on the task and underlying hardware. Over the past years we saw more and more work in this area, with notably: - [jemalloc]( (Facebook - 2005) - [mimalloc]( (Microsoft - 2019) - [tcmalloc]( (Google - 2020) Each pushes forward different approaches to improve aspects of the memory allocation and management on various software. ### 2. Efficient parallelization of computations Now that we have an efficient way to represent our data, we need a way to take the most out of the computational hardware at our disposal. Interestingly, when it comes to inference, CPUs have a potential advantage over GPUs in the sense they are everywhere, and they do not require specific application components and administration staff to operate them. Modern CPUs come with many cores and complex mechanisms to increase the general performances of software. Yet, as we highlighted on [the first blog post]( they also have features which can be tweaked depending on the kind of workload (CPU or I/O bound) you target, to further improve performances for your application. Still, implementing parallel algorithms might not be as simple as throwing more cores to do the work. Many factors, such as data structures used, concurrent data access, CPU caches invalidation - all of which might prevent your algorithm from being effectively faster. As a reference talk, we recommend the talk from [**Scott Meyers: CPU Caches and Why You Care**]( if you are interested in diving more into the subject. Thankfully, there are libraries which make the development process of such parallel algorithms easier and less error-prone. Among the most common parallel libraries we can mention OpenMP and TBB (Threading Building Blocks), which work at various levels, from programming API in C/C++ to environment variable tuning and dynamic scheduling. On Intel hardware, it is advised to use the Intel implementation of the OpenMP specification often referred as \"IOMP\" available as part of the [Intel oneAPI toolkit]( Figure 2. Code snippet showing parallel computation done through OpenMP [comment]: () ### 3. Optimized mathematical operators Now that we covered the necessary building blocks for designing efficient data structures and parallel algorithms, the last remaining piece is the one running the computation, the one implementing the variety of mathematical operators and neural network layers to do what we love most, designing neural networks! In every programmer toolkit, there are multiple levels which can bring mathematical operations support, which can then be optimized differently depending on various factors such as the data storage layout being used (Contiguous memory, Chunked, Packed, etc.), the data format representing each scalar element (Float32, Integer, Long, Bfloat16, etc.) and of course the various instructions being supported by your processor. Nowadays, almost all processors support basic mathematical operations on scalar items (one single item at time) or in vectorized mode (meaning they operate on multiple items within the same CPU instructions, referred as SIMD \u201cSingle Instruction Multiple Data\u201d). Famous sets of SIMD instructions are SSE2, AVX, AVX2 and the AVX-512 present on the latest generations of Intel CPUs being able to operate over 16 bytes of content within a single CPU clock. Most of the time, one doesn't have to worry too much about the actual assembly being generated to execute a simple element-wise addition between two vectors, but if you do, again there are some libraries which allow you to go one level higher than writing code calling CPU specific intrinsic to implement efficient mathematical kernels. This is for instance what Intel\u2019s MKL \u201cMath Kernel Library\u201d provides, along with the famous BLAS \u201cBasic Linear Algebra Subroutines\u201d interface to implement all the basic operations for linear algebra. Finally, on top of this, one can find some domain specific libraries such as Intel's oneDNN which brings all the most common and essential building blocks required to implement neural network layers. Intel MKL and oneDNN are natively integrated within the PyTorch framework, where it can enable some performance speedup for certain operations such as Linear + ReLU or Convolution. On the TensorFlow side, oneDNN can be enabled by setting the environment variable `TF_ENABLE_ONEDNN_OPTS=1` (_TensorFlow >= 2.5.0_) to achieve similar machinery under the hood. ## More Efficient AI Processing on latest Intel Ice Lake CPUs In order to report the performances of the Ice Lake product lineup we will closely follow [the methodology we used for the first blog]( post of this series. As a reminder, we will adopt the exact same schema to benchmark the various setups we will highlight through this second blog post. More precisely, the results presented in the following sections are based on: - PyTorch: 1.9.0 - TensorFlow: 2.5.0 - Batch Sizes: 1, 4, 8, 16, 32, 128 - Sequence Lengths: 8, 16, 32, 64, 128, 384, 512 We will present the results through metrics accepted by the field to establish the performances of the proposed optimizations: - Latency: Time it takes to execute a single inference request (i.e., \u201cforward call\u201d) through the model, expressed in millisecond. - Throughput: Number of inference requests (i.e., \u201cforward calls\u201d) the system can sustain within a defined period, expressed in call/sec. We will also provide an initial baseline showing out-of-the-box results and a second baseline applying all the different optimizations we highlighted in the first blogpost. Everything was run on an Intel provided cloud instance featuring the [Ice Lake Xeon Platinum 8380]( CPU operating on Ubuntu 20.04.2 LTS. You can find the same processors on the various cloud providers: - [AWS m6i / c6i instances]( - [Azure Ev5 / Dv5 series]( Figure 3. Intel Ice Lake Xeon 8380 Specifications ### Establishing the baseline As mentioned previously, the baselines will be composed of two different setups: - Out-of-the-box: We are running the workloads as-is, without any tuning - Optimized: We apply the various knobs present in [Blog #1]( Also, from the comments we had about the previous blog post, we wanted to change the way we present the framework within the resulting benchmarks. As such, through the rest of this second blog post, we will split framework benchmarking results according to the following: - Frameworks using \u201ceager\u201d mode for computations (PyTorch, TensorFlow) - Frameworks using \u201cgraph\u201d mode for computations (TorchScript, TensorFlow Graph, Intel Tensorflow) #### Baseline: Eager frameworks latencies Frameworks operating in eager mode usually discover the actual graph while executing it. More precisely, the actual computation graph is not known beforehand and you gradually (_eagerly_) execute one operator which will become the input of the next one, etc. until you reach leaf nodes (outputs). These frameworks usually provide more flexibility in the algorithm you implement at the cost of increased runtime overhead and slightly potential more memory usage to keep track of all the required elements for the backward pass. Last but not least, it is usually harder through these frameworks to enable graph optimizations such as operator fusion. For instance, many deep learning libraries such as oneDNN have optimized kernels for Convolution + ReLU but you actually need to know before executing the graph that this pattern will occur within the sequence of operation, which is, by design, not something possible within eager frameworks. Figure 4. PyTorch latencies with respect to the number of cores involved Figure 5. Google's TensorFlow latencies with respect to the number of cores involved Figure 6. Google's TensorFlow with oneDNN enabled latencies with respect to the number of cores involved Figure 7. Intel TensorFlow latencies with respect to the number of cores involved The global trend highlights the positive impact of the number of cores on the observed latencies. In most of the cases, increasing the number of cores reduces the computation time across the different workload sizes. Still, putting more cores to the task doesn't result in monotonic latency reductions, there is always a trade-off between the workload\u2019s size and the number of resources you allocate to execute the job. As you can see on the charts above, one very common pattern tends to arise from using all the cores available on systems with more than one CPU (more than one socket). The inter-socket communication introduces a significant latency overhead and results in very little improvement to increased latency overall. Also, this inter-socket communication overhead tends to be less and less perceptive as the workload becomes larger, meaning the usage of all computational resources benefits from using all the available cores. In this domain, it seems PyTorch (Figure 1.) and Intel TensorFlow (Figure 4.) seem to have slightly better parallelism support, as showed on the sequence length 384 and 512 for which using all the cores still reduces the observed latency. #### Baseline: Graph frameworks latencies This time we compare performance when using frameworks in \u201cGraph\u201d mode, where the graph is fully known beforehand, and all the allocations and optimizations such as graph pruning and operators fusing can be made. Figure 8. TorchScript latencies with respect to the number of cores involved Figure 9. Google's TensorFlow latencies with respect to the number of cores involved Figure 10. Google's TensorFlow with oneDNN enabled latencies with respect to the number of cores involved Figure 11. Intel TensorFlow latencies with respect to the number of cores involved This is often referred to as \u201ctracing\u201d the graph and, as you can see here, the results are not that different from TorchScript (Graph execution mode from PyTorch) vs TensorFlow(s). All TensorFlow implementations seem to perform better than TorchScript when the parallelization is limited (low number of cores involved in the intra operation computations) but this seems not to scale efficiently as we increase the computation resources, whereas TorchScript seems to be able to better leverage the power of modern CPUs. Still, the margin between all these frameworks in most cases very limited. ### Tuning the Memory Allocator: Can this impact the latencies observed? One crucial component every program dynamically allocating memory relies on is the memory allocator. If you are familiar with C/C++ programming this component provides the low bits to malloc/free or new/delete. Most of the time you don\u2019t have to worry too much about it and the default ones (glibc for instance on most Linux distributions) will provide great performances out of the box. Still, in some situations it might not provide the most efficient performances, as these default allocators are most of the time designed to be \u201cgood\u201d most of the time, and not fine-tuned for specific workloads or parallelism. So, what are the alternatives, and when are they more suitable than the default ones? Well, again, it depends on the kind of context around your software. Possible situations are a heavy number of allocations/de-allocations causing fragmentation over time, specific hardware and/or architecture you\u2019re executing your software on and finally the level of parallelism of your application. Do you see where this is going? Deep learning and by extension all the applications doing heavy computations are heavily multi-threaded, that\u2019s also the case for software libraries such as PyTorch, TensorFlow and any other frameworks targeting Machine Learning workloads. The default memory allocator strategies often rely on global memory pools which require the usage of synchronization primitives to operate, increasing the overall pressure on the system, reducing the performance of your application. Some recent works by companies such as Google, Facebook and Microsoft provided alternative memory allocation strategies implemented in custom memory allocator libraries one can easily integrate directly within its software components or use dynamic shared library preload to swap the library being used to achieve the allocation/de-allocation. Among these libraries, we can cite a few of them such as [tcmalloc](), [jemalloc]() and [mimalloc](). Figure 12. Various memory allocators benchmarked on different tasks Through this blog post we will only focus on benchmarking tcmalloc and jemalloc as potential memory allocators drop-in candidates. To be fully transparent, for the scope of the results below we used tcmalloc as part of the gperftools package available on Ubuntu distributions version 2.9 and jemalloc 5.1.0-1. #### Memory allocator benchmarks Again, we first compare performance against frameworks executing in an eager fashion. This is potentially the use case where the allocator can play the biggest role: As the graph is unknown before its execution, each framework must manage the memory required for each operation when it meets the actual execution of the above node, no planning ahead possible. In this context, the allocator is a major component due to all the system calls to allocate and reclaim memory. Figure 13. PyTorch memory allocator and cores scaling latencies Figure 14. Google's TensorFlow memory allocator and cores scaling latencies Figure 15. Google's TensorFlow with oneDNN enabled memory allocator and cores scaling latencies Figure 16. Intel TensorFlow memory allocator and cores scaling latencies As per the graph above, you can notice that the standard library allocator (glibc) is often behind performance-wise but provides reasonable performance. Jemalloc allocator is sometimes the fastest around but in very specific situations, where the concurrency is not that high, this can be explained by the underlying structure jemalloc uses internally which is out of the scope of this blog, but you can read the [Facebook Engineering blog]( if you want to know more about it. Finally, tcmalloc seems to be the one providing generally best performances across all the workloads benchmarked here. Again, tcmalloc has a different approach than Jemalloc in the way it allocates resources, especially tcmalloc maintains a pool of memory segments locally for each thread, which reduces the necessity to have global, exclusive, critical paths. Again, for more details, I invite you to read the full [blog by Google Abseil team]( Now, back to the graph mode where we benchmark framework having an omniscient representation of the overall computation graph. Figure 17. TorchScript memory allocator and cores scaling latencies Figure 18. Google's TensorFlow memory allocator and cores scaling latencies Figure 19. Google's TensorFlow with oneDNN enabled memory allocator and cores scaling latencies Figure 20. Intel TensorFlow memory allocator and cores scaling latencies This time, by knowing the underlying structure of the operator flows and matrix shapes involved then the framework can plan and reserve the required resources beforehand. In this context, and as it is shown in the chart above, the difference between framework is very small and there is no clear winner between jemalloc and tcmalloc. Of course, glibc is still slightly behind as a general-purpose memory allocator, but the margin is less significant than in the eager setup. To sum it up, tuning the memory allocator can provide an interesting item to grab the last milliseconds' improvement at the end of the optimization process, especially if you are already using traced computation graphs. ### OpenMP In the previous section we talked about the memory management within machine learning software involving mostly CPU-bound workloads. Such software often relies on intermediary frameworks such as PyTorch or TensorFlow for Deep Learning which commonly abstract away all the underlying, highly parallelized, operator implementations. Writing such highly parallel and optimized algorithms is a real engineering challenge, and it requires a very low-level understanding of all the actual elements coming into play operated by the CPU (synchronization, memory cache, cache validity, etc.). In this context, it is very important to be able to leverage primitives to implement such powerful algorithms, reducing the delivery time and computation time by a large margin compared to implementing everything from scratch. There are many libraries available which provide such higher-level features to accelerate the development of algorithms. Among the most common, one can look at OpenMP, Thread Building Blocks and directly from the C++ when targeting a recent version of the standard. In the following part of this blog post, we will restrict ourselves to OpenMP and especially comparing the GNU, open source and community-based implementation, to the Intel OpenMP one. The latter especially targets Intel CPUs and is optimized to provide best of class performances when used as a drop-in replacement against the GNU OpenMP one. OpenMP exposes [many environment variables]( to automatically configure the underlying resources which will be involved in the computations, such as the number of threads to use to dispatch computation to (intra-op threads), the way the system scheduler should bind each of these threads with respect to the CPU resources (threads, cores, sockets) and some other variables which bring further control to the user. Intel OpenMP exposes [more of these environment variables]( to provide the user even more flexibility to adjust the performance of its software. Figure 21. OpenMP vs Intel OpenMP latencies running PyTorch Figure 22. OpenMP vs Intel OpenMP latencies running PyTorch As stated above, tuning OpenMP is something you can start to tweak when you tried all the other, system related, tuning knobs. It can bring a final speed up to you model with just a single environment variable to set. Also, it is important to note that tuning OpenMP library will only work within software that uses the OpenMP API internally. More specially, now only PyTorch and TorchScript really make usage of OpenMP and thus benefit from OpenMP backend tuning. This also explains why we reported latencies only for these two frameworks. ## Automatic Performances Tuning: Bayesian Optimization with Intel SigOpt As mentioned above, many knobs can be tweaked to improve latency and throughput on Intel CPUs, but because there are many, tuning all of them to get optimal performance can be cumbersome. For instance, in our experiments, the following knobs were tuned: - The number of cores: although using as many cores as you have is often a good idea, it does not always provide the best performance because it also means more communication between the different threads. On top of that, having better performance with fewer cores can be very useful as it allows to run multiple instances at the same time, resulting in both better latency and throughput. - The memory allocator: which memory allocator out of the default malloc, Google's tcmalloc and Facebook's jemalloc provides the best performance? - The parallelism library: which parallelism library out of GNU OpenMP and Intel OpenMP provides the best performance? - Transparent Huge Pages: does enabling Transparent Huge Pages (THP) on the system provide better performance? - KMP block time parameter: sets the time, in milliseconds, that a thread should wait, after completing the execution of a parallel region, before sleeping. Of course, the brute force approach, consisting of trying out all the possibilities will provide the best knob values to use to get optimal performance but, the size of the search space being `N x 3 x 2 x 2 x 2 = 24N`, it can take a lot of time: on a machine with 80 physical cores, this means trying out at most `24 x 80 = 1920` different setups! Fortunately, Intel's [SigOpt]( through Bayesian optimization, allows us to make these tuning experiments both faster and more convenient to analyse, while providing similar performance than the brute force approach. When we analyse the relative difference between the absolute best latency and what SigOpt provides, we observe that although it is often not as good as brute force (except for sequence length = 512 in that specific case), it gives very close performance, with **8.6%** being the biggest gap on this figure. Figure 23. Absolute best latency found by SigOpt automatic tuning vs brute force Figure 24. Relative best latency found by SigOpt automatic tuning vs brute force SigOpt is also very useful for analysis: it provides a lot of figures and valuable information. First, it gives the best value it was able to find, the corresponding knobs, and the history of trials and how it improved as trials went, for example, with sequence length = 20: Figure 25. SigOpt best value reporting Figure 26. SigOpt best value reporting In this specific setup, 16 cores along with the other knobs were able to give the best results, that is very important to know, because as mentioned before, that means that multiple instances of the model can be run in parallel while still having the best latency for each. It also shows that it had converged at roughly 20 trials, meaning that maybe 25 trials instead of 40 would have been enough. A wide range of other valuable information is available, such as Parameter Importance: As expected, the number of cores is, by far, the most important parameter, but the others play a part too, and it is very experiment dependent. For instance, for the sequence length = 512 experiment, this was the Parameter Importance: Figure 27. SigOpt best value for Batch Size = 1, Sequence Length = 20 Figure 28. SigOpt best value for Batch Size = 1, Sequence Length = 512 Here not only the impact of using OpenMP vs Intel OpenMP was bigger than the impact of the allocator, the relative importance of each knob is more balanced than in the sequence length = 20 experiment. And many more figures, often interactive, are available on SigOpt such as: - 2D experiment history, allowing to compare knobs vs knobs or knobs vs objectives - 3D experiment history, allowing to do the same thing as the 2D experiment history with one more knob / objective. ## Conclusion - Accelerating Transformers for Production In this post, we showed how the new Intel Ice Lake Xeon CPUs are suitable for running AI workloads at scale along with the software elements you can swap and tune in order to exploit the full potential of the hardware. All these items are to be considered after setting-up the various lower-level knobs detailed in [the previous blog]( to maximize the usage of all the cores and resources. At Hugging Face, we are on a mission to democratize state-of-the-art Machine Learning, and a critical part of our work is to make these state-of-the-art models as efficient as possible, to use less energy and memory at scale, and to be more affordable to run by companies of all sizes. Our collaboration with Intel through the [Hardware Partner Program]( enables us to make advanced efficiency and optimization techniques easily available to the community, through our new [Optimum open source library]( dedicated to production performance. For companies looking to accelerate their Transformer models inference, our new [Infinity product offers a plug-and-play containerized solution]( achieving down to 1ms latency on GPU and 2ms on Intel Xeon Ice Lake CPUs. If you found this post interesting or useful to your work, please consider giving Optimum a star. And if this post was music to your ears, consider [joining our Machine Learning Optimization team]("}
{"title": "bert-inferentia-sagemaker.md", "repo_owner": "huggingface", "repo_name": "blog", "text": "--- title: \"Accelerate BERT inference with Hugging Face Transformers and AWS Inferentia\" thumbnail: /blog//assets/55_bert_inferentia_sagemaker/thumbnail.png authors: - user: philschmid --- Accelerate BERT inference with Hugging Face Transformers and AWS Inferentia notebook: [sagemaker/18_inferentia_inference]( The adoption of [BERT]( and [Transformers]( continues to grow. Transformer-based models are now not only achieving state-of-the-art performance in Natural Language Processing but also for [Computer Vision]( [Speech]( and [Time-Series]( Companies are now slowly moving from the experimentation and research phase to the production phase in order to use transformer models for large-scale workloads. But by default BERT and its friends are relatively slow, big, and complex models compared to the traditional Machine Learning algorithms. Accelerating Transformers and BERT is and will become an interesting challenge to solve in the future. AWS's take to solve this challenge was to design a custom machine learning chip designed for optimized inference workload called [AWS Inferentia]( AWS says that AWS Inferentia *\u201cdelivers up to 80% lower cost per inference and up to 2.3X higher throughput than comparable current generation GPU-based Amazon EC2 instances.\u201d* The real value of AWS Inferentia instances compared to GPU comes through the multiple Neuron Cores available on each device. A Neuron Core is the custom accelerator inside AWS Inferentia. Each Inferentia chip comes with 4x Neuron Cores. This enables you to either load 1 model on each core (for high throughput) or 1 model across all cores (for lower latency). ## Tutorial In this end-to-end tutorial, you will learn how to speed up BERT inference for text classification with Hugging Face Transformers, Amazon SageMaker, and AWS Inferentia. You can find the notebook here: [sagemaker/18_inferentia_inference]( You will learn how to: - [1. Convert your Hugging Face Transformer to AWS Neuron](#1-convert-your-hugging-face-transformer-to-aws-neuron) - [2. Create a custom `inference.py` script for `text-classification`](#2-create-a-custom-inferencepy-script-for-text-classification) - [3. Create and upload the neuron model and inference script to Amazon S3](#3-create-and-upload-the-neuron-model-and-inference-script-to-amazon-s3) - [4. Deploy a Real-time Inference Endpoint on Amazon SageMaker](#4-deploy-a-real-time-inference-endpoint-on-amazon-sagemaker) - [5. Run and evaluate Inference performance of BERT on Inferentia](#5-run-and-evaluate-inference-performance-of-bert-on-inferentia) Let's get started! --- *If you are going to use Sagemaker in a local environment (not SageMaker Studio or Notebook Instances), you need access to an IAM Role with the required permissions for Sagemaker. You can find [here]( more about it.* ## 1. Convert your Hugging Face Transformer to AWS Neuron We are going to use the [AWS Neuron SDK for AWS Inferentia]( The Neuron SDK includes a deep learning compiler, runtime, and tools for converting and compiling PyTorch and TensorFlow models to neuron compatible models, which can be run on [EC2 Inf1 instances]( As a first step, we need to install the [Neuron SDK]( and the required packages. *Tip: If you are using Amazon SageMaker Notebook Instances or Studio you can go with the `conda_python3` conda kernel.* ```python # Set Pip repository to point to the Neuron repository !pip config set global.extra-index-url # Install Neuron PyTorch !pip install torch-neuron==1.9.1.* neuron-cc[tensorflow] sagemaker>=2.79.0 transformers==4.12.3 --upgrade ``` After we have installed the Neuron SDK we can load and convert our model. Neuron models are converted using `torch_neuron` with its `trace` method similar to `torchscript`. You can find more information in our [documentation]( To be able to convert our model we first need to select the model we want to use for our text classification pipeline from [hf.co/models]( For this example, let's go with [distilbert-base-uncased-finetuned-sst-2-english]( but this can be easily adjusted with other BERT-like models. ```python model_id = \"distilbert-base-uncased-finetuned-sst-2-english\" ``` At the time of writing, the [AWS Neuron SDK does not support dynamic shapes]( which means that the input size needs to be static for compiling and inference. In simpler terms, this means that when the model is compiled with e.g. an input of batch size 1 and sequence length of 16, the model can only run inference on inputs with that same shape. *When using a `t2.medium` instance the compilation takes around 3 minutes* ```python import os import tensorflow # to workaround a protobuf version conflict issue import torch import torch.neuron from transformers import AutoTokenizer, AutoModelForSequenceClassification # load tokenizer and model tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForSequenceClassification.from_pretrained(model_id, torchscript=True) # create dummy input for max length 128 dummy_input = \"dummy input which will be padded later\" max_length = 128 embeddings = tokenizer(dummy_input, max_length=max_length, padding=\"max_length\",return_tensors=\"pt\") neuron_inputs = tuple(embeddings.values()) # compile model with torch.neuron.trace and update config model_neuron = torch.neuron.trace(model, neuron_inputs) model.config.update({\"traced_sequence_length\": max_length}) # save tokenizer, neuron model and config for later use save_dir=\"tmp\" os.makedirs(\"tmp\",exist_ok=True) model_neuron.save(os.path.join(save_dir,\"neuron_model.pt\")) tokenizer.save_pretrained(save_dir) model.config.save_pretrained(save_dir) ``` ## 2. Create a custom `inference.py` script for `text-classification` The [Hugging Face Inference Toolkit]( supports zero-code deployments on top of the [pipeline feature]( from Transformers. This allows users to deploy Hugging Face transformers without an inference script [[Example]( Currently, this feature is not supported with AWS Inferentia, which means we need to provide an `inference.py` script for running inference. *If you would be interested in support for zero-code deployments for Inferentia let us know on the [forum]( --- To use the inference script, we need to create an `inference.py` script. In our example, we are going to overwrite the `model_fn` to load our neuron model and the `predict_fn` to create a text-classification pipeline. If you want to know more about the `inference.py` script check out this [example]( It explains amongst other things what `model_fn` and `predict_fn` are. ```python !mkdir code ``` We are using the `NEURON_RT_NUM_CORES=1` to make sure that each HTTP worker uses 1 Neuron core to maximize throughput. ```python %%writefile code/inference.py import os from transformers import AutoConfig, AutoTokenizer import torch import torch.neuron # To use one neuron core per worker os.environ[\"NEURON_RT_NUM_CORES\"] = \"1\" # saved weights name AWS_NEURON_TRACED_WEIGHTS_NAME = \"neuron_model.pt\" def model_fn(model_dir): # load tokenizer and neuron model from model_dir tokenizer = AutoTokenizer.from_pretrained(model_dir) model = torch.jit.load(os.path.join(model_dir, AWS_NEURON_TRACED_WEIGHTS_NAME)) model_config = AutoConfig.from_pretrained(model_dir) return model, tokenizer, model_config def predict_fn(data, model_tokenizer_model_config): # destruct model, tokenizer and model config model, tokenizer, model_config = model_tokenizer_model_config # create embeddings for inputs inputs = data.pop(\"inputs\", data) embeddings = tokenizer( inputs, return_tensors=\"pt\", max_length=model_config.traced_sequence_length, padding=\"max_length\", truncation=True, ) # convert to tuple for neuron model neuron_inputs = tuple(embeddings.values()) # run prediciton with torch.no_grad(): predictions = model(*neuron_inputs)[0] scores = torch.nn.Softmax(dim=1)(predictions) # return dictonary, which will be json serializable return [{\"label\": model_config.id2label[item.argmax().item()], \"score\": item.max().item()} for item in scores] ``` ## 3. Create and upload the neuron model and inference script to Amazon S3 Before we can deploy our neuron model to Amazon SageMaker we need to create a `model.tar.gz` archive with all our model artifacts saved into `tmp/`, e.g. `neuron_model.pt` and upload this to Amazon S3. To do this we need to set up our permissions. ```python import sagemaker import boto3 sess = sagemaker.Session() # sagemaker session bucket -> used for uploading data, models and logs # sagemaker will automatically create this bucket if it not exists sagemaker_session_bucket=None if sagemaker_session_bucket is None and sess is not None: # set to default bucket if a bucket name is not given sagemaker_session_bucket = sess.default_bucket() try: role = sagemaker.get_execution_role() except ValueError: iam = boto3.client('iam') role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn'] sess = sagemaker.Session(default_bucket=sagemaker_session_bucket) print(f\"sagemaker role arn: {role}\") print(f\"sagemaker bucket: {sess.default_bucket()}\") print(f\"sagemaker session region: {sess.boto_region_name}\") ``` Next, we create our `model.tar.gz`. The `inference.py` script will be placed into a `code/` folder. ```python # copy inference.py into the code/ directory of the model directory. !cp -r code/ tmp/code/ # create a model.tar.gz archive with all the model artifacts and the inference.py script. %cd tmp !tar zcvf model.tar.gz * %cd .. ``` Now we can upload our `model.tar.gz` to our session S3 bucket with `sagemaker`. ```python from sagemaker.s3 import S3Uploader # create s3 uri s3_model_path = f\"s3://{sess.default_bucket()}/{model_id}\" # upload model.tar.gz s3_model_uri = S3Uploader.upload(local_path=\"tmp/model.tar.gz\",desired_s3_uri=s3_model_path) print(f\"model artifcats uploaded to {s3_model_uri}\") ``` ## 4. Deploy a Real-time Inference Endpoint on Amazon SageMaker After we have uploaded our `model.tar.gz` to Amazon S3 can we create a custom `HuggingfaceModel`. This class will be used to create and deploy our real-time inference endpoint on Amazon SageMaker. ```python from sagemaker.huggingface.model import HuggingFaceModel # create Hugging Face Model Class huggingface_model = HuggingFaceModel( model_data=s3_model_uri, # path to your model and script role=role, # iam role with permissions to create an Endpoint transformers_version=\"4.12\", # transformers version used pytorch_version=\"1.9\", # pytorch version used py_version='py37', # python version used ) # Let SageMaker know that we've already compiled the model via neuron-cc huggingface_model._is_compiled_model = True # deploy the endpoint endpoint predictor = huggingface_model.deploy( initial_instance_count=1, # number of instances instance_type=\"ml.inf1.xlarge\" # AWS Inferentia Instance ) ``` ## 5. Run and evaluate Inference performance of BERT on Inferentia The `.deploy()` returns an `HuggingFacePredictor` object which can be used to request inference. ```python data = { \"inputs\": \"the mesmerizing performances of the leads keep the film grounded and keep the audience riveted .\", } res = predictor.predict(data=data) res ``` We managed to deploy our neuron compiled BERT to AWS Inferentia on Amazon SageMaker. Now, let's test its performance. As a dummy load test, we will loop and send 10,000 synchronous requests to our endpoint. ```python # send 10000 requests for i in range(10000): resp = predictor.predict( data={\"inputs\": \"it 's a charming and often affecting journey .\"} ) ``` Let's inspect the performance in cloudwatch. ```python print(f\" ``` The average latency for our BERT model is `5-6ms` for a sequence length of 128. Figure 1. Model Latency ### Delete model and endpoint To clean up, we can delete the model and endpoint. ```python predictor.delete_model() predictor.delete_endpoint() ``` ## Conclusion We successfully managed to compile a vanilla Hugging Face Transformers model to an AWS Inferentia compatible Neuron Model. After that we deployed our Neuron model to Amazon SageMaker using the new Hugging Face Inference DLC. We managed to achieve `5-6ms` latency per neuron core, which is faster than CPU in terms of latency, and achieves a higher throughput than GPUs since we ran 4 models in parallel. If you or you company are currently using a BERT-like Transformer for encoder tasks (text-classification, token-classification, question-answering etc.), and the latency meets your requirements you should switch to AWS Inferentia. This will not only save costs, but can also increase efficiency and performance for your models. We are planning to do a more detailed case study on cost-performance of transformers in the future, so stay tuned! Also if you want to learn more about accelerating transformers you should also check out Hugging Face [optimum]( --- Thanks for reading! If you have any questions, feel free to contact me, through [Github]( or on the [forum]( You can also connect with me on [Twitter]( or [LinkedIn]("}
{"title": "bertopic.md", "repo_owner": "huggingface", "repo_name": "blog", "text": "--- title: \"Introducing BERTopic Integration with the Hugging Face Hub\" thumbnail: /blog/assets/145_bertopic/logo.png authors: - user: MaartenGr guest: true - user: davanstrien --- Introducing BERTopic Integration with the Hugging Face Hub []( We are thrilled to announce a significant update to the [BERTopic]( Python library, expanding its capabilities and further streamlining the workflow for topic modelling enthusiasts and practitioners. BERTopic now supports pushing and pulling trained topic models directly to and from the Hugging Face Hub. This new integration opens up exciting possibilities for leveraging the power of BERTopic in production use cases with ease. ## What is Topic Modelling? Topic modelling is a method that can help uncover hidden themes or \"topics\" within a group of documents. By analyzing the words in the documents, we can find patterns and connections that reveal these underlying topics. For example, a document about machine learning is more likely to use words like \"gradient\" and \"embedding\" compared to a document about baking bread. Each document usually covers multiple topics in different proportions. By examining the word statistics, we can identify clusters of related words that represent these topics. This allows us to analyze a set of documents and determine the topics they discuss, as well as the balance of topics within each document. More recently, new approaches to topic modelling have moved beyond using words to using more rich representations such as those offered through Transformer based models. ## What is BERTopic? BERTopic is a state-of-the-art Python library that simplifies the topic modelling process using various embedding techniques and [c-TF-IDF]( to create dense clusters allowing for easily interpretable topics whilst keeping important words in the topic descriptions. *An overview of the BERTopic library* Whilst BERTopic is easy to get started with, it supports a range of advanced approaches to topic modelling including [guided]( [supervised]( [semi-supervised]( and [manual]( topic modelling. More recently BERTopic has added support for multi-modal topic models. BERTopic also have a rich set of tools for producing visualizations. BERTopic provides a powerful tool for users to uncover significant topics within text collections, thereby gaining valuable insights. With BERTopic, users can analyze customer reviews, explore research papers, or categorize news articles with ease, making it an essential tool for anyone looking to extract meaningful information from their text data. ## BERTopic Model Management with the Hugging Face Hub With the latest integration, BERTopic users can seamlessly push and pull their trained topic models to and from the Hugging Face Hub. This integration marks a significant milestone in simplifying the deployment and management of BERTopic models across different environments. The process of training and pushing a BERTopic model to the Hub can be done in a few lines ```python from bertopic import BERTopic topic_model = BERTopic(\"english\") topics, probs = topic_model.fit_transform(docs) topic_model.push_to_hf_hub('davanstrien/transformers_issues_topics') ``` You can then load this model in two lines and use it to predict against new data. ```python from bertopic import BERTopic topic_model = BERTopic.load(\"davanstrien/transformers_issues_topics\") ``` By leveraging the power of the Hugging Face Hub, BERTopic users can effortlessly share, version, and collaborate on their topic models. The Hub acts as a central repository, allowing users to store and organize their models, making it easier to deploy models in production, share them with colleagues, or even showcase them to the broader NLP community. You can use the `libraries` filter on the hub to find BERTopic models. , which is still fast (zero-copy). We\u2019re excited to see more and more libraries leveraging safetensors for safe serialization. You can read more about a recent audit of the library in this [blog post]( ### An example of using BERTopic to explore RLFH datasets To illustrate some of the power of BERTopic let's look at an example of how it can be used to monitor changes in topics in datasets used to train chat models. The last year has seen several datasets for Reinforcement Learning with Human Feedback released. One of these datasets is the [OpenAssistant Conversations dataset]( This dataset was produced via a worldwide crowd-sourcing effort involving over 13,500 volunteers. Whilst this dataset already has some scores for toxicity, quality, humour etc., we may want to get a better understanding of what types of conversations are represented in this dataset. BERTopic offers one way of getting a better understanding of the topics in this dataset. In this case, we train a model on the English assistant responses part of the datasets. Resulting in a [topic model]( with 75 topics. BERTopic gives us various ways of visualizing a dataset. We can see the top 8 topics and their associated words below. We can see that the second most frequent topic consists mainly of \u2018response words\u2019, which we often see frequently from chat models, i.e. responses which aim to be \u2018polite\u2019 and \u2018helpful\u2019. We can also see a large number of topics related to programming or computing topics as well as physics, recipes and pets.  ``` We can predict on a single example text: ```python example = \"Stalemate is a drawn position. It doesn't matter who has captured more pieces or is in a winning position\" topic, prob = topic_model.transform(example) ``` We can get more information about the predicted topic ```python topic_model.get_topic_info(topic) ``` | | Count | Name | Representation | |---:|--------:|:--------------------------------------|:----------------------------------------------------------------------------------------------------| | 0 | 240 | 22_chess_chessboard_practice_strategy | ['chess', 'chessboard', 'practice', 'strategy', 'learn', 'pawn', 'board', 'pawns', 'play', 'decks'] | We can see here the topics predicted seem to make sense. We may want to extend this to compare the topics predicted for the whole dataset. ```python from datasets import load_dataset dataset = load_dataset(\"databricks/databricks-dolly-15k\") dolly_docs = dataset['train']['response'] dolly_topics, dolly_probs = topic_model.transform(dolly_docs) ``` We can then compare the distribution of topics across both datasets. We can see here that there seems to be a broader distribution across topics in the dolly dataset according to our BERTopic model. This might be a result of the different approaches to creating both datasets (we likely want to retrain a BERTopic across both datasets to ensure we are not missing topics to confirm this).  after 1991. - [MaartenGr/BERTopic_Wikipedia]( a model trained on 1000000 English Wikipedia pages. - [davanstrien/imdb_bertopic]( a model trained on the unsupervised split of the IMDB dataset You can find a full overview of BERTopic models on the hub using the [libraries filter]( We invite you to explore the possibilities of this new integration and share your trained models on the hub!"}
{"title": "big-bird.md", "repo_owner": "huggingface", "repo_name": "blog", "text": "--- title: \"Understanding BigBird's Block Sparse Attention\" thumbnail: /blog/assets/18_big_bird/attn.png authors: - user: vasudevgupta --- # Understanding BigBird's Block Sparse Attention ## Introduction Transformer-based models have shown to be very useful for many NLP tasks. However, a major limitation of transformers-based models is its \\\\(O(n^2)\\\\) time & memory complexity (where \\\\(n\\\\) is sequence length). Hence, it's computationally very expensive to apply transformer-based models on long sequences \\\\(n > 512\\\\). Several recent papers, *e.g.* `Longformer`, `Performer`, `Reformer`, `Clustered attention` try to remedy this problem by approximating the full attention matrix. You can checkout 's recent blog [post]( in case you are unfamiliar with these models. `BigBird` (introduced in [paper]( is one of such recent models to address this issue. `BigBird` relies on **block sparse attention** instead of normal attention (*i.e.* BERT's attention) and can handle sequences up to a length of **4096** at a much lower computational cost compared to BERT. It has achieved SOTA on various tasks involving very long sequences such as long documents summarization, question-answering with long contexts. **BigBird RoBERTa-like** model is now available in Transformers. The goal of this post is to give the reader an **in-depth** understanding of big bird implementation & ease one's life in using BigBird with Transformers. But, before going into more depth, it is important to remember that the `BigBird's` attention is an approximation of `BERT`'s full attention and therefore does not strive to be **better** than `BERT's` full attention, but rather to be more efficient. It simply allows to apply transformer-based models to much longer sequences since BERT's quadratic memory requirement quickly becomes unbearable. Simply put, if we would have \\\\(\\infty\\\\) compute & \\\\(\\infty\\\\) time, BERT's attention would be preferred over block sparse attention (which we are going to discuss in this post). If you wonder why we need more compute when working with longer sequences, this blog post is just right for you! --- Some of the main questions one might have when working with standard `BERT`-like attention include: * Do all tokens really have to attend to all other tokens? * Why not compute attention only over important tokens? * How to decide what tokens are important? * How to attend to just a few tokens in a very efficient way? --- In this blog post, we will try to answer those questions. ### What tokens should be attended to? We will give a practical example of how attention works by considering the sentence \"BigBird is now available in HuggingFace for extractive question answering\". In `BERT`-like attention, every word would simply attend to all other tokens. Put mathematically, this would mean that each queried token \\\\( \\text{query-token} \\in \\{\\text{BigBird},\\text{is},\\text{now},\\text{available},\\text{in},\\text{HuggingFace},\\text{for},\\text{extractive},\\text{question},\\text{answering}\\} \\\\), would attend to the full list of \\\\( \\text{key-tokens} = \\left[\\text{BigBird},\\text{is},\\text{now},\\text{available},\\text{in},\\text{HuggingFace},\\text{for},\\text{extractive},\\text{question},\\text{answering} \\right]\\\\). Let's think about a sensible choice of key tokens that a queried token actually only should attend to by writing some pseudo-code. Will will assume that the token `available` is queried and build a sensible list of key tokens to attend to. ```python >>> # let's consider following sentence as an example >>> example = ['BigBird', 'is', 'now', 'available', 'in', 'HuggingFace', 'for', 'extractive', 'question', 'answering'] >>> # further let's assume, we're trying to understand the representation of 'available' i.e. >>> query_token = 'available' >>> # We will initialize an empty `set` and fill up the tokens of our interest as we proceed in this section. >>> key_tokens = [] # => currently 'available' token doesn't have anything to attend ``` Nearby tokens should be important because, in a sentence (sequence of words), the current word is highly dependent on neighboring past & future tokens. This intuition is the idea behind the concept of `sliding attention`. ```python >>> # considering `window_size = 3`, we will consider 1 token to left & 1 to right of 'available' >>> # left token: 'now' ; right token: 'in' >>> sliding_tokens = [\"now\", \"available\", \"in\"] >>> # let's update our collection with the above tokens >>> key_tokens.append(sliding_tokens) ``` **Long-range dependencies:** For some tasks, it is crucial to capture long-range relationships between tokens. *E.g.*, in `question-answering the model needs to compare each token of the context to the whole question to be able to figure out which part of the context is useful for a correct answer. If most of the context tokens would just attend to other context tokens, but not to the question, it becomes much harder for the model to filter important context tokens from less important context tokens. Now, `BigBird` proposes two ways of allowing long-term attention dependencies while staying computationally efficient. * **Global tokens:** Introduce some tokens which will attend to every token and which are attended by every token. Eg: *\"HuggingFace is building nice libraries for easy NLP\"*. Now, let's say *'building'* is defined as a global token, and the model needs to know the relation among *'NLP'* & *'HuggingFace'* for some task (Note: these 2 tokens are at two extremes); Now having *'building'* attend globally to all other tokens will probably help the model to associate *'NLP'* with *'HuggingFace'*. ```python >>> # let's assume 1st & last token to be `global`, then >>> global_tokens = [\"BigBird\", \"answering\"] >>> # fill up global tokens in our key tokens collection >>> key_tokens.append(global_tokens) ``` * **Random tokens:** Select some tokens randomly which will transfer information by transferring to other tokens which in turn can transfer to other tokens. This may reduce the cost of information travel from one token to other. ```python >>> # now we can choose `r` token randomly from our example sentence >>> # let's choose 'is' assuming `r=1` >>> random_tokens = [\"is\"] # Note: it is chosen compleletly randomly; so it can be anything else also. >>> # fill random tokens to our collection >>> key_tokens.append(random_tokens) >>> # it's time to see what tokens are in our `key_tokens` list >>> key_tokens {'now', 'is', 'in', 'answering', 'available', 'BigBird'} # Now, 'available' (query we choose in our 1st step) will attend only these tokens instead of attending the complete sequence ``` This way, the query token attends only to a subset of all possible tokens while yielding a good approximation of full attention. The same approach will is used for all other queried tokens. But remember, the whole point here is to approximate `BERT`'s full attention as efficiently as possible. Simply making each queried token attend all key tokens as it's done for BERT can be computed very effectively as a sequence of matrix multiplication on modern hardware, like GPUs. However, a combination of sliding, global & random attention appears to imply sparse matrix multiplication, which is harder to implement efficiently on modern hardware. One of the major contributions of `BigBird` is the proposition of a `block sparse` attention mechanism that allows computing sliding, global & random attention effectively. Let's look into it! ### Understanding the need for global, sliding, random keys with Graphs First, let's get a better understanding of `global`, `sliding` & `random` attention using graphs and try to understand how the combination of these three attention mechanisms yields a very good approximation of standard `Bert-like` attention. *The above figure shows `global` (left), `sliding` (middle) & `random` (right) connections respectively as a graph. Each node corresponds to a token and each line represents an attention score. If no connection is made between 2 tokens, then an attention score is assumed to 0.*  **BigBird block sparse attention** is a combination of sliding, global & random connections (total 10 connections) as shown in `gif` in left. While a graph of **normal attention** (right) will have all 15 connections (note: total 6 nodes are present). You can simply think of normal attention as all the tokens attending globally \\\\( {}^1 \\\\). **Normal attention:** Model can transfer information from one token to another token directly in a single layer since each token is queried over every other token and is attended by every other token. Let's consider an example similar to what is shown in the above figures. If the model needs to associate *'going'* with *'now'*, it can simply do that in a single layer since there is a direct connection joining both the tokens. **Block sparse attention:** If the model needs to share information between two nodes (or tokens), information will have to travel across various other nodes in the path for some of the tokens; since all the nodes are not directly connected in a single layer. *Eg.*, assuming model needs to associate *'going'* with *'now'*, then if only sliding attention is present the flow of information among those 2 tokens, is defined by the path: `going -> am -> i -> now` (i.e. it will have to travel over 2 other tokens). Hence, we may need multiple layers to capture the entire information of the sequence. Normal attention can capture this in a single layer. In an extreme case, this could mean that as many layers as input tokens are needed. If, however, we introduce some global tokens information can travel via the path: `going -> i -> now` (which is shorter). If we in addition introduce random connections it can travel via: `going -> am -> now`. With the help of random connections & global connections, information can travel very rapidly (with just a few layers) from one token to the next. In case, we have many global tokens, then we may not need random connections since there will be multiple short paths through which information can travel. This is the idea behind keeping `num_random_tokens = 0` when working with a variant of BigBird, called ETC (more on this in later sections). \\\\( {}^1 \\\\) In these graphics, we are assuming that the attention matrix is symmetric **i.e.** \\\\(\\mathbf{A}_{ij} = \\mathbf{A}_{ji}\\\\) since in a graph if some token **A** attends **B**, then **B** will also attend **A**. You can see from the figure of the attention matrix shown in the next section that this assumption holds for most tokens in BigBird | Attention Type | `global_tokens` | `sliding_tokens` | `random_tokens` | |-----------------|-------------------|------------------|------------------------------------| | `original_full` | `n` | 0 | 0 | | `block_sparse` | 2 x `block_size` | 3 x `block_size` | `num_random_blocks` x `block_size` | *`original_full` represents `BERT`'s attention while `block_sparse` represents `BigBird`'s attention. Wondering what the `block_size` is? We will cover that in later sections. For now, consider it to be 1 for simplicity* ## BigBird block sparse attention BigBird block sparse attention is just an efficient implementation of what we discussed above. Each token is attending some **global tokens**, **sliding tokens**, & **random tokens** instead of attending to **all** other tokens. The authors hardcoded the attention matrix for multiple query components separately; and used a cool trick to speed up training/inference on GPU and TPU.  *Note: on the top, we have 2 extra sentences. As you can notice, every token is just switched by one place in both sentences. This is how sliding attention is implemented. When `q[i]` is multiplied with `k[i,0:3]`, we will get a sliding attention score for `q[i]` (where `i` is index of element in sequence).* You can find the actual implementation of `block_sparse` attention [here]( This may look very scary now. But this article will surely ease your life in understanding the code. ### Global Attention For global attention, each query is simply attending to all the other tokens in the sequence & is attended by every other token. Let's assume `Vasudev` (1st token) & `them` (last token) to be global (in the above figure). You can see that these tokens are directly connected to all other tokens (blue boxes). ```python # pseudo code Q -> Query martix (seq_length, head_dim) K -> Key matrix (seq_length, head_dim) # 1st & last token attends all other tokens Q[0] x [K[0], K[1], K[2], ......, K[n-1]] Q[n-1] x [K[0], K[1], K[2], ......, K[n-1]] # 1st & last token getting attended by all other tokens K[0] x [Q[0], Q[1], Q[2], ......, Q[n-1]] K[n-1] x [Q[0], Q[1], Q[2], ......, Q[n-1]] ``` ### Sliding Attention The sequence of key tokens is copied 2 times with each element shifted to the right in one of the copies and to the left in the other copy. Now if we multiply query sequence vectors by these 3 sequence vectors, we will cover all the sliding tokens. Computational complexity is simply `O(3xn) = O(n)`. Referring to the above picture, the orange boxes represent the sliding attention. You can see 3 sequences at the top of the figure with 2 of them shifted by one token (1 to the left, 1 to the right). ```python # what we want to do Q[i] x [K[i-1], K[i], K[i+1]] for i = 1:-1 # efficient implementation in code (assume dot product multiplication ) [Q[0], Q[1], Q[2], ......, Q[n-2], Q[n-1]] x [K[1], K[2], K[3], ......, K[n-1], K[0]] [Q[0], Q[1], Q[2], ......, Q[n-1]] x [K[n-1], K[0], K[1], ......, K[n-2]] [Q[0], Q[1], Q[2], ......, Q[n-1]] x [K[0], K[1], K[2], ......, K[n-1]] # Each sequence is getting multiplied by only 3 sequences to keep `window_size = 3`. # Some computations might be missing; this is just a rough idea. ``` ### Random Attention Random attention is ensuring that each query token will attend a few random tokens as well. For the actual implementation, this means that the model gathers some tokens randomly and computes their attention score. ```python # r1, r2, r are some random indices; Note: r1, r2, r3 are different for each row Q[1] x [Q[r1], Q[r2], ......, Q[r]] . . . Q[n-2] x [Q[r1], Q[r2], ......, Q[r]] # leaving 0th & (n-1)th token since they are already global ``` **Note:** The current implementation further divides sequence into blocks & each notation is defined w.r.to block instead of tokens. Let's discuss this in more detail in the next section. ### Implementation **Recap:** In regular BERT attention, a sequence of tokens i.e. \\\\( X = x_1, x_2, ...., x_n \\\\) is projected through a dense layer into \\\\( Q,K,V \\\\) and the attention score \\\\( Z \\\\) is calculated as \\\\( Z=Softmax(QK^T) \\\\). In the case of BigBird block sparse attention, the same algorithm is used but only with some selected query & key vectors. Let's have a look at how bigbird block sparse attention is implemented. To begin with, let's assume \\\\(b, r, s, g\\\\) represent `block_size`, `num_random_blocks`, `num_sliding_blocks`, `num_global_blocks`, respectively. Visually, we can illustrate the components of big bird's block sparse attention with \\\\(b=4, r=1, g=2, s=3, d=5\\\\) as follows: Attention scores for \\\\({q}_{1}, {q}_{2}, {q}_{3:n-2}, {q}_{n-1}, {q}_{n}\\\\) are calculated separately as described below: --- Attention score for \\\\(\\mathbf{q}_{1}\\\\) represented by \\\\(a_1\\\\) where \\\\(a_1=Softmax(q_1 * K^T)\\\\), is nothing but attention score between all the tokens in 1st block with all the other tokens in the sequence.  \\\\(q_1\\\\) represents 1st block, \\\\(g_i\\\\) represents \\\\(i\\\\) block. We are simply performing normal attention operation between \\\\(q_1\\\\) & \\\\(g\\\\) (i.e. all the keys). --- For calculating attention score for tokens in seconcd block, we are gathering the first three blocks, the last block, and the fifth block. Then we can compute \\\\(a_2 = Softmax(q_2 * concat(k_1, k_2, k_3, k_5, k_7)\\\\).  *I am representing tokens by \\\\(g, r, s\\\\) just to represent their nature explicitly (i.e. showing global, random, sliding tokens), else they are \\\\(k\\\\) only.* --- For calculating attention score for \\\\({q}_{3:n-2}\\\\), we will gather global, sliding, random keys & will compute the normal attention operation over \\\\({q}_{3:n-2}\\\\) and the gathered keys. Note that sliding keys are gathered using the special shifting trick as discussed earlier in the sliding attention section.  --- For calculating attention score for tokens in previous to last block (i.e. \\\\({q}_{n-1}\\\\)), we are gathering the first block, last three blocks, and the third block. Then we can apply the formula \\\\({a}_{n-1} = Softmax({q}_{n-1} * concat(k_1, k_3, k_5, k_6, k_7))\\\\). This is very similar to what we did for \\\\(q_2\\\\).  --- Attention score for \\\\(\\mathbf{q}_{n}\\\\) is represented by \\\\(a_n\\\\) where \\\\(a_n=Softmax(q_n * K^T)\\\\), and is nothing but attention score between all the tokens in the last block with all the other tokens in sequence. This is very similar to what we did for \\\\( q_1 \\\\) .  --- Let's combine the above matrices to get the final attention matrix. This attention matrix can be used to get a representation of all the tokens.  *`blue -> global blocks`, `red -> random blocks`, `orange -> sliding blocks` This attention matrix is just for illustration. During the forward pass, we aren't storing `white` blocks, but are computing a weighted value matrix (i.e. representation of each token) directly for each separated components as discussed above.* Now, we have covered the hardest part of block sparse attention, i.e. its implementation. Hopefully, you now have a better background to understand the actual code. Feel free to dive into it and to connect each part of the code with one of the components above. ## Time & Memory complexity | Attention Type | Sequence length | Time & Memory Complexity | |-----------------|-----------------|--------------------------| | `original_full` | 512 | `T` | | | 1024 | 4 x `T` | | | 4096 | 64 x `T` | | `block_sparse` | 1024 | 2 x `T` | | | 4096 | 8 x `T` | *Comparison of time & space complexity of BERT attention and BigBird block sparse attention.* Expand this snippet in case you wanna see the calculations ```md BigBird time complexity = O(w x n + r x n + g x n) BERT time complexity = O(n^2) Assumptions: w = 3 x 64 r = 3 x 64 g = 2 x 64 When seqlen = 512 => **time complexity in BERT = 512^2** When seqlen = 1024 => time complexity in BERT = (2 x 512)^2 => **time complexity in BERT = 4 x 512^2** => time complexity in BigBird = (8 x 64) x (2 x 512) => **time complexity in BigBird = 2 x 512^2** When seqlen = 4096 => time complexity in BERT = (8 x 512)^2 => **time complexity in BERT = 64 x 512^2** => compute in BigBird = (8 x 64) x (8 x 512) => compute in BigBird = 8 x (512 x 512) => **time complexity in BigBird = 8 x 512^2** ``` ## ITC vs ETC The BigBird model can be trained using 2 different strategies: **ITC** & **ETC**. ITC (internal transformer construction) is simply what we discussed above. In ETC (extended transformer construction), some additional tokens are made global such that they will attend to / will be attended by all tokens. ITC requires less compute since very few tokens are global while at the same time the model can capture sufficient global information (also with the help of random attention). On the other hand, ETC can be very helpful for tasks in which we need a lot of global tokens such as `question-answering for which the entire question should be attended to globally by the context to be able to relate the context correctly to the question. ***Note:** It is shown in the Big Bird paper that in many ETC experiments, the number of random blocks is set to 0. This is reasonable given our discussions above in the graph section.* The table below summarizes ITC & ETC: | | ITC | ETC | |----------------------------------------------|---------------------------------------|--------------------------------------| | Attention Matrix with global attention | \\\\( A = \\begin{bmatrix} 1 & 1 & 1 & 1 & 1 & 1 & 1 \\\\ 1 & & & & & & 1 \\\\ 1 & & & & & & 1 \\\\ 1 & & & & & & 1 \\\\ 1 & & & & & & 1 \\\\ 1 & & & & & & 1 \\\\ 1 & 1 & 1 & 1 & 1 & 1 & 1 \\end{bmatrix} \\\\) | \\\\( B = \\begin{bmatrix} 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 \\\\ 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 \\\\ 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 \\\\ 1 & 1 & 1 & & & & & & 1 \\\\ 1 & 1 & 1 & & & & & & 1 \\\\ 1 & 1 & 1 & & & & & & 1 \\\\ 1 & 1 & 1 & & & & & & 1 \\\\ 1 & 1 & 1 & & & & & & 1 \\\\ 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 \\end{bmatrix} \\\\) | | `global_tokens` | 2 x `block_size` | `extra_tokens` + 2 x `block_size` | | `random_tokens` | `num_random_blocks` x `block_size` | `num_random_blocks` x `block_size` | | `sliding_tokens` | 3 x `block_size` | 3 x `block_size` | ## Using BigBird with Transformers You can use `BigBirdModel` just like any other model. Let's see some code below: ```python from transformers import BigBirdModel # loading bigbird from its pretrained checkpoint model = BigBirdModel.from_pretrained(\"google/bigbird-roberta-base\") # This will init the model with default configuration i.e. attention_type = \"block_sparse\" num_random_blocks = 3, block_size = 64. # But You can freely change these arguments with any checkpoint. These 3 arguments will just change the number of tokens each query token is going to attend. model = BigBirdModel.from_pretrained(\"google/bigbird-roberta-base\", num_random_blocks=2, block_size=16) # By setting attention_type to `original_full`, BigBird will be relying on the full attention of n^2 complexity. This way BigBird is 99.9 % similar to BERT. model = BigBirdModel.from_pretrained(\"google/bigbird-roberta-base\", attention_type=\"original_full\") ``` There are total **3 checkpoints** available in **Hub** (at the point of writing this article): [`bigbird-roberta-base`]( [`bigbird-roberta-large`]( [`bigbird-base-trivia-itc`]( The first two checkpoints come from pretraining `BigBirdForPretraining` with `masked_lm loss`; while the last one corresponds to the checkpoint after finetuning `BigBirdForQuestionAnswering` on `trivia-qa` dataset. Let's have a look at minimal code you can write (in case you like to use your PyTorch trainer), to use 's BigBird model for fine-tuning your tasks. ```python # let's consider our task to be question-answering as an example from transformers import BigBirdForQuestionAnswering, BigBirdTokenizer import torch device = torch.device(\"cpu\") if torch.cuda.is_available(): device = torch.device(\"cuda\") # lets initialize bigbird model from pretrained weights with randomly initialized head on its top model = BigBirdForQuestionAnswering.from_pretrained(\"google/bigbird-roberta-base\", block_size=64, num_random_blocks=3) tokenizer = BigBirdTokenizer.from_pretrained(\"google/bigbird-roberta-base\") model.to(device) dataset = \"torch.utils.data.DataLoader object\" optimizer = \"torch.optim object\" epochs = ... # very minimal training loop for e in range(epochs): for batch in dataset: model.train() batch = {k: batch[k].to(device) for k in batch} # forward pass output = model(**batch) # back-propogation output[\"loss\"].backward() optimizer.step() optimizer.zero_grad() # let's save final weights in a local directory model.save_pretrained(\"\") # let's push our weights to Hub from huggingface_hub import ModelHubMixin ModelHubMixin.push_to_hub(\"\", model_id=\"\") # using finetuned model for inference question = [\"How are you doing?\", \"How is life going?\"] context = [\"\", \"\"] batch = tokenizer(question, context, return_tensors=\"pt\") batch = {k: batch[k].to(device) for k in batch} model = BigBirdForQuestionAnswering.from_pretrained(\"\") model.to(device) with torch.no_grad(): start_logits, end_logits = model(**batch).to_tuple() # now decode start_logits, end_logits with what ever strategy you want. # Note: # This was very minimal code (in case you want to use raw PyTorch) just for showing how BigBird can be used very easily # I would suggest using Trainer to have access for a lot of features ``` It's important to keep the following points in mind while working with big bird: * Sequence length must be a multiple of block size i.e. `seqlen % block_size = 0`. You need not worry since Transformers will automatically `` (to smallest multiple of block size which is greater than sequence length) if batch sequence length is not a multiple of `block_size`. * Currently, HuggingFace version **doesn't support ETC** and hence only 1st & last block will be global. * Current implementation doesn't support `num_random_blocks = 0`. * It's recommended by authors to set `attention_type = \"original_full\"` when sequence length global_token + random_tokens + sliding_tokens + buffer_tokens` where `global_tokens = 2 x block_size`, `sliding_tokens = 3 x block_size`, `random_tokens = num_random_blocks x block_size` & `buffer_tokens = num_random_blocks x block_size`. In case you fail to do that, Transformers will automatically switch `attention_type` to `original_full` with a warning. * When using big bird as decoder (or using `BigBirdForCasualLM`), `attention_type` should be `original_full`. But you need not worry, Transformers will automatically switch `attention_type` to `original_full` in case you forget to do that. ## What's next? [@patrickvonplaten]( has made a really cool [notebook]( on how to evaluate `BigBirdForQuestionAnswering` on the `trivia-qa` dataset. Feel free to play with BigBird using that notebook. You will soon find **BigBird Pegasus-like** model in the library for **long document summarization**. ## End Notes The original implementation of **block sparse attention matrix** can be found [here]( You can find 's version [here]("}
{"title": "blip-2.md", "repo_owner": "huggingface", "repo_name": "blog", "text": "--- title: \"Zero-shot image-to-text generation with BLIP-2\" thumbnail: /blog/assets/blip-2/thumbnail.png authors: - user: MariaK - user: JunnanLi --- # Zero-shot image-to-text generation with BLIP-2 This guide introduces [BLIP-2]( from Salesforce Research that enables a suite of state-of-the-art visual-language models that are now available in [ Transformers]( We'll show you how to use it for image captioning, prompted image captioning, visual question-answering, and chat-based prompting. ## Table of contents 1. [Introduction](#introduction) 2. [What's under the hood in BLIP-2?](#whats-under-the-hood-in-blip-2) 3. [Using BLIP-2 with Hugging Face Transformers](#using-blip-2-with-hugging-face-transformers) 1. [Image Captioning](#image-captioning) 2. [Prompted image captioning](#prompted-image-captioning) 3. [Visual question answering](#visual-question-answering) 4. [Chat-based prompting](#chat-based-prompting) 4. [Conclusion](#conclusion) 5. [Acknowledgments](#acknowledgments) ## Introduction Recent years have seen rapid advancements in computer vision and natural language processing. Still, many real-world problems are inherently multimodal - they involve several distinct forms of data, such as images and text. Visual-language models face the challenge of combining modalities so that they can open the door to a wide range of applications. Some of the image-to-text tasks that visual language models can tackle include image captioning, image-text retrieval, and visual question answering. Image captioning can aid the visually impaired, create useful product descriptions, identify inappropriate content beyond text, and more. Image-text retrieval can be applied in multimodal search, as well as in applications such as autonomous driving. Visual question-answering can aid in education, enable multimodal chatbots, and assist in various domain-specific information retrieval applications. Modern computer vision and natural language models have become more capable; however, they have also significantly grown in size compared to their predecessors. While pre-training a single-modality model is resource-consuming and expensive, the cost of end-to-end vision-and-language pre-training has become increasingly prohibitive. [BLIP-2]( tackles this challenge by introducing a new visual-language pre-training paradigm that can potentially leverage any combination of pre-trained vision encoder and LLM without having to pre-train the whole architecture end to end. This enables achieving state-of-the-art results on multiple visual-language tasks while significantly reducing the number of trainable parameters and pre-training costs. Moreover, this approach paves the way for a multimodal ChatGPT-like model. ## What's under the hood in BLIP-2? BLIP-2 bridges the modality gap between vision and language models by adding a lightweight Querying Transformer (Q-Former) between an off-the-shelf frozen pre-trained image encoder and a frozen large language model. Q-Former is the only trainable part of BLIP-2; both the image encoder and language model remain frozen. Q-Former is a transformer model that consists of two submodules that share the same self-attention layers: * an image transformer that interacts with the frozen image encoder for visual feature extraction * a text transformer that can function as both a text encoder and a text decoder The image transformer extracts a fixed number of output features from the image encoder, independent of input image resolution, and receives learnable query embeddings as input. The queries can additionally interact with the text through the same self-attention layers. Q-Former is pre-trained in two stages. In the first stage, the image encoder is frozen, and Q-Former is trained with three losses: * Image-text contrastive loss: pairwise similarity between each query output and text output's CLS token is calculated, and the highest one is picked. Query embeddings and text don't \u201csee\u201d each other. * Image-grounded text generation: queries can attend to each other but not to the text tokens, and text has a causal mask and can attend to all of the queries. * Image-text matching loss: queries and text can see others, and a logit is obtained to indicate whether the text matches the image or not. To obtain negative examples, hard negative mining is used. In the second pre-training stage, the query embeddings now have the relevant visual information to the text as it has passed through an information bottleneck. These embeddings are now used as a visual prefix to the input to the LLM. This pre-training phase effectively involves an image-ground text generation task using the causal LM loss. As a visual encoder, BLIP-2 uses ViT, and for an LLM, the paper authors used OPT and Flan T5 models. You can find pre-trained checkpoints for both OPT and Flan T5 on [Hugging Face Hub]( However, as mentioned before, the introduced pre-training approach allows combining any visual backbone with any LLM. ## Using BLIP-2 with Hugging Face Transformers Using Hugging Face Transformers, you can easily download and run a pre-trained BLIP-2 model on your images. Make sure to use a GPU environment with high RAM if you'd like to follow along with the examples in this blog post. Let's start by installing Transformers. As this model has been added to Transformers very recently, we need to install Transformers from the source: ```bash pip install git+ ``` Next, we'll need an input image. Every week The New Yorker runs a [cartoon captioning contest]( among its readers, so let's take one of these cartoons to put BLIP-2 to the test. ``` import requests from PIL import Image url = ' image = Image.open(requests.get(url, stream=True).raw).convert('RGB') display(image.resize((596, 437))) ``` We have an input image. Now we need a pre-trained BLIP-2 model and corresponding preprocessor to prepare the inputs. You can find the list of all available pre-trained checkpoints on [Hugging Face Hub]( Here, we'll load a BLIP-2 checkpoint that leverages the pre-trained OPT model by Meta AI, which has 2.7 billion parameters. ``` from transformers import AutoProcessor, Blip2ForConditionalGeneration import torch processor = AutoProcessor.from_pretrained(\"Salesforce/blip2-opt-2.7b\") model = Blip2ForConditionalGeneration.from_pretrained(\"Salesforce/blip2-opt-2.7b\", torch_dtype=torch.float16) ``` Notice that BLIP-2 is a rare case where you cannot load the model with Auto API (e.g. AutoModelForXXX), and you need to explicitly use `Blip2ForConditionalGeneration`. However, you can use `AutoProcessor` to fetch the appropriate processor class - `Blip2Processor` in this case. Let's use GPU to make text generation faster: ``` device = \"cuda\" if torch.cuda.is_available() else \"cpu\" model.to(device) ``` ### Image Captioning Let's find out if BLIP-2 can caption a New Yorker cartoon in a zero-shot manner. To caption an image, we do not have to provide any text prompt to the model, only the preprocessed input image. Without any text prompt, the model will start generating text from the BOS (beginning-of-sequence) token thus creating a caption. ``` inputs = processor(image, return_tensors=\"pt\").to(device, torch.float16) generated_ids = model.generate(**inputs, max_new_tokens=20) generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0].strip() print(generated_text) ``` ``` \"two cartoon monsters sitting around a campfire\" ``` This is an impressively accurate description for a model that wasn't trained on New Yorker style cartoons! ### Prompted image captioning We can extend image captioning by providing a text prompt, which the model will continue given the image. ``` prompt = \"this is a cartoon of\" inputs = processor(image, text=prompt, return_tensors=\"pt\").to(device, torch.float16) generated_ids = model.generate(**inputs, max_new_tokens=20) generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0].strip() print(generated_text) ``` ``` \"two monsters sitting around a campfire\" ``` ``` prompt = \"they look like they are\" inputs = processor(image, text=prompt, return_tensors=\"pt\").to(device, torch.float16) generated_ids = model.generate(**inputs, max_new_tokens=20) generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0].strip() print(generated_text) ``` ``` \"having a good time\" ``` ### Visual question answering For visual question answering the prompt has to follow a specific format: \"Question: {} Answer:\" ``` prompt = \"Question: What is a dinosaur holding? Answer:\" inputs = processor(image, text=prompt, return_tensors=\"pt\").to(device, torch.float16) generated_ids = model.generate(**inputs, max_new_tokens=10) generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0].strip() print(generated_text) ``` ``` \"A torch\" ``` ### Chat-based prompting Finally, we can create a ChatGPT-like interface by concatenating each generated response to the conversation. We prompt the model with some text (like \"What is a dinosaur holding?\"), the model generates an answer for it \"a torch\"), which we can concatenate to the conversation. Then we do it again, building up the context. However, make sure that the context does not exceed 512 tokens, as this is the context length of the language models used by BLIP-2 (OPT and T5). ``` context = [ (\"What is a dinosaur holding?\", \"a torch\"), (\"Where are they?\", \"In the woods.\") ] question = \"What for?\" template = \"Question: {} Answer: {}.\" prompt = \" \".join([template.format(context[i][0], context[i][1]) for i in range(len(context))]) + \" Question: \" + question + \" Answer:\" print(prompt) ``` ``` Question: What is a dinosaur holding? Answer: a torch. Question: Where are they? Answer: In the woods.. Question: What for? Answer: ``` ``` inputs = processor(image, text=prompt, return_tensors=\"pt\").to(device, torch.float16) generated_ids = model.generate(**inputs, max_new_tokens=10) generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0].strip() print(generated_text) ``` ``` To light a fire. ``` ## Conclusion BLIP-2 is a zero-shot visual-language model that can be used for multiple image-to-text tasks with image and image and text prompts. It is an effective and efficient approach that can be applied to image understanding in numerous scenarios, especially when examples are scarce. The model bridges the gap between vision and natural language modalities by adding a transformer between pre-trained models. The new pre-training paradigm allows this model to keep up with the advances in both individual modalities. If you'd like to learn how to fine-tune BLIP-2 models for various vision-language tasks, check out [LAVIS library by Salesforce]( that offers comprehensive support for model training. To see BLIP-2 in action, try its demo on [Hugging Face Spaces]( ## Acknowledgments Many thanks to the Salesforce Research team for working on BLIP-2, Niels Rogge for adding BLIP-2 to Transformers, and to Omar Sanseviero for reviewing this blog post."}
{"title": "bloom-inference-optimization.md", "repo_owner": "huggingface", "repo_name": "blog", "text": "--- title: \"Optimization story: Bloom inference\" thumbnail: /blog/assets/bloom-inference-pytorch-scripts/thumbnail.png authors: - user: Narsil --- Optimization story: Bloom inference This article gives you the behind-the-scenes of how we made an efficient inference server that powers bloom. inference server that powers [ We achieved a 5x latency reduction over several weeks (and 50x more throughput). We wanted to share all the struggles and epic wins we went through to achieve such speed improvements. A lot of different people were involved at many stages so not everything will be covered here. And please bear with us, some of the content might be outdated or flat out wrong because we're still learning how to optimize extremely large models and lots of new hardware features and content keep coming out regularly. If your favorite flavor of optimizations is not discussed or improperly represented, we're sorry, please share it with us we're more than happy to try out new stuff and correct our mistakes. # Creating BLOOM This goes without saying but without the large model being accessible in the first place, there would be no real reasons to optimize inference for it. This was an incredible effort led by many different people. To maximize the GPU during training, several solutions were explored and in the end, [Megatron-Deepspeed]( was chosen to train the end model. This meant that the code as-is wasn't necessarily compatible with the `transformers` library. # Porting to transformers Because of the original training code, we set out to do something which we regularly do: port an existing model to `transformers`. The goal was to extract from the training code the relevant parts and implement it within `transformers`. This effort was tackled by [Younes](/ybelkada). This is by no means a small effort as it took almost a month and [200 commits]( to get there. There are several things to note that will come back later: We needed to have smaller models [bigscience/bigscience-small-testing]( and [bigscience/bloom-560m]( This is extremely important because they are smaller, so everything is faster when working with them. First, you have to abandon all hope to have exactly the same logits at the end down to the bytes. PyTorch versions can change the kernels and introduce subtle differences, and different hardware might yield different results because of different architecture (and you probably don't want to develop on a A100 GPU all the time for cost reasons). ***Getting a good strict test suite is really important for all models*** The best test we found was having a fixed set of prompts. You know the prompt, you know the completion that needs to be deterministic so greedy. If two generations are identical, you can basically ignore small logits differences Whenever you see a drift, you need to investigate. It could be that your code is not doing what it should OR that you are actually out of domain for that model and therefore the model is more sensitive to noise. If you have several prompts and long enough prompts, you're less likely to trigger that for all prompts by accident. The more prompts the better, the longer the better. The first model (small-testing) is in `bfloat16` like the big bloom so everything should be very similar, but it wasn't trained a lot or just doesn't perform well, so it highly fluctuates in outputs. That means we had issues with those generation tests. The second model is more stable but was trained and saved in `float16` instead of `bfloat16`. That's more room for error between the two. To be perfectly fair `bfloat16` -> `float16` conversion seemed to be OK in inference mode (`bfloat16` mostly exists to handle large gradients, which do not exist in inference). During that step, one important tradeoff was discovered and implemented. Because bloom was trained in a distributed setting, part of the code was doing Tensor parallelism on a Linear layer meaning running the same operation as a single operation on a single GPU was giving [different results]( This took a while to pinpoint and either we went for 100% compliance and the model was much slower, or we would take a small difference in generation but was much faster to run and simpler code. We opted for a configurable flag. # First inference (PP + Accelerate) ``` Note: Pipeline Parallelism (PP) means in this context that each GPU will own some layers so each GPU will work on a given chunk of data before handing it off to the next GPU. ``` Now we have a workable `transformers` clean version of the start working on running this. Bloom is a 352GB (176B parameters in bf16) model, we need at least that much GPU RAM to make it fit. We briefly explored offloading to CPU on smaller machines but the inference speed was orders of magnitude slower so we discarded it. Then we wanted to basically use the [pipeline]( So it's dogfooding and this is what the API uses under the hood all the time. However `pipelines` are not distributed aware (it's not their goal). After briefly discussing options, we ended up using [accelerate]( newly created `device_map=\"auto\"` to manage the sharding of the model. We had to iron out a few bugs, and fix the `transformers` code a bit to help `accelerate` do the right job. It works by splitting the various layers of the transformers and giving part of the model to each GPU. So GPU0 gets to work, then hands it over to GPU1 so on and so forth. In the end, with a small HTTP server on top, we could start serving bloom (the big model) !! # Starting point But we haven't even started discussing optimizations yet! We actually have quite a bit, all this process is a castle of cards. During optimizations we are going to make modifications to the underlying code, being extra sure you're not killing the model in one way or the other is really important and easier to do than you think. So we are now at the very first step of optimizations and we need to start measuring and keep measuring performance. So we need to consider what we care about. For an open inference server supporting many options, we expect users to send many queries with different parameters and what we care about are: The number of users we can serve at the same time (throughput) How long does it take for an average user to be served (latency)? We made a testing script in [locust]( which is exactly this: ```python from locust import HttpUser, between, task from random import randrange, random class QuickstartUser(HttpUser): wait_time = between(1, 5) @task def bloom_small(self): sentence = \"Translate to chinese. EN: I like soup. CN: \" self.client.post( \"/generate\", json={ \"inputs\": sentence[: randrange(1, len(sentence))], \"parameters\": {\"max_new_tokens\": 20, \"seed\": random()}, }, ) @task def bloom_small(self): sentence = \"Translate to chinese. EN: I like soup. CN: \" self.client.post( \"/generate\", json={ \"inputs\": sentence[: randrange(1, len(sentence))], \"parameters\": { \"max_new_tokens\": 20, \"do_sample\": True, \"top_p\": 0.9, \"seed\": random(), }, }, ) ``` **Note: This is not the best nor the only load testing we used, but it was always the first to be run so that it could compare fairly across approaches. Being the best on this benchmark does NOT mean it is the best solution. Other more complex scenarios had to be used in addition to actual real-world performance. ** We wanted to observe the ramp-up for various implementations and also make sure that underload the server properly circuit breaked. Circuit breaking means that the server can answer (fast) that it will not answer your query because too many people are trying to use it at the same time. It's extremely important to avoid the hug of death. On this benchmark the initial performance was (on 16xA100 40Go on GCP which is the machine used throughout): Requests/s : 0.3 (throughput) Latency: 350ms/token (latency) Those numbers are not that great. Before getting to work let's estimate the best we can imagine achieving. The formula for amount of operations is `24Bsh^2 + 4\ud835\udc35s^2h24Bsh^2 + 4\ud835\udc35s^2h` where `B` is the batch size, `s` the sequence length, and `h` the hidden dimension. Let's do the math and we are getting `17 TFlop` for a single forward pass. Looking at the [specs]( of A100 it claims `312 TFLOPS` for a single card. That means a single GPU could potentially run at `17 / 312 = 54ms/token`. We're using 16 of those so `3ms/token` on the overall machine. Take all these numbers with a big grain of salt, it's never possible to reach those numbers, and real-life performance rarely matches the specs. Also if computation is not your limiting factor then this is not the lowest you can get. It's just good practice to know how far you are from your target. In this case, we're 2 orders of magnitude so pretty far. Also, this estimate puts all the flops at the service of latency which means only a single request can go at a time (it's ok since you're maximizing your machine so there's not much else to be done, but we can have higher latency and get throughput back through batching much more easily). # Exploring many routes ``` Note: Tensor Parallelism (TP) means in this context that each GPU will own part of the weights, so ALL gpus are active all the time and do less work. Usually this comes with a very slight overhead that some work is duplicated and more importantly that the GPUs regularly have to communicate to each other their results to continue the computation ``` Now that we have a good understanding of where we stand it's time to get to work. We tried many different things based on the people and our various knowledge. ALL endeavors deserve their own blog post so I'll just list them, explain the few final learnings and delve into the details of only what went into the current server. Moving from Pipeline Parallelism (PP) to Tensor Parallelism (TP) is one big interesting change for latency. Each GPU will own part of the parameters and all will be working at the same time. So the latency should decrease drastically but the price to pay is the communication overhead since they regularly need to communicate with each other about their results. It is to note that this is a very wide range of approaches and the intent was deliberately to learn more about each tool and how it could fit in later endeavors. ## Porting the code the JAX/Flax to run on TPUs: - Expected to be easier to choose the type of parallelism. so TP should be easier to test. It's one of the perks of Jax's design. - More constrained on hardware, performance on TPU likely superior than GPU, and less vendor choice for TPU. - Cons, another port is needed. But it would be welcome anyway in our libs. Results: - Porting was not an easy task as some conditions and kernels were hard to reproduce correctly enough. Still manageable though. - Parallelism was quite easy to get once ported Kudos to Jax the claim is alive. - Ray/communicating with TPU workers proved to be a real pain for us. We don't know if its the tool, the network, or simply our lack of knowledge but it slowed down experiments and work much more than we anticipated. We would launch an experiment that takes 5mn to run, wait for 5mn nothing had happened, 10mn later still nothing, turned out some worker was down/not responding we had to manually get in, figure out what went on, fix it, restart something, and relaunch and we had just lost half an hour. Repeat that enough times, and lost days add up quickly. Let's emphasize that it's not necessarily a critique of the tools we used but the subjective experience we had remains. - No control over compilation Once we had the thing running, we tried several settings to figure out which suited best the inference we had in mind, and it turned out it was really hard to guess from settings what would happen in the latency/throughput. For instance, we had a 0.3 rps on batch_size=1 (so every request/user is on its own) with a latency of 15ms/token (Do not compare too much with other numbers in this article it's on a different machine with a very different profile) which is great, but the overall throughput is not much better than what we had with the old code. So we decided to add batching, and with BS=2 and the latency went up 5 fold, with only 2 times the throughput... Upon further investigation, it turned out that up to batch_size=16 every batch_size had the same latency profile. So we could have 16x more throughput at a 5x latency cost. Not bad, but looking at the numbers we really would have preferred a more fine-grained control. The numbers we were aiming for stem from the [100ms, 1s, 10s, 1mn]( rule. ## Using ONNX/TRT or other compiled approaches - They are supposed to handle most of the optimization work - Con, Usually parallelism needs to be handled manually. Results: - Turned out that to be able to trace/jit/export stuff we needed to rework part of the PyTorch, so it easily fused with the pure PyTorch approach And overall we figured out that we could have most of the optimizations we desired by staying within PyTorch world, enabling us to keep flexibility without having to make too much coding effort. Another thing to note, since we're running on GPU and text-generation has many forward passes going on, we need the tensors to stay on the GPU, and it is sometimes hard to send your tensors to some lib, be given back the result, perform the logits computation (like argmax or sampling) and feed it back again. Putting the loop within the external lib means losing flexibility just like Jax, so it was not envisioned in our use case. ## DeepSpeed - This is the technology that powered training, it seemed only fair to use it for inference - Cons, it was never used/prepared for inference before. Results: - We had really impressive results fast which are roughly the same as the last iteration we are currently running. - We had to invent a way to put a webserver (so dealing with concurrency) on top of DeepSpeed which also has several processes (one for each GPU). Since there is an excellent library [Mii]( It doesn't fit the extremely flexible goals we had in mind, but we probably would have started working on top of it now. (The current solution is discussed later). - The biggest caveat we encountered with DeepSpeed, was the lack of stability. We had issues when running it on CUDA 11.4 where the code was built for 11.6 And the long-standing issue we could never really fix is that there would be regular kernel crashes (Cuda illegal access, dimensions mismatch, etc..). We fixed a bunch of these but we could never quite achieve stability under stress of our webserver. Despite, that I want to shout out to the Microsoft folks that helped us, we had a really good conversation that improved our understanding of what was happening, and gave us real insights to do some follow-up works. - One of the pain points I feel is that our team is mostly in Europe, while Microsoft is in California, so the collaboration was tricky timewise and we probably lost a big chunk of time because of it. This has nothing to do with the technical part, but it's good to acknowledge that the organizational part of working together is also really important. - Another thing to note, is that DeepSpeed relies on `transformers` to inject its optimization, and since we were updating our code pretty much consistently it made it hard for the DeepSpeed team to keep things working on our `main` branch. We're sorry to have made it hard, I guess this is why it's called bleeding edge. ## Webserver ideas - Given that we are going to run a free server where users are going to send long text, short text, want a few tokens, or a whole recipe each with different parameters, something had to be done here. Results: - We recoded everything in `Rust` with the excellent bindings [tch-rs]( Rust was not aimed at having performance gains but just much more fine-grained control over parallelism (threads/processes) and playing more fine-grained on the webserver concurrency and the PyTorch one. Python is infamously hard to handle low-level details thanks to the [GIL]( - Turned out that most of the pain came from the port, and after that, the experimentation was a breeze. And we figured that with enough control over the loops we could have great performance for everyone even in the context of a very wide array of requests with different properties. [Code]( for the curious, but it doesn't come with any support or nice docs. - It became production for a few weeks because it was more lenient on the parallelism, we could use the GPUs more efficiently (using GPU0 for request 1 while GPU1 is treating request 0). and we went from 0.3 RPS to ~2.5 RPS with the same latency. The optimal case would have been to increase throughput by 16X but the numbers shown here are real workloads measurements so this is not too bad. ## Pure PyTorch - Purely modify the existing code to make it faster by removing operations like `reshape`, using better-optimized kernels so on and so forth. - Con, we have to code TP ourselves and we have a constraint that the code still fits our library (mostly). Results - Next chapter. # Final route: PyTorch + TP + 1 custom kernel + torch.jit.script ## Writing more efficient PyTorch The first item on the list was removing unnecessary operations in the first implementations Some can be seen by just looking at the code and figuring out obvious flaws: - Alibi is used in Bloom to add position embeddings and it was calculated in too many places, we could only calculate it once and more efficiently. The old code: [link]( The new code: [link]( This is a 10x speedup and the latest version includes padding too! Since this step is only computed once, the actual speed is not important but overall reducing the number of operations and tensor creation is a good direction. Other parts come out more clearly when you start [profiling]( and we used quite extensively the [tensorboard extension]( This provides this sort of image which give insights: Attention takes a lot of time, careful this is a CPU view so the long bars don't mean long, they mean the CPU is awaiting the GPU results of the previous step. We see many `cat` operations before `baddbmm`. Removing a lot of reshape/transpose, for instance, we figured out that: - The attention is the hot path (it's expected but always good to verify). - In the attention, a lot of kernels were actual copies due to the massive amount of reshapes - We **could** remove the reshapes by reworking the weights themselves and the past. This is a breaking change but it did improve performance quite a bit! ## Supporting TP Ok, we have removed most of the low-hanging fruits now we went roughly from 350ms/token latency to 300ms/token in PP. That's a 15% reduction in latency, but it actually provided more than that, but we were not extremely rigorous in our measuring initially so let's stick to that figure. Then we went on to provide a TP implementation. Turned out to be much faster than we anticipated the implementation took half a day of a single (experienced) dev. The result is [here]( We were also able to reuse code from other projects which helped. The latency went directly from 300ms/token to 91ms/token which is a huge improvement in user experience. A simple 20 tokens request went from 6s to 2s which went from a \"slow\" experience to slightly delayed. Also, the throughput went up a lot to 10RPS. The throughput comes from the fact that running a query in batch_size=1 takes the same time as batch_size=32 and throughput becomes essentially *free* in latency cost at this point. ## Low-hanging fruits Now that we had a TP implementation, we could start profiling and optimizing again. It's a significant enough shift that we had to start from scratch again. The first thing that stood out, is that synchronization (ncclAllReduce) starts to become a preponderant part of the load, which is expected, this is the synchronization part and it **is** taking some time. We never tried to look and optimize this as it's already using `nccl` but there might still be some room for improvement there. We assumed it would be hard to do much better. The second thing is that `Gelu` operator was launching many elementwise kernels and overall it was taking a bigger share of compute than we expected. We made the change from: ```python def bloom_gelu_forward(x): return x * 0.5 * (1.0 + torch.tanh(0.79788456 * x * (1 + 0.044715 * x * x))) ``` to ```python @torch.jit.script def bloom_gelu_forward(x): return x * 0.5 * (1.0 + torch.tanh(0.79788456 * x * (1 + 0.044715 * x * x))) ``` This transforms the operations from multiple small element-wise kernels (and hence tensor copies) to a single kernel operation! This provided a 10% latency improvement from 91ms/token to 81ms/token, right there! Be careful though, this is not some magic black box you can just throw everywhere, the kernel fusion will not necessarily happen or the previously used operations are already extremely efficient. Places where we found it worked well: - You have a lot of small/elementwise operations - You have a hotspot with a few hard-to-remove reshape, copies in general - When the fusion happens. ## Epic fail We also had some points, during our testing periods, where we ended up seeing some consistent 25% lower latency for the Rust server compared to the Python one. This was rather odd, but because it was consistently measured, and because removing kernels provided a speed up, we were under the impression that maybe dropping the Python overhead could provide a nice boost. We started a 3-day job to reimplement the necessary parts of `torch.distributed` To get up and running in the Rust world [nccl-rs]( We had the version working but something was off in the generations compared to its Python counterpart. During the investigation of the issues, we figured... **that we had forgotten to remove the profiler in the Pytorch measurements**... That was the epic fail because removing it gave us back the 25% and then both codes ran just as fast. This is what we initially expected, that python mustn't be a performance hit, since it's mostly running torch cpp's code. In the end, 3 days is not the end of the world, and it might become useful sometime in the future but still pretty bad. This is quite common when doing optimizations to do wrong or misrepresentative measurements which end up being disappointing or even detrimental to the overall product. This is why doing it in small steps and having expectations about the outcome as soon as possible helps contain that risk. Another place where we had to be extra careful, was the initial forward pass (without past) and the later forward passes (with past). If you optimize the first one, you're most certainly going to be slowing down the later ones which are much more important and account for most of the runtime. Another pretty common culprit is measuring times which are CPU times, and not actual CUDA times, so you need to `torch.cuda.synchronize()` when doing runs to be sure that the kernels complete. ## Custom kernel So far, we had achieved close to DeepSpeed performance without any custom code outside of PyTorch! Pretty neat. We also didn't have to make any compromise on the flexibility of the run time batch size! But given the DeepSpeed experience, we wanted to try and write a custom kernel to fuse a few operations in the hot path where `torch.jit.script` wasn't able to do it for us. Essentially the following two lines: ```python attn_weights = attention_scores.masked_fill_(attention_mask, torch.finfo(attention_scores.dtype).min) attention_probs = F.softmax(attn_weights, dim=-1, dtype=torch.float32).to(input_dtype) ``` The first masked fill is creating a new tensor, which is here only to say to the softmax operator to ignore those values. Also, the softmax needs to be calculated on float32 (for stability) but within a custom kernel, we could limit the amount of upcasting necessary so we limit them to the actual sums and accumulated needed. Code can be found [here]( Keep in mind we had a single GPU architecture to target so we could focus on this and we are not experts (yet) at writing kernels, so there could be better ways to do this. This custom kernel provided yet another 10% latency increase moving down from 81ms/token to 71ms/token latency. All the while keeping our flexibility. After that, we investigated and explored other things like fusing more operators removing other reshapes, or putting them in other places. But no attempt ever made a significant enough impact to make it to the final versions. ## Webserver part Just like the Rust counterpart, we had to implement the batching of requests with different parameters. Since we were in the `PyTorch` world, we have pretty much full control of what's going on. Since we're in Python, we have the limiting factor that the `torch.distributed` needs to run on several processes instead of threads, which means it's slightly harder to communicate between processes. In the end, we opted to communicate raw strings over a Redis pub/sub to distribute the requests to all processes at once. Since we are in different processes it's easier to do it that way than communicating tensors (which are way bigger) for instance. Then we had to drop the use [generate]( since this applies the parameters to all members of the batch, and we actually want to apply a different set of parameters. Thankfully, we can reuse lower-level items like the [LogitsProcessor]( to save us a lot of work. So we reconstructed a `generate` function that takes a list of parameters and applies them to each member of the batch. Another really important aspect of the final UX is latency. Since we have different parameter sets for different requests, we might have 1 request for 20 tokens and the other for 250 tokens. Since it takes 75ms/token latency one request takes 1.5s and the other 18s. If we were batching all the way, we would be making the user that asked to wait for 18s and making it appear to him as if we were running at 900ms/token which is quite slow! Since we're in a PyTorch world with extreme flexibility, what we can do instead is extract from the batch the first request as soon as we generated to first 20 tokens, and return to that user within the requested 1.5s! We also happen to save 230 tokens worth of computation. So flexibility **is** important to get the best possible latency out there. # Last notes and crazy ideas Optimization is a never-ending job, and like any other project, 20% of work will usually yield 80% of the results. At some point, we started having a small testing strategy to figure out potential yields of some idea we had, and if the tests didn't yield significant results then we discarded the idea. 1 day for a 10% increase is valuable enough, 2 weeks for 10X is valuable enough. 2 weeks for 10% is not so interesting. ## Have you tried ...? Stuff we know exists and haven't used because of various reasons. It could be it felt like it wasn't adapted to our use case, it was too much work, the yields weren't promising enough, or even simply we had too many options to try out from and discarded some for no particular reasons and just lack of time. The following are in no particular order: - [Cuda graphs]( - [nvFuser]( (This is what powers `torch.jit.script` so we did use it.) - [FasterTransformer]( - [Nvidia's Triton]( - [XLA]( (Jax is using xla too !) - [torch.fx]( - [TensorRT]( Please feel free to reach out if your favorite tool is missing from here or if you think we missed out on something important that could prove useful! ## [Flash attention]( We have briefly looked at integrating flash attention, and while it performs extremely well on the first forward pass (without `past_key_values`) it didn't yield as big improvements when running when using `past_key_values`. Since we needed to adapt it to include the `alibi` tensor in the calculation we decide to not do the work (at least not yet). ## [OpenAI Triton]( [Triton]( is a great framework for building custom kernels in Python. We want to get to use it more but we haven't so far. We would be eager to see if it performs better than our Cuda kernel. Writing directly in Cuda seemed like the shortest path for our goal when we considered our options for that part. ## Padding and Reshapes As mentioned throughout this article, every tensor copy has a cost and another hidden cost of running production is padding. When two queries come in with very different lengths, you have to pad (use a dummy token) to make them fit a square. This leads to maybe a lot of unnecessary calculations. [More information]( Ideally, we would be able to *not* do those calculations at all, and never have reshapes. Tensorflow has the concept of [RaggedTensor]( and Pytorch [Nested tensors]( Both of these seem not as streamlined as regular tensors but might enable us to do less computation which is always a win. In an ideal world, the entire inference would be written in CUDA or pure GPU implementation. Considering the performance improvements yielded when we could fuse operations it looks desirable. But to what extent this would deliver, we have no idea. If smarter GPU people have ideas we are listening! # Acknowledgments All this work results of the collaboration of many HF team members. In no particular order, [@ThomasWang]( [@stas]( [@Nouamane]( [@Suraj]( [@Sanchit]( [@Patrick]( [@Younes](/ybelkada) [@Sylvain]( [@Jeff (Microsoft)]( [@Reza]( And all the [BigScience]( organization."}
{"title": "bloom-inference-pytorch-scripts.md", "repo_owner": "huggingface", "repo_name": "blog", "text": "--- title: \"Incredibly Fast BLOOM Inference with DeepSpeed and Accelerate\" thumbnail: /blog/assets/bloom-inference-pytorch-scripts/thumbnail.png authors: - user: stas - user: sgugger --- Incredibly Fast BLOOM Inference with DeepSpeed and Accelerate This article shows how to get an incredibly fast per token throughput when generating with the 176B parameter [BLOOM model]( As the model needs 352GB in bf16 (bfloat16) weights (`176*2`), the most efficient set-up is 8x80GB A100 GPUs. Also 2x8x40GB A100s or 2x8x48GB A6000 can be used. The main reason for using these GPUs is that at the time of this writing they provide the largest GPU memory, but other GPUs can be used as well. For example, 24x32GB V100s can be used. Using a single node will typically deliver a fastest throughput since most of the time intra-node GPU linking hardware is faster than inter-node one, but it's not always the case. If you don't have that much hardware, it's still possible to run BLOOM inference on smaller GPUs, by using CPU or NVMe offload, but of course, the generation time will be much slower. We are also going to cover the [8bit quantized solutions]( which require half the GPU memory at the cost of slightly slower throughput. We will discuss [BitsAndBytes]( and [Deepspeed-Inference]( libraries there. ## Benchmarks Without any further delay let's show some numbers. For the sake of consistency, unless stated differently, the benchmarks in this article were all done on the same 8x80GB A100 node w/ 512GB of CPU memory on [Jean Zay HPC]( The JeanZay HPC users enjoy a very fast IO of about 3GB/s read speed (GPFS). This is important for checkpoint loading time. A slow disc will result in slow loading time. Especially since we are concurrently doing IO in multiple processes. All benchmarks are doing [greedy generation]( of 100 token outputs: ``` Generate args {'max_length': 100, 'do_sample': False} ``` The input prompt is comprised of just a few tokens. The previous token caching is on as well, as it'd be quite slow to recalculate them all the time. First, let's have a quick look at how long did it take to get ready to generate - i.e. how long did it take to load and prepare the model: | project | secs | | :---------------------- | :--- | | accelerate | 121 | | ds-inference shard-int8 | 61 | | ds-inference shard-fp16 | 60 | | ds-inference unsharded | 662 | | ds-zero | 462 | Deepspeed-Inference comes with pre-sharded weight repositories and there the loading takes about 1 minuted. Accelerate's loading time is excellent as well - at just about 2 minutes. The other solutions are much slower here. The loading time may or may not be of importance, since once loaded you can continually generate tokens again and again without an additional loading overhead. Next the most important benchmark of token generation throughput. The throughput metric here is a simple - how long did it take to generate 100 new tokens divided by 100 and the batch size (i.e. divided by the total number of generated tokens). Here is the throughput in msecs on 8x80GB GPUs: | project \\ bs | 1 | 8 | 16 | 32 | 64 | 128 | 256 | 512 | | :---------------- | :----- | :---- | :---- | :---- | :--- | :--- | :--- | :--- | | accelerate bf16 | 230.38 | 31.78 | 17.84 | 10.89 | oom | | | | | accelerate int8 | 286.56 | 40.92 | 22.65 | 13.27 | oom | | | | | ds-inference fp16 | 44.02 | 5.70 | 3.01 | 1.68 | 1.00 | 0.69 | oom | | | ds-inference int8 | 89.09 | 11.44 | 5.88 | 3.09 | 1.71 | 1.02 | 0.71 | oom | | ds-zero bf16 | 283 | 34.88 | oom | | | | | | where OOM == Out of Memory condition where the batch size was too big to fit into GPU memory. Getting an under 1 msec throughput with Deepspeed-Inference's Tensor Parallelism (TP) and custom fused CUDA kernels! That's absolutely amazing! Though using this solution for other models that it hasn't been tried on may require some developer time to make it work. Accelerate is super fast as well. It uses a very simple approach of naive Pipeline Parallelism (PP) and because it's very simple it should work out of the box with any model. Since Deepspeed-ZeRO can process multiple generate streams in parallel its throughput can be further divided by 8 or 16, depending on whether 8 or 16 GPUs were used during the `generate` call. And, of course, it means that it can process a batch size of 64 in the case of 8x80 A100 (the table above) and thus the throughput is about 4msec - so all 3 solutions are very close to each other. Let's revisit again how these numbers were calculated. To generate 100 new tokens for a batch size of 128 took 8832 msecs in real time when using Deepspeed-Inference in fp16 mode. So now to calculate the throughput we did: walltime/(batch_size*new_tokens) or `8832/(128*100) = 0.69`. Now let's look at the power of quantized int8-based models provided by Deepspeed-Inference and BitsAndBytes, as it requires only half the original GPU memory of inference in bfloat16 or float16. Throughput in msecs 4x80GB A100: | project bs | 1 | 8 | 16 | 32 | 64 | 128 | | :---------------- | :----- | :---- | :---- | :---- | :--- | :--- | | accelerate int8 | 284.15 | 40.14 | 21.97 | oom | | | | ds-inference int8 | 156.51 | 20.11 | 10.38 | 5.50 | 2.96 | oom | To reproduce the benchmark results simply add `--benchmark` to any of these 3 scripts discussed below. ## Solutions First checkout the demo repository: ``` git clone cd transformers-bloom-inference ``` In this article we are going to use 3 scripts located under `bloom-inference-scripts/`. The framework-specific solutions are presented in an alphabetical order: ## HuggingFace Accelerate [Accelerate]( Accelerate handles big models for inference in the following way: 1. Instantiate the model with empty weights. 2. Analyze the size of each layer and the available space on each device (GPUs, CPU) to decide where each layer should go. 3. Load the model checkpoint bit by bit and put each weight on its device It then ensures the model runs properly with hooks that transfer the inputs and outputs on the right device and that the model weights offloaded on the CPU (or even the disk) are loaded on a GPU just before the forward pass, before being offloaded again once the forward pass is finished. In a situation where there are multiple GPUs with enough space to accommodate the whole model, it switches control from one GPU to the next until all layers have run. Only one GPU works at any given time, which sounds very inefficient but it does produce decent throughput despite the idling of the GPUs. It is also very flexible since the same code can run on any given setup. Accelerate will use all available GPUs first, then offload on the CPU until the RAM is full, and finally on the disk. Offloading to CPU or disk will make things slower. As an example, users have reported running BLOOM with no code changes on just 2 A100s with a throughput of 15s per token as compared to 10 msecs on 8x80 A100s. You can learn more about this solution in [Accelerate documentation]( ### Setup ``` pip install transformers>=4.21.3 accelerate>=0.12.0 ``` ### Run The simple execution is: ``` python bloom-inference-scripts/bloom-accelerate-inference.py --name bigscience/bloom --batch_size 1 --benchmark ``` To activate the 8bit quantized solution from [BitsAndBytes]( first install `bitsandbytes`: ``` pip install bitsandbytes ``` and then add `--dtype int8` to the previous command line: ``` python bloom-inference-scripts/bloom-accelerate-inference.py --name bigscience/bloom --dtype int8 --batch_size 1 --benchmark ``` if you have more than 4 GPUs you can tell it to use only 4 with: ``` CUDA_VISIBLE_DEVICES=0,1,2,3 python bloom-inference-scripts/bloom-accelerate-inference.py --name bigscience/bloom --dtype int8 --batch_size 1 --benchmark ``` The highest batch size we were able to run without OOM was 40 in this case. If you look inside the script we had to tweak the memory allocation map to free the first GPU to handle only activations and the previous tokens' cache. ## DeepSpeed-Inference [DeepSpeed-Inference]( uses Tensor-Parallelism and efficient fused CUDA kernels to deliver a super-fast =0.7.3 ``` ### Run 1. the fastest approach is to use a TP-pre-sharded (TP = Tensor Parallel) checkpoint that takes only ~1min to load, as compared to 10min for non-pre-sharded bloom checkpoint: ``` deepspeed --num_gpus 8 bloom-inference-scripts/bloom-ds-inference.py --name microsoft/bloom-deepspeed-inference-fp16 ``` 1a. if you want to run the original bloom checkpoint, which once loaded will run at the same throughput as the previous solution, but the loading will take 10-20min: ``` deepspeed --num_gpus 8 bloom-inference-scripts/bloom-ds-inference.py --name bigscience/bloom ``` 2a. The 8bit quantized version requires you to have only half the GPU memory of the normal half precision version: ``` deepspeed --num_gpus 8 bloom-inference-scripts/bloom-ds-inference.py --name microsoft/bloom-deepspeed-inference-int8 --dtype int8 ``` Here we used `microsoft/bloom-deepspeed-inference-int8` and also told the script to run in `int8`. And of course, just 4x80GB A100 GPUs is now sufficient: ``` deepspeed --num_gpus 4 bloom-inference-scripts/bloom-ds-inference.py --name microsoft/bloom-deepspeed-inference-int8 --dtype int8 ``` The highest batch size we were able to run without OOM was 128 in this case. You can see two factors at play leading to better performance here. 1. The throughput here was improved by using Tensor Parallelism (TP) instead of the Pipeline Parallelism (PP) of Accelerate. Because Accelerate is meant to be very generic it is also unfortunately hard to maximize the GPU usage. All computations are done first on GPU 0, then on GPU 1, etc. until GPU 8, which means 7 GPUs are idle all the time. DeepSpeed-Inference on the other hand uses TP, meaning it will send tensors to all GPUs, compute part of the generation on each GPU and then all GPUs communicate to each other the results, then move on to the next layer. That means all GPUs are active at once but they need to communicate much more. 2. DeepSpeed-Inference also uses custom CUDA kernels to avoid allocating too much memory and doing tensor copying to and from GPUs. The effect of this is lesser memory requirements and fewer kernel starts which improves the throughput and allows for bigger batch sizes leading to higher overall throughput. If you are interested in more examples you can take a look at [Accelerate GPT-J inference with DeepSpeed-Inference on GPUs]( or [Accelerate BERT inference with DeepSpeed-Inference on GPUs]( ## Deepspeed ZeRO-Inference [Deepspeed ZeRO]( uses a magical sharding approach which can take almost any model and scale it across a few or hundreds of GPUs and the do training or inference on it. ### Setup ``` pip install deepspeed ``` ### Run Note that the script currently runs the same inputs on all GPUs, but you can run a different stream on each GPU, and get `n_gpu` times faster throughput. You can't do that with Deepspeed-Inference. ``` deepspeed --num_gpus 8 bloom-inference-scripts/bloom-ds-zero-inference.py --name bigscience/bloom --batch_size 1 --benchmark ``` Please remember that with ZeRO the user can generate multiple unique streams at the same time - and thus the overall performance should be throughput in secs/token divided by number of participating GPUs - so 8x to 16x faster depending on whether 8 or 16 GPUs were used! You can also try the offloading solutions with just one smallish GPU, which will take a long time to run, but if you don't have 8 huge GPUs this is as good as it gets. CPU-Offload (1x GPUs): ``` deepspeed --num_gpus 1 bloom-inference-scripts/bloom-ds-zero-inference.py --name bigscience/bloom --batch_size 8 --cpu_offload --benchmark ``` NVMe-Offload (1x GPUs): ``` deepspeed --num_gpus 1 bloom-inference-scripts/bloom-ds-zero-inference.py --name bigscience/bloom --batch_size 8 --nvme_offload_path=/path/to/nvme_offload --benchmark ``` make sure to adjust `/path/to/nvme_offload` to somewhere you have ~400GB of free memory on a fast NVMe drive. ## Additional Client and Server Solutions At [transformers-bloom-inference]( you will find more very efficient solutions, including server solutions. Here are some previews. Server solutions: * [Mayank Mishra]( took all the demo scripts discussed in this blog post and turned them into a webserver package, which you can download from [here]( * [Nicolas Patry]( has developed a super-efficient [Rust-based webserver solution](( More client-side solutions: * [Thomas Wang]( is developing a very fast [custom CUDA kernel BLOOM model]( * The JAX team @HuggingFace has developed a [JAX-based solution]( As this blog post is likely to become outdated if you read this months after it was published please use [transformers-bloom-inference]( to find the most up-to-date solutions. ## Blog credits Huge thanks to the following kind folks who asked good questions and helped improve the readability of the article: Olatunji Ruwase and Philipp Schmid."}
{"title": "bloom-megatron-deepspeed.md", "repo_owner": "huggingface", "repo_name": "blog", "text": "--- title: \"The Technology Behind BLOOM Training\" thumbnail: /blog/assets/86_bloom_megatron_deepspeed/thumbnail.png authors: - user: stas --- The Technology Behind BLOOM Training In recent years, training ever larger language models has become the norm. While the issues of those models' not being released for further study is frequently discussed, the hidden knowledge about how to train such models rarely gets any attention. This article aims to change this by shedding some light on the technology and engineering behind training such models both in terms of hardware and software on the example of the 176B parameter language model [BLOOM]( But first we would like to thank the companies and key people and groups that made the amazing feat of training a 176 Billion parameter model by a small group of dedicated people possible. Then the hardware setup and main technological components will be discussed.  Here's a quick summary of project: | | | | :----- | :------------- | | Hardware | 384 80GB A100 GPUs | | Software | Megatron-DeepSpeed | | Architecture | GPT3 w/ extras | | Dataset | 350B tokens of 59 Languages | | Training time | 3.5 months | ## People The project was conceived by Thomas Wolf (co-founder and CSO - Hugging Face), who dared to compete with the huge corporations not only to train one of the largest multilingual models, but also to make the final result accessible to all people, thus making what was but a dream to most people a reality. This article focuses specifically on the engineering side of the training of the model. The most important part of the technology behind BLOOM were the people and companies who shared their expertise and helped us with coding and training. There are 6 main groups of people to thank: 1. The HuggingFace's BigScience team who dedicated more than half a dozen full time employees to figure out and run the training from inception to the finishing line and provided and paid for all the infrastructure beyond the Jean Zay's compute. 2. The Microsoft DeepSpeed team, who developed DeepSpeed and later integrated it with Megatron-LM, and whose developers spent many weeks working on the needs of the project and provided lots of awesome practical experiential advice before and during the training. 3. The NVIDIA Megatron-LM team, who developed Megatron-LM and who were super helpful answering our numerous questions and providing first class experiential advice. 4. The IDRIS / GENCI team managing the Jean Zay supercomputer, who donated to the project an insane amount of compute and great system administration support. 5. The PyTorch team who created a super powerful framework, on which the rest of the software was based, and who were very supportive to us during the preparation for the training, fixing multiple bugs and improving the usability of the PyTorch components we relied on during the training. 6. The volunteers in the BigScience Engineering workgroup It'd be very difficult to name all the amazing people who contributed to the engineering side of the project, so I will just name a few key people outside of Hugging Face who were the engineering foundation of this project for the last 14 months: Olatunji Ruwase, Deepak Narayanan, Jeff Rasley, Jared Casper, Samyam Rajbhandari and R\u00e9mi Lacroix Also we are grateful to all the companies who allowed their employees to contribute to this project. ## Overview BLOOM's architecture is very similar to [GPT3]( with a few added improvements as will be discussed later in this article. The model was trained on [Jean Zay]( the French government-funded super computer that is managed by GENCI and installed at [IDRIS]( the national computing center for the French National Center for Scientific Research (CNRS). The compute was generously donated to the project by GENCI (grant 2021-A0101012475). The following hardware was used during the training: - GPUs: 384 NVIDIA A100 80GB GPUs (48 nodes) + 32 spare gpus - 8 GPUs per node Using NVLink 4 inter-gpu connects, 4 OmniPath links - CPU: AMD EPYC 7543 32-Core Processor - CPU memory: 512GB per node - GPU memory: 640GB per node - Inter-node connect: Omni-Path Architecture (OPA) w/ non-blocking fat tree - NCCL-communications network: a fully dedicated subnet - Disc IO network: GPFS shared with other nodes and users Checkpoints: - [main checkpoints]( - each checkpoint with fp32 optim states and bf16+fp32 weights is 2.3TB - just the bf16 weights are 329GB. Datasets: - 46 Languages in 1.5TB of deduplicated massively cleaned up text, converted into 350B unique tokens - Vocabulary size of the model is 250,680 tokens - For full details please see [The BigScience Corpus A 1.6TB Composite Multilingual Dataset]( The training of the 176B BLOOM model occurred over Mar-Jul 2022 and took about 3.5 months to complete (approximately 1M compute hours). ## Megatron-DeepSpeed The 176B BLOOM model has been trained using [Megatron-DeepSpeed]( which is a combination of 2 main technologies: * [DeepSpeed]( is a deep learning optimization library that makes distributed training easy, efficient, and effective. * [Megatron-LM]( is a large, powerful transformer model framework developed by the Applied Deep Learning Research team at NVIDIA. The DeepSpeed team developed a 3D parallelism based implementation by combining ZeRO sharding and pipeline parallelism from the DeepSpeed library with Tensor Parallelism from Megatron-LM. More details about each component can be seen in the table below. Please note that the BigScience's [Megatron-DeepSpeed]( is a fork of the original [Megatron-DeepSpeed]( repository, to which we added multiple additions. Here is a table of which components were provided by which framework to train BLOOM: | Component | DeepSpeed | Megatron-LM | | :---- | :---- | :---- | | [ZeRO Data Parallelism](#zero-data-parallelism) | V | | | [Tensor Parallelism](#tensor-parallelism) | | V | | [Pipeline Parallelism](#pipeline-parallelism) | V | | | [BF16Optimizer](#bf16optimizer) | V | | | [Fused CUDA Kernels](#fused-cuda-kernels) | | V | | [DataLoader](#datasets) | | V | Please note that both Megatron-LM and DeepSpeed have Pipeline Parallelism and BF16 Optimizer implementations, but we used the ones from DeepSpeed as they are integrated with ZeRO. Megatron-DeepSpeed implements 3D Parallelism to allow huge models to train in a very efficient way. Let\u2019s briefly discuss the 3D components. 1. **DataParallel (DP)** - the same setup is replicated multiple times, and each being fed a slice of the data. The processing is done in parallel and all setups are synchronized at the end of each training step. 2. **TensorParallel (TP)** - each tensor is split up into multiple chunks, so instead of having the whole tensor reside on a single GPU, each shard of the tensor resides on its designated GPU. During processing each shard gets processed separately and in parallel on different GPUs and the results are synced at the end of the step. This is what one may call horizontal parallelism, as the splitting happens on a horizontal level. 3. **PipelineParallel (PP)** - the model is split up vertically (layer-level) across multiple GPUs, so that only one or several layers of the model are placed on a single GPU. Each GPU processes in parallel different stages of the pipeline and works on a small chunk of the batch. 4. **Zero Redundancy Optimizer (ZeRO)** - also performs sharding of the tensors somewhat similar to TP, except the whole tensor gets reconstructed in time for a forward or backward computation, therefore the model doesn't need to be modified. It also supports various offloading techniques to compensate for limited GPU memory. ## Data Parallelism Most users with just a few GPUs are likely to be familiar with `DistributedDataParallel` (DDP) [PyTorch documentation]( In this method the model is fully replicated to each GPU and then after each iteration all the models synchronize their states with each other. This approach allows training speed up but throwing more resources at the problem, but it only works if the model can fit onto a single GPU. ### ZeRO Data Parallelism ZeRO-powered data parallelism (ZeRO-DP) is described on the following diagram from this [blog post](  each GPU processes only a slice of a tensor and only aggregates the full tensor for operations that require the whole thing. In this section we use concepts and diagrams from the [Megatron-LM]( paper: [Efficient Large-Scale Language Model Training on GPU Clusters]( The main building block of any transformer is a fully connected `nn.Linear` followed by a nonlinear activation `GeLU`. Following the Megatron paper's notation, we can write the dot-product part of it as `Y = GeLU(XA)`, where `X` and `Y` are the input and output vectors, and `A` is the weight matrix. If we look at the computation in matrix form, it's easy to see how the matrix multiplication can be split between multiple GPUs:  is where one spreads groups of model layers across multiple GPUs and simply moves data along from GPU to GPU as if it were one large composite GPU. The mechanism is relatively simple - switch the desired layers `.to()` the desired devices and now whenever the data goes in and out those layers switch the data to the same device as the layer and leave the rest unmodified. This performs a vertical model parallelism, because if you remember how most models are drawn, we slice the layers vertically. For example, if the following diagram shows an 8-layer model: ``` =================== =================== | 0 | 1 | 2 | 3 | | 4 | 5 | 6 | 7 | =================== =================== GPU0 GPU1 ``` we just sliced it in 2 vertically, placing layers 0-3 onto GPU0 and 4-7 to GPU1. Now while data travels from layer 0 to 1, 1 to 2 and 2 to 3 this is just like the forward pass of a normal model on a single GPU. But when data needs to pass from layer 3 to layer 4 it needs to travel from GPU0 to GPU1 which introduces a communication overhead. If the participating GPUs are on the same compute node (e.g. same physical machine) this copying is pretty fast, but if the GPUs are located on different compute nodes (e.g. multiple machines) the communication overhead could be significantly larger. Then layers 4 to 5 to 6 to 7 are as a normal model would have and when the 7th layer completes we often need to send the data back to layer 0 where the labels are (or alternatively send the labels to the last layer). Now the loss can be computed and the optimizer can do its work. Problems: - the main deficiency and why this one is called \"naive\" PP, is that all but one GPU is idle at any given moment. So if 4 GPUs are used, it's almost identical to quadrupling the amount of memory of a single GPU, and ignoring the rest of the hardware. Plus there is the overhead of copying the data between devices. So 4x 6GB cards will be able to accommodate the same size as 1x 24GB card using naive PP, except the latter will complete the training faster, since it doesn't have the data copying overhead. But, say, if you have 40GB cards and need to fit a 45GB model you can with 4x 40GB cards (but barely because of the gradient and optimizer states). - shared embeddings may need to get copied back and forth between GPUs. Pipeline Parallelism (PP) is almost identical to a naive PP described above, but it solves the GPU idling problem, by chunking the incoming batch into micro-batches and artificially creating a pipeline, which allows different GPUs to concurrently participate in the computation process. The following illustration from the [GPipe paper]( shows the naive PP on the top, and PP on the bottom:  and then it waits for other GPUs to do their work and only when their work is starting to be complete, does GPU0 start to work again doing the backward path for chunks 3, 2, 1 and 0 (B0,3, B0,2, B0,1, B0,0). Note that conceptually this is the same concept as gradient accumulation steps (GAS). PyTorch uses `chunks`, whereas DeepSpeed refers to the same hyper-parameter as GAS. Because of the chunks, PP introduces the concept of micro-batches (MBS). DP splits the global data batch size into mini-batches, so if you have a DP degree of 4, a global batch size of 1024 gets split up into 4 mini-batches of 256 each (1024/4). And if the number of `chunks` (or GAS) is 32 we end up with a micro-batch size of 8 (256/32). Each Pipeline stage works with a single micro-batch at a time. To calculate the global batch size of the DP + PP setup we then do: `mbs*chunks*dp_degree` (`8*32*4=1024`). Let's go back to the diagram. With `chunks=1` you end up with the naive PP, which is very inefficient. With a very large `chunks` value you end up with tiny micro-batch sizes which could be not very efficient either. So one has to experiment to find the value that leads to the highest efficient utilization of the GPUs. While the diagram shows that there is a bubble of \"dead\" time that can't be parallelized because the last `forward` stage has to wait for `backward` to complete the pipeline, the purpose of finding the best value for `chunks` is to enable a high concurrent GPU utilization across all participating GPUs which translates to minimizing the size of the bubble. This scheduling mechanism is known as `all forward all backward`. Some other alternatives are [one forward one backward]( and [interleaved one forward one backward]( While both Megatron-LM and DeepSpeed have their own implementation of the PP protocol, Megatron-DeepSpeed uses the DeepSpeed implementation as it's integrated with other aspects of DeepSpeed. One other important issue here is the size of the word embedding matrix. While normally a word embedding matrix consumes less memory than the transformer block, in our case with a huge 250k vocabulary, the embedding layer needed 7.2GB in bf16 weights and the transformer block is just 4.9GB. Therefore, we had to instruct Megatron-Deepspeed to consider the embedding layer as a transformer block. So we had a pipeline of 72 layers, 2 of which were dedicated to the embedding (first and last). This allowed to balance out the GPU memory consumption. If we didn't do it, we would have had the first and the last stages consume most of the GPU memory, and 95% of GPUs would be using much less memory and thus the training would be far from being efficient. ## DP+PP The following diagram from the DeepSpeed [pipeline tutorial]( demonstrates how one combines DP with PP. . Normally it's a standalone feature that doesn't require PP or TP. But it can be combined with PP and TP. When ZeRO-DP is combined with PP (and optionally TP) it typically enables only ZeRO stage 1, which shards only optimizer states. ZeRO stage 2 additionally shards gradients, and stage 3 also shards the model weights. While it's theoretically possible to use ZeRO stage 2 with Pipeline Parallelism, it will have bad performance impacts. There would need to be an additional reduce-scatter collective for every micro-batch to aggregate the gradients before sharding, which adds a potentially significant communication overhead. By nature of Pipeline Parallelism, small micro-batches are used and instead the focus is on trying to balance arithmetic intensity (micro-batch size) with minimizing the Pipeline bubble (number of micro-batches). Therefore those communication costs are going to hurt. In addition, there are already fewer layers than normal due to PP and so the memory savings won't be huge. PP already reduces gradient size by ``1/PP``, and so gradient sharding savings on top of that are less significant than pure DP. ZeRO stage 3 can also be used to train models at this scale, however, it requires more communication than the DeepSpeed 3D parallel implementation. After careful evaluation in our environment which happened a year ago we found Megatron-DeepSpeed 3D parallelism performed best. Since then ZeRO stage 3 performance has dramatically improved and if we were to evaluate it today perhaps we would have chosen stage 3 instead. ## BF16Optimizer Training huge LLM models in FP16 is a no-no. We have proved it to ourselves by spending several months [training a 104B model]( which as you can tell from the [tensorboard]( was but a complete failure. We learned a lot of things while fighting the ever diverging lm-loss:  and we also got the same advice from the Megatron-LM and DeepSpeed teams after they trained the [530B model]( The recent release of [OPT-175B]( too reported that they had a very difficult time training in FP16. So back in January as we knew we would be training on A100s which support the BF16 format Olatunji Ruwase developed a `BF16Optimizer` which we used to train BLOOM. If you are not familiar with this data format, please have a look [at the bits layout]( The key to BF16 format is that it has the same exponent as FP32 and thus doesn't suffer from overflow FP16 suffers from a lot! With FP16, which has a max numerical range of 64k, you can only multiply small numbers. e.g. you can do `250*250=62500`, but if you were to try `255*255=65025` you got yourself an overflow, which is what causes the main problems during training. This means your weights have to remain tiny. A technique called loss scaling can help with this problem, but the limited range of FP16 is still an issue when models become very large. BF16 has no such problem, you can easily do `10_000*10_000=100_000_000` and it's no problem. Of course, since BF16 and FP16 have the same size of 2 bytes, one doesn't get a free lunch and one pays with really bad precision when using BF16. However, if you remember the training using stochastic gradient descent and its variations is a sort of stumbling walk, so if you don't get the perfect direction immediately it's no problem, you will correct yourself in the next steps. Regardless of whether one uses BF16 or FP16 there is also a copy of weights which is always in FP32 - this is what gets updated by the optimizer. So the 16-bit formats are only used for the computation, the optimizer updates the FP32 weights with full precision and then casts them into the 16-bit format for the next iteration. All PyTorch components have been updated to ensure that they perform any accumulation in FP32, so no loss happening there. One crucial issue is gradient accumulation, and it's one of the main features of pipeline parallelism as the gradients from each microbatch processing get accumulated. It's crucial to implement gradient accumulation in FP32 to keep the training precise, and this is what `BF16Optimizer` does. Besides other improvements we believe that using BF16 mixed precision training turned a potential nightmare into a relatively smooth process which can be observed from the following lm loss graph:  ## Fused CUDA Kernels The GPU performs two things. It can copy data to/from memory and perform computations on that data. While the GPU is busy copying the GPU's computations units idle. If we want to efficiently utilize the GPU we want to minimize the idle time. A kernel is a set of instructions that implements a specific PyTorch operation. For example, when you call `torch.add`, it goes through a [PyTorch dispatcher]( which looks at the input tensor(s) and various other things and decides which code it should run, and then runs it. A CUDA kernel is a specific implementation that uses the CUDA API library and can only run on NVIDIA GPUs. Now, when instructing the GPU to compute `c = torch.add(a, b); e = torch.max([c,d])`, a naive approach, and what PyTorch will do unless instructed otherwise, is to launch two separate kernels, one to perform the addition of `a` and `b` and another to find the maximum value between `c` and `d`. In this case, the GPU fetches from its memory `a` and `b`, performs the addition, and then copies the result back into the memory. It then fetches `c` and `d` and performs the `max` operation and again copies the result back into the memory. If we were to fuse these two operations, i.e. put them into a single \"fused kernel\", and just launch that one kernel we won't copy the intermediary result `c` to the memory, but leave it in the GPU registers and only need to fetch `d` to complete the last computation. This saves a lot of overhead and prevents GPU idling and makes the whole operation much more efficient. Fused kernels are just that. Primarily they replace multiple discrete computations and data movements to/from memory into fused computations that have very few memory movements. Additionally, some fused kernels rewrite the math so that certain groups of computations can be performed faster. To train BLOOM fast and efficiently it was necessary to use several custom fused CUDA kernels provided by Megatron-LM. In particular there is an optimized kernel to perform LayerNorm as well as kernels to fuse various combinations of the scaling, masking, and softmax operations. The addition of a bias term is also fused with the GeLU operation using PyTorch's JIT functionality. These operations are all memory bound, so it is important to fuse them to maximize the amount of computation done once a value has been retrieved from memory. So, for example, adding the bias term while already doing the memory bound GeLU operation adds no additional time. These kernels are all available in the [Megatron-LM repository]( ## Datasets Another important feature from Megatron-LM is the efficient data loader. During start up of the initial training each data set is split into samples of the requested sequence length (2048 for BLOOM) and index is created to number each sample. Based on the training parameters the number of epochs for a dataset is calculated and an ordering for that many epochs is created and then shuffled. For example, if a dataset has 10 samples and should be gone through twice, the system first lays out the samples indices in order `[0, ..., 9, 0, ..., 9]` and then shuffles that order to create the final global order for the dataset. Notice that this means that training will not simply go through the entire dataset and then repeat, it is possible to see the same sample twice before seeing another sample at all, but at the end of training the model will have seen each sample twice. This helps ensure a smooth training curve through the entire training process. These indices, including the offsets into the base dataset of each sample, are saved to a file to avoid recomputing them each time a training process is started. Several of these datasets can then be blended with varying weights into the final data seen by the training process. ## Embedding LayerNorm While we were fighting with trying to stop 104B from diverging we discovered that adding an additional LayerNorm right after the first word embedding made the training much more stable. This insight came from experimenting with [bitsandbytes]( which contains a `StableEmbedding` which is a normal Embedding with layernorm and it uses a uniform xavier initialization. ## Positional Encoding We also replaced the usual positional embedding with an AliBi - based on the paper: [Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation]( which allows to extrapolate for longer input sequences than the ones the model was trained on. So even though we train on sequences with length 2048 the model can also deal with much longer sequences during inference. ## Training Difficulties With the architecture, hardware and software in place we were able to start training in early March 2022. However, it was not just smooth sailing from there. In this section we discuss some of the main hurdles we encountered. There were a lot of issues to figure out before the training started. In particular we found several issues that manifested themselves only once we started training on 48 nodes, and won't appear at small scale. E.g., `CUDA_LAUNCH_BLOCKING=1` was needed to prevent the framework from hanging, and we needed to split the optimizer groups into smaller groups, otherwise the framework would again hang. You can read about those in detail in the [training prequel chronicles]( The main type of issue encountered during training were hardware failures. As this was a new cluster with about 400 GPUs, on average we were getting 1-2 GPU failures a week. We were saving a checkpoint every 3h (100 iterations) so on average we would lose 1.5h of training on hardware crash. The Jean Zay sysadmins would then replace the faulty GPUs and bring the node back up. Meanwhile we had backup nodes to use instead. We have run into a variety of other problems that led to 5-10h downtime several times, some related to a deadlock bug in PyTorch, others due to running out of disk space. If you are curious about specific details please see [training chronicles]( We were planning for all these downtimes when deciding on the feasibility of training this model - we chose the size of the model to match that feasibility and the amount of data we wanted the model to consume. With all the downtimes we managed to finish the training in our estimated time. As mentioned earlier it took about 1M compute hours to complete. One other issue was that SLURM wasn't designed to be used by a team of people. A SLURM job is owned by a single user and if they aren't around, the other members of the group can't do anything to the running job. We developed a kill-switch workaround that allowed other users in the group to kill the current process without requiring the user who started the process to be present. This worked well in 90% of the issues. If SLURM designers read this - please add a concept of Unix groups, so that a SLURM job can be owned by a group. As the training was happening 24/7 we needed someone to be on call - but since we had people both in Europe and West Coast Canada overall there was no need for someone to carry a pager, we would just overlap nicely. Of course, someone had to watch the training on the weekends as well. We automated most things, including recovery from hardware crashes, but sometimes a human intervention was needed as well. ## Conclusion The most difficult and intense part of the training was the 2 months leading to the start of the training. We were under a lot of pressure to start training ASAP, since the resources allocation was limited in time and we didn't have access to A100s until the very last moment. So it was a very difficult time, considering that the `BF16Optimizer` was written in the last moment and we needed to debug it and fix various bugs. And as explained in the previous section we discovered new problems that manifested themselves only once we started training on 48 nodes, and won't appear at small scale. But once we sorted those out, the training itself was surprisingly smooth and without major problems. Most of the time we had one person monitoring the training and only a few times several people were involved to troubleshoot. We enjoyed great support from Jean Zay's administration who quickly addressed most needs that emerged during the training. Overall it was a super-intense but very rewarding experience. Training large language models is still a challenging task, but we hope by building and sharing this technology in the open others can build on top of our experience. ## Resources ### Important links - [main training document]( - [tensorboard]( - [training slurm script]( - [training chronicles]( ### Papers and Articles We couldn't have possibly explained everything in detail in this article, so if the technology presented here piqued your curiosity and you'd like to know more here are the papers to read: Megatron-LM: - [Efficient Large-Scale Language Model Training on GPU Clusters]( - [Reducing Activation Recomputation in Large Transformer Models]( DeepSpeed: - [ZeRO: Memory Optimizations Toward Training Trillion Parameter Models]( - [ZeRO-Offload: Democratizing Billion-Scale Model Training]( - [ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning]( - [DeepSpeed: Extreme-scale model training for everyone]( Joint Megatron-LM and Deepspeeed: - [Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model]( ALiBi: - [Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation]( - [What Language Model to Train if You Have One Million GPU Hours?]( - there you will find the experiments that lead to us choosing ALiBi. BitsNBytes: - [8-bit Optimizers via Block-wise Quantization]( (in the context of Embedding LayerNorm but the rest of the paper and the technology is amazing - the only reason were weren't using the 8-bit optimizer is because we were already saving the optimizer memory with DeepSpeed-ZeRO). ## Blog credits Huge thanks to the following kind folks who asked good questions and helped improve the readability of the article (listed in alphabetical order): Britney Muller, Douwe Kiela, Jared Casper, Jeff Rasley, Julien Launay, Leandro von Werra, Omar Sanseviero, Stefan Schweter and Thomas Wang. The main graphics was created by Chunte Lee."}
{"title": "bloom.md", "repo_owner": "huggingface", "repo_name": "blog", "text": "--- title: \"Introducing The World's Largest Open Multilingual Language Model: BLOOM\" thumbnail: /blog/assets/86_bloom/thumbnail.png authors: - user: bigscience --- Introducing The World's Largest Open Multilingual Language Model: BLOOM Large language models (LLMs) have made a significant impact on AI research. These powerful, general models can take on a wide variety of new language tasks from a user\u2019s instructions. However, academia, nonprofits and smaller companies' research labs find it difficult to create, study, or even use LLMs as only a few industrial labs with the necessary resources and exclusive rights can fully access them. Today, we release [BLOOM]( the first multilingual LLM trained in complete transparency, to change this status quo \u2014 the result of the largest collaboration of AI researchers ever involved in a single research project. With its 176 billion parameters, BLOOM is able to generate text in 46 natural languages and 13 programming languages. For almost all of them, such as Spanish, French and Arabic, BLOOM will be the first language model with over 100B parameters ever created. This is the culmination of a year of work involving over 1000 researchers from 70+ countries and 250+ institutions, leading to a final run of 117 days (March 11 - July 6) training the BLOOM model on the [Jean Zay supercomputer]( in the south of Paris, France thanks to a compute grant worth an estimated \u20ac3M from French research agencies CNRS and GENCI. Researchers can [now download, run and study BLOOM]( to investigate the performance and behavior of recently developed large language models down to their deepest internal operations. More generally, any individual or institution who agrees to the terms of the model\u2019s [Responsible AI License]( (developed during the BigScience project itself) can use and build upon the model on a local machine or on a cloud provider. In this spirit of collaboration and continuous improvement, we\u2019re also releasing, for the first time, the intermediary checkpoints and optimizer states of the training. Don\u2019t have 8 A100s to play with? An inference API, currently backed by Google\u2019s TPU cloud and a FLAX version of the model, also allows quick tests, prototyping, and lower-scale use. You can already play with it on the Hugging Face Hub. This is only the beginning. BLOOM\u2019s capabilities will continue to improve as the workshop continues to experiment and tinker with the model. We\u2019ve started work to make it instructable as our earlier effort T0++ was and are slated to add more languages, compress the model into a more usable version with the same level of performance, and use it as a starting point for more complex architectures\u2026 All of the experiments researchers and practitioners have always wanted to run, starting with the power of a 100+ billion parameter model, are now possible. BLOOM is the seed of a living family of models that we intend to grow, not just a one-and-done model, and we\u2019re ready to support community efforts to expand it."}
{"title": "bridgetower.md", "repo_owner": "huggingface", "repo_name": "blog", "text": "--- title: \"Accelerating Vision-Language Models: BridgeTower on Habana Gaudi2\" thumbnail: /blog/assets/bridgetower/thumbnail.png authors: - user: regisss - user: anahita-b guest: true --- # Accelerating Vision-Language Models: BridgeTower on Habana Gaudi2 [Optimum Habana v1.6]( on Habana Gaudi2 achieves **more than x3 speedups compared to A100** when fine-tuning BridgeTower, a state-of-the-art vision-language model. Two new features contribute to the performance improvement: hardware-accelerated data loading and a fast DDP implementation. *These techniques apply to any other workloads constrained by data loading, which is frequently the case for many types of vision models.* This post will take you through the process and benchmark we used to compare BridgeTower fine-tuning on Habana Gaudi2 and Nvidia A100 80GB. It also demonstrates how easy it is to take advantage of these features in transformers-based models. ## BridgeTower In the recent past, [Vision-Language (VL) models]( have gained tremendous importance and shown dominance in a variety of VL tasks. Most common approaches leverage uni-modal encoders to extract representations from their respective modalities. Then those representations are either fused together, or fed into a cross-modal encoder. To efficiently handle some of the performance limitations and restrictions in VL representation learning, [BridgeTower]( introduces multiple _bridge layers_ that build a connection between the top layers of uni-modal encoders and each layer of the cross-modal encoder. This enables effective bottom-up cross-modal alignment and fusion between visual and textual representations at different semantic levels in the cross-modal encoder. Pre-trained with only 4M images (see the detail [below](#benchmark)), BridgeTower achieves state-of-the-art performance on various downstream vision-language tasks. In particular, BridgeTower achieves an accuracy of 78.73% on the VQAv2 test-std set, outperforming the previous state-of-the-art model (METER) by 1.09% using the same pre-training data and almost negligible additional parameters and computational costs. Notably, when further scaling the model, BridgeTower achieves an accuracy of 81.15%, surpassing models that are pre-trained on orders-of-magnitude larger datasets. ## Hardware [Nvidia A100 Tensor Core GPU]( includes the 3rd generation of the [Tensor Core technology]( Although a newer generation got released recently (H100), this is still the fastest GPU that you will find at most cloud providers. We use here the 80GB-memory variant which also offers faster memory bandwidth than the 40GB one. [Habana Gaudi2]( is the second-generation AI hardware accelerator designed by Habana Labs. A single server contains 8 accelerator devices called HPUs with 96GB of memory each. Check out [our previous blog post]( for a more in-depth introduction and a guide showing how to access it through the [Intel Developer Cloud]( Unlike many AI accelerators in the market, advanced features are very easy to apply to make the most of Gaudi2 with [Optimum Habana]( which enables users to port Transformers-compatible scripts to Gaudi with just a 2-line change. ## Benchmark To benchmark training, we are going to fine-tune a [BridgeTower Large checkpoint]( consisting of 866M parameters. This checkpoint was pretrained on English language using masked language modeling, image-text matching and image-text contrastive loss on [Conceptual Captions]( [SBU Captions]( [MSCOCO Captions]( and [Visual Genome]( We will further fine-tune this checkpoint on the [New Yorker Caption Contest dataset]( which consists of cartoons from The New Yorker and the most voted captions. Hyperparameters are all the same for both accelerators, except the batch size: we managed to fit 40 samples on Gaudi2 against 32 on A100. You can check them out [here]( for Gaudi2 and [there]( for A100. **When dealing with datasets involving images, data loading is frequently a bottleneck** because many costly operations are computed on CPU (image decoding, image augmentations) and then full images are sent to the training devices. Ideally, *we would like to send only raw bytes to devices and then perform decoding and various image transformations on device*. But let's see first how to *easily* allocate more resources to data loading for accelerating your runs. ### Making use of `dataloader_num_workers` When image loading is done on CPU, a quick way to speed it up would be to allocate more subprocesses for data loading. This is very easy to do with Transformers' `TrainingArguments` (or its Optimum Habana counterpart `GaudiTrainingArguments`): you can use the `dataloader_num_workers=N` argument to set the number of subprocesses (`N`) allocated on CPU for data loading. The default is 0, which means that data is loaded in the main process. This may not be optimal as the main process has many things to manage. We can set it to 1 to have one fully dedicated subprocess for data loading. When several subprocesses are allocated, each one of them will be responsible for preparing a batch. This means that RAM consumption will increase with the number of workers. One recommendation would be to set it to the number of CPU cores, but those cores may not be fully free so you will have to try it out to find the best configuration. Let's run the two following experiments: - a mixed-precision (*bfloat16*/*float*) run distributed across 8 devices where data loading is performed by the same process as everything else (i.e. `dataloader_num_workers=0`) - a mixed-precision (*bfloat16*/*float*) run distributed across 8 devices with 1 dedicated subprocess for data loading (i.e. `dataloader_num_workers=1`) Here are the throughputs we got on Gaudi2 and A100: | Device | `dataloader_num_workers=0` | `dataloader_num_workers=1` | |||| | Gaudi2 HPU | 532.4 samples/s | 639.7 samples/s | | A100 GPU | 188.6 samples/s | 254.7 samples/s | We first see that **Gaudi2 is x2.51 faster than A100** with `dataloader_num_workers=1` and x2.82 faster with `dataloader_num_workers=0`, which is even better than [the speedups we previously reported]( Second, we see that **allocating more resources for data loading can lead to easy speedups**: x1.20 on Gaudi2 and x1.35 on A100. We also ran experiments with several dedicated subprocesses for data loading but performance was not better than with `dataloader_num_workers=1` for both Gaudi2 and A100. Thus, **using `dataloader_num_workers=1` is usually a good first way of accelerating your runs involving images!** Tensorboard logs can be visualized [here]( for Gaudi2 and [there]( for A100. ### Optimum Habana's fast DDP Before delving into how to perform hardware-accelerated data loading, let's look at another very easy way of speeding up your distributed runs on Gaudi. The new release of Optimum Habana, version 1.6.0, introduced a new feature that allows users to choose the distribution strategy to use: - `distribution_strategy=\"ddp\"` to use PyTorch [`DistributedDataParallel`]( (DDP) - `distribution_strategy=\"fast_ddp\"` to use a lighter and usually faster implementation Optimum Habana's fast DDP does not split parameter gradients into buckets as [DDP does]( It also uses [HPU graphs]( to collect gradients in all processes and then update them (after the [all_reduce]( operation is performed) with minimal host overhead. You can check this implementation [here]( Simply using `distribution_strategy=\"fast_ddp\"` (and keeping `dataloader_num_workers=1`) on Gaudi2 gives us 705.9 samples/s. **This is x1.10 faster than with DDP and x2.77 faster than A100!** So adding just two training arguments (`dataloader_num_workers=1` and `distribution_strategy=\"fast_ddp\"`) led to a x1.33 speedup on Gaudi2 and to a x2.77 speedup compared to A100 with `dataloader_num_workers=1`. ### Hardware-accelerated data loading with Optimum Habana For even larger speedups, we are now going to move as many data loading operations as possible from the CPU to the accelerator devices (i.e. HPUs on Gaudi2 or GPUs on A100). This can be done on Gaudi2 using Habana's [media pipeline]( Given a dataset, most dataloaders follow the following recipe: 1. Fetch data (e.g. where your JPEG images are stored on disk) 2. The CPU reads encoded images 3. The CPU decodes images 4. The CPU applies image transformations to augment images 5. Finally, images are sent to devices (although this is usually not done by the dataloader itself) Instead of doing the whole process on CPU and send ready-to-train data to devices, a more efficient workflow would be to send encoded images to devices first and then perform image decoding and augmentations: 1. Same as before 2. Same as before 3. Encoded images are sent to devices 4. Devices decode images 5. Devices apply image transformations to augment images That way we can benefit from the computing power of our devices to speed image decoding and transformations up. Note that there are two caveats to be aware of when doing this: - Device memory consumption will increase, so you may have to reduce your batch size if there is not enough free memory. This may mitigate the speedup brought by this approach. - If devices are intensively used (100% or close to it) when doing data loading on CPU, don't expect any speedup when doing it on devices as they already have their hands full. To implement this on Gaudi2, we have got you covered: the [contrastive image-text example]( in Optimum Habana now provides a ready-to-use media pipeline that you can use with COCO-like datasets that contain text and images! You will just have to add `--mediapipe_dataloader` to your command to use it. For interested readers, a lower-level overview is given in the documentation of Gaudi [here]( and the list of all supported operators is available [there]( We are now going to benchmark a run with `dataloader_num_workers=1`, `distribution_strategy=\"fast_ddp\"` and `mediapipe_dataloader` since all these optimizations are compatible with each other: | Device | `dataloader_num_workers=0` | `dataloader_num_workers=1` | `dataloader_num_workers=1` + `distribution_strategy=\"fast_ddp\"` | `dataloader_num_workers=1` + `distribution_strategy=\"fast_ddp\"` + `mediapipe_dataloader` | |||||| | Gaudi2 HPU | 532.4 samples/s | 639.7 samples/s | 705.9 samples/s | 802.1 samples/s | | A100 GPU | 188.6 samples/s | 254.7 samples/s | / | / | We got an additional x1.14 speedup compared to the previous run with `dataloader_num_workers=1` and `distribution_strategy=\"fast_ddp\"`. This final run is thus x1.51 faster than our base run on Gaudi2 **simply adding 3 ready-to-use training arguments.** It is also **x3.15 faster than A100 with `dataloader_num_workers=1`!** ### Reproducing this benchmark To reproduce this benchmark, you first need to get access to Gaudi2 through the [Intel Developer Cloud]( (see [this guide]( for more information). Then, you need to install the latest version of Optimum Habana and to run `run_bridgetower.py` that you can find [here]( Here is how to do it: ```bash pip install optimum[habana] git clone cd optimum-habana/examples/contrastive-image-text pip install -r requirements.txt ``` The base command line to run the script is: ```bash python ../gaudi_spawn.py --use_mpi --world_size 8 run_bridgetower.py \\ --output_dir /tmp/bridgetower-test \\ --model_name_or_path BridgeTower/bridgetower-large-itm-mlm-itc \\ --dataset_name jmhessel/newyorker_caption_contest --dataset_config_name matching \\ --image_column image --caption_column image_description \\ --remove_unused_columns=False \\ --do_train --do_eval --do_predict \\ --per_device_train_batch_size=\"40\" --per_device_eval_batch_size=\"16\" \\ --num_train_epochs 5 \\ --learning_rate=\"1e-5\" \\ --push_to_hub --report_to tensorboard --hub_model_id bridgetower\\ --overwrite_output_dir \\ --use_habana --use_lazy_mode --use_hpu_graphs_for_inference --gaudi_config_name Habana/clip \\ --throughput_warmup_steps 3 \\ --logging_steps 10 ``` which corresponds to the case `--dataloader_num_workers 0`. You can then add `--dataloader_num_workers 1`, `--distribution_strategy fast_ddp` and `--mediapipe_dataloader` to test other configurations. To push your model and Tensorboard logs to the Hugging Face Hub, you will have to log in to your account beforehand with: ```bash huggingface-cli login ``` For A100, you can use the same `run_bridgetower.py` script with a couple of small changes: - Replace `GaudiTrainer` and `GaudiTrainingArguments` with `Trainer` and `TrainingArguments` from Transformers - Remove references to `GaudiConfig` and `gaudi_config` - Import `set_seed` directly from Transformers: `from transformers import set_seed` The results displayed in this benchmark were obtained with a Nvidia A100 80GB GCP instance with 8 GPUS. Note that `--distribution_strategy fast_ddp` and `--mediapipe_dataloader` are compatible with Gaudi2 only and will not work with A100. ## Conclusion When dealing with images, we presented two solutions to speed up your training workflows: allocating more resources to the dataloader, and decoding and augmenting images directly on accelerator devices rather than on CPU. We showed that it leads to dramatic speedups when training a SOTA vision-language model like BridgeTower: **Habana Gaudi2 with Optimum Habana is more than 3x faster than Nvidia A100 80GB with Transformers!** And this is super easy to use as you just need to provide a few additional training arguments. To go further, we are looking forward to using HPU graphs for training models even faster and to presenting how to use DeepSpeed ZeRO-3 on Gaudi2 to accelerate the training of your LLMs. Stay tuned! If you are interested in accelerating your Machine Learning training and inference workflows using the latest AI hardware accelerators and software libraries, check out our [Expert Acceleration Program]( To learn more about Habana solutions, [read about our partnership and contact them here]( To learn more about Hugging Face efforts to make AI hardware accelerators easy to use, check out our [Hardware Partner Program]( ### Related Topics - [Faster Training and Inference: Habana Gaudi-2 vs Nvidia A100 80GB]( - [Fast Inference on Large Language Models: BLOOMZ on Habana Gaudi2 Accelerator]("}
{"title": "carbon-emissions-on-the-hub.md", "repo_owner": "huggingface", "repo_name": "blog", "text": "--- title: \"CO2 Emissions and the Hub: Leading the Charge\" thumbnail: /blog/assets/60_carbon_emissions_on_the_hub/thumbnail.jpg authors: - user: sasha - user: muellerzr - user: nateraw --- CO2 Emissions and the Hub: Leading the Charge ## What are CO2 Emissions and why are they important? Climate change is one of the greatest challenges that we are facing and reducing emissions of greenhouse gases such as carbon dioxide (CO2) is an important part of tackling this problem. Training and deploying machine learning models will emit CO2 due to the energy usage of the computing infrastructures that are used: from GPUs to storage, it all needs energy to function and emits CO2 in the process.  > Pictured: Recent Transformer models and their carbon footprints The amount of CO2 emitted depends on different factors such as runtime, hardware used, and carbon intensity of the energy source. Using the tools described below will help you both track and report your own emissions (which is important to improve the transparency of our field as a whole!) and choose models based on their carbon footprint. ## How to calculate your own CO2 Emissions automatically with Transformers Before we begin, if you do not have the latest version of the `huggingface_hub` library on your system, please run the following: ``` pip install huggingface_hub -U ``` ## How to find low-emission models using the Hugging Face Hub With the model now uploaded to the Hub, how can you search for models on the Hub while trying to be eco-friendly? Well, the `huggingface_hub` library has a new special parameter to perform this search: `emissions_threshold`. All you need to do is specify a minimum or maximum number of grams, and all models that fall within that range. For example, we can search for all models that took a maximum of 100 grams to make: ```python from huggingface_hub import HfApi api = HfApi() models = api.list_models(emissions_thresholds=(None, 100), cardData=True) len(models) >>> 191 ``` There were quite a few! This also helps to find smaller models, given they typically did not release as much carbon during training. We can look at one up close to see it does fit our threshold: ```python model = models[0] print(f'Model Name: {model.modelId}\\nCO2 Emitted during training: {model.cardData[\"co2_eq_emissions\"]}') >>> Model Name: esiebomajeremiah/autonlp-email-classification-657119381 CO2 Emitted during training: 3.516233232503715 ``` Similarly, we can search for a minimum value to find very large models that emitted a lot of CO2 during training: ```python models = api.list_models(emissions_thresholds=(500, None), cardData=True) len(models) >>> 10 ``` Now let's see exactly how much CO2 one of these emitted: ```python model = models[0] print(f'Model Name: {model.modelId}\\nCO2 Emitted during training: {model.cardData[\"co2_eq_emissions\"]}') >>> Model Name: Maltehb/aelaectra-danish-electra-small-cased CO2 Emitted during training: 4009.5 ``` That's a lot of CO2! As you can see, in just a few lines of code we can quickly vet models we may want to use to make sure we're being environmentally cognizant! ## How to Report Your Carbon Emissions with `transformers` If you're using `transformers`, you can automatically track and report carbon emissions thanks to the `codecarbon` integration. If you've installed `codecarbon` on your machine, the `Trainer` object will automatically add the `CodeCarbonCallback` while training, which will store carbon emissions data for you as you train. So, if you run something like this... ```python from datasets import load_dataset from transformers import AutoModelForSequenceClassification, AutoTokenizer, Trainer, TrainingArguments \u200b ds = load_dataset(\"imdb\") model = AutoModelForSequenceClassification.from_pretrained(\"bert-base-cased\", num_labels=2) tokenizer = AutoTokenizer.from_pretrained(\"bert-base-cased\") \u200b def tokenize_function(examples): return tokenizer(examples[\"text\"], padding=\"max_length\", truncation=True) \u200b \u200b small_train_dataset = ds[\"train\"].shuffle(seed=42).select(range(1000)).map(tokenize_function, batched=True) small_eval_dataset = ds[\"test\"].shuffle(seed=42).select(range(1000)).map(tokenize_function, batched=True) \u200b \u200b training_args = TrainingArguments( \"codecarbon-text-classification\", num_train_epochs=4, push_to_hub=True ) \u200b trainer = Trainer( model=model, args=training_args, train_dataset=small_train_dataset, eval_dataset=small_eval_dataset, ) \u200b trainer.train() ``` ...you'll be left with a file within the `codecarbon-text-classification` directory called `emissions.csv`. This file will keep track of the carbon emissions across different training runs. Then, when you're ready, you can take the emissions from the run you used to train your final model and include that in its model card. An example of this data being included at the top of the model card is shown below:  For more references on the metadata format for `co2_eq_emissions ` see [the hub docs]( ### Further readings - Rolnick et al. (2019) - [Tackling Climate Change with Machine Learning]( - Strubell et al. (2019) - [Energy and Policy Considerations for Deep Learning in NLP]( - Schwartz et al. (2020) - [Green AI]("}
{"title": "chatbot-amd-gpu.md", "repo_owner": "huggingface", "repo_name": "blog", "text": "--- title: \"Run a Chatgpt-like Chatbot on a Single GPU with ROCm\" thumbnail: /blog/assets/chatbot-amd-gpu/thumbnail.png authors: - user: andyll7772 guest: true --- Run a Chatgpt-like Chatbot on a Single GPU with ROCm ## Introduction ChatGPT, OpenAI's groundbreaking language model, has become an influential force in the realm of artificial intelligence, paving the way for a multitude of AI applications across diverse sectors. With its staggering ability to comprehend and generate human-like text, ChatGPT has transformed industries, from customer support to creative writing, and has even served as an invaluable research tool. Various efforts have been made to provide open-source large language models which demonstrate great capabilities but in smaller sizes, such as [OPT]( [LLAMA]( [Alpaca]( and [Vicuna]( In this blog, we will delve into the world of Vicuna, and explain how to run the Vicuna 13B model on a single AMD GPU with ROCm. **What is Vicuna?** Vicuna is an open-source chatbot with 13 billion parameters, developed by a team from UC Berkeley, CMU, Stanford, and UC San Diego. To create Vicuna, a LLAMA base model was fine-tuned using about 70K user-shared conversations collected from ShareGPT.com via public APIs. According to initial assessments where GPT-4 is used as a reference, Vicuna-13B has achieved over 90%\\* quality compared to OpenAI ChatGPT. It was released on [Github]( on Apr 11, just a few weeks ago. It is worth mentioning that the data set, training code, evaluation metrics, training cost are known for Vicuna. Its total training cost was just around \\$300, making it a cost-effective solution for the general public. For more details about Vicuna, please check out . **Why do we need a quantized GPT model?** Running Vicuna-13B model in fp16 requires around 28GB GPU RAM. To further reduce the memory footprint, optimization techniques are required. There is a recent research paper GPTQ published, which proposed accurate post-training quantization for GPT models with lower bit precision. As illustrated below, for models with parameters larger than 10B, the 4-bit or 3-bit GPTQ can achieve comparable accuracy with fp16. Moreover, large parameters of these models also have a severely negative effect on GPT latency because GPT token generation is more limited by memory bandwidth (GB/s) than computation (TFLOPs or TOPs) itself. For this reason, a quantized model does not degrade token generation latency when the GPU is under a memory bound situation. Refer to [the GPTQ quantization papers]() and [github repo](). By leveraging this technique, several 4-bit quantized Vicuna models are available from Hugging Face as follows, ## Running Vicuna 13B Model on AMD GPU with ROCm To run the Vicuna 13B model on an AMD GPU, we need to leverage the power of ROCm (Radeon Open Compute), an open-source software platform that provides AMD GPU acceleration for deep learning and high-performance computing applications. Here's a step-by-step guide on how to set up and run the Vicuna 13B model on an AMD GPU with ROCm: **System Requirements** Before diving into the installation process, ensure that your system meets the following requirements: - An AMD GPU that supports ROCm (check the compatibility list on docs.amd.com page) - A Linux-based operating system, preferably Ubuntu 18.04 or 20.04 - Conda or Docker environment - Python 3.6 or higher For more information, please check out . This example has been tested on [**Instinct MI210**]( and [**Radeon RX6900XT**]( GPUs with ROCm5.4.3 and Pytorch2.0. **Quick Start** **1 ROCm installation and Docker container setup (Host machine)** **1.1 ROCm** **installation** The following is for ROCm5.4.3 and Ubuntu 22.04. Please modify according to your target ROCm and Ubuntu version from: ``` sudo apt update && sudo apt upgrade -y wget sudo apt-get install ./amdgpu-install_5.4.50403-1_all.deb sudo amdgpu-install --usecase=hiplibsdk,rocm,dkms sudo amdgpu-install --list-usecase sudo reboot ``` **1.2 ROCm installation verification** ``` rocm-smi sudo rocminfo ``` **1.3 Docker image pull and run a Docker container** The following uses Pytorch2.0 on ROCm5.4.2. Please use the appropriate docker image according to your target ROCm and Pytorch version: ``` docker pull rocm/pytorch:rocm5.4.2_ubuntu20.04_py3.8_pytorch_2.0.0_preview sudo docker run --device=/dev/kfd --device=/dev/dri --group-add video \\ --shm-size=8g --cap-add=SYS_PTRACE --security-opt seccomp=unconfined \\ --ipc=host -it --name vicuna_test -v ${PWD}:/workspace -e USER=${USER} \\ rocm/pytorch:rocm5.4.2_ubuntu20.04_py3.8_pytorch_2.0.0_preview ``` **2 Model** **quantization and Model inference (Inside the docker)** You can either download quantized Vicuna-13b model from Huggingface or quantize the floating-point model. Please check out **Appendix - GPTQ model quantization** if you want to quantize the floating-point model. **2.1 Download the quantized Vicuna-13b model** Use download-model.py script from the following git repo. ``` git clone cd text-generation-webui python download-model.py anon8231489123/vicuna-13b-GPTQ-4bit-128g ``` 2. **Running the Vicuna 13B GPTQ Model on AMD GPU** ``` git clone -b cuda cd GPTQ-for-LLaMa python setup_cuda.py install ``` These commands will compile and link HIPIFIED CUDA-equivalent kernel binaries to python as C extensions. The kernels of this implementation are composed of dequantization + FP32 Matmul. If you want to use dequantization + FP16 Matmul for additional speed-up, please check out **Appendix - GPTQ Dequantization + FP16 Mamul kernel for AMD GPUs** ``` git clone -b cuda cd GPTQ-for-LLaMa/ python setup_cuda.py install # model inference python llama_inference.py ../../models/vicuna-13b --wbits 4 --load \\ ../../models/vicuna-13b/vicuna-13b_4_actorder.safetensors --groupsize 128 --text \u201cYou input text here\u201d ``` Now that you have everything set up, it's time to run the Vicuna 13B model on your AMD GPU. Use the commands above to run the model. Replace *\"Your input text here\"* with the text you want to use as input for the model. If everything is set up correctly, you should see the model generating output text based on your input. **3. Expose the quantized Vicuna model to the Web API server** Change the path of GPTQ python modules (GPTQ-for-LLaMa) in the following line: To launch Web UXUI from the gradio library, you need to set up the controller, worker (Vicunal model worker), web_server by running them as background jobs. ``` nohup python0 -W ignore::UserWarning -m fastchat.serve.controller & nohup python0 -W ignore::UserWarning -m fastchat.serve.model_worker --model-path /path/to/quantized_vicuna_weights \\ --model-name vicuna-13b-quantization --wbits 4 --groupsize 128 & nohup python0 -W ignore::UserWarning -m fastchat.serve.gradio_web_server & ``` Now the 4-bit quantized Vicuna-13B model can be fitted in RX6900XT GPU DDR memory, which has 16GB DDR. Only 7.52GB of DDR (46% of 16GB) is needed to run 13B models whereas the model needs more than 28GB of DDR space in fp16 datatype. The latency penalty and accuracy penalty are also very minimal and the related metrics are provided at the end of this article. **Test the quantized Vicuna model in the Web API server** Let us give it a try. First, let us use fp16 Vicuna model for language translation. It does a better job than me. Next, let us ask something about soccer. The answer looks good to me. When we switch to the 4-bit model, for the same question, the answer is a bit different. There is a duplicated \u201cLionel Messi\u201d in it. **Vicuna fp16 and 4bit quantized model comparison** Test environment: \\- GPU: Instinct MI210, RX6900XT \\- python: 3.10 \\- pytorch: 2.1.0a0+gitfa08e54 \\- rocm: 5.4.3 **Metrics - Model size (GB)** - Model parameter size. When the models are preloaded to GPU DDR, the actual DDR size consumption is larger than model itself due to caching for Input and output token spaces. **Metrics \u2013 Accuracy (PPL: Perplexity)** - Measured on 2048 examples of C4 () dataset - Vicuna 13b \u2013 baseline: fp16 datatype parameter, fp16 Matmul - Vicuna 13b \u2013 quant (4bit/fp32): 4bits datatype parameter, fp32 Matmul - Vicuna 13b \u2013 quant (4bit/fp16): 4bits datatype parameter, fp16 Matmul **Metrics \u2013 Latency (Token generation latency, ms)** - Measured during token generation phases. - Vicuna 13b \u2013 baseline: fp16 datatype parameter, fp16 Matmul - Vicuna 13b \u2013 quant (4bit/fp32): 4bits datatype parameter, fp32 Matmul - Vicuna 13b \u2013 quant (4bit/fp16): 4bits datatype parameter, fp16 Matmul ## Conclusion Large language models (LLMs) have made significant advancements in chatbot systems, as seen in OpenAI\u2019s ChatGPT. Vicuna-13B, an open-source LLM model has been developed and demonstrated excellent capability and quality. By following this guide, you should now have a better understanding of how to set up and run the Vicuna 13B model on an AMD GPU with ROCm. This will enable you to unlock the full potential of this cutting-edge language model for your research and personal projects. Thanks for reading! ## Appendix - GPTQ model quantization **Building Vicuna quantized model from the floating-point LLaMA model** **a. Download LLaMA and Vicuna delta models from Huggingface** The developers of Vicuna (lmsys) provide only delta-models that can be applied to the LLaMA model. Download LLaMA in huggingface format and Vicuna delta parameters from Huggingface individually. Currently, 7b and 13b delta models of Vicuna are available. **b. Convert LLaMA to Vicuna by using Vicuna-delta model** ``` git clone cd FastChat ``` Convert the LLaMA parameters by using this command: (Note: do not use vicuna-{7b, 13b}-\\*delta-v0 because it\u2019s vocab_size is different from that of LLaMA and the model cannot be converted) ``` python -m fastchat.model.apply_delta --base /path/to/llama-13b --delta lmsys/vicuna-13b-delta-v1.1 \\ --target ./vicuna-13b ``` Now Vicuna-13b model is ready. **c. Quantize Vicuna to 2/3/4 bits** To apply the GPTQ to LLaMA and Vicuna, ``` git clone -b cuda cd GPTQ-for-LLaMa ``` (Note, do not use for now. Because 2,3,4bit quantization + MatMul kernels implemented in this repo does not parallelize the dequant+matmul and hence shows lower token generation performance) Quantize Vicuna-13b model with this command. QAT is done based on c4 data-set but you can also use other data-sets, such as wikitext2 (Note. Change group size with different combinations as long as the model accuracy increases significantly. Under some combination of wbit and groupsize, model accuracy can be increased significantly.) ``` python llama.py ./Vicuna-13b c4 --wbits 4 --true-sequential --act-order \\ --save_safetensors Vicuna-13b-4bit-act-order.safetensors ``` Now the model is ready and saved as **Vicuna-13b-4bit-act-order.safetensors**. **GPTQ Dequantization + FP16 Mamul kernel for AMD GPUs** The more optimized kernel implementation in targets at A100 GPU and not compatible with ROCM5.4.3 HIPIFY toolkits. It needs to be modified as follows. The same for VecQuant2MatMulKernelFaster, VecQuant3MatMulKernelFaster, VecQuant4MatMulKernelFaster kernels. For convenience, All the modified codes are available in [Github Gist]("}
{"title": "chinese-language-blog.md", "repo_owner": "huggingface", "repo_name": "blog", "text": "--- title: \"Introducing HuggingFace blog for Chinese speakers: Fostering Collaboration with the Chinese AI community\" thumbnail: /blog/assets/chinese-language-blog/thumbnail.png forceLinkToOtherLanguage: on authors: - user: xianbao - user: adinayakefu - user: chenglu guest: true --- Introducing HuggingFace blog for Chinese speakers: Fostering Collaboration with the Chinese AI community ## Welcome to our blog for Chinese speakers! We are delighted to introduce Hugging Face\u2019s new blog for Chinese speakers: [hf.co/blog/zh]( A committed group of volunteers has made this possible by translating our invaluable resources, including blog posts and comprehensive courses on transformers, diffusion, and reinforcement learning. This step aims to make our content accessible to the ever-growing Chinese AI community, fostering mutual learning and collaboration. ## Recognizing the Chinese AI Community\u2019s Accomplishments We want to highlight the remarkable achievements and contributions of the Chinese AI community, which has demonstrated exceptional talent and innovation. Groundbreaking advancements like [HuggingGPT]( [ChatGLM]( [RWKV]( [ChatYuan]( [ModelScope text-to-video models]( as well as [IDEA CCNL]( and [BAAI]( contributions underscore the incredible potential within the community. In addition, the Chinese AI community has been actively engaged in creating trendy Spaces, such as [Chuanhu GPT]( and [GPT Academy]( further demonstrating its enthusiasm and creativity. We have been collaborating with organizations such as [PaddlePaddle]( to ensure seamless integration with Hugging Face, empowering more collaborative efforts in the realm of Machine Learning. ## Strengthening Collaborative Ties and Future Events We are proud of our collaborative history with our Chinese collaborators, having worked together on various events that have enabled knowledge exchange and collaboration, propelling the AI community forward. Some of our collaborative efforts include: - [Online ChatGPT course, in collaboration with DataWhale (ongoing)]( - [First offline meetup in Beijing for JAX/Diffusers community sprint]( - [Organizing a Prompt engineering hackathon alongside Baixing AI]( - [Fine-tuning Lora models in collaboration with PaddlePaddle]( - [Fine-tuning stable diffusion models in an event with HeyWhale]( We are excited to announce that we will continue to strengthen our ties with the Chinese AI community by fostering more collaborations and joint efforts. These initiatives will create opportunities for knowledge sharing and expertise exchange, promoting collaborative open-source machine learning across our communities, and tackling the challenges and opportunities in the field of cooperative OS ML. ## Beyond Boundaries: Embracing a Diverse AI Community As we embark on this new chapter, our collaboration with the Chinese AI community will serve as a platform to bridge cultural and linguistic barriers, fostering innovation and cooperation in the AI domain. At Hugging Face, we value diverse perspectives and voices, aiming to create a welcoming and inclusive community that promotes ethical and equitable AI development. Join us on this exciting journey, and stay tuned for more updates on our blog about Chinese community advancements and future collaborative endeavors! You may also find us here: [BAAI]( [Bilibili]( [CNBlogs]( [CSDN]( [Juejin]( [OS China]( [SegmentFault]( [Zhihu]("}
{"title": "classification-use-cases.md", "repo_owner": "huggingface", "repo_name": "blog", "text": "--- title: \"Leveraging Hugging Face for complex text classification use cases\" thumbnail: /blog/assets/78_ml_director_insights/blogthumbnail.png authors: - user: juliensimon - user: Violette - user: florentgbelidji - user: oknerazan guest: true - user: lsmith guest: true --- # Leveraging Hugging Face for complex text classification use cases ## The Success Story of Witty Works with the Hugging Face Expert Acceleration Program. _If you're interested in building ML solutions faster, visit the [Expert Acceleration Program]( landing page and contact us [here]( ### Business Context As IT continues to evolve and reshape our world, creating a more diverse and inclusive environment within the industry is imperative. [Witty Works]( was built in 2018 to address this challenge. Starting as a consulting company advising organizations on becoming more diverse, Witty Works first helped them write job ads using inclusive language. To scale this effort, in 2019, they built a web app to assist users in writing inclusive job ads in English, French and German. They enlarged the scope rapidly with a writing assistant working as a browser extension that automatically fixes and explains potential bias in emails, Linkedin posts, job ads, etc. The aim was to offer a solution for internal and external communication that fosters a cultural change by providing micro-learning bites that explain the underlying bias of highlighted words and phrases. Example of suggestions by the writing assistant ### First experiments Witty Works first chose a basic machine learning approach to build their assistant from scratch. Using transfer learning with pre-trained spaCy models, the assistant was able to: - Analyze text and transform words into lemmas, - Perform a linguistic analysis, - Extract the linguistic features from the text (plural and singular forms, gender), part-of-speech tags (pronouns, verbs, nouns, adjectives, etc.), word dependencies labels, named entity recognition, etc. By detecting and filtering words according to a specific knowledge base using linguistic features, the assistant could highlight non-inclusive words and suggest alternatives in real-time. ### Challenge The vocabulary had around 2300 non-inclusive words and idioms in German and English correspondingly. And the above described basic approach worked well for 85% of the vocabulary but failed for context-dependent words. Therefore the task was to build a context-dependent classifier of non-inclusive words. Such a challenge (understanding the context rather than recognizing linguistic features) led to using Hugging Face transformers. ```diff Example of context dependent non-inclusive words: Fossil fuels are not renewable resources. Vs He is an old fossil You will have a flexible schedule. Vs You should keep your schedule flexible. ``` ### Solutions provided by the [Hugging Face Experts]( - #### **Get guidance for deciding on the right ML approach.** The initial chosen approach was vanilla transformers (used to extract token embeddings of specific non-inclusive words). The Hugging Face Expert recommended switching from contextualized word embeddings to contextualized sentence embeddings. In this approach, the representation of each word in a sentence depends on its surrounding context. Hugging Face Experts suggested the use of a [Sentence Transformers]( architecture. This architecture generates embeddings for sentences as a whole. The distance between semantically similar sentences is minimized and maximized for distant sentences. In this approach, Sentence Transformers use Siamese networks and triplet network structures to modify the pre-trained transformer models to generate \u201csemantically meaningful\u201d sentence embeddings. The resulting sentence embedding serves as input for a classical classifier based on KNN or logistic regression to build a context-dependent classifier of non-inclusive words. ```diff Elena Nazarenko, Lead Data Scientist at Witty Works: \u201cWe generate contextualized embedding vectors for every word depending on its sentence (BERT embedding). Then, we keep only the embedding for the \u201cproblem\u201d word\u2019s token, and calculate the smallest angle (cosine similarity)\u201d ``` To fine-tune a vanilla transformers-based classifier, such as a simple BERT model, Witty Works would have needed a substantial amount of annotated data. Hundreds of samples for each category of flagged words would have been necessary. However, such an annotation process would have been costly and time-consuming, which Witty Works couldn\u2019t afford. - #### **Get guidance on selecting the right ML library.** The Hugging Face Expert suggested using the Sentence Transformers Fine-tuning library (aka [SetFit]( an efficient framework for few-shot fine-tuning of Sentence Transformers models. Combining contrastive learning and semantic sentence similarity, SetFit achieves high accuracy on text classification tasks with very little labeled data. ```diff Julien Simon, Chief Evangelist at Hugging Face: \u201cSetFit for text classification tasks is a great tool to add to the ML toolbox\u201d ``` The Witty Works team found the performance was adequate with as little as 15-20 labeled sentences per specific word. ```diff Elena Nazarenko, Lead Data Scientist at Witty Works: \u201cAt the end of the day, we saved time and money by not creating this large data set\u201d ``` Reducing the number of sentences was essential to ensure that model training remained fast and that running the model was efficient. However, it was also necessary for another reason: Witty explicitly takes a highly supervised/rule-based approach to [actively manage bias]( Reducing the number of sentences is very important to reduce the effort in manually reviewing the training sentences. - #### **Get guidance on selecting the right ML models.** One major challenge for Witty Works was deploying a model with low latency. No one expects to wait 3 minutes to get suggestions to improve one\u2019s text! Both Hugging Face and Witty Works experimented with a few sentence transformers models and settled for [mpnet-base-v2]( combined with logistic regression and KNN. After a first test on Google Colab, the Hugging Face experts guided Witty Works on deploying the model on Azure. No optimization was necessary as the model was fast enough. ```diff Elena Nazarenko, Lead Data Scientist at Witty Works: \u201cWorking with Hugging Face saved us a lot of time and money. One can feel lost when implementing complex text classification use cases. As it is one of the most popular tasks, there are a lot of models on the Hub. The Hugging Face experts guided me through the massive amount of transformer-based models to choose the best possible approach. Plus, I felt very well supported during the model deployment\u201d ``` ### **Results and conclusion** The number of training sentences dropped from 100-200 per word to 15-20 per word. Witty Works achieved an accuracy of 0.92 and successfully deployed a custom model on Azure with minimal DevOps effort! ```diff Lukas Kahwe Smith CTO & Co-founder of Witty Works: \u201cWorking on an IT project by oneself can be challenging and even if the EAP is a significant investment for a startup, it is the cheaper and most meaningful way to get a sparring partner\u201c ``` With the guidance of the Hugging Face experts, Witty Works saved time and money by implementing a new ML workflow in the Hugging Face way. ```diff Julien Simon, Chief Evangelist at Hugging Face: \u201cThe Hugging way to build workflows: find open-source pre-trained models, evaluate them right away, see what works, see what does not. By iterating, you start learning things immediately\u201d ``` --- If you or your team are interested in accelerating your ML roadmap with Hugging Face Experts, please visit [hf.co/support]( to learn more."}
{"title": "clipseg-zero-shot.md", "repo_owner": "huggingface", "repo_name": "blog", "text": "--- title: Zero-shot image segmentation with CLIPSeg thumbnail: /blog/assets/123_clipseg-zero-shot/thumb.png authors: - user: segments-tobias guest: true - user: nielsr --- # Zero-shot image segmentation with CLIPSeg **This guide shows how you can use [CLIPSeg]( a zero-shot image segmentation model, using [` transformers`]( CLIPSeg creates rough segmentation masks that can be used for robot perception, image inpainting, and many other tasks. If you need more precise segmentation masks, we\u2019ll show how you can refine the results of CLIPSeg on [Segments.ai]( Image segmentation is a well-known task within the field of computer vision. It allows a computer to not only know what is in an image (classification), where objects are in the image (detection), but also what the outlines of those objects are. Knowing the outlines of objects is essential in fields such as robotics and autonomous driving. For example, a robot has to know the shape of an object to grab it correctly. Segmentation can also be combined with [image inpainting]( to allow users to describe which part of the image they want to replace. One limitation of most image segmentation models is that they only work with a fixed list of categories. For example, you cannot simply use a segmentation model trained on oranges to segment apples. To teach the segmentation model an additional category, you have to label data of the new category and train a new model, which can be costly and time-consuming. But what if there was a model that can already segment almost any kind of object, without any further training? That\u2019s exactly what [CLIPSeg]( a zero-shot segmentation model, achieves. Currently, CLIPSeg still has its limitations. For example, the model uses images of 352 x 352 pixels, so the output is quite low-resolution. This means we cannot expect pixel-perfect results when we work with images from modern cameras. If we want more precise segmentations, we can fine-tune a state-of-the-art segmentation model, as shown in [our previous blog post]( In that case, we can still use CLIPSeg to generate some rough labels, and then refine them in a labeling tool such as [Segments.ai]( Before we describe how to do that, let\u2019s first take a look at how CLIPSeg works. ## CLIP: the magic model behind CLIPSeg [CLIP]( which stands for **C**ontrastive **L**anguage\u2013**I**mage **P**re-training, is a model developed by OpenAI in 2021. You can give CLIP an image or a piece of text, and CLIP will output an abstract *representation* of your input. This abstract representation, also called an *embedding*, is really just a vector (a list of numbers). You can think of this vector as a point in high-dimensional space. CLIP is trained so that the representations of similar pictures and texts are similar as well. This means that if we input an image and a text description that fits that image, the representations of the image and the text will be similar (i.e., the high-dimensional points will be close together). At first, this might not seem very useful, but it is actually very powerful. As an example, let\u2019s take a quick look at how CLIP can be used to classify images without ever having been trained on that task. To classify an image, we input the image and the different categories we want to choose from to CLIP (e.g. we input an image and the words \u201capple\u201d, \u201corange\u201d, \u2026). CLIP then gives us back an embedding of the image and of each category. Now, we simply have to check which category embedding is closest to the embedding of the image, et voil\u00e0! Feels like magic, doesn\u2019t it? Example of image classification using CLIP (source). What\u2019s more, CLIP is not only useful for classification, but it can also be used for [image search]( (can you see how this is similar to classification?), [text-to-image models]( ([DALL-E 2]( is powered by CLIP), [object detection]( ([OWL-ViT]( and most importantly for us: image segmentation. Now you see why CLIP was truly a breakthrough in machine learning. The reason why CLIP works so well is that the model was trained on a huge dataset of images with text captions. The dataset contained a whopping 400 million image-text pairs taken from the internet. These images contain a wide variety of objects and concepts, and CLIP is great at creating a representation for each of them. ## CLIPSeg: image segmentation with CLIP [CLIPSeg]( is a model that uses CLIP representations to create image segmentation masks. It was published by Timo L\u00fcddecke and Alexander Ecker. They achieved zero-shot image segmentation by training a Transformer-based decoder on top of the CLIP model, which is kept frozen. The decoder takes in the CLIP representation of an image, and the CLIP representation of the thing you want to segment. Using these two inputs, the CLIPSeg decoder creates a binary segmentation mask. To be more precise, the decoder doesn\u2019t only use the final CLIP representation of the image we want to segment, but it also uses the outputs of some of the layers of CLIP. Source The decoder is trained on the [PhraseCut dataset]( which contains over 340,000 phrases with corresponding image segmentation masks. The authors also experimented with various augmentations to expand the size of the dataset. The goal here is not only to be able to segment the categories that are present in the dataset, but also to segment unseen categories. Experiments indeed show that the decoder can generalize to unseen categories. One interesting feature of CLIPSeg is that both the query (the image we want to segment) and the prompt (the thing we want to segment in the image) are input as CLIP embeddings. The CLIP embedding for the prompt can either come from a piece of text (the category name), **or from another image**. This means you can segment oranges in a photo by giving CLIPSeg an example image of an orange. This technique, which is called \"visual prompting\", is really helpful when the thing you want to segment is hard to describe. For example, if you want to segment a logo in a picture of a t-shirt, it's not easy to describe the shape of the logo, but CLIPSeg allows you to simply use the image of the logo as the prompt. The CLIPSeg paper contains some tips on improving the effectiveness of visual prompting. They find that cropping the query image (so that it only contains the object you want to segment) helps a lot. Blurring and darkening the background of the query image also helps a little bit. In the next section, we'll show how you can try out visual prompting yourself using [` transformers`]( ## Using CLIPSeg with Hugging Face Transformers Using Hugging Face Transformers, you can easily download and run a pre-trained CLIPSeg model on your images. Let's start by installing transformers. ```python !pip install -q transformers ``` To download the model, simply instantiate it. ```python from transformers import CLIPSegProcessor, CLIPSegForImageSegmentation processor = CLIPSegProcessor.from_pretrained(\"CIDAS/clipseg-rd64-refined\") model = CLIPSegForImageSegmentation.from_pretrained(\"CIDAS/clipseg-rd64-refined\") ``` Now we can load an image to try out the segmentation. We\\'ll choose a picture of a delicious breakfast taken by [Calum Lewis]( ```python from PIL import Image import requests url = \" image = Image.open(requests.get(url, stream=True).raw) image ``` ### Text prompting Let's start by defining some text categories we want to segment. ```python prompts = [\"cutlery\", \"pancakes\", \"blueberries\", \"orange juice\"] ``` Now that we have our inputs, we can process them and input them to the model. ```python import torch inputs = processor(text=prompts, images=[image] * len(prompts), padding=\"max_length\", return_tensors=\"pt\") # predict with torch.no_grad(): outputs = model(**inputs) preds = outputs.logits.unsqueeze(1) ``` Finally, let's visualize the output. ```python import matplotlib.pyplot as plt _, ax = plt.subplots(1, len(prompts) + 1, figsize=(3*(len(prompts) + 1), 4)) [a.axis('off') for a in ax.flatten()] ax[0].imshow(image) [ax[i+1].imshow(torch.sigmoid(preds[i][0])) for i in range(len(prompts))]; [ax[i+1].text(0, -15, prompt) for i, prompt in enumerate(prompts)]; ``` ### Visual prompting As mentioned before, we can also use images as the input prompts (i.e. in place of the category names). This can be especially useful if it\\'s not easy to describe the thing you want to segment. For this example, we\\'ll use a picture of a coffee cup taken by [Daniel Hooper]( ```python url = \" prompt = Image.open(requests.get(url, stream=True).raw) prompt ``` We can now process the input image and prompt image and input them to the model. ```python encoded_image = processor(images=[image], return_tensors=\"pt\") encoded_prompt = processor(images=[prompt], return_tensors=\"pt\") # predict with torch.no_grad(): outputs = model(**encoded_image, conditional_pixel_values=encoded_prompt.pixel_values) preds = outputs.logits.unsqueeze(1) preds = torch.transpose(preds, 0, 1) ``` Then, we can visualize the results as before. ```python _, ax = plt.subplots(1, 2, figsize=(6, 4)) [a.axis('off') for a in ax.flatten()] ax[0].imshow(image) ax[1].imshow(torch.sigmoid(preds[0])) ``` Let's try one last time by using the visual prompting tips described in the paper, i.e. cropping the image and darkening the background. ```python url = \" alternative_prompt = Image.open(requests.get(url, stream=True).raw) alternative_prompt ``` ```python encoded_alternative_prompt = processor(images=[alternative_prompt], return_tensors=\"pt\") # predict with torch.no_grad(): outputs = model(**encoded_image, conditional_pixel_values=encoded_alternative_prompt.pixel_values) preds = outputs.logits.unsqueeze(1) preds = torch.transpose(preds, 0, 1) ``` ```python _, ax = plt.subplots(1, 2, figsize=(6, 4)) [a.axis('off') for a in ax.flatten()] ax[0].imshow(image) ax[1].imshow(torch.sigmoid(preds[0])) ``` In this case, the result is pretty much the same. This is probably because the coffee cup was already separated well from the background in the original image. ## Using CLIPSeg to pre-label images on Segments.ai As you can see, the results from CLIPSeg are a little fuzzy and very low-res. If we want to obtain better results, you can fine-tune a state-of-the-art segmentation model, as explained in [our previous blogpost]( To finetune the model, we\\'ll need labeled data. In this section, we\\'ll show you how you can use CLIPSeg to create some rough segmentation masks and then refine them on [Segments.ai]( a labeling platform with smart labeling tools for image segmentation. First, create an account at [ and install the Segments Python SDK. Then you can initialize the Segments.ai Python client using an API key. This key can be found on [the account page]( ```python !pip install -q segments-ai ``` ```python from segments import SegmentsClient from getpass import getpass api_key = getpass('Enter your API key: ') segments_client = SegmentsClient(api_key) ``` Next, let\\'s load an image from a dataset using the Segments client. We\\'ll use the [a2d2 self-driving dataset]( You can also create your own dataset by following [these instructions]( ```python samples = segments_client.get_samples(\"admin-tobias/clipseg\") # Use the last image as an example sample = samples[1] image = Image.open(requests.get(sample.attributes.image.url, stream=True).raw) image ``` We also need to get the category names from the dataset attributes. ```python dataset = segments_client.get_dataset(\"admin-tobias/clipseg\") category_names = [category.name for category in dataset.task_attributes.categories] ``` Now we can use CLIPSeg on the image as before. This time, we\\'ll also scale up the outputs so that they match the input image\\'s size. ```python from torch import nn inputs = processor(text=category_names, images=[image] * len(category_names), padding=\"max_length\", return_tensors=\"pt\") # predict with torch.no_grad(): outputs = model(**inputs) # resize the outputs preds = nn.functional.interpolate( outputs.logits.unsqueeze(1), size=(image.size[1], image.size[0]), mode=\"bilinear\" ) ``` And we can visualize the results again. ```python len_cats = len(category_names) _, ax = plt.subplots(1, len_cats + 1, figsize=(3*(len_cats + 1), 4)) [a.axis('off') for a in ax.flatten()] ax[0].imshow(image) [ax[i+1].imshow(torch.sigmoid(preds[i][0])) for i in range(len_cats)]; [ax[i+1].text(0, -15, category_name) for i, category_name in enumerate(category_names)]; ``` Now we have to combine the predictions to a single segmented image. We\\'ll simply do this by taking the category with the greatest sigmoid value for each patch. We\\'ll also make sure that all the values under a certain threshold do not count. ```python threshold = 0.1 flat_preds = torch.sigmoid(preds.squeeze()).reshape((preds.shape[0], -1)) # Initialize a dummy \"unlabeled\" mask with the threshold flat_preds_with_treshold = torch.full((preds.shape[0] + 1, flat_preds.shape[-1]), threshold) flat_preds_with_treshold[1:preds.shape[0]+1,:] = flat_preds # Get the top mask index for each pixel inds = torch.topk(flat_preds_with_treshold, 1, dim=0).indices.reshape((preds.shape[-2], preds.shape[-1])) ``` Let\\'s quickly visualize the result. ```python plt.imshow(inds) ``` Lastly, we can upload the prediction to Segments.ai. To do that, we\\'ll first convert the bitmap to a png file, then we\\'ll upload this file to the Segments, and finally we\\'ll add the label to the sample. ```python from segments.utils import bitmap2file import numpy as np inds_np = inds.numpy().astype(np.uint32) unique_inds = np.unique(inds_np).tolist() f = bitmap2file(inds_np, is_segmentation_bitmap=True) asset = segments_client.upload_asset(f, \"clipseg_prediction.png\") attributes = { 'format_version': '0.1', 'annotations': [{\"id\": i, \"category_id\": i} for i in unique_inds if i != 0], 'segmentation_bitmap': { 'url': asset.url }, } segments_client.add_label(sample.uuid, 'ground-truth', attributes) ``` If you take a look at the [uploaded prediction on Segments.ai]( you can see that it\\'s not perfect. However, you can manually correct the biggest mistakes, and then you can use the corrected dataset to train a better model than CLIPSeg. ## Conclusion CLIPSeg is a zero-shot segmentation model that works with both text and image prompts. The model adds a decoder to CLIP and can segment almost anything. However, the output segmentation masks are still very low-res for now, so you\u2019ll probably still want to fine-tune a different segmentation model if accuracy is important. Note that there's more research on zero-shot segmentation currently being conducted, so you can expect more models to be added in the near future. One example is [GroupViT]( which is already available in Transformers. To stay up to date with the latest news in segmentation research, you can follow us on Twitter: [@TobiasCornille]( [@NielsRogge]( and [@huggingface]( If you\u2019re interested in learning how to fine-tune a state-of-the-art segmentation model, check out our previous blog post: ["}
{"title": "cnil.md", "repo_owner": "huggingface", "repo_name": "blog", "text": "--- title: \"Hugging Face Selected for the French Data Protection Agency Enhanced Support Program\" thumbnail: /blog/assets/146_cnil-accompaniment/logo.png authors: - user: yjernite - user: julien-c - user: annatrdj - user: Ima1 --- Hugging Face Selected for the French Data Protection Agency Enhanced Support Program *This blog post was originally published on [LinkedIn on 05/15/2023]( We are happy to announce that Hugging Face has been selected by the [CNIL]( (French Data Protection Authority) to benefit from its [Enhanced Support program]( This new program picked three companies with \u201cstrong potential for economic development\u201d out of over 40 candidates, who will receive support in understanding and implementing their duties with respect to data protection - a daunting and necessary endeavor in the context of the rapidly evolving field of Artificial Intelligence. When it comes to respecting people\u2019s privacy rights, the recent developments in ML and AI pose new questions, and engender new challenges. We have been particularly sensitive to these challenges in our own work at Hugging Face and in our collaborations. The [BigScience Workshop]( that we hosted in collaboration with hundreds of researchers from many different countries and institutions was the first Large Language Model training effort to [visibly put privacy front and center]( through a multi-pronged approach covering [data selection and governance, data processing, and model sharing]( The more recent [BigCode project]( co-hosted with [ServiceNow]( also dedicated significant resources to [addressing privacy risks]( creating [new tools to support pseudonymization]( that will benefit other projects. These efforts help us better understand what is technically necessary and feasible at various levels of the AI development process so we can better address legal requirements and risks tied to personal data. The accompaniment program from the CNIL, benefiting from its expertise and role as France\u2019s Data Protection Agency, will play an instrumental role in supporting our broader efforts to push GDPR compliance forward and provide clarity for our community of users on questions of privacy and data protection. We look forward to working together on addressing these questions with more foresight, and helping develop amazing new ML technology that does respect people\u2019s data rights!"}
{"title": "codeparrot.md", "repo_owner": "huggingface", "repo_name": "blog", "text": "--- title: Training CodeParrot from Scratch thumbnail: /blog/assets/40_codeparrot/thumbnail.png authors: - user: leandro --- Training CodeParrot from Scratch In this blog post we'll take a look at what it takes to build the technology behind [GitHub CoPilot]( an application that provides suggestions to programmers as they code. In this step by step guide, we'll learn how to train a large GPT-2 model called CodeParrot , entirely from scratch. CodeParrot can auto-complete your Python code - give it a spin [here]( Let's get to building it from scratch!  ## Creating a Large Dataset of Source Code The first thing we need is a large training dataset. With the goal to train a Python code generation model, we accessed the GitHub dump available on Google's BigQuery and filtered for all Python files. The result is a 180 GB dataset with 20 million files (available [here]( After initial training experiments, we found that the duplicates in the dataset severely impacted the model performance. Further investigating the dataset we found that: - 0.1% of the unique files make up 15% of all files - 1% of the unique files make up 35% of all files - 10% of the unique files make up 66% of all files You can learn more about our findings in [this Twitter thread]( We removed the duplicates and applied the same cleaning heuristics found in the [Codex paper]( Codex is the model behind CoPilot and is a GPT-3 model fine-tuned on GitHub code. The cleaned dataset is still 50GB big and available on the Hugging Face Hub: [codeparrot-clean]( With that we can setup a new tokenizer and train a model. ## Initializing the Tokenizer and Model First we need a tokenizer. Let's train one specifically on code so it splits code tokens well. We can take an existing tokenizer (e.g. GPT-2) and directly train it on our own dataset with the `train_new_from_iterator()` method. We then push it to the Hub. Note that we omit imports, arguments parsing and logging from the code examples to keep the code blocks compact. But you'll find the full code including preprocessing and downstream task evaluation [here]( ```Python # Iterator for Training def batch_iterator(batch_size=10): for _ in tqdm(range(0, args.n_examples, batch_size)): yield [next(iter_dataset)[\"content\"] for _ in range(batch_size)] # Base tokenizer tokenizer = GPT2Tokenizer.from_pretrained(\"gpt2\") base_vocab = list(bytes_to_unicode().values()) # Load dataset dataset = load_dataset(\"lvwerra/codeparrot-clean\", split=\"train\", streaming=True) iter_dataset = iter(dataset) # Training and saving new_tokenizer = tokenizer.train_new_from_iterator(batch_iterator(), vocab_size=args.vocab_size, initial_alphabet=base_vocab) new_tokenizer.save_pretrained(args.tokenizer_name, push_to_hub=args.push_to_hub) ``` Learn more about tokenizers and how to build them in the [Hugging Face course]( See that inconspicuous `streaming=True` argument? This small change has a big impact: instead of downloading the full (50GB) dataset this will stream individual samples as needed saving a lot of disk space! Checkout the [Hugging Face course]( ) for more information on streaming. Now, we initialize a new model. We\u2019ll use the same hyperparameters as GPT-2 large (1.5B parameters) and adjust the embedding layer to fit our new tokenizer also adding some stability tweaks. The `scale_attn_by_layer_idx` flag makes sure we scale the attention by the layer id and `reorder_and_upcast_attn` mainly makes sure that we compute the attention in full precision to avoid numerical issues. We push the freshly initialized model to the same repo as the tokenizer. ```Python # Load codeparrot tokenizer trained for Python code tokenization tokenizer = AutoTokenizer.from_pretrained(args.tokenizer_name) # Configuration config_kwargs = {\"vocab_size\": len(tokenizer), \"scale_attn_by_layer_idx\": True, \"reorder_and_upcast_attn\": True} # Load model with config and push to hub config = AutoConfig.from_pretrained('gpt2-large', **config_kwargs) model = AutoModelForCausalLM.from_config(config) model.save_pretrained(args.model_name, push_to_hub=args.push_to_hub) ``` Now that we have an efficient tokenizer and a freshly initialized model we can start with the actual training loop. ## Implementing the Training Loop We train with the [ Accelerate]( library which allows us to scale the training from our laptop to a multi-GPU machine without changing a single line of code. We just create an accelerator and do some argument housekeeping: ```Python accelerator = Accelerator() acc_state = {str(k): str(v) for k, v in accelerator.state.__dict__.items()} parser = HfArgumentParser(TrainingArguments) args = parser.parse_args() args = Namespace(**vars(args), **acc_state) samples_per_step = accelerator.state.num_processes * args.train_batch_size set_seed(args.seed) ``` We are now ready to train! Let's use the `huggingface_hub` client library to clone the repository with the new tokenizer and model. We will checkout to a new branch for this experiment. With that setup, we can run many experiments in parallel and in the end we just merge the best one into the main branch. ```Python # Clone model repository if accelerator.is_main_process: hf_repo = Repository(args.save_dir, clone_from=args.model_ckpt) # Checkout new branch on repo if accelerator.is_main_process: hf_repo.git_checkout(run_name, create_branch_ok=True) ``` We can directly load the tokenizer and model from the local repository. Since we are dealing with big models we might want to turn on [gradient checkpointing]( to decrease the GPU memory footprint during training. ```Python # Load model and tokenizer model = AutoModelForCausalLM.from_pretrained(args.save_dir) if args.gradient_checkpointing: model.gradient_checkpointing_enable() tokenizer = AutoTokenizer.from_pretrained(args.save_dir) ``` Next up is the dataset. We make training simpler with a dataset that yields examples with a fixed context size. To not waste too much data (some samples are too short or too long) we can concatenate many examples with an EOS token and then chunk them.  The more sequences we prepare together, the smaller the fraction of tokens we discard (the grey ones in the previous figure). Since we want to stream the dataset instead of preparing everything in advance we use an `IterableDataset`. The full dataset class looks as follows: ```Python class ConstantLengthDataset(IterableDataset): def __init__( self, tokenizer, dataset, infinite=False, seq_length=1024, num_of_sequences=1024, chars_per_token=3.6 ): self.tokenizer = tokenizer self.concat_token_id = tokenizer.bos_token_id self.dataset = dataset self.seq_length = seq_length self.input_characters = seq_length * chars_per_token * num_of_sequences self.epoch = 0 self.infinite = infinite def __iter__(self): iterator = iter(self.dataset) more_examples = True while more_examples: buffer, buffer_len = [], 0 while True: if buffer_len >= self.input_characters: break try: buffer.append(next(iterator)[\"content\"]) buffer_len += len(buffer[-1]) except StopIteration: if self.infinite: iterator = iter(self.dataset) self.epoch += 1 logger.info(f\"Dataset epoch: {self.epoch}\") else: more_examples = False break tokenized_inputs = self.tokenizer(buffer, truncation=False)[\"input_ids\"] all_token_ids = [] for tokenized_input in tokenized_inputs: all_token_ids.extend(tokenized_input + [self.concat_token_id]) for i in range(0, len(all_token_ids), self.seq_length): input_ids = all_token_ids[i : i + self.seq_length] if len(input_ids) == self.seq_length: yield torch.tensor(input_ids) ``` Texts in the buffer are tokenized in parallel and then concatenated. Chunked samples are then yielded until the buffer is empty and the process starts again. If we set `infinite=True` the dataset iterator restarts at its end. ```Python def create_dataloaders(args): ds_kwargs = {\"streaming\": True} train_data = load_dataset(args.dataset_name_train, split=\"train\", streaming=True) train_data = train_data.shuffle(buffer_size=args.shuffle_buffer, seed=args.seed) valid_data = load_dataset(args.dataset_name_valid, split=\"train\", streaming=True) train_dataset = ConstantLengthDataset(tokenizer, train_data, infinite=True, seq_length=args.seq_length) valid_dataset = ConstantLengthDataset(tokenizer, valid_data, infinite=False, seq_length=args.seq_length) train_dataloader = DataLoader(train_dataset, batch_size=args.train_batch_size) eval_dataloader = DataLoader(valid_dataset, batch_size=args.valid_batch_size) return train_dataloader, eval_dataloader train_dataloader, eval_dataloader = create_dataloaders(args) ``` Before we start training we need to set up the optimizer and learning rate schedule. We don\u2019t want to apply weight decay to biases and LayerNorm weights so we use a helper function to exclude those. ```Python def get_grouped_params(model, args, no_decay=[\"bias\", \"LayerNorm.weight\"]): params_with_wd, params_without_wd = [], [] for n, p in model.named_parameters(): if any(nd in n for nd in no_decay): params_without_wd.append(p) else: params_with_wd.append(p) return [{\"params\": params_with_wd, \"weight_decay\": args.weight_decay}, {\"params\": params_without_wd, \"weight_decay\": 0.0},] optimizer = AdamW(get_grouped_params(model, args), lr=args.learning_rate) lr_scheduler = get_scheduler(name=args.lr_scheduler_type, optimizer=optimizer, num_warmup_steps=args.num_warmup_steps, num_training_steps=args.max_train_steps,) ``` A big question that remains is how all the data and models will be distributed across several GPUs. This sounds like a complex task but actually only requires a single line of code with Accelerate. ```Python model, optimizer, train_dataloader, eval_dataloader = accelerator.prepare( model, optimizer, train_dataloader, eval_dataloader) ``` Under the hood it'll use DistributedDataParallel, which means a batch is sent to each GPU worker which has its own copy of the model. There the gradients are computed and then aggregated to update the model on each worker.  We also want to evaluate the model from time to time on the validation set so let\u2019s write a function to do just that. This is done automatically in a distributed fashion and we just need to gather all the losses from the workers. We also want to report the [perplexity]( ```Python def evaluate(args): model.eval() losses = [] for step, batch in enumerate(eval_dataloader): with torch.no_grad(): outputs = model(batch, labels=batch) loss = outputs.loss.repeat(args.valid_batch_size) losses.append(accelerator.gather(loss)) if args.max_eval_steps > 0 and step >= args.max_eval_steps: break loss = torch.mean(torch.cat(losses)) try: perplexity = torch.exp(loss) except OverflowError: perplexity = float(\"inf\") return loss.item(), perplexity.item() ``` We are now ready to write the main training loop. It will look pretty much like a normal PyTorch training loop. Here and there you can see that we use the accelerator functions rather than native PyTorch. Also, we push the model to the branch after each evaluation. ```Python # Train model model.train() completed_steps = 0 for step, batch in enumerate(train_dataloader, start=1): loss = model(batch, labels=batch, use_cache=False).loss loss = loss / args.gradient_accumulation_steps accelerator.backward(loss) if step % args.gradient_accumulation_steps == 0: accelerator.clip_grad_norm_(model.parameters(), 1.0) optimizer.step() lr_scheduler.step() optimizer.zero_grad() completed_steps += 1 if step % args.save_checkpoint_steps == 0: eval_loss, perplexity = evaluate(args) accelerator.wait_for_everyone() unwrapped_model = accelerator.unwrap_model(model) unwrapped_model.save_pretrained(args.save_dir, save_function=accelerator.save) if accelerator.is_main_process: hf_repo.push_to_hub(commit_message=f\"step {step}\") model.train() if completed_steps >= args.max_train_steps: break ``` When we call `wait_for_everyone()` and `unwrap_model()` we make sure that all workers are ready and any model layers that have been added by `prepare()` earlier are removed. We also use gradient accumulation and gradient clipping that are easily implemented. Lastly, after training is complete we run a last evaluation and save the final model and push it to the hub. ```Python # Evaluate and save the last checkpoint logger.info(\"Evaluating and saving model after training\") eval_loss, perplexity = evaluate(args) log_metrics(step, {\"loss/eval\": eval_loss, \"perplexity\": perplexity}) accelerator.wait_for_everyone() unwrapped_model = accelerator.unwrap_model(model) unwrapped_model.save_pretrained(args.save_dir, save_function=accelerator.save) if accelerator.is_main_process: hf_repo.push_to_hub(commit_message=\"final model\") ``` Done! That's all the code to train a full GPT-2 model from scratch with as little as 150 lines. We did not show the imports and logs of the scripts to make the code a little bit more compact. Now let's actually train it! With this code we trained models for our upcoming [book on Transformers and NLP]( a [110M]( and [1.5B]( parameter GPT-2 model. We used a 16 x A100 GPU machine to train these models for 1 day and 1 week, respectively. Enough time to get a coffee and read a book or two! ## Evaluation This is still relatively short training time for pretraining but we can already observe good downstream performance as compared to similar models. We evaluated the models on OpenAI's HumanEval benchmark that was introduced in the Codex paper. It measures the performance of code generation models on almost 200 coding challenges.  Note that we trained CodeParrot on roughly 25-30B tokens whereas GPT-neo was trained on 300B tokens and Codex on 300B (GPT-3 checkpoint) + 100B (code) tokens. So you definitely get a good bang for your buck here. But what is this _pass@k_ metric exactly? Simply put it measures the probability of at least one program passing the unittest of a programming problem given _k_ candidate generations from the model. Naturally, the metric increases with k as more candidates can potentially solve the coding challenge. You can use the metric with `datasets`: ```Python from datasets import load_metric code_eval = datasets.load_metric(\"code_eval\") test_cases = [\"assert add(2,3)==5\"] candidates = [[\"def add(a,b): return a*b\", \"def add(a, b): return a+b\"]] pass_at_k, results = code_eval.compute(references=test_cases, predictions=candidates, k=[1, 2]) print(pass_at_k) # -> {'pass@1': 0.5, 'pass@2': 1.0} ``` You can also load OpenAI's HumanEval dataset with `datasets`: ```Python from datasets import load_dataset dataset = load_dataset(\"openai_humaneval\") ``` Now let's have a look at what we get after anxiously waiting for so long! ## Results Let's start slow and check if we can prompt the model to write a function to return the size of a file given its path: **Prompt:** ```Python def get_files_size(filename): ``` **Completion:** ```Python \"\"\" Returns the file size of a file. :param filepath: The path to the file. :type filepath: str The file size. int \"\"\" return os.path.getsize(filepath) ``` So far so good, but let's take this a step further and try something a bit more interesting. One of the less exciting aspects of building production ready code is writing unittest. Let's see if we can use CodeParrot to do this for us! **Prompt:** ```Python def is_even(value): \"\"\"Returns True if value is an even number.\"\"\" return value % 2 == 0 # setup unit tests for is_even import unittest ``` **Completion:** ```Python class TestIsEven(unittest.TestCase): \"\"\"Tests for is_even function.\"\"\" def test_is_even(self): \"\"\"Test is_even function.\"\"\" self.assertTrue(is_even(0)) self.assertTrue(is_even(1)) self.assertTrue ``` Although the test condition itself does not look quite right the model has built all the boilerplate code for us. This alone could already be quite useful and save us a couple of minutes time! But what we really want to do is build better transformer models. Let's use CodeParrot to help us do that: **Prompt:** ```Python from transformers import AutoTokenizer, AutoModelForSequenceClassification # build a BERT classifier ``` **Completion:** ```Python tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased') model = AutoModelForSequenceClassification.from_pretrained('bert-base-uncased') ``` Looking at a few examples is nice but by now you probably want to try it yourself. We prepared a few online demos just for that! ## Demos You can play with the models in two demos we added to [Hugging Face Spaces]( With the first you can quickly generate code with the model and with the second you can highlight your code with the model to spot bugs! - [Code Generation]( - [Code Highlighting]( You can also directly use the models from the `transformers` library: ```Python from transformers import pipeline pipe = pipeline('text-generation', model='lvwerra/codeparrot') pipe('def hello_world():') ``` ## Summary In this short blog post we walked through all the steps involved for training a large GPT-2 model called CodeParrot for code generation. Using Accelerate we built a training script with less than 200 lines of code that we can effortlessly scale across many GPUs. With that you can now train your own GPT-2 model! This post gives a brief overview of CodeParrot , but if you are interested in diving deeper into how to pretrain this models, we recommend reading its dedicated chapter in the upcoming [book on Transformers and NLP]( This chapter provides many more details around building custom datasets, design considerations when training a new tokenizer, and architecture choice."}
{"title": "collaborative-training.md", "repo_owner": "huggingface", "repo_name": "blog", "text": "--- title: \"Deep Learning over the Internet: Training Language Models Collaboratively\" thumbnail: /blog/assets/24_sahajBERT/thumbnail.png authors: - user: mryab guest: true - user: SaulLu --- # Deep Learning over the Internet: Training Language Models Collaboratively With the additional help of Quentin Lhoest and Sylvain Lesage. Modern language models often require a significant amount of compute for pretraining, making it impossible to obtain them without access to tens and hundreds of GPUs or TPUs. Though in theory it might be possible to combine the resources of multiple individuals, in practice, such distributed training methods have previously seen limited success because connection speeds over the Internet are way slower than in high-performance GPU supercomputers. In this blog post, we describe [DeDLOC]( \u2014 a new method for collaborative distributed training that can adapt itself to the network and hardware constraints of participants. We show that it can be successfully applied in real-world scenarios by pretraining [sahajBERT]( a model for the Bengali language, with 40 volunteers. On downstream tasks in Bengali, this model achieves nearly state-of-the-art quality with results comparable to much larger models that used hundreds of high-tier accelerators. ## Distributed Deep Learning in Open Collaborations ### Why should we do it? These days, many highest-quality NLP systems are based on large pretrained Transformers. In general, their quality improves with size: you can achieve unparalleled results in natural language understanding and generation by scaling up the parameter count and leveraging the abundance of unlabeled text data. Unfortunately, we use these pretrained models not only because it's convenient. The hardware resources for training Transformers on large datasets often exceed anything affordable to a single person and even most commercial or research organizations. Take, for example, BERT: its training was estimated to cost about $7,000, and for the largest models like GPT-3, this number can be as high as $12 million! This resource limitation might seem obvious and inevitable, but is there really no alternative to using pretrained models for the broader ML community? However, there might be a way out of this situation: to come up with a solution, we only need to take a look around. It might be the case that the computational resources we're looking for are already there; for example, many of us have powerful computers with gaming or workstation GPUs at home. You might've already guessed that we're going to join their power similarly to [Folding@home]( [Rosetta@home]( [Leela Chess Zero]( or different [BOINC]( projects that leverage volunteer computing, but the approach is even more general. For instance, several laboratories can join their smaller clusters to utilize all the available resources, and some might want to join the experiment using inexpensive cloud instances. To a skeptical mind, it might seem that we're missing a key factor here: data transfer in distributed DL is often a bottleneck, since we need to aggregate the gradients from multiple workers. Indeed, any na\u00efve approach to distributed training over the Internet is bound to fail, as most participants don't have gigabit connections and might disconnect from the network at any time. So how on Earth can you train anything with a household data plan? :) As a solution to this problem, we propose a new training algorithm, called Distributed Deep Learning in Open Collaborations (or **DeDLOC**), which is described in detail in our recently released [preprint]( Now, let\u2019s find out what are the core ideas behind this algorithm! ### Training with volunteers In its most frequently used version, distributed training with multiple GPUs is pretty straightforward. Recall that when doing deep learning, you usually compute gradients of your loss function averaged across many examples in a batch of training data. In case of _data-parallel_ distributed DL, you simply split the data across multiple workers, compute gradients separately, and then average them once the local batches are processed. When the average gradient is computed on all workers, we adjust the model weights with the optimizer and continue training our model. You can see an illustration of different tasks that are executed below.  Typical machine learning tasks executed by peers in distributed training, possibly with a separation of roles Often, to reduce the amount of synchronization and to stabilize the learning process, we can accumulate the gradients for N batches before averaging, which is equivalent to increasing the actual batch size N times. This approach, combined with the observation that most state-of-the-art language models use large batches, led us to a simple idea: let's accumulate one _very_ large batch across all volunteer devices before each optimizer step! Along with complete equivalence to regular distributed training and easy scalability, this method also has the benefit of built-in fault tolerance, which we illustrate below. Let's consider a couple of potential failure cases that we might encounter throughout a collaborative experiment. By far, the most frequent scenario is that one or several peers disconnect from the training procedure: they might have an unstable connection or simply want to use their GPUs for something else. In this case, we only suffer a minor setback of training: the contribution of these peers gets deducted from the currently accumulated batch size, but other participants will compensate for that with their gradients. Also, if more peers join, the target batch size will simply be reached faster, and our training procedure will naturally speed up. You can see a demonstration of this in the video: ### Adaptive averaging Now that we have discussed the overall training procedure, there remains one more question: how do we actually aggregate the gradients of participants? Most home computers cannot easily accept incoming connections, and the download speed might also become a constraint. Since we rely on volunteer hardware for experiments, a central server is not really a viable option, as it will quickly face overload when scaling to tens of clients and hundreds of millions of parameters. Most data-parallel training runs today don't use this strategy anyway; instead, they rely on All-Reduce \u2014 an efficient all-to-all communication primitive. Thanks to clever algorithmic optimizations, each node can compute the global average without sending the entire local gradient to every peer. Because All-Reduce is decentralized, it seems like a good choice; however, we still need to take the diversity of hardware and network setups into account. For example, some volunteers might join from computers that have slow network but powerful GPUs, some might have better connectivity only to a subset of other peers, and some may be firewalled from incoming connections. It turns out we can actually come up with an optimal data transfer strategy on the fly by leveraging this information about performance! On a high level, we split the entire gradient vector into parts depending on the Internet speed of each peer: those with the fastest connection aggregate the largest parts. Also, if some nodes do not accept incoming connections, they simply send their data for aggregation but do not compute the average themselves. Depending on the conditions, this adaptive algorithm can recover well-known distributed DL algorithms and improve on them with a hybrid strategy, as demonstrated below.  Examples of different averaging strategies with the adaptive algorithm. The core techniques for decentralized training are available in Hivemind. Check out the repo and learn how to use this library in your own projects! ## sahajBERT As always, having a well-designed algorithmic framework doesn't mean that it will work as intended in practice, because some assumptions may not hold true in actual training runs. To verify the competitive performance of this technology and to showcase its potential, we organized a special collaborative event to pretrain a masked language model for the Bengali language. Even though it is the fifth most spoken native language in the world, it has [very few]( masked language models openly available, which emphasizes the importance of tools that can empower the community, unlocking a plethora of opportunities in the field. We conducted this experiment with real volunteers from the Neuropark community and used openly available datasets (OSCAR and Wikipedia), because we wanted to have a fully reproducible example that might serve as an inspiration for other groups. Below, we describe the detailed setup of our training run and demonstrate its results. ### Architecture For our experiment, we chose ALBERT _(A Lite BERT)_ \u2014 a model for language representations that is pretrained with Masked Language Modeling (MLM) and Sentence Order Prediction (SOP) as objectives. We use this architecture because weight sharing makes it very parameter-efficient: for example, ALBERT-large has ~18M trainable parameters and performs comparably to BERT-base with ~108M weights on the GLUE benchmark. It means that there is less data to exchange between the peers, which is crucial in our setup, as it significantly speeds up each training iteration. Want to know more about ALBERT? Paper Transformers doc ### Tokenizer The first brick of our model is called a _tokenizer_ and takes care of transforming raw text into vocabulary indices. Because we are training a model for Bengali, which is not very similar to English, we need to implement language-specific preprocessing as a part of our tokenizer. We can view it as a sequence of operations: 1. **Normalization:** includes all preprocessing operations on raw text data. This was the step at which we have made the most changes, because removing certain details can either change the meaning of the text or leave it the same, depending on the language. For example, the standard ALBERT normalizer removes the accents, while for the Bengali language, we need to keep them, because they contain information about the vowels. As a result, we use the following operations: NMT normalization, NFKC normalization, removal of multiple spaces, homogenization of recurring Unicode characters in the Bengali language, and lowercasing. 2. **Pretokenization** describes rules for splitting the input (for example, by whitespace) to enforce specific token boundaries. As in the original work, we have chosen to keep the whitespace out of the tokens. Therefore, to distinguish the words from each other and not to have multiple single-space tokens, each token corresponding to the beginning of a word starts with a special character \u201c\\_\u201d (U+2581). In addition, we isolated all punctuation and digits from other characters to condense our vocabulary. 3. **Tokenizer modeling:** It is at this level that the text is mapped into a sequence of elements of a vocabulary. There are several algorithms for this, such as Byte-Pair Encoding (BPE) or Unigram, and most of them need to build the vocabulary from a text corpus. Following the setup of ALBERT, we used the **Unigram Language Model** approach, training a vocabulary of 32k tokens on the deduplicated Bengali part of the OSCAR dataset. 4. **Post-processing:** After tokenization, we might want to add several special tokens required by the architecture, such as starting the sequence with a special token `[CLS]` or separating two segments with a special token `[SEP]`. Since our main architecture is the same as the original ALBERT, we keep the same post-processing: specifically, we add a `[CLS]` token at the beginning of each example and a `[SEP]` token both between two segments and at the end. Read more information about each component in Tokenizers doc You can reuse our tokenizer by running the following code: ```python from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained(\"neuropark/sahajBERT\") ``` ### Dataset The last thing we need to cover is the training dataset. As you probably know, the great strength of pretrained models like BERT or ALBERT is that you don't need an annotated dataset, but just a lot of texts. To train sahajBERT, we used the [Bengali Wikipedia dump from 03/20/2021]( and the Bengali subset of [OSCAR]( (600MB + 6GB of text). These two datasets can easily be downloaded from the HF Hub. However, loading an entire dataset requires time and storage \u2014 two things that our peers do not necessarily have. To make the most of the resources provided by the participants, we have implemented **dataset streaming**, which allows them to train the model nearly as soon as they join the network. Specifically, the examples in the dataset are downloaded and transformed in parallel to the training. We can also shuffle the dataset so that our peers have little chance to process the same examples at the same time. As the dataset is not downloaded and preprocessed in advance, the transformations needed to go from plain text to a training example (shown in the figure below) are done on the fly.  From a raw sample to a training sample The dataset streaming mode is available from version v1.9 of the datasets library, so you can use it right now as follows: ```python from datasets import load_dataset oscar_dataset = load_dataset(\"oscar\", name=\"unshuffled_deduplicated_bn\", streaming=True) ``` Learn more about loading datasets in streaming mode in the documentation ### Collaborative event The sahajBERT collaborative training event took place from May 12 to May 21. The event brought together 40 participants, 30 of whom were Bengali-speaking volunteers, and 10 were volunteers from one of the authors' organizations. These 40 volunteers joined the [Neuropark]( Discord channel to receive all information regarding the event and participate in discussions. To join the experiment, volunteers were asked to: 1. Send their username to the moderators to be allowlisted; 2. Open the provided notebook locally, on Google Colaboratory, or on Kaggle; 3. Run one code cell and fill in their Hugging Face credentials when requested; 4. Watch the training loss decrease on the shared dashboards! For security purposes, we set up an authorization system so that only members of the Neuropark community could train the model. Sparing you the technical details, our authorization protocol allows us to guarantee that every participant is in the allowlist and to acknowledge the individual contribution of each peer. In the following figure, you can see the activity of each volunteer. Over the experiment, the volunteers logged in 600 different sessions. Participants regularly launched multiple runs in parallel, and many of them spread out the runs they launched over time. The runs of individual participants lasted 4 hours on average, and the maximum length was 21 hours. You can read more about the participation statistics in the paper. Chart showing participants of the sahajBERT experiment. Circle radius is relative to the total number of processed batches, the circle is greyed if the participant is not active. Every purple square represents an active device, darker color corresponds to higher performance Along with the resources provided by participants, we also used 16 preemptible (cheap but frequently interrupted) single-GPU T4 cloud instances to ensure the stability of the run. The cumulative runtime for the experiment was 234 days, and in the figure below you can see parts of the loss curve that each peer contributed to! The final model was uploaded to the Model Hub, so you can download and play with it if you want to: [ ### Evaluation To evaluate the performance of sahajBERT, we finetuned it on two downstream tasks in Bengali: - Named entity recognition (NER) on the Bengali split of [WikiANN]( The goal of this task is to classify each token in the input text into one of the following categories: person, organization, location, or none of them. - News Category Classification (NCC) on the Soham articles dataset from [IndicGLUE]( The goal of this task is to predict the category to which belong the input text. We evaluated it during training on the NER task to check that everything was going well; as you can see on the following plot, this was indeed the case! Evaluation metrics of fine-tuned models on the NER task from different checkpoints of pre-trained models. At the end of training, we compared sahajBERT with three other pretrained language models: [XLM-R Large]( [IndicBert]( and [bnRoBERTa]( In the table below, you can see that our model has results comparable to the best Bengali language models available on HF Hub, even though our model has only ~18M trained parameters, while, for instance, XLM-R (a strong multilingual baseline), has ~559M parameters and was trained on several hundred V100 GPUs. | Model | NER F1 (mean \u00b1 std) | NCC Accuracy (mean \u00b1 std) | |||| |[sahajBERT]( | 95.45 \u00b1 0.53| 91.97 \u00b1 0.47| |[XLM-R-large]( | 96.48 \u00b1 0.22| 90.05 \u00b1 0.38| |[IndicBert]( | 92.52 \u00b1 0.45| 74.46 \u00b1 1.91| |[bnRoBERTa]( |82.32 \u00b1 0.67|80.94 \u00b1 0.45| These models are available on the Hub as well. You can test them directly by playing with the Hosted Inference API widget on their Model Cards or by loading them directly in your Python code. #### sahajBERT-NER Model card: [ ```python from transformers import ( AlbertForTokenClassification, TokenClassificationPipeline, PreTrainedTokenizerFast, ) # Initialize tokenizer tokenizer = PreTrainedTokenizerFast.from_pretrained(\"neuropark/sahajBERT-NER\") # Initialize model model = AlbertForTokenClassification.from_pretrained(\"neuropark/sahajBERT-NER\") # Initialize pipeline pipeline = TokenClassificationPipeline(tokenizer=tokenizer, model=model) raw_text = \"\u098f\u0987 \u0987\u0989\u09a8\u09bf\u09af\u09bc\u09a8\u09c7 \u09e9 \u099f\u09bf \u09ae\u09cc\u099c\u09be \u0993 \u09e7\u09e6 \u099f\u09bf \u0997\u09cd\u09b0\u09be\u09ae \u0986\u099b\u09c7 \u0964\" # Change me output = pipeline(raw_text) ``` #### sahajBERT-NCC Model card: [ ```python from transformers import ( AlbertForSequenceClassification, TextClassificationPipeline, PreTrainedTokenizerFast, ) # Initialize tokenizer tokenizer = PreTrainedTokenizerFast.from_pretrained(\"neuropark/sahajBERT-NCC\") # Initialize model model = AlbertForSequenceClassification.from_pretrained(\"neuropark/sahajBERT-NCC\") # Initialize pipeline pipeline = TextClassificationPipeline(tokenizer=tokenizer, model=model) raw_text = \"\u098f\u0987 \u0987\u0989\u09a8\u09bf\u09af\u09bc\u09a8\u09c7 \u09e9 \u099f\u09bf \u09ae\u09cc\u099c\u09be \u0993 \u09e7\u09e6 \u099f\u09bf \u0997\u09cd\u09b0\u09be\u09ae \u0986\u099b\u09c7 \u0964\" # Change me output = pipeline(raw_text) ``` ## Conclusion In this blog post, we have discussed the method that can enable collaborative pretraining of neural networks with sahajBERT as the first truly successful example of applying it to a real-world problem. What does this all mean for the broader ML community? First, it is now possible to run large-scale distributed pretraining with your friends, and we hope to see a lot of cool new models that were previously less feasible to obtain. Also, our result might be important for multilingual NLP, since now the community for any language can train their own models without the need for significant computational resources concentrated in one place. ## Acknowledgements The DeDLOC paper and sahajBERT training experiment were created by Michael Diskin, Alexey Bukhtiyarov, Max Ryabinin, Lucile Saulnier, Quentin Lhoest, Anton Sinitsin, Dmitry Popov, Dmitry Pyrkin, Maxim Kashirin, Alexander Borzunov, Albert Villanova del Moral, Denis Mazur, Ilia Kobelev, Yacine Jernite, Thomas Wolf, and Gennady Pekhimenko. This project is the result of a collaboration between [Hugging Face]( [Yandex Research]( [HSE University]( [MIPT]( [University of Toronto]( and [Vector Institute]( In addition, we would like to thank Stas Bekman, Dmitry Abulkhanov, Roman Zhytar, Alexander Ploshkin, Vsevolod Plokhotnyuk and Roman Kail for their invaluable help with building the training infrastructure. Also, we thank Abhishek Thakur for helping with downstream evaluation and Tanmoy Sarkar with Omar Sanseviero, who helped us organize the collaborative experiment and gave regular status updates to the participants over the course of the training run. Below, you can see all participants of the collaborative experiment: ## References \"Distributed Deep Learning in Open Collaborations\", [ArXiv]( Code for [sahajBERT experiments]( in the DeDLOC repository."}
{"title": "community-update.md", "repo_owner": "huggingface", "repo_name": "blog", "text": "--- title: Introducing Pull Requests and Discussions thumbnail: /blog/assets/76_community_update/thumbnail.png --- # Introducing Pull Requests and Discussions  We are thrilled to announce the release of our latest collaborative features: pull requests and discussions on the Hugging Face Hub! Pull requests and discussions are available today under the [community tab]( for all repository types: models, datasets, and Spaces. Any member of the community can create and participate in discussions and pull requests, facilitating collaborations not only within teams, but also with everyone else in the community! It's the biggest update ever done to the Hub, and we can't wait to see the community members start collaborating with it . The new \"Community\" tab also aligns with proposals in ethical ML throughout the years. Feedback and iterations have a central place in the development of ethical machine learning software. We really believe having it in the community's toolset will unlock new kinds of positive patterns in ML, collaborations, and progress. Some example use cases for discussions and pull requests: - Propose suggestions in model cards to improve disclosures of ethical biases. - Let users flag concerning generations of a given Space demo. - Provide a venue through which model and dataset authors can have a direct discussion with community members. - Allow others to improve your repositories! For example, users might want to provide TensorFlow weights! ## Discussions  [Discussions]( allow community members ask and answer questions as well as share their ideas and suggestions directly with the repository owners and the community. Anyone can create and participate in discussions in the community tab of a repository. ## Pull requests  [Pull requests]( allow community members open, comment, merge, or close pull requests directly from the website. The easiest way to open a pull request is to use the \"Collaborate\" button in the \"Files and versions\" tab. It will let you do single file contributions very easily. Under the hood, our Pull requests do not use forks and branches, but instead, custom \"branches\" called `refs` that are stored directly on the source repo. This approach to avoids the need to create a forks for each new version of the model/dataset. ## How is this different from other git hosts At a high level, we aim to build a simpler version of other git hosts' (like GitHub's) PRs and Issues: - no forks are involved: contributors push to a special `ref` branch directly on the source repo - no hard distinction between issues and PRs: they are essentially the same so we display them in the same lists - streamlined for ML (i.e. models/datasets/Spaces repos), not arbitrary repos ## What's next Of course, it's only the beginning. We will listen to the community feedback to add new features and improve the community tab in the future. If you have any feedback, you can [join the discussion here]( Today is the best time to join your first discussion and open a PR!"}
{"title": "constrained-beam-search.md", "repo_owner": "huggingface", "repo_name": "blog", "text": "--- title: Guiding Text Generation with Constrained Beam Search in Transformers thumbnail: /blog/assets/53_constrained_beam_search/thumbnail.png authors: - user: cwkeam guest: true --- # Guiding Text Generation with Constrained Beam Search in Transformers ## **Introduction** This blog post assumes that the reader is familiar with text generation methods using the different variants of beam search, as explained in the blog post: [\"How to generate text: using different decoding methods for language generation with Transformers\"]( Unlike ordinary beam search, **constrained** beam search allows us to exert control over the output of text generation. This is useful because we sometimes know exactly what we want inside the output. For example, in a Neural Machine Translation task, we might know which words must be included in the final translation with a dictionary lookup. Sometimes, generation outputs that are almost equally possible to a language model might not be equally desirable for the end-user due to the particular context. Both of these situations could be solved by allowing the users to tell the model which words must be included in the end output. ### **Why It's Difficult** However, this is actually a very non-trivial problem. This is because the task requires us to force the generation of certain subsequences *somewhere* in the final output, at *some point* during the generation. Let's say that we're want to generate a sentence `S` that has to include the phrase \\\\( p_1=\\{ t_1, t_2 \\} \\\\) with tokens \\\\( t_1, t_2 \\\\) in order. Let's define the expected sentence \\\\( S \\\\) as: $$ S_{expected} = \\{ s_1, s_2, ..., s_k, t_1, t_2, s_{k+1}, ..., s_n \\} $$ The problem is that beam search generates the sequence *token-by-token*. Though not entirely accurate, one can think of beam search as the function \\\\( B(\\mathbf{s}_{0:i}) = s_{i+1} \\\\), where it looks at the currently generated sequence of tokens from \\\\( 0 \\\\) to \\\\( i \\\\) then predicts the next token at \\\\( i+1 \\\\) . But how can this function know, at an arbitrary step \\\\( i k \\\\) ?  *and* also the phrase \\\\( p_2=\\{ t_3, t_4, t_5, t_6\\} \\\\) ? What if you want the model to **choose between** the two phrases? What if we want to force the phrase \\\\( p_1 \\\\) and force just one phrase among the list of phrases \\\\( \\{p_{21}, p_{22}, p_{23}\\} \\\\) ? The above examples are actually very reasonable use-cases, as it will be shown below, and the new constrained beam search feature allows for all of them! This post will quickly go over what the new ***constrained beam search*** feature can do for you and then go into deeper details about how it works under the hood. ## **Example 1: Forcing a Word** Let's say we're trying to translate `\"How old are you?\"` to German. `\"Wie alt bist du?\"` is what you'd say in an informal setting, and `\"Wie alt sind Sie?\"` is what you'd say in a formal setting. And depending on the context, we might want one form of formality over the other, but how do we tell the model that? ### **Traditional Beam Search** Here's how we would do text translation in the ***traditional beam search setting.*** ``` !pip install -q git+ ``` ```python from transformers import AutoTokenizer, AutoModelForSeq2SeqLM tokenizer = AutoTokenizer.from_pretrained(\"t5-base\") model = AutoModelForSeq2SeqLM.from_pretrained(\"t5-base\") encoder_input_str = \"translate English to German: How old are you?\" input_ids = tokenizer(encoder_input_str, return_tensors=\"pt\").input_ids outputs = model.generate( input_ids, num_beams=10, num_return_sequences=1, no_repeat_ngram_size=1, remove_invalid_values=True, ) print(\"Output:\\n\" + 100 * '-') print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` Output: ---------------------------------------------------------------------------------------------------- Wie alt bist du? ### **With Constrained Beam Search** But what if we knew that we wanted a formal output instead of the informal one? What if we knew from prior knowledge what the generation must include, and we could *inject it* into the generation? The following is what is possible now with the `force_words_ids` keyword argument to `model.generate()`: ```python tokenizer = AutoTokenizer.from_pretrained(\"t5-base\") model = AutoModelForSeq2SeqLM.from_pretrained(\"t5-base\") encoder_input_str = \"translate English to German: How old are you?\" force_words = [\"Sie\"] input_ids = tokenizer(encoder_input_str, return_tensors=\"pt\").input_ids force_words_ids = tokenizer(force_words, add_special_tokens=False).input_ids outputs = model.generate( input_ids, force_words_ids=force_words_ids, num_beams=5, num_return_sequences=1, no_repeat_ngram_size=1, remove_invalid_values=True, ) print(\"Output:\\n\" + 100 * '-') print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` Output: ---------------------------------------------------------------------------------------------------- Wie alt sind Sie? As you can see, we were able to guide the generation with prior knowledge about our desired output. Previously we would've had to generate a bunch of possible outputs, then filter the ones that fit our requirement. Now we can do that at the generation stage. ## **Example 2: Disjunctive Constraints** We mentioned above a use-case where we know which words we want to be included in the final output. An example of this might be using a dictionary lookup during neural machine translation. But what if we don't know which *word forms* to use, where we'd want outputs like `[\"raining\", \"rained\", \"rains\", ...]` to be equally possible? In a more general sense, there are always cases when we don't want the *exact word verbatim*, letter by letter, and might be open to other related possibilities too. Constraints that allow for this behavior are ***Disjunctive Constraints***, which allow the user to input a list of words, whose purpose is to guide the generation such that the final output must contain just *at least one* among the list of words. Here's an example that uses a mix of the above two types of constraints: ```python from transformers import GPT2LMHeadModel, GPT2Tokenizer model = GPT2LMHeadModel.from_pretrained(\"gpt2\") tokenizer = GPT2Tokenizer.from_pretrained(\"gpt2\") force_word = \"scared\" force_flexible = [\"scream\", \"screams\", \"screaming\", \"screamed\"] force_words_ids = [ tokenizer([force_word], add_prefix_space=True, add_special_tokens=False).input_ids, tokenizer(force_flexible, add_prefix_space=True, add_special_tokens=False).input_ids, ] starting_text = [\"The soldiers\", \"The child\"] input_ids = tokenizer(starting_text, return_tensors=\"pt\").input_ids outputs = model.generate( input_ids, force_words_ids=force_words_ids, num_beams=10, num_return_sequences=1, no_repeat_ngram_size=1, remove_invalid_values=True, ) print(\"Output:\\n\" + 100 * '-') print(tokenizer.decode(outputs[0], skip_special_tokens=True)) print(tokenizer.decode(outputs[1], skip_special_tokens=True)) ``` Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation. Output: ---------------------------------------------------------------------------------------------------- The soldiers, who were all scared and screaming at each other as they tried to get out of the The child was taken to a local hospital where she screamed and scared for her life, police said. As you can see, the first output used `\"screaming\"`, the second output used `\"screamed\"`, and both used `\"scared\"` verbatim. The list to choose from `[\"screaming\", \"screamed\", ...]` doesn't have to be word forms; this can satisfy any use-case where we need just one from a list of words. ## **Traditional Beam search** The following is an example of traditional **beam search**, taken from a previous [blog post](  for \\\\( n \\\\) steps, which becomes very large very quickly ( \\\\( 10 \\\\) beams after \\\\( 10 \\\\) steps is \\\\( 10,000,000,000 \\\\) beams!). For the rest of the generation, we repeat the above step until the ending criteria has been met, like generating the `` token or reaching `max_length`, for example. Branch out, rank, reduce, and repeat. ## **Constrained Beam Search** Constrained beam search attempts to fulfill the constraints by *injecting* the desired tokens at every step of the generation. Let's say that we're trying to force the phrase `\"is fast\"` in the generated output. In the traditional beam search setting, we find the top `k` most probable next tokens at each branch and append them for consideration. In the constrained setting, we do the same but also append the tokens that will take us *closer to fulfilling our constraints*. Here's a demonstration:  refers to the ***list of beams that have made \\\\( n \\\\) steps progress in fulfilling the constraints***. After sorting all the possible beams into their respective banks, we do a round-robin selection. With the above example, we'd select the most probable output from Bank 2, then most probable from Bank 1, one from Bank 0, the second most probable from Bank 2, the second most probable from Bank 1, and so forth. Since we're using `num_beams=3`, we just do the above process three times to end up with `[\"The is fast\", \"The dog is\", \"The dog and\"]`. This way, even though we're *forcing* the model to consider the branch where we've manually appended the desired token, we still keep track of other high-probable sequences that probably make more sense. Even though `\"The is fast\"` fulfills our constraint completely, it's not a very sensible phrase. Luckily, we have `\"The dog is\"` and `\"The dog and\"` to work with in future steps, which hopefully will lead to more sensible outputs later on. This behavior is demonstrated in the third step of the above example: . Also, notice how beams like `\"The dog is slow\"` or `\"The dog is mad\"` are actually in Bank 0, since, although it includes the token `\"is\"`, it must restart from the beginning to generate `\"is fast\"`. By appending something like `\"slow\"` after `\"is\"`, it has effectively *reset its progress*. And finally notice how we ended up at a sensible output that contains our constraint phrase: `\"The dog is fast\"`! We were worried initially because blindly appending the desired tokens led to nonsensical phrases like `\"The is fast\"`. However, using round-robin selection from banks, we implicitly ended up getting rid of nonsensical outputs in preference for the more sensible outputs. ## **More About `Constraint` Classes and Custom Constraints** The main takeaway from the explanation can be summarized as the following. At every step, we keep pestering the model to consider the tokens that fulfill our constraints, all the while keeping track of beams that don't, until we end up with reasonably high probability sequences that contain our desired phrases. So a principled way to design this implementation was to represent each constraint as a `Constraint` object, whose purpose was to keep track of its progress and tell the beam search which tokens to generate next. Although we have provided the keyword argument `force_words_ids` for `model.generate()`, the following is what actually happens in the back-end: ```python from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, PhrasalConstraint tokenizer = AutoTokenizer.from_pretrained(\"t5-base\") model = AutoModelForSeq2SeqLM.from_pretrained(\"t5-base\") encoder_input_str = \"translate English to German: How old are you?\" constraints = [ PhrasalConstraint( tokenizer(\"Sie\", add_special_tokens=False).input_ids ) ] input_ids = tokenizer(encoder_input_str, return_tensors=\"pt\").input_ids outputs = model.generate( input_ids, constraints=constraints, num_beams=10, num_return_sequences=1, no_repeat_ngram_size=1, remove_invalid_values=True, ) print(\"Output:\\n\" + 100 * '-') print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` Output: ---------------------------------------------------------------------------------------------------- Wie alt sind Sie? You can define one yourself and input it into the `constraints` keyword argument to design your unique constraints. You just have to create a sub-class of the `Constraint` abstract interface class and follow its requirements. You can find more information in the definition of `Constraint` found [here]( Some unique ideas (not yet implemented; maybe you can give it a try!) include constraints like `OrderedConstraints`, `TemplateConstraints` that may be added further down the line. Currently, the generation is fulfilled by including the sequences, wherever in the output. For example, a previous example had one sequence with scared -> screaming and the other with screamed -> scared. `OrderedConstraints` could allow the user to specify the order in which these constraints are fulfilled. `TemplateConstraints` could allow for a more niche use of the feature, where the objective can be something like: ```python starting_text = \"The woman\" template = [\"the\", \"\", \"School of\", \"\", \"in\"] possible_outputs == [ \"The woman attended the Ross School of Business in Michigan.\", \"The woman was the administrator for the Harvard School of Business in MA.\" ] ``` or: ```python starting_text = \"The woman\" template = [\"the\", \"\", \"\", \"University\", \"\", \"in\"] possible_outputs == [ \"The woman attended the Carnegie Mellon University in Pittsburgh.\", ] impossible_outputs == [ \"The woman attended the Harvard University in MA.\" ] ``` or if the user does not care about the number of tokens that can go in between two words, then one can just use `OrderedConstraint`. ## **Conclusion** Constrained beam search gives us a flexible means to inject external knowledge and requirements into text generation. Previously, there was no easy way to tell the model to 1. include a list of sequences where 2. some of which are optional and some are not, such that 3. they're generated *somewhere* in the sequence at respective reasonable positions. Now, we can have full control over our generation with a mix of different subclasses of `Constraint` objects! This new feature is based mainly on the following papers: - [Guided Open Vocabulary Image Captioning with Constrained Beam Search]( - [Fast Lexically Constrained Decoding with Dynamic Beam Allocation for Neural Machine Translation]( - [Improved Lexically Constrained Decoding for Translation and Monolingual Rewriting]( - [Guided Generation of Cause and Effect]( Like the ones above, many new research papers are exploring ways of using external knowledge (e.g., KGs, KBs) to guide the outputs of large deep learning models. Hopefully, this constrained beam search feature becomes another effective way to achieve this purpose. Thanks to everybody that gave guidance for this feature contribution: Patrick von Platen for being involved from the [initial issue]( to the [final PR]( and Narsil Patry, for providing detailed feedback on the code. *Thumbnail of this post uses an icon with the attribution: Shorthand icons created by Freepik - Flaticon*"}
{"title": "content-guidelines-update.md", "repo_owner": "huggingface", "repo_name": "blog", "text": "--- title: \"Announcing our new Content Guidelines and Policy\" thumbnail: /blog/assets/content-guidelines-blogpost/thumbnail.png authors: - user: giadap --- # Announcing our new Community Policy As a community-driven platform that aims to advance Open, Collaborative, and Responsible Machine Learning, we are thrilled to support and maintain a welcoming space for our entire community! In support of this goal, we've updated our [Content Policy]( We encourage you to familiarize yourself with the complete document to fully understand what it entails. Meanwhile, this blog post serves to provide an overview, outline the rationale, and highlight the values driving the update of our Content Policy. By delving into both resources, you'll gain a comprehensive understanding of the expectations and goals for content on our platform. ## Moderating Machine Learning Content Moderating Machine Learning artifacts introduces new challenges. Even more than static content, the risks associated with developing and deploying artificial intelligence systems and/or models require in-depth analysis and a wide-ranging approach to foresee possible harms. That is why the efforts to draft this new Content Policy come from different members and expertise in our cross-company teams, all of which are indispensable to have both a general and a detailed picture to provide clarity on how we enable responsible development and deployment on our platform. Furthermore, as the field of AI and machine learning continues to expand, the variety of use cases and applications proliferates. This makes it essential for us to stay up-to-date with the latest research, ethical considerations, and best practices. For this reason, promoting user collaboration is also vital to the sustainability of our platform. Namely, through our community features, such as the Community Tab, we encourage and foster collaborative solutions between repository authors, users, organizations, and our team. ## Consent as a Core Value As we prioritize respecting people's rights throughout the development and use of Machine Learning systems, we take a forward-looking view to account for developments in the technology and law affecting those rights. New ways of processing information enabled by Machine Learning are posing entirely new questions, both in the field of AI and in regulatory circles, about people's agency and rights with respect to their work, their image, and their privacy. Central to these discussions are how people's rights should be operationalized -- and we offer one avenue for addressing this here. In this evolving legal landscape, it becomes increasingly important to emphasize the intrinsic value of \"consent\" to avoid enabling harm. By doing so, we focus on the individual's agency and subjective experiences. This approach not only supports forethought and a more empathetic understanding of consent but also encourages proactive measures to address cultural and contextual factors. In particular, our Content Policy aims to address consent related to what users see, and to how people's identities and expressions are represented. This consideration for people's consent and experiences on the platform extends to Community Content and people's behaviors on the Hub. To maintain a safe and welcoming environment, we do not allow aggressive or harassing language directed at our users and/or the Hugging Face staff. We focus on fostering collaborative resolutions for any potential conflicts between users and repository authors, intervening only when necessary. To promote transparency, we encourage open discussions to occur within our Community tab. Our approach is a reflection of our ongoing efforts to adapt and progress, which is made possible by the invaluable input of our users who actively collaborate and share their feedback. We are committed to being receptive to comments and constantly striving for improvement. We encourage you to reach out to [feedback@huggingface.co](mailto:feedback@huggingface.co) with any questions or concerns. Let's join forces to build a friendly and supportive community that encourages open AI and ML collaboration! Together, we can make great strides forward in fostering a welcoming environment for everyone."}
{"title": "controlnet.md", "repo_owner": "huggingface", "repo_name": "blog", "text": "--- title: \"ControlNet in Diffusers\" thumbnail: /blog/assets/controlnet/thumbnail.png authors: - user: sayakpaul - user: yiyixu - user: patrickvonplaten --- # Ultra fast ControlNet with Diffusers Ever since Stable Diffusion took the world by storm, people have been looking for ways to have more control over the results of the generation process. ControlNet provides a minimal interface allowing users to customize the generation process up to a great extent. With [ControlNet]( users can easily condition the generation with different spatial contexts such as a depth map, a segmentation map, a scribble, keypoints, and so on! We can turn a cartoon drawing into a realistic photo with incredible coherence. Realistic Lofi Girl Or even use it as your interior designer. Before After You can turn your sketch scribble into an artistic drawing. Before After Also, make some of the famous logos coming to life. Before After With ControlNet, the sky is the limit In this blog post, we first introduce the [`StableDiffusionControlNetPipeline`]( and then show how it can be applied for various control conditionings. Let\u2019s get controlling! ## ControlNet: TL;DR ControlNet was introduced in [Adding Conditional Control to Text-to-Image Diffusion Models]( by Lvmin Zhang and Maneesh Agrawala. It introduces a framework that allows for supporting various spatial contexts that can serve as additional conditionings to Diffusion models such as Stable Diffusion. The diffusers implementation is adapted from the original [source code]( Training ControlNet is comprised of the following steps: 1. Cloning the pre-trained parameters of a Diffusion model, such as Stable Diffusion's latent UNet, (referred to as \u201ctrainable copy\u201d) while also maintaining the pre-trained parameters separately (\u201dlocked copy\u201d). It is done so that the locked parameter copy can preserve the vast knowledge learned from a large dataset, whereas the trainable copy is employed to learn task-specific aspects. 2. The trainable and locked copies of the parameters are connected via \u201czero convolution\u201d layers (see [here]( for more information) which are optimized as a part of the ControlNet framework. This is a training trick to preserve the semantics already learned by frozen model as the new conditions are trained. Pictorially, training a ControlNet looks like so: The diagram is taken from here. A sample from the training set for ControlNet-like training looks like this (additional conditioning is via edge maps): Prompt Original Image Conditioning \"bird\" Similarly, if we were to condition ControlNet with semantic segmentation maps, a training sample would be like so: Prompt Original Image Conditioning \"big house\" Every new type of conditioning requires training a new copy of ControlNet weights. The paper proposed 8 different conditioning models that are all [supported]( in Diffusers! For inference, both the pre-trained diffusion models weights as well as the trained ControlNet weights are needed. For example, using [Stable Diffusion v1-5]( with a ControlNet checkpoint require roughly 700 million more parameters compared to just using the original Stable Diffusion model, which makes ControlNet a bit more memory-expensive for inference. Because the pre-trained diffusion models are locked during training, one only needs to switch out the ControlNet parameters when using a different conditioning. This makes it fairly simple to deploy multiple ControlNet weights in one application as we will see below. ## The `StableDiffusionControlNetPipeline` Before we begin, we want to give a huge shout-out to the community contributor [Takuma Mori]( for having led the integration of ControlNet into Diffusers . To experiment with ControlNet, Diffusers exposes the [`StableDiffusionControlNetPipeline`]( similar to the [other Diffusers pipelines]( Central to the `StableDiffusionControlNetPipeline` is the `controlnet` argument which lets us provide a particular trained [`ControlNetModel`]( instance while keeping the pre-trained diffusion model weights the same. We will explore different use cases with the `StableDiffusionControlNetPipeline` in this blog post. The first ControlNet model we are going to walk through is the [Canny model]( - this is one of the most popular models that generated some of the amazing images you are libely seeing on the internet. We welcome you to run the code snippets shown in the sections below with [this Colab Notebook]( Before we begin, let's make sure we have all the necessary libraries installed: ```bash pip install diffusers==0.14.0 transformers xformers git+ ``` To process different conditionings depending on the chosen ControlNet, we also need to install some additional dependencies: - [OpenCV]( - [controlnet-aux]( - a simple collection of pre-processing models for ControlNet ```bash pip install opencv-contrib-python pip install controlnet_aux ``` We will use the famous painting [\"Girl With A Pearl\"]( for this example. So, let's download the image and take a look: ```python from diffusers.utils import load_image image = load_image( \" ) image ``` Next, we will put the image through the canny pre-processor: ```python import cv2 from PIL import Image import numpy as np image = np.array(image) low_threshold = 100 high_threshold = 200 image = cv2.Canny(image, low_threshold, high_threshold) image = image[:, :, None] image = np.concatenate([image, image, image], axis=2) canny_image = Image.fromarray(image) canny_image ``` As we can see, it is essentially edge detection: Now, we load [runwaylml/stable-diffusion-v1-5]( as well as the [ControlNet model for canny edges]( The models are loaded in half-precision (`torch.dtype`) to allow for fast and memory-efficient inference. ```python from diffusers import StableDiffusionControlNetPipeline, ControlNetModel import torch controlnet = ControlNetModel.from_pretrained(\"lllyasviel/sd-controlnet-canny\", torch_dtype=torch.float16) pipe = StableDiffusionControlNetPipeline.from_pretrained( \"runwayml/stable-diffusion-v1-5\", controlnet=controlnet, torch_dtype=torch.float16 ) ``` Instead of using Stable Diffusion's default [PNDMScheduler]( we use one of the currently fastest diffusion model schedulers, called [UniPCMultistepScheduler]( Choosing an improved scheduler can drastically reduce inference time - in our case we are able to reduce the number of inference steps from 50 to 20 while more or less keeping the same image generation quality. More information regarding schedulers can be found [here]( ```python from diffusers import UniPCMultistepScheduler pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config) ``` Instead of loading our pipeline directly to GPU, we instead enable smart CPU offloading which can be achieved with the [`enable_model_cpu_offload` function]( Remember that during inference diffusion models, such as Stable Diffusion require not just one but multiple model components that are run sequentially. In the case of Stable Diffusion with ControlNet, we first use the CLIP text encoder, then the diffusion model unet and control net, then the VAE decoder and finally run a safety checker. Most components are only run once during the diffusion process and are thus not required to occupy GPU memory all the time. By enabling smart model offloading, we make sure that each component is only loaded into GPU when it's needed so that we can significantly save memory consumption without significantly slowing down infenence. **Note**: When running `enable_model_cpu_offload`, do not manually move the pipeline to GPU with `.to(\"cuda\")` - once CPU offloading is enabled, the pipeline automatically takes care of GPU memory management. ```py pipe.enable_model_cpu_offload() ``` Finally, we want to take full advantage of the amazing [FlashAttention/xformers]( attention layer acceleration, so let's enable this! If this command does not work for you, you might not have `xformers` correctly installed. In this case, you can just skip the following line of code. ```py pipe.enable_xformers_memory_efficient_attention() ``` Now we are ready to run the ControlNet pipeline! We still provide a prompt to guide the image generation process, just like what we would normally do with a Stable Diffusion image-to-image pipeline. However, ControlNet will allow a lot more control over the generated image because we will be able to control the exact composition in generated image with the canny edge image we just created. It will be fun to see some images where contemporary celebrities posing for this exact same painting from the 17th century. And it's really easy to do that with ControlNet, all we have to do is to include the names of these celebrities in the prompt! Let's first create a simple helper function to display images as a grid. ```python def image_grid(imgs, rows, cols): assert len(imgs) == rows * cols w, h = imgs[0].size grid = Image.new(\"RGB\", size=(cols * w, rows * h)) grid_w, grid_h = grid.size for i, img in enumerate(imgs): grid.paste(img, box=(i % cols * w, i // cols * h)) return grid ``` Next, we define the input prompts and set a seed for reproducability. ```py prompt = \", best quality, extremely detailed\" prompt = [t + prompt for t in [\"Sandra Oh\", \"Kim Kardashian\", \"rihanna\", \"taylor swift\"]] generator = [torch.Generator(device=\"cpu\").manual_seed(2) for i in range(len(prompt))] ``` Finally, we can run the pipeline and display the image! ```py output = pipe( prompt, canny_image, negative_prompt=[\"monochrome, lowres, bad anatomy, worst quality, low quality\"] * 4, num_inference_steps=20, generator=generator, ) image_grid(output.images, 2, 2) ``` We can effortlessly combine ControlNet with fine-tuning too! For example, we can fine-tune a model with [DreamBooth]( and use it to render ourselves into different scenes. In this post, we are going to use our beloved Mr Potato Head as an example to show how to use ControlNet with DreamBooth. We can use the same ControlNet. However, instead of using the Stable Diffusion 1.5, we are going to load the [Mr Potato Head model]( into our pipeline - Mr Potato Head is a Stable Diffusion model fine-tuned with Mr Potato Head concept using Dreambooth Let's run the above commands again, keeping the same controlnet though! ```python model_id = \"sd-dreambooth-library/mr-potato-head\" pipe = StableDiffusionControlNetPipeline.from_pretrained( model_id, controlnet=controlnet, torch_dtype=torch.float16, ) pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config) pipe.enable_model_cpu_offload() pipe.enable_xformers_memory_efficient_attention() ``` Now let's make Mr Potato posing for [Johannes Vermeer]( ```python generator = torch.manual_seed(2) prompt = \"a photo of sks mr potato head, best quality, extremely detailed\" output = pipe( prompt, canny_image, negative_prompt=\"monochrome, lowres, bad anatomy, worst quality, low quality\", num_inference_steps=20, generator=generator, ) output.images[0] ``` It is noticeable that Mr Potato Head is not the best candidate but he tried his best and did a pretty good job in capturing some of the essence Another exclusive application of ControlNet is that we can take a pose from one image and reuse it to generate a different image with the exact same pose. So in this next example, we are going to teach superheroes how to do yoga using [Open Pose ControlNet]( First, we will need to get some images of people doing yoga: ```python urls = \"yoga1.jpeg\", \"yoga2.jpeg\", \"yoga3.jpeg\", \"yoga4.jpeg\" imgs = [ load_image(\" + url) for url in urls ] image_grid(imgs, 2, 2) ``` Now let's extract yoga poses using the OpenPose pre-processors that are handily available via `controlnet_aux`. ```python from controlnet_aux import OpenposeDetector model = OpenposeDetector.from_pretrained(\"lllyasviel/ControlNet\") poses = [model(img) for img in imgs] image_grid(poses, 2, 2) ``` To use these yoga poses to generate new images, let's create a [Open Pose ControlNet]( We will generate some super-hero images but in the yoga poses shown above. Let's go ```python controlnet = ControlNetModel.from_pretrained( \"fusing/stable-diffusion-v1-5-controlnet-openpose\", torch_dtype=torch.float16 ) model_id = \"runwayml/stable-diffusion-v1-5\" pipe = StableDiffusionControlNetPipeline.from_pretrained( model_id, controlnet=controlnet, torch_dtype=torch.float16, ) pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config) pipe.enable_model_cpu_offload() ``` Now it's yoga time! ```python generator = [torch.Generator(device=\"cpu\").manual_seed(2) for i in range(4)] prompt = \"super-hero character, best quality, extremely detailed\" output = pipe( [prompt] * 4, poses, negative_prompt=[\"monochrome, lowres, bad anatomy, worst quality, low quality\"] * 4, generator=generator, num_inference_steps=20, ) image_grid(output.images, 2, 2) ``` ### Combining multiple conditionings Multiple ControlNet conditionings can be combined for a single image generation. Pass a list of ControlNets to the pipeline's constructor and a corresponding list of conditionings to `__call__`. When combining conditionings, it is helpful to mask conditionings such that they do not overlap. In the example, we mask the middle of the canny map where the pose conditioning is located. It can also be helpful to vary the `controlnet_conditioning_scale`s to emphasize one conditioning over the other. #### Canny conditioning The original image Prepare the conditioning ```python from diffusers.utils import load_image from PIL import Image import cv2 import numpy as np from diffusers.utils import load_image canny_image = load_image( \" ) canny_image = np.array(canny_image) low_threshold = 100 high_threshold = 200 canny_image = cv2.Canny(canny_image, low_threshold, high_threshold) # zero out middle columns of image where pose will be overlayed zero_start = canny_image.shape[1] // 4 zero_end = zero_start + canny_image.shape[1] // 2 canny_image[:, zero_start:zero_end] = 0 canny_image = canny_image[:, :, None] canny_image = np.concatenate([canny_image, canny_image, canny_image], axis=2) canny_image = Image.fromarray(canny_image) ``` #### Openpose conditioning The original image Prepare the conditioning ```python from controlnet_aux import OpenposeDetector from diffusers.utils import load_image openpose = OpenposeDetector.from_pretrained(\"lllyasviel/ControlNet\") openpose_image = load_image( \" ) openpose_image = openpose(openpose_image) ``` #### Running ControlNet with multiple conditionings ```python from diffusers import StableDiffusionControlNetPipeline, ControlNetModel, UniPCMultistepScheduler import torch controlnet = [ ControlNetModel.from_pretrained(\"lllyasviel/sd-controlnet-openpose\", torch_dtype=torch.float16), ControlNetModel.from_pretrained(\"lllyasviel/sd-controlnet-canny\", torch_dtype=torch.float16), ] pipe = StableDiffusionControlNetPipeline.from_pretrained( \"runwayml/stable-diffusion-v1-5\", controlnet=controlnet, torch_dtype=torch.float16 ) pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config) pipe.enable_xformers_memory_efficient_attention() pipe.enable_model_cpu_offload() prompt = \"a giant standing in a fantasy landscape, best quality\" negative_prompt = \"monochrome, lowres, bad anatomy, worst quality, low quality\" generator = torch.Generator(device=\"cpu\").manual_seed(1) images = [openpose_image, canny_image] image = pipe( prompt, images, num_inference_steps=20, generator=generator, negative_prompt=negative_prompt, controlnet_conditioning_scale=[1.0, 0.8], ).images[0] image.save(\"./multi_controlnet_output.png\") ``` Throughout the examples, we explored multiple facets of the [`StableDiffusionControlNetPipeline`]( to show how easy and intuitive it is play around with ControlNet via Diffusers. However, we didn't cover all types of conditionings supported by ControlNet. To know more about those, we encourage you to check out the respective model documentation pages: * [lllyasviel/sd-controlnet-depth]( * [lllyasviel/sd-controlnet-hed]( * [lllyasviel/sd-controlnet-normal]( * [lllyasviel/sd-controlnet-scribble]( * [lllyasviel/sd-controlnet-seg]( * [lllyasviel/sd-controlnet-openpose]( * [lllyasviel/sd-controlnet-mlsd]( * [lllyasviel/sd-controlnet-canny]( We welcome you to combine these different elements and share your results with [@diffuserslib]( Be sure to check out [the Colab Notebook]( to take some of the above examples for a spin! We also showed some techniques to make the generation process faster and memory-friendly by using a fast scheduler, smart model offloading and `xformers`. With these techniques combined the generation process takes only ~3 seconds on a V100 GPU and consumes just ~4 GBs of VRAM for a single image On free services like Google Colab, generation takes about 5s on the default GPU (T4), whereas the original implementation requires 17s to create the same result! Combining all the pieces in the `diffusers` toolbox is a real superpower ## Conclusion We have been playing a lot with [`StableDiffusionControlNetPipeline`]( and our experience has been fun so far! We\u2019re excited to see what the community builds on top of this pipeline. If you want to check out other pipelines and techniques supported in Diffusers that allow for controlled generation, check out our [official documentation]( If you cannot wait to try out ControlNet directly, we got you covered as well! Simply click on one of the following spaces to play around with ControlNet: - [ 2. [What is Hugging Face Optimum?](#2-what-is-hugging-face-optimum) 3. [What Transformers architectures are supported?](#3-what-transformers-architectures-are-supported) 4. [How can I convert a Transformers model (BERT) to ONNX?](#4-how-can-i-convert-a-transformers-model-bert-to-onnx) 5. [What's next?](#5-whats-next) Let's get started! --- If you are interested in optimizing your models to run with maximum efficiency, check out the [ Optimum library]( 1. What is ONNX? The [ONNX or Open Neural Network eXchange]( is an open standard and format to represent machine learning models. ONNX defines a common set of operators and a common file format to represent deep learning models in a wide variety of frameworks, including PyTorch and TensorFlow. pseudo ONNX graph, visualized with NETRON When a model is exported to the ONNX format, these operators are used to construct a computational graph (often called an `intermediate representation`) which represents the flow of data through the neural network. > **Important:** ONNX Is not a Runtime ONNX is only the representation that can be used with runtimes like ONNX Runtime. You can find a list of supported accelerators [here]( > [Learn more about ONNX.]( 2. What is Hugging Face Optimum? [Hugging Face Optimum]( is an open-source library and an extension of [Hugging Face Transformers]( that provides a unified API of performance optimization tools to achieve maximum efficiency to train and run models on accelerated hardware, including toolkits for optimized performance on [Graphcore IPU]( and [Habana Gaudi]( Optimum can be used for converting, quantization, graph optimization, accelerated training & inference with support for [transformers pipelines]( Below you can see a typical developer journey of how you can leverage Optimum with ONNX. [ Learn more about Optimum]( 3. What Transformers architectures are supported? A list of all supported Transformers architectures can be found in the [ONNX section of the Transformers documentation]( Below is an excerpt of the most commonly used architectures which can be converted to ONNX and optimized with [Hugging Face Optimum]( - ALBERT - BART - BERT - DistilBERT - ELECTRA - GPT Neo - GPT-J - GPT-2 - RoBERTa - T5 - ViT - XLM - \u2026 [ All supported architectures]( 4. How can I convert a Transformers model (BERT) to ONNX? There are currently three ways to convert your Hugging Face Transformers models to ONNX. In this section, you will learn how to export [distilbert-base-uncased-finetuned-sst-2-english]( for `text-classification` using all three methods going from the low-level `torch` API to the most user-friendly high-level API of `optimum`. Each method will do exactly the same ### Export with `torch.onnx` (low-level) [torch.onnx]( enables you to convert model checkpoints to an ONNX graph by the `export` method. But you have to provide a lot of values like `input_names`, `dynamic_axes`, etc. You\u2019ll first need to install some dependencies: ```python pip install transformers torch ``` exporting our checkpoint with `export` ```python import torch from transformers import AutoModelForSequenceClassification, AutoTokenizer # load model and tokenizer model_id = \"distilbert-base-uncased-finetuned-sst-2-english\" model = AutoModelForSequenceClassification.from_pretrained(model_id) tokenizer = AutoTokenizer.from_pretrained(model_id) dummy_model_input = tokenizer(\"This is a sample\", return_tensors=\"pt\") # export torch.onnx.export( model, tuple(dummy_model_input.values()), f=\"torch-model.onnx\", input_names=['input_ids', 'attention_mask'], output_names=['logits'], dynamic_axes={'input_ids': {0: 'batch_size', 1: 'sequence'}, 'attention_mask': {0: 'batch_size', 1: 'sequence'}, 'logits': {0: 'batch_size', 1: 'sequence'}}, do_constant_folding=True, opset_version=13, ) ``` ### Export with `transformers.onnx` (mid-level) [transformers.onnx]( enables you to convert model checkpoints to an ONNX graph by leveraging configuration objects. That way you don\u2019t have to provide the complex configuration for `dynamic_axes` etc. You\u2019ll first need to install some dependencies: ```python pip install transformers[onnx] torch ``` Exporting our checkpoint with the `transformers.onnx`. ```python from pathlib import Path import transformers from transformers.onnx import FeaturesManager from transformers import AutoConfig, AutoTokenizer, AutoModelForSequenceClassification # load model and tokenizer model_id = \"distilbert-base-uncased-finetuned-sst-2-english\" feature = \"sequence-classification\" model = AutoModelForSequenceClassification.from_pretrained(model_id) tokenizer = AutoTokenizer.from_pretrained(model_id) # load config model_kind, model_onnx_config = FeaturesManager.check_supported_model_or_raise(model, feature=feature) onnx_config = model_onnx_config(model.config) # export onnx_inputs, onnx_outputs = transformers.onnx.export( preprocessor=tokenizer, model=model, config=onnx_config, opset=13, output=Path(\"trfs-model.onnx\") ) ``` ### Export with Optimum (high-level) [Optimum]( Inference includes methods to convert vanilla Transformers models to ONNX using the `ORTModelForXxx` classes. To convert your Transformers model to ONNX you simply have to pass `from_transformers=True` to the `from_pretrained()` method and your model will be loaded and converted to ONNX leveraging the [transformers.onnx]( package under the hood. You\u2019ll first need to install some dependencies: ```python pip install optimum[onnxruntime] ``` Exporting our checkpoint with `ORTModelForSequenceClassification` ```python from optimum.onnxruntime import ORTModelForSequenceClassification model = ORTModelForSequenceClassification.from_pretrained(\"distilbert-base-uncased-finetuned-sst-2-english\",from_transformers=True) ``` The best part about the conversion with Optimum is that you can immediately use the `model` to run predictions or load it [inside a pipeline.]( ## 5. What's next? Since you successfully convert your Transformers model to ONNX the whole set of optimization and quantization tools is now open to use. Potential next steps can be: - Use the onnx model for [Accelerated Inference with Optimum and Transformers Pipelines]( - Apply [static quantization to your model]( for ~3x latency improvements - Use ONNX runtime for [training]( - Convert your ONNX model to [TensorRT]( to improve GPU performance - \u2026 If you are interested in optimizing your models to run with maximum efficiency, check out the [ Optimum library]( --- Thanks for reading! If you have any questions, feel free to contact me, through [Github]( or on the [forum]( You can also connect with me on [Twitter]( or [LinkedIn]("}
{"title": "course-launch-event.md", "repo_owner": "huggingface", "repo_name": "blog", "text": "--- title: \"Course Launch Community Event\" thumbnail: /blog/assets/34_course_launch/speakers_day1_thumb.png authors: - user: sgugger --- # Course Launch Community Event We are excited to share that after a lot of work from the Hugging Face team, part 2 of the [Hugging Face Course]( will be released on November 15th! Part 1 focused on teaching you how to use a pretrained model, fine-tune it on a text classification task then upload the result to the [Model Hub]( Part 2 will focus on all the other common NLP tasks: token classification, language modeling (causal and masked), translation, summarization and question answering. It will also take a deeper dive in the whole Hugging Face ecosystem, in particular [ Datasets]( and [ Tokenizers]( To go with this release, we are organizing a large community event to which you are invited! The program includes two days of talks, then team projects focused on fine-tuning a model on any NLP task ending with live demos like [this one]( Those demos will go nicely in your portfolio if you are looking for a new job in Machine Learning. We will also deliver a certificate of completion to all the participants that achieve building one of them. AWS is sponsoring this event by offering free compute to participants via [Amazon SageMaker]( To register, please fill out [this form]( You will find below more details on the two days of talks. ## Day 1 (November 15th): A high-level view of Transformers and how to train them The first day of talks will focus on a high-level presentation of Transformers models and the tools we can use to train or fine-tune them. Thomas Wolf: Transfer Learning and the birth of the Transformers library Thomas Wolf is co-founder and Chief Science Officer of HuggingFace. The tools created by Thomas Wolf and the Hugging Face team are used across more than 5,000 research organisations including Facebook Artificial Intelligence Research, Google Research, DeepMind, Amazon Research, Apple, the Allen Institute for Artificial Intelligence as well as most university departments. Thomas Wolf is the initiator and senior chair of the largest research collaboration that has ever existed in Artificial Intelligence: \u201cBigScience\u201d, as well as a set of widely used libraries and tools. Thomas Wolf is also a prolific educator and a thought leader in the field of Artificial Intelligence and Natural Language Processing, a regular invited speaker to conferences all around the world ( Margaret Mitchell: On Values in ML Development Margaret Mitchell is a researcher working on Ethical AI, currently focused on the ins and outs of ethics-informed AI development in tech. She has published over 50 papers on natural language generation, assistive technology, computer vision, and AI ethics, and holds multiple patents in the areas of conversation generation and sentiment classification. She previously worked at Google AI as a Staff Research Scientist, where she founded and co-led Google's Ethical AI group, focused on foundational AI ethics research and operationalizing AI ethics Google-internally. Before joining Google, she was a researcher at Microsoft Research, focused on computer vision-to-language generation; and was a postdoc at Johns Hopkins, focused on Bayesian modeling and information extraction. She holds a PhD in Computer Science from the University of Aberdeen and a Master's in computational linguistics from the University of Washington. While earning her degrees, she also worked from 2005-2012 on machine learning, neurological disorders, and assistive technology at Oregon Health and Science University. She has spearheaded a number of workshops and initiatives at the intersections of diversity, inclusion, computer science, and ethics. Her work has received awards from Secretary of Defense Ash Carter and the American Foundation for the Blind, and has been implemented by multiple technology companies. She likes gardening, dogs, and cats. Jakob Uszkoreit: It Ain't Broke So Don't Fix Let's Break It Jakob Uszkoreit is the co-founder of Inceptive. Inceptive designs RNA molecules for vaccines and therapeutics using large-scale deep learning in a tight loop with high throughput experiments with the goal of making RNA-based medicines more accessible, more effective and more broadly applicable. Previously, Jakob worked at Google for more than a decade, leading research and development teams in Google Brain, Research and Search working on deep learning fundamentals, computer vision, language understanding and machine translation. Jay Alammar: A gentle visual intro to Transformers models Jay Alammar, Cohere. Through his popular ML blog, Jay has helped millions of researchers and engineers visually understand machine learning tools and concepts from the basic (ending up in numPy, pandas docs) to the cutting-edge (Transformers, BERT, GPT-3). Matthew Watson: NLP workflows with Keras Matthew Watson is a machine learning engineer on the Keras team, with a focus on high-level modeling APIs. He studied Computer Graphics during undergrad and a Masters at Stanford University. An almost English major who turned towards computer science, he is passionate about working across disciplines and making NLP accessible to a wider audience. Chen Qian: NLP workflows with Keras Chen Qian is a software engineer from Keras team, with a focus on high-level modeling APIs. Chen got a Master degree of Electrical Engineering from Stanford University, and he is especially interested in simplifying code implementations of ML tasks and large-scale ML. Mark Saroufim: How to Train a Model with Pytorch Mark Saroufim is a Partner Engineer at Pytorch working on OSS production tools including TorchServe and Pytorch Enterprise. In his past lives, Mark was an Applied Scientist and Product Manager at Graphcore, yuri.ai, Microsoft and NASA's JPL. His primary passion is to make programming more fun. ## Day 2 (November 16th): The tools you will use Day 2 will be focused on talks by the Hugging Face, [Gradio]( and [AWS]( teams, showing you the tools you will use. Lewis Tunstall: Simple Training with the Transformers Trainer Lewis is a machine learning engineer at Hugging Face, focused on developing open-source tools and making them accessible to the wider community. He is also a co-author of an upcoming O\u2019Reilly book on Transformers and you can follow him on Twitter (@_lewtun) for NLP tips and tricks! Matthew Carrigan: New TensorFlow Features for Transformers and Datasets Matt is responsible for TensorFlow maintenance at Transformers, and will eventually lead a coup against the incumbent PyTorch faction which will likely be co-ordinated via his Twitter account @carrigmat. Lysandre Debut: The Hugging Face Hub as a means to collaborate on and share Machine Learning projects Lysandre is a Machine Learning Engineer at Hugging Face where he is involved in many open source projects. His aim is to make Machine Learning accessible to everyone by developing powerful tools with a very simple API. Sylvain Gugger: Supercharge your PyTorch training loop with Accelerate Sylvain is a Research Engineer at Hugging Face and one of the core maintainers of Transformers and the developer behind Accelerate. He likes making model training more accessible. Lucile Saulnier: Get your own tokenizer with Transformers & Tokenizers Lucile is a machine learning engineer at Hugging Face, developing and supporting the use of open source tools. She is also actively involved in many research projects in the field of Natural Language Processing such as collaborative training and BigScience. Merve Noyan: Showcase your model demos with Spaces Merve is a developer advocate at Hugging Face, working on developing tools and building content around them to democratize machine learning for everyone. Abubakar Abid: Building Machine Learning Applications Fast Abubakar Abid is the CEO of Gradio. He received his Bachelor's of Science in Electrical Engineering and Computer Science from MIT in 2015, and his PhD in Applied Machine Learning from Stanford in 2021. In his role as the CEO of Gradio, Abubakar works on making machine learning models easier to demo, debug, and deploy. Mathieu Desv\u00e9: AWS ML Vision: Making Machine Learning Accessible to all Customers Technology enthusiast, maker on my free time. I like challenges and solving problem of clients and users, and work with talented people to learn every day. Since 2004, I work in multiple positions switching from frontend, backend, infrastructure, operations and managements. Try to solve commons technical and managerial issues in agile manner. Philipp Schmid: Managed Training with Amazon SageMaker and Transformers Philipp Schmid is a Machine Learning Engineer and Tech Lead at Hugging Face, where he leads the collaboration with the Amazon SageMaker team. He is passionate about democratizing and productionizing cutting-edge NLP models and improving the ease of use for Deep Learning."}
{"title": "cv_state.md", "repo_owner": "huggingface", "repo_name": "blog", "text": "--- title: The State of Computer Vision at Hugging Face thumbnail: /blog/assets/cv_state/thumbnail.png authors: - user: sayakpaul --- # The State of Computer Vision at Hugging Face At Hugging Face, we pride ourselves on democratizing the field of artificial intelligence together with the community. As a part of that mission, we began focusing our efforts on computer vision over the last year. What started as a [PR for having Vision Transformers (ViT) in Transformers]( has now grown into something much bigger \u2013 8 core vision tasks, over 3000 models, and over 100 datasets on the Hugging Face Hub. A lot of exciting things have happened since ViTs joined the Hub. In this blog post, we\u2019ll summarize what went down and what\u2019s coming to support the continuous progress of Computer Vision from the ecosystem. Here is a list of things we\u2019ll cover: - [Supported vision tasks and Pipelines](#support-for-pipelines) - [Training your own vision models](#training-your-own-models) - [Integration with `timm`](#--timm) - [Diffusers](#-diffusers) - [Support for third-party libraries](#support-for-third-party-libraries) - [Deployment](#deployment) - and much more! ## Enabling the community: One task at a time The Hugging Face Hub is home to over 100,000 public models for different tasks such as next-word prediction, mask filling, token classification, sequence classification, and so on. As of today, we support [8 core vision tasks]( providing many model checkpoints: - Image classification - Image segmentation - (Zero-shot) object detection - Video classification - Depth estimation - Image-to-image synthesis - Unconditional image generation - Zero-shot image classification Each of these tasks comes with at least 10 model checkpoints on the Hub for you to explore. Furthermore, we support [tasks]( that lie at the intersection of vision and language such as: - Image-to-text (image captioning, OCR) - Text-to-image - Document question-answering - Visual question-answering These tasks entail not only state-of-the-art Transformer-based architectures such as [ViT]( [Swin]( [DETR]( but also *pure convolutional* architectures like [ConvNeXt]( [ResNet]( [RegNet]( and more! Architectures like ResNets are still very much relevant for a myriad of industrial use cases and hence the support of these non-Transformer architectures in Transformers. It\u2019s also important to note that the models on the Hub are not just from the Transformers library but also from other third-party libraries. For example, even though we support tasks like unconditional image generation on the Hub, we don\u2019t have any models supporting that task in Transformers yet (such as [this]( Supporting all ML tasks, whether they are solved with Transformers or a third-party library is a part of our mission to foster a collaborative open-source Machine Learning ecosystem. ## Support for Pipelines We developed [Pipelines]( to equip practitioners with the tools they need to easily incorporate machine learning into their toolbox. They provide an easy way to perform inference on a given input with respect to a task. We have support for [seven vision tasks]( in Pipelines. Here is an example of using Pipelines for depth estimation: ```py from transformers import pipeline depth_estimator = pipeline(task=\"depth-estimation\", model=\"Intel/dpt-large\") output = depth_estimator(\" # This is a tensor with the values being the depth expressed # in meters for each pixel output[\"depth\"] ``` The interface remains the same even for tasks like visual question-answering: ```py from transformers import pipeline oracle = pipeline(model=\"dandelin/vilt-b32-finetuned-vqa\") image_url = \" oracle(question=\"What's the animal doing?\", image=image_url, top_k=1) # [{'score': 0.778620, 'answer': 'laying down'}] ``` ## Training your own models While being able to use a model for off-the-shelf inference is a great way to get started, fine-tuning is where the community gets the most benefits. This is especially true when your datasets are custom, and you\u2019re not getting good performance out of the pre-trained models. Transformers provides a [Trainer API]( for everything related to training. Currently, `Trainer` seamlessly supports the following tasks: image classification, image segmentation, video classification, object detection, and depth estimation. Fine-tuning models for other vision tasks are also supported, just not by `Trainer`. As long as the loss computation is included in a model from Transformers computes loss for a given task, it should be eligible for fine-tuning for the task. If you find issues, please [report]( them on GitHub. **Where do I find the code?** - [Model documentation]( - [Hugging Face notebooks]( - [Hugging Face example scripts]( - [Task pages]( [Hugging Face example scripts]( include different [self-supervised pre-training strategies]( like [MAE]( and [contrastive image-text pre-training strategies]( like [CLIP]( These scripts are valuable resources for the research community as well as for practitioners willing to run pre-training from scratch on custom data corpora. Some tasks are not inherently meant for fine-tuning, though. Examples include zero-shot image classification (such as [CLIP]( zero-shot object detection (such as [OWL-ViT]( and zero-shot segmentation (such as [CLIPSeg]( We\u2019ll revisit these models in this post. ## Integrations with Datasets [Datasets]( provides easy access to thousands of datasets of different modalities. As mentioned earlier, the Hub has over 100 datasets for computer vision. Some examples worth noting here: [ImageNet-1k]( [Scene Parsing]( [NYU Depth V2]( [COYO-700M]( and [LAION-400M]( With these datasets being on the Hub, one can easily load them with just two lines of code: ```py from datasets import load_dataset dataset = load_dataset(\"scene_parse_150\") ``` Besides these datasets, we provide integration support with augmentation libraries like [albumentations]( and [Kornia]( The community can take advantage of the flexibility and performance of Datasets and powerful augmentation transformations provided by these libraries. In addition to these, we also provide [dedicated data-loading guides]( for core vision tasks: image classification, image segmentation, object detection, and depth estimation. ## timm `timm`, also known as [pytorch-image-models]( is an open-source collection of state-of-the-art PyTorch image models, pre-trained weights, and utility scripts for training, inference, and validation. We have over 200 models from `timm` on the Hub and more are on the way. Check out the [documentation]( to know more about this integration. ## Diffusers [Diffusers]( provides pre-trained vision and audio diffusion models, and serves as a modular toolbox for inference and training. With this library, you can generate plausible images from natural language inputs amongst other creative use cases. Here is an example: ```py from diffusers import DiffusionPipeline generator = DiffusionPipeline.from_pretrained(\"CompVis/stable-diffusion-v1-4\") generator.to(\u201ccuda\u201d) image = generator(\"An image of a squirrel in Picasso style\").images[0] ``` This type of technology can empower a new generation of creative applications and also aid artists coming from different backgrounds. To know more about Diffusers and the different use cases, check out the [official documentation]( The literature on Diffusion-based models is developing at a rapid pace which is why we partnered with [Jonathan Whitaker]( to develop a course on it. The course is free, and you can check it out [here]( ## Support for third-party libraries Central to the Hugging Face ecosystem is the [Hugging Face Hub]( which lets people collaborate effectively on Machine Learning. As mentioned earlier, we not only support models from Transformers on the Hub but also models from other third-party libraries. To this end, we provide [several utilities]( so that you can integrate your own library with the Hub. One of the primary advantages of doing this is that it becomes very easy to share artifacts (such as models and datasets) with the community, thereby making it easier for your users to try out your models. When you have your models hosted on the Hub, you can also [add custom inference widgets]( for them. Inference widgets allow users to quickly check out the models. This helps with improving user engagement. ## Spaces for computer vision demos With [Spaces]( one can easily demonstrate their Machine Learning models. Spaces support direct integrations with [Gradio]( [Streamlit]( and [Docker]( empowering practitioners to have a great amount of flexibility while showcasing their models. You can bring in your own Machine Learning framework to build a demo with Spaces. The Gradio library provides several components for building Computer Vision applications on Spaces such as [Video]( [Gallery]( and [Model3D]( The community has been hard at work building some amazing Computer Vision applications that are powered by Spaces: - [Generate 3D voxels from a predicted depth map of an input image]( - [Open vocabulary semantic segmentation]( - [Narrate videos by generating captions]( - [Classify videos from YouTube]( - [Zero-shot video classification]( - [Visual question-answering]( - [Use zero-shot image classification to find best captions for an image to generate similar images]( ## AutoTrain [AutoTrain]( provides a \u201cno-code\u201d solution to train state-of-the-art Machine Learning models for tasks like text classification, text summarization, named entity recognition, and more. For Computer Vision, we currently support [image classification]( but one can expect more task coverage. AutoTrain also enables [automatic model evaluation]( This application allows you to evaluate Transformers [models]( across a wide variety of [datasets]( on the Hub. The results of your evaluation will be displayed on the [public leaderboards]( You can check [this blog post]( for more details. ## The technical philosophy In this section, we wanted to share our philosophy behind adding support for Computer Vision in Transformers so that the community is aware of the design choices specific to this area. Even though Transformers started with NLP, we support multiple modalities today, for example \u2013 vision, audio, vision-language, and Reinforcement Learning. For all of these modalities, all the corresponding models from Transformers enjoy some common benefits: - Easy model download with a single line of code with `from_pretrained()` - Easy model upload with `push_to_hub()` - Support for loading huge checkpoints with efficient checkpoint sharding techniques - Optimization support (with tools like [Optimum]( - Initialization from model configurations - Support for both PyTorch and TensorFlow (non-exhaustive) - and many more Unlike tokenizers, we have preprocessors (such as [this]( that take care of preparing data for the vision models. We have worked hard to ensure the user experience of using a vision model still feels easy and similar: ```py from transformers import ViTImageProcessor, ViTForImageClassification import torch from datasets import load_dataset dataset = load_dataset(\"huggingface/cats-image\") image = dataset[\"test\"][\"image\"][0] image_processor = ViTImageProcessor.from_pretrained(\"google/vit-base-patch16-224\") model = ViTForImageClassification.from_pretrained(\"google/vit-base-patch16-224\") inputs = image_processor(image, return_tensors=\"pt\") with torch.no_grad(): logits = model(**inputs).logits # model predicts one of the 1000 ImageNet classes predicted_label = logits.argmax(-1).item() print(model.config.id2label[predicted_label]) # Egyptian cat ``` Even for a difficult task like object detection, the user experience doesn\u2019t change very much: ```py from transformers import AutoImageProcessor, AutoModelForObjectDetection from PIL import Image import requests url = \" image = Image.open(requests.get(url, stream=True).raw) image_processor = AutoImageProcessor.from_pretrained(\"microsoft/conditional-detr-resnet-50\") model = AutoModelForObjectDetection.from_pretrained(\"microsoft/conditional-detr-resnet-50\") inputs = image_processor(images=image, return_tensors=\"pt\") outputs = model(**inputs) # convert outputs (bounding boxes and class logits) to COCO API target_sizes = torch.tensor([image.size[::-1]]) results = image_processor.post_process_object_detection( outputs, threshold=0.5, target_sizes=target_sizes )[0] for score, label, box in zip(results[\"scores\"], results[\"labels\"], results[\"boxes\"]): box = [round(i, 2) for i in box.tolist()] print( f\"Detected {model.config.id2label[label.item()]} with confidence \" f\"{round(score.item(), 3)} at location {box}\" ) ``` Leads to: ```bash Detected remote with confidence 0.833 at location [38.31, 72.1, 177.63, 118.45] Detected cat with confidence 0.831 at location [9.2, 51.38, 321.13, 469.0] Detected cat with confidence 0.804 at location [340.3, 16.85, 642.93, 370.95] Detected remote with confidence 0.683 at location [334.48, 73.49, 366.37, 190.01] Detected couch with confidence 0.535 at location [0.52, 1.19, 640.35, 475.1] ``` ## Zero-shot models for vision There\u2019s been a surge of models that reformulate core vision tasks like segmentation and detection in interesting ways and introduce even more flexibility. We support a few of those from Transformers: - [CLIP]( that enables zero-shot image classification with prompts. Given an image, you\u2019d prompt the CLIP model with a natural language query like \u201can image of {}\u201d. The hope is to get the class label as the answer. - [OWL-ViT]( that allows for language-conditioned zero-shot object detection and image-conditioned one-shot object detection. This means you can detect objects in an image even if the underlying model didn\u2019t learn to detect them during training! You can refer to [this notebook]( to know more. - [CLIPSeg]( that supports language-conditioned zero-shot image segmentation and image-conditioned one-shot image segmentation. This means you can segment objects in an image even if the underlying model didn\u2019t learn to segment them during training! You can refer to [this blog post]( that illustrates this idea. [GroupViT]( also supports the task of zero-shot segmentation. - [X-CLIP]( that showcases zero-shot generalization to videos. Precisely, it supports zero-shot video classification. Check out [this notebook]( for more details. The community can expect to see more zero-shot models for computer vision being supported from Transformers in the coming days. ## Deployment As our CTO Julien says - \u201creal artists ship\u201d We support the deployment of these vision models through [Inference Endpoints]( Inference Endpoints integrates directly with compatible models pertaining to image classification, object detection, and image segmentation. For other tasks, you can use the [custom handlers]( Since we also provide many vision models in TensorFlow from Transformers for their deployment, we either recommend using the custom handlers or following these resources: - [Deploying TensorFlow Vision Models in Hugging Face with TF Serving]( - [Deploying ViT on Kubernetes with TF Serving]( - [Deploying ViT on Vertex AI]( - [Deploying ViT with TFX and Vertex AI]( ## Conclusion In this post, we gave you a rundown of the things currently supported from the Hugging Face ecosystem to empower the next generation of Computer Vision applications. We hope you\u2019ll enjoy using these offerings to build reliably and responsibly. There is a lot to be done, though. Here are some things you can expect to see: - Direct support of videos from Datasets - Supporting more industry-relevant tasks like image similarity - Interoperability of the image datasets with TensorFlow - A course on Computer Vision from the community As always, we welcome your patches, PRs, model checkpoints, datasets, and other contributions! *Acknowlegements: Thanks to Omar Sanseviero, Nate Raw, Niels Rogge, Alara Dirik, Amy Roberts, Maria Khalusova, and Lysandre Debut for their rigorous and timely reviews on the blog draft. Thanks to Chunte Lee for creating the blog thumbnail.*"}
{"title": "data-measurements-tool.md", "repo_owner": "huggingface", "repo_name": "blog", "text": "--- title: \"Introducing the Data Measurements Tool: an Interactive Tool for Looking at Datasets\" thumbnail: /blog/assets/37_data-measurements-tool/datametrics.png authors: - user: sasha - user: yjernite - user: meg --- # Introducing the Data Measurements Tool: an Interactive Tool for Looking at Datasets ***tl;dr:*** We made a tool you can use online to build, measure, and compare datasets. [Click to access the Data Measurements Tool here.]( ----- As developers of a fast-growing unified repository for Machine Learning datasets ([Lhoest et al. 2021]( the Hugging Face [team]( has been working on supporting good practices for dataset documentation ([McMillan-Major et al., 2021]( While static (if evolving) documentation represents a necessary first step in this direction, getting a good sense of what is actually in a dataset requires well-motivated measurements and the ability to interact with it, dynamically visualizing different aspects of interest. To this end, we introduce an open-source Python library and no-code interface called the [ Data Measurements Tool]( using our [Dataset]( and [Spaces]( Hubs paired with the great [Streamlit tool]( This can be used to help understand, build, curate, and compare datasets. ## What is the Data Measurements Tool? The [Data Measurements Tool (DMT)]( is an interactive interface and open-source library that lets dataset creators and users automatically calculate metrics that are meaningful and useful for responsible data development. ## Why have we created this tool? Thoughtful curation and analysis of Machine Learning datasets is often overlooked in AI development. Current norms for \u201cbig data\u201d in AI ([Luccioni et al., 2021]( [Dodge et al., 2021]( include using data scraped from various websites, with little or no attention paid to concrete measurements of what the different data sources represent, nor the nitty-gritty details of how they may influence what a model learns. Although dataset annotation approaches can help to curate datasets that are more in line with a developer\u2019s goals, the methods for \u201cmeasuring\u201d different aspects of these datasets are fairly limited ([Sambasivan et al., 2021]( A new wave of research in AI has called for a fundamental paradigm shift in how the field approaches ML datasets ([Paullada et al., 2020]( [Denton et al., 2021]( This includes defining fine-grained requirements for dataset creation from the start ([Hutchinson et al., 2021]( curating datasets in light of problematic content and bias concerns ([Yang et al., 2020]( [Prabhu and Birhane, 2020]( and making explicit the values inherent in dataset construction and maintenance ([Scheuerman et al., 2021]( [Birhane et al., 2021]( Although there is general agreement that dataset development is a task that people from many different disciplines should be able to inform, in practice there is often a bottleneck in interfacing with the raw data itself, which tends to require complex coding skills in order to analyze and query the dataset. Despite this, there are few tools openly available to the public to enable people from different disciplines to measure, interrogate, and compare datasets. We aim to help fill this gap. We learn and build from recent tools such as [Know Your Data]( and [Data Quality for AI]( as well as research proposals for dataset documentation such as [Vision and Language Datasets (Ferraro et al., 2015)]( [Datasheets for Datasets (Gebru et al, 2018)]( and [Data Statements (Bender & Friedman 2019)]( The result is an open-source library for dataset measurements, and an accompanying no-code interface for detailed dataset analysis. ## When can I use the Data Measurements Tool? The Data Measurements Tool can be used iteratively for exploring one or more existing NLP datasets, and will soon support iterative development of datasets from scratch. It provides actionable insights informed by research on datasets and responsible dataset development, allowing users to hone in on both high-level information and specific items. ## What can I learn using the Data Measurements Tool? ### Dataset Basics **For a high-level overview of the dataset** *This begins to answer questions like \u201cWhat is this dataset? Does it have missing items?\u201d. You can use this as \u201csanity checks\u201d that the dataset you\u2019re working with is as you expect it to be.* - A description of the dataset (from the Hugging Face Hub) - Number of missing values or NaNs ### Descriptive Statistics **To look at the surface characteristics of the dataset** *This begins to answer questions like \u201cWhat kind of language is in this dataset? How diverse is it?\u201d* - The dataset vocabulary size and word distribution, for both [open- and closed-class words]( - The dataset label distribution and information about class (im)balance.  means that your distribution is relatively unnatural for natural language. This can be a sign of mixed artefacts in the dataset, such as HTML markup. You can use this information to clean up your dataset or to guide you in determining how further language you add to the dataset should be distributed. ### Comparison statistics *This begins to answer questions like \u201cWhat kinds of topics, biases, and associations are in this dataset?\u201d* - Embedding clusters to pinpoint any clusters of similar language in the dataset. Taking in the diversity of text represented in a dataset can be challenging when it is made up of hundreds to hundreds of thousands of sentences. Grouping these text items based on a measure of similarity can help users gain some insights into their distribution. We show a hierarchical clustering of the text fields in the dataset based on a [Sentence-Transformer]( model and a maximum dot product [single-linkage criterion]( To explore the clusters, you can: - hover over a node to see the 5 most representative examples (deduplicated) - enter an example in the text box to see which leaf clusters it is most similar to - select a cluster by ID to show all of its examples - The [normalized pointwise mutual information (nPMI)]( between word pairs in the dataset, which may be used to identify problematic stereotypes. You can use this as a tool in dealing with dataset \u201cbias\u201d, where here the term \u201cbias\u201d refers to stereotypes and prejudices for identity groups along the axes of gender and sexual orientation. We will add further terms in the near future.  of the tool, demonstrating its usefulness on a handful of popular English-language datasets (e.g. SQuAD, imdb, C4, ...) available on the [Dataset Hub]( with the functionalities described above. The words that we selected for nPMI visualization are a subset of identity terms that came up frequently in the datasets that we were working with. In coming weeks and months, we will be extending the tool to: - Cover more languages and datasets present in the Datasets library. - Provide support for user-provided datasets and iterative dataset building. - Add more features and functionalities to the tool itself. For example, we will make it possible to add your own terms for the nPMI visualization so you can pick the words that matter most to you. ### Acknowledgements Thank you to Thomas Wolf for initiating this work, as well as other members of the team (Quentin, Lewis, Sylvain, Nate, Julien C., Julien S., Cl\u00e9ment, Omar, and many others!) for their help and support."}
{"title": "datasets-docs-update.md", "repo_owner": "huggingface", "repo_name": "blog", "text": "--- title: \"Introducing new audio and vision documentation in Datasets\" thumbnail: /blog/assets/87_datasets-docs-update/thumbnail.gif authors: - user: stevhliu --- # Introducing new audio and vision documentation in Datasets Open and reproducible datasets are essential for advancing good machine learning. At the same time, datasets have grown tremendously in size as rocket fuel for large language models. In 2020, Hugging Face launched Datasets, a library dedicated to: 1. Providing access to standardized datasets with a single line of code. 2. Tools for rapidly and efficiently processing large-scale datasets. Thanks to the community, we added hundreds of NLP datasets in many languages and dialects during the [Datasets Sprint]( But text datasets are just the beginning. Data is represented in richer formats like audio, images, and even a combination of audio and text or image and text. Models trained on these datasets enable awesome applications like describing what is in an image or answering questions about an image. The Datasets team has been building tools and features to make working with these dataset types as simple as possible for the best developer experience. We added new documentation along the way to help you learn more about loading and processing audio and image datasets. ## Quickstart The [Quickstart]( is one of the first places new users visit for a TLDR about a library\u2019s features. That\u2019s why we updated the Quickstart to include how you can use Datasets to work with audio and image datasets. Choose a dataset modality you want to work with and see an end-to-end example of how to load and process the dataset to get it ready for training with either PyTorch or TensorFlow. Also new in the Quickstart is the `to_tf_dataset` function which takes care of converting a dataset into a `tf.data.Dataset` like a mama bear taking care of her cubs. This means you don\u2019t have to write any code to shuffle and load batches from your dataset to get it to play nicely with TensorFlow. Once you\u2019ve converted your dataset into a `tf.data.Dataset`, you can train your model with the usual TensorFlow or Keras methods. Check out the [Quickstart]( today to learn how to work with different dataset modalities and try out the new `to_tf_dataset` function! Choose your dataset adventure! ## Dedicated guides Each dataset modality has specific nuances on how to load and process them. For example, when you load an audio dataset, the audio signal is automatically decoded and resampled on-the-fly by the `Audio` feature. This is quite different from loading a text dataset! To make all of the modality-specific documentation more discoverable, there are new dedicated sections with guides focused on showing you how to load and process each modality. If you\u2019re looking for specific information about working with a dataset modality, take a look at these dedicated sections first. Meanwhile, functions that are non-specific and can be used broadly are documented in the General Usage section. Reorganizing the documentation in this way will allow us to better scale to other dataset types we plan to support in the future. The guides are organized into sections that cover the most essential aspects of Datasets. Check out the [dedicated guides]( to learn more about loading and processing datasets for different modalities. ## ImageFolder Typically, Datasets users [write a dataset loading script]( to download and generate a dataset with the appropriate `train` and `test` splits. With the `ImageFolder` dataset builder, you don\u2019t need to write any code to download and generate an image dataset. Loading an image dataset for image classification is as simple as ensuring your dataset is organized in a folder like: ```py folder/train/dog/golden_retriever.png folder/train/dog/german_shepherd.png folder/train/dog/chihuahua.png folder/train/cat/maine_coon.png folder/train/cat/bengal.png folder/train/cat/birman.png ``` Your dataset should look something like this once you've uploaded it to the Hub and preview it. Image labels are generated in a `label` column based on the directory name. `ImageFolder` allows you to get started instantly with an image dataset, eliminating the time and effort required to write a dataset loading script. But wait, it gets even better! If you have a file containing some metadata about your image dataset, `ImageFolder` can be used for other image tasks like image captioning and object detection. For example, object detection datasets commonly have *bounding boxes*, coordinates in an image that identify where an object is. `ImageFolder` can use this file to link the metadata about the bounding box and category for each image to the corresponding images in the folder: ```py {\"file_name\": \"0001.png\", \"objects\": {\"bbox\": [[302.0, 109.0, 73.0, 52.0]], \"categories\": [0]}} {\"file_name\": \"0002.png\", \"objects\": {\"bbox\": [[810.0, 100.0, 57.0, 28.0]], \"categories\": [1]}} {\"file_name\": \"0003.png\", \"objects\": {\"bbox\": [[160.0, 31.0, 248.0, 616.0], [741.0, 68.0, 202.0, 401.0]], \"categories\": [2, 2]}} dataset = load_dataset(\"imagefolder\", data_dir=\"/path/to/folder\", split=\"train\") dataset[0][\"objects\"] {\"bbox\": [[302.0, 109.0, 73.0, 52.0]], \"categories\": [0]} ``` You can use `ImageFolder` to load an image dataset for nearly any type of image task if you have a metadata file with the required information. Check out the [ImageFolder]( guide to learn more. ## What\u2019s next? Similar to how the first iteration of the Datasets library standardized text datasets and made them super easy to download and process, we are very excited to bring this same level of user-friendliness to audio and image datasets. In doing so, we hope it\u2019ll be easier for users to train, build, and evaluate models and applications across all different modalities. In the coming months, we\u2019ll continue to add new features and tools to support working with audio and image datasets. Word on the Hugging Face street is that there\u2019ll be something called `AudioFolder` coming soon! While you wait, feel free to take a look at the [audio processing guide]( and then get hands-on with an audio dataset like [GigaSpeech]( --- Join the [forum]( for any questions and feedback about working with audio and image datasets. If you discover any bugs, please open a [GitHub Issue]( so we can take care of it. Feeling a little more adventurous? Contribute to the growing community-driven collection of audio and image datasets on the [Hub]( [Create a dataset repository]( on the Hub and upload your dataset. If you need a hand, open a discussion on your repository\u2019s **Community tab** and ping one of the Datasets team members to help you cross the finish line!"}