In this glossary, I will briefly list and explain terms used in the Machine Learning literature including links to more information and important publications. Whenever I come along a term I do not know or being unsure about and then doing some research to truly understand the term, I will update this glossary by this term. So expect this glossary to grow over time 🙂
Content
- Attention
- BERT (Bidirectional Encoder Representation from Transformers)
- CLIP (Contrastive Language–Image Pre-training)
- DALL-E
- Deep Autoregressive Models
- GPT (Generative Pre-trained Transformer)
- Label Smoothing
- Transformer
- VAE (Variational Autoencoders)
Attention
Attention is the key concept in transformers. It enables the network to model dependencies between different tokens or features in a sequence. The attention mechanism was first introduced for image recognition by Volodymyr Mnih, Nicolas Heess, Alex Graves, and Koray Kavukcuoglu in 2014 on NIPS: “Recurrent models of visual attention”.
The first time attention was used for language processing (machine translation) was in the paper from D. Bahdanau, K. Cho, and Yoshua Bengio in 2015: “Neural Machine Translation by Jointly Learning to Align and Translate” and a bit later by Minh-Thang Luong, Hieu Pham and Christopher D. Manning 2015: “Effective Approaches to Attention-based Neural Machine Translation”.
(written March 2021)
Further reading:
- Great video by Yannic Kilcher explaining attention
- Blog post by Sebastian Ruder explaining different kinds of attention
BERT (Bidirectional Encoder Representation from Transformers)
BERT is a language model developed by Google in 2019 based on the Transformer. It uses only the encoder side of the original transformer. In opposite to the GPT model, BERT is trained bidirectional on sentences. Words are masked in the training set and BERT is trained to predict or generate these masked words. Its training is based on auto-encoding or masked generation. The BERT model achieved state-of-the-art results in many NLP tasks.
(written March 2021)
Further reading:
- Jacob Devlin et al.: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, Google, 2019
- Jay Alammar: The Illustrated BERT
- A great detailed technical explanation of BERT in form of a notebook: BERT Inner Workings by George Mihaila
CLIP (Contrastive Language–Image Pre-training)
CLIP is a system, introduced by OpenAI in 2021, to learn associations between images and captions describing the image. It uses contrastive learning.
(written March 2021)
Future readings:
DALL-E
DALL-E is a transformer-based architecture that is able to generate images from text descriptions. DALL-E was developed by OpenAI and published in January 2021. It is based on the GPT-3 architecture and consists of 12 billion parameters. During training, DALL-E is presented with text-image pairs.
(written March 2021)
Further readings:
- Blog about DALL-E on the OpenAI website
- Original paper from Aditya Ramesh et al.: Zero-Shot Text-to-Image Generation, 2021
Deep Autoregressive Models
Autoregressive Models have been developed in statistics, economics, and signal processing. They describe time-varying processes as a model which is dependent linearly on its previous values and a stochastic term (like random noise). Deep autoregressive models use deep neural networks to model an actual value based on the previous values. As the model is generating an output based on previous input values, it is a generative model. More formally, the model calculates the joint distribution P(X, Y) of the observation X and the target output Y. In contrast to generative models are discriminative models, which model the conditional distribution P(Y|X).
In NLP, GPT is a transformer based on autoregression (in opposite to BERT, which is a transformer based on autoencoding or masked generation).
(written March 2021)
Further reading:
- Blog post from George Ho on Deep Autoregressive Models
- Notes from the Stanford course on deep generative models
GPT (Generative Pre-trained Transformer)
GPT is a transformer working with generative autoregression. In particular, GPT2 and GPT3 attained a lot of attention as they scaled the parameters to new limits and achieved impressive results. GPT is using the decoder blocks of the original transformer model proposed by Vaswani et al. which consists of an encoder and decoder.
(written March 2021)
Further reading:
- Alec Radford: Improving Language Understanding with Unsupervised Learning, 2018
- Alec Radford et al.: Better Language Models and Their Implications (GPT-2), 2019
- Tom B Brown et al.: Language Models are Few-Shot Learners, 2020. The original GPT-3 paper on arxiv
- Jay Alammar: The Illustrated GPT-2
Label Smoothing
Label Smoothing or Label Smoothing Regularization (LSR) is a technique to regularize a classifier during training and was introduced by Christian Szegedy et. al. in the 2015 paper “Rethinking the Inception Architecture for Computer Vision” (link).
In a Multilayer Neural Network classifier, the activation of the output layer is usually calculated with the softmax function and the network is trained with the negative log-likelihood (NLL) against the one-hot labels of the training set. This encourages the network to maximize the output of the correct unit to 1 and to bring all other units to an output of zero or very close to zero. The behavior can lead to over-fitting and a weaker generalization of the network, as the networks becomes too confident about its predictions.
Szegedy et. al. propose to smooth the hard one-hot distribution of the labels by adding an additional distribution: For a label the original desired output of the network is: and for all . We can say that where is the Dirac delta with if and 0 for all .
Szegedy et. al. are proposing a new label distribution with:
which is a mixure of the original distribution and a fixed distribution . The paper proposes a simple uniform distribution , is equal the number of different labels. The paper describes ImageNet experiments with classes and . They reported a consistent improvement of about by Label Smoothing Regularization.
Further reading:
- Original paper from Christian Szegedy et. al. “Rethinking the Inception Architecture for Computer Vision“, 2015
- How to implement Label Smoothing in TensorFlow or PyTorch (StackOverflow)
Transformer
Transformers are a model initially developed for natural language tasks by Google in 2017. It consists of feed-forward networks and attention layers and does not use convolutional or recurrent networks. In particular, the transformers BERT (from Google) and GPT (from OpenAI) have been hugely successful and achieved state-of-the-art results on many NLP tasks. Another big advantage of transformers is their ability of transfer learning. Transformers can be pre-trained on large data sets. The trained language model then can be optimized with a much smaller training effort to special downstream tasks.
The original transformer model by Vaswani et al. consists of an encoder and a decoder (as previous Seq-to-Seq models for machine translation based on LSTMs). The subsequent model BERT is only using the encoder side of a transformer, as the model GPT is using only the decoder side of the transformer.
Recently transformers are also successfully applied to other tasks like visual pattern recognition and multi-model modeling.
(Updated May 2021)
Further reading:
- Blog post from George Ho on Transformers in NLP
- Original paper from Vaswani (Google) et al. “Attention Is All You Need”, 2017
- Jay Alammar: The illustrated Transformer
- The Annotated Transformer: The paper from Vaswani et.al. annotated with Python code
- Pretrained Transformers Models in PyTorch Using Hugging Face Transformers
- Fine-tune Transformers in PyTorch Using Hugging Face Transformers
VAE (Variational Autoencoders)
Deep generative models are able to produce highly realistic pieces of content of various kinds, such as images, texts, or sounds. Variational Autoencoders (VAEs) is one type of deep generative models, the other type is Generative Adversarial Networks (GANs).
(written March 2021)
Further reading: