Machine Learning Terms Explained

In this glossary, I will briefly list and explain terms used in the Machine Learning literature including links to more information and important publications. Whenever I come along a term I do not know or being unsure about and then doing some research to truly understand the term, I will update this glossary by this term. So expect this glossary to grow over time 🙂



Attention is the key concept in transformers. It enables the network to model dependencies between different tokens or features in a sequence. The attention mechanism was first introduced for image recognition by Volodymyr Mnih, Nicolas Heess, Alex Graves, and Koray Kavukcuoglu in 2014 on NIPS: “Recurrent models of visual attention”.
The first time attention was used for language processing (machine translation) was in the paper from D. Bahdanau, K. Cho, and Yoshua Bengio in 2015: “Neural Machine Translation by Jointly Learning to Align and Translate” and a bit later by Minh-Thang Luong, Hieu Pham and Christopher D. Manning 2015: “Effective Approaches to Attention-based Neural Machine Translation”.
(written March 2021)

Further reading:

BERT (Bidirectional Encoder Representation from Transformers)

BERT is a language model developed by Google in 2019 based on the Transformer. It uses only the encoder side of the original transformer. In opposite to the GPT model, BERT is trained bidirectional on sentences. Words are masked in the training set and BERT is trained to predict or generate these masked words. Its training is based on auto-encoding or masked generation. The BERT model achieved state-of-the-art results in many NLP tasks.
(written March 2021)

Further reading:

CLIP (Contrastive Language–Image Pre-training)

CLIP is a system, introduced by OpenAI in 2021, to learn associations between images and captions describing the image. It uses contrastive learning.
(written March 2021)

Future readings:


DALL-E is a transformer-based architecture that is able to generate images from text descriptions. DALL-E was developed by OpenAI and published in January 2021. It is based on the GPT-3 architecture and consists of 12 billion parameters. During training, DALL-E is presented with text-image pairs.
(written March 2021)

Further readings:

Deep Autoregressive Models

Autoregressive Models have been developed in statistics, economics, and signal processing. They describe time-varying processes as a model which is dependent linearly on its previous values and a stochastic term (like random noise). Deep autoregressive models use deep neural networks to model an actual value based on the previous values. As the model is generating an output based on previous input values, it is a generative model. More formally, the model calculates the joint distribution P(X, Y) of the observation X and the target output Y. In contrast to generative models are discriminative models, which model the conditional distribution P(Y|X).
In NLP, GPT is a transformer based on autoregression (in opposite to BERT, which is a transformer based on autoencoding or masked generation).
(written March 2021)

Further reading:

GPT (Generative Pre-trained Transformer)

GPT is a transformer working with generative autoregression. In particular, GPT2 and GPT3 attained a lot of attention as they scaled the parameters to new limits and achieved impressive results. GPT is using the decoder blocks of the original transformer model proposed by Vaswani et al. which consists of an encoder and decoder.
(written March 2021)

Further reading:

Label Smoothing

Label Smoothing or Label Smoothing Regularization (LSR) is a technique to regularize a classifier during training and was introduced by Christian Szegedy et. al. in the 2015 paper “Rethinking the Inception Architecture for Computer Vision” (link).

In a Multilayer Neural Network classifier, the activation of the output layer is usually calculated with the softmax function and the network is trained with the negative log-likelihood (NLL) against the one-hot labels of the training set. This encourages the network to maximize the output of the correct unit to 1 and to bring all other units to an output of zero or very close to zero. The behavior can lead to over-fitting and a weaker generalization of the network, as the networks becomes too confident about its predictions.

Szegedy et. al. propose to smooth the hard one-hot distribution of the labels by adding an additional distribution: For a label y the original desired output of the network is: q(y|x) = 1 and q(k|x) = 0 for all k \neq y. We can say that q(k|x) = \delta_{k, y} where \delta_{k,y} is the Dirac delta with \delta_{k,y} = 1 if k = y and 0 for all k \neq y.

Szegedy et. al. are proposing a new label distribution q'(k|x) with:

    \[q'(k|x) = (1 - \epsilon) \delta_{k,y} + \epsilon u(k)\]

which is a mixure of the original distribution q(k|x) and a fixed distribution u(k). The paper proposes a simple uniform distribution u(k) = 1/K, K is equal the number of different labels. The paper describes ImageNet experiments with K = 1000 classes and \epsilon = 0.1. They reported a consistent improvement of about 0.2\% by Label Smoothing Regularization.

Further reading:


Transformers are a model initially developed for natural language tasks by Google in 2017. It consists of feed-forward networks and attention layers and does not use convolutional or recurrent networks. In particular, the transformers BERT (from Google) and GPT (from OpenAI) have been hugely successful and achieved state-of-the-art results on many NLP tasks. Another big advantage of transformers is their ability of transfer learning. Transformers can be pre-trained on large data sets. The trained language model then can be optimized with a much smaller training effort to special downstream tasks.

The original transformer model by Vaswani et al. consists of an encoder and a decoder (as previous Seq-to-Seq models for machine translation based on LSTMs). The subsequent model BERT is only using the encoder side of a transformer, as the model GPT is using only the decoder side of the transformer.

Recently transformers are also successfully applied to other tasks like visual pattern recognition and multi-model modeling.
(Updated May 2021)

Further reading:

VAE (Variational Autoencoders)

Deep generative models are able to produce highly realistic pieces of content of various kinds, such as images, texts, or sounds. Variational Autoencoders (VAEs) is one type of deep generative models, the other type is Generative Adversarial Networks (GANs).
(written March 2021)

Further reading:

Leave a Reply

Your email address will not be published. Required fields are marked *