What Are Transformer Models In Machine Learning | 2021 | ExentAI

What Are Transformer Models In Machine Learning?

Language modelling problems and translations systems use approaches like Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) but architectural problems with these approaches led to the need for an alternative. This is where the transformer model comes from.

A deep learning model introduced in 2017, transformers are used in Natural Language Processing (NLP) tasks. It was introduced and described in a paper titled ‘Attention is all you need’ by Ashish Vaswani et al. in 2017. In its abstract, the authors explain that dominant sequence transduction models are based on complex recurrent or convolutional neutral networks in an encoder-decoder configuration.

“We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely,” the abstract reads, adding that testing was carried out on two machine translation tasks to show how the transformer model is of superior quality.

Since the introduction of the transformer model, it has seen widespread use in machine learning and several AI service providers use the technology in their services. Despite this, many have difficulty understanding the transformer model.

While the paper that introduces the transformer model may be of interest to industry professionals, it is too technical and must be read several times to understand what exactly the transformer model is.

The approach is based on encoder-decoder architecture and should be visualized as two rectangles side by side. On the left is the encoder, which has two sub-layers, and on the right is the decoder, which has three sub-layers.

They both have multi-head attention and feed forward sub-layers with an additional sub-layer, masked multi-head attention, in the decoder.

Multiple encoders and decoders are stacked, with the output of one connecting to the input of another.

Looking at the architecture of the model, one can argue that not much change has been made to encoder-decoder architecture, and this is true. What makes this approach different is that it eliminates recurrent layers which are traditionally used and instead relies on self-attention.

This leads us to the question: What is self-attention? According to the 2017 paper by Vaswani et al., self-attention, which is also referred to as intra-attention is, “an attention mechanism relating different positions of a single sequence in order to compute a representation of the sequence.”

The authors explain that the mechanism has been successfully used in tasks like reading comprehension, abstractive summarization, textual entailment and learning task-independent sentence representations.”

However, the authors state that, this is, to their knowledge, the first transduction model that relies entirely on self-attention to compute representations of input and output without using sequence aligned RNNs or convolution.

There is a formulae and a step-by-step visualization of the calculation gives one a better understanding of how this works but a machine learning consultancy would explain it is words using a simple example of a phrase like “hot coffee.”

Using the transformer model, the self-attention for the word “hot” in the sequence will be calculated by following a few steps. The input vector x1, representing “hot”, is multiplied with three different weight matrices, Wq, Wk, and Wv. This will give you three different vectors, which are q1, k1, and v1. The same process applies to input vector x2, which represents “coffee”.

Now q1 is multiplied by k1 and k2 respectively to calculate the two scores that tell you how much focus should be laid on the two words when the transformer model receives input for the word “hot.”

The scores are then divided by the square root of the dimension of the key vector and a softmax is applied to the scores. The weighted value vector of each word can now be calculated by multiplying the scores with the corresponding value vector.

The two vectors can be summed up to get the final vector z1, which is the output of the self-attention layer for the first input.

Since the model uses multi-head self-attention, this process is carried out multiple times with different weight matrices. This is a basic description of the transformer model, which any AI app development company would use in machine learning due to its simplified process.