Join our daily and weekly newsletters for the latest updates and the exclusive content on AI coverage. Learn more
Today, almost all of peak AI products and models use transformer architecture. Great language models (LLMS) such as GPT-4O, Llama, Gemini and Claude are all based on transformers, and other AI applications such as the text-vocation, the recognition of automatic speaking, the generation of images and the Video text models have transformers like their underlying technology underlying.
With the media threw around AI that can be slowed down as soon as it is, it is time to give transformers their due, that is why I would like to explain a little how they work, why they are so important for the growth of solutions evolving and why they are the backbone of LLMS.
Transformers are more than who meet the eye
In short, a transformer is an architecture of a neural network designed to model data sequences, which makes them ideal for tasks such as language translation, completion of sentences, automatic speech recognition and more. Transformers have really become the dominant architecture for many of these sequence modeling tasks because the underlying attention mechanism can be easily parallelized, allowing a massive scale when training and inference.
Originally introduced in a 2017 article, “Attention is all you need“Google researchers, the transformer was introduced as an encoder architecture specially designed for the translation of language. The following year, Google published representations of Bidirectional Coder of Transformers (Bert), which could be considered one of the first LLM – although it is now considered small according to today’s standards.
Since then – and especially accelerated with the advent of GPT models of OPENAI – The trend has been to train increasingly large models with more data, more parameters and longer context windows.
To facilitate this evolution, there have been many innovations such as: more more advanced GPU equipment and better software for multi-GPU training; techniques such as quantification and mixture of experts (MOE) to reduce memory consumption; New optimizing for training, such as shampoo and Adamw; Techniques for effectively calculating attention, such as flashatting and KV chatter. The trend will probably continue in the foreseeable future.
The importance of self -expectation in transformers
According to the application, a model of transformer follows a encoder architecture. The component of the encoder learns a vector representation of data which can then be used for downstream tasks such as classification and analysis of feelings. The decoder component takes a vector or a latent representation of the text or the image and uses it to generate a new text, which makes it useful for tasks such as the completion of the sentence and the summary. For this reason, many peak familiar models, such as the GPT family, are only a decoder.
Coder models combine the two components, which makes them useful for translation and other sequence sequence tasks. For encoder and decoder architectures, the central component is the layer of attention, because this is what allows a model to keep the context of words that appear much earlier in the text.
Attention presents itself in two flavors: self -attention and cross -attention. Self-tensioning is used to capture relationships between words in the same sequence, while cross attention is used to capture relationships between words through two different sequences. Crossed attention connects the components of the encoder and the decoder in a model and during the translation. For example, it allows the English word “strawberries” to relate to the French word “flora”. Mathematically, self-attentive and cross-attention are different forms of matrix multiplication, which can be made extremely effectively using a GPU.
Due to the layer of attention, transformers can better capture relationships between words separated by long quantities of text, while previous models such as recurring neural networks (RNN) and long -term models in short Term (LSTM) lose track of the context of words earlier in the text.
The future of models
Currently, transformers are the dominant architecture for many use cases that require LLM and benefit from the greatest research and development. Although this does not seem likely to change anytime soon, a different class of model that recently acquired interest is models of state space (SSM) such as Mamba. This very effective algorithm can manage very long data sequences, while transformers are limited by a context window.
For me, the most exciting applications of transformer models are multimodal models. The OPENAI GPT -4O, for example, is able to manage text, audio and images – and other suppliers are starting to follow. Multimodal applications are very diverse, ranging from video subtitling to vocal cloning to images segmentation (and more). They also have the opportunity to make AI more accessible to disabled people. For example, a blind person could be greatly served by the ability to interact through vocal and audio components of a multimodal application.
It is an exciting space with a lot of potential to discover new use cases. But remember that, at least in the predictable future, are largely supported by the transformative architecture.
Terrence Alsup is a data scientist greater than Finastra.
DATADECISIONMAKERS
Welcome to the Venturebeat community!
Data data manufacturers are the place where experts, including technicians who do data work, can share data -related information and innovation.
If you want to read on advanced ideas and up-to-date information, best practices and the future of data and data technology, join us at datadecisionmakers.
You might even consider Contribute an article It’s up to you!