Back to blogs

How to Make an AI Model: The Math Behind GPT

Juraj S.12 min readMay 20, 2024Technology

Juraj S.12 min read

Contents:

1. What are generative AI systems?

2. Model architecture

3. Training data: The backbone of generative AI models

4. Open-source models

5. What are the benefits of developing your own AI model?

6. Wrapping up

If you stumbled upon this blog, chances are you're looking to build your own AI model from scratch—or at least understand what that means. To get there, we will need to cover a lot of math and theory. But don't worry, you can do it!

Most of us are acquainted with generative AI, more specifically, GPTs, which have gained significant attention in recent years. This blog will focus on this type of AI.

1. What are generative AI systems?

Generative AI refers to the subset of artificial intelligence focused on building AI systems that create new data samples resembling those in the training dataset. This capability includes generating images, text, audio, and more.

Introduced in the 1960s, generative AI has undergone quite a few innovations and architectural changes, evolving from simple pattern recognition programs to complex systems capable of generating high-fidelity outputs. The most state-of-the-art model architecture was introduced with the seminal paper “Attention is All You Need” in 2017. This paper marked the debut of the Transformer architecture, the foundation of modern generative AI models like GPT. Transformers are the T in GPT, an advanced AI model.

1.2. Attention is all you need

The scientific paper introduced a novel mechanism called self-attention, which recognizes the dependencies of each token in relation to every other token in the sequence.

The embedding for each token is a high-dimensional vector that encodes the meaning of that token alone. With the transformer architecture, we compute attention scores between all pairs of tokens in the sequence. This is called self-attention, a process that enhances the model's ability to manage context and meaning in text-based tasks.

1.3. The transformer: Revolutionizing NLP with self-attention

Transformers revolutionized natural language processing (NLP). The Transformer is the first transduction model relying entirely on self-attention to compute representations of its input and output, eliminating the need for recurrent layers.

With a transformer, we have a more flexible and efficient handling of sentence structure, which means we don’t need to use RNNs (Recurrent Neural Networks) or CNNs (Convolutional Neural Networks).

2. Model architecture

In this section, we will go into building sequence transduction models, which are the cornerstone of text-to-text generative AI. These models, including variants like text-to-speech or image-text-to-json, are designed to generate coherent and contextually relevant outputs based on various forms of input data.

2.1. Encoder and decoder stacks

The optimal approach for our case involves an encoder-decoder architecture.

The encoder maps an input sequence of symbols (x1,...,xn) into a series of vector embeddings z = (z1,...,zn).

Given z, the decoder uses these embeddings, generating an output sequence (y1,...,ym) of symbols, constructing one token at a time. At each step, the model is auto-regressive. This means each new token generated takes into account all previously generated tokens, which improves the coherence of the output.

2.2. Advanced attention mechanisms

The attention function in our model can be described as a way to dynamically prioritize different parts of the input data. It works by mapping a set of key-value pairs to an output, where the query, keys, values, and output are all represented as vectors.

So, in this system, each piece of input data is represented not just by a single number and its meaning but by three vectors: the Query (Q), Key (K), and Value (V). The weight for each Value is determined by how compatible the Query is with its corresponding Key.

Finally, the output is calculated by taking a weighted sum of the Values for each token.

I have described them as vectors for each token, but each Query, Key, and Value vector is part of a larger Query, Key, and Value matrix. These matrices contain the values of the entire input sequence.

How Transformers work in deep learning and NLP

2.2.1. A closer look at scaled dot-product attention

The compatibility between queries and keys is computed using the dot product of the two vectors. The dot product measures the similarity between the vectors; if they point in similar directions in high-dimensional space, their dot product will be larger, indicating a higher degree of similarity.

Scaled self-attention additionally divides the dot product by the square root of the head size, or, said differently, it is scaled by 1/√dk. This means that when input Q and K are unit variance, their dot product will be unit variance, too, and Softmax will stay diffuse and not saturate too much.

Since we are dealing with probabilities, we want to normalize the output of the dot product. This is done using a softmax function to calculate a probability distribution over the keys. In other words, a softmax function will convert the dot product matrix to one with weights ranging from 0 to 1, representing the relevance of each Key relative to the Query.

The attention mechanism can mathematically be written as:

Attention(Q, K, V ) = softmax( QK^T/√d_k )V

2.2.2 Scaled self attention in Python

Using PyTorch in Python, attention scores are calculated like this:

This operation computes the dot product between the queries and keys in order to get their compatibility.

The softmax function normalizes the scores along the last dimension (dim=-1), which represents the dimension corresponding to the sequence length. This ensures that the attention weights sum up to 1, providing a probability distribution over the sequence elements.

matmul(scores, values) - the scores are multiplied by the values matrix. This step computes the weighted sum of values based on the attention weights, effectively capturing the context or information relevant to each query.

transpose(1, 2) - The output matrix is transposed to bring the seqlen dimension to the middle, making it compatible with the original input shape.

contiguous() - Ensures that the matrix is stored in a contiguous block of memory.

view - Reshapes the matrix to the desired output shape (bsz, seqlen, -1), where -1 infers the size of the last dimension based on the input and other specified dimensions.

self.wo - The output matrix is passed through a linear layer which maps the aggregated information to the desired output space.

2.3. Position-wise feed-forward networks (FFN)

In the transformer architecture, each neuron in a layer is connected to every neuron in the previous layer. These connections are called fully connected layers, also known as dense layers. Each layer, both in the encoder and decoder, incorporates a position-wise feed-forward network (FFN). This structure introduces non-linearity into the model, which is crucial because it allows the model to extract increasingly abstract and higher-level features from the raw input.

Following the self-attention mechanism in the processing sequence, the FFN is applied at each position in the input sequence. While the self-attention mechanism effectively captures global dependencies and relationships between different words in the sequence, the FFN focuses on enhancing the model's ability to capture local patterns and features specific to each position. This double focus on global and local data attributes allows for a more detailed understanding and processing of the input.

2.3.1 Step-by-step feed-forward processing

The FFN consists of two linear transformations followed by a non-linear activation function, applied independently at each position in the sequence. Mathematically, the operation of the FFN can be represented as:

FFN(x) = ReLU( W₁ ⋅ x + b₁ ) ⋅ W₂ + b₂

Where:

x represents the input at a specific position
W₁ and b₁ are the weights and biases of the first linear transformation
W₂ and b₂ are the weights and biases of the second linear transformation
ReLU is a function that decides which neurons of the neural net should fire

*The W₁ and W₂ weight matrices are the parameters of the model, which are calculated during training.

The operation of the FFN can be broken down mathematically into three steps

1. The first linear transformation

Initially, the first linear transformation is applied to the input using the formula x ⋅ W₁ + b₁. This step reduces the dimensionality of the data, helping discard irrelevant information and reducing computational cost. This streamlines the model’s efficiency.

2. ReLU activation function

Following the first transformation, the ReLU (Rectified Linear Unit) activation function is applied. This function is critical because it introduces non-linearity into the network by setting all negative values in the transformed representation to zero while leaving positive values unchanged. It is computationally efficient, simple, and widely adopted in deep learning architectures.

3. The second linear transformation

The second linear transformation then takes place, expanding the dimensionality back to its original input. This step is important for restoring the data's complexity, which is necessary for the next stages of processing.

By integrating these transformations, the model can capture complex patterns and relationships within the data. This capability significantly enhances the model’s performance in tasks involving natural language processing, allowing it to generate responses or text with high accuracy and contextual relevance. The network can capture intricate patterns and relationships within the data, enhancing its ability to extract meaningful features from the input sequence.

2.4 Embeddings and Softmax

To effectively predict the likelihood of each token being the appropriate next piece in the sequence, the transformer model first converts the output from the decoder into embeddings. These embeddings are then processed through a softmax function to generate the final predictions.

2.4.1 Generating embeddings from token data

Each token in the input sequence is initially represented as a one-hot vector, the dimensionality of which corresponds to the size of the vocabulary.

These one-hot token representations are then mapped to continuous vector embeddings through a learned embedding matrix. The matrix is an important component of the model, trained alongside other parameters during the model training process.

The embedding matrix projects the one-hot vectors into a continuous embedding space of dimension dmodel. This transition from sparse to dense representations allows the model to capture and learn meaningful relationships and semantic connections between tokens, improving its ability to understand and generate language.

2.4.2 Calculating token probabilities with Softmax

Once the decoder generates output token representations, these representations are transformed into probabilities over the vocabulary using a softmax function once again. This transformation enables the model to predict the likelihood of each token in the vocabulary being the next token in the output sequence. In other words, this function is critical in determining which token most likely follows in the sequence.

The softmax function converts the raw output scores from the embeddings into a probability distribution. The probabilities obtained from the softmax function represent the likelihood of each token being the next token in the output sequence. The token with the highest probability is typically selected as the predicted next token by the model. This step is important for the model to make informed predictions, ensuring that each generated token aligns closely with the contextual flow of the text.

2.4.3 The role of positional encoding

Given that the transformer model operates without recurrent or convolutional mechanisms, it relies on positional encodings added to the input embeddings. These encodings are derived from sine and cosine functions at various frequencies, which provide a unique signature for each position in the sequence. This enables the model to discern the relative or absolute positions of tokens, a critical factor in maintaining the integrity and context of the sequence.

Positional encodings ensure that the model recognizes not only the content of the input but also the order in which it appears. This is essential for tasks involving structured text or series data.

2.5 Architecture Summary

The Transformer model architecture employs a combination of self-attention mechanisms, feed-forward networks, and positional encodings. Together, these components enable the model to process and generate output sequences with high accuracy and efficiency.

3. Training data: The backbone of generative AI models

Just like the human brain, the Neural Net needs to be trained. Although they are fundamentally different, both need large amounts of data to form an understanding of the world. Thus, high-quality training data is paramount for developing AI models.

The quality of the input data profoundly impacts the efficacy of the output or predictions of the AI model.

OpenAI, for example, has three main sources for its data:

Publicly available information: This is data accessible on the internet.
Licensed data: OpenAI licenses datasets from third-party sources. These may include proprietary datasets and are probably company secrets.
User-provided data: This may come in the form of labeled examples or other forms of human input.

This is where the expertise of data scientists, well-versed in computer science and various programming languages, becomes the key.

3.1 Understanding the challenges of training

Overfitting

This occurs when the model learns to capture noise or random fluctuations in the training data rather than underlying patterns. As a result, the model performs well on the training data but fails to generalize and, therefore, performs poorly with real-world data.

To address overfitting, regularization methods penalize large model weights, dropout randomly removes units during training to enhance robustness, and data augmentation introduces variations in the training data, promoting exposure to diverse scenarios.

Underfitting

In contrast, underfitting occurs when the model is too simplistic to capture the underlying patterns in the data. This results in poor performance both on the training data and unseen data. Underfitting may arise from using overly simple models or insufficient training data.

Balancing between these two extremes—overfitting and underfitting—requires careful experimentation and fine-tuning of various model parameters.

Insufficient data

When the amount of data available for training an AI model is limited, the model may not capture the full complexity of the underlying data distribution. As a result, the model may generalize poorly to unseen data.

Insufficient training data is a common challenge across various fields, particularly in emerging or specialized domains where data collection is limited or expensive. In industries like healthcare, where niche datasets are crucial for specialized applications such as medical imaging or personalized treatments, obtaining sufficient and diverse data can be especially challenging due to legal and moral concerns.

Developers found creative solutions to these issues, such as mirroring, rotation, or adding noise to existing data to increase the size of the training dataset artificially.

Biased data

Bias in the training data occurs when certain subsets or categories of data are overrepresented or underrepresented compared to others. This can lead to skewed predictions and unfair treatment of different groups or individuals.

For example, if a facial recognition system is trained primarily on images of individuals from one demographic group, it may perform poorly on individuals from underrepresented groups. To address bias in training data, it's essential to carefully curate and balance the dataset to cover diverse representation across different groups. Additionally, techniques such as data preprocessing, stratified sampling, and bias correction algorithms can help mitigate bias in AI models.

Hyperparameter tuning

Hyperparameters are parameters that are set before the training process begins and control aspects of the model's architecture, learning process, and optimization strategy.

Tuning typically requires training and evaluating multiple versions of the model with different hyperparameter settings. This process can be computationally intensive, while high dimensionality makes it difficult to explore the entire hyperparameter space efficiently.

4. Open-source models

AI systems are very complex and by exploring the transformer architecture and its components, we gained a deeper understanding of generative AI, but fully implementing this architecture would be very cumbersome. Thankfully, open-source is here to help us.

Offering pre-trained models, architectures, and tools enables researchers and developers to build upon existing work. Open-source models can significantly accelerate the development process and reduce the need for extensive training data, as these models are often trained on large, diverse datasets and fine-tuned for specific tasks.

The availability of open-source models democratizes AI development, making state-of-the-art algorithms and techniques accessible to a broader audience.

I encourage you to take a look at the source code and see the implementation of the architecture we described:

Llama 3 - by Meta
nanoGPT - by Andrej Karpathy
Mistral - by Mistral AI

5. What are the benefits of developing your own AI model?

Competitive edge

Building your own AI model can grant you a competitive advantage in various fields. Whether it's finance, healthcare, or retail, AI models can analyze customer behavior, optimize operations, and streamline decision-making processes. All this helps your business set you apart from the competition.

Data security

By developing your own AI system, you gain better control over data security and privacy. This is especially important in industries where data is sensitive, as it allows you to implement security measures tailored to your specific risk profile and compliance requirements.

Full control over the custom model

Creating your own AI model allows you to tailor it to your specific needs and objectives. Whether it's a model trained to recognize unique data patterns or one that integrates with your existing tech stack, having complete control means you can fine-tune every aspect to ensure it meets your exact requirements.

6. Wrapping up

Building your own AI model from scratch is challenging and requires a deep dive into mathematical concepts. However, don’t be discouraged; you can master this by yourself.

From exploring self-attention mechanisms to understanding position-wise feed-forward networks, each component is important in determining the performance and behavior of your AI model. Diving deep into these aspects enhances your understanding of how AI systems work.

By developing your own AI model, you gain a competitive edge, ensure data security, and have full control over customizing the model to meet your specific needs and objectives.

My advice is to build your own AI for learning purposes; this whole blog is written to help you learn as much as possible. If you do not have extensive resources, use open-source models. They are valuable resources for researchers and developers, providing pre-trained models, architectures, and tools to accelerate the development process.

If you have any questions about building an AI or if you have suggestions and recommendations for additional resources beyond "Attention is All You Need" and Andrej Karpathy's YouTube channel, please feel free to send them our way.

Spread the word:

Keep readingSimilar blogs for further insights

Vibe Coding: How AI is Revolutionizing the Software Development Process

Technology

Marko P.7 min readAug 20, 2025

Vibe Coding: How AI is Revolutionizing the Software Development ProcessVibe coding leverages advanced AI models to translate natural-language instructions into working source code. From rapid prototyping and democratizing software creation to concerns around security, maintainability, and legal liability, this article delves into the origins of vibe coding, its strengths and pitfalls, and how “vibe engineering” can ensure safe, clean, and ethical AI-powered programming.

Agentic AI: From Chains to Autonomous, Multi-Agent Frameworks

Technology

Rino K.8 min readAug 13, 2025

Agentic AI: From Chains to Autonomous, Multi-Agent FrameworksCutting-edge insights into how interconnected AI agents leverage adaptive workflows, schema validation, and observability to tackle complex tasks—unlocking new possibilities in autonomous decision-making and seamless tool coordination.

Symfony Serializer: What Happens Between Request and Response

Technology

Ante C.3 min readAug 6, 2025

Symfony Serializer: What Happens Between Request and ResponseInsights into using Symfony Serializer for JSON serialization and deserialization in REST APIs, highlighting groups, virtual properties, and name converters.