
Building LLM Powered Applications
By :

In the preceding sections, we saw how choosing an LLM architecture is a pivotal step in determining its functioning. However, the quality and diversity of the output text depend largely on two factors: the training dataset and the evaluation metric.
The training dataset determines what kind of data the LLM learns from and how well it can generalize to new domains and languages. The evaluation metric measures how well the LLM performs on specific tasks and benchmarks, and how it compares to other models and human writers. Therefore, choosing an appropriate training dataset and evaluation metric is crucial for developing and assessing LLMs.
In this section, we will discuss some of the challenges and trade-offs involved in selecting and using different training datasets and evaluation metrics for LLMs, as well as some of the recent developments and future directions in this area.
By definition, LLMs are huge, from a double point of view:
Figure 1.11: GPT-3 knowledge base
Considering the assumption:
We can conclude that GPT-3 has been trained on around 374 billion words.
So generally speaking, LLMs are trained using unsupervised learning on massive datasets, which often consist of billions of sentences collected from diverse sources on the internet. The transformer architecture, with its self-attention mechanism, allows the model to efficiently process long sequences of text and capture intricate dependencies between words. Training such models necessitates vast computational resources, typically employing distributed systems with multiple graphics processing units (GPUs) or tensor processing units (TPUs).
Definition
A tensor is a multi-dimensional array used in mathematics and computer science. It holds numerical data and is fundamental in fields like machine learning.
A TPU is a specialized hardware accelerator created by Google for deep learning tasks. TPUs are optimized for tensor operations, making them highly efficient for training and running neural networks. They offer fast processing while consuming less power, enabling faster model training and inference in data centers.
The training process involves numerous iterations over the dataset, fine-tuning the model’s parameters using optimization algorithms backpropagation. Through this process, transformer-based language models acquire a deep understanding of language patterns, semantics, and context, enabling them to excel in a wide range of NLP tasks, from text generation to sentiment analysis and machine translation.
The following are the main steps involved in the training process of an LLM:
Definition
In the context of neural networks, the optimization algorithm during training is the method used to find the best set of weights for the model that minimizes the prediction error or maximizes the accuracy of the training data. The most common optimization algorithm for neural networks is stochastic gradient descent (SGD), which updates the weights in small steps based on the gradient of the error function and the current input-output pair. SGD is often combined with backpropagation, which we defined earlier in this chapter.
The output of the pre-training phase is the so-called base model.
Definition
Reinforcement learning (RL) is a branch of machine learning that focuses on training computers to make optimal decisions by interacting with their environment. Instead of being given explicit instructions, the computer learns through trial and error: by exploring the environment and receiving rewards or penalties for its actions. The goal of reinforcement learning is to find the optimal behavior or policy that maximizes the expected reward or value of a given model. To do so, the RL process involves a reward model (RM) that is able to provide a “preferability score” to the computer. In the context of RLHF, the RM is trained to incorporate human preferences.
Note that RLHF is a pivotal milestone in achieving human alignment with AI systems. Due to the rapid achievements in the field of generative AI, it is pivotal to keep endowing those powerful LLMs and, more generally, LFMs with those preferences and values that are typical of human beings.
Once we have a trained model, the next and final step is evaluating its performance.
Evaluating traditional AI models was, in some ways, pretty intuitive. For example, let’s think about an image classification model that has to determine whether the input image represents a dog or a cat. So we train our model on a training dataset with a set of labeled images and, once the model is trained, we test it on unlabeled images. The evaluation metric is simply the percentage of correctly classified images over the total number of images within the test set.
When it comes to LLMs, the story is a bit different. As those models are trained on unlabeled text and are not task-specific, but rather generic and adaptable given a user’s prompt, traditional evaluation metrics were not suitable anymore. Evaluating an LLM means, among other things, measuring its language fluency, coherence, and ability to emulate different styles depending on the user’s request.
Hence, a new set of evaluation frameworks needed to be introduced. The following are the most popular frameworks used to evaluate LLMs:
It recently evolved into a new benchmark styled after GLUE and called SuperGLUE, which comes with more difficult tasks. It consists of eight challenging tasks that require more advanced reasoning skills than GLUE, such as natural language inference, question answering, coreference resolution, etc., a broad coverage diagnostic set that tests models on various linguistic capabilities and failure modes, and a leaderboard that ranks models based on their average score across all tasks.
The difference between the GLUE and the SuperGLUE benchmark is that the SuperGLUE benchmark is more challenging and realistic than the GLUE benchmark, as it covers more complex tasks and phenomena, requires models to handle multiple domains and formats, and has higher human performance baselines. The SuperGLUE benchmark is designed to drive research in the development of more general and robust NLU systems.
Definition
The concept of zero-shot evaluation is a method of evaluating a language model without any labeled data or fine-tuning. It measures how well the language model can perform a new task by using natural language instructions or examples as prompts and computing the likelihood of the correct output given the input. It is the probability that a trained model will produce a particular set of tokens without needing any labeled training data.
This design adds complexity to the benchmark and aligns it more closely with the way we assess human performance. The benchmark comprises 14,000 multiple-choice questions categorized into 57 groups, spanning STEM, humanities, social sciences, and other fields. It covers a spectrum of difficulty levels, ranging from basic to advanced professional, assessing both general knowledge and problem-solving skills. The subjects encompass various areas, including traditional ones like mathematics and history, as well as specialized domains like law and ethics. The extensive range of subjects and depth of coverage make this benchmark valuable for uncovering any gaps in a model’s knowledge. Scoring is based on subject-specific accuracy and the average accuracy across all subjects.
It is important to note that each evaluation framework has a focus on a specific feature. Namely, the GLUE benchmark focuses on grammar, paraphrasing, and text similarity, while MMLU focuses on generalized language understanding among various domains and tasks. Hence, while evaluating an LLM, it is important to have a clear understanding of the final goal, so that the most relevant evaluation framework can be used. Alternatively, if the goal is that of having the best of the breed in any task, it is key not to use only one evaluation framework, but rather an average of multiple frameworks.
In addition to that, in case no existing LLM is able to tackle your specific use cases, you still have a margin to customize those models and make them more tailored toward your application scenarios. In the next section, we are indeed going to cover the existing techniques of LLM customization, from the lightest ones (such as prompt engineering) up to the whole training of an LLM from scratch.
Change the font size
Change margin width
Change background colour