Published on: May 9, 2024
9 min read
Our blog series debuts with a behind-the-scenes look at how we evaluate LLMs, match them to use cases, and fine-tune them to produce better responses for users.

Generative AI marks a monumental shift in the software development industry, making it easier to develop, secure, and operate software. Our new blog series, written by our product and engineering teams, gives you an inside look at how we create, test, and deploy the AI features you need integrated throughout the enterprise. Get to know new capabilities within GitLab Duo and how they will help DevSecOps teams deliver better results for customers.
GitLab values the trust our customers place in us. Part of maintaining that trust is transparency in how we build, evaluate, and ensure the high-quality functionality of our GitLab Duo AI features. GitLab Duo features are powered by a diverse set of models, which allows us to support a broad set of use cases and gives our customers flexibility. GitLab is not tied to a single model provider by design. We currently use foundation models from Google and Anthropic. However, we continuously assess what models are the right matches for GitLab Duo’s use cases. In this article, we give you an inside look at our AI model validation process.
Discover the future of AI-driven software development with our GitLab 17 virtual launch event. Watch today!
Large language models (LLMs) are generative AI models that power many AI features across the platform. Trained on vast datasets, LLMs predict the next word in a sequence based on preceding context. Given an input prompt, they generate human-like text by sampling from the probability distribution of words conditioned on the prompt.
LLMs enable intelligent code suggestions, conversational chatbots, code explanations, vulnerability analysis, and more. Their ability to produce diverse outputs for a given prompt makes standardized quality evaluation challenging. LLMs can be optimized for different characteristics, which is why there are so many AI models actively being developed.
Unlike traditional software systems where inputs and outputs can be more easily defined and tested, LLMs produce outputs that are often nuanced, diverse, and context-dependent. Testing these models requires comprehensive strategies that account for subjective and variable interpretations of quality, as well as the stochastic nature of their outputs. We, therefore, cannot judge the quality of an LLM’s output in an individual or anecdotal fashion; instead, we need to be able to examine the overall pattern of an LLM's behavior. To get a sense of those patterns, we need to test at scale. Testing at scale refers to the process of evaluating the performance, reliability, and robustness of a system or application across a large and diverse array of datasets and use cases. Our Centralized Evaluation Framework (CEF) utilizes thousands of prompts tied to dozens of use cases to allow us to identify significant patterns and assess the overall behavior of our foundational LLMs and the GitLab Duo features in which they are integrated.
Testing at scale helps us:
Testing LLMs at scale is imperative for ensuring their reliability and robustness for deployment within the GitLab platform. By investing in comprehensive testing strategies that encompass diverse datasets, use cases, and scenarios, GitLab is working to unlock the full potential of AI-powered workflows while mitigating potential risks.
These are the steps we take to test LLMs at scale.
While other companies view and use customer data to train their AI features, GitLab currently does not. As a result, we needed to develop a comprehensive prompt library that is a proxy for both the scale and activity of production.
This prompt library is composed of questions and answers. The questions represent the kinds of queries or inputs that we would expect to see in production, while the answers represent a ground truth of what our ideal answer would be. This ground truth answer could also be mentally framed as a target answer. Both the question and the answer may be human generated, but are not necessarily so. These question/answer pairs give us a basis for comparison and a reference frame that allow us to tease out differences between models and features. When multiple models are asked the same question and generate different responses, we can use our ground truth answer to determine which model has provided an answer that is most closely aligned to our target and score them accordingly.
Again, a key element of a comprehensive prompt library is ensuring that it is representative of the inputs that we expect to see in production. We want to know how well foundational models fit to our specific use case, and how well our features are performing. There are numerous benchmark prompt datasets, but those datasets may not be reflective of the use cases that we see for features at GitLab. Our prompt library is designed to be specific to GitLab features and use cases.
Once we have crafted a prompt library that accurately reflects production activity, we feed those questions into various models to test how well they serve our customer’s needs. We compare each response to our ground truth and provide it a ranking based on a series of metrics including: Cosine Similarity Score, Cross Similarity Score, LLM Judge, and Consensus Filtering with an LLM Judge. This first iteration provides us a baseline for how well each model is performing, and guides our selection of a foundational model for our features. For brevity, we won’t go into the details here, but we encourage you to learn more about more about the metrics here. It is important to note this isn’t a solved problem; the wider AI industry is actively researching and developing new techniques. GitLab’s model validation team keeps a pulse on the industry and is continuously iterating on how we measure and score the LLMs GitLab Duo uses.
Now that we have a baseline for our selected model's performance, we can start developing our features with confidence. While prompt engineering gets a lot of buzz, focusing entirely on changing the behavior of a model via prompting (or any other technique) without validation means that you are operating in the dark and very possibly overfitting your prompting. You may solve one problem, but be causing a dozen more. You would never know. Creating a baseline for a model's performance allows us to track how we are changing behavior over time for all our necessary use cases. At GitLab, we re-validate the performance of our features on a daily basis during active development to help ensure that all changes improve the overall functionality.
Here is how our experimental iterations work. Each cycle, we examine the scores from our tests at scale to identify patterns:
Only when we test at scale do these kinds of patterns begin to emerge and allow us to focus our experiments. Based on these patterns, we propose a variety of experiments or approaches to try to improve performance in a specific area and on a specific metric.
However, testing at scale is both expensive and time-consuming. To enable faster and less expensive iteration, we craft a smaller scale dataset to act as a mini-proxy. The focused subset will be weighted to include question/answer pairs that we know we want to improve upon, and the broader subset will also include sampling of all the other use cases and scores to ensure that our changes aren't adversely affecting the feature broadly. Make your change and run it against the focused subset of data. How does the new response compare to the baseline? How does it compare to the ground truth?
Once we have found a prompt that addresses the specific use case we are working on with the focused subset, we validate that prompt against a broader subset of data to help ensure that it won’t adversely affect other areas of the feature. Only when we believe that the new prompt improves our performance in our target area through validation metrics AND doesn’t degrade performance elsewhere, do we push that change to production.
The entire Centralized Evaluation Framework is then run against the new prompt and we validate that it has increased the performance of the entire feature against the baseline from the day before. In this way, GitLab is constantly iterating to help ensure that you are getting the latest and greatest performance of AI-powered features across the GitLab ecosystem. This allows us to ensure that we keep working faster, together.
Hopefully this gives you insight into how we’re responsibly developing GitLab Duo features. This process has been developed as we’ve brought GitLab Duo Code Suggestions and GitLab Duo Chat to general availability. We’ve also integrated this validation process into our development process as we iterate on GitLab Duo features. It’s a lot of trial and error, and many times fixing one thing breaks three others. But we have data-driven insights into those impacts, which helps us ensure that GitLab Duo is always getting better.
Start a free trial of GitLab Duo today!