How to Analyze a Dataset for LLM Fine Tuning

Say you have an LLM and want to teach it some behavior and therefore your idea is to fine tune an LLM that is close and good enough. You found a dataset or two, and now want to see how training the LLM on this dataset would influence its behavior and knowledge.

Define the Objective

What behavior or knowledge you want to instill in the LLM? Is it domain-specific knowledge, conversational style, task-specific capabilities, or adherence to specific ethical guidelines?

Dataset Exploration

Check if the dataset’s content aligns with your domain of interest. Where does the dataset come from? Ensure it is reliable and unbiased for your use case.

Evaluate the dataset size to see if it is sufficient for fine-tuning but not too large to overfit or be computationally prohibitive. Check the dataset format (e.g., JSON, CSV, text) and its fields (e.g., prompt-response pairs, paragraphs, structured annotations).

Content

Quality: Ensure the text is grammatically correct and coherent, code is working. Check for logical structure and factual accuracy.

Diversity: Analyze the range of topics, styles, and formats in the dataset. Ensure the dataset covers edge cases and diverse scenarios relevant to your objectives.

Look for harmful, biased, or inappropriate content. Assess the dataset for compliance with ethical and legal standards.

Behavior

Use a small subset of the dataset to run experiments and assess how the model’s behavior shifts. Compare the outputs before and after fine-tuning on metrics like relevance, correctness, and alignment with desired behaviors.

Compare the dataset’s content with the base model’s knowledge and capabilities. Focus on gaps or areas where the dataset adds value.

TLD;DR: Train with a small subset and observe how it changes behavior.

Data Cleaning

Normalize text (e.g., casing, punctuation) and remove irrelevant characters. Tokenize or prepare the dataset in a format compatible with the model.

Remove low-quality, irrelevant, or harmful samples. In fact, many of the datasets used to train large LLMs are not very clean. Address bias and ethical issues by balancing or augmenting content as needed. Add labels or annotations if the dataset lacks sufficient structure for fine-tuning.

Resource Estimate

Determine the compute power required for fine-tuning with this dataset. f the dataset is too large, consider selecting a high-quality, representative subset.

Alternative Approaches: Evaluate whether fine-tuning is necessary. Explore alternatives like prompt engineering or few-shot learning.

Ethical and Practical Validation

Use tools or frameworks to check for potential biases in the dataset. Ensure the dataset complies with copyright, privacy, and data protection regulations.

Add Notes

Document findings about dataset quality, limitations, and potential biases. Record the preprocessing steps and justification for changes made to the dataset.

By following this structured analysis, you can determine how fine-tuning with a particular dataset will influence an LLM and decide on the most effective approach for your objectives.

Note that knowledge from training and fine tuning can be blurry, so make sure you augment it with a RAG to get sharper responses. I’ll show how to do that in another blog post.

How to build an AI Agent with a memory

How to Build an Agent with a Local LLM and RAG, Complete with Local Memory

If you want to build an agent with a local LLM that can remember things and retrieve them on demand, you’ll need a few components: the LLM itself, a Retrieval-Augmented Generation (RAG) system, and a memory mechanism. Here’s how you can piece it all together, with examples using LangChain and Python. (and here is why a small LLM is a good idea)

Step 1: Set Up Your Local LLM

First, you need a local LLM. This could be a smaller pre-trained model like LLaMA or GPT-based open-source options running on your machine. The key is that it’s not connected to the cloud—it’s local, private, and under your control. Make sure the LLM is accessible via an API or similar interface so that you can integrate it into your system. A good choice would be using Ollama and an LLM such as Googles gemma. I also wrote easy to follow instructions in how to set an T5 LLM from Salesforce up locally, but it is also perfectly fine to use a cloud-based LLM.

In case the agent you want to build is about source code, here is an example of how to use CodeT5 with LangChain.

Step 2: Add Retrieval-Augmented Generation (RAG)

TL;DR: Gist on Github

Next comes the RAG. A RAG system works by combining your LLM with an external knowledge base. The idea is simple: when the LLM encounters a query, the RAG fetches relevant information from your knowledge base (documents, notes, or even structured data) and feeds it into the LLM as context.

To set up RAG, you’ll need:

  1. A Vector Database: This is where your knowledge will live. Tools like Pinecone, Weaviate, or even local implementations like FAISS can store your data as embeddings.
  2. A Way to Query the Vector Database: Use similarity search to find the most relevant pieces of information for any given query.
  3. Integration with the LLM: Once the RAG fetches data, format it and pass it as input to the LLM.

I have good experience with LangChain and Chroma:

documents = TextLoader("my_data.txt").load()
texts = CharacterTextSplitter(chunk_size=300, chunk_overlap=100).split_documents(documents)
vectorstore = Chroma.from_documents(texts, OllamaEmbeddings(model="gemma:latest")).as_retriever()

llm = OllamaLLM(model=model_name)
qa_chain = RetrievalQA.from_chain_type(llm=llm, retriever=vectorstore)

qa_chain.invoke("What is the main topic of my document?")

Step 3: Introduce Local Memory

Now for the fun part: giving your agent memory. Memory is what allows the agent to recall past interactions or store information for future use. There are a few ways to do this:

  • Short-Term Memory: Store conversation context temporarily. This can simply be a rolling buffer of recent interactions that gets passed back into the LLM each time.
  • Long-Term Memory: Save important facts or interactions for retrieval later. For this, you can extend your RAG system by saving interactions as embeddings in your vector database.

For example:

  1. After each interaction, decide if it’s worth remembering.
  2. If yes, convert it into an embedding and store it in your vector database.
  3. When needed, retrieve it alongside other RAG data to give the agent a sense of history.

Langchain Example

from langchain.memory import ConversationBufferMemory

# Initialize memory
memory = ConversationBufferMemory()

# Save some conversation turns
memory.save_context({"input": "Hello"}, {"output": "Hi there!"})
memory.save_context({"input": "How are you?"}, {"output": "I'm doing great, thanks!"})

# Retrieve stored memory
print(memory.load_memory_variables({}))

Step 4: Put It All Together

Now you can combine these elements:

  • The user sends a query.
  • The system retrieves relevant data via RAG.
  • The memory module checks for related interactions or facts.
  • The LLM generates a response based on the query, retrieved context, and memory.

This setup is powerful because it blends the LLM’s generative abilities with a custom memory tailored to your needs. It’s also entirely local, so your data stays private and secure.

Final Thoughts

Building an agent like this might sound complex, but it’s mostly about connecting the dots between well-known tools. Once you’ve got it running, you can tweak and fine-tune it to handle specific tasks or remember things better. Start small, iterate, and soon you’ll have an agent that feels less like software and more like a real assistant.

CodeGen2.5 LLM not working with latest Huggingface Transformers

Tried to install Salesforce/codegen25-7b-multi_P on my Macbook with Huggingface transformers 4.45, which failed with the following error:

.env/lib/python3.12/site-packages/transformers/tokenization_utils_base.py", line 1590, in __init__
    raise AttributeError(f"{key} conflicts with the method {key} in {self.__class__.__name__}")
AttributeError: add_special_tokens conflicts with the method add_special_tokens in CodeGen25Tokenizer

Going back a few transformer versions gives this error:

codegen25-7b-multi/0bdf3f45a09e4f53b333393205db1388634a0e2e/tokenization_codegen25.py", line 149, in vocab_size
    return self.encoder.n_vocab
           ^^^^^^^^^^^^
AttributeError: 'CodeGen25Tokenizer' object has no attribute 'encoder'. Did you mean: 'encode'?

After zipping through older transformer versions I found a note in their release saying it requires transformers 4.29.2. That version didn’t want to compile on my currnet Mac setup anymore because of Rust, with this error:

error: could not compile `tokenizers` (lib) due to 1 previous error; 3 warnings emitted
      
      Caused by:
        process didn't exit successfully: `rustc --crate-name tokenizers --edition=2018 tokenizers-lib/src/lib.rs --error-format=json --json=diagnostic-rendered-ansi,artifacts,future-incompat --crate-type lib 
...      
error: `cargo rustc --lib --message-format=json-render-diagnostics --manifest-path Cargo.toml --release -v --features pyo3/extension-module --crate-type cdylib -- -C 'link-args=-undefined dynamic_lookup -Wl,-install_name,@rpath/tokenizers.cpython-312-darwin.so'` failed with code 101
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for tokenizers
Failed to build tokenizers

The Solution

Here is a solution that worked for me:

RUSTFLAGS="-A invalid_reference_casting" pip install transformers==4.33.2

Transformers 4.29.2 works as well. Then install torch.

And here everything together:

virtualenv .env
source .env/bin/activate
RUSTFLAGS="-A invalid_reference_casting" HF_HOME=.cache pip install tiktoken==0.4.0 torch transformers==4.33.2
python test.py

where test.py would be the following:

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("Salesforce/codegen25-7b-instruct", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("Salesforce/codegen25-7b-instruct")

text = "def hello_world():"
input_ids = tokenizer(text, return_tensors="pt").input_ids
generated_ids = model.generate(input_ids, max_length=128)
print(tokenizer.decode(generated_ids[0], skip_special_tokens=True))

9 AI trends in Late 2024

From smaller, more efficient language models to groundbreaking multi-modal capabilities, AI is becoming increasingly accessible and versatile. In this blog post, we’ll look at nine key AI trends that are shaping the landscape of artificial intelligence.

Reality Check

The initial hype surrounding AI in 2023 has given way to a more realistic assessment of its capabilities and limitations. While AI has made significant progress, challenges such as bias and ethical concerns remain. As researchers continue to refine AI models and address these issues, we can expect further advancements in various fields. However, it’s crucial to maintain a balanced perspective and recognize that AI is a tool that, when used responsibly, can benefit society.

Multi-Modal AI

AI is making significant strides with multi-modal models. These innovative systems can process and understand information from various sources, including text, images, audio, and video. This opens up a world of possibilities, such as generating captions for images or creating videos based on audio. Multi-modal AI is revolutionizing fields like medical image analysis and customer service, making machines more intelligent and versatile than ever before.

Smaller Models

Forget “bigger is better,” 2024 is seeing a major shift in AI with the rise of smaller language models (LLMs). This move away from giant, complex models offers a wave of benefits. Smaller LLMs are more energy-efficient and cost-effective, making them accessible for businesses and individuals alike. They can even run on personal computers and smartphones, putting AI power in everyone’s hands. Plus, smaller models often focus on specific tasks, leading to more focused and efficient solutions – whether it’s crafting compelling content or streamlining customer service. This trend towards smaller LLMs reflects a future where AI becomes more accessible, efficient, and specialized than ever before.

Custom Local Models

Custom local models are gaining prominence in AI. These models, trained on private datasets within an organization’s infrastructure, offer a tailored approach to AI applications. By maintaining control over sensitive information, organizations can enhance data privacy and comply with regulations. Custom models can also be fine-tuned for specific tasks, improving accuracy and relevance.

While training custom models can be computationally expensive, fine-tuning is cheaper and the benefits often outweigh the costs. For industries like healthcare, finance, and manufacturing, these models offer significant advantages. By operating offline, they enable applications in environments with limited or no internet connectivity. Additionally, reduced latency improves user experience and response times. To address the challenges of training and maintenance, organizations can leverage transfer learning techniques and invest in appropriate hardware resources.

GPU + Cloud Costs

GPU and cloud costs continue to be major factors in AI development. As AI models, especially LLMs, grow more complex, the demand for powerful hardware and scalable cloud infrastructure increases, driving up costs. High-performance GPUs remain expensive, while cloud providers offer various GPU options at different prices. To manage costs, AI practitioners use techniques like model compression and distributed training. Organizations must carefully weigh the costs of GPUs against the benefits of cloud computing, considering factors like project scale and desired performance. By effectively managing these costs, businesses can maximize the value of their AI investments.

Model Optimization

In 2024, model optimization is a key trend in AI, focused on making models smaller, faster, and more adaptable without sacrificing performance. Techniques like LoRA (Low-Rank Adaptation) allow for efficient fine-tuning of large models by reducing the number of parameters to be updated, while quantization lowers model precision to reduce size and improve speed, especially on edge devices. Pruning further trims down models by removing unnecessary weights, and knowledge distillation transfers the capabilities of large models to smaller ones, maintaining high performance. These methods, alongside sparse models and memory-efficient attention mechanisms, are helping make AI more scalable and accessible. With hardware-aware optimization and dynamic scaling, AI models are increasingly efficient, capable of running on a wider range of devices, from cloud servers to mobile platforms.

Virtual Agents

Virtual agents are becoming increasingly sophisticated, automating tasks and providing personalized assistance across various industries. From scheduling meetings to managing personal finances, these AI-powered assistants are streamlining workflows and saving time for users. As AI technology continues to advance, we can expect virtual agents to play an even more significant role in our daily lives.

Shadow AI

Shadow AI, a growing concern in 2024, refers to the unauthorized use of AI technologies within an organization. It’s like the shadowy underbelly of the digital world, where employees secretly adopt AI tools without IT’s blessing.

This trend is fueled by the increasing accessibility and power of AI. Employees often turn to Shadow AI for quick solutions, cost-effectiveness, or personal preference. However, this seemingly harmless behavior can lead to serious consequences.

Regulation

The world of AI regulation is undergoing a major shift. 2024 has seen a surge in legislation, with the EU’s AI Act setting a framework for risk-based regulation. Countries are also introducing their own rules focused on data privacy, transparency,and fairness in AI development. Regulatory bodies are demanding accountability for potential bias in AI systems, leading to an increase in bias audits by organizations. Meanwhile, research into AI safety is growing, with international cooperation aimed at developing responsible AI standards. The key challenge lies in balancing innovation with ethical development, while striving for global harmonization to avoid a patchwork of regulations.

ADHD Hyperfocus Engineers: A Double-Edged Sword

I recently came across an interesting topic: ADHD, often associated with inattentiveness, can paradoxically lead to intense hyperfocus in certain individuals. For engineers, this can be a double-edged sword. On one hand, hyperfocus can fuel extraordinary productivity and deep problem-solving. When engineers become engrossed in a challenging task, they can achieve remarkable feats.

However, hyperfocus can also lead to difficulties in managing time and prioritizing tasks. Engineers may become so absorbed in a project that they neglect other responsibilities or deadlines. Additionally, the intensity of hyperfocus can sometimes be emotionally draining, making it difficult to transition to other activities.

Engineers may find themselves working for hours without noticing the passage of time, completely immersed in their projects. This intense focus can lead to breakthroughs and innovative solutions and make significant contributions to their teams and projects.

How to use CodeT5 with LangChain

Here is how to create a CodeT5 wrapper for LangChain, which can be used to embed code generation, translation, or analysis tasks into your LangChain application.

Create a CodeT5 Custom LLM class for LangChain

from typing import Any

from langchain.llms.base import LLM
from langchain.prompts import PromptTemplate

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch

class CodeT5LLM(LLM):
    model_name: str
    tokenizer: Any = None
    model: AutoModelForSeq2SeqLM = None
    
    def __init__(self, model_name):
        super().__init__(model_name=model_name)
        self.model_name = model_name
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

    def _call(self, prompt: str, stop=None) -> str:
        input_ids = self.tokenizer(prompt, return_tensors="pt").input_ids
        
        with torch.no_grad():
            generated_ids = self.model.generate(input_ids, max_length=128)
        
        return self.tokenizer.decode(generated_ids[0], skip_special_tokens=True)
    
    @property
    def _identifying_params(self):
        return {"model_name": self.model_name}

    def _llm_type(self):
        return "custom"

Usage of the CodeT5 LLM in your LangChain app

Then, you can use the CodeT5 LLM in your code like this:

codet5_llm = CodeT5LLM(model_name="Salesforce/codet5-base-multi-sum")

prompt = PromptTemplate(
    input_variables=["code"],
    template="{code}"
)

chain = prompt | codet5_llm
result = chain.invoke("def hello_world():\n    print('Hello, World!')")

print(result)

You may want to adjust the implementation based on your specific use case and the CodeT5 variant you’re using.

Explain like I am five: lvalue and rvalue in C++

Imagine you have a toy box.

  • Lvalue: Think of the toy box itself. It’s a place where you can keep your toys. You can put a toy in the box, take one out, or replace one toy with another. In C++, an lvalue is like this box — it’s a location in memory where a value is stored, and you can change what’s stored there.
    • Exampleint my_number = 5;
    • Here, my_number is an lvalue because it’s like the toy box where the number 5 is stored. You can change my_numberlater, like putting a different toy in the box (e.g., my_number = 10;).
  • Rvalue: Now, think of a toy you’re holding in your hand. It’s a specific toy that you can use to play with, but you can’t change that toy into something else on the spot. In C++, an rvalue is like this toy— it’s a value that exists temporarily and can’t be directly modified.
    • Example5 + 3
    • The result of this (which is 8) is an rvalue. It’s like the toy you just created by combining parts (5 and 3), but you can’t change this 8 directly; you can’t say “8 = 10”. It’s just a value you use.

Key Points to Remember:

  • Lvalues can appear on the left side of an assignment (=). They represent things that can be modified or assigned to.
  • Rvalues typically appear on the right side of an assignment. They represent temporary values that can’t be assigned to directly.

Simplified Summary:

  • Lvalues are like toy boxes: You can store and change what’s inside.
  • Rvalues are like toys: You can use them, but you can’t change them directly.

std::vector

std::vector is a dynamic container that can be used with offsets. Unlike static arrays, it can grow and shrink in size at runtime, but you can use offsets as well as iterators to navigate on them.

Some key points about std::vector

  • Dynamic resizing: When you add elements to a std::vector, it can automatically resize itself to accommodate the new elements.
  • Random access: You can access any element of a std::vector using its index, just like a regular array.
  • Efficient insertion and deletion: Inserting or deleting elements at the end of a std::vector is very efficient.
  • Memory management: std::vector handles memory management for you, so you don’t have to worry about allocating and deallocating memory manually.
  • Iterators: std::vector provides iterators that allow you to iterate over its elements in a convenient way.

Example of std:vector in C++

#include <iostream>
#include <vector>

int main() {
    std::vector<int> numbers;

    // Add elements to the vector
    numbers.push_back(1);
    numbers.push_back(2);
    numbers.push_back(3);   


    // Access elements using   
 their indices
    std::cout << numbers[0] << std::endl;
    std::cout << numbers[1] << std::endl;
    std::cout << numbers[2] << std::endl;

    // Iterate over the elements using a for loop
    for (int number : numbers) {
        std::cout << number << std::endl;
    }

    return 0;
}

It will output 1,2,3 and 1,2,3.

Differences between std::vector and std::list

Featurestd::vectorstd::list
Underlying data structureDynamic arrayDoubly linked list
Random accessYesNo
Insertion/deletion at the beginning/middleInefficient (O(n))Efficient (O(1))
Insertion/deletion at the endEfficient (O(1))Efficient (O(1))
IteratorsRandom access iteratorsBidirectional iterators
Memory usageGenerally more compactGenerally less compact (due to pointers)
Export to Sheets

std::vector is often preferred due to random access and memory efficiency, while std::list if the choice if you often need to insert at the beginning or in the middle.

5 Python Datasets to Supercharge Your LLM

LLMs have shown remarkable potential, but fine-tuning with high-quality datasets is crucial for optimal performance.This blog post explores five exceptional Python datasets to elevate your LLM fine-tuning process.

flytech/python-codes-25k

The dataset provides clear instruction-output pairs, which is ideal for supervised fine-tuning of models to generate code based on text prompts. The dataset contains a substantial number of examples (25,000), which is sufficient for effective fine-tuning.

This dataset offers a versatile resource for various code-related tasks. It provides a rich collection of Python code examples paired with detailed instructions, enabling training for code generation, natural language understanding of code,and behavioral analysis of coding patterns. Additionally, it serves as a valuable educational tool for exploring coding styles and problem-solving approaches.

While it could potentially be used for benchmarking, its primary strength lies in providing training data for improving existing models rather than evaluating their performance against a standardized set of challenges. You can find it here on Huggingface.

Tested-22k-Python-Alpaca

Tested-22k-Python-Alpaca is a high-quality dataset comprising 22,600 verified working Python code examples. It was meticulously curated by filtering and testing code snippets extracted from various open-source datasets. The primary goal of this dataset is to provide a reliable resource for training and evaluating code-generating AI models.By offering a collection of functional Python code, it addresses the common challenge of models producing incorrect or incomplete code.

The dataset’s standout feature is the rigorous testing process ensuring all code examples are executable. It incorporates code from multiple open-source datasets for diversity.

It is well-suited to fine-tune your LLM to improve its code generation capabilities, as this dataset offers a valuable foundation for developing robust code-generating models by providing a large collection of accurate and diverse Python code examples. Download here.

notional-python

Notional-Python is a quality dataset containing Python code files extracted from 100 popular GitHub repositories. It is primarily designed for evaluating existing language models. While it’s not ideal for training a model from scratch due to its relatively small size, it can be effective for improving an already trained model’s ability to generate Python code.

By fine-tuning on this dataset, you can potentially increase code quality, improve code accuracy, enhance code style consistency.

Remember, fine tuning could introduce bias, it’s crucial to evaluate the model’s performance carefully after fine-tuning. You can find it on Huggingface.

code_contest_python3_alpaca

The Code Contest Processed dataset is well-suited for fine-tuning. The availability of problem descriptions, correct code solutions, and test cases makes it an ideal dataset for training models to generate code based on problem statements. Additionally, the inclusion of Alpaca-style prompts can facilitate fine-tuning for tasks like code generation from natural language instructions.

The dataset comprises coding contest problems and their corresponding Python3 solutions, derived from Deepmind’s code_contest dataset. It offers structured data including problem descriptions, correct code, test cases, and problem sources. Additionally, it provides Alpaca-style prompts for text generation tasks related to code. This dataset is specifically tailored for Python-based machine learning models. On Huggingface.

stackoverflow_python_preprocessed

This dataset contains questions and answers were filtered to only include questions with more than 100 votes and answers with more than 5 votes.

While the dataset doesn’t directly provide code snippets, it offers information about Python concepts, problems, and solutions. This textual data can enhance a model’s ability to understand and respond to Python-related queries.

For instance, a model trained on this dataset would be better equipped to:

  • Identify the core of a Python-related problem.
  • Understand the context of a Python question.
  • Provide relevant information or potential solutions.

You can find it here.

Conclusion

Incorporating these LLM datasets into your fine-tuning pipeline is a strategic step towards developing a more sophisticated and capable LLM.

Analyzing an LLM dataset for non-data scientists

Large Language Models (LLMs) have become increasingly important for tasks involving natural language processing (NLP). However, their effectiveness hinges on the quality of the datasets used for training and evaluation. While data scientists typically handle the intricacies of these datasets, there are several reasons why non-data scientists, such as developers, project managers, or domain experts, might also need to engage in this process.

Why Analyze an LLM Dataset?

Understanding and analyzing an LLM dataset is essential for several reasons:

  1. Ensuring Model Quality: The performance of an LLM is directly tied to the quality of its training data. By analyzing the dataset, you can identify any potential issues, such as imbalances, biases, or irrelevant data that might negatively impact the model’s output.
  2. Bias Detection and Ethical Considerations: Datasets can inadvertently contain biases that lead to unfair or unethical outcomes. For example, if the training data over-represents certain demographic groups, the LLM might produce biased results. Analyzing the dataset allows you to spot these issues early and address them before the model is deployed.
  3. Customizing for Specific Needs: Not all datasets are created equal. Depending on your application, you might need to fine-tune the LLM on data that is more relevant to your domain. Analyzing the dataset helps you understand its strengths and weaknesses, guiding the fine-tuning process.
  4. Compliance and Documentation: In regulated industries, it’s crucial to ensure that your data practices are compliant with laws and regulations, such as GDPR. Analyzing the dataset is a necessary step in auditing and documenting the data to meet these requirements.

What to Look for in a Dataset

When you’re tasked with analyzing an LLM dataset, focus on these key aspects:

  • Data Distribution: Check if the data covers all relevant categories and is evenly distributed across them. Imbalances can lead to biased models.
  • Quality and Relevance: Assess the quality of the data—look for noise, duplicates, or irrelevant entries that could skew results.
  • Representation of Sensitive Attributes: Pay attention to how sensitive attributes (e.g., race, gender) are represented to avoid introducing bias.
  • Coverage of Domain-Specific Content: Ensure that the dataset contains sufficient examples related to the specific language, terminology, or context relevant to your application.

Practical Steps

  1. Data Profiling: Start with basic profiling to understand the dataset’s structure, including the distribution of data points, missing values, and outliers.
  2. Bias Auditing: Use statistical methods to detect any biases. Simple checks like comparing distributions across different demographic groups can reveal potential issues.
  3. Domain Relevance Check: Evaluate whether the dataset includes enough examples relevant to your specific use case, and consider augmenting it with additional data if necessary.

Conclusion

While data scientists usually handle the heavy lifting of dataset analysis, non-data scientists can play a crucial role in ensuring that an LLM performs well and behaves ethically. By engaging in dataset analysis, you not only improve the model’s quality but also help safeguard against potential biases and compliance issues. This approach ensures that the AI systems you contribute to are both effective and responsible.