How to use CodeT5 with LangChain

Here is how to create a CodeT5 wrapper for LangChain, which can be used to embed code generation, translation, or analysis tasks into your LangChain application.

Create a CodeT5 Custom LLM class for LangChain

from typing import Any

from langchain.llms.base import LLM
from langchain.prompts import PromptTemplate

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch

class CodeT5LLM(LLM):
    model_name: str
    tokenizer: Any = None
    model: AutoModelForSeq2SeqLM = None
    
    def __init__(self, model_name):
        super().__init__(model_name=model_name)
        self.model_name = model_name
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

    def _call(self, prompt: str, stop=None) -> str:
        input_ids = self.tokenizer(prompt, return_tensors="pt").input_ids
        
        with torch.no_grad():
            generated_ids = self.model.generate(input_ids, max_length=128)
        
        return self.tokenizer.decode(generated_ids[0], skip_special_tokens=True)
    
    @property
    def _identifying_params(self):
        return {"model_name": self.model_name}

    def _llm_type(self):
        return "custom"

Usage of the CodeT5 LLM in your LangChain app

Then, you can use the CodeT5 LLM in your code like this:

codet5_llm = CodeT5LLM(model_name="Salesforce/codet5-base-multi-sum")

prompt = PromptTemplate(
    input_variables=["code"],
    template="{code}"
)

chain = prompt | codet5_llm
result = chain.invoke("def hello_world():\n    print('Hello, World!')")

print(result)

You may want to adjust the implementation based on your specific use case and the CodeT5 variant you’re using.

5 Python Datasets to Supercharge Your LLM

LLMs have shown remarkable potential, but fine-tuning with high-quality datasets is crucial for optimal performance.This blog post explores five exceptional Python datasets to elevate your LLM fine-tuning process.

flytech/python-codes-25k

The dataset provides clear instruction-output pairs, which is ideal for supervised fine-tuning of models to generate code based on text prompts. The dataset contains a substantial number of examples (25,000), which is sufficient for effective fine-tuning.

This dataset offers a versatile resource for various code-related tasks. It provides a rich collection of Python code examples paired with detailed instructions, enabling training for code generation, natural language understanding of code,and behavioral analysis of coding patterns. Additionally, it serves as a valuable educational tool for exploring coding styles and problem-solving approaches.

While it could potentially be used for benchmarking, its primary strength lies in providing training data for improving existing models rather than evaluating their performance against a standardized set of challenges. You can find it here on Huggingface.

Tested-22k-Python-Alpaca

Tested-22k-Python-Alpaca is a high-quality dataset comprising 22,600 verified working Python code examples. It was meticulously curated by filtering and testing code snippets extracted from various open-source datasets. The primary goal of this dataset is to provide a reliable resource for training and evaluating code-generating AI models.By offering a collection of functional Python code, it addresses the common challenge of models producing incorrect or incomplete code.

The dataset’s standout feature is the rigorous testing process ensuring all code examples are executable. It incorporates code from multiple open-source datasets for diversity.

It is well-suited to fine-tune your LLM to improve its code generation capabilities, as this dataset offers a valuable foundation for developing robust code-generating models by providing a large collection of accurate and diverse Python code examples. Download here.

notional-python

Notional-Python is a quality dataset containing Python code files extracted from 100 popular GitHub repositories. It is primarily designed for evaluating existing language models. While it’s not ideal for training a model from scratch due to its relatively small size, it can be effective for improving an already trained model’s ability to generate Python code.

By fine-tuning on this dataset, you can potentially increase code quality, improve code accuracy, enhance code style consistency.

Remember, fine tuning could introduce bias, it’s crucial to evaluate the model’s performance carefully after fine-tuning. You can find it on Huggingface.

code_contest_python3_alpaca

The Code Contest Processed dataset is well-suited for fine-tuning. The availability of problem descriptions, correct code solutions, and test cases makes it an ideal dataset for training models to generate code based on problem statements. Additionally, the inclusion of Alpaca-style prompts can facilitate fine-tuning for tasks like code generation from natural language instructions.

The dataset comprises coding contest problems and their corresponding Python3 solutions, derived from Deepmind’s code_contest dataset. It offers structured data including problem descriptions, correct code, test cases, and problem sources. Additionally, it provides Alpaca-style prompts for text generation tasks related to code. This dataset is specifically tailored for Python-based machine learning models. On Huggingface.

stackoverflow_python_preprocessed

This dataset contains questions and answers were filtered to only include questions with more than 100 votes and answers with more than 5 votes.

While the dataset doesn’t directly provide code snippets, it offers information about Python concepts, problems, and solutions. This textual data can enhance a model’s ability to understand and respond to Python-related queries.

For instance, a model trained on this dataset would be better equipped to:

  • Identify the core of a Python-related problem.
  • Understand the context of a Python question.
  • Provide relevant information or potential solutions.

You can find it here.

Conclusion

Incorporating these LLM datasets into your fine-tuning pipeline is a strategic step towards developing a more sophisticated and capable LLM.

Analyzing an LLM dataset for non-data scientists

Large Language Models (LLMs) have become increasingly important for tasks involving natural language processing (NLP). However, their effectiveness hinges on the quality of the datasets used for training and evaluation. While data scientists typically handle the intricacies of these datasets, there are several reasons why non-data scientists, such as developers, project managers, or domain experts, might also need to engage in this process.

Why Analyze an LLM Dataset?

Understanding and analyzing an LLM dataset is essential for several reasons:

  1. Ensuring Model Quality: The performance of an LLM is directly tied to the quality of its training data. By analyzing the dataset, you can identify any potential issues, such as imbalances, biases, or irrelevant data that might negatively impact the model’s output.
  2. Bias Detection and Ethical Considerations: Datasets can inadvertently contain biases that lead to unfair or unethical outcomes. For example, if the training data over-represents certain demographic groups, the LLM might produce biased results. Analyzing the dataset allows you to spot these issues early and address them before the model is deployed.
  3. Customizing for Specific Needs: Not all datasets are created equal. Depending on your application, you might need to fine-tune the LLM on data that is more relevant to your domain. Analyzing the dataset helps you understand its strengths and weaknesses, guiding the fine-tuning process.
  4. Compliance and Documentation: In regulated industries, it’s crucial to ensure that your data practices are compliant with laws and regulations, such as GDPR. Analyzing the dataset is a necessary step in auditing and documenting the data to meet these requirements.

What to Look for in a Dataset

When you’re tasked with analyzing an LLM dataset, focus on these key aspects:

  • Data Distribution: Check if the data covers all relevant categories and is evenly distributed across them. Imbalances can lead to biased models.
  • Quality and Relevance: Assess the quality of the data—look for noise, duplicates, or irrelevant entries that could skew results.
  • Representation of Sensitive Attributes: Pay attention to how sensitive attributes (e.g., race, gender) are represented to avoid introducing bias.
  • Coverage of Domain-Specific Content: Ensure that the dataset contains sufficient examples related to the specific language, terminology, or context relevant to your application.

Practical Steps

  1. Data Profiling: Start with basic profiling to understand the dataset’s structure, including the distribution of data points, missing values, and outliers.
  2. Bias Auditing: Use statistical methods to detect any biases. Simple checks like comparing distributions across different demographic groups can reveal potential issues.
  3. Domain Relevance Check: Evaluate whether the dataset includes enough examples relevant to your specific use case, and consider augmenting it with additional data if necessary.

Conclusion

While data scientists usually handle the heavy lifting of dataset analysis, non-data scientists can play a crucial role in ensuring that an LLM performs well and behaves ethically. By engaging in dataset analysis, you not only improve the model’s quality but also help safeguard against potential biases and compliance issues. This approach ensures that the AI systems you contribute to are both effective and responsible.

LLM vs SLM: Why Small Language Models?

Large Language Models (LLMs) and Small Language Models (SLMs) represent distinct approaches to natural language processing. LLMs are massive models trained on vast amounts of text data, enabling them to generate human-quality text, translate languages, write different kinds of creative content, and answer your questions in an informative way. However, their size necessitates substantial computational resources.   

In contrast, SLMs are smaller models trained on more focused datasets. This makes them computationally efficient and suitable for specific tasks. They often excel in particular domains or applications.

Use Cases for LLMs

  • Content generation: Creating various text formats, from articles to code.
  • Machine translation: Translating between different languages.
  • Chatbots and virtual assistants: Providing interactive and informative conversations.
  • Summarization: Condensing long texts into shorter summaries.

LLMs excel at generating diverse text formats, from marketing copy and social media content to scripts, poems, and translations. They can also provide informative answers to a wide range of questions.

Use Cases for SLMs

  • Domain-specific tasks: Excelling in tasks requiring specialized knowledge, such as medical or legal text processing, as well as code-specific tasks
  • Resource-constrained environments: Operating efficiently on devices with limited computational power.
  • Faster training and deployment: Shorter development cycles compared to LLMs.

SLMs demonstrate strengths in specific text-based applications, excelling in tasks such as sentiment analysis, text classification, and named entity recognition. They can also be tailored for specialized domains like healthcare or finance. Additionally, SLMs can be adapted to support niche programming languages, providing solutions for specific development challenges.

Trade-offs and Considerations

LLMs demand substantial computational resources for training and deployment, reflecting their complexity and size. In contrast, SLMs are more efficient due to their smaller scale. While LLMs often excel in diverse language tasks, SLMs can be specialized for specific domains. Data requirements also differ significantly, with LLMs needing vast datasets and SLMs operating on smaller, focused collections. Ultimately, the choice between an LLM and an SLM hinges on factors such as computational budget, performance, and the nature of the target app.

Hybrid Approaches

Hybrid approaches to language models combine the strengths of large language models (LLMs) and smaller, more specialized language models (SLMs). Transfer learning involves utilizing a pre-trained LLM as a foundation and adapting it to specific tasks through fine-tuning on domain-specific data. This approach benefits from the knowledge captured in the base LLM while tailoring the model to the target domain. Model distillation compresses a large LLM into a smaller, more efficient SLM while preserving key functionalities. This technique enables deployment in resource-constrained environments without significant performance degradation. By strategically combining LLMs and SLMs, organizations can develop robust and adaptable language models capable of handling a wide range of tasks.

Run Code LLMs Locally Without the Cloud: 4 User-Friendly Tools (Infographic)

Learn how to run Code LLMs locally with ease using these 4 tools designed for experimentation and exploration of AI technology on your local computer or on-premise server.

LM Studio: Discover a beginner-friendly interface with drag-and-drop functionality for basic code generation tasks.

Ollama: Immerse yourself in an interactive environment tailored for exploring diverse models and functionalities with ease.

Transformers Pipeline from Huggingface: Harness the power of a robust command-line interface suited for advanced users seeking customization options.

Transformers Models from Huggingface: Enjoy maximum flexibility by directly loading and utilizing Code LLM models from local or remote storage, empowering developers with seamless integration.

If you’re more into the technical side, in this blog post I explain how to set up and run a CodeT5 LLM in Python, which you can easily run on locally on your laptop or on premise.

6 Python Libs to make a Powerful AI Training Stack

PyTorch

PyTorch is a deep learning framework that is based on the Torch library. It is a powerful and flexible framework for building and training neural networks, and it is particularly well-suited for GPU-accelerated training.

Powerful and flexible deep learning framework. 

Website pytorch.org , PyTorch on GitHub

Keras

Keras is a high-level neural network API that is built on top of TensorFlow and PyTorch. It provides a user-friendly interface for defining neural network architectures and training models, making it a popular choice for beginners and experienced practitioners alike.

User-friendly interface for defining neural network architectures and training models.

Website keras.io, Keras on GitHub

Pandas

Pandas is a Python library for data manipulation and analysis. It provides a powerful set of tools for data cleaning, transformation, and analysis, making it an essential tool for working with structured data in machine learning.

Strength: Powerful data manipulation and analysis for structured data.

Website pandas.pydata.org, Pandas on GitHub

Numpy

NumPy is a numerical Python library that provides efficient data structures and operations for numerical computing. It is essential for working with large datasets, especially in scientific computing and machine learning.

Strength: Fast numerical data manipulation and operations.

Website numpy.org, Numpy on GitHub

Matplotlib

Matplotlib is a Python library for creating 2D plots and visualizations. It is a versatile tool for visualizing data, and it is widely used in machine learning for tasks such as data exploration, model evaluation, and communication of results.

Versatile tool for creating 2D plots and visualizations.

Website matplotlib.org, Matplotlib on GitHub

Huggingface Transformers

Transformers is a library for constructing, utilizing, and adapting transformer-based NLP models, enabling NLP practitioners to build effective NLP applications tailored to their specific needs.

Website huggingface.co, Transformers on GitHub

Together, these libraries form a powerful and versatile AI training stack and are widely used in Machine Learning and LLMs.

Fix: No GPU support in Tensorflow

I came across a problem where my Tensorflow installation did not recognize the installed gpu, despite of Cuda and Nvidia drivers being installed properly.

test:

python3 -c “import tensorflow as tf; print(tf.config.list_physical_devices(‘GPU’))”

returned an empty list. Furthermore, it tells it cannot find the cuda library:

2024-01-30 14:57:42.015454: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.

Output of the Nvidia tool is correct and shows Cuda is installed:

nvidia-smi

ubuntu@ip-bla-foo:~/build-nb$  nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Mon_Apr__3_17:16:06_PDT_2023
Cuda compilation tools, release 12.1, V12.1.105
Build cuda_12.1.r12.1/compiler.32688072_0

Which tells us it is version 12. Ahhh!💡

Now, 12 is a version from 2023 and my idea was that Tensorflow 2.13 might not know this version, see https://blog.tensorflow.org/2023/11/whats-new-in-tensorflow-2-15.html

Ok, the latest version pip offered was TF 2.13 on Python 3.8. Here is the fix:

  1. upgrade Python: sudo apt install python3.9
  2. a new venv: virtualenv –python /usr/bin/python3.9 ~/.env-python3.9
  3. source ~/.env-python3.9/bin/activate
  4. pip install –upgrade pip
  5. python3 -m pip install tensorflow[and-cuda]==2.15.0.post1

Test: python3 -c “import tensorflow as tf; print(tf.config.list_physical_devices(‘GPU’))”

2024-01-30 15:27:04.458720: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-01-30 15:27:04.458772: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-01-30 15:27:04.459601: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-01-30 15:27:04.465334: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-01-30 15:27:05.115551: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
2024-01-30 15:27:05.560865: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:901] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-01-30 15:27:05.585883: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:901] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-01-30 15:27:05.586100: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:901] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

Now we see the GPU in Tensorflow.

Large vs Small LLMs – Thoughts

If you are working on a task that is very specific, a smaller LLM may be able to learn the task-specific patterns more quickly than a larger LLM. Additionally, if you are working on a resource-constrained device, a smaller LLM may be the only option. Read in this blog post how to prepare an LLM for a specific task.

Benefits of large LLMs, such as 70B

Large language models (LLMs) with more parameters are typically trained on larger datasets. The more parameters an LLM has, the more complex it is, and the more data it can process. This is because the parameters represent the connections between the neurons in the LLM’s neural network. The more parameters there are, the more connections there are, and the more complex the network can be.

Benefits of smaller LLMs, such as 6B or 770m

If I have a task that requires Python, I don’t need a model trained on Haskell, GO and Rust. It is not necessary to use a model that is trained on other programming languages. This is because LLMs that are trained on a variety of programming languages can often overfit to the training data, which can make them less effective for generating code in a specific language.

An LLM that is trained on a large dataset of Python, Haskell, Go, and Rust code may be able to generate code in all of these languages. However, it may not be as good at generating idiomatic Python code as an LLM that is specifically trained on Python code.

If you have a task that requires Python, it is generally best to use an LLM that is specifically trained on Python code. This will give you the best chance of generating code that is syntactically correct, semantically meaningful, and idiomatic.

A 6B model is significantly more convenient for many purposes: less expensive to operate, runs on your laptop, maybe more accurate on that specific language if the training data is good.

A good way to decide whether to use an LLM that is trained on multiple programming languages or an LLM that is specifically trained on one programming language is to experiment with both and see which one works better for your task.

Prepare Data for Code LLM Training

If you want to teach your LLM some tricks you need to prepare some training data and run a training (or fine-tuning) on the LLM. For more complex knowledge, this would be a set of a few dozen or even hundred of data pairs: what it is and what it should be. This is called Supervised Learning.

For example: a piece of code, and a description of what the code does. If you write about 100 of these pairs, the LLM will start understanding and be able to explain code it hasn’t seen before. It can also be a piece of code and an instruction: the instruction describes how the given code should be build. As a result, the LLM will be able to write code out of text instructions.

Example:

How much should I write?

You can start seeing results with as little as 100 pairs. But the actual number you will need depends on various factors such as model complexity, data quality, diversity, the complexity of the task or the available training resources.

More complex models might require more data to learn effectively. Higher-quality data can lead to better performance, but it might compensate for a smaller dataset to some extent. A diverse dataset covering various programming languages, problem domains, and styles can enhance the model’s generalization. If the task requires highly nuanced or specialized descriptions, more data might be needed to capture these nuances effectively. The computational resources available for training play a role too; larger datasets might require more computational power and time.

How to Start

Begin with a reasonably sized dataset and monitor the model’s performance. You can then incrementally add more data, observing how the model improves with additional training examples.

As a general rule of thumb, having several thousand pairs of code and descriptions is a good starting point for training a language model effectively. However, this can vary significantly based on the factors mentioned above.

Tools that Help

For once you would need to get a larger set of code snippets from your code base, or something you find on the internet or on GitHub. A useful tool for that is Treesitter. It supports a lot of languages (parsers) from JS, Python, C++ and the like to more esoteric languages such as Erlang, Haskell, Fennel (a Lisp that compiles to Lua). You need your dataset to be somewhat diverse, cover each topic kind of equally such as language datatypes, conditional constructs, I/O etc, talking about a base dataset. When it gets to your specific use cases, identify what is essential and make sure you cover everything.

When you have your list of snippets, you can import them into a tool such as OpenDocString which helps you write the descriptions, balance the topics of your dataset and gives insights on data quality and diversity. The tool is in its early stage, but looks already very promising and makes life much easier.

Once done, you have a larger list of code and descriptions, which you can then feed to your model for training, either using an online service or train it locally on your machine or cloud instance.

How to Use CodeT5 on Your Laptop: A Step-by-Step Guide

CodeT5+ (or CodeT5) is an advanced LLM designed for developers. It can generate source code, and explain what your code does. It has a good performance while being lightweight enough to run on a laptop for both inference and fine tuning, and can be trained with additional knowledge-data. I’ll show how to set up and use CodeT5+ on your laptop in minutes.

💡 Try the inference web demo. If you need the LLM to better understand your code or toolchain, you can fine-tune it. It is really not very difficult to set up, and doesn’t require expensive hardware. Read here how to do the training, and here how to use CodeT5 with LangChain.

CodeT5 vs CodeT5+?

CodeT5 is an advanced Transformer-based model designed for both understanding and generating code. It stands out by effectively handling code identifiers, leveraging user-written comments, and excelling in various code-related tasks, surpassing previous methods.

CodeT5, provided by Salesforce, comes pre-trained as small, base and large versions differing in the size of trained parameters. The newer versions are called CodeT5+ which I will use here.

Installation on Your Laptop

CodeT5+ delivers impressive performance while remaining capable of running smoothly on a local laptop for inference without any issues. Fine-tuning can also be conducted locally, making CodeT5 an ideal choice for developers to experiment with large language models (LLMs).

Setting Up Your Environment

I’ll show how to set it up and use CodeT5+ fine-tuned with some KDE code.

For an easy installation and demonstrating fine-tuning, this GitHub repo is well suitable and uses the new CodeT5+. It can run as a local server and comes with a simple demo html page, a great choice if you want to make a quick REST call from your app and go from there. And for reference, here is the original Salesforce GitHub repo.

The model is capable of describing code accurately, and you can fine-tune this model with your own code snippets to make it better or know your latest API changes.

To get started, create a dedicated folder for your project and set up the necessary Python packages. Here are the steps:

Create a virtual environment in your project folder:

Activate the virtual environment:

Install the required Python packages from the provided “requirements.txt” file:

Once you’ve set up the environment, you’ll need to obtain and configure the model weights before running the CodeT5+ model. Here’s how you can do that:

Download the model weights from the specified URL:

  1. Unpack the downloaded model weights. You can do this manually or by using the provided script:

Manually:

Or by using the script:

After completing these steps, you’ll have the model weights available in the “api/saved-pretrained-kde-…” directory, and your environment will be set up to run the CodeT5 model effectively.

Using CodeT5 for Inference

To test your model you can run the inference.py script in the folder.

Though, the model becomes more useful when starting it as a local server and make requests to it. You can do:

Now, there is a demo.html in your folder which you can open and make some queries. Try to paste some code and see what it does.

Prompt example: Write a replace method for a string class which replaces the given string with a given set of characters. Try it out on this running demo.

If you found that it does a wrong description for your code, you might want to fine-tune your model. Keep in mind that the model is trained on certain code and programming languages, but maybe hasn’t seen any code similar to the one you tried.

Fine-tuning CodeT5 (Optional)

If you want to add capabilities to the model, such as another programming language, toolkit usage or other usage patterns, you can fine-tune this model.

Why fine tuning?

Fine tuning leverages on a pre-trained model, and simply adds knowledge on top. This way, you don’t need to go through the full, lengthy and expensive training process. Fine tuning is much faster and cheaper.

How to Fine-tune CodeT5

In order to do that, you will need to create a dataset which is then fed to the model training. To get some results, a dataset of about 100 entries is already sufficient. The entries should show some variety and cover different, but similar topics.

For example, if you want to improve your models capacity to understand Python Django code, you can spend some amount on Django Views, some on Django models, and so on. They are close enough to have things in common (Django) and different enough for diversification (models, views, etc).

The data is organized in simple csv files of the following format:

The training process will then read this file and feed it to the model. Read more about preparing training data in this blog post.

I will in a next blog post show how to set up such a fine-tune training on a Macbook M1.

Let me know if you have any questions about running and fine-tuning CodeT5, either in the comment section below or reach out.