Prepare Data for Code LLM Training

If you want to teach your LLM some tricks you need to prepare some training data and run a training (or fine-tuning) on the LLM. For more complex knowledge, this would be a set of a few dozen or even hundred of data pairs: what it is and what it should be. This is called Supervised Learning.

For example: a piece of code, and a description of what the code does. If you write about 100 of these pairs, the LLM will start understanding and be able to explain code it hasn’t seen before. It can also be a piece of code and an instruction: the instruction describes how the given code should be build. As a result, the LLM will be able to write code out of text instructions.

Example:

How much should I write?

You can start seeing results with as little as 100 pairs. But the actual number you will need depends on various factors such as model complexity, data quality, diversity, the complexity of the task or the available training resources.

More complex models might require more data to learn effectively. Higher-quality data can lead to better performance, but it might compensate for a smaller dataset to some extent. A diverse dataset covering various programming languages, problem domains, and styles can enhance the model’s generalization. If the task requires highly nuanced or specialized descriptions, more data might be needed to capture these nuances effectively. The computational resources available for training play a role too; larger datasets might require more computational power and time.

How to Start

Begin with a reasonably sized dataset and monitor the model’s performance. You can then incrementally add more data, observing how the model improves with additional training examples.

As a general rule of thumb, having several thousand pairs of code and descriptions is a good starting point for training a language model effectively. However, this can vary significantly based on the factors mentioned above.

Tools that Help

For once you would need to get a larger set of code snippets from your code base, or something you find on the internet or on GitHub. A useful tool for that is Treesitter. It supports a lot of languages (parsers) from JS, Python, C++ and the like to more esoteric languages such as Erlang, Haskell, Fennel (a Lisp that compiles to Lua). You need your dataset to be somewhat diverse, cover each topic kind of equally such as language datatypes, conditional constructs, I/O etc, talking about a base dataset. When it gets to your specific use cases, identify what is essential and make sure you cover everything.

When you have your list of snippets, you can import them into a tool such as OpenDocString which helps you write the descriptions, balance the topics of your dataset and gives insights on data quality and diversity. The tool is in its early stage, but looks already very promising and makes life much easier.

Once done, you have a larger list of code and descriptions, which you can then feed to your model for training, either using an online service or train it locally on your machine or cloud instance.