Python-Alpaca Dataset

I came across this dataset recently, a collection of 22k Python code examples, tested and verified to work. What really caught my attention is how this was put together—they used a custom script to extract Python code from Alpaca-formatted datasets, tested each snippet locally, and only kept the functional ones. Non-functional examples were separated into their own file.

The dataset pulls from a mix of open-source projects like Wizard-LM’s Evol datasets, CodeUp’s 19k, and a bunch of others, plus some hand-prompted GPT-4 examples. Everything’s been deduplicated, so you’re not stuck with repeats.

It’s especially cool if you’re working on training AI models for coding tasks because it sidesteps one of the biggest issues with open datasets: non-functional or broken code. They even hinted at adapting the script for other languages like C++ or SQL.

If you use the dataset or their script, they ask for attribution: Filtered Using Vezora’s CodeTester. Oh, and they’re working on releasing an even bigger dataset with 220,000+ examples, definitely one to keep an eye on!

On Huggingface: Tested-22k-Python-Alpaca

Read also how to analyze a dataset.

Can’t install PyTorch on my Macbook

To my surprise, I wasn’t able to install Pytorch for a project on my Macbook Pro M1 today (MacOS Sequoia 15.2). I kept getting this error when running pip3 install -r requirements.txt:

ERROR: Ignored the following versions that require a different python version: 1.21.2 Requires-Python >=3.7,<3.11; 1.21.3 Requires-Python >=3.7,<3.11; 1.21.4 Requires-Python >=3.7,<3.11; 1.21.5 Requires-Python >=3.7,<3.11; 1.21.6 Requires-Python >=3.7,<3.11; 1.26.0 Requires-Python <3.13,>=3.9; 1.26.1 Requires-Python <3.13,>=3.9
ERROR: Could not find a version that satisfies the requirement torch==2.7.0.dev20250116 (from versions: none)
ERROR: No matching distribution found for torch==2.7.0.dev20250116

I tried it manually: pip3 install torch, no luck:

pip install torch
Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
ERROR: Could not find a version that satisfies the requirement torch (from versions: none)
ERROR: No matching distribution found for torch

Solution

This is what I came up with and it works fine:

pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cpu

Fix: No GPU support in Tensorflow

I came across a problem where my Tensorflow installation did not recognize the installed gpu, despite of Cuda and Nvidia drivers being installed properly.

test:

python3 -c “import tensorflow as tf; print(tf.config.list_physical_devices(‘GPU’))”

returned an empty list. Furthermore, it tells it cannot find the cuda library:

2024-01-30 14:57:42.015454: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.

Output of the Nvidia tool is correct and shows Cuda is installed:

nvidia-smi

ubuntu@ip-bla-foo:~/build-nb$  nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Mon_Apr__3_17:16:06_PDT_2023
Cuda compilation tools, release 12.1, V12.1.105
Build cuda_12.1.r12.1/compiler.32688072_0

Which tells us it is version 12. Ahhh!💡

Now, 12 is a version from 2023 and my idea was that Tensorflow 2.13 might not know this version, see https://blog.tensorflow.org/2023/11/whats-new-in-tensorflow-2-15.html

Ok, the latest version pip offered was TF 2.13 on Python 3.8. Here is the fix:

  1. upgrade Python: sudo apt install python3.9
  2. a new venv: virtualenv –python /usr/bin/python3.9 ~/.env-python3.9
  3. source ~/.env-python3.9/bin/activate
  4. pip install –upgrade pip
  5. python3 -m pip install tensorflow[and-cuda]==2.15.0.post1

Test: python3 -c “import tensorflow as tf; print(tf.config.list_physical_devices(‘GPU’))”

2024-01-30 15:27:04.458720: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-01-30 15:27:04.458772: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-01-30 15:27:04.459601: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-01-30 15:27:04.465334: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-01-30 15:27:05.115551: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
2024-01-30 15:27:05.560865: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:901] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-01-30 15:27:05.585883: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:901] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-01-30 15:27:05.586100: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:901] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

Now we see the GPU in Tensorflow.

Large vs Small LLMs – Thoughts

If you are working on a task that is very specific, a smaller LLM may be able to learn the task-specific patterns more quickly than a larger LLM. Additionally, if you are working on a resource-constrained device, a smaller LLM may be the only option. Read in this blog post how to prepare an LLM for a specific task.

Benefits of large LLMs, such as 70B

Large language models (LLMs) with more parameters are typically trained on larger datasets. The more parameters an LLM has, the more complex it is, and the more data it can process. This is because the parameters represent the connections between the neurons in the LLM’s neural network. The more parameters there are, the more connections there are, and the more complex the network can be.

Benefits of smaller LLMs, such as 6B or 770m

If I have a task that requires Python, I don’t need a model trained on Haskell, GO and Rust. It is not necessary to use a model that is trained on other programming languages. This is because LLMs that are trained on a variety of programming languages can often overfit to the training data, which can make them less effective for generating code in a specific language.

An LLM that is trained on a large dataset of Python, Haskell, Go, and Rust code may be able to generate code in all of these languages. However, it may not be as good at generating idiomatic Python code as an LLM that is specifically trained on Python code.

If you have a task that requires Python, it is generally best to use an LLM that is specifically trained on Python code. This will give you the best chance of generating code that is syntactically correct, semantically meaningful, and idiomatic.

A 6B model is significantly more convenient for many purposes: less expensive to operate, runs on your laptop, maybe more accurate on that specific language if the training data is good.

A good way to decide whether to use an LLM that is trained on multiple programming languages or an LLM that is specifically trained on one programming language is to experiment with both and see which one works better for your task.