I came across this dataset recently, a collection of 22k Python code examples, tested and verified to work. What really caught my attention is how this was put together—they used a custom script to extract Python code from Alpaca-formatted datasets, tested each snippet locally, and only kept the functional ones. Non-functional examples were separated into their own file.
The dataset pulls from a mix of open-source projects like Wizard-LM’s Evol datasets, CodeUp’s 19k, and a bunch of others, plus some hand-prompted GPT-4 examples. Everything’s been deduplicated, so you’re not stuck with repeats.
It’s especially cool if you’re working on training AI models for coding tasks because it sidesteps one of the biggest issues with open datasets: non-functional or broken code. They even hinted at adapting the script for other languages like C++ or SQL.
If you use the dataset or their script, they ask for attribution: Filtered Using Vezora’s CodeTester. Oh, and they’re working on releasing an even bigger dataset with 220,000+ examples, definitely one to keep an eye on!
To my surprise, I wasn’t able to install Pytorch for a project on my Macbook Pro M1 today (MacOS Sequoia 15.2). I kept getting this error when running pip3 install -r requirements.txt:
ERROR: Ignored the following versions that require a different python version: 1.21.2 Requires-Python >=3.7,<3.11; 1.21.3 Requires-Python >=3.7,<3.11; 1.21.4 Requires-Python >=3.7,<3.11; 1.21.5 Requires-Python >=3.7,<3.11; 1.21.6 Requires-Python >=3.7,<3.11; 1.26.0 Requires-Python <3.13,>=3.9; 1.26.1 Requires-Python <3.13,>=3.9
ERROR: Could not find a version that satisfies the requirement torch==2.7.0.dev20250116 (from versions: none)
ERROR: No matching distribution found for torch==2.7.0.dev20250116
I tried it manually: pip3 install torch, no luck:
pip install torch
Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
ERROR: Could not find a version that satisfies the requirement torch (from versions: none)
ERROR: No matching distribution found for torch
I came across a problem where my Tensorflow installation did not recognize the installed gpu, despite of Cuda and Nvidia drivers being installed properly.
test:
python3 -c “import tensorflow as tf; print(tf.config.list_physical_devices(‘GPU’))”
returned an empty list. Furthermore, it tells it cannot find the cuda library:
2024-01-30 14:57:42.015454: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
Output of the Nvidia tool is correct and shows Cuda is installed:
nvidia-smi
ubuntu@ip-bla-foo:~/build-nb$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2023 NVIDIA Corporation Built on Mon_Apr__3_17:16:06_PDT_2023 Cuda compilation tools, release 12.1, V12.1.105 Build cuda_12.1.r12.1/compiler.32688072_0
Test: python3 -c “import tensorflow as tf; print(tf.config.list_physical_devices(‘GPU’))”
2024-01-30 15:27:04.458720: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered 2024-01-30 15:27:04.458772: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered 2024-01-30 15:27:04.459601: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered 2024-01-30 15:27:04.465334: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. 2024-01-30 15:27:05.115551: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT 2024-01-30 15:27:05.560865: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:901] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 2024-01-30 15:27:05.585883: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:901] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 2024-01-30 15:27:05.586100: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:901] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
If you are working on a task that is very specific, a smaller LLM may be able to learn the task-specific patterns more quickly than a larger LLM. Additionally, if you are working on a resource-constrained device, a smaller LLM may be the only option. Read in this blog post how to prepare an LLM for a specific task.
Benefits of large LLMs, such as 70B
Large language models (LLMs) with more parameters are typically trained on larger datasets. The more parameters an LLM has, the more complex it is, and the more data it can process. This is because the parameters represent the connections between the neurons in the LLM’s neural network. The more parameters there are, the more connections there are, and the more complex the network can be.
Benefits of smaller LLMs, such as 6B or 770m
If I have a task that requires Python, I don’t need a model trained on Haskell, GO and Rust. It is not necessary to use a model that is trained on other programming languages. This is because LLMs that are trained on a variety of programming languages can often overfit to the training data, which can make them less effective for generating code in a specific language.
An LLM that is trained on a large dataset of Python, Haskell, Go, and Rust code may be able to generate code in all of these languages. However, it may not be as good at generating idiomatic Python code as an LLM that is specifically trained on Python code.
If you have a task that requires Python, it is generally best to use an LLM that is specifically trained on Python code. This will give you the best chance of generating code that is syntactically correct, semantically meaningful, and idiomatic.
A 6B model is significantly more convenient for many purposes: less expensive to operate, runs on your laptop, maybe more accurate on that specific language if the training data is good.
A good way to decide whether to use an LLM that is trained on multiple programming languages or an LLM that is specifically trained on one programming language is to experiment with both and see which one works better for your task.