How to run Ollama in CircleCI

Yes, it’s absolutely possible! You can run a small LLM like Gemma3 4b using Ollama in your basic CircleCI pipeline to integrate AI capabilities directly into your CI/CD workflows. Of course its capabilities are limited, but you can use it for agents or semantic unittests.

Here is an example of a CircleCI config using Ollama, and runs on the free plan (large resource). It demonstrates how to use the Ollama Docker image in a CI pipeline and assumes you want to pull a model and run a basic script using the Ollama service.

jobs:
  ollama-example:
    docker:
      - image: cimg/python:3.9
      - image: ollama/ollama:latest
        name: ollama
    resource_class: large
    steps:
      - checkout
      - run:
          name: Wait for Ollama to start
          command: |
            until curl -s http://ollama:11434/; do
              echo "Waiting for Ollama to start..."
              sleep 5
            done
      - run:
          name: Pull Gemma3 Model Using Web API
          command: |
            curl -X POST http://ollama:11434/api/pull \
              -H "Content-Type: application/json" \
              -d '{"model": "gemma3:4b"}'
      - run:
          name: Run a Python script using Ollama
          command: |
            python script.py

workflows:
  version: 2
  ollama-workflow:
    jobs:
      - ollama-example

And the Python script:

import requests
from pprint import pprint

response = requests.post(
    'http://ollama:11434/api/completion',
    json={'model': 'gemma3:4b', 'prompt': 'Hello, Ollama!'}
)
pprint(response.json())

This configuration is simple and can be used as a starting point to work on integrating Ollama into a CI pipeline.

Semantic Unittests

Unit tests traditionally focus on verifying exact outputs, but how do we test the output of data that might slightly change, such as the output of an LLM to the same question.

Luckily, using a SemanticTestcase we can test semantic correctness rather than rigid string matches in Python. This is useful for applications like text validation, classification, or summarization, where there’s more than one “correct” answer.

Traditional vs. Semantic Testing

  • Traditional Unit Test

A standard test might look like this:

import unittest
from text_validator import validate_text

class TestTextValidator(unittest.TestCase):
    def test_profane_text(self):
        self.assertFalse(validate_text("This is some bad language!")) 
    def test_clean_text(self):
        self.assertTrue(validate_text("Hello, how are you?"))

Here, validate_text() returns True or False, but it assumes there’s a strict set of phrases that are “bad” or “good.” Edge cases like paraphrased profanity might be missed.

  • Semantic Unit Test

Instead of rigid assertions, we can use SemanticTestCase to evaluate the meaning of the response:

self.assertSemanticallyEqual("Blue is the sky.", "The sky is blue.")

A test case:


class TestTextValidator(SemanticTestCase):
    """
    We're testing the SemanticTestCase here
    """

    def test_semantic(self):
        self.assertSemanticallyCorrect(longer_text, "It is a public holiday in Ireland")
        self.assertSemanticallyIncorrect(longer_text, "It is a public holiday in Italy")
        self.assertSemanticallyEqual("Blue is the sky.", "The sky is blue.")

Here, assertSemanticallyCorrect() and its siblings use an LLM to classify the input and return a judgment. Instead of exact matches, we test whether the response aligns with our expectation.

Why This Matters

• AI systems often output slightly different versions of the same sentence, when repeated. This makes it very hard for traditional unittest asserts, but SemanticTestCase allows to compare these outputs as well.

• Handles paraphrased inputs: Profanity, toxicity, or policy violations don’t always follow exact patterns.

• More flexible testing: Works for tasks like summarization or classification, where exact matches aren’t realistic.

Some words on ..

Execution speed: Running an LLM for each test could be slower than traditional unit tests. But it is surprisingly fast on my Mac M1 with local Ollama and a laptop-sized LLM such as Gemma.

The speed is affected by the size of the prompt (or context), it is fast when comparing just a few sentences. Furthermore, the LLM stays loaded between two assertions, which also contributes to its speed.

Data protection: if handling sensitive data is a concern, install a local LLM e.g. using Ollama. Still quite fast.

Working with Copilot

Just something I wanted to share:

Me: when working with Copilot, many things get easier, but the core problems remain the same and fixing hairy problems such as concurrency issues and the like, or test edge cases etc are still manual work. It helps mainly with boilerplate code, which aren’t difficult to write anyway, but it saves time

ChatGPT said:

Yeah, exactly. Copilot and similar AI tools are great for reducing grunt work—boilerplate, repetitive patterns, and even some tedious debugging—but they don’t fundamentally change the nature of software development. The hard parts, like designing good abstractions, handling tricky concurrency issues, and ensuring correctness in edge cases, still require human insight.

It’s like having an assistant that speeds up the easy stuff, but when you hit a truly gnarly problem, you’re still on your own. Have you found any areas where it actually helps with the deeper problems, or is it mostly just a time-saver for you?

“Conclusion”

It is good that the AI agrees with me here.

How to Analyze a Dataset for LLM Fine Tuning

Say you have an LLM and want to teach it some behavior and therefore your idea is to fine tune an LLM that is close and good enough. You found a dataset or two, and now want to see how training the LLM on this dataset would influence its behavior and knowledge.

Define the Objective

What behavior or knowledge you want to instill in the LLM? Is it domain-specific knowledge, conversational style, task-specific capabilities, or adherence to specific ethical guidelines?

Dataset Exploration

Check if the dataset’s content aligns with your domain of interest. Where does the dataset come from? Ensure it is reliable and unbiased for your use case.

Evaluate the dataset size to see if it is sufficient for fine-tuning but not too large to overfit or be computationally prohibitive. Check the dataset format (e.g., JSON, CSV, text) and its fields (e.g., prompt-response pairs, paragraphs, structured annotations).

Content

Quality: Ensure the text is grammatically correct and coherent, code is working. Check for logical structure and factual accuracy.

Diversity: Analyze the range of topics, styles, and formats in the dataset. Ensure the dataset covers edge cases and diverse scenarios relevant to your objectives.

Look for harmful, biased, or inappropriate content. Assess the dataset for compliance with ethical and legal standards.

Behavior

Use a small subset of the dataset to run experiments and assess how the model’s behavior shifts. Compare the outputs before and after fine-tuning on metrics like relevance, correctness, and alignment with desired behaviors.

Compare the dataset’s content with the base model’s knowledge and capabilities. Focus on gaps or areas where the dataset adds value.

TLD;DR: Train with a small subset and observe how it changes behavior.

Data Cleaning

Normalize text (e.g., casing, punctuation) and remove irrelevant characters. Tokenize or prepare the dataset in a format compatible with the model.

Remove low-quality, irrelevant, or harmful samples. In fact, many of the datasets used to train large LLMs are not very clean. Address bias and ethical issues by balancing or augmenting content as needed. Add labels or annotations if the dataset lacks sufficient structure for fine-tuning.

Resource Estimate

Determine the compute power required for fine-tuning with this dataset. f the dataset is too large, consider selecting a high-quality, representative subset.

Alternative Approaches: Evaluate whether fine-tuning is necessary. Explore alternatives like prompt engineering or few-shot learning.

Ethical and Practical Validation

Use tools or frameworks to check for potential biases in the dataset. Ensure the dataset complies with copyright, privacy, and data protection regulations.

Add Notes

Document findings about dataset quality, limitations, and potential biases. Record the preprocessing steps and justification for changes made to the dataset.

By following this structured analysis, you can determine how fine-tuning with a particular dataset will influence an LLM and decide on the most effective approach for your objectives.

Note that knowledge from training and fine tuning can be blurry, so make sure you augment it with a RAG to get sharper responses. I’ll show how to do that in another blog post.

Was bedeutet ‘nit’ in Code Reviews?

Was bedeutet nit in einem Code Review?

Gelegentlich können wir in Codereviews Kommentare wie diese finden:

auto result = std::make_pair<uint64_t, std::string>(64, "Hallihallo");;

nit: double semicolon

In einem Review bezeichnet “Nit” eine kleine Ungenauigkeit oder einen Fehler, der die Funktionalität des Codes nicht wesentlich beeinträchtigt, aber dennoch korrigiert werden sollte. Zum Beispiel ein Tippfehler in einem Kommentar, ein zu viel gesetzter Semikolon oder eine zusätzliche Leerzeile. Der Prüfer weist auf diesen Fehler hin, möchte aber wahrscheinlich nicht, dass Ihr den Pull-Request aufgrund dieser Kleinigkeit verzögert.

Und so sollten wir damit umgehen: Wenn Du noch an dem PR arbeitest, kannst Du das in einem der nächsten Commits beheben. Verzögere aber nicht die Integration der Funktion oder des Bugfixes nicht wegen dieser Kleinigkeit. Wir alle wissen, dass das Warten auf den CI Zeit in Anspruch nehmen kann, und wenn Du den CI für diese Kleinigkeit blockierst, werden einige Leute wahrscheinlich nicht sehr glücklich sein.

Um mehr über die in Code Reviews verwendete “Slang” zu erfahren, habe ich in diesem Blogpost (englisch) eine Liste zusammengestellt, in der Du Erklärungen zu Abkürzungen wie +1, WIP, lgtm und anderen findest.

Die englische Version dieses Posts findest Du hier.

Code Review, what to look for

Why Code Reviews

Code review is an essential part of the software development process that helps to ensure the quality of code and catch potential issues before they become a problem. By having other developers review the code, it helps to identify areas for improvement, promote best practices, and ensure that code is maintainable, scalable, and secure. Code reviews can be conducted using a variety of tools, such as code review platforms, linting tools, automated code review tools, code comparison tools, and code coverage tools. The goal of Code review is to improve the quality of code and make the development process more efficient and effective. Regular Code Reviews can help to promote a culture of collaboration and teamwork within the development team, leading to better code and a more successful project.

What to look for

A code review is an important part of the software development process and developers should look for the following aspects when conducting a code review:

  1. Code Quality: Check if the code is clean, readable, and adheres to established coding standards. Ensure that the code is optimized and free of bugs.
  2. Functionality: Ensure that the code meets the requirements and that it functions as expected.
  3. Security: Check for potential security vulnerabilities and ensure that the code follows best practices for security.
  4. Test Coverage: Ensure that the code is covered by adequate test cases and that the tests are thorough.
  5. Performance: Review the code for performance bottlenecks and ensure that it is optimized for speed and efficiency.
  6. Scalability: Ensure that the code can scale to meet the needs of the users as the system grows.
  7. Maintainability: Check that the code is easy to maintain and can be easily updated and extended in the future.
  8. Documentations: Check if the code is properly documented, including comments and inline documentation, to help other developers understand it.

A successful code review is a collaborative effort that enhances the quality of the code and aligns it with the requirements of both the users and the development team. It should be a constructive process that helps to identify areas for improvement and ensures the code is optimized for maintenance, scalability, and security.

Tools

There are several tools available to help enhance code quality during code reviews:

  1. Linting Tools: These tools scan code for potential issues such as syntax errors, style violations, and semantic problems. Examples include ESLint and JSLint for JavaScript and Pylint for Python.
  2. Code Review Platforms: These platforms provide a centralized place for code review, allowing teams to review, discuss, and track changes to code. Examples include GitHub, GitLab, and Bitbucket.
  3. Automated Code Review Tools: These tools can automatically identify potential issues in code, such as security vulnerabilities, performance bottlenecks, and missing test coverage. Examples include SonarQube, CodeClimate, and Crucible.
  4. Code Comparison Tools: These tools allow developers to compare and merge changes to code. They can highlight differences between code versions and help to identify potential conflicts. Examples include Meld and Beyond Compare.
  5. Code Coverage Tools: These tools measure how much of the code is covered by tests and can identify areas where additional tests are needed. Examples include Cobertura and Istanbul.

Using these tools in combination with manual code review can help ensure that code quality is maintained and improved throughout the development process.

5 reasons why it makes sense to work with branches and tests in a single developer project

Working with branches and automated tests can bring a host of benefits to a single developer project, even if the project isn’t being worked on by multiple people. Here are some of the reasons why:

  1. Enhanced efficiency: When working with branches, a solo developer can tackle multiple features or bug fixes simultaneously, without having to worry about disrupting the main codebase. Additionally, by utilizing automated tests, the developer can validate that changes made in a branch don’t break existing functionality in a fast and efficient manner.
  2. Superior code quality: Automated tests can help catch bugs and issues early in the development process, long before they become problematic and harder to resolve. This leads to a more stable codebase and better code quality overall.
  3. Optimal version control: Branches allow a single developer to switch between different versions of code easily, as well as revert back to a previous version if necessary. This also makes it easier for the developer to manage code reviews and collaborate with other developers if the need arises in the future.
  4. Increased confidence: Automated tests provide a safety net for changes made in the code, which can give the developer more confidence when making modifications. If issues arise, the tests will quickly detect them, allowing the developer to fix them promptly.
  5. Support for experimentation: Branches make it possible for a developer to experiment with new ideas or approaches without affecting the main codebase. This can be especially valuable when exploring new technologies or finding new solutions to problems.

In conclusion, working with branches and automated tests can lead to improved efficiency, better code quality, optimal version control, increased confidence, and support for experimentation even in single developer projects. Whether you’re a beginner or an experienced developer, utilizing these tools can help streamline your development process and lead to better results.

10 shorthands commonly used in Code Reviews

There are several shorthands and abbreviations commonly used in code reviews:

  1. nit – nitpicking. Refers to minor and cosmetic changes that the reviewer suggests to the code (typos, formatting etc)
  2. N/A – Not Applicable, used to indicate that a particular comment or suggestion does not apply to the code being reviewed.
  3. +1 – Indicates agreement or support for a particular change or suggestion.
  4. -1 – Indicates opposition or disapproval of a particular change or suggestion.
  5. ACK – Acknowledge, used to indicate that the reviewer has seen the comment or suggestion and will address it.
  6. WIP – Work In Progress, used to indicate that the code being reviewed is still a work in progress and may not be complete or ready for review.
  7. RTFC – Read The F***ing Code, used to suggest that the reviewer should go back and read the relevant code before making a comment or suggestion.
  8. FIXME – A placeholder used to indicate that a particular piece of code needs to be fixed in the future.
  9. TODO – A placeholder used to indicate that a particular task needs to be completed in the future.
  10. LGTM

These shorthands and abbreviations are commonly used in code reviews to speed up the review process and make it more efficient. However, it’s important for all participants in the review to understand and agree on their meanings to avoid confusion and ensure effective communication.

If you know or use other shorthands or abbreviations, please let me know.

The 6 drawbacks of linter tools

While linter tools are widely used and can be incredibly helpful in detecting issues and improving code quality, they do have some disadvantages as well. Some of the common disadvantages of using linter tools include:

  1. False positives: Linters may produce false positive warnings or errors, which can be frustrating and lead to wasted time trying to resolve non-issues.
  2. Configuration complexity: Setting up a linter can be challenging, especially for large projects with multiple contributors and a complex codebase. It can be difficult to configure the linter to meet the specific needs of the project and the development team.
  3. Learning curve: Using a linter can require a learning curve for developers, as they need to understand how to use and configure the tool effectively. This can be especially challenging for developers who are new to the tool or the programming language.
  4. Inconsistent enforcement: Linters may not always be enforced consistently, leading to situations where some developers may not adhere to the linter’s recommendations. This can lead to inconsistent code quality and undermine the value of the linter.
  5. Limited scope: Linters are typically limited in scope and can only detect issues related to code syntax, style, and formatting. They may not be able to detect more complex issues such as performance bottlenecks or security vulnerabilities.
  6. Unfamiliar codebase: If a linter is being applied to an unfamiliar codebase, it may produce a large number of warnings and errors that can be overwhelming for the developer to resolve. This can lead to frustration and a sense that the tool is not effective.

In conclusion, while linter tools can be incredibly helpful in detecting issues and improving code quality, they also have some disadvantages that need to be taken into consideration. It is important to weigh the benefits and drawbacks of using a linter and determine if it is the right tool for your specific project and development team.

Code Review – A Critical Component of Software Development

Code reviews are an essential aspect of software development that can significantly enhance the quality and reliability of your code. They provide an opportunity for developers to learn from one another, share their expertise, and collaborate on creating better code.

One of the key benefits of code reviews is improved code quality. Through code reviews, developers can identify and resolve potential bugs, performance issues, and security vulnerabilities before the code is released to production. This proactive approach can save time and resources in the long run, as it is more cost-effective to catch and fix problems early in the development process.

In addition to improving code quality, code reviews also facilitate knowledge sharing and best practices. Reviewing the code of others can help developers understand the codebase and learn new techniques for writing high-quality code. This sharing of knowledge and expertise can lead to increased efficiency and better collaboration among team members, as everyone works towards a common goal.

Code reviews also play a crucial role in enhancing team communication. By working together to review and improve code, developers can build a sense of teamwork and collaboration. This can result in better communication and higher-quality code, as everyone works together to ensure the code meets the necessary standards and specifications.

Consistency in code is another important aspect that can be maintained through code reviews. By ensuring that code follows established coding standards, code reviews make it easier to maintain and enhance the code over time. This can greatly reduce the time and effort required for code maintenance and updates, as everyone on the team follows the same standards and best practices.

Finally, code reviews can also improve documentation, making it easier for others to understand and work with the code in the future. By reviewing the documentation and ensuring its completeness and accuracy, code reviews can help ensure that the code is well-documented and easy to understand.

In conclusion, code reviews are a valuable tool that can provide numerous benefits for both individual developers and development teams. Incorporating code reviews into your development process can help you write better code, share knowledge, communicate effectively, maintain consistency, and enhance documentation. Don’t overlook the importance of code reviews – make them a part of your workflow for the best results.