Unit tests traditionally focus on verifying exact outputs, but how do we test the output of data that might slightly change, such as the output of an LLM to the same question.
Luckily, using a SemanticTestcase we can test semantic correctness rather than rigid string matches in Python. This is useful for applications like text validation, classification, or summarization, where there’s more than one “correct” answer.
Traditional vs. Semantic Testing
- Traditional Unit Test
A standard test might look like this:
import unittest
from text_validator import validate_text
class TestTextValidator(unittest.TestCase):
def test_profane_text(self):
self.assertFalse(validate_text("This is some bad language!"))
def test_clean_text(self):
self.assertTrue(validate_text("Hello, how are you?"))
Here, validate_text() returns True or False, but it assumes there’s a strict set of phrases that are “bad” or “good.” Edge cases like paraphrased profanity might be missed.
- Semantic Unit Test
Instead of rigid assertions, we can use SemanticTestCase to evaluate the meaning of the response:
self.assertSemanticallyEqual("Blue is the sky.", "The sky is blue.")
A test case:
class TestTextValidator(SemanticTestCase):
"""
We're testing the SemanticTestCase here
"""
def test_semantic(self):
self.assertSemanticallyCorrect(longer_text, "It is a public holiday in Ireland")
self.assertSemanticallyIncorrect(longer_text, "It is a public holiday in Italy")
self.assertSemanticallyEqual("Blue is the sky.", "The sky is blue.")
Here, assertSemanticallyCorrect() and its siblings use an LLM to classify the input and return a judgment. Instead of exact matches, we test whether the response aligns with our expectation.
Why This Matters
• AI systems often output slightly different versions of the same sentence, when repeated. This makes it very hard for traditional unittest asserts, but SemanticTestCase allows to compare these outputs as well.
• Handles paraphrased inputs: Profanity, toxicity, or policy violations don’t always follow exact patterns.
• More flexible testing: Works for tasks like summarization or classification, where exact matches aren’t realistic.
Some words on ..
• Execution speed: Running an LLM for each test could be slower than traditional unit tests. But it is surprisingly fast on my Mac M1 with local Ollama and a laptop-sized LLM such as Gemma.
The speed is affected by the size of the prompt (or context), it is fast when comparing just a few sentences. Furthermore, the LLM stays loaded between two assertions, which also contributes to its speed.
• Data protection: if handling sensitive data is a concern, install a local LLM e.g. using Ollama. Still quite fast.
