Build Your Own Local AI Agent: A Step By Step Guide

Local AI agents run on your machine. No cloud. No external APIs. Just you, your hardware, and the model. This post walks through the essentials: choosing a model, wiring it up with an agent framework, and running it locally. If you want privacy, speed, or control, this is how you get it.

What Can Local Agents Do?

Local agents can handle a wide range of tasks: summarizing documents, answering questions, automating workflows, scraping websites, or even acting as coding assistants.

In this post, we’ll focus on a simple task: scraping news headlines from a website and summarizing them. It’s fast, useful, and shows the core pieces in action.

Tools We’ll Use

  • Ollama – run language models locally with one command. Gemma or Mistral work fine on a Laptop
  • LangChain – structure reasoning, tools, and memory
  • Python – glue everything together

Basic Structure of a Local Agent

  1. Model – the LLM doing the “thinking”
  2. Tools – code the agent can use (like a scraper or file reader)
  3. Prompt – instructions for what the agent should do
  4. Loop – let the agent think and act step-by-step

That’s it. The rest is just wiring.

Getting Started

  1. Install Ollama
    https://ollama.com
    brew install ollama or grab it for your OS.
  2. Pull a model: ollama run mistral
  3. Set up a LangChain agent
    Load the model via LangChain, define a tool, and pass it to the agent. You’ll see how in the example below.

The Code

pip install langchain beautifulsoup4 requests

ollama run mistral

Now make yourself a python script, such as run.py

from langchain.llms import Ollama

llm = Ollama(model="mistral")

The scraper:

import requests
from bs4 import BeautifulSoup

def get_headlines(url="https://www.bbc.com"):
    res = requests.get(url)
    soup = BeautifulSoup(res.text, "html.parser")
    headlines = [h.get_text() for h in soup.find_all("h3")]
    return "\n".join(headlines[:10])  # Just take top 10

Wrap it as a LangChain tool:

from langchain.agents import tool

@tool
def scrape_headlines() -> str:
    """Scrapes top headlines from BBC."""
    return get_headlines()

Build the agent:

from langchain.agents import initialize_agent, AgentType

tools = [scrape_headlines]

agent = initialize_agent(
    tools,
    llm,
    agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION,
    verbose=True
)

Run the agent:

agent.run("Get the top news headlines and summarize them in a few bullet points.")

That’s it, you now have a local agent: scraping, thinking, and summarizing. All on your machine.

Working with Copilot

Just something I wanted to share:

Me: when working with Copilot, many things get easier, but the core problems remain the same and fixing hairy problems such as concurrency issues and the like, or test edge cases etc are still manual work. It helps mainly with boilerplate code, which aren’t difficult to write anyway, but it saves time

ChatGPT said:

Yeah, exactly. Copilot and similar AI tools are great for reducing grunt work—boilerplate, repetitive patterns, and even some tedious debugging—but they don’t fundamentally change the nature of software development. The hard parts, like designing good abstractions, handling tricky concurrency issues, and ensuring correctness in edge cases, still require human insight.

It’s like having an assistant that speeds up the easy stuff, but when you hit a truly gnarly problem, you’re still on your own. Have you found any areas where it actually helps with the deeper problems, or is it mostly just a time-saver for you?

“Conclusion”

It is good that the AI agrees with me here.

How to Install and Use Salesforce’s CodeGen LLM

CodeGen is an AI (LLM) from Salesforce that can generate source code, as well as describe what a piece of code does. It comes under the Apache license and has a good performance while being lightweight enough to run on a laptop for both inference and fine tuning. Here is how to set it up and how to use it.

Installation with HuggingFace

This blog post provides instructions on how to use the Codegen LLM via the Hugging Face Transformers library. It assumes you have a development environment set up and are familiar with Hugging Face.

You’ll need to install the `transformers` and `torch` libraries:

pip install transformers torch

If you intend to use a GPU, ensure you have the correct CUDA drivers and PyTorch/TensorFlow builds for GPU support.

Model Loading

Codegen models are typically available on the Hugging Face Model Hub. You can load a model and its tokenizer using the following code:

from transformers import AutoTokenizer, AutoModelForCausalLM  # Or AutoModelForSeq2SeqLM for sequence-to-sequence models

model_name = "Salesforce/codegen-350M-mono"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)  # Or AutoModelForSeq2SeqLM

# For GPU usage (recommended):
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)

Replace "Salesforce/codegen-350M-mono" with the specific Codegen model name you intend to use. Check the Hugging Face Model Hub for available models.

Code Generation

Here’s how to generate code using the loaded model:

prompt = "Write a Python function to calculate the factorial of a number."

input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)  # Move input to device

outputs = model.generate(input_ids,
                       max_length=200,  # Adjust as needed
                       num_beams=5,      # Adjust for quality/speed trade-off
                       temperature=0.7,  # Adjust for creativity (higher = more creative)
                       top_k=40,         # Adjust for sampling
                       top_p=0.95,        # Adjust for sampling
                       pad_token_id=tokenizer.eos_token_id # Important for some models
                       )

generated_code = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_code)


# Example with infilling (code completion):
prompt = "def my_function(x):\n    # TODO: Calculate the square of x\n    return"
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)
outputs = model.generate(input_ids, max_length=100, num_beams=5, pad_token_id=tokenizer.eos_token_id)
generated_code = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_code)

Considerations

Model Selection: Different Codegen models have different strengths. Choose the one that best suits your needs.

Prompt Engineering: Clear and specific prompts are essential for good results.

Parameter Tuning: Experiment with the generation parameters to find the optimal settings for your use case.

Resource Management: Large language models can be resource-intensive. Consider using a GPU if available.

Output Validation: The generated code should be reviewed and tested carefully. It might require debugging.

Python-Alpaca Dataset

I came across this dataset recently, a collection of 22k Python code examples, tested and verified to work. What really caught my attention is how this was put together—they used a custom script to extract Python code from Alpaca-formatted datasets, tested each snippet locally, and only kept the functional ones. Non-functional examples were separated into their own file.

The dataset pulls from a mix of open-source projects like Wizard-LM’s Evol datasets, CodeUp’s 19k, and a bunch of others, plus some hand-prompted GPT-4 examples. Everything’s been deduplicated, so you’re not stuck with repeats.

It’s especially cool if you’re working on training AI models for coding tasks because it sidesteps one of the biggest issues with open datasets: non-functional or broken code. They even hinted at adapting the script for other languages like C++ or SQL.

If you use the dataset or their script, they ask for attribution: Filtered Using Vezora’s CodeTester. Oh, and they’re working on releasing an even bigger dataset with 220,000+ examples, definitely one to keep an eye on!

On Huggingface: Tested-22k-Python-Alpaca

Read also how to analyze a dataset.

Role Assignment in Multi-Agent Systems

When working with multi-agent systems, one of the most powerful concepts you can leverage is role assignment. In a multi-agent setup, you can define distinct roles for each agent to create different behaviors, allowing them to collaborate, interact, and solve problems in a simulated environment.

Imagine you’re managing a software development project. You have a project manager, a developer, and a tester, each with a unique perspective and responsibilities. By assigning these roles to different agents in a conversation, you can simulate their interactions to observe how they work together toward a common goal, like completing a feature or identifying a bug.

Why Use Role Assignment?

Role assignment is essential in multi-agent systems because it allows you to create more realistic, diverse behaviors in the simulation. Each agent has specific tasks, which means they’ll react differently based on their role. For example:

  • The project manager might focus on project timelines, priorities, and coordinating tasks.
  • The developer could be focused on writing code, debugging, and creating new features.
  • The tester would be identifying bugs, running test cases, and ensuring the quality of the product.

By assigning different roles, you give each agent context and a purpose, which leads to more meaningful interactions.

How to Assign Roles in the OpenAI Chat API

Using the OpenAI API Documentation, assigning roles is simple. You can use system messages to define the specific behavior of each agent. These messages help guide each agent’s response and ensure that they act within their role.

Here’s how you can structure it:

import openai

openai.ChatCompletion.create(model="gpt-3.5-turbo",
  messages=[

    {
      "role": "system", 
      "content": "You are the project manager for a software development team. Your role is to coordinate tasks, set deadlines, and ensure the project stays on track. Focus on the big picture and team collaboration."
    },

    {
      "role": "system", 
      "content": "You are a developer working on new features and fixing bugs. Focus on writing clean code, debugging, and offering technical solutions to problems."
    },

    {
      "role": "system", 
      "content": "You are a tester responsible for finding bugs and ensuring that the software is stable. Run tests, identify issues, and communicate them clearly for the team to address."
    },

    {
     "role": "user",
     "content": "Let's start the project. The first task is to build the user authentication feature."
    }
  ]
)
In this example:
Note: Don’t be confused by the API role and the role you define

Don’t be confused by the “role” in the API message (e.g., system, user, assistant) and the “role” you define for each agent (e.g., project manager, developer, tester). In the API context, “role” refers to the message sender (system, user, assistant), while in the agent context, “role” refers to the specific persona or responsibility the agent has within the conversation.

In this example:

  • The project manager agent is given a message to manage the project, prioritizing tasks and deadlines.
  • The developer agent is tasked with coding and troubleshooting technical challenges.
  • The tester agent focuses on testing and identifying bugs to ensure a stable product.

Each agent’s system message helps them understand their role and contributes accordingly to the conversation, creating a collaborative environment that mirrors real-world project dynamics.

Why It Works

The power of multi-agent systems comes from the interaction between agents with different roles. When agents understand their role and objectives, they can communicate more effectively, mimic real-world collaborations, and help identify solutions more efficiently. You can also test various scenarios to see how different roles react to challenges or changes in the system, all without human intervention.

Wrapping Up

Role assignment in multi-agent systems is a powerful way to simulate complex scenarios with diverse behaviors. By using system messages to define roles, you can create agents that act like real-life colleagues, each contributing in their own way to achieve the common goal. Whether you’re simulating a team of developers or testing a new feature, this approach brings both flexibility and realism to the table.

Next time you’re working with multi-agent systems, try assigning different roles to your agents. You might be surprised at how dynamic and engaging the conversation becomes!

For more information on how to implement these concepts, be sure to check out the OpenAI API Documentation, where you can explore further examples, code snippets, and more to help you make the most of the Chat API in your projects.

How to Analyze a Dataset for LLM Fine Tuning

Say you have an LLM and want to teach it some behavior and therefore your idea is to fine tune an LLM that is close and good enough. You found a dataset or two, and now want to see how training the LLM on this dataset would influence its behavior and knowledge.

Define the Objective

What behavior or knowledge you want to instill in the LLM? Is it domain-specific knowledge, conversational style, task-specific capabilities, or adherence to specific ethical guidelines?

Dataset Exploration

Check if the dataset’s content aligns with your domain of interest. Where does the dataset come from? Ensure it is reliable and unbiased for your use case.

Evaluate the dataset size to see if it is sufficient for fine-tuning but not too large to overfit or be computationally prohibitive. Check the dataset format (e.g., JSON, CSV, text) and its fields (e.g., prompt-response pairs, paragraphs, structured annotations).

Content

Quality: Ensure the text is grammatically correct and coherent, code is working. Check for logical structure and factual accuracy.

Diversity: Analyze the range of topics, styles, and formats in the dataset. Ensure the dataset covers edge cases and diverse scenarios relevant to your objectives.

Look for harmful, biased, or inappropriate content. Assess the dataset for compliance with ethical and legal standards.

Behavior

Use a small subset of the dataset to run experiments and assess how the model’s behavior shifts. Compare the outputs before and after fine-tuning on metrics like relevance, correctness, and alignment with desired behaviors.

Compare the dataset’s content with the base model’s knowledge and capabilities. Focus on gaps or areas where the dataset adds value.

TLD;DR: Train with a small subset and observe how it changes behavior.

Data Cleaning

Normalize text (e.g., casing, punctuation) and remove irrelevant characters. Tokenize or prepare the dataset in a format compatible with the model.

Remove low-quality, irrelevant, or harmful samples. In fact, many of the datasets used to train large LLMs are not very clean. Address bias and ethical issues by balancing or augmenting content as needed. Add labels or annotations if the dataset lacks sufficient structure for fine-tuning.

Resource Estimate

Determine the compute power required for fine-tuning with this dataset. f the dataset is too large, consider selecting a high-quality, representative subset.

Alternative Approaches: Evaluate whether fine-tuning is necessary. Explore alternatives like prompt engineering or few-shot learning.

Ethical and Practical Validation

Use tools or frameworks to check for potential biases in the dataset. Ensure the dataset complies with copyright, privacy, and data protection regulations.

Add Notes

Document findings about dataset quality, limitations, and potential biases. Record the preprocessing steps and justification for changes made to the dataset.

By following this structured analysis, you can determine how fine-tuning with a particular dataset will influence an LLM and decide on the most effective approach for your objectives.

Note that knowledge from training and fine tuning can be blurry, so make sure you augment it with a RAG to get sharper responses. I’ll show how to do that in another blog post.

How to build an AI Agent with a memory

How to Build an Agent with a Local LLM and RAG, Complete with Local Memory

If you want to build an agent with a local LLM that can remember things and retrieve them on demand, you’ll need a few components: the LLM itself, a Retrieval-Augmented Generation (RAG) system, and a memory mechanism. Here’s how you can piece it all together, with examples using LangChain and Python. (and here is why a small LLM is a good idea)

Step 1: Set Up Your Local LLM

First, you need a local LLM. This could be a smaller pre-trained model like LLaMA or GPT-based open-source options running on your machine. The key is that it’s not connected to the cloud—it’s local, private, and under your control. Make sure the LLM is accessible via an API or similar interface so that you can integrate it into your system. A good choice would be using Ollama and an LLM such as Googles gemma. I also wrote easy to follow instructions in how to set an T5 LLM from Salesforce up locally, but it is also perfectly fine to use a cloud-based LLM.

In case the agent you want to build is about source code, here is an example of how to use CodeT5 with LangChain.

Step 2: Add Retrieval-Augmented Generation (RAG)

TL;DR: Gist on Github

Next comes the RAG. A RAG system works by combining your LLM with an external knowledge base. The idea is simple: when the LLM encounters a query, the RAG fetches relevant information from your knowledge base (documents, notes, or even structured data) and feeds it into the LLM as context.

To set up RAG, you’ll need:

  1. A Vector Database: This is where your knowledge will live. Tools like Pinecone, Weaviate, or even local implementations like FAISS can store your data as embeddings.
  2. A Way to Query the Vector Database: Use similarity search to find the most relevant pieces of information for any given query.
  3. Integration with the LLM: Once the RAG fetches data, format it and pass it as input to the LLM.

I have good experience with LangChain and Chroma:

documents = TextLoader("my_data.txt").load()
texts = CharacterTextSplitter(chunk_size=300, chunk_overlap=100).split_documents(documents)
vectorstore = Chroma.from_documents(texts, OllamaEmbeddings(model="gemma:latest")).as_retriever()

llm = OllamaLLM(model=model_name)
qa_chain = RetrievalQA.from_chain_type(llm=llm, retriever=vectorstore)

qa_chain.invoke("What is the main topic of my document?")

Step 3: Introduce Local Memory

Now for the fun part: giving your agent memory. Memory is what allows the agent to recall past interactions or store information for future use. There are a few ways to do this:

  • Short-Term Memory: Store conversation context temporarily. This can simply be a rolling buffer of recent interactions that gets passed back into the LLM each time.
  • Long-Term Memory: Save important facts or interactions for retrieval later. For this, you can extend your RAG system by saving interactions as embeddings in your vector database.

For example:

  1. After each interaction, decide if it’s worth remembering.
  2. If yes, convert it into an embedding and store it in your vector database.
  3. When needed, retrieve it alongside other RAG data to give the agent a sense of history.

Langchain Example

from langchain.memory import ConversationBufferMemory

# Initialize memory
memory = ConversationBufferMemory()

# Save some conversation turns
memory.save_context({"input": "Hello"}, {"output": "Hi there!"})
memory.save_context({"input": "How are you?"}, {"output": "I'm doing great, thanks!"})

# Retrieve stored memory
print(memory.load_memory_variables({}))

Step 4: Put It All Together

Now you can combine these elements:

  • The user sends a query.
  • The system retrieves relevant data via RAG.
  • The memory module checks for related interactions or facts.
  • The LLM generates a response based on the query, retrieved context, and memory.

This setup is powerful because it blends the LLM’s generative abilities with a custom memory tailored to your needs. It’s also entirely local, so your data stays private and secure.

Final Thoughts

Building an agent like this might sound complex, but it’s mostly about connecting the dots between well-known tools. Once you’ve got it running, you can tweak and fine-tune it to handle specific tasks or remember things better. Start small, iterate, and soon you’ll have an agent that feels less like software and more like a real assistant.

Large vs Small LLMs – Thoughts

If you are working on a task that is very specific, a smaller LLM may be able to learn the task-specific patterns more quickly than a larger LLM. Additionally, if you are working on a resource-constrained device, a smaller LLM may be the only option. Read in this blog post how to prepare an LLM for a specific task.

Benefits of large LLMs, such as 70B

Large language models (LLMs) with more parameters are typically trained on larger datasets. The more parameters an LLM has, the more complex it is, and the more data it can process. This is because the parameters represent the connections between the neurons in the LLM’s neural network. The more parameters there are, the more connections there are, and the more complex the network can be.

Benefits of smaller LLMs, such as 6B or 770m

If I have a task that requires Python, I don’t need a model trained on Haskell, GO and Rust. It is not necessary to use a model that is trained on other programming languages. This is because LLMs that are trained on a variety of programming languages can often overfit to the training data, which can make them less effective for generating code in a specific language.

An LLM that is trained on a large dataset of Python, Haskell, Go, and Rust code may be able to generate code in all of these languages. However, it may not be as good at generating idiomatic Python code as an LLM that is specifically trained on Python code.

If you have a task that requires Python, it is generally best to use an LLM that is specifically trained on Python code. This will give you the best chance of generating code that is syntactically correct, semantically meaningful, and idiomatic.

A 6B model is significantly more convenient for many purposes: less expensive to operate, runs on your laptop, maybe more accurate on that specific language if the training data is good.

A good way to decide whether to use an LLM that is trained on multiple programming languages or an LLM that is specifically trained on one programming language is to experiment with both and see which one works better for your task.

Prepare Data for Code LLM Training

If you want to teach your LLM some tricks you need to prepare some training data and run a training (or fine-tuning) on the LLM. For more complex knowledge, this would be a set of a few dozen or even hundred of data pairs: what it is and what it should be. This is called Supervised Learning.

For example: a piece of code, and a description of what the code does. If you write about 100 of these pairs, the LLM will start understanding and be able to explain code it hasn’t seen before. It can also be a piece of code and an instruction: the instruction describes how the given code should be build. As a result, the LLM will be able to write code out of text instructions.

Example:

How much should I write?

You can start seeing results with as little as 100 pairs. But the actual number you will need depends on various factors such as model complexity, data quality, diversity, the complexity of the task or the available training resources.

More complex models might require more data to learn effectively. Higher-quality data can lead to better performance, but it might compensate for a smaller dataset to some extent. A diverse dataset covering various programming languages, problem domains, and styles can enhance the model’s generalization. If the task requires highly nuanced or specialized descriptions, more data might be needed to capture these nuances effectively. The computational resources available for training play a role too; larger datasets might require more computational power and time.

How to Start

Begin with a reasonably sized dataset and monitor the model’s performance. You can then incrementally add more data, observing how the model improves with additional training examples.

As a general rule of thumb, having several thousand pairs of code and descriptions is a good starting point for training a language model effectively. However, this can vary significantly based on the factors mentioned above.

Tools that Help

For once you would need to get a larger set of code snippets from your code base, or something you find on the internet or on GitHub. A useful tool for that is Treesitter. It supports a lot of languages (parsers) from JS, Python, C++ and the like to more esoteric languages such as Erlang, Haskell, Fennel (a Lisp that compiles to Lua). You need your dataset to be somewhat diverse, cover each topic kind of equally such as language datatypes, conditional constructs, I/O etc, talking about a base dataset. When it gets to your specific use cases, identify what is essential and make sure you cover everything.

When you have your list of snippets, you can import them into a tool such as OpenDocString which helps you write the descriptions, balance the topics of your dataset and gives insights on data quality and diversity. The tool is in its early stage, but looks already very promising and makes life much easier.

Once done, you have a larger list of code and descriptions, which you can then feed to your model for training, either using an online service or train it locally on your machine or cloud instance.