How to Build an Agent with a Local LLM and RAG, Complete with Local Memory
If you want to build an agent with a local LLM that can remember things and retrieve them on demand, you’ll need a few components: the LLM itself, a Retrieval-Augmented Generation (RAG) system, and a memory mechanism. Here’s how you can piece it all together, with examples using LangChain and Python. (and here is why a small LLM is a good idea)
Step 1: Set Up Your Local LLM
First, you need a local LLM. This could be a smaller pre-trained model like LLaMA or GPT-based open-source options running on your machine. The key is that it’s not connected to the cloud—it’s local, private, and under your control. Make sure the LLM is accessible via an API or similar interface so that you can integrate it into your system. A good choice would be using Ollama and an LLM such as Googles gemma. I also wrote easy to follow instructions in how to set an T5 LLM from Salesforce up locally, but it is also perfectly fine to use a cloud-based LLM.
In case the agent you want to build is about source code, here is an example of how to use CodeT5 with LangChain.
Step 2: Add Retrieval-Augmented Generation (RAG)
TL;DR: Gist on Github
Next comes the RAG. A RAG system works by combining your LLM with an external knowledge base. The idea is simple: when the LLM encounters a query, the RAG fetches relevant information from your knowledge base (documents, notes, or even structured data) and feeds it into the LLM as context.
To set up RAG, you’ll need:
- A Vector Database: This is where your knowledge will live. Tools like Pinecone, Weaviate, or even local implementations like FAISS can store your data as embeddings.
- A Way to Query the Vector Database: Use similarity search to find the most relevant pieces of information for any given query.
- Integration with the LLM: Once the RAG fetches data, format it and pass it as input to the LLM.
I have good experience with LangChain and Chroma:
documents = TextLoader("my_data.txt").load()
texts = CharacterTextSplitter(chunk_size=300, chunk_overlap=100).split_documents(documents)
vectorstore = Chroma.from_documents(texts, OllamaEmbeddings(model="gemma:latest")).as_retriever()
llm = OllamaLLM(model=model_name)
qa_chain = RetrievalQA.from_chain_type(llm=llm, retriever=vectorstore)
qa_chain.invoke("What is the main topic of my document?")
Step 3: Introduce Local Memory
Now for the fun part: giving your agent memory. Memory is what allows the agent to recall past interactions or store information for future use. There are a few ways to do this:
- Short-Term Memory: Store conversation context temporarily. This can simply be a rolling buffer of recent interactions that gets passed back into the LLM each time.
- Long-Term Memory: Save important facts or interactions for retrieval later. For this, you can extend your RAG system by saving interactions as embeddings in your vector database.
For example:
- After each interaction, decide if it’s worth remembering.
- If yes, convert it into an embedding and store it in your vector database.
- When needed, retrieve it alongside other RAG data to give the agent a sense of history.
Langchain Example
from langchain.memory import ConversationBufferMemory
# Initialize memory
memory = ConversationBufferMemory()
# Save some conversation turns
memory.save_context({"input": "Hello"}, {"output": "Hi there!"})
memory.save_context({"input": "How are you?"}, {"output": "I'm doing great, thanks!"})
# Retrieve stored memory
print(memory.load_memory_variables({}))
Step 4: Put It All Together
Now you can combine these elements:
- The user sends a query.
- The system retrieves relevant data via RAG.
- The memory module checks for related interactions or facts.
- The LLM generates a response based on the query, retrieved context, and memory.
This setup is powerful because it blends the LLM’s generative abilities with a custom memory tailored to your needs. It’s also entirely local, so your data stays private and secure.
Final Thoughts
Building an agent like this might sound complex, but it’s mostly about connecting the dots between well-known tools. Once you’ve got it running, you can tweak and fine-tune it to handle specific tasks or remember things better. Start small, iterate, and soon you’ll have an agent that feels less like software and more like a real assistant.