Last year I built Search for Intelligence to experiment with evaluating LLMs; without all the columns of data clutter many of the eval benchmarks use. I wanted a simple, easy to use system for running the same prompt with two different LLMs; then scoring the generated output.
But along the way building Search for Intelligence it became a massive hassle to setup cloud infrastructure - with a backend to host the LLMs, and a chat frontend to test prompts - that I started building a separate system to host my own LLM stack. Necessity begets invention!
Meanwhile, the LLM powered IDEs were getting really good at guessing the next step to take when coding an application. I tried VS Code with Copilot; it sucked. Tried multiple command line tools; but they were too primitive for building frontends. I eventually landed on Cursor and made the commitment to make it my daily IDE; but try different LLMs to test output quality and performance. After trying multiple OpenAI models ended up settling on Anthropic Claude Sonnet 3.7; which is still my daily use LLM.
With an LLM powered IDE settled then I went back to focusing on building an LLM stack and testing whether I could vibe code it into existence. The first few iterations were rough because I had to deal with a significant amount of cloud infrastructure engineering; which Cursor didn’t have search capabilities to get context on. But a few tricks within Cursor - and the Claude.ai interface - solved this current context problem. As of April 2025 there’s now several solutions for giving an LLM powered IDE fresh context with web search capabilities. This might be a future post on what I learned.
Back to the LLM stack; it’s live now at https://www.evalbox.ai/.
My pals in various companies, startups, orgs are starting to use it; which is a great feedback loop for how to refine a product you built for yourself.
But I have to decide where the LLM stack experiment goes from here. Two areas are tickling my brain; 1] memory, 2] agent transition. The first is a limitation that became obvious during the early experiments with Search for Intelligence; the system needs to remember stuff about what you’re doing with it; and you. And this was magnified with the LLM stack; which is for the moment an ephemeral system where you deploy an LLM stack, run experiments, then delete the LLM stack. As a colleague of mine asked: “Shouldn’t we have a ceremony before we kill the thing we created; don’t we need to think about what’s happening before we turn it off?” That’s getting too esoteric for my purposes but it does highlight the need to remember what the system did while you were running experiments with it. The second brain tickle is much more difficult; in that if I build agent[s] into the LLM stack, can those agent[s] transition between themselves in simultaneously running LLM stacks? And, can those agent[s] transitions be stored - say dormant - and revived later when the LLM stack they were born in was shutdown then a new LLM stack is deployed?
Like many of us running deep experiments with AI there’s a lot more questions than easy answers. I’m learning every single day what’s possible but also trying to be grounded in the practical, I need to get stuff done, reality.