The Myth of the “Naked” LLM
A Conversation with Dimitri, Felix Research Founding Engineer
In the current AI arms race, there is a persistent misunderstanding that the model is the entire product. Investors and users alike often ask why bespoke solutions are necessary when frontier models like Claude and similar exist as “standalone powerhouses”.
However, treating an LLM as a complete solution is like admiring a high-performance engine while forgetting it requires a surrounding structure to actually move.
I sat down with Felix Research’s Founding Engineer, Dimitri, to address the myth of the “naked” LLM. We discussed why the future of specialised research doesn’t lie in the brilliance of a single model, but in the “glue”; the programmatic tools, the guardrails, the architecture and crucially, the domain expertise.
Sav: So, we already discussed context windows for the recent piece. How it’s not just an issue of bigger and bigger because of the costs... they increase quadratically.
Dimitri: Yeah. They skyrocket.
Sav: Exactly. But what I’m curious about is your answer, as the engineer, when investors are like, “Why can’t I just use Claude? Why do I need you guys?” Is it frustrating? Or do they have a point? Articulate the reasoning!
Dimitri: Okay. Well. It’s because everyone is looking at the software and attributing all of it to “the LLM” because they are “AI companies.” But what they’re forgetting is that these companies hire engineers to build around the LLM. That’s the final product you interact with. Not just a lone LLM.
Dimitri’s frustration is backed by the latest industry research; the industry is moving away from evaluating LLMs as standalone “brains” and toward evaluating them as Composite Systems. A recent arXiv paper, entitled Precision Proactivity: Measuring Cognitive Load in Real-World AI-Assisted Work, asserts that “Agents are systems, not models,” and that single-turn accuracy (the “naked” output) is no longer a viable metric for real-world enterprise utility.
D: What LLM’s are really good at is: you give it text, it understands it [to varying degrees, depending on what “understand” means to you], and it gives you output. Output specifically (predominantly) in text format. Now, that doesn’t sound very powerful, but it is in the context of a programmatic system that can do something with that output.
Ours [Felix Intelligence] is a bit different because it’s using a visual LLM. It’s taking in an image and producing text. But the system we built - the parts that show the tables, the RAG (Retrieval augmented generation), the bit where you ask a question and it finds the relevant source - that’s a tool built using normal programming languages. It’s engineering glue.
S: So there’s an architectural misunderstanding going on. People think the entire product is the “naked” LLM. They don’t realise that Claude and similar are (often) a blend of engineering and models?
D: Yeah so if you go to GPT and press “Deep Research/ Web Search” they are providing a programmatic tool to the LLM to use if it wants to - a tool where the LLM inputs text into the tool and the tool returns text. To be clear for a second, this isn’t prompt engineering or fine-tuning. This is giving it access to tools, regardless of prompt engineering.
S: What do you actually mean by “tools”?
D: Okay, so, you have the LLM in a box. You have the prompt going in, and the output being spat out. But the output doesn’t have to be just raw text. It can be a specific structured format.
Then you have “LLM tools.” These are mechanisms the LLM can use, for example to query our database. These tools are programmatic. It’s basically like giving the LLM an API so it can pull information from the external world. Importantly, we have to build each of those tools programmatically before the LLM can touch them.
S: So the efficiency and power that the industry is speaking of at the moment doesn’t come from the lone LLM.
D: It comes from a system. Multiple LLMs plus programmatic engineering to put it all together and tell it what to do. That’s what people forget. Take Claude Code; the reason it’s so powerful is that Anthropic built engineering tools around the LLM, to enable it.
The rise of “Claude Code” and OpenAI’s “Deep Research” proves that the frontier labs are no longer just selling intelligence; they are selling orchestration. However, as the ArXiv paper notes, these general tools often fail in production because they lack “operational constraints” and “domain-specific safety” that specialised systems like Felix provide.
D: That’s the problem with investors [with love! We love you]. They go, “Why should I invest in this when Anthropic is going to replace you?” Well, because they’re not. Because if they want to go into a specific domain and create purpose-built tools at our level of granularity, they would need a team like ours.
S: If you were on an investor call and you could speak freely, what would you actually say in response to that line of questioning?
D: I mean [he laughs to himself], I’ve done this many times. I tell them: an LLM is a tool, and what you’re investing in is the system around it that makes it powerful and specialised. There’s a reason the frontiers are still hiring engineers like crazy and paying them however many hundreds of thousands to millions a year. It’s because the LLM is useful, but the real power comes from the enabling system.
It’s like... using power tools-
S: -A jackhammer is powerful and effective but you can’t just hand it to anyone and expect perfect outcomes with no mistakes or destruction.
D: Exactly.
S: So when someone says, “Why can’t I just use Claude?”, they’re ignoring that you still have to build all these domain-specific tools.
D: Yeah. Look at what Anthropic did with Claude code. They provided a tool for the LLM to write raw Bash. Bash being a system level Language.
S: Bash is the programming language that... what? Controls the computer?
D: Yeah, well at least Mac and Linux. Anthropic couldn’t - and wouldn’t - write a specific tool for every single thing. So instead they said, “Here’s Bash. Bash can do everything. If we give the LLM the ability to use Bash, we don’t need to write a tool for every little task.”
S: [Likely looking perturbed] Um…subject to…permissions…?
D: Yeah. Because of the potential for danger, the LLM will ask you for permission every single time it wants to use Bash - rather than it being like an on switch that stays on. Like if you tell it to do something it shouldn’t, it can technically send an email to your boss or the government.
Lepine, Kim, Mishkin and Beane highlight that “reliability is more valuable than brilliance.” In financial research, an agent that can write Bash is “brilliant,” but an agent that can reliably navigate a proprietary financial database without violating PII (Personally Identifiable Information) boundaries is “valuable.” Felix’s value lies in this bridge between raw power and operational safety.
D: The important thing to get across is that we’re not reinventing the wheel. We’re creating value addition. We’re building a powerful system around various LLM’s, as well as blending models and proprietary RAG guardrails.
S: They think it’s a brain; it’s actually an engine. And we’re building the car.
D: Yeah. You can’t generalise expertise. You have to build the specific tools for the domain.
References: https://arxiv.org/pdf/2505.10742



