Field Notes on Startups

Fine-tuning is not a company anymore

A lot of AI startup pitches still sound like they were written in 2024.

The company has proprietary data. It fine-tunes a model. The tuned model performs better on a narrow task. Therefore, the model is the moat.

That argument has narrowed. Fine-tuning is still useful, but it is less often a company.

The market is becoming a barbell. On one side are frontier labs with the capital, compute, research talent, distribution, and brand to sell model capability directly. On the other side are vertical workflow companies that use models as components inside a system they own. The weak position is the middle: a startup whose main product is a fine-tuned model, sold as if the tuning itself creates durable defensibility.

The pressure is visible in the data. Stanford’s 2025 AI Index reports that the cost of querying a model with GPT-3.5-level performance fell more than 280x between November 2022 and October 2024. The same report says the gap between leading open-weight and closed-weight models on Chatbot Arena narrowed from about 8% to 1.7% in a year. Good-enough model capability is spreading faster than most fine-tuning moats can harden. Stanford HAI AI Index 2025

The frontier is still a model business

The anti-model argument often goes too far. Frontier models are not commodities. The best models still pull revenue, set developer defaults, and create platform gravity.

OpenAI’s annualized revenue reportedly passed $20 billion in 2025, up from $6 billion in 2024. Anthropic has said Claude Code passed $2.5 billion in run-rate revenue, with usage more than doubling since the beginning of 2026. Those are not “models do not matter” data points. They show that the very top of the market is still a model business.

But that is not the same game most startups are playing.

A frontier lab is defensible because it can keep pushing the capability curve, serve massive inference demand, win developer adoption, pass enterprise procurement, absorb safety and regulatory cost, and finance the next training run. That is a full-stack capital machine. It is not a seed-stage moat with a training set.

OpenAI and Anthropic do not prove that every model-centric startup should sell a model. They prove that at the absolute top, model quality can still be the product. Below that line, the product usually has to become something else.

Some modalities are not yet text

There is another real exception: modalities that have not yet compressed.

Text has moved fastest toward substitutability. In many enterprise text workflows, the difference between model A and model B matters less than retrieval quality, permissioning, evaluation, latency, cost controls, and integration into the system of record.

Models are still moats when the model is clearly better, the buyer can see the difference, and the difference lasts long enough to monetize. That is true at the frontier. It is still true in parts of video, audio, 3D, robotics, and other less-settled modalities.

Today Google positions Veo around production-grade generation features such as prompt adherence, native audio, and higher-quality output. Runway positions Gen-4.5 around realistic motion, subject consistency, style consistency, and world understanding. These are not small backend details. They change the output a buyer sees, and represent the early stage of, for example, 2025's coding models.

Fine-tuning is becoming an implementation detail

Fine-tuning used to be an easy story to tell. Prompting was fragile. Retrieval was immature. Frontier APIs were expensive. Open models were weaker. If a startup had proprietary examples and could tune an open model into a narrow task, the result looked like a product.

DeepSeek-R1 was one of the clearest resets. DeepSeek described R1 as performing on par with OpenAI-o1 on math, code, and reasoning tasks, released it with open weights, and priced API access aggressively. Whatever one thinks of the benchmark claims, the market signal was obvious: capable reasoning models were not going to remain scarce in the way many 2024 decks assumed. DeepSeek-R1 release

At the same time, providers have made the prompt-and-retrieval route cheaper and more capable. Anthropic’s prompt caching reduces the cost and latency of repeated long-context prompts. Long-context models, structured outputs, tool calling, and managed retrieval have removed many of the reasons teams used to fine-tune by default. Anthropic prompt caching

Fine-tuning also has operational risk. Qi et al.’s ICLR 2024 paper showed that fine-tuning aligned models can compromise safety properties even when users do not intend to. Other work on catastrophic forgetting shows that tuning can improve one behavior while degrading others. Qi et al., ICLR 2024

Fine-tuning still makes sense when it gives a customer-visible advantage in one of four places: output determinism, latency, cost at high volume, or domain performance that prompting and retrieval cannot reach. The practical test is simple. Take the tuned model out. Replace it with the current best frontier model, a good prompt, retrieval, caching, and evals. Does the customer notice on quality, latency, cost, compliance, or reliability?

If the answer is no, the tuning is not the business. And even if the answer is yes, the time to frontier to eat specialized model is reducing fast, and it's clear that time to monetize finetunes is shortening, making ROI conversation more difficult at planning stages.

The harness is the product

Especially in enterprise, where patterns exist at scale, the model is rarely the hard part for long.

The hard part is getting access to the right data without breaking permissions. It is normalizing messy inputs. It is deciding what the model is allowed to do. It is logging enough to debug failures without creating a compliance problem. It is running evals that match the customer’s real workflow. It is handling version changes. It is giving procurement a price they can understand. It is supporting the customer when the answer is wrong and nobody can reproduce the exact path that led to it.

This is where vertical AI companies can become defensible. Not because they own a magical model, but because they own the operating context: claims processing, contract review, prior authorization, security triage, field-service dispatch, revenue-cycle management, procurement intake, customer-support resolution.

Fine-tuning does not answer most of those questions, the right harness does.

Training a model is not the same as selling one

The last two years gave the market enough evidence.

Microsoft hired Mustafa Suleyman and Karén Simonyan from Inflection to lead Microsoft AI. Adept’s co-founders and some of the team joined Amazon, while Amazon licensed Adept’s agent technology, models, and datasets. Character.AI signed a licensing deal with Google, and its co-founders returned to Google with other research staff. Microsoft announcement

These were not simple technical failures. They were reminders that training capability and company capability are different things.

A model company has to price stochastic output with variable inference cost. It has to make latency and availability commitments. It has to survive customer-side evals, or worse, customers with no evals at all. It has to support hallucination tickets without turning the research team into customer support. It has to version models without breaking integrations. It has to pass security review. It has to explain why the buyer should not just use the frontier API directly.

MosaicML is the cleaner positive case. Databricks acquired it in 2023 and positioned the combination around helping customers build and secure models with their own data. The model capability mattered, but it became more valuable inside an existing data platform with existing enterprise distribution. Databricks MosaicML acquisition

That pattern keeps showing up. Model capability is valuable. It is just often more valuable as part of a larger platform than as a standalone company.

The diligence question

If the company is trying to be a frontier lab, the discussion is about capital, compute, research velocity, distribution, safety process, and whether it can stay near the top of the capability curve long enough to monetize.

If the company is not trying to be a frontier lab, the discussion should move away from model language quickly. What workflow does it own? What system does it integrate into? What data rights does it have? What eval loop improves with usage? What policy layer would be painful to replace? What does the customer buy that is not just tokens passed through a margin stack?

The bad answer is the middle answer: we fine-tune on proprietary data, that can be part of the architecture. It is not enough to be the architecture of the company.

What is left of the model moat

The model-as-moat thesis did not die. It moved to the extremes.

At the top, frontier labs can still sell model capability directly. In less-commoditized modalities like video, the model can still be visibly differentiated. But in the broad middle of text-centric enterprise AI, fine-tuning is being absorbed into the stack. It is becoming one technique among many, next to prompting, retrieval, caching, evals, routing, guardrails, and workflow design.

The startup opportunity is still large. It is just not where many 2024 decks said it was.

The durable company is less likely to be “a fine-tuned model for X.” It is more likely to be “the system for X workflow,” with models underneath that can be swapped, routed, evaluated, constrained, and leverage the economies of scale from frontier lab advancing intelligence without sunken cost in training anchoring the product to a fixed point in time.

ai-startupsdefensibilityagent-infraharnessai-infradiligencepatterns

← all posts