Notes — Kevin Weil on OpenAI and the Future of AI Products

Four questions [Adler frame]

Q1 — What is it about?
A CPO’s-eye view of how building on LLMs differs from traditional software product development, and the key philosophical stances that follow: model maximalism, iterative deployment, evals-first, fine-tuning everywhere. Grounded in Kevin’s direct experience shipping ChatGPT products (deep research, image generation, tasks, Canvas via Karina).

Q2 — How is it argued?
Named philosophies with practical grounding: model maximalism (don’t compensate for limitations), iterative deployment (ship and co-evolve), ‘worst model you’ll ever use’ (empirical rate-of-improvement argument). Also specific product case studies: deep research evals; customer support with fine-tuned models; reasoning model UX problem.

Q3 — Is it true?
The model maximalism argument depends on the rate of model improvement continuing. The 10x/year claim has been roughly true for the past 3–4 years but may not hold indefinitely. For near-term product strategy, the argument is well-supported. The evals-determine-product-design point is well-established across multiple wiki sources.

Q4 — What of it?
Model maximalism is the most actionable insight: stop investing in scaffolding around model limitations; invest in evals and fine-tuning instead. ‘Poor man’s fine-tuning’ (few-shot examples in prompt) is an immediately usable tactic. The reliability threshold product framework (60% vs 95% vs 99.5%) is a practical decision tool.

Glossary

Model maximalism — the philosophy of building at the edge of model capability without compensating for current limitations; trust models to improve fast enough to render today’s limitations irrelevant. See Model Maximalism.

Iterative deployment — OpenAI’s product philosophy: ship early, learn in public, co-evolve with society rather than holding back breakthroughs for a ‘perfect’ launch.

‘Worst model you’ll ever use’ — Kevin’s framing of the model improvement rate: today’s models are the least capable you will experience; every future model is better.

Model ensemble — using multiple fine-tuned models in combination, each specialised for a specific subtask, with an integrating model combining outputs. OpenAI uses this extensively internally.

Poor man’s fine-tuning — including several problem/answer examples in a prompt (few-shot) to steer model behaviour toward a specific style or domain, without full fine-tuning.

Reliability threshold — the percentage of the time a model correctly handles a specific use case. Different thresholds warrant different product designs: 60% requires very different choices than 99.5%.

Reliability thresholds and product design [§ Deterministic vs. probabilistic]

The key question for any AI product: what is the model’s reliability on your specific use case? This drives the entire product architecture:

60% reliable → probably need human-in-the-loop; narrow, forgiving use cases only; extensive error surfacing.
95% reliable → can build most consumer products; failures are occasional and recoverable.
99.5% reliable → can build mission-critical products; users trust the output.

Traditional software has 100% reliability on deterministic operations. LLMs have a distribution. Product design must be calibrated to where in that distribution your specific use case falls. [?] Kevin did not explicitly state the 99.5% threshold — this is an extrapolation from his framework.

Model maximalism in action [§ Model maximalism]

Kevin’s specific recommendation: if you’re building a product and it’s right on the edge of what the model can do, keep going. Don’t try to compensate around the limitation. In two months, the model will get significantly better.

The analogy: nobody broke 4 minutes in the mile for a long time; once one person did, 12 more did it the following year. Once a capability breakthrough occurs, it propagates quickly. Build as if you expect the breakthrough; it often comes sooner than expected.

Internal application: OpenAI uses small models (4o-mini) for quick/cheap checks; o-series for reasoning-intensive tasks; fine-tuned models for specific use cases. Don’t use a general model for a specific problem when a fine-tuned one would be 10x better.

Evals as fine-tuning targets [§ Evals as the core product skill]

The deep research case study is the most concrete description in the wiki of how evals drive product development:

Identify hero use cases (complex research questions users want answered).
Define what an amazing answer looks like for each.
Encode those as evals.
Fine-tune the model against those evals.
Track eval performance as the primary signal that the product is working.

This reframes evals from ‘quality gate’ to ‘training target.’ Evals are not just a test; they are the specification that the model learns from. This extends Edwin Chen’s post-training ladder with the product-level application.

The human analogy heuristic [§ Human analogy]

Kevin’s finding: to design good AI product experiences, model the equivalent human behaviour. Examples:

Wait time UX → what would a human do while thinking? → give periodic updates.
Ensemble of models → brainstorming → humans generate better ideas in groups.
Prompt framing → ‘You are the world’s best X’ → same as asking a colleague to adopt a specific perspective.

The heuristic: when stuck on an AI product design question, ask ‘what would an equivalent human do in this situation?’ It works because there is prior art for every human communication situation. Use this as a source of design intuition, not a rule.

OpenAI PM culture [§ OpenAI operating model]

~25 PMs for 400M+ weekly active users and 3M+ API developers. PM-light is intentional.

Kevin’s model of good PM behaviour: decisiveness in ambiguity. Not making every decision (that’s micromanagement), and not making no decisions (that’s abdication). The PM’s job is to ensure that when no one else will make a call, the call gets made.

High agency + comfortable with ambiguity + leads through influence = the PM profile that works in an AI lab context. Junior PMs struggle because there is no one telling them ‘here’s your area, go do this.‘