Itamar Gilad on Evidence-Guided Product
Source: Lenny’s Podcast Speaker: Itamar Gilad Source URL: https://www.lennysnewsletter.com/p/itamar-gilad
Key ideas
- Opinion-based vs. evidence-guided: Google+ (1,000 people, multi-year, total failure) vs. Gmail tabbed inbox (no code to start, Wizard of Oz tests, now used by 1.8B users) are the same organisation using two completely different operating modes.
- GIST framework: Goals → Ideas → Steps → Tasks. A meta-framework integrating lean startup, design thinking, product discovery, and agile into one system. Each layer has a characteristic failure mode and a prescriptive fix.
- Confidence meter: ICE’s confidence score is almost always inflated by self-conviction. A tiered model calibrates confidence from 0 (gut feeling) to 10 (AB test result). Investment should scale with confidence level.
- GIST board: a team-level tool showing goals (key results), active ideas (ICE scores), and next steps (learning milestones). The “middle layer” between strategic roadmaps and task-level agile is the layer most organisations skip.
- Outcome roadmaps: commit to goals by a date, not features. Release roadmaps pre-commit to solutions with low confidence and actively suppress evidence-guided behaviour.
Google+ vs. Gmail tabbed inbox
Google+ (opinion-based failure)
~1,000 people at peak; entire divisions restructured; separate buildings. Hypothesis: Google needs a Facebook clone. Nobody tested whether users wanted this before massive investment. Result: shut down 2019 with no measurable impact on Google’s advertising revenue or Facebook’s growth.
The operating mode: plan-and-execute. Leaders believed in the idea; the organisation built it. Itamar calls this opinion-based development: “You come up with an idea, you believe in it, all the indications show it’s good… then you just go all in.”
Gmail tabbed inbox (evidence-guided success)
Starting point: no one believed in the idea. A colleague challenged Itamar: “We’ve tried inbox organisation features before, users don’t use them. Why is your idea different?” This pushed the team into genuine problem research.
Validation before a single line of code: researchers manually pre-sorted users’ top 50 inbox messages into tabs (a Wizard of Oz test). Distracted participants, then showed them their reorganised inbox. Response: “Wow, this is actually very cool.”
Key insight: ~85–88% of Gmail users are passive inbox managers who would love tabs. The team was almost entirely composed of power users (who manage their own inbox and found the idea pointless). Evidence-guided methods caught this bias that internal opinion would have missed.
GIST framework
G — Goals
Failure mode: goals as a planning exercise — “what should we build by when?” Siloed functional goals that pull teams in different directions.
Fix: two organisation-wide metrics:
- North Star metric — value delivered to users: WhatsApp → messages sent; Airbnb → nights booked; Amplitude → weekly active learning users.
- Top business KPI — value captured: revenue or profit.
Build a metrics tree from each. The trees overlap in the middle, revealing sub-metrics that move both user value and business value. These are the highest-leverage areas for team investment.
OKR integration: metrics trees + mission populate OKRs. Max four key results per team. “If you have more than four, you can’t actually deliver on them.”
I — Ideas
Failure mode: HiPPO (highest-paid person’s opinion) wins. Ideas validated by strategic theme (“it’s about AI”) rather than evidence.
Fix: ICE scoring + confidence meter.
ICE: Impact on goals × Confidence × Ease (inverse of effort). Created by Sean Ellis.
Confidence meter: a tiered calibration for ICE’s hardest dimension:
| Tier | Evidence type | Confidence (0–10) |
|---|---|---|
| Opinion | Self-conviction, pitch decks, strategic themes | 0–1 |
| Social | Stakeholder review, colleague feedback | 1–2 |
| Estimates | Back-of-envelope modelling | 2–3 |
| Anecdotal data | A few interviews, competitor has the feature | 3–4 |
| Market data | Surveys, competitive analysis, field research | 4–5 |
| Low-fidelity tests | Fake door, Wizard of Oz, usability studies | 5–7 |
| Rough builds | Early adopter programme, fish food (team testing) | 6–8 |
| Experiments | AB tests, multivariate, staged rollout | 8–10 |
Investment should scale with confidence level. A high-ease idea can skip straight to shipping; a high-stakes, low-confidence idea must climb the ladder before commit.
S — Steps
Failure mode: teams think discovery = “build an MVP” (actually a beta) and then measure. This is both expensive and too late.
Fix: a spectrum of validation methods ordered by cost and confidence gained. Start at the cheapest level that can falsify your key assumption:
- Assessment: ICE, assumption mapping, stakeholder 1:1s, business modelling. No code, no research budget.
- Data: user interviews, surveys, competitive analysis, field observation.
- Low-fidelity tests: fake door, smoke test, Wizard of Oz, concierge (manual service behind a product façade).
- Rough builds: fish food (own team), longitudinal study, early adopter programme.
- More complete builds: dog fooding, preview, beta, labs.
- Experiments: AB test, multivariate test, staged rollout, hold-backs.
Each step is a learning milestone. After each: continue, pivot, or kill and move to the next ICE idea.
T — Tasks
Failure mode: two disconnected worlds — managers in roadmap/planning mode; developers in Jira/story-points mode; PMs exhausted trying to translate between them. Developers disengaged from outcomes.
Fix: GIST board.
GIST board contents (per team):
- Goals (max 4 key results, set at quarter start)
- Ideas under consideration (ICE scores)
- Next steps for each idea (learning milestones, not engineering milestones)
Team reviews the board every two weeks. Discussion: Are we working on the right ideas? How are we doing against goals? What’s blocking the most important steps?
“This middle layer discussion is not happening today. Most discussion happens at the roadmap level and then at the task level. The middle layer doesn’t exist.”
Outcome vs. release roadmaps
Release roadmap: features + dates. “Launch onboarding wizard by October.” Pre-commits to a solution before confidence is established. Actively suppresses evidence-guided behaviour — teams race to ship, not to learn.
Outcome roadmap: goals + dates. “Reduce average onboarding time below 2 days by Q3.” Solution left open until confidence is gained. Once a high-confidence idea is ready to ship, promote it to a dated release milestone.
Signs you are not actually evidence-guided
- Goals are unclear, vague, or absent.
- No user-facing metrics — only revenue and business KPIs.
- Heavy roadmapping processes consuming most senior PM and leadership time.
- Little or no experimentation; when there is, no systematic learning.
- Developers are disengaged — focused on output, not outcomes.
Stage calibration
Pre-PMF startup: don’t build full metrics trees or heavy OKR frameworks. Goal is to find PMF; iterate fast; the North Star metric may not yet be known.
Series A–B: start building metrics. Lightweight GIST board and ICE scoring are directly useful.
Scale-up: full GIST is warranted. The cost of opinion-based development is highest here: more people, more wasted capacity, more cultural inertia.