Product Metrics and What They Measure

Introduction

Every metric is a proxy. Revenue is a proxy for value delivered. Churn is a proxy for dissatisfaction. Page load time is a proxy for user experience. The quality of a measurement system depends on how well its proxies track the underlying reality they are supposed to represent.

Technical decisions affect product metrics, but the relationship is indirect. A latency improvement changes page load time, which may change activation rate, which may eventually change retention. The chain is real, but it is also long, noisy, and full of confounds. Technical leaders who understand this chain make better decisions about where to invest.

The problem is not that metrics are proxies. It is that teams forget they are proxies and optimise the proxy directly instead of optimising the underlying thing the proxy represents.

Thread Activation

You have seen feedback loops before in their infrastructure form: health checks that trigger service restarts, auto-scaling that adjusts capacity to observed load, circuit breakers that halt requests to failing dependencies. In each case, a measured signal drives a corrective action. This chapter examines the feedback loop at the product layer, where the measured signals are product metrics and the corrective actions are prioritisation decisions. The engineering challenge is identical: choose the right signal. A feedback loop driven by the wrong metric will optimise the wrong thing as reliably as a loop driven by the right metric will optimise the right thing.

The Concept

Metrics fall into two categories based on when they measure relative to outcomes they predict.

Lagging indicators measure what has already happened. Revenue, churn, and customer lifetime value are lagging. By the time a lagging metric changes, the cause is weeks or months in the past. Lagging metrics are accurate but slow. They confirm outcomes; they do not help steer toward them.

Leading indicators measure current activity that predicts future outcomes. Activation rate — the proportion of new users who complete a meaningful first action — predicts retention. Feature engagement — which features users return to — predicts long-term value. These metrics are faster but noisier; they can be misleading if the team misunderstands what they are actually measuring.

Technical decisions primarily affect leading indicators. A deployment that improves time-to-first-meaningful-interaction by 20% will show up in activation rate within days. It may show up in revenue six months later, buried in many other causal factors. Measuring technical investments against lagging indicators introduces an attribution gap that makes technical work invisible to business decision-makers.

The Technical Grounding

Goodhart’s Law (L9) states: when a measure becomes a target, it ceases to be a good measure. This principle was originally applied to economic policy, but it applies with equal force to product engineering.

A team measured on bugs closed will close bugs without fixing the underlying problems, or will reclassify open bugs to reduce counts, or will write code that produces easily-closable small bugs rather than addressing root causes. The metric became a target. The metric stopped measuring what it was supposed to measure.

A team measured on features shipped will ship small, low-complexity features to maximise count. The metric rewards shipping over impact. A single feature that triples retention counts as one feature. Thirty minor UI tweaks that produce no measurable change also count as thirty features.

The design principle that resists Goodhart corruption is this: choose metrics where the only viable path to improvement is to improve the actual outcome. Retention is difficult to game because the only way to retain users is to deliver continuing value. Users who find no value leave. No amount of notification-spam or dark-pattern friction holds them indefinitely.

A Goodhart-resistant metric for a search product is query reformulation rate: the percentage of searches where the user modifies their query and searches again within the same session. A high reformulation rate means users are not finding what they need on the first attempt. The metric is resistant to gaming because the only way to reduce reformulation is to return better results. A team cannot artificially suppress reformulations without degrading the search experience in ways that show up immediately in other metrics. The metric is also hard to Goodhart in the opposite direction — returning fewer results does not reduce reformulation, it increases it. The incentive gradient points toward genuine improvement (AT9 — Speed vs Quality: investing in result relevance is slower than optimising click-through, but reformulation rate rewards relevance directly).

A Goodhart-resistant metric for a payments product is the successful-transaction-to-dispute ratio: the number of successful transactions for every dispute filed. A team measured on transaction volume alone will approve marginal transactions that generate disputes. A team measured on dispute rate alone will reject legitimate transactions to keep the denominator clean. The ratio forces both sides of the equation to improve together. Good: above 500:1 for a consumer payments platform. Bad: below 100:1, which indicates either fraud or a broken merchant experience. The ratio is hard to game because reducing disputes requires genuinely preventing fraud and ensuring merchant quality, while increasing successful transactions requires genuinely reducing checkout friction (FM11 — Observability Blindness: measuring only volume misses the dispute signal until it becomes a regulatory problem).

A third example: a SaaS product measured on “active users” can inflate the number with email-triggered logins that produce no meaningful activity. The Goodhart-resistant alternative is weekly core-action users — users who performed at least one action central to the product’s value proposition (creating a document, sending an invoice, deploying a build) within the last seven days. The metric cannot be inflated by notification-driven opens because opens without core actions do not count.

The HEART framework provides a structured approach to this. HEART stands for Happiness, Engagement, Adoption, Retention, and Task success. Each dimension measures a different aspect of user value:

Happiness measures user satisfaction, typically through surveys or sentiment analysis. It is the hardest to measure automatically and the most qualitative. Technically, it requires infrastructure for collecting and analysing survey responses at the right moment in the user flow. Instrument happiness with in-app NPS or CSAT prompts triggered after task completion, not on page load. Good: NPS above 50 for a B2B product, CSAT above 4.2/5 for a consumer product. Bad: NPS below 0, or a declining 90-day trend regardless of absolute number. Common misinterpretation: a high NPS driven entirely by power users while casual users never see the survey. The metric looks healthy. The product is losing everyone except the hardcore segment.

Engagement measures the depth of interaction: how often users return, how much of the product they use per session, how many of the core features they activate. High engagement is a proxy for the user finding continuing value. Technically, it requires event tracking at the feature level. Instrument engagement by emitting named events for each core feature interaction — not page views, feature activations. Track L7/L30 ratio (users active 7 of the last 30 days divided by users active at least once). Good: L7/L30 above 0.25 for a consumer app, above 0.6 for a daily-use tool. Bad: L7/L30 below 0.1, which means users try the product and do not come back within the week. Common misinterpretation: high session count driven by users repeatedly failing to complete a task. Each failure is an “engagement” event. The metric rises while the experience degrades (AT9 — Speed vs Quality: high-frequency instrumentation costs engineering time but catches this trap).

Adoption measures the proportion of users reaching a defined milestone, either for a new feature or for first activation. A high adoption metric on a new feature is only meaningful if the users who adopted it retained. Technically, adoption requires funnel tracking and cohort analysis. Instrument adoption with a funnel: exposure → activation → repeated use. Track activation rate as users-who-completed-milestone divided by users-who-were-exposed. Good: 30%+ activation rate for a feature promoted in the UI. Bad: below 5%, which means the feature is either invisible or irrelevant. Common misinterpretation: measuring adoption as “users who clicked the button” instead of “users who completed the action.” A 60% click rate with a 4% completion rate means the feature is broken, not adopted (FM11 — Observability Blindness: measuring the wrong event produces false confidence).

Retention measures the proportion of users returning after a defined interval. Retention is the strongest leading indicator of long-term value. Technically, it requires time-series analysis of user activity and cohort comparisons. Instrument retention with cohort tables: group users by signup week, measure what percentage are active at day 1, day 7, day 30, day 90. Good: day-30 retention above 20% for a consumer app, above 40% for a B2B SaaS product. Bad: day-7 retention below 10%, which means the product delivers no reason to return. Common misinterpretation: measuring retention as “any activity” when the product has a notification system that forces opens. A user who opens the app to dismiss a notification and immediately closes it is “retained” by the metric and lost by any meaningful standard.

Task success measures whether users who attempt a specific action complete it. A high task success rate means the system is doing its job. A low one means friction exists somewhere between intent and completion. Technically, it requires instrumentation at the task level, not just at the page level. Instrument task success by defining start and end events for each critical flow — not just the final success event. Track completion rate and median time-to-completion. Good: above 90% for a checkout flow, above 80% for a multi-step form. Bad: below 60% for any flow the product depends on. Common misinterpretation: a 95% task success rate that excludes users who abandoned before reaching the first step. The funnel starts at intent, not at step one. If 40% of users who navigate to the task page never start it, the real success rate is closer to 57%.

Instrumentation is not a measurement afterthought. It is an architectural requirement. A system that does not emit events at the right granularity cannot produce HEART metrics. The observability of user behaviour must be designed in from the start.

Consider building a HEART measurement system for a B2B SaaS product with 100K monthly active users. The event schema is the foundation: every user action emits a structured event with fields user_id (string), event_type (enum: page_view, feature_activation, task_start, task_complete, task_abandon, survey_response), timestamp (ISO 8601), and properties (JSON map of event-specific data — feature name, task name, survey score, duration). The schema is intentionally flat. Nested structures complicate querying downstream.

The pipeline is four stages. The client (web or mobile) emits events to a lightweight ingestion endpoint that validates the schema and drops malformed events. The ingestion endpoint publishes to an event bus (Kafka or a managed equivalent). A stream processor consumes from the bus, enriches events with user metadata (signup cohort, plan tier, geography), and writes to the analytical warehouse (BigQuery, Snowflake, or ClickHouse). A scheduled job runs nightly, computing the HEART dimensions from the enriched events and writing the results to a dashboard table.

The HEART dashboard for this product tracks five numbers, updated daily. Happiness: trailing 30-day NPS, computed from survey_response events triggered after task completion — target above 40, alert below 25. Engagement: L7/L30 ratio of core-action users — target above 0.5 for a daily-use B2B tool, alert below 0.3. Adoption: 14-day activation rate for each feature released in the last quarter — target above 20%, alert below 8%. Retention: week-over-week cohort retention at week 4 and week 12 — target above 60% at week 4 and 45% at week 12, alert if any weekly cohort drops 10 points below the trailing average. Task success: completion rate for the top five user flows (onboarding, core action, billing, settings change, export) — target above 85%, alert below 70%.

The engineering cost is real. The event schema, ingestion endpoint, bus integration, enrichment job, and dashboard amount to roughly two engineer-weeks of initial build and one engineer-week per quarter of maintenance. The storage cost at 100K MAU with an average of 50 events per user per day is approximately 150M events per month, which is modest for any modern warehouse. The cost of not building it is operating blind — making product decisions based on anecdote, support tickets, and lagging revenue numbers that arrive too late to steer (AT9 — Speed vs Quality: the instrumentation investment slows initial delivery but produces the feedback signal that makes every subsequent decision more accurate).

Real-World Examples

Facebook’s early growth team used a specific activation metric: did a new user add seven friends within ten days? This was not an arbitrary number. Data showed that users who crossed this threshold retained at dramatically higher rates. The metric predicted retention. Everything in the onboarding flow was optimised toward it.

The discovery process was retention cohort analysis, not intuition. The growth team pulled every user who signed up in a given month and measured their 30-day retention rate. They then segmented those users by the number of friends they had added by day 10. Plotting 30-day retention on the y-axis against friend count at day 10 on the x-axis produced a curve with a clear inflection point: retention climbed steeply from zero to seven friends, then flattened. A user with seven friends retained at roughly the same rate as a user with twenty. A user with three friends retained at half that rate. The inflection point was the threshold. Below it, the product had not delivered enough value to stick. Above it, the network effect was self-sustaining.

This is the general method for finding activation metrics. Pick a retention window (30 days is standard). Pick a candidate behaviour (friend adds, documents created, messages sent). Plot retention against behaviour count at a fixed early window (day 7 or day 10). Look for the inflection point. The inflection point is the activation threshold. If there is no inflection — if retention rises linearly with the behaviour — the behaviour is correlated but not causal. The threshold test separates behaviours that predict retention from behaviours that merely accompany it.

The method has a critical limitation. Correlation is not causation. Users who add seven friends may be inherently more social and would have retained regardless. Facebook validated the causal direction by running experiments: they changed the onboarding flow to make friend-adding easier and measured whether the users who were nudged across the threshold retained at the same rate as users who crossed it organically. They did. The threshold was causal, not just correlational.

Amazon’s page load time research produced an often-cited finding: every 100 milliseconds of added latency reduced sales by 1%. This is a direct relationship between a technical metric and a lagging business outcome. The finding is real, but it is also specific to Amazon’s scale, user base, and competitive environment. Applying it verbatim to a different product without validation is an example of treating a measurement as a universal law rather than a context-specific proxy.

Google’s HEART framework itself emerged from the recognition that measuring web products only by page views rewarded low-quality, high-volume content. The metric had been gamed. Replacing it with user-centric measures required changing both the measurement infrastructure and the product development process.

The Tradeoffs

Precise metrics require instrumentation investment. Instrumentation adds complexity to the codebase, increases event volume, and creates data storage requirements. Teams that want high-resolution product metrics must pay an engineering cost. Teams that want to avoid that cost operate with lower resolution visibility into product behaviour.

The inverse tradeoff is over-instrumentation. A system that emits events for every user action produces so much data that the signal is lost in the noise. Useful measurement requires selecting the right instrumentation points, which requires understanding the user flows that matter before building the measurement infrastructure.

What Goes Wrong

Metric misalignment is the dominant failure. The team optimises for a metric that was a good proxy when chosen and has since drifted from the underlying reality. Daily active users was a useful metric for social products until teams discovered they could inflate it with push notifications that users found annoying and eventually deleted the app to escape. The metric went up. Retention went down. The proxy had failed.

A second failure is metric isolation. Teams measure their subsystem’s metric without measuring its impact on downstream metrics. Infrastructure teams optimise for deployment frequency without asking whether the deployments change user outcomes. Product teams optimise for feature adoption without asking whether adopted features retain users. Each team’s metric looks good. The product stagnates.

Concept: Every metric is a proxy for user value. The quality of a measurement system depends on how well its proxies resist Goodhart corruption and how well they predict the outcomes they are supposed to represent.

Thread: T11 (Feedback) ← metrics create feedback loops that shape engineering behaviour → choose metrics where improvement requires improving the actual outcome

Core Idea: Leading indicators predict outcomes; lagging indicators confirm them. Technical decisions affect leading indicators. Goodhart’s Law corrupts metrics that become targets. Instrumentation is a first-class architectural requirement.

Tradeoff: AT9 (Speed vs Quality) — high-resolution metrics require instrumentation investment; teams that skip that investment operate faster but with lower visibility into whether speed is producing value

Failure Mode: FM11 (Observability Blindness) — without instrumentation at the task level, teams cannot distinguish between a product that is working and a product that looks like it is working

Signal: When a metric improves but the user outcome it is supposed to predict does not, the proxy relationship has broken down and the metric must be replaced

Maps to: Book 0, Framework 9 (Laws)

Reflection Questions

List three metrics used in your current product. For each, identify what underlying outcome the metric is a proxy for. Under what conditions would the metric improve while the underlying outcome degrades?
Apply Goodhart’s Law to two metrics your team currently uses. Can you construct a scenario where a team could improve the metric without improving the underlying outcome? How would you redesign the metric to resist that exploit?
Map your product’s key user flows to the HEART framework. Which dimensions do you currently instrument? Which dimensions are you measuring with insufficient granularity? What would it cost to fix the gaps?
The chapter argues that instrumentation is a first-class architectural requirement, not an afterthought. How does your current system’s architecture support or hinder the ability to add instrumentation? What would need to change to instrument at the task level rather than the page level?

Read in the book →