Beyond launch metrics: Two case studies in crafting A/B tests

In this article, Saee Pansare, Principal Product Manager at DoorDash, explores common themes encountered in her daily work with A/B testing, highlighting key pitfalls and best practices for effective experiment design.

8 min read
Share on

As a product manager in the e-commerce customer experience (CX) space, I have not only launched experiments myself but have also had the opportunity to serve as an A/B test design reviewer for other product managers’ work. Through this experience, one pattern became increasingly clear: one’s experiment can fail even before the first user sees the implemented change.

In my role, I have had a unique vantage point, not just seeing what worked, but more importantly, what failed due to common design mistakes. From simple UI changes to backend fulfillment process changes, I have seen product managers sail the choppy waters of experimentation, observing what sets tests up for success while what leads to waste of time, resources, and credibility.

Drawing from my experience, I would like to share my learnings through the medium of two case studies. My hope is to help product managers design experiments with confidence and to avoid more tests with inconclusive results.

Case #1: The pilot

Context: At Amazon, I was working on a 0 to 1 product where I owned the customer-facing app experience and a partner team managed the fulfillment work. The fulfillment technology was novel and required significant upfront resource investment, restricting us to launching only one site at a time, with site expansion taking over 6 months. On the CX side, our objective was to launch a new version of the brand for this program and were very keen to ensure that any potential customer misunderstanding doesn’t hurt Amazon’s or its partners’ reputation. Additionally, since the program’s format was new to Amazon’s customers, I was managing several opinion-based debates between stakeholders to nail down the MLP (Minimum Lovable Product) CX.

Challenges: Typically, at Amazon, there is a strong bias towards launching and continuously A/B testing to circle in on the right CX. This approach works because we have a large customer base we can experiment with and because there are several ‘two-way-door’ decisions that can be difficult to resolve without actual customer feedback.

i) However, in this case, because we were only live at one site, we could not A/B test because our experiment would not be able to reach power. Power is the likelihood of a statistical test detecting an effect when there is a real, non-zero relationship between variables in the population. 

ii) Additionally, even in the presence of statistical power, A/B test did not seem like the right method to test brand assets as customers could get confused about the product’s identity.

Realizing A/B testing would not be an applicable solution for our project, we implemented the following validation strategies: 

i) We conducted in-depth qualitative testing to align on the launch CX decisions, such as the UI components and the user flow. This involved putting our mocks in front of a recruited group of customers and asking them questions about clarity of the program, thoughts about the change, the actions they would perform if they saw this program live, etc. While we performed qualitative testing on all projects, we specifically went a level deeper in this case and recruited a larger group for feedback, knowing that we would not have the opportunity to A/B test until site expansion reached a critical mass of customers. 

ii) We defined clear launch metrics for the pilot and conducted a pre- and post- analysis to determine success. Pre-and post- analyses are imperfect because they do control for seasonality, macroeconomic conditions, customer mindset change, etc. However, they can be a close proxy to obtain directional validation.

Key takeaways:

i) A/B testing may not be appropriate when launching strategic changes: If you are implementing a brand refresh that includes major UI and logo redesign for your product, A/B testing or phased rollout may not be suitable because of its impact on brand consistency and customer expectation. 

ii) Design for power:  A/B testing is not useful for products and companies that do not have a sufficient sample size for meaningful results. While there are a few A/B testing sample size calculators out there, the rule of thumb is to not think about A/B testing until you have at least 30,000 visitors per variant. A few ways to increase sample size are broadening customer segments in the test, testing in multiple geographies at once, and leveraging marketing to drive more traffic to your test. It should also be noted that increasing the number of variants in the experiment also dilutes traffic allocation and reduces the power of the experiment. 

iii) There are alternate validation methods: If A/B testing is not an option, qualitative testing, user experience research, and pre- and post- analyses can provide directional feedback on customer sentiment. 

Case #2: The overlap

Context: In this project, we were focused on enhancing the grocery scoped search functionality on the Amazon app. The feature involved surfacing additional items to customers and resulted in a new checkout flow. To reduce engineering lift and to maintain customer consistency, we reused many of the existing components—call to action button, individual checkout pages such as slot selection, payment selection, pre-checkout upsell, etc.—but in a different context.

Challenges:

i) Unexpected CX: During testing, we experienced completely different user flows than the ones we had designed. Those arose because the component-owning team was also running experiments in their domain that were not adapted to the grocery user’s flow and were degrading the customer experience. While we caught the experiment interactions before launch, our last-minute approach to component teams to find a resolution created friction. The teams were caught off-guard by our request to coordinate experiment timelines and customer impact.

ii) Performance variation by brand: We ran the experiment in two retailer brands and saw mixed results. We observed that the overall Amazon impact (a key launch criterion) was positive, but there was significant performance disparity between the two brands at the retailer level. Despite directional merics showing stronger engagement in one brand, our experiment design did not allow us to launch selectively in the successful brand/retailer because of limitations in calculating overall Amazon impact by brand. Our primary concern was that the poor performance at one retailer could be due cannabilization from the rest of Amazon. Ultimately, to get conclusive results, we split the experiment into two (for each brand) and ran them again.

Key Takeaways: 

i) Proactively check for experiment interactions: In large organizations with a strong culture of experimentation, such as Amazon, tens of experiments are running at the same time. As a product manager, I find it wise to investigate if there are other experiments that may impact the flow you are testing. Broadly, only changes that fight for the same code and page space you are testing in or that result in a confusing customer experience are typically ones that cannot be tested concurrently. 

ii) Determine success metrics upfront and adhere to them: As product managers, we tend to consider our products our babies, and we want them to succeed. This attachment unfortunately introduces bias in decision-making, which is why we must determine success criteria prior to launching the experiment. Post-launch, of course, we must follow that evaluation criteria. If one must deviate, I recommend deep diving into why the metrics didn’t behave as expected and following an exception-style process where you review reasons for and against the original criteria and get feedback from multiple unbiased stakeholders before taking a decision. Talking to other people gives a second perspective and can motivate us to do the right thing for the customer, even if it is inconvenient for us in the short-term.

Common pitfall: Incorrect triggers

A common yet overlooked element of experiment design is trigger accuracy. A trigger is a data log that says that a customer was assigned to treatment A or B at time t. Over-triggering is when we log more triggers than we actually expose a treatment to users and under-triggering is when the user is experiencing a treatment but we don’t capture it in our log. Not accurately counting triggers can invalidate your experiment data or can lead to biased results as they add noise to the data. To navigate this issue, I’ve found it rewarding to work with the engineering partner on a triggering diagram so we may be 100% aligned on trigger logging and code expectation.

Conclusion

Designing an A/B test accurately is important to avoid invalidating your results, wasting time, or worse, making decisions based on incorrect information. Product managers are often exhausted by the time the development is complete for their product and don’t bring in the same level of detail or objectivity in this last stretch of the design process. To avoid this pitfall, it is valuable to build your own A/B testing checklist of best practices that you can rely on. Additionally, reviewing experiment design with broader stakeholders can provide insightful feedback and inculcate accountability.

Comments

Join the community

Sign up for free to share your thoughts