Overview

Project Background

Experimentation ( or online A/B Tests) helps make informed decisions on the product based on key metrics of interest. For a growing organisation such as Shopify, it becomes increasingly important to measure the impact of decisions made on various product offerings to the merchants. To support the fast shipping culture and make data backed decisions, the data infrastructure needed to be revamped to support the teams iterating with speed on their product lines to conduct A/B tests and learn from the user behavior.

This project aimed at building the culture of experimentation and enable a smooth process for building, running and analyzing the results for experiments across all product lines.

This platform was to be extended to the merchants to conduct A/B tests on their online store to measure the impact of their business decisions based on buyer behavior.

Earlier Stats

400

Total experiments conducted since 2014

92

Experiments conducted in the past one year

15%

Product lines conducting regular experiments

Comparison with other Tech Firms

Number of Experiments Conducted Annually

My Role

This was the first time that the Developers and Data Scientists in the team were working with a designer and I was working within the Data Org for the first time as a designer. As the only designer in the team, my role was:

Leading the project through Design by involving the team in the process and educate about my role
Align with the team on the user problems by conducting user research and prioritize engineering and business concerns for the MVP release
Conducting design workshops involving developers for better understanding of the functioning of the system
Design and build interactive prototypes to test with the users for feedback
Support developers on documentation and development of the tool

User Research

Objective

Understand the different experiment workflows Data Scienists, Developers and Product Managers use for A/B Testing.
Understand the main blockers and challenges with the current system from defining hypothesis to setting up experiment and analyzing results to make decisions.
Understand the different use cases of experiments from various product teams and user personas
( e.g. testing front-end, back-end or algorithm changes, one-off random A/B test or continuous testing).
Gain a better understanding of the culture and education problem.

User Interviews

12

Data Scientists

Build the metrics and configure the experiment the right way ensuring ‘who’ the right subjects for the experiment are and how much time it might take to obtain statistically significant results.

06

Product Managers

Responsible for the decision making on the results obtained.

06

Data Developers

Ensure the proper functioning of the assignment mechanism.
That is, ensure the right candidates (example - visitors on Shopify.com from Canada) are included in the experiment and sustained for the right duration.

Findings

General

Lack of Knowledge

Lack of understanding of experimentation concepts, values, practice, statistics and technical limitations with the system led to a popular opinion that experimentation is HARD.

Delays Product Development

People think that running experiments is costly and delays product development and not worth making the investment.

Not for New Features

People see value in building experiments to optimize existing features, however many people do not see much value when shipping brand new features or products.

Fear of Failure

The perception that experimentation does not generate positive ROI, unclarity on when to experiment, primarily lead to the lack of experimentation culture.

System Limitations

Too many PR Reviews

Many reviews are required on the development side before an experiment can actually be launched.

Not Centraized

There is no one place in the system to see a full picture of the experiment. Data Scientists and Data Developers define experiments with two sets of metadata in two different systems.

Lack of Flexibility

People see value in building experiments to optimize existing features, however many people do not see much value when shipping brand new features or products.

UX Limitations

Misleading experiment UI and terminologies

The dashboard was catered towards data scientists who have the technical understanding of the statistical term used and not for other experiment observers. Terms like ‘Confidence level’, ‘inconclusiveness’ of an experiment were misunderstood.

“Ongoing experiments give a false interpretation which stakeholders don't necessarily understand”

“I don't know what 90% confidence really means!”

Discoverability of experiments is hard

Discovering current and past experiments and understanding the impact of similar previous experiments was complex. It was hard to know ‘overlapping’ or ‘conflicting’ experiments and their effects.

“Its hard to find previous examples of people tackling the same problem”

Poor management of experiments

Lack of a standardised way to create, track, debug, conclude or ‘kill’ (Immediately end) an experiment led to customized workarounds for teams and used more time to implement.

“We ran a lot of experiments in parallel and it was hard to track them”

Product Manager

Data Scientist

Data Developer

Role

Decision maker
Defining hypothesis

Configure experiment, analyze and share results with the decision maker

Ensure qualification logic and group assignments are implemented correctly

Motivation

Make quick and informed decisions to ship as fast as possible
May prefer speed over accuracy at times

Use the right metrics and analysis to derive accurate results with ease

Set up experiments in the application seamlessly with minimal complexity

Values

Clear understanding of ROI
(Value of experiment need to be higher than cost to experiment
‍
Accuracy and simplicity of results
Communication to stakeholders
‍
Risks involved, possible delays and implications of not experimenting

Experiment design: Hypothesis, metrics, audience and end impact

Accuracy of results

Simplicity of workflow to minimize efforts

Time and effort involved to set up experiment back-end in apps

Seamless integration with their current development workflow ( e.g. Code clean-up)

Design Considerations & Iterations

Since this was a very technical product, I consistently iterated and collaborated with the technical experts ( both engineering and statistics) to obtain further clarity on the domain and functionality of the product. It involved understanding how the data infrastructure works, checks on technical feasibility, business requirements and UI consistencies.

I worked closely with Data Scientists, Data Developers, Product Manager, domain experts ( Statistics - PhDs) to iterate and review with other UX folks and Content Designers for feedback.

This required multiple concepts and iterations.

Based on the prior research and the above objectives, we prioritized improving the following experiences:

Experiment
Discovery

Quickly find experiments of interest.

Clarity on owners, applications, subjects and experiment status .

Creating New Experiments

Minimize complexity in building experiments.

Automate complex tasks like group assignments i.e. Manage and track which visitor will be exposed to which variant (A or B, also referred as ‘Test’ and ‘Control’ groups)

Analyzing Results

Make interpretation of results simpler ( even for non-data folks)

Clear indicators for result directions, with significance

Initial Concepts

Users needed a simpler way to find the experiments that might be of interest to them. This could mean the experiments by a particular team, or a test running on a particular application ( also called ‘surface’). For this, a simple search function with the right filters made for a simpler interaction.

Building an experiment requires a lot of work on part of the data scientist and data developers. For example, discussions on the subject of the experiment (who is being tested with) and determining the percentage split (sometimes 70-30 or 50-50 ) depending on the type of experiment, and then associating the same with the right experiment is crucial.

The team developed a group assignment mechanism that automatically managed the splitting of the audience without the developer having to code the details. This could now be done simply through the UI of the experiments platform.

The results page was the most crucial because of the difference in expectation and requirement from the different users observing the experiment.

Data Scientist

Needs all the details of the data captured for the metric and then analyse to check if there is statistical significance in the outcome.

Product Manager & Stake Holders

Typically take a quick glance to learn how the experiment is performing.

Design Desisions for Results Page

Based on the above and prior research, we made some design decisions for the results page:

Add a summary section of stats for stakeholders along a detailed
metrics section for data scientists to dive deep
This should give an indication about the winning variant & a clear indication of
the margin in difference or Relative Change.
Avoid using ‘bell curve’ visuals as they can get overwhelming for users
Bell curves are typical data visualizations used during statistical analysis by data scientists to compare data. However, for users with no background in data, this was overwhelming and they feel they need help to interpret.
What the data scientists really need is a comparison of any overlap in the credible intervals/ conversion rate (width of the bell curve) for the two variants.
Use simplistic terminologies
We found that terms like ‘confidence level’ are not universally understood. Typically confidence level determines how likely the stats being shown on the table are true. This leads to more confusion as what confidence level is good for decision making is not clear. Mostly understood as 95%, but there may be cases where 70% is also good enough.

To bring more simplicity, we replaced confidence level with ‘Probability to beat baseline’. A percentage that shows the likelihood that the test variant gets better conversion rate than the baseline variant (or control group). This helps users decide with a clearer information.

Impact

With the successful release of the new experiments platform in May 2020, the number of experments conducted had grown tremendously by December 2020.

Experiment Stats

1621

Total experiments conducted since 2014

92

Total experiments conducted before launch

1221

Experiments conducted since launch

+1000%

Increase in number of experiments since launch

Adoption

32%

Experiments conducted on new applications for the first time. These had not registered any experiment before.

+500%

Experiments concluded since launch

100%

Product lines conducted at least one experiment since launch

We wanted to learn how these experiments contributed to the primary metrics for Shopify like Gross Merchandise Volume, or the Net New Merchants added onto the platform etc. But due to the complexity in the metrics infrastructure involved, it was hard to confidently suggest the link. But overall, the platform has been well appreciated and seen a greater adoption across all teams and application verticals in the organisation.To introduce new members to the culture of experimentation, the tool became a part of the onboarding program for all new joinees including UX designers.

The success of the platform led to the planning of expansion of the same to Shopify merchants to set up experiments on their online store to measure the impact of the business decisions they make based on buyer behavior. This project was put on hold due to budget and infrastructure constraints.

My Learnings

Working on a complex technical tool was quite a learning experience for me. With no prior knowledge base in Data and as the only designer in the team, it was initially overwhelming but regular discussions with my team helped me learn about the domain.

I wrote a blog post that was published by Shopify UX where I shared my learnings. I have summarized the same here:

Its OK not to know everything, you are not required to
Serious technical terms specially in the domain of work like Statistics and Probability can be overwhelming. Know only how much you need to know to be effective in your role and make efforts to learn enough to unblock your own flow.
Counter imposter syndrome by reaching out to more people
Reach out and build your circle. Seek help when needed and collaborate with members in other teams.  You are not alone, others might also feel the same.
Ask more questions
Being new in the domain (Data for me), one can feel out of place. But know that there are no stupid questions indeed! No one judges you and everybody is happy to help.
Share your work early and more often
Familiarize the team with your work and seek early feedback. A complex tool requires help from experts to guide the design decisions at an early stage. This helps set the expectations leading to smoother process.

Read Full Blog

Building the Experiments Platform for Shopify

My Role

The Team

Time Frame

Overview

Project Background

Earlier Stats

400

92

15%

Comparison with other Tech Firms

Objective

Make Fast and Informed Decisions at Scale

Success Criteria

10x

Priary KPIs

Secondary KPIs

My Role

User Research

Objective

User Interviews

12

06

06

Findings

General

Lack of Knowledge

Delays Product Development

Not for New Features

Fear of Failure

System Limitations

Too many PR Reviews

Not Centraized

Lack of Flexibility

UX Limitations

Misleading experiment UI and terminologies

Discoverability of experiments is hard

Poor management of experiments

User Profiles

Product Manager

Data Scientist

Data Developer

Role

Motivation

Values

Steps to Build an Experiment

Designing the New Experimentation Platform

Objectives for First Release

Ease the implementation of experiments across all applications

Simplify result interpretation

Ease the discoverability of experiments

Build trustworthiness of data

Educate new users

Design Considerations & Iterations

Experiment Discovery

Creating New Experiments

Analyzing Results

Initial Concepts

Data Scientist

Product Manager & Stake Holders

Design Desisions for Results Page

Add a summary section of stats for stakeholders along a detailed metrics section for data scientists to dive deep

Avoid using ‘bell curve’ visuals as they can get overwhelming for users

Use simplistic terminologies

Rough Iterations for Results Page

Final Designs

Impact

Experiment Stats

1621

92

1221

+1000%

Adoption

32%

+500%

100%

Next

My Learnings

Its OK not to know everything, you are not required to

Counter imposter syndrome by reaching out to more people

Building the Experiments
Platform for Shopify

Experiment
Discovery

Add a summary section of stats for stakeholders along a detailed
metrics section for data scientists to dive deep