Building the culture of experimentation and enabling a smooth process for building, running, and analyzing experiments across all product lines. This platform, named EXP, was then to be extended to Merchants to conduct A/B Tests independently on their online store to measure the impact of their business decisions based on customer behavior on their store.

My Role

The Team

Time Frame

Product Design Lead
Ux Research
UI Design
User Test & Analysis
Prototyping
Workshop Fecilitation

3 Data Developers
1 Front End Developer
1 Product Manager
2 Data Scientists
1 Product Designer

Jan 2020- May 2020

Led to an increase in experiments
created by 1000% within first 6 months of launch.

Overview

Project Background

Experimentation ( or online A/B Tests) helps make informed decisions on the product based on key metrics of interest. For a growing organisation such as Shopify, it becomes increasingly important to measure the impact of decisions made on various product offerings to the merchants. To support the fast shipping culture and make data backed decisions, the data infrastructure needed to be revamped to support the teams iterating with speed on their product lines to conduct A/B tests and learn from the user behavior.

This project aimed at building the culture of experimentation and enable a smooth process for building, running and analyzing the results for experiments across all product lines.

This platform was to be extended to the merchants to conduct A/B tests on their online store to measure the impact of their business decisions based on buyer behavior.

Earlier Stats

400

Total experiments conducted since 2014

92

Experiments conducted in the past one year

15%

Product lines conducting regular experiments

Comparison with other Tech Firms

Number of Experiments Conducted Annually

Objective

Incrementally improve the existing experimentation system that allows Data Scientists, Developers, Product Managers to easily build an experiment end to end;  from defining hypothesis to serving treatments and analyzing results with no complexity. And as a result,

Make Fast and Informed Decisions at Scale

Success Criteria

10x

The number of experiments conducted annually

Priary KPIs

# Experiments Completed
# Product Users

Secondary KPIs

Experiments Build Time
Error Rate

My Role

This was the first time that the Developers and Data Scientists in the team were working with a designer and I was working within the Data Org for the first time as a designer. As the only designer in the team, my role was:

User Research

Objective

User Interviews

12

Data Scientists

Build the metrics and configure the experiment the right way ensuring ‘who’ the right subjects for the experiment are and how much time it might take to obtain statistically significant results.

06

Product Managers

Responsible for the decision making on the results obtained.

06

Data Developers

Ensure the proper functioning of the assignment mechanism.
That is, ensure the right candidates (example - visitors on Shopify.com from Canada) are included in the experiment and sustained for the right duration.

Findings

General

Lack of Knowledge

Lack of understanding of experimentation concepts, values, practice, statistics and technical limitations with the system led to a popular opinion that experimentation is HARD.

Delays Product Development

People think that running experiments is costly and delays product development and not worth making the investment.

Not for New Features

People see value in building experiments to optimize existing features, however many people do not see much value when shipping brand new features or products.

Fear of Failure

The perception that experimentation does not generate positive ROI, unclarity on when to experiment, primarily lead to the lack of experimentation culture.

System Limitations

Too many PR Reviews

Many reviews are required on the development side before an experiment can actually be launched.

Not Centraized

There is no one place in the system to see a full picture of the experiment. Data Scientists and Data Developers define experiments with two sets of metadata in two different systems.

Lack of Flexibility

People see value in building experiments to optimize existing features, however many people do not see much value when shipping brand new features or products.

UX Limitations

Misleading experiment UI and terminologies

The dashboard was catered towards data scientists who have the technical understanding of the statistical term used and not for other experiment observers. Terms like ‘Confidence level’, ‘inconclusiveness’ of an experiment were misunderstood.

“Ongoing experiments give a false interpretation which stakeholders don't necessarily understand”

“I don't know what 90% confidence really means!”

Discoverability of experiments is hard

Discovering current and past experiments and understanding the impact of similar previous experiments was complex. It was hard to know ‘overlapping’ or ‘conflicting’ experiments and their effects.

“Its hard to find previous examples of people tackling the same problem”

Poor management of experiments

Lack of a standardised way to create, track, debug, conclude or ‘kill’ (Immediately end) an experiment led to customized workarounds for teams and used more time to implement.

“We ran a lot of experiments in parallel and it was hard to track them”

User Profiles

Product Manager

Data Scientist

Data Developer

Role

Decision maker
Defining hypothesis

Configure experiment, analyze and share results with the decision maker

Ensure qualification logic and group assignments are implemented correctly

Motivation

Make quick and informed decisions to ship as fast as possible
May prefer speed over accuracy at times

Use the right metrics and analysis to derive accurate results with ease

Set up experiments in the application seamlessly with minimal complexity

Values

Clear understanding of ROI
(Value of experiment need to be higher than cost to experiment

Accuracy and simplicity of results
Communication to stakeholders

Risks involved, possible delays and implications of not experimenting

Experiment design: Hypothesis, metrics, audience and end impact

Accuracy of results

Simplicity of workflow to minimize efforts

Time and effort involved to set up experiment back-end in apps

Seamless integration with their current development workflow ( e.g.  Code clean-up)

Steps to Build an Experiment

Designing the New Experimentation Platform

Objectives for First Release

Based on the research insights and the engineering time and effort involved in building the data infrastructure to support the new experiments system, we narrowed our objectives to the following:

Design Considerations & Iterations

Since this was a very technical product, I consistently iterated and collaborated with the technical experts ( both engineering and statistics) to obtain further clarity on the domain and functionality of the product. It involved understanding how the data infrastructure works, checks on technical feasibility, business requirements and UI consistencies.

I worked closely with Data Scientists, Data Developers, Product Manager, domain experts ( Statistics - PhDs) to iterate and review with other UX folks and Content Designers for feedback.

This required multiple concepts and iterations.

Based on the prior research and the above objectives, we prioritized improving the following experiences:

Experiment
Discovery


Quickly find experiments of interest.

Clarity on owners, applications, subjects and experiment status .

Creating New Experiments


Minimize complexity in building experiments.

Automate complex tasks like group assignments i.e. Manage and track which visitor will be exposed to which variant (A or B, also referred as ‘Test’ and ‘Control’ groups)

Analyzing Results


Make interpretation of results simpler ( even for non-data folks)

Clear indicators for result directions, with significance

Initial Concepts

Users needed a simpler way to find the experiments that might be of interest to them. This could mean the experiments by a particular team, or a test running on a particular application ( also called ‘surface’). For this, a simple search function with the right filters made for a simpler interaction.

Building an experiment requires a lot of work on part of the data scientist and data developers. For example, discussions on the subject of the experiment (who is being tested with) and determining the percentage split (sometimes 70-30 or 50-50 ) depending on the type of experiment, and then associating the same with the right experiment is crucial.

The team developed a group assignment mechanism that automatically managed the splitting of the audience without the developer having to code the details. This could now be done simply through the UI of the experiments platform.

The results page was the most crucial because of the difference in expectation and requirement from the different users observing the experiment.

Data Scientist

Needs all the details of the data captured for the metric and then analyse to check if there is statistical significance in the outcome.

Product Manager & Stake Holders

Typically take a quick glance to learn how the experiment is performing.

Design Desisions for Results Page

Based on the above and prior research, we made some design decisions for the results page:

Rough Iterations for Results Page

Final Designs

Impact

With the successful release of the new experiments platform in May 2020, the number of experments conducted had grown tremendously by December 2020.

Experiment Stats

1621

Total experiments conducted since 2014

92

Total experiments conducted before launch

1221

Experiments conducted since launch

+1000%

Increase in number of experiments since launch

Adoption

32%

Experiments conducted on new applications for the first time. These had not registered any experiment before.

+500%

Experiments concluded since launch

100%

Product lines conducted at least one experiment since launch

We wanted to learn how these experiments contributed to the primary metrics for Shopify like Gross Merchandise Volume, or the Net New Merchants added onto the platform etc. But due to the complexity in the metrics infrastructure involved, it was hard to confidently suggest the link. But overall, the platform has been well appreciated and seen a greater adoption across all teams and application verticals in the organisation.To introduce new members to the culture of experimentation, the tool became a part of the onboarding program for all new joinees including UX designers.

Next

The success of the platform led to the planning of expansion of the same to Shopify merchants to set up experiments on their online store to measure the impact of the business decisions they make based on buyer behavior. This project was put on hold due to budget and infrastructure constraints.

My Learnings

Working on a complex technical tool was quite a learning experience for me. With no prior knowledge base in Data and as the only designer in the team, it was initially overwhelming but regular discussions with my team helped me learn about the domain.

I wrote a blog post that was published by Shopify UX where I shared my learnings. I have summarized the same here:

Feel free to reach out for a chat!