In the last decade of data analysis, A/B testing and predictive modeling have transitioned from an afterthought to a given in the game industry. Data can be invaluable in understanding the player and making decisions, but it can just as easily lead the industry astray, or worse, narrow the way the industry thinks. When should you be driven by data, and when should you let your imagination roam free? This session will expose common mistakes and pitfalls, both technical and emotional, as well as provide practical guidance on how to improve the rigorousness of your tests and the quality of your data, and how to make sure you don't lose the forest for the trees.
7. 7
Don’t Be Intimidated
You don’t need an
advanced degree in
statistics to get data
analysis right.
You can easily get it
wrong even if you do
have one.
(I’m not saying it doesn’t help, just that it’s not essential.)
11. 11
A Tale of Two Games
Two games on
Kongregate.com:
● Same genre
● Similar Day 1 and Day 7
retention (Game 1 slightly
higher)
● Similar lifetime buyer %
(Game 2 slightly higher)
● Similar ARPPU* (chart)
ARPU** = Buyer % x ARPPU
So they’ll have similar ARPUs,
too, right?
*ARPPU = Average Revenue per Paying User **ARPU = Average Revenue per Paying User
12. 12
A Different Tale of Two Games
Game 1 ARPU = $2.27
Game 2 ARPU = $0.84
Game 1 has much higher:
● D30
● Transactions/Buyer
● Lifetime ARPPU (chart)
16. 16
Building Worlds
Map of the Universe
Limited view of enormous,
changing systems Deploy every method we can
invent
Compare observations over from
different times and angles
36. 36
(Huge Miss) Assignment
Your best players ALWAYS show up first. Any
test that doesn’t take that into account is
inherently flawed.
Even if you’re testing to new players only the
people who start playing a game on Saturday
are not the same people who start on
Tuesday.
45. 45
Extreme Description Testing
"...a design masterpiece." 5/5 - TouchArcade
2016 Game of the Year - TouchArcade, Gamezebo
2016 Action Game of the Year - Pocket Tactics
DICE 2016 Mobile Game of the Year Nominee
Time's Top 10 Games / Top 50 Apps of 2016
"...a design masterpiece." 5/5 - TouchArcade
2016 Game of the Year - TouchArcade, Gamezebo
2016 Action Game of the Year - Pocket Tactics
DICE 2016 Mobile Game of the Year Nominee
Time's Top 10 Games / Top 50 Apps of 2016
Craft, battle, and quest your way through Crashlands, an outlandish story overflowing with sass!
Become Flux Dabes, a galactic trucker whose latest shipment gets derailed by a chin-strapped alien menace named Hewgodooko, leaving you stranded on an alien
planet. As you hustle to retrieve your packages you’ll become enmeshed in a nefarious plot of world domination, which will require all of your wits and both of your
glutes to overcome. Learn recipes from the local sentient life, make new friends, uncover ancient secrets and deadly bosses, tame everything and build yourself a
home-away-from-home as you learn to thrive on planet Woanope.
▼▼ Key Features ▼▼
● Expansive Crafting System ●
Unlock over 500 craftable items as you explore the world and learn its secrets!
● Self-managing, Infinite Inventory ●
In Crashlands, your inventory is infinite, manages itself, and retrieves your tools when you need them, so you can focus on adventuring, questing, and building. You'll
never have dig through your bag or return to your base to free up inventory space!
● RPG-Style Character Progression ●
Become more powerful through creating ever-more-amazing items! As you grow in power, you can venture to new regions of the world, meet strange characters,
discover new stories, and encounter new and interesting enemies.
● Skill-Based Combat ●
Learn the attacks of the enemies you encounter, and use your skill, agility, and wits to defeat them! You can even augment your fighting prowess with the power of
the dozens of gadgets you can craft. Set your enemies on fire, stun them, slow down time, and more!
● Intuitive Base Building ●
Building a base in Crashlands is so simple it feels like fingerpainting. You can create beautiful, sprawling bases in minutes!
● Tameable Creatures ●
Every creature in Crashlands can become a trusty combat sidekick. Find an egg, incubate it, and hatch your very own adorable or hideous bundle of joy. You can even
craft special items to grow and empower them!
● Huge World... with Huge Problems ●
Four sentient races, three continents, an epic bid for the future of the planet, and you - trapped in the middle, trying to deliver your freakin' packages. Take your time
to dive into the sidestories of the characters you meet or just rush headlong into making that special delivery. With hundreds upon hundreds of quests, there's a lot to
do and discover on planet Woanope!
● Effortless Cloud Saving ●
Just because your battery died or you accidentally dropped your device into a bottomless chasm, doesn't mean your save has to die with it. With BscotchID, you can
easily store and retrieve your save from the cloud, and move it between your devices!
● Controller Support ●
Tired of rubbing your sweaty hands all over your beautiful touchscreen? No problem! We've got support for most mobile-compatible controllers, so you can rub your
sweaty hands on some joysticks instead!
----------------------------------------
Recommended Hardware & OS:
● Android 4.1 or newer
● At least 1GB RAM
● At least 960x540px screen resolution
VS
47. 47
A/B/C(ontext) Testing
On Google Play, Helicopter beat
Girl with Gun by 92%
...but we were using Girl with Gun
because it beat Helicopter by 47% on
Kongregate.com
49. 49
Hierarchy of Testing
Advertising
Test everything, all the
time, everywhere.
Tools abound.
Conversion
As much testing of
visual assets as
possible. Tools more
limited.
Initial
Experience
Significant testing
possible but tests will
often have only minor
effects.
Late Game
Tread carefully.
Sample size, audience
expectations and
player fairness
become a challenge
First Weeks
Still possible,
especially around
store, offers, feature
unlocks.
Critical Difficult
50. 50
Game Data Lifecycle
?
Concepting
What could we make?
?
Pre-Production
What will we make?
✕
Production
Make it!
✓
Testing/Beta
What’s working? What’s
not working? Is this
viable? What can we
make better?
✓
Launch
What’s breaking? What’s
changing? What can we
make better?
✓
Live Ops
How can we keep
players engaged? What
can we make better? Did
we break something?
Creation Optimization
All Games Games-as-a-Service
54. 54
Live Disappointment
Source Impressions Clicks CTR Conversion
Castaway Cove Art Test Round 1 53,929 1,157 2.15% n/a
Castaway Cove Art Test Round 2 40,450 1,068 2.64% n/a
Castaway Cove Art Test Round 3 175,762 3,323 1.89% n/a
Castaway Cove Test Markets 3,912,062 43,765 1.11% 22.39% Target = 30%
CPIs for live version of Castaway Cove are okay, but much higher than we’d been targeting
61. 61
Ask Me Anything
Special thanks to Tammy Levy, Drew Levin, Zebulon
Reynolds, Heather Gainer and Butterscotch
Shenanigans for help with data examples!
More great data & talks from the whole team on our blog
https://blog.kongregate.com
Or follow us on Twitter:
@EmilyG
@KongregateDevs
And finally a good explanation of Wilcoxon Rank Sum test
can be found here:
We’re hiring for
analytics!
Open Roles
Director of Analytics
Product Manager
Data Analyst
Locations
Portland, OR
San Francisco, CA
Chicago, IL
Montreal, Canada
https://www.slideshare.net/KrysselMaeCabili/wilcoxon-ranksum-mann-whitney-u-kolmogorovsmirnov-12
Notes de l'éditeur
It’s the cornerstone of many of the biggest businesses in the US, including Google & Amazon, and the backbone of most scientific undertakings.
But data is just a tool, and like almost every tool it has both uses and abuses, not to mention just straight up errors. How many conflicting health studies have you seen?
As a company Kongregate uses a lot of data, and some of you have probably seen talks I’ve given before where I share a lot of that data. But a lot of the time I’ve been unsure whether we’re this ship, charting a clean course to treasure, or this ship, towards disaster. Both have happened! And since I think that’s a pretty common phenomenon, I thought it would be a good talk for GDC.
I love using numbers & testing to understand the world. I still probably spend at least an hour a day poking around dashboards and spreadsheets because it’s so much more fun for me than meetings.
I’m mostly self-taught, majored in Eastern European Studies, not math or econ. Stumbled into direct marketing, specifically catalogs, after college, and fell in love with data. Taught myself SQL because I hated to wait for IT to pull my data, took math & econ classes to understand more theory. After 10 years in catalogs & e-commerce and a near-miss with econ grad school I co-founded Kongregate partly to do something completely different. But it hasn’t turned out to be that different after all. User acquisition in particular is fundamentally similar between catalogs & games.
Part of the reason I’m telling you this is to make my first point:
And for an organization to do data right you can’t toss analysis back and forth over a wall to quants. It takes intimate knowledge of a game (and the development) to do good analysis and multiple perspectives and theories are good.
Sometimes it’s immediately obvious. One of the first games we launched on mobile was an endless runner. It wasn’t filtering purchases from jailbroken phones and was showing an average revenue per player of $500. That’s not very plausible and easily caught. But most issues are much more subtle – tracking pixels not firing correctly for a particular game on a particular browser, tutorial steps being completed twice by some players but not by others, clients reporting strange timestamps, etc. For this reason I recommend never relying on any analytic system where you can’t go in and inspect individual records. If you can’t check the detail there are some problems you’ll never find and fix.
Even when your data is accurate it can still be deceiving. This looks like 4 separate pictures photoshopped together to create an appealing color grid, right?
Wrong.
So much of data is like these pictures – a set-up that appears straightforwardly to be one thing from one angle, turns out to be completely different from another.
Except of course you know I’m setting you up
People are playing game 1 longer than game 2, and buying repeatedly. But if you just concentrated on daily monetization stats you could miss that entirely.
The witnesses may be lying or confused. The crime scene may have been tampered with. You can’t trust any one piece of evidence but by cross-checking them against each other you can figure out what’s true and false.
Client data (our SDK, Adjust) vs server dataApp storesBenchmarking against other gamesBenchmarking deltas
Your goal should be to create a 3-dimensional view of your players and your game. How people move through and interact with different parts. It’s a living, changing system and flat views are not enough.
We tend to think of playerbases as monolithic but really they are aggregations of all sorts of subgroups created by time in game, platform, device, browser, demographics, source – and these subgroups are shifting around. Changes in key KPIs are more often the result of changes in the audience than they are of changes in the game.
These examples show dramatic changes, but more subtle audience changes are happening all the time. Tracking cohorts by date of install/registration is a good way to track metrics independent of certain types of mix issues, but then it’s easy to lose track of events and changes in the game. So as ever, it’s about building a true picture across multiple sources.
75% ARPDAU decline, then a modest recover to ~50% of previous high.
When you break out ARPDAU by player age you can see that the decline isn’t nearly as dramatic. There’s some decline after a big holiday sale, and then again some as we expanded UA aggressively. But most of it is from fewer el
This is for a collectible card game where the player who goes first has a substantial advantage.
On this chart of player win rates for Tyrant it looks like Mission 24 is very difficult (50% win rate) and mission 25 is easy (95% win rate). It’s sort of true: Mission 25 is relatively easy for those who attempt it. But by deck strength it’s harder than 22, which has a 70% win rate. Mission 25 is easy for the players who are strong enough & skilled enough to beat Mission 24, a selected subgroup of those who attempted 24.
So for the last 10 minutes I’ve been ranting about how important it is too look at audience mix split
The most important metrics (revenue, sessions, battles, etc) in games are all power distributions. Your business (especially in free-to-play games) is driven by outliers, and their presence or absence distorts almost any data you look at.
Your outliers are your best players so it’s a good idea to do individual analysis on them to understand who they are, what drives them, and what they’re most likely to distort.Binary “yes/no” metrics like % buyer, D7 retention, tutorial completion are a lot more stable than averages involving revenue and engagement like ARPPU, $/DAU, Avg Sessions, and can be looked at in much smaller samples.
Sometimes we do it consciously, but more often it’s unconscious. I’ll look at a group of cohorts and the best one is ALWAYS the most memorable. If you’re in test market and hoping to hit 50% the days you hit that number will imprint on your brain that your game has 50% D1 retention, even if the average is 45%.
Cherry Picking’s great and good friend!
Part of building a mental mode of your game is having theories about behavior, and if you have a theory you should test it. But it’s really easy to look for the data that supports you theory and miss the data that contradicts it, or even just muddies the picture. [Can I find an example]
How you visualize data has a big impact on how you perceive it.
Ice cream consumption and drowning are correlated, because they’re both more likely to happen in hot weather. But ice cream kills would be a terrible conclusion. We’ve all heard this a 1000 times but we need to keep hearing it like a mantra every day because we all make this same mistake over and over and over. We’re humans, we’re wired to search for causation. It’s our superpower and a curse.
Almost every metric you look at will be positively correlated with engagement because the most engaged users do everything more. Maybe Facebook is increasing engagement. Maybe only engaged players were willing to hit the button and potentially spam their friends.
This is the real way to separate correlation from causation and understand what’s really going on. But it’s not a magic bullet, because nothing is that easy. Testing has real costs in engineering time & overhead, complexity, and divisions/confusions for the players, and the more you’re running the worse that gets.
There’s also a lot of ways to screw up A/B testing even though it seems so foolproof. Most A/B test traps are variations on themes I’ve mentioned but some are new, particularly issues around how people get assigned to tests
For example if you’re A/B testing your store, don’t assign people to the test unless they interact with the store. It’s often easier to split people as they arrive in your game, or some other thing, but a) there’s a chance you would end up with non-equal distribution of interaction with the tested feature and b)any signal from the test group would get lost in the noise of a larger sample.
Tests can have unintended consequences, you should look at additional metrics beyond the one being tested to make sure that you get the full picture. Commercial A/B products often make you choose one metric for a test to prevent you from fishing for the good result to decide the test on. I think it’s more important to understand the full effects of the change that you made (though fishing is bad, too.)
Early results tend to be both volatile and fascinating – differences are exaggerated or totally change direction. People tend to remember the early, interesting results rather than the actual results. People also often want to end the test early if they see a big swing, which is a bad idea. So I recommend that you don’t look at early test results except to make sure the test isn’t totally broken. How big should your test sample be? In my opinion the bigger the better.
When people talk about A/B tests you’ll often hear things like “we’ve got a statistically significant 5% lift”! And most people hear that and think that means that the lift is definitely 5%. But that’s not how statistical significance tests work.
Statistical significance tests assume that there is some true difference in lift, and that if you run the same test repeatedly there will be a bell curve distribution of results, with the true lift as the average. Your 5% result could be right on the mean, or it could be an outlier on either end. If it’s statistically significant then the chance is low (usually 5% or less) that there’s no lift at all. But the true lift could be 1% or 10%. Conversely if you do a test that doesn’t show a lift, or doesn’t pass the significance test for a small lift that doesn’t mean there ISN’T a lift.
This is why I like to run A/B tests with larger sample sizes. It’s like running the test again and averaging the results. It’s possible you’d get two outlier results in the same direction, but becomes less and less likely, and more likely that your test results represent the true mean.
Often 70-80% of a free-to-play game’s revenue will come from a small % of buyers who spend more than $500.
Large sample sizes help here, too.
This can be really frustrating, even demoralizing for a team. When you’re going through the effort to make and test changes, you want them to mean something! You want to make progress. And then you get another non-result on a test. But finding out what doesn’t matter can actually be really powerful.
Here’s an extreme example of this from the team at Butterscotch Shenanigans, who made the game Crashlands. They had written up an elaborate, detailed description and decided to test how much impact it had using Google’s store testing system on Android against the most extreme possible variant, no description at all. Just the accolades the game has received.
They were kind enough to share the results and after 4 full months the test shows absolutely no difference, and that actually tells you a lot: specifically that the description has very little impact, and this is consistent with the testing we’ve done on our own games, as well. Time and resources are a constraint for virtually everybody, and knowing what is not important allows you to concentrate more on things that do matter. We used to argue endlessly over game names, but after doing test after test and not seeing much difference we’re all much more relaxed about it.
But it’s important not to extrapolate too much. Just because you get a particular
Specifically late game content is often very difficult to test, or any testing on late game players.
Daniel Cook from Spryfox tweeted this recently. He was talking about YouTube and algorithms, but I think it helps frame some of the limitations of testing. As a player plays a game, the game is shaping their expectations and experience, and training them to behave in certain ways. So the same player might react very differently based on how long they had been playing the game. And when engaged players start talking to each other in chat and forums they affect each other, too. Plus you run into small sample sizes with lots of outliers and other fun problems I’ve already talked about.
Tyrant successful on a small core audience, but difficult to market
CPIs for live version of Castaway Cove are okay, but much higher than we’d been targeting. Lots of ways we probably went wrong
So far data has helped us iterate on existing games, pointing us in the direction that helped get us from Tyrant to Animation Throwdown. But in
But what we don’t know is as important as what we do know
Data is alway going to tell you to make an existing successful game, but better. It’s not going to tell you to make a game unlike anything people have played before
But what we don’t know is as important as what we do know