EDHREC @ Data Science MD

EDHREC, Magic: TG
Recommendation Engine
(and data science on games)
Donald Miner @donaldpminer dminer@minerkasch.com
September 21st, 2015 - Data Science MD Meetup
Games & Stuff in Glen Burnie, MD

Talk agenda
 Background
 EDHREC Overview
 EDHREC Data Analysis
 EDHREC Architecture
 Data Science Application UX Lessons Learned
 Related Work in Magic and Other Domains
 Virtues of Data Science on Games

Magic: The Gathering
 Trading card game
 First published in 1993
 20 million players in 2015 (World of Warcraft has 7.1 million subscribers)
 Organized tournaments
 Secondary market
1993
$27,000

Elder Dragon Highlander / Commander
 One of the Magic “formats”
 Started independently from WOTC late 00’s
 Officially supported starting 2011
 Typically multiplayer
 100-card singleton deck
(instead of 60-card, up to 4x copies)
 Each deck has a single “commander”
(unique to this format)

Data Science
 Term coined around 2008
 Represents a shift in data
analysis in industry
 A mix of computer science,
machine learning, statistics,
programming, visualization,
and domain knowledge

EDHREC Algorithm 1.0
User-based Collaborative Filtering
Image from http://blog.comsysto.com/2013/04/03/background-of-collaborative-filtering-with-mahout/
Analogy:
Deck -> User
Card -> Item
Pros:
Better at picking up bigger themes in decks
Easy to implement
Cons:
Had issues discovering subtle deck themes
Had issues pointing out combos

Recommendation Engine 2.0 Algorithm
31,000
decks
Decks that contain Sanguine Bond AND Exquisite Blood
÷
Decks that contain Sanguine Bond OR Exquisite Blood
Step 1: Card Affinity Matrix
Jaccard / Tanimoto distance
Repeat for every card combination
(15,000 cards)
This is the basis of the Card Analysis page
This matrix is built offline in batch
Image from http://blog.comsysto.com/2013/04/03/background-of-collaborative-filtering-with-mahout/

Recommendation Engine 2.0 Algorithm
31,000
decks
1. Select each row of the Tanimoto matrix corresponding to cards in Deck D
2. Sum the columns
3. Sort by score, display results
Step 2: Calculate Scores
This gives you a sum of the Tanimoto coefficients
I really have no idea what this algorithm is called… I’m not sure if it’s novel or not
This is performed in real time

Lessons learned:
Taking out the garbage
 A lot of garbage gets submitted to EDHREC
 Decks with <20 cards
 Decks with invalid commanders
 Decks with illegal cards
 The algorithms handle this well and rarely do problem cards show up
 However, pruning “worthless” decks significantly improves
performance due to all the O(N^2) algorithms going on
General advice: Think about which pieces of data are worthless in your data set

Lessons learned:
Partitioning (too much or too little)
 Partitioning the user/deck space into subgroups is a great way to speed things
up in recommendation engines
 The 31,000 EDHREC decks are partitioned into 27 partitions
(one per possible color combination)
 Algorithms are ran typically on a single partition
(e.g., Red/Blue deck recommendations only come from other Red/Blue decks)
 However, themes that span color combinations suffer worse recommendations
 However, partitioning too deep causes problems
 I tried partitioning by commander, and that was awful:
new commanders, themes than span commanders suffer
General advice: There is no good way to figure out a partition scheme, just try it out

Batch Processes
(cron)
EDHREC Architecture
Reddit Bot
(praw)

Batch Processes
(cron)
Reddit Bot
(praw)
Redis
• In-memory key/value data store
• Stores website state
• Utilized as a cache
• Stores all of the decks
• Stores all of the pre-computed stats
• Stores all metadata about Magic cards
• EDHREC serializes most things to common
internal json data formats
• Very fast
• Very easy to use
• Good support with Python
• Getting harder to do “analysis”
• Going to move to Redshift SQL database
for analytical things

Batch Processes
(cron)
Reddit Bot
(praw)
Cherrypy
• “A Minimalist Python Web Framework”
• Runs the website
• Pulls data from Redis and then renders the
results as HTML
• Most of the data from Redis is cached in
memory objects (IPC to Redis too slow)
• EDHREC runs 6 of these in parallel behind
an NGINX round robin proxy
• Very easy to use, doesn’t get in your way
• Very easy to expose Python data science
• Running into problems with
maintainability due to my own sloppiness

Batch Processes
(cron)
Reddit Bot
(praw)
Python
• Programming language
• Plenty of good libraries for data analysis:
numpy, pandas in this case
• Can handle the “full stack” well
(from data analysis to web front end)
• PRAW is a great framework for building
Reddit bots
• Most things run every few hours

Batch Processes
(cron)
Reddit Bot
(praw)
Amazon Web Services
• Infrastructure as a Service
• Easily spin up new servers with
pre-built operating system
• EDHREC runs on one m4.2xlarge
8 CPUs, 32GB RAM, Better network
10 cents per hour ($72/month)
• Great for recovering from failures
• Easy to upgrade machine
• Very good uptime so far
• Easy to backup to s3

Some observations about
User Experience and AI applications

LOL! Look at the dumb bot!
Lesson learned:
Humans LOVE pointing out when something the AI is doing is strange or wrong,
even if it gets it right 90% of the time.
Therefore, I am very conservative of what I end up publishing as
I’ve gotten burned a few times. Which can be a shame sometimes.
(just a couple examples)

The apocalypse is near
 “EDHREC is ruining EDH/Commander”
 “EDHREC is taking the fun out of deck construction”
 “EDHREC kills conversation”
MapQuest takes the fun out of planning trips!
 Mostly these are taken as compliments
 AI is going to have resistance from people who liked the manual labor
 I don’t think the commentary entirely off base… but...

Sometimes too much is too much
 Over-engineering and doing too much is an easy trap
 You want to make it better and provide more “intelligence”
 Give the users ability to discover and find things
 Increases user engagement
 Better results
 Philosophy: EDHREC is a tool, not a solution
 I’m starting to see my other data science projects this way
Lesson learned:
Spend more time on interactive “discovery tools”
than intelligent do-everything algorithms

Interesting related things to look at

RoboRosewater
 Rosewater is the name of the Magic lead designer
 RoboRosewater is a “backwards” neural network, trained on
Magic cards

MTG Finance
Lots of analysis around Magic finance!
mtgstocks.com

Virtues of this whole thing
Community
 Most hobbies are defined by communities
 Technology can bring communities together
Self-Development
 Data has value and getting data of value is hard
 Hobby-based data is relatively easy to acquire (compared to say data used by
health care companies)
 A great way to do real data science on real data (opposed to synthetic data on a
more valuable data set)
Profit!
 Hobbyists are passionate about their hobby and willing to spend money on it
 They will pay for and support services they like

EDHREC @ Data Science MD

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à EDHREC @ Data Science MD

Similaire à EDHREC @ Data Science MD (20)

Plus de Donald Miner

Plus de Donald Miner (6)

Dernier

Dernier (20)

EDHREC @ Data Science MD

Notes de l'éditeur