4. 4
Rakuten Group in numbers
Rakuten in Japan
• > 12.000 employees
• > 48 billions euros of GMS
• > 100.000.000 users
• > 250.000.000 items
• > 40.000 merchants
https://global.rakuten.com/corp/ 2018/11/05
Rakuten Group
• Kobo 18.000.000 users
• Viki 28.000.000 users
• Viber 345.000.000 users
5. 5
Rakuten Ecosystem
Rakuten global ecosystem :
• Member-based business model that connects Rakuten services
• Rakuten ID common to various Rakuten services
• Online shopping and services;
• Main business areas: E-commerce, Internet finance, Digital content
https://global.rakuten.com/corp/about/index.html#strengths 2018/11/05
https://global.rakuten.com/corp/about/history.html 2018/11/05
Recommendation challenges
• Cross-services
• Aggregated data
• Complex users features
6. 6
Rakuten’s e-commerce: B2B2C Business Model
Business to Business to Consumer:
• Merchants located in different regions / online virtual shopping mall
• Main profit sources
• Fixed fees from merchants
• Fees based on each transaction and other service
Recommendation challenges
• Many shops
• Items references
• Global catalog
7. 7
Big Data @ Rakuten
Mission: Development and operations of internal systems for:
• Recommendations
• Search
• Targeting
• User behavior tracking
Average traffic:
• > 100.000.000 events / day
• > 40.000.000 items view / day
• > 50.000.000 search / day
• > 750.000 purchases / day
Technology stack:
• Java / Python / Ruby
• Solr / Lucene
• Cassandra / Couchbase
• Hadoop / Hive / Pig
• Redis / Kafka
8. 88
Short Bio
ESPCI: engineer in Physics / Biology
ENS Cachan: MVA Master Mathematics Vision and Learning
INRIA Parietal team: PhD in Computer Science
Understanding the visual cortex by using classification techniques
Logilab – Development and data science consulting
Data.bnf.fr (French National Library open-data platform)
Brainomics (platform for heterogeneous medical data)
Education
Experience
Rakuten PriceMinister– Senior Developer and data scientist
Data engineer and data science consulting
Rakuten – Recommendations & Personalization team lead
Lead a team of engineers, datascientists and project managers
10. 10
Do not redo it yourself !
Lots of interesting open-source libraries for all your needs
• Test first on a small POC, then contribute/develop
• Scikit-learn, pandas, Caffe, Scikit-image, opencv, ….
• Be careful: it is easy to do something wrong !
Open-data
• More and more open-data for catalogs, …
• E.g. data.bnf.fr: ~ 2.000.000 authors, ~ 200.000 works, ~ 200.000 topics
Contribute to open-source
• Unless you are doing some kind of super magical algorithm
• Is there a need / pool of potential developers ?
• Do it well (documentation / test)
• May bring you help, bug fixes, and engineers ! But it takes time and energy
11. 11
Quality in data science software engineering
Never underestimates integration cost
• Easy to write a 20 lines Python code doing some 883fancy Random Forests…
• …that could be hard to deploy (data pipeline, packaging, monitoring)
• Developer != DevOps != Sys admin
Make it clean from the start (> 2 days of dev or > 100 lines of code)
• Tests, tests, tests, tests, tests, tests, tests, …
• Packaging / supervision / monitoring
• Release often release earlier
• Documentation, Agile development, Pull request, code versioning
Choose the right tool
• Do you really need this super fancy NoSQL database to store your transactions?
12. 12
Monitoring and alerting: building datascience product
Hardware
(CPU, IO, …)
Software
(Errors, requests, …)
Datascience
(KPIs, …)
14. 14
Defining yourself as a data scientist
Do not try to sell yourself
as a unicorn!
Define your skills
(and unicorns no longer exist…)
15. 15
Few remarks on hiring – my personal opinion
Be careful of CVs with buzzwords!
• E.g. “IT skills: SVM (linear, non-linear), Clustering (K-means, Hierarchical), Random Forests,
Regularization (L1, L2, Elastic net…) …”
• It is like as someone saying “ IT skills: Python (for loop, if/else pattern, …)
Hungry for data?
• Loving data is the most important thing to show
• Opendata? Personal project? Curious about data? (Hackaton?)
• Pluridisciplinary == knowing how to handle various datasets
Improve your IT skills
• Should be able to install/develop new libraries/algorithms
• A huge part of the job could be to format / cleanup the data
• Experience VS education -> Autonomy
17. 17
What is the GDPR?
Adopted in April 2016 and applicable as of May, 25th 2018
Replaces all the national legislations about the handling of personal data in Europe
Until now, 1995 Directive which had been transposed differently among the EU countries
In France, the law « Informatique et libertés » is going to be modified
There still will be other differing sources: national case law and national data protection authorities (CNIL)
doctrines
Still pending: E-privacy Regulation adoption (about cookies and OTT)
18. 18
Why GDPR ?
Why was the GDPR passed?
• Harmonisation of the European rules
• To directly target non-European companies making business with EU data
• To empower citizens and give them control over their data
Why is the GDPR important?
• Fines of up to 4% of the global annual turnover or 20 million euros. Now in France: max fine = 3 million
euros
• Loss of reputation and future customers
• NGOs can bring claims on behalf of individuals
• Burden of proof is on the company
19. 19
What will the GDPR really change?
Accountability principle : less formalities to the DPA but more internal preparatory works (DPIA) and
possibly higher fines in case of a control (on-site or online)
Mandatory Data Protection Officer (records of processings)
New obligations to data processors
Security breach notification to the DPA (and even to the users in some cases) 72 hours max after a
security incident
New user right: data portability
20. 20
When is the GDPR applicable?
The GDPR is applicable
Yes
No
The GDPR
is not
applicable
Does your business
offer services to the
EU?
Do you provide your service in any
European languages?
Does your service use/accept any
European currency?
Are EU customers specifically
addressed? (delivery)
Profiling
Tracking by cookies or
otherwise
Analysis of personal
preferences / behavior
Yes
No
Yes
No
Does your business
collect, use or process
personal data?
But other
privacy
laws may
apply
Yes
Is an office of your
business in the EU?
No
Do you monitor
individuals in the
EU?
23. 23
What are recommendations ?
https://www.rinapiccolo.com/piccolo-cartoons/
A recommender system seeks to
predict the "rating" or "preference" a user
would give to an item (wikipedia)
25. 25
What is personalization?
“Personalization, consists of
tailoring a service or a product
to accommodate specific
individuals, sometimes tied to
groups or segments of
individuals” (wikipedia)
https://www.rakuten.co.jp 2018/11/07
26. 26
Personalization usecases
Left column links Main widgets Top header links
Push less but relevant
content to the customer
Push dynamic content to
fit the context to the
customer
Push the most attractive
content to the customer
first
A
β
δ
α
1
B
2
B
α
β
δ
2
A
1
A
δ
α
1
β
2
A
1
A
β
δ
α
1
B
2
A
δ
α
1
27. 27
What are industrial companies doing?
“Netflix member loses interest after
perhaps 60 to 90 seconds of choosing”
[source]
“Netflix recommender system is used on most
screens of the Netflix product beyond the
homepage, and in total influences choice for
about 80% of hours streamed at Netflix.”
[source]
“Already, 35 percent of what consumers purchase on Amazon
and 75 percent of what they watch on Netflix come from
product recommendations based on such algorithms.”
[source]
28. 28
Recommendations: different usages for different contexts
Best offers, Faster navigation, Serendipity, Complementary / substitute items…
https://www.rakuten.fr 2018/11/05
33. 33
Recommendation datatypes
Ratings
Numerical feedbacks from the users
Sources: Stars, reviews, …
✔ Qualitative and valuable data
✖ Hard to obtain
Scaling and normalization !
Unitary data
Only 0/1 without any quality feedback
Sources: Click, purchase…
✔ Easy to obtain (e.g. tracker)
✖ No direct rating
Users
Items
1 3 2
5 2
2 4 1
3 1 5
4 4 1 3
Users
Items
1 1 1
1 1
1 1 1
1 1 1
1 1 1 1
34. 34
Items Catalogues
Use different levels of aggregation to improve recommendations
Category-level
(e.g. food, soda, clothes, …)
Product-level
(manufactured items)
Item in shop-level
(specific product sell by a
specific shop)
Increased statistical power in
co-events computation
Easier business handling
(picking the good item)
36. 36
Cocounts for binary / Unitary data
Only occurences of items views/purchases/…
Jaccard distance
Cosine similarity
Conditional probability
37. 37
Co-occurrences and Similarities Computation
Multiple possible parameters:
• Size of time window to be considered:
Does browsing and purchase data reflect similar behavior ?
• Threshold on co-occurrences
Is one co-occurrence significant enough to be used ? Two ? Three ?
• Symmetric or asymmetric
Is the order important in the co-occurrence ? A then B == B then A ?
• Similarity metrics
Which similarity metrics to be used based on the co-occurrences ?
Only access to unitary data (purchase / browsing)
Use co-occurrences for computing items similarity
42. 42
Algorithm 2 - Matrix factorization
Users
Items
1 3 2
5 2
2 4 1
3 1 5
4 4 1 3
-0.7 1 0.4
…
…
…
…
…
2.3 0.2 -0.3
Items
0.5 0.3 … 1.2
…
1.2 -0.2 … -3.2
Users
~
X
• Choose a number of latent variables to decompose the data
• Predict new rating using the product of latent vectors
• Use gradient descent technics (e.g. SGD)
• Add some regularization
43. 43
Matrix factorization – MovieLens example
Read files
import csv
movies_fname = '/path/ml-latest/movies.csv'
with open(movies_fname) as fobj:
movies = dict((r[0], r[1]) for r in csv.reader(fobj))
ratings_fname = ’/path/ml-latest/ratings.csv'
with open(ratings_fname) as fobj:
header = fobj.next()
ratings = [(r[0], movies[r[1]], float(r[2])) for r in csv.reader(fobj)]
Build sparse matrix
import scipy.sparse as sp
user_idx, item_idx = {}, {}
data, rows, cols = [], [], []
for u, i, s in ratings:
rows.append(user_idx.setdefault(u, len(user_idx)))
cols.append(item_idx.setdefault(i, len(item_idx)))
data.append(s)
ratings = sp.csr_matrix((data, (rows, cols)))
reverse_item_idx = dict((v, k) for k, v in item_idx.iteritems())
reverse_user_idx = dict((v, k) for k, v in user_idx.iteritems())
44. 44
Matrix factorization – MovieLens example
Fit Non-negative Matrix Factorization
from sklearn.decomposition import NMF
nmf = NMF(n_components=50)
user_mat = nmf.fit_transform(ratings)
item_mat = nmf.components_
Plot results
component_ind = 3
component = [(reverse_item_idx[i], s)
for i, s in enumerate(item_mat[component_ind ,
:]) if s>0.] For movie, score in sorted(component,
key=lambda x: x[1], reverse=True)[:10]:
print movie, round(score)
Terminator 2: Judgment Day (1991) 24.0
Terminator, The (1984) 23.0
Die Hard (198 19.0
Aliens (1986) 17.0
Alien (1979) 16.0
Exorcist, The (1973) 8.0
Halloween (197 7.0
Nightmare on Elm Street, A (1984) 7.0
Shining, The (1980) 7.0
Carrie (1976) 7.0
Star Trek II: The Wrath of Khan (1982) 10.0
Star Trek: First Contact (1996) 10.0
Star Trek IV: The Voyage Home (1986) 9.0
Contact (1997) 8.0
Star Trek VI: The Undiscovered Country (1991) 8.0
Blade Runner (1982) 8.0
46. 46
Content-based: what should we use ?
Attribute-based Content-based
• Encoded features (e.g. one-hot-
encoding)
• Represent documents in the features
space
• Find similar documents (Knn, Kd-tree,
…)
• Encoded textual content of documents
• Represent textual content in an
embeddings space
• Find similar documents (Knn, Kd-tree,
…)
Can be linked to Search Engine
https://www.rakuten.fr 2018/11/05
47. 47
Example of feature: Named entities in product description
Sample of code with Polyglot
from polyglot.text import Text
text = Text(blob)
for sent in text.sentences:
print(sent, "n")
for entity in sent.entities:
print(entity.tag, entity)
(Sentence("A New York, au printemps 2008, alors que l'Amérique bruisse des prémices de l'élection présidentielle,
Marcus Goldman, jeune écrivain à succès, est dans..."), 'n')
(u'I-LOC', I-LOC([u'New', u'York']))
(u'I-PER', I-PER([u'Marcus', u'Goldman']))
(Sentence("Lire la suite la tourmente : il est incapable d'écrire le nouveau roman qu'il doit remettre à son éditeur
d'ici quelques mois."), 'n')
(Sentence("Le délai est près d'expirer quand soudain tout bascule pour lui : son ami et ancien professeur
d'université, Harry Quebert, l'un des écrivains les plus respectés du pays, est rattrapé par son passé et se retrouve
accusé d'avoir assassiné, en 1975, Nola Kellergan, une jeune fille de 15 ans, avec qui il aurait eu une liaison."),
'n')
(u'I-PER', I-PER([u'Harry', u'Quebert']))
(u'I-PER', I-PER([u'Nola', u'Kellergan']))
https://fr.shopping.rakuten.com/mfp/3011174/la-verite-sur-l-affaire-harry-quebert-joel-dicker-livre?pid=171972011 2018/11/05
48. 48
Word2vec: two-layer neural network
Distributed representation of words:
• Continuous bag-of-words: predict current word from surrounding words only
• Skip-gram: use current word to predict surrounding words
https://skymind.ai/wiki/word2vec
50. 50
Word2vec: Code sample
import string
import unidecode
import requests
from bs4 import BeautifulSoup
refs = ["annee1", "artgrdp1", "ballades1", "bugjarg1", "contemplA2",
"contemplB2”, "feuilles1", "hugoshak1", "legend1", "legendet21",
"nddp1", "oriental1”, "quatrevt1", "rayons1", "ruesboi1", "satan1"]
textes = ""
for ref in refs:
print ref
res = requests.get("http://abu.cnam.fr/cgi-bin/donner_html?%s" %
ref)
text = BeautifulSoup(res.content, 'lxml').text
textes += text.split("DEBUT DU FICHIER")[1].split("FIN DU
FICHIER")[0]
sentences = [t for t in textes.replace("n", " ").replace("r", "
").split(".") if len(t) > 20]
sentences = [t.lower().strip() for sentence in sentences for t in
sentence.split(";")]
words = [[t.strip().strip(string.punctuation) for t in
sentence.split()] for sentence in sentences]
words = [[unidecode.unidecode(t) for t in word] for word in words]
Model learning with Gensim
import gensim
model = gensim.models.Word2Vec(words, min_count=5, size=500, sg=1)
52. 52
Convolutional neural network (CNN)
https://www.mathworks.com/solutions/deep-learning/convolutional-neural-network.html
• Feature-sharing
• Not hand-designed
• It boosted algorithms performances in many tasks!
• Data driven (but you need data!)
54. 54
Prod2vec: purchases session as a sentence
E-commerce in Your Inbox: Product Recommendations at Scale, Grbovic et al.
≈
Purchase
08/11/2015 24/11/201508/11/201508/09/2015 10/09/2015
This is not Romeo, he's some other where.I am not here;Tut, I have lost myself;
Apply Word2Vec on a sentence of “purchases”
55. 55
Prod2vec: Theory
Prod2vec learns a low-level embedding representation
of products using the skip-gram model
E-commerce in Your Inbox: Product Recommendations at Scale, Grbovic et al.
Objective function
(S is the set of sessions)
Probability of seeing
the neighboring product
pi+j given product pi
v and v’ are the input and output vector representation (that should be learned).
Similar products should be closed in the vector space.
59. 59
LTR in a Nutshell
Re-orders the recommendations
based on the features
Store features
Offline Machine
Learning
Store model
Real time ML
Recommend
API delivery
https://static.googleusercontent.com/media/research.google.com/ru//pubs/archive/45530.pdf
61. 61
Problem setting
ts item ritem
1 A B .1 .3 .2 .9 .4 .0 .2 0
1 A C .9 .7 .6 .0 .0 .3 .6 0
1 A D .4 .8 .6 .3 .2 .1 .0 1
2 B A .3 .1 .5 .7 .1 .9 .1 0
2 B E .1 .5 .2 .3 .2 .8 .7 1
Learn any model (linear, non-linear, …)
click
62. 62
Learning to Rank approaches
• Pointwise approach: Consider each pair (document, target) separately (clicked, purchased).
• Pairwise approach: Consider order of two documents and minimize inversions errors
• Listwise approach: Consider all documents, and try to optimize the overall/average score
1
Not Clicked
2
Clicked
3
Not Clicked
4
Clicked
5
Clicked
6
Not Clicked
https://www.rakuten.fr 2018/11/05
63. 63
Learning to Rank approaches algorithm
1
Not Clicked
2
Clicked
• Pointwise approach == (feature11, feature21, …, 0)
== (feature12, feature22, …, 1)
3
Not Clicked
== (feature13, feature23, …, 0)
1
Not Clicked• Pairwise approach == (feature11, feature21, feature12, feature22…, 0)2
Clicked
== (feature11, feature21, feature13, feature23…, 1)1
Not Clicked
3
Not Clicked
== (feature12, feature22, feature13, feature23…, 2)3
Not Clicked
2
Clicked
1
Not Clicked• Listwise approach == (feature11, feature21, feature12, feature22…, (2, 1, 3))2
Clicked
3
Not Clicked
Regression problem
predict document score
Classification problem
which document is better
More complex problem…
optimize the value of one of
the evaluation measures
66. 66
Recommendation Quality Challenges
Recommendations categories
• Cold start issue
• External data ?
• Cross-services ?
• Hot products (A)
• Top-N items ?
• Short tail (B)
• Long tail (C + D)
Minor
Product
Major
Product
(Popular)
New
Product
Old
Product
(A)
(B)
(D)
(C)
67. 67
Offline Evaluation
Pros/Cons
• Convenient way to try new ideas
• Fast and cheap
• But hard to align with online KPI
Approaches
• Rescoring
• Prediction game
• Business simulator
68. 68
Public Initiative – Viki Recommendation Challenge
http://www.dextra.sg/challenges/rakuten-viki-video-challenge
567 submissions from 132 participants
69. 69
A/B Testing
Track users’ interaction
with the AB-test variants
Compute statistical tests
Choose which version
to put in production
A B
control variation
70. 70
A/B Statistical test
Do not peek A/B tests or stopped them before the end
Abtest A
Abtest B
Abtest C
Sample size should be fixed in advance
500 samples 1000 samples 1500 samples
Stopped because
Not significant
Stopped because
Not significant
Kept because
significant
✗
Abtest A
Abtest B
Abtest C
500 samples 1000 samples 1500 samples
Kept because
significant
✓
Kept because
significant
72. 72
Continuous A/B test process
Short ABtest
2 or 3 days
L1 (control): CVRia 0.21
L2: CVRia 0.03
L3: CVRia 0.7
Short ABtest
2 or 3 days
L3 (control): CVRia 0.065
L1 (old control): CVRia 0.03
L4: CVRia 0.09
Short ABtest
2 or 3 days
L4 (control): CVRia 0.07
L3 (old control): CVRia 0.72
L2: CVRia 0.03
Expected output: step-by-step increase of CVRia / orders
73. 73
Multi-arm bandit
Testing different models without knowing their outcome
- Exploring the different models to estimate their rewards
- Exploit the best model know so far
It costs to test hypotheses ( == explore)
Model A
Model B
Model C
Model A
Reward 1.1
Model B
Model C
Model A
Reward 0.5
Model B
Model C
Model A
Reward 0.8
Model B
Model C
Model A
Reward 0.2
Model B
Model C
Model A
Reward 0.9
Model B
Model C
Model A
Reward 1.8
Model B
Model C
74. 74
Epsilon-greedy strategy
Many other strategies (e.g. see Wikipedia)…
• Epsilon-first strategy: A pure exploration phase is followed by a pure exploitation phase.
• Epsilon-decreasing strategy: The value of epsilon decreases over time.
…and different bandits:
• Contextual bandit: at each iteration, we have access to a vector of contextual features.
• Constrained contextual bandit: a total budget is associated with the bandit.
Models
Other models
Best model
Model 1
Model 2
Model n
1/n
1/n
1/n
ε
1-ε
76. 76
What is Record Linkage?
Record linkage (RL) is the task of finding records in a data set that refer to the same entity across different
data sources (e.g., data files, books, websites, and databases) Wikipedia
Usage for Recommendations
• Global catalog
• Items aggregation
• Helps with cold start issues
• Improved navigation
Marketplace 1 Marketplace 2Reference dataset
77. 77
Linked Open Data
Linking Open Data cloud diagram 2014, by Max Schmachtenberg, Christian Bizer,
Anja Jentzsch and Richard Cyganiak. http://lod-cloud.net/"
78. 78
Semantic-web and RDF format
Triples: <subject> <relation> <object>
URI: unique identifier
http://dbpedia.org/page/Terminator_2:_Judgment_Day
79. 79
Record linkage for global recommendations
• Linking products together in a service
• Feature Engineering - Generate hierarchies of products
• ✔ Improve statistical power of co-events computation by aggregation
• ✔ If based on text, may be used for content-based recommendations
• Based on Record Linkage technics (e.g. MinHashing)
• Linking products together between services
• Having a unique product id across services
• Use recommendations from one service, for another new service
• ✔ Avoid cold start issue
• Cross-services recommendations
• ✔ Show items from a service on another service pages
• Linking products to an external database (Wikidata)
• More info for item enrichment, UI, content-based recommendations
80. 80
Record linkage – The big picture
Dataset 1
e.g. title, categories, price
Blocking
e.g. minhashing on the titles
Dataset 2
e.g. title, categories, price
Subset Dataset 1 Subset Dataset 2
Block 1
Subset Dataset 1 Subset Dataset 2
Block 2
Subset Dataset 1 Subset Dataset 2
Block n
Subset Dataset 1 Subset Dataset 2X
Comparisons based on attributes with specific distances
For each
block
Links
creation
Data 1_1 == Data 2_n
Data 1_2 == Data 2_p
Data 1_3 == Data 2_1
….
Comparisons
(distances computation)
81. 81
Record linkage – Naive approach
Match items from one dataset to the other using distances
Levenshtein
Jaccard
Python difflib
Based on an algorithm published in the late 1980’s by
Ratcliff and Obershelp under the hyperbolic name
“gestalt pattern matching.” (see doc)
82. 82
Shingles and documents representation
Shingles ~ Word-n-grams
Split documents in a list of words groups
A New York, au printemps 2008, alors que l'Amérique bruisse des prémices de l'élection présidentielle,
Marcus Goldman, jeune écrivain à succès, est dans…
E.g. word-3-grams
• A New York, au
• New York au printemps
• …
• à succès, est
• succès, est dans
Use jaccard distance between sets of shingles
But…
83. 83
Record linkage – Problematic
Combinatorial explosion
10^6 items X 10^6 items = 10^12 comparisons
• Use blocking: divide and conquer approaches: n-gram indexes, clustering,
minhashing…
• Apply more computationally expensive approaches on each block
84. 84
Minhashing - Theory
How to compare shingles of different sizes? MINHASHING !
1. Compute hash of shingles
2. Keep minimum hash value
3. Repeat for 200 different hashs
We now have a 200-dimensions representation of all shingles
Why is it working ?
If 2 documents share the same minimum hash for two shingles -> They share that shingle.
Document 1
Document 2
Shingle 1 Shingle 2 …
Shingle 1 Shingle 2 …
Hash
Value1, Value2, …
Value1’, Value2’, …
Find minimum
hash value
85. 85
Minhashing - Theory
We just have to keep the minimum hash value for each document
Huge computational and storage boost !
Randomly picking 200 shingles and comparing them between 2 documents
≈
Storing and comparing minimum values for 200 differents hash functions
Jaccard(doc1, doc2) -> #Minhash(S1)==Minhash(S2) / nbhash
But we still have to compare all documents together
(even if the comparison is way faster)
86. 86
Local Sensitivity Hashing
What is a Locality sensitive hashing (LSH)?
It is a hash such that similar vectors tend to get similar hash values
It generates ‘band’ of documents, where documents within a band are more or less similar, and
should be compared with Minhashing
Document 1
Document 2
Shingle 1 Shingle 2 …
Shingle 1 Shingle 2 …
Shingle 1 Shingle 2 …
Shingle 1 Shingle 2 …
Document 1
Document 2
Minhash 1 … Minhash 200
Minhash 1 … Minhash 200
Minhash 1 … Minhash 200
Minhash 1 … Minhash 200
Document 1
Document 4
Document 3
Document 5
…
Minhashing
comparison
Minhashing
comparison
Minhashing
comparison
Shingle Minhashing Local Sensitivity Hashing
87. 87
Minhashing + LSH - Results
Marketplace 1 Marketplace 2
https://www.wikidata.org/wiki/Q170564
bg Терминатор 2: Денят на страшния съд
el Εξολοθρευτής 2: Μέρα Κρίσης
en Terminator 2: Judgment Day
es Terminator 2: el juicio final
fr Terminator 2 : Le Jugement dernier
ja ターミネーター2
ka ტერმინატორი 2: განკითხვის დღე
Director: James Cameron
Cast member: Arnold Schwarzenegger
Cast member: Edward Furlong
Follows: The Terminator
Genre: action film
Main subject: time travel, android
Narrative location: Los Angeles
https://item.rakuten.co.jp/auc-tecc/10016937/ 2018/11/07https://fr.shopping.rakuten.com/mfp/5705594/terminator-2-2 2018/11/07
90. 90
Challenge – Task 1
Predict if a review is useful for other users or not
May be use to boost interesting reviews on the website
Classification (#useful / #total > 0.5) or Regression task on textual features
https://fr.shopping.rakuten.com 2016/10/10
91. 91
Challenge – Task 2
Predict the user stars number based on his/her review
May be use to detect fraud and help improving the quality on the website
Regression task (6 discrete values) on textual features
https://fr.shopping.rakuten.com 2016/10/10
92. 92
Data samples
product: b57c06ed94773c4d08bcefcdf8cbedd846bbdcba8d669a15d511b9acb92efeb43
review_title: “why not !”
review_content: “Yess! est un jeu de communication réussi, intuitif, malin assez rapide avec peu de temps mort. Voilà un jeu
au rapport plaisir/prix qui est bien placé.”
review_note: 4
feedback_positive_count: 0
feedback_negative_count: 0
product: 12d1407836441fc39805916ecb705604bcb539bd70b477c82d57f5043977102
review_title: “Lave linge parfait”
review_content: “Seul point négatif : la mise en réseau de la machine...Mais ce n'est pas ce que l'on recherche le plus dans
un lave linge”
review_note: 4
feedback_positive_count: 0
feedback_negative_count: 0
93. 93
Expected results
The expected results are probabilities of being useful (Class 1)
Proba: 0.99
Je vien d'accerir ses enceintes d'une qualité de son incroyable . Des basses profondes et puissantes . Pour
amoureux de son !
Proba: 0.0611228262615
Parfait PARFAIT PARFAIT PARFAIT PARFAIT PARFAIT PARFAIT PARFAIT PARFAIT PARFAIT
PARFAIT
94. 94
Baseline code example
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import linear_model
from sklearn.cross_validation import KFold
STOPWORDS = set(['alors', 'au', 'aucuns', 'aussi', 'autre', 'avant', 'avec’ …])
fname = '/path/to/reviews_test_clean 2.csv’
df = pd.read_csv(fname)
df['feedback_ratio'] = df['feedback_positive_count'].astype(float) / (df['feedback_positive_count’] +
df['feedback_negative_count'])
df['feedback_class'] = df['feedback_ratio'] >= 0.5
X = df['review_content'].values
Y = df['feedback_class'].to_dense().values
kf = KFold(n=len(X), n_folds=5)
all_pred , all_test = [], []
for train, test in kf:
Xtrain, Xtest = X[train], X[test]
Ytrain, Ytest = Y[train], Y[test]
vect = TfidfVectorizer(stop_words=STOPWORDS)
Xtrain = vect.fit_transform(Xtrain)
Xtest = vect.transform(Xtest)
clf = linear_model.SGDClassifier()
clf.fit(Xtrain, Ytrain)
pred = clf.predict(Xtest)
all_pred.extend(pred)
all_test.extend(Ytest)
96. 96
Claims predictions in E-commerce
Claims have a huge impact in terms of user experience + cost
Claims can be dealt with differently following the different cases (broken, fake, …)
Predict if a transaction has a probability to lead to a claim
-> focus on risky transactions.
Possibly a huge impact in the whole E-commerce field!
97. 97
Dataset
• ID: identifier of the sample
• SHIPPING_MODE: mode of shipping of the product
• (RECOMMANDE, NORMAL, …)
• SHIPPING_PRICE: cost of shipping, if existing
• (<1, 1<5, 5<10, 10<20, >20)
• WARRANTIES_FLG: True if a warranty has been taken by the buyer
• WARRANTIES_PRICE: Price of warranty, if existing
• (<5, 5<20, 20<50, 50<100, 100<500, >500)
• CARD_PAYEMENT: transactions paid by card
• COUPON_PAYEMENT: transactions paid with a discount coupon
• RSP_PAYEMENT: transactions paid with Rakuten Super Points
• WALLET_PAYMENT: transactions paid with PriceMinister-Rakuten
wallet
• PRICECLUB_STATUS: status of the buyer
• (UNSUBSCRIBED, PLATINUM, …)
• REGISTRATION_DATE: year of registration of the buyer
• PURCHASE_COUNT: binarisation of buyer's previous purchases count
• (<5, 5<20, 20<50, 50<100, 100<500, >500)
• BUYER_BIRTHDAY_DATE: year of birth of the buyer
• BUYER_DEPARTMENT: department of the buyer or -1
• BUYING_DATE: year and month of the purchase
• SELLER_SCORE_COUNT: binarisation of the seller's previous sales count
• (<100, 100<103, 103<104, 104<105, 105<106, >106)
• SELLER_SCORE_AVERAGE: score of the seller on PriceMinister-Rakuten
• SELLER_COUNTRY: country of the seller
• (FRANCE METROPOLITAN, CHINA, …)
• SELLER_DEPARTMENT: department of the seller or -1
• PRODUCT_TYPE: type of the purchased product
• (TOYS, CELLPHONE_ACCESSORY, …)
• PRODUCT_FAMILY: family of the purchased product
• (ELECTRONICS, BABY, …)
• ITEM_PRICE: binarisation of the purchased product
• (<10, 10<20, 20<50, 50<100, 100<500, 500<1000, 1000<5000, >5000)
98. 98
Challenges
Complex interactions that involves multiple factors (probably not all contained in the features) and
subjective information (the same shop does not always send broken products…)
Unbalanced classes!
Categorical features + numerical features (beware of ranges!)
Find some socio-demographics/behavioral features (e.g. based on country)
99. 99
Baseline
Metric: AUC weighted metric (from sklearn)
Calculate metrics for each label, and find their average, weighted by support (the number of true instances
for each label).
http://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html#sklearn.metrics.roc_auc_score
Algorithm used for benchmarks (naive and classic!)
• Random forests classifier (from sklearn), with 200 estimators
• Classical preprocessors (from sklearn): OneHotEncoder, LabelEncoder
Result obtained: 0.574 AUC weighted metric.
100. 100
Baseline code example
from collections import defaultdict
import pandas as pd
import numpy as np
import sklearn.preprocessing
from scipy.sparse import hstack
xtrain_df = pd.read_csv('training_X.tsv’,
delimiter='t’)
xtest_df = pd.read_csv('test_X.tsv’, delimiter='t’)
ytrain_df = pd.read_csv('training_Y.tsv’,
delimiter='t’)
ytest_df = pd.read_csv('test_Y.tsv’, delimiter='t’)
CATS = ['WARRANTIES_FLG’, `'SHIPPING_MODE`, …]
LABELS = defaultdict(set)
for cat in CATS:
d =
set(xtrain_df[cat].unique()).union(set(xtest_df[cat].un
ique()))
LABELS[cat] = dict((v, i) for i, v in enumerate(d))
NUMERICAL = ['CARD_PAYMENT’, COUPON_PAYMENT’,…]
ENCODERS = dict()
COLUMNS = []
Xtrain = create_matrix(xtrain_df, COLUMNS)
Xtest = create_matrix(xtest_df)
def create_matrix(df, COLUMNS=None):
print df.shape
X = df[NUMERICAL].to_sparse()
if COLUMNS is not None:
COLUMNS.extend(NUMERICAL)
for cat in CATS:
if COLUMNS is not None:
for v in sorted(LABELS[cat].iteritems(),
key=lambda x: x[1]):
COLUMNS.append('%s (%s)' % (cat, v[0]))
data = df[cat]
data = np.ravel([LABELS[cat][v] for v in data])
data = np.reshape(data, [data.size, 1])
if cat in ENCODERS:
data = ENCODERS[cat].transform(data)
else:
oenc = sklearn.preprocessing.OneHotEncoder()
data = oenc.fit_transform(data)
ENCODERS[cat] = oenc
X = hstack((X, data))
return X
101. 101
Baseline code example
lenc = sklearn.preprocessing.LabelEncoder()
Ytrain =
lenc.fit_transform(ytrain_df['CLAIM_TYPE'])
Ytrain[Ytrain != 0] = 1
lenc = sklearn.preprocessing.LabelEncoder()
Ytest = lenc.fit_transform(ytest_df['CLAIM_TYPE'])
Ytest[Ytest != 0] = 1
from sklearn.ensemble import RandomForestClassifier
clf =
sklearn.ensemble.RandomForestClassifier(n_estimator
s=50)
clf.fit(Xtrain.toarray(), Ytrain)
pred = clf.predict(Xtest.toarray())
ytrain_df['CLAIM_TYPE'][ytrain_df['CLAIM_TYPE'] == '-'] =
'NO COMPLAIN'
ytest_df['CLAIM_TYPE'][ytest_df['CLAIM_TYPE'] == '-'] = 'NO
COMPLAIN’
STATUS =
set(ytrain_df['CLAIM_TYPE']).union(set(ytest_df['CLAIM_TYPE'
]))
STATUS = dict((v, i) for i, v in enumerate(STATUS))
Ytrain = np.array([STATUS[v] for v in
ytrain_df['CLAIM_TYPE']])
Ytest = np.array([STATUS[v] for v in
ytest_df['CLAIM_TYPE']])
clf = SGDClassifier(n_jobs=4, loss='log')#,
class_weight='auto')#weights)
clf.fit(Xtrain, Ytrain)
pred = clf.predict(Xtest)
proba = clf.predict_proba(Xtest)
Binary Classifier Multiclass Classifier
103. 103
Search principles
The goal of search is to help users efficiently find the most relevant documents for a given query.
• Documents
• Depend on how the data is modeled
• Marketplace: product, offer (product sold by a merchant), SKU (variation of a product), …
• Video streaming: movie, tv series, tv episode, …
• Query
• Terms: what goes in the search box
• Filters: navigation items
• Relevancy
• Based off the data: by price, by freshness, …
• Based off user behavior: clicks, purchases, …
• Based off text semantics: entity extraction, …
• Based off corpus statistics: terms frequencies, …
• Efficiency
• Low response time
• Assistance: spellcheck, autocomplete, …
104. 104
Search principles
• Most existing search systems are based off
indices
• Same as the indices found at the end of books
• Lucene is the most well-known library to
handle these
• Consider the query “the best search engines”
• First, break up the query in terms, remove
common ones, and normalize
• Yields [“best”, “search”, “engine”]
• Look up each term in the index dictionary
• Yields a list of documents per term (called an inverted
list)
• Find common documents in all list
• Sort the results in the desired order
(Image from https://stackoverflow.com/questions/17272050/book-index-page-layout-using-html5-and-css)
105. 105
Search, big data and e-commerce
• Indexing challenges
• Volume of data: large marketplaces contain a lot of documents
• Update rate: information can change fast; price, inventory
• Search challenges
• Query rate: lots of end-users querying at the same time
• Features: not only document retrieval, but also navigation, statistics, …
• Relevancy challenges
• Document do not usually contain natural language
• More susceptible to spam and merchants trying to game the system
• Need to balance well-selling items with discovery, especially for newer releases of popular products
• Multi-language support for global market places
• Operation challenges
• Very large scale systems (2000+ nodes) need robust deployment and monitoring
• Resources distribution (models, linguistic resources)
107. 107107
Datascience everywhere !
Rakuten provides marketplaces worldwide
Specific challenges for recommendations
Items catalogue: reinforce statistical power of co-occurrences across shops and services;
Items similarities: find the good parameters for the different use-cases;
Recommendations models: what is the best models for in-shop, all-shops, personalization?
Evaluation: handling long-tail? Comparing different models?
108. 108108
We are Hiring!
Positions
http://www.priceminister.com/recrutement/?p=197
Data Scientist / Software Developer
• Build algorithms for recommendations, search, targeting
• Predictive modeling, machine learning, natural language processing
• Working close to business
• Python, Java, Hadoop, Couchbase, Cassandra…
Also hiring: search engine developers, big data system administrators, etc.
109. 109109
Thanks !
Questions ?
More on Rakuten tech initiatives
http://www.slideshare.net/rakutentech
http://rit.rakuten.co.jp/oss.html
http://rit.rakuten.co.jp/opendata.html
Positions
http://www.priceminister.com/recrutement/?p=197