SlideShare une entreprise Scribd logo
1  sur  109
Télécharger pour lire hors ligne
Datascience in E-commerce industry
Vincent MICHEL, Big Data EU,
Rakuten.Inc
vincent.michel@mail.rakuten.com
3
Rakuten Group Worldwide
https://global.rakuten.com/corp/about/index.html#strengths 2018/11/05
Recommendation challenges
• Different languages
• Users behavior
• Business areas
4
Rakuten Group in numbers
Rakuten in Japan
• > 12.000 employees
• > 48 billions euros of GMS
• > 100.000.000 users
• > 250.000.000 items
• > 40.000 merchants
https://global.rakuten.com/corp/ 2018/11/05
Rakuten Group
• Kobo 18.000.000 users
• Viki 28.000.000 users
• Viber 345.000.000 users
5
Rakuten Ecosystem
Rakuten global ecosystem :
• Member-based business model that connects Rakuten services
• Rakuten ID common to various Rakuten services
• Online shopping and services;
• Main business areas: E-commerce, Internet finance, Digital content
https://global.rakuten.com/corp/about/index.html#strengths 2018/11/05
https://global.rakuten.com/corp/about/history.html 2018/11/05
Recommendation challenges
• Cross-services
• Aggregated data
• Complex users features
6
Rakuten’s e-commerce: B2B2C Business Model
Business to Business to Consumer:
• Merchants located in different regions / online virtual shopping mall
• Main profit sources
• Fixed fees from merchants
• Fees based on each transaction and other service
Recommendation challenges
• Many shops
• Items references
• Global catalog
7
Big Data @ Rakuten
Mission: Development and operations of internal systems for:
• Recommendations
• Search
• Targeting
• User behavior tracking
Average traffic:
• > 100.000.000 events / day
• > 40.000.000 items view / day
• > 50.000.000 search / day
• > 750.000 purchases / day
Technology stack:
• Java / Python / Ruby
• Solr / Lucene
• Cassandra / Couchbase
• Hadoop / Hive / Pig
• Redis / Kafka
88
Short Bio
ESPCI: engineer in Physics / Biology
ENS Cachan: MVA Master Mathematics Vision and Learning
INRIA Parietal team: PhD in Computer Science
Understanding the visual cortex by using classification techniques
Logilab – Development and data science consulting
Data.bnf.fr (French National Library open-data platform)
Brainomics (platform for heterogeneous medical data)
Education
Experience
Rakuten PriceMinister– Senior Developer and data scientist
Data engineer and data science consulting
Rakuten – Recommendations & Personalization team lead
Lead a team of engineers, datascientists and project managers
99
Software engineering
Lessons learned from (painful) experiences
10
Do not redo it yourself !
Lots of interesting open-source libraries for all your needs
• Test first on a small POC, then contribute/develop
• Scikit-learn, pandas, Caffe, Scikit-image, opencv, ….
• Be careful: it is easy to do something wrong !
Open-data
• More and more open-data for catalogs, …
• E.g. data.bnf.fr: ~ 2.000.000 authors, ~ 200.000 works, ~ 200.000 topics
Contribute to open-source
• Unless you are doing some kind of super magical algorithm
• Is there a need / pool of potential developers ?
• Do it well (documentation / test)
• May bring you help, bug fixes, and engineers ! But it takes time and energy
11
Quality in data science software engineering
Never underestimates integration cost
• Easy to write a 20 lines Python code doing some 883fancy Random Forests…
• …that could be hard to deploy (data pipeline, packaging, monitoring)
• Developer != DevOps != Sys admin
Make it clean from the start (> 2 days of dev or > 100 lines of code)
• Tests, tests, tests, tests, tests, tests, tests, …
• Packaging / supervision / monitoring
• Release often release earlier
• Documentation, Agile development, Pull request, code versioning
Choose the right tool
• Do you really need this super fancy NoSQL database to store your transactions?
12
Monitoring and alerting: building datascience product
Hardware
(CPU, IO, …)
Software
(Errors, requests, …)
Datascience
(KPIs, …)
1313
Hiring remarks
Selling yourself as a (good) data scientist
14
Defining yourself as a data scientist
Do not try to sell yourself
as a unicorn!
Define your skills
(and unicorns no longer exist…)
15
Few remarks on hiring – my personal opinion
Be careful of CVs with buzzwords!
• E.g. “IT skills: SVM (linear, non-linear), Clustering (K-means, Hierarchical), Random Forests,
Regularization (L1, L2, Elastic net…) …”
• It is like as someone saying “ IT skills: Python (for loop, if/else pattern, …)
Hungry for data?
• Loving data is the most important thing to show
• Opendata? Personal project? Curious about data? (Hackaton?)
• Pluridisciplinary == knowing how to handle various datasets
Improve your IT skills
• Should be able to install/develop new libraries/algorithms
• A huge part of the job could be to format / cleanup the data
• Experience VS education -> Autonomy
1616
Knowing the general context
Few remarks about GDPR
17
What is the GDPR?
Adopted in April 2016 and applicable as of May, 25th 2018
Replaces all the national legislations about the handling of personal data in Europe
Until now, 1995 Directive which had been transposed differently among the EU countries
In France, the law « Informatique et libertés » is going to be modified
There still will be other differing sources: national case law and national data protection authorities (CNIL)
doctrines
Still pending: E-privacy Regulation adoption (about cookies and OTT)
18
Why GDPR ?
Why was the GDPR passed?
• Harmonisation of the European rules
• To directly target non-European companies making business with EU data
• To empower citizens and give them control over their data
Why is the GDPR important?
• Fines of up to 4% of the global annual turnover or 20 million euros. Now in France: max fine = 3 million
euros
• Loss of reputation and future customers
• NGOs can bring claims on behalf of individuals
• Burden of proof is on the company
19
What will the GDPR really change?
Accountability principle : less formalities to the DPA but more internal preparatory works (DPIA) and
possibly higher fines in case of a control (on-site or online)
Mandatory Data Protection Officer (records of processings)
New obligations to data processors
Security breach notification to the DPA (and even to the users in some cases) 72 hours max after a
security incident
New user right: data portability
20
When is the GDPR applicable?
The GDPR is applicable
Yes
No
The GDPR
is not
applicable
Does your business
offer services to the
EU?
Do you provide your service in any
European languages?
Does your service use/accept any
European currency?
Are EU customers specifically
addressed? (delivery)
Profiling
Tracking by cookies or
otherwise
Analysis of personal
preferences / behavior
Yes
No
Yes
No
Does your business
collect, use or process
personal data?
But other
privacy
laws may
apply
Yes
Is an office of your
business in the EU?
No
Do you monitor
individuals in the
EU?
2121
Datascience usecase
Recommendations @Rakuten
2222
Recommendations & Personalization
The Big Picture
23
What are recommendations ?
https://www.rinapiccolo.com/piccolo-cartoons/
A recommender system seeks to
predict the "rating" or "preference" a user
would give to an item (wikipedia)
24
Recommendations are generic
Contextual features
Recommendations
engine
Input entities
Items / Products
Categories
Users
Shops
Widgets/UI sections
Output entities
Items / Products
Categories
Users
Shops
Widgets/UI sections
25
What is personalization?
“Personalization, consists of
tailoring a service or a product
to accommodate specific
individuals, sometimes tied to
groups or segments of
individuals” (wikipedia)
https://www.rakuten.co.jp 2018/11/07
26
Personalization usecases
Left column links Main widgets Top header links
Push less but relevant
content to the customer
Push dynamic content to
fit the context to the
customer
Push the most attractive
content to the customer
first
A
β
δ
α
1
B
2
B
α
β
δ
2
A
1
A
δ
α
1
β
2
A
1
A
β
δ
α
1
B
2
A
δ
α
1
27
What are industrial companies doing?
“Netflix member loses interest after
perhaps 60 to 90 seconds of choosing”
[source]
“Netflix recommender system is used on most
screens of the Netflix product beyond the
homepage, and in total influences choice for
about 80% of hours streamed at Netflix.”
[source]
“Already, 35 percent of what consumers purchase on Amazon
and 75 percent of what they watch on Netflix come from
product recommendations based on such algorithms.”
[source]
28
Recommendations: different usages for different contexts
Best offers, Faster navigation, Serendipity, Complementary / substitute items…
https://www.rakuten.fr 2018/11/05
2929
Recommendations & Personalization
How to do it?
30
Recommender System overview
Datasources
(BI, catalog…)
Delivery
API
Users
data
Tracker
Realtime context
31
Challenges in Recommendations
Items catalogues
• Catalogue for multiple shops with different items references ?
Items similarity / distances
• Cross services aggregation ?
• Lots of parameters ?
Recommendations engine
• Best / optimal recommendations logic ?
Evaluation process
• Offline / online evaluation ?
• Long-tail ? KPI ?
Items
Catalogue
Items
Similarity
Recommendations
engine
Evaluation
Process
32
Recommendations – Two axis strategy
Recommendations
candidates
Ranked
candidates
Candidates
retrieval
Realtime
ranking
Batch
logic 1
Batch
logic 2
Batch
logic 3
Batch
logic 4
AXIS 1
Candidates generation strategy
(Cocounts, Prod2vec, Word2vec,
Top, Content-based)
AXIS 2
Candidates ranking strategy / Learning-to-rank
(User context, Item context,
Page context, External context)
ML / AI algorithm
+Context & features
e.g. history, item, time, …
33
Recommendation datatypes
Ratings
Numerical feedbacks from the users
Sources: Stars, reviews, …
✔ Qualitative and valuable data
✖ Hard to obtain
Scaling and normalization !
Unitary data
Only 0/1 without any quality feedback
Sources: Click, purchase…
✔ Easy to obtain (e.g. tracker)
✖ No direct rating
Users
Items
1 3 2
5 2
2 4 1
3 1 5
4 4 1 3
Users
Items
1 1 1
1 1
1 1 1
1 1 1
1 1 1 1
34
Items Catalogues
Use different levels of aggregation to improve recommendations
Category-level
(e.g. food, soda, clothes, …)
Product-level
(manufactured items)
Item in shop-level
(specific product sell by a
specific shop)
Increased statistical power in
co-events computation
Easier business handling
(picking the good item)
3535
Recommendations & Personalization
How to do work with unitary data?
36
Cocounts for binary / Unitary data
Only occurences of items views/purchases/…
Jaccard distance
Cosine similarity
Conditional probability
37
Co-occurrences and Similarities Computation
Multiple possible parameters:
• Size of time window to be considered:
Does browsing and purchase data reflect similar behavior ?
• Threshold on co-occurrences
Is one co-occurrence significant enough to be used ? Two ? Three ?
• Symmetric or asymmetric
Is the order important in the co-occurrence ? A then B == B then A ?
• Similarity metrics
Which similarity metrics to be used based on the co-occurrences ?
Only access to unitary data (purchase / browsing)
Use co-occurrences for computing items similarity
38
Co-occurrences Example
Browsing
Purchase
Session ? Session ?Time window 1
Session ?Time window 2
07/11/2015 08/11/2015
08/11/2015
24/11/2015
08/11/2015
08/11/2015
10/09/2015
08/09/2015 10/09/2015
39
Co-occurrences Example
Co-purchases
Co-browsing
Classical co-occurrences
Complementary
items
Substitute
items
Other possible co-occurrences
Items browsed and
bought together
Items browsed and not
bought together
“You may also want…”
“Similar items…”
08/11/2015
08/11/2015
08/11/2015
07/11/2015
08/11/201510/09/2015
08/09/2015 07/11/2015
4040
Recommendations & Personalization
How to do it for ratings data?
41
Algorithm 1 - Collaborative filtering
User-user
#items < #users
Items are changing quickly
Item-item
#items >> #users
Users
Items
1 3 2
5 2
2 4 1
3 1 5
4 4 1 3
?
1 – Compute users similarities
(cosine-similarity, Pearson)
2 – Weighted average of ratings
42
Algorithm 2 - Matrix factorization
Users
Items
1 3 2
5 2
2 4 1
3 1 5
4 4 1 3
-0.7 1 0.4
…
…
…
…
…
2.3 0.2 -0.3
Items
0.5 0.3 … 1.2
…
1.2 -0.2 … -3.2
Users
~
X
• Choose a number of latent variables to decompose the data
• Predict new rating using the product of latent vectors
• Use gradient descent technics (e.g. SGD)
• Add some regularization
43
Matrix factorization – MovieLens example
Read files
import csv
movies_fname = '/path/ml-latest/movies.csv'
with open(movies_fname) as fobj:
movies = dict((r[0], r[1]) for r in csv.reader(fobj))
ratings_fname = ’/path/ml-latest/ratings.csv'
with open(ratings_fname) as fobj:
header = fobj.next()
ratings = [(r[0], movies[r[1]], float(r[2])) for r in csv.reader(fobj)]
Build sparse matrix
import scipy.sparse as sp
user_idx, item_idx = {}, {}
data, rows, cols = [], [], []
for u, i, s in ratings:
rows.append(user_idx.setdefault(u, len(user_idx)))
cols.append(item_idx.setdefault(i, len(item_idx)))
data.append(s)
ratings = sp.csr_matrix((data, (rows, cols)))
reverse_item_idx = dict((v, k) for k, v in item_idx.iteritems())
reverse_user_idx = dict((v, k) for k, v in user_idx.iteritems())
44
Matrix factorization – MovieLens example
Fit Non-negative Matrix Factorization
from sklearn.decomposition import NMF
nmf = NMF(n_components=50)
user_mat = nmf.fit_transform(ratings)
item_mat = nmf.components_
Plot results
component_ind = 3
component = [(reverse_item_idx[i], s)
for i, s in enumerate(item_mat[component_ind ,
:]) if s>0.] For movie, score in sorted(component,
key=lambda x: x[1], reverse=True)[:10]:
print movie, round(score)
Terminator 2: Judgment Day (1991) 24.0
Terminator, The (1984) 23.0
Die Hard (198 19.0
Aliens (1986) 17.0
Alien (1979) 16.0
Exorcist, The (1973) 8.0
Halloween (197 7.0
Nightmare on Elm Street, A (1984) 7.0
Shining, The (1980) 7.0
Carrie (1976) 7.0
Star Trek II: The Wrath of Khan (1982) 10.0
Star Trek: First Contact (1996) 10.0
Star Trek IV: The Voyage Home (1986) 9.0
Contact (1997) 8.0
Star Trek VI: The Undiscovered Country (1991) 8.0
Blade Runner (1982) 8.0
4545
Recommendations & Personalization
How to do content-based ?
46
Content-based: what should we use ?
Attribute-based Content-based
• Encoded features (e.g. one-hot-
encoding)
• Represent documents in the features
space
• Find similar documents (Knn, Kd-tree,
…)
• Encoded textual content of documents
• Represent textual content in an
embeddings space
• Find similar documents (Knn, Kd-tree,
…)
Can be linked to Search Engine
https://www.rakuten.fr 2018/11/05
47
Example of feature: Named entities in product description
Sample of code with Polyglot
from polyglot.text import Text
text = Text(blob)
for sent in text.sentences:
print(sent, "n")
for entity in sent.entities:
print(entity.tag, entity)
(Sentence("A New York, au printemps 2008, alors que l'Amérique bruisse des prémices de l'élection présidentielle,
Marcus Goldman, jeune écrivain à succès, est dans..."), 'n')
(u'I-LOC', I-LOC([u'New', u'York']))
(u'I-PER', I-PER([u'Marcus', u'Goldman']))
(Sentence("Lire la suite la tourmente : il est incapable d'écrire le nouveau roman qu'il doit remettre à son éditeur
d'ici quelques mois."), 'n')
(Sentence("Le délai est près d'expirer quand soudain tout bascule pour lui : son ami et ancien professeur
d'université, Harry Quebert, l'un des écrivains les plus respectés du pays, est rattrapé par son passé et se retrouve
accusé d'avoir assassiné, en 1975, Nola Kellergan, une jeune fille de 15 ans, avec qui il aurait eu une liaison."),
'n')
(u'I-PER', I-PER([u'Harry', u'Quebert']))
(u'I-PER', I-PER([u'Nola', u'Kellergan']))
https://fr.shopping.rakuten.com/mfp/3011174/la-verite-sur-l-affaire-harry-quebert-joel-dicker-livre?pid=171972011 2018/11/05
48
Word2vec: two-layer neural network
Distributed representation of words:
• Continuous bag-of-words: predict current word from surrounding words only
• Skip-gram: use current word to predict surrounding words
https://skymind.ai/wiki/word2vec
49
Word2vec: results
https://www.tensorflow.org/tutorials/representation/word2vec
50
Word2vec: Code sample
import string
import unidecode
import requests
from bs4 import BeautifulSoup
refs = ["annee1", "artgrdp1", "ballades1", "bugjarg1", "contemplA2",
"contemplB2”, "feuilles1", "hugoshak1", "legend1", "legendet21",
"nddp1", "oriental1”, "quatrevt1", "rayons1", "ruesboi1", "satan1"]
textes = ""
for ref in refs:
print ref
res = requests.get("http://abu.cnam.fr/cgi-bin/donner_html?%s" %
ref)
text = BeautifulSoup(res.content, 'lxml').text
textes += text.split("DEBUT DU FICHIER")[1].split("FIN DU
FICHIER")[0]
sentences = [t for t in textes.replace("n", " ").replace("r", "
").split(".") if len(t) > 20]
sentences = [t.lower().strip() for sentence in sentences for t in
sentence.split(";")]
words = [[t.strip().strip(string.punctuation) for t in
sentence.split()] for sentence in sentences]
words = [[unidecode.unidecode(t) for t in word] for word in words]
Model learning with Gensim
import gensim
model = gensim.models.Word2Vec(words, min_count=5, size=500, sg=1)
51
Word2vec: two-layer neural network
We can use + and – operations on embedding vectors to find logical relationships
model.most_similar("femme”)
[('fille', 0.8448578715324402),
('danseuse', 0.8413887023925781),
('heureuse', 0.8404737114906311),
('fee', 0.8399657011032104),
('jolie', 0.8360834717750549),
('epouse', 0.8242064714431763),
('malheureuse', 0.823990523815155),
('jeune', 0.8126694560050964),
('creature', 0.8117369413375854),
('bohemienne', 0.8117284178733826)]
model.most_similar(positive=["quasimodo", "femme"], negative=["homme"])
[('bohemienne', 0.7697243094444275),
("l'egyptienne", 0.7591158747673035),
('chevre', 0.7411096096038818),
('recluse', 0.7373903393745422),
('esmeralda', 0.733445405960083),
('rene-jean', 0.7310534715652466),
('parole', 0.7230771780014038),
('tourna', 0.715079665184021),
('vivement', 0.7149174213409424),
('condamnee', 0.7142783403396606)]
model.most_similar("philosophe“)
[('ecolier', 0.9446097016334534),
('soulier', 0.9436287879943848),
('bandit', 0.9423332810401917),
('bonhomme', 0.9420523643493652),
('ennemi', 0.938879132270813),
('gentilhomme', 0.9362510442733765),
('savant', 0.9335594177246094),
('officier', 0.9284510612487793),
('traitre', 0.9282448887825012),
('damne', 0.926963210105896)]
52
Convolutional neural network (CNN)
https://www.mathworks.com/solutions/deep-learning/convolutional-neural-network.html
• Feature-sharing
• Not hand-designed
• It boosted algorithms performances in many tasks!
• Data driven (but you need data!)
5353
Recommendations & Personalization
Deep-learning for E-commerce ?
54
Prod2vec: purchases session as a sentence
E-commerce in Your Inbox: Product Recommendations at Scale, Grbovic et al.
≈
Purchase
08/11/2015 24/11/201508/11/201508/09/2015 10/09/2015
This is not Romeo, he's some other where.I am not here;Tut, I have lost myself;
Apply Word2Vec on a sentence of “purchases”
55
Prod2vec: Theory
Prod2vec learns a low-level embedding representation
of products using the skip-gram model
E-commerce in Your Inbox: Product Recommendations at Scale, Grbovic et al.
Objective function
(S is the set of sessions)
Probability of seeing
the neighboring product
pi+j given product pi
v and v’ are the input and output vector representation (that should be learned).
Similar products should be closed in the vector space.
56
Prod2vec: embedding of products with similar neighborhood
Purchase
Purchase
57
Prod2vec: Python example with Gensim
Model learning with Gensim
from gensim.models import Word2Vec
fobj = open(’purchases.json')
model = Word2Vec(fobj, size=100, window=5, min_count=1, workers=4)
embeddings = model.wv.syn0
indices = model.wv.index2word
KNN computation with Falconn
import falconn
params_cp = falconn.get_default_parameters(embeddings.shape[0], embeddings.shape[1])
params_cp.lsh_family = falconn.LSHFamily.CrossPolytope
params_cp.distance_function = falconn.DistanceFunction.EuclideanSquared
lsh = falconn.LSHIndex(params_cp)
lsh.setup(embeddings)
query = lsh.construct_query_object()
for i in range(embeddings.shape[0]):
res = query.find_k_nearest_neighbors(embeddings[i], knn)
5858
Recommendations & Personalization
Learning to Rank
How to adapt to each specific context?
59
LTR in a Nutshell
Re-orders the recommendations
based on the features
Store features
Offline Machine
Learning
Store model
Real time ML
Recommend
API delivery
https://static.googleusercontent.com/media/research.google.com/ru//pubs/archive/45530.pdf
60
LTR features
About the user
About the item
External factors
61
Problem setting
ts item ritem
1 A B .1 .3 .2 .9 .4 .0 .2 0
1 A C .9 .7 .6 .0 .0 .3 .6 0
1 A D .4 .8 .6 .3 .2 .1 .0 1
2 B A .3 .1 .5 .7 .1 .9 .1 0
2 B E .1 .5 .2 .3 .2 .8 .7 1
Learn any model (linear, non-linear, …)
click
62
Learning to Rank approaches
• Pointwise approach: Consider each pair (document, target) separately (clicked, purchased).
• Pairwise approach: Consider order of two documents and minimize inversions errors
• Listwise approach: Consider all documents, and try to optimize the overall/average score
1
Not Clicked
2
Clicked
3
Not Clicked
4
Clicked
5
Clicked
6
Not Clicked
https://www.rakuten.fr 2018/11/05
63
Learning to Rank approaches algorithm
1
Not Clicked
2
Clicked
• Pointwise approach == (feature11, feature21, …, 0)
== (feature12, feature22, …, 1)
3
Not Clicked
== (feature13, feature23, …, 0)
1
Not Clicked• Pairwise approach == (feature11, feature21, feature12, feature22…, 0)2
Clicked
== (feature11, feature21, feature13, feature23…, 1)1
Not Clicked
3
Not Clicked
== (feature12, feature22, feature13, feature23…, 2)3
Not Clicked
2
Clicked
1
Not Clicked• Listwise approach == (feature11, feature21, feature12, feature22…, (2, 1, 3))2
Clicked
3
Not Clicked
Regression problem
predict document score
Classification problem
which document is better
More complex problem…
optimize the value of one of
the evaluation measures
64
Learning to Rank approaches solutions
Regression/classification problem:
• SVM
• SGD
• Random Forests
Listwise measures:
• nDCG: normalized Discounted Cumulative Gain
Doc 1
Rel = 5
Doc 2
Rel = 3
Doc 3
Rel = 1
Doc 4
Rel = 4
Doc 5
Rel = 2
CG5 = 15; DCG5 = 9.89; IDCG5 = 10.27, nDCG5 = 0.96
Doc 1
Rel = 5
Doc 2
Rel = 3
Doc 3
Rel = 4
Doc 4
Rel = 1
Doc 5
Rel = 2
CG5 = 15; DCG5 = 10.1; IDCG5 = 10.27, nDCG5 = 0.98
6565
Recommendations & Personalization
Recommendations Quality
How to evaluate recommendations ?
66
Recommendation Quality Challenges
Recommendations categories
• Cold start issue
• External data ?
• Cross-services ?
• Hot products (A)
• Top-N items ?
• Short tail (B)
• Long tail (C + D)
Minor
Product
Major
Product
(Popular)
New
Product
Old
Product
(A)
(B)
(D)
(C)
67
Offline Evaluation
Pros/Cons
• Convenient way to try new ideas
• Fast and cheap
• But hard to align with online KPI
Approaches
• Rescoring
• Prediction game
• Business simulator
68
Public Initiative – Viki Recommendation Challenge
http://www.dextra.sg/challenges/rakuten-viki-video-challenge
567 submissions from 132 participants
69
A/B Testing
Track users’ interaction
with the AB-test variants
Compute statistical tests
Choose which version
to put in production
A B
control variation
70
A/B Statistical test
Do not peek A/B tests or stopped them before the end
Abtest A
Abtest B
Abtest C
Sample size should be fixed in advance
500 samples 1000 samples 1500 samples
Stopped because
Not significant
Stopped because
Not significant
Kept because
significant
✗
Abtest A
Abtest B
Abtest C
500 samples 1000 samples 1500 samples
Kept because
significant
✓
Kept because
significant
71
A/B Statistical test
https://www.evanmiller.org/how-not-to-run-an-ab-test.html
What sample size to use?
Sample variance you expect
the minimum effect you wish to detect
https://www.evanmiller.org/ab-testing/sample-size.html
72
Continuous A/B test process
Short ABtest
2 or 3 days
L1 (control): CVRia 0.21
L2: CVRia 0.03
L3: CVRia 0.7
Short ABtest
2 or 3 days
L3 (control): CVRia 0.065
L1 (old control): CVRia 0.03
L4: CVRia 0.09
Short ABtest
2 or 3 days
L4 (control): CVRia 0.07
L3 (old control): CVRia 0.72
L2: CVRia 0.03
Expected output: step-by-step increase of CVRia / orders
73
Multi-arm bandit
Testing different models without knowing their outcome
- Exploring the different models to estimate their rewards
- Exploit the best model know so far
It costs to test hypotheses ( == explore)
Model A
Model B
Model C
Model A
Reward 1.1
Model B
Model C
Model A
Reward 0.5
Model B
Model C
Model A
Reward 0.8
Model B
Model C
Model A
Reward 0.2
Model B
Model C
Model A
Reward 0.9
Model B
Model C
Model A
Reward 1.8
Model B
Model C
74
Epsilon-greedy strategy
Many other strategies (e.g. see Wikipedia)…
• Epsilon-first strategy: A pure exploration phase is followed by a pure exploitation phase.
• Epsilon-decreasing strategy: The value of epsilon decreases over time.
…and different bandits:
• Contextual bandit: at each iteration, we have access to a vector of contextual features.
• Constrained contextual bandit: a total budget is associated with the bandit.
Models
Other models
Best model
Model 1
Model 2
Model n
1/n
1/n
1/n
ε
1-ε
7575
Cleaning and improving datasets
Record-linkage
76
What is Record Linkage?
Record linkage (RL) is the task of finding records in a data set that refer to the same entity across different
data sources (e.g., data files, books, websites, and databases) Wikipedia
Usage for Recommendations
• Global catalog
• Items aggregation
• Helps with cold start issues
• Improved navigation
Marketplace 1 Marketplace 2Reference dataset
77
Linked Open Data
Linking Open Data cloud diagram 2014, by Max Schmachtenberg, Christian Bizer,
Anja Jentzsch and Richard Cyganiak. http://lod-cloud.net/"
78
Semantic-web and RDF format
Triples: <subject> <relation> <object>
URI: unique identifier
http://dbpedia.org/page/Terminator_2:_Judgment_Day
79
Record linkage for global recommendations
• Linking products together in a service
• Feature Engineering - Generate hierarchies of products
• ✔ Improve statistical power of co-events computation by aggregation
• ✔ If based on text, may be used for content-based recommendations
• Based on Record Linkage technics (e.g. MinHashing)
• Linking products together between services
• Having a unique product id across services
• Use recommendations from one service, for another new service
• ✔ Avoid cold start issue
• Cross-services recommendations
• ✔ Show items from a service on another service pages
• Linking products to an external database (Wikidata)
• More info for item enrichment, UI, content-based recommendations
80
Record linkage – The big picture
Dataset 1
e.g. title, categories, price
Blocking
e.g. minhashing on the titles
Dataset 2
e.g. title, categories, price
Subset Dataset 1 Subset Dataset 2
Block 1
Subset Dataset 1 Subset Dataset 2
Block 2
Subset Dataset 1 Subset Dataset 2
Block n
Subset Dataset 1 Subset Dataset 2X
Comparisons based on attributes with specific distances
For each
block
Links
creation
Data 1_1 == Data 2_n
Data 1_2 == Data 2_p
Data 1_3 == Data 2_1
….
Comparisons
(distances computation)
81
Record linkage – Naive approach
Match items from one dataset to the other using distances
Levenshtein
Jaccard
Python difflib
Based on an algorithm published in the late 1980’s by
Ratcliff and Obershelp under the hyperbolic name
“gestalt pattern matching.” (see doc)
82
Shingles and documents representation
Shingles ~ Word-n-grams
Split documents in a list of words groups
A New York, au printemps 2008, alors que l'Amérique bruisse des prémices de l'élection présidentielle,
Marcus Goldman, jeune écrivain à succès, est dans…
E.g. word-3-grams
• A New York, au
• New York au printemps
• …
• à succès, est
• succès, est dans
Use jaccard distance between sets of shingles
But…
83
Record linkage – Problematic
Combinatorial explosion
10^6 items X 10^6 items = 10^12 comparisons
• Use blocking: divide and conquer approaches: n-gram indexes, clustering,
minhashing…
• Apply more computationally expensive approaches on each block
84
Minhashing - Theory
How to compare shingles of different sizes? MINHASHING !
1. Compute hash of shingles
2. Keep minimum hash value
3. Repeat for 200 different hashs
We now have a 200-dimensions representation of all shingles
Why is it working ?
If 2 documents share the same minimum hash for two shingles -> They share that shingle.
Document 1
Document 2
Shingle 1 Shingle 2 …
Shingle 1 Shingle 2 …
Hash
Value1, Value2, …
Value1’, Value2’, …
Find minimum
hash value
85
Minhashing - Theory
We just have to keep the minimum hash value for each document
Huge computational and storage boost !
Randomly picking 200 shingles and comparing them between 2 documents
≈
Storing and comparing minimum values for 200 differents hash functions
Jaccard(doc1, doc2) -> #Minhash(S1)==Minhash(S2) / nbhash
But we still have to compare all documents together
(even if the comparison is way faster)
86
Local Sensitivity Hashing
What is a Locality sensitive hashing (LSH)?
It is a hash such that similar vectors tend to get similar hash values
It generates ‘band’ of documents, where documents within a band are more or less similar, and
should be compared with Minhashing
Document 1
Document 2
Shingle 1 Shingle 2 …
Shingle 1 Shingle 2 …
Shingle 1 Shingle 2 …
Shingle 1 Shingle 2 …
Document 1
Document 2
Minhash 1 … Minhash 200
Minhash 1 … Minhash 200
Minhash 1 … Minhash 200
Minhash 1 … Minhash 200
Document 1
Document 4
Document 3
Document 5
…
Minhashing
comparison
Minhashing
comparison
Minhashing
comparison
Shingle Minhashing Local Sensitivity Hashing
87
Minhashing + LSH - Results
Marketplace 1 Marketplace 2
https://www.wikidata.org/wiki/Q170564
bg Терминатор 2: Денят на страшния съд
el Εξολοθρευτής 2: Μέρα Κρίσης
en Terminator 2: Judgment Day
es Terminator 2: el juicio final
fr Terminator 2 : Le Jugement dernier
ja ターミネーター2
ka ტერმინატორი 2: განკითხვის დღე
Director: James Cameron
Cast member: Arnold Schwarzenegger
Cast member: Edward Furlong
Follows: The Terminator
Genre: action film
Main subject: time travel, android
Narrative location: Los Angeles
https://item.rakuten.co.jp/auc-tecc/10016937/ 2018/11/07https://fr.shopping.rakuten.com/mfp/5705594/terminator-2-2 2018/11/07
8888
Rakuten.fr datachallenge 2017
Using user reviews
https://challengedata.ens.fr/fr/challenge/26/prediction_de_linteret_des_avis_utilisate
urs.html
89
Users reviews on Rakuten.fr
https://fr.shopping.rakuten.com 2016/10/10
90
Challenge – Task 1
Predict if a review is useful for other users or not
May be use to boost interesting reviews on the website
Classification (#useful / #total > 0.5) or Regression task on textual features
https://fr.shopping.rakuten.com 2016/10/10
91
Challenge – Task 2
Predict the user stars number based on his/her review
May be use to detect fraud and help improving the quality on the website
Regression task (6 discrete values) on textual features
https://fr.shopping.rakuten.com 2016/10/10
92
Data samples
product: b57c06ed94773c4d08bcefcdf8cbedd846bbdcba8d669a15d511b9acb92efeb43
review_title: “why not !”
review_content: “Yess! est un jeu de communication réussi, intuitif, malin assez rapide avec peu de temps mort. Voilà un jeu
au rapport plaisir/prix qui est bien placé.”
review_note: 4
feedback_positive_count: 0
feedback_negative_count: 0
product: 12d1407836441fc39805916ecb705604bcb539bd70b477c82d57f5043977102
review_title: “Lave linge parfait”
review_content: “Seul point négatif : la mise en réseau de la machine...Mais ce n'est pas ce que l'on recherche le plus dans
un lave linge”
review_note: 4
feedback_positive_count: 0
feedback_negative_count: 0
93
Expected results
The expected results are probabilities of being useful (Class 1)
Proba: 0.99
Je vien d'accerir ses enceintes d'une qualité de son incroyable . Des basses profondes et puissantes . Pour
amoureux de son !
Proba: 0.0611228262615
Parfait PARFAIT PARFAIT PARFAIT PARFAIT PARFAIT PARFAIT PARFAIT PARFAIT PARFAIT
PARFAIT
94
Baseline code example
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import linear_model
from sklearn.cross_validation import KFold
STOPWORDS = set(['alors', 'au', 'aucuns', 'aussi', 'autre', 'avant', 'avec’ …])
fname = '/path/to/reviews_test_clean 2.csv’
df = pd.read_csv(fname)
df['feedback_ratio'] = df['feedback_positive_count'].astype(float) / (df['feedback_positive_count’] +
df['feedback_negative_count'])
df['feedback_class'] = df['feedback_ratio'] >= 0.5
X = df['review_content'].values
Y = df['feedback_class'].to_dense().values
kf = KFold(n=len(X), n_folds=5)
all_pred , all_test = [], []
for train, test in kf:
Xtrain, Xtest = X[train], X[test]
Ytrain, Ytest = Y[train], Y[test]
vect = TfidfVectorizer(stop_words=STOPWORDS)
Xtrain = vect.fit_transform(Xtrain)
Xtest = vect.transform(Xtest)
clf = linear_model.SGDClassifier()
clf.fit(Xtrain, Ytrain)
pred = clf.predict(Xtest)
all_pred.extend(pred)
all_test.extend(Ytest)
9595
Rakuten.fr datachallenge 2018
Prediction of transaction claims status
https://challengedata.ens.fr/en/challenge/39/prediction_of_transaction_claims_status.
html
96
Claims predictions in E-commerce
Claims have a huge impact in terms of user experience + cost
Claims can be dealt with differently following the different cases (broken, fake, …)
Predict if a transaction has a probability to lead to a claim
-> focus on risky transactions.
Possibly a huge impact in the whole E-commerce field!
97
Dataset
• ID: identifier of the sample
• SHIPPING_MODE: mode of shipping of the product
• (RECOMMANDE, NORMAL, …)
• SHIPPING_PRICE: cost of shipping, if existing
• (<1, 1<5, 5<10, 10<20, >20)
• WARRANTIES_FLG: True if a warranty has been taken by the buyer
• WARRANTIES_PRICE: Price of warranty, if existing
• (<5, 5<20, 20<50, 50<100, 100<500, >500)
• CARD_PAYEMENT: transactions paid by card
• COUPON_PAYEMENT: transactions paid with a discount coupon
• RSP_PAYEMENT: transactions paid with Rakuten Super Points
• WALLET_PAYMENT: transactions paid with PriceMinister-Rakuten
wallet
• PRICECLUB_STATUS: status of the buyer
• (UNSUBSCRIBED, PLATINUM, …)
• REGISTRATION_DATE: year of registration of the buyer
• PURCHASE_COUNT: binarisation of buyer's previous purchases count
• (<5, 5<20, 20<50, 50<100, 100<500, >500)
• BUYER_BIRTHDAY_DATE: year of birth of the buyer
• BUYER_DEPARTMENT: department of the buyer or -1
• BUYING_DATE: year and month of the purchase
• SELLER_SCORE_COUNT: binarisation of the seller's previous sales count
• (<100, 100<103, 103<104, 104<105, 105<106, >106)
• SELLER_SCORE_AVERAGE: score of the seller on PriceMinister-Rakuten
• SELLER_COUNTRY: country of the seller
• (FRANCE METROPOLITAN, CHINA, …)
• SELLER_DEPARTMENT: department of the seller or -1
• PRODUCT_TYPE: type of the purchased product
• (TOYS, CELLPHONE_ACCESSORY, …)
• PRODUCT_FAMILY: family of the purchased product
• (ELECTRONICS, BABY, …)
• ITEM_PRICE: binarisation of the purchased product
• (<10, 10<20, 20<50, 50<100, 100<500, 500<1000, 1000<5000, >5000)
98
Challenges
Complex interactions that involves multiple factors (probably not all contained in the features) and
subjective information (the same shop does not always send broken products…)
Unbalanced classes!
Categorical features + numerical features (beware of ranges!)
Find some socio-demographics/behavioral features (e.g. based on country)
99
Baseline
Metric: AUC weighted metric (from sklearn)
Calculate metrics for each label, and find their average, weighted by support (the number of true instances
for each label).
http://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html#sklearn.metrics.roc_auc_score
Algorithm used for benchmarks (naive and classic!)
• Random forests classifier (from sklearn), with 200 estimators
• Classical preprocessors (from sklearn): OneHotEncoder, LabelEncoder
Result obtained: 0.574 AUC weighted metric.
100
Baseline code example
from collections import defaultdict
import pandas as pd
import numpy as np
import sklearn.preprocessing
from scipy.sparse import hstack
xtrain_df = pd.read_csv('training_X.tsv’,
delimiter='t’)
xtest_df = pd.read_csv('test_X.tsv’, delimiter='t’)
ytrain_df = pd.read_csv('training_Y.tsv’,
delimiter='t’)
ytest_df = pd.read_csv('test_Y.tsv’, delimiter='t’)
CATS = ['WARRANTIES_FLG’, `'SHIPPING_MODE`, …]
LABELS = defaultdict(set)
for cat in CATS:
d =
set(xtrain_df[cat].unique()).union(set(xtest_df[cat].un
ique()))
LABELS[cat] = dict((v, i) for i, v in enumerate(d))
NUMERICAL = ['CARD_PAYMENT’, COUPON_PAYMENT’,…]
ENCODERS = dict()
COLUMNS = []
Xtrain = create_matrix(xtrain_df, COLUMNS)
Xtest = create_matrix(xtest_df)
def create_matrix(df, COLUMNS=None):
print df.shape
X = df[NUMERICAL].to_sparse()
if COLUMNS is not None:
COLUMNS.extend(NUMERICAL)
for cat in CATS:
if COLUMNS is not None:
for v in sorted(LABELS[cat].iteritems(),
key=lambda x: x[1]):
COLUMNS.append('%s (%s)' % (cat, v[0]))
data = df[cat]
data = np.ravel([LABELS[cat][v] for v in data])
data = np.reshape(data, [data.size, 1])
if cat in ENCODERS:
data = ENCODERS[cat].transform(data)
else:
oenc = sklearn.preprocessing.OneHotEncoder()
data = oenc.fit_transform(data)
ENCODERS[cat] = oenc
X = hstack((X, data))
return X
101
Baseline code example
lenc = sklearn.preprocessing.LabelEncoder()
Ytrain =
lenc.fit_transform(ytrain_df['CLAIM_TYPE'])
Ytrain[Ytrain != 0] = 1
lenc = sklearn.preprocessing.LabelEncoder()
Ytest = lenc.fit_transform(ytest_df['CLAIM_TYPE'])
Ytest[Ytest != 0] = 1
from sklearn.ensemble import RandomForestClassifier
clf =
sklearn.ensemble.RandomForestClassifier(n_estimator
s=50)
clf.fit(Xtrain.toarray(), Ytrain)
pred = clf.predict(Xtest.toarray())
ytrain_df['CLAIM_TYPE'][ytrain_df['CLAIM_TYPE'] == '-'] =
'NO COMPLAIN'
ytest_df['CLAIM_TYPE'][ytest_df['CLAIM_TYPE'] == '-'] = 'NO
COMPLAIN’
STATUS =
set(ytrain_df['CLAIM_TYPE']).union(set(ytest_df['CLAIM_TYPE'
]))
STATUS = dict((v, i) for i, v in enumerate(STATUS))
Ytrain = np.array([STATUS[v] for v in
ytrain_df['CLAIM_TYPE']])
Ytest = np.array([STATUS[v] for v in
ytest_df['CLAIM_TYPE']])
clf = SGDClassifier(n_jobs=4, loss='log')#,
class_weight='auto')#weights)
clf.fit(Xtrain, Ytrain)
pred = clf.predict(Xtest)
proba = clf.predict_proba(Xtest)
Binary Classifier Multiclass Classifier
102102
Big data at scale
Search engine
103
Search principles
The goal of search is to help users efficiently find the most relevant documents for a given query.
• Documents
• Depend on how the data is modeled
• Marketplace: product, offer (product sold by a merchant), SKU (variation of a product), …
• Video streaming: movie, tv series, tv episode, …
• Query
• Terms: what goes in the search box
• Filters: navigation items
• Relevancy
• Based off the data: by price, by freshness, …
• Based off user behavior: clicks, purchases, …
• Based off text semantics: entity extraction, …
• Based off corpus statistics: terms frequencies, …
• Efficiency
• Low response time
• Assistance: spellcheck, autocomplete, …
104
Search principles
• Most existing search systems are based off
indices
• Same as the indices found at the end of books
• Lucene is the most well-known library to
handle these
• Consider the query “the best search engines”
• First, break up the query in terms, remove
common ones, and normalize
• Yields [“best”, “search”, “engine”]
• Look up each term in the index dictionary
• Yields a list of documents per term (called an inverted
list)
• Find common documents in all list
• Sort the results in the desired order
(Image from https://stackoverflow.com/questions/17272050/book-index-page-layout-using-html5-and-css)
105
Search, big data and e-commerce
• Indexing challenges
• Volume of data: large marketplaces contain a lot of documents
• Update rate: information can change fast; price, inventory
• Search challenges
• Query rate: lots of end-users querying at the same time
• Features: not only document retrieval, but also navigation, statistics, …
• Relevancy challenges
• Document do not usually contain natural language
• More susceptible to spam and merchants trying to game the system
• Need to balance well-selling items with discovery, especially for newer releases of popular products
• Multi-language support for global market places
• Operation challenges
• Very large scale systems (2000+ nodes) need robust deployment and monitoring
• Resources distribution (models, linguistic resources)
106106
Conclusion
107107
Datascience everywhere !
Rakuten provides marketplaces worldwide
Specific challenges for recommendations
Items catalogue: reinforce statistical power of co-occurrences across shops and services;
Items similarities: find the good parameters for the different use-cases;
Recommendations models: what is the best models for in-shop, all-shops, personalization?
Evaluation: handling long-tail? Comparing different models?
108108
We are Hiring!
Positions
http://www.priceminister.com/recrutement/?p=197
Data Scientist / Software Developer
• Build algorithms for recommendations, search, targeting
• Predictive modeling, machine learning, natural language processing
• Working close to business
• Python, Java, Hadoop, Couchbase, Cassandra…
Also hiring: search engine developers, big data system administrators, etc.
109109
Thanks !
Questions ?
More on Rakuten tech initiatives
http://www.slideshare.net/rakutentech
http://rit.rakuten.co.jp/oss.html
http://rit.rakuten.co.jp/opendata.html
Positions
http://www.priceminister.com/recrutement/?p=197

Contenu connexe

Tendances

第52回SWO研究会チュートリアル資料
第52回SWO研究会チュートリアル資料第52回SWO研究会チュートリアル資料
第52回SWO研究会チュートリアル資料Takanori Ugai
 
[DL輪読会]ICLR2020の分布外検知速報
[DL輪読会]ICLR2020の分布外検知速報[DL輪読会]ICLR2020の分布外検知速報
[DL輪読会]ICLR2020の分布外検知速報Deep Learning JP
 
楽天の規模とクラウドプラットフォーム統括部の役割
楽天の規模とクラウドプラットフォーム統括部の役割楽天の規模とクラウドプラットフォーム統括部の役割
楽天の規模とクラウドプラットフォーム統括部の役割Rakuten Group, Inc.
 
みんなが知らない pytorch-pfn-extras
みんなが知らない pytorch-pfn-extrasみんなが知らない pytorch-pfn-extras
みんなが知らない pytorch-pfn-extrasTakuji Tahara
 
シリコンバレーの「何が」凄いのか
シリコンバレーの「何が」凄いのかシリコンバレーの「何が」凄いのか
シリコンバレーの「何が」凄いのかAtsushi Nakada
 
Graph Attention Network
Graph Attention NetworkGraph Attention Network
Graph Attention NetworkTakahiro Kubo
 
Active Learning の基礎と最近の研究
Active Learning の基礎と最近の研究Active Learning の基礎と最近の研究
Active Learning の基礎と最近の研究Fumihiko Takahashi
 
差分プライバシーによる時系列データの扱い方
差分プライバシーによる時系列データの扱い方差分プライバシーによる時系列データの扱い方
差分プライバシーによる時系列データの扱い方Hiroshi Nakagawa
 
Linux女子部 iptables復習編
Linux女子部 iptables復習編Linux女子部 iptables復習編
Linux女子部 iptables復習編Etsuji Nakai
 
Music Recommendations at Scale with Spark
Music Recommendations at Scale with SparkMusic Recommendations at Scale with Spark
Music Recommendations at Scale with SparkChris Johnson
 
[DL輪読会]Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets
[DL輪読会]Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets[DL輪読会]Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets
[DL輪読会]Grokking: Generalization Beyond Overfitting on Small Algorithmic DatasetsDeep Learning JP
 
【DL輪読会】AUTOGT: AUTOMATED GRAPH TRANSFORMER ARCHITECTURE SEARCH
【DL輪読会】AUTOGT: AUTOMATED GRAPH TRANSFORMER ARCHITECTURE SEARCH【DL輪読会】AUTOGT: AUTOMATED GRAPH TRANSFORMER ARCHITECTURE SEARCH
【DL輪読会】AUTOGT: AUTOMATED GRAPH TRANSFORMER ARCHITECTURE SEARCHDeep Learning JP
 
Machine Learning and Big Data for Music Discovery at Spotify
Machine Learning and Big Data for Music Discovery at SpotifyMachine Learning and Big Data for Music Discovery at Spotify
Machine Learning and Big Data for Music Discovery at SpotifyChing-Wei Chen
 
Approximate nearest neighbor methods and vector models – NYC ML meetup
Approximate nearest neighbor methods and vector models – NYC ML meetupApproximate nearest neighbor methods and vector models – NYC ML meetup
Approximate nearest neighbor methods and vector models – NYC ML meetupErik Bernhardsson
 
Deep Learningによる超解像の進歩
Deep Learningによる超解像の進歩Deep Learningによる超解像の進歩
Deep Learningによる超解像の進歩Hiroto Honda
 
15. Transformerを用いた言語処理技術の発展.pdf
15. Transformerを用いた言語処理技術の発展.pdf15. Transformerを用いた言語処理技術の発展.pdf
15. Transformerを用いた言語処理技術の発展.pdf幸太朗 岩澤
 
コピュラと金融工学の新展開(?)
コピュラと金融工学の新展開(?)コピュラと金融工学の新展開(?)
コピュラと金融工学の新展開(?)Nagi Teramo
 
Recommending and Searching (Research @ Spotify)
Recommending and Searching (Research @ Spotify)Recommending and Searching (Research @ Spotify)
Recommending and Searching (Research @ Spotify)Mounia Lalmas-Roelleke
 
GANの概要とDCGANのアーキテクチャ/アルゴリズム
GANの概要とDCGANのアーキテクチャ/アルゴリズムGANの概要とDCGANのアーキテクチャ/アルゴリズム
GANの概要とDCGANのアーキテクチャ/アルゴリズムHirosaji
 
不均衡データのクラス分類
不均衡データのクラス分類不均衡データのクラス分類
不均衡データのクラス分類Shintaro Fukushima
 

Tendances (20)

第52回SWO研究会チュートリアル資料
第52回SWO研究会チュートリアル資料第52回SWO研究会チュートリアル資料
第52回SWO研究会チュートリアル資料
 
[DL輪読会]ICLR2020の分布外検知速報
[DL輪読会]ICLR2020の分布外検知速報[DL輪読会]ICLR2020の分布外検知速報
[DL輪読会]ICLR2020の分布外検知速報
 
楽天の規模とクラウドプラットフォーム統括部の役割
楽天の規模とクラウドプラットフォーム統括部の役割楽天の規模とクラウドプラットフォーム統括部の役割
楽天の規模とクラウドプラットフォーム統括部の役割
 
みんなが知らない pytorch-pfn-extras
みんなが知らない pytorch-pfn-extrasみんなが知らない pytorch-pfn-extras
みんなが知らない pytorch-pfn-extras
 
シリコンバレーの「何が」凄いのか
シリコンバレーの「何が」凄いのかシリコンバレーの「何が」凄いのか
シリコンバレーの「何が」凄いのか
 
Graph Attention Network
Graph Attention NetworkGraph Attention Network
Graph Attention Network
 
Active Learning の基礎と最近の研究
Active Learning の基礎と最近の研究Active Learning の基礎と最近の研究
Active Learning の基礎と最近の研究
 
差分プライバシーによる時系列データの扱い方
差分プライバシーによる時系列データの扱い方差分プライバシーによる時系列データの扱い方
差分プライバシーによる時系列データの扱い方
 
Linux女子部 iptables復習編
Linux女子部 iptables復習編Linux女子部 iptables復習編
Linux女子部 iptables復習編
 
Music Recommendations at Scale with Spark
Music Recommendations at Scale with SparkMusic Recommendations at Scale with Spark
Music Recommendations at Scale with Spark
 
[DL輪読会]Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets
[DL輪読会]Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets[DL輪読会]Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets
[DL輪読会]Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets
 
【DL輪読会】AUTOGT: AUTOMATED GRAPH TRANSFORMER ARCHITECTURE SEARCH
【DL輪読会】AUTOGT: AUTOMATED GRAPH TRANSFORMER ARCHITECTURE SEARCH【DL輪読会】AUTOGT: AUTOMATED GRAPH TRANSFORMER ARCHITECTURE SEARCH
【DL輪読会】AUTOGT: AUTOMATED GRAPH TRANSFORMER ARCHITECTURE SEARCH
 
Machine Learning and Big Data for Music Discovery at Spotify
Machine Learning and Big Data for Music Discovery at SpotifyMachine Learning and Big Data for Music Discovery at Spotify
Machine Learning and Big Data for Music Discovery at Spotify
 
Approximate nearest neighbor methods and vector models – NYC ML meetup
Approximate nearest neighbor methods and vector models – NYC ML meetupApproximate nearest neighbor methods and vector models – NYC ML meetup
Approximate nearest neighbor methods and vector models – NYC ML meetup
 
Deep Learningによる超解像の進歩
Deep Learningによる超解像の進歩Deep Learningによる超解像の進歩
Deep Learningによる超解像の進歩
 
15. Transformerを用いた言語処理技術の発展.pdf
15. Transformerを用いた言語処理技術の発展.pdf15. Transformerを用いた言語処理技術の発展.pdf
15. Transformerを用いた言語処理技術の発展.pdf
 
コピュラと金融工学の新展開(?)
コピュラと金融工学の新展開(?)コピュラと金融工学の新展開(?)
コピュラと金融工学の新展開(?)
 
Recommending and Searching (Research @ Spotify)
Recommending and Searching (Research @ Spotify)Recommending and Searching (Research @ Spotify)
Recommending and Searching (Research @ Spotify)
 
GANの概要とDCGANのアーキテクチャ/アルゴリズム
GANの概要とDCGANのアーキテクチャ/アルゴリズムGANの概要とDCGANのアーキテクチャ/アルゴリズム
GANの概要とDCGANのアーキテクチャ/アルゴリズム
 
不均衡データのクラス分類
不均衡データのクラス分類不均衡データのクラス分類
不均衡データのクラス分類
 

Similaire à How Data Science Drives Personalization and Recommendations at Rakuten

Data Science in E-commerce
Data Science in E-commerceData Science in E-commerce
Data Science in E-commerceVincent Michel
 
Openbar Kontich // How to create intelligent & personal conversational AI - W...
Openbar Kontich // How to create intelligent & personal conversational AI - W...Openbar Kontich // How to create intelligent & personal conversational AI - W...
Openbar Kontich // How to create intelligent & personal conversational AI - W...Openbar
 
Tech Job Conference: Software Engineer @Criteo
Tech Job Conference: Software Engineer @CriteoTech Job Conference: Software Engineer @Criteo
Tech Job Conference: Software Engineer @CriteoGilles Legoux
 
Agile data science
Agile data scienceAgile data science
Agile data scienceJoel Horwitz
 
Telecom datascience master_public
Telecom datascience master_publicTelecom datascience master_public
Telecom datascience master_publicVincent Michel
 
Algorithm Marketplace and the new "Algorithm Economy"
Algorithm Marketplace and the new "Algorithm Economy"Algorithm Marketplace and the new "Algorithm Economy"
Algorithm Marketplace and the new "Algorithm Economy"Diego Oppenheimer
 
Top Business Intelligence Trends for 2016 by Panorama Software
Top Business Intelligence Trends for 2016 by Panorama SoftwareTop Business Intelligence Trends for 2016 by Panorama Software
Top Business Intelligence Trends for 2016 by Panorama SoftwarePanorama Software
 
Big-Data-Seminar-6-Aug-2014-Koenig
Big-Data-Seminar-6-Aug-2014-KoenigBig-Data-Seminar-6-Aug-2014-Koenig
Big-Data-Seminar-6-Aug-2014-KoenigManish Chopra
 
AI in the Enterprise: Past, Present & Future - StampedeCon AI Summit 2017
AI in the Enterprise: Past,  Present &  Future - StampedeCon AI Summit 2017AI in the Enterprise: Past,  Present &  Future - StampedeCon AI Summit 2017
AI in the Enterprise: Past, Present & Future - StampedeCon AI Summit 2017StampedeCon
 
Privacy for tech startups
Privacy for tech startups Privacy for tech startups
Privacy for tech startups Marc Gallardo
 
Sr. Jon Ander, Internet de las Cosas y Big Data: ¿hacia dónde va la Industria?
Sr. Jon Ander, Internet de las Cosas y Big Data: ¿hacia dónde va la Industria? Sr. Jon Ander, Internet de las Cosas y Big Data: ¿hacia dónde va la Industria?
Sr. Jon Ander, Internet de las Cosas y Big Data: ¿hacia dónde va la Industria? INACAP
 
Presentation on Big Data Analytics
Presentation on Big Data AnalyticsPresentation on Big Data Analytics
Presentation on Big Data AnalyticsS P Sajjan
 
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...Mihai Criveti
 
Sironta at OpenOffice.org Conference 2010
Sironta at OpenOffice.org Conference  2010Sironta at OpenOffice.org Conference  2010
Sironta at OpenOffice.org Conference 2010Manu Arjó
 
Big data in Private Banking
Big data in Private BankingBig data in Private Banking
Big data in Private BankingJérôme Kehrli
 
Introduction to (Big) Data Science
Introduction to (Big) Data ScienceIntroduction to (Big) Data Science
Introduction to (Big) Data ScienceInfoFarm
 
Industrial Internet of Things (IIoT) for Automotive Paint Shop Operations
Industrial Internet of Things (IIoT) for Automotive Paint Shop OperationsIndustrial Internet of Things (IIoT) for Automotive Paint Shop Operations
Industrial Internet of Things (IIoT) for Automotive Paint Shop OperationsRam Shetty
 
TIBCO Innovation Workshop Series: Reducing Decision Latency with Streaming An...
TIBCO Innovation Workshop Series: Reducing Decision Latency with Streaming An...TIBCO Innovation Workshop Series: Reducing Decision Latency with Streaming An...
TIBCO Innovation Workshop Series: Reducing Decision Latency with Streaming An...Nelson Petracek
 

Similaire à How Data Science Drives Personalization and Recommendations at Rakuten (20)

Data Science in E-commerce
Data Science in E-commerceData Science in E-commerce
Data Science in E-commerce
 
Openbar Kontich // How to create intelligent & personal conversational AI - W...
Openbar Kontich // How to create intelligent & personal conversational AI - W...Openbar Kontich // How to create intelligent & personal conversational AI - W...
Openbar Kontich // How to create intelligent & personal conversational AI - W...
 
Tech Job Conference: Software Engineer @Criteo
Tech Job Conference: Software Engineer @CriteoTech Job Conference: Software Engineer @Criteo
Tech Job Conference: Software Engineer @Criteo
 
Agile data science
Agile data scienceAgile data science
Agile data science
 
Telecom datascience master_public
Telecom datascience master_publicTelecom datascience master_public
Telecom datascience master_public
 
Algorithm Marketplace and the new "Algorithm Economy"
Algorithm Marketplace and the new "Algorithm Economy"Algorithm Marketplace and the new "Algorithm Economy"
Algorithm Marketplace and the new "Algorithm Economy"
 
Top Business Intelligence Trends for 2016 by Panorama Software
Top Business Intelligence Trends for 2016 by Panorama SoftwareTop Business Intelligence Trends for 2016 by Panorama Software
Top Business Intelligence Trends for 2016 by Panorama Software
 
Big-Data-Seminar-6-Aug-2014-Koenig
Big-Data-Seminar-6-Aug-2014-KoenigBig-Data-Seminar-6-Aug-2014-Koenig
Big-Data-Seminar-6-Aug-2014-Koenig
 
Big Data
Big DataBig Data
Big Data
 
AI in the Enterprise: Past, Present & Future - StampedeCon AI Summit 2017
AI in the Enterprise: Past,  Present &  Future - StampedeCon AI Summit 2017AI in the Enterprise: Past,  Present &  Future - StampedeCon AI Summit 2017
AI in the Enterprise: Past, Present & Future - StampedeCon AI Summit 2017
 
Privacy for tech startups
Privacy for tech startups Privacy for tech startups
Privacy for tech startups
 
Sr. Jon Ander, Internet de las Cosas y Big Data: ¿hacia dónde va la Industria?
Sr. Jon Ander, Internet de las Cosas y Big Data: ¿hacia dónde va la Industria? Sr. Jon Ander, Internet de las Cosas y Big Data: ¿hacia dónde va la Industria?
Sr. Jon Ander, Internet de las Cosas y Big Data: ¿hacia dónde va la Industria?
 
Presentation on Big Data Analytics
Presentation on Big Data AnalyticsPresentation on Big Data Analytics
Presentation on Big Data Analytics
 
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
 
Sironta at OpenOffice.org Conference 2010
Sironta at OpenOffice.org Conference  2010Sironta at OpenOffice.org Conference  2010
Sironta at OpenOffice.org Conference 2010
 
Big data in Private Banking
Big data in Private BankingBig data in Private Banking
Big data in Private Banking
 
Introduction to (Big) Data Science
Introduction to (Big) Data ScienceIntroduction to (Big) Data Science
Introduction to (Big) Data Science
 
Industrial Internet of Things (IIoT) for Automotive Paint Shop Operations
Industrial Internet of Things (IIoT) for Automotive Paint Shop OperationsIndustrial Internet of Things (IIoT) for Automotive Paint Shop Operations
Industrial Internet of Things (IIoT) for Automotive Paint Shop Operations
 
TIBCO Innovation Workshop Series: Reducing Decision Latency with Streaming An...
TIBCO Innovation Workshop Series: Reducing Decision Latency with Streaming An...TIBCO Innovation Workshop Series: Reducing Decision Latency with Streaming An...
TIBCO Innovation Workshop Series: Reducing Decision Latency with Streaming An...
 
Identifying the new frontier of big data as an enabler for T&T industries: Re...
Identifying the new frontier of big data as an enabler for T&T industries: Re...Identifying the new frontier of big data as an enabler for T&T industries: Re...
Identifying the new frontier of big data as an enabler for T&T industries: Re...
 

Plus de Karthik Murugesan

Yahoo's Knowledge Graph - 2014 slides
Yahoo's Knowledge Graph - 2014 slidesYahoo's Knowledge Graph - 2014 slides
Yahoo's Knowledge Graph - 2014 slidesKarthik Murugesan
 
Free servers to build Big Data Systems on: Bing's Approach
Free servers to build Big Data Systems on: Bing's  Approach Free servers to build Big Data Systems on: Bing's  Approach
Free servers to build Big Data Systems on: Bing's Approach Karthik Murugesan
 
Microsoft AI Platform - AETHER Introduction
Microsoft AI Platform - AETHER IntroductionMicrosoft AI Platform - AETHER Introduction
Microsoft AI Platform - AETHER IntroductionKarthik Murugesan
 
BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2
BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2
BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2Karthik Murugesan
 
Lyft data Platform - 2019 slides
Lyft data Platform - 2019 slidesLyft data Platform - 2019 slides
Lyft data Platform - 2019 slidesKarthik Murugesan
 
The Evolution of Spotify Home Architecture - Qcon 2019
The Evolution of Spotify Home Architecture - Qcon 2019The Evolution of Spotify Home Architecture - Qcon 2019
The Evolution of Spotify Home Architecture - Qcon 2019Karthik Murugesan
 
Unifying Twitter around a single ML platform - Twitter AI Platform 2019
Unifying Twitter around a single ML platform  - Twitter AI Platform 2019Unifying Twitter around a single ML platform  - Twitter AI Platform 2019
Unifying Twitter around a single ML platform - Twitter AI Platform 2019Karthik Murugesan
 
The magic behind your Lyft ride prices: A case study on machine learning and ...
The magic behind your Lyft ride prices: A case study on machine learning and ...The magic behind your Lyft ride prices: A case study on machine learning and ...
The magic behind your Lyft ride prices: A case study on machine learning and ...Karthik Murugesan
 
The journey toward a self-service data platform at Netflix - sf 2019
The journey toward a self-service data platform at Netflix - sf 2019The journey toward a self-service data platform at Netflix - sf 2019
The journey toward a self-service data platform at Netflix - sf 2019Karthik Murugesan
 
2019 Slides - Michelangelo Palette: A Feature Engineering Platform at Uber
2019 Slides - Michelangelo Palette: A Feature Engineering Platform at Uber2019 Slides - Michelangelo Palette: A Feature Engineering Platform at Uber
2019 Slides - Michelangelo Palette: A Feature Engineering Platform at UberKarthik Murugesan
 
Developing a ML model using TF Estimator
Developing a ML model using TF EstimatorDeveloping a ML model using TF Estimator
Developing a ML model using TF EstimatorKarthik Murugesan
 
Production Model Deployment - StitchFix - 2018
Production Model Deployment - StitchFix - 2018Production Model Deployment - StitchFix - 2018
Production Model Deployment - StitchFix - 2018Karthik Murugesan
 
Netflix factstore for recommendations - 2018
Netflix factstore  for recommendations - 2018Netflix factstore  for recommendations - 2018
Netflix factstore for recommendations - 2018Karthik Murugesan
 
Trends in Music Recommendations 2018
Trends in Music Recommendations 2018Trends in Music Recommendations 2018
Trends in Music Recommendations 2018Karthik Murugesan
 
Netflix Ads Personalization Solution - 2017
Netflix Ads Personalization Solution - 2017Netflix Ads Personalization Solution - 2017
Netflix Ads Personalization Solution - 2017Karthik Murugesan
 
Spotify Machine Learning Solution for Music Discovery
Spotify Machine Learning Solution for Music DiscoverySpotify Machine Learning Solution for Music Discovery
Spotify Machine Learning Solution for Music DiscoveryKarthik Murugesan
 
AirBNB - Zipline: Airbnb’s Machine Learning Data Management Platform
AirBNB - Zipline: Airbnb’s Machine Learning Data Management Platform AirBNB - Zipline: Airbnb’s Machine Learning Data Management Platform
AirBNB - Zipline: Airbnb’s Machine Learning Data Management Platform Karthik Murugesan
 
Uber - Building Intelligent Applications, Experimental ML with Uber’s Data Sc...
Uber - Building Intelligent Applications, Experimental ML with Uber’s Data Sc...Uber - Building Intelligent Applications, Experimental ML with Uber’s Data Sc...
Uber - Building Intelligent Applications, Experimental ML with Uber’s Data Sc...Karthik Murugesan
 

Plus de Karthik Murugesan (20)

Yahoo's Knowledge Graph - 2014 slides
Yahoo's Knowledge Graph - 2014 slidesYahoo's Knowledge Graph - 2014 slides
Yahoo's Knowledge Graph - 2014 slides
 
Free servers to build Big Data Systems on: Bing's Approach
Free servers to build Big Data Systems on: Bing's  Approach Free servers to build Big Data Systems on: Bing's  Approach
Free servers to build Big Data Systems on: Bing's Approach
 
Microsoft cosmos
Microsoft cosmosMicrosoft cosmos
Microsoft cosmos
 
Microsoft AI Platform - AETHER Introduction
Microsoft AI Platform - AETHER IntroductionMicrosoft AI Platform - AETHER Introduction
Microsoft AI Platform - AETHER Introduction
 
BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2
BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2
BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2
 
Lyft data Platform - 2019 slides
Lyft data Platform - 2019 slidesLyft data Platform - 2019 slides
Lyft data Platform - 2019 slides
 
The Evolution of Spotify Home Architecture - Qcon 2019
The Evolution of Spotify Home Architecture - Qcon 2019The Evolution of Spotify Home Architecture - Qcon 2019
The Evolution of Spotify Home Architecture - Qcon 2019
 
Unifying Twitter around a single ML platform - Twitter AI Platform 2019
Unifying Twitter around a single ML platform  - Twitter AI Platform 2019Unifying Twitter around a single ML platform  - Twitter AI Platform 2019
Unifying Twitter around a single ML platform - Twitter AI Platform 2019
 
The magic behind your Lyft ride prices: A case study on machine learning and ...
The magic behind your Lyft ride prices: A case study on machine learning and ...The magic behind your Lyft ride prices: A case study on machine learning and ...
The magic behind your Lyft ride prices: A case study on machine learning and ...
 
The journey toward a self-service data platform at Netflix - sf 2019
The journey toward a self-service data platform at Netflix - sf 2019The journey toward a self-service data platform at Netflix - sf 2019
The journey toward a self-service data platform at Netflix - sf 2019
 
2019 Slides - Michelangelo Palette: A Feature Engineering Platform at Uber
2019 Slides - Michelangelo Palette: A Feature Engineering Platform at Uber2019 Slides - Michelangelo Palette: A Feature Engineering Platform at Uber
2019 Slides - Michelangelo Palette: A Feature Engineering Platform at Uber
 
Developing a ML model using TF Estimator
Developing a ML model using TF EstimatorDeveloping a ML model using TF Estimator
Developing a ML model using TF Estimator
 
Production Model Deployment - StitchFix - 2018
Production Model Deployment - StitchFix - 2018Production Model Deployment - StitchFix - 2018
Production Model Deployment - StitchFix - 2018
 
Netflix factstore for recommendations - 2018
Netflix factstore  for recommendations - 2018Netflix factstore  for recommendations - 2018
Netflix factstore for recommendations - 2018
 
Trends in Music Recommendations 2018
Trends in Music Recommendations 2018Trends in Music Recommendations 2018
Trends in Music Recommendations 2018
 
Netflix Ads Personalization Solution - 2017
Netflix Ads Personalization Solution - 2017Netflix Ads Personalization Solution - 2017
Netflix Ads Personalization Solution - 2017
 
State Of AI 2018
State Of AI 2018State Of AI 2018
State Of AI 2018
 
Spotify Machine Learning Solution for Music Discovery
Spotify Machine Learning Solution for Music DiscoverySpotify Machine Learning Solution for Music Discovery
Spotify Machine Learning Solution for Music Discovery
 
AirBNB - Zipline: Airbnb’s Machine Learning Data Management Platform
AirBNB - Zipline: Airbnb’s Machine Learning Data Management Platform AirBNB - Zipline: Airbnb’s Machine Learning Data Management Platform
AirBNB - Zipline: Airbnb’s Machine Learning Data Management Platform
 
Uber - Building Intelligent Applications, Experimental ML with Uber’s Data Sc...
Uber - Building Intelligent Applications, Experimental ML with Uber’s Data Sc...Uber - Building Intelligent Applications, Experimental ML with Uber’s Data Sc...
Uber - Building Intelligent Applications, Experimental ML with Uber’s Data Sc...
 

Dernier

Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 

Dernier (20)

Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 

How Data Science Drives Personalization and Recommendations at Rakuten

  • 1. Datascience in E-commerce industry Vincent MICHEL, Big Data EU, Rakuten.Inc vincent.michel@mail.rakuten.com
  • 2.
  • 3. 3 Rakuten Group Worldwide https://global.rakuten.com/corp/about/index.html#strengths 2018/11/05 Recommendation challenges • Different languages • Users behavior • Business areas
  • 4. 4 Rakuten Group in numbers Rakuten in Japan • > 12.000 employees • > 48 billions euros of GMS • > 100.000.000 users • > 250.000.000 items • > 40.000 merchants https://global.rakuten.com/corp/ 2018/11/05 Rakuten Group • Kobo 18.000.000 users • Viki 28.000.000 users • Viber 345.000.000 users
  • 5. 5 Rakuten Ecosystem Rakuten global ecosystem : • Member-based business model that connects Rakuten services • Rakuten ID common to various Rakuten services • Online shopping and services; • Main business areas: E-commerce, Internet finance, Digital content https://global.rakuten.com/corp/about/index.html#strengths 2018/11/05 https://global.rakuten.com/corp/about/history.html 2018/11/05 Recommendation challenges • Cross-services • Aggregated data • Complex users features
  • 6. 6 Rakuten’s e-commerce: B2B2C Business Model Business to Business to Consumer: • Merchants located in different regions / online virtual shopping mall • Main profit sources • Fixed fees from merchants • Fees based on each transaction and other service Recommendation challenges • Many shops • Items references • Global catalog
  • 7. 7 Big Data @ Rakuten Mission: Development and operations of internal systems for: • Recommendations • Search • Targeting • User behavior tracking Average traffic: • > 100.000.000 events / day • > 40.000.000 items view / day • > 50.000.000 search / day • > 750.000 purchases / day Technology stack: • Java / Python / Ruby • Solr / Lucene • Cassandra / Couchbase • Hadoop / Hive / Pig • Redis / Kafka
  • 8. 88 Short Bio ESPCI: engineer in Physics / Biology ENS Cachan: MVA Master Mathematics Vision and Learning INRIA Parietal team: PhD in Computer Science Understanding the visual cortex by using classification techniques Logilab – Development and data science consulting Data.bnf.fr (French National Library open-data platform) Brainomics (platform for heterogeneous medical data) Education Experience Rakuten PriceMinister– Senior Developer and data scientist Data engineer and data science consulting Rakuten – Recommendations & Personalization team lead Lead a team of engineers, datascientists and project managers
  • 9. 99 Software engineering Lessons learned from (painful) experiences
  • 10. 10 Do not redo it yourself ! Lots of interesting open-source libraries for all your needs • Test first on a small POC, then contribute/develop • Scikit-learn, pandas, Caffe, Scikit-image, opencv, …. • Be careful: it is easy to do something wrong ! Open-data • More and more open-data for catalogs, … • E.g. data.bnf.fr: ~ 2.000.000 authors, ~ 200.000 works, ~ 200.000 topics Contribute to open-source • Unless you are doing some kind of super magical algorithm • Is there a need / pool of potential developers ? • Do it well (documentation / test) • May bring you help, bug fixes, and engineers ! But it takes time and energy
  • 11. 11 Quality in data science software engineering Never underestimates integration cost • Easy to write a 20 lines Python code doing some 883fancy Random Forests… • …that could be hard to deploy (data pipeline, packaging, monitoring) • Developer != DevOps != Sys admin Make it clean from the start (> 2 days of dev or > 100 lines of code) • Tests, tests, tests, tests, tests, tests, tests, … • Packaging / supervision / monitoring • Release often release earlier • Documentation, Agile development, Pull request, code versioning Choose the right tool • Do you really need this super fancy NoSQL database to store your transactions?
  • 12. 12 Monitoring and alerting: building datascience product Hardware (CPU, IO, …) Software (Errors, requests, …) Datascience (KPIs, …)
  • 13. 1313 Hiring remarks Selling yourself as a (good) data scientist
  • 14. 14 Defining yourself as a data scientist Do not try to sell yourself as a unicorn! Define your skills (and unicorns no longer exist…)
  • 15. 15 Few remarks on hiring – my personal opinion Be careful of CVs with buzzwords! • E.g. “IT skills: SVM (linear, non-linear), Clustering (K-means, Hierarchical), Random Forests, Regularization (L1, L2, Elastic net…) …” • It is like as someone saying “ IT skills: Python (for loop, if/else pattern, …) Hungry for data? • Loving data is the most important thing to show • Opendata? Personal project? Curious about data? (Hackaton?) • Pluridisciplinary == knowing how to handle various datasets Improve your IT skills • Should be able to install/develop new libraries/algorithms • A huge part of the job could be to format / cleanup the data • Experience VS education -> Autonomy
  • 16. 1616 Knowing the general context Few remarks about GDPR
  • 17. 17 What is the GDPR? Adopted in April 2016 and applicable as of May, 25th 2018 Replaces all the national legislations about the handling of personal data in Europe Until now, 1995 Directive which had been transposed differently among the EU countries In France, the law « Informatique et libertés » is going to be modified There still will be other differing sources: national case law and national data protection authorities (CNIL) doctrines Still pending: E-privacy Regulation adoption (about cookies and OTT)
  • 18. 18 Why GDPR ? Why was the GDPR passed? • Harmonisation of the European rules • To directly target non-European companies making business with EU data • To empower citizens and give them control over their data Why is the GDPR important? • Fines of up to 4% of the global annual turnover or 20 million euros. Now in France: max fine = 3 million euros • Loss of reputation and future customers • NGOs can bring claims on behalf of individuals • Burden of proof is on the company
  • 19. 19 What will the GDPR really change? Accountability principle : less formalities to the DPA but more internal preparatory works (DPIA) and possibly higher fines in case of a control (on-site or online) Mandatory Data Protection Officer (records of processings) New obligations to data processors Security breach notification to the DPA (and even to the users in some cases) 72 hours max after a security incident New user right: data portability
  • 20. 20 When is the GDPR applicable? The GDPR is applicable Yes No The GDPR is not applicable Does your business offer services to the EU? Do you provide your service in any European languages? Does your service use/accept any European currency? Are EU customers specifically addressed? (delivery) Profiling Tracking by cookies or otherwise Analysis of personal preferences / behavior Yes No Yes No Does your business collect, use or process personal data? But other privacy laws may apply Yes Is an office of your business in the EU? No Do you monitor individuals in the EU?
  • 23. 23 What are recommendations ? https://www.rinapiccolo.com/piccolo-cartoons/ A recommender system seeks to predict the "rating" or "preference" a user would give to an item (wikipedia)
  • 24. 24 Recommendations are generic Contextual features Recommendations engine Input entities Items / Products Categories Users Shops Widgets/UI sections Output entities Items / Products Categories Users Shops Widgets/UI sections
  • 25. 25 What is personalization? “Personalization, consists of tailoring a service or a product to accommodate specific individuals, sometimes tied to groups or segments of individuals” (wikipedia) https://www.rakuten.co.jp 2018/11/07
  • 26. 26 Personalization usecases Left column links Main widgets Top header links Push less but relevant content to the customer Push dynamic content to fit the context to the customer Push the most attractive content to the customer first A β δ α 1 B 2 B α β δ 2 A 1 A δ α 1 β 2 A 1 A β δ α 1 B 2 A δ α 1
  • 27. 27 What are industrial companies doing? “Netflix member loses interest after perhaps 60 to 90 seconds of choosing” [source] “Netflix recommender system is used on most screens of the Netflix product beyond the homepage, and in total influences choice for about 80% of hours streamed at Netflix.” [source] “Already, 35 percent of what consumers purchase on Amazon and 75 percent of what they watch on Netflix come from product recommendations based on such algorithms.” [source]
  • 28. 28 Recommendations: different usages for different contexts Best offers, Faster navigation, Serendipity, Complementary / substitute items… https://www.rakuten.fr 2018/11/05
  • 30. 30 Recommender System overview Datasources (BI, catalog…) Delivery API Users data Tracker Realtime context
  • 31. 31 Challenges in Recommendations Items catalogues • Catalogue for multiple shops with different items references ? Items similarity / distances • Cross services aggregation ? • Lots of parameters ? Recommendations engine • Best / optimal recommendations logic ? Evaluation process • Offline / online evaluation ? • Long-tail ? KPI ? Items Catalogue Items Similarity Recommendations engine Evaluation Process
  • 32. 32 Recommendations – Two axis strategy Recommendations candidates Ranked candidates Candidates retrieval Realtime ranking Batch logic 1 Batch logic 2 Batch logic 3 Batch logic 4 AXIS 1 Candidates generation strategy (Cocounts, Prod2vec, Word2vec, Top, Content-based) AXIS 2 Candidates ranking strategy / Learning-to-rank (User context, Item context, Page context, External context) ML / AI algorithm +Context & features e.g. history, item, time, …
  • 33. 33 Recommendation datatypes Ratings Numerical feedbacks from the users Sources: Stars, reviews, … ✔ Qualitative and valuable data ✖ Hard to obtain Scaling and normalization ! Unitary data Only 0/1 without any quality feedback Sources: Click, purchase… ✔ Easy to obtain (e.g. tracker) ✖ No direct rating Users Items 1 3 2 5 2 2 4 1 3 1 5 4 4 1 3 Users Items 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
  • 34. 34 Items Catalogues Use different levels of aggregation to improve recommendations Category-level (e.g. food, soda, clothes, …) Product-level (manufactured items) Item in shop-level (specific product sell by a specific shop) Increased statistical power in co-events computation Easier business handling (picking the good item)
  • 35. 3535 Recommendations & Personalization How to do work with unitary data?
  • 36. 36 Cocounts for binary / Unitary data Only occurences of items views/purchases/… Jaccard distance Cosine similarity Conditional probability
  • 37. 37 Co-occurrences and Similarities Computation Multiple possible parameters: • Size of time window to be considered: Does browsing and purchase data reflect similar behavior ? • Threshold on co-occurrences Is one co-occurrence significant enough to be used ? Two ? Three ? • Symmetric or asymmetric Is the order important in the co-occurrence ? A then B == B then A ? • Similarity metrics Which similarity metrics to be used based on the co-occurrences ? Only access to unitary data (purchase / browsing) Use co-occurrences for computing items similarity
  • 38. 38 Co-occurrences Example Browsing Purchase Session ? Session ?Time window 1 Session ?Time window 2 07/11/2015 08/11/2015 08/11/2015 24/11/2015 08/11/2015 08/11/2015 10/09/2015 08/09/2015 10/09/2015
  • 39. 39 Co-occurrences Example Co-purchases Co-browsing Classical co-occurrences Complementary items Substitute items Other possible co-occurrences Items browsed and bought together Items browsed and not bought together “You may also want…” “Similar items…” 08/11/2015 08/11/2015 08/11/2015 07/11/2015 08/11/201510/09/2015 08/09/2015 07/11/2015
  • 40. 4040 Recommendations & Personalization How to do it for ratings data?
  • 41. 41 Algorithm 1 - Collaborative filtering User-user #items < #users Items are changing quickly Item-item #items >> #users Users Items 1 3 2 5 2 2 4 1 3 1 5 4 4 1 3 ? 1 – Compute users similarities (cosine-similarity, Pearson) 2 – Weighted average of ratings
  • 42. 42 Algorithm 2 - Matrix factorization Users Items 1 3 2 5 2 2 4 1 3 1 5 4 4 1 3 -0.7 1 0.4 … … … … … 2.3 0.2 -0.3 Items 0.5 0.3 … 1.2 … 1.2 -0.2 … -3.2 Users ~ X • Choose a number of latent variables to decompose the data • Predict new rating using the product of latent vectors • Use gradient descent technics (e.g. SGD) • Add some regularization
  • 43. 43 Matrix factorization – MovieLens example Read files import csv movies_fname = '/path/ml-latest/movies.csv' with open(movies_fname) as fobj: movies = dict((r[0], r[1]) for r in csv.reader(fobj)) ratings_fname = ’/path/ml-latest/ratings.csv' with open(ratings_fname) as fobj: header = fobj.next() ratings = [(r[0], movies[r[1]], float(r[2])) for r in csv.reader(fobj)] Build sparse matrix import scipy.sparse as sp user_idx, item_idx = {}, {} data, rows, cols = [], [], [] for u, i, s in ratings: rows.append(user_idx.setdefault(u, len(user_idx))) cols.append(item_idx.setdefault(i, len(item_idx))) data.append(s) ratings = sp.csr_matrix((data, (rows, cols))) reverse_item_idx = dict((v, k) for k, v in item_idx.iteritems()) reverse_user_idx = dict((v, k) for k, v in user_idx.iteritems())
  • 44. 44 Matrix factorization – MovieLens example Fit Non-negative Matrix Factorization from sklearn.decomposition import NMF nmf = NMF(n_components=50) user_mat = nmf.fit_transform(ratings) item_mat = nmf.components_ Plot results component_ind = 3 component = [(reverse_item_idx[i], s) for i, s in enumerate(item_mat[component_ind , :]) if s>0.] For movie, score in sorted(component, key=lambda x: x[1], reverse=True)[:10]: print movie, round(score) Terminator 2: Judgment Day (1991) 24.0 Terminator, The (1984) 23.0 Die Hard (198 19.0 Aliens (1986) 17.0 Alien (1979) 16.0 Exorcist, The (1973) 8.0 Halloween (197 7.0 Nightmare on Elm Street, A (1984) 7.0 Shining, The (1980) 7.0 Carrie (1976) 7.0 Star Trek II: The Wrath of Khan (1982) 10.0 Star Trek: First Contact (1996) 10.0 Star Trek IV: The Voyage Home (1986) 9.0 Contact (1997) 8.0 Star Trek VI: The Undiscovered Country (1991) 8.0 Blade Runner (1982) 8.0
  • 46. 46 Content-based: what should we use ? Attribute-based Content-based • Encoded features (e.g. one-hot- encoding) • Represent documents in the features space • Find similar documents (Knn, Kd-tree, …) • Encoded textual content of documents • Represent textual content in an embeddings space • Find similar documents (Knn, Kd-tree, …) Can be linked to Search Engine https://www.rakuten.fr 2018/11/05
  • 47. 47 Example of feature: Named entities in product description Sample of code with Polyglot from polyglot.text import Text text = Text(blob) for sent in text.sentences: print(sent, "n") for entity in sent.entities: print(entity.tag, entity) (Sentence("A New York, au printemps 2008, alors que l'Amérique bruisse des prémices de l'élection présidentielle, Marcus Goldman, jeune écrivain à succès, est dans..."), 'n') (u'I-LOC', I-LOC([u'New', u'York'])) (u'I-PER', I-PER([u'Marcus', u'Goldman'])) (Sentence("Lire la suite la tourmente : il est incapable d'écrire le nouveau roman qu'il doit remettre à son éditeur d'ici quelques mois."), 'n') (Sentence("Le délai est près d'expirer quand soudain tout bascule pour lui : son ami et ancien professeur d'université, Harry Quebert, l'un des écrivains les plus respectés du pays, est rattrapé par son passé et se retrouve accusé d'avoir assassiné, en 1975, Nola Kellergan, une jeune fille de 15 ans, avec qui il aurait eu une liaison."), 'n') (u'I-PER', I-PER([u'Harry', u'Quebert'])) (u'I-PER', I-PER([u'Nola', u'Kellergan'])) https://fr.shopping.rakuten.com/mfp/3011174/la-verite-sur-l-affaire-harry-quebert-joel-dicker-livre?pid=171972011 2018/11/05
  • 48. 48 Word2vec: two-layer neural network Distributed representation of words: • Continuous bag-of-words: predict current word from surrounding words only • Skip-gram: use current word to predict surrounding words https://skymind.ai/wiki/word2vec
  • 50. 50 Word2vec: Code sample import string import unidecode import requests from bs4 import BeautifulSoup refs = ["annee1", "artgrdp1", "ballades1", "bugjarg1", "contemplA2", "contemplB2”, "feuilles1", "hugoshak1", "legend1", "legendet21", "nddp1", "oriental1”, "quatrevt1", "rayons1", "ruesboi1", "satan1"] textes = "" for ref in refs: print ref res = requests.get("http://abu.cnam.fr/cgi-bin/donner_html?%s" % ref) text = BeautifulSoup(res.content, 'lxml').text textes += text.split("DEBUT DU FICHIER")[1].split("FIN DU FICHIER")[0] sentences = [t for t in textes.replace("n", " ").replace("r", " ").split(".") if len(t) > 20] sentences = [t.lower().strip() for sentence in sentences for t in sentence.split(";")] words = [[t.strip().strip(string.punctuation) for t in sentence.split()] for sentence in sentences] words = [[unidecode.unidecode(t) for t in word] for word in words] Model learning with Gensim import gensim model = gensim.models.Word2Vec(words, min_count=5, size=500, sg=1)
  • 51. 51 Word2vec: two-layer neural network We can use + and – operations on embedding vectors to find logical relationships model.most_similar("femme”) [('fille', 0.8448578715324402), ('danseuse', 0.8413887023925781), ('heureuse', 0.8404737114906311), ('fee', 0.8399657011032104), ('jolie', 0.8360834717750549), ('epouse', 0.8242064714431763), ('malheureuse', 0.823990523815155), ('jeune', 0.8126694560050964), ('creature', 0.8117369413375854), ('bohemienne', 0.8117284178733826)] model.most_similar(positive=["quasimodo", "femme"], negative=["homme"]) [('bohemienne', 0.7697243094444275), ("l'egyptienne", 0.7591158747673035), ('chevre', 0.7411096096038818), ('recluse', 0.7373903393745422), ('esmeralda', 0.733445405960083), ('rene-jean', 0.7310534715652466), ('parole', 0.7230771780014038), ('tourna', 0.715079665184021), ('vivement', 0.7149174213409424), ('condamnee', 0.7142783403396606)] model.most_similar("philosophe“) [('ecolier', 0.9446097016334534), ('soulier', 0.9436287879943848), ('bandit', 0.9423332810401917), ('bonhomme', 0.9420523643493652), ('ennemi', 0.938879132270813), ('gentilhomme', 0.9362510442733765), ('savant', 0.9335594177246094), ('officier', 0.9284510612487793), ('traitre', 0.9282448887825012), ('damne', 0.926963210105896)]
  • 52. 52 Convolutional neural network (CNN) https://www.mathworks.com/solutions/deep-learning/convolutional-neural-network.html • Feature-sharing • Not hand-designed • It boosted algorithms performances in many tasks! • Data driven (but you need data!)
  • 54. 54 Prod2vec: purchases session as a sentence E-commerce in Your Inbox: Product Recommendations at Scale, Grbovic et al. ≈ Purchase 08/11/2015 24/11/201508/11/201508/09/2015 10/09/2015 This is not Romeo, he's some other where.I am not here;Tut, I have lost myself; Apply Word2Vec on a sentence of “purchases”
  • 55. 55 Prod2vec: Theory Prod2vec learns a low-level embedding representation of products using the skip-gram model E-commerce in Your Inbox: Product Recommendations at Scale, Grbovic et al. Objective function (S is the set of sessions) Probability of seeing the neighboring product pi+j given product pi v and v’ are the input and output vector representation (that should be learned). Similar products should be closed in the vector space.
  • 56. 56 Prod2vec: embedding of products with similar neighborhood Purchase Purchase
  • 57. 57 Prod2vec: Python example with Gensim Model learning with Gensim from gensim.models import Word2Vec fobj = open(’purchases.json') model = Word2Vec(fobj, size=100, window=5, min_count=1, workers=4) embeddings = model.wv.syn0 indices = model.wv.index2word KNN computation with Falconn import falconn params_cp = falconn.get_default_parameters(embeddings.shape[0], embeddings.shape[1]) params_cp.lsh_family = falconn.LSHFamily.CrossPolytope params_cp.distance_function = falconn.DistanceFunction.EuclideanSquared lsh = falconn.LSHIndex(params_cp) lsh.setup(embeddings) query = lsh.construct_query_object() for i in range(embeddings.shape[0]): res = query.find_k_nearest_neighbors(embeddings[i], knn)
  • 58. 5858 Recommendations & Personalization Learning to Rank How to adapt to each specific context?
  • 59. 59 LTR in a Nutshell Re-orders the recommendations based on the features Store features Offline Machine Learning Store model Real time ML Recommend API delivery https://static.googleusercontent.com/media/research.google.com/ru//pubs/archive/45530.pdf
  • 60. 60 LTR features About the user About the item External factors
  • 61. 61 Problem setting ts item ritem 1 A B .1 .3 .2 .9 .4 .0 .2 0 1 A C .9 .7 .6 .0 .0 .3 .6 0 1 A D .4 .8 .6 .3 .2 .1 .0 1 2 B A .3 .1 .5 .7 .1 .9 .1 0 2 B E .1 .5 .2 .3 .2 .8 .7 1 Learn any model (linear, non-linear, …) click
  • 62. 62 Learning to Rank approaches • Pointwise approach: Consider each pair (document, target) separately (clicked, purchased). • Pairwise approach: Consider order of two documents and minimize inversions errors • Listwise approach: Consider all documents, and try to optimize the overall/average score 1 Not Clicked 2 Clicked 3 Not Clicked 4 Clicked 5 Clicked 6 Not Clicked https://www.rakuten.fr 2018/11/05
  • 63. 63 Learning to Rank approaches algorithm 1 Not Clicked 2 Clicked • Pointwise approach == (feature11, feature21, …, 0) == (feature12, feature22, …, 1) 3 Not Clicked == (feature13, feature23, …, 0) 1 Not Clicked• Pairwise approach == (feature11, feature21, feature12, feature22…, 0)2 Clicked == (feature11, feature21, feature13, feature23…, 1)1 Not Clicked 3 Not Clicked == (feature12, feature22, feature13, feature23…, 2)3 Not Clicked 2 Clicked 1 Not Clicked• Listwise approach == (feature11, feature21, feature12, feature22…, (2, 1, 3))2 Clicked 3 Not Clicked Regression problem predict document score Classification problem which document is better More complex problem… optimize the value of one of the evaluation measures
  • 64. 64 Learning to Rank approaches solutions Regression/classification problem: • SVM • SGD • Random Forests Listwise measures: • nDCG: normalized Discounted Cumulative Gain Doc 1 Rel = 5 Doc 2 Rel = 3 Doc 3 Rel = 1 Doc 4 Rel = 4 Doc 5 Rel = 2 CG5 = 15; DCG5 = 9.89; IDCG5 = 10.27, nDCG5 = 0.96 Doc 1 Rel = 5 Doc 2 Rel = 3 Doc 3 Rel = 4 Doc 4 Rel = 1 Doc 5 Rel = 2 CG5 = 15; DCG5 = 10.1; IDCG5 = 10.27, nDCG5 = 0.98
  • 65. 6565 Recommendations & Personalization Recommendations Quality How to evaluate recommendations ?
  • 66. 66 Recommendation Quality Challenges Recommendations categories • Cold start issue • External data ? • Cross-services ? • Hot products (A) • Top-N items ? • Short tail (B) • Long tail (C + D) Minor Product Major Product (Popular) New Product Old Product (A) (B) (D) (C)
  • 67. 67 Offline Evaluation Pros/Cons • Convenient way to try new ideas • Fast and cheap • But hard to align with online KPI Approaches • Rescoring • Prediction game • Business simulator
  • 68. 68 Public Initiative – Viki Recommendation Challenge http://www.dextra.sg/challenges/rakuten-viki-video-challenge 567 submissions from 132 participants
  • 69. 69 A/B Testing Track users’ interaction with the AB-test variants Compute statistical tests Choose which version to put in production A B control variation
  • 70. 70 A/B Statistical test Do not peek A/B tests or stopped them before the end Abtest A Abtest B Abtest C Sample size should be fixed in advance 500 samples 1000 samples 1500 samples Stopped because Not significant Stopped because Not significant Kept because significant ✗ Abtest A Abtest B Abtest C 500 samples 1000 samples 1500 samples Kept because significant ✓ Kept because significant
  • 71. 71 A/B Statistical test https://www.evanmiller.org/how-not-to-run-an-ab-test.html What sample size to use? Sample variance you expect the minimum effect you wish to detect https://www.evanmiller.org/ab-testing/sample-size.html
  • 72. 72 Continuous A/B test process Short ABtest 2 or 3 days L1 (control): CVRia 0.21 L2: CVRia 0.03 L3: CVRia 0.7 Short ABtest 2 or 3 days L3 (control): CVRia 0.065 L1 (old control): CVRia 0.03 L4: CVRia 0.09 Short ABtest 2 or 3 days L4 (control): CVRia 0.07 L3 (old control): CVRia 0.72 L2: CVRia 0.03 Expected output: step-by-step increase of CVRia / orders
  • 73. 73 Multi-arm bandit Testing different models without knowing their outcome - Exploring the different models to estimate their rewards - Exploit the best model know so far It costs to test hypotheses ( == explore) Model A Model B Model C Model A Reward 1.1 Model B Model C Model A Reward 0.5 Model B Model C Model A Reward 0.8 Model B Model C Model A Reward 0.2 Model B Model C Model A Reward 0.9 Model B Model C Model A Reward 1.8 Model B Model C
  • 74. 74 Epsilon-greedy strategy Many other strategies (e.g. see Wikipedia)… • Epsilon-first strategy: A pure exploration phase is followed by a pure exploitation phase. • Epsilon-decreasing strategy: The value of epsilon decreases over time. …and different bandits: • Contextual bandit: at each iteration, we have access to a vector of contextual features. • Constrained contextual bandit: a total budget is associated with the bandit. Models Other models Best model Model 1 Model 2 Model n 1/n 1/n 1/n ε 1-ε
  • 75. 7575 Cleaning and improving datasets Record-linkage
  • 76. 76 What is Record Linkage? Record linkage (RL) is the task of finding records in a data set that refer to the same entity across different data sources (e.g., data files, books, websites, and databases) Wikipedia Usage for Recommendations • Global catalog • Items aggregation • Helps with cold start issues • Improved navigation Marketplace 1 Marketplace 2Reference dataset
  • 77. 77 Linked Open Data Linking Open Data cloud diagram 2014, by Max Schmachtenberg, Christian Bizer, Anja Jentzsch and Richard Cyganiak. http://lod-cloud.net/"
  • 78. 78 Semantic-web and RDF format Triples: <subject> <relation> <object> URI: unique identifier http://dbpedia.org/page/Terminator_2:_Judgment_Day
  • 79. 79 Record linkage for global recommendations • Linking products together in a service • Feature Engineering - Generate hierarchies of products • ✔ Improve statistical power of co-events computation by aggregation • ✔ If based on text, may be used for content-based recommendations • Based on Record Linkage technics (e.g. MinHashing) • Linking products together between services • Having a unique product id across services • Use recommendations from one service, for another new service • ✔ Avoid cold start issue • Cross-services recommendations • ✔ Show items from a service on another service pages • Linking products to an external database (Wikidata) • More info for item enrichment, UI, content-based recommendations
  • 80. 80 Record linkage – The big picture Dataset 1 e.g. title, categories, price Blocking e.g. minhashing on the titles Dataset 2 e.g. title, categories, price Subset Dataset 1 Subset Dataset 2 Block 1 Subset Dataset 1 Subset Dataset 2 Block 2 Subset Dataset 1 Subset Dataset 2 Block n Subset Dataset 1 Subset Dataset 2X Comparisons based on attributes with specific distances For each block Links creation Data 1_1 == Data 2_n Data 1_2 == Data 2_p Data 1_3 == Data 2_1 …. Comparisons (distances computation)
  • 81. 81 Record linkage – Naive approach Match items from one dataset to the other using distances Levenshtein Jaccard Python difflib Based on an algorithm published in the late 1980’s by Ratcliff and Obershelp under the hyperbolic name “gestalt pattern matching.” (see doc)
  • 82. 82 Shingles and documents representation Shingles ~ Word-n-grams Split documents in a list of words groups A New York, au printemps 2008, alors que l'Amérique bruisse des prémices de l'élection présidentielle, Marcus Goldman, jeune écrivain à succès, est dans… E.g. word-3-grams • A New York, au • New York au printemps • … • à succès, est • succès, est dans Use jaccard distance between sets of shingles But…
  • 83. 83 Record linkage – Problematic Combinatorial explosion 10^6 items X 10^6 items = 10^12 comparisons • Use blocking: divide and conquer approaches: n-gram indexes, clustering, minhashing… • Apply more computationally expensive approaches on each block
  • 84. 84 Minhashing - Theory How to compare shingles of different sizes? MINHASHING ! 1. Compute hash of shingles 2. Keep minimum hash value 3. Repeat for 200 different hashs We now have a 200-dimensions representation of all shingles Why is it working ? If 2 documents share the same minimum hash for two shingles -> They share that shingle. Document 1 Document 2 Shingle 1 Shingle 2 … Shingle 1 Shingle 2 … Hash Value1, Value2, … Value1’, Value2’, … Find minimum hash value
  • 85. 85 Minhashing - Theory We just have to keep the minimum hash value for each document Huge computational and storage boost ! Randomly picking 200 shingles and comparing them between 2 documents ≈ Storing and comparing minimum values for 200 differents hash functions Jaccard(doc1, doc2) -> #Minhash(S1)==Minhash(S2) / nbhash But we still have to compare all documents together (even if the comparison is way faster)
  • 86. 86 Local Sensitivity Hashing What is a Locality sensitive hashing (LSH)? It is a hash such that similar vectors tend to get similar hash values It generates ‘band’ of documents, where documents within a band are more or less similar, and should be compared with Minhashing Document 1 Document 2 Shingle 1 Shingle 2 … Shingle 1 Shingle 2 … Shingle 1 Shingle 2 … Shingle 1 Shingle 2 … Document 1 Document 2 Minhash 1 … Minhash 200 Minhash 1 … Minhash 200 Minhash 1 … Minhash 200 Minhash 1 … Minhash 200 Document 1 Document 4 Document 3 Document 5 … Minhashing comparison Minhashing comparison Minhashing comparison Shingle Minhashing Local Sensitivity Hashing
  • 87. 87 Minhashing + LSH - Results Marketplace 1 Marketplace 2 https://www.wikidata.org/wiki/Q170564 bg Терминатор 2: Денят на страшния съд el Εξολοθρευτής 2: Μέρα Κρίσης en Terminator 2: Judgment Day es Terminator 2: el juicio final fr Terminator 2 : Le Jugement dernier ja ターミネーター2 ka ტერმინატორი 2: განკითხვის დღე Director: James Cameron Cast member: Arnold Schwarzenegger Cast member: Edward Furlong Follows: The Terminator Genre: action film Main subject: time travel, android Narrative location: Los Angeles https://item.rakuten.co.jp/auc-tecc/10016937/ 2018/11/07https://fr.shopping.rakuten.com/mfp/5705594/terminator-2-2 2018/11/07
  • 88. 8888 Rakuten.fr datachallenge 2017 Using user reviews https://challengedata.ens.fr/fr/challenge/26/prediction_de_linteret_des_avis_utilisate urs.html
  • 89. 89 Users reviews on Rakuten.fr https://fr.shopping.rakuten.com 2016/10/10
  • 90. 90 Challenge – Task 1 Predict if a review is useful for other users or not May be use to boost interesting reviews on the website Classification (#useful / #total > 0.5) or Regression task on textual features https://fr.shopping.rakuten.com 2016/10/10
  • 91. 91 Challenge – Task 2 Predict the user stars number based on his/her review May be use to detect fraud and help improving the quality on the website Regression task (6 discrete values) on textual features https://fr.shopping.rakuten.com 2016/10/10
  • 92. 92 Data samples product: b57c06ed94773c4d08bcefcdf8cbedd846bbdcba8d669a15d511b9acb92efeb43 review_title: “why not !” review_content: “Yess! est un jeu de communication réussi, intuitif, malin assez rapide avec peu de temps mort. Voilà un jeu au rapport plaisir/prix qui est bien placé.” review_note: 4 feedback_positive_count: 0 feedback_negative_count: 0 product: 12d1407836441fc39805916ecb705604bcb539bd70b477c82d57f5043977102 review_title: “Lave linge parfait” review_content: “Seul point négatif : la mise en réseau de la machine...Mais ce n'est pas ce que l'on recherche le plus dans un lave linge” review_note: 4 feedback_positive_count: 0 feedback_negative_count: 0
  • 93. 93 Expected results The expected results are probabilities of being useful (Class 1) Proba: 0.99 Je vien d'accerir ses enceintes d'une qualité de son incroyable . Des basses profondes et puissantes . Pour amoureux de son ! Proba: 0.0611228262615 Parfait PARFAIT PARFAIT PARFAIT PARFAIT PARFAIT PARFAIT PARFAIT PARFAIT PARFAIT PARFAIT
  • 94. 94 Baseline code example import pandas as pd from sklearn.feature_extraction.text import TfidfVectorizer from sklearn import linear_model from sklearn.cross_validation import KFold STOPWORDS = set(['alors', 'au', 'aucuns', 'aussi', 'autre', 'avant', 'avec’ …]) fname = '/path/to/reviews_test_clean 2.csv’ df = pd.read_csv(fname) df['feedback_ratio'] = df['feedback_positive_count'].astype(float) / (df['feedback_positive_count’] + df['feedback_negative_count']) df['feedback_class'] = df['feedback_ratio'] >= 0.5 X = df['review_content'].values Y = df['feedback_class'].to_dense().values kf = KFold(n=len(X), n_folds=5) all_pred , all_test = [], [] for train, test in kf: Xtrain, Xtest = X[train], X[test] Ytrain, Ytest = Y[train], Y[test] vect = TfidfVectorizer(stop_words=STOPWORDS) Xtrain = vect.fit_transform(Xtrain) Xtest = vect.transform(Xtest) clf = linear_model.SGDClassifier() clf.fit(Xtrain, Ytrain) pred = clf.predict(Xtest) all_pred.extend(pred) all_test.extend(Ytest)
  • 95. 9595 Rakuten.fr datachallenge 2018 Prediction of transaction claims status https://challengedata.ens.fr/en/challenge/39/prediction_of_transaction_claims_status. html
  • 96. 96 Claims predictions in E-commerce Claims have a huge impact in terms of user experience + cost Claims can be dealt with differently following the different cases (broken, fake, …) Predict if a transaction has a probability to lead to a claim -> focus on risky transactions. Possibly a huge impact in the whole E-commerce field!
  • 97. 97 Dataset • ID: identifier of the sample • SHIPPING_MODE: mode of shipping of the product • (RECOMMANDE, NORMAL, …) • SHIPPING_PRICE: cost of shipping, if existing • (<1, 1<5, 5<10, 10<20, >20) • WARRANTIES_FLG: True if a warranty has been taken by the buyer • WARRANTIES_PRICE: Price of warranty, if existing • (<5, 5<20, 20<50, 50<100, 100<500, >500) • CARD_PAYEMENT: transactions paid by card • COUPON_PAYEMENT: transactions paid with a discount coupon • RSP_PAYEMENT: transactions paid with Rakuten Super Points • WALLET_PAYMENT: transactions paid with PriceMinister-Rakuten wallet • PRICECLUB_STATUS: status of the buyer • (UNSUBSCRIBED, PLATINUM, …) • REGISTRATION_DATE: year of registration of the buyer • PURCHASE_COUNT: binarisation of buyer's previous purchases count • (<5, 5<20, 20<50, 50<100, 100<500, >500) • BUYER_BIRTHDAY_DATE: year of birth of the buyer • BUYER_DEPARTMENT: department of the buyer or -1 • BUYING_DATE: year and month of the purchase • SELLER_SCORE_COUNT: binarisation of the seller's previous sales count • (<100, 100<103, 103<104, 104<105, 105<106, >106) • SELLER_SCORE_AVERAGE: score of the seller on PriceMinister-Rakuten • SELLER_COUNTRY: country of the seller • (FRANCE METROPOLITAN, CHINA, …) • SELLER_DEPARTMENT: department of the seller or -1 • PRODUCT_TYPE: type of the purchased product • (TOYS, CELLPHONE_ACCESSORY, …) • PRODUCT_FAMILY: family of the purchased product • (ELECTRONICS, BABY, …) • ITEM_PRICE: binarisation of the purchased product • (<10, 10<20, 20<50, 50<100, 100<500, 500<1000, 1000<5000, >5000)
  • 98. 98 Challenges Complex interactions that involves multiple factors (probably not all contained in the features) and subjective information (the same shop does not always send broken products…) Unbalanced classes! Categorical features + numerical features (beware of ranges!) Find some socio-demographics/behavioral features (e.g. based on country)
  • 99. 99 Baseline Metric: AUC weighted metric (from sklearn) Calculate metrics for each label, and find their average, weighted by support (the number of true instances for each label). http://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html#sklearn.metrics.roc_auc_score Algorithm used for benchmarks (naive and classic!) • Random forests classifier (from sklearn), with 200 estimators • Classical preprocessors (from sklearn): OneHotEncoder, LabelEncoder Result obtained: 0.574 AUC weighted metric.
  • 100. 100 Baseline code example from collections import defaultdict import pandas as pd import numpy as np import sklearn.preprocessing from scipy.sparse import hstack xtrain_df = pd.read_csv('training_X.tsv’, delimiter='t’) xtest_df = pd.read_csv('test_X.tsv’, delimiter='t’) ytrain_df = pd.read_csv('training_Y.tsv’, delimiter='t’) ytest_df = pd.read_csv('test_Y.tsv’, delimiter='t’) CATS = ['WARRANTIES_FLG’, `'SHIPPING_MODE`, …] LABELS = defaultdict(set) for cat in CATS: d = set(xtrain_df[cat].unique()).union(set(xtest_df[cat].un ique())) LABELS[cat] = dict((v, i) for i, v in enumerate(d)) NUMERICAL = ['CARD_PAYMENT’, COUPON_PAYMENT’,…] ENCODERS = dict() COLUMNS = [] Xtrain = create_matrix(xtrain_df, COLUMNS) Xtest = create_matrix(xtest_df) def create_matrix(df, COLUMNS=None): print df.shape X = df[NUMERICAL].to_sparse() if COLUMNS is not None: COLUMNS.extend(NUMERICAL) for cat in CATS: if COLUMNS is not None: for v in sorted(LABELS[cat].iteritems(), key=lambda x: x[1]): COLUMNS.append('%s (%s)' % (cat, v[0])) data = df[cat] data = np.ravel([LABELS[cat][v] for v in data]) data = np.reshape(data, [data.size, 1]) if cat in ENCODERS: data = ENCODERS[cat].transform(data) else: oenc = sklearn.preprocessing.OneHotEncoder() data = oenc.fit_transform(data) ENCODERS[cat] = oenc X = hstack((X, data)) return X
  • 101. 101 Baseline code example lenc = sklearn.preprocessing.LabelEncoder() Ytrain = lenc.fit_transform(ytrain_df['CLAIM_TYPE']) Ytrain[Ytrain != 0] = 1 lenc = sklearn.preprocessing.LabelEncoder() Ytest = lenc.fit_transform(ytest_df['CLAIM_TYPE']) Ytest[Ytest != 0] = 1 from sklearn.ensemble import RandomForestClassifier clf = sklearn.ensemble.RandomForestClassifier(n_estimator s=50) clf.fit(Xtrain.toarray(), Ytrain) pred = clf.predict(Xtest.toarray()) ytrain_df['CLAIM_TYPE'][ytrain_df['CLAIM_TYPE'] == '-'] = 'NO COMPLAIN' ytest_df['CLAIM_TYPE'][ytest_df['CLAIM_TYPE'] == '-'] = 'NO COMPLAIN’ STATUS = set(ytrain_df['CLAIM_TYPE']).union(set(ytest_df['CLAIM_TYPE' ])) STATUS = dict((v, i) for i, v in enumerate(STATUS)) Ytrain = np.array([STATUS[v] for v in ytrain_df['CLAIM_TYPE']]) Ytest = np.array([STATUS[v] for v in ytest_df['CLAIM_TYPE']]) clf = SGDClassifier(n_jobs=4, loss='log')#, class_weight='auto')#weights) clf.fit(Xtrain, Ytrain) pred = clf.predict(Xtest) proba = clf.predict_proba(Xtest) Binary Classifier Multiclass Classifier
  • 102. 102102 Big data at scale Search engine
  • 103. 103 Search principles The goal of search is to help users efficiently find the most relevant documents for a given query. • Documents • Depend on how the data is modeled • Marketplace: product, offer (product sold by a merchant), SKU (variation of a product), … • Video streaming: movie, tv series, tv episode, … • Query • Terms: what goes in the search box • Filters: navigation items • Relevancy • Based off the data: by price, by freshness, … • Based off user behavior: clicks, purchases, … • Based off text semantics: entity extraction, … • Based off corpus statistics: terms frequencies, … • Efficiency • Low response time • Assistance: spellcheck, autocomplete, …
  • 104. 104 Search principles • Most existing search systems are based off indices • Same as the indices found at the end of books • Lucene is the most well-known library to handle these • Consider the query “the best search engines” • First, break up the query in terms, remove common ones, and normalize • Yields [“best”, “search”, “engine”] • Look up each term in the index dictionary • Yields a list of documents per term (called an inverted list) • Find common documents in all list • Sort the results in the desired order (Image from https://stackoverflow.com/questions/17272050/book-index-page-layout-using-html5-and-css)
  • 105. 105 Search, big data and e-commerce • Indexing challenges • Volume of data: large marketplaces contain a lot of documents • Update rate: information can change fast; price, inventory • Search challenges • Query rate: lots of end-users querying at the same time • Features: not only document retrieval, but also navigation, statistics, … • Relevancy challenges • Document do not usually contain natural language • More susceptible to spam and merchants trying to game the system • Need to balance well-selling items with discovery, especially for newer releases of popular products • Multi-language support for global market places • Operation challenges • Very large scale systems (2000+ nodes) need robust deployment and monitoring • Resources distribution (models, linguistic resources)
  • 107. 107107 Datascience everywhere ! Rakuten provides marketplaces worldwide Specific challenges for recommendations Items catalogue: reinforce statistical power of co-occurrences across shops and services; Items similarities: find the good parameters for the different use-cases; Recommendations models: what is the best models for in-shop, all-shops, personalization? Evaluation: handling long-tail? Comparing different models?
  • 108. 108108 We are Hiring! Positions http://www.priceminister.com/recrutement/?p=197 Data Scientist / Software Developer • Build algorithms for recommendations, search, targeting • Predictive modeling, machine learning, natural language processing • Working close to business • Python, Java, Hadoop, Couchbase, Cassandra… Also hiring: search engine developers, big data system administrators, etc.
  • 109. 109109 Thanks ! Questions ? More on Rakuten tech initiatives http://www.slideshare.net/rakutentech http://rit.rakuten.co.jp/oss.html http://rit.rakuten.co.jp/opendata.html Positions http://www.priceminister.com/recrutement/?p=197