SlideShare une entreprise Scribd logo
1  sur  23
Télécharger pour lire hors ligne
SPARK SUMMIT
EUROPE2016
MENTOR AND MENTEE RELATIONS
BASED ON AUTHORSHIP GRAPHS
Reza Karimi
Elsevier
Outline
• Introduction:
– Elsevier and Scopus Data
• Mentorship Model
• Training and Validation
• (Big) Graph Visualization (in Spark)
Who are we?
• Elsevier: the #1 provider of scientific, technical, and medical
information, and a technology company originally established
in 1880 in Netherlands. Now part of RELX group.
• Privileged to publish about 25% of cited (what matters)
scientific publications
• Publishing 400K articles per year in about 2500 journals such
as Lancet, Cell, Trends, Current Opinions, Artificial Intelligence
(generally accessed via ScienceDirect platform (about 900m
full text download per year)
• Curating Scientific publication data in products such as
Scopus, SciVal, Pure
Full Text Access in ScienceDirect
Scopus is the largest abstract and citation database of
peer-reviewed literature, and features smart tools that allow you to
track, analyze and visualize scholarly research.
What is Scopus?
Scopus includes content from more than 5,000
publishers and 105 different countries
21,568 peer-reviewed
journals
361 trade journals
• Full metadata, abstracts
and cited references
(ref’s post-1995 only)
• Funding data from
acknowledgements
• Citations back to 1970
Physical
Sciences
7,443
Health
Sciences
6,795
Social
Sciences
8,086
Life
Sciences
4,492
90K conference
events
7.3M conference
papers
Mainly Engineering
and Computer
Sciences
531 book series
30K Volumes /
1.2M items
119,882 stand-
alone books
974K items
Focus on Social
Sciences and A&H
65M records from 22K serials, 90K conferences and 120K books
• Updated daily
• Records back to 1823
• “Articles in Press” from > 3,750 titles
• 40 different languages covered
• 3,715 active Gold Open Access journals indexed
BOOKSCONFERENCES
Source: November 2015 title list at https://www.elsevier.com/solutions/scopus/content * Available late 2016
JOURNALS
27M patents
From 5 major
patent offices
- WIPO
- EPO
- USPTO
- JPO
- UK IPO
PATENTS*
Scopus Data Model
By employing the state of the art disambiguation and deduplication algorithms, entities such as authors, institutes and
cited document are disambiguated. This enables us to to analyze trends and to track researchers and institutes.
author
affiliation
Article
(65M)
Author
(12M)
Affiliation
(10M)
Author Disambiguation:
AI algorithms enhanced with multi-level feedbacks
Scopus use a combination of automated and curated data to automatically build robust author profiles,
which power the Elsevier Research Intelligence portfolio.
Scopus
author
profile
Scopus.com: Ongoing
algorithmic updates
Manual feedback via
Scopus.com Author
Feedback Wizard
Pure Profile Refinement
Service
Scopus2SciVal manual &
automated updates
Manual & Automated
feedback via ORCiD
One-off updates via
consortia/ governmental
agencies
Who uses Scopus Data? Examples
Co-authorship Graphs
• Based on disambiguated records, we can track all the publication, co-
authors, affiliation, corresponding authors, citing authors, cited authors …
for each author
• Authors will be modeled as vertices and each document co-published
represents an edge
• Current Analysis is based on:
– 65.2M documents (208M Authorships)
– 33.8M Authors (9.1M published within Elsevier)
– 4.6B co-authorships based on 382M unique author pairs
– 123M unique correspondence ordered author pairs
Author i
Author j
Author k
co-authorship graph
Problem Statement and Modeling
• Problem
– For each author, find the most likely mentor among co-authors (explicit
clues exist)
• Solution
– most likely: ranking
– being mentor: classification
• CLASSIFICATION INPUT:
– Approximate into a pairwise model as opposed to a model based on full-set
– Approximate with an aggregated graph
• TRAIN THE CLASSIFIER: Even for a simple pairwise model we need lots of
training data to account for variations in different fields, years, countries,…
– Start from simple heuristic common-sense models (cannot overfit) and improve it gradually
» Can predict accurately based on given clues/features, but cannot properly balance
the interaction and proper weight for each clue
– When satisfied, verify the early predictions via crowd sourcing and collect precise data for
ML training
Heuristic Mode I
Heuristic Mode II
(baseline)
Email Campaign
(Crowd Sourcing
Training Data)
ML Model I
ML Model II (Extra
Features)
Spark Pipeline in AWS: ETL and Aggregate
feature
tables
Graph
Data
analytical
format
temporary
format for
compute
optimized
nodes
raw data
in S3
xml files
Selected fields
into json
sorted and
repartitioned into
columnar parquet
Node Aggregation
N Feature I
N Feature II
N Feature ...
Edge Aggregation
E Feature A
E Feature B
E Feature ...
• Feature are primarily built on dual
summary statistics (around nodes for
nodal features or around all edges
between two nodes for edge features)
• Normalizing data is superbly important.
For an article with n authors, following
weights are adopted:
– 1/n : authorship
– 1/(n*(n-1)) : co-authorship
– 1/(n-1) : for corresponding authorship
• Non-numeric features are aggregated
primarily into
– most common value (and its frequency) for nodes
– overlap-share/cosine-similarity for edges
Mixing co-authorship and corresponding authorship
Author i
Author j
Author k
corresponding authorship graph
Author i
Author j
Author k
Final graph
co-authorship graph
Author i
Author j
Author k
cor-i->j
Feature Examples
• Number/weighted-sum of
– authorships
– co-authorships
– corresponding authorships
• First/Last/AVG year
• Publication Milestone Years (to simplify time-series):
– year reaching 3,5,10 publications (robustness against disambiguation and variations in academic fields)
• PageRank of Author in correspondence graph or citation graphs
• Number of co-authors/corresponding authors
• HIndex, Citation metrics
• Continent/Country, current affiliation(s)
• Email Domain(s)
• AVG author position in authorship sequences (from 0-1)
• Node Content representation by:
– Journal frequency vector
– Subject-area frequency vector
1. Apply mentorship to each edge
2. Rank co-authors based on their mentorship score and pick up the best score as the
candidate mentor
3. Create the mentorship graph (batch process):
1. Look for loops of length L (L=2,3,4,…)
2. Break loops by the weakest link and replace that link with the next mentor candidate
3. Go back to step 2 and if no loop found increase L
4. Create academic family-trees (for select authors):
can locally break loops of bigger lengths (such as L=7)
graph=GraphFrame(vertices, edges)
paths =graph.find("(a1)-[e]->(a2)")
paths.rdd.map(mentorship).toDF().write.format('parquet').mode('overwrite').saveAsTable(…)
Applying Pairwise Model
Validations via Crowd Sourcing
• Emails sent to opt-in
(Mendeley/Scopus) User
to evaluate our
predictions and collect
data for supervised ML
model
• A/B/C testing for Email
templates to optimize
open/click rate
• Clicks lead to a submit
page to avoid random
clicks
Email Clicked Submit Page Thank you email
• Funnel:
– 10000 randomly selected recipients
– 4315 recipients opened their email
– 1742 recipients clicked on provided choices
– 413 recipients submitted a choice
• Click vs Submit
– 23% of those who clicked, submitted
– 94% of submission and click choices were identical, the rest is primarily a switch to a more conservative choice
Prediction Accuracy
• Academic family-trees are important:
– Recommendations such as article/people in Mendeley
– Be aware of conflict of interests: for example in reviewer
selection or funding panelist
• Special features of academic family-trees:
– Low connectivity (every one has only (one) mentor)
– A structure growing with time like snow crystals
• Given the simplicity of the mentorship graph when it
is cached, Spark can act as a back-end for
instantaneous subgraph creation or other non-batch
analysis
Academic Family-Tree
Academic progeny of Prof. … in MIT
(intentionally blurred screenshot to minimize
personal impact on names displayed)
• To visualize (big) graphs in Spark application we need two components:
1. Sub-Graph creation: it would not be possible to visualize all edge/nodes, especially for highly
connected graphs such as:
1. co-authorships
2. journal to journal citation
3. institute to institute citation
To select a limited set of node/edges, filter or GraphFrame queries can help. They work great, if the
sub-graph can be obtained by some global filters (such as n-top strongest edges). However, we
had to make substantial development to cover sub-graphs centered around a given node. The
local sub-graph creation was extended by our library via customized development in line with data-frame
operations.
2. Visualization library: Adopted D3.js and displayHTML to visualize (Edges, Vertices) Dataframes
interactively. The library recognizes following elements:
– Directed Graphs (such as citation) including self referral edges or non-directed graphs (such as co-authorship)
displayed with or without arrows
– size, colors based on continuous/categorical data, line type for edges and nodes
– Line types
(Big) Graph Visualization (by Spark)
Visualization Example I: Global filter
Top 250 Global non-self institute citations
Visualization Example II:
Top Institute Citations in Computer Science
Visualization Example III: lsub-graph of
top citations around MIT
Top 50 citations to/from MIT
Top 150 citations for MIT or top MIT neighbors (distance_max=2)
SPARK SUMMIT
EUROPE2016
THANK YOU.
Feel free to reach me for further information or if
interested to join our team:
r.karimi@elsevier.com
Acknowledgment:
Bob Schijvenaars, Antonio Gulli, Darin McBeath, Daul Trevor,
and Elsevier members in BOS/Labs/Scopus teams

Contenu connexe

Tendances

Spark Summit EU talk by Shay Nativ and Dvir Volk
Spark Summit EU talk by Shay Nativ and Dvir VolkSpark Summit EU talk by Shay Nativ and Dvir Volk
Spark Summit EU talk by Shay Nativ and Dvir VolkSpark Summit
 
Spark Summit EU talk by Ram Sriharsha and Vlad Feinberg
Spark Summit EU talk by Ram Sriharsha and Vlad FeinbergSpark Summit EU talk by Ram Sriharsha and Vlad Feinberg
Spark Summit EU talk by Ram Sriharsha and Vlad FeinbergSpark Summit
 
Machine Learning as a Service: Apache Spark MLlib Enrichment and Web-Based Co...
Machine Learning as a Service: Apache Spark MLlib Enrichment and Web-Based Co...Machine Learning as a Service: Apache Spark MLlib Enrichment and Web-Based Co...
Machine Learning as a Service: Apache Spark MLlib Enrichment and Web-Based Co...Databricks
 
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...Spark Summit
 
Spark Summit EU talk by Sol Ackerman and Franklyn D'souza
Spark Summit EU talk by Sol Ackerman and Franklyn D'souzaSpark Summit EU talk by Sol Ackerman and Franklyn D'souza
Spark Summit EU talk by Sol Ackerman and Franklyn D'souzaSpark Summit
 
Large Scale Machine learning with Spark
Large Scale Machine learning with SparkLarge Scale Machine learning with Spark
Large Scale Machine learning with SparkMd. Mahedi Kaysar
 
EclairJS = Node.Js + Apache Spark
EclairJS = Node.Js + Apache SparkEclairJS = Node.Js + Apache Spark
EclairJS = Node.Js + Apache SparkJen Aman
 
Download It
Download ItDownload It
Download Itbutest
 
26 Trillion App Recomendations using 100 Lines of Spark Code - Ayman Farahat
26 Trillion App Recomendations using 100 Lines of Spark Code - Ayman Farahat26 Trillion App Recomendations using 100 Lines of Spark Code - Ayman Farahat
26 Trillion App Recomendations using 100 Lines of Spark Code - Ayman FarahatSpark Summit
 
Spark Summit EU talk by Javier Aguedes
Spark Summit EU talk by Javier AguedesSpark Summit EU talk by Javier Aguedes
Spark Summit EU talk by Javier AguedesSpark Summit
 
Ray: A Cluster Computing Engine for Reinforcement Learning Applications with ...
Ray: A Cluster Computing Engine for Reinforcement Learning Applications with ...Ray: A Cluster Computing Engine for Reinforcement Learning Applications with ...
Ray: A Cluster Computing Engine for Reinforcement Learning Applications with ...Databricks
 
Large-Scale Machine Learning with Apache Spark
Large-Scale Machine Learning with Apache SparkLarge-Scale Machine Learning with Apache Spark
Large-Scale Machine Learning with Apache SparkDB Tsai
 
Spark ML par Xebia (Spark Meetup du 11/06/2015)
Spark ML par Xebia (Spark Meetup du 11/06/2015)Spark ML par Xebia (Spark Meetup du 11/06/2015)
Spark ML par Xebia (Spark Meetup du 11/06/2015)Modern Data Stack France
 
Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...
Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...
Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...Spark Summit
 
A Data Frame Abstraction Layer for SparkR-(Chris Freeman, Alteryx)
A Data Frame Abstraction Layer for SparkR-(Chris Freeman, Alteryx)A Data Frame Abstraction Layer for SparkR-(Chris Freeman, Alteryx)
A Data Frame Abstraction Layer for SparkR-(Chris Freeman, Alteryx)Spark Summit
 
Practical Machine Learning Pipelines with MLlib
Practical Machine Learning Pipelines with MLlibPractical Machine Learning Pipelines with MLlib
Practical Machine Learning Pipelines with MLlibDatabricks
 
Spark Summit EU talk by Sital Kedia
Spark Summit EU talk by Sital KediaSpark Summit EU talk by Sital Kedia
Spark Summit EU talk by Sital KediaSpark Summit
 
Spark Summit EU talk by Brij Bhushan Ravat
Spark Summit EU talk by Brij Bhushan RavatSpark Summit EU talk by Brij Bhushan Ravat
Spark Summit EU talk by Brij Bhushan RavatSpark Summit
 
Recent Developments in Spark MLlib and Beyond
Recent Developments in Spark MLlib and BeyondRecent Developments in Spark MLlib and Beyond
Recent Developments in Spark MLlib and BeyondDataWorks Summit
 

Tendances (20)

Spark Summit EU talk by Shay Nativ and Dvir Volk
Spark Summit EU talk by Shay Nativ and Dvir VolkSpark Summit EU talk by Shay Nativ and Dvir Volk
Spark Summit EU talk by Shay Nativ and Dvir Volk
 
Spark Summit EU talk by Ram Sriharsha and Vlad Feinberg
Spark Summit EU talk by Ram Sriharsha and Vlad FeinbergSpark Summit EU talk by Ram Sriharsha and Vlad Feinberg
Spark Summit EU talk by Ram Sriharsha and Vlad Feinberg
 
Machine Learning as a Service: Apache Spark MLlib Enrichment and Web-Based Co...
Machine Learning as a Service: Apache Spark MLlib Enrichment and Web-Based Co...Machine Learning as a Service: Apache Spark MLlib Enrichment and Web-Based Co...
Machine Learning as a Service: Apache Spark MLlib Enrichment and Web-Based Co...
 
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
 
Spark Summit EU talk by Sol Ackerman and Franklyn D'souza
Spark Summit EU talk by Sol Ackerman and Franklyn D'souzaSpark Summit EU talk by Sol Ackerman and Franklyn D'souza
Spark Summit EU talk by Sol Ackerman and Franklyn D'souza
 
Large Scale Machine learning with Spark
Large Scale Machine learning with SparkLarge Scale Machine learning with Spark
Large Scale Machine learning with Spark
 
EclairJS = Node.Js + Apache Spark
EclairJS = Node.Js + Apache SparkEclairJS = Node.Js + Apache Spark
EclairJS = Node.Js + Apache Spark
 
Download It
Download ItDownload It
Download It
 
26 Trillion App Recomendations using 100 Lines of Spark Code - Ayman Farahat
26 Trillion App Recomendations using 100 Lines of Spark Code - Ayman Farahat26 Trillion App Recomendations using 100 Lines of Spark Code - Ayman Farahat
26 Trillion App Recomendations using 100 Lines of Spark Code - Ayman Farahat
 
Spark Summit EU talk by Javier Aguedes
Spark Summit EU talk by Javier AguedesSpark Summit EU talk by Javier Aguedes
Spark Summit EU talk by Javier Aguedes
 
Ray: A Cluster Computing Engine for Reinforcement Learning Applications with ...
Ray: A Cluster Computing Engine for Reinforcement Learning Applications with ...Ray: A Cluster Computing Engine for Reinforcement Learning Applications with ...
Ray: A Cluster Computing Engine for Reinforcement Learning Applications with ...
 
Large-Scale Machine Learning with Apache Spark
Large-Scale Machine Learning with Apache SparkLarge-Scale Machine Learning with Apache Spark
Large-Scale Machine Learning with Apache Spark
 
Spark ML par Xebia (Spark Meetup du 11/06/2015)
Spark ML par Xebia (Spark Meetup du 11/06/2015)Spark ML par Xebia (Spark Meetup du 11/06/2015)
Spark ML par Xebia (Spark Meetup du 11/06/2015)
 
Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...
Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...
Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...
 
A Data Frame Abstraction Layer for SparkR-(Chris Freeman, Alteryx)
A Data Frame Abstraction Layer for SparkR-(Chris Freeman, Alteryx)A Data Frame Abstraction Layer for SparkR-(Chris Freeman, Alteryx)
A Data Frame Abstraction Layer for SparkR-(Chris Freeman, Alteryx)
 
Practical Machine Learning Pipelines with MLlib
Practical Machine Learning Pipelines with MLlibPractical Machine Learning Pipelines with MLlib
Practical Machine Learning Pipelines with MLlib
 
Spark Summit EU talk by Sital Kedia
Spark Summit EU talk by Sital KediaSpark Summit EU talk by Sital Kedia
Spark Summit EU talk by Sital Kedia
 
Spark Summit EU talk by Brij Bhushan Ravat
Spark Summit EU talk by Brij Bhushan RavatSpark Summit EU talk by Brij Bhushan Ravat
Spark Summit EU talk by Brij Bhushan Ravat
 
Recent Developments in Spark MLlib and Beyond
Recent Developments in Spark MLlib and BeyondRecent Developments in Spark MLlib and Beyond
Recent Developments in Spark MLlib and Beyond
 
Apache Spark & MLlib
Apache Spark & MLlibApache Spark & MLlib
Apache Spark & MLlib
 

En vedette

The Next AMPLab: Real-Time, Intelligent, and Secure Computing
The Next AMPLab: Real-Time, Intelligent, and Secure ComputingThe Next AMPLab: Real-Time, Intelligent, and Secure Computing
The Next AMPLab: Real-Time, Intelligent, and Secure ComputingSpark Summit
 
Spark Summit EU talk by Patrick Baier and Stanimir Dragiev
Spark Summit EU talk by Patrick Baier and Stanimir DragievSpark Summit EU talk by Patrick Baier and Stanimir Dragiev
Spark Summit EU talk by Patrick Baier and Stanimir DragievSpark Summit
 
Spark Summit EU talk by Oscar Castaneda
Spark Summit EU talk by Oscar CastanedaSpark Summit EU talk by Oscar Castaneda
Spark Summit EU talk by Oscar CastanedaSpark Summit
 
Spark Summit EU talk by Josef Habdank
Spark Summit EU talk by Josef HabdankSpark Summit EU talk by Josef Habdank
Spark Summit EU talk by Josef HabdankSpark Summit
 
The Spark (R)evolution in The Netherlands
The Spark (R)evolution in The NetherlandsThe Spark (R)evolution in The Netherlands
The Spark (R)evolution in The NetherlandsSpark Summit
 
Spark Summit EU talk by Heiko Korndorf
Spark Summit EU talk by Heiko KorndorfSpark Summit EU talk by Heiko Korndorf
Spark Summit EU talk by Heiko KorndorfSpark Summit
 
Spark Summit EU talk by Jorg Schad
Spark Summit EU talk by Jorg SchadSpark Summit EU talk by Jorg Schad
Spark Summit EU talk by Jorg SchadSpark Summit
 
Spark Summit EU talk by Erwin Datema and Roeland van Ham
Spark Summit EU talk by Erwin Datema and Roeland van HamSpark Summit EU talk by Erwin Datema and Roeland van Ham
Spark Summit EU talk by Erwin Datema and Roeland van HamSpark Summit
 
Democratizing AI with Apache Spark
Democratizing AI with Apache SparkDemocratizing AI with Apache Spark
Democratizing AI with Apache SparkSpark Summit
 
Spark Summit EU talk by Sudeep Das and Aish Faenton
Spark Summit EU talk by Sudeep Das and Aish FaentonSpark Summit EU talk by Sudeep Das and Aish Faenton
Spark Summit EU talk by Sudeep Das and Aish FaentonSpark Summit
 
Spark Summit EU talk by Luc Bourlier
Spark Summit EU talk by Luc BourlierSpark Summit EU talk by Luc Bourlier
Spark Summit EU talk by Luc BourlierSpark Summit
 
Spark Summit EU talk by Dean Wampler
Spark Summit EU talk by Dean WamplerSpark Summit EU talk by Dean Wampler
Spark Summit EU talk by Dean WamplerSpark Summit
 
Spark Summit EU talk by Ruben Pulido and Behar Veliqi
Spark Summit EU talk by Ruben Pulido and Behar VeliqiSpark Summit EU talk by Ruben Pulido and Behar Veliqi
Spark Summit EU talk by Ruben Pulido and Behar VeliqiSpark Summit
 
MmmooOgle: From Big Data to Decisions for Dairy Cows
MmmooOgle: From Big Data to Decisions for Dairy CowsMmmooOgle: From Big Data to Decisions for Dairy Cows
MmmooOgle: From Big Data to Decisions for Dairy CowsSpark Summit
 
Spark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted MalaskaSpark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted MalaskaSpark Summit
 
Hadoop in the Cloud – The What, Why and How from the Experts
Hadoop in the Cloud – The What, Why and How from the ExpertsHadoop in the Cloud – The What, Why and How from the Experts
Hadoop in the Cloud – The What, Why and How from the ExpertsDataWorks Summit/Hadoop Summit
 
Case study of online machine learning for display advertising in Yahoo! JAPAN
Case study of online machine learning for display advertising in Yahoo! JAPANCase study of online machine learning for display advertising in Yahoo! JAPAN
Case study of online machine learning for display advertising in Yahoo! JAPANDataWorks Summit/Hadoop Summit
 
Simplifying Big Data Applications with Apache Spark 2.0
Simplifying Big Data Applications with Apache Spark 2.0Simplifying Big Data Applications with Apache Spark 2.0
Simplifying Big Data Applications with Apache Spark 2.0Spark Summit
 

En vedette (20)

The Next AMPLab: Real-Time, Intelligent, and Secure Computing
The Next AMPLab: Real-Time, Intelligent, and Secure ComputingThe Next AMPLab: Real-Time, Intelligent, and Secure Computing
The Next AMPLab: Real-Time, Intelligent, and Secure Computing
 
Spark Summit EU talk by Patrick Baier and Stanimir Dragiev
Spark Summit EU talk by Patrick Baier and Stanimir DragievSpark Summit EU talk by Patrick Baier and Stanimir Dragiev
Spark Summit EU talk by Patrick Baier and Stanimir Dragiev
 
Spark Summit EU talk by Oscar Castaneda
Spark Summit EU talk by Oscar CastanedaSpark Summit EU talk by Oscar Castaneda
Spark Summit EU talk by Oscar Castaneda
 
Spark Summit EU talk by Josef Habdank
Spark Summit EU talk by Josef HabdankSpark Summit EU talk by Josef Habdank
Spark Summit EU talk by Josef Habdank
 
The Spark (R)evolution in The Netherlands
The Spark (R)evolution in The NetherlandsThe Spark (R)evolution in The Netherlands
The Spark (R)evolution in The Netherlands
 
Spark Summit EU talk by Heiko Korndorf
Spark Summit EU talk by Heiko KorndorfSpark Summit EU talk by Heiko Korndorf
Spark Summit EU talk by Heiko Korndorf
 
Spark Summit EU talk by Jorg Schad
Spark Summit EU talk by Jorg SchadSpark Summit EU talk by Jorg Schad
Spark Summit EU talk by Jorg Schad
 
Spark Summit EU talk by Erwin Datema and Roeland van Ham
Spark Summit EU talk by Erwin Datema and Roeland van HamSpark Summit EU talk by Erwin Datema and Roeland van Ham
Spark Summit EU talk by Erwin Datema and Roeland van Ham
 
Democratizing AI with Apache Spark
Democratizing AI with Apache SparkDemocratizing AI with Apache Spark
Democratizing AI with Apache Spark
 
Spark Summit EU talk by Sudeep Das and Aish Faenton
Spark Summit EU talk by Sudeep Das and Aish FaentonSpark Summit EU talk by Sudeep Das and Aish Faenton
Spark Summit EU talk by Sudeep Das and Aish Faenton
 
Spark Summit EU talk by Luc Bourlier
Spark Summit EU talk by Luc BourlierSpark Summit EU talk by Luc Bourlier
Spark Summit EU talk by Luc Bourlier
 
Spark Summit EU talk by Dean Wampler
Spark Summit EU talk by Dean WamplerSpark Summit EU talk by Dean Wampler
Spark Summit EU talk by Dean Wampler
 
Spark Summit EU talk by Ruben Pulido and Behar Veliqi
Spark Summit EU talk by Ruben Pulido and Behar VeliqiSpark Summit EU talk by Ruben Pulido and Behar Veliqi
Spark Summit EU talk by Ruben Pulido and Behar Veliqi
 
MmmooOgle: From Big Data to Decisions for Dairy Cows
MmmooOgle: From Big Data to Decisions for Dairy CowsMmmooOgle: From Big Data to Decisions for Dairy Cows
MmmooOgle: From Big Data to Decisions for Dairy Cows
 
Spark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted MalaskaSpark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted Malaska
 
Combined analysis of Watson and Spark
Combined analysis of Watson and SparkCombined analysis of Watson and Spark
Combined analysis of Watson and Spark
 
Hadoop in the Cloud – The What, Why and How from the Experts
Hadoop in the Cloud – The What, Why and How from the ExpertsHadoop in the Cloud – The What, Why and How from the Experts
Hadoop in the Cloud – The What, Why and How from the Experts
 
Case study of online machine learning for display advertising in Yahoo! JAPAN
Case study of online machine learning for display advertising in Yahoo! JAPANCase study of online machine learning for display advertising in Yahoo! JAPAN
Case study of online machine learning for display advertising in Yahoo! JAPAN
 
Simplifying Big Data Applications with Apache Spark 2.0
Simplifying Big Data Applications with Apache Spark 2.0Simplifying Big Data Applications with Apache Spark 2.0
Simplifying Big Data Applications with Apache Spark 2.0
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
 

Similaire à Spark Summit EU talk by Reza Karimi

A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...Jose Quesada (hiring)
 
EgoSystem: Presentation to LITA, American Library Association, Nov 8 2014
EgoSystem: Presentation to LITA, American Library Association, Nov 8 2014EgoSystem: Presentation to LITA, American Library Association, Nov 8 2014
EgoSystem: Presentation to LITA, American Library Association, Nov 8 2014James Powell
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Spark Summit
 
Multi-Model Data Query Languages and Processing Paradigms
Multi-Model Data Query Languages and Processing ParadigmsMulti-Model Data Query Languages and Processing Paradigms
Multi-Model Data Query Languages and Processing ParadigmsJiaheng Lu
 
Graph Analytics in Spark
Graph Analytics in SparkGraph Analytics in Spark
Graph Analytics in SparkPaco Nathan
 
Linked Data Experiences at Springer Nature
Linked Data Experiences at Springer NatureLinked Data Experiences at Springer Nature
Linked Data Experiences at Springer NatureMichele Pasin
 
Composable Parallel Processing in Apache Spark and Weld
Composable Parallel Processing in Apache Spark and WeldComposable Parallel Processing in Apache Spark and Weld
Composable Parallel Processing in Apache Spark and WeldDatabricks
 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesPaco Nathan
 
Arches Getty Brownbag Talk
Arches Getty Brownbag TalkArches Getty Brownbag Talk
Arches Getty Brownbag Talkbenosteen
 
Machine learning with Spark
Machine learning with SparkMachine learning with Spark
Machine learning with SparkKhalid Salama
 
Data massage! databases scaled from one to one million nodes (ulf wendel)
Data massage! databases scaled from one to one million nodes (ulf wendel)Data massage! databases scaled from one to one million nodes (ulf wendel)
Data massage! databases scaled from one to one million nodes (ulf wendel)Zhang Bo
 
Data Structure Graph DMZ #DMZone
Data Structure Graph DMZ #DMZoneData Structure Graph DMZ #DMZone
Data Structure Graph DMZ #DMZoneDoug Needham
 
Semantic framework for web scraping.
Semantic framework for web scraping.Semantic framework for web scraping.
Semantic framework for web scraping.Shyjal Raazi
 
LDBC 8th TUC Meeting: Introduction and status update
LDBC 8th TUC Meeting: Introduction and status updateLDBC 8th TUC Meeting: Introduction and status update
LDBC 8th TUC Meeting: Introduction and status updateLDBC council
 
CS6502 OOAD - Question Bank and Answer
CS6502 OOAD - Question Bank and AnswerCS6502 OOAD - Question Bank and Answer
CS6502 OOAD - Question Bank and AnswerGobinath Subramaniam
 
The nature.com ontologies portal: nature.com/ontologies
The nature.com ontologies portal: nature.com/ontologiesThe nature.com ontologies portal: nature.com/ontologies
The nature.com ontologies portal: nature.com/ontologiesTony Hammond
 
01 nosql and multi model database
01   nosql and multi model database01   nosql and multi model database
01 nosql and multi model databaseMahdi Atawneh
 
Graph Database and Neo4j
Graph Database and Neo4jGraph Database and Neo4j
Graph Database and Neo4jSina Khorami
 
A machine learning and data science pipeline for real companies
A machine learning and data science pipeline for real companiesA machine learning and data science pipeline for real companies
A machine learning and data science pipeline for real companiesDataWorks Summit
 

Similaire à Spark Summit EU talk by Reza Karimi (20)

A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
 
EgoSystem: Presentation to LITA, American Library Association, Nov 8 2014
EgoSystem: Presentation to LITA, American Library Association, Nov 8 2014EgoSystem: Presentation to LITA, American Library Association, Nov 8 2014
EgoSystem: Presentation to LITA, American Library Association, Nov 8 2014
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
 
Multi-Model Data Query Languages and Processing Paradigms
Multi-Model Data Query Languages and Processing ParadigmsMulti-Model Data Query Languages and Processing Paradigms
Multi-Model Data Query Languages and Processing Paradigms
 
Graph Analytics in Spark
Graph Analytics in SparkGraph Analytics in Spark
Graph Analytics in Spark
 
Linked Data Experiences at Springer Nature
Linked Data Experiences at Springer NatureLinked Data Experiences at Springer Nature
Linked Data Experiences at Springer Nature
 
Composable Parallel Processing in Apache Spark and Weld
Composable Parallel Processing in Apache Spark and WeldComposable Parallel Processing in Apache Spark and Weld
Composable Parallel Processing in Apache Spark and Weld
 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communities
 
Arches Getty Brownbag Talk
Arches Getty Brownbag TalkArches Getty Brownbag Talk
Arches Getty Brownbag Talk
 
Machine learning with Spark
Machine learning with SparkMachine learning with Spark
Machine learning with Spark
 
Data massage! databases scaled from one to one million nodes (ulf wendel)
Data massage! databases scaled from one to one million nodes (ulf wendel)Data massage! databases scaled from one to one million nodes (ulf wendel)
Data massage! databases scaled from one to one million nodes (ulf wendel)
 
Data Structure Graph DMZ #DMZone
Data Structure Graph DMZ #DMZoneData Structure Graph DMZ #DMZone
Data Structure Graph DMZ #DMZone
 
Semantic framework for web scraping.
Semantic framework for web scraping.Semantic framework for web scraping.
Semantic framework for web scraping.
 
LDBC 8th TUC Meeting: Introduction and status update
LDBC 8th TUC Meeting: Introduction and status updateLDBC 8th TUC Meeting: Introduction and status update
LDBC 8th TUC Meeting: Introduction and status update
 
Spark meetup TCHUG
Spark meetup TCHUGSpark meetup TCHUG
Spark meetup TCHUG
 
CS6502 OOAD - Question Bank and Answer
CS6502 OOAD - Question Bank and AnswerCS6502 OOAD - Question Bank and Answer
CS6502 OOAD - Question Bank and Answer
 
The nature.com ontologies portal: nature.com/ontologies
The nature.com ontologies portal: nature.com/ontologiesThe nature.com ontologies portal: nature.com/ontologies
The nature.com ontologies portal: nature.com/ontologies
 
01 nosql and multi model database
01   nosql and multi model database01   nosql and multi model database
01 nosql and multi model database
 
Graph Database and Neo4j
Graph Database and Neo4jGraph Database and Neo4j
Graph Database and Neo4j
 
A machine learning and data science pipeline for real companies
A machine learning and data science pipeline for real companiesA machine learning and data science pipeline for real companies
A machine learning and data science pipeline for real companies
 

Plus de Spark Summit

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang Spark Summit
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...Spark Summit
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang WuSpark Summit
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya RaghavendraSpark Summit
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...Spark Summit
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingSpark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingSpark Summit
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...Spark Summit
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakSpark Summit
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimSpark Summit
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraSpark Summit
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Spark Summit
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...Spark Summit
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spark Summit
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovSpark Summit
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Spark Summit
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkSpark Summit
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...Spark Summit
 
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...Spark Summit
 

Plus de Spark Summit (20)

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin Kim
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
 
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...
 

Dernier

Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraGovindSinghDasila
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...ZurliaSoop
 
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night StandCall Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...amitlee9823
 
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...gajnagarg
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...only4webmaster01
 
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...amitlee9823
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...amitlee9823
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangaloreamitlee9823
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...gajnagarg
 
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...amitlee9823
 

Dernier (20)

Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - Almora
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night StandCall Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
 
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
 
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 

Spark Summit EU talk by Reza Karimi

  • 1. SPARK SUMMIT EUROPE2016 MENTOR AND MENTEE RELATIONS BASED ON AUTHORSHIP GRAPHS Reza Karimi Elsevier
  • 2. Outline • Introduction: – Elsevier and Scopus Data • Mentorship Model • Training and Validation • (Big) Graph Visualization (in Spark)
  • 3. Who are we? • Elsevier: the #1 provider of scientific, technical, and medical information, and a technology company originally established in 1880 in Netherlands. Now part of RELX group. • Privileged to publish about 25% of cited (what matters) scientific publications • Publishing 400K articles per year in about 2500 journals such as Lancet, Cell, Trends, Current Opinions, Artificial Intelligence (generally accessed via ScienceDirect platform (about 900m full text download per year) • Curating Scientific publication data in products such as Scopus, SciVal, Pure
  • 4. Full Text Access in ScienceDirect
  • 5. Scopus is the largest abstract and citation database of peer-reviewed literature, and features smart tools that allow you to track, analyze and visualize scholarly research. What is Scopus?
  • 6. Scopus includes content from more than 5,000 publishers and 105 different countries 21,568 peer-reviewed journals 361 trade journals • Full metadata, abstracts and cited references (ref’s post-1995 only) • Funding data from acknowledgements • Citations back to 1970 Physical Sciences 7,443 Health Sciences 6,795 Social Sciences 8,086 Life Sciences 4,492 90K conference events 7.3M conference papers Mainly Engineering and Computer Sciences 531 book series 30K Volumes / 1.2M items 119,882 stand- alone books 974K items Focus on Social Sciences and A&H 65M records from 22K serials, 90K conferences and 120K books • Updated daily • Records back to 1823 • “Articles in Press” from > 3,750 titles • 40 different languages covered • 3,715 active Gold Open Access journals indexed BOOKSCONFERENCES Source: November 2015 title list at https://www.elsevier.com/solutions/scopus/content * Available late 2016 JOURNALS 27M patents From 5 major patent offices - WIPO - EPO - USPTO - JPO - UK IPO PATENTS*
  • 7. Scopus Data Model By employing the state of the art disambiguation and deduplication algorithms, entities such as authors, institutes and cited document are disambiguated. This enables us to to analyze trends and to track researchers and institutes. author affiliation Article (65M) Author (12M) Affiliation (10M)
  • 8. Author Disambiguation: AI algorithms enhanced with multi-level feedbacks Scopus use a combination of automated and curated data to automatically build robust author profiles, which power the Elsevier Research Intelligence portfolio. Scopus author profile Scopus.com: Ongoing algorithmic updates Manual feedback via Scopus.com Author Feedback Wizard Pure Profile Refinement Service Scopus2SciVal manual & automated updates Manual & Automated feedback via ORCiD One-off updates via consortia/ governmental agencies
  • 9. Who uses Scopus Data? Examples
  • 10. Co-authorship Graphs • Based on disambiguated records, we can track all the publication, co- authors, affiliation, corresponding authors, citing authors, cited authors … for each author • Authors will be modeled as vertices and each document co-published represents an edge • Current Analysis is based on: – 65.2M documents (208M Authorships) – 33.8M Authors (9.1M published within Elsevier) – 4.6B co-authorships based on 382M unique author pairs – 123M unique correspondence ordered author pairs Author i Author j Author k co-authorship graph
  • 11. Problem Statement and Modeling • Problem – For each author, find the most likely mentor among co-authors (explicit clues exist) • Solution – most likely: ranking – being mentor: classification • CLASSIFICATION INPUT: – Approximate into a pairwise model as opposed to a model based on full-set – Approximate with an aggregated graph • TRAIN THE CLASSIFIER: Even for a simple pairwise model we need lots of training data to account for variations in different fields, years, countries,… – Start from simple heuristic common-sense models (cannot overfit) and improve it gradually » Can predict accurately based on given clues/features, but cannot properly balance the interaction and proper weight for each clue – When satisfied, verify the early predictions via crowd sourcing and collect precise data for ML training Heuristic Mode I Heuristic Mode II (baseline) Email Campaign (Crowd Sourcing Training Data) ML Model I ML Model II (Extra Features)
  • 12. Spark Pipeline in AWS: ETL and Aggregate feature tables Graph Data analytical format temporary format for compute optimized nodes raw data in S3 xml files Selected fields into json sorted and repartitioned into columnar parquet Node Aggregation N Feature I N Feature II N Feature ... Edge Aggregation E Feature A E Feature B E Feature ... • Feature are primarily built on dual summary statistics (around nodes for nodal features or around all edges between two nodes for edge features) • Normalizing data is superbly important. For an article with n authors, following weights are adopted: – 1/n : authorship – 1/(n*(n-1)) : co-authorship – 1/(n-1) : for corresponding authorship • Non-numeric features are aggregated primarily into – most common value (and its frequency) for nodes – overlap-share/cosine-similarity for edges
  • 13. Mixing co-authorship and corresponding authorship Author i Author j Author k corresponding authorship graph Author i Author j Author k Final graph co-authorship graph Author i Author j Author k cor-i->j
  • 14. Feature Examples • Number/weighted-sum of – authorships – co-authorships – corresponding authorships • First/Last/AVG year • Publication Milestone Years (to simplify time-series): – year reaching 3,5,10 publications (robustness against disambiguation and variations in academic fields) • PageRank of Author in correspondence graph or citation graphs • Number of co-authors/corresponding authors • HIndex, Citation metrics • Continent/Country, current affiliation(s) • Email Domain(s) • AVG author position in authorship sequences (from 0-1) • Node Content representation by: – Journal frequency vector – Subject-area frequency vector
  • 15. 1. Apply mentorship to each edge 2. Rank co-authors based on their mentorship score and pick up the best score as the candidate mentor 3. Create the mentorship graph (batch process): 1. Look for loops of length L (L=2,3,4,…) 2. Break loops by the weakest link and replace that link with the next mentor candidate 3. Go back to step 2 and if no loop found increase L 4. Create academic family-trees (for select authors): can locally break loops of bigger lengths (such as L=7) graph=GraphFrame(vertices, edges) paths =graph.find("(a1)-[e]->(a2)") paths.rdd.map(mentorship).toDF().write.format('parquet').mode('overwrite').saveAsTable(…) Applying Pairwise Model
  • 16. Validations via Crowd Sourcing • Emails sent to opt-in (Mendeley/Scopus) User to evaluate our predictions and collect data for supervised ML model • A/B/C testing for Email templates to optimize open/click rate • Clicks lead to a submit page to avoid random clicks Email Clicked Submit Page Thank you email
  • 17. • Funnel: – 10000 randomly selected recipients – 4315 recipients opened their email – 1742 recipients clicked on provided choices – 413 recipients submitted a choice • Click vs Submit – 23% of those who clicked, submitted – 94% of submission and click choices were identical, the rest is primarily a switch to a more conservative choice Prediction Accuracy
  • 18. • Academic family-trees are important: – Recommendations such as article/people in Mendeley – Be aware of conflict of interests: for example in reviewer selection or funding panelist • Special features of academic family-trees: – Low connectivity (every one has only (one) mentor) – A structure growing with time like snow crystals • Given the simplicity of the mentorship graph when it is cached, Spark can act as a back-end for instantaneous subgraph creation or other non-batch analysis Academic Family-Tree Academic progeny of Prof. … in MIT (intentionally blurred screenshot to minimize personal impact on names displayed)
  • 19. • To visualize (big) graphs in Spark application we need two components: 1. Sub-Graph creation: it would not be possible to visualize all edge/nodes, especially for highly connected graphs such as: 1. co-authorships 2. journal to journal citation 3. institute to institute citation To select a limited set of node/edges, filter or GraphFrame queries can help. They work great, if the sub-graph can be obtained by some global filters (such as n-top strongest edges). However, we had to make substantial development to cover sub-graphs centered around a given node. The local sub-graph creation was extended by our library via customized development in line with data-frame operations. 2. Visualization library: Adopted D3.js and displayHTML to visualize (Edges, Vertices) Dataframes interactively. The library recognizes following elements: – Directed Graphs (such as citation) including self referral edges or non-directed graphs (such as co-authorship) displayed with or without arrows – size, colors based on continuous/categorical data, line type for edges and nodes – Line types (Big) Graph Visualization (by Spark)
  • 20. Visualization Example I: Global filter Top 250 Global non-self institute citations
  • 21. Visualization Example II: Top Institute Citations in Computer Science
  • 22. Visualization Example III: lsub-graph of top citations around MIT Top 50 citations to/from MIT Top 150 citations for MIT or top MIT neighbors (distance_max=2)
  • 23. SPARK SUMMIT EUROPE2016 THANK YOU. Feel free to reach me for further information or if interested to join our team: r.karimi@elsevier.com Acknowledgment: Bob Schijvenaars, Antonio Gulli, Darin McBeath, Daul Trevor, and Elsevier members in BOS/Labs/Scopus teams