SlideShare une entreprise Scribd logo
1  sur  61
Télécharger pour lire hors ligne
UCREL Summer School |
Presented By
Date
Big Data NLP
Daniel Kershaw
27/06/2017
UCREL Summer School |
Daniel Kershaw
Recommender System
Senior Data Scientist
@danjamker
www.danjamker.com
2
About
UCREL Summer School |
• Part 1 – 30 Minutes
• Big Data (What is it?)
• Map Reduce
• Spark
• Document Similarity
• Part 2 – 1 hour
• Downloading Zepplin on Dockers
• Read document set, extract data with
• Tokenize
• Implement Document Similarity
• Cosine Similarity between documents
3
Outline
UCREL Summer School |
Set up docker:
sudo docker pull epahomov/docker-zeppelin
Download Zeppelin Notebook:
https://www.dropbox.com/s/161hpz02cafblsg/SDOA.json?dl=0
4
First
UCREL Summer School |
Presented By
Date
Part 1 - Big Data and NLP
Daniel Kershaw
20th June 2017
UCREL Summer School |
640K	ought	to	be	enough	for	anyone
Bill	Gates,	Microsoft,	1981
UCREL Summer School |
“There were 5 exabytes of
information created between the
dawn of civilization through 2003,
but that much information is now
created every 2 days”
Eric	Schmidt,	Google,	2010
UCREL Summer School |
Google processes 20 PB a day (2008)
Wayback Machine has 3 PB + 100 TB/month (3/2009)
Facebook has 2.5 PB of user data + 15 TB/day (4/2009)
eBay has 6.5 PB of user data + 50 TB/day (5/2009)
CERN’s Large Hydron Collider (LHC) generates 15 PB a year
How much data?
UCREL Summer School |
Google	Big	Data	Trend
UCREL Summer School |
What is Big Data
Too big to fit in an Excel
spreadsheet
Professor	Steven	Weber,	UC	Berkeley	School	of	Information
UCREL Summer School |
What is Big Data
Big data means data that cannot
fit easily into a standard relational
database
Hal	Varian,	Chief	Economist,	Google
UCREL Summer School |
What is Big Data
The term ‘Big Data’ applies to
information that can't be
processed or analysed using
traditional processes or tools
Professor	Steven	Weber,	UC	Berkeley	School	of	Information
UCREL Summer School |
Volume
Velocity
Variety
Exhaustive
Veracity
Relational & Indexical
Relational
Flexible
The Big V’s
UCREL Summer School |
Wikipedia
Hansard
Enron Email Corpus
Reddit Data Release
Twitter Data Set
Examples of Big / Large Data (NLP )
Science	Direct	Corpus	
Mendeley	User	Catalogs
Engineering	Village	
User	interaction	logs
Funding	data
EVISE
UCREL Summer School |
Scaling up Computation
Servers
CPUs	(Xeon)
RAM	(32Gb)
Disks	(2	x	1Tb)
Rack
40	- 80	Server
Networked	Together
UPS	(Power	Supply)
UCREL Summer School |
Google Data Center image
UCREL Summer School |
• How do we split across nodes
• Network and data locality
• How do we deal with failures
• 1 server fails ever 3 years => 10k nodes would
be about 10 failure a day
• How do we deal with slow machines
Programming at Scale
UCREL Summer School |
Hadoop
Google	MapReduce	publish	2004
Google	File	System	publish	2004
UCREL Summer School |
Mapper
Reducer
Map Reduce
UCREL Summer School |
Mapper
Reducer
Map Reduce - Mapper
Takes	a	series	of	<key,	value>
Processes	each	tuple
Output’s	0	or	more	<key,	value>	tuples
UCREL Summer School |
Mapper
Reducer
Map Reduce - Reducer
Called	once	for	each	unique	<key,	[value]>
Iterates	though	each	value
Outputs	0	or	more	results	as	<key,	value>
UCREL Summer School |
Example Code – Word Count
UCREL Summer School |
Map Reduce
UCREL Summer School |
Map Reduce
UCREL Summer School |
MapReduce - Overview
UCREL Summer School |
• Application need more than on step
• Google pipeline was 22 steps
• Analytic queries e.g. K-mean 2-5 steps
• Iterative queries e.g. page-rank 10-20 steps
• Problems with performance and ease of
development
Issues with Hadoop - Complexity
UCREL Summer School |
• Multiple map and reduce classes
• A lot of boiler plate code
• Easy to combine incorrectly
Issues with Hadoop - Usability
UCREL Summer School |
• One pass at a time
• Must write to HDFS between jobs
• Expensive to reuse data
• Hand optimize code to combine steps
Issues with Hadoop - Performance
UCREL Summer School |
Big Data Processing
UCREL Summer School |
Spark
UCREL Summer School |
• Resilient distributed datasets (RDD)
• Immutable, partitioned collections of objects
• Created through parallel transformations (map, filter, groupBy,
join, …) on data in stable storage
• Can be cached for effect use
• Actions on RDDs
• Count, reduce, collect, save, …
Spark Model
UCREL Summer School |
Spark vs Hadoop – Data Sharing
Spark
Hadoop
UCREL Summer School |
UCREL Summer School |
UCREL Summer School |
SparkML
val train_data = // RDD of Vector!
val
model = KMeans.train(train_data, k=10)!
// evaluate the model!
val test_data = // RDD of
Vector!
test_data.map(t =>
model.predict(t)).collect().foreach(print
ln)!
UCREL Summer School |
• Interact with data like a table
• Inbuilt function to:
• Tokenize
• Stop-word removal
• TFIDF transformation
Spark Dataframes
Name Age Gender Abstract
UCREL Summer School |
Title abstra
ct
keywo
rds
ASJC Title abstra
ct
keywo
rds
ASJC Title_t
ok
UCREL Summer School |
Presented By
Date
Part 2 – Document Similarity
Technical Workshop
Daniel Kershaw
29th June 2017
UCREL Summer School |
• Download apache Zepplin
• Download datasets
• Read datasets
• Tokenize and remove stopwords
• Read word vectors
39
Outline
UCREL Summer School | 40
UCREL Summer School |
• Clone docker image
• docker pull epahomov/docker-zeppelin
• Run docker image
• docker run -d -p 8080:8080 -p 7077:7077 -p 4040:4040 epahomov/docker-zeppelin
• Goto
• localhost:8080
41
Install Apache Zeppelin
UCREL Summer School |
Document Embedding Similarity
Apple	[0.5,0.6,	0.3,	0.1,	0.6,	0.5, 0.5, 0.9,	0.9,	0.3,	0.5,	0.4,	0.4,	0.5, 0.5,]
Word	represented	as	dense	vector
Document	represented	as	sum	(mean)	of	dense	vectors
Apple	[0.5,0.6,	0.3,	0.1,	0.6,	0.5, 0.5, 0.9,	0.9,	0.3,	0.5,	0.4,	0.4,	0.5, 0.5,]
Mac	[0.5,0.6,	0.3,	0.1,	0.6,	0.5, 0.5, 0.9,	0.9,	0.3,	0.5,	0.4,	0.4,	0.5, 0.5,]
Computer	[0.5,0.6,	0.3,	0.1,	0.6,	0.5, 0.5, 0.9,	0.9,	0.3,	0.5,	0.4,	0.4,	0.5, 0.5,]
+
+
=
Document	[0.5,0.6,	0.3,	0.1,	0.6,	0.5, 0.5, 0.9,	0.9,	0.3,	0.5,	0.4,	0.4,	0.5, 0.5,]
UCREL Summer School | 43
Download Spark Dependencies
UCREL Summer School | 44
Download Sample Science Direct Corpus
UCREL Summer School | 45
Science Direct Open Access Corpus
Contains	all	content	seen	on	SD	frontend
Available	on	Github
Extract	PII	(document	ID)
Extract	Abstract
Use	Elsevier	Opensource XML	parser	
Extract	fields	with	xpath &	xquery
UCREL Summer School | 46
Read Documents
UCREL Summer School | 47
Extract Title and Document Abstract
UCREL Summer School | 48
Tokenize and Remove Stop words
UCREL Summer School | 49
Download Word Vectors
UCREL Summer School | 50
Load Word Vectors
word vector
apple [0.2,0.4,0.8]
compu
ter
[0.2,0.4,0.8]
mac [0.2,0.4,0.8]
Google [0.2,0.4,0.8]
UCREL Summer School | 51
Doc	ID Tokens
1 [apple,	computer,	mac]
2 [apple,	computer,	mac]
3 [apple,	computer,	mac]
4 [apple,	computer,	mac]
5 [apple,	computer,	mac] Doc	ID Tokens
1 apple
1 computer
1 mac
2 apple
2 computer
Explode	the	tokens
UCREL Summer School | 52
Doc	ID word
1 apple
1 computer
1 mac
2 apple
2 computer
word vector
apple [0.2,0.4,0.8]
compu
ter
[0.2,0.4,0.8]
mac [0.2,0.4,0.8]
Google [0.2,0.4,0.8]
this [0.2,0.4,0.8]
Join	on	words
Doc ID word vector
1 apple [0.2,0.4,0.8]
1 computer [0.2,0.4,0.8]
UCREL Summer School | 53
Doc ID word vector
1 apple [0.2,0.4,0.8]
1 computer [0.2,0.4,0.8]
Group	by	document	ID,	mean the	vectors	
Doc	ID vector
1 [0.2,0.4,0.8]
2 [0.2,0.4,0.8]
3 [0.2,0.4,0.8]
4 [0.2,0.4,0.8]
UCREL Summer School | 54
Join word vectors to document
UCREL Summer School | 55
Join word vectors to document
UCREL Summer School | 56
Join word vectors to document
UCREL Summer School | 57
Join word vectors to document
UCREL Summer School | 58
Join word vectors to document
UCREL Summer School | 59
Join word vectors to document
UCREL Summer School |
• Cartesian join of documents
• Compute cosine similarity between each document
60
Identify similar documents
1 2 3
1 0.4 0.6 0.6
2 0.5 0.4 0.7
3 0.6 0.1 0.3
Doc	ID vector
1 [0.2,0.4,0.8]
2 [0.2,0.4,0.8]
3 [0.2,0.4,0.8]
4 [0.2,0.4,0.8]
5 [0.2,0.4,0.8]
Doc	ID vector
1 [0.2,0.4,0.8]
2 [0.2,0.4,0.8]
3 [0.2,0.4,0.8]
4 [0.2,0.4,0.8]
5 [0.2,0.4,0.8]
Join	to	self
UCREL Summer School |
Thank you
Any questions
61

Contenu connexe

Similaire à Lancaster UCREL Summer School 2017 - Big Data and NLP

Software tools for high-throughput materials data generation and data mining
Software tools for high-throughput materials data generation and data miningSoftware tools for high-throughput materials data generation and data mining
Software tools for high-throughput materials data generation and data miningAnubhav Jain
 
Agile Data: Building Hadoop Analytics Applications
Agile Data: Building Hadoop Analytics ApplicationsAgile Data: Building Hadoop Analytics Applications
Agile Data: Building Hadoop Analytics ApplicationsDataWorks Summit
 
Parsl: Pervasive Parallel Programming in Python
Parsl: Pervasive Parallel Programming in PythonParsl: Pervasive Parallel Programming in Python
Parsl: Pervasive Parallel Programming in PythonDaniel S. Katz
 
Scientific
Scientific Scientific
Scientific marpierc
 
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...Ilkay Altintas, Ph.D.
 
Agile Data Science: Building Hadoop Analytics Applications
Agile Data Science: Building Hadoop Analytics ApplicationsAgile Data Science: Building Hadoop Analytics Applications
Agile Data Science: Building Hadoop Analytics ApplicationsRussell Jurney
 
Big data at experimental facilities
Big data at experimental facilitiesBig data at experimental facilities
Big data at experimental facilitiesIan Foster
 
Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...
Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...
Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...Databricks
 
Leveraging NLP and Deep Learning for Document Recommendations in the Cloud
Leveraging NLP and Deep Learning for Document Recommendations in the CloudLeveraging NLP and Deep Learning for Document Recommendations in the Cloud
Leveraging NLP and Deep Learning for Document Recommendations in the CloudDatabricks
 
Exascale Computing Project - Driving a HUGE Change in a Changing World
Exascale Computing Project - Driving a HUGE Change in a Changing WorldExascale Computing Project - Driving a HUGE Change in a Changing World
Exascale Computing Project - Driving a HUGE Change in a Changing Worldinside-BigData.com
 
From Pipelines to Refineries: scaling big data applications with Tim Hunter
From Pipelines to Refineries: scaling big data applications with Tim HunterFrom Pipelines to Refineries: scaling big data applications with Tim Hunter
From Pipelines to Refineries: scaling big data applications with Tim HunterDatabricks
 
Time to Science/Time to Results: Transforming Research in the Cloud
Time to Science/Time to Results: Transforming Research in the CloudTime to Science/Time to Results: Transforming Research in the Cloud
Time to Science/Time to Results: Transforming Research in the CloudAmazon Web Services
 
IBM Strategy for Spark
IBM Strategy for SparkIBM Strategy for Spark
IBM Strategy for SparkMark Kerzner
 
MSR 2009
MSR 2009MSR 2009
MSR 2009swy351
 
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014The Hive
 
Agile Data Science: Hadoop Analytics Applications
Agile Data Science: Hadoop Analytics ApplicationsAgile Data Science: Hadoop Analytics Applications
Agile Data Science: Hadoop Analytics ApplicationsRussell Jurney
 
From Pipelines to Refineries: Scaling Big Data Applications
From Pipelines to Refineries: Scaling Big Data ApplicationsFrom Pipelines to Refineries: Scaling Big Data Applications
From Pipelines to Refineries: Scaling Big Data ApplicationsDatabricks
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsDataStax Academy
 
Using Containers and HPC to Solve the Mysteries of the Universe by Deborah Bard
Using Containers and HPC to Solve the Mysteries of the Universe by Deborah BardUsing Containers and HPC to Solve the Mysteries of the Universe by Deborah Bard
Using Containers and HPC to Solve the Mysteries of the Universe by Deborah BardDocker, Inc.
 

Similaire à Lancaster UCREL Summer School 2017 - Big Data and NLP (20)

Software tools for high-throughput materials data generation and data mining
Software tools for high-throughput materials data generation and data miningSoftware tools for high-throughput materials data generation and data mining
Software tools for high-throughput materials data generation and data mining
 
Spark
SparkSpark
Spark
 
Agile Data: Building Hadoop Analytics Applications
Agile Data: Building Hadoop Analytics ApplicationsAgile Data: Building Hadoop Analytics Applications
Agile Data: Building Hadoop Analytics Applications
 
Parsl: Pervasive Parallel Programming in Python
Parsl: Pervasive Parallel Programming in PythonParsl: Pervasive Parallel Programming in Python
Parsl: Pervasive Parallel Programming in Python
 
Scientific
Scientific Scientific
Scientific
 
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
 
Agile Data Science: Building Hadoop Analytics Applications
Agile Data Science: Building Hadoop Analytics ApplicationsAgile Data Science: Building Hadoop Analytics Applications
Agile Data Science: Building Hadoop Analytics Applications
 
Big data at experimental facilities
Big data at experimental facilitiesBig data at experimental facilities
Big data at experimental facilities
 
Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...
Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...
Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...
 
Leveraging NLP and Deep Learning for Document Recommendations in the Cloud
Leveraging NLP and Deep Learning for Document Recommendations in the CloudLeveraging NLP and Deep Learning for Document Recommendations in the Cloud
Leveraging NLP and Deep Learning for Document Recommendations in the Cloud
 
Exascale Computing Project - Driving a HUGE Change in a Changing World
Exascale Computing Project - Driving a HUGE Change in a Changing WorldExascale Computing Project - Driving a HUGE Change in a Changing World
Exascale Computing Project - Driving a HUGE Change in a Changing World
 
From Pipelines to Refineries: scaling big data applications with Tim Hunter
From Pipelines to Refineries: scaling big data applications with Tim HunterFrom Pipelines to Refineries: scaling big data applications with Tim Hunter
From Pipelines to Refineries: scaling big data applications with Tim Hunter
 
Time to Science/Time to Results: Transforming Research in the Cloud
Time to Science/Time to Results: Transforming Research in the CloudTime to Science/Time to Results: Transforming Research in the Cloud
Time to Science/Time to Results: Transforming Research in the Cloud
 
IBM Strategy for Spark
IBM Strategy for SparkIBM Strategy for Spark
IBM Strategy for Spark
 
MSR 2009
MSR 2009MSR 2009
MSR 2009
 
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
 
Agile Data Science: Hadoop Analytics Applications
Agile Data Science: Hadoop Analytics ApplicationsAgile Data Science: Hadoop Analytics Applications
Agile Data Science: Hadoop Analytics Applications
 
From Pipelines to Refineries: Scaling Big Data Applications
From Pipelines to Refineries: Scaling Big Data ApplicationsFrom Pipelines to Refineries: Scaling Big Data Applications
From Pipelines to Refineries: Scaling Big Data Applications
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
 
Using Containers and HPC to Solve the Mysteries of the Universe by Deborah Bard
Using Containers and HPC to Solve the Mysteries of the Universe by Deborah BardUsing Containers and HPC to Solve the Mysteries of the Universe by Deborah Bard
Using Containers and HPC to Solve the Mysteries of the Universe by Deborah Bard
 

Dernier

Tales from a Passkey Provider Progress from Awareness to Implementation.pptx
Tales from a Passkey Provider  Progress from Awareness to Implementation.pptxTales from a Passkey Provider  Progress from Awareness to Implementation.pptx
Tales from a Passkey Provider Progress from Awareness to Implementation.pptxFIDO Alliance
 
Working together SRE & Platform Engineering
Working together SRE & Platform EngineeringWorking together SRE & Platform Engineering
Working together SRE & Platform EngineeringMarcus Vechiato
 
Introduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMIntroduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMKumar Satyam
 
Navigating the Large Language Model choices_Ravi Daparthi
Navigating the Large Language Model choices_Ravi DaparthiNavigating the Large Language Model choices_Ravi Daparthi
Navigating the Large Language Model choices_Ravi DaparthiRaviKumarDaparthi
 
JohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard37
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontologyjohnbeverley2021
 
Microsoft CSP Briefing Pre-Engagement - Questionnaire
Microsoft CSP Briefing Pre-Engagement - QuestionnaireMicrosoft CSP Briefing Pre-Engagement - Questionnaire
Microsoft CSP Briefing Pre-Engagement - QuestionnaireExakis Nelite
 
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...ScyllaDB
 
Frisco Automating Purchase Orders with MuleSoft IDP- May 10th, 2024.pptx.pdf
Frisco Automating Purchase Orders with MuleSoft IDP- May 10th, 2024.pptx.pdfFrisco Automating Purchase Orders with MuleSoft IDP- May 10th, 2024.pptx.pdf
Frisco Automating Purchase Orders with MuleSoft IDP- May 10th, 2024.pptx.pdfAnubhavMangla3
 
ChatGPT and Beyond - Elevating DevOps Productivity
ChatGPT and Beyond - Elevating DevOps ProductivityChatGPT and Beyond - Elevating DevOps Productivity
ChatGPT and Beyond - Elevating DevOps ProductivityVictorSzoltysek
 
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptxHarnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptxFIDO Alliance
 
2024 May Patch Tuesday
2024 May Patch Tuesday2024 May Patch Tuesday
2024 May Patch TuesdayIvanti
 
Portal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russePortal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russe中 央社
 
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdfMuhammad Subhan
 
Design and Development of a Provenance Capture Platform for Data Science
Design and Development of a Provenance Capture Platform for Data ScienceDesign and Development of a Provenance Capture Platform for Data Science
Design and Development of a Provenance Capture Platform for Data SciencePaolo Missier
 
CORS (Kitworks Team Study 양다윗 발표자료 240510)
CORS (Kitworks Team Study 양다윗 발표자료 240510)CORS (Kitworks Team Study 양다윗 발표자료 240510)
CORS (Kitworks Team Study 양다윗 발표자료 240510)Wonjun Hwang
 
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...panagenda
 
Continuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
Continuing Bonds Through AI: A Hermeneutic Reflection on ThanabotsContinuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
Continuing Bonds Through AI: A Hermeneutic Reflection on ThanabotsLeah Henrickson
 
Design Guidelines for Passkeys 2024.pptx
Design Guidelines for Passkeys 2024.pptxDesign Guidelines for Passkeys 2024.pptx
Design Guidelines for Passkeys 2024.pptxFIDO Alliance
 

Dernier (20)

Tales from a Passkey Provider Progress from Awareness to Implementation.pptx
Tales from a Passkey Provider  Progress from Awareness to Implementation.pptxTales from a Passkey Provider  Progress from Awareness to Implementation.pptx
Tales from a Passkey Provider Progress from Awareness to Implementation.pptx
 
Working together SRE & Platform Engineering
Working together SRE & Platform EngineeringWorking together SRE & Platform Engineering
Working together SRE & Platform Engineering
 
Introduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMIntroduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDM
 
Navigating the Large Language Model choices_Ravi Daparthi
Navigating the Large Language Model choices_Ravi DaparthiNavigating the Large Language Model choices_Ravi Daparthi
Navigating the Large Language Model choices_Ravi Daparthi
 
JohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptx
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Overview of Hyperledger Foundation
Overview of Hyperledger FoundationOverview of Hyperledger Foundation
Overview of Hyperledger Foundation
 
Microsoft CSP Briefing Pre-Engagement - Questionnaire
Microsoft CSP Briefing Pre-Engagement - QuestionnaireMicrosoft CSP Briefing Pre-Engagement - Questionnaire
Microsoft CSP Briefing Pre-Engagement - Questionnaire
 
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
 
Frisco Automating Purchase Orders with MuleSoft IDP- May 10th, 2024.pptx.pdf
Frisco Automating Purchase Orders with MuleSoft IDP- May 10th, 2024.pptx.pdfFrisco Automating Purchase Orders with MuleSoft IDP- May 10th, 2024.pptx.pdf
Frisco Automating Purchase Orders with MuleSoft IDP- May 10th, 2024.pptx.pdf
 
ChatGPT and Beyond - Elevating DevOps Productivity
ChatGPT and Beyond - Elevating DevOps ProductivityChatGPT and Beyond - Elevating DevOps Productivity
ChatGPT and Beyond - Elevating DevOps Productivity
 
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptxHarnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
 
2024 May Patch Tuesday
2024 May Patch Tuesday2024 May Patch Tuesday
2024 May Patch Tuesday
 
Portal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russePortal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russe
 
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf
 
Design and Development of a Provenance Capture Platform for Data Science
Design and Development of a Provenance Capture Platform for Data ScienceDesign and Development of a Provenance Capture Platform for Data Science
Design and Development of a Provenance Capture Platform for Data Science
 
CORS (Kitworks Team Study 양다윗 발표자료 240510)
CORS (Kitworks Team Study 양다윗 발표자료 240510)CORS (Kitworks Team Study 양다윗 발표자료 240510)
CORS (Kitworks Team Study 양다윗 발표자료 240510)
 
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
 
Continuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
Continuing Bonds Through AI: A Hermeneutic Reflection on ThanabotsContinuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
Continuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
 
Design Guidelines for Passkeys 2024.pptx
Design Guidelines for Passkeys 2024.pptxDesign Guidelines for Passkeys 2024.pptx
Design Guidelines for Passkeys 2024.pptx
 

Lancaster UCREL Summer School 2017 - Big Data and NLP

  • 1. UCREL Summer School | Presented By Date Big Data NLP Daniel Kershaw 27/06/2017
  • 2. UCREL Summer School | Daniel Kershaw Recommender System Senior Data Scientist @danjamker www.danjamker.com 2 About
  • 3. UCREL Summer School | • Part 1 – 30 Minutes • Big Data (What is it?) • Map Reduce • Spark • Document Similarity • Part 2 – 1 hour • Downloading Zepplin on Dockers • Read document set, extract data with • Tokenize • Implement Document Similarity • Cosine Similarity between documents 3 Outline
  • 4. UCREL Summer School | Set up docker: sudo docker pull epahomov/docker-zeppelin Download Zeppelin Notebook: https://www.dropbox.com/s/161hpz02cafblsg/SDOA.json?dl=0 4 First
  • 5. UCREL Summer School | Presented By Date Part 1 - Big Data and NLP Daniel Kershaw 20th June 2017
  • 6. UCREL Summer School | 640K ought to be enough for anyone Bill Gates, Microsoft, 1981
  • 7. UCREL Summer School | “There were 5 exabytes of information created between the dawn of civilization through 2003, but that much information is now created every 2 days” Eric Schmidt, Google, 2010
  • 8. UCREL Summer School | Google processes 20 PB a day (2008) Wayback Machine has 3 PB + 100 TB/month (3/2009) Facebook has 2.5 PB of user data + 15 TB/day (4/2009) eBay has 6.5 PB of user data + 50 TB/day (5/2009) CERN’s Large Hydron Collider (LHC) generates 15 PB a year How much data?
  • 9. UCREL Summer School | Google Big Data Trend
  • 10. UCREL Summer School | What is Big Data Too big to fit in an Excel spreadsheet Professor Steven Weber, UC Berkeley School of Information
  • 11. UCREL Summer School | What is Big Data Big data means data that cannot fit easily into a standard relational database Hal Varian, Chief Economist, Google
  • 12. UCREL Summer School | What is Big Data The term ‘Big Data’ applies to information that can't be processed or analysed using traditional processes or tools Professor Steven Weber, UC Berkeley School of Information
  • 13. UCREL Summer School | Volume Velocity Variety Exhaustive Veracity Relational & Indexical Relational Flexible The Big V’s
  • 14. UCREL Summer School | Wikipedia Hansard Enron Email Corpus Reddit Data Release Twitter Data Set Examples of Big / Large Data (NLP ) Science Direct Corpus Mendeley User Catalogs Engineering Village User interaction logs Funding data EVISE
  • 15. UCREL Summer School | Scaling up Computation Servers CPUs (Xeon) RAM (32Gb) Disks (2 x 1Tb) Rack 40 - 80 Server Networked Together UPS (Power Supply)
  • 16. UCREL Summer School | Google Data Center image
  • 17. UCREL Summer School | • How do we split across nodes • Network and data locality • How do we deal with failures • 1 server fails ever 3 years => 10k nodes would be about 10 failure a day • How do we deal with slow machines Programming at Scale
  • 18. UCREL Summer School | Hadoop Google MapReduce publish 2004 Google File System publish 2004
  • 19. UCREL Summer School | Mapper Reducer Map Reduce
  • 20. UCREL Summer School | Mapper Reducer Map Reduce - Mapper Takes a series of <key, value> Processes each tuple Output’s 0 or more <key, value> tuples
  • 21. UCREL Summer School | Mapper Reducer Map Reduce - Reducer Called once for each unique <key, [value]> Iterates though each value Outputs 0 or more results as <key, value>
  • 22. UCREL Summer School | Example Code – Word Count
  • 23. UCREL Summer School | Map Reduce
  • 24. UCREL Summer School | Map Reduce
  • 25. UCREL Summer School | MapReduce - Overview
  • 26. UCREL Summer School | • Application need more than on step • Google pipeline was 22 steps • Analytic queries e.g. K-mean 2-5 steps • Iterative queries e.g. page-rank 10-20 steps • Problems with performance and ease of development Issues with Hadoop - Complexity
  • 27. UCREL Summer School | • Multiple map and reduce classes • A lot of boiler plate code • Easy to combine incorrectly Issues with Hadoop - Usability
  • 28. UCREL Summer School | • One pass at a time • Must write to HDFS between jobs • Expensive to reuse data • Hand optimize code to combine steps Issues with Hadoop - Performance
  • 29. UCREL Summer School | Big Data Processing
  • 31. UCREL Summer School | • Resilient distributed datasets (RDD) • Immutable, partitioned collections of objects • Created through parallel transformations (map, filter, groupBy, join, …) on data in stable storage • Can be cached for effect use • Actions on RDDs • Count, reduce, collect, save, … Spark Model
  • 32. UCREL Summer School | Spark vs Hadoop – Data Sharing Spark Hadoop
  • 35. UCREL Summer School | SparkML val train_data = // RDD of Vector!
val model = KMeans.train(train_data, k=10)! // evaluate the model! val test_data = // RDD of Vector!
test_data.map(t => model.predict(t)).collect().foreach(print ln)!
  • 36. UCREL Summer School | • Interact with data like a table • Inbuilt function to: • Tokenize • Stop-word removal • TFIDF transformation Spark Dataframes Name Age Gender Abstract
  • 37. UCREL Summer School | Title abstra ct keywo rds ASJC Title abstra ct keywo rds ASJC Title_t ok
  • 38. UCREL Summer School | Presented By Date Part 2 – Document Similarity Technical Workshop Daniel Kershaw 29th June 2017
  • 39. UCREL Summer School | • Download apache Zepplin • Download datasets • Read datasets • Tokenize and remove stopwords • Read word vectors 39 Outline
  • 41. UCREL Summer School | • Clone docker image • docker pull epahomov/docker-zeppelin • Run docker image • docker run -d -p 8080:8080 -p 7077:7077 -p 4040:4040 epahomov/docker-zeppelin • Goto • localhost:8080 41 Install Apache Zeppelin
  • 42. UCREL Summer School | Document Embedding Similarity Apple [0.5,0.6, 0.3, 0.1, 0.6, 0.5, 0.5, 0.9, 0.9, 0.3, 0.5, 0.4, 0.4, 0.5, 0.5,] Word represented as dense vector Document represented as sum (mean) of dense vectors Apple [0.5,0.6, 0.3, 0.1, 0.6, 0.5, 0.5, 0.9, 0.9, 0.3, 0.5, 0.4, 0.4, 0.5, 0.5,] Mac [0.5,0.6, 0.3, 0.1, 0.6, 0.5, 0.5, 0.9, 0.9, 0.3, 0.5, 0.4, 0.4, 0.5, 0.5,] Computer [0.5,0.6, 0.3, 0.1, 0.6, 0.5, 0.5, 0.9, 0.9, 0.3, 0.5, 0.4, 0.4, 0.5, 0.5,] + + = Document [0.5,0.6, 0.3, 0.1, 0.6, 0.5, 0.5, 0.9, 0.9, 0.3, 0.5, 0.4, 0.4, 0.5, 0.5,]
  • 43. UCREL Summer School | 43 Download Spark Dependencies
  • 44. UCREL Summer School | 44 Download Sample Science Direct Corpus
  • 45. UCREL Summer School | 45 Science Direct Open Access Corpus Contains all content seen on SD frontend Available on Github Extract PII (document ID) Extract Abstract Use Elsevier Opensource XML parser Extract fields with xpath & xquery
  • 46. UCREL Summer School | 46 Read Documents
  • 47. UCREL Summer School | 47 Extract Title and Document Abstract
  • 48. UCREL Summer School | 48 Tokenize and Remove Stop words
  • 49. UCREL Summer School | 49 Download Word Vectors
  • 50. UCREL Summer School | 50 Load Word Vectors word vector apple [0.2,0.4,0.8] compu ter [0.2,0.4,0.8] mac [0.2,0.4,0.8] Google [0.2,0.4,0.8]
  • 51. UCREL Summer School | 51 Doc ID Tokens 1 [apple, computer, mac] 2 [apple, computer, mac] 3 [apple, computer, mac] 4 [apple, computer, mac] 5 [apple, computer, mac] Doc ID Tokens 1 apple 1 computer 1 mac 2 apple 2 computer Explode the tokens
  • 52. UCREL Summer School | 52 Doc ID word 1 apple 1 computer 1 mac 2 apple 2 computer word vector apple [0.2,0.4,0.8] compu ter [0.2,0.4,0.8] mac [0.2,0.4,0.8] Google [0.2,0.4,0.8] this [0.2,0.4,0.8] Join on words Doc ID word vector 1 apple [0.2,0.4,0.8] 1 computer [0.2,0.4,0.8]
  • 53. UCREL Summer School | 53 Doc ID word vector 1 apple [0.2,0.4,0.8] 1 computer [0.2,0.4,0.8] Group by document ID, mean the vectors Doc ID vector 1 [0.2,0.4,0.8] 2 [0.2,0.4,0.8] 3 [0.2,0.4,0.8] 4 [0.2,0.4,0.8]
  • 54. UCREL Summer School | 54 Join word vectors to document
  • 55. UCREL Summer School | 55 Join word vectors to document
  • 56. UCREL Summer School | 56 Join word vectors to document
  • 57. UCREL Summer School | 57 Join word vectors to document
  • 58. UCREL Summer School | 58 Join word vectors to document
  • 59. UCREL Summer School | 59 Join word vectors to document
  • 60. UCREL Summer School | • Cartesian join of documents • Compute cosine similarity between each document 60 Identify similar documents 1 2 3 1 0.4 0.6 0.6 2 0.5 0.4 0.7 3 0.6 0.1 0.3 Doc ID vector 1 [0.2,0.4,0.8] 2 [0.2,0.4,0.8] 3 [0.2,0.4,0.8] 4 [0.2,0.4,0.8] 5 [0.2,0.4,0.8] Doc ID vector 1 [0.2,0.4,0.8] 2 [0.2,0.4,0.8] 3 [0.2,0.4,0.8] 4 [0.2,0.4,0.8] 5 [0.2,0.4,0.8] Join to self
  • 61. UCREL Summer School | Thank you Any questions 61