Lancaster UCREL Summer School 2017 - Big Data and NLP

UCREL Summer School |
Presented By
Date
Big Data NLP
Daniel Kershaw
27/06/2017

Daniel Kershaw
Recommender System
Senior Data Scientist
@danjamker
www.danjamker.com
2
About

• Part 1 – 30 Minutes
• Big Data (What is it?)
• Map Reduce
• Spark
• Document Similarity
• Part 2 – 1 hour
• Downloading Zepplin on Dockers
• Read document set, extract data with
• Tokenize
• Implement Document Similarity
• Cosine Similarity between documents
3
Outline

Set up docker:
sudo docker pull epahomov/docker-zeppelin
Download Zeppelin Notebook:
https://www.dropbox.com/s/161hpz02cafblsg/SDOA.json?dl=0
4
First

Presented By
Date
Part 1 - Big Data and NLP
Daniel Kershaw
20th June 2017

640K ought to be enough for anyone
Bill Gates, Microsoft, 1981

“There were 5 exabytes of
information created between the
dawn of civilization through 2003,
but that much information is now
created every 2 days”
Eric Schmidt, Google, 2010

Google processes 20 PB a day (2008)
Wayback Machine has 3 PB + 100 TB/month (3/2009)
Facebook has 2.5 PB of user data + 15 TB/day (4/2009)
eBay has 6.5 PB of user data + 50 TB/day (5/2009)
CERN’s Large Hydron Collider (LHC) generates 15 PB a year
How much data?

Google Big Data Trend

What is Big Data
Too big to fit in an Excel
spreadsheet
Professor Steven Weber, UC Berkeley School of Information

What is Big Data
Big data means data that cannot
fit easily into a standard relational
database
Hal Varian, Chief Economist, Google

What is Big Data
The term ‘Big Data’ applies to
information that can't be
processed or analysed using
traditional processes or tools
Professor Steven Weber, UC Berkeley School of Information

Volume
Velocity
Variety
Exhaustive
Veracity
Relational & Indexical
Relational
Flexible
The Big V’s

Wikipedia
Hansard
Enron Email Corpus
Reddit Data Release
Twitter Data Set
Examples of Big / Large Data (NLP )
Science Direct Corpus
Mendeley User Catalogs
Engineering Village
User interaction logs
Funding data
EVISE

Scaling up Computation
Servers
CPUs (Xeon)
RAM (32Gb)
Disks (2 x 1Tb)
Rack
40 - 80 Server
Networked Together
UPS (Power Supply)

Google Data Center image

• How do we split across nodes
• Network and data locality
• How do we deal with failures
• 1 server fails ever 3 years => 10k nodes would
be about 10 failure a day
• How do we deal with slow machines
Programming at Scale

Hadoop
Google MapReduce publish 2004
Google File System publish 2004

Mapper
Reducer
Map Reduce

Mapper
Reducer
Map Reduce - Mapper
Takes a series of <key, value>
Processes each tuple
Output’s 0 or more <key, value> tuples

Mapper
Reducer
Map Reduce - Reducer
Called once for each unique <key, [value]>
Iterates though each value
Outputs 0 or more results as <key, value>

Example Code – Word Count

Map Reduce

MapReduce - Overview

• Application need more than on step
• Google pipeline was 22 steps
• Analytic queries e.g. K-mean 2-5 steps
• Iterative queries e.g. page-rank 10-20 steps
• Problems with performance and ease of
development
Issues with Hadoop - Complexity

• Multiple map and reduce classes
• A lot of boiler plate code
• Easy to combine incorrectly
Issues with Hadoop - Usability

• One pass at a time
• Must write to HDFS between jobs
• Expensive to reuse data
• Hand optimize code to combine steps
Issues with Hadoop - Performance

Big Data Processing

• Resilient distributed datasets (RDD)
• Immutable, partitioned collections of objects
• Created through parallel transformations (map, filter, groupBy,
join, …) on data in stable storage
• Can be cached for effect use
• Actions on RDDs
• Count, reduce, collect, save, …
Spark Model

Spark vs Hadoop – Data Sharing
Spark
Hadoop

SparkML
val train_data = // RDD of Vector! val
model = KMeans.train(train_data, k=10)!
// evaluate the model!
val test_data = // RDD of
Vector! test_data.map(t =>
model.predict(t)).collect().foreach(print
ln)!

• Interact with data like a table
• Inbuilt function to:
• Tokenize
• Stop-word removal
• TFIDF transformation
Spark Dataframes
Name Age Gender Abstract

Title abstra
ct
keywo
rds
ASJC Title abstra
ct
keywo
rds
ASJC Title_t
ok

Presented By
Date
Part 2 – Document Similarity
Technical Workshop
Daniel Kershaw
29th June 2017

• Download apache Zepplin
• Download datasets
• Read datasets
• Tokenize and remove stopwords
• Read word vectors
39
Outline

• Clone docker image
• docker pull epahomov/docker-zeppelin
• Run docker image
• docker run -d -p 8080:8080 -p 7077:7077 -p 4040:4040 epahomov/docker-zeppelin
• Goto
• localhost:8080
41
Install Apache Zeppelin

Document Embedding Similarity
Apple [0.5,0.6, 0.3, 0.1, 0.6, 0.5, 0.5, 0.9, 0.9, 0.3, 0.5, 0.4, 0.4, 0.5, 0.5,]
Word represented as dense vector
Document represented as sum (mean) of dense vectors
Apple [0.5,0.6, 0.3, 0.1, 0.6, 0.5, 0.5, 0.9, 0.9, 0.3, 0.5, 0.4, 0.4, 0.5, 0.5,]
Mac [0.5,0.6, 0.3, 0.1, 0.6, 0.5, 0.5, 0.9, 0.9, 0.3, 0.5, 0.4, 0.4, 0.5, 0.5,]
Computer [0.5,0.6, 0.3, 0.1, 0.6, 0.5, 0.5, 0.9, 0.9, 0.3, 0.5, 0.4, 0.4, 0.5, 0.5,]
+
+
=
Document [0.5,0.6, 0.3, 0.1, 0.6, 0.5, 0.5, 0.9, 0.9, 0.3, 0.5, 0.4, 0.4, 0.5, 0.5,]

UCREL Summer School | 43
Download Spark Dependencies

Download Sample Science Direct Corpus

Science Direct Open Access Corpus
Contains all content seen on SD frontend
Available on Github
Extract PII (document ID)
Extract Abstract
Use Elsevier Opensource XML parser
Extract fields with xpath & xquery

Read Documents

Extract Title and Document Abstract

Tokenize and Remove Stop words

Download Word Vectors

Load Word Vectors
word vector
apple [0.2,0.4,0.8]
compu
ter
[0.2,0.4,0.8]
mac [0.2,0.4,0.8]
Google [0.2,0.4,0.8]

Doc ID Tokens
1 [apple, computer, mac]
5 [apple, computer, mac] Doc ID Tokens
1 apple
1 computer
1 mac
2 apple
2 computer
Explode the tokens

Doc ID word
1 apple
1 computer
1 mac
2 apple
2 computer
word vector
apple [0.2,0.4,0.8]
compu
ter
[0.2,0.4,0.8]
mac [0.2,0.4,0.8]
Google [0.2,0.4,0.8]
this [0.2,0.4,0.8]
Join on words
Doc ID word vector
1 apple [0.2,0.4,0.8]
1 computer [0.2,0.4,0.8]

Doc ID word vector
1 apple [0.2,0.4,0.8]
1 computer [0.2,0.4,0.8]
Group by document ID, mean the vectors
Doc ID vector
1 [0.2,0.4,0.8]
2 [0.2,0.4,0.8]
3 [0.2,0.4,0.8]
4 [0.2,0.4,0.8]

Join word vectors to document

• Cartesian join of documents
• Compute cosine similarity between each document
60
Identify similar documents
1 2 3
1 0.4 0.6 0.6
2 0.5 0.4 0.7
3 0.6 0.1 0.3
Doc ID vector
1 [0.2,0.4,0.8]
2 [0.2,0.4,0.8]
3 [0.2,0.4,0.8]
4 [0.2,0.4,0.8]
5 [0.2,0.4,0.8]
Doc ID vector
1 [0.2,0.4,0.8]
2 [0.2,0.4,0.8]
3 [0.2,0.4,0.8]
4 [0.2,0.4,0.8]
5 [0.2,0.4,0.8]
Join to self

Thank you
Any questions
61

Lancaster UCREL Summer School 2017 - Big Data and NLP

Recommandé

Recommandé

Contenu connexe

Similaire à Lancaster UCREL Summer School 2017 - Big Data and NLP

Similaire à Lancaster UCREL Summer School 2017 - Big Data and NLP (20)

Dernier

Dernier (20)

Lancaster UCREL Summer School 2017 - Big Data and NLP