Big data has allowed for an explosion in NLP research. This presentation introduces you what is big data along with a tutorial on how to use spark to compute document similarity.
Lancaster UCREL Summer School 2017 - Big Data and NLP
1. UCREL Summer School |
Presented By
Date
Big Data NLP
Daniel Kershaw
27/06/2017
2. UCREL Summer School |
Daniel Kershaw
Recommender System
Senior Data Scientist
@danjamker
www.danjamker.com
2
About
3. UCREL Summer School |
• Part 1 – 30 Minutes
• Big Data (What is it?)
• Map Reduce
• Spark
• Document Similarity
• Part 2 – 1 hour
• Downloading Zepplin on Dockers
• Read document set, extract data with
• Tokenize
• Implement Document Similarity
• Cosine Similarity between documents
3
Outline
4. UCREL Summer School |
Set up docker:
sudo docker pull epahomov/docker-zeppelin
Download Zeppelin Notebook:
https://www.dropbox.com/s/161hpz02cafblsg/SDOA.json?dl=0
4
First
5. UCREL Summer School |
Presented By
Date
Part 1 - Big Data and NLP
Daniel Kershaw
20th June 2017
6. UCREL Summer School |
640K ought to be enough for anyone
Bill Gates, Microsoft, 1981
7. UCREL Summer School |
“There were 5 exabytes of
information created between the
dawn of civilization through 2003,
but that much information is now
created every 2 days”
Eric Schmidt, Google, 2010
8. UCREL Summer School |
Google processes 20 PB a day (2008)
Wayback Machine has 3 PB + 100 TB/month (3/2009)
Facebook has 2.5 PB of user data + 15 TB/day (4/2009)
eBay has 6.5 PB of user data + 50 TB/day (5/2009)
CERN’s Large Hydron Collider (LHC) generates 15 PB a year
How much data?
10. UCREL Summer School |
What is Big Data
Too big to fit in an Excel
spreadsheet
Professor Steven Weber, UC Berkeley School of Information
11. UCREL Summer School |
What is Big Data
Big data means data that cannot
fit easily into a standard relational
database
Hal Varian, Chief Economist, Google
12. UCREL Summer School |
What is Big Data
The term ‘Big Data’ applies to
information that can't be
processed or analysed using
traditional processes or tools
Professor Steven Weber, UC Berkeley School of Information
13. UCREL Summer School |
Volume
Velocity
Variety
Exhaustive
Veracity
Relational & Indexical
Relational
Flexible
The Big V’s
14. UCREL Summer School |
Wikipedia
Hansard
Enron Email Corpus
Reddit Data Release
Twitter Data Set
Examples of Big / Large Data (NLP )
Science Direct Corpus
Mendeley User Catalogs
Engineering Village
User interaction logs
Funding data
EVISE
15. UCREL Summer School |
Scaling up Computation
Servers
CPUs (Xeon)
RAM (32Gb)
Disks (2 x 1Tb)
Rack
40 - 80 Server
Networked Together
UPS (Power Supply)
17. UCREL Summer School |
• How do we split across nodes
• Network and data locality
• How do we deal with failures
• 1 server fails ever 3 years => 10k nodes would
be about 10 failure a day
• How do we deal with slow machines
Programming at Scale
18. UCREL Summer School |
Hadoop
Google MapReduce publish 2004
Google File System publish 2004
20. UCREL Summer School |
Mapper
Reducer
Map Reduce - Mapper
Takes a series of <key, value>
Processes each tuple
Output’s 0 or more <key, value> tuples
21. UCREL Summer School |
Mapper
Reducer
Map Reduce - Reducer
Called once for each unique <key, [value]>
Iterates though each value
Outputs 0 or more results as <key, value>
26. UCREL Summer School |
• Application need more than on step
• Google pipeline was 22 steps
• Analytic queries e.g. K-mean 2-5 steps
• Iterative queries e.g. page-rank 10-20 steps
• Problems with performance and ease of
development
Issues with Hadoop - Complexity
27. UCREL Summer School |
• Multiple map and reduce classes
• A lot of boiler plate code
• Easy to combine incorrectly
Issues with Hadoop - Usability
28. UCREL Summer School |
• One pass at a time
• Must write to HDFS between jobs
• Expensive to reuse data
• Hand optimize code to combine steps
Issues with Hadoop - Performance
31. UCREL Summer School |
• Resilient distributed datasets (RDD)
• Immutable, partitioned collections of objects
• Created through parallel transformations (map, filter, groupBy,
join, …) on data in stable storage
• Can be cached for effect use
• Actions on RDDs
• Count, reduce, collect, save, …
Spark Model
35. UCREL Summer School |
SparkML
val train_data = // RDD of Vector! val
model = KMeans.train(train_data, k=10)!
// evaluate the model!
val test_data = // RDD of
Vector! test_data.map(t =>
model.predict(t)).collect().foreach(print
ln)!
36. UCREL Summer School |
• Interact with data like a table
• Inbuilt function to:
• Tokenize
• Stop-word removal
• TFIDF transformation
Spark Dataframes
Name Age Gender Abstract
37. UCREL Summer School |
Title abstra
ct
keywo
rds
ASJC Title abstra
ct
keywo
rds
ASJC Title_t
ok
38. UCREL Summer School |
Presented By
Date
Part 2 – Document Similarity
Technical Workshop
Daniel Kershaw
29th June 2017
39. UCREL Summer School |
• Download apache Zepplin
• Download datasets
• Read datasets
• Tokenize and remove stopwords
• Read word vectors
39
Outline
45. UCREL Summer School | 45
Science Direct Open Access Corpus
Contains all content seen on SD frontend
Available on Github
Extract PII (document ID)
Extract Abstract
Use Elsevier Opensource XML parser
Extract fields with xpath & xquery
50. UCREL Summer School | 50
Load Word Vectors
word vector
apple [0.2,0.4,0.8]
compu
ter
[0.2,0.4,0.8]
mac [0.2,0.4,0.8]
Google [0.2,0.4,0.8]
51. UCREL Summer School | 51
Doc ID Tokens
1 [apple, computer, mac]
2 [apple, computer, mac]
3 [apple, computer, mac]
4 [apple, computer, mac]
5 [apple, computer, mac] Doc ID Tokens
1 apple
1 computer
1 mac
2 apple
2 computer
Explode the tokens
52. UCREL Summer School | 52
Doc ID word
1 apple
1 computer
1 mac
2 apple
2 computer
word vector
apple [0.2,0.4,0.8]
compu
ter
[0.2,0.4,0.8]
mac [0.2,0.4,0.8]
Google [0.2,0.4,0.8]
this [0.2,0.4,0.8]
Join on words
Doc ID word vector
1 apple [0.2,0.4,0.8]
1 computer [0.2,0.4,0.8]
53. UCREL Summer School | 53
Doc ID word vector
1 apple [0.2,0.4,0.8]
1 computer [0.2,0.4,0.8]
Group by document ID, mean the vectors
Doc ID vector
1 [0.2,0.4,0.8]
2 [0.2,0.4,0.8]
3 [0.2,0.4,0.8]
4 [0.2,0.4,0.8]