JDD 2016 - Tomasz Lelek - Machine Learning With Apache Spark

•Download as PPTX, PDF•

0 likes•72 views

How to use text data to draw conclusions about users of our website or forum? This talk describes a solution to a particular problem, using Machine Learning and Statistics. Based on provided forum we will create the program that learns the structure of posts using Natural Language Processing technics. Then after proper Machine Learning models are trained, program is able to answer with probability which of the users of the forum wrote a particular post. We will go through all the steps required to create Machine Learning models for text. How to use Natural Language Processing and Bag-of-Words techniques to analyse text? How to prepare input data to further Processing by Machine Learning Models? I will answer those questions. Implementation will be written in Apache Spark, so we will get to know that technology with some important libraries like Spark MLlib and DataFrame API. In MLlib we will use Gaussian Mixure Model and Logistic Regression.

Technology

tomekl007 @tomekl007
1
Tomasz Lelek
MACHINE LEARNING WITH APACHE
SPARK

What we will try to achieve?
Find an author of given post, based on text of
post

Input data
Forum with given structure of posts:

Tokenization
• Input: Swimmer like to swim, so he swims.
• Output: swimmer, like, to, swim, so, he, swims

Remove Stop Words
• Each language has stops words, e.g.:
to, as, a, the, …

Lemmatization -
Morphological Analysis
• mum:
mums
mummies
mummy

Transforming text into vector
of numbers

Bag-of-Words
1. Jon likes watching movies. Mary likes movies
too.
2. Jon also likes watching football games.
[“Jon”, “likes”, “watching”, “movies”, “also”, “football”,
“games”. “Mary”, “too”]
1. [1, 2, 1, 1, 0, 0, 0, 1, 1]
2. [1, 1, 1, 0, 1, 1, 1, 0, 0]

Skip-Gram
• Input:
In Poland rain mainly in September.
• Output:
In rain, Poland mainly, rain in, mainly September

Machine Learning
• Supervised Learning – input data needs to be
labeled
• Unsupervised Learning – not labeled, clustering.

Used techniques
• Logistic Regression
• Gaussian Mixture Model

I. Logistic Regression
• Supervised Learning
• Data that we want to analyze is labeled binary ( 1
or 0 )
• Input could be vector of numbers (text
transformed using Word2Vect) labeled binary
• Vector ( text ) is written by an author (1) or not (0)

Logistic Regression example
input
Hours of Study vspassing of exam (1 or 0 )

II. Gaussian Mixture Model
• Unsupervised learning
• Used to draw conclusions from time data
• Answer question: What is a probability of that
some event occurred at given time?

Next steps to build model
• What we want to achieve?
• Find author of given post with some
probability, based on text of post

Input data for our algorithms
• Word2Vect
• Example sentence: “It is very important to plan for
a future but also being in the moment”
• Resulted vector may look like:

Evaluating Logistic
Regression with GMM model.

Find author for post:
• “Given that somebody could take that as a
granted, I think we should”
• Post was written at 18 hour.

Similar to JDD 2016 - Tomasz Lelek - Machine Learning With Apache Spark

2011 mongo sf-schemadesignMongoDB

Context-based movie search using doc2vec, word2vecJIN KYU CHANG

The Neural Search Frontier - Doug Turnbull, OpenSource ConnectionsLucidworks

Deep Learning Bangalore meet up Satyam Saxena

DLBLR talkAnuj Gupta

A Panorama of Natural Language ProcessingTed Xiao

Introducción a NLP (Natural Language Processing) en AzurePlain Concepts

2023-My AI Experience - Colm Dunphy.pdfColm Dunphy

Evaluation of DOM Tree Similarities - Thesis PresentationTeoman Turan

Thinkful - Intro to JavaScriptTJ Stalcup

Mongo DB at Community EngineCommunity Engine

MongoDB at community enginemathraq

Intro to javascript (5:2)Thinkful

JSONModel Lightning TalkMarin Todorov

Schema Design by Example ~ MongoSF 2012hungarianhc

Jsonbaabtra.com - No. 1 supplier of quality freshers

Global Azure Bootcamp - ML.NET for developersChris Melinn

Nl201609Tetsuya Sakai

Programming Languages #devcon2013Iván Montes

Tomáš Mikolov - Distributed Representations for NLPMachine Learning Prague

Similar to JDD 2016 - Tomasz Lelek - Machine Learning With Apache Spark (20)

2011 mongo sf-schemadesign

Context-based movie search using doc2vec, word2vec

The Neural Search Frontier - Doug Turnbull, OpenSource Connections

Deep Learning Bangalore meet up

DLBLR talk

A Panorama of Natural Language Processing

Introducción a NLP (Natural Language Processing) en Azure

2023-My AI Experience - Colm Dunphy.pdf

Evaluation of DOM Tree Similarities - Thesis Presentation

Thinkful - Intro to JavaScript

Mongo DB at Community Engine

MongoDB at community engine

Intro to javascript (5:2)

JSONModel Lightning Talk

Schema Design by Example ~ MongoSF 2012

Json

Global Azure Bootcamp - ML.NET for developers

Nl201609

Programming Languages #devcon2013

Tomáš Mikolov - Distributed Representations for NLP

Recently uploaded

DevEX - reference for building teams, processes, and platformsSergiu Bodiu

Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm

SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos

SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal

Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3

Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson

Commit 2024 - Secret Management made easyAlfredo García Lavilla

Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited

SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero

The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3

How to write a Business Continuity PlanDatabarracks

Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation

Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed

Gen AI in Business - Global Trends Report 2024.pdfAddepto

How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe

Advanced Computer Architecture – An IntroductionDilum Bandara

Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3

"ML in Production",Oleksandr BaganFwdays

DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy

Recently uploaded (20)

DevEX - reference for building teams, processes, and platforms

Streamlining Python Development: A Guide to a Modern Project Setup

SALESFORCE EDUCATION CLOUD | FEXLE SERVICES

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)

SAP Build Work Zone - Overview L2-L3.pptx

Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx

Are Multi-Cloud and Serverless Good or Bad?

Commit 2024 - Secret Management made easy

Ensuring Technical Readiness For Copilot in Microsoft 365

SIP trunking in Janus @ Kamailio World 2024

The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx

How to write a Business Continuity Plan

Connect Wave/ connectwave Pitch Deck Presentation

Scanning the Internet for External Cloud Exposures via SSL Certs

Gen AI in Business - Global Trends Report 2024.pdf

How AI, OpenAI, and ChatGPT impact business and software.

Advanced Computer Architecture – An Introduction

Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx

"ML in Production",Oleksandr Bagan

DevoxxFR 2024 Reproducible Builds with Apache Maven

JDD 2016 - Tomasz Lelek - Machine Learning With Apache Spark

1. tomekl007 @tomekl007 1 Tomasz Lelek MACHINE LEARNING WITH APACHE SPARK

2. What we will try to achieve? Find an author of given post, based on text of post

3. Input data Forum with given structure of posts:

4. Preparing data

5. Tokenization • Input: Swimmer like to swim, so he swims. • Output: swimmer, like, to, swim, so, he, swims

6. Remove Stop Words • Each language has stops words, e.g.: to, as, a, the, …

7. Lemmatization - Morphological Analysis • mum: mums mummies mummy

8. Load forum data

9. Tokenize and Stop Words

10. Transforming text into vector of numbers

11. Bag-of-Words 1. Jon likes watching movies. Mary likes movies too. 2. Jon also likes watching football games. [“Jon”, “likes”, “watching”, “movies”, “also”, “football”, “games”. “Mary”, “too”] 1. [1, 2, 1, 1, 0, 0, 0, 1, 1] 2. [1, 1, 1, 0, 1, 1, 1, 0, 0]

12. Word2Vect FRANCE closest words:

13. Skip-Gram • Input: In Poland rain mainly in September. • Output: In rain, Poland mainly, rain in, mainly September

14. Spark Word2Vect

15. Machine Learning • Supervised Learning – input data needs to be labeled • Unsupervised Learning – not labeled, clustering.

16. Used techniques • Logistic Regression • Gaussian Mixture Model

17. I. Logistic Regression • Supervised Learning • Data that we want to analyze is labeled binary ( 1 or 0 ) • Input could be vector of numbers (text transformed using Word2Vect) labeled binary • Vector ( text ) is written by an author (1) or not (0)

18. Logistic Regression example input Hours of Study vspassing of exam (1 or 0 )

19. Chart

20. Example result

21. II. Gaussian Mixture Model • Unsupervised learning • Used to draw conclusions from time data • Answer question: What is a probability of that some event occurred at given time?

22. Graphic representation hour

23. Next steps to build model • What we want to achieve? • Find author of given post with some probability, based on text of post

24. Input data for our algorithms • Word2Vect • Example sentence: “It is very important to plan for a future but also being in the moment” • Resulted vector may look like:

25. Logistic Regression model per author

26. Area under ROC

27. Interpreting measures

28. Prepare labeled data

29. Build model

30. Model validation

31. Add time when post was written to model

32. Time of day distribution for author X

33. Preparing data for GMM

34. Creating GMM

35. Evaluating Logistic Regression with GMM model.

36. Find author for post: • “Given that somebody could take that as a granted, I think we should” • Post was written at 18 hour.

37. Test run

38. Result

39. How it could be used?

40. Thank you, Questions?

Editor's Notes

We will be solving problems with apache spark, using machine learning techniques
We need to analyze text of post to be able to find author, so we will be using Natural Language Processing techiques We have text of post only, and we want to find which author, from the group of authors wrote post
Let say that forum has around 10 000 posts. Each author can have one or many posts. There are some authors that has around 1000, but some has 1 post. We are mainly interested in body post, we will use it to further processing
Before we get into data, we need to prepare and clean text that is in body of posts
We need to split whole body into tokens, we want to see our text as a vector, and that is first step to achieve this.
We need to remove them from text. They are not semantically value for analyzing text.
In text can be written is multiple forms, we want that words to be in same form, because then our next processing step will perceive that word as the same. Otherwise words will be presented in vector as different number, and our algorithms will not be working effectively. In given example word mum could be written as mummy, could also be in plural form. We want all those words to be presented as one word, so we transform them to one common form
We will be using data frame api. We want to load all data from mysql db, that was imported to our local mysql instance
Having dataframe with data loaded, and sqlContext. We create spark Tokenizer. It will handle splliting “body” columns into array of “words”. Then StopWordsRemover will remove from array “words” all stop words and put them in without-stop-word column in data frame. We need to setStopWords that are specific for language that we analyze. We have all stop words in StopWords.allStopWords array.
To be able to run machine learning algorithms on our data, we need to transform input arrays of words into vector of numbers, Agorithms like Logistic Regiression or GMM that we will use later operate on numbers. There are algorithms that will do it for use, we need to configure them In proper way. In reality vector for set of words will be much longer, and will have more dimensions
Simple method to show text as a vector is Bag-of-words. It is commolny used is Document classification problems. By text classification we can assing document to proper category. In this method text is presentes as set of words and frequency of word occurance in text. We have two documents. Each document will be presented as vector. Indeks is a word into token array, and number on given index is number of word occurance in document.
word2Vect is another, more complex algorithm that make vector of numbers from text. Firstly it constructs dictionary from text data, then present that data as a vector. It could be used to find closest word ( most similar word ). For example giving word "France" as input to algorithm, words that have closes word cosine distance are:
Word2Vect use two algorithms to see text as vector. Continuous Bag-of-Words and Skip-gram. Skip-gram is a N-gram. Using skip-gram keep better semantics of text - place where word occurred in text due to that fact, that information is not lost when text is transformed into vector.
Here we have function that take dataFrame. DataFrame has words column in it. We create new Word2Vect algorithm. There are two important conifg params, vectorSize: should be at least 100, to be able to distinctively present document as vector and minCount.
There are two groups of ML algorithms. Eg. Algorithm to finding faces on photo. Supervided – input data is photo labeled with information if on this photo there is face or not. (1 or 0).
Now we will see techniques that later will be used to analyze forum data. Right now we will look into them, using some simple examples.
Input data is number of hours and information if student pass an exam(1) or not (0).
Logistic regression result could be depcited by such chart, We will see, that as more hours is spend on learning then probabilty of passing exam is increasing
Using logistic regression we could draw such conclusions. That algorithm will be used later to analyze our forum data
Input we have infomration when event occured
Having author and many of his posts, and each post has a timestamp when that post was written, we could create very effective GMM model. Most importatn parameter for GMM is number of clusters. In this example there are 3 clusters. We are seeing three Gaussian Distribution combined. On X axis there is an hour when that post was written, So we see that most ofter author wrote post at the evening, less often at the morning and seldom in the middle of the day. We could use that to analyze our forum data, and find author of given post, because we know when each author prefer to write posts.
We have our input data transformed to vector of numbers using word2Vect algorithm
We want to build one logistic regression model per author. So having N autohr there will be N models. First model-we are creating for author A. We see that first post ( presented as vector ) was written by author A. secnond not, and so one. Second model we are builidng for author B. Now when we have post prsenet as vecotr, we iterating over models, and asking if given post was written by author A, then B. Each model returng some probability that this post was written by that author. Model that return highes probability tells us that that author for wich that model was build wrote that post.
We want to measure accuracy of modes. Measure that is used for that is called ROC. This is a mesaure of how fit model is.
0,50 means that you could toss a coin
We prepare input data for logistic regression- vector with label (1 or 0) if post is for given author or not
We are splitting data into training and test set. Traing set is used to build model, test will be used to validate model. From validation we will get measures ( i.e. area under ROC )
We want to add information when post was written to our model. We are interesten only in hour of post. Then we could see tendency of author, i.e. he is writing posts mainly at midnight. Or we have some autohors that live in USA, then they will write posts on totally diferent time of day that authors from Poland.
Example distribution for author we are seeing that author most often write posts from 3 to 12, and from 16 to 19 of GMT. Creating GMM model for author, model will answer question if author could write posts in given hour. It will be some probabiltiy. That one value is not enough to find an author for given post, but could be used as additionl dimesnion ( value ) in Logistc Regression model input.
Normalize time basically take timestamp, create data in GMT format from it, and take hourOfDate from that data.
Input for gmm needs to be a vector. Then we are creating GMM with three clusters.
For Logistic Regression model that is trained using input word2Vect and gmm result probabilty our model gave given results. We will se that two top result models are excellent!
We have input to our model: post body, and hour when post was written, We transfrom text to vector using word2Vect, and hour is passed to gmm predict method, we append two vectors, end evaluate logistic regression model. It will return probabilty that post that was written at 18 o’clock was written by analyzed author. We do it for each author, sort by probabilty descending. Result that will be on to of the results set, tell us that this author wrote post with biggest probabilty
There is a probabiltiy of 87% that author wild wrote that post
How that could be used? If we have a forum, we could find that one person has two accounts, because they are writing in exact same way. If someone will be writing as an anonym, but it wrote post previosuly as logged in user, we could identify that user.

JDD 2016 - Tomasz Lelek - Machine Learning With Apache Spark

Recommended

Recommended

More Related Content

Similar to JDD 2016 - Tomasz Lelek - Machine Learning With Apache Spark

Similar to JDD 2016 - Tomasz Lelek - Machine Learning With Apache Spark (20)

Recently uploaded

Recently uploaded (20)

JDD 2016 - Tomasz Lelek - Machine Learning With Apache Spark

Editor's Notes