SlideShare a Scribd company logo
1 of 40
tomekl007 @tomekl007
1
Tomasz Lelek
MACHINE LEARNING WITH APACHE
SPARK
What we will try to achieve?
Find an author of given post, based on text of
post
Input data
Forum with given structure of posts:
Preparing data
Tokenization
• Input: Swimmer like to swim, so he swims.
• Output: swimmer, like, to, swim, so, he, swims
Remove Stop Words
• Each language has stops words, e.g.:
to, as, a, the, …
Lemmatization -
Morphological Analysis
• mum:
mums
mummies
mummy
Load forum data
Tokenize and Stop Words
Transforming text into vector
of numbers
Bag-of-Words
1. Jon likes watching movies. Mary likes movies
too.
2. Jon also likes watching football games.
[“Jon”, “likes”, “watching”, “movies”, “also”, “football”,
“games”. “Mary”, “too”]
1. [1, 2, 1, 1, 0, 0, 0, 1, 1]
2. [1, 1, 1, 0, 1, 1, 1, 0, 0]
Word2Vect
FRANCE closest words:
Skip-Gram
• Input:
In Poland rain mainly in September.
• Output:
In rain, Poland mainly, rain in, mainly September
Spark Word2Vect
Machine Learning
• Supervised Learning – input data needs to be
labeled
• Unsupervised Learning – not labeled, clustering.
Used techniques
• Logistic Regression
• Gaussian Mixture Model
I. Logistic Regression
• Supervised Learning
• Data that we want to analyze is labeled binary ( 1
or 0 )
• Input could be vector of numbers (text
transformed using Word2Vect) labeled binary
• Vector ( text ) is written by an author (1) or not (0)
Logistic Regression example
input
Hours of Study vspassing of exam (1 or 0 )
Chart
Example result
II. Gaussian Mixture Model
• Unsupervised learning
• Used to draw conclusions from time data
• Answer question: What is a probability of that
some event occurred at given time?
Graphic representation
hour
Next steps to build model
• What we want to achieve?
• Find author of given post with some
probability, based on text of post
Input data for our algorithms
• Word2Vect
• Example sentence: “It is very important to plan for
a future but also being in the moment”
• Resulted vector may look like:
Logistic Regression model
per author
Area under ROC
Interpreting measures
Prepare labeled data
Build model
Model validation
Add time when post was
written to model
Time of day distribution for
author X
Preparing data for GMM
Creating GMM
Evaluating Logistic
Regression with GMM model.
Find author for post:
• “Given that somebody could take that as a
granted, I think we should”
• Post was written at 18 hour.
Test run
Result
How it could be used?
Thank you, Questions?

More Related Content

Similar to JDD 2016 - Tomasz Lelek - Machine Learning With Apache Spark

2011 mongo sf-schemadesign
2011 mongo sf-schemadesign2011 mongo sf-schemadesign
2011 mongo sf-schemadesignMongoDB
 
Context-based movie search using doc2vec, word2vec
Context-based movie search using doc2vec, word2vecContext-based movie search using doc2vec, word2vec
Context-based movie search using doc2vec, word2vecJIN KYU CHANG
 
The Neural Search Frontier - Doug Turnbull, OpenSource Connections
The Neural Search Frontier - Doug Turnbull, OpenSource ConnectionsThe Neural Search Frontier - Doug Turnbull, OpenSource Connections
The Neural Search Frontier - Doug Turnbull, OpenSource ConnectionsLucidworks
 
Deep Learning Bangalore meet up
Deep Learning Bangalore meet up Deep Learning Bangalore meet up
Deep Learning Bangalore meet up Satyam Saxena
 
A Panorama of Natural Language Processing
A Panorama of Natural Language ProcessingA Panorama of Natural Language Processing
A Panorama of Natural Language ProcessingTed Xiao
 
Introducción a NLP (Natural Language Processing) en Azure
Introducción a NLP (Natural Language Processing) en AzureIntroducción a NLP (Natural Language Processing) en Azure
Introducción a NLP (Natural Language Processing) en AzurePlain Concepts
 
2023-My AI Experience - Colm Dunphy.pdf
2023-My AI Experience - Colm Dunphy.pdf2023-My AI Experience - Colm Dunphy.pdf
2023-My AI Experience - Colm Dunphy.pdfColm Dunphy
 
Evaluation of DOM Tree Similarities - Thesis Presentation
Evaluation of DOM Tree Similarities - Thesis PresentationEvaluation of DOM Tree Similarities - Thesis Presentation
Evaluation of DOM Tree Similarities - Thesis PresentationTeoman Turan
 
Thinkful - Intro to JavaScript
Thinkful - Intro to JavaScriptThinkful - Intro to JavaScript
Thinkful - Intro to JavaScriptTJ Stalcup
 
Mongo DB at Community Engine
Mongo DB at Community EngineMongo DB at Community Engine
Mongo DB at Community EngineCommunity Engine
 
MongoDB at community engine
MongoDB at community engineMongoDB at community engine
MongoDB at community enginemathraq
 
Intro to javascript (5:2)
Intro to javascript (5:2)Intro to javascript (5:2)
Intro to javascript (5:2)Thinkful
 
JSONModel Lightning Talk
JSONModel Lightning TalkJSONModel Lightning Talk
JSONModel Lightning TalkMarin Todorov
 
Schema Design by Example ~ MongoSF 2012
Schema Design by Example ~ MongoSF 2012Schema Design by Example ~ MongoSF 2012
Schema Design by Example ~ MongoSF 2012hungarianhc
 
Global Azure Bootcamp - ML.NET for developers
Global Azure Bootcamp - ML.NET for developersGlobal Azure Bootcamp - ML.NET for developers
Global Azure Bootcamp - ML.NET for developersChris Melinn
 
Programming Languages #devcon2013
Programming Languages #devcon2013Programming Languages #devcon2013
Programming Languages #devcon2013Iván Montes
 
Tomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLPTomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLPMachine Learning Prague
 

Similar to JDD 2016 - Tomasz Lelek - Machine Learning With Apache Spark (20)

2011 mongo sf-schemadesign
2011 mongo sf-schemadesign2011 mongo sf-schemadesign
2011 mongo sf-schemadesign
 
Context-based movie search using doc2vec, word2vec
Context-based movie search using doc2vec, word2vecContext-based movie search using doc2vec, word2vec
Context-based movie search using doc2vec, word2vec
 
The Neural Search Frontier - Doug Turnbull, OpenSource Connections
The Neural Search Frontier - Doug Turnbull, OpenSource ConnectionsThe Neural Search Frontier - Doug Turnbull, OpenSource Connections
The Neural Search Frontier - Doug Turnbull, OpenSource Connections
 
Deep Learning Bangalore meet up
Deep Learning Bangalore meet up Deep Learning Bangalore meet up
Deep Learning Bangalore meet up
 
DLBLR talk
DLBLR talkDLBLR talk
DLBLR talk
 
A Panorama of Natural Language Processing
A Panorama of Natural Language ProcessingA Panorama of Natural Language Processing
A Panorama of Natural Language Processing
 
Introducción a NLP (Natural Language Processing) en Azure
Introducción a NLP (Natural Language Processing) en AzureIntroducción a NLP (Natural Language Processing) en Azure
Introducción a NLP (Natural Language Processing) en Azure
 
2023-My AI Experience - Colm Dunphy.pdf
2023-My AI Experience - Colm Dunphy.pdf2023-My AI Experience - Colm Dunphy.pdf
2023-My AI Experience - Colm Dunphy.pdf
 
Evaluation of DOM Tree Similarities - Thesis Presentation
Evaluation of DOM Tree Similarities - Thesis PresentationEvaluation of DOM Tree Similarities - Thesis Presentation
Evaluation of DOM Tree Similarities - Thesis Presentation
 
Thinkful - Intro to JavaScript
Thinkful - Intro to JavaScriptThinkful - Intro to JavaScript
Thinkful - Intro to JavaScript
 
Mongo DB at Community Engine
Mongo DB at Community EngineMongo DB at Community Engine
Mongo DB at Community Engine
 
MongoDB at community engine
MongoDB at community engineMongoDB at community engine
MongoDB at community engine
 
Intro to javascript (5:2)
Intro to javascript (5:2)Intro to javascript (5:2)
Intro to javascript (5:2)
 
JSONModel Lightning Talk
JSONModel Lightning TalkJSONModel Lightning Talk
JSONModel Lightning Talk
 
Schema Design by Example ~ MongoSF 2012
Schema Design by Example ~ MongoSF 2012Schema Design by Example ~ MongoSF 2012
Schema Design by Example ~ MongoSF 2012
 
Json
JsonJson
Json
 
Global Azure Bootcamp - ML.NET for developers
Global Azure Bootcamp - ML.NET for developersGlobal Azure Bootcamp - ML.NET for developers
Global Azure Bootcamp - ML.NET for developers
 
Nl201609
Nl201609Nl201609
Nl201609
 
Programming Languages #devcon2013
Programming Languages #devcon2013Programming Languages #devcon2013
Programming Languages #devcon2013
 
Tomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLPTomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLP
 

Recently uploaded

DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 

Recently uploaded (20)

DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 

JDD 2016 - Tomasz Lelek - Machine Learning With Apache Spark

Editor's Notes

  1. We will be solving problems with apache spark, using machine learning techniques
  2. We need to analyze text of post to be able to find author, so we will be using Natural Language Processing techiques We have text of post only, and we want to find which author, from the group of authors wrote post
  3. Let say that forum has around 10 000 posts. Each author can have one or many posts. There are some authors that has around 1000, but some has 1 post. We are mainly interested in body post, we will use it to further processing
  4. Before we get into data, we need to prepare and clean text that is in body of posts
  5. We need to split whole body into tokens, we want to see our text as a vector, and that is first step to achieve this.
  6. We need to remove them from text. They are not semantically value for analyzing text.
  7. In text can be written is multiple forms, we want that words to be in same form, because then our next processing step will perceive that word as the same. Otherwise words will be presented in vector as different number, and our algorithms will not be working effectively. In given example word mum could be written as mummy, could also be in plural form. We want all those words to be presented as one word, so we transform them to one common form
  8. We will be using data frame api. We want to load all data from mysql db, that was imported to our local mysql instance
  9. Having dataframe with data loaded, and sqlContext. We create spark Tokenizer. It will handle splliting “body” columns into array of “words”. Then StopWordsRemover will remove from array “words” all stop words and put them in without-stop-word column in data frame. We need to setStopWords that are specific for language that we analyze. We have all stop words in StopWords.allStopWords array.
  10. To be able to run machine learning algorithms on our data, we need to transform input arrays of words into vector of numbers, Agorithms like Logistic Regiression or GMM that we will use later operate on numbers. There are algorithms that will do it for use, we need to configure them In proper way. In reality vector for set of words will be much longer, and will have more dimensions
  11. Simple method to show text as a vector is Bag-of-words. It is commolny used is Document classification problems. By text classification we can assing document to proper category. In this method text is presentes as set of words and frequency of word occurance in text. We have two documents. Each document will be presented as vector. Indeks is a word into token array, and number on given index is number of word occurance in document.
  12. word2Vect is another, more complex algorithm that make vector of numbers from text. Firstly it constructs dictionary from text data, then present that data as a vector. It could be used to find closest word ( most similar word ). For example giving word "France" as input to algorithm, words that have closes word cosine distance are:
  13. Word2Vect use two algorithms to see text as vector. Continuous Bag-of-Words and Skip-gram. Skip-gram is a N-gram. Using skip-gram keep better semantics of text - place where word occurred in text due to that fact, that information is not lost when text is transformed into vector.
  14. Here we have function that take dataFrame. DataFrame has words column in it. We create new Word2Vect algorithm. There are two important conifg params, vectorSize: should be at least 100, to be able to distinctively present document as vector and minCount.
  15. There are two groups of ML algorithms. Eg. Algorithm to finding faces on photo. Supervided – input data is photo labeled with information if on this photo there is face or not. (1 or 0).
  16. Now we will see techniques that later will be used to analyze forum data. Right now we will look into them, using some simple examples.
  17. Input data is number of hours and information if student pass an exam(1) or not (0).
  18. Logistic regression result could be depcited by such chart, We will see, that as more hours is spend on learning then probabilty of passing exam is increasing
  19. Using logistic regression we could draw such conclusions. That algorithm will be used later to analyze our forum data
  20. Input we have infomration when event occured
  21. Having author and many of his posts, and each post has a timestamp when that post was written, we could create very effective GMM model. Most importatn parameter for GMM is number of clusters. In this example there are 3 clusters. We are seeing three Gaussian Distribution combined. On X axis there is an hour when that post was written, So we see that most ofter author wrote post at the evening, less often at the morning and seldom in the middle of the day. We could use that to analyze our forum data, and find author of given post, because we know when each author prefer to write posts.
  22. We have our input data transformed to vector of numbers using word2Vect algorithm
  23. We want to build one logistic regression model per author. So having N autohr there will be N models. First model-we are creating for author A. We see that first post ( presented as vector ) was written by author A. secnond not, and so one. Second model we are builidng for author B. Now when we have post prsenet as vecotr, we iterating over models, and asking if given post was written by author A, then B. Each model returng some probability that this post was written by that author. Model that return highes probability tells us that that author for wich that model was build wrote that post.
  24. We want to measure accuracy of modes. Measure that is used for that is called ROC. This is a mesaure of how fit model is.
  25. 0,50 means that you could toss a coin
  26. We prepare input data for logistic regression- vector with label (1 or 0) if post is for given author or not
  27. We are splitting data into training and test set. Traing set is used to build model, test will be used to validate model. From validation we will get measures ( i.e. area under ROC )
  28. We want to add information when post was written to our model. We are interesten only in hour of post. Then we could see tendency of author, i.e. he is writing posts mainly at midnight. Or we have some autohors that live in USA, then they will write posts on totally diferent time of day that authors from Poland.
  29. Example distribution for author we are seeing that author most often write posts from 3 to 12, and from 16 to 19 of GMT. Creating GMM model for author, model will answer question if author could write posts in given hour. It will be some probabiltiy. That one value is not enough to find an author for given post, but could be used as additionl dimesnion ( value ) in Logistc Regression model input.
  30. Normalize time basically take timestamp, create data in GMT format from it, and take hourOfDate from that data.
  31. Input for gmm needs to be a vector. Then we are creating GMM with three clusters.
  32. For Logistic Regression model that is trained using input word2Vect and gmm result probabilty our model gave given results. We will se that two top result models are excellent!
  33. We have input to our model: post body, and hour when post was written, We transfrom text to vector using word2Vect, and hour is passed to gmm predict method, we append two vectors, end evaluate logistic regression model. It will return probabilty that post that was written at 18 o’clock was written by analyzed author. We do it for each author, sort by probabilty descending. Result that will be on to of the results set, tell us that this author wrote post with biggest probabilty
  34. There is a probabiltiy of 87% that author wild wrote that post
  35. How that could be used? If we have a forum, we could find that one person has two accounts, because they are writing in exact same way. If someone will be writing as an anonym, but it wrote post previosuly as logged in user, we could identify that user.