How to use text data to draw conclusions about users of our website or forum?
This talk describes a solution to a particular problem, using Machine Learning and Statistics. Based on provided forum we will create the program that learns the structure of posts using Natural Language Processing technics. Then after proper Machine Learning models are trained, program is able to answer with probability which of the users of the forum wrote a particular post.
We will go through all the steps required to create Machine Learning models for text. How to use Natural Language Processing and Bag-of-Words techniques to analyse text? How to prepare input data to further Processing by Machine Learning Models? I will answer those questions. Implementation will be written in Apache Spark, so we will get to know that technology with some important libraries like Spark MLlib and DataFrame API. In MLlib we will use Gaussian Mixure Model and Logistic Regression.
17. I. Logistic Regression
• Supervised Learning
• Data that we want to analyze is labeled binary ( 1
or 0 )
• Input could be vector of numbers (text
transformed using Word2Vect) labeled binary
• Vector ( text ) is written by an author (1) or not (0)
21. II. Gaussian Mixture Model
• Unsupervised learning
• Used to draw conclusions from time data
• Answer question: What is a probability of that
some event occurred at given time?
23. Next steps to build model
• What we want to achieve?
• Find author of given post with some
probability, based on text of post
24. Input data for our algorithms
• Word2Vect
• Example sentence: “It is very important to plan for
a future but also being in the moment”
• Resulted vector may look like:
We will be solving problems with apache spark, using machine learning techniques
We need to analyze text of post to be able to find author, so we will be using Natural Language Processing techiques We have text of post only, and we want to find which author, from the group of authors wrote post
Let say that forum has around 10 000 posts. Each author can have one or many posts. There are some authors that has around 1000, but some has 1 post.
We are mainly interested in body post, we will use it to further processing
Before we get into data, we need to prepare and clean text that is in body of posts
We need to split whole body into tokens, we want to see our text as a vector, and that is first step to achieve this.
We need to remove them from text. They are not semantically value for analyzing text.
In text can be written is multiple forms, we want that words to be in same form, because then our next processing step will perceive that word as the same. Otherwise words will be presented in vector as different number, and our algorithms will not be working effectively. In given example word mum could be written as mummy, could also be in plural form. We want all those words to be presented as one word, so we transform them to one common form
We will be using data frame api. We want to load all data from mysql db, that was imported to our local mysql instance
Having dataframe with data loaded, and sqlContext. We create spark Tokenizer. It will handle splliting “body” columns into array of “words”. Then StopWordsRemover will remove from array “words” all stop words and put them in without-stop-word column in data frame. We need to setStopWords that are specific for language that we analyze. We have all stop words in StopWords.allStopWords array.
To be able to run machine learning algorithms on our data, we need to transform input arrays of words into vector of numbers, Agorithms like Logistic Regiression or GMM that we will use later operate on numbers.
There are algorithms that will do it for use, we need to configure them In proper way. In reality vector for set of words will be much longer, and will have more dimensions
Simple method to show text as a vector is Bag-of-words. It is commolny used is Document classification problems. By text classification we can assing document to proper category.
In this method text is presentes as set of words and frequency of word occurance in text.
We have two documents. Each document will be presented as vector. Indeks is a word into token array, and number on given index is number of word occurance in document.
word2Vect is another, more complex algorithm that make vector of numbers from text. Firstly it constructs dictionary from text data, then present that data as a vector.
It could be used to find closest word ( most similar word ). For example giving word "France" as input to algorithm, words that have closes word cosine distance are:
Word2Vect use two algorithms to see text as vector. Continuous Bag-of-Words and Skip-gram. Skip-gram is a N-gram. Using skip-gram keep better semantics of text - place where word occurred in text due to that fact, that information is not lost when text is transformed into vector.
Here we have function that take dataFrame. DataFrame has words column in it. We create new Word2Vect algorithm. There are two important conifg params, vectorSize: should be at least 100, to be able to distinctively present document as vector and minCount.
There are two groups of ML algorithms. Eg. Algorithm to finding faces on photo. Supervided – input data is photo labeled with information if on this photo there is face or not. (1 or 0).
Now we will see techniques that later will be used to analyze forum data. Right now we will look into them, using some simple examples.
Input data is number of hours and information if student pass an exam(1) or not (0).
Logistic regression result could be depcited by such chart, We will see, that as more hours is spend on learning then probabilty of passing exam is increasing
Using logistic regression we could draw such conclusions. That algorithm will be used later to analyze our forum data
Input we have infomration when event occured
Having author and many of his posts, and each post has a timestamp when that post was written, we could create very effective GMM model.
Most importatn parameter for GMM is number of clusters. In this example there are 3 clusters. We are seeing three Gaussian Distribution combined. On X axis there is an hour when that post was written,
So we see that most ofter author wrote post at the evening, less often at the morning and seldom in the middle of the day. We could use that to analyze our forum data, and find author of given post, because we know when each author prefer to write posts.
We have our input data transformed to vector of numbers using word2Vect algorithm
We want to build one logistic regression model per author. So having N autohr there will be N models. First model-we are creating for author A. We see that first post ( presented as vector ) was written by author A. secnond not, and so one. Second model we are builidng for author B. Now when we have post prsenet as vecotr, we iterating over models, and asking if given post was written by author A, then B. Each model returng some probability that this post was written by that author. Model that return highes probability tells us that that author for wich that model was build wrote that post.
We want to measure accuracy of modes. Measure that is used for that is called ROC. This is a mesaure of how fit model is.
0,50 means that you could toss a coin
We prepare input data for logistic regression- vector with label (1 or 0) if post is for given author or not
We are splitting data into training and test set. Traing set is used to build model, test will be used to validate model. From validation we will get measures ( i.e. area under ROC )
We want to add information when post was written to our model. We are interesten only in hour of post. Then we could see tendency of author, i.e. he is writing posts mainly at midnight. Or we have some autohors that live in USA, then they will write posts on totally diferent time of day that authors from Poland.
Example distribution for author we are seeing that author most often write posts from 3 to 12, and from 16 to 19 of GMT.
Creating GMM model for author, model will answer question if author could write posts in given hour. It will be some probabiltiy. That one value is not enough to find an author for given post, but could be used as additionl dimesnion ( value ) in Logistc Regression model input.
Normalize time basically take timestamp, create data in GMT format from it, and take hourOfDate from that data.
Input for gmm needs to be a vector. Then we are creating GMM with three clusters.
For Logistic Regression model that is trained using input word2Vect and gmm result probabilty our model gave given results. We will se that two top result models are excellent!
We have input to our model: post body, and hour when post was written, We transfrom text to vector using word2Vect, and hour is passed to gmm predict method, we append two vectors, end evaluate logistic regression model. It will return probabilty that post that was written at 18 o’clock was written by analyzed author. We do it for each author, sort by probabilty descending. Result that will be on to of the results set, tell us that this author wrote post with biggest probabilty
There is a probabiltiy of 87% that author wild wrote that post
How that could be used? If we have a forum, we could find that one person has two accounts, because they are writing in exact same way. If someone will be writing as an anonym, but it wrote post previosuly as logged in user, we could identify that user.