SlideShare une entreprise Scribd logo
1  sur  11
Télécharger pour lire hors ligne
Higher Diploma in Data Analytics
Programming for Big Data Project
Alexandros Papageorgiou
Student ID: 15019004
Analysis I:
The major factors determining the prediction of interest rate for Lending Club Loan
request.
Analysis II:
Prediction of activity based on mobile phone spatial measurements
Analysis III:
Analysis of user interaction with online ads on a major news website with Spark
The major factors determining the prediction of interest rate for Lending Club Loan request.
Objectives of the analysis
Loans are common place in nowadays and advances in the finance industry have made the process
of requesting a loan a highly automated process. A major component in this process is the level of
interest rate.
This is determined based on a number of factors both from the applicant’s credit history as well as
the application data submitted with the request like their employment history, credit history, and
creditworthiness scores (lendingclub.com, 2015).
Determining the interest rate can be a complex task that requires advanced data analysis.
The purpose of this analysis is to spot the association between interest rates and a number of other
factors based on the loan application data (such as their employment history, credit history, and
creditworthiness scores) as well as data provided by external sources in order to get a better
understanding of how the interest rate is determined and attempt to quantify these relationships.
Particularly this study investigates beyond FICO (the main measure of the credit worthiness of the
applicant) which are the other factors that can have an impact. Using exploratory analysis and
standard multiple regression techniques it is demonstrated that there is significant relationship
between Interest rate and FICO as well as 2 other variables (amount requested and length of the
loan)
Dataset Description
For this analysis a 2500 sample observations dataset was used containing 2500 observations
(rows) and 14 variables (columns) from the lending club website downloaded using the R
programming language (R-Core-Team, 2015).
The lending club data used in this analysis contains observations in code names as seen below,
measuring the following
 Amount.Requested: The amount (in dollars) requested in the loan application
 Amount.Funded.By.Investors: The amount (in dollars) loaned to the individual
 Interest.rate: The lending interest rate.
 Loan.length: The length of time (in months) of the loan
 Loan.Purpose: The purpose of the loan as stated by the applicant
 Debt.to.Income.Ratio: The percentage of consumer’s gross income that goes towards
paying debts
 State: The abbreviation for the U.S. state of residence of the loan applicant
 Home.ownsership: A variable indicating whether the applicant owns, rents, or has a
mortgage on their home.
 Monthly.income: The monthly income of the applicant (in dollars).
 FICO.range: A range indicating the applicants FICO score. This is a measure of the credit
worthiness of the applicant
 Open.Credit.Lines: The number of open lines of credit the applicant had at the time of
application.
 Revolving.Credit.Balance: The total amount outstanding all lines of credit
 Inquiries.in.the.Last.6.Months: The number of authorized queries in the 6 months before
the loan was issued.
 Employment.Length: Length of time employee at current job.
Challenge: Data not in a tidy form
Exploratory analysis was the method used via constructing plots and relevant tables to examine the
quality of the data provided and explore possible associations between interest rate and the
independent variables. This was after handling the 7 missing values found, making sure that
analysis is performed based on complete cases. This was based on the assumption of no significant
effect on the analysis due to low size of the missing values.
Other data type transformations:
 Several factor or character variables converted to numerical
 Removal of % symbol from interest rate,
 FICO range converted from a range in to a single figure
 Renaming of variables where appropriate.
Rationale for the transformations:
Those transformations were made in order to enable a more flexible handling of the data through R
especially by transforming them in numerical forms.
To relate interest rate with its major components a standard linear regression model was deployed.
The model selection was performed on the basis of the exploratory analysis and prior knowledge of
the relationship between interest rate and the factors that are considered critical to its
determination.
Data processing activities
As noted above a minimal number of missing values were identified and where appropriate
removed, beyond that the data was found to be within normal and acceptable ranges without any
extremities in interest rates and the other independent variables either.
The final dataset was in line with the tidy data rule (Wickham, 2015).
As a first step in the exploratory analysis a correlation analysis of all the numeric variables was
introduced in order to identify possible associations among them and particularly the ones that
correlate well with the interest rate.
The results of this first analysis reveal quite a high negative type correlation among the interest
rate and the FICO score (r=-0.7) and there is also some correlation with amount requested and
amount funded on the level of r=0.33.
The correlation among other variables with interest rate was relatively low.
To carry on with the analysis the information provided on the club’s website was considered, which
mentions as credit risk indicators factored into the model for the interest rate, the following:
 Requested amount loan
 Loan maturity (36 or 60 months)
 Debt to income ratio
 Length of credit history
 Number of other accounts opened
 Payment history
 Number of other credit inquiries initiated over the past six months.
The FICO score is also quite explicitly mentioned as a decisive factor so in this context this will
unavoidably be one of the variables that will define the model.
A number of experiments through box plots were performed in order to identify possible
relationships of the interest rate with categorical variable. It turned out that what seems to have an
impact on the interest rate is the length of the loan.
Obviously overloading the model by including all the variables is not the optimal strategy (Cohen,
2009) and therefore as selection has to be made based on the results of the correlation analysis, the
box plots for the categorical variables and the information provided on the web site.
After testing with a number of models the one that was found to be best fit for this analysis is the
following:
Interest Rate= b0 + b1(FICO Score) + b2 (Requested Amount) + b3(Length of the Loan) + e
where b0 is an intercept term and b1 represents the change of the (negative) interest rate for a
given change of one unit of FICO score, similarly b2 represents the impact on interest for a one
dollar increase of the requested amount. The term length of loan is a categorical two level variable
that represents the change of the interest rate with a change from 36 months to 60 months of loan
period, at average levels of the other two independent variables.
The error term e represents all sources of unmeasured and un-modelled random variation
(Stockburger, 2015)
For the length of loan a set of dummy variables were implemented so that the R function can
interpret the data more effectively. As term of reference was selected “36 months”.
In the case of amount obviously due to confounder concerns (the two variables correlate not just
with the interest rate but most obviously between themselves as well) just the amount requested
was included in the regression model (the funded amount obviously directly depends on the
originally requested amount)
We observed: highly statistically significant (P =2e-16) association between interest rate and FICO
score. A change of one unit FICO corresponded to a negative change of b1 = 8.9 on Interest rate
(95% Confidence Interval: -8.984321e-02 -8.507495e-02).
Association between interest rate and amount requested (P =2e-16). A change of one unit amount
requested corresponded to a change of b2 = 1.446e-04 on Interest rate (95% Confidence Interval:
1.319564e-04 1.573394e-04).
Last, with an intercept of 7.245e+01 it is observed that this is the amount of interest rate that
corresponds when all the coefficients are set to zero which corresponds to the projected value for
the 36 month period of loan, while when the coefficient takes the value of one, this corresponds to
the value for the 60 month length. (P =2e-16)
The model has an Adjusted R-squared: 0.7454 which corresponds to the amount of variation that is
explained by the model.
A -limited in scope analysis- of residuals to compare the effectiveness of a multiple regression
against the simple linear regression model shows that non-random residual variance is better fitted
with the second one.
Conclusions:
The analysis suggests that there is a significant, positive association between Interest rate and FICO
score as well as factors such as loan length and amount of loan requested. The analysis estimated
the relationship using a linear model.
This analysis provides some insights with regard to the ways a loan institution like lending club
determines the cost of money for its customers, it therefore makes sense for the borrowers to be
aware of the major factors that determine the interest rate they will be asked to pay and possibly
based on this knowledge take action that could contribute in more favourable terms (for example
ask for a lower amount and return the loan sooner rather than later.
It is important to keep in mind that this study is a result of a limited dataset of just one institution
and therefore it might be subject to bias. As time goes on and depending also on other parameters
of the national and international economy other factors might come to play critical roles too. In any
case an informed customer who is aware of this type of analysis is likely to make better decisions in
his or her loan purchase.
Works Cited
Cohen, Y., 2009. Statistics and Data with R. s.l.:Wiley.
lendingclub.com, 2015. Interest rates and how we set them. [Online]
Available at: https://www.lendingclub.com/public/how-we-set-interest-rates.action
[Accessed 25 11 2015].
R-Core-Team, 2015. R: A language and environment for statistical computing. [Online]
Available at: http://www.R-project.org
Stockburger, D. W., 2015. Multiple Regression with Categorical Variables. [Online]
Available at: http://www.psychstat.missouristate.edu/multibook/mlt08m.html
Wickham, H., 2015. Tidy Data. [Online]
Available at: http://vita.had.co.nz/papers/tidy-data.pdf
Title: Prediction of activity based on mobile phone spatial measurements
Introduction and Objectives:
Advances in technology of mobile phones and the proliferation of smart devices have enabled the
collection of spatial data of smart phone users with the intention of studying the relation between
the measurements registered with the devices and the corresponding synchronous activity of the
subjects.
Data analysis methodology will be used with a prediction model to determine user activity based on
a wide range of signals related to body motion.
In particular the above analysis is based on the records of the Activity Recognition database which
was built from the recordings of 30 subjects doing Activities of Daily Living (ADL) while carrying a
waist-mounted smartphone with embedded inertial sensors including accelerometers and
gyroscopes. The objective is the recognition of six different human activities based on the
quantitative measurements of the Samsung phones.
Data Description
A group of 30 volunteers were selected for this task from the original research team. Each person
was instructed to follow a predefined set of activities while wearing a waist-mounted Smartphone.
The six selected ADLs were standing, sitting, laying down, walking, walking downstairs and
upstairs.
The respective. data set was downloaded from the URL
https://sparkpublic.s3.amazonaws.com/dataanalysis/samsungData.rda. The data was partly
preprocessed to facilitate its use within the R environment by the authors of the Coursera data
analysis course (https://github.com/jtleek/dataanalysis).
The data consists of 7352 entities (rows), each of which corresponds to a time indexed activity of
each of the 21 subjects and 563 variables (columns) corresponding to measurements of two
sensors.
Specifically for each record, the data provided:
- Acceleration from the accelerometer (total acceleration) and the estimated body acceleration. ( X,
Y, Z axis)
- Angular velocity from the gyroscope. (X, Y, Z axis)
-Various descriptive statistics based on the above measurements
Also, 2 additional pieces of information included as variables
-Corresponding activity
-Subject identifier
Data processing
The different activities were relatively evenly distributed and the same applies to the observations
for every subject, so no extremes found in this context.
All the columns referring to measurement are numeric. The subject is integer and the activity has
the type of character. This type is transformed to factor to assist with R processing given that
activity is the actual dependent variable of this dataset.
Prior to the analysis, a number of additional data transformations needed to take place. There are
some issues, for example a number of variables appear to have the same names but different values.
In specific, the bandsEnergy-related variables, are repeated in sets of 3. For example, columns 303-
316,317-330, and 331-344 have the same column names.
To fix this the variables were renamed in such a way to avoid duplication and possible problems
with the analysis of the data. Those transformations were made in order to enable a more flexible
handling of the data through R.
Moreover the variable names are cleaned by removing some punctuation like “( )” and “-“
characters in names to make it make syntactically valid.
An unusual fact observed is that all the numeric data appear to be within a range of -1 and 1 but this
turns out to be because the data is normalized.
There were no missing values found just complete cases observed and except the above mentioned
no other data type transformations were found to be necessary.
The dataset was split in two sets for training and testing. On a random split training set include the
subject ids 1,3,5 and 6- Total of 4 samples corresponding to 328 observations. The test set includes
ids 27, 28, 29 and 30. Therefore a total of 4 samples corresponding to 371 observations.
Results:
The selected method for the analysis is classification trees. Trees are particularly useful when there
are have many explanatory variables. If the “twigs” of the tree are categorical a classification tree is
recommended in order to partition the data ultimately into groups that are as homogeneous as
possible (Pekelis, 2013)
The next call to make is the selection of the variables that will be integrated into the model. It was
decided that with the very high number of columns in the dataset, it might not be meaningful to
examine each one individually. The first exploratory attempt to fit the tree model was with test tree
prediction on the first variable set, related to Body acceleration (tBodyAcc) which includes the first
15 variables of the data set. A classification tree was grown through the training set, and the
predictive model was tested on the test data. The misclassification rate was as high as 41%. This led
to the straightforward abandonment of this first set of variables.
As a next step, and considering again the nature of the variables, a more comprehensive approach
was chosen, that would include all the variables (with the obvious exception of subject id) and let
the classification tree algorithm choose the critical nodes.
There were 11 variables selected as nodes namely:
"tBodyAcc.std.X" "tGravityAcc.mean.X" "tGravityAcc.max...Y"
"tGravityAcc.mean.Y" "fBodyGyro.meanFreq.X" "tGravityAcc.arCoeff.X.1"
"tBodyAccJerk.max.X" "tBodyGyroMag.arCoeff..1" "fBodyAcc.bandsEnergy.1.8"
"tBodyAcc.arCoeff.X.3"
These variables correspond to various summary statistics measurements from the two sensors
The misclassification error rate was ~3 % on the training set, practically failing to classify correctly
the activity just nine time out of 328. The value of the model will not be proven unless applied on to
the test set. Once it performs versus the test set the error in prediction reaches 19.6% therefore
being successful in predicting over 80% of the cases.
The most significant nodes at the higher level of the tree are the Body acceleration standard
deviation of X the Gravity acceleration mean of X and the Gravity acceleration coefficient of X.
To check if the tree has any potential to improve its performance across validation experiment was
made to find the deviance and number of misclassifications as related to the size of our model.
Given the graph, it appears to be the case that the model has the least amount of misclassifications
and deviance for a model size of 8. Given this result the next step is to prune the tree by predefining
the number of nodes to be used to 8.
Fitting the model of 8 nodes to the test data produces 20.2 % error rate so it is marginally lower
than the previous model.
Conclusions:
Based on the results of the data analysis it turns out that the classification tree method was
effective into handling a large number variables, which due to size and the nature of the variables,
they would not have been able to be analyzed one by one. That said, there is definitely room for
improvement especially if an analysis has the possibility to look deeper into the meaning of the
variables and identifies patterns and relationships between them that could help regarding the
selection of the variables for the model, instead of opting for the comprehensive approach as it was
the case with analysis.
In this relatively straightforward approach adopted, no potential confounders were identified, as
this model did not include any linear analysis. The main criterion for judging the model was
accuracy, but additional ways of measuring error can be considered. Also possible to use are
techniques that are likely to improve the initial classification tree results, such as random forests
(Chen, 2009)
Works Cited
Chen, F., 2009. R Examples of Using Some Prediction Tools. [Online]
Available at: stat.fsu.edu/~fchen/prediction-tool.pdf
[Accessed 05 12 2015].
Pekelis, L., 2013. Classification And Regression Trees : A Practical Guide for Describing a Dataset.
[Online]
Available at: http://statweb.stanford.edu/~lpekelis/talks/13_datafest_cart_talk.pdf
[Accessed 04 12 2015].
Y.Theodoridis, 1996. ACM Digital Library-A model for the prediction of R-tree performance. [Online]
Available at: dl.acm.org/citation.cfm?id=237705
[Accessed 07 12 2015].
Title: Analysis of user interaction with online ads on a major news website
The dataset
The dataset is part of a sequence of files that include daily click through to online ads data, based on
user character tics as recorded on the New York Times web site in May of 2012. The datasets are
available on the “Doing data science” book (Schutt, 2013) Github repo, in this analysis only the first
day of available data is considered. It contains over 458,000 observations including 5 variables:
 Age of User – numerical variable
 Gender – binary variable
 Signed_In – binary variable representing if the users was logged in or not
 Impressions – the number of ad impressions during the session
 Clicks - the number of click-throughs to one or more ads on the website
Every row corresponds to a user. It is generally speaking a simple low dimensional dataset which
however can be used to conduct a basic analysis of the user behaviour on the website with relation
to user interaction with the ad content on the site.
Configurations: Setting up the Environment
The platform used for this analysis was IBM Bluemix, which via the integrated notebook interfaces
allows access to Apache Spark, an open source –in memory- data processing engine for cluster
computing that shares some common ground with the Hadoop Map Reduce programming
framework.
The dataset in csv format is first uploaded as a new data source on Bluemix with the Object storage
service, which is associated with the required credentials.
The first step is to define a function that sets the Hadoop configuration with the credentials as
parameter.
Next using the insert code function from the data sources a dictionary with the credentials
associated with the data source is created and then used as an argument into the set Hadoop
configuration function in order to activate the service.
For the data processing activities that follow Pyspark, the Python API to Spark is used.
The data structured used throughout the analysis is Spark DataFrame.
Algorithms, results, challenges:
The algorithms used for the analysis are for the most part of the category ‘split-apply-combine’
whereby the data are grouped based on an attribute, then a function is applied onto the grouped
data summarising the values within each group into one value. This is in line with the MapReduce
principles of creating key value pairs, then grouping by key with the individual values of same-key
entries in an associated to the key sequence and the reducing this to one aggregate value that
represents all the values under the common key.
Although in Spark the map and reduce process differs from the Hadoop Map Reduce
implementation (Owen, 2014) specialised Spark functions such as reduceByKey, groupByKey and
flatMap can deploy equivalent functionality.
In Spark the above procedure represents a transformation, which is lazily evaluated when an action
is performed i.e. when an answer from the system is explicitly requested.
 For example the observations are grouped by gender and then a function is applied to the
groups that outputs the mean age by gender (22.9 for males and 40.8 for women).
 Other algorithms are used to make transformations of the existing variables to new ones,
for example number of clicks and impressions are used together as a ratio to produce the
click-through rate.
 In other cases, SQL type analysis is deployed for example to filter the observations, by
keeping only a subset that for instance belongs to the 25-35 age group and then focus the
analysis on the specific segment.
 There are also interesting implementations of summary statistics including the count of
observations (458441)
 The mean values for the key variables – for example the average age which is 29.4 years,
the average number of impressions which is 5 and clicks (just below 0.1)
 We can also see that the minimum reported age is 0 and maximum 99, the max number of
ad impressions is 9 and maximum number of clicks 4
 The Pearson correlation between impressions and clicks is 0.13, which a positive but
relatively weak one, implying that more ad impressions do not always lead to more clicks.
The algorithms used along with the full results are presented in detail in the attached Jupyter
notebook.
One of the challenges working with large scale data is the need to use distributed frameworks for
the computation which translates to the need to use suitable data structures. In spark the core data
structure is the RDD Resilient Distributed Dataset, which is essentially a collection of elements that
can be partitioned across the nodes of a cluster.
Data manipulation with RDDs is not as intuitive and expressive for data analysis as other data
structures. Given the introduction of Spark data frames, that was actually the data structure
selected for the analysis which thanks to the named columns attribute that supports makes the
performed data analysis tasks more intuitive as well as efficient in terms of computational speed
compared with the RDDs.
Works Cited
Anon., 2015. Spark programming-guide. [Online]
Available at: spark.apache.org/docs/latest/programming-guide.html
Owen, S., 2014. how-to-translate-from-mapreduce-to-apache-spark. [Online]
Available at: https://blog.cloudera.com/blog/2014/09/how-to-translate-from-mapreduce-to-
apache-spark/
Schutt, R., 2013. Doing Data Science. [Online]
Available at: https://github.com/oreillymedia/doing_data_science
Programming for big data

Contenu connexe

Similaire à Programming for big data

The ability of previous quarterly earnings, net interest margin, and average ...
The ability of previous quarterly earnings, net interest margin, and average ...The ability of previous quarterly earnings, net interest margin, and average ...
The ability of previous quarterly earnings, net interest margin, and average ...
RyanMHolcomb
 
Report-11-Whats-36-Got-to-Do-With-It_100616
Report-11-Whats-36-Got-to-Do-With-It_100616Report-11-Whats-36-Got-to-Do-With-It_100616
Report-11-Whats-36-Got-to-Do-With-It_100616
Heather Lamoureux
 
Report-7-B-Searching-for-Harm-in-Storefront-Payday-Lending-nonPrime101
Report-7-B-Searching-for-Harm-in-Storefront-Payday-Lending-nonPrime101Report-7-B-Searching-for-Harm-in-Storefront-Payday-Lending-nonPrime101
Report-7-B-Searching-for-Harm-in-Storefront-Payday-Lending-nonPrime101
Heather Lamoureux
 
presentation
presentationpresentation
presentation
hmagrissy
 

Similaire à Programming for big data (20)

The ability of previous quarterly earnings, net interest margin, and average ...
The ability of previous quarterly earnings, net interest margin, and average ...The ability of previous quarterly earnings, net interest margin, and average ...
The ability of previous quarterly earnings, net interest margin, and average ...
 
Estimation of Net Interest Margin Determinants of the Deposit Banks in Turkey...
Estimation of Net Interest Margin Determinants of the Deposit Banks in Turkey...Estimation of Net Interest Margin Determinants of the Deposit Banks in Turkey...
Estimation of Net Interest Margin Determinants of the Deposit Banks in Turkey...
 
fast publication journals
fast publication journalsfast publication journals
fast publication journals
 
Report-11-Whats-36-Got-to-Do-With-It_100616
Report-11-Whats-36-Got-to-Do-With-It_100616Report-11-Whats-36-Got-to-Do-With-It_100616
Report-11-Whats-36-Got-to-Do-With-It_100616
 
Reduction in customer complaints - Mortgage Industry
Reduction in customer complaints - Mortgage IndustryReduction in customer complaints - Mortgage Industry
Reduction in customer complaints - Mortgage Industry
 
Mortgage Banking: A Holistic Approach to Managing Compliance Risk
Mortgage Banking: A Holistic Approach to Managing Compliance RiskMortgage Banking: A Holistic Approach to Managing Compliance Risk
Mortgage Banking: A Holistic Approach to Managing Compliance Risk
 
Report-7-B-Searching-for-Harm-in-Storefront-Payday-Lending-nonPrime101
Report-7-B-Searching-for-Harm-in-Storefront-Payday-Lending-nonPrime101Report-7-B-Searching-for-Harm-in-Storefront-Payday-Lending-nonPrime101
Report-7-B-Searching-for-Harm-in-Storefront-Payday-Lending-nonPrime101
 
03_AJMS_298_21.pdf
03_AJMS_298_21.pdf03_AJMS_298_21.pdf
03_AJMS_298_21.pdf
 
Mortgage Insurance Data Organization Havlicek Mrotek
Mortgage Insurance Data Organization Havlicek MrotekMortgage Insurance Data Organization Havlicek Mrotek
Mortgage Insurance Data Organization Havlicek Mrotek
 
Lendit PostShow SlideShare
Lendit PostShow SlideShareLendit PostShow SlideShare
Lendit PostShow SlideShare
 
loan.docx
loan.docxloan.docx
loan.docx
 
How to Justify a Change in Your ALLL
How to Justify a Change in Your ALLLHow to Justify a Change in Your ALLL
How to Justify a Change in Your ALLL
 
Busting Credit Score Myths
Busting Credit Score MythsBusting Credit Score Myths
Busting Credit Score Myths
 
AI-based credit scoring - An Overview.pdf
AI-based credit scoring - An Overview.pdfAI-based credit scoring - An Overview.pdf
AI-based credit scoring - An Overview.pdf
 
Business research report proposal customer delight in banking
Business research report proposal customer delight in bankingBusiness research report proposal customer delight in banking
Business research report proposal customer delight in banking
 
Essay On Stamford International Inc
Essay On Stamford International IncEssay On Stamford International Inc
Essay On Stamford International Inc
 
Consumer Credit Scoring Using Logistic Regression and Random Forest
Consumer Credit Scoring Using Logistic Regression and Random ForestConsumer Credit Scoring Using Logistic Regression and Random Forest
Consumer Credit Scoring Using Logistic Regression and Random Forest
 
Real matters march 2018 presentation
Real matters   march 2018 presentationReal matters   march 2018 presentation
Real matters march 2018 presentation
 
presentation
presentationpresentation
presentation
 
Remaking IT for New U.S. Mortgage Rule Compliance
Remaking IT for New U.S. Mortgage Rule ComplianceRemaking IT for New U.S. Mortgage Rule Compliance
Remaking IT for New U.S. Mortgage Rule Compliance
 

Plus de Alex Papageorgiou

Plus de Alex Papageorgiou (15)

Webinar Advanced marketing analytics
Webinar Advanced marketing analyticsWebinar Advanced marketing analytics
Webinar Advanced marketing analytics
 
Kaggle for digital analysts
Kaggle for digital analystsKaggle for digital analysts
Kaggle for digital analysts
 
Kaggle for Analysts - MeasureCamp London 2019
Kaggle for Analysts - MeasureCamp London 2019Kaggle for Analysts - MeasureCamp London 2019
Kaggle for Analysts - MeasureCamp London 2019
 
Travel information search: the presence of social media
Travel information search: the presence of social mediaTravel information search: the presence of social media
Travel information search: the presence of social media
 
The Kaggle Experience from a Digital Analysts' Perspective
The Kaggle Experience from a Digital Analysts' PerspectiveThe Kaggle Experience from a Digital Analysts' Perspective
The Kaggle Experience from a Digital Analysts' Perspective
 
Clickstream analytics with Markov Chains
Clickstream analytics with Markov ChainsClickstream analytics with Markov Chains
Clickstream analytics with Markov Chains
 
Growth Analytics: Evolution, Community and Tools
Growth Analytics: Evolution, Community and ToolsGrowth Analytics: Evolution, Community and Tools
Growth Analytics: Evolution, Community and Tools
 
Clickstream Analytics with Markov Chains
Clickstream Analytics with Markov Chains Clickstream Analytics with Markov Chains
Clickstream Analytics with Markov Chains
 
The impact of search ads on organic search traffic
The impact of search ads on organic search trafficThe impact of search ads on organic search traffic
The impact of search ads on organic search traffic
 
Prediciting happiness from mobile app survey data
Prediciting happiness from mobile app survey dataPrediciting happiness from mobile app survey data
Prediciting happiness from mobile app survey data
 
E com conversion prediction and optimisation
E com conversion prediction and optimisationE com conversion prediction and optimisation
E com conversion prediction and optimisation
 
Web analytics with R
Web analytics with RWeb analytics with R
Web analytics with R
 
Data science with Google Analytics @MeasureCamp
Data science with Google Analytics @MeasureCampData science with Google Analytics @MeasureCamp
Data science with Google Analytics @MeasureCamp
 
Intro to AdWords eMTI
Intro to AdWords eMTIIntro to AdWords eMTI
Intro to AdWords eMTI
 
Social Media And Civil Society
Social Media And Civil SocietySocial Media And Civil Society
Social Media And Civil Society
 

Dernier

In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
ahmedjiabur940
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
vexqp
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
gajnagarg
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
gajnagarg
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
nirzagarg
 
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
gajnagarg
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Klinik kandungan
 
Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...
Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...
Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...
HyderabadDolls
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
nirzagarg
 

Dernier (20)

Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...Identify Customer Segments to Create Customer Offers for Each Segment - Appli...
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
 
💞 Safe And Secure Call Girls Agra Call Girls Service Just Call 🍑👄6378878445 🍑...
💞 Safe And Secure Call Girls Agra Call Girls Service Just Call 🍑👄6378878445 🍑...💞 Safe And Secure Call Girls Agra Call Girls Service Just Call 🍑👄6378878445 🍑...
💞 Safe And Secure Call Girls Agra Call Girls Service Just Call 🍑👄6378878445 🍑...
 
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
 
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for Research
 
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
 
Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...
Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...
Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...
 
Giridih Escorts Service Girl ^ 9332606886, WhatsApp Anytime Giridih
Giridih Escorts Service Girl ^ 9332606886, WhatsApp Anytime GiridihGiridih Escorts Service Girl ^ 9332606886, WhatsApp Anytime Giridih
Giridih Escorts Service Girl ^ 9332606886, WhatsApp Anytime Giridih
 
Introduction to Statistics Presentation.pptx
Introduction to Statistics Presentation.pptxIntroduction to Statistics Presentation.pptx
Introduction to Statistics Presentation.pptx
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham Ware
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
 

Programming for big data

  • 1. Higher Diploma in Data Analytics Programming for Big Data Project Alexandros Papageorgiou Student ID: 15019004 Analysis I: The major factors determining the prediction of interest rate for Lending Club Loan request. Analysis II: Prediction of activity based on mobile phone spatial measurements Analysis III: Analysis of user interaction with online ads on a major news website with Spark
  • 2. The major factors determining the prediction of interest rate for Lending Club Loan request. Objectives of the analysis Loans are common place in nowadays and advances in the finance industry have made the process of requesting a loan a highly automated process. A major component in this process is the level of interest rate. This is determined based on a number of factors both from the applicant’s credit history as well as the application data submitted with the request like their employment history, credit history, and creditworthiness scores (lendingclub.com, 2015). Determining the interest rate can be a complex task that requires advanced data analysis. The purpose of this analysis is to spot the association between interest rates and a number of other factors based on the loan application data (such as their employment history, credit history, and creditworthiness scores) as well as data provided by external sources in order to get a better understanding of how the interest rate is determined and attempt to quantify these relationships. Particularly this study investigates beyond FICO (the main measure of the credit worthiness of the applicant) which are the other factors that can have an impact. Using exploratory analysis and standard multiple regression techniques it is demonstrated that there is significant relationship between Interest rate and FICO as well as 2 other variables (amount requested and length of the loan) Dataset Description For this analysis a 2500 sample observations dataset was used containing 2500 observations (rows) and 14 variables (columns) from the lending club website downloaded using the R programming language (R-Core-Team, 2015). The lending club data used in this analysis contains observations in code names as seen below, measuring the following  Amount.Requested: The amount (in dollars) requested in the loan application  Amount.Funded.By.Investors: The amount (in dollars) loaned to the individual  Interest.rate: The lending interest rate.  Loan.length: The length of time (in months) of the loan  Loan.Purpose: The purpose of the loan as stated by the applicant  Debt.to.Income.Ratio: The percentage of consumer’s gross income that goes towards paying debts  State: The abbreviation for the U.S. state of residence of the loan applicant  Home.ownsership: A variable indicating whether the applicant owns, rents, or has a mortgage on their home.  Monthly.income: The monthly income of the applicant (in dollars).  FICO.range: A range indicating the applicants FICO score. This is a measure of the credit worthiness of the applicant  Open.Credit.Lines: The number of open lines of credit the applicant had at the time of application.  Revolving.Credit.Balance: The total amount outstanding all lines of credit  Inquiries.in.the.Last.6.Months: The number of authorized queries in the 6 months before the loan was issued.  Employment.Length: Length of time employee at current job.
  • 3. Challenge: Data not in a tidy form Exploratory analysis was the method used via constructing plots and relevant tables to examine the quality of the data provided and explore possible associations between interest rate and the independent variables. This was after handling the 7 missing values found, making sure that analysis is performed based on complete cases. This was based on the assumption of no significant effect on the analysis due to low size of the missing values. Other data type transformations:  Several factor or character variables converted to numerical  Removal of % symbol from interest rate,  FICO range converted from a range in to a single figure  Renaming of variables where appropriate. Rationale for the transformations: Those transformations were made in order to enable a more flexible handling of the data through R especially by transforming them in numerical forms. To relate interest rate with its major components a standard linear regression model was deployed. The model selection was performed on the basis of the exploratory analysis and prior knowledge of the relationship between interest rate and the factors that are considered critical to its determination. Data processing activities As noted above a minimal number of missing values were identified and where appropriate removed, beyond that the data was found to be within normal and acceptable ranges without any extremities in interest rates and the other independent variables either. The final dataset was in line with the tidy data rule (Wickham, 2015). As a first step in the exploratory analysis a correlation analysis of all the numeric variables was introduced in order to identify possible associations among them and particularly the ones that correlate well with the interest rate. The results of this first analysis reveal quite a high negative type correlation among the interest rate and the FICO score (r=-0.7) and there is also some correlation with amount requested and amount funded on the level of r=0.33. The correlation among other variables with interest rate was relatively low. To carry on with the analysis the information provided on the club’s website was considered, which mentions as credit risk indicators factored into the model for the interest rate, the following:  Requested amount loan  Loan maturity (36 or 60 months)  Debt to income ratio  Length of credit history  Number of other accounts opened  Payment history  Number of other credit inquiries initiated over the past six months.
  • 4. The FICO score is also quite explicitly mentioned as a decisive factor so in this context this will unavoidably be one of the variables that will define the model. A number of experiments through box plots were performed in order to identify possible relationships of the interest rate with categorical variable. It turned out that what seems to have an impact on the interest rate is the length of the loan. Obviously overloading the model by including all the variables is not the optimal strategy (Cohen, 2009) and therefore as selection has to be made based on the results of the correlation analysis, the box plots for the categorical variables and the information provided on the web site. After testing with a number of models the one that was found to be best fit for this analysis is the following: Interest Rate= b0 + b1(FICO Score) + b2 (Requested Amount) + b3(Length of the Loan) + e where b0 is an intercept term and b1 represents the change of the (negative) interest rate for a given change of one unit of FICO score, similarly b2 represents the impact on interest for a one dollar increase of the requested amount. The term length of loan is a categorical two level variable that represents the change of the interest rate with a change from 36 months to 60 months of loan period, at average levels of the other two independent variables. The error term e represents all sources of unmeasured and un-modelled random variation (Stockburger, 2015) For the length of loan a set of dummy variables were implemented so that the R function can interpret the data more effectively. As term of reference was selected “36 months”. In the case of amount obviously due to confounder concerns (the two variables correlate not just with the interest rate but most obviously between themselves as well) just the amount requested was included in the regression model (the funded amount obviously directly depends on the originally requested amount) We observed: highly statistically significant (P =2e-16) association between interest rate and FICO score. A change of one unit FICO corresponded to a negative change of b1 = 8.9 on Interest rate (95% Confidence Interval: -8.984321e-02 -8.507495e-02). Association between interest rate and amount requested (P =2e-16). A change of one unit amount requested corresponded to a change of b2 = 1.446e-04 on Interest rate (95% Confidence Interval: 1.319564e-04 1.573394e-04). Last, with an intercept of 7.245e+01 it is observed that this is the amount of interest rate that corresponds when all the coefficients are set to zero which corresponds to the projected value for the 36 month period of loan, while when the coefficient takes the value of one, this corresponds to the value for the 60 month length. (P =2e-16) The model has an Adjusted R-squared: 0.7454 which corresponds to the amount of variation that is explained by the model. A -limited in scope analysis- of residuals to compare the effectiveness of a multiple regression against the simple linear regression model shows that non-random residual variance is better fitted with the second one.
  • 5. Conclusions: The analysis suggests that there is a significant, positive association between Interest rate and FICO score as well as factors such as loan length and amount of loan requested. The analysis estimated the relationship using a linear model. This analysis provides some insights with regard to the ways a loan institution like lending club determines the cost of money for its customers, it therefore makes sense for the borrowers to be aware of the major factors that determine the interest rate they will be asked to pay and possibly based on this knowledge take action that could contribute in more favourable terms (for example ask for a lower amount and return the loan sooner rather than later. It is important to keep in mind that this study is a result of a limited dataset of just one institution and therefore it might be subject to bias. As time goes on and depending also on other parameters of the national and international economy other factors might come to play critical roles too. In any case an informed customer who is aware of this type of analysis is likely to make better decisions in his or her loan purchase. Works Cited Cohen, Y., 2009. Statistics and Data with R. s.l.:Wiley. lendingclub.com, 2015. Interest rates and how we set them. [Online] Available at: https://www.lendingclub.com/public/how-we-set-interest-rates.action [Accessed 25 11 2015]. R-Core-Team, 2015. R: A language and environment for statistical computing. [Online] Available at: http://www.R-project.org Stockburger, D. W., 2015. Multiple Regression with Categorical Variables. [Online] Available at: http://www.psychstat.missouristate.edu/multibook/mlt08m.html Wickham, H., 2015. Tidy Data. [Online] Available at: http://vita.had.co.nz/papers/tidy-data.pdf
  • 6. Title: Prediction of activity based on mobile phone spatial measurements Introduction and Objectives: Advances in technology of mobile phones and the proliferation of smart devices have enabled the collection of spatial data of smart phone users with the intention of studying the relation between the measurements registered with the devices and the corresponding synchronous activity of the subjects. Data analysis methodology will be used with a prediction model to determine user activity based on a wide range of signals related to body motion. In particular the above analysis is based on the records of the Activity Recognition database which was built from the recordings of 30 subjects doing Activities of Daily Living (ADL) while carrying a waist-mounted smartphone with embedded inertial sensors including accelerometers and gyroscopes. The objective is the recognition of six different human activities based on the quantitative measurements of the Samsung phones. Data Description A group of 30 volunteers were selected for this task from the original research team. Each person was instructed to follow a predefined set of activities while wearing a waist-mounted Smartphone. The six selected ADLs were standing, sitting, laying down, walking, walking downstairs and upstairs. The respective. data set was downloaded from the URL https://sparkpublic.s3.amazonaws.com/dataanalysis/samsungData.rda. The data was partly preprocessed to facilitate its use within the R environment by the authors of the Coursera data analysis course (https://github.com/jtleek/dataanalysis). The data consists of 7352 entities (rows), each of which corresponds to a time indexed activity of each of the 21 subjects and 563 variables (columns) corresponding to measurements of two sensors. Specifically for each record, the data provided: - Acceleration from the accelerometer (total acceleration) and the estimated body acceleration. ( X, Y, Z axis) - Angular velocity from the gyroscope. (X, Y, Z axis) -Various descriptive statistics based on the above measurements Also, 2 additional pieces of information included as variables -Corresponding activity -Subject identifier Data processing The different activities were relatively evenly distributed and the same applies to the observations for every subject, so no extremes found in this context.
  • 7. All the columns referring to measurement are numeric. The subject is integer and the activity has the type of character. This type is transformed to factor to assist with R processing given that activity is the actual dependent variable of this dataset. Prior to the analysis, a number of additional data transformations needed to take place. There are some issues, for example a number of variables appear to have the same names but different values. In specific, the bandsEnergy-related variables, are repeated in sets of 3. For example, columns 303- 316,317-330, and 331-344 have the same column names. To fix this the variables were renamed in such a way to avoid duplication and possible problems with the analysis of the data. Those transformations were made in order to enable a more flexible handling of the data through R. Moreover the variable names are cleaned by removing some punctuation like “( )” and “-“ characters in names to make it make syntactically valid. An unusual fact observed is that all the numeric data appear to be within a range of -1 and 1 but this turns out to be because the data is normalized. There were no missing values found just complete cases observed and except the above mentioned no other data type transformations were found to be necessary. The dataset was split in two sets for training and testing. On a random split training set include the subject ids 1,3,5 and 6- Total of 4 samples corresponding to 328 observations. The test set includes ids 27, 28, 29 and 30. Therefore a total of 4 samples corresponding to 371 observations. Results: The selected method for the analysis is classification trees. Trees are particularly useful when there are have many explanatory variables. If the “twigs” of the tree are categorical a classification tree is recommended in order to partition the data ultimately into groups that are as homogeneous as possible (Pekelis, 2013) The next call to make is the selection of the variables that will be integrated into the model. It was decided that with the very high number of columns in the dataset, it might not be meaningful to examine each one individually. The first exploratory attempt to fit the tree model was with test tree prediction on the first variable set, related to Body acceleration (tBodyAcc) which includes the first 15 variables of the data set. A classification tree was grown through the training set, and the predictive model was tested on the test data. The misclassification rate was as high as 41%. This led to the straightforward abandonment of this first set of variables. As a next step, and considering again the nature of the variables, a more comprehensive approach was chosen, that would include all the variables (with the obvious exception of subject id) and let the classification tree algorithm choose the critical nodes. There were 11 variables selected as nodes namely: "tBodyAcc.std.X" "tGravityAcc.mean.X" "tGravityAcc.max...Y" "tGravityAcc.mean.Y" "fBodyGyro.meanFreq.X" "tGravityAcc.arCoeff.X.1" "tBodyAccJerk.max.X" "tBodyGyroMag.arCoeff..1" "fBodyAcc.bandsEnergy.1.8" "tBodyAcc.arCoeff.X.3"
  • 8. These variables correspond to various summary statistics measurements from the two sensors The misclassification error rate was ~3 % on the training set, practically failing to classify correctly the activity just nine time out of 328. The value of the model will not be proven unless applied on to the test set. Once it performs versus the test set the error in prediction reaches 19.6% therefore being successful in predicting over 80% of the cases. The most significant nodes at the higher level of the tree are the Body acceleration standard deviation of X the Gravity acceleration mean of X and the Gravity acceleration coefficient of X. To check if the tree has any potential to improve its performance across validation experiment was made to find the deviance and number of misclassifications as related to the size of our model. Given the graph, it appears to be the case that the model has the least amount of misclassifications and deviance for a model size of 8. Given this result the next step is to prune the tree by predefining the number of nodes to be used to 8. Fitting the model of 8 nodes to the test data produces 20.2 % error rate so it is marginally lower than the previous model. Conclusions: Based on the results of the data analysis it turns out that the classification tree method was effective into handling a large number variables, which due to size and the nature of the variables, they would not have been able to be analyzed one by one. That said, there is definitely room for improvement especially if an analysis has the possibility to look deeper into the meaning of the variables and identifies patterns and relationships between them that could help regarding the selection of the variables for the model, instead of opting for the comprehensive approach as it was the case with analysis. In this relatively straightforward approach adopted, no potential confounders were identified, as this model did not include any linear analysis. The main criterion for judging the model was accuracy, but additional ways of measuring error can be considered. Also possible to use are techniques that are likely to improve the initial classification tree results, such as random forests (Chen, 2009) Works Cited Chen, F., 2009. R Examples of Using Some Prediction Tools. [Online] Available at: stat.fsu.edu/~fchen/prediction-tool.pdf [Accessed 05 12 2015]. Pekelis, L., 2013. Classification And Regression Trees : A Practical Guide for Describing a Dataset. [Online] Available at: http://statweb.stanford.edu/~lpekelis/talks/13_datafest_cart_talk.pdf [Accessed 04 12 2015]. Y.Theodoridis, 1996. ACM Digital Library-A model for the prediction of R-tree performance. [Online] Available at: dl.acm.org/citation.cfm?id=237705 [Accessed 07 12 2015].
  • 9. Title: Analysis of user interaction with online ads on a major news website The dataset The dataset is part of a sequence of files that include daily click through to online ads data, based on user character tics as recorded on the New York Times web site in May of 2012. The datasets are available on the “Doing data science” book (Schutt, 2013) Github repo, in this analysis only the first day of available data is considered. It contains over 458,000 observations including 5 variables:  Age of User – numerical variable  Gender – binary variable  Signed_In – binary variable representing if the users was logged in or not  Impressions – the number of ad impressions during the session  Clicks - the number of click-throughs to one or more ads on the website Every row corresponds to a user. It is generally speaking a simple low dimensional dataset which however can be used to conduct a basic analysis of the user behaviour on the website with relation to user interaction with the ad content on the site. Configurations: Setting up the Environment The platform used for this analysis was IBM Bluemix, which via the integrated notebook interfaces allows access to Apache Spark, an open source –in memory- data processing engine for cluster computing that shares some common ground with the Hadoop Map Reduce programming framework. The dataset in csv format is first uploaded as a new data source on Bluemix with the Object storage service, which is associated with the required credentials. The first step is to define a function that sets the Hadoop configuration with the credentials as parameter. Next using the insert code function from the data sources a dictionary with the credentials associated with the data source is created and then used as an argument into the set Hadoop configuration function in order to activate the service. For the data processing activities that follow Pyspark, the Python API to Spark is used. The data structured used throughout the analysis is Spark DataFrame. Algorithms, results, challenges: The algorithms used for the analysis are for the most part of the category ‘split-apply-combine’ whereby the data are grouped based on an attribute, then a function is applied onto the grouped data summarising the values within each group into one value. This is in line with the MapReduce principles of creating key value pairs, then grouping by key with the individual values of same-key entries in an associated to the key sequence and the reducing this to one aggregate value that represents all the values under the common key.
  • 10. Although in Spark the map and reduce process differs from the Hadoop Map Reduce implementation (Owen, 2014) specialised Spark functions such as reduceByKey, groupByKey and flatMap can deploy equivalent functionality. In Spark the above procedure represents a transformation, which is lazily evaluated when an action is performed i.e. when an answer from the system is explicitly requested.  For example the observations are grouped by gender and then a function is applied to the groups that outputs the mean age by gender (22.9 for males and 40.8 for women).  Other algorithms are used to make transformations of the existing variables to new ones, for example number of clicks and impressions are used together as a ratio to produce the click-through rate.  In other cases, SQL type analysis is deployed for example to filter the observations, by keeping only a subset that for instance belongs to the 25-35 age group and then focus the analysis on the specific segment.  There are also interesting implementations of summary statistics including the count of observations (458441)  The mean values for the key variables – for example the average age which is 29.4 years, the average number of impressions which is 5 and clicks (just below 0.1)  We can also see that the minimum reported age is 0 and maximum 99, the max number of ad impressions is 9 and maximum number of clicks 4  The Pearson correlation between impressions and clicks is 0.13, which a positive but relatively weak one, implying that more ad impressions do not always lead to more clicks. The algorithms used along with the full results are presented in detail in the attached Jupyter notebook. One of the challenges working with large scale data is the need to use distributed frameworks for the computation which translates to the need to use suitable data structures. In spark the core data structure is the RDD Resilient Distributed Dataset, which is essentially a collection of elements that can be partitioned across the nodes of a cluster. Data manipulation with RDDs is not as intuitive and expressive for data analysis as other data structures. Given the introduction of Spark data frames, that was actually the data structure selected for the analysis which thanks to the named columns attribute that supports makes the performed data analysis tasks more intuitive as well as efficient in terms of computational speed compared with the RDDs. Works Cited Anon., 2015. Spark programming-guide. [Online] Available at: spark.apache.org/docs/latest/programming-guide.html Owen, S., 2014. how-to-translate-from-mapreduce-to-apache-spark. [Online] Available at: https://blog.cloudera.com/blog/2014/09/how-to-translate-from-mapreduce-to- apache-spark/ Schutt, R., 2013. Doing Data Science. [Online] Available at: https://github.com/oreillymedia/doing_data_science