Programming for big data

Higher Diploma in Data Analytics
Programming for Big Data Project
Alexandros Papageorgiou
Student ID: 15019004
Analysis I:
The major factors determining the prediction of interest rate for Lending Club Loan
request.
Analysis II:
Prediction of activity based on mobile phone spatial measurements
Analysis III:
Analysis of user interaction with online ads on a major news website with Spark

The major factors determining the prediction of interest rate for Lending Club Loan request.
Objectives of the analysis
Loans are common place in nowadays and advances in the finance industry have made the process
of requesting a loan a highly automated process. A major component in this process is the level of
interest rate.
This is determined based on a number of factors both from the applicant’s credit history as well as
the application data submitted with the request like their employment history, credit history, and
creditworthiness scores (lendingclub.com, 2015).
Determining the interest rate can be a complex task that requires advanced data analysis.
The purpose of this analysis is to spot the association between interest rates and a number of other
factors based on the loan application data (such as their employment history, credit history, and
creditworthiness scores) as well as data provided by external sources in order to get a better
understanding of how the interest rate is determined and attempt to quantify these relationships.
Particularly this study investigates beyond FICO (the main measure of the credit worthiness of the
applicant) which are the other factors that can have an impact. Using exploratory analysis and
standard multiple regression techniques it is demonstrated that there is significant relationship
between Interest rate and FICO as well as 2 other variables (amount requested and length of the
loan)
Dataset Description
For this analysis a 2500 sample observations dataset was used containing 2500 observations
(rows) and 14 variables (columns) from the lending club website downloaded using the R
programming language (R-Core-Team, 2015).
The lending club data used in this analysis contains observations in code names as seen below,
measuring the following
 Amount.Requested: The amount (in dollars) requested in the loan application
 Amount.Funded.By.Investors: The amount (in dollars) loaned to the individual
 Interest.rate: The lending interest rate.
 Loan.length: The length of time (in months) of the loan
 Loan.Purpose: The purpose of the loan as stated by the applicant
 Debt.to.Income.Ratio: The percentage of consumer’s gross income that goes towards
paying debts
 State: The abbreviation for the U.S. state of residence of the loan applicant
 Home.ownsership: A variable indicating whether the applicant owns, rents, or has a
mortgage on their home.
 Monthly.income: The monthly income of the applicant (in dollars).
 FICO.range: A range indicating the applicants FICO score. This is a measure of the credit
worthiness of the applicant
 Open.Credit.Lines: The number of open lines of credit the applicant had at the time of
application.
 Revolving.Credit.Balance: The total amount outstanding all lines of credit
 Inquiries.in.the.Last.6.Months: The number of authorized queries in the 6 months before
the loan was issued.
 Employment.Length: Length of time employee at current job.

Challenge: Data not in a tidy form
Exploratory analysis was the method used via constructing plots and relevant tables to examine the
quality of the data provided and explore possible associations between interest rate and the
independent variables. This was after handling the 7 missing values found, making sure that
analysis is performed based on complete cases. This was based on the assumption of no significant
effect on the analysis due to low size of the missing values.
Other data type transformations:
 Several factor or character variables converted to numerical
 Removal of % symbol from interest rate,
 FICO range converted from a range in to a single figure
 Renaming of variables where appropriate.
Rationale for the transformations:
Those transformations were made in order to enable a more flexible handling of the data through R
especially by transforming them in numerical forms.
To relate interest rate with its major components a standard linear regression model was deployed.
The model selection was performed on the basis of the exploratory analysis and prior knowledge of
the relationship between interest rate and the factors that are considered critical to its
determination.
Data processing activities
As noted above a minimal number of missing values were identified and where appropriate
removed, beyond that the data was found to be within normal and acceptable ranges without any
extremities in interest rates and the other independent variables either.
The final dataset was in line with the tidy data rule (Wickham, 2015).
As a first step in the exploratory analysis a correlation analysis of all the numeric variables was
introduced in order to identify possible associations among them and particularly the ones that
correlate well with the interest rate.
The results of this first analysis reveal quite a high negative type correlation among the interest
rate and the FICO score (r=-0.7) and there is also some correlation with amount requested and
amount funded on the level of r=0.33.
The correlation among other variables with interest rate was relatively low.
To carry on with the analysis the information provided on the club’s website was considered, which
mentions as credit risk indicators factored into the model for the interest rate, the following:
 Requested amount loan
 Loan maturity (36 or 60 months)
 Debt to income ratio
 Length of credit history
 Number of other accounts opened
 Payment history
 Number of other credit inquiries initiated over the past six months.

The FICO score is also quite explicitly mentioned as a decisive factor so in this context this will
unavoidably be one of the variables that will define the model.
A number of experiments through box plots were performed in order to identify possible
relationships of the interest rate with categorical variable. It turned out that what seems to have an
impact on the interest rate is the length of the loan.
Obviously overloading the model by including all the variables is not the optimal strategy (Cohen,
2009) and therefore as selection has to be made based on the results of the correlation analysis, the
box plots for the categorical variables and the information provided on the web site.
After testing with a number of models the one that was found to be best fit for this analysis is the
following:
Interest Rate= b0 + b1(FICO Score) + b2 (Requested Amount) + b3(Length of the Loan) + e
where b0 is an intercept term and b1 represents the change of the (negative) interest rate for a
given change of one unit of FICO score, similarly b2 represents the impact on interest for a one
dollar increase of the requested amount. The term length of loan is a categorical two level variable
that represents the change of the interest rate with a change from 36 months to 60 months of loan
period, at average levels of the other two independent variables.
The error term e represents all sources of unmeasured and un-modelled random variation
(Stockburger, 2015)
For the length of loan a set of dummy variables were implemented so that the R function can
interpret the data more effectively. As term of reference was selected “36 months”.
In the case of amount obviously due to confounder concerns (the two variables correlate not just
with the interest rate but most obviously between themselves as well) just the amount requested
was included in the regression model (the funded amount obviously directly depends on the
originally requested amount)
We observed: highly statistically significant (P =2e-16) association between interest rate and FICO
score. A change of one unit FICO corresponded to a negative change of b1 = 8.9 on Interest rate
(95% Confidence Interval: -8.984321e-02 -8.507495e-02).
Association between interest rate and amount requested (P =2e-16). A change of one unit amount
requested corresponded to a change of b2 = 1.446e-04 on Interest rate (95% Confidence Interval:
1.319564e-04 1.573394e-04).
Last, with an intercept of 7.245e+01 it is observed that this is the amount of interest rate that
corresponds when all the coefficients are set to zero which corresponds to the projected value for
the 36 month period of loan, while when the coefficient takes the value of one, this corresponds to
the value for the 60 month length. (P =2e-16)
The model has an Adjusted R-squared: 0.7454 which corresponds to the amount of variation that is
explained by the model.
A -limited in scope analysis- of residuals to compare the effectiveness of a multiple regression
against the simple linear regression model shows that non-random residual variance is better fitted
with the second one.

Conclusions:
The analysis suggests that there is a significant, positive association between Interest rate and FICO
score as well as factors such as loan length and amount of loan requested. The analysis estimated
the relationship using a linear model.
This analysis provides some insights with regard to the ways a loan institution like lending club
determines the cost of money for its customers, it therefore makes sense for the borrowers to be
aware of the major factors that determine the interest rate they will be asked to pay and possibly
based on this knowledge take action that could contribute in more favourable terms (for example
ask for a lower amount and return the loan sooner rather than later.
It is important to keep in mind that this study is a result of a limited dataset of just one institution
and therefore it might be subject to bias. As time goes on and depending also on other parameters
of the national and international economy other factors might come to play critical roles too. In any
case an informed customer who is aware of this type of analysis is likely to make better decisions in
his or her loan purchase.
Works Cited
Cohen, Y., 2009. Statistics and Data with R. s.l.:Wiley.
lendingclub.com, 2015. Interest rates and how we set them. [Online]
Available at: https://www.lendingclub.com/public/how-we-set-interest-rates.action
[Accessed 25 11 2015].
R-Core-Team, 2015. R: A language and environment for statistical computing. [Online]
Available at: http://www.R-project.org
Stockburger, D. W., 2015. Multiple Regression with Categorical Variables. [Online]
Available at: http://www.psychstat.missouristate.edu/multibook/mlt08m.html
Wickham, H., 2015. Tidy Data. [Online]
Available at: http://vita.had.co.nz/papers/tidy-data.pdf

Title: Prediction of activity based on mobile phone spatial measurements
Introduction and Objectives:
Advances in technology of mobile phones and the proliferation of smart devices have enabled the
collection of spatial data of smart phone users with the intention of studying the relation between
the measurements registered with the devices and the corresponding synchronous activity of the
subjects.
Data analysis methodology will be used with a prediction model to determine user activity based on
a wide range of signals related to body motion.
In particular the above analysis is based on the records of the Activity Recognition database which
was built from the recordings of 30 subjects doing Activities of Daily Living (ADL) while carrying a
waist-mounted smartphone with embedded inertial sensors including accelerometers and
gyroscopes. The objective is the recognition of six different human activities based on the
quantitative measurements of the Samsung phones.
Data Description
A group of 30 volunteers were selected for this task from the original research team. Each person
was instructed to follow a predefined set of activities while wearing a waist-mounted Smartphone.
The six selected ADLs were standing, sitting, laying down, walking, walking downstairs and
upstairs.
The respective. data set was downloaded from the URL
https://sparkpublic.s3.amazonaws.com/dataanalysis/samsungData.rda. The data was partly
preprocessed to facilitate its use within the R environment by the authors of the Coursera data
analysis course (https://github.com/jtleek/dataanalysis).
The data consists of 7352 entities (rows), each of which corresponds to a time indexed activity of
each of the 21 subjects and 563 variables (columns) corresponding to measurements of two
sensors.
Specifically for each record, the data provided:
- Acceleration from the accelerometer (total acceleration) and the estimated body acceleration. ( X,
Y, Z axis)
- Angular velocity from the gyroscope. (X, Y, Z axis)
-Various descriptive statistics based on the above measurements
Also, 2 additional pieces of information included as variables
-Corresponding activity
-Subject identifier
Data processing
The different activities were relatively evenly distributed and the same applies to the observations
for every subject, so no extremes found in this context.

All the columns referring to measurement are numeric. The subject is integer and the activity has
the type of character. This type is transformed to factor to assist with R processing given that
activity is the actual dependent variable of this dataset.
Prior to the analysis, a number of additional data transformations needed to take place. There are
some issues, for example a number of variables appear to have the same names but different values.
In specific, the bandsEnergy-related variables, are repeated in sets of 3. For example, columns 303-
316,317-330, and 331-344 have the same column names.
To fix this the variables were renamed in such a way to avoid duplication and possible problems
with the analysis of the data. Those transformations were made in order to enable a more flexible
handling of the data through R.
Moreover the variable names are cleaned by removing some punctuation like “( )” and “-“
characters in names to make it make syntactically valid.
An unusual fact observed is that all the numeric data appear to be within a range of -1 and 1 but this
turns out to be because the data is normalized.
There were no missing values found just complete cases observed and except the above mentioned
no other data type transformations were found to be necessary.
The dataset was split in two sets for training and testing. On a random split training set include the
subject ids 1,3,5 and 6- Total of 4 samples corresponding to 328 observations. The test set includes
ids 27, 28, 29 and 30. Therefore a total of 4 samples corresponding to 371 observations.
Results:
The selected method for the analysis is classification trees. Trees are particularly useful when there
are have many explanatory variables. If the “twigs” of the tree are categorical a classification tree is
recommended in order to partition the data ultimately into groups that are as homogeneous as
possible (Pekelis, 2013)
The next call to make is the selection of the variables that will be integrated into the model. It was
decided that with the very high number of columns in the dataset, it might not be meaningful to
examine each one individually. The first exploratory attempt to fit the tree model was with test tree
prediction on the first variable set, related to Body acceleration (tBodyAcc) which includes the first
15 variables of the data set. A classification tree was grown through the training set, and the
predictive model was tested on the test data. The misclassification rate was as high as 41%. This led
to the straightforward abandonment of this first set of variables.
As a next step, and considering again the nature of the variables, a more comprehensive approach
was chosen, that would include all the variables (with the obvious exception of subject id) and let
the classification tree algorithm choose the critical nodes.
There were 11 variables selected as nodes namely:
"tBodyAcc.std.X" "tGravityAcc.mean.X" "tGravityAcc.max...Y"
"tGravityAcc.mean.Y" "fBodyGyro.meanFreq.X" "tGravityAcc.arCoeff.X.1"
"tBodyAccJerk.max.X" "tBodyGyroMag.arCoeff..1" "fBodyAcc.bandsEnergy.1.8"
"tBodyAcc.arCoeff.X.3"

These variables correspond to various summary statistics measurements from the two sensors
The misclassification error rate was ~3 % on the training set, practically failing to classify correctly
the activity just nine time out of 328. The value of the model will not be proven unless applied on to
the test set. Once it performs versus the test set the error in prediction reaches 19.6% therefore
being successful in predicting over 80% of the cases.
The most significant nodes at the higher level of the tree are the Body acceleration standard
deviation of X the Gravity acceleration mean of X and the Gravity acceleration coefficient of X.
To check if the tree has any potential to improve its performance across validation experiment was
made to find the deviance and number of misclassifications as related to the size of our model.
Given the graph, it appears to be the case that the model has the least amount of misclassifications
and deviance for a model size of 8. Given this result the next step is to prune the tree by predefining
the number of nodes to be used to 8.
Fitting the model of 8 nodes to the test data produces 20.2 % error rate so it is marginally lower
than the previous model.
Conclusions:
Based on the results of the data analysis it turns out that the classification tree method was
effective into handling a large number variables, which due to size and the nature of the variables,
they would not have been able to be analyzed one by one. That said, there is definitely room for
improvement especially if an analysis has the possibility to look deeper into the meaning of the
variables and identifies patterns and relationships between them that could help regarding the
selection of the variables for the model, instead of opting for the comprehensive approach as it was
the case with analysis.
In this relatively straightforward approach adopted, no potential confounders were identified, as
this model did not include any linear analysis. The main criterion for judging the model was
accuracy, but additional ways of measuring error can be considered. Also possible to use are
techniques that are likely to improve the initial classification tree results, such as random forests
(Chen, 2009)
Works Cited
Chen, F., 2009. R Examples of Using Some Prediction Tools. [Online]
Available at: stat.fsu.edu/~fchen/prediction-tool.pdf
[Accessed 05 12 2015].
Pekelis, L., 2013. Classiﬁcation And Regression Trees : A Practical Guide for Describing a Dataset.
[Online]
Available at: http://statweb.stanford.edu/~lpekelis/talks/13_datafest_cart_talk.pdf
[Accessed 04 12 2015].
Y.Theodoridis, 1996. ACM Digital Library-A model for the prediction of R-tree performance. [Online]
Available at: dl.acm.org/citation.cfm?id=237705
[Accessed 07 12 2015].

Title: Analysis of user interaction with online ads on a major news website
The dataset
The dataset is part of a sequence of files that include daily click through to online ads data, based on
user character tics as recorded on the New York Times web site in May of 2012. The datasets are
available on the “Doing data science” book (Schutt, 2013) Github repo, in this analysis only the first
day of available data is considered. It contains over 458,000 observations including 5 variables:
 Age of User – numerical variable
 Gender – binary variable
 Signed_In – binary variable representing if the users was logged in or not
 Impressions – the number of ad impressions during the session
 Clicks - the number of click-throughs to one or more ads on the website
Every row corresponds to a user. It is generally speaking a simple low dimensional dataset which
however can be used to conduct a basic analysis of the user behaviour on the website with relation
to user interaction with the ad content on the site.
Configurations: Setting up the Environment
The platform used for this analysis was IBM Bluemix, which via the integrated notebook interfaces
allows access to Apache Spark, an open source –in memory- data processing engine for cluster
computing that shares some common ground with the Hadoop Map Reduce programming
framework.
The dataset in csv format is first uploaded as a new data source on Bluemix with the Object storage
service, which is associated with the required credentials.
The first step is to define a function that sets the Hadoop configuration with the credentials as
parameter.
Next using the insert code function from the data sources a dictionary with the credentials
associated with the data source is created and then used as an argument into the set Hadoop
configuration function in order to activate the service.
For the data processing activities that follow Pyspark, the Python API to Spark is used.
The data structured used throughout the analysis is Spark DataFrame.
Algorithms, results, challenges:
The algorithms used for the analysis are for the most part of the category ‘split-apply-combine’
whereby the data are grouped based on an attribute, then a function is applied onto the grouped
data summarising the values within each group into one value. This is in line with the MapReduce
principles of creating key value pairs, then grouping by key with the individual values of same-key
entries in an associated to the key sequence and the reducing this to one aggregate value that
represents all the values under the common key.

Although in Spark the map and reduce process differs from the Hadoop Map Reduce
implementation (Owen, 2014) specialised Spark functions such as reduceByKey, groupByKey and
flatMap can deploy equivalent functionality.
In Spark the above procedure represents a transformation, which is lazily evaluated when an action
is performed i.e. when an answer from the system is explicitly requested.
 For example the observations are grouped by gender and then a function is applied to the
groups that outputs the mean age by gender (22.9 for males and 40.8 for women).
 Other algorithms are used to make transformations of the existing variables to new ones,
for example number of clicks and impressions are used together as a ratio to produce the
click-through rate.
 In other cases, SQL type analysis is deployed for example to filter the observations, by
keeping only a subset that for instance belongs to the 25-35 age group and then focus the
analysis on the specific segment.
 There are also interesting implementations of summary statistics including the count of
observations (458441)
 The mean values for the key variables – for example the average age which is 29.4 years,
the average number of impressions which is 5 and clicks (just below 0.1)
 We can also see that the minimum reported age is 0 and maximum 99, the max number of
ad impressions is 9 and maximum number of clicks 4
 The Pearson correlation between impressions and clicks is 0.13, which a positive but
relatively weak one, implying that more ad impressions do not always lead to more clicks.
The algorithms used along with the full results are presented in detail in the attached Jupyter
notebook.
One of the challenges working with large scale data is the need to use distributed frameworks for
the computation which translates to the need to use suitable data structures. In spark the core data
structure is the RDD Resilient Distributed Dataset, which is essentially a collection of elements that
can be partitioned across the nodes of a cluster.
Data manipulation with RDDs is not as intuitive and expressive for data analysis as other data
structures. Given the introduction of Spark data frames, that was actually the data structure
selected for the analysis which thanks to the named columns attribute that supports makes the
performed data analysis tasks more intuitive as well as efficient in terms of computational speed
compared with the RDDs.
Works Cited
Anon., 2015. Spark programming-guide. [Online]
Available at: spark.apache.org/docs/latest/programming-guide.html
Owen, S., 2014. how-to-translate-from-mapreduce-to-apache-spark. [Online]
Available at: https://blog.cloudera.com/blog/2014/09/how-to-translate-from-mapreduce-to-
apache-spark/
Schutt, R., 2013. Doing Data Science. [Online]
Available at: https://github.com/oreillymedia/doing_data_science

Programming for big data

Recommandé

Recommandé

Contenu connexe

Similaire à Programming for big data

Similaire à Programming for big data (20)

Plus de Alex Papageorgiou

Plus de Alex Papageorgiou (15)

Dernier

Dernier (20)

Programming for big data