SlideShare une entreprise Scribd logo
1  sur  80
Télécharger pour lire hors ligne
AN INTRODUCTION TO BIG DATA ANALYTICS AND CLOUD COMPUTING
a talk on Decision Making in
Big Data and Cloud Computing era
May 10, 2014 (1400-1600 Hrs)
in
Room no. 511, Fifth floor,
Department of Management Studies,
Vishwakarma Bhawan, IIT Delhi
Your speaker
Ajay Ohri
R for Business Analytics http://www.springer.com/statistics/book/978-1-4614-4342-1
My requirements
What are the key challenges to Data Analytics? How to capture unstructured data and then process it, so that it can
be used for analysis? Which methodology can be more efficient to handle Big Data ? What are the key technologies that can help to
process Big Data?
What skill sets are required to become a Data Scientist? What are the possible key areas for research in Big Data Analytics? What level of
programming skills is required to work in this area? Which packages/algorithms are useful ? Does R support some inbuilt functionality to make
efficient use of multi-core processors ?
How R can be used to do data mining in Social Network Data? Can it help HR persons to do analytics to hire right set of people (HR
Analytics) ?
How R can be used to perform Regression, Classification, Clustering, Structural Equation Modeling and Data Envelopment Analysis? Illustrate
with real life example.
How Cloud computing can help in processing or analyzing data efficiently? What are the risks associated with using Cloud-based
model?
My requirements- let’s break this down
What are the key challenges to Data Analytics? How to capture unstructured data and then process it, so that it can
be used for analysis? Which methodology can be more efficient to handle Big Data ? What are the key technologies that can help to
process Big Data?
What skill sets are required to become a Data Scientist? What are the possible key areas for research in Big Data Analytics? What level of
programming skills is required to work in this area? Which packages/algorithms are useful ? Does R support some inbuilt functionality to make
efficient use of multi-core processors ?
How R can be used to do data mining in Social Network Data? Can it help HR persons to do analytics to hire right set of people (HR
Analytics) ?
How R can be used to perform Regression, Classification, Clustering, Structural Equation Modeling and Data Envelopment Analysis? Illustrate
with real life example.
How Cloud computing can help in processing or analyzing data efficiently? What are the risks associated with using Cloud-based
model?
My requirements- let’s sort this up
What are the key challenges to Data Analytics? How to capture unstructured data and then process it, so that it can
be used for analysis?
How Cloud computing can help in processing or analyzing data efficiently? What are the risks associated with using Cloud-based
model?
Which methodology can be more efficient to handle Big Data ? What are the key technologies that can help to
process Big Data?
What skill sets are required to become a Data Scientist? What are the possible key areas for research in Big Data Analytics? What level of
programming skills is required to work in this area? Can it help HR persons to do analytics to hire right set of people (HR
Analytics) ?
How R can be used to do data mining in Social Network Data?
How R can be used to perform Regression, Classification, Clustering, Structural Equation Modeling and Data Envelopment Analysis? Illustrate
with real life example. Which packages/algorithms are useful ? Does R support some inbuilt functionality to make efficient use of multi-core
processors ?
My requirements- let’s tag this down
What are the key challenges to Data Analytics? How to capture unstructured data and then process it, so that it can
be used for analysis?
How Cloud computing can help in processing or analyzing data efficiently? What are the risks associated with using Cloud-based
model?
Which methodology can be more efficient to handle Big Data ? What are the key technologies that can help to
process Big Data?
What skill sets are required to become a Data Scientist? What are the possible key areas for research in Big Data Analytics? What level of
programming skills is required to work in this area? Can it help HR persons to do analytics to hire right set of people (HR
Analytics) ?
How R can be used to do data mining in Social Network Data?
How R can be used to perform Regression, Classification, Clustering, Structural Equation Modeling and Data Envelopment Analysis? Illustrate
with real life example. Which packages/algorithms are useful ? Does R support some inbuilt functionality to make efficient use of multi-core
processors ?
Data Analytics and Cloud Computing
Big Data
R
R (Data Science Careers)
My requirements- let’s check this again
What are the key challenges to Data Analytics? How to capture unstructured data and then process it, so that it can
be used for analysis?
How Cloud computing can help in processing or analyzing data efficiently? What are the risks associated with using Cloud-based
model?
Which methodology can be more efficient to handle Big Data ? What are the key technologies that can help to
process Big Data?
What skill sets are required to become a Data Scientist? What are the possible key areas for research in Big Data Analytics? What level of
programming skills is required to work in this area? Can it help HR persons to do analytics to hire right set of people (HR
Analytics) ?
How R can be used to do data mining in Social Network Data?
How R can be used to perform Regression, Classification, Clustering, Structural Equation Modeling and Data Envelopment Analysis? Illustrate
with real life example. Which packages/algorithms are useful ? Does R support some inbuilt functionality to make efficient use of multi-core
processors ?
Data Analytics and Cloud Computing
Big Data
R
R (Data Science Careers)
Incorrect Classification?
Topics to be covered
Business Analytics
Data Science
Big Data
Cloud Computing
R
Sub- topics to be covered
Business Analytics -methodologies, challenges,structured /unstructured data
Data Science
Big Data
Cloud Computing
R
Sub- topics to be covered
Business Analytics -methodologies, challenges,structured /unstructured data,HR analytics
Data Science -careers, skills
Big Data - Technology, skills
Cloud Computing
R
Sub- topics to be covered
Business Analytics -methodologies, challenges,structured /unstructured data,HR analytics
Data Science -careers, skills
Big Data - Technology, skills
Cloud Computing -technology,risks
R-
Sub- topics to be covered
Business Analytics -methodologies, challenges,structured /unstructured data,HR analytics
Data Science -careers, skills
Big Data - Technology, skills
Cloud Computing -technology,risks
R- ???
Sub- topics that won’t be covered
R- Data Envelopment Analysis (http://professorjf.webs.com/DEA%202013.pdf )
http://www.uri.edu/artsci/ecn/burkett/DEAnotes.pdf
Structural Equation Modeling ( http://socserv.socsci.mcmaster.ca/jfox/Misc/sem/SEM-paper.pdf )
http://cran.r-project.org/doc/contrib/Fox-Companion/appendix-sems.pdf
and if time permits
HR Analytics
http://www.slideshare.net/ajayohri/hr-analytics-34080636
Business Analytics
Definition
Business analytics (BA) refers to the field of
exploration and investigation of data generated by businesses.
Business Intelligence (BI) is the seamless dissemination of
information through the organization, which primarily involves business metrics both past and current for the use of
decision support in businesses.
Data Mining (DM) is the process of discovering
new patterns from large data using algorithms and statistical methods.
To differentiate between the three, BI is mostly current reports, BA is models to predict and strategize and DM matches patterns in
big data.
Business Analytics
Definition
-Information Ladder
-CRISP DM
-KDD
-SEMMA
Business Analytics
-Information Ladder
Data → Information → Knowledge → Understanding → Insight → Wisdom
Whereas the first two steps can be scientifically exactly defined, the upper parts belong to the domain of psychology and
philosophy.
Also DIKW
CRISP DM
KDD
SEMMA
Data Mining - a good map http://www.saedsayad.com/
Data Science
https://s3.amazonaws.com/aws.drewconway.com/viz/venn_diagram/data_science.html
What is a Data Scientist
a data scientist is simply a person who can
write code
understand statistics
derive insights from data
Oh really, is this a Data Scientist ?
a data scientist is simply a person who can
write code = in R,Python,Java, SQL, Hadoop (Pig,HQL,MR) etc
= for data storage, querying, summarization, visualization
= how efficiently, and in time (fast results?)
= where on databases, on cloud, servers
and understand enough statistics
to derive insights from data
so business can make decisions
Some tools
Linux
+
Java /Python/Pig
+
R
+
SQL
Cheat Sheets for Data Scientists
http://www.slideshare.net/ajayohri/cheat-sheets-for-data-scientists
Data Scientist Programming Skills
Java http://www.learnjavaonline.org/
Python http://www.codecademy.com/tracks/python
SQL http://www.w3schools.com/sql/
R http://bigdatauniversity.com/bdu-wp/bdu-course/introduction-to-data-analysis-using-r/
http://www.statmethods.net/
Hadoop http://hortonworks.com/hadoop-training/
Linuxhttps://github.com/WilliamHackmore/linuxgems/blob/master/cheat_sheet.org.sh
Other place to learn
MOOCs 1 https://www.edx.org/ 2 https://www.coursera.org/ 3 https://www.udacity.com/ 4 https://www.udemy.com/
Books
Courses
Workshops
Big Data
Statistics on Facebook
https://newsroom.fb.com/company-info/
● 802 million daily active users on average in March 2014
● 609 million mobile daily active users on average in March 2014
● 1.28 billion monthly active users as of March 31, 2014
● 1.01 billion mobile monthly active users as of March 31, 2014
Statistics on Twitter
https://about.twitter.com/company
● 255 million monthly active users
● 500 million Tweets are sent per day
● 78% of Twitter active users are on mobile
● 77% of accounts are outside the U.S.
Big Data
is changing marketing
is changing marketing models
is much more quantitative compared to earlier
marketing
Hadoop - infrastructure for Big Data
http://hadoop.apache.org/
Hadoop- evolving ecosystem
Hadoop- evolving ecosystem
Hadoop- evolving ecosystem
Hadoop- evolving ecosystem
Cloud Computing -HW to the SW
http://csrc.nist.gov/publications/nistpubs/800-145/SP800-145.pdf
http://www.silverlighthack.com/post/2011/02/27/iaas-paas-and-saas-terms-explained-and-defined.aspx
Cloud Computing
http://www.silverlighthack.com/post/2011/02/27/iaas-paas-and-saas-terms-explained-and-defined.aspx
Cloud Computing-Google
https://cloud.google.com/products/
Compute Engine is Google’s Infrastructure-as-a-Service (IaaS).
App Engine is Google’s Platform-as-a-Service (PaaS).
Storage
Cloud SQL -a fully-managed, relational MySQL database.
Cloud Storage -a simple API that allows you to manage your data programmatically
Cloud Datastore provides a managed, NoSQL, schemaless database for storing non-relational data
Big Data
BigQuery. Run fast, SQL-like queries against multi-terabyte datasets in seconds
https://github.com/GoogleCloudPlatform
Cloud Computing-Google
Cloud Computing-Amazon
http://aws.amazon.com/products/
More on Cloud Computing
Challenges and Opportunities for India (from
http://chennai.vit.ac.in/isbcc/)
http://www.slideshare.net/ajayohri/data-analytics-using-the-cloud-challenges-and-opportunities-for-india
Big Data Big Analytics (http://krishnarajpm.com/bigdata/abstract.pdf Workshop on
Statistical Machine Learning and Game Theory Approaches for Large Scale Data Analysis)
http://www.slideshare.net/ajayohri/big-data-big-analytics
R
http://www.r-project.org/
Open Source
Free
5000+ Packages
Growing Faster
>2 million users
RAM constraints??
R
http://www.r-project.org/
Object Oriented
has GUI and IDE
has Commercial offerings
R
http://www.r-project.org/
Object Oriented
has GUI and IDE
has Commercial offerings
R - Rattle- Data Mining GUI
http://www.r-project.org/
Object Oriented
has GUI and IDE
has Commercial offerings
R - R Commander
http://www.r-project.org/
Object Oriented
has GUI and IDE
has Commercial offerings
R -R Studio
R -Revolution Analytics
Free for Academics
World Wide !!
RevoScaleR package
for Big Data
Recommended Install -
http://info.revolutionanalytics.com/free-academic.html
R -Revolution Analytics
Free for Academics
World Wide !!
RevoScaleR package
for Big Data
My favorite places to learn R
http://www.swirlstats.com/
My favorite places to learn R
http://www.twotorials.com/
My favorite places to learn R
http://tryr.codeschool.com/
My favorite places to learn R
https://www.coursera.org/course/rprog
also see http://blog.datacamp.com/complete-list-of-coursera-courses-using-r-ranked-by-popularity/
R Case Study
Who are my Facebook friends?
Step 1
http://thinktostart.wordpress.com/2013/11/19/analyzing-facebook-with-r/
Step 2
https://gist.github.com/decisionstats/f18126aea544be324169
Case Study
my FB friends?
Step 1
http://thinktostart.wordpress.com/2013/11/19/analyzing-facebook-with-r/
Step 2
https://gist.github.com/decisionstats/f18126aea544be324169
Twitter Analysis
http://www.slideshare.net/ajayohri/twitter-analysis-by-kaify-rais
http://www.rdatamining.com/examples/social-network-analysis
Big Data Social Network Analysis
Analyzing A Big Social Network using R and distributed graph engines
http://thinkaurelius.com/2012/02/05/graph-degree-distributions-using-r-over-
hadoop/
Big Data Social Media Analysis
Can be used for Customers (and also for latent influencers) -http://www.r-
bloggers.com/an-example-of-social-network-analysis-with-r-using-package-igraph/
Big Data Social Media Analysis
R package twitteR http://cran.r-project.org/web/packages/twitteR/index.html can
be used for prototyping but Twitter's API is rate limited to 1500 per hour(?)/day,
so we can use Datasift API http://datasift.com/pricing#costs
Big Data Social Media Analysis
How does information propagate through a
social network?
http://www.r-bloggers.com/information-transmission-in-a-social-network-dissecting-the-spread-of-a-quora-post/
Big Data Social Network Analysis
Can be used for Terrorists (and also for potential protestors ) -Drew Conway http://riskecon.com/wp-
content/uploads/2012/02/Conway-Socio_Terrorism.pdf
Primary focus is one three aspects of network analysis
1. Identifying leadership and key actors
2. Revealing underlying structure and intra-network community structure
3. Evolution and decay of social networks
R -Big Data Packages
http://cran.r-project.org/web/views/HighPerformanceComputing.html
● The RHIPE package, started by Saptarshi Guha and now developed by a core team via GitHub, provides an interface
between R and Hadoop for analysis of large complex data wholly from within R using the Divide and Recombine approach
to big data. ( link )
● The rmr package by Revolution Analytics also provides an interface between R and Hadoop for a Map/Reduce
programming framework. ( link )
● A related package, segue package by Long, permits easy execution of embarassingly parallel task on Elastic Map Reduce
(EMR) at Amazon. ( link )
● The RProtoBuf package provides an interface to Google's language-neutral, platform-neutral, extensible mechanism for
serializing structured data. This package can be used in R code to read data streams from other systems in a distributed
MapReduce setting where data is serialized and passed back and forth between tasks.
● The HistogramTools package provides a number of routines useful for the construction, aggregation, manipulation, and
plotting of large numbers of Histograms such as those created by Mappers in a MapReduce application.
R -Hadoop Packages
https://github.com/RevolutionAnalytics/RHadoop/wiki
● plyrmr - higher level plyr-like data processing for structured data, powered by rmr
● rmr - functions providing Hadoop MapReduce functionality in R
● rhdfs - functions providing file management of the HDFS from within R
● rhbase - functions providing database management for the HBase distributed database from within R
http://amplab-extras.github.io/SparkR-pkg/
SparkR is an R package that provides a light-weight frontend to use Apache Spark from R.
https://github.com/nexr/RHive
RHive is an R extension facilitating distributed computing via HIVE query. RHive allows easy usage of HQL(Hive SQL) in R, and
allows easy usage of R objects and R functions in Hive.
R - Cloud Computing
http://cran.r-project.org/web/views/WebTechnologies.html
R -Big Data Packages
http://cran.r-project.org/web/views/HighPerformanceComputing.html
Large memory and out-of-memory data
● The biglm package by Lumley uses incremental computations to offer lm() and glm() functionality to data sets stored
outside of R's main memory.
● The ff package by Adler et al. offers file-based access to data sets that are too large to be loaded into memory, along with
a number of higher-level functions.
● The bigmemory package by Kane and Emerson permits storing large objects such as matrices in memory (as well as via
files) and uses external pointer objects to refer to them. .
● A large number of database packages, and database-alike packages (such as sqldf by Grothendieck and data.table
● The HadoopStreaming package provides a framework for writing map/reduce scripts for use in Hadoop Streaming; it also
facilitates operating on data in a streaming fashion which does not require Hadoop.
● The speedglm package permits to fit (generalised) linear models to large data.
● The biglars package by Seligman et al can use the ff to support large-than-memory datasets for least-angle regression,
lasso and stepwise regression.
● The bigrf package provides a Random Forests implementation with support for parellel execution and large memory.
● The MonetDB.R package allows R to access the MonetDB column-oriented, open source database system as a backend.
R in Finance
http://www.rinfinance.com/
R in Finance
http://www.quantmod.com/
C’est la vie
IN INDUSTRY - a R expert is one who knows which package to use from
IN RESEARCH- a R expert is one who creates a new popular and improved package
CRAN Views help experts
http://cran.r-project.org/web/views/
SAP with R
Departure of Aeroplanes-SAP Hana 200m
http://allthingsr.blogspot.in/#!/2012/04/big-data-r-and-hana-analyze-200-million.html
R using SAP Hana
SAP Hana DB uses R
http://scn.sap.com/community/in-memory-business-data-management/blog/2011/11/28/dealing-with-r-and-hana
Oracle R Enterprise
Case Studies and Examples
http://www.oracle.com/technetwork/database/options/advanced-analytics/r-enterprise/index.html
Oracle R Enterprise
Case Studies and Examples
http://www.oracle.com/technetwork/database/options/advanced-analytics/r-
enterprise/index.html
Additional
http://www.slideshare.net/ajayohri/open-source-analytics
Open Source in Analytics (OSSCamp 2014)
http://osscamp.in/
http://osscamp.in/events/6/open-source-analytics-overview-r-python-and-others
How does this affect
decision making
Lots of Data
IT is not a support function
Analytical Organizations with cross functional domains
and
Employees as first line of analysis
is education and research keeping up?
Lets do a revision
Requirements and People
a=NULL
a$req=c("Met","Unmet")
a$counts=c(50,50)
a=as.data.frame(a)
a
pie(a$counts,label=a$req)
library(RColorBrewer)
p=NULL
p$req=c("Satisfied","Unsatisfied","Busy Sleeping")
p$counts=c(50,40,10)
p=as.data.frame(p)
pie(p$counts,label=p$req,col=brewer.pal(3, "Set1"))
Thanks for listening
Contact - ohri2007@gmail.com
LinkedIN -http://linkedin.com/in/ajayohri
Questions please?
One more thing
a movie on a murdered IIM batchmate of mine
fighting against corruption just released
yesterday
http://www.imdb.com/title/tt3056632/
Dedicated to

Contenu connexe

En vedette

エクセルで統計分析 統計プログラムHADについて
エクセルで統計分析 統計プログラムHADについてエクセルで統計分析 統計プログラムHADについて
エクセルで統計分析 統計プログラムHADについてHiroshi Shimizu
 
KH Coder 2 チュートリアル(スライド版)
KH Coder 2 チュートリアル(スライド版)KH Coder 2 チュートリアル(スライド版)
KH Coder 2 チュートリアル(スライド版)khcoder
 
Introducing Revolution R Open: Enhanced, Open Source R distribution from Revo...
Introducing Revolution R Open: Enhanced, Open Source R distribution from Revo...Introducing Revolution R Open: Enhanced, Open Source R distribution from Revo...
Introducing Revolution R Open: Enhanced, Open Source R distribution from Revo...Revolution Analytics
 
ようやく分かった!最尤推定とベイズ推定
ようやく分かった!最尤推定とベイズ推定ようやく分かった!最尤推定とベイズ推定
ようやく分かった!最尤推定とベイズ推定Akira Masuda
 

En vedette (6)

はじめての「R」
はじめての「R」はじめての「R」
はじめての「R」
 
エクセルで統計分析 統計プログラムHADについて
エクセルで統計分析 統計プログラムHADについてエクセルで統計分析 統計プログラムHADについて
エクセルで統計分析 統計プログラムHADについて
 
KH Coder 2 チュートリアル(スライド版)
KH Coder 2 チュートリアル(スライド版)KH Coder 2 チュートリアル(スライド版)
KH Coder 2 チュートリアル(スライド版)
 
Revolution R - 100% R and More
Revolution R - 100% R and MoreRevolution R - 100% R and More
Revolution R - 100% R and More
 
Introducing Revolution R Open: Enhanced, Open Source R distribution from Revo...
Introducing Revolution R Open: Enhanced, Open Source R distribution from Revo...Introducing Revolution R Open: Enhanced, Open Source R distribution from Revo...
Introducing Revolution R Open: Enhanced, Open Source R distribution from Revo...
 
ようやく分かった!最尤推定とベイズ推定
ようやく分かった!最尤推定とベイズ推定ようやく分かった!最尤推定とベイズ推定
ようやく分かった!最尤推定とベイズ推定
 

Plus de Ajay Ohri

Introduction to R ajay Ohri
Introduction to R ajay OhriIntroduction to R ajay Ohri
Introduction to R ajay OhriAjay Ohri
 
Introduction to R
Introduction to RIntroduction to R
Introduction to RAjay Ohri
 
Social Media and Fake News in the 2016 Election
Social Media and Fake News in the 2016 ElectionSocial Media and Fake News in the 2016 Election
Social Media and Fake News in the 2016 ElectionAjay Ohri
 
Download Python for R Users pdf for free
Download Python for R Users pdf for freeDownload Python for R Users pdf for free
Download Python for R Users pdf for freeAjay Ohri
 
Install spark on_windows10
Install spark on_windows10Install spark on_windows10
Install spark on_windows10Ajay Ohri
 
Ajay ohri Resume
Ajay ohri ResumeAjay ohri Resume
Ajay ohri ResumeAjay Ohri
 
Statistics for data scientists
Statistics for  data scientistsStatistics for  data scientists
Statistics for data scientistsAjay Ohri
 
National seminar on emergence of internet of things (io t) trends and challe...
National seminar on emergence of internet of things (io t)  trends and challe...National seminar on emergence of internet of things (io t)  trends and challe...
National seminar on emergence of internet of things (io t) trends and challe...Ajay Ohri
 
Tools and techniques for data science
Tools and techniques for data scienceTools and techniques for data science
Tools and techniques for data scienceAjay Ohri
 
How Big Data ,Cloud Computing ,Data Science can help business
How Big Data ,Cloud Computing ,Data Science can help businessHow Big Data ,Cloud Computing ,Data Science can help business
How Big Data ,Cloud Computing ,Data Science can help businessAjay Ohri
 
Training in Analytics and Data Science
Training in Analytics and Data ScienceTraining in Analytics and Data Science
Training in Analytics and Data ScienceAjay Ohri
 
Software Testing for Data Scientists
Software Testing for Data ScientistsSoftware Testing for Data Scientists
Software Testing for Data ScientistsAjay Ohri
 
A Data Science Tutorial in Python
A Data Science Tutorial in PythonA Data Science Tutorial in Python
A Data Science Tutorial in PythonAjay Ohri
 
How does cryptography work? by Jeroen Ooms
How does cryptography work?  by Jeroen OomsHow does cryptography work?  by Jeroen Ooms
How does cryptography work? by Jeroen OomsAjay Ohri
 
Using R for Social Media and Sports Analytics
Using R for Social Media and Sports AnalyticsUsing R for Social Media and Sports Analytics
Using R for Social Media and Sports AnalyticsAjay Ohri
 
Kush stats alpha
Kush stats alpha Kush stats alpha
Kush stats alpha Ajay Ohri
 
Analyze this
Analyze thisAnalyze this
Analyze thisAjay Ohri
 

Plus de Ajay Ohri (20)

Introduction to R ajay Ohri
Introduction to R ajay OhriIntroduction to R ajay Ohri
Introduction to R ajay Ohri
 
Introduction to R
Introduction to RIntroduction to R
Introduction to R
 
Social Media and Fake News in the 2016 Election
Social Media and Fake News in the 2016 ElectionSocial Media and Fake News in the 2016 Election
Social Media and Fake News in the 2016 Election
 
Pyspark
PysparkPyspark
Pyspark
 
Download Python for R Users pdf for free
Download Python for R Users pdf for freeDownload Python for R Users pdf for free
Download Python for R Users pdf for free
 
Install spark on_windows10
Install spark on_windows10Install spark on_windows10
Install spark on_windows10
 
Ajay ohri Resume
Ajay ohri ResumeAjay ohri Resume
Ajay ohri Resume
 
Statistics for data scientists
Statistics for  data scientistsStatistics for  data scientists
Statistics for data scientists
 
National seminar on emergence of internet of things (io t) trends and challe...
National seminar on emergence of internet of things (io t)  trends and challe...National seminar on emergence of internet of things (io t)  trends and challe...
National seminar on emergence of internet of things (io t) trends and challe...
 
Tools and techniques for data science
Tools and techniques for data scienceTools and techniques for data science
Tools and techniques for data science
 
How Big Data ,Cloud Computing ,Data Science can help business
How Big Data ,Cloud Computing ,Data Science can help businessHow Big Data ,Cloud Computing ,Data Science can help business
How Big Data ,Cloud Computing ,Data Science can help business
 
Training in Analytics and Data Science
Training in Analytics and Data ScienceTraining in Analytics and Data Science
Training in Analytics and Data Science
 
Tradecraft
Tradecraft   Tradecraft
Tradecraft
 
Software Testing for Data Scientists
Software Testing for Data ScientistsSoftware Testing for Data Scientists
Software Testing for Data Scientists
 
Craps
CrapsCraps
Craps
 
A Data Science Tutorial in Python
A Data Science Tutorial in PythonA Data Science Tutorial in Python
A Data Science Tutorial in Python
 
How does cryptography work? by Jeroen Ooms
How does cryptography work?  by Jeroen OomsHow does cryptography work?  by Jeroen Ooms
How does cryptography work? by Jeroen Ooms
 
Using R for Social Media and Sports Analytics
Using R for Social Media and Sports AnalyticsUsing R for Social Media and Sports Analytics
Using R for Social Media and Sports Analytics
 
Kush stats alpha
Kush stats alpha Kush stats alpha
Kush stats alpha
 
Analyze this
Analyze thisAnalyze this
Analyze this
 

Dernier

Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024The Digital Insurer
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Principled Technologies
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 

Dernier (20)

Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 

Decision making in the era of cloud computing and big data

  • 1. AN INTRODUCTION TO BIG DATA ANALYTICS AND CLOUD COMPUTING a talk on Decision Making in Big Data and Cloud Computing era May 10, 2014 (1400-1600 Hrs) in Room no. 511, Fifth floor, Department of Management Studies, Vishwakarma Bhawan, IIT Delhi
  • 2. Your speaker Ajay Ohri R for Business Analytics http://www.springer.com/statistics/book/978-1-4614-4342-1
  • 3. My requirements What are the key challenges to Data Analytics? How to capture unstructured data and then process it, so that it can be used for analysis? Which methodology can be more efficient to handle Big Data ? What are the key technologies that can help to process Big Data? What skill sets are required to become a Data Scientist? What are the possible key areas for research in Big Data Analytics? What level of programming skills is required to work in this area? Which packages/algorithms are useful ? Does R support some inbuilt functionality to make efficient use of multi-core processors ? How R can be used to do data mining in Social Network Data? Can it help HR persons to do analytics to hire right set of people (HR Analytics) ? How R can be used to perform Regression, Classification, Clustering, Structural Equation Modeling and Data Envelopment Analysis? Illustrate with real life example. How Cloud computing can help in processing or analyzing data efficiently? What are the risks associated with using Cloud-based model?
  • 4. My requirements- let’s break this down What are the key challenges to Data Analytics? How to capture unstructured data and then process it, so that it can be used for analysis? Which methodology can be more efficient to handle Big Data ? What are the key technologies that can help to process Big Data? What skill sets are required to become a Data Scientist? What are the possible key areas for research in Big Data Analytics? What level of programming skills is required to work in this area? Which packages/algorithms are useful ? Does R support some inbuilt functionality to make efficient use of multi-core processors ? How R can be used to do data mining in Social Network Data? Can it help HR persons to do analytics to hire right set of people (HR Analytics) ? How R can be used to perform Regression, Classification, Clustering, Structural Equation Modeling and Data Envelopment Analysis? Illustrate with real life example. How Cloud computing can help in processing or analyzing data efficiently? What are the risks associated with using Cloud-based model?
  • 5. My requirements- let’s sort this up What are the key challenges to Data Analytics? How to capture unstructured data and then process it, so that it can be used for analysis? How Cloud computing can help in processing or analyzing data efficiently? What are the risks associated with using Cloud-based model? Which methodology can be more efficient to handle Big Data ? What are the key technologies that can help to process Big Data? What skill sets are required to become a Data Scientist? What are the possible key areas for research in Big Data Analytics? What level of programming skills is required to work in this area? Can it help HR persons to do analytics to hire right set of people (HR Analytics) ? How R can be used to do data mining in Social Network Data? How R can be used to perform Regression, Classification, Clustering, Structural Equation Modeling and Data Envelopment Analysis? Illustrate with real life example. Which packages/algorithms are useful ? Does R support some inbuilt functionality to make efficient use of multi-core processors ?
  • 6. My requirements- let’s tag this down What are the key challenges to Data Analytics? How to capture unstructured data and then process it, so that it can be used for analysis? How Cloud computing can help in processing or analyzing data efficiently? What are the risks associated with using Cloud-based model? Which methodology can be more efficient to handle Big Data ? What are the key technologies that can help to process Big Data? What skill sets are required to become a Data Scientist? What are the possible key areas for research in Big Data Analytics? What level of programming skills is required to work in this area? Can it help HR persons to do analytics to hire right set of people (HR Analytics) ? How R can be used to do data mining in Social Network Data? How R can be used to perform Regression, Classification, Clustering, Structural Equation Modeling and Data Envelopment Analysis? Illustrate with real life example. Which packages/algorithms are useful ? Does R support some inbuilt functionality to make efficient use of multi-core processors ? Data Analytics and Cloud Computing Big Data R R (Data Science Careers)
  • 7. My requirements- let’s check this again What are the key challenges to Data Analytics? How to capture unstructured data and then process it, so that it can be used for analysis? How Cloud computing can help in processing or analyzing data efficiently? What are the risks associated with using Cloud-based model? Which methodology can be more efficient to handle Big Data ? What are the key technologies that can help to process Big Data? What skill sets are required to become a Data Scientist? What are the possible key areas for research in Big Data Analytics? What level of programming skills is required to work in this area? Can it help HR persons to do analytics to hire right set of people (HR Analytics) ? How R can be used to do data mining in Social Network Data? How R can be used to perform Regression, Classification, Clustering, Structural Equation Modeling and Data Envelopment Analysis? Illustrate with real life example. Which packages/algorithms are useful ? Does R support some inbuilt functionality to make efficient use of multi-core processors ? Data Analytics and Cloud Computing Big Data R R (Data Science Careers) Incorrect Classification?
  • 8. Topics to be covered Business Analytics Data Science Big Data Cloud Computing R
  • 9. Sub- topics to be covered Business Analytics -methodologies, challenges,structured /unstructured data Data Science Big Data Cloud Computing R
  • 10. Sub- topics to be covered Business Analytics -methodologies, challenges,structured /unstructured data,HR analytics Data Science -careers, skills Big Data - Technology, skills Cloud Computing R
  • 11. Sub- topics to be covered Business Analytics -methodologies, challenges,structured /unstructured data,HR analytics Data Science -careers, skills Big Data - Technology, skills Cloud Computing -technology,risks R-
  • 12. Sub- topics to be covered Business Analytics -methodologies, challenges,structured /unstructured data,HR analytics Data Science -careers, skills Big Data - Technology, skills Cloud Computing -technology,risks R- ???
  • 13. Sub- topics that won’t be covered R- Data Envelopment Analysis (http://professorjf.webs.com/DEA%202013.pdf ) http://www.uri.edu/artsci/ecn/burkett/DEAnotes.pdf Structural Equation Modeling ( http://socserv.socsci.mcmaster.ca/jfox/Misc/sem/SEM-paper.pdf ) http://cran.r-project.org/doc/contrib/Fox-Companion/appendix-sems.pdf and if time permits HR Analytics http://www.slideshare.net/ajayohri/hr-analytics-34080636
  • 14. Business Analytics Definition Business analytics (BA) refers to the field of exploration and investigation of data generated by businesses. Business Intelligence (BI) is the seamless dissemination of information through the organization, which primarily involves business metrics both past and current for the use of decision support in businesses. Data Mining (DM) is the process of discovering new patterns from large data using algorithms and statistical methods. To differentiate between the three, BI is mostly current reports, BA is models to predict and strategize and DM matches patterns in big data.
  • 16. Business Analytics -Information Ladder Data → Information → Knowledge → Understanding → Insight → Wisdom Whereas the first two steps can be scientifically exactly defined, the upper parts belong to the domain of psychology and philosophy. Also DIKW
  • 18. KDD
  • 19. SEMMA
  • 20. Data Mining - a good map http://www.saedsayad.com/
  • 22. What is a Data Scientist a data scientist is simply a person who can write code understand statistics derive insights from data
  • 23. Oh really, is this a Data Scientist ? a data scientist is simply a person who can write code = in R,Python,Java, SQL, Hadoop (Pig,HQL,MR) etc = for data storage, querying, summarization, visualization = how efficiently, and in time (fast results?) = where on databases, on cloud, servers and understand enough statistics to derive insights from data so business can make decisions
  • 25. Cheat Sheets for Data Scientists http://www.slideshare.net/ajayohri/cheat-sheets-for-data-scientists
  • 26. Data Scientist Programming Skills Java http://www.learnjavaonline.org/ Python http://www.codecademy.com/tracks/python SQL http://www.w3schools.com/sql/ R http://bigdatauniversity.com/bdu-wp/bdu-course/introduction-to-data-analysis-using-r/ http://www.statmethods.net/ Hadoop http://hortonworks.com/hadoop-training/ Linuxhttps://github.com/WilliamHackmore/linuxgems/blob/master/cheat_sheet.org.sh
  • 27. Other place to learn MOOCs 1 https://www.edx.org/ 2 https://www.coursera.org/ 3 https://www.udacity.com/ 4 https://www.udemy.com/ Books Courses Workshops
  • 29. Statistics on Facebook https://newsroom.fb.com/company-info/ ● 802 million daily active users on average in March 2014 ● 609 million mobile daily active users on average in March 2014 ● 1.28 billion monthly active users as of March 31, 2014 ● 1.01 billion mobile monthly active users as of March 31, 2014
  • 30. Statistics on Twitter https://about.twitter.com/company ● 255 million monthly active users ● 500 million Tweets are sent per day ● 78% of Twitter active users are on mobile ● 77% of accounts are outside the U.S.
  • 31. Big Data is changing marketing is changing marketing models is much more quantitative compared to earlier marketing
  • 32. Hadoop - infrastructure for Big Data http://hadoop.apache.org/
  • 37. Cloud Computing -HW to the SW http://csrc.nist.gov/publications/nistpubs/800-145/SP800-145.pdf http://www.silverlighthack.com/post/2011/02/27/iaas-paas-and-saas-terms-explained-and-defined.aspx
  • 39. Cloud Computing-Google https://cloud.google.com/products/ Compute Engine is Google’s Infrastructure-as-a-Service (IaaS). App Engine is Google’s Platform-as-a-Service (PaaS). Storage Cloud SQL -a fully-managed, relational MySQL database. Cloud Storage -a simple API that allows you to manage your data programmatically Cloud Datastore provides a managed, NoSQL, schemaless database for storing non-relational data Big Data BigQuery. Run fast, SQL-like queries against multi-terabyte datasets in seconds https://github.com/GoogleCloudPlatform
  • 42. More on Cloud Computing Challenges and Opportunities for India (from http://chennai.vit.ac.in/isbcc/) http://www.slideshare.net/ajayohri/data-analytics-using-the-cloud-challenges-and-opportunities-for-india Big Data Big Analytics (http://krishnarajpm.com/bigdata/abstract.pdf Workshop on Statistical Machine Learning and Game Theory Approaches for Large Scale Data Analysis) http://www.slideshare.net/ajayohri/big-data-big-analytics
  • 43. R http://www.r-project.org/ Open Source Free 5000+ Packages Growing Faster >2 million users RAM constraints??
  • 44. R http://www.r-project.org/ Object Oriented has GUI and IDE has Commercial offerings
  • 45. R http://www.r-project.org/ Object Oriented has GUI and IDE has Commercial offerings
  • 46. R - Rattle- Data Mining GUI http://www.r-project.org/ Object Oriented has GUI and IDE has Commercial offerings
  • 47. R - R Commander http://www.r-project.org/ Object Oriented has GUI and IDE has Commercial offerings
  • 49. R -Revolution Analytics Free for Academics World Wide !! RevoScaleR package for Big Data Recommended Install - http://info.revolutionanalytics.com/free-academic.html
  • 50. R -Revolution Analytics Free for Academics World Wide !! RevoScaleR package for Big Data
  • 51. My favorite places to learn R http://www.swirlstats.com/
  • 52. My favorite places to learn R http://www.twotorials.com/
  • 53. My favorite places to learn R http://tryr.codeschool.com/
  • 54. My favorite places to learn R https://www.coursera.org/course/rprog also see http://blog.datacamp.com/complete-list-of-coursera-courses-using-r-ranked-by-popularity/
  • 55. R Case Study Who are my Facebook friends? Step 1 http://thinktostart.wordpress.com/2013/11/19/analyzing-facebook-with-r/ Step 2 https://gist.github.com/decisionstats/f18126aea544be324169
  • 56. Case Study my FB friends? Step 1 http://thinktostart.wordpress.com/2013/11/19/analyzing-facebook-with-r/ Step 2 https://gist.github.com/decisionstats/f18126aea544be324169
  • 58. Big Data Social Network Analysis Analyzing A Big Social Network using R and distributed graph engines http://thinkaurelius.com/2012/02/05/graph-degree-distributions-using-r-over- hadoop/
  • 59. Big Data Social Media Analysis Can be used for Customers (and also for latent influencers) -http://www.r- bloggers.com/an-example-of-social-network-analysis-with-r-using-package-igraph/
  • 60. Big Data Social Media Analysis R package twitteR http://cran.r-project.org/web/packages/twitteR/index.html can be used for prototyping but Twitter's API is rate limited to 1500 per hour(?)/day, so we can use Datasift API http://datasift.com/pricing#costs
  • 61. Big Data Social Media Analysis How does information propagate through a social network? http://www.r-bloggers.com/information-transmission-in-a-social-network-dissecting-the-spread-of-a-quora-post/
  • 62. Big Data Social Network Analysis Can be used for Terrorists (and also for potential protestors ) -Drew Conway http://riskecon.com/wp- content/uploads/2012/02/Conway-Socio_Terrorism.pdf Primary focus is one three aspects of network analysis 1. Identifying leadership and key actors 2. Revealing underlying structure and intra-network community structure 3. Evolution and decay of social networks
  • 63. R -Big Data Packages http://cran.r-project.org/web/views/HighPerformanceComputing.html ● The RHIPE package, started by Saptarshi Guha and now developed by a core team via GitHub, provides an interface between R and Hadoop for analysis of large complex data wholly from within R using the Divide and Recombine approach to big data. ( link ) ● The rmr package by Revolution Analytics also provides an interface between R and Hadoop for a Map/Reduce programming framework. ( link ) ● A related package, segue package by Long, permits easy execution of embarassingly parallel task on Elastic Map Reduce (EMR) at Amazon. ( link ) ● The RProtoBuf package provides an interface to Google's language-neutral, platform-neutral, extensible mechanism for serializing structured data. This package can be used in R code to read data streams from other systems in a distributed MapReduce setting where data is serialized and passed back and forth between tasks. ● The HistogramTools package provides a number of routines useful for the construction, aggregation, manipulation, and plotting of large numbers of Histograms such as those created by Mappers in a MapReduce application.
  • 64. R -Hadoop Packages https://github.com/RevolutionAnalytics/RHadoop/wiki ● plyrmr - higher level plyr-like data processing for structured data, powered by rmr ● rmr - functions providing Hadoop MapReduce functionality in R ● rhdfs - functions providing file management of the HDFS from within R ● rhbase - functions providing database management for the HBase distributed database from within R http://amplab-extras.github.io/SparkR-pkg/ SparkR is an R package that provides a light-weight frontend to use Apache Spark from R. https://github.com/nexr/RHive RHive is an R extension facilitating distributed computing via HIVE query. RHive allows easy usage of HQL(Hive SQL) in R, and allows easy usage of R objects and R functions in Hive.
  • 65. R - Cloud Computing http://cran.r-project.org/web/views/WebTechnologies.html
  • 66. R -Big Data Packages http://cran.r-project.org/web/views/HighPerformanceComputing.html Large memory and out-of-memory data ● The biglm package by Lumley uses incremental computations to offer lm() and glm() functionality to data sets stored outside of R's main memory. ● The ff package by Adler et al. offers file-based access to data sets that are too large to be loaded into memory, along with a number of higher-level functions. ● The bigmemory package by Kane and Emerson permits storing large objects such as matrices in memory (as well as via files) and uses external pointer objects to refer to them. . ● A large number of database packages, and database-alike packages (such as sqldf by Grothendieck and data.table ● The HadoopStreaming package provides a framework for writing map/reduce scripts for use in Hadoop Streaming; it also facilitates operating on data in a streaming fashion which does not require Hadoop. ● The speedglm package permits to fit (generalised) linear models to large data. ● The biglars package by Seligman et al can use the ff to support large-than-memory datasets for least-angle regression, lasso and stepwise regression. ● The bigrf package provides a Random Forests implementation with support for parellel execution and large memory. ● The MonetDB.R package allows R to access the MonetDB column-oriented, open source database system as a backend.
  • 69. C’est la vie IN INDUSTRY - a R expert is one who knows which package to use from IN RESEARCH- a R expert is one who creates a new popular and improved package
  • 70. CRAN Views help experts http://cran.r-project.org/web/views/
  • 71. SAP with R Departure of Aeroplanes-SAP Hana 200m http://allthingsr.blogspot.in/#!/2012/04/big-data-r-and-hana-analyze-200-million.html R using SAP Hana
  • 72. SAP Hana DB uses R http://scn.sap.com/community/in-memory-business-data-management/blog/2011/11/28/dealing-with-r-and-hana
  • 73. Oracle R Enterprise Case Studies and Examples http://www.oracle.com/technetwork/database/options/advanced-analytics/r-enterprise/index.html
  • 74. Oracle R Enterprise Case Studies and Examples http://www.oracle.com/technetwork/database/options/advanced-analytics/r- enterprise/index.html
  • 75. Additional http://www.slideshare.net/ajayohri/open-source-analytics Open Source in Analytics (OSSCamp 2014) http://osscamp.in/ http://osscamp.in/events/6/open-source-analytics-overview-r-python-and-others
  • 76. How does this affect decision making Lots of Data IT is not a support function Analytical Organizations with cross functional domains and Employees as first line of analysis is education and research keeping up?
  • 77. Lets do a revision Requirements and People a=NULL a$req=c("Met","Unmet") a$counts=c(50,50) a=as.data.frame(a) a pie(a$counts,label=a$req) library(RColorBrewer) p=NULL p$req=c("Satisfied","Unsatisfied","Busy Sleeping") p$counts=c(50,40,10) p=as.data.frame(p) pie(p$counts,label=p$req,col=brewer.pal(3, "Set1"))
  • 78. Thanks for listening Contact - ohri2007@gmail.com LinkedIN -http://linkedin.com/in/ajayohri Questions please?
  • 79. One more thing a movie on a murdered IIM batchmate of mine fighting against corruption just released yesterday http://www.imdb.com/title/tt3056632/