SlideShare une entreprise Scribd logo
1  sur  15
Télécharger pour lire hors ligne
University of Milano - Bicocca
Viale dell’Innovazione 10
Building U9, 2nd floor
20126 Milan, Italy
Tel: (+39) 02 6448 2180
Fax: (+39) 02 70056 9114
e-mail: crisp@crisp-org.it
web: www.crisp-org.it
1
R role in Business Intelligence
Software Architecture
CRISP - Inter-university Research Centre
on public Services
Ettore Colombo
Gaithersburg, Maryland
July, 22 2010
2
Introduction to C.R.I.S.P.
Crisp’s main areas of concern:
1. public service development and
demand analysis;
2. analysis of economic system dynamics;
3. unbiased methodologies for quality
estimation of services;
4. technology innovation
• Training and the Labour Market
• Public Health
• Environment and the Quality of Life
• Education and Learning
• Public Utilites
CRISP “Public Services”:
The on-going collaboration and mutual exchange between
several centres of study was rendered official in 1997 by
the creation of a centre of study proposing high-profile
research on public services.
LABOR Project
3
Outcome:
a Statistical Information System (SIS)
integrated in the BI process
statistical models integrated in in BI system
a community of users crossing the province
boundaries
Project Goal:
provide the provinces of Lombardy with a Business
Intelligence (BI) System to analyse their labour
markets.
Open Source Projects
Technological Platform
features
SIS Technological Platform Design
4
Statistical models
Complex, innovative
and domain-dependent
models coming from
Research
Community Feedback
Suggestions and hints
coming from the
experience of the user
community
BI analysis tools
OLAP
Reporting
Dashboard
Extendibility
Adaptability
Flexibility
Integration and
interoperability
Innovative
Communities
No licences to pay
SIS Software Layers
SIS
Data
Presentation
Data Preparation &
Transformation
Data
Storage
DBMS
Data Warehouse
Data Mart
Extraction
Transformation &
Loading - ETL
Data Mining
Data Profiling
Reporting OLAP
DashboardMaps R project
BIRT
R project
R and the Data Transformation &
Preparation Layer: the actors
6
An Open Source Platform for ETL and Data Profiling.
Talend OS is a visual suite (based on Java & Perl) to develop ETL processes
MySQL … the well-known DBMS used at CRISP to develop Data
Warehouses and Data Marts
R project
RMySQL
R scripts
R and its packages …
RMySQL is used to get data from MySQL
A set of R scripts with the algorithms developed at CRISP
Need to run complex data analysis
methods not supported by common
ETL tools - e.g. Clustering method
to classify workers’ careers
Need to run these methods directly
in the ETL processes
R and the Data Transformation &
Preparation Layer: the process
7
R project
1
2
3
1
2
3
During the execution of an ETL process, TALEND launches R via command line
R runs the script on the data from the DBs
R stores the outcome data in dedicated DB tables
R is used to elaborate data with innovative models defined by CRISP researchers
during ETL in a 3-step process
Light but effective (no need to give back data to TALEND)
R and the Data Presentation Layer: the
actors
8
RComponent
RoSuDa RServe
RoSuDa REngine
R project
RMySQL
RgraphViz
 R and its packages …
 RMySQL is used to get data from MySQL
 RgraphViz (Bioconductor) is used to generate graphs
 Rserve is used for TCP/IP communication over the
internet
The Open Source BI platform that is the backbone of the Presentation Layer
An ah-hoc extension of Pentaho to manage the interactions with R (via
Rengine) and preparation of the elements to be shown in Pentaho dashboards
A set of script templates containing placeholders for DB connection and model
parameters
MySQL … the well-known DBMS used at CRISP to develop Data
Warehouses and Data Marts
Rscript templates
Need to graphically represent the
results of the run of complex data
analysis methods - e.g. Markov’s
Chains on workers’ contract type
Need to show these representation
in SIS dashboards
R and the Data Transformation &
Preparation Layer: the front-end process
9
Dashboard
framework
Parameter Input
Form … gender
(Male), nationality
(Italian) and
algorithm params
R and the Data Transformation &
Preparation Layer: the front-end process
10
Probability of
changing
contract type
in 12 months
Parameter Input
Form … gender
(Male), nationality
(Italian) and
algorithm params
R and the Data Transformation &
Preparation Layer: the front-end process
11
Probability of
having a
contract type
in 15 months
Parameter Input
Form … gender
(Male), nationality
(Italian) and
algorithm params
R and the Data Transformation &
Preparation Layer: the front-end process
12
We can change
the inputs and see
what happens …
Parameter Input
Form … gender
(All), nationality
(Italian) and
algorithm params
R and the Data Presentation Layer: the
back-end process
13
1
1
2
3
4
5
6
Pentaho invokes RComponent for a
specific script template and data source
RComponent parses the script template
and generates a new in-memory script and
connects to Rserve
RComponent remotely launches the
execution of the script to Rserve
R runs the script on the data from the DBs
and generates a set of JPGs via Rgraphviz
Rserve takes these pictures and returns
them to RComponent
RComponent prepare an HTML fragment to
be shown in the Pentaho framework
RComponent R project
2 3
4
5
6
R is used to elaborate data and generate graphs to show the outcome of the
execution of algorithms defined by CRISP researchers in a 6-step process
Physical and logical Separation of
concerns
Integration “limited” to visualization
issues
Conclusions
14
NOW
NEXT
FUTURE
Beyond …
R plays an active role in ETL
processes to run complex
statistical analysis - Clustering
on Workers’ careers
R plays an active role the SIS
generating visualization strictly
related to statistical analysis -
Markov’s Chains
Extend the use to other analysis
and models - Clustering on
Workers’ Skills
Extend the use to other models -
Time Series and Geospatial
Analysis
… the Visualization
Use the developed
communication infrastructure
between Pentaho and R to run
different kind of script (e.g.
What-If scenario analysis) giving
back data, not only images
… the Light Integration
Change the paradigm of
communication between Talend
and R in order to enable R to
give back data useful for ETL
processes
R and the Data
Presentation Layer
R and the Data Preparation
& Transformation Layer
15
Web: www.crisp-org.it
E-mail: ettore.colombo@crisp-org.it
Tel: (+39) 02 6448 2172
Fax: (+39) 02 70056 9114
15
FURTHER INFORMATION

Contenu connexe

Tendances

resume-8.1-software
resume-8.1-softwareresume-8.1-software
resume-8.1-software
Tianbo Zhang
 

Tendances (8)

Matlab IEEE Projects
Matlab IEEE ProjectsMatlab IEEE Projects
Matlab IEEE Projects
 
Master Thesis LTE Projects for Students
Master Thesis LTE Projects for StudentsMaster Thesis LTE Projects for Students
Master Thesis LTE Projects for Students
 
Latest Thesis Topics in Wireless Communication
Latest Thesis Topics in Wireless CommunicationLatest Thesis Topics in Wireless Communication
Latest Thesis Topics in Wireless Communication
 
resume-8.1-software
resume-8.1-softwareresume-8.1-software
resume-8.1-software
 
A Hierarchical approach towards Efficient and Expressive Stream Reasoning
A Hierarchical approach towards Efficient and Expressive Stream ReasoningA Hierarchical approach towards Efficient and Expressive Stream Reasoning
A Hierarchical approach towards Efficient and Expressive Stream Reasoning
 
Matlab Projects UK
Matlab Projects UKMatlab Projects UK
Matlab Projects UK
 
Matlab Projects for MCA Students
Matlab Projects for MCA StudentsMatlab Projects for MCA Students
Matlab Projects for MCA Students
 
Matlab Projects for Electrical Students
Matlab Projects for Electrical StudentsMatlab Projects for Electrical Students
Matlab Projects for Electrical Students
 

Similaire à Colombo+ronzoni+fontana

Prakash_Profile(279074)
Prakash_Profile(279074)Prakash_Profile(279074)
Prakash_Profile(279074)
Prakash s
 
Using R for Cyber Security Part 1
Using R for Cyber Security Part 1Using R for Cyber Security Part 1
Using R for Cyber Security Part 1
Ajay Ohri
 
Database Integrated Analytics using R InitialExperiences wi
Database Integrated Analytics using R InitialExperiences wiDatabase Integrated Analytics using R InitialExperiences wi
Database Integrated Analytics using R InitialExperiences wi
OllieShoresna
 
STUDY ON EMERGING APPLICATIONS ON DATA PLANE AND OPTIMIZATION POSSIBILITIES
STUDY ON EMERGING APPLICATIONS ON DATA  PLANE AND OPTIMIZATION POSSIBILITIES STUDY ON EMERGING APPLICATIONS ON DATA  PLANE AND OPTIMIZATION POSSIBILITIES
STUDY ON EMERGING APPLICATIONS ON DATA PLANE AND OPTIMIZATION POSSIBILITIES
ijdpsjournal
 
STUDY ON EMERGING APPLICATIONS ON DATA PLANE AND OPTIMIZATION POSSIBILITIES
STUDY ON EMERGING APPLICATIONS ON DATA PLANE AND OPTIMIZATION POSSIBILITIESSTUDY ON EMERGING APPLICATIONS ON DATA PLANE AND OPTIMIZATION POSSIBILITIES
STUDY ON EMERGING APPLICATIONS ON DATA PLANE AND OPTIMIZATION POSSIBILITIES
ijdpsjournal
 

Similaire à Colombo+ronzoni+fontana (20)

statistical computation using R- report
statistical computation using R- reportstatistical computation using R- report
statistical computation using R- report
 
Resume (1)
Resume (1)Resume (1)
Resume (1)
 
Resume (1)
Resume (1)Resume (1)
Resume (1)
 
Towards a Resource Slice Interoperability Hub for IoT
Towards a Resource Slice Interoperability Hub for IoTTowards a Resource Slice Interoperability Hub for IoT
Towards a Resource Slice Interoperability Hub for IoT
 
Planetdata simpda
Planetdata simpdaPlanetdata simpda
Planetdata simpda
 
PlanetData: Consuming Structured Data at Web Scale
PlanetData: Consuming Structured Data at Web ScalePlanetData: Consuming Structured Data at Web Scale
PlanetData: Consuming Structured Data at Web Scale
 
Report for-smart-trash-project
Report for-smart-trash-project Report for-smart-trash-project
Report for-smart-trash-project
 
#CeDEM17 - Towards an Open Data based ICT Reference Architecture for Smart Ci...
#CeDEM17 - Towards an Open Data based ICT Reference Architecture for Smart Ci...#CeDEM17 - Towards an Open Data based ICT Reference Architecture for Smart Ci...
#CeDEM17 - Towards an Open Data based ICT Reference Architecture for Smart Ci...
 
Prakash_Profile(279074)
Prakash_Profile(279074)Prakash_Profile(279074)
Prakash_Profile(279074)
 
Revolution R: 100% R and more
Revolution R: 100% R and moreRevolution R: 100% R and more
Revolution R: 100% R and more
 
Using R for Cyber Security Part 1
Using R for Cyber Security Part 1Using R for Cyber Security Part 1
Using R for Cyber Security Part 1
 
Database Integrated Analytics using R InitialExperiences wi
Database Integrated Analytics using R InitialExperiences wiDatabase Integrated Analytics using R InitialExperiences wi
Database Integrated Analytics using R InitialExperiences wi
 
CORE final workshop introduction
CORE final workshop introductionCORE final workshop introduction
CORE final workshop introduction
 
STUDY ON EMERGING APPLICATIONS ON DATA PLANE AND OPTIMIZATION POSSIBILITIES
STUDY ON EMERGING APPLICATIONS ON DATA  PLANE AND OPTIMIZATION POSSIBILITIES STUDY ON EMERGING APPLICATIONS ON DATA  PLANE AND OPTIMIZATION POSSIBILITIES
STUDY ON EMERGING APPLICATIONS ON DATA PLANE AND OPTIMIZATION POSSIBILITIES
 
STUDY ON EMERGING APPLICATIONS ON DATA PLANE AND OPTIMIZATION POSSIBILITIES
STUDY ON EMERGING APPLICATIONS ON DATA PLANE AND OPTIMIZATION POSSIBILITIESSTUDY ON EMERGING APPLICATIONS ON DATA PLANE AND OPTIMIZATION POSSIBILITIES
STUDY ON EMERGING APPLICATIONS ON DATA PLANE AND OPTIMIZATION POSSIBILITIES
 
FDS_dept_ppt.pptx
FDS_dept_ppt.pptxFDS_dept_ppt.pptx
FDS_dept_ppt.pptx
 
Smart App@Pivotal by Dat Tran
Smart App@Pivotal by Dat TranSmart App@Pivotal by Dat Tran
Smart App@Pivotal by Dat Tran
 
Shaik Niyas Ahamed M Resume
Shaik Niyas Ahamed M ResumeShaik Niyas Ahamed M Resume
Shaik Niyas Ahamed M Resume
 
SWANSTAT: A user-friendly web application for data analysis using shinydashbo...
SWANSTAT: A user-friendly web application for data analysis using shinydashbo...SWANSTAT: A user-friendly web application for data analysis using shinydashbo...
SWANSTAT: A user-friendly web application for data analysis using shinydashbo...
 
Better Hackathon 2020 - Fraunhofer IAIS - Semantic geo-clustering with SANSA
Better Hackathon 2020 - Fraunhofer IAIS - Semantic geo-clustering with SANSABetter Hackathon 2020 - Fraunhofer IAIS - Semantic geo-clustering with SANSA
Better Hackathon 2020 - Fraunhofer IAIS - Semantic geo-clustering with SANSA
 

Plus de Ajay Ohri

Plus de Ajay Ohri (20)

Introduction to R ajay Ohri
Introduction to R ajay OhriIntroduction to R ajay Ohri
Introduction to R ajay Ohri
 
Introduction to R
Introduction to RIntroduction to R
Introduction to R
 
Social Media and Fake News in the 2016 Election
Social Media and Fake News in the 2016 ElectionSocial Media and Fake News in the 2016 Election
Social Media and Fake News in the 2016 Election
 
Pyspark
PysparkPyspark
Pyspark
 
Download Python for R Users pdf for free
Download Python for R Users pdf for freeDownload Python for R Users pdf for free
Download Python for R Users pdf for free
 
Install spark on_windows10
Install spark on_windows10Install spark on_windows10
Install spark on_windows10
 
Ajay ohri Resume
Ajay ohri ResumeAjay ohri Resume
Ajay ohri Resume
 
Statistics for data scientists
Statistics for  data scientistsStatistics for  data scientists
Statistics for data scientists
 
National seminar on emergence of internet of things (io t) trends and challe...
National seminar on emergence of internet of things (io t)  trends and challe...National seminar on emergence of internet of things (io t)  trends and challe...
National seminar on emergence of internet of things (io t) trends and challe...
 
Tools and techniques for data science
Tools and techniques for data scienceTools and techniques for data science
Tools and techniques for data science
 
How Big Data ,Cloud Computing ,Data Science can help business
How Big Data ,Cloud Computing ,Data Science can help businessHow Big Data ,Cloud Computing ,Data Science can help business
How Big Data ,Cloud Computing ,Data Science can help business
 
Training in Analytics and Data Science
Training in Analytics and Data ScienceTraining in Analytics and Data Science
Training in Analytics and Data Science
 
Tradecraft
Tradecraft   Tradecraft
Tradecraft
 
Software Testing for Data Scientists
Software Testing for Data ScientistsSoftware Testing for Data Scientists
Software Testing for Data Scientists
 
Craps
CrapsCraps
Craps
 
A Data Science Tutorial in Python
A Data Science Tutorial in PythonA Data Science Tutorial in Python
A Data Science Tutorial in Python
 
How does cryptography work? by Jeroen Ooms
How does cryptography work?  by Jeroen OomsHow does cryptography work?  by Jeroen Ooms
How does cryptography work? by Jeroen Ooms
 
Using R for Social Media and Sports Analytics
Using R for Social Media and Sports AnalyticsUsing R for Social Media and Sports Analytics
Using R for Social Media and Sports Analytics
 
Kush stats alpha
Kush stats alpha Kush stats alpha
Kush stats alpha
 
Analyze this
Analyze thisAnalyze this
Analyze this
 

Colombo+ronzoni+fontana

  • 1. University of Milano - Bicocca Viale dell’Innovazione 10 Building U9, 2nd floor 20126 Milan, Italy Tel: (+39) 02 6448 2180 Fax: (+39) 02 70056 9114 e-mail: crisp@crisp-org.it web: www.crisp-org.it 1 R role in Business Intelligence Software Architecture CRISP - Inter-university Research Centre on public Services Ettore Colombo Gaithersburg, Maryland July, 22 2010
  • 2. 2 Introduction to C.R.I.S.P. Crisp’s main areas of concern: 1. public service development and demand analysis; 2. analysis of economic system dynamics; 3. unbiased methodologies for quality estimation of services; 4. technology innovation • Training and the Labour Market • Public Health • Environment and the Quality of Life • Education and Learning • Public Utilites CRISP “Public Services”: The on-going collaboration and mutual exchange between several centres of study was rendered official in 1997 by the creation of a centre of study proposing high-profile research on public services.
  • 3. LABOR Project 3 Outcome: a Statistical Information System (SIS) integrated in the BI process statistical models integrated in in BI system a community of users crossing the province boundaries Project Goal: provide the provinces of Lombardy with a Business Intelligence (BI) System to analyse their labour markets.
  • 4. Open Source Projects Technological Platform features SIS Technological Platform Design 4 Statistical models Complex, innovative and domain-dependent models coming from Research Community Feedback Suggestions and hints coming from the experience of the user community BI analysis tools OLAP Reporting Dashboard Extendibility Adaptability Flexibility Integration and interoperability Innovative Communities No licences to pay
  • 5. SIS Software Layers SIS Data Presentation Data Preparation & Transformation Data Storage DBMS Data Warehouse Data Mart Extraction Transformation & Loading - ETL Data Mining Data Profiling Reporting OLAP DashboardMaps R project BIRT R project
  • 6. R and the Data Transformation & Preparation Layer: the actors 6 An Open Source Platform for ETL and Data Profiling. Talend OS is a visual suite (based on Java & Perl) to develop ETL processes MySQL … the well-known DBMS used at CRISP to develop Data Warehouses and Data Marts R project RMySQL R scripts R and its packages … RMySQL is used to get data from MySQL A set of R scripts with the algorithms developed at CRISP Need to run complex data analysis methods not supported by common ETL tools - e.g. Clustering method to classify workers’ careers Need to run these methods directly in the ETL processes
  • 7. R and the Data Transformation & Preparation Layer: the process 7 R project 1 2 3 1 2 3 During the execution of an ETL process, TALEND launches R via command line R runs the script on the data from the DBs R stores the outcome data in dedicated DB tables R is used to elaborate data with innovative models defined by CRISP researchers during ETL in a 3-step process Light but effective (no need to give back data to TALEND)
  • 8. R and the Data Presentation Layer: the actors 8 RComponent RoSuDa RServe RoSuDa REngine R project RMySQL RgraphViz  R and its packages …  RMySQL is used to get data from MySQL  RgraphViz (Bioconductor) is used to generate graphs  Rserve is used for TCP/IP communication over the internet The Open Source BI platform that is the backbone of the Presentation Layer An ah-hoc extension of Pentaho to manage the interactions with R (via Rengine) and preparation of the elements to be shown in Pentaho dashboards A set of script templates containing placeholders for DB connection and model parameters MySQL … the well-known DBMS used at CRISP to develop Data Warehouses and Data Marts Rscript templates Need to graphically represent the results of the run of complex data analysis methods - e.g. Markov’s Chains on workers’ contract type Need to show these representation in SIS dashboards
  • 9. R and the Data Transformation & Preparation Layer: the front-end process 9 Dashboard framework Parameter Input Form … gender (Male), nationality (Italian) and algorithm params
  • 10. R and the Data Transformation & Preparation Layer: the front-end process 10 Probability of changing contract type in 12 months Parameter Input Form … gender (Male), nationality (Italian) and algorithm params
  • 11. R and the Data Transformation & Preparation Layer: the front-end process 11 Probability of having a contract type in 15 months Parameter Input Form … gender (Male), nationality (Italian) and algorithm params
  • 12. R and the Data Transformation & Preparation Layer: the front-end process 12 We can change the inputs and see what happens … Parameter Input Form … gender (All), nationality (Italian) and algorithm params
  • 13. R and the Data Presentation Layer: the back-end process 13 1 1 2 3 4 5 6 Pentaho invokes RComponent for a specific script template and data source RComponent parses the script template and generates a new in-memory script and connects to Rserve RComponent remotely launches the execution of the script to Rserve R runs the script on the data from the DBs and generates a set of JPGs via Rgraphviz Rserve takes these pictures and returns them to RComponent RComponent prepare an HTML fragment to be shown in the Pentaho framework RComponent R project 2 3 4 5 6 R is used to elaborate data and generate graphs to show the outcome of the execution of algorithms defined by CRISP researchers in a 6-step process Physical and logical Separation of concerns Integration “limited” to visualization issues
  • 14. Conclusions 14 NOW NEXT FUTURE Beyond … R plays an active role in ETL processes to run complex statistical analysis - Clustering on Workers’ careers R plays an active role the SIS generating visualization strictly related to statistical analysis - Markov’s Chains Extend the use to other analysis and models - Clustering on Workers’ Skills Extend the use to other models - Time Series and Geospatial Analysis … the Visualization Use the developed communication infrastructure between Pentaho and R to run different kind of script (e.g. What-If scenario analysis) giving back data, not only images … the Light Integration Change the paradigm of communication between Talend and R in order to enable R to give back data useful for ETL processes R and the Data Presentation Layer R and the Data Preparation & Transformation Layer
  • 15. 15 Web: www.crisp-org.it E-mail: ettore.colombo@crisp-org.it Tel: (+39) 02 6448 2172 Fax: (+39) 02 70056 9114 15 FURTHER INFORMATION