Contenu connexe Similaire à Cwin16 tls-datalab for scientists (20) Cwin16 tls-datalab for scientists1. DataLab for Scientists – A new way of
working
DataLab for Scientists – A new way of
working
S. ANGELI / C. CORMONT / A. GREVIN
29/09/2016
S. ANGELI / C. CORMONT / A. GREVIN
29/09/2016
2. 2Copyright © Capgemini 2015. All Rights Reserved
Scientists Methodology for data analytics
Preprocessing
& Automatic
reconciliation
raw
Data
bulk
Predictive
modelling
Posed
problem
Descriptive
modelling
Algorithms for
categorial data
Algorithms for
numerical data
Algorithms for
textual data
Data
frames
Statistical tests, contingency &
correlation matrices, factoriel analysis,
hierarchical clustering, important
variable extraction, semantic graphs
construction
Logistic regression, discriminant
analysis, decision trees, k-means,
kohonen map, supervised neural
networks , document analysis
3. Copyright © 2016 Capgemini and Sogeti. All rights reserved. 3
Current issues with traditional tools & way of working
No scalability
Lack of collaborative tools
Lack of visualization tools
Several tools and languages: lack of integration
Need to accelerate the path from R&D to real industrialization
4. Copyright © 2016 Capgemini and Sogeti. All rights reserved. 4
Inside the highly iterative Journey
Applications Expert, Data
Scientist, Business Process
Expert, Technical Architect
Data Manipulation
Run Mathematical
Algorithms
Understand
Results
Architecture Design &
Infra setup
Define
Use case
• Hortonworks
• RHadoop
• Map Reduce
Data Extraction
5. Copyright © 2016 Capgemini and Sogeti. All rights reserved. 5
Our Use Case
Data Size : 200Mo 20Go
R loads data in memory -> local-> Job too long (hours), impossible
1) RStudio & RHadoop
2) Dataiku & pySpark
6. Copyright © 2016 Capgemini and Sogeti. All rights reserved. 6
First Architecture with RStudio & RHadoop
7. REX IMC2 | 29/09/2016
Copyright © 2016 Capgemini and Sogeti. All rights reserved. 7
DATAIKU: AN INTEGRATED DATA SCIENCE PLATFORM
PRODUCTION
Create quickly your predictive models
and the associated workflows, by
combining visual components and
programming languages in a common
environment
Deploy your predictive applications in
production using advanced automation
of workflows and expose your machine
learning models via API’s
One product – one environment – one platform
DESIGN PRODUCTION
FOR
CLICKERS
FOR
CODERS
For all your data science projects and predictive applications
Acquire, prepare, filter, join, copy your
data with visual components…
Use your favorite (big data) programming
languages to add arbitrary custom logic…
8. Copyright © 2016 Capgemini and Sogeti. All rights reserved. 8
ETL Analyzes - ML Vizualisation
Dataiku
Talend
Spotfire
QlickView
Dataiku vs others softwares
Scikit Learn
9. Copyright © 2016 Capgemini and Sogeti. All rights reserved. 9
Rstudio-Hadoop Dataiku - Spark
~20 min ~7 min
Only Code Code / Visual Operation / Visual
Workflow
IDE NoteBook
Charts with R Charts with Dataiku
Fractionnate WorkFlows
Run Partial Job
Results
12. Copyright © 2016 Capgemini and Sogeti. All rights reserved. 12
RStudio - RHadoop
Rhadoop : MapReduce, HDFS, Hbase and Avro API
mapper(..[R function])
.
.
reducer(..[R function])
Use R function to manipulate data
map reduce
disk disk
13. Copyright © 2016 Capgemini and Sogeti. All rights reserved. 13
Dataiku with Spark (PySpark – Rspark)
map()
reduce()
reduceByKey()
filter()
join()
group()
Use Spark function to manipulate data
map reduce
memory memory
14. Copyright © 2016 Capgemini and Sogeti. All rights reserved. 14
Dataiku Dataset
Spark Dataframes
Spark RDD
Data in Dataiku and Spark
Tables in dataiku
Distributed tables
SQL-like
List Distributed
Dataiku API
Dataframes API
15. Copyright © 2016 Capgemini and Sogeti. All rights reserved. 15
Dataiku Server
Spark (R-Python-SQL)
Yarn
NodeManager / Resource Manager
Executor Executor Executor
Dataiku – Spark -Yarn
16. Copyright © 2016 Capgemini and Sogeti. All rights reserved. 16
Rstudio-Hadoop Dataiku - Spark
~20 min ~7 min
Only Code Code / Visual Operation / Visual
Workflows
IDE NoteBook
Data Preparation Needed Data Preparation Needed
Chart with R Chart with Dataiku and R
Fractionnate WorkFlow
Run Partial Job
17. www.capgemini.com
The information contained in this presentation is proprietary.
Copyright © 2016 Capgemini and Sogeti. All rights reserved.
Rightshore® is a trademark belonging to Capgemini.
www.sogeti.com
About Capgemini and Sogeti
With more than 180,000 people in over 40 countries, Capgemini is a
global leader in consulting, technology and outsourcing services. The
Group reported 2015 global revenues of EUR 11.9 billion. Together
with its clients, Capgemini creates and delivers business, technology
and digital solutions that fit their needs, enabling them to achieve
innovation and competitiveness. A deeply multicultural organization,
Capgemini has developed its own way of working, the Collaborative
Business Experience™, and draws on Rightshore®, its worldwide
delivery model.
Sogeti is a leading provider of technology and software testing,
specializing in Application, Infrastructure and Engineering
Services. Sogeti offers cutting-edge solutions around Testing,
Business Intelligence & Analytics, Mobile, Cloud and Cyber
Security. Sogeti brings together more than 23,000 professionals in
15 countries and has a strong local presence in over 100 locations
in Europe, USA and India. Sogeti is a wholly-owned subsidiary of
Cap Gemini S.A., listed on the Paris Stock Exchange.