SlideShare une entreprise Scribd logo
1  sur  55
@CasertaConcepts#DataSummit
Introduction to Data Science
Joe Caserta Bill Walrond
@joe_Caserta @bill_walrond
@CasertaConcepts
@CasertaConcepts#DataSummit
About Caserta Concepts
• Consulting Data Innovation and Modern Data Engineering
• Award-winning company
• Internationally recognized work force
• Strategy, Architecture, Implementation, Governance
• Innovation Partner
• Strategic Consulting
• Advanced Architecture
• Build & Deploy
• Leader in Enterprise Data Solutions
• Big Data Analytics
• Data Warehousing
• Business Intelligence
• Data Science
• Cloud Computing
• Data Governance
@CasertaConcepts#DataSummit
Caserta Client Portfolio
Retail/eCommerce
& Manufacturing
Finance, Healthcare
& Insurance
Digital Media/AdTech
Education & Services
@CasertaConcepts#DataSummit
Awards & Recognition
Top 10
Fastest Growing
Big Data Companies
2016
@CasertaConcepts#DataSummit
Our Partners
@CasertaConcepts#DataSummit
Agenda
• Why we care about Big Data
• Challenges of working with Big Data
• Governing Big Data for Data Science
• Introducing the Data Pyramid
• Why Data Science is Cool?
• What does a Data Scientist do?
• Standards for Data Science
• Business Objective
• Data Discovery
• Preparation
• Models
• Evaluation
• Deployment
• Q & A
@CasertaConcepts#DataSummit
Big Data Analysis: Timeline of Society Media
1500s
Printing Press
1840s
Penny Post
1850s
Telegraph
1850s
Rural Free Post
1890s
Telephone
1900s
Radio
1950s
TV
1970s
PCs
1980s
Internet
1990s
Web
2000s
Social Media, Mobile, Big Data, Cloud
98,000+ Tweets
695,000 Status Updates
11 Million instant messages
698,445 Google Searches
168 million+ emails sent
1,829 TB of data created
217 new mobile web
users
Every 60 Seconds
@CasertaConcepts#DataSummit
Data is your Differentiator
63% of organizations realize a positive return on
analytic investments within a year
69% of speed-driven analytics organizations
created a positive impact on business outcomes
74% of respondents anticipate a speed at which
executives expect new data-driven insights will
continue to accelerate
@CasertaConcepts#DataSummit
Understanding the Customer Journey
Awareness Consideration Purchase Service
Loyalty
Expansion
PR
Radio
TV
Print
Outdoor
Word of Mouth
Direct Mail
Customer Service
Physical Touchpoints
Digital Touchpoints
Search
Paid Content
email
Website/
Landing Pages
Social Media
Community
Chat
Social Media
Call Center
Offers
Mailings
Survey
Loyalty Programs
email
Agents
Partners
Ads
Website
Mobile
3rd Party Sites
Offers
Web self-service
@CasertaConcepts#DataSummit
Type
Comments
Single Touch Rules-Based Statistically Driven
Assign the credit
to the first or last
exposure
Assign the credit to
each interaction
based on business
rules
Assign the credit to
interactions based
on data-driven
model
Ad-Click Mailing MailingE-mail E-mailAd-Click Ad-Click
100% 33% 33% 33% 27% 49% 24%
- Last touch only
- Ignores bulk of
customer journey
- Undervalues
other interactions
and influencers
- Subjective
- Assigns arbitrary
values to each
interaction
- Lacks analytics rigor
to determine weights
ü Looks at full behavior
patterns
ü Consider all touch points
ü Can apply different
models for best results
ü Use data to find
correlations between
touch points (winning
combinations)
Understanding Touchpoint Methods
@CasertaConcepts#DataSummit
What is Data Science?
@CasertaConcepts#DataSummit
Business Value
Cloud-based Data Lake
Big Data Analysis: The Ecosystem of the future
Analyze
Persist
DeployIngest
Data Integration
Identity Resolution
Data Quality
Discovery Exploration
Machine Learning
Models Development
Reports / Dashboards
Applications
APIs
Structured Data
Unstructured Data
SQL, NoSQL, Object Store
Find Share Collaborate
Data Engineer Data Scientist Business Analyst App Developer
Provides innovative and industry
leading technologies to rapidly be
applied to the business without
having to manage compatibility
and data complexity.
Technical Value
Provides an open
framework to reduce the
number of integration
points and testing
environments to deliver
business solutions.
@CasertaConcepts#DataSummit
Progression of Business Analytics to Data Science
Descriptive
Analytics
Diagnostic
Analytics
Predictive
Analytics
Prescriptive Analytics
What
happened?
Why did it
happen?
What will
happen?
How can we make
It happen?
Data Analytics Sophistication
BusinessValue
Hindsight
Insight
Foresight
Information
Optimization
Cognitive
Analytics
Influence what happens
Reports  Correlations  Predictions  Recommendations 
Monetization
Interactions
Action
@CasertaConcepts#DataSummit
Progression of Data Science Maturity
• Timeline
• Tools
• Available libraries
• Best practices
@CasertaConcepts#DataSummit
What are the Realities of the Data Scientist
— Searching for the data they need
— Making sense of the data
— Figuring why the data looks the way is does and assessing its validity
— Cleaning up all the garbage within the data so it represents true business
— Combining events with Reference data to give it context
— Correlating event data with other events
— Finally, they implement algorithms to perform mining, clustering and predictive analytics
— Writes really cool and sophisticated
algorithms that impacts the way the business
runs.
— Much of the time of a Data Scientist is spent:
— NOT
@CasertaConcepts#DataSummit
Why Data Science now?
• Costs of compute and storage dramatically lower than just a few years ago
• Data generated by all aspects of society has dramatically increased
• Need to efficiently learn what there is to learn from our data
@CasertaConcepts#DataSummit
The Data Scientist Winning Trifecta
Modern Data
Engineering/Data
Preparation
Domain
Knowledge/Bu
siness
Expertise
Advanced
Mathematics/
Statistics
- Computer Science
- Programming/Storage
- Data Quality
- Visualization
Algorithms -
A/B Testing -
- Data and Outcome
- Sensibility
@CasertaConcepts#DataSummit
Modern Data Engineering
@CasertaConcepts#DataSummit
Which Visualization, When?
@CasertaConcepts#DataSummit
Advanced Mathematics / Statistics
@CasertaConcepts#DataSummit
Domain and Outcome Sensibility
@CasertaConcepts#DataSummit
Is Data Trying To Trick You?
Correlation: 99.26%
@CasertaConcepts#DataSummit
Are we Considering the Right Factors?
@CasertaConcepts#DataSummit
Are there Standards?
CRISP-DM: Cross Industry Standard Process for Data Mining
1. Business Understanding
• Solve a single business problem
2. Data Understanding
• Discovery
• Data Munging
• Cleansing Requirements
3. Data Preparation
• ETL
4. Modeling
• Evaluate various models
• Iterative experimentation
5. Evaluation
• Does the model achieve business objectives?
6. Deployment
• PMML; application integration; data platform; Excel
@CasertaConcepts#DataSummit
1. Business Understanding
In this initial phase of the project we will need to speak to humans.
• It would be premature to jump in to the data, or begin selection of
the appropriate model(s) or algorithm
• Understand the project objective
• Review the business requirements
• The output of this phase will be conversion of business requirements
into a preliminary technical design (decision model) and plan.
Since this is an iterative process, this phase will be revisited throughout
the entire process.
@CasertaConcepts#DataSummit
Data
ScientistBusiness
Analyst
Business Stakeholders
Business Stakeholders
Business Stakeholders
Interview notes
Requirement Document
Models / Insights
Gathering Requirements
@CasertaConcepts#DataSummit
Data Science Scrum Team
Data
Scientist
Business
Stakeholders
Data
Engineer
Efficient Inclusive
EffectiveInteractive
Data
Analyst
@CasertaConcepts#DataSummit
2. Data Understanding
• Data Discovery  understand where the data you need comes
from
• Data Profiling  interrogate the data at the entity level,
understand key entities and fields that are relevant to the
analysis.
• Cleansing Requirements  understand data quality, data
density, skew, etc
• Data Munging  collocate, blend and analyze data for early
insights! Valuable information can be achieved from simple
group-by, aggregate queries, and even more with SQL Jujitsu!
Significant iteration between Business Understanding and Data
Understanding phases.
Sample
Exploration tools
for Hadoop:
Trifacta, Paxata,
Spark, Python,
Waterline,
Elasticsearch
@CasertaConcepts#DataSummit
Data Science Data Quality Priorities
Be
Corrective
Be Fast
Be
Transparent
Be Thorough
@CasertaConcepts#DataSummit
Data Science Data Quality Priorities
Data Quality
SpeedtoValue
Fast
Slow
Raw Refined
Data Scientist’s Tightrope
Does Data munging in a data science
lab need the same restrictive
governance and enterprise reporting?
@CasertaConcepts#DataSummit
3. Data Preparation
ETL (Extract Transform Load)
90+% of a Data Scientists time goes into Data Preparation!
• Locating and acquiring valuable data sources
• Select required entities/fields
• Address Data Quality issues: missing or incomplete values,
whitespace, bad data-points
• Join/Enrich disparate datasets
• Derive behavioral features
• Transform/Aggregate data for intended use:
• Sample
• Aggregate
• Pivot
@CasertaConcepts#DataSummit
We Spark
• Development local or distributed is identical
• Beautiful high level API’s
• Full universe of Python modules
• Open source and Free
• Blazing fast!
Spark has become our default processing engine for a data engineering & science
@CasertaConcepts#DataSummit
Data Preparation
• We love Spark!
• ETL can be done in Scala,
Python or SQL
• Cleansing, transformation,
and standardization
• Address Parsing:
usaddress, postal-address,
etc
• Name Hashing: fuzzy, etc
• Genderization:
sexmachine, etc
• And all the goodies of the
standard Python library!
• Parallelize workload
against a large number of
machines in Hadoop
cluster
@CasertaConcepts#DataSummit
Data Quality and Monitoring
• BUILD a robust data quality subsystem:
• Metadata and error event facts
• Orchestration
• Based on Data Warehouse ETL Toolkit
• Each error instance of each data quality
check is captured
• Implemented as sub-system after
ingestion
• Each fact stores unique identifier of the
defective source row
HAMBot: ‘open
source’ project
created in Caserta
Innovation Lab
(CIL)
@CasertaConcepts#DataSummit
Data Preparation Demonstration! Wifi:
Hilton Meeting Room Wifi
infotoday2017
Follow along: http://bit.ly/2r9ABcK
File: SanFranCrime.ipynb
@CasertaConcepts#DataSummit
4. Modeling
Do you love algebra & stats?
• Evaluate various models/algorithms
• Classification
• Clustering
• Regression
• Many others…..
• Tune parameters
• Iterative experimentation
• Different models may require different data preparation
techniques (ie. Sparse Vector Format)
• Additionally we may discover the need for additional data points,
or uncover additional data quality issues!
@CasertaConcepts#DataSummit
Modeling in Hadoop
• Spark works well
• SAS, SPSS, Etc. not
native on Hadoop
• R and Python
becoming new
standard
• PMML can be used,
but approach with
caution
@CasertaConcepts#DataSummit
Machine Learning
The goal of machine learning is to get software to make decisions and learn
from data without being programed explicitly to do so
Machine Learning algorithms are broadly broken out into two groups:
• Supervised learning  inferring functions based on labeled training data
• Unsupervised learning  finding hidden structure/patterns within data, no
training data is supplied
We will review some popular, easy to understand machine learning
algorithms
@CasertaConcepts#DataSummit
What to use When?
@CasertaConcepts#DataSummit
Supervised Learning
Name Weight Color Cat_or_Dog
Susie 9lbs Orange Cat
Fido 25lbs Brown Dog
Sparkles 6lbs Black Cat
Fido 9lbs Black Dog
Name Weight Color Cat_or_Dog
Misty 5lbs Orange ?
The training set is used to generate a function
..so we can predict if we have a cat or dog!
@CasertaConcepts#DataSummit
Category or Values?
There are several classes of algorithms depending on whether the prediction is a
category (like cat or dog) or a value, like the value of a home.
Classification algorithms are generally well fit for categorization, while algorithms
like Regression and Decision Trees are well suited for predicting “continuous”
values.
@CasertaConcepts#DataSummit
Regression
• Understanding the relationship between a given set of dependent variables
and independent variables
• Typically regression is used to predict the output of a dependent variable
based on variations in independent variables
• Very popular for prediction and forecasting
Linear Regression
@CasertaConcepts#DataSummit
Decision Trees
• A method for predicting outcomes based on the features of data
• Model is represented a easy to understand tree structure of if-else statements
Weight > 10lbs
color = orange
cat
yes
no
name = fido
no
no
dogyes
dog
cat
yes
@CasertaConcepts#DataSummit
Unsupervised K-Means
• Treats items as coordinates
• Places a number of random “centroids”
and assigns the nearest items
• Moves the centroids around based on
average location
• Process repeats until the assignments
stop changing
Clustering of items into logical groups based on natural patterns in data
Uses:
• Cluster Analysis
• Classification
• Content Filtering
@CasertaConcepts#DataSummit
Collaborative Filtering
• A hybrid of Supervised and Unsupervised Learning (Model Based vs. Memory
Based)
• Leveraging collaboration between multiple agents to filter, project, or detect
patterns
• Popular in recommender systems for projecting the “taste” for of specific
individuals for items they have not yet expressed one.
@CasertaConcepts#DataSummit
Item-based
• A popular and simple memory-based collaborative filtering algorithm
• Projects preference based on item similarity (based on ratings):
for every item i that u has no preference for yet
for every item j that u has a preference for
compute a similarity s between i and j
add u's preference for j, weighted by s, to a running average
return the top items, ranked by weighted average
• First a matrix of Item to Item similarity is calculated based on user rating
• Then recommendations are created by producing a weighted sum of top items,
based on the users previously rated items
@CasertaConcepts#DataSummit
Data Science Demonstration!
Follow along: http://bit.ly/2r9ABcK
File: SanFranCrime_model_DataSummit.ipynb
Wifi:
Hilton Meeting Room Wifi
infotoday2017
@CasertaConcepts#DataSummit
5. Evaluation
What problem are we trying to solve again?
• Our final solution needs to be evaluated against original
Business Understanding
• Did we meet our objectives?
• Did we address all issues?
@CasertaConcepts#DataSummit
6. Deployment
Engineering Time!
• It’s time for the work products of data science to “graduate” from “new
insights” to real applications.
• Processes must be hardened, repeatable, and generally perform well too!
• Data Governance applied
• PMML (Predictive Model Markup Langauge): XML based interchange format
@CasertaConcepts#DataSummit
•This is the ‘people’ part. Establishing Enterprise Data Council, Data Stewards, etc.
•Definitions, lineage (where does this data come from), business definitions, technical
metadata
Organization
•Identify and control sensitive data, regulatory compliance
Metadata
•Data must be complete and correct. Measure, improve, certify
Privacy/Security
•Policies around data frequency, source availability, etc.
Data Quality and Monitoring
•Ensure consistent business critical data i.e. Members, Providers, Agents, etc.
Business Process Integration
•Data retention, purge schedule, storage/archiving
Master Data Management
Information Lifecycle
Management (ILM)
Components of Data Governance
• Add Big Data to overall framework and assign responsibility
• Add data scientists to the Stewardship program
• Assign stewards to new data sets (twitter, call center logs, etc.)
• Graph databases are more flexible than relational
• Lower latency service required
• Distributed data quality and matching algorithms
• Data Quality and Monitoring (probably home grown, drools?)
• Quality checks not only SQL: machine learning, Pig and Map Reduce
• Acting on large dataset quality checks may require distribution
• Larger scale
• New datatypes
• Integrate with Hive Metastore, HCatalog, home grown tables
• Secure and mask multiple data types (not just tabular)
• Deletes are more uncommon (unless there is regulatory requirement)
• Take advantage of compression and archiving (like AWS Glacier)
• Data detection and masking on unstructured data upon ingest
• Near-zero latency, DevOps, Core component of business operations
For Data Science
@CasertaConcepts#DataSummit
Ingest Raw
Data
Organize, Define,
Complete
Munging, Blending
Machine Learning
Data Quality and
Monitoring
Metadata, ILM , Security
Data Catalog
Data Integration
Fully Governed ( trusted)
Arbitrary/Ad-hoc Queries
and Reporting
Big
Data
Ware
house
Data Science Workspace
Data Lake – Integrated Sandbox
Landing Area – Source Data in “Full Fidelity”
Usage Pattern Data Governance
Metadata, ILM,
Security
Corporate Data Pyramid (CDP)
@CasertaConcepts#DataSummit
Chief Data Organization (Oversight)
Vertical Business Area
[Sales/Finance/Marketing/Operations/Customer Svc]
Product Owner
SCRUM Master
Development Team
Business Subject Matter Expertise
Data Librarian/Data Stewardship
Data Science/ Statistical Skills
Data Engineering / Architecture
Presentation/ BI Report Development Skills
Data Quality Assurance
DevOps
IT Organization
(Oversight)
Enterprise Data Architect
Solution Engineers
Data Integration Practice
User Experience Practice
QA Practice
Operations Practice
Advanced Analytics
Business Analysts
Data Analysts
Data Scientists
Statisticians
Data Engineers
Planning Organization
Project Managers
Data Organization
Data Gov Coordinator
Data Librarians
Data Stewards
Analytics-Driven Organization
@CasertaConcepts#DataSummit
Technologies & Techniques
• The Cloud and Spark can provide a relatively low cost and extremely scalable
platform for Data Science
• AWS S3 and Google GCS offers great scalability and speed to value without the
overhead of structuring data
• Spark, with MLlib offers a great library of established Machine Learning algorithms,
reducing development efforts
• Python and SQL are choices for Data Science
• Go Agile and follow Best Practices (CRISP-DM)
• Employ Data Pyramid concepts to ensure data has just enough governance
@CasertaConcepts#DataSummit
Some Thoughts – Enable the Future
— Data Science requires the convergence of data
quality, advanced math, data engineering and
visualization and business smarts
— Make sure your data can be trusted and people can
be held accountable for impact caused by low data
quality.
— Good data scientists are rare: It will take a village
to achieve all the tasks required for effective data
science
— Get good!
— Be great!
— Blaze new trails!
https://explore-data-science.thisismetis.com
Data Science Training:
• Big Data Warehousing Meetup
• New York City
• 4,300+ members
• Knowledge sharing
@CasertaConcepts#DataSummit
Thank You / Q&A
Joe Caserta Bill Walrond
@joe_Caserta @bill_walrond

Contenu connexe

Tendances

DAMA Webinar - Big and Little Data Quality
DAMA Webinar - Big and Little Data QualityDAMA Webinar - Big and Little Data Quality
DAMA Webinar - Big and Little Data QualityDATAVERSITY
 
1. Data Analytics-introduction
1. Data Analytics-introduction1. Data Analytics-introduction
1. Data Analytics-introductionkrishna singh
 
Introduction To Data Science
Introduction To Data ScienceIntroduction To Data Science
Introduction To Data ScienceSpotle.ai
 
What’s The Difference Between Structured, Semi-Structured And Unstructured Data?
What’s The Difference Between Structured, Semi-Structured And Unstructured Data?What’s The Difference Between Structured, Semi-Structured And Unstructured Data?
What’s The Difference Between Structured, Semi-Structured And Unstructured Data?Bernard Marr
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data scienceSampath Kumar
 
Data science presentation
Data science presentationData science presentation
Data science presentationMSDEVMTL
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data ScienceNiko Vuokko
 
Data Science Introduction
Data Science IntroductionData Science Introduction
Data Science IntroductionGang Tao
 
How to Become a Data Scientist | Data Scientist Skills | Data Science Trainin...
How to Become a Data Scientist | Data Scientist Skills | Data Science Trainin...How to Become a Data Scientist | Data Scientist Skills | Data Science Trainin...
How to Become a Data Scientist | Data Scientist Skills | Data Science Trainin...Edureka!
 
Introduction to Data Engineering
Introduction to Data EngineeringIntroduction to Data Engineering
Introduction to Data EngineeringDurga Gadiraju
 
Introduction to data science club
Introduction to data science clubIntroduction to data science club
Introduction to data science clubData Science Club
 

Tendances (20)

DAMA Webinar - Big and Little Data Quality
DAMA Webinar - Big and Little Data QualityDAMA Webinar - Big and Little Data Quality
DAMA Webinar - Big and Little Data Quality
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data science
 
Data Science
Data ScienceData Science
Data Science
 
1. Data Analytics-introduction
1. Data Analytics-introduction1. Data Analytics-introduction
1. Data Analytics-introduction
 
Introduction To Data Science
Introduction To Data ScienceIntroduction To Data Science
Introduction To Data Science
 
What’s The Difference Between Structured, Semi-Structured And Unstructured Data?
What’s The Difference Between Structured, Semi-Structured And Unstructured Data?What’s The Difference Between Structured, Semi-Structured And Unstructured Data?
What’s The Difference Between Structured, Semi-Structured And Unstructured Data?
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data science
 
Data science presentation
Data science presentationData science presentation
Data science presentation
 
Data Science
Data ScienceData Science
Data Science
 
Practical Data Science 
Use-cases in Retail & eCommerce
Practical Data Science 
Use-cases in Retail & eCommercePractical Data Science 
Use-cases in Retail & eCommerce
Practical Data Science 
Use-cases in Retail & eCommerce
 
Data science Big Data
Data science Big DataData science Big Data
Data science Big Data
 
Data analytics
Data analyticsData analytics
Data analytics
 
Data science
Data scienceData science
Data science
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Data Science Introduction
Data Science IntroductionData Science Introduction
Data Science Introduction
 
How to Become a Data Scientist | Data Scientist Skills | Data Science Trainin...
How to Become a Data Scientist | Data Scientist Skills | Data Science Trainin...How to Become a Data Scientist | Data Scientist Skills | Data Science Trainin...
How to Become a Data Scientist | Data Scientist Skills | Data Science Trainin...
 
Introduction to Data Engineering
Introduction to Data EngineeringIntroduction to Data Engineering
Introduction to Data Engineering
 
Data analytics
Data analyticsData analytics
Data analytics
 
Introduction to data science club
Introduction to data science clubIntroduction to data science club
Introduction to data science club
 
Data science big data and analytics
Data science big data and analyticsData science big data and analytics
Data science big data and analytics
 

En vedette

Introduction to Mahout and Machine Learning
Introduction to Mahout and Machine LearningIntroduction to Mahout and Machine Learning
Introduction to Mahout and Machine LearningVarad Meru
 
Tutorial on Deep learning and Applications
Tutorial on Deep learning and ApplicationsTutorial on Deep learning and Applications
Tutorial on Deep learning and ApplicationsNhatHai Phan
 
Hands-on Deep Learning in Python
Hands-on Deep Learning in PythonHands-on Deep Learning in Python
Hands-on Deep Learning in PythonImry Kissos
 
Deep Learning for Natural Language Processing
Deep Learning for Natural Language ProcessingDeep Learning for Natural Language Processing
Deep Learning for Natural Language ProcessingDevashish Shanker
 
10 R Packages to Win Kaggle Competitions
10 R Packages to Win Kaggle Competitions10 R Packages to Win Kaggle Competitions
10 R Packages to Win Kaggle CompetitionsDataRobot
 
Data By The People, For The People
Data By The People, For The PeopleData By The People, For The People
Data By The People, For The PeopleDaniel Tunkelang
 
Hadoop and Machine Learning
Hadoop and Machine LearningHadoop and Machine Learning
Hadoop and Machine Learningjoshwills
 
Tips for data science competitions
Tips for data science competitionsTips for data science competitions
Tips for data science competitionsOwen Zhang
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data ScienceCaserta
 
10 Lessons Learned from Building Machine Learning Systems
10 Lessons Learned from Building Machine Learning Systems10 Lessons Learned from Building Machine Learning Systems
10 Lessons Learned from Building Machine Learning SystemsXavier Amatriain
 
An Introduction to Supervised Machine Learning and Pattern Classification: Th...
An Introduction to Supervised Machine Learning and Pattern Classification: Th...An Introduction to Supervised Machine Learning and Pattern Classification: Th...
An Introduction to Supervised Machine Learning and Pattern Classification: Th...Sebastian Raschka
 
Machine Learning and Data Mining: 12 Classification Rules
Machine Learning and Data Mining: 12 Classification RulesMachine Learning and Data Mining: 12 Classification Rules
Machine Learning and Data Mining: 12 Classification RulesPier Luca Lanzi
 
How to Interview a Data Scientist
How to Interview a Data ScientistHow to Interview a Data Scientist
How to Interview a Data ScientistDaniel Tunkelang
 
Deep neural networks
Deep neural networksDeep neural networks
Deep neural networksSi Haem
 
How to Become a Data Scientist
How to Become a Data ScientistHow to Become a Data Scientist
How to Become a Data Scientistryanorban
 
Artificial neural network
Artificial neural networkArtificial neural network
Artificial neural networkDEEPASHRI HK
 
A Statistician's View on Big Data and Data Science (Version 1)
A Statistician's View on Big Data and Data Science (Version 1)A Statistician's View on Big Data and Data Science (Version 1)
A Statistician's View on Big Data and Data Science (Version 1)Prof. Dr. Diego Kuonen
 
Myths and Mathemagical Superpowers of Data Scientists
Myths and Mathemagical Superpowers of Data ScientistsMyths and Mathemagical Superpowers of Data Scientists
Myths and Mathemagical Superpowers of Data ScientistsDavid Pittman
 
A tutorial on deep learning at icml 2013
A tutorial on deep learning at icml 2013A tutorial on deep learning at icml 2013
A tutorial on deep learning at icml 2013Philip Zheng
 

En vedette (20)

Modern Data Science
Modern Data ScienceModern Data Science
Modern Data Science
 
Introduction to Mahout and Machine Learning
Introduction to Mahout and Machine LearningIntroduction to Mahout and Machine Learning
Introduction to Mahout and Machine Learning
 
Tutorial on Deep learning and Applications
Tutorial on Deep learning and ApplicationsTutorial on Deep learning and Applications
Tutorial on Deep learning and Applications
 
Hands-on Deep Learning in Python
Hands-on Deep Learning in PythonHands-on Deep Learning in Python
Hands-on Deep Learning in Python
 
Deep Learning for Natural Language Processing
Deep Learning for Natural Language ProcessingDeep Learning for Natural Language Processing
Deep Learning for Natural Language Processing
 
10 R Packages to Win Kaggle Competitions
10 R Packages to Win Kaggle Competitions10 R Packages to Win Kaggle Competitions
10 R Packages to Win Kaggle Competitions
 
Data By The People, For The People
Data By The People, For The PeopleData By The People, For The People
Data By The People, For The People
 
Hadoop and Machine Learning
Hadoop and Machine LearningHadoop and Machine Learning
Hadoop and Machine Learning
 
Tips for data science competitions
Tips for data science competitionsTips for data science competitions
Tips for data science competitions
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
10 Lessons Learned from Building Machine Learning Systems
10 Lessons Learned from Building Machine Learning Systems10 Lessons Learned from Building Machine Learning Systems
10 Lessons Learned from Building Machine Learning Systems
 
An Introduction to Supervised Machine Learning and Pattern Classification: Th...
An Introduction to Supervised Machine Learning and Pattern Classification: Th...An Introduction to Supervised Machine Learning and Pattern Classification: Th...
An Introduction to Supervised Machine Learning and Pattern Classification: Th...
 
Machine Learning and Data Mining: 12 Classification Rules
Machine Learning and Data Mining: 12 Classification RulesMachine Learning and Data Mining: 12 Classification Rules
Machine Learning and Data Mining: 12 Classification Rules
 
How to Interview a Data Scientist
How to Interview a Data ScientistHow to Interview a Data Scientist
How to Interview a Data Scientist
 
Deep neural networks
Deep neural networksDeep neural networks
Deep neural networks
 
How to Become a Data Scientist
How to Become a Data ScientistHow to Become a Data Scientist
How to Become a Data Scientist
 
Artificial neural network
Artificial neural networkArtificial neural network
Artificial neural network
 
A Statistician's View on Big Data and Data Science (Version 1)
A Statistician's View on Big Data and Data Science (Version 1)A Statistician's View on Big Data and Data Science (Version 1)
A Statistician's View on Big Data and Data Science (Version 1)
 
Myths and Mathemagical Superpowers of Data Scientists
Myths and Mathemagical Superpowers of Data ScientistsMyths and Mathemagical Superpowers of Data Scientists
Myths and Mathemagical Superpowers of Data Scientists
 
A tutorial on deep learning at icml 2013
A tutorial on deep learning at icml 2013A tutorial on deep learning at icml 2013
A tutorial on deep learning at icml 2013
 

Similaire à Caserta Concepts Data Summit Agenda

Predictive Analytics - Big Data Warehousing Meetup
Predictive Analytics - Big Data Warehousing MeetupPredictive Analytics - Big Data Warehousing Meetup
Predictive Analytics - Big Data Warehousing MeetupCaserta
 
Deliveinrg explainable AI
Deliveinrg explainable AIDeliveinrg explainable AI
Deliveinrg explainable AIGary Allemann
 
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing KeynoteArchitecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing KeynoteCaserta
 
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...DATAVERSITY
 
Building a New Platform for Customer Analytics
Building a New Platform for Customer Analytics Building a New Platform for Customer Analytics
Building a New Platform for Customer Analytics Caserta
 
What Data Do You Have and Where is It?
What Data Do You Have and Where is It? What Data Do You Have and Where is It?
What Data Do You Have and Where is It? Caserta
 
Balancing Data Governance and Innovation
Balancing Data Governance and InnovationBalancing Data Governance and Innovation
Balancing Data Governance and InnovationCaserta
 
The Emerging Role of the Data Lake
The Emerging Role of the Data LakeThe Emerging Role of the Data Lake
The Emerging Role of the Data LakeCaserta
 
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016Caserta
 
Gse uk-cedrinemadera-2018-shared
Gse uk-cedrinemadera-2018-sharedGse uk-cedrinemadera-2018-shared
Gse uk-cedrinemadera-2018-sharedcedrinemadera
 
Why Your Data Science Architecture Should Include a Data Virtualization Tool ...
Why Your Data Science Architecture Should Include a Data Virtualization Tool ...Why Your Data Science Architecture Should Include a Data Virtualization Tool ...
Why Your Data Science Architecture Should Include a Data Virtualization Tool ...Denodo
 
When and How Data Lakes Fit into a Modern Data Architecture
When and How Data Lakes Fit into a Modern Data ArchitectureWhen and How Data Lakes Fit into a Modern Data Architecture
When and How Data Lakes Fit into a Modern Data ArchitectureDATAVERSITY
 
Delivering Value Through Business Analytics
Delivering Value Through Business AnalyticsDelivering Value Through Business Analytics
Delivering Value Through Business AnalyticsSocial Media Today
 
Big Data's Impact on the Enterprise
Big Data's Impact on the EnterpriseBig Data's Impact on the Enterprise
Big Data's Impact on the EnterpriseCaserta
 
Is your big data journey stalling? Take the Leap with Capgemini and Cloudera
Is your big data journey stalling? Take the Leap with Capgemini and ClouderaIs your big data journey stalling? Take the Leap with Capgemini and Cloudera
Is your big data journey stalling? Take the Leap with Capgemini and ClouderaCloudera, Inc.
 
ADV Slides: How to Improve Your Analytic Data Architecture Maturity
ADV Slides: How to Improve Your Analytic Data Architecture MaturityADV Slides: How to Improve Your Analytic Data Architecture Maturity
ADV Slides: How to Improve Your Analytic Data Architecture MaturityDATAVERSITY
 
Self-Service Analytics Framework - Connected Brains 2018
Self-Service Analytics Framework - Connected Brains 2018Self-Service Analytics Framework - Connected Brains 2018
Self-Service Analytics Framework - Connected Brains 2018LoQutus
 
The Data Lake - Balancing Data Governance and Innovation
The Data Lake - Balancing Data Governance and Innovation The Data Lake - Balancing Data Governance and Innovation
The Data Lake - Balancing Data Governance and Innovation Caserta
 
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...Denodo
 

Similaire à Caserta Concepts Data Summit Agenda (20)

Predictive Analytics - Big Data Warehousing Meetup
Predictive Analytics - Big Data Warehousing MeetupPredictive Analytics - Big Data Warehousing Meetup
Predictive Analytics - Big Data Warehousing Meetup
 
Deliveinrg explainable AI
Deliveinrg explainable AIDeliveinrg explainable AI
Deliveinrg explainable AI
 
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing KeynoteArchitecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
 
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
 
Building a New Platform for Customer Analytics
Building a New Platform for Customer Analytics Building a New Platform for Customer Analytics
Building a New Platform for Customer Analytics
 
What Data Do You Have and Where is It?
What Data Do You Have and Where is It? What Data Do You Have and Where is It?
What Data Do You Have and Where is It?
 
Balancing Data Governance and Innovation
Balancing Data Governance and InnovationBalancing Data Governance and Innovation
Balancing Data Governance and Innovation
 
The Emerging Role of the Data Lake
The Emerging Role of the Data LakeThe Emerging Role of the Data Lake
The Emerging Role of the Data Lake
 
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
 
Gse uk-cedrinemadera-2018-shared
Gse uk-cedrinemadera-2018-sharedGse uk-cedrinemadera-2018-shared
Gse uk-cedrinemadera-2018-shared
 
Why Your Data Science Architecture Should Include a Data Virtualization Tool ...
Why Your Data Science Architecture Should Include a Data Virtualization Tool ...Why Your Data Science Architecture Should Include a Data Virtualization Tool ...
Why Your Data Science Architecture Should Include a Data Virtualization Tool ...
 
When and How Data Lakes Fit into a Modern Data Architecture
When and How Data Lakes Fit into a Modern Data ArchitectureWhen and How Data Lakes Fit into a Modern Data Architecture
When and How Data Lakes Fit into a Modern Data Architecture
 
Delivering Value Through Business Analytics
Delivering Value Through Business AnalyticsDelivering Value Through Business Analytics
Delivering Value Through Business Analytics
 
Big Data's Impact on the Enterprise
Big Data's Impact on the EnterpriseBig Data's Impact on the Enterprise
Big Data's Impact on the Enterprise
 
Is your big data journey stalling? Take the Leap with Capgemini and Cloudera
Is your big data journey stalling? Take the Leap with Capgemini and ClouderaIs your big data journey stalling? Take the Leap with Capgemini and Cloudera
Is your big data journey stalling? Take the Leap with Capgemini and Cloudera
 
ADV Slides: How to Improve Your Analytic Data Architecture Maturity
ADV Slides: How to Improve Your Analytic Data Architecture MaturityADV Slides: How to Improve Your Analytic Data Architecture Maturity
ADV Slides: How to Improve Your Analytic Data Architecture Maturity
 
Self-Service Analytics Framework - Connected Brains 2018
Self-Service Analytics Framework - Connected Brains 2018Self-Service Analytics Framework - Connected Brains 2018
Self-Service Analytics Framework - Connected Brains 2018
 
Lean Analytics: How to get more out of your data science team
Lean Analytics: How to get more out of your data science teamLean Analytics: How to get more out of your data science team
Lean Analytics: How to get more out of your data science team
 
The Data Lake - Balancing Data Governance and Innovation
The Data Lake - Balancing Data Governance and Innovation The Data Lake - Balancing Data Governance and Innovation
The Data Lake - Balancing Data Governance and Innovation
 
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
 

Plus de Caserta

Using Machine Learning & Spark to Power Data-Driven Marketing
Using Machine Learning & Spark to Power Data-Driven MarketingUsing Machine Learning & Spark to Power Data-Driven Marketing
Using Machine Learning & Spark to Power Data-Driven MarketingCaserta
 
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...Caserta
 
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017Caserta
 
General Data Protection Regulation - BDW Meetup, October 11th, 2017
General Data Protection Regulation - BDW Meetup, October 11th, 2017General Data Protection Regulation - BDW Meetup, October 11th, 2017
General Data Protection Regulation - BDW Meetup, October 11th, 2017Caserta
 
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...Caserta
 
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017Caserta
 
The Rise of the CDO in Today's Enterprise
The Rise of the CDO in Today's EnterpriseThe Rise of the CDO in Today's Enterprise
The Rise of the CDO in Today's EnterpriseCaserta
 
You're the New CDO, Now What?
You're the New CDO, Now What?You're the New CDO, Now What?
You're the New CDO, Now What?Caserta
 
Making Big Data Easy for Everyone
Making Big Data Easy for EveryoneMaking Big Data Easy for Everyone
Making Big Data Easy for EveryoneCaserta
 
Benefits of the Azure Cloud
Benefits of the Azure CloudBenefits of the Azure Cloud
Benefits of the Azure CloudCaserta
 
Big Data Analytics on the Cloud
Big Data Analytics on the CloudBig Data Analytics on the Cloud
Big Data Analytics on the CloudCaserta
 
Intro to Data Science on Hadoop
Intro to Data Science on HadoopIntro to Data Science on Hadoop
Intro to Data Science on HadoopCaserta
 
Not Your Father's Database by Databricks
Not Your Father's Database by DatabricksNot Your Father's Database by Databricks
Not Your Father's Database by DatabricksCaserta
 
Mastering Customer Data on Apache Spark
Mastering Customer Data on Apache SparkMastering Customer Data on Apache Spark
Mastering Customer Data on Apache SparkCaserta
 
Moving Past Infrastructure Limitations
Moving Past Infrastructure LimitationsMoving Past Infrastructure Limitations
Moving Past Infrastructure LimitationsCaserta
 
Introducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing MeetupIntroducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing MeetupCaserta
 
Balancing Data Governance and Innovation
Balancing Data Governance and InnovationBalancing Data Governance and Innovation
Balancing Data Governance and InnovationCaserta
 
Setting Up the Data Lake
Setting Up the Data LakeSetting Up the Data Lake
Setting Up the Data LakeCaserta
 
Incorporating the Data Lake into Your Analytic Architecture
Incorporating the Data Lake into Your Analytic ArchitectureIncorporating the Data Lake into Your Analytic Architecture
Incorporating the Data Lake into Your Analytic ArchitectureCaserta
 
Real Time Big Data Processing on AWS
Real Time Big Data Processing on AWSReal Time Big Data Processing on AWS
Real Time Big Data Processing on AWSCaserta
 

Plus de Caserta (20)

Using Machine Learning & Spark to Power Data-Driven Marketing
Using Machine Learning & Spark to Power Data-Driven MarketingUsing Machine Learning & Spark to Power Data-Driven Marketing
Using Machine Learning & Spark to Power Data-Driven Marketing
 
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
 
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
 
General Data Protection Regulation - BDW Meetup, October 11th, 2017
General Data Protection Regulation - BDW Meetup, October 11th, 2017General Data Protection Regulation - BDW Meetup, October 11th, 2017
General Data Protection Regulation - BDW Meetup, October 11th, 2017
 
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...
 
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017
 
The Rise of the CDO in Today's Enterprise
The Rise of the CDO in Today's EnterpriseThe Rise of the CDO in Today's Enterprise
The Rise of the CDO in Today's Enterprise
 
You're the New CDO, Now What?
You're the New CDO, Now What?You're the New CDO, Now What?
You're the New CDO, Now What?
 
Making Big Data Easy for Everyone
Making Big Data Easy for EveryoneMaking Big Data Easy for Everyone
Making Big Data Easy for Everyone
 
Benefits of the Azure Cloud
Benefits of the Azure CloudBenefits of the Azure Cloud
Benefits of the Azure Cloud
 
Big Data Analytics on the Cloud
Big Data Analytics on the CloudBig Data Analytics on the Cloud
Big Data Analytics on the Cloud
 
Intro to Data Science on Hadoop
Intro to Data Science on HadoopIntro to Data Science on Hadoop
Intro to Data Science on Hadoop
 
Not Your Father's Database by Databricks
Not Your Father's Database by DatabricksNot Your Father's Database by Databricks
Not Your Father's Database by Databricks
 
Mastering Customer Data on Apache Spark
Mastering Customer Data on Apache SparkMastering Customer Data on Apache Spark
Mastering Customer Data on Apache Spark
 
Moving Past Infrastructure Limitations
Moving Past Infrastructure LimitationsMoving Past Infrastructure Limitations
Moving Past Infrastructure Limitations
 
Introducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing MeetupIntroducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing Meetup
 
Balancing Data Governance and Innovation
Balancing Data Governance and InnovationBalancing Data Governance and Innovation
Balancing Data Governance and Innovation
 
Setting Up the Data Lake
Setting Up the Data LakeSetting Up the Data Lake
Setting Up the Data Lake
 
Incorporating the Data Lake into Your Analytic Architecture
Incorporating the Data Lake into Your Analytic ArchitectureIncorporating the Data Lake into Your Analytic Architecture
Incorporating the Data Lake into Your Analytic Architecture
 
Real Time Big Data Processing on AWS
Real Time Big Data Processing on AWSReal Time Big Data Processing on AWS
Real Time Big Data Processing on AWS
 

Dernier

Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
While-For-loop in python used in college
While-For-loop in python used in collegeWhile-For-loop in python used in college
While-For-loop in python used in collegessuser7a7cd61
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 

Dernier (20)

Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
While-For-loop in python used in college
While-For-loop in python used in collegeWhile-For-loop in python used in college
While-For-loop in python used in college
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 

Caserta Concepts Data Summit Agenda

  • 1. @CasertaConcepts#DataSummit Introduction to Data Science Joe Caserta Bill Walrond @joe_Caserta @bill_walrond @CasertaConcepts
  • 2. @CasertaConcepts#DataSummit About Caserta Concepts • Consulting Data Innovation and Modern Data Engineering • Award-winning company • Internationally recognized work force • Strategy, Architecture, Implementation, Governance • Innovation Partner • Strategic Consulting • Advanced Architecture • Build & Deploy • Leader in Enterprise Data Solutions • Big Data Analytics • Data Warehousing • Business Intelligence • Data Science • Cloud Computing • Data Governance
  • 3. @CasertaConcepts#DataSummit Caserta Client Portfolio Retail/eCommerce & Manufacturing Finance, Healthcare & Insurance Digital Media/AdTech Education & Services
  • 4. @CasertaConcepts#DataSummit Awards & Recognition Top 10 Fastest Growing Big Data Companies 2016
  • 6. @CasertaConcepts#DataSummit Agenda • Why we care about Big Data • Challenges of working with Big Data • Governing Big Data for Data Science • Introducing the Data Pyramid • Why Data Science is Cool? • What does a Data Scientist do? • Standards for Data Science • Business Objective • Data Discovery • Preparation • Models • Evaluation • Deployment • Q & A
  • 7. @CasertaConcepts#DataSummit Big Data Analysis: Timeline of Society Media 1500s Printing Press 1840s Penny Post 1850s Telegraph 1850s Rural Free Post 1890s Telephone 1900s Radio 1950s TV 1970s PCs 1980s Internet 1990s Web 2000s Social Media, Mobile, Big Data, Cloud 98,000+ Tweets 695,000 Status Updates 11 Million instant messages 698,445 Google Searches 168 million+ emails sent 1,829 TB of data created 217 new mobile web users Every 60 Seconds
  • 8. @CasertaConcepts#DataSummit Data is your Differentiator 63% of organizations realize a positive return on analytic investments within a year 69% of speed-driven analytics organizations created a positive impact on business outcomes 74% of respondents anticipate a speed at which executives expect new data-driven insights will continue to accelerate
  • 9. @CasertaConcepts#DataSummit Understanding the Customer Journey Awareness Consideration Purchase Service Loyalty Expansion PR Radio TV Print Outdoor Word of Mouth Direct Mail Customer Service Physical Touchpoints Digital Touchpoints Search Paid Content email Website/ Landing Pages Social Media Community Chat Social Media Call Center Offers Mailings Survey Loyalty Programs email Agents Partners Ads Website Mobile 3rd Party Sites Offers Web self-service
  • 10. @CasertaConcepts#DataSummit Type Comments Single Touch Rules-Based Statistically Driven Assign the credit to the first or last exposure Assign the credit to each interaction based on business rules Assign the credit to interactions based on data-driven model Ad-Click Mailing MailingE-mail E-mailAd-Click Ad-Click 100% 33% 33% 33% 27% 49% 24% - Last touch only - Ignores bulk of customer journey - Undervalues other interactions and influencers - Subjective - Assigns arbitrary values to each interaction - Lacks analytics rigor to determine weights ü Looks at full behavior patterns ü Consider all touch points ü Can apply different models for best results ü Use data to find correlations between touch points (winning combinations) Understanding Touchpoint Methods
  • 12. @CasertaConcepts#DataSummit Business Value Cloud-based Data Lake Big Data Analysis: The Ecosystem of the future Analyze Persist DeployIngest Data Integration Identity Resolution Data Quality Discovery Exploration Machine Learning Models Development Reports / Dashboards Applications APIs Structured Data Unstructured Data SQL, NoSQL, Object Store Find Share Collaborate Data Engineer Data Scientist Business Analyst App Developer Provides innovative and industry leading technologies to rapidly be applied to the business without having to manage compatibility and data complexity. Technical Value Provides an open framework to reduce the number of integration points and testing environments to deliver business solutions.
  • 13. @CasertaConcepts#DataSummit Progression of Business Analytics to Data Science Descriptive Analytics Diagnostic Analytics Predictive Analytics Prescriptive Analytics What happened? Why did it happen? What will happen? How can we make It happen? Data Analytics Sophistication BusinessValue Hindsight Insight Foresight Information Optimization Cognitive Analytics Influence what happens Reports  Correlations  Predictions  Recommendations  Monetization Interactions Action
  • 14. @CasertaConcepts#DataSummit Progression of Data Science Maturity • Timeline • Tools • Available libraries • Best practices
  • 15. @CasertaConcepts#DataSummit What are the Realities of the Data Scientist — Searching for the data they need — Making sense of the data — Figuring why the data looks the way is does and assessing its validity — Cleaning up all the garbage within the data so it represents true business — Combining events with Reference data to give it context — Correlating event data with other events — Finally, they implement algorithms to perform mining, clustering and predictive analytics — Writes really cool and sophisticated algorithms that impacts the way the business runs. — Much of the time of a Data Scientist is spent: — NOT
  • 16. @CasertaConcepts#DataSummit Why Data Science now? • Costs of compute and storage dramatically lower than just a few years ago • Data generated by all aspects of society has dramatically increased • Need to efficiently learn what there is to learn from our data
  • 17. @CasertaConcepts#DataSummit The Data Scientist Winning Trifecta Modern Data Engineering/Data Preparation Domain Knowledge/Bu siness Expertise Advanced Mathematics/ Statistics - Computer Science - Programming/Storage - Data Quality - Visualization Algorithms - A/B Testing - - Data and Outcome - Sensibility
  • 22. @CasertaConcepts#DataSummit Is Data Trying To Trick You? Correlation: 99.26%
  • 24. @CasertaConcepts#DataSummit Are there Standards? CRISP-DM: Cross Industry Standard Process for Data Mining 1. Business Understanding • Solve a single business problem 2. Data Understanding • Discovery • Data Munging • Cleansing Requirements 3. Data Preparation • ETL 4. Modeling • Evaluate various models • Iterative experimentation 5. Evaluation • Does the model achieve business objectives? 6. Deployment • PMML; application integration; data platform; Excel
  • 25. @CasertaConcepts#DataSummit 1. Business Understanding In this initial phase of the project we will need to speak to humans. • It would be premature to jump in to the data, or begin selection of the appropriate model(s) or algorithm • Understand the project objective • Review the business requirements • The output of this phase will be conversion of business requirements into a preliminary technical design (decision model) and plan. Since this is an iterative process, this phase will be revisited throughout the entire process.
  • 26. @CasertaConcepts#DataSummit Data ScientistBusiness Analyst Business Stakeholders Business Stakeholders Business Stakeholders Interview notes Requirement Document Models / Insights Gathering Requirements
  • 27. @CasertaConcepts#DataSummit Data Science Scrum Team Data Scientist Business Stakeholders Data Engineer Efficient Inclusive EffectiveInteractive Data Analyst
  • 28. @CasertaConcepts#DataSummit 2. Data Understanding • Data Discovery  understand where the data you need comes from • Data Profiling  interrogate the data at the entity level, understand key entities and fields that are relevant to the analysis. • Cleansing Requirements  understand data quality, data density, skew, etc • Data Munging  collocate, blend and analyze data for early insights! Valuable information can be achieved from simple group-by, aggregate queries, and even more with SQL Jujitsu! Significant iteration between Business Understanding and Data Understanding phases. Sample Exploration tools for Hadoop: Trifacta, Paxata, Spark, Python, Waterline, Elasticsearch
  • 29. @CasertaConcepts#DataSummit Data Science Data Quality Priorities Be Corrective Be Fast Be Transparent Be Thorough
  • 30. @CasertaConcepts#DataSummit Data Science Data Quality Priorities Data Quality SpeedtoValue Fast Slow Raw Refined Data Scientist’s Tightrope Does Data munging in a data science lab need the same restrictive governance and enterprise reporting?
  • 31. @CasertaConcepts#DataSummit 3. Data Preparation ETL (Extract Transform Load) 90+% of a Data Scientists time goes into Data Preparation! • Locating and acquiring valuable data sources • Select required entities/fields • Address Data Quality issues: missing or incomplete values, whitespace, bad data-points • Join/Enrich disparate datasets • Derive behavioral features • Transform/Aggregate data for intended use: • Sample • Aggregate • Pivot
  • 32. @CasertaConcepts#DataSummit We Spark • Development local or distributed is identical • Beautiful high level API’s • Full universe of Python modules • Open source and Free • Blazing fast! Spark has become our default processing engine for a data engineering & science
  • 33. @CasertaConcepts#DataSummit Data Preparation • We love Spark! • ETL can be done in Scala, Python or SQL • Cleansing, transformation, and standardization • Address Parsing: usaddress, postal-address, etc • Name Hashing: fuzzy, etc • Genderization: sexmachine, etc • And all the goodies of the standard Python library! • Parallelize workload against a large number of machines in Hadoop cluster
  • 34. @CasertaConcepts#DataSummit Data Quality and Monitoring • BUILD a robust data quality subsystem: • Metadata and error event facts • Orchestration • Based on Data Warehouse ETL Toolkit • Each error instance of each data quality check is captured • Implemented as sub-system after ingestion • Each fact stores unique identifier of the defective source row HAMBot: ‘open source’ project created in Caserta Innovation Lab (CIL)
  • 35. @CasertaConcepts#DataSummit Data Preparation Demonstration! Wifi: Hilton Meeting Room Wifi infotoday2017 Follow along: http://bit.ly/2r9ABcK File: SanFranCrime.ipynb
  • 36. @CasertaConcepts#DataSummit 4. Modeling Do you love algebra & stats? • Evaluate various models/algorithms • Classification • Clustering • Regression • Many others….. • Tune parameters • Iterative experimentation • Different models may require different data preparation techniques (ie. Sparse Vector Format) • Additionally we may discover the need for additional data points, or uncover additional data quality issues!
  • 37. @CasertaConcepts#DataSummit Modeling in Hadoop • Spark works well • SAS, SPSS, Etc. not native on Hadoop • R and Python becoming new standard • PMML can be used, but approach with caution
  • 38. @CasertaConcepts#DataSummit Machine Learning The goal of machine learning is to get software to make decisions and learn from data without being programed explicitly to do so Machine Learning algorithms are broadly broken out into two groups: • Supervised learning  inferring functions based on labeled training data • Unsupervised learning  finding hidden structure/patterns within data, no training data is supplied We will review some popular, easy to understand machine learning algorithms
  • 40. @CasertaConcepts#DataSummit Supervised Learning Name Weight Color Cat_or_Dog Susie 9lbs Orange Cat Fido 25lbs Brown Dog Sparkles 6lbs Black Cat Fido 9lbs Black Dog Name Weight Color Cat_or_Dog Misty 5lbs Orange ? The training set is used to generate a function ..so we can predict if we have a cat or dog!
  • 41. @CasertaConcepts#DataSummit Category or Values? There are several classes of algorithms depending on whether the prediction is a category (like cat or dog) or a value, like the value of a home. Classification algorithms are generally well fit for categorization, while algorithms like Regression and Decision Trees are well suited for predicting “continuous” values.
  • 42. @CasertaConcepts#DataSummit Regression • Understanding the relationship between a given set of dependent variables and independent variables • Typically regression is used to predict the output of a dependent variable based on variations in independent variables • Very popular for prediction and forecasting Linear Regression
  • 43. @CasertaConcepts#DataSummit Decision Trees • A method for predicting outcomes based on the features of data • Model is represented a easy to understand tree structure of if-else statements Weight > 10lbs color = orange cat yes no name = fido no no dogyes dog cat yes
  • 44. @CasertaConcepts#DataSummit Unsupervised K-Means • Treats items as coordinates • Places a number of random “centroids” and assigns the nearest items • Moves the centroids around based on average location • Process repeats until the assignments stop changing Clustering of items into logical groups based on natural patterns in data Uses: • Cluster Analysis • Classification • Content Filtering
  • 45. @CasertaConcepts#DataSummit Collaborative Filtering • A hybrid of Supervised and Unsupervised Learning (Model Based vs. Memory Based) • Leveraging collaboration between multiple agents to filter, project, or detect patterns • Popular in recommender systems for projecting the “taste” for of specific individuals for items they have not yet expressed one.
  • 46. @CasertaConcepts#DataSummit Item-based • A popular and simple memory-based collaborative filtering algorithm • Projects preference based on item similarity (based on ratings): for every item i that u has no preference for yet for every item j that u has a preference for compute a similarity s between i and j add u's preference for j, weighted by s, to a running average return the top items, ranked by weighted average • First a matrix of Item to Item similarity is calculated based on user rating • Then recommendations are created by producing a weighted sum of top items, based on the users previously rated items
  • 47. @CasertaConcepts#DataSummit Data Science Demonstration! Follow along: http://bit.ly/2r9ABcK File: SanFranCrime_model_DataSummit.ipynb Wifi: Hilton Meeting Room Wifi infotoday2017
  • 48. @CasertaConcepts#DataSummit 5. Evaluation What problem are we trying to solve again? • Our final solution needs to be evaluated against original Business Understanding • Did we meet our objectives? • Did we address all issues?
  • 49. @CasertaConcepts#DataSummit 6. Deployment Engineering Time! • It’s time for the work products of data science to “graduate” from “new insights” to real applications. • Processes must be hardened, repeatable, and generally perform well too! • Data Governance applied • PMML (Predictive Model Markup Langauge): XML based interchange format
  • 50. @CasertaConcepts#DataSummit •This is the ‘people’ part. Establishing Enterprise Data Council, Data Stewards, etc. •Definitions, lineage (where does this data come from), business definitions, technical metadata Organization •Identify and control sensitive data, regulatory compliance Metadata •Data must be complete and correct. Measure, improve, certify Privacy/Security •Policies around data frequency, source availability, etc. Data Quality and Monitoring •Ensure consistent business critical data i.e. Members, Providers, Agents, etc. Business Process Integration •Data retention, purge schedule, storage/archiving Master Data Management Information Lifecycle Management (ILM) Components of Data Governance • Add Big Data to overall framework and assign responsibility • Add data scientists to the Stewardship program • Assign stewards to new data sets (twitter, call center logs, etc.) • Graph databases are more flexible than relational • Lower latency service required • Distributed data quality and matching algorithms • Data Quality and Monitoring (probably home grown, drools?) • Quality checks not only SQL: machine learning, Pig and Map Reduce • Acting on large dataset quality checks may require distribution • Larger scale • New datatypes • Integrate with Hive Metastore, HCatalog, home grown tables • Secure and mask multiple data types (not just tabular) • Deletes are more uncommon (unless there is regulatory requirement) • Take advantage of compression and archiving (like AWS Glacier) • Data detection and masking on unstructured data upon ingest • Near-zero latency, DevOps, Core component of business operations For Data Science
  • 51. @CasertaConcepts#DataSummit Ingest Raw Data Organize, Define, Complete Munging, Blending Machine Learning Data Quality and Monitoring Metadata, ILM , Security Data Catalog Data Integration Fully Governed ( trusted) Arbitrary/Ad-hoc Queries and Reporting Big Data Ware house Data Science Workspace Data Lake – Integrated Sandbox Landing Area – Source Data in “Full Fidelity” Usage Pattern Data Governance Metadata, ILM, Security Corporate Data Pyramid (CDP)
  • 52. @CasertaConcepts#DataSummit Chief Data Organization (Oversight) Vertical Business Area [Sales/Finance/Marketing/Operations/Customer Svc] Product Owner SCRUM Master Development Team Business Subject Matter Expertise Data Librarian/Data Stewardship Data Science/ Statistical Skills Data Engineering / Architecture Presentation/ BI Report Development Skills Data Quality Assurance DevOps IT Organization (Oversight) Enterprise Data Architect Solution Engineers Data Integration Practice User Experience Practice QA Practice Operations Practice Advanced Analytics Business Analysts Data Analysts Data Scientists Statisticians Data Engineers Planning Organization Project Managers Data Organization Data Gov Coordinator Data Librarians Data Stewards Analytics-Driven Organization
  • 53. @CasertaConcepts#DataSummit Technologies & Techniques • The Cloud and Spark can provide a relatively low cost and extremely scalable platform for Data Science • AWS S3 and Google GCS offers great scalability and speed to value without the overhead of structuring data • Spark, with MLlib offers a great library of established Machine Learning algorithms, reducing development efforts • Python and SQL are choices for Data Science • Go Agile and follow Best Practices (CRISP-DM) • Employ Data Pyramid concepts to ensure data has just enough governance
  • 54. @CasertaConcepts#DataSummit Some Thoughts – Enable the Future — Data Science requires the convergence of data quality, advanced math, data engineering and visualization and business smarts — Make sure your data can be trusted and people can be held accountable for impact caused by low data quality. — Good data scientists are rare: It will take a village to achieve all the tasks required for effective data science — Get good! — Be great! — Blaze new trails! https://explore-data-science.thisismetis.com Data Science Training: • Big Data Warehousing Meetup • New York City • 4,300+ members • Knowledge sharing
  • 55. @CasertaConcepts#DataSummit Thank You / Q&A Joe Caserta Bill Walrond @joe_Caserta @bill_walrond