SlideShare une entreprise Scribd logo
1  sur  51
Télécharger pour lire hors ligne
Why most Big Data projects fail
1
2
3
BIG DATA FAILURE
METHODOLOGY & LIFECYCLE
DO’s & DONT’s
How to approach Big Data projects?
Key steps to a successful project.
Typical pitfalls and some tips to make it work ;)
4
5
BUILD YOUR BIG DATA STACK
STEP-BY-STEP EXAMPLE
Elements to build your Big Data environment
Use case: your first Big Data project
Course by Eric Rodriguez
Stop thinking about experiments and get
back to identifying classic business
problems and using data to find solutions
Companies focus on collecting data but
they're not able to answer questions from
the beginning.
Before thinking about technology, we
must be clear on which are our business
needs.
We must start asking the “why” and then
move on to the “how”
Also, data is not being valued as a
strategic asset to the company.
Course by Eric Rodriguez
The Data Lake Fallacy
All Water and Little Substance
Course by Eric Rodriguez
Start with a small dataset then become
familiar with the business need and go
into production to get value out.
Big Data Workloads tend to be bursty,
making it difficult to allocate capacity for
ressources
Many companies fail to take into account
how quickly a big data project can grow
and evolve
To achieve scalability you need to build
your application a certain way thus
understand how the technology scales
Course by Eric Rodriguez
PROCESSING TIME
HARD TO TEST LARGE SYSTEMS
TECHNOLOGY CAN/WILL FAIL
Course by Eric Rodriguez
Challenging and fast-evolving tools
57% of organizations cite skill gap as
major inhibitor to Hadoop adoption
Businesses need data experts with
domain knowledge and people skills
Currently, it is difficult to hire good data
analysts, since they are expensive and
scarce.
Many Big Data vendors seek to overcome
this challenge by providing educational
resources or by providing more
automation of the platform management
Course by Eric Rodriguez
Course by Eric Rodriguez
Specific Challenges include :
✓ User authentication for every team and team member
accessing the data
✓ Restricting access based on a user’s need
✓ Recording data access histories and meeting other
compliance regulations
✓ Proper use of encryption on data in-transit and at rest
Course by Eric Rodriguez
Course by Eric Rodriguez
UNDERSTAND INDUSTRY POINT-OF-VIEW ON BIG DATA
EVALUATE CURRENT TOOLS AND TECHNOLOGY
IDENTIFY BUSINESS CASE PROOF OF CONCEPT (POC)
DEVELOP BIG DATA IMPLEMENTATION FRAMEWORK & PROCESS STEPS
FINALIZE ARCHITECTURE FOR POC/PILOT PROJECT
CAPTURE BUSINESS MEASURES OF SUCCESSFUL POCS
ENVISION BIG DATA ROADMAP
1
2
3
4
5
6
7
Course by Eric Rodriguez
INGEST the data sources to allow ease of exploration
INDEX content to make it accessible and queryable
INTEGRATE and LINK data elements
INVESTIGATE by exploring through data models
Discover INSIGHT Mimimum Viable Insight (MVI)
Minimum hurdle that validates a new approach to problem solving by delivering new insight
INVEST discovered insights by implementing and deploying
into the organization
ITERATE
1
2
3
4
5
6
7
Course by Eric Rodriguez
Course by Eric Rodriguez
Course by Eric Rodriguez
Course by Eric Rodriguez
Course by Eric Rodriguez
Course by Eric Rodriguez
DETERMINE DELIVERABLES
(THE OUTPUTS OF THE PROJECT)
EXAMINE THE OVERALL SCOPE OF THE WORK
IDENTIFY THE KEY BUSINESS OBJECTIVES
IDENTIFY THE KEY BUSINESS OBJECTIVES
1. How much or how many? (regression)
2. Which category? (classification)
3. Which group? (clustering)
4. Is this weird? (anomaly detection)
5. Which option should be taken?
(recommendation)
TYPICAL QUESTIONS
Course by Eric Rodriguez
GATHER AND SCRAPE THE NECESSARY DATA
FOR YOUR PROJECT
Connect to a database
Get data that’s available, or
open your private database
up, and start digging through
it, and understanding what
information your company has
been collecting.
Here are a few ways to get yourself some data:
Use APIs
think of the APIs to all the
tools your company’s been
using, and the data these guys
have been collecting. You have
to work on getting these all set
up so you can use those email
open/click stats, the
information your sales team
put in Pipedrive or Salesforce,
the support ticket somebody
submitted, etc.
Look for open data
the Internet is full of datasets
to enrich what you have with
extra information; census data
will help you add the average
revenue for the district where
your user lives, or open street
maps can show you how many
coffee shops are on his street.
Use more APIs
another great way to start a
personal project is to make it
super personal by working on
your own data! You can
connect to your social media
tools, like twitter, or facebook,
to analyze your followers and
friends.
Course by Eric Rodriguez
Fix the inconsistencies and handle the missing values
Start digging and try to link everything together
to answer your original goal
Analyze and ask questions to business people or IT,
to understand what all your variables mean
Course by Eric Rodriguez
⚠️
Warning
This is probably the longest, most
annoying step of your data project.
Data scientists report data cleaning is
about 80% of the time spent on a project.
Course by Eric Rodriguez
Data exploration is typically
conducted using a combination
of automated and manual
activities
Data exploration is an approach
similar to initial data analysis,
whereby a data analyst uses visual
exploration to understand what is in
a dataset and the characteristics of
the data, rather than through
http://adilyalcin.me/
Course by Eric Rodriguez
http://www.jeannjoroge.com/significance-of-exploratory-data-anaysis/
Course by Eric Rodriguez
Select important features and construct more meaningful ones
using the raw data that you have
Course by Eric Rodriguez
By working with clustering algorithms
( unsupervised), you can build models to
uncover trends in the data that were not
distinguishable in graphs and stats. These
create groups of similar events (or clusters)
and more or less explicitly express what
feature is decisive in these results.
Course by Eric Rodriguez
Course by Eric Rodriguez
Remember:
Not all data is clean or useable
Understand
the computational limits
Don't obsess over tools.
Ignore the trends, Worry about
what's cost-effective for you
Create an analytics plan and
process
Start small, low-risk project
Allow for a learning curve
Don't expect to find a
data science unicorn when hiring
Course by Eric Rodriguez
Lack of Clarity:
In order to gain the maximum benefit out of it, you need to point your Big
Data to a specific need or problem of your business. In order to justify your
investments for Big Data projects, you would require showcasing your
results continuously.
A Huge hurdle in terms of ROI:
Many entreprises can’t cope up with the heavy amount to be invested in
making their existing data setup in synch with new challenges.
The way we think of Big Data is wrong:
The way Big Data gets treated is like it is a known beginning with a known
end rather than an agile journey leading through constant exploration.
Course by Eric Rodriguez
BIG DATA STACK
Course by Eric Rodriguez
Course by Eric Rodriguez
Course by Eric Rodriguez
DATA ARCHITECTURE OVERVIEW
Course by Eric Rodriguez
PARADOX OF
CHOICE
Course by Eric Rodriguez
Course by Eric Rodriguez
Things You Must Consider Before
you Decide to Adopt a NoSQL :
• Community Strength and
Commercial Support
• APIs
• Model based on Data
• Model based on Queries
• Model based on Consistency
Course by Eric Rodriguez
Course by Eric Rodriguez
DATA INGESTION - SIMPLE
Course by Eric Rodriguez
DATA INGESTION - ADVANCED
Course by Eric Rodriguez
• Apache Hadoop (free)
Hadoop is a leading tool for big data
analysis and is a top big data tool as
well.
• Microsoft HDInsight (Paid)
HDInsight provides low-cost
infrastructure for the Hadoop storage.
• NoSQL Databases
[MongoDB, HBase, and
Cassandra] (free)
No particular schema is needed when
you are working with NoSQL databases
and each row will have their own set
of column values. Another benefit of
using the NoSQL databases are the
better performance while storing a
massive amount of data.
• Apache Hive (Free)
Hive is majorly used for data mining
purpose and works on the top of
Hadoop.
• Apache Pig (free)
You don’t need to define the schema
before storing any file and directly you
can start working. Both Hive and Pig
almost fulfill the same situation.
• Talend
Talend offers many products like Big
Data Integration, Master Data
Management (MDM) which combines
real-time data, applications, and
process integration with embedded
data quality and stewardship.
• OpenRefine (free)
OpenRefine is a pretty user-friendly
tool and if your data is little
unstructured also, it can be easily
managed. Using this tool, you
can explore data, Clean, Transform,
Reconcile and Match Data easily.
• DataCleaner (Paid)
DataCleaner is mainly the pre-stage of
the data visualization where only
structured and clean data can be used.
• Tableau
Tableau is a data visualization tool
which is used to visualize the
structured data. You can connect to
Hive directly and start visualizing the
data.
• Import.io
Data extraction tool that enables you
to convert any website into structured,
machine-readable data with no coding
required.
• Apache Sqoop (Free)
Data Transfer tool allowing to import
data from RDBMS to Hadoop and
export Hadoop data to RDBMS easily
Course by Eric Rodriguez
Here is a list of 24 Data Science
Projects (free access) to practice:
https://www.analyticsvidhya.com/
blog/2018/05/24-ultimate-data-
science-projects-to-boost-your-
knowledge-and-skills/
1. FINDING A TOPIC
2. EXTRACTING DATA FROM THE WEB AND CLEANING IT
3. GAINING DEEPER INSIGHTS
4. ENGINEERING OF FEATURES USING EXTERNAL APIS
Course by Eric Rodriguez
Move up the information ladder by
asking users for input
Combine, correlate and improve
quality of data sets
Bring new value from raw (open)
data sets
Bring new value from raw (open)
data sets
EXAMPLE : what are the main drivers of rental prices in Berlin?
Course by Eric Rodriguez
GETTING THE DATA
There are tons of amazing
data repositories, such as :
• Kaggle, UCI ML Repository
• dataset search engines,
• and websites containing
academic papers with
datasets…
Alternatively, you could
use web scraping.
CLEANING THE DATA
Once you starting getting the
data, it is very important to have
a look at it as early as possible in
order to find any possible issues.
EXAMPLE:
Possible issues with the data gathered in
our example :
• Duplicated apartments because they
had been online for a while,
• Agencies had input errors and they
would publish a completely new ad
with corrected values and additional
description modifications
• Some prices were changed after a
month for the same apartment)
• …
Course by Eric Rodriguez
EXAMPLE:
Interactive dashboard of Berlin rental
prices: one can select all the possible
configurations and see the
corresponding price distribution.
Course by Eric Rodriguez
Visualization helps you to identify
important attributes, or “features,” that
could be used by these machine learning
algorithms. If the features you use are
very uninformative, any algorithm will
produce bad predictions.
With very strong features, even a very
simple algorithm can produce pretty
decent results.
EXAMPLE:
In the rental price project, price is a continuous variable, so it is a typical regression problem.
Taking all extracted information, we collected the features above in order to be able to predict a rental price.
i
i
Course by Eric Rodriguez
EXAMPLE : PROBLEM
One feature that was problematic was the address.
There were 6.6K apartments and around 4.4K
unique addresses of different granularity. There
were around 200 unique postcodes which could
be converted into the dummy variables but then
very precious information of a particular location
would be lost.
i
i
EXAMPLE : SOLUTION
By using an external API following the four additional
features given, the apartment’s address could be
calculated:
• duration of a train trip to the S-Bahn Friedrichstrasse
(central station)
• distance to U-Bahn Stadtmitte (city center) by car
• duration of a walking trip to the nearest metro station
• number of metro stations within one kilometer from
the apartment
These four features boosted the performance
significantly.
Course 8 : How to start your big data project by Eric Rodriguez
Course 8 : How to start your big data project by Eric Rodriguez

Contenu connexe

Tendances

Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Shirshanka Das
 
The Evolution of Big Data Frameworks
The Evolution of Big Data FrameworksThe Evolution of Big Data Frameworks
The Evolution of Big Data Frameworks
eXascale Infolab
 

Tendances (20)

Full-Stack Data Science: How to be a One-person Data Team
Full-Stack Data Science: How to be a One-person Data TeamFull-Stack Data Science: How to be a One-person Data Team
Full-Stack Data Science: How to be a One-person Data Team
 
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
 
Graph Databases - Where Do We Do the Modeling Part?
Graph Databases - Where Do We Do the Modeling Part?Graph Databases - Where Do We Do the Modeling Part?
Graph Databases - Where Do We Do the Modeling Part?
 
Power of the Run Graph
Power of the Run GraphPower of the Run Graph
Power of the Run Graph
 
Stanford DeepDive Framework
Stanford DeepDive FrameworkStanford DeepDive Framework
Stanford DeepDive Framework
 
Big data-ppt
Big data-pptBig data-ppt
Big data-ppt
 
The Evolution of Big Data Frameworks
The Evolution of Big Data FrameworksThe Evolution of Big Data Frameworks
The Evolution of Big Data Frameworks
 
The Future Of Big Data
The Future Of Big DataThe Future Of Big Data
The Future Of Big Data
 
Intro to Data Science Big Data
Intro to Data Science Big DataIntro to Data Science Big Data
Intro to Data Science Big Data
 
Making Data Scientists Productive in Azure
Making Data Scientists Productive in AzureMaking Data Scientists Productive in Azure
Making Data Scientists Productive in Azure
 
Graphs for Enterprise Architects
Graphs for Enterprise ArchitectsGraphs for Enterprise Architects
Graphs for Enterprise Architects
 
NoSQL and Data Modeling for Data Modelers
NoSQL and Data Modeling for Data ModelersNoSQL and Data Modeling for Data Modelers
NoSQL and Data Modeling for Data Modelers
 
Big Data Evolution
Big Data EvolutionBig Data Evolution
Big Data Evolution
 
Introduction to big data
Introduction to big dataIntroduction to big data
Introduction to big data
 
Total Data Industry Report
Total Data Industry ReportTotal Data Industry Report
Total Data Industry Report
 
Big Data: Issues and Challenges
Big Data: Issues and ChallengesBig Data: Issues and Challenges
Big Data: Issues and Challenges
 
Intro to Neo4j Webinar
Intro to Neo4j WebinarIntro to Neo4j Webinar
Intro to Neo4j Webinar
 
Big Data Fundamentals
Big Data FundamentalsBig Data Fundamentals
Big Data Fundamentals
 
The 3 Key Barriers Keeping Companies from Deploying Data Products
The 3 Key Barriers Keeping Companies from Deploying Data Products The 3 Key Barriers Keeping Companies from Deploying Data Products
The 3 Key Barriers Keeping Companies from Deploying Data Products
 
Modern Big Data Analytics Tools: An Overview
Modern Big Data Analytics Tools: An OverviewModern Big Data Analytics Tools: An Overview
Modern Big Data Analytics Tools: An Overview
 

Similaire à Course 8 : How to start your big data project by Eric Rodriguez

final oracle presentation
final oracle presentationfinal oracle presentation
final oracle presentation
Priyesh Patel
 
BIG DATA WORKBOOK OCT 2015
BIG DATA WORKBOOK OCT 2015BIG DATA WORKBOOK OCT 2015
BIG DATA WORKBOOK OCT 2015
Fiona Lew
 

Similaire à Course 8 : How to start your big data project by Eric Rodriguez (20)

Big data analytics
Big data analyticsBig data analytics
Big data analytics
 
Quicker Insights and Sustainable Business Agility Powered By Data Virtualizat...
Quicker Insights and Sustainable Business Agility Powered By Data Virtualizat...Quicker Insights and Sustainable Business Agility Powered By Data Virtualizat...
Quicker Insights and Sustainable Business Agility Powered By Data Virtualizat...
 
AI Orange Belt - Session 3
AI Orange Belt - Session 3AI Orange Belt - Session 3
AI Orange Belt - Session 3
 
How to Prepare for a Career in Data Science
How to Prepare for a Career in Data ScienceHow to Prepare for a Career in Data Science
How to Prepare for a Career in Data Science
 
final oracle presentation
final oracle presentationfinal oracle presentation
final oracle presentation
 
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
 
BIG DATA WORKBOOK OCT 2015
BIG DATA WORKBOOK OCT 2015BIG DATA WORKBOOK OCT 2015
BIG DATA WORKBOOK OCT 2015
 
Advanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data VirtualizationAdvanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data Virtualization
 
Big Data Driven Solutions to Combat Covid' 19
Big Data Driven Solutions to Combat Covid' 19Big Data Driven Solutions to Combat Covid' 19
Big Data Driven Solutions to Combat Covid' 19
 
Gse uk-cedrinemadera-2018-shared
Gse uk-cedrinemadera-2018-sharedGse uk-cedrinemadera-2018-shared
Gse uk-cedrinemadera-2018-shared
 
Data Science For Beginners | Who Is A Data Scientist? | Data Science Tutorial...
Data Science For Beginners | Who Is A Data Scientist? | Data Science Tutorial...Data Science For Beginners | Who Is A Data Scientist? | Data Science Tutorial...
Data Science For Beginners | Who Is A Data Scientist? | Data Science Tutorial...
 
Which institute is best for data science?
Which institute is best for data science?Which institute is best for data science?
Which institute is best for data science?
 
Best Selenium certification course
Best Selenium certification courseBest Selenium certification course
Best Selenium certification course
 
Data science training in hyd ppt (1)
Data science training in hyd ppt (1)Data science training in hyd ppt (1)
Data science training in hyd ppt (1)
 
Data science training institute in hyderabad
Data science training institute in hyderabadData science training institute in hyderabad
Data science training institute in hyderabad
 
Data science training in Hyderabad
Data science  training in HyderabadData science  training in Hyderabad
Data science training in Hyderabad
 
Data science training Hyderabad
Data science training HyderabadData science training Hyderabad
Data science training Hyderabad
 
Data science online training in hyderabad
Data science online training in hyderabadData science online training in hyderabad
Data science online training in hyderabad
 
Data science training in hyd ppt (1)
Data science training in hyd ppt (1)Data science training in hyd ppt (1)
Data science training in hyd ppt (1)
 
data science training and placement
data science training and placementdata science training and placement
data science training and placement
 

Dernier

In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
ahmedjiabur940
 
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
HyderabadDolls
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
gajnagarg
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
nirzagarg
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
vexqp
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
nirzagarg
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
Computer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdfComputer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdf
SayantanBiswas37
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
Health
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
chadhar227
 

Dernier (20)

In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
 
20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Statistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbersStatistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbers
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
 
Computer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdfComputer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdf
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 

Course 8 : How to start your big data project by Eric Rodriguez

  • 1.
  • 2. Why most Big Data projects fail 1 2 3 BIG DATA FAILURE METHODOLOGY & LIFECYCLE DO’s & DONT’s How to approach Big Data projects? Key steps to a successful project. Typical pitfalls and some tips to make it work ;) 4 5 BUILD YOUR BIG DATA STACK STEP-BY-STEP EXAMPLE Elements to build your Big Data environment Use case: your first Big Data project
  • 3.
  • 4.
  • 5. Course by Eric Rodriguez Stop thinking about experiments and get back to identifying classic business problems and using data to find solutions Companies focus on collecting data but they're not able to answer questions from the beginning. Before thinking about technology, we must be clear on which are our business needs. We must start asking the “why” and then move on to the “how” Also, data is not being valued as a strategic asset to the company.
  • 6. Course by Eric Rodriguez The Data Lake Fallacy All Water and Little Substance
  • 7. Course by Eric Rodriguez Start with a small dataset then become familiar with the business need and go into production to get value out. Big Data Workloads tend to be bursty, making it difficult to allocate capacity for ressources Many companies fail to take into account how quickly a big data project can grow and evolve To achieve scalability you need to build your application a certain way thus understand how the technology scales
  • 8. Course by Eric Rodriguez PROCESSING TIME HARD TO TEST LARGE SYSTEMS TECHNOLOGY CAN/WILL FAIL
  • 9. Course by Eric Rodriguez Challenging and fast-evolving tools 57% of organizations cite skill gap as major inhibitor to Hadoop adoption Businesses need data experts with domain knowledge and people skills Currently, it is difficult to hire good data analysts, since they are expensive and scarce. Many Big Data vendors seek to overcome this challenge by providing educational resources or by providing more automation of the platform management
  • 10. Course by Eric Rodriguez
  • 11. Course by Eric Rodriguez Specific Challenges include : ✓ User authentication for every team and team member accessing the data ✓ Restricting access based on a user’s need ✓ Recording data access histories and meeting other compliance regulations ✓ Proper use of encryption on data in-transit and at rest
  • 12. Course by Eric Rodriguez
  • 13. Course by Eric Rodriguez UNDERSTAND INDUSTRY POINT-OF-VIEW ON BIG DATA EVALUATE CURRENT TOOLS AND TECHNOLOGY IDENTIFY BUSINESS CASE PROOF OF CONCEPT (POC) DEVELOP BIG DATA IMPLEMENTATION FRAMEWORK & PROCESS STEPS FINALIZE ARCHITECTURE FOR POC/PILOT PROJECT CAPTURE BUSINESS MEASURES OF SUCCESSFUL POCS ENVISION BIG DATA ROADMAP 1 2 3 4 5 6 7
  • 14. Course by Eric Rodriguez INGEST the data sources to allow ease of exploration INDEX content to make it accessible and queryable INTEGRATE and LINK data elements INVESTIGATE by exploring through data models Discover INSIGHT Mimimum Viable Insight (MVI) Minimum hurdle that validates a new approach to problem solving by delivering new insight INVEST discovered insights by implementing and deploying into the organization ITERATE 1 2 3 4 5 6 7
  • 15. Course by Eric Rodriguez
  • 16.
  • 17. Course by Eric Rodriguez
  • 18. Course by Eric Rodriguez
  • 19. Course by Eric Rodriguez
  • 20. Course by Eric Rodriguez
  • 21. Course by Eric Rodriguez DETERMINE DELIVERABLES (THE OUTPUTS OF THE PROJECT) EXAMINE THE OVERALL SCOPE OF THE WORK IDENTIFY THE KEY BUSINESS OBJECTIVES IDENTIFY THE KEY BUSINESS OBJECTIVES 1. How much or how many? (regression) 2. Which category? (classification) 3. Which group? (clustering) 4. Is this weird? (anomaly detection) 5. Which option should be taken? (recommendation) TYPICAL QUESTIONS
  • 22. Course by Eric Rodriguez GATHER AND SCRAPE THE NECESSARY DATA FOR YOUR PROJECT Connect to a database Get data that’s available, or open your private database up, and start digging through it, and understanding what information your company has been collecting. Here are a few ways to get yourself some data: Use APIs think of the APIs to all the tools your company’s been using, and the data these guys have been collecting. You have to work on getting these all set up so you can use those email open/click stats, the information your sales team put in Pipedrive or Salesforce, the support ticket somebody submitted, etc. Look for open data the Internet is full of datasets to enrich what you have with extra information; census data will help you add the average revenue for the district where your user lives, or open street maps can show you how many coffee shops are on his street. Use more APIs another great way to start a personal project is to make it super personal by working on your own data! You can connect to your social media tools, like twitter, or facebook, to analyze your followers and friends.
  • 23. Course by Eric Rodriguez Fix the inconsistencies and handle the missing values Start digging and try to link everything together to answer your original goal Analyze and ask questions to business people or IT, to understand what all your variables mean
  • 24. Course by Eric Rodriguez ⚠️ Warning This is probably the longest, most annoying step of your data project. Data scientists report data cleaning is about 80% of the time spent on a project.
  • 25. Course by Eric Rodriguez Data exploration is typically conducted using a combination of automated and manual activities Data exploration is an approach similar to initial data analysis, whereby a data analyst uses visual exploration to understand what is in a dataset and the characteristics of the data, rather than through http://adilyalcin.me/
  • 26. Course by Eric Rodriguez http://www.jeannjoroge.com/significance-of-exploratory-data-anaysis/
  • 27. Course by Eric Rodriguez Select important features and construct more meaningful ones using the raw data that you have
  • 28. Course by Eric Rodriguez By working with clustering algorithms ( unsupervised), you can build models to uncover trends in the data that were not distinguishable in graphs and stats. These create groups of similar events (or clusters) and more or less explicitly express what feature is decisive in these results.
  • 29. Course by Eric Rodriguez
  • 30. Course by Eric Rodriguez Remember: Not all data is clean or useable Understand the computational limits Don't obsess over tools. Ignore the trends, Worry about what's cost-effective for you Create an analytics plan and process Start small, low-risk project Allow for a learning curve Don't expect to find a data science unicorn when hiring
  • 31. Course by Eric Rodriguez Lack of Clarity: In order to gain the maximum benefit out of it, you need to point your Big Data to a specific need or problem of your business. In order to justify your investments for Big Data projects, you would require showcasing your results continuously. A Huge hurdle in terms of ROI: Many entreprises can’t cope up with the heavy amount to be invested in making their existing data setup in synch with new challenges. The way we think of Big Data is wrong: The way Big Data gets treated is like it is a known beginning with a known end rather than an agile journey leading through constant exploration.
  • 32. Course by Eric Rodriguez BIG DATA STACK
  • 33. Course by Eric Rodriguez
  • 34. Course by Eric Rodriguez
  • 35. Course by Eric Rodriguez DATA ARCHITECTURE OVERVIEW
  • 36. Course by Eric Rodriguez PARADOX OF CHOICE
  • 37. Course by Eric Rodriguez
  • 38. Course by Eric Rodriguez Things You Must Consider Before you Decide to Adopt a NoSQL : • Community Strength and Commercial Support • APIs • Model based on Data • Model based on Queries • Model based on Consistency
  • 39. Course by Eric Rodriguez
  • 40. Course by Eric Rodriguez DATA INGESTION - SIMPLE
  • 41. Course by Eric Rodriguez DATA INGESTION - ADVANCED
  • 42. Course by Eric Rodriguez • Apache Hadoop (free) Hadoop is a leading tool for big data analysis and is a top big data tool as well. • Microsoft HDInsight (Paid) HDInsight provides low-cost infrastructure for the Hadoop storage. • NoSQL Databases [MongoDB, HBase, and Cassandra] (free) No particular schema is needed when you are working with NoSQL databases and each row will have their own set of column values. Another benefit of using the NoSQL databases are the better performance while storing a massive amount of data. • Apache Hive (Free) Hive is majorly used for data mining purpose and works on the top of Hadoop. • Apache Pig (free) You don’t need to define the schema before storing any file and directly you can start working. Both Hive and Pig almost fulfill the same situation. • Talend Talend offers many products like Big Data Integration, Master Data Management (MDM) which combines real-time data, applications, and process integration with embedded data quality and stewardship. • OpenRefine (free) OpenRefine is a pretty user-friendly tool and if your data is little unstructured also, it can be easily managed. Using this tool, you can explore data, Clean, Transform, Reconcile and Match Data easily. • DataCleaner (Paid) DataCleaner is mainly the pre-stage of the data visualization where only structured and clean data can be used. • Tableau Tableau is a data visualization tool which is used to visualize the structured data. You can connect to Hive directly and start visualizing the data. • Import.io Data extraction tool that enables you to convert any website into structured, machine-readable data with no coding required. • Apache Sqoop (Free) Data Transfer tool allowing to import data from RDBMS to Hadoop and export Hadoop data to RDBMS easily
  • 43. Course by Eric Rodriguez Here is a list of 24 Data Science Projects (free access) to practice: https://www.analyticsvidhya.com/ blog/2018/05/24-ultimate-data- science-projects-to-boost-your- knowledge-and-skills/
  • 44. 1. FINDING A TOPIC 2. EXTRACTING DATA FROM THE WEB AND CLEANING IT 3. GAINING DEEPER INSIGHTS 4. ENGINEERING OF FEATURES USING EXTERNAL APIS
  • 45. Course by Eric Rodriguez Move up the information ladder by asking users for input Combine, correlate and improve quality of data sets Bring new value from raw (open) data sets Bring new value from raw (open) data sets EXAMPLE : what are the main drivers of rental prices in Berlin?
  • 46. Course by Eric Rodriguez GETTING THE DATA There are tons of amazing data repositories, such as : • Kaggle, UCI ML Repository • dataset search engines, • and websites containing academic papers with datasets… Alternatively, you could use web scraping. CLEANING THE DATA Once you starting getting the data, it is very important to have a look at it as early as possible in order to find any possible issues. EXAMPLE: Possible issues with the data gathered in our example : • Duplicated apartments because they had been online for a while, • Agencies had input errors and they would publish a completely new ad with corrected values and additional description modifications • Some prices were changed after a month for the same apartment) • …
  • 47. Course by Eric Rodriguez EXAMPLE: Interactive dashboard of Berlin rental prices: one can select all the possible configurations and see the corresponding price distribution.
  • 48. Course by Eric Rodriguez Visualization helps you to identify important attributes, or “features,” that could be used by these machine learning algorithms. If the features you use are very uninformative, any algorithm will produce bad predictions. With very strong features, even a very simple algorithm can produce pretty decent results. EXAMPLE: In the rental price project, price is a continuous variable, so it is a typical regression problem. Taking all extracted information, we collected the features above in order to be able to predict a rental price. i i
  • 49. Course by Eric Rodriguez EXAMPLE : PROBLEM One feature that was problematic was the address. There were 6.6K apartments and around 4.4K unique addresses of different granularity. There were around 200 unique postcodes which could be converted into the dummy variables but then very precious information of a particular location would be lost. i i EXAMPLE : SOLUTION By using an external API following the four additional features given, the apartment’s address could be calculated: • duration of a train trip to the S-Bahn Friedrichstrasse (central station) • distance to U-Bahn Stadtmitte (city center) by car • duration of a walking trip to the nearest metro station • number of metro stations within one kilometer from the apartment These four features boosted the performance significantly.