For more info about our Big Data courses, check out our website ➡️ https://www.betacowork.com/big-data/
---------
"Data is the new oil" - Many companies and professionals do not know how to use their data or are not aware of the added value they could gain from it.
It is in response to these problems that the project “Brussels: The Beating Heart of Big Data” was born.
This project, financed by the Region of Brussels Capital and organised by Betacowork, offers 3 training cycles of 10 courses on big data, at both beginner and advanced levels. These 3 cycles will be followed by a Hackathon weekend.
No prerequisites are required to start these courses. The aim of these courses is to familiarize participants with the principles of Big Data.
------
For more info about our Big Data courses, check out our website ➡️ https://www.betacowork.com/big-data/
Course 8 : How to start your big data project by Eric Rodriguez
1.
2. Why most Big Data projects fail
1
2
3
BIG DATA FAILURE
METHODOLOGY & LIFECYCLE
DO’s & DONT’s
How to approach Big Data projects?
Key steps to a successful project.
Typical pitfalls and some tips to make it work ;)
4
5
BUILD YOUR BIG DATA STACK
STEP-BY-STEP EXAMPLE
Elements to build your Big Data environment
Use case: your first Big Data project
3.
4.
5. Course by Eric Rodriguez
Stop thinking about experiments and get
back to identifying classic business
problems and using data to find solutions
Companies focus on collecting data but
they're not able to answer questions from
the beginning.
Before thinking about technology, we
must be clear on which are our business
needs.
We must start asking the “why” and then
move on to the “how”
Also, data is not being valued as a
strategic asset to the company.
6. Course by Eric Rodriguez
The Data Lake Fallacy
All Water and Little Substance
7. Course by Eric Rodriguez
Start with a small dataset then become
familiar with the business need and go
into production to get value out.
Big Data Workloads tend to be bursty,
making it difficult to allocate capacity for
ressources
Many companies fail to take into account
how quickly a big data project can grow
and evolve
To achieve scalability you need to build
your application a certain way thus
understand how the technology scales
8. Course by Eric Rodriguez
PROCESSING TIME
HARD TO TEST LARGE SYSTEMS
TECHNOLOGY CAN/WILL FAIL
9. Course by Eric Rodriguez
Challenging and fast-evolving tools
57% of organizations cite skill gap as
major inhibitor to Hadoop adoption
Businesses need data experts with
domain knowledge and people skills
Currently, it is difficult to hire good data
analysts, since they are expensive and
scarce.
Many Big Data vendors seek to overcome
this challenge by providing educational
resources or by providing more
automation of the platform management
11. Course by Eric Rodriguez
Specific Challenges include :
✓ User authentication for every team and team member
accessing the data
✓ Restricting access based on a user’s need
✓ Recording data access histories and meeting other
compliance regulations
✓ Proper use of encryption on data in-transit and at rest
13. Course by Eric Rodriguez
UNDERSTAND INDUSTRY POINT-OF-VIEW ON BIG DATA
EVALUATE CURRENT TOOLS AND TECHNOLOGY
IDENTIFY BUSINESS CASE PROOF OF CONCEPT (POC)
DEVELOP BIG DATA IMPLEMENTATION FRAMEWORK & PROCESS STEPS
FINALIZE ARCHITECTURE FOR POC/PILOT PROJECT
CAPTURE BUSINESS MEASURES OF SUCCESSFUL POCS
ENVISION BIG DATA ROADMAP
1
2
3
4
5
6
7
14. Course by Eric Rodriguez
INGEST the data sources to allow ease of exploration
INDEX content to make it accessible and queryable
INTEGRATE and LINK data elements
INVESTIGATE by exploring through data models
Discover INSIGHT Mimimum Viable Insight (MVI)
Minimum hurdle that validates a new approach to problem solving by delivering new insight
INVEST discovered insights by implementing and deploying
into the organization
ITERATE
1
2
3
4
5
6
7
21. Course by Eric Rodriguez
DETERMINE DELIVERABLES
(THE OUTPUTS OF THE PROJECT)
EXAMINE THE OVERALL SCOPE OF THE WORK
IDENTIFY THE KEY BUSINESS OBJECTIVES
IDENTIFY THE KEY BUSINESS OBJECTIVES
1. How much or how many? (regression)
2. Which category? (classification)
3. Which group? (clustering)
4. Is this weird? (anomaly detection)
5. Which option should be taken?
(recommendation)
TYPICAL QUESTIONS
22. Course by Eric Rodriguez
GATHER AND SCRAPE THE NECESSARY DATA
FOR YOUR PROJECT
Connect to a database
Get data that’s available, or
open your private database
up, and start digging through
it, and understanding what
information your company has
been collecting.
Here are a few ways to get yourself some data:
Use APIs
think of the APIs to all the
tools your company’s been
using, and the data these guys
have been collecting. You have
to work on getting these all set
up so you can use those email
open/click stats, the
information your sales team
put in Pipedrive or Salesforce,
the support ticket somebody
submitted, etc.
Look for open data
the Internet is full of datasets
to enrich what you have with
extra information; census data
will help you add the average
revenue for the district where
your user lives, or open street
maps can show you how many
coffee shops are on his street.
Use more APIs
another great way to start a
personal project is to make it
super personal by working on
your own data! You can
connect to your social media
tools, like twitter, or facebook,
to analyze your followers and
friends.
23. Course by Eric Rodriguez
Fix the inconsistencies and handle the missing values
Start digging and try to link everything together
to answer your original goal
Analyze and ask questions to business people or IT,
to understand what all your variables mean
24. Course by Eric Rodriguez
⚠️
Warning
This is probably the longest, most
annoying step of your data project.
Data scientists report data cleaning is
about 80% of the time spent on a project.
25. Course by Eric Rodriguez
Data exploration is typically
conducted using a combination
of automated and manual
activities
Data exploration is an approach
similar to initial data analysis,
whereby a data analyst uses visual
exploration to understand what is in
a dataset and the characteristics of
the data, rather than through
http://adilyalcin.me/
26. Course by Eric Rodriguez
http://www.jeannjoroge.com/significance-of-exploratory-data-anaysis/
27. Course by Eric Rodriguez
Select important features and construct more meaningful ones
using the raw data that you have
28. Course by Eric Rodriguez
By working with clustering algorithms
( unsupervised), you can build models to
uncover trends in the data that were not
distinguishable in graphs and stats. These
create groups of similar events (or clusters)
and more or less explicitly express what
feature is decisive in these results.
30. Course by Eric Rodriguez
Remember:
Not all data is clean or useable
Understand
the computational limits
Don't obsess over tools.
Ignore the trends, Worry about
what's cost-effective for you
Create an analytics plan and
process
Start small, low-risk project
Allow for a learning curve
Don't expect to find a
data science unicorn when hiring
31. Course by Eric Rodriguez
Lack of Clarity:
In order to gain the maximum benefit out of it, you need to point your Big
Data to a specific need or problem of your business. In order to justify your
investments for Big Data projects, you would require showcasing your
results continuously.
A Huge hurdle in terms of ROI:
Many entreprises can’t cope up with the heavy amount to be invested in
making their existing data setup in synch with new challenges.
The way we think of Big Data is wrong:
The way Big Data gets treated is like it is a known beginning with a known
end rather than an agile journey leading through constant exploration.
38. Course by Eric Rodriguez
Things You Must Consider Before
you Decide to Adopt a NoSQL :
• Community Strength and
Commercial Support
• APIs
• Model based on Data
• Model based on Queries
• Model based on Consistency
42. Course by Eric Rodriguez
• Apache Hadoop (free)
Hadoop is a leading tool for big data
analysis and is a top big data tool as
well.
• Microsoft HDInsight (Paid)
HDInsight provides low-cost
infrastructure for the Hadoop storage.
• NoSQL Databases
[MongoDB, HBase, and
Cassandra] (free)
No particular schema is needed when
you are working with NoSQL databases
and each row will have their own set
of column values. Another benefit of
using the NoSQL databases are the
better performance while storing a
massive amount of data.
• Apache Hive (Free)
Hive is majorly used for data mining
purpose and works on the top of
Hadoop.
• Apache Pig (free)
You don’t need to define the schema
before storing any file and directly you
can start working. Both Hive and Pig
almost fulfill the same situation.
• Talend
Talend offers many products like Big
Data Integration, Master Data
Management (MDM) which combines
real-time data, applications, and
process integration with embedded
data quality and stewardship.
• OpenRefine (free)
OpenRefine is a pretty user-friendly
tool and if your data is little
unstructured also, it can be easily
managed. Using this tool, you
can explore data, Clean, Transform,
Reconcile and Match Data easily.
• DataCleaner (Paid)
DataCleaner is mainly the pre-stage of
the data visualization where only
structured and clean data can be used.
• Tableau
Tableau is a data visualization tool
which is used to visualize the
structured data. You can connect to
Hive directly and start visualizing the
data.
• Import.io
Data extraction tool that enables you
to convert any website into structured,
machine-readable data with no coding
required.
• Apache Sqoop (Free)
Data Transfer tool allowing to import
data from RDBMS to Hadoop and
export Hadoop data to RDBMS easily
43. Course by Eric Rodriguez
Here is a list of 24 Data Science
Projects (free access) to practice:
https://www.analyticsvidhya.com/
blog/2018/05/24-ultimate-data-
science-projects-to-boost-your-
knowledge-and-skills/
44. 1. FINDING A TOPIC
2. EXTRACTING DATA FROM THE WEB AND CLEANING IT
3. GAINING DEEPER INSIGHTS
4. ENGINEERING OF FEATURES USING EXTERNAL APIS
45. Course by Eric Rodriguez
Move up the information ladder by
asking users for input
Combine, correlate and improve
quality of data sets
Bring new value from raw (open)
data sets
Bring new value from raw (open)
data sets
EXAMPLE : what are the main drivers of rental prices in Berlin?
46. Course by Eric Rodriguez
GETTING THE DATA
There are tons of amazing
data repositories, such as :
• Kaggle, UCI ML Repository
• dataset search engines,
• and websites containing
academic papers with
datasets…
Alternatively, you could
use web scraping.
CLEANING THE DATA
Once you starting getting the
data, it is very important to have
a look at it as early as possible in
order to find any possible issues.
EXAMPLE:
Possible issues with the data gathered in
our example :
• Duplicated apartments because they
had been online for a while,
• Agencies had input errors and they
would publish a completely new ad
with corrected values and additional
description modifications
• Some prices were changed after a
month for the same apartment)
• …
47. Course by Eric Rodriguez
EXAMPLE:
Interactive dashboard of Berlin rental
prices: one can select all the possible
configurations and see the
corresponding price distribution.
48. Course by Eric Rodriguez
Visualization helps you to identify
important attributes, or “features,” that
could be used by these machine learning
algorithms. If the features you use are
very uninformative, any algorithm will
produce bad predictions.
With very strong features, even a very
simple algorithm can produce pretty
decent results.
EXAMPLE:
In the rental price project, price is a continuous variable, so it is a typical regression problem.
Taking all extracted information, we collected the features above in order to be able to predict a rental price.
i
i
49. Course by Eric Rodriguez
EXAMPLE : PROBLEM
One feature that was problematic was the address.
There were 6.6K apartments and around 4.4K
unique addresses of different granularity. There
were around 200 unique postcodes which could
be converted into the dummy variables but then
very precious information of a particular location
would be lost.
i
i
EXAMPLE : SOLUTION
By using an external API following the four additional
features given, the apartment’s address could be
calculated:
• duration of a train trip to the S-Bahn Friedrichstrasse
(central station)
• distance to U-Bahn Stadtmitte (city center) by car
• duration of a walking trip to the nearest metro station
• number of metro stations within one kilometer from
the apartment
These four features boosted the performance
significantly.