This presentation was given by Karthigai Muthu, Lead Big Data Analyst, at a meetup organized by the group Internet of Everything in March 2015.
Through his presentation, Karthik provided a comprehensive understanding of available ecosystem tools and how they can be used to perform data engineering and data analytics. Karthik covers the following topics in his presentation:
• Establishment of complete data pipeline using big data ecosystem tools.
• Tackling of high velocity streams using various stream processing engines on cloud and performing Real Time analytics.
• Tackling of historical data using big data ecosystem tools and migration of traditional infrastructure to big data environments.
• Integration of big data ecosystem for data analysis using SAMOA , R and Mahout.
• Deployments of big data environments on the cloud.
4. Sources of data
25+ TBs of
log data
every day 2+
billion
people
on the
Web
by end
2011
30 billion
RFID tags
today
(1.3B in
2005)
4.6
billion
camera
phones
world
wide
100s of
millions of
GPS
enabled
devices
sold
annually
76 million smart
meters in 2009…
200M by 2014
12+ TBs
of tweet data
every day
?TBsof
dataeveryday
5. What makes Data Big
Characteristics Description Attributes Drivers
Volume The amount of data generated or
intensify that must be ingested,
analyzed and managed to make
decision based on complete data
analysis
Exabyte (EB)
Zettabyte (ZB)
Yottabyte (YB)
Increase in data sources
Higher resolution sensors
Scalable infrastructure
Velocity How fast the data is being
produced and changed and the
speed at which is transformed into
insight
Batch
Near real time
Real time and Streams
Rapid feedback loop
Improved throughput connectivity
Competitive advantage
Pre-computed information
Variety The degree of diversity of data from
sources both inside and outside an
organization
Degree of structure
Complexity
M2M/IoT
Social Media
Genomics
Video and Mobile
Veracity The quality and provenance of data Consistency
Completeness
Ambiguity
Integrity
Cost
Need of traceability and justification
7. What’s driving Big Data
- Ad-hoc querying and reporting
- Data mining techniques
- Structured data, typical sources
- Small to mid-size datasets
- Optimizations and predictive analytics
- Complex statistical analysis
- All types of data, and many sources
- Very large datasets
- More of a real-time
8. Big Data:
Batch Processing &
Distributed Data Store
Hadoop/Spark;
HBase/Cassandra/MongoDB
BI Reporting
OLAP &
Data warehouse
Business Objects, SAS,
Informatica, Cognos
other SQL Reporting
Tools
Interactive
Business
Intelligence &
In-memory
RDBMS
QlikView, Tableau,HANA
Big Data:
Real Time &
Single View
Graph Databases
The Evolution of Business
Intelligence
1990’s 2000’s
2010’s
Speed
Scale
Scale
Speed
22. Real-Time Analytics/Decision Requirement
Customer
Influence
Behavior
Product
Recommendations
that are Relevant
& Compelling
Friend Invitations
to join a
Game or Activity
that expands
business
Preventing Fraud
as it is Occurring
& preventing more
proactively
Learning why Customers
Switch to competitors
and their offers; in
time to Counter
Improving the
Marketing
Effectiveness of a
Promotion while it
is still in Play
24. Big Data is a factor that will, to a large extent, determine the future
growth rate in the M2M industry
M2M will connect increasingly more nodes that will provide data from
endpoints.
Data will be more granular, more frequent, and more accurate, with
bigger data sets or even live data streams
Large volume of endpoint connections IPv4 addressing scheme can’t
accommodate everything(sensors, smart phones, smart factories, smart
grids, smart vehicles, controllers, meters ) that it requires IPv6
IoE= Convergence of IoT, Big Data Analytics ,Cloud Computing and
other technologies is collectively called as Internet of Everything
Role of Big Data in M2M/IoT
25. Meeting the need for speed
Data understanding
Maintaining data quality
Displaying the meaningful result
Challenges of Big Data in M2M/IoT
27. Personal IoT: the scope is a single person, such as a smartphone
equipped with GPS sensor or a fitness device that measures the heart
rate. This is one of the fastest growing, consumer-oriented areas of IoT.
Group IoT: the scope is a fairly small group of people, such as a family
in a smart house, co-workers in a van or a group of tourists. This is one
of the most challenging areas and is still in its early phase.
Community IoT: the scope is a large group of people, potentially
thousands and more; usually this is in a public infrastructure context,
such as smart cities or smart roads. This is a young and potentially
promising IoT area.
Industrial IoT: the scope can be within an organization (smart factory)
or between organizations (retailer supply chain). This is arguably the
most established and mature part of IoT.
Big Data Use Cases – IoT/M2M
28. Agriculture - sensors can be deployed on farm machinery in order to provide data about
the equipment, soil temperature, moisture, etc.
Buildings/Smart Homes - Building sensors be used to help facility managers become
more proactive about ensuring that their buildings operate at peak efficiency.
Communities – Smart cities make use of parking space availability systems, intelligent
traffic monitoring systems, intelligent highways, weather-adaptive street lighting, and
more.
Healthcare – Infant monitors, smart diapers, pills with ingestible sensors are just some of
the IOT-based devices.
Manufacturing – factories with sensors can improve operations, product quality, and
decrease safety hazards.
Smartphones – can control everything from door locks, thermostats, light bulbs, vacuum
cleaners, and more.
Utilities – smart water meters can be used to reduce water leaks. Smart electric grids
can adjust rates depending on usage.
Wearables – Smart watches, fitness trackers and health monitors may become primary
source for human-related data, and can also be used in sports, retail, travel and
manufacturing.
Big Data Use cases – IoT/M2M
29. 1. Device Maintenance:
a. Time for next patch upgrade
b. Energy management
c. Inventory management and track replacement
2. Proactive Healthcare:
Capture and analyze real time data from medical monitors to predict
potential health problems before patients manifest clinical signs of
infection.
3. Monetize Machine Data:
a. Monitor performance, usage and capacity details to uncover up-sell
and cross-sell opportunities
b. Maximize the lifespan and performance of high value medical assets
Benefits of Big Data Analytics in M2M/IoT
30. 4. Optimize Support Operations:
a. Reduce MTTR and support escalations
b. Preempt failures with proactive support
c. Troubleshoot with accurate information
d. Proactive consultation to customers on approaching
expiry dates
Benefits of Big Data Analytics cont..
34. Batch processing
- Gathering of data and processing as a group at one time.
- Jobs run to completion
- Data might be out of date
Real-time processing
- Processing of data that takes place as the information is being
entered.
- Run for ever
Batch vs. Real-Time processing
35. Apache Storm is a free and
open source distributed real-time
computation system.
Storm makes it easy to reliably
process unbounded streams of data,
doing for real-time processing what
Hadoop did for batch processing
Storm
44. Groupings are used to decide to which task in the
subscribing bolt (group) a tuple is sent.
Possible Groupings:
- Shuffle
- Fields
- All
- Global
- None
- Direct
- Local or Shuffle
Stream Grouping