Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.
Introduction to Big Data
Definition of Big Data
 "Big data is a broad term for data sets so large or complex that traditional
data processing appl...
Where does Big Data come from?
 Online recorded content:
 Clicks
 Ad views
 Server requests
 .. everything what happe...
Example scales of Big Data
 EIR communication logs: 1.4 TB / day
 Facebook logs: 60 TB / day
 Google total web index: ~...
How do we program this thing?
6
OK but I don’t work at Google yet ...
Startup example
 Let’s design a simple web tracker from scratch
 Register and count each page view for a number of clien...
Version 2.0
 Why write each count?!
 Let’s introduce a queue and buffer updates
 Problem?
 # of page views and # of cl...
Version 3.0
 The bottleneck is the write-heavy DB
 Let’s shard the database!
 Problems?!
 Have to keep adding new serv...
Is there a way out?
 We need new tools which handle:
 automatic sharding and re-sharding
 automatic replication and reb...
Big Data tooling
 Apache Hadoop distributed filesystem (HDFS)
 Distributed, scalable, portable filesytem written in Java...
Apache Spark by a glance
13
The good news
 The right tools are available and open-source
 The knowledge is available and mostly free
 It’s all read...
Introduction to Big Data
Prochain SlideShare
Chargement dans…5
×

Introduction to Big Data

392 vues

Publié le

Introduction to Big Data - how we got here, how we can't avoid the topic anymore.

Publié dans : Technologie
  • Soyez le premier à commenter

  • Soyez le premier à aimer ceci

Introduction to Big Data

  1. 1. Introduction to Big Data
  2. 2. Definition of Big Data  "Big data is a broad term for data sets so large or complex that traditional data processing applications are inadequate.“  "Big data is an evolving term that describes any voluminous amount of structured, semi-structured and unstructured data that has the potential to be mined for information.“  Data growing way faster than computation speeds  A single machine can no longer process or even store all this data! The Big Data problem
  3. 3. Where does Big Data come from?  Online recorded content:  Clicks  Ad views  Server requests  .. everything what happens online can potentially be recorded  User generated content (Facebook, Twitter, Instagram, etc)  Smartphone users reach to their phone 150 times a day (2013)  Health and scientific computing  Large Hadron Collider produces about double amount of data than Twitter every year  Internet of Things (IoT)  smart thermostat systems  automobiles with built-in sensors  all kind of “smart” devices of various sizes
  4. 4. Example scales of Big Data  EIR communication logs: 1.4 TB / day  Facebook logs: 60 TB / day  Google total web index: ~10+ PB (10000TB)  Facebook total data: 300 PB with an incoming rate of 600 TB / day (2014)  ..as a reminder..  time to read 1TB from disk: 3 hours (100MB/s)  Google web index could be read from disk serialized in ~3.4 years
  5. 5. How do we program this thing? 6
  6. 6. OK but I don’t work at Google yet ...
  7. 7. Startup example  Let’s design a simple web tracker from scratch  Register and count each page view for a number of clients  “Keep simple things simple”  Version 1.0:  Problem?  Huge number of page views => massive DB load on concurrent updates => DB timeouts => FAIL
  8. 8. Version 2.0  Why write each count?!  Let’s introduce a queue and buffer updates  Problem?  # of page views and # of clients keep increasing => DB overload => FAIL
  9. 9. Version 3.0  The bottleneck is the write-heavy DB  Let’s shard the database!  Problems?!  Have to keep adding new servers and re-sharding existing databases  Re-sharding online is tricky (maybe introduce pending queues?)  A single code failure corrupts a huge set of data collected over years  Maintenance nightmare
  10. 10. Is there a way out?  We need new tools which handle:  automatic sharding and re-sharding  automatic replication and rebalancing  fault tolerance  effortless horizontal scaling  But we need to adapt ourselves as well. We need:  a new definition of “data” (data ≠ information)  new architectures (Lambda Architecture)  immutable data (for scaling and fault tolerance)  functional programming concepts  No, writing 25 years old structural code in this year’s favorite language won’t cut it anymore
  11. 11. Big Data tooling  Apache Hadoop distributed filesystem (HDFS)  Distributed, scalable, portable filesytem written in Java  Open source, 10 years old (!) project  Handles files in the gigabytes-terabytes range  Manages automatic replication and rebalancing of data  Facebook had 21 PB of storage on HDFS in 2010  Yahoo had a cluster of 10 000 Hadoop nodes in 2008  Apache Spark  Next generation data processing engine written in Scala  Open source, 5 years old project  Up to 100 times faster than Hadoop MapReduce  Uses functional programming techniques to process data  Can scale down to get run in an IDE!
  12. 12. Apache Spark by a glance 13
  13. 13. The good news  The right tools are available and open-source  The knowledge is available and mostly free  It’s all ready to get learned!

×