Publicité

Big Data

NGDATA
CTO at NGDATA à NGDATA
4 Apr 2011
Publicité

Contenu connexe

Publicité
Publicité

Big Data

  1. Big Data Steven Noels & Wim Van Leuven SAI, 7 april 2011
  2. Houston, we have a problem. IDC says Digital Universe will be 35 Zettabytes by 2020. 1 Zettabyte = 1,000,000,000,000,000,000,000 bytes, or 1 billion terrabytes
  3. We're drowning in a sea of data.
  4. The fire hose of social and attention data.
  5. We regard content as cost .
  6. ... but data is an opportunity !
  7. Think about it ...
  8. advertisements
  9. recommendations
  10. profile data
  11. anything that sells
  12. The future is for data nerds.
  13. The incumbents view
  14. A different approach: (big) data systems real time !
  15. What is a Data System? (Nathan Marz)
  16. What is a Data System? (Nathan Marz)
  17. DATA SYSTEM IMPLEMENTATION (Nathan Marz)
  18. Types of store
  19. Is your data BIG enough ?
  20. Parting Thoughts A couple of ideas we want you to remember
  21. Platonic architecture of a Data System Speed Layer Batch Layer
  22. Event Driven Architecture
  23. “ Top-performing organizations are twice as likely to apply analytics to activities.” (MIT Sloan Management Review, Winter 2011)
  24. Zite - interest-based e-magazine (iPad)
  25. social second screen app
  26. social second screen app
  27. FlipBoard: everyone's excuse to buy an iPad
  28. Announcement
  29. Thanks ! Wim & Steven.

Notes de l'éditeur

  1. - like disk seek time: how long does it take to read a full 1TB disk compared to the 4MB HD of 20 years ago? - Amazon lets you ship hard disks to load data
  2. - the only solution is to divide work beyond one node biringing us to cluster technology - but ... clusters have their own programming challenges, e.g.work load management, distributed locking and distributed transactions - but clusters do especially have one certain property ... Anyone knows which?
  3. - Failure! Nodes will certainly fail. In large setups there are continuously breakdowns. - ... making it even more difficult to build software on the grid. - It needs to be fault-tolerant, but also self orchestrating and self healing - Assistence you will be needing: standing on the shoulders of giants
  4. - Distributed File System for high available data - MapReduce to bring logic to the data on the nodes en bring back the results - BigTable & Dynamo to add realtime read/write access to big data - with FOSS implementations which allow US to build applications, not the plumbing ...
  5. Althought the basic functions of those technologies are rather basic/high-level, their implementations hardly are.  - They represent the state-of-the-art in operating and distributed systems research: distributed hash tables (DHT), consistent hashing, distributed versioning, vector clocks, quorums,, gossip protocols, anti-entropy based recovery, etc - ... with an industrial/commercial angle: Amazon, Google, Facebook, ... Lets explain some of the basic technologies
  6. The most important classifier for scalable stores CA, AP, CP
  7. KV (Amazon Dynamo) Column family (Google BigTable) Document stores (MongoDB) Graph DBs (Neo4J) Please remember scalability, availability and resilience come at a cost
  8. RDBMSs scale to reasonable proportions, bringing commodity of technology, tools, knowlegde and experience.  BD stores are rather uncharted territory lacking tools, standardized APIs, etc.  cost of hardware vs cost of learning Do your homework!
  9. ref  http://www.slideshare.net/quipo/nosql-databases-why-what-and-when Good overview of different OSS and commercial implementations with their classification and features slides 96 ...
  10. Basic support for secondary indexes. Better use full text search tools like Solr or Katta. Implement joins by denormalization  Meaning consistency has to be maintained by the application, i.e. DIY Transactions are mostly non-existent, meaning you have to divide your application to support data statuses and/or implement counter-transactions for failures. No true query language, but map reduce jobs or more high-level languages like HiveQL and Pig-Latin. However not very interactive, rather meant for ETL and reporting. Think data warehouse. Complement with full text search tools like Sorl and Katta giving added value, and also faceted search possibilities.
Publicité