Este documento parece descrever algum tipo de processo de trabalho repetitivo ou tarefas. Contém várias repetições de termos como "Volume", "DoWork()" e traços, sugerindo algum tipo de fluxo de trabalho ou processo sequencial. Infelizmente o documento não fornece muitas informações além disso.
44 times as much data in the next decade, 15 Zb in 2015Data silos (erp, crm, …)CustomersTrimble (3Tb in hun database systeem)Truvo (wijzigen van een index duurt 24u)Traditionele systemen kunnen dit volume niet aan.How many data do you have?Turn 12 terabytes of Tweets created each day into improved product sentiment analysisConvert 350 billion annual meter readings to better predict power consumption
Real timeTime sensitivedecisiontakingFrauddetectionEnergy allocationMarketing campaignsMarket transactionsSolution:Real-time solutions in combination with batch (hadoop)Nosql systems
StructuredUnstructured80% is unstructured data, A key drawback of using traditional relational database systems is that they're not good at handling variable data. A flexible data modelWord, email, foto, text, video, …?What are your needs regarding variety?The end result: bringing structure into unstructured dataMonitor 100’s of live video feeds from surveillance cameras to target points of interestExploit the 80% data growth in images, video and documents to improve customer satisfaction
We live in a ever changing world. an organization ability to quickly modify their computing systems to respond to changing business requirements.Change control (adding or changing features of a system between releases)Customization of specific data servicesHow Agile are you?How to be AgileCultureSystemsSchema flexibility
It is easier to store all data in a cost effective way.Compare to DWH world.
The # of followers on Twitter = all follows & unfollows combined.
Data = event
It is easier to store all data in a cost effective way.Compare to DWH world.
Only CD, no more CRUD.Information might ofcourse change.Fault Tolerance
Allows state regeneration. Eg. What was my bank balance on 1 may 2005?
Queries as pure functions that take all data as input is the most general formulation.Different functions may look at different portions and aggregate information in different ways.
Too slow; might be petabyte scale
The batch layer can calculate anything (given enough time).
Doesn’t have to be Hadoop. The importance here is a Distributed FS combined with a processing framework.
Source: PolybasePass2012.pptx
In some circumstances.
In some circumstances.
Consistency (all nodes see the same data at the same time)Availability (a guarantee that every request receives a response about whether it was successful or failed)Partition tolerance (the system continues to operate despite arbitrary message loss or failure of part of the system)http://codahale.com/you-cant-sacrifice-partition-tolerance/
Eg. Unique counts
Nimbus:Manages the clusterWorker Node:Supervisor:Manages workers; restarts them if neededWorkerPhysical JVM process.Execute tasks (those are spread evenly across the workers)TasksEach in his own Thread. Is the actual Bolt or Spout.Processes the stream.
Tuple:Named list of valuesDynamicly typedStreamSequence of Tuples
SpoutSource of StreamsSometimes replayableBoltStream transformationsAt least 1 input stream0 - * output streams