This document discusses big data architectural concerns for handling large volumes of data. It notes that new technologies have enabled more efficient use of big data, creating a positive feedback loop where more adoption of big data leads to even more data generation. This requires new ways of storing, retrieving, scaling and analyzing data in a distributed manner. Examples of distributed databases like Bigtable and Dynamo are discussed. The document recommends flexible schemas, storing data close to its domain model, and limiting data movement for efficiency.
9. • Walmart handles 1M transactions per hour
• Google processes 24PB of data per day
• AT&T transfers 30PB of data per day
• 90 trillion emails are sent every year
• World of Warcraft uses 1.3PB of storage
Sunday, 2 December 12
10. Big Data - the positive
feedback cycle
1
new technologies
make using big data 2
efficient
more adoption
of big data
3
generation
of more
big data
Sunday, 2 December 12
11. new technologies
.. new architectural concerns
Sunday, 2 December 12
20. The Database
Landscape so far ..
• relational database - the bedrock of
enterprise data
• irrespective of application development
paradigm
• object-relational-mapping considered to be
the panacea for impedance mismatch
Sunday, 2 December 12
21. blogger, big geek and
architectural consultant
“Object Relational Mapping is the
Vietnam of Computer Science”
- Ted Neward (2006)
Sunday, 2 December 12
22. RDBMS & Big Data
• once the data volume crosses the limit of a
single server, you shard / partition
• sharding implies a lookup node for the
hash code => SPOF
• cross shard joins, transactions don’t scale
Sunday, 2 December 12
23. RDBMS & Big Data
• Cost of distributed transactions
• synchronization overhead
• 2 phase commit is a blocking protocol
(can block indefinitely)
• as slow as the slowest DB node +
network latency
Sunday, 2 December 12
24. RDBMS & Big Data
• Master/Slave replication
• synchronous replication => slow
• asynchronous replication => can lose
data
• writing to master is a bottleneck and
SPOF
Sunday, 2 December 12
25. Need Distributed
Databases
• data is automatically partitioned
• transparent to the application
• add capacity without downtime
• failure tolerant
Sunday, 2 December 12
26. 2 famous papers ..
• Bigtable: A distributed storage system for
structured data, 2006
• Dynamo: Amazon’s highly scalable key/value
store, 2007
Sunday, 2 December 12
27. Addressing 2
Approaches
• Bigtable: “how can we build a distributed
database on top of GFS ?”
• Dynamo: “how can we build a distributed
hash table appropriate for data center ?”
Sunday, 2 December 12
28. Big Data
recommendations
• reduce accidental complexity in processing
data
• be less rigid (no rigid schema)
• store data in a format closer to the domain
model
• hence no universal data model ..
Sunday, 2 December 12
29. Polyglot Storage
• unfortunately came to be known as NoSQL
databases
• document oriented (MongoDB, CouchDB)
• key/value (Dynamo, Bigtable, Riak,
Cassandra,Voldemort)
• data structure based (redis)
• graph based (Neo4J)
Sunday, 2 December 12
30. reduced impedance
mismatch
richer modeling closer to
capabilities domain model
Sunday, 2 December 12
33. Relational Database is just another option, not
the only option when data set is BIG and
semantically rich
Sunday, 2 December 12
34. 10 things never to do with a
Relational Database
• Search • Media Repository
• Recommendation • Email
• High Frequency Trading • Classification ad
• Product Cataloging • Time Series /
Forecasting
• User group / ACLs
• Log Analysis
Source: http://www.infoworld.com/d/application-development/10-things-never-do-relational-
database-206944?page=0,0
Sunday, 2 December 12
36. CAP Theorem
• Consistency, Availability & Partition
Tolerance
• You can have only 2 of these in a
distributed system
• Eric Brewer postulated this quite some
time back
Sunday, 2 December 12
37. ACID => BASE
• Basic Availability Soft-state Eventual
consistency
• Rather than requiring consistency after
every transaction, it’s enough for the
database to eventually be in a consistent
state.
• It’s ok to use stale data and it’s ok to give
approximate answers
Sunday, 2 December 12
39. Big Data in the wild
• Hadoop
• started as a batch processing engine
(HDFS & Map/Reduce)
• with bigger and bigger data, you need to
make them available to users at near real
time
• stream processing, CEP ..
Sunday, 2 December 12
40. a data warehouse system for Hadoop for easy data
summarization, ad-hoc queries & analysis of large
datasets stored in Hadoop compatible file systems
complementing
Map/Reduce Pig, a platform for analyzing large data sets that
consists of a high-level language for expressing data
in Hadoop analysis programs, coupled with infrastructure for
evaluating these programs.
Cloudera Impala
real time ad hoc query capability to Hadoop,
complementing traditional MapReduce batch
processing
Sunday, 2 December 12
41. Real time queries in
Hadoop
• currently people use Hadoop connectors
to massively parallel databases to do real
time queries in Hadoop
• expensive and may need lots of data
movement between the database & the
Hadoop clusters
Sunday, 2 December 12
42. .. and the Hadoop ecosystem continues to grow
with lots of real time tools being developed
actively that are compliant with the current
base ..
Sunday, 2 December 12
43. Shark from UC
Berkeley
• a large scale data warehouse system for
Spark, compatible with Hive
• supports HiveQL, Hive data formats and
user defined functions. In addition, Shark
can be used to query data in HDFS, HBase
and Amazon S3
Sunday, 2 December 12
44. BI and Analytics
• making Big Data available to developers
• API / scripting abilities for writing rich
analytic applications (Precog, Continuity,
Infochimps)
• analyzing user behaviors, network
monitoring, log processing, recommenders,
AI ..
Sunday, 2 December 12
45. Machine Learning
• personalization
• social network analysis
• pattern discovery - click patterns,
recommendations, ratings
• apps that rely on machine learning -
Prismatic, Trifacta, Google, Twitter ..
Sunday, 2 December 12
46. Summary
• Big Data will grow bigger - we need to
embrace the changes in architecture
• An RDBMS is NOT the panacea - pick your
data model that’s closest to your domain
• It’s economical to limit data movement -
process data in place and utilize the
multiple cores of your hardware
Sunday, 2 December 12
47. Summary
• Go for decentralized architectures, avoid
SPOFs
• With the big volumes of data, streaming is
your friend
Sunday, 2 December 12