2. Scope
• Internet effect on corporate data centre
• End of Moore’s law
• Scaling on and off CPU
3. Internet emerges
• 1980s - Connections
• Broadband connectivity at work, modem @ home
• Beginnings of e-Commerce (Amazon’s readers recommendations
shows the way)
• 1990s - Few Publishers
• Internet bubble
• Rise of Search (Google shows the way)
• Start of consumer publications (Blogs / WIKIs)
4. Read-write Internet
• Good connectivity / reach
• Social networking = publication explosion
• Smart phones WIFI / 3G
5. Outputs
• More, much more data
• Content is rich (read BIG!!)
• audio, video, photo
• Data is unstructured or semi-structured
• users don’t do DBs
6. We ain’t Twitter
• OK, but wouldn’t you like to mine all of that
public information?
• See what they are saying about your
products / competitors / their requirements?
• Is there any possibility of turning on an internal
fire hose?
• How many fine-grained business events
happen in your company that you would like
to track / analyse? Someone will....
7. Fire Hydrants
• They’re coming - more data, from more people
and more devices
• Use data to improve decisions
• Gain insight to the organisation
• Jump competition or at least maintain pace
8. Numbers
• Facebook serves 250k unique pages per
second (June 2010)
• Twitter has seen a rise from 10m to 50m
tweets per day in the last year (July 2010)
• 1Gb of disk $700k (1980) --- 10c (2010)
• “Between the birth of the world and 2003, there
were 5 exabytes of information created. We [now]
create 5 exabytes every 2 days.” Eric Schmidt,
CEO, Google
9. So what?
• As people share more, they will change the way
they form their opinions
• Existing media channels are struggling to adapt
their business models
• Traditional market research, product marketing
and after-sales channels become less relevant to
these consumers
• Being out of the loop is bad for business
10. How bad for business?
• Now: data is a key asset of business
• Future: business data is not only private
• as public content integrated into analysis
• Maintaining secrecy will rise in cost
• internal systems management
• governance as you join the conversation
11. Effects
• Conventional platforms cannot
• store so much data cost effectively
• process the data cost effectively
• derive meaning from unstructured sources
12. Hardware Now
• SMP x86-64 & bit players
• Large local RAM (<=2TB)
• NAS for high capacity storage (<=14PB)
• On-premise
13. Today’s Big Boxes
• Indicate trends and influences
• Use 50k-250k CPU cores
• All Top 10 supercomputers run Linux
• Algorithms must be fault tolerant
14. Moore’s is Less
• Moore’s law was software developer’s friend
• 30 years of good times, speed ups “for free”
• Outward effect of Moore’s law only
maintained if exploiting multiple cores
• Standard programming models need to adapt
to use multiple cores
15. Hardware Horizon
• Fast inter-core buses and networks
• Infiniband: 10Gb/s - 120Gb/s
• Networked memory
• NUMA - not homogenous
• Exabyte disk clusters
• Elastic scaling
• On and off-premise integrated
16. Distributed Disruption
• Existing clustering options do not work
• Existing software models do not work
• Existing data models do not work
17. Old Skool
• Traditional clustering enables all machines
in a cluster to behave as if they are one in
space and time
• Not physically possible to cluster online
access to all data globally with today’s
hardware and networks (ask Google)
• Not news: traditional corporations do not
have real-time, coherent global BI databases
18. I’m gonna pop a CAP in
your head
• Repeat: clustering does not scale
• So, you can have 2 from 3 of:
• Consistency
• Availability
• Partition Tolerance
19. AC / DC
• One needs Partition tolerance to scale, so
you can only have:
• Availability OR
• Consistency
• All attempts to scale out conventional
databases and application servers prove the
theorem (who still believes in sharding?)
20. Availability
• Enables high service levels so the site stays
alive
• Lose global consistency for periods
(seconds or less)
21. Consistency
• Focus of RDBMS today
• High cost only appropriate for high value
• Remains the default for non-scaling cases
22. Eventual Consistency
• A datastore guarantees to eventually
provide updates to all cluster members
• Some desirable properties
• Read your own writes
• Limited form of cursor stability
• Monotonic read consistency
• Only see updates in the order they happened
23. Sclerotic Software
• Early (mostly static) binding of everything
to everything else
• Point to point traffic routing
• Application to server
• Single thread model of control
• Program language to runtime
• Object models to SQL
24. Shapeability
• Dynamic data routing
• Runtime, in-place upgrades
• Languages that support parallel functions
• Multiple evolving and coexisting schemas
• Zero impedance mismatching
25. Dynamic Data Routing
• Cannot rely on per input solutions
• Data transfer protocols should have
minimal impact on programming models
• Law of leaky abstractions
• Bus required to allow evolution and to add
intelligence to routes
26. Upgrades
• Software must be upgradeable in parts
• Software must stay up while upgrade is
ongoing
• Modular, transitive upgrades (Maven, OSGI)
• Hypervisor VM mobility (vMotion,
Teleportation)
27. SCAlable LAnguages
• Java 7 comes with more concurrency
support (fork/join ... due mid 2011)
• Functional languages have support for high
concurrency
• JVM Languages: Scala, Clojure
• .NET: F#
• Others: Erlang, Haskell, Ocaml
28. Schema Shmeema
• Easiest schema evolution is with no schema
(NoSQL data stores)
• Where schema needed, data can travel with
its schema (AVRO, Riak, CouchDB)
• Data can be shared via REST, JMS or trickle
to RDMBS
29. Objections to models
• Remove the RM from ORM
• Externalize schemas don’t internalize them
• Prefer simple persistence options
• key/value, graph or document-oriented
30. What scales
• HTTP - it’s stateless . . . but:
• Caching layers need to be added
• Protocol can go faster (Google et al
proposing updates for 1.2)
• Er, that’s it from the current stack
31. App server scale FAIL
• Threads too coarse grained and expensive
• Need Actor model to be reliable and scale
out to exploit the hardware
• CAP based design patterns over data
32. Software scaling limits
• Ahmdahl’s law still applies:
• Can only go as fast as the slowest
serializable task
• Worse if that task blocks others, which it
often does
• Software needed to support cloud design
and testing
33. RDBMS scale FAIL
• Index updates do not scale linearly with data
• Normalize to reduce data volumes but then joins
become too expensive
• Transactions are costly and often not needed
(especially for READ)
• Hard to manage xx,000 MySQL instances (ask
Yahoo! and Facebook)
• License fees scale with load ($1m+ / month for
Facebook just to serve photos)
34. NoSQL
• Different flavours (with examples)
• column oriented (Hadoop, Hbase)
• document store (Couch, Mongo)
• key value store (Riak, Redis)
• eventual consistent (Dynamo,Voldemort)
• graph database (Neo4J, InfiniteGraph)
35. NoSQL gains
• Scale
• Performance
• Reliability and uptime
• Simpler application persistence API
• Some SQL syntax for aggregate operations
• Zero backup, if using HA file system
36. NoSQL loses
• SQL - especially joins
• Schema
• Transactions
• Consistency (for some coarse-grained
aspects at least)
• Query tools are immature / low-level
37. NoSQL
• For the diplomats: No(t Only) SQL
• SQL will live on in many applications and
Use Cases
Editor's Notes
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
#1 owned by USA, #2 owned by PRC\n
\n
\n
\n
\n
CAP theorem proposed in June 2000 by Eric Brewer\n
\n
\n
\n
Robert Patrick\n
\n
\n
\n
\n
Java: Join/Fork, Parallel arrays, tail call recursion improvements, not closures\n