What ya gonna do?

What ya gonna do?
without the help of Moore’s Law?

Scope

• Internet effect on corporate data centre
• End of Moore’s law
• Scaling on and off CPU

Internet emerges
• 1980s - Connections
• Broadband connectivity at work, modem @ home
• Beginnings of e-Commerce (Amazon’s readers recommendations
shows the way)

• 1990s - Few Publishers
• Internet bubble
• Rise of Search (Google shows the way)
• Start of consumer publications (Blogs / WIKIs)

Read-write Internet

• Good connectivity / reach
• Social networking = publication explosion
• Smart phones WIFI / 3G

Outputs

• More, much more data
• Content is rich (read BIG!!)
• audio, video, photo
• Data is unstructured or semi-structured
• users don’t do DBs

We ain’t Twitter
• OK, but wouldn’t you like to mine all of that
public information?

• See what they are saying about your
products / competitors / their requirements?

• Is there any possibility of turning on an internal
ﬁre hose?

• How many ﬁne-grained business events
happen in your company that you would like
to track / analyse? Someone will....

Fire Hydrants
• They’re coming - more data, from more people
and more devices

• Use data to improve decisions
• Gain insight to the organisation
• Jump competition or at least maintain pace

Numbers
• Facebook serves 250k unique pages per
second (June 2010)

• Twitter has seen a rise from 10m to 50m
tweets per day in the last year (July 2010)

• 1Gb of disk $700k (1980) --- 10c (2010)
• “Between the birth of the world and 2003, there
were 5 exabytes of information created. We [now]
create 5 exabytes every 2 days.” Eric Schmidt,
CEO, Google

So what?
• As people share more, they will change the way
they form their opinions

• Existing media channels are struggling to adapt
their business models

• Traditional market research, product marketing
and after-sales channels become less relevant to
these consumers

• Being out of the loop is bad for business

How bad for business?
• Now: data is a key asset of business
• Future: business data is not only private
• as public content integrated into analysis
• Maintaining secrecy will rise in cost
• internal systems management
• governance as you join the conversation

Effects

• Conventional platforms cannot
• store so much data cost effectively
• process the data cost effectively
• derive meaning from unstructured sources

Hardware Now

• SMP x86-64 & bit players
• Large local RAM (<=2TB)
• NAS for high capacity storage (<=14PB)
• On-premise

Today’s Big Boxes

• Indicate trends and inﬂuences
• Use 50k-250k CPU cores
• All Top 10 supercomputers run Linux
• Algorithms must be fault tolerant

Moore’s is Less

• Moore’s law was software developer’s friend
• 30 years of good times, speed ups “for free”
• Outward effect of Moore’s law only
maintained if exploiting multiple cores
• Standard programming models need to adapt
to use multiple cores

Hardware Horizon
• Fast inter-core buses and networks
• Inﬁniband: 10Gb/s - 120Gb/s

• Networked memory
• NUMA - not homogenous

• Exabyte disk clusters
• Elastic scaling
• On and off-premise integrated

Distributed Disruption

• Existing clustering options do not work
• Existing software models do not work
• Existing data models do not work

Old Skool
• Traditional clustering enables all machines
in a cluster to behave as if they are one in
space and time
• Not physically possible to cluster online
access to all data globally with today’s
hardware and networks (ask Google)
• Not news: traditional corporations do not
have real-time, coherent global BI databases

I’m gonna pop a CAP in
your head
• Repeat: clustering does not scale
• So, you can have 2 from 3 of:
• Consistency
• Availability
• Partition Tolerance

AC / DC

• One needs Partition tolerance to scale, so
you can only have:
• Availability OR
• Consistency
• All attempts to scale out conventional
databases and application servers prove the
theorem (who still believes in sharding?)

Availability

• Enables high service levels so the site stays
alive
• Lose global consistency for periods
(seconds or less)

Consistency

• Focus of RDBMS today
• High cost only appropriate for high value
• Remains the default for non-scaling cases

Eventual Consistency
• A datastore guarantees to eventually
provide updates to all cluster members
• Some desirable properties
• Read your own writes
• Limited form of cursor stability

• Monotonic read consistency
• Only see updates in the order they happened

Sclerotic Software
• Early (mostly static) binding of everything
to everything else
• Point to point trafﬁc routing
• Application to server
• Single thread model of control
• Program language to runtime
• Object models to SQL

Shapeability

• Dynamic data routing
• Runtime, in-place upgrades
• Languages that support parallel functions
• Multiple evolving and coexisting schemas
• Zero impedance mismatching

Dynamic Data Routing
• Cannot rely on per input solutions
• Data transfer protocols should have
minimal impact on programming models
• Law of leaky abstractions
• Bus required to allow evolution and to add
intelligence to routes

Upgrades
• Software must be upgradeable in parts
• Software must stay up while upgrade is
ongoing
• Modular, transitive upgrades (Maven, OSGI)
• Hypervisor VM mobility (vMotion,
Teleportation)

SCAlable LAnguages
• Java 7 comes with more concurrency
support (fork/join ... due mid 2011)
• Functional languages have support for high
concurrency
• JVM Languages: Scala, Clojure
• .NET: F#
• Others: Erlang, Haskell, Ocaml

Schema Shmeema

• Easiest schema evolution is with no schema
(NoSQL data stores)
• Where schema needed, data can travel with
its schema (AVRO, Riak, CouchDB)
• Data can be shared via REST, JMS or trickle
to RDMBS

Objections to models

• Remove the RM from ORM
• Externalize schemas don’t internalize them
• Prefer simple persistence options
• key/value, graph or document-oriented

What scales

• HTTP - it’s stateless . . . but:
• Caching layers need to be added
• Protocol can go faster (Google et al
proposing updates for 1.2)
• Er, that’s it from the current stack

App server scale FAIL

• Threads too coarse grained and expensive
• Need Actor model to be reliable and scale
out to exploit the hardware
• CAP based design patterns over data

Software scaling limits
• Ahmdahl’s law still applies:
• Can only go as fast as the slowest
serializable task
• Worse if that task blocks others, which it
often does
• Software needed to support cloud design
and testing

RDBMS scale FAIL
• Index updates do not scale linearly with data

• Normalize to reduce data volumes but then joins
become too expensive

• Transactions are costly and often not needed
(especially for READ)

• Hard to manage xx,000 MySQL instances (ask
Yahoo! and Facebook)

• License fees scale with load ($1m+ / month for
Facebook just to serve photos)

NoSQL

• Different ﬂavours (with examples)
• column oriented (Hadoop, Hbase)
• document store (Couch, Mongo)
• key value store (Riak, Redis)
• eventual consistent (Dynamo,Voldemort)
• graph database (Neo4J, InﬁniteGraph)

NoSQL gains
• Scale
• Performance
• Reliability and uptime
• Simpler application persistence API
• Some SQL syntax for aggregate operations
• Zero backup, if using HA ﬁle system

NoSQL loses
• SQL - especially joins
• Schema
• Transactions
• Consistency (for some coarse-grained
aspects at least)
• Query tools are immature / low-level

NoSQL

• For the diplomats: No(t Only) SQL
• SQL will live on in many applications and
Use Cases

What ya gonna do?

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to What ya gonna do?

Similar to What ya gonna do? (20)

Recently uploaded

Recently uploaded (20)

What ya gonna do?

Editor's Notes