30. 8 fallacies of distributed computing
» The network is reliable.
» Latency is zero.
» Bandwidth is infinite.
Peter Deutsch and James Gosling
» The network is secure.
» Topology doesn't change.
» There is one administrator.
» Transport cost is zero.
» The network is homogeneous.
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 30
32. Trend 1: Data size
ExaBytes (10!") of data stored per year
988
1000
Each year more and
more digital data is
created. Over t wo
750 years we create more
digital data than all 623
the data created in
history before that.
500
397
253
250 161
0
2006 2007 2008 2009 2010
Data source: IDC 2007 3
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 32
33. Trend 2: Connectedness
Giant
Global
Graph (GGG)
Over time data has evolved to Ontologies
be more and more interlinked
and connected.
RDF
Hypertext has links,
Blogs have pingback,
Tagging groups all related data Folksonomies
Information connectivity
Tagging
Wikis User-generated
content
Blogs
RSS
Hypertext
Text documents
web 1.0 web 2.0 “web 3.0”
1990 2000 2010 2020 4
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 33
34. Trend 3: Semi-structure
! Individualization of content
• In the salary lists of the 1970s, all elements had exactly one job
• In Or 15? lists of the 2000s, we need 5 job columns! Or 8?
the salary
! All encompassing “entire world views”
• Store more data about each entity
! Trend accelerated by the decentralization of content generation
that is the hallmark of the age of participation (“web 2.0”)
5
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 34
37. Trend 4: Architecture
2000s: (moving towards) Decoupled services
with their own backend
Application Application Application
DB DB DB
8
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 37
38. Trend 4: Architecture
2000s: (moving towards) Decoupled services
with their own backend
Application Application Application
DB DATA TIER
DB DB
8
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 38
39. For years, we
tried to squeeze
data into a
one-size-fits-all
container.
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
59. Lily
» provides scalable storage
» and scalable search
» with a fault-tolerant, distributed architecture
» automated index maintenance
» versioning, rich data types, Java+REST API
» based on HBase (NOSQL) and SOLR (Lucene)
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 59
60. Choosing a NoSQL store for Lily: step I
» automatic scaling to large data sets
» fault-tolerance
» flexible datamodel with sparse data
» commodity hardware
» efficient random access
» community-based open source
» Java if possible
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 60
61. Choosing a NoSQL store for Lily: step II
» need for consistency
» atomic single-row updates
» M/R for index regeneration
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 61
62. Choosing a NoSQL store for Lily: step III
HBase
» datamodel with column families and cell
versioning
» ordered tables with range scans
» HDFS for blob storage
» Apache
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 62
63. Lily
» scales to infinity, and beyond
» open source
» Apache license (no strings attached)
» Java and REST API
» www.lilyproject.org
» subscription- and partnership-based
business model
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 63
64. distributed process coordination
and configuration (ZooKeeper)
}
query update indexer
Lily
Lily Lily Store Server
store
client
node WAL MQ M/R
client
}
store
node 2ary WAL / HBase Region Server
documents
indexes MQ
client
store
node
} Hadoop DFS
REST
index
replica
inverted index
replica replica
} SOLR
lily simplified architecture
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 64
65. Key lessons learned
» unlearning normalization is very difficult
» integrity checking in code = not so bad
» doing joins in code can be very liberating
» importance of keyspace design
» secondary indexing
» data de-normalization = size! (x3)
» schema vs. code flexibility?
» distribution is everywhere
and you shouldn’t forget about it
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 65
66. Pssst. :-)
If you absolutely, positively want to see a
demo, go check http://outerthought.blip.tv/
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
67. Reading material
» Amazon Dynamo, Google BigTable, CAP
» http://nosql.mypopescu.com/
» http://nosql-database.org/
» http://twitter.com/nosqlupdate
» http://highscalability.com/
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 67
69. Thank you !
for your attention
for your questions
» stevenn@outerthought.org
» @stevenn
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org