Big Data is one of the new buzzwords in the industry. Everyone is using NoSQL databases. MySQL is not cool anymore. But... do we really have big data? Where should we store it? Are the traditional RDBMS databases dead? Is NoSQL the solution to our problems? And most importantly, how can PHP and Symfony2 help with it?
2. HELLO WORLD!
• Ricard Clau, born and grown up in Barcelona
• Server engineer at Another Place Productions
• Symfony2 lover and PHP believer (sometimes…)
• Open-source contributor, sometimes I give talks
• Twitter (@ricardclau) / Gmail ricard.clau@gmail.com
3. WE WILL TALK ABOUT…
• Where / How to store / query our “BIG” DATA
• SQL vs NoSQL, why we ended up here?
• Strengths and weaknesses of both approaches
• PHP / Symfony Status with these technologies
• Some war stories and recommendations
4. QUICK DISCLAIMERS
• Not your average PHP talk, not sure if you will
be able to use this next week at work
• Continuous learner about all these technologies
• 100M records is NOT BIG DATA
5. “Big data is like teenage sex;
everyone talks about it,
nobody really knows how to do it,
everyone thinks everyone else is doing it,
so everyone claims they are doing it”.
Dan Ariely, Duke University
9. A BIT OF HISTORY
Maybe we have not learnt so much…
10. A (NOT SO) LONG TIME AGO
• Programmers processed files directly
• Lots of people doing the same, first
databases appeared, different APIs,
strengths and weaknesses
• In the early 70s IBM came with the
SEQUEL (Structured English Query
Language) idea, and the rest is story
11.
12. WHY NOSQL EXISTS?
• RDBMS are not brilliant to scale horizontally
• Google, Amazon, Facebook, etc… started building
their own solutions to meet their unique needs
• When your data does not fit in one box, you need to
give up consistency or availability
• Some problems need a different approach
15. SQL
• A “common” query language
• We can normalise data and query it
• Easy to do joins, filters, aggregations
• We don’t need to know in advance how we access data
• We rely on each database server’s query optimiser (and
sometimes we need a DBA)
16. ACID PROPERTIES
A C I D
Atomicity
Transactions
are all or
nothing
Consistency
A transaction
is subject to a
set of rules
Isolation
Transactions
do not affect
each other
Durability
Written data
will not get
lost
17. WE NEED ACID
• Banking, logistics, finance, e-commerce,…
• Systems we started building 30 years ago… and we
still work on them generating millions of $ daily!
• There are many applications that still fit the relational
model and have structured data
18. USUAL PROBLEMS
• You can painfully achieve sharding, but
you need to give up some ACID goods
• Tricky for unstructured data
• Not great for small read / write ratio
• Some data structures
19. TRICKY SCENARIOS
• Geospatial queries for augmented reality
• Leaderboards for social activity, Sets operations
• Columnar aggregations on big tables
• Graph data traversing to analyse your customers
• Search engines over big chunks of text
21. BASE PROPERTIES
• Basically Available: appears
to work most of the time
• Soft state: state of the
system may change even
without a query
• Eventual consistency
22. CAP THEOREM
• A shared-data system cannot guarantee
simultaneously:
• Consistency: All clients have the same view of the data
• Availability: Each client can always read and write
• Partition tolerance: The system works well even
when there are network partitions
23. “During a network partition, a
distributed system must choose
between either Consistency or
Availability”
24. Availability
Consistency
Partition
Tolerance
Single Node,
mostly RDBMS
(MySQL, PostgreSQL,
DB2, SQLite…)
All nodes same role
(Cassandra, Riak,
DynamoDB…)
Special nodes (Zookeeper, HBase,
MongoDB, Redis…)
31. PHP: BEST WEB PLATFORM?
• PHP is still heavily used, despite its many quirks
• Mature, actively maintained libraries for everything
• Composer makes things much easier these days
• Symfony bundles for almost everything
• Some databases consider PHP a second class citizen
33. KEY-VALUE STORES
• Simple APIs, easy to install and use. You are
already using them for caching, sessions, etc…
• PHP Extensions: memcached, phpredis
• Libraries: nrk/predis, basho/riak, aws/aws-sdk-php
• Bundles: snc/redis-bundle, leaseweb/memcache-bundle,
kbrw/riak-bundle
34. GRAPH DATABASES
• Very verbose queries, access via REST APIs
• Maybe not mature enough for source of truth
• Libraries: everyman/neo4jphp
• Bundles: klaussilveira/neo4j-ogm-bundle
• IMHO, one of the next big things
35. CYPHER QUERY EXAMPLES
Top 5 Sushi restaurants
in New York for
Philip’s friends
2nd degree co-actors
who have never acted
with Tom Hanks
36. COLUMN-BASED STORAGES
• Possibly the most suitable for Big Data
• Redshift supports SQL in a petabyte scale
database
• Libraries: thobbs/phpcassa, pop/pop_hbase,
PDO for Redshift (with some quirks)
• IMHO, Cassandra will become THE database
37. DOCUMENT DATABASES
• MongoDB and Couchbase look very shiny… but the
Internet is FULL of horror scaling stories
• PHP Extensions: mongodb, couchbase
• Libraries: doctrine/mongodb
• Bundles: doctrine/mongodb-odm-bundle
40. QUERY VS PROCESSING
• SQL is great because we can query by any field
• There is no standard in NoSQL databases
• NoSQL systems are more limited, only keys (some
allow secondary indexes) or complex graph syntax
• We sometimes need processing for complex queries
42. HADOOP VS SPARK
• Techniques to extract subsets of the data (MAP) and
operate them in parallel before aggregating (REDUCE)
• Not real time, Hadoop the most popular
• Apache Spark opens a new paradigm for near real-time
• You need other languages for these techniques
44. ENGINEERING CHALLENGES
• The Internet of things will generate real BIG DATA
• SQL / ACID technologies are not going anywhere
• Be very careful when using NoSQL in production
• Databases… and life… are full of tradeoffs
• The next decade will be fascinating for the industry