2. What is “big data”?
● Big data is a collection of data sets so large and
complex that it becomes difficult to process
using traditional data processing applications.
● The challenges include capture, curation,
storage, search, sharing, transfer, analysis,
and visualization.
http://en.wikipedia.org/wiki/Big_data
3. Big Data Challenges
●
The challenges include:
– capture
– curation
– storage
– search
– sharing
– transfer
– analysis
– visualization
– large
– complex
4. What is “big data” exactly?
● What is considered "big data" varies depending on
the capabilities of the organization managing the
set, and on the capabilities of the applications that
are traditionally used to process and analyze the
data set in its domain.
● As of 2012, limits on the size of data sets that are
feasible to process in a reasonable amount of
time were on the order of exabytes of data.
http://en.wikipedia.org/wiki/Big_data
5. Big Data Qualifiers
● varies
● capabilities
● traditionally
● feasibly
● reasonably
● [somptha]bytes of data
6. My first “big data” challenge
● Real time news delivery platform
● Ingest news as text and provide full text search
● Qualifiers
– Reasonable: Real time search was < 1 second
– Capabilities: small company, <100 servers
● Big Data challenges
– Storage: roughly 300GB for 60 days data
– Search: searches of thousands of terms
7.
8. Traditionally
● Data was placed in mysql
● MySQL full text search
● Easy to insert
● Easy to search
● Worked great!
– Until it got real world load
9. Feasibly in hardware
(circa 2008)
● 300GB data and 16GB ram
● ...MySQL stores an in-memory binary tree of the keys.
Using this tree, MySQL can calculate the count of matching
rows with reasonable speed. But speed declines
logarithmically as the number of terms increases.
● The platters revolve at 15,000 RPM or so, which works out
to 250 revolutions per second. Average latency is listed as
2.0ms
● As the speed of an HDD increases the power it takes to run
it increases disproportionately
http://serverfault.com/questions/190451/what-is-the-throughput-of-15k-rpm-sas-drive
http://thessdguy.com/why-dont-hdds-spin-faster-than-15k-rpm/
http://dev.mysql.com/doc/internals/en/full-text-search.html
10. “Big Data” is about giving up things
● In theoretical computer science, the CAP theorem states
that it is impossible for a distributed computer system to
simultaneously provide all three of the following guarantees:
– Consistency (all nodes see the same data at the same time)
– Availability (a guarantee that every request receives a response
about whether it was successful or failed)
– Partition tolerance (the system continues to operate despite
arbitrary message loss or failure of part of the system)
http://en.wikipedia.org/wiki/CAP_theorem
http://www.youtube.com/watch?v=I4yGBfcODmU
11. Multi-Master solution
● Write the data to N mysql servers and round
robin reads between them
– Good: More machines to serve reads
– Bad: Requires Nx hardware
– Hard: Keeping machines loaded with same data
especially auto-generated-ids
– Hard: What about when the data does not even fit
on a single machine?
12.
13. Sharding
● Rather then replicate all data to all machines
● Replicate data to selective machines
– Good: localized data
– Good: better caching
– Hard: Joins across shards
– Hard: Management
– Hard: Failure
● Parallel RDBMS = $$$
14. Life lesson
“applications that are traditionally used to”
● How did we solve our problem?
– We switched to lucene
● A tool designed for full text search
● Eventually sharded lucene
● When you hold a hammer:
– Not everything is a nail
● Understand what you really need
● Understand reasonable and feasable
15. Big data Challenge 2
● Large high volume web site
● Process them and produce reports
● Big Data challenges
– Storage: Store GB of data a day for years
– Analysis, visualization: support reports of existing system
● Qualifiers
– Reasonable to want daily reports less then one day
– Honestly needs to be faster / reruns etc
16. Enter hadoop
● Hadoop (0.17.X) was fairly new at the time
● Use cases of map reduce were emerging
– Hive had just been open sourced by Facebook
● Many database vendors were calling
map/reduce “a step backwards”
– They had solved these problems “in the 80s”
17. Hadoop file system HDFS
● Distributed redundant storage
– We were a NoSPOF across the board
● Commodity hardware vs buying a big
SAN/NAS device
● We already had processes that scp'ed data to
servers, easily adapted to placing them into
hdfs
● HDFS easy huge
18. Map Reduce
● As a proof of concept I wrote a group/count
application that would group/count on column
in our logs
● Was able to show linear speed up with
increased nodes
●
19. Winning (why hadoop kicked arse)
● Data capture, curation
– bulk loading data into RDBMS (indexes, overhead)
– bulk loading into hadoop is network copy
● Data anaysis
– RDBMS would not parallel-ize queries (even across
partitions)
– Some queries could cause very locks and
performance degradation
http://hardcourtlessons.blogspot.com/2010/05/definition-of-winning.html
20. Enter hive
● Capture- NO
● Curation- YES
● Storage- YES
● Search- YES
● Sharing- YES
● Transfer- NO
● Analysis-YES
● Visualization-NO
22. Sample program group and count
Source data looks like
jan 10 2009:.........:200:/index.htm
jan 10 2009:.........:200:/index.htm
jan 10 2009:.........:200:/igloo.htm
jan 10 2009:.........:200:/ed.htm
23. In case your the math type
(input) <k1, v1> →
map -> <k2, v2> -> combine -> <k2, v2> ->
reduce -> <k3, v3> (output)
Map(k1,v1) -> list(k2,v2)
Reduce(k2, list (v2)) -> list(v3)
27. Life lessons volume 2
● feasible and reasonable were completely
different then case 1#
● Query from seconds -> hours
● Size from GB to TB
● Feasilble from 4 Nodes to 15
28. Big Data Challenge #3
(work at m6d)
● Large high volume ad serving site
● Process them and produce reports
● Support data science and biz-dev users
● Big Data challenges
– Storage: Store and process terabytes of data
● Complex data types, encoded data
– Analysis, visualization: support reports of existing system
● Qualifiers
– Reasonable: adhoc, daily,hourly, weekly, monthly reports
29. Data data everywhere
● We have to use cookies in many places
● Cookies have limited size
● Cookies have complex values encoded
30. Some encoding tricks we might do
LastSeen: long (64 bits)
Segment: int (32 bits)
Literal ','
Segment: int (32 bits)
Zipcode (32bits)
● 1 chose a relevant
epoc and use byte
● Use a byte for # of
segments
● Use a 4 byte radix
encoded number
● ... and so on
31. Getting at embedded data
● Write N UDFS for each object like:
– getLastSeenForCookie(String)
– getZipcodeForCookie(String)
– ...
● But this would have made a huge toolkit
● Traditionally you do not want to break first
normal form
32. Struct solution
● Hive has a struct like a c struct
● Struct is list of name value pair
● Structs can contain other structs
● This gives us the serious ability to do object
mapping
● UDFs can return struct types
33. Using a UDF
● add jar myjar.jar;
● Create temporary function parseCookie as
'com.md6.ParseCookieIntoStruct' ;
● Select
parseCookie(encodedColumn).lastSeen from
mydata;
34. LATERAL VIEW + EXPLODE
SELECT
client_id, entry.spendcreativeid
FROM datatable
LATERAL VIEW explode (AdHistoryAsStruct(ad_history).adEntrylist)
entryList as entry
where hit_date=20110321 AND mid=001406;
3214498023360851706 215286
3214498023360851706 195785
3214498023360851706 128640
36. Life lessons volume #3
● Big data is not only batch or real-time
● Big data is feed back loops
– Machine learning
– Ad hoc performance checks
● Generated SQL tables periodically synced to
web server
● Data shared between sections of an
organization to make business decisions