SlideShare une entreprise Scribd logo
1  sur  36
Big Data in the “Real World”
Edward Capriolo
What is “big data”?
● Big data is a collection of data sets so large and
complex that it becomes difficult to process
using traditional data processing applications.
● The challenges include capture, curation,
storage, search, sharing, transfer, analysis,
and visualization.
http://en.wikipedia.org/wiki/Big_data
Big Data Challenges
●
The challenges include:
– capture
– curation
– storage
– search
– sharing
– transfer
– analysis
– visualization
– large
– complex
What is “big data” exactly?
● What is considered "big data" varies depending on
the capabilities of the organization managing the
set, and on the capabilities of the applications that
are traditionally used to process and analyze the
data set in its domain.
● As of 2012, limits on the size of data sets that are
feasible to process in a reasonable amount of
time were on the order of exabytes of data.
http://en.wikipedia.org/wiki/Big_data
Big Data Qualifiers
● varies
● capabilities
● traditionally
● feasibly
● reasonably
● [somptha]bytes of data
My first “big data” challenge
● Real time news delivery platform
● Ingest news as text and provide full text search
● Qualifiers
– Reasonable: Real time search was < 1 second
– Capabilities: small company, <100 servers
● Big Data challenges
– Storage: roughly 300GB for 60 days data
– Search: searches of thousands of terms
Traditionally
● Data was placed in mysql
● MySQL full text search
● Easy to insert
● Easy to search
● Worked great!
– Until it got real world load
Feasibly in hardware
(circa 2008)
● 300GB data and 16GB ram
● ...MySQL stores an in-memory binary tree of the keys.
Using this tree, MySQL can calculate the count of matching
rows with reasonable speed. But speed declines
logarithmically as the number of terms increases.
● The platters revolve at 15,000 RPM or so, which works out
to 250 revolutions per second. Average latency is listed as
2.0ms
● As the speed of an HDD increases the power it takes to run
it increases disproportionately
http://serverfault.com/questions/190451/what-is-the-throughput-of-15k-rpm-sas-drive
http://thessdguy.com/why-dont-hdds-spin-faster-than-15k-rpm/
http://dev.mysql.com/doc/internals/en/full-text-search.html
“Big Data” is about giving up things
● In theoretical computer science, the CAP theorem states
that it is impossible for a distributed computer system to
simultaneously provide all three of the following guarantees:
– Consistency (all nodes see the same data at the same time)
– Availability (a guarantee that every request receives a response
about whether it was successful or failed)
– Partition tolerance (the system continues to operate despite
arbitrary message loss or failure of part of the system)
http://en.wikipedia.org/wiki/CAP_theorem
http://www.youtube.com/watch?v=I4yGBfcODmU
Multi-Master solution
● Write the data to N mysql servers and round
robin reads between them
– Good: More machines to serve reads
– Bad: Requires Nx hardware
– Hard: Keeping machines loaded with same data
especially auto-generated-ids
– Hard: What about when the data does not even fit
on a single machine?
Sharding
● Rather then replicate all data to all machines
● Replicate data to selective machines
– Good: localized data
– Good: better caching
– Hard: Joins across shards
– Hard: Management
– Hard: Failure
● Parallel RDBMS = $$$
Life lesson
“applications that are traditionally used to”
● How did we solve our problem?
– We switched to lucene
● A tool designed for full text search
● Eventually sharded lucene
● When you hold a hammer:
– Not everything is a nail
● Understand what you really need
● Understand reasonable and feasable
Big data Challenge 2
● Large high volume web site
● Process them and produce reports
● Big Data challenges
– Storage: Store GB of data a day for years
– Analysis, visualization: support reports of existing system
● Qualifiers
– Reasonable to want daily reports less then one day
– Honestly needs to be faster / reruns etc
Enter hadoop
● Hadoop (0.17.X) was fairly new at the time
● Use cases of map reduce were emerging
– Hive had just been open sourced by Facebook
● Many database vendors were calling
map/reduce “a step backwards”
– They had solved these problems “in the 80s”
Hadoop file system HDFS
● Distributed redundant storage
– We were a NoSPOF across the board
● Commodity hardware vs buying a big
SAN/NAS device
● We already had processes that scp'ed data to
servers, easily adapted to placing them into
hdfs
● HDFS easy huge
Map Reduce
● As a proof of concept I wrote a group/count
application that would group/count on column
in our logs
● Was able to show linear speed up with
increased nodes
●
Winning (why hadoop kicked arse)
● Data capture, curation
– bulk loading data into RDBMS (indexes, overhead)
– bulk loading into hadoop is network copy
● Data anaysis
– RDBMS would not parallel-ize queries (even across
partitions)
– Some queries could cause very locks and
performance degradation
http://hardcourtlessons.blogspot.com/2010/05/definition-of-winning.html
Enter hive
● Capture- NO
● Curation- YES
● Storage- YES
● Search- YES
● Sharing- YES
● Transfer- NO
● Analysis-YES
● Visualization-NO
Logging from apache to hive
Sample program group and count
Source data looks like
jan 10 2009:.........:200:/index.htm
jan 10 2009:.........:200:/index.htm
jan 10 2009:.........:200:/igloo.htm
jan 10 2009:.........:200:/ed.htm
In case your the math type
(input) <k1, v1> →
map -> <k2, v2> -> combine -> <k2, v2> ->
reduce -> <k3, v3> (output)
Map(k1,v1) -> list(k2,v2)
Reduce(k2, list (v2)) -> list(v3)
A mapper
A reducer
Hive style
hive>create table web_data
( sdate STRING, stime STRING,
envvar STRING, remotelogname STRING ,servername STRING,
localip STRING, literaldash STRING, method STRING, url
STRING, querystring STRING, status STRING, litteralzero
STRING ,bytessent INT,header STRING, timetoserver INT,
useragent STRING ,cookie STRING, referer STRING);
SELECT url,count(1) FROM web_data GROUP BY url;
Life lessons volume 2
● feasible and reasonable were completely
different then case 1#
● Query from seconds -> hours
● Size from GB to TB
● Feasilble from 4 Nodes to 15
Big Data Challenge #3
(work at m6d)
● Large high volume ad serving site
● Process them and produce reports
● Support data science and biz-dev users
● Big Data challenges
– Storage: Store and process terabytes of data
● Complex data types, encoded data
– Analysis, visualization: support reports of existing system
● Qualifiers
– Reasonable: adhoc, daily,hourly, weekly, monthly reports
Data data everywhere
● We have to use cookies in many places
● Cookies have limited size
● Cookies have complex values encoded
Some encoding tricks we might do
LastSeen: long (64 bits)
Segment: int (32 bits)
Literal ','
Segment: int (32 bits)
Zipcode (32bits)
● 1 chose a relevant
epoc and use byte
● Use a byte for # of
segments
● Use a 4 byte radix
encoded number
● ... and so on
Getting at embedded data
● Write N UDFS for each object like:
– getLastSeenForCookie(String)
– getZipcodeForCookie(String)
– ...
● But this would have made a huge toolkit
● Traditionally you do not want to break first
normal form
Struct solution
● Hive has a struct like a c struct
● Struct is list of name value pair
● Structs can contain other structs
● This gives us the serious ability to do object
mapping
● UDFs can return struct types
Using a UDF
● add jar myjar.jar;
● Create temporary function parseCookie as
'com.md6.ParseCookieIntoStruct' ;
● Select
parseCookie(encodedColumn).lastSeen from
mydata;
LATERAL VIEW + EXPLODE
SELECT
client_id, entry.spendcreativeid
FROM datatable
LATERAL VIEW explode (AdHistoryAsStruct(ad_history).adEntrylist)
entryList as entry
where hit_date=20110321 AND mid=001406;
3214498023360851706 215286
3214498023360851706 195785
3214498023360851706 128640
All that data might boil down to...
Life lessons volume #3
● Big data is not only batch or real-time
● Big data is feed back loops
– Machine learning
– Ad hoc performance checks
● Generated SQL tables periodically synced to
web server
● Data shared between sections of an
organization to make business decisions

Contenu connexe

Tendances

M6d cassandrapresentation
M6d cassandrapresentationM6d cassandrapresentation
M6d cassandrapresentationEdward Capriolo
 
ScyllaDB: NoSQL at Ludicrous Speed
ScyllaDB: NoSQL at Ludicrous SpeedScyllaDB: NoSQL at Ludicrous Speed
ScyllaDB: NoSQL at Ludicrous SpeedJ On The Beach
 
Webinar: Using Control Theory to Keep Compactions Under Control
Webinar: Using Control Theory to Keep Compactions Under ControlWebinar: Using Control Theory to Keep Compactions Under Control
Webinar: Using Control Theory to Keep Compactions Under ControlScyllaDB
 
Building Data Pipelines with SMACK: Designing Storage Strategies for Scale an...
Building Data Pipelines with SMACK: Designing Storage Strategies for Scale an...Building Data Pipelines with SMACK: Designing Storage Strategies for Scale an...
Building Data Pipelines with SMACK: Designing Storage Strategies for Scale an...DataStax
 
Introduction to NoSQL & Apache Cassandra
Introduction to NoSQL & Apache CassandraIntroduction to NoSQL & Apache Cassandra
Introduction to NoSQL & Apache CassandraChetan Baheti
 
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...DataStax
 
Signal Digital: The Skinny on Wide Rows
Signal Digital: The Skinny on Wide RowsSignal Digital: The Skinny on Wide Rows
Signal Digital: The Skinny on Wide RowsDataStax Academy
 
NewSQL - The Future of Databases?
NewSQL - The Future of Databases?NewSQL - The Future of Databases?
NewSQL - The Future of Databases?Elvis Saravia
 
Cassandra Tuning - above and beyond
Cassandra Tuning - above and beyondCassandra Tuning - above and beyond
Cassandra Tuning - above and beyondMatija Gobec
 
Cassandra: Open Source Bigtable + Dynamo
Cassandra: Open Source Bigtable + DynamoCassandra: Open Source Bigtable + Dynamo
Cassandra: Open Source Bigtable + Dynamojbellis
 
Druid realtime indexing
Druid realtime indexingDruid realtime indexing
Druid realtime indexingSeoeun Park
 
BI, Reporting and Analytics on Apache Cassandra
BI, Reporting and Analytics on Apache CassandraBI, Reporting and Analytics on Apache Cassandra
BI, Reporting and Analytics on Apache CassandraVictor Coustenoble
 
Spark and cassandra (Hulu Talk)
Spark and cassandra (Hulu Talk)Spark and cassandra (Hulu Talk)
Spark and cassandra (Hulu Talk)Jon Haddad
 
Re-Engineering PostgreSQL as a Time-Series Database
Re-Engineering PostgreSQL as a Time-Series DatabaseRe-Engineering PostgreSQL as a Time-Series Database
Re-Engineering PostgreSQL as a Time-Series DatabaseAll Things Open
 
Cassandra concepts, patterns and anti-patterns
Cassandra concepts, patterns and anti-patternsCassandra concepts, patterns and anti-patterns
Cassandra concepts, patterns and anti-patternsDave Gardner
 
Webinar: Diagnosing Apache Cassandra Problems in Production
Webinar: Diagnosing Apache Cassandra Problems in ProductionWebinar: Diagnosing Apache Cassandra Problems in Production
Webinar: Diagnosing Apache Cassandra Problems in ProductionDataStax Academy
 
Webinar: Getting Started with Apache Cassandra
Webinar: Getting Started with Apache CassandraWebinar: Getting Started with Apache Cassandra
Webinar: Getting Started with Apache CassandraDataStax
 
Large volume data analysis on the Typesafe Reactive Platform
Large volume data analysis on the Typesafe Reactive PlatformLarge volume data analysis on the Typesafe Reactive Platform
Large volume data analysis on the Typesafe Reactive PlatformMartin Zapletal
 
Cassandra Summit 2014: Cyanite — Better Graphite Storage with Apache Cassandra
Cassandra Summit 2014: Cyanite — Better Graphite Storage with Apache CassandraCassandra Summit 2014: Cyanite — Better Graphite Storage with Apache Cassandra
Cassandra Summit 2014: Cyanite — Better Graphite Storage with Apache CassandraDataStax Academy
 

Tendances (20)

M6d cassandrapresentation
M6d cassandrapresentationM6d cassandrapresentation
M6d cassandrapresentation
 
ScyllaDB: NoSQL at Ludicrous Speed
ScyllaDB: NoSQL at Ludicrous SpeedScyllaDB: NoSQL at Ludicrous Speed
ScyllaDB: NoSQL at Ludicrous Speed
 
Hadoop-2.6.0 Slides
Hadoop-2.6.0 SlidesHadoop-2.6.0 Slides
Hadoop-2.6.0 Slides
 
Webinar: Using Control Theory to Keep Compactions Under Control
Webinar: Using Control Theory to Keep Compactions Under ControlWebinar: Using Control Theory to Keep Compactions Under Control
Webinar: Using Control Theory to Keep Compactions Under Control
 
Building Data Pipelines with SMACK: Designing Storage Strategies for Scale an...
Building Data Pipelines with SMACK: Designing Storage Strategies for Scale an...Building Data Pipelines with SMACK: Designing Storage Strategies for Scale an...
Building Data Pipelines with SMACK: Designing Storage Strategies for Scale an...
 
Introduction to NoSQL & Apache Cassandra
Introduction to NoSQL & Apache CassandraIntroduction to NoSQL & Apache Cassandra
Introduction to NoSQL & Apache Cassandra
 
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
 
Signal Digital: The Skinny on Wide Rows
Signal Digital: The Skinny on Wide RowsSignal Digital: The Skinny on Wide Rows
Signal Digital: The Skinny on Wide Rows
 
NewSQL - The Future of Databases?
NewSQL - The Future of Databases?NewSQL - The Future of Databases?
NewSQL - The Future of Databases?
 
Cassandra Tuning - above and beyond
Cassandra Tuning - above and beyondCassandra Tuning - above and beyond
Cassandra Tuning - above and beyond
 
Cassandra: Open Source Bigtable + Dynamo
Cassandra: Open Source Bigtable + DynamoCassandra: Open Source Bigtable + Dynamo
Cassandra: Open Source Bigtable + Dynamo
 
Druid realtime indexing
Druid realtime indexingDruid realtime indexing
Druid realtime indexing
 
BI, Reporting and Analytics on Apache Cassandra
BI, Reporting and Analytics on Apache CassandraBI, Reporting and Analytics on Apache Cassandra
BI, Reporting and Analytics on Apache Cassandra
 
Spark and cassandra (Hulu Talk)
Spark and cassandra (Hulu Talk)Spark and cassandra (Hulu Talk)
Spark and cassandra (Hulu Talk)
 
Re-Engineering PostgreSQL as a Time-Series Database
Re-Engineering PostgreSQL as a Time-Series DatabaseRe-Engineering PostgreSQL as a Time-Series Database
Re-Engineering PostgreSQL as a Time-Series Database
 
Cassandra concepts, patterns and anti-patterns
Cassandra concepts, patterns and anti-patternsCassandra concepts, patterns and anti-patterns
Cassandra concepts, patterns and anti-patterns
 
Webinar: Diagnosing Apache Cassandra Problems in Production
Webinar: Diagnosing Apache Cassandra Problems in ProductionWebinar: Diagnosing Apache Cassandra Problems in Production
Webinar: Diagnosing Apache Cassandra Problems in Production
 
Webinar: Getting Started with Apache Cassandra
Webinar: Getting Started with Apache CassandraWebinar: Getting Started with Apache Cassandra
Webinar: Getting Started with Apache Cassandra
 
Large volume data analysis on the Typesafe Reactive Platform
Large volume data analysis on the Typesafe Reactive PlatformLarge volume data analysis on the Typesafe Reactive Platform
Large volume data analysis on the Typesafe Reactive Platform
 
Cassandra Summit 2014: Cyanite — Better Graphite Storage with Apache Cassandra
Cassandra Summit 2014: Cyanite — Better Graphite Storage with Apache CassandraCassandra Summit 2014: Cyanite — Better Graphite Storage with Apache Cassandra
Cassandra Summit 2014: Cyanite — Better Graphite Storage with Apache Cassandra
 

En vedette

Shoestring Video-SoMeT 2011-Brian Matson
Shoestring Video-SoMeT 2011-Brian MatsonShoestring Video-SoMeT 2011-Brian Matson
Shoestring Video-SoMeT 2011-Brian MatsonBrian Matson
 
Paananen: Hyviä uutisia Kouluterveyskyselystä 2013
Paananen: Hyviä uutisia Kouluterveyskyselystä 2013Paananen: Hyviä uutisia Kouluterveyskyselystä 2013
Paananen: Hyviä uutisia Kouluterveyskyselystä 2013Kouluterveyskysely
 
Chatham mba open house (10 5 2013 rc)
Chatham mba open house (10 5 2013 rc)Chatham mba open house (10 5 2013 rc)
Chatham mba open house (10 5 2013 rc)Rachel Chung
 
Data Integration And Visualization
Data Integration And VisualizationData Integration And Visualization
Data Integration And VisualizationIvan Ermilov
 
Msu bmp widescreen
Msu bmp widescreenMsu bmp widescreen
Msu bmp widescreenJosh Johnson
 
Наши будни и праздники
Наши будни и праздникиНаши будни и праздники
Наши будни и праздникиelvira38
 
Vishal anand director of bricks and mortar
Vishal anand director of bricks and mortarVishal anand director of bricks and mortar
Vishal anand director of bricks and mortarNew Projects Noida
 
Trabajo extractase de ingles
Trabajo extractase de inglesTrabajo extractase de ingles
Trabajo extractase de inglesteacherisela
 
Cfu3721 definitions of_concepts_2013__2 (1)
Cfu3721 definitions of_concepts_2013__2 (1)Cfu3721 definitions of_concepts_2013__2 (1)
Cfu3721 definitions of_concepts_2013__2 (1)Koskim Petrus
 
Alegações Finais Impeachment Dilma
Alegações Finais Impeachment DilmaAlegações Finais Impeachment Dilma
Alegações Finais Impeachment DilmaMiguel Rosario
 
Mal ppt 2013
Mal ppt 2013Mal ppt 2013
Mal ppt 2013shineasso
 
長野市放課後子ども総合プラン有料化の方針
長野市放課後子ども総合プラン有料化の方針長野市放課後子ども総合プラン有料化の方針
長野市放課後子ども総合プラン有料化の方針長野市議会議員小泉一真
 
Practical eCommerce with WooCommerce
Practical eCommerce with WooCommercePractical eCommerce with WooCommerce
Practical eCommerce with WooCommerceBrian Krogsgard
 
Keynote01 -boris--foundation update-8-10-2012
Keynote01 -boris--foundation update-8-10-2012Keynote01 -boris--foundation update-8-10-2012
Keynote01 -boris--foundation update-8-10-2012OpenCity Community
 
Heaven - escena baralla al parc
Heaven - escena baralla al parcHeaven - escena baralla al parc
Heaven - escena baralla al parcmvinola2
 

En vedette (20)

Shoestring Video-SoMeT 2011-Brian Matson
Shoestring Video-SoMeT 2011-Brian MatsonShoestring Video-SoMeT 2011-Brian Matson
Shoestring Video-SoMeT 2011-Brian Matson
 
Paananen: Hyviä uutisia Kouluterveyskyselystä 2013
Paananen: Hyviä uutisia Kouluterveyskyselystä 2013Paananen: Hyviä uutisia Kouluterveyskyselystä 2013
Paananen: Hyviä uutisia Kouluterveyskyselystä 2013
 
Lecture Commentary On Homosexuality
Lecture Commentary On HomosexualityLecture Commentary On Homosexuality
Lecture Commentary On Homosexuality
 
Chatham mba open house (10 5 2013 rc)
Chatham mba open house (10 5 2013 rc)Chatham mba open house (10 5 2013 rc)
Chatham mba open house (10 5 2013 rc)
 
Data Integration And Visualization
Data Integration And VisualizationData Integration And Visualization
Data Integration And Visualization
 
Msu bmp widescreen
Msu bmp widescreenMsu bmp widescreen
Msu bmp widescreen
 
Наши будни и праздники
Наши будни и праздникиНаши будни и праздники
Наши будни и праздники
 
Vishal anand director of bricks and mortar
Vishal anand director of bricks and mortarVishal anand director of bricks and mortar
Vishal anand director of bricks and mortar
 
Trabajo extractase de ingles
Trabajo extractase de inglesTrabajo extractase de ingles
Trabajo extractase de ingles
 
Cfu3721 definitions of_concepts_2013__2 (1)
Cfu3721 definitions of_concepts_2013__2 (1)Cfu3721 definitions of_concepts_2013__2 (1)
Cfu3721 definitions of_concepts_2013__2 (1)
 
Alegações Finais Impeachment Dilma
Alegações Finais Impeachment DilmaAlegações Finais Impeachment Dilma
Alegações Finais Impeachment Dilma
 
Mal ppt 2013
Mal ppt 2013Mal ppt 2013
Mal ppt 2013
 
Real ch.2 a
Real ch.2 aReal ch.2 a
Real ch.2 a
 
長野市放課後子ども総合プラン有料化の方針
長野市放課後子ども総合プラン有料化の方針長野市放課後子ども総合プラン有料化の方針
長野市放課後子ども総合プラン有料化の方針
 
Practical eCommerce with WooCommerce
Practical eCommerce with WooCommercePractical eCommerce with WooCommerce
Practical eCommerce with WooCommerce
 
Как стать лидером в ТРАДО
Как стать лидером в ТРАДОКак стать лидером в ТРАДО
Как стать лидером в ТРАДО
 
Rcm
RcmRcm
Rcm
 
Keynote01 -boris--foundation update-8-10-2012
Keynote01 -boris--foundation update-8-10-2012Keynote01 -boris--foundation update-8-10-2012
Keynote01 -boris--foundation update-8-10-2012
 
Heaven - escena baralla al parc
Heaven - escena baralla al parcHeaven - escena baralla al parc
Heaven - escena baralla al parc
 
Formato planeacion
Formato planeacionFormato planeacion
Formato planeacion
 

Similaire à Big data nyu

Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introductionSandeep Singh
 
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...Reynold Xin
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoopMohit Tare
 
Processing Drone data @Scale
Processing Drone data @ScaleProcessing Drone data @Scale
Processing Drone data @ScaleDr Hajji Hicham
 
Big data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructureBig data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructureRoman Nikitchenko
 
Hardware Provisioning
Hardware ProvisioningHardware Provisioning
Hardware ProvisioningMongoDB
 
Storage Systems For Scalable systems
Storage Systems For Scalable systemsStorage Systems For Scalable systems
Storage Systems For Scalable systemselliando dias
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned Omid Vahdaty
 
Data has a better idea the in-memory data grid
Data has a better idea   the in-memory data gridData has a better idea   the in-memory data grid
Data has a better idea the in-memory data gridBogdan Dina
 
IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop
IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and HadoopIOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop
IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and HadoopLeons Petražickis
 
Designing your SaaS Database for Scale with Postgres
Designing your SaaS Database for Scale with PostgresDesigning your SaaS Database for Scale with Postgres
Designing your SaaS Database for Scale with PostgresOzgun Erdogan
 
Distributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewDistributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewKonstantin V. Shvachko
 
Inroduction to Big Data
Inroduction to Big DataInroduction to Big Data
Inroduction to Big DataOmnia Safaan
 
Spark Driven Big Data Analytics
Spark Driven Big Data AnalyticsSpark Driven Big Data Analytics
Spark Driven Big Data Analyticsinoshg
 
Big data and hadoop overvew
Big data and hadoop overvewBig data and hadoop overvew
Big data and hadoop overvewKunal Khanna
 

Similaire à Big data nyu (20)

Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introduction
 
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
Processing Drone data @Scale
Processing Drone data @ScaleProcessing Drone data @Scale
Processing Drone data @Scale
 
Spark
SparkSpark
Spark
 
Big data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructureBig data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructure
 
Hardware Provisioning
Hardware ProvisioningHardware Provisioning
Hardware Provisioning
 
BigData Hadoop
BigData Hadoop BigData Hadoop
BigData Hadoop
 
Big Data
Big DataBig Data
Big Data
 
Storage Systems For Scalable systems
Storage Systems For Scalable systemsStorage Systems For Scalable systems
Storage Systems For Scalable systems
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned
 
Data has a better idea the in-memory data grid
Data has a better idea   the in-memory data gridData has a better idea   the in-memory data grid
Data has a better idea the in-memory data grid
 
IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop
IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and HadoopIOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop
IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop
 
Designing your SaaS Database for Scale with Postgres
Designing your SaaS Database for Scale with PostgresDesigning your SaaS Database for Scale with Postgres
Designing your SaaS Database for Scale with Postgres
 
Distributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewDistributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology Overview
 
Inroduction to Big Data
Inroduction to Big DataInroduction to Big Data
Inroduction to Big Data
 
Cloud arch patterns
Cloud arch patternsCloud arch patterns
Cloud arch patterns
 
Spark Driven Big Data Analytics
Spark Driven Big Data AnalyticsSpark Driven Big Data Analytics
Spark Driven Big Data Analytics
 
Big data and hadoop overvew
Big data and hadoop overvewBig data and hadoop overvew
Big data and hadoop overvew
 
bigdata.pdf
bigdata.pdfbigdata.pdf
bigdata.pdf
 

Plus de Edward Capriolo

Nibiru: Building your own NoSQL store
Nibiru: Building your own NoSQL storeNibiru: Building your own NoSQL store
Nibiru: Building your own NoSQL storeEdward Capriolo
 
Intravert Server side processing for Cassandra
Intravert Server side processing for CassandraIntravert Server side processing for Cassandra
Intravert Server side processing for CassandraEdward Capriolo
 
Cassandra NoSQL Lan party
Cassandra NoSQL Lan partyCassandra NoSQL Lan party
Cassandra NoSQL Lan partyEdward Capriolo
 
Breaking first-normal form with Hive
Breaking first-normal form with HiveBreaking first-normal form with Hive
Breaking first-normal form with HiveEdward Capriolo
 
Hadoop Monitoring best Practices
Hadoop Monitoring best PracticesHadoop Monitoring best Practices
Hadoop Monitoring best PracticesEdward Capriolo
 
Whirlwind tour of Hadoop and HIve
Whirlwind tour of Hadoop and HIveWhirlwind tour of Hadoop and HIve
Whirlwind tour of Hadoop and HIveEdward Capriolo
 
Counters for real-time statistics
Counters for real-time statisticsCounters for real-time statistics
Counters for real-time statisticsEdward Capriolo
 

Plus de Edward Capriolo (14)

Nibiru: Building your own NoSQL store
Nibiru: Building your own NoSQL storeNibiru: Building your own NoSQL store
Nibiru: Building your own NoSQL store
 
Cassandra4hadoop
Cassandra4hadoopCassandra4hadoop
Cassandra4hadoop
 
Intravert Server side processing for Cassandra
Intravert Server side processing for CassandraIntravert Server side processing for Cassandra
Intravert Server side processing for Cassandra
 
M6d cassandra summit
M6d cassandra summitM6d cassandra summit
M6d cassandra summit
 
Apache Kafka Demo
Apache Kafka DemoApache Kafka Demo
Apache Kafka Demo
 
Cassandra NoSQL Lan party
Cassandra NoSQL Lan partyCassandra NoSQL Lan party
Cassandra NoSQL Lan party
 
Breaking first-normal form with Hive
Breaking first-normal form with HiveBreaking first-normal form with Hive
Breaking first-normal form with Hive
 
Casbase presentation
Casbase presentationCasbase presentation
Casbase presentation
 
Hadoop Monitoring best Practices
Hadoop Monitoring best PracticesHadoop Monitoring best Practices
Hadoop Monitoring best Practices
 
Whirlwind tour of Hadoop and HIve
Whirlwind tour of Hadoop and HIveWhirlwind tour of Hadoop and HIve
Whirlwind tour of Hadoop and HIve
 
Cli deep dive
Cli deep diveCli deep dive
Cli deep dive
 
Cassandra as Memcache
Cassandra as MemcacheCassandra as Memcache
Cassandra as Memcache
 
Counters for real-time statistics
Counters for real-time statisticsCounters for real-time statistics
Counters for real-time statistics
 
Real world capacity
Real world capacityReal world capacity
Real world capacity
 

Dernier

Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGSujit Pal
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 

Dernier (20)

Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 

Big data nyu

  • 1. Big Data in the “Real World” Edward Capriolo
  • 2. What is “big data”? ● Big data is a collection of data sets so large and complex that it becomes difficult to process using traditional data processing applications. ● The challenges include capture, curation, storage, search, sharing, transfer, analysis, and visualization. http://en.wikipedia.org/wiki/Big_data
  • 3. Big Data Challenges ● The challenges include: – capture – curation – storage – search – sharing – transfer – analysis – visualization – large – complex
  • 4. What is “big data” exactly? ● What is considered "big data" varies depending on the capabilities of the organization managing the set, and on the capabilities of the applications that are traditionally used to process and analyze the data set in its domain. ● As of 2012, limits on the size of data sets that are feasible to process in a reasonable amount of time were on the order of exabytes of data. http://en.wikipedia.org/wiki/Big_data
  • 5. Big Data Qualifiers ● varies ● capabilities ● traditionally ● feasibly ● reasonably ● [somptha]bytes of data
  • 6. My first “big data” challenge ● Real time news delivery platform ● Ingest news as text and provide full text search ● Qualifiers – Reasonable: Real time search was < 1 second – Capabilities: small company, <100 servers ● Big Data challenges – Storage: roughly 300GB for 60 days data – Search: searches of thousands of terms
  • 7.
  • 8. Traditionally ● Data was placed in mysql ● MySQL full text search ● Easy to insert ● Easy to search ● Worked great! – Until it got real world load
  • 9. Feasibly in hardware (circa 2008) ● 300GB data and 16GB ram ● ...MySQL stores an in-memory binary tree of the keys. Using this tree, MySQL can calculate the count of matching rows with reasonable speed. But speed declines logarithmically as the number of terms increases. ● The platters revolve at 15,000 RPM or so, which works out to 250 revolutions per second. Average latency is listed as 2.0ms ● As the speed of an HDD increases the power it takes to run it increases disproportionately http://serverfault.com/questions/190451/what-is-the-throughput-of-15k-rpm-sas-drive http://thessdguy.com/why-dont-hdds-spin-faster-than-15k-rpm/ http://dev.mysql.com/doc/internals/en/full-text-search.html
  • 10. “Big Data” is about giving up things ● In theoretical computer science, the CAP theorem states that it is impossible for a distributed computer system to simultaneously provide all three of the following guarantees: – Consistency (all nodes see the same data at the same time) – Availability (a guarantee that every request receives a response about whether it was successful or failed) – Partition tolerance (the system continues to operate despite arbitrary message loss or failure of part of the system) http://en.wikipedia.org/wiki/CAP_theorem http://www.youtube.com/watch?v=I4yGBfcODmU
  • 11. Multi-Master solution ● Write the data to N mysql servers and round robin reads between them – Good: More machines to serve reads – Bad: Requires Nx hardware – Hard: Keeping machines loaded with same data especially auto-generated-ids – Hard: What about when the data does not even fit on a single machine?
  • 12.
  • 13. Sharding ● Rather then replicate all data to all machines ● Replicate data to selective machines – Good: localized data – Good: better caching – Hard: Joins across shards – Hard: Management – Hard: Failure ● Parallel RDBMS = $$$
  • 14. Life lesson “applications that are traditionally used to” ● How did we solve our problem? – We switched to lucene ● A tool designed for full text search ● Eventually sharded lucene ● When you hold a hammer: – Not everything is a nail ● Understand what you really need ● Understand reasonable and feasable
  • 15. Big data Challenge 2 ● Large high volume web site ● Process them and produce reports ● Big Data challenges – Storage: Store GB of data a day for years – Analysis, visualization: support reports of existing system ● Qualifiers – Reasonable to want daily reports less then one day – Honestly needs to be faster / reruns etc
  • 16. Enter hadoop ● Hadoop (0.17.X) was fairly new at the time ● Use cases of map reduce were emerging – Hive had just been open sourced by Facebook ● Many database vendors were calling map/reduce “a step backwards” – They had solved these problems “in the 80s”
  • 17. Hadoop file system HDFS ● Distributed redundant storage – We were a NoSPOF across the board ● Commodity hardware vs buying a big SAN/NAS device ● We already had processes that scp'ed data to servers, easily adapted to placing them into hdfs ● HDFS easy huge
  • 18. Map Reduce ● As a proof of concept I wrote a group/count application that would group/count on column in our logs ● Was able to show linear speed up with increased nodes ●
  • 19. Winning (why hadoop kicked arse) ● Data capture, curation – bulk loading data into RDBMS (indexes, overhead) – bulk loading into hadoop is network copy ● Data anaysis – RDBMS would not parallel-ize queries (even across partitions) – Some queries could cause very locks and performance degradation http://hardcourtlessons.blogspot.com/2010/05/definition-of-winning.html
  • 20. Enter hive ● Capture- NO ● Curation- YES ● Storage- YES ● Search- YES ● Sharing- YES ● Transfer- NO ● Analysis-YES ● Visualization-NO
  • 22. Sample program group and count Source data looks like jan 10 2009:.........:200:/index.htm jan 10 2009:.........:200:/index.htm jan 10 2009:.........:200:/igloo.htm jan 10 2009:.........:200:/ed.htm
  • 23. In case your the math type (input) <k1, v1> → map -> <k2, v2> -> combine -> <k2, v2> -> reduce -> <k3, v3> (output) Map(k1,v1) -> list(k2,v2) Reduce(k2, list (v2)) -> list(v3)
  • 26. Hive style hive>create table web_data ( sdate STRING, stime STRING, envvar STRING, remotelogname STRING ,servername STRING, localip STRING, literaldash STRING, method STRING, url STRING, querystring STRING, status STRING, litteralzero STRING ,bytessent INT,header STRING, timetoserver INT, useragent STRING ,cookie STRING, referer STRING); SELECT url,count(1) FROM web_data GROUP BY url;
  • 27. Life lessons volume 2 ● feasible and reasonable were completely different then case 1# ● Query from seconds -> hours ● Size from GB to TB ● Feasilble from 4 Nodes to 15
  • 28. Big Data Challenge #3 (work at m6d) ● Large high volume ad serving site ● Process them and produce reports ● Support data science and biz-dev users ● Big Data challenges – Storage: Store and process terabytes of data ● Complex data types, encoded data – Analysis, visualization: support reports of existing system ● Qualifiers – Reasonable: adhoc, daily,hourly, weekly, monthly reports
  • 29. Data data everywhere ● We have to use cookies in many places ● Cookies have limited size ● Cookies have complex values encoded
  • 30. Some encoding tricks we might do LastSeen: long (64 bits) Segment: int (32 bits) Literal ',' Segment: int (32 bits) Zipcode (32bits) ● 1 chose a relevant epoc and use byte ● Use a byte for # of segments ● Use a 4 byte radix encoded number ● ... and so on
  • 31. Getting at embedded data ● Write N UDFS for each object like: – getLastSeenForCookie(String) – getZipcodeForCookie(String) – ... ● But this would have made a huge toolkit ● Traditionally you do not want to break first normal form
  • 32. Struct solution ● Hive has a struct like a c struct ● Struct is list of name value pair ● Structs can contain other structs ● This gives us the serious ability to do object mapping ● UDFs can return struct types
  • 33. Using a UDF ● add jar myjar.jar; ● Create temporary function parseCookie as 'com.md6.ParseCookieIntoStruct' ; ● Select parseCookie(encodedColumn).lastSeen from mydata;
  • 34. LATERAL VIEW + EXPLODE SELECT client_id, entry.spendcreativeid FROM datatable LATERAL VIEW explode (AdHistoryAsStruct(ad_history).adEntrylist) entryList as entry where hit_date=20110321 AND mid=001406; 3214498023360851706 215286 3214498023360851706 195785 3214498023360851706 128640
  • 35. All that data might boil down to...
  • 36. Life lessons volume #3 ● Big data is not only batch or real-time ● Big data is feed back loops – Machine learning – Ad hoc performance checks ● Generated SQL tables periodically synced to web server ● Data shared between sections of an organization to make business decisions