SlideShare une entreprise Scribd logo
1  sur  63
Hadoop & Hive
HOW
Data Warehousing Game
CHANGE THE
Forever
Dave Mariani
CEO & Founder of AtScale
@dmariani
Atscale.com
2014 Hadoop Summit
San Jose, CA
June 3, 2014
The Truth about Data
44
“We think only 3% of the potentially
useful data is tagged, and even less
is analyzed.”
Source: IDC Predictions 2013: Big Data, IDC
“90% of the data in the world
today has been created in the
last two years”
Source: IBM
In 2012, 2.5 quintillion byes
of data was generated every
day
Source: IBM
2,500,000,000 Gb
7
…and that was back in 2012
The Broken Promise
What we wanted
What we got
Time for a new approach
Relational DBs
Volume: Write twice
Variety: Structured
Velocity: Early Transformation
Hadoop
Volume: Write once
Variety: Semi-structured
Velocity: Late Transformation
INPUT DATA
HADOOP
ETL
MART MART MART
QUERY ENGINE
VISUALIZER
INPUT DATA
HADOOP
ETL
MART MART MART
QUERY ENGINE
VISUALIZER
INPUT DATA
HADOOP (HIVE)
VISUALIZER
Case Study
Klout
20
15
Social networks processed daily
769
TB of data storage
200,000
Indexed users added daily
400,000,000
Users indexed daily
12,000,000,000
Social signals processed daily
50,000,000,000
API calls delivered monthly
10,080,000,000,000
Rows of data in the data warehouse
Trillions!
Klout’s Data Architecture
A common question
How are users using our site?
Google Analytics
+ Great page by page analysis
- User identifiable data against EULA
MixPanel
+ Great event tracking
- Can send user specific identifiers, big limitations
Klout
+ All our data telling us who these people actually are
- That’s about it
Klout’s Event Tracker
{
"project": "plusK",
"event": "spend",
"ks_uid": 123456,
"type": "add_topic"
}
{
"project": "plusK",
"event": "spend",
"session_id": "0",
"ip": "50.68.47.158",
"kloutId": "123456",
"cookie_id": "123456",
"ref": "http://www.klout.com",
"type": "add_topic",
"time": "1338366015"
}
EVENT_LOG
tstamp INT
project STRING
event STRING
session_id BIGINT
ks_uid BIGINT
ip STRING
attr_map MAP<STRING,STRING>
json_text STRING
dt STRING
hr STRING
SELECT { [Measures].[Counter],
[Measures].[PreviousPeriodCounter]}
ON COLUMNS,
NON EMPTY CROSSJOIN (
exists([Date].[Date].[Date].allmembers,
[Date].[Date].&[2012-05-19T00:00:00]:[Date].[Date].&[2012-0
02T00:00:00]),
[Events].[Event].[Event].allmembers) DIMENSION PROPERTIES
MEMBER_CAPTION
ON ROWS
FROM [ProductInsight]
WHERE ({[Projects].[Project].[plusK]})
SELECT
get_json_object(json_text,'$.sid') as sid,
get_json_object(json_text,'$.kloutId') as kloutId,
get_json_object(json_text,'$.v') as version,
get_json_object(json_text,'$.status') as status,
event
FROM bi.event_log
WHERE project='mobile-ios'
AND tstamp=20121027
AND event in ('api_error', 'api_timeout')
ORDER BY sid;
So, what’s wrong with this
picture?
Klout’s Data Architecture
Klout’s Data Architecture
Case Study
Online Gaming (MMO)
...
LogInt1369155542t4533245t”loc":”23”,"rank":"Expert”,"client":"ios"lf
Buyt1369155556t4533446t”loc":”23”,"item":"212”,"ref”:”ask.com”,"amt":"1.50"lf
...
Capture
Event Name Timestamp User ID Attributes
CREATE EXTERNAL TABLE event_log (
event STRING,
event_time TIMESTAMP,
user_id INTEGER,
event_attributes MAP<STRING, STRING>
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY 't' COLLECTION ITEM TERMINATED
BY ','
PARTITIONED BY (day(FROM_UNIXTIME(event_time)), INTEGER)
LOCATION '/user/event_logs’;
Event Name Timestamp User ID Attributes
Map
SELECT
SUBSTR(FROM_UNIXTIME(event_time),1,7) AS MonthOfEvent,
event_attributes[”loc"] AS Location,
count(*) AS EventCount
FROM event_log
WHERE year(FROM_UNIXTIME(event_time)) = 2014
GROUP BY SUBSTR(FROM_UNIXTIME(event_time),1,7), attributes[”loc"]
Event Name Timestamp User ID Attributes
Transform and Query
Hive for analytics
Why now?
New, Interactive Flavors
Shark Impala Stinger
Shark Impala Stinger
Performance approach Caching Optimizer Improve Hive
Theoretical limits (# of rows) Billions Trillions Trillions
Supports UDFs, SerDes Yes Fall ‘14 Yes
Supports non-scalar data types Yes Fall ‘14 Yes
Preferred file format Tachyon Parquet ORC
Sponsorship Databricks Cloudera Hortonworks
Hive is a cheap MPP database
Records
Returned
Time (Seconds)
Select Statement
HANA
Small
Impala
Small
(1 Node)
Parquet
Impala
Small
(3
Nodes)
Parquet
Impala
Small
(1 Node)
Text
Impala
Small
(3 Nodes)
Text
select count(*) from lineitem 1 1 3 1 74 31
select count(*), sum(l_extendedprice) from lineitem 1 4 12 3 73 29
select l_shipmode, count(*), sum(l_extendedprice) from lineitem group by
l_shipmode 7 8 23 5 74 28
select l_shipmode, count(*), sum(l_extendedprice) from lineitem where
l_shipmode = 'AIR' group by l_shipmode 1 1 20 4 73 28
select l_shipmode, l_linestatus, count(*), sum(l_extendedprice) from
lineitem group by l_shipmode, l_linestatus 14 10 32 7 74 28
select l_shipmode, l_linestatus, count(*), sum(l_extendedprice) from
lineitem where l_shipmode = 'AIR' and l_linestatus = 'F' group by
l_shipmode, l_linestatus 1 1 27 5 72 29
select count(*) from lineitem where l_shipmode = 'AIR' and l_linestatus =
'F' and l_suppkey = 1 45 1 23 5 73 30
select l_shipmode, l_linestatus, l_extendedprice from lineitem where
l_shipmode = 'AIR' and l_linestatus = 'F' and l_suppkey = 1 45 1 29 5 73 31
select * from lineitem where l_shipmode = 'AIR' and l_linestatus = 'F' and
l_suppkey = 1 45 1 104 21 73 30
Size
(5 Part.)
1.9Gb
(40 files x 80mb)
3.2Gb
(1 file – No
Compression)
7.2Gb
Est. Monthly Cost of Production Environment on AWS
(HANA m2.xlarge, Impala m1.medium) $1022 $175 $350 $175 $350Source: Aron MacDonald, http://scn.sap.com/community/developer-center/hana/blog/2013/05/30/big-data-analytics-hana-vs-hadoop-impala-on-aws
TPC-H Query Run Times (Impala vs. HANA)
Line item table, 60 Million Rows
Real Customer Scenario: Impala (CDH5)
5 Data Node Cluster, 16Gb of RAM each, 4 Cores each, Parquet
Fact Table
(# of rows)
Dimensions
Execution
Time
(seconds)
1 Dimension (28 rows) 0:00:00.836
2 Dimensions (28 rows, 28 rows) 0:00:00.767
2 Dimensions (28 rows, 2,926 rows) 0:00:00.660
3 Dimensions (28 rows, 28 rows, 2,926 rows) 0:00:00.871
1 Dimension (2,926 rows) 0:00:00.490
2 Dimensions (28 rows, 2,926 rows) 0:00:00.705
1 Dimension (28 rows) 0:00:06.780
2 Dimensions (28 rows, 3,782 rows) 0:00:23.097
0 Dimensions 0:00:52.074
1 Dimension (121,964,466 rows) - Count Distinct 0:00:55.861
1 Dimension (5,547,151 rows) 0:01:08.972
2 Dimensions (5 rows, 2,926 rows) 0:00:45.060
3 Dimensions (5 rows, 2,926 rows, 3,782 rows) 0:01:39.119
0 Dimensions 0:00:06.945
2 Dimensions(5 rows, 40,040 rows) 0:00:31.980
4 Dimensions (2 rows, 487,374 rows, 7,875,489 rows, 2,038,760 rows) 0:01:33.404
0 Dimensions 0:00:12.854
1 Dimension (8,038 rows) 0:00:24.083
995,761,863 1 Dimension (3,782 rows) 0:01:23.484
1 Dimension (28 rows) 0:00:56.716
1 Dimension (5 rows) 0:00:33.750
2 Dimensions (5 rows, 3,782 rows) 0:01:11.021
0 Dimensions 0:00:32.854
2 Dimensions (3 rows, 371 rows) 0:00:54.329
520
15,036
55,676
72,745,961
121,964,466
263,223,987
378,706,328
587,679,516
1,064,423,864
1,174,737,467
Hive v Impala
TL;DW
(Too long; didn’t watch)
DO DO NOT
Capture Data Aggregate Data
DO DO NOT
T (Transform) ETL (Extract, Transform, Load)
DO DO NOT
Schema on Read Schema on Load
DO DO NOT
Query in Place Create Data Marts
Thank you!
Dave Mariani
CEO & Founder of AtScale
@dmariani
Atscale.com

Contenu connexe

Tendances

Mapreduce in Search
Mapreduce in SearchMapreduce in Search
Mapreduce in SearchAmund Tveit
 
A Rusty introduction to Apache Arrow and how it applies to a time series dat...
A Rusty introduction to Apache Arrow and how it applies to a  time series dat...A Rusty introduction to Apache Arrow and how it applies to a  time series dat...
A Rusty introduction to Apache Arrow and how it applies to a time series dat...Andrew Lamb
 
High-Performance Analytics with Probabilistic Data Structures: the Power of H...
High-Performance Analytics with Probabilistic Data Structures: the Power of H...High-Performance Analytics with Probabilistic Data Structures: the Power of H...
High-Performance Analytics with Probabilistic Data Structures: the Power of H...Databricks
 
Text Analytics Summit 2009 - Roddy Lindsay - "Social Media, Happiness, Petaby...
Text Analytics Summit 2009 - Roddy Lindsay - "Social Media, Happiness, Petaby...Text Analytics Summit 2009 - Roddy Lindsay - "Social Media, Happiness, Petaby...
Text Analytics Summit 2009 - Roddy Lindsay - "Social Media, Happiness, Petaby...guest5b1607
 
GeoMesa on Apache Spark SQL with Anthony Fox
GeoMesa on Apache Spark SQL with Anthony FoxGeoMesa on Apache Spark SQL with Anthony Fox
GeoMesa on Apache Spark SQL with Anthony FoxDatabricks
 
2021 04-20 apache arrow and its impact on the database industry.pptx
2021 04-20  apache arrow and its impact on the database industry.pptx2021 04-20  apache arrow and its impact on the database industry.pptx
2021 04-20 apache arrow and its impact on the database industry.pptxAndrew Lamb
 
Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)Kevin Weil
 
Big data & hadoop
Big data & hadoopBig data & hadoop
Big data & hadoopAbhi Goyan
 
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and DruidPulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and DruidTony Ng
 
Report Hadoop Map Reduce
Report Hadoop Map ReduceReport Hadoop Map Reduce
Report Hadoop Map ReduceUrvashi Kataria
 
Anomaly Detection with Apache Spark
Anomaly Detection with Apache SparkAnomaly Detection with Apache Spark
Anomaly Detection with Apache SparkCloudera, Inc.
 
Intro to Hadoop
Intro to HadoopIntro to Hadoop
Intro to Hadoopjeffturner
 
Introduction to MapReduce
Introduction to MapReduceIntroduction to MapReduce
Introduction to MapReduceHassan A-j
 
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)PyData
 
Scaling up data science applications
Scaling up data science applicationsScaling up data science applications
Scaling up data science applicationsKexin Xie
 
How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...DataWorks Summit/Hadoop Summit
 

Tendances (20)

Mapreduce in Search
Mapreduce in SearchMapreduce in Search
Mapreduce in Search
 
Building data pipelines
Building data pipelinesBuilding data pipelines
Building data pipelines
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
A Rusty introduction to Apache Arrow and how it applies to a time series dat...
A Rusty introduction to Apache Arrow and how it applies to a  time series dat...A Rusty introduction to Apache Arrow and how it applies to a  time series dat...
A Rusty introduction to Apache Arrow and how it applies to a time series dat...
 
High-Performance Analytics with Probabilistic Data Structures: the Power of H...
High-Performance Analytics with Probabilistic Data Structures: the Power of H...High-Performance Analytics with Probabilistic Data Structures: the Power of H...
High-Performance Analytics with Probabilistic Data Structures: the Power of H...
 
Text Analytics Summit 2009 - Roddy Lindsay - "Social Media, Happiness, Petaby...
Text Analytics Summit 2009 - Roddy Lindsay - "Social Media, Happiness, Petaby...Text Analytics Summit 2009 - Roddy Lindsay - "Social Media, Happiness, Petaby...
Text Analytics Summit 2009 - Roddy Lindsay - "Social Media, Happiness, Petaby...
 
GeoMesa on Apache Spark SQL with Anthony Fox
GeoMesa on Apache Spark SQL with Anthony FoxGeoMesa on Apache Spark SQL with Anthony Fox
GeoMesa on Apache Spark SQL with Anthony Fox
 
2021 04-20 apache arrow and its impact on the database industry.pptx
2021 04-20  apache arrow and its impact on the database industry.pptx2021 04-20  apache arrow and its impact on the database industry.pptx
2021 04-20 apache arrow and its impact on the database industry.pptx
 
Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Hadoop Report
Hadoop ReportHadoop Report
Hadoop Report
 
Big data & hadoop
Big data & hadoopBig data & hadoop
Big data & hadoop
 
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and DruidPulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
 
Report Hadoop Map Reduce
Report Hadoop Map ReduceReport Hadoop Map Reduce
Report Hadoop Map Reduce
 
Anomaly Detection with Apache Spark
Anomaly Detection with Apache SparkAnomaly Detection with Apache Spark
Anomaly Detection with Apache Spark
 
Intro to Hadoop
Intro to HadoopIntro to Hadoop
Intro to Hadoop
 
Introduction to MapReduce
Introduction to MapReduceIntroduction to MapReduce
Introduction to MapReduce
 
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
 
Scaling up data science applications
Scaling up data science applicationsScaling up data science applications
Scaling up data science applications
 
How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...
 

En vedette

Population Health Management Case Studies
Population Health Management Case StudiesPopulation Health Management Case Studies
Population Health Management Case StudiesPhytel
 
Apache Kylin – Cubes on Hadoop
Apache Kylin – Cubes on HadoopApache Kylin – Cubes on Hadoop
Apache Kylin – Cubes on HadoopDataWorks Summit
 
Change Data Capture using Kafka
Change Data Capture using KafkaChange Data Capture using Kafka
Change Data Capture using KafkaAkash Vacher
 
Design cube in Apache Kylin
Design cube in Apache KylinDesign cube in Apache Kylin
Design cube in Apache KylinYang Li
 
Hybrid Data Architecture: Integrating Hadoop with a Data Warehouse
Hybrid Data Architecture: Integrating Hadoop with a Data WarehouseHybrid Data Architecture: Integrating Hadoop with a Data Warehouse
Hybrid Data Architecture: Integrating Hadoop with a Data WarehouseDataWorks Summit
 
Dealing with Changed Data in Hadoop
Dealing with Changed Data in HadoopDealing with Changed Data in Hadoop
Dealing with Changed Data in HadoopDataWorks Summit
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveDataWorks Summit
 
Information Virtualization: Query Federation on Data Lakes
Information Virtualization: Query Federation on Data LakesInformation Virtualization: Query Federation on Data Lakes
Information Virtualization: Query Federation on Data LakesDataWorks Summit
 
Hive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it finalHive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it finalHortonworks
 
Clinical Data Repository vs. A Data Warehouse - Which Do You Need?
Clinical Data Repository vs. A Data Warehouse - Which Do You Need?Clinical Data Repository vs. A Data Warehouse - Which Do You Need?
Clinical Data Repository vs. A Data Warehouse - Which Do You Need?Health Catalyst
 

En vedette (10)

Population Health Management Case Studies
Population Health Management Case StudiesPopulation Health Management Case Studies
Population Health Management Case Studies
 
Apache Kylin – Cubes on Hadoop
Apache Kylin – Cubes on HadoopApache Kylin – Cubes on Hadoop
Apache Kylin – Cubes on Hadoop
 
Change Data Capture using Kafka
Change Data Capture using KafkaChange Data Capture using Kafka
Change Data Capture using Kafka
 
Design cube in Apache Kylin
Design cube in Apache KylinDesign cube in Apache Kylin
Design cube in Apache Kylin
 
Hybrid Data Architecture: Integrating Hadoop with a Data Warehouse
Hybrid Data Architecture: Integrating Hadoop with a Data WarehouseHybrid Data Architecture: Integrating Hadoop with a Data Warehouse
Hybrid Data Architecture: Integrating Hadoop with a Data Warehouse
 
Dealing with Changed Data in Hadoop
Dealing with Changed Data in HadoopDealing with Changed Data in Hadoop
Dealing with Changed Data in Hadoop
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
 
Information Virtualization: Query Federation on Data Lakes
Information Virtualization: Query Federation on Data LakesInformation Virtualization: Query Federation on Data Lakes
Information Virtualization: Query Federation on Data Lakes
 
Hive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it finalHive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it final
 
Clinical Data Repository vs. A Data Warehouse - Which Do You Need?
Clinical Data Repository vs. A Data Warehouse - Which Do You Need?Clinical Data Repository vs. A Data Warehouse - Which Do You Need?
Clinical Data Repository vs. A Data Warehouse - Which Do You Need?
 

Similaire à Hadoop & Hive Change the Data Warehousing Game Forever

Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small ...
Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small ...Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small ...
Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small ...randyguck
 
Computing Outside The Box June 2009
Computing Outside The Box June 2009Computing Outside The Box June 2009
Computing Outside The Box June 2009Ian Foster
 
Big Data & Machine Learning Pipelines: A Tale of Lambdas, Kappas and Pancakes
Big Data & Machine Learning Pipelines: A Tale of Lambdas, Kappas and PancakesBig Data & Machine Learning Pipelines: A Tale of Lambdas, Kappas and Pancakes
Big Data & Machine Learning Pipelines: A Tale of Lambdas, Kappas and PancakesOsama Khan
 
Mastering MapReduce: MapReduce for Big Data Management and Analysis
Mastering MapReduce: MapReduce for Big Data Management and AnalysisMastering MapReduce: MapReduce for Big Data Management and Analysis
Mastering MapReduce: MapReduce for Big Data Management and AnalysisTeradata Aster
 
Cloud Computing ...changes everything
Cloud Computing ...changes everythingCloud Computing ...changes everything
Cloud Computing ...changes everythingLew Tucker
 
Metadata and Provenance for ML Pipelines with Hopsworks
Metadata and Provenance for ML Pipelines with Hopsworks Metadata and Provenance for ML Pipelines with Hopsworks
Metadata and Provenance for ML Pipelines with Hopsworks Jim Dowling
 
Google Cluster Innards
Google Cluster InnardsGoogle Cluster Innards
Google Cluster InnardsMartin Dvorak
 
Sorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at SpotifySorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at SpotifyNeville Li
 
Aggregated queries with Druid on terrabytes and petabytes of data
Aggregated queries with Druid on terrabytes and petabytes of dataAggregated queries with Druid on terrabytes and petabytes of data
Aggregated queries with Druid on terrabytes and petabytes of dataRostislav Pashuto
 
The hidden engineering behind machine learning products at Helixa
The hidden engineering behind machine learning products at HelixaThe hidden engineering behind machine learning products at Helixa
The hidden engineering behind machine learning products at HelixaAlluxio, Inc.
 
Lessons from Running Large Scale Spark Workloads
Lessons from Running Large Scale Spark WorkloadsLessons from Running Large Scale Spark Workloads
Lessons from Running Large Scale Spark WorkloadsDatabricks
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Databricks
 
Real-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to StreamingReal-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to StreamingDatabricks
 
Big Data: Architecture and Performance Considerations in Logical Data Lakes
Big Data: Architecture and Performance Considerations in Logical Data LakesBig Data: Architecture and Performance Considerations in Logical Data Lakes
Big Data: Architecture and Performance Considerations in Logical Data LakesDenodo
 
Transforming Mobile Push Notifications with Big Data
Transforming Mobile Push Notifications with Big DataTransforming Mobile Push Notifications with Big Data
Transforming Mobile Push Notifications with Big Dataplumbee
 
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...BigDataEverywhere
 

Similaire à Hadoop & Hive Change the Data Warehousing Game Forever (20)

Yahoo compares Storm and Spark
Yahoo compares Storm and SparkYahoo compares Storm and Spark
Yahoo compares Storm and Spark
 
Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small ...
Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small ...Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small ...
Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small ...
 
Computing Outside The Box June 2009
Computing Outside The Box June 2009Computing Outside The Box June 2009
Computing Outside The Box June 2009
 
Big Data & Machine Learning Pipelines: A Tale of Lambdas, Kappas and Pancakes
Big Data & Machine Learning Pipelines: A Tale of Lambdas, Kappas and PancakesBig Data & Machine Learning Pipelines: A Tale of Lambdas, Kappas and Pancakes
Big Data & Machine Learning Pipelines: A Tale of Lambdas, Kappas and Pancakes
 
Mastering MapReduce: MapReduce for Big Data Management and Analysis
Mastering MapReduce: MapReduce for Big Data Management and AnalysisMastering MapReduce: MapReduce for Big Data Management and Analysis
Mastering MapReduce: MapReduce for Big Data Management and Analysis
 
M7 and Apache Drill, Micheal Hausenblas
M7 and Apache Drill, Micheal HausenblasM7 and Apache Drill, Micheal Hausenblas
M7 and Apache Drill, Micheal Hausenblas
 
Cloud Computing ...changes everything
Cloud Computing ...changes everythingCloud Computing ...changes everything
Cloud Computing ...changes everything
 
Metadata and Provenance for ML Pipelines with Hopsworks
Metadata and Provenance for ML Pipelines with Hopsworks Metadata and Provenance for ML Pipelines with Hopsworks
Metadata and Provenance for ML Pipelines with Hopsworks
 
Google Cluster Innards
Google Cluster InnardsGoogle Cluster Innards
Google Cluster Innards
 
Sorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at SpotifySorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at Spotify
 
Osd ctw spark
Osd ctw sparkOsd ctw spark
Osd ctw spark
 
Aggregated queries with Druid on terrabytes and petabytes of data
Aggregated queries with Druid on terrabytes and petabytes of dataAggregated queries with Druid on terrabytes and petabytes of data
Aggregated queries with Druid on terrabytes and petabytes of data
 
Handson with Twitter Heron
Handson with Twitter HeronHandson with Twitter Heron
Handson with Twitter Heron
 
The hidden engineering behind machine learning products at Helixa
The hidden engineering behind machine learning products at HelixaThe hidden engineering behind machine learning products at Helixa
The hidden engineering behind machine learning products at Helixa
 
Lessons from Running Large Scale Spark Workloads
Lessons from Running Large Scale Spark WorkloadsLessons from Running Large Scale Spark Workloads
Lessons from Running Large Scale Spark Workloads
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
 
Real-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to StreamingReal-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to Streaming
 
Big Data: Architecture and Performance Considerations in Logical Data Lakes
Big Data: Architecture and Performance Considerations in Logical Data LakesBig Data: Architecture and Performance Considerations in Logical Data Lakes
Big Data: Architecture and Performance Considerations in Logical Data Lakes
 
Transforming Mobile Push Notifications with Big Data
Transforming Mobile Push Notifications with Big DataTransforming Mobile Push Notifications with Big Data
Transforming Mobile Push Notifications with Big Data
 
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
 

Plus de DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

Plus de DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Dernier

The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 

Dernier (20)

The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 

Hadoop & Hive Change the Data Warehousing Game Forever

Notes de l'éditeur

  1. Batman *Forever*
  2. Our ability to capture data has far exceeded our ability to analyze it Traditional data warehousing tools have not kept pace with the growth of data Hadoop allows us to capture and store data economically but tradition BI tools and approaches don’t work IDC “Currently a quarter of the information in the Digital Universe would be useful for big data if it were tagged and analyzed. We think only 3% of the potentially useful data is tagged, and even less is analyzed”
  3. Sad panda
  4. Happy panda!
  5. Klout “scores” 400M people every day and collects 12B signals from various social networks that need to factor into the Klout score..
  6. Klout “scores” 400M people every day and collects 12B signals from various social networks that need to factor into the Klout score..
  7. Klout “scores” 400M people every day and collects 12B signals from various social networks that need to factor into the Klout score..
  8. Klout “scores” 400M people every day and collects 12B signals from various social networks that need to factor into the Klout score..
  9. Klout “scores” 400M people every day and collects 12B signals from various social networks that need to factor into the Klout score..
  10. Klout “scores” 400M people every day and collects 12B signals from various social networks that need to factor into the Klout score..
  11. Klout “scores” 400M people every day and collects 12B signals from various social networks that need to factor into the Klout score..
  12. Klout “scores” 400M people every day and collects 12B signals from various social networks that need to factor into the Klout score..
  13. The Klout architecture is made up of open source tools. Since we could not afford an expensive software/hardware solution, we chose Hadoop and Hive. We built a 150 node Hadoop cluster with $3k commodity nodes to create a 1.5 petabyte warehouse. We used SQL Server Analysis Services, connected directly to our Hive data warehouse for providing an interactive query environment.
  14. Great page by page analysis, great reports, but couldn’t send user identifiable data
  15. Mixpanel has great support for real time events, but we couldn’t send all the necessary data to really draw interesting conclusions. Joining on data was still going to be a huge challenge.
  16. We had all our data, but of course, that was about it
  17. We couldn’t cross the streams. We wanted to discover really interesting patterns and make advanced recommendations based on who the user was.
  18. At Klout, we used web analytics tools like Google Analytics and Mixpanel to understand how our users interacted with our web site and mobile app. However, we could not join the usage data with our profile data. This made for an incomplete view of our users. We decided to build a flexible, event oriented architecture to capture all events for user activity. This is the architecture.
  19. First, we invented a simple, JSON oriented event capture method. This allowed our web and app designers to add instrumentation without regards to how it would affect the downstream analytics applications or Hive warehouse.
  20. Next, using Flume, we mapped the semi-structure data stream into time partitioned files in Hadoop HDFS.
  21. We then created an EXTERNAL Hive table on top of this file structure. That allowed us to “query” the incoming files in HDFS.
  22. In order to provide an interactive query environment (OLAP), we connected SQL Server Analysis Services directly to the Hive warehouse and continuously updated a MOLAP cube with the data.
  23. We then could hook up internally developed applications (Event Tracker) to our data by having the applications generate MDX (multi-dimensional query language) and run them against our cube.
  24. Or we could use the Hive CLI (command line interface) to execute queries using SQL directly against our Hive warehouse.
  25. Thumbs up!
  26. The Klout architecture is made up of open source tools. Since we could not afford an expensive software/hardware solution, we chose Hadoop and Hive. We built a 150 node Hadoop cluster with $3k commodity nodes to create a 1.5 petabyte warehouse. We used SQL Server Analysis Services, connected directly to our Hive data warehouse for providing an interactive query environment.
  27. The Klout architecture is made up of open source tools. Since we could not afford an expensive software/hardware solution, we chose Hadoop and Hive. We built a 150 node Hadoop cluster with $3k commodity nodes to create a 1.5 petabyte warehouse. We used SQL Server Analysis Services, connected directly to our Hive data warehouse for providing an interactive query environment.
  28. By leveraging Hadoop’s non-scalar data types (map, array, struct, union), we can store data in an event/attributes (key/value pairs) format. This allows us to perform transformations at query time (schema on read). This allows us to keep our data modeling (pre-structuring) to a minimum and allows us to add new data without affecting the schema. In this way, we can capturing all the data we want in the most simple terms (log files) and structure the data later on read. This drastically simplifies data modeling and creates huge flexibility and reduced cost.
  29. By leveraging Hadoop’s non-scalar data types (map, array, struct, union), we can store data in an event/attributes (key/value pairs) format. This allows us to perform transformations at query time (schema on read). This allows us to keep our data modeling (pre-structuring) to a minimum and allows us to add new data without affecting the schema. In this way, we can capturing all the data we want in the most simple terms (log files) and structure the data later on read. This drastically simplifies data modeling and creates huge flexibility and reduced cost.
  30. By leveraging Hadoop’s non-scalar data types (map, array, struct, union), we can store data in an event/attributes (key/value pairs) format. This allows us to perform transformations at query time (schema on read). This allows us to keep our data modeling (pre-structuring) to a minimum and allows us to add new data without affecting the schema. In this way, we can capturing all the data we want in the most simple terms (log files) and structure the data later on read. This drastically simplifies data modeling and creates huge flexibility and reduced cost.
  31. Apache Hive’s reliance on MapReduce as it’s core data processing engines makes it unsuitable for interactive queries due to startup times and MapReduce’s batch nature. There are several approaches emerging to address these deficiencies that are still Hive catalog compatible. These developments are what makes using Hadoop/Hive as the world’s least expensive but scalable data warehousing platform possible.
  32. Aron MacDonald Source: http://scn.sap.com/community/developer-center/hana/blog/2013/05/30/big-data-analytics-hana-vs-hadoop-impala-on-aws Cloudera Impala which is essentially free, performs almost as well as an expensive alternative that relies on memory caching for delivering performance.
  33. Shark, Impala, etc. turn Hive into a real interactive SQL query environment. This is a huge advancement and was the missing piece that makes Hadoop into the world’s cheapest most scalable database. Here’s a query that demonstrates Shark/Hive’s support for non-scalar data types: use aw_demo; describe factinternetsales; select a.year, s.stylename, a.num_orders from ( select part_year as year, product_info["style"] as style, sum(orderquantity) as num_orders from factinternetsales where part_year < 2007 group by product_info["style"], part_year ) a left outer join dimstyle s on a.style = s.stylekey order by year, num_orders desc;
  34. Too Long; Didn’t Watch
  35. Too Long; Didn’t Watch