SlideShare une entreprise Scribd logo
1  sur  47
Column Stores and Google BigQuery
Cloud
Presented By:
Csaba Toth
Csaba Technology Services LLC
GDG Fresno meeting
October 1 2015 Fresno, California
Disclaimer
Disclaimer – cont.
Goal / wish
• Being able to issue queries
• Preferably SQL style
• Over Big data
• As small response time as possible
• Plus: through web interface (no need to
install anything)
• Plus: capability to visualize
Agenda
• Big Data
• Brief look at Hadoop, HIVE and Spark
• OLAP and OLTP
• Row based data store vs. Column data store
• Google BigQuery
• Demo
Big Data
Wikipedia: “collection of data sets so large and complex that it becomes
difficult to process using on-hand database management tools or
traditional data processing applications”
Examples: (Wikibon - A Comprehensive List of Big Data Statistics)
• 100 Terabytes of data is uploaded to Facebook every day
• Facebook Stores, Processes, and Analyzes more than 30 Petabytes of user
generated data
• Twitter generates 12 Terabytes of data every day
• LinkedIn processes and mines Petabytes of user data to power the "People You May
Know" feature
• YouTube users upload 48 hours of new video content every minute of the day
• Decoding of the human genome used to take 10 years. Now it can be done in 7 days
Big Data
Three Vs: Volume, Velocity, Variety
Sources:
• Science, Sensors, Social networks, Log files
• Public Data Stores, Data warehouse appliances
• Network and in-stream monitoring technologies
• Legacy documents
Main problems:
• Storage Problem
• Money Problem
• Consuming and processing the data
Hadoop
• Hadoop is an open-source software
framework that supports data-intensive
distributed applications
• A Hadoop cluster is composed of a single
master node and multiple worker nodes
Little Hadoop history
“The Google File System” - October 2003
• http://labs.google.com/papers/gfs.html – describes a scalable,
distributed, fault-tolerant file system tailored for data-intensive
applications, running on inexpensive commodity hardware, delivers
high aggregate performance
“MapReduce: Simplified Data Processing on Large
Clusters” - April 2004
• http://queue.acm.org/detail.cfm?id=988408 – describes a
programming model and an implementation for processing large
data sets.
Hadoop
Has two main services:
1. Storing large amounts of data: HDFS – Hadoop
Distributed File System
2. Processing large amounts of data:
implementing the MapReduce programming
model
HDFS
Name node
Metadata
Store
Data node Data node Data node
Node 1 Node 2
Block A Block B Block A Block B
Node 3
Block A Block B
Job / task management
Name node
Heart beat signals and
communication
Jobtracker
Data node Data node Data node
Tasktracker Tasktracker
Map 1 Reduce 1 Map 2 Reduce 2
Tasktracker
Map 3 Reduce 3
Hadoop / RDBMS / Docum.
Hadoop / MapReduce RDBMS Document stores
Size of data Petabytes Gigabytes Gigabytes+
Integrity of data Low High (referential, typed) Low/Intermediate
Data schema Dynamic Static Dynamic
Access method Batch Interactive and Batch Interactive and Batch
Scaling Linear Nonlinear (worse than linear) Better than RDBMS
Data structure Unstructured Structured Unstructured / semi-struct.
Normalization of data Not Required Required Not or somewhat required
Query Response Time Has latency (due to batch
processing)
Can be near immediate Can be near immediate
Apache Hive
Log Data RDBMS
Data Integration LayerFlume Sqoop
Storage Layer (HDFS)
Computing Layer (MapReduce)
Advanced Query Engine (Hive, Pig)
Data Mining
(Pegasus, Mahout)
Index, Searches
(Lucene)
DB drivers
(Hive driver)
Web Browser (JS)
Hadoop architecture
http://blog.iquestgroup.com/en/hadoop/#.Vgg2w2sRMeI
Apache Hive UI
Apache Hive UI
Hadoop distributions
Beyond Apache Hive
Goals: decrease latency
Technologies which help:
• YARN: next generation Hadoop
• Hadoop distribution specific: e.g. Cloudera
Impala
• Apache Spark
Beyond Apache Hive
• YARN: improves Hadoop performance in
many respects (resource management and
allocation, …)
• Impala: Cloudera’s MPP SQL Query engine,
based on Hadoop
• Spark: cluster computing framework with
multi-stage in-memory primitives
Apache Spark
• Open Source
• In contrast to Hadoop’s two-stage disk-
based MapReduce paradigm, multi-stage in-
memory primitives can provide up to 100x
performance increase
• It can work over HDFS
Spark and Hadoop
http://blog.iquestgroup.com/en/hadoop/#.Vgg2w2sRMeI
Spark and Hadoop
OLAP vs OLTP
OLTP - Online Transaction
Processing (Operational System)
OLAP - Online Analytical Processing
(Data Warehouse)
Source of data Operational data; original source Consolidation data; comes form various
sources
Purpose of data To control and run fundamental
business tasks
To help with planning, problem solving,
and decision support
Goal of operations retrieve or modify individual
records (mostly few records)
derive new information from existing data
(aggregates, transformations, calculations)
Queries queries often triggered by end user
actions and should complete
instantly
queries often run on many records or
complete data set
Read/Write mixed read/write workload mainly read or even read-only workload
RAM working set should fit in RAM data set may exceed size of RAM easily
OLAP vs OLTP
OLTP - Online Transaction
Processing (Operational System)
OLAP - Online Analytical Processing
(Data Warehouse)
ACID properties may be important often not important, data can often be
regenerated
Interactivity queries often triggered by end user
actions and should complete
instantly
queries often run interactively
Indexing use indexes to quickly find relevant
records
common: not known in advance which
aspects are interesting
so pre-indexing „relevant“ columns is
difficult
DB Design Often highly normalized with many
tables
Typically de-normalized with fewer tables;
use of star and/or snowflake schemas
Storing data: row stores
• Traditional RDBMS and often the document
stores are row oriented too
• The engine always stores and retrieves
entire rows from disk (unless indexes help)
• Row is a collection of column values
together
• Rows are materialized on disk
Row stores
All columns
are stored
together
on disk
id scientist death_by movie_name
1 Reinhardt Maximillian The Black Hole
2 Tyrell Roy Batty Blade Runner
3 Hammond Dinosaur Jurassic Park
4 Soong Lore Star Trek: TNG
5 Morbius His mind Forbidden Planet
6 Dyson Skynet Terminator 2: Judgment Day
Row stores
Performs best
when a small
number of
rows are
accessed
select * from the_table where id = 6
id scientist death_by movie_name
1 Reinhardt Maximillian The Black Hole
2 Tyrell Roy Batty Blade Runner
3 Hammond Dinosaur Jurassic Park
4 Soong Lore Star Trek: TNG
5 Morbius His mind Forbidden Planet
6 Dyson Skynet Terminator 2: Judgment Day
Row stores
• Not so great for wide rows
• If only a small subset of columns queried,
reading the entire row wastes IO
Row stores
Bad case scenario:
• select sum(bigint_column) from table
• Million rows in table
• Average row length is 1 KiB
The select reads one bigint column (8 bytes)
• Entire row must be read
• Reads ~1 GiB data for ~8MiB of column data
Column stores
• Data is organized by columns instead of
rows
• Non material world: often not materialized
during storage, exists only in memory
• Each row still has some sort of “row id”
Column stores
• A row is a collection of column values that are
associated with one another
• Associated: every row has some type of “row
id“
• Can still produce row output (assembling a
row maybe complex though – under the
hood)
Column store
Stores each COLUMN on disk
id
1
2
3
4
5
6
title
Mrs. Doubtfire
The Big Lebowski
The Fly
Steel Magnolias
The Birdcage
Erin Brokovitch
actor
Robin Williams
Jeff Bridges
Jeff Goldblum
Dolly Parton
Nathan Lane
Julia Roberts
genre
Comedy
Comedy
Horror
Drama
Comedy
Drama
row id = 1
row id = 6
Natural order may be unusual Each column has a file or segment on disk
Column stores
• Column compression can be way more efficient
than row compression or compression available
for row stores (sometimes 10:1 to 30:1 ratio)
• Compression: RLE, Integer packing,
dictionaries and lookup, other…
• Reduces both storage and IO (thus response
time)
Column stores
Best case scenario:
• select sum(bigint_column) from table
• Million rows in table
• Average row length is 1 KiB
The select reads one bigint column (8 bytes)
• Only single column read from disk
• Reads ~8MiB of column data, even less with
compression
Column stores
Bad case scenario:
select *
from long_wide_table
where order_line_id = 34653875;
• Accessing all table doesn’t save anything,
could be even more expensive than row
store
• Not ideal fo tables with few columns
Column stores
Updating and deleting rows is expensive
• Some column stores are append only
• Others just strongly discourage writes
• Some split storage into row and column
areas
Row/Column - OLTP/OLAP
Row stores are good fit for OLTP
• Reading small portions of a table, but often
many of the columns
• Frequent changes to data
• Small (<2TB) amount of data (typically
working set must fit in ram)
• "Nested loops" joins are good fit for OLTP
Row/Column - OLTP/OLAP
Column stores are good fit for OLAP
Read large portions of a table in terms of rows,
but often a small number of columns
Batch loading / updates
Big data (50TB-100TB per machine):
• Compression capabilities comes in handy
• Machine generated data is well suited
Column / Row stores
• RDBMS provide ACID capabilities
• Row stores mainly use tree style indexes
• B-tree derivative index structure provides very
fast binary search as long as it fits into memory
• Very large datasets end up unmanageably big
indexes
• Column stores: bitmap indexing
Very expensive to update
BigQuery
• A web service that enables interactive
analysis of massively large datasets
• based on Dremel, a scalable, interactive ad
hoc query system for analysis of read-only
nested data
• working in conjunction with Google Storage
• Has a RESTful web service interface
BigQuery
• You can issue SQL queries over big data
• Interactive web interface
• Can visualize results too
• As small response time as possible
• Auto scales under the hood
Demo!
Thank you!
Questions?
Resources
• Slides: http://www.slideshare.net/tothc
• Contact: http://www.meetup.com/CCalJUG/
• Csaba Toth: Introduction to Hadoop and MapReduce -
http://www.slideshare.net/tothc/introduction-to-hadoop-and-map-
reduce
• Justin Swanhart: Introduction to column stores -
http://www.slideshare.net/MySQLGeek/intro-to-column-stores
• Jan Steemann: Column-oriented databases -
http://www.slideshare.net/arangodb/introduction-to-column-
oriented-databases
Resources
• https://anonymousbi.wordpress.com/2012/11/02/hadoop-
beginners-tutorial-on-ubuntu/
• https://www.capgemini.com/blog/capping-it-off/2012/01/what-is-
hadoop
• http://blog.iquestgroup.com/en/hadoop/#.Vgg2w2sRMeI
• https://www.cloudera.com/content/cloudera/en/documentation/core/
latest/PDF/cloudera-impala.pdf
• https://www.keithrozario.com/2012/07/google-bigquery-wikipedia-
dataset-malaysia-singapore.html
• https://cloud.google.com/bigquery/web-ui-quickstart
• https://cloud.google.com/bigquery/query-reference
Column Stores and Google BigQuery

Contenu connexe

Tendances

Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
Varun Narang
 

Tendances (20)

Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
 
Hadoop Fundamentals I
Hadoop Fundamentals IHadoop Fundamentals I
Hadoop Fundamentals I
 
Hadoop overview
Hadoop overviewHadoop overview
Hadoop overview
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Asbury Hadoop Overview
Asbury Hadoop OverviewAsbury Hadoop Overview
Asbury Hadoop Overview
 
Apache Hadoop at 10
Apache Hadoop at 10Apache Hadoop at 10
Apache Hadoop at 10
 
Hadoop
HadoopHadoop
Hadoop
 
Big Data and Hadoop Ecosystem
Big Data and Hadoop EcosystemBig Data and Hadoop Ecosystem
Big Data and Hadoop Ecosystem
 
Hadoop
HadoopHadoop
Hadoop
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 
Introduction to Big Data & Hadoop
Introduction to Big Data & HadoopIntroduction to Big Data & Hadoop
Introduction to Big Data & Hadoop
 
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | EdurekaWhat are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
 
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
 
Top Hadoop Big Data Interview Questions and Answers for Fresher
Top Hadoop Big Data Interview Questions and Answers for FresherTop Hadoop Big Data Interview Questions and Answers for Fresher
Top Hadoop Big Data Interview Questions and Answers for Fresher
 
Hadoop overview
Hadoop overviewHadoop overview
Hadoop overview
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1
 
Big data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guideBig data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guide
 
2. hadoop fundamentals
2. hadoop fundamentals2. hadoop fundamentals
2. hadoop fundamentals
 
Hadoop project design and a usecase
Hadoop project design and  a usecaseHadoop project design and  a usecase
Hadoop project design and a usecase
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
 

En vedette (8)

Introduction to Google BigQuery
Introduction to Google BigQueryIntroduction to Google BigQuery
Introduction to Google BigQuery
 
Key-Value-Stores -- The Key to Scaling?
Key-Value-Stores -- The Key to Scaling?Key-Value-Stores -- The Key to Scaling?
Key-Value-Stores -- The Key to Scaling?
 
VLDB 2009 Tutorial on Column-Stores
VLDB 2009 Tutorial on Column-StoresVLDB 2009 Tutorial on Column-Stores
VLDB 2009 Tutorial on Column-Stores
 
Big data key-value and column stores redis - cassandra
Big data  key-value and column stores redis - cassandraBig data  key-value and column stores redis - cassandra
Big data key-value and column stores redis - cassandra
 
Non-Relational Databases & Key/Value Stores
Non-Relational Databases & Key/Value StoresNon-Relational Databases & Key/Value Stores
Non-Relational Databases & Key/Value Stores
 
Key-Value Stores: a practical overview
Key-Value Stores: a practical overviewKey-Value Stores: a practical overview
Key-Value Stores: a practical overview
 
MongoDB Introduction - Document Oriented Nosql Database
MongoDB Introduction - Document Oriented Nosql DatabaseMongoDB Introduction - Document Oriented Nosql Database
MongoDB Introduction - Document Oriented Nosql Database
 
Intro to column stores
Intro to column storesIntro to column stores
Intro to column stores
 

Similaire à Column Stores and Google BigQuery

4. hadoop גיא לבנברג
4. hadoop  גיא לבנברג4. hadoop  גיא לבנברג
4. hadoop גיא לבנברג
Taldor Group
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in HadoopBackup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
larsgeorge
 
Intro to big data choco devday - 23-01-2014
Intro to big data   choco devday - 23-01-2014Intro to big data   choco devday - 23-01-2014
Intro to big data choco devday - 23-01-2014
Hassan Islamov
 
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
MaharajothiP
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
Mohit Tare
 
Hadoop ppt on the basics and architecture
Hadoop ppt on the basics and architectureHadoop ppt on the basics and architecture
Hadoop ppt on the basics and architecture
saipriyacoool
 

Similaire à Column Stores and Google BigQuery (20)

Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introduction
 
Introduction to Big Data Hadoop Training Online by www.itjobzone.biz
Introduction to Big Data Hadoop Training Online by www.itjobzone.bizIntroduction to Big Data Hadoop Training Online by www.itjobzone.biz
Introduction to Big Data Hadoop Training Online by www.itjobzone.biz
 
Intro to Big Data
Intro to Big DataIntro to Big Data
Intro to Big Data
 
Big data and hadoop overvew
Big data and hadoop overvewBig data and hadoop overvew
Big data and hadoop overvew
 
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
 
Big data Hadoop
Big data  Hadoop   Big data  Hadoop
Big data Hadoop
 
Hadoop ppt1
Hadoop ppt1Hadoop ppt1
Hadoop ppt1
 
getFamiliarWithHadoop
getFamiliarWithHadoopgetFamiliarWithHadoop
getFamiliarWithHadoop
 
4. hadoop גיא לבנברג
4. hadoop  גיא לבנברג4. hadoop  גיא לבנברג
4. hadoop גיא לבנברג
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in HadoopBackup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
 
Big Data & Hadoop Introduction
Big Data & Hadoop IntroductionBig Data & Hadoop Introduction
Big Data & Hadoop Introduction
 
Tableau on Hadoop Meet Up: Advancing from Extracts to Live Connect
Tableau on Hadoop Meet Up: Advancing from Extracts to Live ConnectTableau on Hadoop Meet Up: Advancing from Extracts to Live Connect
Tableau on Hadoop Meet Up: Advancing from Extracts to Live Connect
 
Intro to big data choco devday - 23-01-2014
Intro to big data   choco devday - 23-01-2014Intro to big data   choco devday - 23-01-2014
Intro to big data choco devday - 23-01-2014
 
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
 
Big data
Big dataBig data
Big data
 
DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data
DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big DataDataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data
DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
Emerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataEmerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big Data
 
Hadoop Technology
Hadoop TechnologyHadoop Technology
Hadoop Technology
 
Hadoop ppt on the basics and architecture
Hadoop ppt on the basics and architectureHadoop ppt on the basics and architecture
Hadoop ppt on the basics and architecture
 

Plus de Csaba Toth

Hadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User GroupHadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User Group
Csaba Toth
 

Plus de Csaba Toth (17)

Git, GitHub gh-pages and static websites
Git, GitHub gh-pages and static websitesGit, GitHub gh-pages and static websites
Git, GitHub gh-pages and static websites
 
Eclipse RCP Demo
Eclipse RCP DemoEclipse RCP Demo
Eclipse RCP Demo
 
The Health of Networks
The Health of NetworksThe Health of Networks
The Health of Networks
 
Windows 10 preview
Windows 10 previewWindows 10 preview
Windows 10 preview
 
Developing Multi Platform Games using PlayN and TriplePlay Framework
Developing Multi Platform Games using PlayN and TriplePlay FrameworkDeveloping Multi Platform Games using PlayN and TriplePlay Framework
Developing Multi Platform Games using PlayN and TriplePlay Framework
 
Trends and future of java
Trends and future of javaTrends and future of java
Trends and future of java
 
Google Compute Engine
Google Compute EngineGoogle Compute Engine
Google Compute Engine
 
Google App Engine
Google App EngineGoogle App Engine
Google App Engine
 
Setting up a free open source java e-commerce website
Setting up a free open source java e-commerce websiteSetting up a free open source java e-commerce website
Setting up a free open source java e-commerce website
 
CCJUG inaugural meeting and Adopt a JSR
CCJUG inaugural meeting and Adopt a JSRCCJUG inaugural meeting and Adopt a JSR
CCJUG inaugural meeting and Adopt a JSR
 
Google Cloud Platform, Compute Engine, and App Engine
Google Cloud Platform, Compute Engine, and App EngineGoogle Cloud Platform, Compute Engine, and App Engine
Google Cloud Platform, Compute Engine, and App Engine
 
Hive and Pig for .NET User Group
Hive and Pig for .NET User GroupHive and Pig for .NET User Group
Hive and Pig for .NET User Group
 
Hadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User GroupHadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User Group
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce
 
Introduction into windows 8 application development
Introduction into windows 8 application developmentIntroduction into windows 8 application development
Introduction into windows 8 application development
 
Ups and downs of enterprise Java app in a research setting
Ups and downs of enterprise Java app in a research settingUps and downs of enterprise Java app in a research setting
Ups and downs of enterprise Java app in a research setting
 
Adopt a JSR NJUG edition
Adopt a JSR NJUG editionAdopt a JSR NJUG edition
Adopt a JSR NJUG edition
 

Dernier

AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
VictorSzoltysek
 
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
VishalKumarJha10
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
mohitmore19
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 

Dernier (20)

Define the academic and professional writing..pdf
Define the academic and professional writing..pdfDefine the academic and professional writing..pdf
Define the academic and professional writing..pdf
 
Announcing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareAnnouncing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK Software
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) SolutionIntroducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburg
%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburg%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburg
%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburg
 
%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
 
%in Durban+277-882-255-28 abortion pills for sale in Durban
%in Durban+277-882-255-28 abortion pills for sale in Durban%in Durban+277-882-255-28 abortion pills for sale in Durban
%in Durban+277-882-255-28 abortion pills for sale in Durban
 
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park %in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
 
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
 
8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students
 
%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare
 
10 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 202410 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 2024
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 

Column Stores and Google BigQuery

  • 1. Column Stores and Google BigQuery Cloud Presented By: Csaba Toth Csaba Technology Services LLC GDG Fresno meeting October 1 2015 Fresno, California
  • 4. Goal / wish • Being able to issue queries • Preferably SQL style • Over Big data • As small response time as possible • Plus: through web interface (no need to install anything) • Plus: capability to visualize
  • 5. Agenda • Big Data • Brief look at Hadoop, HIVE and Spark • OLAP and OLTP • Row based data store vs. Column data store • Google BigQuery • Demo
  • 6. Big Data Wikipedia: “collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications” Examples: (Wikibon - A Comprehensive List of Big Data Statistics) • 100 Terabytes of data is uploaded to Facebook every day • Facebook Stores, Processes, and Analyzes more than 30 Petabytes of user generated data • Twitter generates 12 Terabytes of data every day • LinkedIn processes and mines Petabytes of user data to power the "People You May Know" feature • YouTube users upload 48 hours of new video content every minute of the day • Decoding of the human genome used to take 10 years. Now it can be done in 7 days
  • 7. Big Data Three Vs: Volume, Velocity, Variety Sources: • Science, Sensors, Social networks, Log files • Public Data Stores, Data warehouse appliances • Network and in-stream monitoring technologies • Legacy documents Main problems: • Storage Problem • Money Problem • Consuming and processing the data
  • 8. Hadoop • Hadoop is an open-source software framework that supports data-intensive distributed applications • A Hadoop cluster is composed of a single master node and multiple worker nodes
  • 9. Little Hadoop history “The Google File System” - October 2003 • http://labs.google.com/papers/gfs.html – describes a scalable, distributed, fault-tolerant file system tailored for data-intensive applications, running on inexpensive commodity hardware, delivers high aggregate performance “MapReduce: Simplified Data Processing on Large Clusters” - April 2004 • http://queue.acm.org/detail.cfm?id=988408 – describes a programming model and an implementation for processing large data sets.
  • 10. Hadoop Has two main services: 1. Storing large amounts of data: HDFS – Hadoop Distributed File System 2. Processing large amounts of data: implementing the MapReduce programming model
  • 11. HDFS Name node Metadata Store Data node Data node Data node Node 1 Node 2 Block A Block B Block A Block B Node 3 Block A Block B
  • 12. Job / task management Name node Heart beat signals and communication Jobtracker Data node Data node Data node Tasktracker Tasktracker Map 1 Reduce 1 Map 2 Reduce 2 Tasktracker Map 3 Reduce 3
  • 13. Hadoop / RDBMS / Docum. Hadoop / MapReduce RDBMS Document stores Size of data Petabytes Gigabytes Gigabytes+ Integrity of data Low High (referential, typed) Low/Intermediate Data schema Dynamic Static Dynamic Access method Batch Interactive and Batch Interactive and Batch Scaling Linear Nonlinear (worse than linear) Better than RDBMS Data structure Unstructured Structured Unstructured / semi-struct. Normalization of data Not Required Required Not or somewhat required Query Response Time Has latency (due to batch processing) Can be near immediate Can be near immediate
  • 14. Apache Hive Log Data RDBMS Data Integration LayerFlume Sqoop Storage Layer (HDFS) Computing Layer (MapReduce) Advanced Query Engine (Hive, Pig) Data Mining (Pegasus, Mahout) Index, Searches (Lucene) DB drivers (Hive driver) Web Browser (JS)
  • 19. Beyond Apache Hive Goals: decrease latency Technologies which help: • YARN: next generation Hadoop • Hadoop distribution specific: e.g. Cloudera Impala • Apache Spark
  • 20. Beyond Apache Hive • YARN: improves Hadoop performance in many respects (resource management and allocation, …) • Impala: Cloudera’s MPP SQL Query engine, based on Hadoop • Spark: cluster computing framework with multi-stage in-memory primitives
  • 21. Apache Spark • Open Source • In contrast to Hadoop’s two-stage disk- based MapReduce paradigm, multi-stage in- memory primitives can provide up to 100x performance increase • It can work over HDFS
  • 24. OLAP vs OLTP OLTP - Online Transaction Processing (Operational System) OLAP - Online Analytical Processing (Data Warehouse) Source of data Operational data; original source Consolidation data; comes form various sources Purpose of data To control and run fundamental business tasks To help with planning, problem solving, and decision support Goal of operations retrieve or modify individual records (mostly few records) derive new information from existing data (aggregates, transformations, calculations) Queries queries often triggered by end user actions and should complete instantly queries often run on many records or complete data set Read/Write mixed read/write workload mainly read or even read-only workload RAM working set should fit in RAM data set may exceed size of RAM easily
  • 25. OLAP vs OLTP OLTP - Online Transaction Processing (Operational System) OLAP - Online Analytical Processing (Data Warehouse) ACID properties may be important often not important, data can often be regenerated Interactivity queries often triggered by end user actions and should complete instantly queries often run interactively Indexing use indexes to quickly find relevant records common: not known in advance which aspects are interesting so pre-indexing „relevant“ columns is difficult DB Design Often highly normalized with many tables Typically de-normalized with fewer tables; use of star and/or snowflake schemas
  • 26. Storing data: row stores • Traditional RDBMS and often the document stores are row oriented too • The engine always stores and retrieves entire rows from disk (unless indexes help) • Row is a collection of column values together • Rows are materialized on disk
  • 27. Row stores All columns are stored together on disk id scientist death_by movie_name 1 Reinhardt Maximillian The Black Hole 2 Tyrell Roy Batty Blade Runner 3 Hammond Dinosaur Jurassic Park 4 Soong Lore Star Trek: TNG 5 Morbius His mind Forbidden Planet 6 Dyson Skynet Terminator 2: Judgment Day
  • 28. Row stores Performs best when a small number of rows are accessed select * from the_table where id = 6 id scientist death_by movie_name 1 Reinhardt Maximillian The Black Hole 2 Tyrell Roy Batty Blade Runner 3 Hammond Dinosaur Jurassic Park 4 Soong Lore Star Trek: TNG 5 Morbius His mind Forbidden Planet 6 Dyson Skynet Terminator 2: Judgment Day
  • 29. Row stores • Not so great for wide rows • If only a small subset of columns queried, reading the entire row wastes IO
  • 30. Row stores Bad case scenario: • select sum(bigint_column) from table • Million rows in table • Average row length is 1 KiB The select reads one bigint column (8 bytes) • Entire row must be read • Reads ~1 GiB data for ~8MiB of column data
  • 31. Column stores • Data is organized by columns instead of rows • Non material world: often not materialized during storage, exists only in memory • Each row still has some sort of “row id”
  • 32. Column stores • A row is a collection of column values that are associated with one another • Associated: every row has some type of “row id“ • Can still produce row output (assembling a row maybe complex though – under the hood)
  • 33. Column store Stores each COLUMN on disk id 1 2 3 4 5 6 title Mrs. Doubtfire The Big Lebowski The Fly Steel Magnolias The Birdcage Erin Brokovitch actor Robin Williams Jeff Bridges Jeff Goldblum Dolly Parton Nathan Lane Julia Roberts genre Comedy Comedy Horror Drama Comedy Drama row id = 1 row id = 6 Natural order may be unusual Each column has a file or segment on disk
  • 34. Column stores • Column compression can be way more efficient than row compression or compression available for row stores (sometimes 10:1 to 30:1 ratio) • Compression: RLE, Integer packing, dictionaries and lookup, other… • Reduces both storage and IO (thus response time)
  • 35. Column stores Best case scenario: • select sum(bigint_column) from table • Million rows in table • Average row length is 1 KiB The select reads one bigint column (8 bytes) • Only single column read from disk • Reads ~8MiB of column data, even less with compression
  • 36. Column stores Bad case scenario: select * from long_wide_table where order_line_id = 34653875; • Accessing all table doesn’t save anything, could be even more expensive than row store • Not ideal fo tables with few columns
  • 37. Column stores Updating and deleting rows is expensive • Some column stores are append only • Others just strongly discourage writes • Some split storage into row and column areas
  • 38. Row/Column - OLTP/OLAP Row stores are good fit for OLTP • Reading small portions of a table, but often many of the columns • Frequent changes to data • Small (<2TB) amount of data (typically working set must fit in ram) • "Nested loops" joins are good fit for OLTP
  • 39. Row/Column - OLTP/OLAP Column stores are good fit for OLAP Read large portions of a table in terms of rows, but often a small number of columns Batch loading / updates Big data (50TB-100TB per machine): • Compression capabilities comes in handy • Machine generated data is well suited
  • 40. Column / Row stores • RDBMS provide ACID capabilities • Row stores mainly use tree style indexes • B-tree derivative index structure provides very fast binary search as long as it fits into memory • Very large datasets end up unmanageably big indexes • Column stores: bitmap indexing Very expensive to update
  • 41. BigQuery • A web service that enables interactive analysis of massively large datasets • based on Dremel, a scalable, interactive ad hoc query system for analysis of read-only nested data • working in conjunction with Google Storage • Has a RESTful web service interface
  • 42. BigQuery • You can issue SQL queries over big data • Interactive web interface • Can visualize results too • As small response time as possible • Auto scales under the hood
  • 43. Demo!
  • 45. Resources • Slides: http://www.slideshare.net/tothc • Contact: http://www.meetup.com/CCalJUG/ • Csaba Toth: Introduction to Hadoop and MapReduce - http://www.slideshare.net/tothc/introduction-to-hadoop-and-map- reduce • Justin Swanhart: Introduction to column stores - http://www.slideshare.net/MySQLGeek/intro-to-column-stores • Jan Steemann: Column-oriented databases - http://www.slideshare.net/arangodb/introduction-to-column- oriented-databases
  • 46. Resources • https://anonymousbi.wordpress.com/2012/11/02/hadoop- beginners-tutorial-on-ubuntu/ • https://www.capgemini.com/blog/capping-it-off/2012/01/what-is- hadoop • http://blog.iquestgroup.com/en/hadoop/#.Vgg2w2sRMeI • https://www.cloudera.com/content/cloudera/en/documentation/core/ latest/PDF/cloudera-impala.pdf • https://www.keithrozario.com/2012/07/google-bigquery-wikipedia- dataset-malaysia-singapore.html • https://cloud.google.com/bigquery/web-ui-quickstart • https://cloud.google.com/bigquery/query-reference