SlideShare une entreprise Scribd logo
1  sur  26
Data

JSON
Document

origina
l
copy
copy

Data

Data

Data
High Storage Extra Large (XL)
DW Node:
CPU: 2 virtual cores - Intel Xeon E5
Memory: 15 GiB
Storage: 3 HDD with 2TB of local
attached storage
Network: Moderate
Disk I/O: Moderate
API: dw.hs1.xlarge

High Storage Eight Extra Large (8XL)
DW Node:
CPU: 16 virtual cores - Intel Xeon E5
Memory: 120 GiB
Storage: 24 HDD with 16TB of local
attached storage
Network: 10 Gigabit Ethernet with support
for cluster placement groups
Disk I/O: Very High
API: dw.hs1.8xlarge
•
•
Data Warehouse Node Storage and CPU Options
Data Warehouse Node Storage and CPU Options
Data Warehouse Node Storage and CPU Options

Contenu connexe

Tendances

Распределенные системы хранения данных, особенности реализации DHT в проекте ...
Распределенные системы хранения данных, особенности реализации DHT в проекте ...Распределенные системы хранения данных, особенности реализации DHT в проекте ...
Распределенные системы хранения данных, особенности реализации DHT в проекте ...yaevents
 
File system performance
File system performanceFile system performance
File system performanceVijay Yadav
 
The basic concept of Linux FIleSystem
The basic concept of Linux FIleSystemThe basic concept of Linux FIleSystem
The basic concept of Linux FIleSystemHungWei Chiu
 
LargeCorp_AccessData_Examiner3TopologyChart_Revised22June2011_byERoberseon
LargeCorp_AccessData_Examiner3TopologyChart_Revised22June2011_byERoberseonLargeCorp_AccessData_Examiner3TopologyChart_Revised22June2011_byERoberseon
LargeCorp_AccessData_Examiner3TopologyChart_Revised22June2011_byERoberseonEric Roberson
 

Tendances (10)

Распределенные системы хранения данных, особенности реализации DHT в проекте ...
Распределенные системы хранения данных, особенности реализации DHT в проекте ...Распределенные системы хранения данных, особенности реализации DHT в проекте ...
Распределенные системы хранения данных, особенности реализации DHT в проекте ...
 
PXF BDAM 2016
PXF BDAM 2016PXF BDAM 2016
PXF BDAM 2016
 
File system performance
File system performanceFile system performance
File system performance
 
Sql can be cool again
Sql can be cool againSql can be cool again
Sql can be cool again
 
The basic concept of Linux FIleSystem
The basic concept of Linux FIleSystemThe basic concept of Linux FIleSystem
The basic concept of Linux FIleSystem
 
Directories
DirectoriesDirectories
Directories
 
VeloxDFS
VeloxDFSVeloxDFS
VeloxDFS
 
172 15-10023
172 15-10023172 15-10023
172 15-10023
 
LargeCorp_AccessData_Examiner3TopologyChart_Revised22June2011_byERoberseon
LargeCorp_AccessData_Examiner3TopologyChart_Revised22June2011_byERoberseonLargeCorp_AccessData_Examiner3TopologyChart_Revised22June2011_byERoberseon
LargeCorp_AccessData_Examiner3TopologyChart_Revised22June2011_byERoberseon
 
Hard drives
Hard drivesHard drives
Hard drives
 

En vedette

SVCC Google App Engine: Java Runtime
SVCC Google App Engine: Java RuntimeSVCC Google App Engine: Java Runtime
SVCC Google App Engine: Java RuntimeVan Riper
 
Internet Security - Naga Rohit S [ IIT Guwahati ] - Coding Club & DefCon DC91...
Internet Security - Naga Rohit S [ IIT Guwahati ] - Coding Club & DefCon DC91...Internet Security - Naga Rohit S [ IIT Guwahati ] - Coding Club & DefCon DC91...
Internet Security - Naga Rohit S [ IIT Guwahati ] - Coding Club & DefCon DC91...Naga Rohit
 
Get up and running with google app engine in 60 minutes or less
Get up and running with google app engine in 60 minutes or lessGet up and running with google app engine in 60 minutes or less
Get up and running with google app engine in 60 minutes or lesszrok
 
App Engine On Air: Munich
App Engine On Air: MunichApp Engine On Air: Munich
App Engine On Air: Munichdion
 

En vedette (6)

SVCC Google App Engine: Java Runtime
SVCC Google App Engine: Java RuntimeSVCC Google App Engine: Java Runtime
SVCC Google App Engine: Java Runtime
 
Internet Security - Naga Rohit S [ IIT Guwahati ] - Coding Club & DefCon DC91...
Internet Security - Naga Rohit S [ IIT Guwahati ] - Coding Club & DefCon DC91...Internet Security - Naga Rohit S [ IIT Guwahati ] - Coding Club & DefCon DC91...
Internet Security - Naga Rohit S [ IIT Guwahati ] - Coding Club & DefCon DC91...
 
Get up and running with google app engine in 60 minutes or less
Get up and running with google app engine in 60 minutes or lessGet up and running with google app engine in 60 minutes or less
Get up and running with google app engine in 60 minutes or less
 
Google App Engine
Google App EngineGoogle App Engine
Google App Engine
 
GAE_20100112
GAE_20100112GAE_20100112
GAE_20100112
 
App Engine On Air: Munich
App Engine On Air: MunichApp Engine On Air: Munich
App Engine On Air: Munich
 

Dernier

The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 

Dernier (20)

The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 

Data Warehouse Node Storage and CPU Options

  • 1.
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 9.
  • 10.
  • 11.
  • 12.
  • 13. High Storage Extra Large (XL) DW Node: CPU: 2 virtual cores - Intel Xeon E5 Memory: 15 GiB Storage: 3 HDD with 2TB of local attached storage Network: Moderate Disk I/O: Moderate API: dw.hs1.xlarge High Storage Eight Extra Large (8XL) DW Node: CPU: 16 virtual cores - Intel Xeon E5 Memory: 120 GiB Storage: 24 HDD with 16TB of local attached storage Network: 10 Gigabit Ethernet with support for cluster placement groups Disk I/O: Very High API: dw.hs1.8xlarge
  • 14.
  • 15.
  • 16.
  • 17.
  • 18.
  • 19.
  • 20.
  • 21.
  • 22.

Notes de l'éditeur

  1. As the internet grew and evolved from simple text data into multimedia, relational database engines failed to adapt and support these new data types. This especially became important with mobile and our ability to share photos and videos as they were happening. We had to figure out a way to store all those cat videos and pictures of food somehow!
  2. New content producers accelerated this growth by empowering millions of us with the ability to share photos, videos, and more through sites like Facebok, Twitter, Youtube, and Instagram.
  3. Prior to this internet evolution, most data were transactional in nature adhering to well structured rows and columns. Since this revolution the internet has exploded in growth both in terms of it’s breadth and it’s depth. Transactional business data which supports most key types of analysis has not increased nearly at the rate as social media data, web logs, and sensor data. So when we talk about “Big Data” it’s important to keep in mind the meat of analysis at the heart of most business processes is still manageable using traditional systems. Src: IDC Digital Universe 2009: White Paper, Sponsored by EMC, 2009
  4. For years data capture mechanisms have greatly surpassed the processing capacity of the hardware on which they ran. New technologies taking advantage of standard hardware, also known as commodity hardware, have closed the gap so it is no longer cost-prohibitive to only store part of the data versus everything. Along with these new technologies running on commodity hardware are high-powered analytical engines taking full advantage of huge improvements in throughput and disk operations found in new solid state disks.Src: http://www.mkomo.com/cost-per-gigabyte
  5. Document stores are as the name implies, a repository of documents and the associated framework for storing an managing them. These documents are often structured in a way that applications can easily retrieve and parse their data for use. This makes querying these sources for analytics exceptionally difficult, however there are software packages aimed at providing a SQL like interface for their documents.
  6. All of this may seem a bit overwhelming, but rest assured, there are several platform vedors to your rescue, all for a price of course. The biggest platform vendor out there is Cloudera. Cloudera offers a complete hadoop system bundled into their CDH product. They also offer professional services and many other open-source add-ons to hadoop. One of which is very interesting called Impala, which lets users execute real-time queries against the Hadoop Distributed File System or HDFS.The next platform vendor to mention is Hortonworks. Many of the key people at Hortonworks are those that created the original hadoop project dating back to it’s inception at Yahoo labs. Hortonworks also bundles all of the different technologies you need to run Hadoop, however their more focused on an older build of Hadoop, and from what I’ve seen, haven’t been pushing forward with the newest versions of the underlying tools just yet.Third, and possibly the most interesting, is Infochimps. Infochimps is a startup out of Austin offering a complete on-premise and cloud based Big Data solution. Many new platform vendors like this are sprouting up daily as they all race to capture your IT budget on your next Big Data project. This brings us to the next topic on how to implement Big Data in the cloud.
  7. Optimized for Data Warehousing – Amazon Redshift uses a variety of innovations to obtain very high query performance on datasets ranging in size from hundreds of gigabytes to a petabyte or more. It uses columnar storage, data compression, and zone maps to reduce the amount of IO needed to perform queries. Amazon Redshift has a massively parallel processing (MPP) architecture, parallelizing and distributing SQL operations to take advantage of all available resources. The underlying hardware is designed for high performance data processing, using local attached storage to maximize throughput between the Intel Xeon E5 processor and drives, and a 10GigE mesh network to maximize throughput between nodes.
  8. Scalable – With a few clicks of the AWS Management Console or a simple API call, you can easily scale the number of nodes in your data warehouse up or down as your performance or capacity needs change. Amazon Redshift enables you to start with as little as a single 2TB XL node and scale up all the way to a hundred 16TB 8XL nodes for 1.6PB of compressed user data. Amazon Redshift will place your existing cluster into read-only mode, provision a new cluster of your chosen size, and then copy data from your old cluster to your new one in parallel. You can continue running queries against your old cluster while the new one is being provisioned. Once your data has been copied to your new cluster, Amazon Redshift will automatically redirect queries to your new cluster and remove the old cluster.
  9. No Up-Front Costs – You pay only for the resources you provision. You can choose On-Demand pricing with no up-front costs or long-term commitments, or obtain significantly discounted rates with Reserved Instance pricing. On-Demand pricing starts at just $0.85 per hour for a single node 2TB data warehouse, scaling linearly with cluster size. With Reserved Instance pricing, you can lower your effective price to $0.228 per hour for a single 2TB node, or under $1,000 per TB per year. To see more details, visit the Amazon Redshift Pricing page.
  10. Get Started in Minutes – With a few clicks in the AWS Management Console or simple API calls, you can create a cluster, specifying its size, underlying node type, and security profile. Amazon Redshift will provision your nodes, configure the connections between them, and secure the cluster. Your data warehouse should be up and running in minutes.Fully Managed – Amazon Redshift handles all the work needed to manage, monitor, and scale your data warehouse, from monitoring cluster health and taking backups to applying patches and upgrades. You can easily add or remove nodes from your cluster as your performance and capacity needs change. By handling all these time-consuming, labor-intensive tasks, Amazon Redshift frees you up to focus on your data and business.Fault Tolerant – Amazon Redshift has multiple features that enhance the reliability of your data warehouse cluster. All data written to a node in your cluster is automatically replicated to other nodes within the cluster and all data is continuously backed up to Amazon S3. Amazon Redshift continuously monitors the health of the cluster and automatically re-replicates data from failed drives and replaces nodes as necessary.Automated Backups – Amazon Redshift’s automated snapshot feature continuously backs up new data on the cluster to Amazon S3. Snapshots, are automated, incremental, and continous. Amazon Redshift stores your snapshots for a user-defined period, which can be from one to thirty-five days. You can also take your own snapshots at any time, which leverage all existing system snapshots and are retained until you explicitly delete them. Once you delete a cluster, your system snapshots are removed but your user snapshots are available until you explicitly delete them.Easy Restores - You can use any system or user snapshot to restore your cluster using the AWS Management Console or the Amazon Redshift APIs. Your cluster is available as soon as the system metadata has been restored and you can start running queries while user data is spooled down in the background.
  11. Encryption – With just a couple of parameter settings, you can set up Amazon Redshift to use SSL to secure data in transit and hardware-acccelerated AES-256 encryption for data at rest. If you choose to enable encryption of data at rest, all data written to disk will be encrypted as well as any backups.Isolation - Amazon Redshift enables you to configure firewall rules to control network access to your data warehouse cluster. You can also run Amazon Redshift inside Amazon Virtual Private Cloud (Amazon VPC) to isolate your data warehouse cluster in your own virtual network and connect it to your existing IT infrastructure using industry-standard encrypted IPsec VPN.
  12. SQL - Amazon Redshift is a SQL data warehouse and uses industry standard ODBC and JDBC connections and Postgres drivers. Many popular software vendors are certifying Amazon Redshift with their offerings to enable you to continue to use the tools you do today. See the Amazon Redshift partner page for details.Designed for use with other AWS Services – Amazon Redshift is integrated with other AWS services and has built in commands to load data in parallel to each node from Amazon Simple Storage Service (Amazon S3) and Amazon DynamoDB. AWS Data Pipeline enables easy, programmatic integration between Amazon Redshift, Amazon Elastic MapReduce (Amazon EMR), and Amazon Relational Database Service (Amazon RDS).
  13. BigQuery is Google’s Cloud Big Data solution based on the Dremel platform. Dremel has been in development for over 6 years and powers much of Googles Cloud Platform. It’s worth mentioning that for this course I’m going to cover BigQuery at a high-level and then later we’ll connect Tableau up to it to see how functionally to use it. If you’d like to dive deeper into BigQuery Lynn Langit has a course on here which goes into much greater detail that is definitely worth checking out.Let’s start by taking a look at their homepage.
  14. Looking at their interface, on their homepage they proclaim, analyze terabytes of data w/ just a click of a button. Sounds promising, if it weren’t for Amazon Redshift offering petabytes in scale.You’ll also notice a query editor and result pane previewed, this is encouraging however for non-sql developers this can be a scary sight.
  15. Similar to Amazon’s Redshift Google Bigquery stores data in a columnar database format which is great for data compression and query speeds.Google Bigquery differs from Amazon redshift however in that it uses this Tree structure which is similar to a MPP database however it spreads the data extremely wide and for queries creates execution “trees” which can scan tens of thousands of servers or leaf nodes containing the data and return results in miliseconds.Like Redshift, this all adds up to speed! Google is trying to differentiate from MPP solutions with BigQuery by providing what they call full-scan results. This is essentially by creating a query tree of every possible combination of query you can run. In their whitepaper from Kazunori Sato titled “An Inside Look at Google BigQuery” he states that “BigQuery solves the parallel disk I/O problem by utilizing the cloud platform’s economy of scale. You would need to run 10,000 disk drives and 5,000 processors simultaneously to execute the full scan of 1TB of data within one second. “ Impressive.The quote from this whitepaper tat
  16. Dremel is the platform which Google is Based on.
  17. Scalabitlity with Google BigQuery is a bit if a mystery to be honest. Since they handle all of the administration and data distribution for you the scalability really is only limited by based on how much you can afford. Once you upload your data to BigQuery, it handles the rest, you only need to worry about how much data is going to be processed in your queries, this brings us to their pricing model.
  18. Big Data analysis engine without operating a data center Managed service means no additional capital costs Ability to terminate service and remove your data at any timeTransparency in pricing and usage Simplicity: only 2 pricing components (query processing, storage) Flexibility: choice to pay-by-the-month for what you useFull Visibility and Control Monthly billing: Monitor and throttle what you use Tools to optimize usage/costs: best practices, tooling, samplesSince you’re charged by amount of data processed this can be very expensive if using a “chatty” query tool like Tableau. Google recommends to shard data into separate tables using a time stamp and setting your queries to filter just to a specific date range to minimize query costs.In my view this is the only issue with BigQuery. Let’s say you have a query which pulls back something like sales for the west region by month for the past year. This will return 24 data points. That’s 12 integers for sales, and 12 date values corresponding to the month of sales. To get to these 24 data points your query may have to scan millions or billions of rows, imagine Amazon’s detailed sales transactions, aggregate the data, then return your results. Since you’re paying for all the data scanned, a single query could really rack up the bills. Now, if you were building a focused application and not doing visual analytics using a tool like Tableau you can probably handle this quite well however in this case, it can be cost prohibitive to store your data here. I have a friend who was testing this and one of his analysts actually ran a single query that cost them $400!