Infochimps Cloudcon 2012

•Télécharger en tant que PPTX, PDF•

0 j'aime•900 vues

This document discusses big data and cloud computing. It describes how data volumes are growing exponentially and will increase 44-fold by 2020. It also discusses how cloud infrastructure provides unlimited computational power to process large and diverse data sources to gain statistically significant insights. Finally, it promotes a big data cloud service that aims to reduce friction for business users.

Technologie Business

Big Data & Cloud
Infinite Monkey Theorem
CloudCon Expo & Conference
October, 2012

First
8/17/2013 Infochimps Confidential 2
What is Big Data?
“data sets so large and complex that it becomes
difficult to process using on-hand database
management tools.”

3
Source: 2011 IDC Digital Universe Study
2010 = 1.2
Zettabytes/yr
2020 = 35.2
Zettabytes/yr
Data Volume
Growing 44x
8/17/2013 Infochimps Confidential

Amp
Node
Amp
Node
Amp
Node
Enterprise Data Warehouse
PARC | 4
. . . .
BYNET Interconnect
Parsing
Engines
Request
???
Answer

Search Recommend
Rank
Next-Best-ActionScore
Big Data Warehouse
PARC | 5
. . . .
Ethernet Interconnect
Master:
Name Node
Job Tracker
Analytic
Request
Slave:
Task Trckr
Data Node
Slave:
Task Trckr
Data Node
Slave:
Task Trckr
Data Node
Answer
Semi-
Structured
Data

Traditional Operational
Traditional
Decision Support
Analytic
Appliances
Real
Time
Batch
Large
Enterprise
Small
Enterprise
Application Ecosystem
Deployment in
Public/Private Cloud
Toolset Integration
Hardened
8/17/2013 6Infochimps Confidential

Next
8/17/2013 Infochimps Confidential 7
Infinite Monkey Theorem (2):
an infinite number of monkeys hitting
keys on a typewriter for a period of time
will almost surely type a given text, such
as Shakespeare”s Hamlet.

8/17/2013 Infochimps Confidential 8
“unexperienced and unobservable“
based on
“real experiences and real
observations“

“ “
8/17/2013 Infochimps Confidential 9
Infinite Monkey Theorem (2):
an infinite number of monkeys hitting keys
on a typewriter for a period of time will
almost surely type a given text, such as
Shakespeare”s Hamlet.
an infinite number of monkeys hitting keys
on atypewriter for a period of time will
almost surely type a given text, such as
Shakespeare”s Hamlet.

8/17/2013 Infochimps Confidential 10
infinite number
of monkeys
keys on a
typewriter
almost
surely
Shakespeare”s
Hamlet
unlimited
computational
power
processing
data
statistically
significant
insights

8/17/2013 Infochimps Confidential 11
#thisischimpy

8/17/2013 Infochimps Confidential 12
“Little Data For Business Users“
Problem

8/17/2013 Infochimps Confidential 15
“Big Data For Business Users“

8/17/2013 Infochimps Confidential
16
?
Data
$ $
$ $
Executive
Reduce
Friction

8/17/2013 Infochimps Confidential 17
#thisisreallygood

8/17/2013 Infochimps Confidential 18
unlimited
computational
power
Public
Private
Virtual
Private

8/17/2013 Infochimps Confidential 19
analysts use these images to
count shipping containers
coming off ships in California
and are able to get a sense of
overall US import activity

8/17/2013 Infochimps Confidential 20
data
processing
Public
Private
Virtual
Private

8/17/2013 Infochimps Confidential 21
Walmart

8/17/2013 Infochimps Confidential 22
Target

8/17/2013 Infochimps Confidential 23
Images
Docs,
Text
Web
Logs
Social
Sensors
GPS
Business
Transactions &
Interactions
Business
Intelligence &
Analytics
SQL NoSQL NewSQL
EDW MPP NewSQL
Dashboards, Reports
Visualization…
Web, Mobile, CRM,
ERP, SCM…

8/17/2013 Infochimps Confidential 24
statistically
significant
Public
Private
Virtual
Private

8/17/2013 Infochimps Confidential 25
#lotsofdata #simplealgorithms+

8/17/2013 Infochimps Confidential 26
Cars
In Lot
News
Text
Web
Pricing
Social
Sentiment
Weather
Sensors
Local
Employment
Quarterly
Revenue
Prediction

8/17/2013 Infochimps Confidential 27
insights
Public
Private
Virtual
Private

8/17/2013 Infochimps Confidential 28
Gnip
Powertrack
Gnip
EDC
Moreover
Metabase
TV
Transcription
Radio
Transcription
Print
Transcription
In-Motion
Data Delivery
Service
NoSQL
Listening
Application
New Media
Traditional Media
APIs
Sources Sentiment
Business Users
App DeveloperData Scientist
IT Staff

8/17/2013 Infochimps Confidential 29
unlimited
computational
power
processing
data
statistically
significant
insights

8/17/2013 Infochimps Confidential 30
#1BigDataCloudService

8/17/2013 Infochimps Confidential 31
#inspiredbyAvinashKaushik

Recommandé

Big Data & Cloud - Infinite Monkey TheoremJim Kaskade

Infochimps CxO Seminar @ PARCJim Kaskade

Vmware Serengeti - Based on Infochimps IronfanJim Kaskade

RubiOne: Apache Spark as the Backbone of a Retail Analytics Development Envir...Databricks

Streamline Data Governance with Egeria: The Industry's First Open Metadata St...DataWorks Summit

Data Wrangling on Hadoop - Olivier De Garrigues, Trifactahuguk

Flink Meetup Septmeber 2017 2018Christos Hadjinikolis

Predictive Analytics: Why (I)IoT Is DifferentAltoros

Recommandé

Big Data & Cloud - Infinite Monkey TheoremJim Kaskade

Infochimps CxO Seminar @ PARCJim Kaskade

Vmware Serengeti - Based on Infochimps IronfanJim Kaskade

RubiOne: Apache Spark as the Backbone of a Retail Analytics Development Envir...Databricks

Streamline Data Governance with Egeria: The Industry's First Open Metadata St...DataWorks Summit

Data Wrangling on Hadoop - Olivier De Garrigues, Trifactahuguk

Flink Meetup Septmeber 2017 2018Christos Hadjinikolis

Predictive Analytics: Why (I)IoT Is DifferentAltoros

Rabobank - There is something about DataBigDataExpo

Introduction to Neo4jNeo4j

Linkurious Enterprise: graph visualization platform neo4jLinkurious

Data Wrangling and the Art of Big Data DiscoveryInside Analysis

WEBINAR: Emerging Technologies in Supply ChainFlytBase

Session 2.3 semantics for safeguarding & security – a police storysemanticsconference

Advanced Data Analytics and Open Data - Dr Ingo Keck of CeADAR - Dublinked Da...Dublinked .

Agile v Warehouse? Maurice Lynch CEO of Nathaen Technologies - Dublinked Data...Dublinked .

Session 1.1 linked data applied: a field report from the netherlandssemanticsconference

Big Data Scotland 2017Ray Bugg

Introduction to Deep Learning and AI at Scale for ManagersDataWorks Summit

EclipseCon France 2015 - Science TrackBoris Adryan

Improving Response Times at Optum with Elastic APMElasticsearch

Using a Semantic and Graph-based Data Catalog in a Modern Data FabricCambridge Semantics

Translating the Human Analog to Digital with GraphsNeo4j

Data Science Application in Business Portfolio & Risk ManagementData Science Thailand

The 3 Key Barriers Keeping Companies from Deploying Data Products Dataiku

Accelerating Big Data Implementations for the Connected WorldDataWorks Summit/Hadoop Summit

HPC Top 5 Stories: October 13, 2017NVIDIA

Building Real-Time Data Pipeline for Diabetes Medication Recommender System U...Databricks

How to design ai functions to the cloud native infraChun Myung Kyu

Big Data : Risks and OpportunitiesKenny Huang Ph.D.

Contenu connexe

Tendances

Rabobank - There is something about DataBigDataExpo

Introduction to Neo4jNeo4j

Linkurious Enterprise: graph visualization platform neo4jLinkurious

Data Wrangling and the Art of Big Data DiscoveryInside Analysis

WEBINAR: Emerging Technologies in Supply ChainFlytBase

Session 2.3 semantics for safeguarding & security – a police storysemanticsconference

Advanced Data Analytics and Open Data - Dr Ingo Keck of CeADAR - Dublinked Da...Dublinked .

Agile v Warehouse? Maurice Lynch CEO of Nathaen Technologies - Dublinked Data...Dublinked .

Session 1.1 linked data applied: a field report from the netherlandssemanticsconference

Big Data Scotland 2017Ray Bugg

Introduction to Deep Learning and AI at Scale for ManagersDataWorks Summit

EclipseCon France 2015 - Science TrackBoris Adryan

Improving Response Times at Optum with Elastic APMElasticsearch

Using a Semantic and Graph-based Data Catalog in a Modern Data FabricCambridge Semantics

Translating the Human Analog to Digital with GraphsNeo4j

Data Science Application in Business Portfolio & Risk ManagementData Science Thailand

The 3 Key Barriers Keeping Companies from Deploying Data Products Dataiku

Accelerating Big Data Implementations for the Connected WorldDataWorks Summit/Hadoop Summit

HPC Top 5 Stories: October 13, 2017NVIDIA

Building Real-Time Data Pipeline for Diabetes Medication Recommender System U...Databricks

Tendances (20)

Rabobank - There is something about Data

Introduction to Neo4j

Linkurious Enterprise: graph visualization platform neo4j

Data Wrangling and the Art of Big Data Discovery

WEBINAR: Emerging Technologies in Supply Chain

Session 2.3 semantics for safeguarding & security – a police story

Advanced Data Analytics and Open Data - Dr Ingo Keck of CeADAR - Dublinked Da...

Agile v Warehouse? Maurice Lynch CEO of Nathaen Technologies - Dublinked Data...

Session 1.1 linked data applied: a field report from the netherlands

Big Data Scotland 2017

Introduction to Deep Learning and AI at Scale for Managers

EclipseCon France 2015 - Science Track

Improving Response Times at Optum with Elastic APM

Using a Semantic and Graph-based Data Catalog in a Modern Data Fabric

Translating the Human Analog to Digital with Graphs

Data Science Application in Business Portfolio & Risk Management

The 3 Key Barriers Keeping Companies from Deploying Data Products

Accelerating Big Data Implementations for the Connected World

HPC Top 5 Stories: October 13, 2017

Building Real-Time Data Pipeline for Diabetes Medication Recommender System U...

Similaire à Infochimps Cloudcon 2012

How to design ai functions to the cloud native infraChun Myung Kyu

Big Data : Risks and OpportunitiesKenny Huang Ph.D.

Designing the Next Generation Data LakeRobert Chong

Streaming Analytics for IoT-Oriented ApplicationsDATAVERSITY

Big data and APIs for PHP developers - SXSW 2011Eli White

The Management Accountant in a Digital World The interface of strategy, tech...Workiva

Druid Overview by Rachel PedreschiBrian Olsen

Aginity "Big Data" Research Labkevinflorian

The Synapse IoT Stack: Technology Trends in IOT and Big DataInMobi Technology

Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...confluent

Trivento summercamp fast data 9/9/2016Stavros Kontopoulos

Big data business caseKarthik Padmanabhan ( MLE℠)

Infochimps + CloudCon: Infinite Monkey TheoremInfochimps, a CSC Big Data Business

Big data analytics 1gauravsc36

Keynote Address at 2013 CloudCon: Future of Big Data by Richard McDougall (In...exponential-inc

Introduction to Big Datatrendwiseanalytics1

MapR Edge : Act Locally Learn Globallyridhav

Measure All the Things! - Austin Data Day 2014gdusbabek

Big Data - A Real Life RevolutionCapgemini

Smart Data Webinar: Choosing the Right Data Management Architecture for Cogni...DATAVERSITY

Similaire à Infochimps Cloudcon 2012 (20)

How to design ai functions to the cloud native infra

Big Data : Risks and Opportunities

Designing the Next Generation Data Lake

Streaming Analytics for IoT-Oriented Applications

Big data and APIs for PHP developers - SXSW 2011

The Management Accountant in a Digital World The interface of strategy, tech...

Druid Overview by Rachel Pedreschi

Aginity "Big Data" Research Lab

The Synapse IoT Stack: Technology Trends in IOT and Big Data

Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...

Trivento summercamp fast data 9/9/2016

Big data business case

Infochimps + CloudCon: Infinite Monkey Theorem

Big data analytics 1

Keynote Address at 2013 CloudCon: Future of Big Data by Richard McDougall (In...

Introduction to Big Data

MapR Edge : Act Locally Learn Globally

Measure All the Things! - Austin Data Day 2014

Big Data - A Real Life Revolution

Smart Data Webinar: Choosing the Right Data Management Architecture for Cogni...

Plus de Jim Kaskade

Jim kaskade biography (updated)Jim Kaskade

Woodside Residential Design GuidelinesJim Kaskade

Woodside Glens Neighborhood Plan - Amended 1999Jim Kaskade

Infochimps Hadoop Summit 2013Jim Kaskade

Infochimps TieCon 2013Jim Kaskade

Big analytics best practices @ PARCJim Kaskade

Marketing & SalesJim Kaskade

Outsourcing ClassJim Kaskade

Online Video and Next-gen StorageJim Kaskade

Rapid Social Game Development & DeploymentJim Kaskade

Application Model for Cloud DeploymentJim Kaskade

Next-Gen Security (using Cloud)Jim Kaskade

CISCO Visual Networking Index Forecast and Methodology, 2009-14Jim Kaskade

Jim Kaskade BiographyJim Kaskade

$CISCO\'s Take On Internet Video$ $CISCO\'s Take On Internet Video$

CISCO\'s Take On Internet VideoJim Kaskade

Private Cloud Platform as a ServiceJim Kaskade

Advertising Exchange WhitepaperJim Kaskade

Broadband Video Ad ExchangeJim Kaskade

Mobile VideoJim Kaskade

Broadband Video ReviewJim Kaskade

Plus de Jim Kaskade (20)

Jim kaskade biography (updated)

Woodside Residential Design Guidelines

Woodside Glens Neighborhood Plan - Amended 1999

Infochimps Hadoop Summit 2013

Infochimps TieCon 2013

Big analytics best practices @ PARC

Marketing & Sales

Outsourcing Class

Online Video and Next-gen Storage

Rapid Social Game Development & Deployment

Application Model for Cloud Deployment

Next-Gen Security (using Cloud)

CISCO Visual Networking Index Forecast and Methodology, 2009-14

Jim Kaskade Biography

$CISCO\'s Take On Internet Video$ $CISCO\'s Take On Internet Video$

CISCO\'s Take On Internet Video

Private Cloud Platform as a Service

Advertising Exchange Whitepaper

Broadband Video Ad Exchange

Mobile Video

Broadband Video Review

Dernier

Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar

DevEX - reference for building teams, processes, and platformsSergiu Bodiu

Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB

Commit 2024 - Secret Management made easyAlfredo García Lavilla

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays

From Family Reminiscence to Scholarly Archive .Alan Dix

Take control of your SAP testing with UiPath Test SuiteDianaGray10

"ML in Production",Oleksandr BaganFwdays

DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos

Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi

Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz

CloudStudio User manual (basic edition):comworks

Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity

Search Engine Optimization SEO PDF for 2024.pdfRankYa

Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation

Powerpoint exploring the locations used in television show Time Clashcharlottematthew16

SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero

Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro

SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal

Dernier (20)

Unleash Your Potential - Namagunga Girls Coding Club

DevEX - reference for building teams, processes, and platforms

Developer Data Modeling Mistakes: From Postgres to NoSQL

Commit 2024 - Secret Management made easy

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack

From Family Reminiscence to Scholarly Archive .

Take control of your SAP testing with UiPath Test Suite

"ML in Production",Oleksandr Bagan

DevoxxFR 2024 Reproducible Builds with Apache Maven

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)

Vertex AI Gemini Prompt Engineering Tips

Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost

CloudStudio User manual (basic edition):

Dev Dives: Streamline document processing with UiPath Studio Web

Search Engine Optimization SEO PDF for 2024.pdf

Connect Wave/ connectwave Pitch Deck Presentation

Powerpoint exploring the locations used in television show Time Clash

SIP trunking in Janus @ Kamailio World 2024

Unraveling Multimodality with Large Language Models.pdf

SAP Build Work Zone - Overview L2-L3.pptx

Infochimps Cloudcon 2012

1. Big Data & Cloud Infinite Monkey Theorem CloudCon Expo & Conference October, 2012

2. First 8/17/2013 Infochimps Confidential 2 What is Big Data? “data sets so large and complex that it becomes difficult to process using on-hand database management tools.”

3. 3 Source: 2011 IDC Digital Universe Study 2010 = 1.2 Zettabytes/yr 2020 = 35.2 Zettabytes/yr Data Volume Growing 44x 8/17/2013 Infochimps Confidential

4. Amp Node Amp Node Amp Node Enterprise Data Warehouse PARC | 4 . . . . BYNET Interconnect Parsing Engines Request ??? Answer

5. Search Recommend Rank Next-Best-ActionScore Big Data Warehouse PARC | 5 . . . . Ethernet Interconnect Master: Name Node Job Tracker Analytic Request Slave: Task Trckr Data Node Slave: Task Trckr Data Node Slave: Task Trckr Data Node Answer Semi- Structured Data

6. Traditional Operational Traditional Decision Support Analytic Appliances Real Time Batch Large Enterprise Small Enterprise Application Ecosystem Deployment in Public/Private Cloud Toolset Integration Hardened 8/17/2013 6Infochimps Confidential

7. Next 8/17/2013 Infochimps Confidential 7 Infinite Monkey Theorem (2): an infinite number of monkeys hitting keys on a typewriter for a period of time will almost surely type a given text, such as Shakespeare”s Hamlet.

8. 8/17/2013 Infochimps Confidential 8 “unexperienced and unobservable“ based on “real experiences and real observations“

9. “ “ 8/17/2013 Infochimps Confidential 9 Infinite Monkey Theorem (2): an infinite number of monkeys hitting keys on a typewriter for a period of time will almost surely type a given text, such as Shakespeare”s Hamlet. an infinite number of monkeys hitting keys on atypewriter for a period of time will almost surely type a given text, such as Shakespeare”s Hamlet.

10. 8/17/2013 Infochimps Confidential 10 infinite number of monkeys keys on a typewriter almost surely Shakespeare”s Hamlet unlimited computational power processing data statistically significant insights

11. 8/17/2013 Infochimps Confidential 11 #thisischimpy

12. 8/17/2013 Infochimps Confidential 12 “Little Data For Business Users“ Problem

13.

14.

15. 8/17/2013 Infochimps Confidential 15 “Big Data For Business Users“

16. 8/17/2013 Infochimps Confidential 16 ? Data $ $ $ $ Executive Reduce Friction

17. 8/17/2013 Infochimps Confidential 17 #thisisreallygood

18. 8/17/2013 Infochimps Confidential 18 unlimited computational power Public Private Virtual Private

19. 8/17/2013 Infochimps Confidential 19 analysts use these images to count shipping containers coming off ships in California and are able to get a sense of overall US import activity

20. 8/17/2013 Infochimps Confidential 20 data processing Public Private Virtual Private

21. 8/17/2013 Infochimps Confidential 21 Walmart

22. 8/17/2013 Infochimps Confidential 22 Target

23. 8/17/2013 Infochimps Confidential 23 Images Docs, Text Web Logs Social Sensors GPS Business Transactions & Interactions Business Intelligence & Analytics SQL NoSQL NewSQL EDW MPP NewSQL Dashboards, Reports Visualization… Web, Mobile, CRM, ERP, SCM…

24. 8/17/2013 Infochimps Confidential 24 statistically significant Public Private Virtual Private

25. 8/17/2013 Infochimps Confidential 25 #lotsofdata #simplealgorithms+

26. 8/17/2013 Infochimps Confidential 26 Cars In Lot News Text Web Pricing Social Sentiment Weather Sensors Local Employment Quarterly Revenue Prediction

27. 8/17/2013 Infochimps Confidential 27 insights Public Private Virtual Private

28. 8/17/2013 Infochimps Confidential 28 Gnip Powertrack Gnip EDC Moreover Metabase TV Transcription Radio Transcription Print Transcription In-Motion Data Delivery Service NoSQL Listening Application New Media Traditional Media APIs Sources Sentiment Business Users App DeveloperData Scientist IT Staff

29. 8/17/2013 Infochimps Confidential 29 unlimited computational power processing data statistically significant insights

30. 8/17/2013 Infochimps Confidential 30 #1BigDataCloudService

31. 8/17/2013 Infochimps Confidential 31 #inspiredbyAvinashKaushik

Notes de l'éditeur

AvinashKaushik gave a talk at Strata 2012 in Santa Clara in March.If you listen to all the hype of Big Data, it solves for the first problem.If you listen to all the vendors, there is a lot of emphasis on the first part (perhaps Infochimps included), and very little on the second.I think that’s because we don’t exactly know how to truly empower the organization to interact directly with any/all data available.It’s too expensive, risky, complex.
40%+ YoY growth with 2012 generating 2.4Zettabytes alone.http://jameskaskade.com/?p=2040http://www.emc.com/collateral/demos/microsites/emc-digital-universe-2011/index.htm
AMP:access module processorsPE: Parsing EngineBYNET: Banyan Cross-bar Switch YNET (Y Network)Store:The Parsing Engine dispatches a request to retrieve one or more rows.The BYNET ensures that appropriate AMP(s) are activated.The Parsing Engine dispatches a request to insert a row.The BYNET ensures that the row gets to the appropriate AMP (Access Module Processor) via the hashing algorithm.The AMP stores the row on its associated disk.Each AMP can have multiple physical disks associated with it.Retrieve:The AMPs (access module processors) locate and retrieve desired rows in parallel access and will sort, aggregate or format if needed.The BYNET returns retrieved rows to Parsing Engine.The Parsing Engine returns row(s) to requesting client application.Teradata’s shared-nothing architecture allows for highly scalable data volumes.
3 node Hadoop system:$8K/node$10K switch$4K/node HadoopDistro$24K + $10K x 25%x3 maintenance = $43K$4K x 3 x 3 = $36KTotal = There are three essential elements of an analytic platform: Strong support for analytic database query. A variety of query styles — at a minimum, SQL, MDX or graph.Strong support for analytic processes other than queries. Typically these would be in the areas of mathematics (statistics, predictive analytics, data mining, linear algebra, optimization, graph theory, etc.) and/or data transformation (e.g. sessionization, entity extraction).Strong integration between the first two.The point is — an analytic platform is something on which you can build a range of powerful analytic applications. Some specifics of what to look for in analytic platform may be found in the link above.http://www.dbms2.com/2011/02/24/analytic-platforms/http://www.dbms2.com/2011/01/18/architectural-options-for-analytic-database-management-systems/Enterprise data warehouse (Full or partial)Kinds of data likely to be included: All, but especially operationalLikely use styles: AllCanonical example: Central EDW for a big enterpriseStresses: Concurrency, reliability, workload managementClassical EDWs are Teradata, DB2, Exadata, and maybe Microsoft SQL ServerTraditional data martKinds of data likely to be included: AllLikely use styles: Business intelligence, budgeting/consolidation, investigativeExamples: Reporting servers, planning/consolidation servers, anything MOLAP, etc.Stresses: Performance, concurrency, TCOColumnar DBMS might have more attractive performance and TCO (Total Cost of Ownership); the same goes for Netezza. Some of them — e.g. Sybase IQ and Vertica — have excellent track records in concurrent usage as well.Investigative data mart — agileKinds of data likely to be included: All, especially customer-centricLikely use styles: InvestigativeCanonical example: A few analysts getting a few TB to examineStresses: Ease of setup/load, ease of admin, price/performanceInfobright is often cost-effective among columnar analytic DBMS. Investigative data mart — bigKinds of data likely to be included: All, especially customer-centric, logs, financial trade, scientificLikely use styles: InvestigativeCanonical example: Single-subject 20 TB – 20 PB relational databaseStresses: Performance, scale-out, analytic functionalityPerformance and scalability are major challenges, usually best addressed by MPP (Massively Parallel Processing) systems, such as Netezza, Vertica, Aster Data, ParAccel, Teradata, or Greenplum.Bit bucket - HadoopKinds of data likely to be included: Logs, other technical/externalLikely use styles: Staging/ETL, investigativeCanonical example: Log files in a Hadoop clusterStresses: TCO, scale-out, transform/big-query performance, ETL functionalityArchival data storeKinds of data likely to be included: Operational, CDR (call detail record), security logLikely use styles: Archival, reporting (for compliance), possibly also investigativeExamples: Any long-term detailed historical storeStresses: TCO, compression, scale-out, performance (if multi-use)Perhaps only Rainstor truly embraces the archival positioningOutsourced data martKinds of data likely to be included: AllLikely use styles: Traditional BI, investigative analytics, staging/ETLExamples: Advertising tracking, SaaS CRMStresses: Performance, TCO, reliability, concurrencyOracle shops = Vertica gets the nod in a number of these casesOperational analytic(s) serverKinds of data likely to be included: Customer-centric, log, financial tradeLikely use styles: Advanced operational analyticsExamples:Lower latency: Web or call-center personalization, anti-fraudHigher latency: Customer profiling, Basel 3 risk analysisStresses: Performance, reliability, analytic functionality, perhaps concurrencyhttp://www.dbms2.com/2011/07/05/eight-kinds-of-analytic-database-part-1/
Being the CEO of Infochimps, I felt compelled to share a little “chimpy” research with you…The “Infinite Monkey Theorem”….is a METAPHOR that directly relates to Big Data, that I think you’ll appreciate.So what is the “Infinite Monkey Theorem”????The following definition is a variant of the original theorem….let me read it to you.This theorem has been traced back to Aristotle's “On Generation and Corruption”, where he makes deductions about the unexperienced and unobservable based on real experiences and real observations.
This theorem has been traced back to Aristotle's “On Generation and Corruption”, where he makes deductions about the unexperienced and unobservable based on real experiences and real observations.Think about this a little….we’re talking about analyzing real world experiences and observations to predict what will happen…what will happen with our business in the future….the unexperienced and unobserved.This is fundamentally what Big Data proposes to help…
So as a metaphor…the "monkey" is not an actual monkey, but a metaphor for an abstract device a device that produces a sequence of letters and symbols.And "almost surely" is a mathematical term with a precise meaningShakespeare’s Hamlet also represents a broader meaning….it represents any text, any work, any insight.
So lets look at this in more depth….Infinite number of monkeys -> represents today’s seemingly unlimited computational power of either public or private Clouds…as an elastic delivery method.Keys on a typewriter -> capture discrete transactions which only analyzed together can derive meaning. Again we amass the computational power to process dataAlmost surely -> is translated into a mathematical term, namely the concept of significanceAnd finally, Shakespeare’s Hamlet is what we strive to create and it is the source of our happiness, our translation of this raw resource into insight.
Now this may seem “chimpy”….but this is beautiful. I love this metaphor.But we have a LARGE problem….
We have a problem today WITH our data infrastructure….our ability to gleam insights.I think all of you know what I’m referring to…..It’s the fact that we’re operating on less than 15% of the corporate data available to us…..even with the ENTERPRISE DATA WAREHOUSE, the EDW which is supposedly storing a COMPLETE, SINGLE VIEW OF THE TRUTH….We’re still giving our business users…..a tiny bit…a little bit of data.
The Business User
The Business User
The Business User
So why is an elastic, unlimited computational resource important?Op-Ex vs. Cap-ExCost Reduction due to better utilization / productivityTime-to-Market
Hedge funds and Wall Street firms, are using Cold War-style satellite surveillance to gather market-moving information. The Port of Long Beach is the second-busiest container port in the United States and acts as a major gateway for trade between the US and Asia. With the activity from this port estimated at over $100 billion per year, this specific port is a location it will pay to keep track of.   Satellite analysts use these images to count shipping containers coming off ships in California and are able to get a sense of overall US import activity, comparing activity month by month.This analysis is being performed in Amazon”s EC2
Now lets talk about processing your enterprise data assets….your Big Data…..again, we can leverage the cloud infrastructure to scale to the level of any processing needs you may have.
The current image shows a Walmart in Wichita, Kansas.Analysts count cars in Wal-Mart parking lots to measure overall customer traffic to understand growth versus its competition.For example, Wal-Mart's growthwas determined to come mostly from areas of high unemployment.This type of analysis is being performed in Amazon”s EC2…
The current image shows the a Target in the Moraine Point Plaza located in Gardiner, NorthAnalysts comparing satellite parking lot data with regional unemployment trends found Target's growth tended to come in areas of lower-than-average unemployment.  Again, these processes are being performed in Amazon EC2.…this is interesting….but how do we process the data further to help derive more relevant insights?http://www.cnbc.com/id/38738810/Spying_For_Profits_The_Satellite_Image_Indicator
The way this is performed is by taking data sources like images and storing them into Hadoop. Then using Big Data tools like MapReduce to perform sophisticated analysis on those aggregated data sets.Why is this concept so disruptive?Things like a fraction of the price….no structured data model – aka no star schema…yet the ability to run sophisticated queries and algorithms against all your detailed data.
The Business User
The previous examples of Walmart and Target involved using a regression algorithm which was executed against the satellite data + other data to produce a quarterly revenue prediction which BEAT all previous models.
Which brings us to the discussion around insights.
Quote that sets theme….the definition of “Infinite Monkey Theorem”.
The Business User