Intro to big data and hadoop ubc cs lecture series - g fawkes

•Télécharger en tant que PPT, PDF•

1 j'aime•1,169 vues

The document is an introduction to analytics and big data using Hadoop presented by Geoff Fawkes. It discusses the challenges of large amounts of data, how Hadoop addresses these challenges through its HDFS distributed file system and MapReduce programming model. It provides examples of how companies use Hadoop for applications like analyzing customer behavior from set top cable boxes or performing sentiment analysis on product reviews. The presentation recommends further reading on analytics, big data, and data science topics.

Technologie

Introduction to Analytics
and Big Data - Hadoop

The University of British Columbia
Computer Science Alumni/Industry Lecture Series
Geoff Fawkes
November, 2013

© 2013 Geoff Fawkes. All Rights Reserved.

1 / 450

Who am I?
 Director Engineering, Teradata
 HSBC, Pivotal/Aptean, Newbridge/Alcatel, etc. various

engineering roles
 Technology executive, mentor, software engineer


B.Sc. Comp Sci (UBC), MBA Executive (SFU)

 Interruptive (disruptive?) personality



Please ask questions to me / each other as we go along
I don’t have all the answers – you do!

 Credits: Rob Pegler, SNIA Education


Storage Networking Industry Association, 2012

 Who’s paying attention - 450 slides page count?


Not that “big” - - about 50
© 2013 Geoff Fawkes. All Rights Reserved.
2

Big Data and Hadoop
 History
 Data Challenges
 Why Hadoop?

© 2013 Geoff Fawkes. All Rights Reserved.
3

Customer Challenges: The Data
Deluge

© 2013 Geoff Fawkes. All Rights Reserved.
4

Big Data is Different than Business
Intelligence

© 2013 Geoff Fawkes. All Rights Reserved.
5

Questions From Business Will Vary

© 2013 Geoff Fawkes. All Rights Reserved.
6

Web 2.0 is “Data Driven”

© 2013 Geoff Fawkes. All Rights Reserved.
7

The World of Data-Driven
Applications

© 2013 Geoff Fawkes. All Rights Reserved.
8

Attributes of Big Data

© 2013 Geoff Fawkes. All Rights Reserved.
9

Top Ten Common Big Data Problems

© 2013 Geoff Fawkes. All Rights Reserved.
10

Industries Are Embracing Big Data

© 2013 Geoff Fawkes. All Rights Reserved.
11

Why Hadoop?

© 2013 Geoff Fawkes. All Rights Reserved.
12

Why Hadoop?

© 2013 Geoff Fawkes. All Rights Reserved.
13

Storage and Memory B/W Lagging
CPU

© 2013 Geoff Fawkes. All Rights Reserved.
14

Commodity Hardware Economics

© 2013 Geoff Fawkes. All Rights Reserved.
15

What is Hadoop?
 Hadoop Adoption
 HDFS
 MapReduce
 Examples
 Ecosystem Projects

© 2013 Geoff Fawkes. All Rights Reserved.
17

Hadoop Adoption in the Industry

© 2013 Geoff Fawkes. All Rights Reserved.
18

What is Hadoop?

© 2013 Geoff Fawkes. All Rights Reserved.
19

What is Hadoop?

© 2013 Geoff Fawkes. All Rights Reserved.
20

HDFS 101 – The Data Set System

© 2013 Geoff Fawkes. All Rights Reserved.
21

HDFS Organization and Replication

© 2013 Geoff Fawkes. All Rights Reserved.
22

Hadoop Server Roles - Multiple

© 2013 Geoff Fawkes. All Rights Reserved.
23

Hadoop Cluster

© 2013 Geoff Fawkes. All Rights Reserved.
24

HDFS File Write Operation - Instance

© 2013 Geoff Fawkes. All Rights Reserved.
25

HDFS File Read Operation - Instance

© 2013 Geoff Fawkes. All Rights Reserved.
26

HDFS File Operation R/W Replication

© 2013 Geoff Fawkes. All Rights Reserved.
27

MapReduce 101 – Functional
Programming Meets Distributed Processing

© 2013 Geoff Fawkes. All Rights Reserved.
28

What is MapReduce?

© 2013 Geoff Fawkes. All Rights Reserved.
29

Key MapReduce Terminology

© 2013 Geoff Fawkes. All Rights Reserved.
30

MapReduce Basic Concepts

© 2013 Geoff Fawkes. All Rights Reserved.
31

Example 1: MapReduce Operation

© 2013 Geoff Fawkes. All Rights Reserved.
32

Example 2: Sample Dataset

© 2013 Geoff Fawkes. All Rights Reserved.
33

MapReduce Paradigm – UNIX Cmd

© 2013 Geoff Fawkes. All Rights Reserved.
34

Example 3: Count Words

© 2013 Geoff Fawkes. All Rights Reserved.
35

Ex. 3: Lifecycle of a MapReduce Job
Map function

Reduce function

Run this program as a
MapReduce job

© 2013 Geoff Fawkes. All Rights Reserved.
36

Ex. 3: Lifecycle of a MapReduce Job
Map function

Reduce function

Run this program as a
MapReduce job

© 2013 Geoff Fawkes. All Rights Reserved.
37

Ex. 3: Lifecycle of a MapReduce Job
Time

Input
Splits

Map
Wave 1

Map
Wave 2

Reduce
Wave 1

Reduce
Wave 2

How are the number of splits, number of map and reduce
tasks, memory allocation to tasks, etc., determined?
© 2013 Geoff Fawkes. All Rights Reserved.
38

MapReduce Job Configuration Parms
 190+ parameters in

Hadoop
 Set manually or
defaults are used

© 2013 Geoff Fawkes. All Rights Reserved.
39

Putting it all Together: MapReduce +
HDFS

© 2013 Geoff Fawkes. All Rights Reserved.
40

Hadoop Ecosystem Projects

- Interactive SQL Query & Modeling
- Data flow for tedious MapReduce Jobs
- Columnar NoSQL Store

© 2013 Geoff Fawkes. All Rights Reserved.
41

Compare: Hadoop, SQL, Massively
Parallel Processing (MPP)

© 2013 Geoff Fawkes. All Rights Reserved.
42

Compare: RDBMS and MapReduce

© 2013 Geoff Fawkes. All Rights Reserved.
43

Hadoop Use Cases
 Set Top Cable TV Boxes
 Pay Per View Advertising
 Bank Risk Modelling
 Product Sentiment Analysis

© 2013 Geoff Fawkes. All Rights Reserved.
44

Example 1: Set Top Cable TV Boxes

© 2013 Geoff Fawkes. All Rights Reserved.
45

Example 2: Pay Per View Advertising

© 2013 Geoff Fawkes. All Rights Reserved.
46

Example 3: Bank Risk Modelling

© 2013 Geoff Fawkes. All Rights Reserved.
47

Example 4: Product Sentiment Analysis

© 2013 Geoff Fawkes. All Rights Reserved.
48

More Reading?
 World Economic Forum: “Personal Data: The Emergence of a New Asset
Class” 2011
 McKinsey Global Institute: Big Data: The next frontier for innovation,
competition, and productivity
 Big Data: Harnessing a game-changing asset
 IDC: 2011 Digital Universe Study: Extracting Value from Chaos
 The Economist: Data, Data Everywhere
 Data Science Revealed: A Data-Driven Glimpse into the Burgeoning New
Field
 O’Reilly – What is Data Science?
 O’Reilly – Building Data Science Teams?
 O’Reilly – Data for the public good
 Obama Administration “Big Data Research and Development Initiative.”

© 2013 Geoff Fawkes. All Rights Reserved.
49

Introduction to Analytics
and Big Data – Hadoop
Q&A
Geoff Fawkes
http://www.linkedin.com/pub/geoff-fawkes/1/269/202
@gfawkes
November, 2013

© 2013 Geoff Fawkes. All Rights Reserved.

50

Contenu connexe

Tendances

Introduction To Hadoop Administration - SpringPeopleSpringPeople

Hadoop Tutorial For BeginnersDataflair Web Services Pvt Ltd

Big Data at Geisinger Health System: Big Wins in a Short TimeDataWorks Summit

Introducing Big DataPravin Kumar Singh, PMP, PSM

Introducing Data LakesPravin Kumar Singh, PMP, PSM

How Big Data and Hadoop Integrated into BMC ControlM at CARFAXBMC Software

Common and unique use cases for Apache HadoopBrock Noland

Concepts on HadoopChristopher Sharkey

Hortonworks Yarn Code Walk Through January 2014Hortonworks

ResumeRama kumar M V

It Takes a Village: Organizational Alignment to Deliver Big Data Value in Hea...DataWorks Summit

What Is Hadoop | Hadoop Tutorial For Beginners | EdurekaEdureka!

Hadoop - Architectural road map for Hadoop Ecosystemnallagangus

Keys for Success from Streams to QueriesDataWorks Summit/Hadoop Summit

Hadoop Distriubted File System (HDFS) presentation 27- 5-2015Abdul Nasir

Pivotal-HadoopOverview2016-workingtts2086

Hadoop in the cloud – The what, why and how from the expertsDataWorks Summit

Worldpay - Delivering Multi-Tenancy Applications in A Secure Operational Plat...DataWorks Summit/Hadoop Summit

new_Rajesh_Hadoop Developer_2016Rajesh Kumar

PyData: The Next GenerationWes McKinney

Tendances (20)

Introduction To Hadoop Administration - SpringPeople

Hadoop Tutorial For Beginners

Big Data at Geisinger Health System: Big Wins in a Short Time

Introducing Big Data

Introducing Data Lakes

How Big Data and Hadoop Integrated into BMC ControlM at CARFAX

Common and unique use cases for Apache Hadoop

Concepts on Hadoop

Hortonworks Yarn Code Walk Through January 2014

Resume

It Takes a Village: Organizational Alignment to Deliver Big Data Value in Hea...

What Is Hadoop | Hadoop Tutorial For Beginners | Edureka

Hadoop - Architectural road map for Hadoop Ecosystem

Keys for Success from Streams to Queries

Hadoop Distriubted File System (HDFS) presentation 27- 5-2015

Pivotal-HadoopOverview2016-working

Hadoop in the cloud – The what, why and how from the experts

Worldpay - Delivering Multi-Tenancy Applications in A Secure Operational Plat...

new_Rajesh_Hadoop Developer_2016

PyData: The Next Generation

En vedette

An introduction to Apache CassandraMike Frampton

Youth councilworachak11

Pbe 3.0 final presentation 2011kylekeller

Mumias the kiangoi reportChweya Kiangoi

Linked data our experienceTalis Consulting

Linked data your journeyTalis Consulting

Personal hygiene istirahat & tidurIdha Chan

Black Belt In Retail Prakash Menon

SolTec Presentationtcmg

The use of biological materials in the production of riceworachak11

Black Belt In Retail Prakash Menon

R.O.G.E.R Games for health 2011Laurent Grumiaux

ShopssGiziAfkir

Cats And Dogs Living Together: Langsec Is Also About UsabilityMeredith Patterson

Sociala medier och effektivitet enabling final ver01Charles Limerius

Linked Data in ActionTalis Consulting

Population project keynotekylekeller

Powerful presentationkimkuboom

Black Belt In Retail Prakash Menon

En vedette (20)

An introduction to Apache Cassandra

Youth council

Pbe 3.0 final presentation 2011

Mumias the kiangoi report

Linked data our experience

Linked data your journey

Personal hygiene istirahat & tidur

Black Belt In Retail

SolTec Presentation

The use of biological materials in the production of rice

Black Belt In Retail

R.O.G.E.R Games for health 2011

Shopss

Cats And Dogs Living Together: Langsec Is Also About Usability

Sociala medier och effektivitet enabling final ver01

Linked Data in Action

Population project keynote

Powerful presentation

Black Belt In Retail

Similaire à Intro to big data and hadoop ubc cs lecture series - g fawkes

How to use Hadoop for operational and transactional purposes by RODRIGO MERI...Big Data Spain

Hadoop and Mapreduce CertificationVskills

Pivotal: Hadoop for Powerful Processing of Unstructured Data for Valuable Ins...EMC

How pig and hadoop fit in data processing architectureKovid Academy

BI congres 2016-2: Diving into weblog data with SAS on Hadoop - Lisa Truyers...BICC Thomas More

Big Data InfrastructureTrivadis

HDFS & MapReduceSkillspeed

Future of-hadoop-analyticsMapR Technologies

Hadoop Hive Tutorial | Hive Fundamentals | Hive ArchitectureSkillspeed

Hadoop: Extending your Data WarehouseCloudera, Inc.

Level Up – How to Achieve Hadoop AccelerationInside Analysis

Run Your First Hadoop 2.x ProgramSkillspeed

Big Data Integration Webinar: Getting Started With Hadoop Big DataPentaho

Hadoop_Its_Not_Just_Internal_Storage_V14John Sing

Top 5 Tasks Of A Hadoop Developer WebinarSkillspeed

Delivering on the Hadoop/HBase Integrated ArchitectureDataWorks Summit

Learning How to Learn HadoopSilicon Halton

Big Data Management System: Smart SQL Processing Across Hadoop and your Data ...DataWorks Summit

Oracle Unified Information Architeture + Analytics by ExampleHarald Erb

Hadoop training-and-placementsofia taylor

Similaire à Intro to big data and hadoop ubc cs lecture series - g fawkes (20)

How to use Hadoop for operational and transactional purposes by RODRIGO MERI...

Hadoop and Mapreduce Certification

Pivotal: Hadoop for Powerful Processing of Unstructured Data for Valuable Ins...

How pig and hadoop fit in data processing architecture

BI congres 2016-2: Diving into weblog data with SAS on Hadoop - Lisa Truyers...

Big Data Infrastructure

HDFS & MapReduce

Future of-hadoop-analytics

Hadoop Hive Tutorial | Hive Fundamentals | Hive Architecture

Hadoop: Extending your Data Warehouse

Level Up – How to Achieve Hadoop Acceleration

Run Your First Hadoop 2.x Program

Big Data Integration Webinar: Getting Started With Hadoop Big Data

Hadoop_Its_Not_Just_Internal_Storage_V14

Top 5 Tasks Of A Hadoop Developer Webinar

Delivering on the Hadoop/HBase Integrated Architecture

Learning How to Learn Hadoop

Big Data Management System: Smart SQL Processing Across Hadoop and your Data ...

Oracle Unified Information Architeture + Analytics by Example

Hadoop training-and-placement

Dernier

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays

Apidays New York 2024 - The value of a flexible API Management solution for O...apidays

Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo

Automating Google Workspace (GWS) & more with Apps Scriptwesley chun

GenAI Risks & Security Meetup 01052024.pdflior mazor

Scaling API-first – The story of a global engineering organizationRadu Cotescu

Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services

Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi

Real Time Object Detection Using Open CVKhem

A Year of the Servo Reboot: Where Are We Now?Igalia

MINDCTI Revenue Release Quarter One 2024MIND CTI

TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc

🐬 The future of MySQL is Postgres 🐘RTylerCroy

Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun

HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung

From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software

2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong

Dernier (20)

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...

Apidays New York 2024 - The value of a flexible API Management solution for O...

Exploring the Future Potential of AI-Enabled Smartphone Processors

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...

Automating Google Workspace (GWS) & more with Apps Script

GenAI Risks & Security Meetup 01052024.pdf

Scaling API-first – The story of a global engineering organization

Strategies for Landing an Oracle DBA Job as a Fresher

Top 5 Benefits OF Using Muvi Live Paywall For Live Streams

Real Time Object Detection Using Open CV

A Year of the Servo Reboot: Where Are We Now?

MINDCTI Revenue Release Quarter One 2024

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery

🐬 The future of MySQL is Postgres 🐘

Data Cloud, More than a CDP by Matt Robison

HTML Injection Attacks: Impact and Mitigation Strategies

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...

From Event to Action: Accelerate Your Decision Making with Real-Time Automation

2024: Domino Containers - The Next Step. News from the Domino Container commu...

Intro to big data and hadoop ubc cs lecture series - g fawkes

1. Introduction to Analytics and Big Data - Hadoop The University of British Columbia Computer Science Alumni/Industry Lecture Series Geoff Fawkes November, 2013 © 2013 Geoff Fawkes. All Rights Reserved. 1 / 450

2. Who am I?  Director Engineering, Teradata  HSBC, Pivotal/Aptean, Newbridge/Alcatel, etc. various engineering roles  Technology executive, mentor, software engineer  B.Sc. Comp Sci (UBC), MBA Executive (SFU)  Interruptive (disruptive?) personality   Please ask questions to me / each other as we go along I don’t have all the answers – you do!  Credits: Rob Pegler, SNIA Education  Storage Networking Industry Association, 2012  Who’s paying attention - 450 slides page count?  Not that “big” - - about 50 © 2013 Geoff Fawkes. All Rights Reserved. 2

37. Ex. 3: Lifecycle of a MapReduce Job Time Input Splits Map Wave 1 Map Wave 2 Reduce Wave 1 Reduce Wave 2 How are the number of splits, number of map and reduce tasks, memory allocation to tasks, etc., determined? © 2013 Geoff Fawkes. All Rights Reserved. 38

48. More Reading?  World Economic Forum: “Personal Data: The Emergence of a New Asset Class” 2011  McKinsey Global Institute: Big Data: The next frontier for innovation, competition, and productivity  Big Data: Harnessing a game-changing asset  IDC: 2011 Digital Universe Study: Extracting Value from Chaos  The Economist: Data, Data Everywhere  Data Science Revealed: A Data-Driven Glimpse into the Burgeoning New Field  O’Reilly – What is Data Science?  O’Reilly – Building Data Science Teams?  O’Reilly – Data for the public good  Obama Administration “Big Data Research and Development Initiative.” © 2013 Geoff Fawkes. All Rights Reserved. 49

Notes de l'éditeur

Housekeeping: Keep your mobile devices on, turn up the ringer volume really loud, tweet, checkin on foursquare, update your facebook as I speak – we now live in a multi-tasking world so I’m ok with interruptions. Ask questions. If I don’t have the answer, someone else may, and you can drop me an email after. How many pages?!
Introductory presentation for new hires at Teradata. Mixture of business and engineering concepts. Scratch the surface – references at the end of presentation.
Zettabyte = 10 to the power of 21
Teradata used Tableau
Baidu is chinese language version of Google. William Gibson, author, poet quote. Coined the term “cyberspace” in his 1982 book Neuromancer. Predicted the rise and popularity of reality TV.
Structured Data – defined format, such as XML document or database tables Semi Structured Data – May be a schema but often ignored eg. spreadsheet, in which cells/fields can store any type of data Unstructured Data – no particular internal structure eg. plain text, image tile, twitter feed. 80% of Big Data is unstructured.
If Gartner says so, it must be right ;>) Motivations for Hadoop: Huge dependency on network and huge bandwidth demands Scaling up and down is not a smooth process Partial failures are difficult to handle A lot of processing power is spent on transporting data Data synchronization is required during exchange As a developer you should not be worrying about these issues being handled by your application - - these are the problems that Hadoop solves, leaving you to focus on business logic.
Basic I/O problem – while storage capacity of hard drives has increased, access speed (rate at which data can be read), has not. Eg. 1 TB drives are normal, but at 100 mega/bits transfer would take 2.5 hours to read all the data on the drive.
The world continues to move towards commodity hardware.
Commercial companies focused on developing and supporting Hadoop: Hortonworks, Cloudera, Amazon Web Services (AWS)
In more simplistic terms, Hadoop is a framework that facilitates functioning of several machines together to achieve the goal of analyzing large sets of data. Hadoop framework supports reliability and data motion. MapReduce divides an application’s retrieval of data into many small fragments of work, each executed or re-executed on a node in the cluster. Data is stored on many compute nodes, providing very high aggregate bandwidth across the cluster for HDFS. Node failures are automatically handled by the framework, through parallelism, heartbeat, checksum and replication.
The Hadoop platform consists of: Hadoop kernel (implemented in Java), MapReduce (any programming language used) and HDFS (Hadoop Distributed File System). HDFS can be accessed natively through a Java API for applications to use (a C language wrapper is also available) Ext3 – Third extended file system commonly used by Linux kernel is supported Xfs – Journaling file system supporting 64-bit and parallel I/O
Blocks – a disk block is 512 bytes, a file block is 3 kb, and an HDFS block is 64MB default (up to 128MB). An HDFS Block is greater than a Disk Block to minimize cost of seeks to disk. HDFS files are write-once. Once written are closed and cannot be changed. A typical single file in HDFS is Gigabytes-to-Terabytes in size.
Terminology. A set of machines is a Hadoop cluster, using Master-Slave architecture. Each node in a Hadoop Instance, has a single NameNode and a cluster of DataNodes. A NameNode is the software to maintain file system structure and metadata for the Datanodes. A Datanode is the software to store and retrieve blocks of data. Can be up to 4,000 slave DataNodes per NameNode. NameNode Job Tracker takes care of MapReduce task execution tracking. DataNode Task Tracker takes care of MapReduce processing for write/read requests. NameNode does not require a lot of disk space, but requires a lot of RAM (the brains of the Instance). DataNode does not require a lot of RAM, but requires a lot of disk space. Failover – the transition from active NameNode to secondary/standby NameNode by a failover controller such as Zookeeper.
HDFS is designed to run on commodity hardware. Low cost servers running Linux/Apache. Philosophy of the cluster design is to bring computing as close as possble to the data. All HDFS communication protocols are layered on top of the TCP/IP protocol. NameNode and Datanodes can be located anywhere.
A single instance is a single HDFS cluster.
A single instance is a single HDFS cluster.
Blocks – a disk block is 512 bytes, a file block is 3 kb, and an HDFS block is 64MB default (up to 128MB). Hardware and data corruption is the norm, rather than the exception. An HDFS instance may consist of hundreds or 1000s of server machines, each storing part of the file system’s data. The fact that there are a huge number of components and that each component has a non-trivial probability of failure means that some components of HDFS are always non-functional (dead). By default, each block is replicated 3 times (can be changed by application in configuration). Replica placement is heavily studied for optimization - HDFS’s policy is to put one replica on one node in the local rack and distribute other replicas to other nodes and other racks, with the goal to reduce seek times, and encourage cluster rebalancing. Separate from file operations, the NameNode periodically receives a Heartbeat and a Blockreport from each of the DataNodes in the cluster. Receipt of a Heartbeat implies that the DataNode is functioning properly. If NameNode itself crashes, backup will have to be restored from disk. The Zookeeper tool provides NameNode failover coordination, through high availability of active/passive NameNodes.
Analogy to UNIX is a large distributed pipeline
Map Server/Function 1, Map Server/Function 2, Map Server/Function 3: each process in parallel In MapReduce, every input is viewed as a Key-Value pair. Eg. Key=Sentence 1, Value=“John has a red car, which has no radio”. Step 1 – Each sentence is given to a Map, and each word is counted in a wave. In this example, there are 3 Map jobs. Step 2 – Shuffle and sort simply moves the words to server locations, where all the unique keys are brought together. Step 3 – The words on each server are aggregated, and reduced. In this example Reduce is performed across two waves. Final output on lower right.
As a developer you have to start thinking about your data storage problem in a distributed way, instead of in a monolithic way.
Step 1 – data is broken into file splits of 64 MB (or 128 MB) and the blocks are moved to different NameNodes Step 2 – Once all the blocks are moved, the Hadoop framework passes on your program to each NameNode Step 3 – Job Tracker then starts scheduling the programs on individual Datanodes Step 4 – Once all the Datanodes are done, the output (yellow) is written back
Also built on top of Hadoop, are the helper applications: Hive – interactive SQL query and modeling using datawarehouse view of HDFS. Projects a table structure on the dataset and then manipulates it with HiveQL. Pig – Data flow for tedious MapReduce jobs. A language for expressing data analysis and infrastructure processes. HBase – Columnar NoSQL store for billions of rows HCatalog – Table and schema management Zookeeper – NameNode to backup failover coordination Ambari – management tool
Download commercial implementations: Hortonworks (Sandbox is a single node download), Cloudera, Amazon services
Question is not “Why should I care about Big Data”, but rather, how can I get closer to Big Data and start taking advantage of it. Thanks to Peter Smith and Michel Ng to organizing. If you have a topic you would like to present on, see Peter – contribute your expertise to the tech ecosystem in Vancouver Send me questions via LinkedIn and copy will be posted to my profile Hootsuite, Quickmobile, a few others in Vancouver looking for analytics developers – have a look

Intro to big data and hadoop ubc cs lecture series - g fawkes

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (20)

Similaire à Intro to big data and hadoop ubc cs lecture series - g fawkes

Similaire à Intro to big data and hadoop ubc cs lecture series - g fawkes (20)

Dernier

Dernier (20)

Intro to big data and hadoop ubc cs lecture series - g fawkes

Notes de l'éditeur