Big Data on the Cloud

•Télécharger en tant que PPTX, PDF•

1 j'aime•287 vues

Sercan Karaoglu

Foreks Talk of Big Data From Past to Today event at Bahcesehir University

Données & analyses

BIG DATA ON THE CLOUD
@ugurarpaci
@SercanKaraoglu

CONTENTS
 3V Model
 Development & Operational Challenges
 Distributed Processing
 Hadoop & Spark
 AWS Spot Instance Management
 Use Case: Apache Zeppelin, Spark

WHO WE ARE
 Financial Data Provider Merging Different Markets
 Applications on Different Platforms (Web, Mobile,
Desktop, APIs)
 Software Development Team ~50 People, 130 Total
 Financial Data Application Management

3V MODEL
 90% of the data in the
world today has been
created over the last
two years alone
VOLUME VELOCITY VARIETY
 High data generation
speed
 Data is formatted by
any shape
HIGH HIGH HIGH

$METADATA, EVENTS, ACTIONS ARE BIGDATA What you see is not the whole picture! An actual tweet to end user is similar as follows: { text: “This is a 140 chars”, created_at: date(); favourited: boolean; }$

HIDDEN VALUES IN DATA
Automated
Decisions
ForecastPatternData

DISTRIBUTED PROCESSING
 Location Transparency
 Redundancy
 Logical Grouping
 Decoupling Storage From Processing

HADOOP - DISTRIBUTED PROCESSING
Hadoop Distributed File SystemHadoop Common
Hadoop Map-ReduceHadoop YARN
 The common utilities that
support the other Hadoop
modules
 A distributed file system that
provides high-throughput
access to application data
 A framework for job scheduling
and cluster resource
Management
 A YARN based system for
parallel processing of large
datasets

SPARK VS HADOOP – DEVELOPER PRODUCTIVITY

RDD - SPARK
Resilient Distributed Dataset
Transformations
map, filter, distinct, union, sample, groupByKey, join, reduce.. etc.
Actions
collect, count, first, take, foreach.. etc

RESOURCE MANAGEMENT ON THE CLOUD
Resource
Requirement
Orchestrated Cluster
Management
Accesibility

CLOUD STORAGE (AMAZON S3)
 Separate compute and storage
 Resize and shutdown Spark
Instance(EMR, EC2) with no data loss
 Point multiple Spark Clusters at the
same data in S3
 Easily evolve your analytic infrastructure
as technology evolves

SPOT INSTANCE PROVISIONING PROCESS
Provisioning
Spinning-up
Service DiscoveryService Registry
Data Persistence

val conf = new SparkConf().setAppName("Trading
Statistics").setMaster("spark://foreks.sparkcluster.com:18080")
val sc = new SparkContext(conf)
tasks
tasks
tasks
Read
HDFS&S3
block
Read
HDFS&S3
block
Read
HDFS&S3
block
Process&Cache Data
Process&Cache Data
Process&Cache Data
Results
Results
Results
USE CASE: SPARK + ZEPPELIN + S3
var logFile = sc.textFile("s3://../../2016/*/*/*.log.gz")
logFile = logFile.filter(line => line.startsWith("t;"))
.map(toTradeObject)
.groupBy(_.getSecurityName)
logFile.count().show()

USE CASE: SPARK + ZEPPELIN + S3
Data Engineers write
necessary queries for
Marketing Department
Marketing Department
can view & evaluate
analytics graphics and
several statistics showed
on Zeppelin nice and
smooth
Access logs uploaded to S3
Spark Cluster pulls access logs from
s3://../../2016/*/*.log.gz

Recommandé

Otimizações de Projetos de Big Data, Dw e AI no Microsoft AzureLuan Moreno Medeiros Maciel

Hadoop Big data Solution ProviderAgileiss

Azure Data Factory ETL Patterns in the CloudMark Kromer

Data cleansing and prep with synapse data flowsMark Kromer

Building Data Lakes with Apache AirflowGary Stafford

Microsoft Build 2018 Analytic Solutions with Azure Data Factory and Azure SQL...Mark Kromer

ADF Mapping Data Flows Training Slides V1Mark Kromer

Hadoop World VerticaOmer Trajman

Recommandé

Otimizações de Projetos de Big Data, Dw e AI no Microsoft AzureLuan Moreno Medeiros Maciel

Hadoop Big data Solution ProviderAgileiss

Azure Data Factory ETL Patterns in the CloudMark Kromer

Data cleansing and prep with synapse data flowsMark Kromer

Building Data Lakes with Apache AirflowGary Stafford

Microsoft Build 2018 Analytic Solutions with Azure Data Factory and Azure SQL...Mark Kromer

ADF Mapping Data Flows Training Slides V1Mark Kromer

Hadoop World VerticaOmer Trajman

An Introduction to Apache SparkElvis Saravia

ETL in the Cloud With Microsoft AzureMark Kromer

Hadoop and Vertica: Data Analytics Platform at TwitterDataWorks Summit

Earth on AWS - Next-Generation Open Data PlatformsAmazon Web Services

Azure Data Factory Data Flows Training v005Mark Kromer

SQL Saturday Redmond 2019 ETL Patterns in the CloudMark Kromer

Big Data Analytics with Amazon Web ServicesAmazon Web Services

Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...Lace Lofranco

Data Lake ETL in the Cloud with ADFMark Kromer

Digital Transformation with Microsoft AzureLuan Moreno Medeiros Maciel

Modern Data architecture DesignKujambu Murugesan

Data Quality Patterns in the Cloud with ADFMark Kromer

Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data...Rukmani Gopalan

Azure Data Factory Data Wrangling with Power QueryMark Kromer

Streaming Real-time Data to Azure Data Lake Storage Gen 2Carole Gunst

Data pipeline and data lake for autonomous drivingYu Huang

Spark as a Service with Azure DatabricksLace Lofranco

Azure Data Factory for Redmond SQL PASS UG Sept 2018Mark Kromer

Azure Data Factory Data Flows Training (Sept 2020 Update)Mark Kromer

AWS SSA Webinar 20 - Getting Started with Data Warehouses on AWSCobus Bernard

(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR ...Amazon Web Services

Big data vahidamiri-tabriz-13960226-datastack.irdatastack

Contenu connexe

Tendances

An Introduction to Apache SparkElvis Saravia

ETL in the Cloud With Microsoft AzureMark Kromer

Hadoop and Vertica: Data Analytics Platform at TwitterDataWorks Summit

Earth on AWS - Next-Generation Open Data PlatformsAmazon Web Services

Azure Data Factory Data Flows Training v005Mark Kromer

SQL Saturday Redmond 2019 ETL Patterns in the CloudMark Kromer

Big Data Analytics with Amazon Web ServicesAmazon Web Services

Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...Lace Lofranco

Data Lake ETL in the Cloud with ADFMark Kromer

Digital Transformation with Microsoft AzureLuan Moreno Medeiros Maciel

Modern Data architecture DesignKujambu Murugesan

Data Quality Patterns in the Cloud with ADFMark Kromer

Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data...Rukmani Gopalan

Azure Data Factory Data Wrangling with Power QueryMark Kromer

Streaming Real-time Data to Azure Data Lake Storage Gen 2Carole Gunst

Data pipeline and data lake for autonomous drivingYu Huang

Spark as a Service with Azure DatabricksLace Lofranco

Azure Data Factory for Redmond SQL PASS UG Sept 2018Mark Kromer

Azure Data Factory Data Flows Training (Sept 2020 Update)Mark Kromer

AWS SSA Webinar 20 - Getting Started with Data Warehouses on AWSCobus Bernard

Tendances (20)

An Introduction to Apache Spark

ETL in the Cloud With Microsoft Azure

Hadoop and Vertica: Data Analytics Platform at Twitter

Earth on AWS - Next-Generation Open Data Platforms

Azure Data Factory Data Flows Training v005

SQL Saturday Redmond 2019 ETL Patterns in the Cloud

Big Data Analytics with Amazon Web Services

Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...

Data Lake ETL in the Cloud with ADF

Digital Transformation with Microsoft Azure

Modern Data architecture Design

Data Quality Patterns in the Cloud with ADF

Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data...

Azure Data Factory Data Wrangling with Power Query

Streaming Real-time Data to Azure Data Lake Storage Gen 2

Data pipeline and data lake for autonomous driving

Spark as a Service with Azure Databricks

Azure Data Factory for Redmond SQL PASS UG Sept 2018

Azure Data Factory Data Flows Training (Sept 2020 Update)

AWS SSA Webinar 20 - Getting Started with Data Warehouses on AWS

Similaire à Big Data on the Cloud

(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR ...Amazon Web Services

Big data vahidamiri-tabriz-13960226-datastack.irdatastack

Azure Databricks & Spark @ Techorama 2018Nathan Bijnens

Big data with javaStefan Angelov

Scaling up with Cisco Big Data: Data + Science = Data ScienceeRic Choo

Apache spark - Architecture , Overview & librariesWalaa Hamdy Assy

Ultra Fast Deep Learning in Hybrid Cloud using Intel Analytics Zoo & AlluxioAlluxio, Inc.

In Memory Analytics with Apache SparkVenkata Naga Ravi

Azure Databricks is Easier Than You ThinkIke Ellis

Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...Anant Corporation

AWS Big Data LandscapeCrishantha Nanayakkara

Real-time Analytics for Data-Driven ApplicationsVMware Tanzu

A look under the hood at Apache Spark's API and engine evolutionsDatabricks

Developing Enterprise Consciousness: Building Modern Open Data PlatformsScyllaDB

Big Data Analytics with Hadoop, MongoDB and SQL ServerMark Kromer

Spark For Faster Batch ProcessingEdureka!

Spark streaming , Spark SQLYousun Jeong

Data Analytics on AWSDanilo Poccia

January 2015 HUG: Using HBase Co-Processors to Build a Distributed, Transacti...Yahoo Developer Network

AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)Amazon Web Services

Similaire à Big Data on the Cloud (20)

(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR ...

Big data vahidamiri-tabriz-13960226-datastack.ir

Azure Databricks & Spark @ Techorama 2018

Big data with java

Scaling up with Cisco Big Data: Data + Science = Data Science

Apache spark - Architecture , Overview & libraries

Ultra Fast Deep Learning in Hybrid Cloud using Intel Analytics Zoo & Alluxio

In Memory Analytics with Apache Spark

Azure Databricks is Easier Than You Think

Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...

AWS Big Data Landscape

Real-time Analytics for Data-Driven Applications

A look under the hood at Apache Spark's API and engine evolutions

Developing Enterprise Consciousness: Building Modern Open Data Platforms

Big Data Analytics with Hadoop, MongoDB and SQL Server

Spark For Faster Batch Processing

Spark streaming , Spark SQL

Data Analytics on AWS

January 2015 HUG: Using HBase Co-Processors to Build a Distributed, Transacti...

AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)

Dernier

PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava

From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck

Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408

DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett

GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch

Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly

Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha

Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534

High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh

Multiple time frame trading analysis -brianshannon.pdfchwongval

Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa

Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375

E-Commerce Order PredictionShraddha Kamble.pptxBoston Institute of Analytics

INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman

Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03

How we prevented account sharing with MFAAndrei Kaleshka

Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort

Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Universitat Politècnica de Catalunya

办理学位证纽约大学毕业证(NYU毕业证书）原版一比一fhwihughh

1:1定制(UQ毕业证）昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk

Dernier (20)

PKS-TGC-1084-630 - Stage 1 Proposal.pptx

From idea to production in a day – Leveraging Azure ML and Streamlit to build...

Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps

DBA Basics: Getting Started with Performance Tuning.pdf

GA4 Without Cookies [Measure Camp AMS]

Generative AI for Social Good at Open Data Science East 2024

Call Girls In Dwarka 9654467111 Escorts Service

Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...

High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...

Multiple time frame trading analysis -brianshannon.pdf

Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf

Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...

E-Commerce Order PredictionShraddha Kamble.pptx

INTERNSHIP ON PURBASHA COMPOSITE TEX LTD

Top 5 Best Data Analytics Courses In Queens

How we prevented account sharing with MFA

Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)

Deep Generative Learning for All - The Gen AI Hype (Spring 2024)

办理学位证纽约大学毕业证(NYU毕业证书）原版一比一

1:1定制(UQ毕业证）昆士兰大学毕业证成绩单修改留信学历认证原版一模一样

Big Data on the Cloud

1. BIG DATA ON THE CLOUD @ugurarpaci @SercanKaraoglu

2. CONTENTS  3V Model  Development & Operational Challenges  Distributed Processing  Hadoop & Spark  AWS Spot Instance Management  Use Case: Apache Zeppelin, Spark

3. WHO WE ARE  Financial Data Provider Merging Different Markets  Applications on Different Platforms (Web, Mobile, Desktop, APIs)  Software Development Team ~50 People, 130 Total  Financial Data Application Management

4. 3V MODEL  90% of the data in the world today has been created over the last two years alone VOLUME VELOCITY VARIETY  High data generation speed  Data is formatted by any shape HIGH HIGH HIGH

5. METADATA, EVENTS, ACTIONS ARE BIGDATA What you see is not the whole picture! An actual tweet to end user is similar as follows: { text: “This is a 140 chars”, created_at: date(); favourited: boolean; }

6. OPERATIONAL CHALLENGES

7. HIDDEN VALUES IN DATA Automated Decisions ForecastPatternData

8. DISTRIBUTED PROCESSING  Location Transparency  Redundancy  Logical Grouping  Decoupling Storage From Processing

9. HADOOP - DISTRIBUTED PROCESSING Hadoop Distributed File SystemHadoop Common Hadoop Map-ReduceHadoop YARN  The common utilities that support the other Hadoop modules  A distributed file system that provides high-throughput access to application data  A framework for job scheduling and cluster resource Management  A YARN based system for parallel processing of large datasets

10. DISTRIBUTED PROCESSING MAP REDUCE

11. SPARK STACK

12. SPARK VS HADOOP - PERFORMANCE

13. SPARK VS HADOOP – DEVELOPER PRODUCTIVITY

14. RDD - SPARK Resilient Distributed Dataset Transformations map, filter, distinct, union, sample, groupByKey, join, reduce.. etc. Actions collect, count, first, take, foreach.. etc

15. RESOURCE MANAGEMENT ON THE CLOUD Resource Requirement Orchestrated Cluster Management Accesibility

16. CLOUD STORAGE (AMAZON S3)  Separate compute and storage  Resize and shutdown Spark Instance(EMR, EC2) with no data loss  Point multiple Spark Clusters at the same data in S3  Easily evolve your analytic infrastructure as technology evolves

17. SPOT INSTANCE PROVISIONING PROCESS Provisioning Spinning-up Service DiscoveryService Registry Data Persistence

18. val conf = new SparkConf().setAppName("Trading Statistics").setMaster("spark://foreks.sparkcluster.com:18080") val sc = new SparkContext(conf) tasks tasks tasks Read HDFS&S3 block Read HDFS&S3 block Read HDFS&S3 block Process&Cache Data Process&Cache Data Process&Cache Data Results Results Results USE CASE: SPARK + ZEPPELIN + S3 var logFile = sc.textFile("s3://../../2016/*/*/*.log.gz") logFile = logFile.filter(line => line.startsWith("t;")) .map(toTradeObject) .groupBy(_.getSecurityName) logFile.count().show()

19. USE CASE: SPARK + ZEPPELIN + S3 Data Engineers write necessary queries for Marketing Department Marketing Department can view & evaluate analytics graphics and several statistics showed on Zeppelin nice and smooth Access logs uploaded to S3 Spark Cluster pulls access logs from s3://../../2016/*/*.log.gz

20. USE CASE: SPARK + ZEPPELIN + S3

21. THANK YOU

Notes de l'éditeur

Metinlerde yer degisikligi
Sercan resim verecek
Gökalp soldaki resme Scala, R, Python, Java, Mesos logoları ekleyelim
3 tane baloncuk
Animasyon düzenlenecek