SlideShare une entreprise Scribd logo
1  sur  25
Télécharger pour lire hors ligne
Building the Infrastructure
       for Big Data

            @ The Fifth Elephant
               July 27th, 2012
   -Prashant Kumar, Founder- PromptCloud




                                                        1
              © PromptCloud 2012, All rights reserved
Agenda
About

Context

Machines, Installation & Cloud Automation

Building blocks of a system

Sample application sketch

Lack of time components
                                                              2
                    © PromptCloud 2012, All rights reserved
About
                            Section 0




                                          3
© PromptCloud 2012, All rights reserved
About PromptCloud
 We provide data feeds and feed ourselves on data- since 2009
How??
• Large-scale data crawl and extraction
• Hosted indexing
• Custom data analytics
• Working round the clock

                     About Me
 • PromptCloud’s Founder
 • Yahoo! - 2007-2008
 • IIT-Kanpur CS- 2007
                                                                4
                      © PromptCloud 2012, All rights reserved
Deliverable




                                          5
© PromptCloud 2012, All rights reserved
Context
                            Section 0.1




                                          6
© PromptCloud 2012, All rights reserved
Generic Big Data Systems
• Multiple nodes (incoherent set of coherent
  ones)
• Compute layer- Interdependent processes
• Data storage layer & multiple middleware
• Tools for installation, monitoring & scheduling
*Meta- source control, code reviews, continuous integration



                                                               7
                     © PromptCloud 2012, All rights reserved
Machines, Installation &
     Cloud Automation
                                         Section 1




                                                       8
             © PromptCloud 2012, All rights reserved
Installation
             Create an image and install



•Easy to install                          •Modifications? Difficult to save
•No maintenance cost                      it back
•1 image for 1 purpose                    •Apt, yum, etc-keeper like
                                          systems but difficult to scale

                                                        Solutions?? 
                                                                              9
                         © PromptCloud 2012, All rights reserved
Enter the Magic!




Not a panacea; analgesic though


                                                   10
         © PromptCloud 2012, All rights reserved
Virtual Machines

                                                          Virtual Machines


                 ssh
         Up
Init
       Vagrant
                  Shared directory
       Port Forwarding                          AWS, Xen,                      Virtual Box
                                                 KVM,…                         Installation




                                                                                              11
                                     © PromptCloud 2012, All rights reserved
Code the Installation using Chef
             Give the recipe- code what’s to be done

                                 I’m Solo




  Roles,
                                                                 Data Files
 Recipes




Templates,
 Run List                                                        Chef Server


                                   Knife
                                                                               12
                       © PromptCloud 2012, All rights reserved
Building blocks
                                 Section 2




                                               13
     © PromptCloud 2012, All rights reserved
To keep processes running,

                                  Option 1- Install GOD to
                                   monitor processes and
                                    to keep them in place
                                   Option 2 (for atheists)-
                                        Install MONIT
     Courtesy- BIT Mesra




                                                         14
              © PromptCloud 2012, All rights reserved
God’s Snippet
God.watch do |w|
    w.name = watcher_name
    w.start = start_command
    #w.restart = restart_command
    w.stop = stop_command
    w.behavior(:clean_pid_file)
    #w.group = "some group"
    w.log = "/tmp/god_monitoring_#{watcher_name}.log"
    w.keepalive
    w.stop_timeout = 10.seconds
end

                                                               15
                     © PromptCloud 2012, All rights reserved
Job Scheduling
Resque, Beanstalk, Gearman, Celery, + cron and queues

Things to remember while making choices-

• Persistence
• Priorities
• Tags
• Option for retry
• Ability to inspect the queue


                                                              16
                    © PromptCloud 2012, All rights reserved
Data Storage Layer
SQL/NoSQL, key/value, document-based, graph databases

• For large systems, maintenance cost is a
  primary overhead
• Replication & Availability
• Consistency guarantees
• Full-text search



                                                             17
                   © PromptCloud 2012, All rights reserved
Voldemort                                                Not
                                                                       me!!!!!!!!
• Distributed key/value
  store
• Great performance
• Easy to add/remove
  nodes
• Alternatives- Mongo,                                    Courtesy- harrypotter.wikia.com

  Riak, Hbase, Cassandra


                                                                                            18
                © PromptCloud 2012, All rights reserved
Messaging Layer-

• RabbitMQ- most commonly used in high-load
  production systems
• Implements AMQP
• Robust exchange server
• Multiple kinds of exchanges- direct, topic,
  fanout
• Options for HA with Pacemaker/DRBD
                                                          19
                © PromptCloud 2012, All rights reserved
Demo
                            Section 3




                                          20
© PromptCloud 2012, All rights reserved
Demo Sketch
1. We’ll generate random sentences based on
   Markov chain

2. Store these in Voldemort

3. Enqueue corresponding jobs in RabbitMQ

4. Another set of workers will process these
   sentences


                                                           21
                 © PromptCloud 2012, All rights reserved
For the lack of time..
                                       Section 4




                                                     22
           © PromptCloud 2012, All rights reserved
Sensu &Graphite
•   Monitoring router
•   "check scripts” on nodes
•    “handler scripts” on servers
•   Output can be sent to pagerduty, graphite,
    twitter or IRC




                                                             23
                   © PromptCloud 2012, All rights reserved
Distributed Log Collection
               Scribe, Flume, Splunk

Flume
• Allows multiple topologies
• Agent
• Collector
• Sink


                                                           24
                 © PromptCloud 2012, All rights reserved
Feel free to reach out



                              Big Data made Small
                    info@promptcloud.com
                              Appreciate your time 




Thanks to Arpan Jha for her help with the slides
                                                                             25
                                   © PromptCloud 2012, All rights reserved

Contenu connexe

Tendances

Cloud Foundry and OpenStack
Cloud Foundry and OpenStackCloud Foundry and OpenStack
Cloud Foundry and OpenStackvadimspivak
 
SemeruRuntimesUnderTheCover .pptx
SemeruRuntimesUnderTheCover .pptxSemeruRuntimesUnderTheCover .pptx
SemeruRuntimesUnderTheCover .pptxSumanMitra22
 
Xen Cloud Platform at Build a Cloud Day at SCALE 10x
Xen Cloud Platform at Build a Cloud Day at SCALE 10x Xen Cloud Platform at Build a Cloud Day at SCALE 10x
Xen Cloud Platform at Build a Cloud Day at SCALE 10x The Linux Foundation
 
VMware Performance for Gurus - A Tutorial
VMware Performance for Gurus - A TutorialVMware Performance for Gurus - A Tutorial
VMware Performance for Gurus - A TutorialRichard McDougall
 
Crash course on open source cloud computing
Crash course on open source cloud computingCrash course on open source cloud computing
Crash course on open source cloud computingLorscheider Santiago
 
Virtualization Primer for Java Developers
Virtualization Primer for Java DevelopersVirtualization Primer for Java Developers
Virtualization Primer for Java DevelopersRichard McDougall
 
Xen server 6.1 customer presentation
Xen server 6.1 customer presentationXen server 6.1 customer presentation
Xen server 6.1 customer presentationNuno Alves
 
What’s New in vCloud Director 5.1?
What’s New in vCloud Director 5.1?What’s New in vCloud Director 5.1?
What’s New in vCloud Director 5.1?Eric Sloof
 
20 christian ferber xen_server_6_workshop
20 christian ferber xen_server_6_workshop20 christian ferber xen_server_6_workshop
20 christian ferber xen_server_6_workshopDigicomp Academy AG
 
The glideinWMS approach to the ownership of System Images in the Cloud World
The glideinWMS approach to the ownership of System Images in the Cloud WorldThe glideinWMS approach to the ownership of System Images in the Cloud World
The glideinWMS approach to the ownership of System Images in the Cloud WorldIgor Sfiligoi
 
Getting Started with Rails on GlassFish (Hands-on Lab) - Spark IT 2010
Getting Started with Rails on GlassFish (Hands-on Lab) - Spark IT 2010Getting Started with Rails on GlassFish (Hands-on Lab) - Spark IT 2010
Getting Started with Rails on GlassFish (Hands-on Lab) - Spark IT 2010Arun Gupta
 
Fusion-io Memory Flash for Microsoft SQL Server 2012
Fusion-io Memory Flash for Microsoft SQL Server 2012Fusion-io Memory Flash for Microsoft SQL Server 2012
Fusion-io Memory Flash for Microsoft SQL Server 2012Mark Ginnebaugh
 
Keynote Speech: Xen ARM Virtualization
Keynote Speech: Xen ARM VirtualizationKeynote Speech: Xen ARM Virtualization
Keynote Speech: Xen ARM VirtualizationThe Linux Foundation
 
Accelerate Your Rails Site with Automatic Generation-Based Action Caching
Accelerate Your Rails Site with Automatic Generation-Based Action CachingAccelerate Your Rails Site with Automatic Generation-Based Action Caching
Accelerate Your Rails Site with Automatic Generation-Based Action Cachingelliando dias
 

Tendances (20)

Cloud Foundry and OpenStack
Cloud Foundry and OpenStackCloud Foundry and OpenStack
Cloud Foundry and OpenStack
 
VMware vSphere
VMware vSphereVMware vSphere
VMware vSphere
 
SemeruRuntimesUnderTheCover .pptx
SemeruRuntimesUnderTheCover .pptxSemeruRuntimesUnderTheCover .pptx
SemeruRuntimesUnderTheCover .pptx
 
Xen Cloud Platform at Build a Cloud Day at SCALE 10x
Xen Cloud Platform at Build a Cloud Day at SCALE 10x Xen Cloud Platform at Build a Cloud Day at SCALE 10x
Xen Cloud Platform at Build a Cloud Day at SCALE 10x
 
VMware Performance for Gurus - A Tutorial
VMware Performance for Gurus - A TutorialVMware Performance for Gurus - A Tutorial
VMware Performance for Gurus - A Tutorial
 
XS Boston 2008 ARM
XS Boston 2008 ARMXS Boston 2008 ARM
XS Boston 2008 ARM
 
Crash course on open source cloud computing
Crash course on open source cloud computingCrash course on open source cloud computing
Crash course on open source cloud computing
 
Virtualization Primer for Java Developers
Virtualization Primer for Java DevelopersVirtualization Primer for Java Developers
Virtualization Primer for Java Developers
 
VMware vSphere5.1 Training
VMware vSphere5.1 TrainingVMware vSphere5.1 Training
VMware vSphere5.1 Training
 
XS Japan 2008 App Data English
XS Japan 2008 App Data EnglishXS Japan 2008 App Data English
XS Japan 2008 App Data English
 
Xen server 6.1 customer presentation
Xen server 6.1 customer presentationXen server 6.1 customer presentation
Xen server 6.1 customer presentation
 
What’s New in vCloud Director 5.1?
What’s New in vCloud Director 5.1?What’s New in vCloud Director 5.1?
What’s New in vCloud Director 5.1?
 
20 christian ferber xen_server_6_workshop
20 christian ferber xen_server_6_workshop20 christian ferber xen_server_6_workshop
20 christian ferber xen_server_6_workshop
 
PowerHA for i
PowerHA for iPowerHA for i
PowerHA for i
 
The glideinWMS approach to the ownership of System Images in the Cloud World
The glideinWMS approach to the ownership of System Images in the Cloud WorldThe glideinWMS approach to the ownership of System Images in the Cloud World
The glideinWMS approach to the ownership of System Images in the Cloud World
 
Getting Started with Rails on GlassFish (Hands-on Lab) - Spark IT 2010
Getting Started with Rails on GlassFish (Hands-on Lab) - Spark IT 2010Getting Started with Rails on GlassFish (Hands-on Lab) - Spark IT 2010
Getting Started with Rails on GlassFish (Hands-on Lab) - Spark IT 2010
 
Fusion-io Memory Flash for Microsoft SQL Server 2012
Fusion-io Memory Flash for Microsoft SQL Server 2012Fusion-io Memory Flash for Microsoft SQL Server 2012
Fusion-io Memory Flash for Microsoft SQL Server 2012
 
Keynote Speech: Xen ARM Virtualization
Keynote Speech: Xen ARM VirtualizationKeynote Speech: Xen ARM Virtualization
Keynote Speech: Xen ARM Virtualization
 
XS Japan 2008 Citrix English
XS Japan 2008 Citrix EnglishXS Japan 2008 Citrix English
XS Japan 2008 Citrix English
 
Accelerate Your Rails Site with Automatic Generation-Based Action Caching
Accelerate Your Rails Site with Automatic Generation-Based Action CachingAccelerate Your Rails Site with Automatic Generation-Based Action Caching
Accelerate Your Rails Site with Automatic Generation-Based Action Caching
 

Similaire à Building infrastructure for Big Data

ActiveMQ Performance Tuning
ActiveMQ Performance TuningActiveMQ Performance Tuning
ActiveMQ Performance TuningChristian Posta
 
Big data in travel domain
Big data in travel domainBig data in travel domain
Big data in travel domainPromptCloud
 
Getting Started Developing with Platform as a Service
Getting Started Developing with Platform as a ServiceGetting Started Developing with Platform as a Service
Getting Started Developing with Platform as a ServiceCloudBees
 
Securing User Data with SQLCipher
Securing User Data with SQLCipherSecuring User Data with SQLCipher
Securing User Data with SQLCipherCommonsWare
 
TDWI_DW2014_SQLNoSQL_DBAAS
TDWI_DW2014_SQLNoSQL_DBAASTDWI_DW2014_SQLNoSQL_DBAAS
TDWI_DW2014_SQLNoSQL_DBAASRomeo Kienzler
 
Platform Engineering for the Modern Oracle World
Platform Engineering for the Modern Oracle WorldPlatform Engineering for the Modern Oracle World
Platform Engineering for the Modern Oracle WorldSimon Haslam
 
Greenplum Analytics Workbench - What Can a Private Hadoop Cloud Do For You?
Greenplum Analytics Workbench - What Can a Private Hadoop Cloud Do For You?  Greenplum Analytics Workbench - What Can a Private Hadoop Cloud Do For You?
Greenplum Analytics Workbench - What Can a Private Hadoop Cloud Do For You? EMC
 
Hadoop Operations for Production Systems (Strata NYC)
Hadoop Operations for Production Systems (Strata NYC)Hadoop Operations for Production Systems (Strata NYC)
Hadoop Operations for Production Systems (Strata NYC)Kathleen Ting
 
Operate your hadoop cluster like a high eff goldmine
Operate your hadoop cluster like a high eff goldmineOperate your hadoop cluster like a high eff goldmine
Operate your hadoop cluster like a high eff goldmineDataWorks Summit
 
Webinar: eFolder Expert Series: Five Technologies from AppAssure to Boost You...
Webinar: eFolder Expert Series: Five Technologies from AppAssure to Boost You...Webinar: eFolder Expert Series: Five Technologies from AppAssure to Boost You...
Webinar: eFolder Expert Series: Five Technologies from AppAssure to Boost You...Dropbox
 
JVM Multitenancy (JavaOne 2012)
JVM Multitenancy (JavaOne 2012)JVM Multitenancy (JavaOne 2012)
JVM Multitenancy (JavaOne 2012)Graeme_IBM
 
GlassFish in Production Environments
GlassFish in Production EnvironmentsGlassFish in Production Environments
GlassFish in Production EnvironmentsBruno Borges
 
Machine Learning and Hadoop: Present and Future
Machine Learning and Hadoop: Present and FutureMachine Learning and Hadoop: Present and Future
Machine Learning and Hadoop: Present and FutureData Science London
 
Ecommerce product review and price crawl
Ecommerce product review and price crawlEcommerce product review and price crawl
Ecommerce product review and price crawlPromptCloud
 
eFolder Webinar: How One Partner Leverages Dell AppAssure and StorageCraft
eFolder Webinar: How One Partner Leverages Dell AppAssure and StorageCrafteFolder Webinar: How One Partner Leverages Dell AppAssure and StorageCraft
eFolder Webinar: How One Partner Leverages Dell AppAssure and StorageCraftDropbox
 
Nagios Conference 2012 - Eric Loyd - Nagios Implementation Case Eastman Kodak...
Nagios Conference 2012 - Eric Loyd - Nagios Implementation Case Eastman Kodak...Nagios Conference 2012 - Eric Loyd - Nagios Implementation Case Eastman Kodak...
Nagios Conference 2012 - Eric Loyd - Nagios Implementation Case Eastman Kodak...Nagios
 
Monitoring VMware vFabric with Hyperic and Spring Insight
Monitoring VMware vFabric with Hyperic and Spring InsightMonitoring VMware vFabric with Hyperic and Spring Insight
Monitoring VMware vFabric with Hyperic and Spring InsightC2B2 Consulting
 

Similaire à Building infrastructure for Big Data (20)

ActiveMQ Performance Tuning
ActiveMQ Performance TuningActiveMQ Performance Tuning
ActiveMQ Performance Tuning
 
Big data in travel domain
Big data in travel domainBig data in travel domain
Big data in travel domain
 
Getting Started Developing with Platform as a Service
Getting Started Developing with Platform as a ServiceGetting Started Developing with Platform as a Service
Getting Started Developing with Platform as a Service
 
Securing User Data with SQLCipher
Securing User Data with SQLCipherSecuring User Data with SQLCipher
Securing User Data with SQLCipher
 
TDWI_DW2014_SQLNoSQL_DBAAS
TDWI_DW2014_SQLNoSQL_DBAASTDWI_DW2014_SQLNoSQL_DBAAS
TDWI_DW2014_SQLNoSQL_DBAAS
 
Platform Engineering for the Modern Oracle World
Platform Engineering for the Modern Oracle WorldPlatform Engineering for the Modern Oracle World
Platform Engineering for the Modern Oracle World
 
Greenplum Analytics Workbench - What Can a Private Hadoop Cloud Do For You?
Greenplum Analytics Workbench - What Can a Private Hadoop Cloud Do For You?  Greenplum Analytics Workbench - What Can a Private Hadoop Cloud Do For You?
Greenplum Analytics Workbench - What Can a Private Hadoop Cloud Do For You?
 
Hadoop Operations for Production Systems (Strata NYC)
Hadoop Operations for Production Systems (Strata NYC)Hadoop Operations for Production Systems (Strata NYC)
Hadoop Operations for Production Systems (Strata NYC)
 
Operate your hadoop cluster like a high eff goldmine
Operate your hadoop cluster like a high eff goldmineOperate your hadoop cluster like a high eff goldmine
Operate your hadoop cluster like a high eff goldmine
 
Webinar: eFolder Expert Series: Five Technologies from AppAssure to Boost You...
Webinar: eFolder Expert Series: Five Technologies from AppAssure to Boost You...Webinar: eFolder Expert Series: Five Technologies from AppAssure to Boost You...
Webinar: eFolder Expert Series: Five Technologies from AppAssure to Boost You...
 
JVM Multitenancy (JavaOne 2012)
JVM Multitenancy (JavaOne 2012)JVM Multitenancy (JavaOne 2012)
JVM Multitenancy (JavaOne 2012)
 
OWF12/Java Sacha labourey
OWF12/Java Sacha laboureyOWF12/Java Sacha labourey
OWF12/Java Sacha labourey
 
GlassFish in Production Environments
GlassFish in Production EnvironmentsGlassFish in Production Environments
GlassFish in Production Environments
 
Machine Learning and Hadoop: Present and Future
Machine Learning and Hadoop: Present and FutureMachine Learning and Hadoop: Present and Future
Machine Learning and Hadoop: Present and Future
 
Hadoop Operations
Hadoop OperationsHadoop Operations
Hadoop Operations
 
Flume and HBase
Flume and HBase Flume and HBase
Flume and HBase
 
Ecommerce product review and price crawl
Ecommerce product review and price crawlEcommerce product review and price crawl
Ecommerce product review and price crawl
 
eFolder Webinar: How One Partner Leverages Dell AppAssure and StorageCraft
eFolder Webinar: How One Partner Leverages Dell AppAssure and StorageCrafteFolder Webinar: How One Partner Leverages Dell AppAssure and StorageCraft
eFolder Webinar: How One Partner Leverages Dell AppAssure and StorageCraft
 
Nagios Conference 2012 - Eric Loyd - Nagios Implementation Case Eastman Kodak...
Nagios Conference 2012 - Eric Loyd - Nagios Implementation Case Eastman Kodak...Nagios Conference 2012 - Eric Loyd - Nagios Implementation Case Eastman Kodak...
Nagios Conference 2012 - Eric Loyd - Nagios Implementation Case Eastman Kodak...
 
Monitoring VMware vFabric with Hyperic and Spring Insight
Monitoring VMware vFabric with Hyperic and Spring InsightMonitoring VMware vFabric with Hyperic and Spring Insight
Monitoring VMware vFabric with Hyperic and Spring Insight
 

Plus de PromptCloud

Big Data’s Potential for the Real Estate Industry: 2021
Big Data’s Potential for the Real Estate Industry: 2021Big Data’s Potential for the Real Estate Industry: 2021
Big Data’s Potential for the Real Estate Industry: 2021PromptCloud
 
All You Need to Know About Web Crawling.pdf
All You Need to Know About Web Crawling.pdfAll You Need to Know About Web Crawling.pdf
All You Need to Know About Web Crawling.pdfPromptCloud
 
Web Scraping Myths vs. Facts
Web Scraping Myths vs. FactsWeb Scraping Myths vs. Facts
Web Scraping Myths vs. FactsPromptCloud
 
Octoparse competitors.pdf
Octoparse competitors.pdfOctoparse competitors.pdf
Octoparse competitors.pdfPromptCloud
 
Parsehub and competitior ppt.pptx
Parsehub and competitior ppt.pptxParsehub and competitior ppt.pptx
Parsehub and competitior ppt.pptxPromptCloud
 
Product Visibility- What Is Seen First, Will ppt.pptx
Product Visibility- What Is Seen First, Will ppt.pptxProduct Visibility- What Is Seen First, Will ppt.pptx
Product Visibility- What Is Seen First, Will ppt.pptxPromptCloud
 
Data Trends in Fashion Industry
Data Trends in Fashion IndustryData Trends in Fashion Industry
Data Trends in Fashion IndustryPromptCloud
 
Data Standardization with Web Data Integration
Data Standardization with Web Data Integration Data Standardization with Web Data Integration
Data Standardization with Web Data Integration PromptCloud
 
Visualizing Marvel Cinematic Universe Movies
Visualizing Marvel Cinematic Universe MoviesVisualizing Marvel Cinematic Universe Movies
Visualizing Marvel Cinematic Universe MoviesPromptCloud
 
15 Key Metrics Every E-commerce Business Should Track
15 Key Metrics Every E-commerce Business Should Track15 Key Metrics Every E-commerce Business Should Track
15 Key Metrics Every E-commerce Business Should TrackPromptCloud
 
Top Amazon Services for Ecommerce Players
Top Amazon Services for Ecommerce PlayersTop Amazon Services for Ecommerce Players
Top Amazon Services for Ecommerce PlayersPromptCloud
 
The Birth of a Web Crawling Bot
The Birth of a Web Crawling BotThe Birth of a Web Crawling Bot
The Birth of a Web Crawling BotPromptCloud
 
Upcoming Applications of Artificial intelligence in 2019
Upcoming Applications of Artificial intelligence in 2019Upcoming Applications of Artificial intelligence in 2019
Upcoming Applications of Artificial intelligence in 2019PromptCloud
 
Zipcode based price benchmarking for retailers
Zipcode based price benchmarking for retailersZipcode based price benchmarking for retailers
Zipcode based price benchmarking for retailersPromptCloud
 
Analyzing Positiveness in 160+ Holiday Songs
Analyzing Positiveness in 160+ Holiday SongsAnalyzing Positiveness in 160+ Holiday Songs
Analyzing Positiveness in 160+ Holiday SongsPromptCloud
 
PromptCloud's Year in Review - 2019
PromptCloud's Year in Review - 2019PromptCloud's Year in Review - 2019
PromptCloud's Year in Review - 2019PromptCloud
 
Top Data Analytics Trends for 2019
Top Data Analytics Trends for 2019Top Data Analytics Trends for 2019
Top Data Analytics Trends for 2019PromptCloud
 
10 Mobile App Ideas that can be Fueled by Web Scraping
10 Mobile App Ideas that can be Fueled by Web Scraping10 Mobile App Ideas that can be Fueled by Web Scraping
10 Mobile App Ideas that can be Fueled by Web ScrapingPromptCloud
 
How Web Scraping Can Help Affiliate Marketers
How Web Scraping Can Help Affiliate MarketersHow Web Scraping Can Help Affiliate Marketers
How Web Scraping Can Help Affiliate MarketersPromptCloud
 
Hotel Review Data Analysis
Hotel Review Data AnalysisHotel Review Data Analysis
Hotel Review Data AnalysisPromptCloud
 

Plus de PromptCloud (20)

Big Data’s Potential for the Real Estate Industry: 2021
Big Data’s Potential for the Real Estate Industry: 2021Big Data’s Potential for the Real Estate Industry: 2021
Big Data’s Potential for the Real Estate Industry: 2021
 
All You Need to Know About Web Crawling.pdf
All You Need to Know About Web Crawling.pdfAll You Need to Know About Web Crawling.pdf
All You Need to Know About Web Crawling.pdf
 
Web Scraping Myths vs. Facts
Web Scraping Myths vs. FactsWeb Scraping Myths vs. Facts
Web Scraping Myths vs. Facts
 
Octoparse competitors.pdf
Octoparse competitors.pdfOctoparse competitors.pdf
Octoparse competitors.pdf
 
Parsehub and competitior ppt.pptx
Parsehub and competitior ppt.pptxParsehub and competitior ppt.pptx
Parsehub and competitior ppt.pptx
 
Product Visibility- What Is Seen First, Will ppt.pptx
Product Visibility- What Is Seen First, Will ppt.pptxProduct Visibility- What Is Seen First, Will ppt.pptx
Product Visibility- What Is Seen First, Will ppt.pptx
 
Data Trends in Fashion Industry
Data Trends in Fashion IndustryData Trends in Fashion Industry
Data Trends in Fashion Industry
 
Data Standardization with Web Data Integration
Data Standardization with Web Data Integration Data Standardization with Web Data Integration
Data Standardization with Web Data Integration
 
Visualizing Marvel Cinematic Universe Movies
Visualizing Marvel Cinematic Universe MoviesVisualizing Marvel Cinematic Universe Movies
Visualizing Marvel Cinematic Universe Movies
 
15 Key Metrics Every E-commerce Business Should Track
15 Key Metrics Every E-commerce Business Should Track15 Key Metrics Every E-commerce Business Should Track
15 Key Metrics Every E-commerce Business Should Track
 
Top Amazon Services for Ecommerce Players
Top Amazon Services for Ecommerce PlayersTop Amazon Services for Ecommerce Players
Top Amazon Services for Ecommerce Players
 
The Birth of a Web Crawling Bot
The Birth of a Web Crawling BotThe Birth of a Web Crawling Bot
The Birth of a Web Crawling Bot
 
Upcoming Applications of Artificial intelligence in 2019
Upcoming Applications of Artificial intelligence in 2019Upcoming Applications of Artificial intelligence in 2019
Upcoming Applications of Artificial intelligence in 2019
 
Zipcode based price benchmarking for retailers
Zipcode based price benchmarking for retailersZipcode based price benchmarking for retailers
Zipcode based price benchmarking for retailers
 
Analyzing Positiveness in 160+ Holiday Songs
Analyzing Positiveness in 160+ Holiday SongsAnalyzing Positiveness in 160+ Holiday Songs
Analyzing Positiveness in 160+ Holiday Songs
 
PromptCloud's Year in Review - 2019
PromptCloud's Year in Review - 2019PromptCloud's Year in Review - 2019
PromptCloud's Year in Review - 2019
 
Top Data Analytics Trends for 2019
Top Data Analytics Trends for 2019Top Data Analytics Trends for 2019
Top Data Analytics Trends for 2019
 
10 Mobile App Ideas that can be Fueled by Web Scraping
10 Mobile App Ideas that can be Fueled by Web Scraping10 Mobile App Ideas that can be Fueled by Web Scraping
10 Mobile App Ideas that can be Fueled by Web Scraping
 
How Web Scraping Can Help Affiliate Marketers
How Web Scraping Can Help Affiliate MarketersHow Web Scraping Can Help Affiliate Marketers
How Web Scraping Can Help Affiliate Marketers
 
Hotel Review Data Analysis
Hotel Review Data AnalysisHotel Review Data Analysis
Hotel Review Data Analysis
 

Dernier

"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 

Dernier (20)

"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 

Building infrastructure for Big Data

  • 1. Building the Infrastructure for Big Data @ The Fifth Elephant July 27th, 2012 -Prashant Kumar, Founder- PromptCloud 1 © PromptCloud 2012, All rights reserved
  • 2. Agenda About Context Machines, Installation & Cloud Automation Building blocks of a system Sample application sketch Lack of time components 2 © PromptCloud 2012, All rights reserved
  • 3. About Section 0 3 © PromptCloud 2012, All rights reserved
  • 4. About PromptCloud We provide data feeds and feed ourselves on data- since 2009 How?? • Large-scale data crawl and extraction • Hosted indexing • Custom data analytics • Working round the clock About Me • PromptCloud’s Founder • Yahoo! - 2007-2008 • IIT-Kanpur CS- 2007 4 © PromptCloud 2012, All rights reserved
  • 5. Deliverable 5 © PromptCloud 2012, All rights reserved
  • 6. Context Section 0.1 6 © PromptCloud 2012, All rights reserved
  • 7. Generic Big Data Systems • Multiple nodes (incoherent set of coherent ones) • Compute layer- Interdependent processes • Data storage layer & multiple middleware • Tools for installation, monitoring & scheduling *Meta- source control, code reviews, continuous integration 7 © PromptCloud 2012, All rights reserved
  • 8. Machines, Installation & Cloud Automation Section 1 8 © PromptCloud 2012, All rights reserved
  • 9. Installation Create an image and install •Easy to install •Modifications? Difficult to save •No maintenance cost it back •1 image for 1 purpose •Apt, yum, etc-keeper like systems but difficult to scale Solutions??  9 © PromptCloud 2012, All rights reserved
  • 10. Enter the Magic! Not a panacea; analgesic though 10 © PromptCloud 2012, All rights reserved
  • 11. Virtual Machines Virtual Machines ssh Up Init Vagrant Shared directory Port Forwarding AWS, Xen, Virtual Box KVM,… Installation 11 © PromptCloud 2012, All rights reserved
  • 12. Code the Installation using Chef Give the recipe- code what’s to be done I’m Solo Roles, Data Files Recipes Templates, Run List Chef Server Knife 12 © PromptCloud 2012, All rights reserved
  • 13. Building blocks Section 2 13 © PromptCloud 2012, All rights reserved
  • 14. To keep processes running, Option 1- Install GOD to monitor processes and to keep them in place Option 2 (for atheists)- Install MONIT Courtesy- BIT Mesra 14 © PromptCloud 2012, All rights reserved
  • 15. God’s Snippet God.watch do |w| w.name = watcher_name w.start = start_command #w.restart = restart_command w.stop = stop_command w.behavior(:clean_pid_file) #w.group = "some group" w.log = "/tmp/god_monitoring_#{watcher_name}.log" w.keepalive w.stop_timeout = 10.seconds end 15 © PromptCloud 2012, All rights reserved
  • 16. Job Scheduling Resque, Beanstalk, Gearman, Celery, + cron and queues Things to remember while making choices- • Persistence • Priorities • Tags • Option for retry • Ability to inspect the queue 16 © PromptCloud 2012, All rights reserved
  • 17. Data Storage Layer SQL/NoSQL, key/value, document-based, graph databases • For large systems, maintenance cost is a primary overhead • Replication & Availability • Consistency guarantees • Full-text search 17 © PromptCloud 2012, All rights reserved
  • 18. Voldemort Not me!!!!!!!! • Distributed key/value store • Great performance • Easy to add/remove nodes • Alternatives- Mongo, Courtesy- harrypotter.wikia.com Riak, Hbase, Cassandra 18 © PromptCloud 2012, All rights reserved
  • 19. Messaging Layer- • RabbitMQ- most commonly used in high-load production systems • Implements AMQP • Robust exchange server • Multiple kinds of exchanges- direct, topic, fanout • Options for HA with Pacemaker/DRBD 19 © PromptCloud 2012, All rights reserved
  • 20. Demo Section 3 20 © PromptCloud 2012, All rights reserved
  • 21. Demo Sketch 1. We’ll generate random sentences based on Markov chain 2. Store these in Voldemort 3. Enqueue corresponding jobs in RabbitMQ 4. Another set of workers will process these sentences 21 © PromptCloud 2012, All rights reserved
  • 22. For the lack of time.. Section 4 22 © PromptCloud 2012, All rights reserved
  • 23. Sensu &Graphite • Monitoring router • "check scripts” on nodes • “handler scripts” on servers • Output can be sent to pagerduty, graphite, twitter or IRC 23 © PromptCloud 2012, All rights reserved
  • 24. Distributed Log Collection Scribe, Flume, Splunk Flume • Allows multiple topologies • Agent • Collector • Sink 24 © PromptCloud 2012, All rights reserved
  • 25. Feel free to reach out Big Data made Small info@promptcloud.com Appreciate your time  Thanks to Arpan Jha for her help with the slides 25 © PromptCloud 2012, All rights reserved