SlideShare une entreprise Scribd logo
1  sur  48
Télécharger pour lire hors ligne
End-to-end Hadoop for
  search suggestions
      Guest lecture at RuG
WEB AND CLOUD
 COMPUTING
http://dilbert.com/strips/comic/2011-01-07/
CLOUD COMPUTING




                               FOR
                              RENT


image: flickr user jurvetson
30+ million users in
       two years

With 3 engineers!
Also,
A quick survey...
This talk:
Build working search suggestions (like on
bol.com)
That update automatically based on user
behavior
Using a lot of open source (elasticsearch,
Hadoop, Redis, Bootstrap, Flume, node.js,
jquery)
With a code example, not just slides
And a live demo of the working thing in
action
http://twitter.github.com/bootstrap/
How do we populate this array?
possibly many of these

         HTTP request:
 /search?q=trisodium+phosphate                                                 write to log file:
                                                         2012-07-01T06:00:02.500Z /search?q= trisodium%20phosphate



                       serve suggestions for web requests


USER



                                                                 transport logs to compute cluster
                     push suggestions to DB server

 this thing
  must be                                                                                    possibly
really, really                                                                             terabytes of
                                 transform logs to search suggestions

    fast                                                                                      these
Search is important!
 98% of landing page
visitors go directly to
    the search box
http://www.elasticsearch.org/
http://nodejs.org/
Node.js
Server-side JavaScript
Built on top of the V8 JavaScript engine
Completely asynchronous / non-blocking /
event-driven
Non-blocking means function calls never block
Distributed filesystem

Consists of a single master node and multiple (many) data
nodes

Files are split up blocks (64MB by default)

Blocks are spread across data nodes in the cluster

Each block is replicated multiple times to different data nodes in
the cluster (typically 3 times)

Master node keeps track of which blocks belong to a file
Accessible through Java API

FUSE (filesystem in user space) driver available to mount as
regular FS

C API available

Basic command line tools in Hadoop distribution

Web interface
File creation, directory listing and other meta data actions go
through the master node (e.g. ls, du, fsck, create file)

Data goes directly to and from data nodes (read, write, append)

Local read path optimization: clients located on same machine
as data node will always access local replica when possible
Name Node


                                                          /some/file               /foo/bar
HDFS client
                          create file




                                                                                                          read data
                                       Date Node                   Date Node                  Date Node
                     write data
                                         DISK                          DISK                     DISK



                                                                                                           Node local
                                                                                                           HDFS client
                                         DISK                          DISK                     DISK




                                                   replicate
                                         DISK                          DISK                     DISK




              read data
NameNode daemon
Filesystem master node

Keeps track of directories, files and block locations

Assigns blocks to data nodes

Keeps track of live nodes (through heartbeats)

Initiates re-replication in case of data node loss



Block meta data is held in memory

Will run out of memory when too many files exist
DataNode daemon
Filesystem worker node / “Block server”

Uses underlying regular FS for storage (e.g. ext4)

Takes care of distribution of blocks across disks

Don’t use RAID

More disks means more IO throughput

Sends heartbeats to NameNode

Reports blocks to NameNode (on startup)

Does not know about the rest of the cluster (shared nothing)
HDFS trivia
HDFS is write once, read many, but has append support in
newer versions

Has built in compression at the block level

Does end-to-end checksumming on all data (on read and
periodically)

HDFS is best used for large, unstructured volumes of raw data
in BIG files used for batch operations

Optimized for sequential reads, not random access
Job input comes from files on HDFS

Other formats are possible; requires specialized InputFormat
implementation

Built in support for text files (convenient for logs, csv, etc.)

Files must be splittable for parallelization to work

Not all compression formats have this property (e.g. gzip)
JobTracker daemon
MapReduce master node

Takes care of scheduling and job submission

Splits jobs into tasks (Mappers and Reducers)

Assigns tasks to worker nodes

Reassigns tasks in case of failure

Keeps track of job progress

Keeps track of worker nodes through heartbeats
TaskTracker daemon
MapReduce worker process

Starts Mappers en Reducers assigned by JobTracker

Sends heart beats to the JobTracker

Sends task progress to the JobTracker

Does not know about the rest of the cluster (shared nothing)
NameNode
JobTracker




 DataNode      DataNode      DataNode      DataNode
TaskTracker   TaskTracker   TaskTracker   TaskTracker
“Flume is a distributed, reliable, and available service for
   efficiently collecting, aggregating, and moving large
                  amounts of log data.”


                 http://flume.apache.org/
image from http://flume.apache.org
image from http://flume.apache.org
image from http://flume.apache.org
Processing the logs, a simple approach:
Divide time into 15 minute buckets
Count # times a search term appears in
each bucket
Divide counts by (current time - time of
start of bucket)
Rank terms according to sum of divided
counts
MapReduce, Job 1:
FIFTEEN_MINUTES = 1000 * 60 * 15

map (key, value) {
  term = getTermFromLogLine(value)
  time = parseTimestampFromLogLine(value)
  bucket = time - (time % FIFTEEN_MINUTES)

    outputKey = (bucket, term)
    outputValue = 1

    emit(outputKey, outputValue)
}

reduce(key, values) {
  outputKey = getTermFromKey(key)
  outputValue = (getBucketFromKey(key), sum(values))

    emit(outputKey, outputValue)
}
MapReduce, Job 2:
map (key, value) {
  for each prefix of key {
    outputKey = prefix
    count = getCountFromValue(value)
    score = count / (now - getBucketFromValue(value))
  }
}

reduce(key, values) {
  for each term in values {
    storeOrUpdateTermScore(value)
  }

    order terms by descending scores

    emit(key, top 10 terms)
}
Processing the logs, a simple solution:
Run Job 1 on new input (written by Flume)
When finished processing new logs, move
files to another location
Run Job 2 after Job 1 on new and historic
output of Job 1
Push output of Job 2 to a database
Cleanup historic outputs of Job 1 after N
days
A database. Sort of...




 “Redis is an open source, advanced key-value store. It is
often referred to as a data structure server since keys can
    contain strings, hashes, lists, sets and sorted sets.”

                      http://redis.io/
Insert into Redis:
Update all keys after each job
Set a TTL on all keys of one hour; obsolete
entries go away automatically
How much history do we need?
Improvements:
Use HTTP keep-alive to make sure the tcp
stays open
Create a diff in Hadoop and only update
relevant entries in Redis
Production hardening (error handling,
packaging, deployment, etc.)
Options for better algorithms:
Use more signals than only search:
 Did a user type the text or click a
 suggestion?
Look at cause-effect relationships within
sessions:
 Was it a successful search?
 Did the user buy the product?
Personalize:
 Filtering of irrelevant results
 Re-ranking based on personal behavior
Q&A
          code is at
https://github.com/friso/rug



                         fvanvollenhoven@xebia.com
                                              @fzk

Contenu connexe

Tendances

Hadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologiesHadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologiesKelly Technologies
 
Deep Learning with Spark and GPUs
Deep Learning with Spark and GPUsDeep Learning with Spark and GPUs
Deep Learning with Spark and GPUsDataWorks Summit
 
Introduction to Apache Spark Ecosystem
Introduction to Apache Spark EcosystemIntroduction to Apache Spark Ecosystem
Introduction to Apache Spark EcosystemBojan Babic
 
Building a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with HadoopBuilding a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with HadoopHadoop User Group
 
Syncsort et le retour d'expérience ComScore
Syncsort et le retour d'expérience ComScoreSyncsort et le retour d'expérience ComScore
Syncsort et le retour d'expérience ComScoreModern Data Stack France
 
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...Databricks
 
Hadoop training in hyderabad-kellytechnologies
Hadoop training in hyderabad-kellytechnologiesHadoop training in hyderabad-kellytechnologies
Hadoop training in hyderabad-kellytechnologiesKelly Technologies
 
Hadoop architecture meetup
Hadoop architecture meetupHadoop architecture meetup
Hadoop architecture meetupvmoorthy
 
Migrating structured data between Hadoop and RDBMS
Migrating structured data between Hadoop and RDBMSMigrating structured data between Hadoop and RDBMS
Migrating structured data between Hadoop and RDBMSBouquet
 
Spark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop SummitSpark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop SummitDataWorks Summit
 
HDPCD Spark using Python (pyspark)
HDPCD Spark using Python (pyspark)HDPCD Spark using Python (pyspark)
HDPCD Spark using Python (pyspark)Durga Gadiraju
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupRafal Kwasny
 
Spark, spark streaming & tachyon
Spark, spark streaming & tachyonSpark, spark streaming & tachyon
Spark, spark streaming & tachyonJohan hong
 
Flexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkFlexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkDataWorks Summit
 
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...spinningmatt
 
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...DataWorks Summit/Hadoop Summit
 

Tendances (20)

Hadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologiesHadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologies
 
Deep Learning with Spark and GPUs
Deep Learning with Spark and GPUsDeep Learning with Spark and GPUs
Deep Learning with Spark and GPUs
 
Introduction to Apache Spark Ecosystem
Introduction to Apache Spark EcosystemIntroduction to Apache Spark Ecosystem
Introduction to Apache Spark Ecosystem
 
Building a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with HadoopBuilding a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with Hadoop
 
Syncsort et le retour d'expérience ComScore
Syncsort et le retour d'expérience ComScoreSyncsort et le retour d'expérience ComScore
Syncsort et le retour d'expérience ComScore
 
Rds data lake @ Robinhood
Rds data lake @ Robinhood Rds data lake @ Robinhood
Rds data lake @ Robinhood
 
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
 
Hadoop training in hyderabad-kellytechnologies
Hadoop training in hyderabad-kellytechnologiesHadoop training in hyderabad-kellytechnologies
Hadoop training in hyderabad-kellytechnologies
 
Hadoop architecture meetup
Hadoop architecture meetupHadoop architecture meetup
Hadoop architecture meetup
 
Nov 2010 HUG: Fuzzy Table - B.A.H
Nov 2010 HUG: Fuzzy Table - B.A.HNov 2010 HUG: Fuzzy Table - B.A.H
Nov 2010 HUG: Fuzzy Table - B.A.H
 
Migrating structured data between Hadoop and RDBMS
Migrating structured data between Hadoop and RDBMSMigrating structured data between Hadoop and RDBMS
Migrating structured data between Hadoop and RDBMS
 
Spark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop SummitSpark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop Summit
 
HDPCD Spark using Python (pyspark)
HDPCD Spark using Python (pyspark)HDPCD Spark using Python (pyspark)
HDPCD Spark using Python (pyspark)
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetup
 
Spark, spark streaming & tachyon
Spark, spark streaming & tachyonSpark, spark streaming & tachyon
Spark, spark streaming & tachyon
 
Flexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkFlexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache Flink
 
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
 
Spark meetup TCHUG
Spark meetup TCHUGSpark meetup TCHUG
Spark meetup TCHUG
 
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
 
Apache Spark & Hadoop
Apache Spark & HadoopApache Spark & Hadoop
Apache Spark & Hadoop
 

Similaire à End-to-end Hadoop search suggestions with Elasticsearch

Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hari Shankar Sreekumar
 
Introduction to Apache Accumulo
Introduction to Apache AccumuloIntroduction to Apache Accumulo
Introduction to Apache AccumuloJared Winick
 
Hadoop, HDFS and MapReduce
Hadoop, HDFS and MapReduceHadoop, HDFS and MapReduce
Hadoop, HDFS and MapReducefvanvollenhoven
 
Above the cloud: Big Data and BI
Above the cloud: Big Data and BIAbove the cloud: Big Data and BI
Above the cloud: Big Data and BIDenny Lee
 
GOTO 2011 preso: 3x Hadoop
GOTO 2011 preso: 3x HadoopGOTO 2011 preso: 3x Hadoop
GOTO 2011 preso: 3x Hadoopfvanvollenhoven
 
Apache Hadoop Big Data Technology
Apache Hadoop Big Data TechnologyApache Hadoop Big Data Technology
Apache Hadoop Big Data TechnologyJay Nagar
 
Design for a Distributed Name Node
Design for a Distributed Name NodeDesign for a Distributed Name Node
Design for a Distributed Name NodeAaron Cordova
 
Introduction to HDFS and MapReduce
Introduction to HDFS and MapReduceIntroduction to HDFS and MapReduce
Introduction to HDFS and MapReduceDerek Chen
 
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Andrey Vykhodtsev
 
Hadoop HDFS Architeture and Design
Hadoop HDFS Architeture and DesignHadoop HDFS Architeture and Design
Hadoop HDFS Architeture and Designsudhakara st
 

Similaire à End-to-end Hadoop search suggestions with Elasticsearch (20)

Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
 
Hadoop Architecture
Hadoop ArchitectureHadoop Architecture
Hadoop Architecture
 
Understanding Hadoop
Understanding HadoopUnderstanding Hadoop
Understanding Hadoop
 
Introduction to Apache Accumulo
Introduction to Apache AccumuloIntroduction to Apache Accumulo
Introduction to Apache Accumulo
 
Lecture 2 part 1
Lecture 2 part 1Lecture 2 part 1
Lecture 2 part 1
 
Hadoop, HDFS and MapReduce
Hadoop, HDFS and MapReduceHadoop, HDFS and MapReduce
Hadoop, HDFS and MapReduce
 
Above the cloud: Big Data and BI
Above the cloud: Big Data and BIAbove the cloud: Big Data and BI
Above the cloud: Big Data and BI
 
Hadoop
HadoopHadoop
Hadoop
 
GOTO 2011 preso: 3x Hadoop
GOTO 2011 preso: 3x HadoopGOTO 2011 preso: 3x Hadoop
GOTO 2011 preso: 3x Hadoop
 
Unit 1
Unit 1Unit 1
Unit 1
 
Apache Hadoop Big Data Technology
Apache Hadoop Big Data TechnologyApache Hadoop Big Data Technology
Apache Hadoop Big Data Technology
 
Nextag talk
Nextag talkNextag talk
Nextag talk
 
HADOOP
HADOOPHADOOP
HADOOP
 
An Introduction to Hadoop
An Introduction to HadoopAn Introduction to Hadoop
An Introduction to Hadoop
 
Design for a Distributed Name Node
Design for a Distributed Name NodeDesign for a Distributed Name Node
Design for a Distributed Name Node
 
Introduction to HDFS and MapReduce
Introduction to HDFS and MapReduceIntroduction to HDFS and MapReduce
Introduction to HDFS and MapReduce
 
Hadoop pig
Hadoop pigHadoop pig
Hadoop pig
 
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
 
Hadoop HDFS Architeture and Design
Hadoop HDFS Architeture and DesignHadoop HDFS Architeture and Design
Hadoop HDFS Architeture and Design
 

Plus de fvanvollenhoven

Xebicon 2015 - Go Data Driven NOW!
Xebicon 2015 - Go Data Driven NOW!Xebicon 2015 - Go Data Driven NOW!
Xebicon 2015 - Go Data Driven NOW!fvanvollenhoven
 
Prototyping online ML with Divolte Collector
Prototyping online ML with Divolte CollectorPrototyping online ML with Divolte Collector
Prototyping online ML with Divolte Collectorfvanvollenhoven
 
Divolte Collector - meetup presentation
Divolte Collector - meetup presentationDivolte Collector - meetup presentation
Divolte Collector - meetup presentationfvanvollenhoven
 
Network analysis with Hadoop and Neo4j
Network analysis with Hadoop and Neo4jNetwork analysis with Hadoop and Neo4j
Network analysis with Hadoop and Neo4jfvanvollenhoven
 
NoSQL War Stories preso: Hadoop and Neo4j for networks
NoSQL War Stories preso: Hadoop and Neo4j for networksNoSQL War Stories preso: Hadoop and Neo4j for networks
NoSQL War Stories preso: Hadoop and Neo4j for networksfvanvollenhoven
 

Plus de fvanvollenhoven (6)

Xebicon 2015 - Go Data Driven NOW!
Xebicon 2015 - Go Data Driven NOW!Xebicon 2015 - Go Data Driven NOW!
Xebicon 2015 - Go Data Driven NOW!
 
Prototyping online ML with Divolte Collector
Prototyping online ML with Divolte CollectorPrototyping online ML with Divolte Collector
Prototyping online ML with Divolte Collector
 
Divolte Collector - meetup presentation
Divolte Collector - meetup presentationDivolte Collector - meetup presentation
Divolte Collector - meetup presentation
 
Network analysis with Hadoop and Neo4j
Network analysis with Hadoop and Neo4jNetwork analysis with Hadoop and Neo4j
Network analysis with Hadoop and Neo4j
 
NoSQL War Stories preso: Hadoop and Neo4j for networks
NoSQL War Stories preso: Hadoop and Neo4j for networksNoSQL War Stories preso: Hadoop and Neo4j for networks
NoSQL War Stories preso: Hadoop and Neo4j for networks
 
Berlin Buzzwords preso
Berlin Buzzwords presoBerlin Buzzwords preso
Berlin Buzzwords preso
 

Dernier

Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGSujit Pal
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 

Dernier (20)

Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 

End-to-end Hadoop search suggestions with Elasticsearch

  • 1. End-to-end Hadoop for search suggestions Guest lecture at RuG
  • 2. WEB AND CLOUD COMPUTING
  • 3.
  • 5. CLOUD COMPUTING FOR RENT image: flickr user jurvetson
  • 6. 30+ million users in two years With 3 engineers!
  • 9. This talk: Build working search suggestions (like on bol.com) That update automatically based on user behavior Using a lot of open source (elasticsearch, Hadoop, Redis, Bootstrap, Flume, node.js, jquery) With a code example, not just slides And a live demo of the working thing in action
  • 11.
  • 12. How do we populate this array?
  • 13. possibly many of these HTTP request: /search?q=trisodium+phosphate write to log file: 2012-07-01T06:00:02.500Z /search?q= trisodium%20phosphate serve suggestions for web requests USER transport logs to compute cluster push suggestions to DB server this thing must be possibly really, really terabytes of transform logs to search suggestions fast these
  • 14. Search is important! 98% of landing page visitors go directly to the search box
  • 15.
  • 17.
  • 19. Node.js Server-side JavaScript Built on top of the V8 JavaScript engine Completely asynchronous / non-blocking / event-driven
  • 20. Non-blocking means function calls never block
  • 21.
  • 22. Distributed filesystem Consists of a single master node and multiple (many) data nodes Files are split up blocks (64MB by default) Blocks are spread across data nodes in the cluster Each block is replicated multiple times to different data nodes in the cluster (typically 3 times) Master node keeps track of which blocks belong to a file
  • 23. Accessible through Java API FUSE (filesystem in user space) driver available to mount as regular FS C API available Basic command line tools in Hadoop distribution Web interface
  • 24. File creation, directory listing and other meta data actions go through the master node (e.g. ls, du, fsck, create file) Data goes directly to and from data nodes (read, write, append) Local read path optimization: clients located on same machine as data node will always access local replica when possible
  • 25. Name Node /some/file /foo/bar HDFS client create file read data Date Node Date Node Date Node write data DISK DISK DISK Node local HDFS client DISK DISK DISK replicate DISK DISK DISK read data
  • 26. NameNode daemon Filesystem master node Keeps track of directories, files and block locations Assigns blocks to data nodes Keeps track of live nodes (through heartbeats) Initiates re-replication in case of data node loss Block meta data is held in memory Will run out of memory when too many files exist
  • 27. DataNode daemon Filesystem worker node / “Block server” Uses underlying regular FS for storage (e.g. ext4) Takes care of distribution of blocks across disks Don’t use RAID More disks means more IO throughput Sends heartbeats to NameNode Reports blocks to NameNode (on startup) Does not know about the rest of the cluster (shared nothing)
  • 28. HDFS trivia HDFS is write once, read many, but has append support in newer versions Has built in compression at the block level Does end-to-end checksumming on all data (on read and periodically) HDFS is best used for large, unstructured volumes of raw data in BIG files used for batch operations Optimized for sequential reads, not random access
  • 29. Job input comes from files on HDFS Other formats are possible; requires specialized InputFormat implementation Built in support for text files (convenient for logs, csv, etc.) Files must be splittable for parallelization to work Not all compression formats have this property (e.g. gzip)
  • 30. JobTracker daemon MapReduce master node Takes care of scheduling and job submission Splits jobs into tasks (Mappers and Reducers) Assigns tasks to worker nodes Reassigns tasks in case of failure Keeps track of job progress Keeps track of worker nodes through heartbeats
  • 31. TaskTracker daemon MapReduce worker process Starts Mappers en Reducers assigned by JobTracker Sends heart beats to the JobTracker Sends task progress to the JobTracker Does not know about the rest of the cluster (shared nothing)
  • 32. NameNode JobTracker DataNode DataNode DataNode DataNode TaskTracker TaskTracker TaskTracker TaskTracker
  • 33. “Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data.” http://flume.apache.org/
  • 37. Processing the logs, a simple approach: Divide time into 15 minute buckets Count # times a search term appears in each bucket Divide counts by (current time - time of start of bucket) Rank terms according to sum of divided counts
  • 38. MapReduce, Job 1: FIFTEEN_MINUTES = 1000 * 60 * 15 map (key, value) { term = getTermFromLogLine(value) time = parseTimestampFromLogLine(value) bucket = time - (time % FIFTEEN_MINUTES) outputKey = (bucket, term) outputValue = 1 emit(outputKey, outputValue) } reduce(key, values) { outputKey = getTermFromKey(key) outputValue = (getBucketFromKey(key), sum(values)) emit(outputKey, outputValue) }
  • 39. MapReduce, Job 2: map (key, value) { for each prefix of key { outputKey = prefix count = getCountFromValue(value) score = count / (now - getBucketFromValue(value)) } } reduce(key, values) { for each term in values { storeOrUpdateTermScore(value) } order terms by descending scores emit(key, top 10 terms) }
  • 40. Processing the logs, a simple solution: Run Job 1 on new input (written by Flume) When finished processing new logs, move files to another location Run Job 2 after Job 1 on new and historic output of Job 1 Push output of Job 2 to a database Cleanup historic outputs of Job 1 after N days
  • 41. A database. Sort of... “Redis is an open source, advanced key-value store. It is often referred to as a data structure server since keys can contain strings, hashes, lists, sets and sorted sets.” http://redis.io/
  • 42. Insert into Redis: Update all keys after each job Set a TTL on all keys of one hour; obsolete entries go away automatically
  • 43. How much history do we need?
  • 44.
  • 45. Improvements: Use HTTP keep-alive to make sure the tcp stays open Create a diff in Hadoop and only update relevant entries in Redis Production hardening (error handling, packaging, deployment, etc.)
  • 46. Options for better algorithms: Use more signals than only search: Did a user type the text or click a suggestion? Look at cause-effect relationships within sessions: Was it a successful search? Did the user buy the product? Personalize: Filtering of irrelevant results Re-ranking based on personal behavior
  • 47.
  • 48. Q&A code is at https://github.com/friso/rug fvanvollenhoven@xebia.com @fzk