5 of the most dangerous words you'll hear a developer say are "How hard could it be?". This talk tells the tale of what happens when you act on the question of "I'm going to write the next Google beater. How hard could it be?" This is the tale of how one person in a few hours is able to write something resembling a search engine thanks to the platform features of Azure and the productivity of F#. We'll see how we're able to use Azure search from F# to easily power our search internals, we'll use MBrace to rapidly find the most popular web pages on the internet and Azure functions to tie everything together to build up APIs and create on demand infrastructure. Add in a healthy mix of queues provided by Azure Service Bus and if you squint hard enough, you might just end up seeing something resembling a search engine.
But seriously writing the next Google, just how hard could it be?
A recording of this talk is available via SkillsMatter at https://skillsmatter.com/skillscasts/8901-f-sharpunctional-londoners-meetup
34. SEARCH IMPLEMENTATION
HOW TO FIND A NEEDLE IN A HAYSTACK
▸ Take all of your documents
▸ Record all of the words which occur within a file
▸ Invert that index
▸ List of all words and the documents they appear in
▸ For all words in the search query, find the files which appear in every inverted
index
37. AZURE SEARCH
WHAT DOES AZURE SEARCH GIVE US?
▸ Hosted Search as a Service
▸ HTTP API for indexing and retrieving documents
▸ Ability to scale out (more replicas, more indexes)
▸ Free basic tier
41. INDEXING DATA
WHAT IS A CRAWLER
▸ Autonomously find every web page on the internet
▸ Pull the content from that web page and index it
▸ Read the links on that page and index those links
▸ Recursively process until every page on the internet has been reached
44. INDEXING DATA
WHAT DOES AZURE SERVICE BUS GIVE US?
▸ Scalable durable queues and topics with guaranteed availability
▸ .Net APIs to communicate with the service bus
▸ Free basic tier
53. BEING A WELL BEHAVED SCRAPER
WHAT IS ROBOTS.TXT?
▸ Text file standard for telling web scrapers what they should scrape
▸ Opt-in - crawlers can ignore the robots.txt file
▸ Simple file stored at the root of the web server
57. WE HAVE A HTML FILE.
WE NEED THE CONTENT OUT OF IT.
58. INFORMATION RETRIEVAL FROM HTML DOCUMENTS
WORKING WITH THE HTML AGILITY PACK
▸ Provides a simple query layer over HTML documents
▸ Works with well formatted and poorly formatted HTML
▸ Provides XPath support over the document
▸ Allows for querying for individual properties and elements
63. AZURE WEB JOBS
WHAT ARE WEB JOBS?
▸ A means of hosting basic executables in the cloud
▸ Provides simplified deployment and monitoring
▸ Pricing per minute of usage
68. PAGE RANK
WHAT IS PAGE RANK?
▸ Stanford’s patented algorithm
▸ Helps you find the most influential websites on the internet
▸ Websites with lots of links to them are more influential
73. WE HAVE A QUERY WHICH NEEDS TO
RUN DAILY.
WE NEED TO ORCHESTRATE IT.
74. AZURE FUNCTIONS + AZURE
RESOURCE MANAGER
USING AZURE FUNCTIONS
FOR DEVOPS
75. DEVOPS
WHAT IS AZURE RESOURCE MANAGER?
▸ Declarative way of describing Azure infrastructure
▸ REST APIs to deploy infrastructure template files
▸ APIs to see current deployment status
76. DEVOPS
WHAT IS AZURE FUNCTIONS?
▸ Lightweight scripting of Azure web jobs
▸ Allows for running scripts in response to certain events
▸ Billing based on number of function invocations
77. DEVOPS
USING AZURE FUNCTIONS FOR DEVOPS
▸ Set up a timer triggered Azure Function
▸ Deploy an Mbrace cluster through Azure Resource Manager
▸ Send an event when the job completes
▸ Second Azure Function for deleting the MBrace cluster