Contenu connexe


The Birth of a Web Crawling Bot

  1. The Birth of a Web Crawling bot
  2. E-commerce, Travel, Jobs and Classifieds are some domains where bots come in use for laying down core competitive strategy.
  3. So what do web crawling bots actually do?
  4. To a larger part, bots can traverse hundreds and thousands of pages on a website, and fetch important bits of information depending on its purpose on the web.
  5. Some bots are designed to collect price data from e-commerce portals, while others can extract customer reviews from online travel agencies. Also, there are bots designed to collect user-generated content.
  6. Irrespective of the use cases, bots are created from scratch, depending on the information that is needed to be extracted from webpages or websites.
  7. Here are five stages of making a web crawling bot
  8. 1. Understanding how the site reacts to human users It is important to understand how a website interacts with a real human. A target website from which data is to be extracted, is navigated on browsers like Google Chrome and Mozilla Firefox. This gives information about browser-server interaction, revealing how the server sees and processes an incoming request, and lays down the base for building the bot.
  9. 2. Getting a hang of how site behaves with a bot Some test traffic in an automated manner is sent to understand how differently a site interacts with a bot compared to a human user. This helps in choosing the best path of action to build the bot. Most websites treat human users and bots differently to protect themselves from bad bots and various forms of cyber attacks.
  10. 3. Building the bot Once a clear blueprint of the target site is obtained, it’s time to start building the crawler bot. The complexity of the build depends on results obtained from previous tests. For instance, if the target site is only accessible from Germany (let’s say), a German proxy is needed to be included to fetch the site.
  11. 4. Putting the bot to test Top most priority is given to reliability and data quality. It’s important to test the crawler bot under different conditions like on and off peak time of the target site before the actual crawls can start. For this, fetching a random number of pages from the live site is done. Changes are made to the crawler for improving its stability and scale of operation after the outcome. If everything works as expected, the bot can go into production.
  12. 5. Extracting data points and data processing Bots can fetch full html content of the pages for data extraction and various other processes depending on client requirements. Once extraction is done, data is automatically scanned for duplicate entries and deduplicated. The next process is normalization where changes are made to the data for easier consumption. For example, if the price data is extracted in dollars, it can be converted to a different currency before being delivered to a client.