How to create a distributed, fault tolerant crawler

May 08, 2019

Learn the basics of creating a distributed, fault tolerant web crawler

So you've created your first crawler and now you're watching it consume page after page but you start to notice that…it takes too long to crawl one page after another. Time is something we don't get back and your crawler could use a few updates. How about running jobs in parallel right? But how? What are the options and challenges to that?

Running jobs in parallel presumes some sort of "master" brain that decides the next job and assigns it to a worker or a thread.

Your first impulse is to write threads because that's what they're meant for right? For small jobs that's my choice as well. The code is usually easier to read and simpler to understand. Threads are easy to manage and cheap. You don't deal with a lot of problems and you keep everything running under the same process:

NOTE: I said threads which is not entirely accurate. The above code uses goroutines which are similar but way faster and cheaper.

The above golang code creates a waiting group and runs 100 jobs in a concurrent execution. Once the jobs are finished, it prints done and exits but there's a problem here: it runs all 100 jobs at once; what if we have thousands to run? The solution is to create some sort of execution cap where we state that no more than x jobs should be running at once. When crawling you ALWAYS have to do this otherwise you risk flooding the target server or getting banned quickly.

This time we introduced an execution cap of 3 concurrent jobs. We used a channel that is limited to 3 (delta parameter). The waiting group is attached to this channel and no more than 3 jobs can be ingested before the channel gets blocked.

Things are getting better. We now have x amount of concurrent jobs and the time required to finish your task is divided by, well…x.

This is all you need for small jobs but what if any of the executions raises an exception stopping the entire process? Who restarts everything? Will you babysit the entire thing for days? Implement some sort of supervisor? If restarted by a supervisor does your crawler know where it left off? Assuming you implement a database and save your progress, does it pick up the same job that raised exceptions again and again without doing any sort of real progress? Does it know how to give up after x failed attempts?

These are all real questions for any serious job. The implementation checklist keeps growing and growing and you're in deep waters when your project is a bit larger. The simplicity of our crawler starts raising practical issues and it's time to address them and create a serious, distributed, crawler.

A distributed task scheduler consists of multiple parts, independent of each other. There should be no points of failure except for your bad code.

This is the part that receives the jobs and schedules the workers to complete it. In most cases, this task server is attached to a database where it stores its state as well as the state of the entire cluster of workers, their status and everything else.

The server should monitor the workers and their designated tasks. Should a task fail at any time, this server should retry it x times (configurable) before giving up. This server should also persist each completion result into the database for later review.

A worker node is responsible for doing the actual job (crawling). It should listen for events from the server, receive the params of a new job and attempt to complete it. Worker nodes should not manage state or alter anything in the Redis database. Ideally, they should be completely independent and serve a single purpose.

In case of a server crash or any other unwanted event that takes your server down, without a database, you risk losing all of your progress. A database is strongly suggested and, I tell you from experience, a task scheduler that doesn't use some sort of persistent database is not recommended.

The database of choice for most schedulers is Redis.

This is a server which is used to communicate between server and workers. Each task is sent via this engine to the designated worker. An ideal task scheduler sends messages directly to the desired worker (picked based on workload and other parameters) to avoid unwanted race conditions that may occur if the job is sent and picked by "whoever is first to start it".

Some task schedulers can use Redis both for persisting state and notifications but, a dedicated RabbitMq server seems to be really preferred choice.

Your code should send jobs to the server which is responsible for picking a worker in order to execute them. Every detail is stored in Redis and the workers are notified using the broker channel which they listen to. An example using machinery may look like this:

A simple task that receives parameters and does something:

This task should be accessible by the worker nodes.

As you can see the worker code and task scheduler code is shared. This is ideal for simplicity but it is not required. Worker nodes can be completely decoupled or written in another programming language.

Once you have everything setup you just have to spawn worker nodes which can be running on any server or on the same machine.

Since we share the same codebase all we have to do is to look for an environment variable that indicates if this is a worker node or the server.

For a complete example and all the configuration details you can visit the machinery repo and read their documentation. In case you're a Python coder I recommend having a look at Celery. NodeJs? Give Bull a shot.

How to create a distributed, fault tolerant crawler

Crawl more at the same time by using threads

The task scheduler

The server

The worker nodes

The database

The broker / communication engine

Table of contents

More like this

How to set a proxy in selenium and Chrome browser