Backends

A DistributedBackend is used to separate higher level code of crawling strategy from low level storage API. Queue, Metadata, States and

DomainMetadata are inner components of the DistributedBackend.

The latter is meant to instantiate and hold the references to the objects of above mentioned classes. new_frontera is bundled with database and in-memory implementations of Queue, Metadata, States and DomainMetadata which can be combined in your custom backends or used standalone by directly instantiating specific variant of FrontierManager.

DistributedBackend methods are called by the FrontierManager after Middleware, using hooks for Request and Response processing according to frontier data flow.

Unlike Middleware, that can have many different instances activated, only one DistributedBackend can be used per frontier.

Activating a backend

To activate the specific backend, set it through the :setting:`BACKEND` setting.

Here’s an example:

BACKEND = 'new_frontera.contrib.backends.memory.MemoryDistributedBackend'

Keep in mind that some backends may need to be additionally configured through a particular setting. See backends documentation for more info.

Writing your own backend

Each backend component is a single Python class inherited from DistributedBackend and using one or all of Queue, Metadata, States and DomainMetadata.

FrontierManager will communicate with active backend through the methods described below.

class new_frontera.core.components.Backend

Interface definition for frontier backend.

Methods

frontier_start()

Called when the frontier starts, see starting/stopping the frontier.

Returns:

None.

frontier_stop()

Called when the frontier stops, see starting/stopping the frontier.

Returns:

None.

abstract finished()

Quick check if crawling is finished. Called pretty often, please make sure calls are lightweight.

Returns:

boolean

abstract page_crawled(response)

This method is called every time a page has been crawled.

Parameters:

response (object) – The Response object for the crawled page.

Returns:

None.

abstract request_error(page, error)

This method is called each time an error occurs when crawling a page.

Parameters:
  • request (object) – The crawled with error Request object.

  • error (string) – A string identifier for the error.

Returns:

None.

abstract get_next_requests(max_n_requests, **kwargs)

Returns a list of next requests to be crawled.

Parameters:
  • max_next_requests (int) – Maximum number of requests to be returned by this method.

  • kwargs (dict) – A parameters from downloader component.

Returns:

list of Request objects.

Class Methods

classmethod from_manager(manager)

Class method called from FrontierManager passing the manager itself.

Example of usage:

def from_manager(cls, manager):
    return cls(settings=manager.settings)

Properties

queue
Returns:

associated Queue object

states
Returns:

associated States object

metadata
Returns:

associated Metadata object

class new_frontera.core.components.DistributedBackend

Interface definition for distributed frontier backend. Implies using in strategy worker and DB worker.

Inherits all methods of Backend, and has two more class methods, which are called during strategy and db worker instantiation.

classmethod DistributedBackend.strategy_worker(manager)
classmethod DistributedBackend.db_worker(manager)

Backend should communicate with low-level storage by means of these classes:

Metadata

Is used to store the contents of the crawl.

class new_frontera.core.components.Metadata

Interface definition for a frontier metadata class. This class is responsible for storing documents metadata, including content and optimized for write-only data flow.

Methods

abstract request_error(page, error)

This method is called each time an error occurs when crawling a page.

Parameters:
  • request (object) – The crawled with error Request object.

  • error (string) – A string identifier for the error.

abstract page_crawled(response)

This method is called every time a page has been crawled.

Parameters:

response (object) – The Response object for the crawled page.

Known implementations are: MemoryMetadata and sqlalchemy.components.Metadata.

Queue

Is a priority queue and used to persist requests scheduled for crawling.

class new_frontera.core.components.Queue

Interface definition for a frontier queue class. The queue has priorities and partitions.

Methods

abstract get_next_requests(max_n_requests, partition_id, **kwargs)

Returns a list of next requests to be crawled, and excludes them from internal storage.

Parameters:
  • max_next_requests (int) – Maximum number of requests to be returned by this method.

  • kwargs (dict) – A parameters from downloader component.

Returns:

list of Request objects.

abstract schedule(batch)

Schedules a new documents for download from batch, and updates score in metadata.

Parameters:

batch – list of tuples(fingerprint, score, request, schedule), if schedule is True, then document needs to be scheduled for download, False - only update score in metadata.

abstract count()

Returns count of documents in the queue.

Returns:

int

Known implementations are: MemoryQueue and sqlalchemy.components.Queue.

States

Is a storage used for checking and storing the link states. Where state is a short integer of one of states descibed in new_frontera.core.components.States.

class new_frontera.core.components.States

Interface definition for a link states management class. This class is responsible for providing actual link state, and persist the state changes in batch-oriented manner.

Methods

abstract update_cache(objs)

Reads states from meta[‘state’] field of request in objs and stores states in internal cache.

Parameters:

objs – list or tuple of Request objects.

abstract set_states(objs)

Sets meta[‘state’] field from cache for every request in objs.

Parameters:

objs – list or tuple of Request objects.

abstract flush()

Flushes internal cache to storage.

abstract fetch(fingerprints)

Get states from the persistent storage to internal cache.

Parameters:

fingerprints – list document fingerprints, which state to read

Known implementations are: MemoryStates and sqlalchemy.components.States.

DomainMetadata

Is used to store per-domain flags, counters or even robots.txt contents to help crawling strategy maintain features like per-domain number of crawled pages limit or automatic banning.

class new_frontera.core.components.DomainMetadata

Interface definition for a domain metadata storage. It’s main purpose is to store the per-domain metadata using Python-friendly structures. Meant to be used by crawling strategy to store counters and flags in low level facilities provided by Backend.

Methods

abstract __setitem__(key, value)

Puts key, value tuple in storage.

Parameters:
  • key – str

  • value – Any

abstract __getitem__(key)

Retrieves the value associated with the storage. Raises KeyError if key is absent.

Parameters:

key – str

Return value:

Any

abstract __delitem__(key)

Removes the tuple associated with key from storage. Raises KeyError if key is absent.

Parameters:

key – str

__contains__(key)

Checks if key is present in the storage.

Parameters:

key – str

Returns:

boolean

Known implementations are: native dict and sqlalchemy.components.DomainMetadata.

Built-in backend reference

This article describes all backend components that come bundled with new_frontera.

Memory backend

This implementation is using heapq module to store the requests queue and native dicts for other purposes and is meant to be used for educational or testing purposes only.

class new_frontera.contrib.backends.memory.MemoryDistributedBackend(manager)

SQLAlchemy backends

This implementations is using RDBMS storage with SQLAlchemy library.

By default it uses an in-memory SQLite database as a storage engine, but any databases supported by SQLAlchemy can be used.

If you need to use your own declarative sqlalchemy models, you can do it by using the :setting:`SQLALCHEMYBACKEND_MODELS` setting.

For a complete list of all settings used for SQLAlchemy backends check the settings section.

HBase backend

Is more suitable for large scale web crawlers. Settings reference can be found here HBase backend. Consider tunning a block cache to fit states within one block for average size website. To achieve this it’s recommended to use hostname_local_fingerprint to achieve documents closeness within the same host. This function can be selected with :setting:`URL_FINGERPRINT_FUNCTION` setting.

Redis backend

This is similar to the HBase backend. It is suitable for large scale crawlers that still has a limited scope. It is recommended to ensure Redis is allowed to use enough memory to store all data the crawler needs. In case of Redis running out of memory, the crawler will log this and continue. When the crawler is unable to write metadata or queue items to the database; that metadata or queue items are lost.

In case of connection errors; the crawler will attempt to reconnect three times. If the third attempt at connecting to Redis fails, the worker will skip that Redis operation and continue operating.