Backends¶
A DistributedBackend is used to separate higher level code
of crawling strategy from low level storage API. Queue,
Metadata, States and
DomainMetadataare inner components of the DistributedBackend.
The latter is meant to instantiate and hold the references to the objects of above mentioned classes. new_frontera is
bundled with database and in-memory implementations of Queue, Metadata, States and DomainMetadata which can be combined
in your custom backends or used standalone by directly instantiating specific variant of
FrontierManager.
DistributedBackend methods are called by the FrontierManager after
Middleware, using hooks for
Request and Response processing
according to frontier data flow.
Unlike Middleware, that can have many different instances activated, only one DistributedBackend can be used per frontier.
Activating a backend¶
To activate the specific backend, set it through the :setting:`BACKEND` setting.
Here’s an example:
BACKEND = 'new_frontera.contrib.backends.memory.MemoryDistributedBackend'
Keep in mind that some backends may need to be additionally configured through a particular setting. See backends documentation for more info.
Writing your own backend¶
Each backend component is a single Python class inherited from
DistributedBackend and using one or all of
Queue, Metadata, States and DomainMetadata.
FrontierManager will communicate with active backend through the methods described below.
- class new_frontera.core.components.Backend¶
Interface definition for frontier backend.
Methods
- frontier_start()¶
Called when the frontier starts, see starting/stopping the frontier.
- Returns:
None.
- frontier_stop()¶
Called when the frontier stops, see starting/stopping the frontier.
- Returns:
None.
- abstract finished()¶
Quick check if crawling is finished. Called pretty often, please make sure calls are lightweight.
- Returns:
boolean
- abstract page_crawled(response)¶
This method is called every time a page has been crawled.
- Parameters:
response (object) – The
Responseobject for the crawled page.- Returns:
None.
- abstract request_error(page, error)¶
This method is called each time an error occurs when crawling a page.
- Parameters:
request (object) – The crawled with error
Requestobject.error (string) – A string identifier for the error.
- Returns:
None.
- abstract get_next_requests(max_n_requests, **kwargs)¶
Returns a list of next requests to be crawled.
- Parameters:
max_next_requests (int) – Maximum number of requests to be returned by this method.
kwargs (dict) – A parameters from downloader component.
- Returns:
list of
Requestobjects.
Class Methods
- classmethod from_manager(manager)¶
Class method called from
FrontierManagerpassing the manager itself.Example of usage:
def from_manager(cls, manager): return cls(settings=manager.settings)
Properties
- class new_frontera.core.components.DistributedBackend¶
Interface definition for distributed frontier backend. Implies using in strategy worker and DB worker.
Inherits all methods of Backend, and has two more class methods, which are called during strategy and db worker instantiation.
Backend should communicate with low-level storage by means of these classes:
Metadata¶
Is used to store the contents of the crawl.
- class new_frontera.core.components.Metadata¶
Interface definition for a frontier metadata class. This class is responsible for storing documents metadata, including content and optimized for write-only data flow.
Methods
Known implementations are: MemoryMetadata and sqlalchemy.components.Metadata.
Queue¶
Is a priority queue and used to persist requests scheduled for crawling.
- class new_frontera.core.components.Queue¶
Interface definition for a frontier queue class. The queue has priorities and partitions.
Methods
- abstract get_next_requests(max_n_requests, partition_id, **kwargs)¶
Returns a list of next requests to be crawled, and excludes them from internal storage.
- Parameters:
max_next_requests (int) – Maximum number of requests to be returned by this method.
kwargs (dict) – A parameters from downloader component.
- Returns:
list of
Requestobjects.
- abstract schedule(batch)¶
Schedules a new documents for download from batch, and updates score in metadata.
- Parameters:
batch – list of tuples(fingerprint, score, request, schedule), if
scheduleis True, then document needs to be scheduled for download, False - only update score in metadata.
- abstract count()¶
Returns count of documents in the queue.
- Returns:
int
Known implementations are: MemoryQueue and sqlalchemy.components.Queue.
States¶
Is a storage used for checking and storing the link states. Where state is a short integer of one of states descibed in
new_frontera.core.components.States.
- class new_frontera.core.components.States¶
Interface definition for a link states management class. This class is responsible for providing actual link state, and persist the state changes in batch-oriented manner.
Methods
- abstract update_cache(objs)¶
Reads states from meta[‘state’] field of request in objs and stores states in internal cache.
- Parameters:
objs – list or tuple of
Requestobjects.
- abstract set_states(objs)¶
Sets meta[‘state’] field from cache for every request in objs.
- Parameters:
objs – list or tuple of
Requestobjects.
- abstract flush()¶
Flushes internal cache to storage.
- abstract fetch(fingerprints)¶
Get states from the persistent storage to internal cache.
- Parameters:
fingerprints – list document fingerprints, which state to read
Known implementations are: MemoryStates and sqlalchemy.components.States.
DomainMetadata¶
Is used to store per-domain flags, counters or even robots.txt contents to help crawling strategy maintain features like per-domain number of crawled pages limit or automatic banning.
- class new_frontera.core.components.DomainMetadata¶
Interface definition for a domain metadata storage. It’s main purpose is to store the per-domain metadata using Python-friendly structures. Meant to be used by crawling strategy to store counters and flags in low level facilities provided by Backend.
Methods
- abstract __setitem__(key, value)¶
Puts key, value tuple in storage.
- Parameters:
key – str
value – Any
- abstract __getitem__(key)¶
Retrieves the value associated with the storage. Raises KeyError if key is absent.
- Parameters:
key – str
- Return value:
Any
- abstract __delitem__(key)¶
Removes the tuple associated with key from storage. Raises KeyError if key is absent.
- Parameters:
key – str
- __contains__(key)¶
Checks if key is present in the storage.
- Parameters:
key – str
- Returns:
boolean
Known implementations are: native dict and sqlalchemy.components.DomainMetadata.
Built-in backend reference¶
This article describes all backend components that come bundled with new_frontera.
Memory backend¶
This implementation is using heapq module to store the requests queue and native dicts for other purposes and is meant to be used for educational or testing purposes only.
- class new_frontera.contrib.backends.memory.MemoryDistributedBackend(manager)¶
SQLAlchemy backends¶
This implementations is using RDBMS storage with SQLAlchemy library.
By default it uses an in-memory SQLite database as a storage engine, but any databases supported by SQLAlchemy can be used.
If you need to use your own declarative sqlalchemy models, you can do it by using the :setting:`SQLALCHEMYBACKEND_MODELS` setting.
For a complete list of all settings used for SQLAlchemy backends check the settings section.
HBase backend¶
Is more suitable for large scale web crawlers. Settings reference can be found here HBase backend. Consider
tunning a block cache to fit states within one block for average size website. To achieve this it’s recommended to use
hostname_local_fingerprint to achieve documents
closeness within the same host. This function can be selected with :setting:`URL_FINGERPRINT_FUNCTION` setting.
Redis backend¶
This is similar to the HBase backend. It is suitable for large scale crawlers that still has a limited scope. It is recommended to ensure Redis is allowed to use enough memory to store all data the crawler needs. In case of Redis running out of memory, the crawler will log this and continue. When the crawler is unable to write metadata or queue items to the database; that metadata or queue items are lost.
In case of connection errors; the crawler will attempt to reconnect three times. If the third attempt at connecting to Redis fails, the worker will skip that Redis operation and continue operating.