Middlewares¶
Frontier Middleware sits between
FrontierManager and
Backend objects, using hooks for
Request
and Response processing according to
frontier data flow.
It’s a light, low-level system for filtering and altering Frontier’s requests and responses.
Activating a middleware¶
To activate a Middleware component, add it to the
:setting:`MIDDLEWARES` setting, which is a list whose values can be class paths or instances of
Middleware objects.
Here’s an example:
MIDDLEWARES = [
'new_frontera.contrib.middlewares.domain.DomainMiddleware',
]
Middlewares are called in the same order they’ve been defined in the list, to decide which order to assign to your middleware pick a value according to where you want to insert it. The order does matter because each middleware performs a different action and your middleware could depend on some previous (or subsequent) middleware being applied.
Finally, keep in mind that some middlewares may need to be enabled through a particular setting. See each middleware documentation for more info.
Writing your own middleware¶
Writing your own frontier middleware is easy. Each Middleware
component is a single Python class inherited from Component.
FrontierManager will communicate with all active middlewares
through the methods described below.
- class new_frontera.core.components.Middleware¶
Interface definition for a Frontier Middlewares
Methods
- frontier_start()¶
Called when the frontier starts, see starting/stopping the frontier.
- frontier_stop()¶
Called when the frontier stops, see starting/stopping the frontier.
- abstract page_crawled(response)¶
This method is called every time a page has been crawled.
Should either return
Noneor aResponseobject.If it returns
None,FrontierManagerwon’t continue processing any other middleware andBackendwill never be notified.If it returns a
Responseobject, this will be passed to next middleware. This process will repeat for all active middlewares until result is finally passed to theBackend.If you want to filter a page, just return None.
- abstract request_error(page, error)¶
This method is called each time an error occurs when crawling a page.
- Parameters:
request (object) – The crawled with error
Requestobject.error (string) – A string identifier for the error.
- Returns:
RequestorNone
Should either return
Noneor aRequestobject.If it returns
None,FrontierManagerwon’t continue processing any other middleware andBackendwill never be notified.If it returns a
Responseobject, this will be passed to next middleware. This process will repeat for all active middlewares until result is finally passed to theBackend.If you want to filter a page error, just return None.
Class Methods
- classmethod from_manager(manager)¶
Class method called from
FrontierManagerpassing the manager itself.Example of usage:
def from_manager(cls, manager): return cls(settings=manager.settings)
Built-in middleware reference¶
This page describes all Middleware components that come with new_frontera.
For information on how to use them and how to write your own middleware, see the
middleware usage guide..
For a list of the components enabled by default (and their orders) see the :setting:`MIDDLEWARES` setting.
DomainMiddleware¶
- class new_frontera.contrib.middlewares.domain.DomainMiddleware¶
This
Middlewarewill add adomaininfo field for everyRequest.metaandResponse.metaif is activated.domainobject will contain the following fields, with both keys and values as bytes:netloc: URL netloc according to RFC 1808 syntax specifications
name: Domain name
scheme: URL scheme
tld: Top level domain
sld: Second level domain
subdomain: URL subdomain(s)
An example for a
Requestobject:>>> request.url 'http://www.scrapinghub.com:8080/this/is/an/url' >>> request.meta['domain'] { "name": "scrapinghub.com", "netloc": "www.scrapinghub.com", "scheme": "http", "sld": "scrapinghub", "subdomain": "www", "tld": "com" }
If :setting:`TEST_MODE` is active, It will accept testing URLs, parsing letter domains:
>>> request.url 'A1' >>> request.meta['domain'] { "name": "A", "netloc": "A", "scheme": "-", "sld": "-", "subdomain": "-", "tld": "-" }
UrlFingerprintMiddleware¶
- class new_frontera.contrib.middlewares.fingerprint.UrlFingerprintMiddleware¶
This
Middlewarewill add afingerprintfield for everyRequest.metaandResponse.metaif is activated.Fingerprint will be calculated from object
URL, using the function defined in :setting:`URL_FINGERPRINT_FUNCTION` setting. You can write your own fingerprint calculation function and use by changing this setting. The fingerprint must be bytes.An example for a
Requestobject:>>> request.url 'http//www.scrapinghub.com:8080' >>> request.meta['fingerprint'] '60d846bc2969e9706829d5f1690f11dafb70ed18'
- new_frontera.utils.fingerprint.hostname_local_fingerprint(key)¶
This function is used for URL fingerprinting, which serves to uniquely identify the document in storage.
hostname_local_fingerprintis constructing fingerprint getting first 4 bytes as Crc32 from host, and rest is MD5 from rest of the URL. Default option is set to make use of HBase block cache. It is expected to fit all the documents of average website within one cache block, which can be efficiently read from disk once.- Parameters:
key – str URL
- Returns:
str 20 bytes hex string
DomainFingerprintMiddleware¶
- class new_frontera.contrib.middlewares.fingerprint.DomainFingerprintMiddleware¶
This
Middlewarewill add afingerprintfield for everyRequest.metaandResponse.metadomainfields if is activated.Fingerprint will be calculated from object
URL, using the function defined in :setting:`DOMAIN_FINGERPRINT_FUNCTION` setting. You can write your own fingerprint calculation function and use by changing this setting. The fingerprint must be bytesAn example for a
Requestobject:>>> request.url 'http//www.scrapinghub.com:8080' >>> request.meta['domain'] { "fingerprint": "5bab61eb53176449e25c2c82f172b82cb13ffb9d", "name": "scrapinghub.com", "netloc": "www.scrapinghub.com", "scheme": "http", "sld": "scrapinghub", "subdomain": "www", "tld": "com" }