Frontier objects¶
Frontier uses 2 object types: Request
and Response. They are used to represent crawling HTTP requests and
responses respectively.
These classes are used by most Frontier API methods either as a parameter or as a return value depending on the method used.
Frontier also uses these objects to internally communicate between different components (middlewares and backend).
Request objects¶
- class new_frontera.core.models.Request(url, method=b'GET', headers=None, cookies=None, meta=None, body='')¶
A
Requestobject represents an HTTP request, which is generated for seeds, extracted page links and next pages to crawl. Each one should be associated to aResponseobject when crawled.- Parameters:
url (string) – URL to send.
method (string) – HTTP method to use.
headers (dict) – dictionary of headers to send.
cookies (dict) – dictionary of cookies to attach to this request.
meta (dict) – dictionary that contains arbitrary metadata for this request, the keys must be bytes and the values must be either bytes or serializable objects such as lists, tuples, dictionaries with byte type items.
- property body¶
A string representing the request body.
- property cookies¶
Dictionary of cookies to attach to this request.
- property headers¶
A dictionary which contains the request headers.
- property meta¶
A dict that contains arbitrary metadata for this request. This dict is empty for new Requests, and is usually populated by different new_frontera components (middlewares, etc). So the data contained in this dict depends on the components you have enabled. The keys are bytes and the values are either bytes or serializable objects such as lists, tuples, dictionaries with byte type items.
- property method¶
A string representing the HTTP method in the request. This is guaranteed to be uppercase. Example:
GET,POST,PUT, etc
- property url¶
A string containing the URL of this request.
Response objects¶
- class new_frontera.core.models.Response(url, status_code=200, headers=None, body='', request=None)¶
A
Responseobject represents an HTTP response, which is usually downloaded (by the crawler) and sent back to the frontier for processing.- Parameters:
url (string) – URL of this response.
status_code (int) – the HTTP status of the response. Defaults to 200.
headers (dict) – dictionary of headers to send.
body (str) – the response body.
request (Request) – The Request object that generated this response.
- property body¶
A str containing the body of this Response.
- property headers¶
A dictionary object which contains the response headers.
- property meta¶
A shortcut to the
Request.metaattribute of theResponse.requestobject (ie. self.request.meta).
- property status_code¶
An integer representing the HTTP status of the response. Example:
200,404,500.
- property url¶
A string containing the URL of the response.
Fields domain and fingerprint are added by built-in middlewares
Identifying unique objects¶
As frontier objects are shared between the crawler and the frontier, some mechanism to uniquely identify objects is needed. This method may vary depending on the frontier logic (in most cases due to the backend used).
By default, new_frontera activates the fingerprint middleware to
generate a unique fingerprint calculated from the Request.url
and Response.url fields, which is added to the
Request.meta and
Response.meta fields respectively. You can use
this middleware or implement your own method to manage frontier objects identification.
An example of a generated fingerprint for a Request object:
>>> request.url
'http://thehackernews.com'
>>> request.meta['fingerprint']
'198d99a8b2284701d6c147174cd69a37a7dea90f'
Adding additional data to objects¶
In most cases frontier objects can be used to represent the information needed to manage the frontier logic/policy.
Also, additional data can be stored by components using the
Request.meta and
Response.meta fields.
For instance the frontier domain middleware adds a domain info field for every
Request.meta and
Response.meta if is activated:
>>> request.url
'http://www.scrapinghub.com'
>>> request.meta['domain']
{
"name": "scrapinghub.com",
"netloc": "www.scrapinghub.com",
"scheme": "http",
"sld": "scrapinghub",
"subdomain": "www",
"tld": "com"
}