new_frontera at a glance¶
new_frontera is an implementation of crawl frontier, a web crawler component used for accumulating URLs/links before downloading them from the web. Main features of new_frontera are:
Online processing oriented,
distributed spiders and backends architecture,
customizable crawling policy,
easy integration with Scrapy,
relational databases support (MySQL, PostgreSQL, sqlite, and more) with SQLAlchemy and HBase key-value database out of the box,
ZeroMQ and Kafka message bus implementations for distributed crawlers,
precise crawling logic tuning with crawling emulation using fake sitemaps with the Graph Manager.
transparent transport layer concept (message bus) and communication protocol,
pure Python implementation.
Python 3 support.
Use cases¶
Here are few cases, external crawl frontier can be suitable for:
URL ordering/queueing isolation from the spider (e.g. distributed cluster of spiders, need of remote management of ordering/queueing),
URL (meta)data storage is needed (e.g. to be able to pause and resume the crawl),
advanced URL ordering logic is needed, when it’s hard to maintain code within spider/fetcher.
One-time crawl, few websites¶
For such use case probably single process mode would be the most appropriate. new_frontera can offer these prioritization models out of the box:
FIFO,
LIFO,
Breadth-first (BFS),
Depth-first (DFS),
based on provided score, mapped from 0.0 to 1.0.
If website is big, and it’s expensive to crawl the whole website, new_frontera can be suitable for pointing the crawler to the most important documents.
Broad crawling of many websites¶
This use case requires full distribution: spiders and backend. In addition to spiders process one should be running strategy worker (s) and db worker (s), depending on chosen partitioning scheme.
new_frontera can be used for broad set of tasks related to large scale web crawling:
Broad web crawling, arbitrary number of websites and pages (we tested it on 45M documents volume and 100K websites),
Host-focused crawls: when you have more than 100 websites,
Focused crawling:
Topical: you search for a pages about some predefined topic,
PageRank, HITS or other link graph algorithm guided.
Here are some of the real world problems:
Building a search engine with content retrieval from the web.
All kinds of research work on web graph: gathering links, statistics, structure of graph, tracking domain count, etc.
More general focused crawling tasks: e.g. you search for pages that are big hubs, and frequently changing in time.