new_frontera at a glance

new_frontera is an implementation of crawl frontier, a web crawler component used for accumulating URLs/links before downloading them from the web. Main features of new_frontera are:

  • Online processing oriented,

  • distributed spiders and backends architecture,

  • customizable crawling policy,

  • easy integration with Scrapy,

  • relational databases support (MySQL, PostgreSQL, sqlite, and more) with SQLAlchemy and HBase key-value database out of the box,

  • ZeroMQ and Kafka message bus implementations for distributed crawlers,

  • precise crawling logic tuning with crawling emulation using fake sitemaps with the Graph Manager.

  • transparent transport layer concept (message bus) and communication protocol,

  • pure Python implementation.

  • Python 3 support.

Use cases

Here are few cases, external crawl frontier can be suitable for:

  • URL ordering/queueing isolation from the spider (e.g. distributed cluster of spiders, need of remote management of ordering/queueing),

  • URL (meta)data storage is needed (e.g. to be able to pause and resume the crawl),

  • advanced URL ordering logic is needed, when it’s hard to maintain code within spider/fetcher.

One-time crawl, few websites

For such use case probably single process mode would be the most appropriate. new_frontera can offer these prioritization models out of the box:

  • FIFO,

  • LIFO,

  • Breadth-first (BFS),

  • Depth-first (DFS),

  • based on provided score, mapped from 0.0 to 1.0.

If website is big, and it’s expensive to crawl the whole website, new_frontera can be suitable for pointing the crawler to the most important documents.

Broad crawling of many websites

This use case requires full distribution: spiders and backend. In addition to spiders process one should be running strategy worker (s) and db worker (s), depending on chosen partitioning scheme.

new_frontera can be used for broad set of tasks related to large scale web crawling:

  • Broad web crawling, arbitrary number of websites and pages (we tested it on 45M documents volume and 100K websites),

  • Host-focused crawls: when you have more than 100 websites,

  • Focused crawling:

    • Topical: you search for a pages about some predefined topic,

    • PageRank, HITS or other link graph algorithm guided.

Here are some of the real world problems:

  • Building a search engine with content retrieval from the web.

  • All kinds of research work on web graph: gathering links, statistics, structure of graph, tracking domain count, etc.

  • More general focused crawling tasks: e.g. you search for pages that are big hubs, and frequently changing in time.