Contributing¶

Important

Read through the Scrapy Contribution Docs for tips relating to writing patches, reporting bugs, and coding style.

Contents

Developer API reference

Issues and bugs¶

Report on GitHub.

Tests¶

Include tests in your pull requests.

To run unit tests:

pytest tests

To run integration tests:

printf "[scrapyd]\nusername = hello12345\npassword = 67890world\n" > scrapyd.conf
mkdir logs
scrapyd &
pytest integration_tests

Installation¶

To install an editable version for development, clone the repository, change to its directory, and run:

pip install -e .[test,docs]

Developer documentation¶

Configuration¶

Pass the config object to a class’ __init__ method, but don’t store it on the instance (#526).

Processes¶

Scrapyd starts Scrapy processes. It runs scrapy crawl in the launcher, and scrapy list in the schedule.json (to check the spider exists), addversion.json (to return the number of spiders) and listspiders.json (to return the names of spiders) webservices.

Environment variables¶

Scrapyd uses environment variables to communicate between the Scrapyd process and the Scrapy processes that it starts.

SCRAPY_PROJECT

The project to use. See scrapyd/runner.py.

SCRAPYD_EGG_VERSION

The version of the project, to be retrieved as an egg from eggstorage and activated.

SCRAPY_SETTINGS_MODULE

The Python path to the settings module of the project.

This is usually the module from the entry points of the egg, but can be the module from the [settings] section of a scrapy.cfg file. See scrapyd/environ.py.

Jobs¶

A pending job is a dict object (referred to as a “message”), accessible via an ISpiderQueue’s pop() or list() methods.

Note

The short-lived message returned by IPoller’s poll() method is also referred to as a “message”.

The schedule.json webservice calls ISpiderScheduler’s schedule() method. The SpiderScheduler implementation of schedule() adds the message to the project’s ISpiderQueue.
The default application sets a TimerService to call IPoller’s poll() method, at poll_interval.
IPoller has a queues attribute, that implements a __getitem__ method to get a project’s ISpiderQueue by project name.
The QueuePoller implementation of poll() calls a project’s ISpiderQueue’s pop() method, adds a _project key to the message and renames the name key to _spider, and fires a callback.
The Launcher service had added the callback to the Deferred, which had been returned by IPoller’s next() method.
The Launcher service adapts the message to instantiate a ScrapyProcessProtocol (ProcessProtocol) object, adds a callback, and spawns a process.

A running job is a ScrapyProcessProtocol object, accessible via Launcher.processes (a dict), in which each key is a slot’s number (an int).

Launcher has a finished attribute, which is an IJobStorage.
When the process ends, the callback fires. The Launcher service calls IJobStorage’s add() method, passing the ScrapyProcessProtocol as input.

A finished job is an object with the attributes project, spider, job, start_time and end_time, accessible via an IJobStorage’s list() or __iter__() methods.

Concept	ISpiderQueue	IPoller	ScrapyProcessProtocol	IJobStorage
Project	not specified	_project	project	project
Spider	name	_spider	spider	spider
Job ID	_job	_job	job	job
Egg version	_version	_version	✗	✗
Scrapy settings	settings	settings	args (`-s k=v`)	✗
Spider arguments	remaining keys	remaining keys	args (`-a k=v`)	✗
Environment variables	✗	✗	env	✗
Process ID	✗	✗	pid	✗
Start time	✗	✗	start_time	start_time
End time	✗	✗	end_time	end_time