Contributing¶
Important
Read through the Scrapy Contribution Docs for tips relating to writing patches, reporting bugs, and coding style.
Issues and bugs¶
Report on GitHub.
Tests¶
Include tests in your pull requests.
To run unit tests:
pytest tests
To run integration tests:
printf "[scrapyd]\nusername = hello12345\npassword = 67890world\n" > scrapyd.conf
mkdir logs
scrapyd &
pytest integration_tests
Installation¶
To install an editable version for development, clone the repository, change to its directory, and run:
pip install -e .[test,docs]
Developer documentation¶
Configuration¶
Pass the config object to a class’ __init__ method, but don’t store it on the instance (#526).
Processes¶
Scrapyd starts Scrapy processes. It runs scrapy crawl in the launcher, and scrapy list in the schedule.json (to check the spider exists), addversion.json (to return the number of spiders) and listspiders.json (to return the names of spiders) webservices.
Environment variables¶
Scrapyd uses environment variables to communicate between the Scrapyd process and the Scrapy processes that it starts.
- SCRAPY_PROJECT
The project to use. See
scrapyd/runner.py.- SCRAPYD_EGG_VERSION
The version of the project, to be retrieved as an egg from eggstorage and activated.
- SCRAPY_SETTINGS_MODULE
The Python path to the settings module of the project.
This is usually the module from the entry points of the egg, but can be the module from the
[settings]section of a scrapy.cfg file. Seescrapyd/environ.py.
Jobs¶
A pending job is a dict object (referred to as a “message”), accessible via an ISpiderQueue’s pop() or list() methods.
Note
The short-lived message returned by IPoller’s poll() method is also referred to as a “message”.
The schedule.json webservice calls
ISpiderScheduler’sschedule()method. TheSpiderSchedulerimplementation ofschedule()adds the message to the project’sISpiderQueue.The default application sets a TimerService to call
IPoller’spoll()method, at poll_interval.IPollerhas aqueuesattribute, that implements a__getitem__method to get a project’sISpiderQueueby project name.The
QueuePollerimplementation ofpoll()calls a project’sISpiderQueue’spop()method, adds a_projectkey to the message and renames thenamekey to_spider, and fires a callback.The
Launcherservice had added the callback to the Deferred, which had been returned byIPoller’snext()method.The
Launcherservice adapts the message to instantiate aScrapyProcessProtocol(ProcessProtocol) object, adds a callback, and spawns a process.
A running job is a ScrapyProcessProtocol object, accessible via Launcher.processes (a dict), in which each key is a slot’s number (an int).
Launcherhas afinishedattribute, which is anIJobStorage.When the process ends, the callback fires. The
Launcherservice callsIJobStorage’sadd()method, passing theScrapyProcessProtocolas input.
A finished job is an object with the attributes project, spider, job, start_time and end_time, accessible via an IJobStorage’s list() or __iter__() methods.
Concept |
ISpiderQueue |
IPoller |
ScrapyProcessProtocol |
IJobStorage |
|---|---|---|---|---|
Project |
not specified |
_project |
project |
project |
Spider |
name |
_spider |
spider |
spider |
Job ID |
_job |
_job |
job |
job |
Egg version |
_version |
_version |
✗ |
✗ |
Scrapy settings |
settings |
settings |
args ( |
✗ |
Spider arguments |
remaining keys |
remaining keys |
args ( |
✗ |
Environment variables |
✗ |
✗ |
env |
✗ |
Process ID |
✗ |
✗ |
pid |
✗ |
Start time |
✗ |
✗ |
start_time |
start_time |
End time |
✗ |
✗ |
end_time |
end_time |