Contributing¶
Important
Read through the Scrapy Contribution Docs for tips relating to writing patches, reporting bugs, and coding style.
Issues and bugs¶
Report on GitHub.
Tests¶
Include tests in your pull requests.
To run unit tests:
pytest tests
To run integration tests:
printf "[scrapyd]\nusername = hello12345\npassword = 67890world\n" > scrapyd.conf
mkdir logs
scrapyd &
pytest integration_tests
Installation¶
To install an editable version for development, clone the repository, change to its directory, and run:
pip install -e .[test,docs]
Developer documentation¶
Configuration¶
Pass the config
object to a class’ __init__
method, but don’t store it on the instance (#526).
Processes¶
Scrapyd starts Scrapy processes. It runs scrapy crawl
in the launcher, and scrapy list
in the schedule.json (to check the spider exists), addversion.json (to return the number of spiders) and listspiders.json (to return the names of spiders) webservices.
Environment variables¶
Scrapyd uses environment variables to communicate between the Scrapyd process and the Scrapy processes that it starts.
- SCRAPY_PROJECT
The project to use. See
scrapyd/runner.py
.- SCRAPYD_EGG_VERSION
The version of the project, to be retrieved as an egg from eggstorage and activated.
- SCRAPY_SETTINGS_MODULE
The Python path to the settings module of the project.
This is usually the module from the entry points of the egg, but can be the module from the
[settings]
section of a scrapy.cfg file. Seescrapyd/environ.py
.
Jobs¶
A pending job is a dict
object (referred to as a “message”), accessible via an ISpiderQueue
’s pop()
or list()
methods.
Note
The short-lived message returned by IPoller
’s poll()
method is also referred to as a “message”.
The schedule.json webservice calls
ISpiderScheduler
’sschedule()
method. TheSpiderScheduler
implementation ofschedule()
adds the message to the project’sISpiderQueue
.The default application sets a TimerService to call
IPoller
’spoll()
method, at poll_interval.IPoller
has aqueues
attribute, that implements a__getitem__
method to get a project’sISpiderQueue
by project name.The
QueuePoller
implementation ofpoll()
calls a project’sISpiderQueue
’spop()
method, adds a_project
key to the message and renames thename
key to_spider
, and fires a callback.The
Launcher
service had added the callback to the Deferred, which had been returned byIPoller
’snext()
method.The
Launcher
service adapts the message to instantiate aScrapyProcessProtocol
(ProcessProtocol) object, adds a callback, and spawns a process.
A running job is a ScrapyProcessProtocol
object, accessible via Launcher.processes
(a dict
), in which each key is a slot’s number (an int
).
Launcher
has afinished
attribute, which is anIJobStorage
.When the process ends, the callback fires. The
Launcher
service callsIJobStorage
’sadd()
method, passing theScrapyProcessProtocol
as input.
A finished job is an object with the attributes project
, spider
, job
, start_time
and end_time
, accessible via an IJobStorage
’s list()
or __iter__()
methods.
Concept |
ISpiderQueue |
IPoller |
ScrapyProcessProtocol |
IJobStorage |
---|---|---|---|---|
Project |
not specified |
_project |
project |
project |
Spider |
name |
_spider |
spider |
spider |
Job ID |
_job |
_job |
job |
job |
Egg version |
_version |
_version |
✗ |
✗ |
Scrapy settings |
settings |
settings |
args ( |
✗ |
Spider arguments |
remaining keys |
remaining keys |
args ( |
✗ |
Environment variables |
✗ |
✗ |
env |
✗ |
Process ID |
✗ |
✗ |
pid |
✗ |
Start time |
✗ |
✗ |
start_time |
start_time |
End time |
✗ |
✗ |
end_time |
end_time |