Configuration¶
Default configuration¶
Scrapyd always loads this configuration file, which can be overridden by Configuration sources:
[scrapyd]
# Application options
application = scrapyd.app.application
bind_address = 127.0.0.1
http_port = 6800
unix_socket_path =
username =
password =
spiderqueue = scrapyd.spiderqueue.SqliteSpiderQueue
# Poller options
poller = scrapyd.poller.QueuePoller
poll_interval = 5.0
# Launcher options
launcher = scrapyd.launcher.Launcher
max_proc = 0
max_proc_per_cpu = 4
logs_dir = logs
items_dir =
jobs_dir =
jobs_to_keep = 5
runner = scrapyd.runner
# Web UI and API options
webroot = scrapyd.website.Root
prefix_header = x-forwarded-prefix
debug = off
# Egg storage options
eggstorage = scrapyd.eggstorage.FilesystemEggStorage
eggs_dir = eggs
# Job storage options
jobstorage = scrapyd.jobstorage.MemoryJobStorage
finished_to_keep = 100
# Directory options
dbs_dir = dbs
[services]
schedule.json = scrapyd.webservice.Schedule
cancel.json = scrapyd.webservice.Cancel
status.json = scrapyd.webservice.Status
addversion.json = scrapyd.webservice.AddVersion
listprojects.json = scrapyd.webservice.ListProjects
listversions.json = scrapyd.webservice.ListVersions
listspiders.json = scrapyd.webservice.ListSpiders
delproject.json = scrapyd.webservice.DeleteProject
delversion.json = scrapyd.webservice.DeleteVersion
listjobs.json = scrapyd.webservice.ListJobs
daemonstatus.json = scrapyd.webservice.DaemonStatus
Configuration sources¶
Scrapyd reads these configuration files in this order. Values in later files take priority.
c:\scrapyd\scrapyd.conf/etc/scrapyd/scrapyd.conf/etc/scrapyd/conf.d/*in alphabetical orderscrapyd.confin the current directory~/.scrapyd.confin the home directory of the user that invoked thescrapydcommandthe closest
scrapy.cfgfile, starting in the current directory and traversing upward
Environment variables¶
Added in version 1.5.0.
These environment variables override corresponding options:
SCRAPYD_BIND_ADDRESS(bind_address)SCRAPYD_HTTP_PORT(http_port)SCRAPYD_USERNAME(username)SCRAPYD_PASSWORD(password)SCRAPYD_UNIX_SOCKET_PATH(unix_socket_path)
scrapyd section¶
Application options¶
application¶
The function that returns the Twisted Application to use.
If necessary, override this to fully control how Scrapyd works.
- Default
scrapyd.app.application- Options
Any Twisted Application
bind_address¶
The IP address on which the Web interface and API listen for connections.
- Default
127.0.0.1- Options
Any IP address, including:
127.0.0.1to listen for local IPv4 connections only0.0.0.0to listen for all IPv4 connections::0to listen for all IPv4 and IPv6 connectionsNote
If
sysctlsetsnet.ipv6.bindv6onlyto true (default false), then::0listens for IPv6 connections only.
http_port¶
The TCP port on which the Web interface and API listen for connections.
- Default
6800- Options
Any integer
unix_socket_path¶
Added in version 1.5.0.
The filesystem path of the Unix socket on which the Web interface and API listen for connections.
For example:
unix_socket_path = /var/run/scrapyd/web.socket
The file’s mode is set to 660 (owner and group, read and write) to control access to Scrapyd.
Attention
If bind_address and http_port are set, a TCP server will start, in addition to the Unix server. To disable the TCP server, set bind_address to empty:
bind_address =
username¶
Added in version 1.3.0.
Enable basic authentication by setting this and password to non-empty values.
- Default
""(empty)
password¶
Added in version 1.3.0.
Enable basic authentication by setting this and username to non-empty values.
- Default
""(empty)
spiderqueue¶
Added in version 1.4.2.
The class that stores pending jobs.
- Default
scrapyd.spiderqueue.SqliteSpiderQueue- Options
scrapyd.spiderqueue.SqliteSpiderQueuestores spider queues in SQLite databases named after each project, in the dbs_dir directoryImplement your own, using the
ISpiderQueueinterface
- Also used by
addversion.json webservice, to create a queue if the project is new
schedule.json webservice, to add a pending job
cancel.json webservice, to remove a pending job
listjobs.json webservice, to list the pending jobs
daemonstatus.json webservice, to count the pending jobs
Web interface, to list the pending jobs and, if queues are transient, to create the queues per project at startup
Poller options¶
poller¶
Added in version 1.5.0.
The class that tracks capacity for new jobs, and starts jobs when ready.
- Default
scrapyd.poller.QueuePoller- Options
scrapyd.poller.QueuePoller. When using the default application and launcher values:
The launcher adds max_proc capacity at startup, and one capacity each time a Scrapy process ends.
The application starts a timer so that, every poll_interval seconds, jobs start if there’s capacity: that is, if the number of Scrapy processes that are running is less than the max_proc value.
Implement your own, using the
IPollerinterface
poll_interval¶
The number of seconds between capacity checks.
- Default
5.0- Options
Any floating-point number
Launcher options¶
launcher¶
The class that starts Scrapy processes.
- Default
scrapyd.launcher.Launcher- Options
Any Twisted Service
max_proc¶
The maximum number of Scrapy processes to run concurrently.
- Default
0- Options
Any non-negative integer, including:
0to use max_proc_per_cpu multiplied by the number of CPUs
max_proc_per_cpu¶
See max_proc.
- Default
4
logs_dir¶
The directory in which to write Scrapy logs.
A log file is written to {logs_dir}/{project}/{spider}/{job}.log.
To disable log storage, set this option to empty:
logs_dir =
To log messages to a remote service, you can, for example, reconfigure Scrapy’s logger from your Scrapy project:
import logging
import logstash
logger = logging.getLogger("scrapy")
logger.handlers.clear()
logger.addHandler(logstash.LogstashHandler("https://user:pass@id.us-east-1.aws.found.io", 5959, version=1))
- Default
logs- Also used by
Web interface, to link to log files
Attention
Each *_dir setting must point to a different directory.
items_dir¶
The directory in which to write Scrapy items.
An item feed is written to {items_dir}/{project}/{spider}/{job}.jl.
If this option is non-empty, the FEEDS Scrapy setting is set as follows, resulting in items being written to the above path as JSON lines:
{"file:///path/to/items_dir/project/spider/job.jl": {"format": "jsonlines"}}
- Default
""(empty), because it is recommended to instead use either:Feed exports, by setting the
FEEDSScrapy setting in your Scrapy project. See the full list of storage backends.Item pipeline, to store the scraped items in a database. See the MongoDB example, which can be adapted to another database.
- Also used by
Web interface, to link to item feeds
Attention
Each *_dir setting must point to a different directory.
jobs_dir¶
Added in version 1.6.0.
The directory in which to persist Scrapy requests.
By default, Scrapy keeps the request queue in memory. Use this setting to reduce memory usage.
Requests are persisted to {jobs_dir}/{project}/{spider}/{job}/.
If this option is non-empty, the JOBDIR Scrapy setting is set.
- Default
""
Attention
Each *_dir setting must point to a different directory.
jobs_to_keep¶
The number of finished jobs per spider, for which to keep the most recent log files in the logs_dir directory and item feeds in the items_dir directory.
To “disable” this feature, set this to an arbitrarily large value. For example, on a 64-bit system:
jobs_to_keep = 9223372036854775807
Warning
Scrapyd deletes old files in these directories, regardless of origin.
- Default
5
runner¶
The Python script to run Scrapy’s CLI.
If necessary, override this to fully control how the Scrapy CLI is called.
- Default
scrapyd.runner- Options
Any Python script
- Also used by
listspiders.json webservice, to run Scrapy’s list command
Web UI and API options¶
webroot¶
Added in version 1.2.0.
The class that defines the Web interface and API, as a Twisted Resource.
If necessary, override this to fully control how the web UI and API work.
- Default
scrapyd.website.Root- Options
Any Twisted Resource
prefix_header¶
Added in version 1.4.2.
The header for the base path of the original request.
The header is relevant only if Scrapyd is running behind a reverse proxy, and if the public URL contains a base path, before the Scrapyd API path components.
A base path must have a leading slash and no trailing slash, e.g. /base/path.
- Default
x-forwarded-prefix
node_name¶
Added in version 1.1.0.
The node name, which appears in API responses.
- Default
socket.gethostname()
debug¶
Whether debug mode is enabled.
If enabled, a Python traceback is returned (as a plain-text response) when the API errors.
- Default
off
Egg storage options¶
eggstorage¶
Added in version 1.3.0.
The class that stores project eggs.
- Default
scrapyd.eggstorage.FilesystemEggStorage- Options
scrapyd.eggstorage.FilesystemEggStoragewrites eggs in the eggs_dir directoryNote
Eggs are named after the
version, replacing characters other thanA-Za-z0-9_-with underscores. Therefore, if you frequently use non-word, non-hyphen characters, the eggs for different versions can collide.Implement your own, using the
IEggStorageinterface: for example, to store eggs remotely
eggs_dir¶
The directory in which to write project eggs.
- Default
eggs
Attention
Each *_dir setting must point to a different directory.
Job storage options¶
jobstorage¶
Added in version 1.3.0.
The class that stores finished jobs.
- Default
scrapyd.jobstorage.MemoryJobStorage- Options
scrapyd.jobstorage.MemoryJobStoragestores jobs in memory, such that jobs are lost when the Scrapyd process endsscrapyd.jobstorage.SqliteJobStoragestores jobs in a SQLite database namedjobs.db, in the dbs_dir directoryImplement your own, using the
IJobStorageinterface
finished_to_keep¶
The number of finished jobs, for which to keep metadata in the jobstorage backend.
Finished jobs are accessed via the Web interface and listjobs.json webservice.
- Default
100- Options
Any non-negative integer
Directory options¶
dbs_dir¶
The directory in which to write SQLite databases.
- Default
dbs- Options
Any relative or absolute path, or :memory:
- Used by
spiderqueue (
scrapyd.spiderqueue.SqliteSpiderQueue)jobstorage (
scrapyd.jobstorage.SqliteJobStorage)
Attention
Each *_dir setting must point to a different directory.
services section¶
If you want to add a webservice (endpoint), add, for example:
[services]
mywebservice.json = amodule.anothermodule.MyWebService
You can use code for webservices in webservice.py as inspiration.
To remove a default webservice, set it to empty:
[services]
daemonstatus.json =
settings section (scrapy.cfg)¶
Project code is usually stored in a Python egg and uploaded to Scrapyd via the addversion.json webservice.
Alternatively, you can invoke Scrapyd within a Scrapy project: that is, you can run the scrapyd command from a directory containing a scrapy.cfg file (or from a directory with any parent directory containing a scrapy.cfg file).
As described in Scrapy’s documentation, the scrapy.cfg file contains a [settings] section, which can describe many Scrapy projects. By default, it is:
[settings]
default = projectname.settings