Configuration¶

Default configuration¶

Scrapyd always loads this configuration file, which can be overridden by Configuration sources:

[scrapyd]
# Application options
application       = scrapyd.app.application
bind_address      = 127.0.0.1
http_port         = 6800
unix_socket_path  =
username          =
password          =
spiderqueue       = scrapyd.spiderqueue.SqliteSpiderQueue

# Poller options
poller            = scrapyd.poller.QueuePoller
poll_interval     = 5.0

# Launcher options
launcher          = scrapyd.launcher.Launcher
max_proc          = 0
max_proc_per_cpu  = 4
logs_dir          = logs
items_dir         =
jobs_dir          =
jobs_to_keep      = 5
runner            = scrapyd.runner

# Web UI and API options
webroot           = scrapyd.website.Root
prefix_header     = x-forwarded-prefix
debug             = off

# Egg storage options
eggstorage        = scrapyd.eggstorage.FilesystemEggStorage
eggs_dir          = eggs

# Job storage options
jobstorage        = scrapyd.jobstorage.MemoryJobStorage
finished_to_keep  = 100

# Directory options
dbs_dir           = dbs

[services]
schedule.json     = scrapyd.webservice.Schedule
cancel.json       = scrapyd.webservice.Cancel
status.json       = scrapyd.webservice.Status
addversion.json   = scrapyd.webservice.AddVersion
listprojects.json = scrapyd.webservice.ListProjects
listversions.json = scrapyd.webservice.ListVersions
listspiders.json  = scrapyd.webservice.ListSpiders
delproject.json   = scrapyd.webservice.DeleteProject
delversion.json   = scrapyd.webservice.DeleteVersion
listjobs.json     = scrapyd.webservice.ListJobs
daemonstatus.json = scrapyd.webservice.DaemonStatus

Configuration sources¶

Scrapyd reads these configuration files in this order. Values in later files take priority.

c:\scrapyd\scrapyd.conf
/etc/scrapyd/scrapyd.conf
/etc/scrapyd/conf.d/* in alphabetical order
scrapyd.conf in the current directory
~/.scrapyd.conf in the home directory of the user that invoked the scrapyd command
the closest scrapy.cfg file, starting in the current directory and traversing upward

Environment variables¶

Added in version 1.5.0.

These environment variables override corresponding options:

SCRAPYD_BIND_ADDRESS (bind_address)
SCRAPYD_HTTP_PORT (http_port)
SCRAPYD_USERNAME (username)
SCRAPYD_PASSWORD (password)
SCRAPYD_UNIX_SOCKET_PATH (unix_socket_path)

scrapyd section¶

Application options¶

application¶

The function that returns the Twisted Application to use.

If necessary, override this to fully control how Scrapyd works.

Default: scrapyd.app.application
Options: Any Twisted Application

bind_address¶

The IP address on which the Web interface and API listen for connections.

Default

127.0.0.1

Options

Any IP address, including:

127.0.0.1 to listen for local IPv4 connections only
0.0.0.0 to listen for all IPv4 connections
::0 to listen for all IPv4 and IPv6 connections

Note

If sysctl sets net.ipv6.bindv6only to true (default false), then ::0 listens for IPv6 connections only.

http_port¶

The TCP port on which the Web interface and API listen for connections.

Default: 6800
Options: Any integer

unix_socket_path¶

Added in version 1.5.0.

The filesystem path of the Unix socket on which the Web interface and API listen for connections.

For example:

unix_socket_path = /var/run/scrapyd/web.socket

The file’s mode is set to 660 (owner and group, read and write) to control access to Scrapyd.

Attention

If bind_address and http_port are set, a TCP server will start, in addition to the Unix server. To disable the TCP server, set bind_address to empty:

bind_address =

username¶

Added in version 1.3.0.

Enable basic authentication by setting this and password to non-empty values.

Default: "" (empty)

password¶

Added in version 1.3.0.

Enable basic authentication by setting this and username to non-empty values.

Default: "" (empty)

spiderqueue¶

Added in version 1.4.2.

The class that stores pending jobs.

Default

scrapyd.spiderqueue.SqliteSpiderQueue

Options

scrapyd.spiderqueue.SqliteSpiderQueue stores spider queues in SQLite databases named after each project, in the dbs_dir directory
Implement your own, using the ISpiderQueue interface

Also used by

addversion.json webservice, to create a queue if the project is new
schedule.json webservice, to add a pending job
cancel.json webservice, to remove a pending job
listjobs.json webservice, to list the pending jobs
daemonstatus.json webservice, to count the pending jobs
Web interface, to list the pending jobs and, if queues are transient, to create the queues per project at startup

Poller options¶

poller¶

Added in version 1.5.0.

The class that tracks capacity for new jobs, and starts jobs when ready.

Default

scrapyd.poller.QueuePoller

Options

scrapyd.poller.QueuePoller. When using the default application and launcher values:

The launcher adds max_proc capacity at startup, and one capacity each time a Scrapy process ends.

The application starts a timer so that, every poll_interval seconds, jobs start if there’s capacity: that is, if the number of Scrapy processes that are running is less than the max_proc value.

Implement your own, using the IPoller interface

poll_interval¶

The number of seconds between capacity checks.

Default: 5.0
Options: Any floating-point number

Launcher options¶

launcher¶

The class that starts Scrapy processes.

Default: scrapyd.launcher.Launcher
Options: Any Twisted Service

max_proc¶

The maximum number of Scrapy processes to run concurrently.

Default

0

Options

Any non-negative integer, including:

0 to use max_proc_per_cpu multiplied by the number of CPUs

max_proc_per_cpu¶

See max_proc.

Default: 4

logs_dir¶

The directory in which to write Scrapy logs.

A log file is written to {logs_dir}/{project}/{spider}/{job}.log.

To disable log storage, set this option to empty:

logs_dir =

To log messages to a remote service, you can, for example, reconfigure Scrapy’s logger from your Scrapy project:

import logging
import logstash

logger = logging.getLogger("scrapy")
logger.handlers.clear()
logger.addHandler(logstash.LogstashHandler("https://user:pass@id.us-east-1.aws.found.io", 5959, version=1))

Default: logs
Also used by: Web interface, to link to log files

Attention

Each *_dir setting must point to a different directory.

items_dir¶

The directory in which to write Scrapy items.

An item feed is written to {items_dir}/{project}/{spider}/{job}.jl.

If this option is non-empty, the FEEDS Scrapy setting is set as follows, resulting in items being written to the above path as JSON lines:

{"file:///path/to/items_dir/project/spider/job.jl": {"format": "jsonlines"}}

Default

"" (empty), because it is recommended to instead use either:

Feed exports, by setting the FEEDS Scrapy setting in your Scrapy project. See the full list of storage backends.

Item pipeline, to store the scraped items in a database. See the MongoDB example, which can be adapted to another database.

Also used by

Web interface, to link to item feeds

Attention

Each *_dir setting must point to a different directory.

jobs_dir¶

Added in version 1.6.0.

The directory in which to persist Scrapy requests.

By default, Scrapy keeps the request queue in memory. Use this setting to reduce memory usage.

Requests are persisted to {jobs_dir}/{project}/{spider}/{job}/.

If this option is non-empty, the JOBDIR Scrapy setting is set.

Default: ""

Attention

Each *_dir setting must point to a different directory.

jobs_to_keep¶

The number of finished jobs per spider, for which to keep the most recent log files in the logs_dir directory and item feeds in the items_dir directory.

To “disable” this feature, set this to an arbitrarily large value. For example, on a 64-bit system:

jobs_to_keep = 9223372036854775807

Warning

Scrapyd deletes old files in these directories, regardless of origin.

Default: 5

runner¶

The Python script to run Scrapy’s CLI.

If necessary, override this to fully control how the Scrapy CLI is called.

Default: scrapyd.runner
Options: Any Python script
Also used by: listspiders.json webservice, to run Scrapy’s list command

Web UI and API options¶

webroot¶

Added in version 1.2.0.

The class that defines the Web interface and API, as a Twisted Resource.

If necessary, override this to fully control how the web UI and API work.

Default: scrapyd.website.Root
Options: Any Twisted Resource

prefix_header¶

Added in version 1.4.2.

The header for the base path of the original request.

The header is relevant only if Scrapyd is running behind a reverse proxy, and if the public URL contains a base path, before the Scrapyd API path components. A base path must have a leading slash and no trailing slash, e.g. /base/path.

Default: x-forwarded-prefix

node_name¶

Added in version 1.1.0.

The node name, which appears in API responses.

Default: socket.gethostname()

debug¶

Whether debug mode is enabled.

If enabled, a Python traceback is returned (as a plain-text response) when the API errors.

Default: off

Egg storage options¶

eggstorage¶

Added in version 1.3.0.

The class that stores project eggs.

Default

scrapyd.eggstorage.FilesystemEggStorage

Options

scrapyd.eggstorage.FilesystemEggStorage writes eggs in the eggs_dir directory

Note

Eggs are named after the version, replacing characters other than A-Za-z0-9_- with underscores. Therefore, if you frequently use non-word, non-hyphen characters, the eggs for different versions can collide.
Implement your own, using the IEggStorage interface: for example, to store eggs remotely

eggs_dir¶

The directory in which to write project eggs.

Default: eggs

Attention

Each *_dir setting must point to a different directory.

Job storage options¶

jobstorage¶

Added in version 1.3.0.

The class that stores finished jobs.

Default

scrapyd.jobstorage.MemoryJobStorage

Options

scrapyd.jobstorage.MemoryJobStorage stores jobs in memory, such that jobs are lost when the Scrapyd process ends
scrapyd.jobstorage.SqliteJobStorage stores jobs in a SQLite database named jobs.db, in the dbs_dir directory
Implement your own, using the IJobStorage interface

finished_to_keep¶

The number of finished jobs, for which to keep metadata in the jobstorage backend.

Finished jobs are accessed via the Web interface and listjobs.json webservice.

Default: 100
Options: Any non-negative integer

Directory options¶

dbs_dir¶

The directory in which to write SQLite databases.

Default

dbs

Options

Any relative or absolute path, or :memory:

Used by

spiderqueue (scrapyd.spiderqueue.SqliteSpiderQueue)
jobstorage (scrapyd.jobstorage.SqliteJobStorage)

Attention

Each *_dir setting must point to a different directory.

services section¶

If you want to add a webservice (endpoint), add, for example:

[services]
mywebservice.json = amodule.anothermodule.MyWebService

You can use code for webservices in webservice.py as inspiration.

To remove a default webservice, set it to empty:

[services]
daemonstatus.json =

settings section (scrapy.cfg)¶

Project code is usually stored in a Python egg and uploaded to Scrapyd via the addversion.json webservice.

Alternatively, you can invoke Scrapyd within a Scrapy project: that is, you can run the scrapyd command from a directory containing a scrapy.cfg file (or from a directory with any parent directory containing a scrapy.cfg file).

As described in Scrapy’s documentation, the scrapy.cfg file contains a [settings] section, which can describe many Scrapy projects. By default, it is:

[settings]
default = projectname.settings