Configuration

Default configuration

Scrapyd always loads this configuration file, which can be overridden by Configuration sources:

[scrapyd]
# Application options
application       = scrapyd.app.application
bind_address      = 127.0.0.1
http_port         = 6800
unix_socket_path  =
username          =
password          =
spiderqueue       = scrapyd.spiderqueue.SqliteSpiderQueue

# Poller options
poller            = scrapyd.poller.QueuePoller
poll_interval     = 5.0

# Launcher options
launcher          = scrapyd.launcher.Launcher
max_proc          = 0
max_proc_per_cpu  = 4
logs_dir          = logs
items_dir         =
jobs_dir          =
jobs_to_keep      = 5
runner            = scrapyd.runner

# Web UI and API options
webroot           = scrapyd.website.Root
prefix_header     = x-forwarded-prefix
debug             = off

# Egg storage options
eggstorage        = scrapyd.eggstorage.FilesystemEggStorage
eggs_dir          = eggs

# Job storage options
jobstorage        = scrapyd.jobstorage.MemoryJobStorage
finished_to_keep  = 100

# Directory options
dbs_dir           = dbs

[services]
schedule.json     = scrapyd.webservice.Schedule
cancel.json       = scrapyd.webservice.Cancel
status.json       = scrapyd.webservice.Status
addversion.json   = scrapyd.webservice.AddVersion
listprojects.json = scrapyd.webservice.ListProjects
listversions.json = scrapyd.webservice.ListVersions
listspiders.json  = scrapyd.webservice.ListSpiders
delproject.json   = scrapyd.webservice.DeleteProject
delversion.json   = scrapyd.webservice.DeleteVersion
listjobs.json     = scrapyd.webservice.ListJobs
daemonstatus.json = scrapyd.webservice.DaemonStatus

Configuration sources

Scrapyd reads these configuration files in this order. Values in later files take priority.

  1. c:\scrapyd\scrapyd.conf

  2. /etc/scrapyd/scrapyd.conf

  3. /etc/scrapyd/conf.d/* in alphabetical order

  4. scrapyd.conf in the current directory

  5. ~/.scrapyd.conf in the home directory of the user that invoked the scrapyd command

  6. the closest scrapy.cfg file, starting in the current directory and traversing upward

Environment variables

Added in version 1.5.0.

These environment variables override corresponding options:

scrapyd section

Application options

application

The function that returns the Twisted Application to use.

If necessary, override this to fully control how Scrapyd works.

Default

scrapyd.app.application

Options

Any Twisted Application

bind_address

The IP address on which the Web interface and API listen for connections.

Default

127.0.0.1

Options

Any IP address, including:

  • 127.0.0.1 to listen for local IPv4 connections only

  • 0.0.0.0 to listen for all IPv4 connections

  • ::0 to listen for all IPv4 and IPv6 connections

    Note

    If sysctl sets net.ipv6.bindv6only to true (default false), then ::0 listens for IPv6 connections only.

http_port

The TCP port on which the Web interface and API listen for connections.

Default

6800

Options

Any integer

unix_socket_path

Added in version 1.5.0.

The filesystem path of the Unix socket on which the Web interface and API listen for connections.

For example:

unix_socket_path = /var/run/scrapyd/web.socket

The file’s mode is set to 660 (owner and group, read and write) to control access to Scrapyd.

Attention

If bind_address and http_port are set, a TCP server will start, in addition to the Unix server. To disable the TCP server, set bind_address to empty:

bind_address =

username

Added in version 1.3.0.

Enable basic authentication by setting this and password to non-empty values.

Default

"" (empty)

password

Added in version 1.3.0.

Enable basic authentication by setting this and username to non-empty values.

Default

"" (empty)

spiderqueue

Added in version 1.4.2.

The class that stores pending jobs.

Default

scrapyd.spiderqueue.SqliteSpiderQueue

Options
  • scrapyd.spiderqueue.SqliteSpiderQueue stores spider queues in SQLite databases named after each project, in the dbs_dir directory

  • Implement your own, using the ISpiderQueue interface

Also used by

Poller options

poller

Added in version 1.5.0.

The class that tracks capacity for new jobs, and starts jobs when ready.

Default

scrapyd.poller.QueuePoller

Options
  • The launcher adds max_proc capacity at startup, and one capacity each time a Scrapy process ends.

  • The application starts a timer so that, every poll_interval seconds, jobs start if there’s capacity: that is, if the number of Scrapy processes that are running is less than the max_proc value.

  • Implement your own, using the IPoller interface

poll_interval

The number of seconds between capacity checks.

Default

5.0

Options

Any floating-point number

Launcher options

launcher

The class that starts Scrapy processes.

Default

scrapyd.launcher.Launcher

Options

Any Twisted Service

max_proc

The maximum number of Scrapy processes to run concurrently.

Default

0

Options

Any non-negative integer, including:

max_proc_per_cpu

See max_proc.

Default

4

logs_dir

The directory in which to write Scrapy logs.

A log file is written to {logs_dir}/{project}/{spider}/{job}.log.

To disable log storage, set this option to empty:

logs_dir =

To log messages to a remote service, you can, for example, reconfigure Scrapy’s logger from your Scrapy project:

import logging
import logstash

logger = logging.getLogger("scrapy")
logger.handlers.clear()
logger.addHandler(logstash.LogstashHandler("https://user:pass@id.us-east-1.aws.found.io", 5959, version=1))
Default

logs

Also used by

Web interface, to link to log files

Attention

Each *_dir setting must point to a different directory.

items_dir

The directory in which to write Scrapy items.

An item feed is written to {items_dir}/{project}/{spider}/{job}.jl.

If this option is non-empty, the FEEDS Scrapy setting is set as follows, resulting in items being written to the above path as JSON lines:

{"file:///path/to/items_dir/project/spider/job.jl": {"format": "jsonlines"}}
Default

"" (empty), because it is recommended to instead use either:

Also used by

Web interface, to link to item feeds

Attention

Each *_dir setting must point to a different directory.

jobs_dir

Added in version 1.6.0.

The directory in which to persist Scrapy requests.

By default, Scrapy keeps the request queue in memory. Use this setting to reduce memory usage.

Requests are persisted to {jobs_dir}/{project}/{spider}/{job}/.

If this option is non-empty, the JOBDIR Scrapy setting is set.

Default

""

Attention

Each *_dir setting must point to a different directory.

jobs_to_keep

The number of finished jobs per spider, for which to keep the most recent log files in the logs_dir directory and item feeds in the items_dir directory.

To “disable” this feature, set this to an arbitrarily large value. For example, on a 64-bit system:

jobs_to_keep = 9223372036854775807

Warning

Scrapyd deletes old files in these directories, regardless of origin.

Default

5

runner

The Python script to run Scrapy’s CLI.

If necessary, override this to fully control how the Scrapy CLI is called.

Default

scrapyd.runner

Options

Any Python script

Also used by

listspiders.json webservice, to run Scrapy’s list command

Web UI and API options

webroot

Added in version 1.2.0.

The class that defines the Web interface and API, as a Twisted Resource.

If necessary, override this to fully control how the web UI and API work.

Default

scrapyd.website.Root

Options

Any Twisted Resource

prefix_header

Added in version 1.4.2.

The header for the base path of the original request.

The header is relevant only if Scrapyd is running behind a reverse proxy, and if the public URL contains a base path, before the Scrapyd API path components. A base path must have a leading slash and no trailing slash, e.g. /base/path.

Default

x-forwarded-prefix

node_name

Added in version 1.1.0.

The node name, which appears in API responses.

Default

socket.gethostname()

debug

Whether debug mode is enabled.

If enabled, a Python traceback is returned (as a plain-text response) when the API errors.

Default

off

Egg storage options

eggstorage

Added in version 1.3.0.

The class that stores project eggs.

Default

scrapyd.eggstorage.FilesystemEggStorage

Options
  • scrapyd.eggstorage.FilesystemEggStorage writes eggs in the eggs_dir directory

    Note

    Eggs are named after the version, replacing characters other than A-Za-z0-9_- with underscores. Therefore, if you frequently use non-word, non-hyphen characters, the eggs for different versions can collide.

  • Implement your own, using the IEggStorage interface: for example, to store eggs remotely

eggs_dir

The directory in which to write project eggs.

Default

eggs

Attention

Each *_dir setting must point to a different directory.

Job storage options

jobstorage

Added in version 1.3.0.

The class that stores finished jobs.

Default

scrapyd.jobstorage.MemoryJobStorage

Options
  • scrapyd.jobstorage.MemoryJobStorage stores jobs in memory, such that jobs are lost when the Scrapyd process ends

  • scrapyd.jobstorage.SqliteJobStorage stores jobs in a SQLite database named jobs.db, in the dbs_dir directory

  • Implement your own, using the IJobStorage interface

finished_to_keep

The number of finished jobs, for which to keep metadata in the jobstorage backend.

Finished jobs are accessed via the Web interface and listjobs.json webservice.

Default

100

Options

Any non-negative integer

Directory options

dbs_dir

The directory in which to write SQLite databases.

Default

dbs

Options

Any relative or absolute path, or :memory:

Used by

Attention

Each *_dir setting must point to a different directory.

services section

If you want to add a webservice (endpoint), add, for example:

[services]
mywebservice.json = amodule.anothermodule.MyWebService

You can use code for webservices in webservice.py as inspiration.

To remove a default webservice, set it to empty:

[services]
daemonstatus.json =

settings section (scrapy.cfg)

Project code is usually stored in a Python egg and uploaded to Scrapyd via the addversion.json webservice.

Alternatively, you can invoke Scrapyd within a Scrapy project: that is, you can run the scrapyd command from a directory containing a scrapy.cfg file (or from a directory with any parent directory containing a scrapy.cfg file).

As described in Scrapy’s documentation, the scrapy.cfg file contains a [settings] section, which can describe many Scrapy projects. By default, it is:

[settings]
default = projectname.settings