Configuration file

Scrapyd searches for configuration files in the following locations, and parses them in order with the latest one taking more priority:

  • /etc/scrapyd/scrapyd.conf (Unix)
  • c:\scrapyd\scrapyd.conf (Windows)
  • /etc/scrapyd/conf.d/* (in alphabetical order, Unix)
  • scrapyd.conf
  • ~/.scrapyd.conf (users home directory)

The configuration file supports the following options (see default values in the example).

http_port

The TCP port where the HTTP JSON API will listen. Defaults to 6800.

bind_address

The IP address where the HTTP JSON API will listen. Defaults to 0.0.0.0 (all)

max_proc

The maximum number of concurrent Scrapy process that will be started. If unset or 0 it will use the number of cpus available in the system multiplied by the value in max_proc_per_cpu option. Defaults to 0.

max_proc_per_cpu

The maximum number of concurrent Scrapy process that will be started per cpu. Defaults to 4.

debug

Whether debug mode is enabled. Defaults to off. When debug mode is enabled the full Python traceback will be returned (as plain text responses) when there is an error processing a JSON API call.

eggs_dir

The directory where the project eggs will be stored.

dbs_dir

The directory where the project databases will be stored (this includes the spider queues).

logs_dir

The directory where the Scrapy logs will be stored. If you want to disable storing logs set this option empty, like this:

logs_dir =

items_dir

New in version 0.15.

The directory where the Scrapy items will be stored. This option is disabled by default because you are expected to use a database or a feed exporter. Setting it to non-empty results in storing scraped item feeds to the specified directory by overriding the scrapy setting FEED_URI.

jobs_to_keep

New in version 0.15.

The number of finished jobs to keep per spider. Defaults to 5. This refers to logs and items.

This setting was named logs_to_keep in previous versions.

finished_to_keep

New in version 0.14.

The number of finished processes to keep in the launcher. Defaults to 100. This only reflects on the website /jobs endpoint and relevant json webservices.

poll_interval

The interval used to poll queues, in seconds. Defaults to 5.0. Can be a float, such as 0.2

runner

The module that will be used for launching sub-processes. You can customize the Scrapy processes launched from Scrapyd by using your own module.

application

A function that returns the (Twisted) Application object to use. This can be used if you want to extend Scrapyd by adding and removing your own components and services.

For more info see Twisted Application Framework

webroot

A twisted web resource that represents the interface to scrapyd. Scrapyd includes an interface with a website to provide simple monitoring and access to the application’s webresources. This setting must provide the root class of the twisted web resource.

node_name

New in version 1.1.

The node name for each node to something like the display hostname. Defaults to ${socket.gethostname()}.

Example configuration file

Here is an example configuration file with all the defaults:

[scrapyd]
eggs_dir    = eggs
logs_dir    = logs
items_dir   =
jobs_to_keep = 5
dbs_dir     = dbs
max_proc    = 0
max_proc_per_cpu = 4
finished_to_keep = 100
poll_interval = 5.0
bind_address = 0.0.0.0
http_port   = 6800
debug       = off
runner      = scrapyd.runner
application = scrapyd.app.application
launcher    = scrapyd.launcher.Launcher
webroot     = scrapyd.website.Root

[services]
schedule.json     = scrapyd.webservice.Schedule
cancel.json       = scrapyd.webservice.Cancel
addversion.json   = scrapyd.webservice.AddVersion
listprojects.json = scrapyd.webservice.ListProjects
listversions.json = scrapyd.webservice.ListVersions
listspiders.json  = scrapyd.webservice.ListSpiders
delproject.json   = scrapyd.webservice.DeleteProject
delversion.json   = scrapyd.webservice.DeleteVersion
listjobs.json     = scrapyd.webservice.ListJobs
daemonstatus.json = scrapyd.webservice.DaemonStatus