Developer API reference

Interfaces

interface scrapyd.interfaces.IEggStorage[source]

A component to store project eggs.

put(eggfile, project, version)

Store the egg (a file object), which represents a version of the project.

get(project, version=None)

Return (version, file) for the egg matching the project and version.

If version is None, the latest version and corresponding file are returned.

If no egg is found, (None, None) is returned.

Tip

Remember to close the file when done.

list(project)

Return all versions of the project in order, with the latest version last.

list_projects()

Return all projects in storage.

Added in version 1.3.0: Move this logic into the interface and its implementations, to allow customization.

delete(project, version=None)

Delete the egg matching the project and version. Delete the project, if no versions remains.

interface scrapyd.interfaces.IPoller[source]

A component that tracks capacity for new jobs, and starts jobs when ready.

queues

An object (like a dict) with a __getitem__ method that accepts a project’s name and returns its spider queue of pending jobs.

poll()

Called periodically to start jobs if there’s capacity.

next()

Return the next pending job.

It should return a Deferred that will be fired when there’s capacity, or already fired if there’s capacity.

The pending job is a dict containing at least the _project name, _spider name and _job ID. The job ID is unique, at least within the project.

The pending job is later passed to scrapyd.interfaces.IEnvironment.get_environment().

update_projects()

Called when projects may have changed, to refresh the available projects, including at initialization.

interface scrapyd.interfaces.ISpiderQueue[source]

A component to store pending jobs.

The dict keys used by the chosen ISpiderQueue implementation must match the chosen:

add(name, priority, **spider_args)

Add a pending job, given the spider name, crawl priority and keyword arguments, which might include the _job ID, egg _version and Scrapy settings depending on the implementation, with keyword arguments that are not recognized by the implementation being treated as spider arguments.

Changed in version 1.3.0: Add the priority parameter.

pop()

Pop the next pending job. The pending job is a dict containing the spider name. Depending on the implementation, other keys might include the _job ID, egg _version and Scrapy settings, with keyword arguments that are not recognized by the receiver being treated as spider arguments.

list()

Return the pending jobs.

count()

Return the number of pending jobs.

remove(func)

Remove pending jobs for which func(job) is true, and return the number of removed pending jobss.

clear()

Remove all pending jobs.

interface scrapyd.interfaces.ISpiderScheduler[source]

A component to schedule jobs.

schedule(project, spider_name, priority, **spider_args)

Schedule a crawl.

Changed in version 1.3.0: Add the priority parameter.

list_projects()

Return all projects that can be scheduled.

update_projects()

Called when projects may have changed, to refresh the available projects, including at initialization.

interface scrapyd.interfaces.IEnvironment[source]

A component to generate the environment of jobs.

The chosen IEnvironment implementation must match the chosen launcher service.

get_settings(message)

Return the Scrapy settings to use for running the process.

Depending on the chosen launcher, this would be one of more LOG_FILE or FEEDS.

Added in version 1.4.2: Support for overriding Scrapy settings via SCRAPY_ environment variables was removed in Scrapy 2.8.

Parameters:

message – the pending job received from the scrapyd.interfaces.IPoller.next() method

get_environment(message, slot)

Return the environment variables to use for running the process.

Depending on the chosen launcher, this would be one of more of SCRAPY_PROJECT, SCRAPYD_EGG_VERSION or SCRAPY_SETTINGS_MODULE.

Parameters:
interface scrapyd.interfaces.IJobStorage[source]

A component to store finished jobs.

Added in version 1.3.0.

add(job)

Add a finished job in the storage.

list()

Return the finished jobs.

__len__()

Return the number of finished jobs.

__iter__()

Iterate over the finished jobs in reverse order by end_time.

A job has the attributes project, spider, job, start_time and end_time and may have the attributes args (scrapy crawl CLI arguments) and env (environment variables).

Config

class scrapyd.config.Config(values=None, extra_sources=())[source]

A ConfigParser wrapper to support defaults when calling instance methods, and also tied to a single section

SECTION = 'scrapyd'
get(option, default=None)[source]
getint(option, default=None)[source]
getfloat(option, default=None)[source]
getboolean(option, default=None)[source]
items(section, default=None)[source]

Exceptions

exception scrapyd.exceptions.ScrapydError[source]

Base class for exceptions from within this package

exception scrapyd.exceptions.ConfigError[source]

Raised if a configuration error prevents Scrapyd from starting

exception scrapyd.exceptions.InvalidUsernameError[source]

Raised if the username contains a colon

exception scrapyd.exceptions.BadEggError[source]

Raised if the egg is invalid

exception scrapyd.exceptions.DirectoryTraversalError[source]

Raised if the resolved path is outside the expected directory

exception scrapyd.exceptions.ProjectNotFoundError[source]

Raised if a project isn’t found in an IEggStorage implementation

exception scrapyd.exceptions.EggNotFoundError[source]

Raised if an egg isn’t found in an IEggStorage implementation

exception scrapyd.exceptions.RunnerError[source]

Raised if the runner returns an error code