Release notes

Unreleased

Added

1.5.0b1 (2024-07-25)

This release contains the most changes in a decade. Therefore, a beta release is made first.

Added

  • Add version (egg version), settings (Scrapy settings) and args (spider arguments) to the pending jobs in the response from the listjobs.json webservice.

  • Add log_url and items_url to the running jobs in the response from the listjobs.json webservice.

  • Add a status.json webservice, to get the status of a job.

  • Add a unix_socket_path setting, to listen on a Unix socket.

  • Add a poller setting.

  • Respond to HTTP OPTIONS method requests.

  • Add environment variables to override common options. See Environment variables.

Documentation

Changed

  • Every poll_interval, up to max_proc processes are started by the default poller, instead of only one process. (The number of running jobs will not exceed max_proc.)

  • Drop support for end-of-life Python version 3.7.

Web UI

  • Add basic CSS.

  • Add a confirmation dialog to the Cancel button.

  • Add “Last modified” column to the directory listings of log files and item feeds.

  • The Jobs page responds only to HTTP GET and HEAD method requests.

API

  • Clarify error messages, for example:

    • 'project' parameter is required, instead of 'project' (KeyError)

    • project 'myproject' not found, instead of 'myproject' (KeyError)

    • project 'myproject' not found, instead of Scrapy VERSION - no active project

    • version 'myversion' not found, instead of a traceback

    • exception class: message, instead of message

    • BadEggError, instead of TypeError: 'tuple' object is not an iterator

    • Error messages for non-UTF-8 bytes and non-float priority.

    • “Unsupported method” error messages no longer list object as an allowed HTTP method

CLI

  • Scrapyd uses twisted.logger instead of the legacy twisted.python.log. Some system information changes:

    • [scrapyd.basicauth#info] Basic authentication ..., instead of [-] ...

    • [scrapyd.app#info] Scrapyd web console available at ..., instead of [-] ...

    • [-] Unhandled Error, instead of [_GenericHTTPChannelProtocol,0,127.0.0.1] ...

    • Data received from standard error and non-zero exit status codes are logged at error level.

  • Correct the usage message and long description.

  • Remove the --rundir option, which only works if *_dir settings are absolute paths.

  • Remove the --nodaemon option, which Scrapyd enables.

  • Remove the --python= option, which Scrapyd needs to set to its application.

  • Remove all twistd subcommands (FTP servers, etc.). Run twistd, if needed.

  • Run the scrapyd.__main__ module, instead of the scrapyd.scripts.scrapyd_run module.

Library

  • Move functions from scrapyd.utils into their callers:

    • sorted_versions to scrapyd.eggstorage

    • get_crawl_args to scrapyd.launcher

  • jobstorage uses the ScrapyProcessProtocol class, by default. If jobstorage is set to scrapyd.jobstorage.SqliteJobStorage, Scrapyd 1.3.0 uses a Job class, instead. To promote parity, the Job class is removed.

  • Move the activate_egg function from the scrapyd.eggutils module to its caller, the scrapyd.runner module.

  • Move the job_log_url and job_items_url functions into the Root class, since the Root class is responsible for file URLs.

  • Change the get_crawl_args function to no longer convert bytes to str, as already done by its caller.

  • Change the scrapyd.app.create_wrapped_resource function to a scrapyd.basicauth.wrap_resource function.

  • Change the scrapyd.utils.sqlite_connection_string function to an scrapyd.sqlite.initialize function.

  • Change the get_spider_list function to a SpiderList class.

  • Merge the JsonResource class into the WsResource class, removing the render_object method.

Fixed

  • Restore support for eggstorage implementations whose get() methods return file-like objects without name attributes (1.4.3 regression).

  • If the items_dir setting is a URL and the path component ends with /, the FEEDS setting no longer contains double slashes.

  • The MemoryJobStorage class returns finished jobs in reverse chronological order, like the SqliteJobStorage class.

  • The list_projects method of the SpiderScheduler class returns a list, instead of dict_keys.

  • Log errors to Scrapyd’s log, even when debug mode is enabled.

  • List the closest scrapy.cfg file as a configuration source.

API

  • The Content-Length header counts the number of bytes, instead of the number of characters.

  • The Access-Control-Allow-Methods response header contains only the HTTP methods to which webservices respond.

  • The listjobs.json webservice sets the log_url and items_url fields to null if the files don’t exist.

  • The schedule.json webservice sets the node_name field in error responses.

  • The next pending job for all but one project was unreported by the daemonstatus.json and listjobs.json webservices, and was not cancellable by the cancel.json webservice.

Security

  • The FilesystemEggStorage class used by the listversions.json webservice escapes project names (used in glob patterns) before globbing, to disallow listing arbitrary directories.

  • The FilesystemEggStorage class used by the runner and the addversion.json, listversions.json, delversion.json and delproject.json webservices raises a DirectoryTraversalError error if the project parameter (used in file paths) would traverse directories.

  • The Environment class used by the launcher raises a DirectoryTraversalError error if the project, spider or job parameters (used in file paths) would traverse directories.

  • The Web interface escapes user input (project names, spider names, and job IDs) to prevent cross-site scripting (XSS).

Platform support

Scrapyd is now tested on macOS and Windows, in addition to Linux.

  • The cancel.json webservice now works on Windows, by using SIGBREAK instead of SIGINT or SIGTERM.

  • The dbs_dir setting no longer causes an error if it contains a drive letter on Windows.

  • The items_dir setting is considered a local path if it contains a drive letter on Windows.

  • The jobs_to_keep setting no longer causes an error if a file to delete can’t be deleted (for example, if the file is open on Windows).

Removed

  • Remove support for parsing URLs in dbs_dir, since SQLite writes only to paths or :memory: (added in 1.4.2).

  • Remove the JsonSqliteDict and UtilsCache classes.

  • Remove the native_stringify_dict function.

  • Remove undocumented and unused internal environment variables:

    • SCRAPYD_FEED_URI

    • SCRAPYD_JOB

    • SCRAPYD_LOG_FILE

    • SCRAPYD_SLOT

    • SCRAPYD_SPIDER

1.4.3 (2023-09-25)

Changed

  • Change project from comma-separated list to bulleted list on landing page. (@bsekiewicz)

Fixed

  • Fix “The process cannot access the file because it is being used by another process” on Windows.

1.4.2 (2023-05-01)

Added

  • Add a spiderqueue setting. Since this was not previously configurable, the changes below are considered backwards-compatible.

  • Add support for the X-Forwarded-Prefix HTTP header. Rename this header using the prefix_header setting.

Changed

  • scrapyd.spiderqueue.SqliteSpiderQueue is initialized with a scrapyd.config.Config object and a project name, rather than a SQLite connection string (i.e. database file path).

  • If dbs_dir is set to :memory: or to a URL, it is passed through without modification and without creating a directory to scrapyd.jobstorage.SqliteJobStorage and scrapyd.spiderqueue.SqliteSpiderQueue.

  • scrapyd.utils.get_spider_queues defers the creation of the dbs_dir directory to the spider queue implementation.

1.4.1 (2023-02-10)

Fixed

  • Encode the FEEDS command-line argument as JSON.

1.4.0 (2023-02-07)

Added

  • Add log_url and items_url to the finished jobs in the response from the listjobs.json webservice. (@mxdev88)

  • Scrapy 2.8 support. Scrapyd sets LOG_FILE and FEEDS command-line arguments, instead of SCRAPY_LOG_FILE and SCRAPY_FEED_URI environment variables.

  • Python 3.11 support.

  • Python 3.12 support. Use packaging.version.Version instead of distutils.LooseVersion. (@pawelmhm)

Changed

  • Rename environment variables to avoid spurious Scrapy deprecation warnings.

    • SCRAPY_EGG_VERSION to SCRAPYD_EGG_VERSION

    • SCRAPY_FEED_URI to SCRAPYD_FEED_URI

    • SCRAPY_JOB to SCRAPYD_JOB

    • SCRAPY_LOG_FILE to SCRAPYD_LOG_FILE

    • SCRAPY_SLOT to SCRAPYD_SLOT

    • SCRAPY_SPIDER to SCRAPYD_SPIDER

    Attention

    Except for SCRAPYD_EGG_VERSION, these are undocumented and unused, and may be removed in future versions. If you use these environment variables, please report your use in an issue.

Removed

  • Scrapy 1.x support.

  • Python 3.6 support.

  • Unmaintained files (Debian packaging) and unused code (scrapyd/script.py).

Fixed

  • Print Scrapyd’s version instead of Twisted’s version with --version (-v) flag. (@niuguy)

  • Override Scrapy’s LOG_STDOUT setting to False to suppress logging output for listspiders.json webservice. (@Lucioric2000)

1.3.0 (2022-01-12)

Added

Changed

  • Make optional the project argument to the listjobs.json webservice, to easily query for all jobs.

  • Improve HTTP headers across webservices.

Removed

  • Python 2, 3.3, 3.4, 3.5 support.

  • PyPy 2 support.

  • Documentation for Ubuntu installs (Zyte no longer maintains the Ubuntu package).

Fixed

  • Respect Scrapy’s TWISTED_REACTOR setting.

  • Replace deprecated SafeConfigParser with ConfigParser.

1.2.1 (2019-06-17)

Fixed

  • Fix HTTP header types for newer Twisted versions.

  • DeferredQueue no longer hides a pending job when reaching max_proc.

  • The addversion.json webservice now works on Windows.

  • test: Update binary eggs to be compatible with Scrapy 1.x.

Removed

  • Remove deprecated SQLite utilities.

1.2.0 (2017-04-12)

Added

  • Webservice

    • Add the daemonstatus.json webservice.

    • Add a _version argument to the schedule.json and listspiders.json webservices.

    • Add a jobid argument to the schedule.json webservice.

    • Add pid to the running jobs in the response from the listjobs.json webservice.

    • Include full tracebacks from Scrapy when failing to get spider list. This makes debugging deployment problems easier, but webservice output noisier.

  • Website

    • Add a webroot setting for website root class.

    • Add start and finish times to jobs page.

  • Make console script executable.

  • Add contributing documentation.

  • Twisted 16 support.

  • Python 3 support.

Changed

  • Change bind_address default to 127.0.0.1, instead of 0.0.0.0, to listen only for connections from localhost.

Removed

  • Deprecate unused SQLite utilities in the scrapyd.sqlite module.

    • SqliteDict

    • SqlitePickleDict

    • SqlitePriorityQueue

    • PickleSqlitePriorityQueue

  • Scrapy 0.x support.

  • Python 2.6 support.

Fixed

  • Poller race condition for concurrently accessed queues.

1.1.1 (2016-11-03)

Added

  • Document and include missing settings in default_scrapyd.conf.

  • Document the spider queue’s priority argument.

  • Enable some missing tests for the SQLite queues.

Removed

  • Disable bdist_wheel command in setup to define dynamic requirements, despite pip-7 wheel caching bug.

Fixed

  • Use correct type adapter for sqlite3 blobs. In some systems, a wrong type adapter leads to incorrect buffer reads/writes.

  • FEED_URI was always overridden by Scrapyd.

  • Specify maximum versions for requirements that became incompatible.

  • Mark package as zip-unsafe because Twistd requires a plain txapp.py.

1.1.0 (2015-06-29)

Added

Changed

Fixed

  • Check if a spider exists before scheduling it. (#8, 288afef, a185ff2)

  • Sanitize version names when creating egg paths. (8023720)

  • Generate correct feed URIs, using w3lib. (9a88ea5)

  • Fix git versioning for projects without annotated tags. (#34, e91dcf4)

  • Use valid HTML markup on website pages. (da5664f, 26089cd)

  • Use file protocol for SCRAPY_FEED_URI environment variable on Windows. (4f0060a)

  • Copy JsonResource class from Scrapy, which no longer provides it. (99ea920)

  • Lowercase scrapyd package name. (1adfc31).

  • Mark package as zip-unsafe, because Twisted requires a plain txapp.py. (f27c054)

  • Install scripts using entry_points instead of scripts. (b670f5e)

1.0.2 (2016-03-28)

Fixed

  • Mark package as zip-unsafe, because Twisted requires a plain txapp.py.

  • Specify maximum versions for compatible requirements.

1.0.1 (2013-09-02)

Trivial update

1.0.0 (2013-09-02)

First standalone release (it was previously shipped with Scrapy until Scrapy 0.16).