Release notes¶
1.5.0 (2024-09-05)¶
Added¶
Default webservices can be disabled. See services section.
Fixed¶
Restore the
--nodaemon(-n) option (which Scrapyd enables, regardless), to avoid “option –nodaemon not recognized”.
1.5.0b1 (2024-07-25)¶
This release contains the most changes in a decade. Therefore, a beta release is made first.
Added¶
Add
version(egg version),settings(Scrapy settings) andargs(spider arguments) to the pending jobs in the response from the listjobs.json webservice.Add
log_urlanditems_urlto the running jobs in the response from the listjobs.json webservice.Add a status.json webservice, to get the status of a job.
Add a unix_socket_path setting, to listen on a Unix socket.
Add a poller setting.
Respond to HTTP
OPTIONSmethod requests.Add environment variables to override common options. See Environment variables.
Documentation¶
How to add webservices (endpoints). See services section.
How to create Docker images. See Creating a Docker image.
How to integrate Scrapy projects, without eggs. See settings section (scrapy.cfg).
Changed¶
Every poll_interval, up to max_proc processes are started by the default poller, instead of only one process. (The number of running jobs will not exceed max_proc.)
Drop support for end-of-life Python version 3.7.
Web UI¶
Add basic CSS.
Add a confirmation dialog to the Cancel button.
Add “Last modified” column to the directory listings of log files and item feeds.
The Jobs page responds only to HTTP
GETandHEADmethod requests.
API¶
Clarify error messages, for example:
'project' parameter is required, instead of'project'(KeyError)project 'myproject' not found, instead of'myproject'(KeyError)project 'myproject' not found, instead ofScrapy VERSION - no active projectversion 'myversion' not found, instead of a tracebackexception class: message, instead ofmessageBadEggError, instead ofTypeError: 'tuple' object is not an iteratorError messages for non-UTF-8 bytes and non-float
priority.“Unsupported method” error messages no longer list
objectas an allowed HTTP method
CLI¶
Scrapyd uses
twisted.loggerinstead of the legacytwisted.python.log. Some system information changes:[scrapyd.basicauth#info] Basic authentication ..., instead of[-] ...[scrapyd.app#info] Scrapyd web console available at ..., instead of[-] ...[-] Unhandled Error, instead of[_GenericHTTPChannelProtocol,0,127.0.0.1] ...Data received from standard error and non-zero exit status codes are logged at error level.
Correct the usage message and long description.
Remove the
--rundiroption, which only works if*_dirsettings are absolute paths.Remove the
--nodaemon(-n) option, which Scrapyd enables.Remove the
--python=(-y) option, which Scrapyd needs to set to its application.Remove all
twistdsubcommands (FTP servers, etc.). Runtwistd, if needed.Run the
scrapyd.__main__module, instead of thescrapyd.scripts.scrapyd_runmodule.
Library¶
Move functions from
scrapyd.utilsinto their callers:sorted_versionstoscrapyd.eggstorageget_crawl_argstoscrapyd.launcher
jobstorage uses the
ScrapyProcessProtocolclass, by default. If jobstorage is set toscrapyd.jobstorage.SqliteJobStorage, Scrapyd 1.3.0 uses aJobclass, instead. To promote parity, theJobclass is removed.Move the
activate_eggfunction from thescrapyd.eggutilsmodule to its caller, thescrapyd.runnermodule.Move the
job_log_urlandjob_items_urlfunctions into theRootclass, since theRootclass is responsible for file URLs.Change the
get_crawl_argsfunction to no longer convertbytestostr, as already done by its caller.Change the
scrapyd.app.create_wrapped_resourcefunction to ascrapyd.basicauth.wrap_resourcefunction.Change the
scrapyd.utils.sqlite_connection_stringfunction to anscrapyd.sqlite.initializefunction.Change the
get_spider_listfunction to aSpiderListclass.Merge the
JsonResourceclass into theWsResourceclass, removing therender_objectmethod.
Fixed¶
Restore support for eggstorage implementations whose
get()methods return file-like objects withoutnameattributes (1.4.3 regression).If the items_dir setting is a URL and the path component ends with
/, theFEEDSsetting no longer contains double slashes.The
MemoryJobStorageclass returns finished jobs in reverse chronological order, like theSqliteJobStorageclass.The
list_projectsmethod of theSpiderSchedulerclass returns alist, instead ofdict_keys.Log errors to Scrapyd’s log, even when debug mode is enabled.
List the closest
scrapy.cfgfile as a configuration source.
API¶
The
Content-Lengthheader counts the number of bytes, instead of the number of characters.The
Access-Control-Allow-Methodsresponse header contains only the HTTP methods to which webservices respond.The listjobs.json webservice sets the
log_urlanditems_urlfields tonullif the files don’t exist.The schedule.json webservice sets the
node_namefield in error responses.The next pending job for all but one project was unreported by the daemonstatus.json and listjobs.json webservices, and was not cancellable by the cancel.json webservice.
Security¶
The
FilesystemEggStorageclass used by the listversions.json webservice escapes project names (used in glob patterns) before globbing, to disallow listing arbitrary directories.The
FilesystemEggStorageclass used by the runner and the addversion.json, listversions.json, delversion.json and delproject.json webservices raises aDirectoryTraversalErrorerror if the project parameter (used in file paths) would traverse directories.The
Environmentclass used by the launcher raises aDirectoryTraversalErrorerror if the project, spider or job parameters (used in file paths) would traverse directories.The Web interface escapes user input (project names, spider names, and job IDs) to prevent cross-site scripting (XSS).
Platform support¶
Scrapyd is now tested on macOS and Windows, in addition to Linux.
The cancel.json webservice now works on Windows, by using SIGBREAK instead of SIGINT or SIGTERM.
The dbs_dir setting no longer causes an error if it contains a drive letter on Windows.
The items_dir setting is considered a local path if it contains a drive letter on Windows.
The jobs_to_keep setting no longer causes an error if a file to delete can’t be deleted (for example, if the file is open on Windows).
Removed¶
Remove support for parsing URLs in dbs_dir, since SQLite writes only to paths or
:memory:(added in 1.4.2).Remove the
JsonSqliteDictandUtilsCacheclasses.Remove the
native_stringify_dictfunction.Remove undocumented and unused internal environment variables:
SCRAPYD_FEED_URISCRAPYD_JOBSCRAPYD_LOG_FILESCRAPYD_SLOTSCRAPYD_SPIDER
1.4.3 (2023-09-25)¶
Changed¶
Change project from comma-separated list to bulleted list on landing page. (@bsekiewicz)
Fixed¶
Fix “The process cannot access the file because it is being used by another process” on Windows.
1.4.2 (2023-05-01)¶
Added¶
Add a spiderqueue setting. Since this was not previously configurable, the changes below are considered backwards-compatible.
Add support for the X-Forwarded-Prefix HTTP header. Rename this header using the prefix_header setting.
Changed¶
scrapyd.spiderqueue.SqliteSpiderQueueis initialized with ascrapyd.config.Configobject and a project name, rather than a SQLite connection string (i.e. database file path).If dbs_dir is set to
:memory:or to a URL, it is passed through without modification and without creating a directory toscrapyd.jobstorage.SqliteJobStorageandscrapyd.spiderqueue.SqliteSpiderQueue.scrapyd.utils.get_spider_queuesdefers the creation of the dbs_dir directory to the spider queue implementation.
1.4.1 (2023-02-10)¶
Fixed¶
Encode the
FEEDScommand-line argument as JSON.
1.4.0 (2023-02-07)¶
Added¶
Add
log_urlanditems_urlto the finished jobs in the response from the listjobs.json webservice. (@mxdev88)Scrapy 2.8 support. Scrapyd sets
LOG_FILEandFEEDScommand-line arguments, instead ofSCRAPY_LOG_FILEandSCRAPY_FEED_URIenvironment variables.Python 3.11 support.
Python 3.12 support. Use
packaging.version.Versioninstead ofdistutils.LooseVersion. (@pawelmhm)
Changed¶
Rename environment variables to avoid spurious Scrapy deprecation warnings.
SCRAPY_EGG_VERSIONtoSCRAPYD_EGG_VERSIONSCRAPY_FEED_URItoSCRAPYD_FEED_URISCRAPY_JOBtoSCRAPYD_JOBSCRAPY_LOG_FILEtoSCRAPYD_LOG_FILESCRAPY_SLOTtoSCRAPYD_SLOTSCRAPY_SPIDERtoSCRAPYD_SPIDER
Attention
Except for
SCRAPYD_EGG_VERSION, these are undocumented and unused, and may be removed in future versions. If you use these environment variables, please report your use in an issue.
Removed¶
Scrapy 1.x support.
Python 3.6 support.
Unmaintained files (Debian packaging) and unused code (
scrapyd/script.py).
Fixed¶
Print Scrapyd’s version instead of Twisted’s version with
--version(-v) flag. (@niuguy)Override Scrapy’s
LOG_STDOUTsetting toFalseto suppress logging output for listspiders.json webservice. (@Lucioric2000)
1.3.0 (2022-01-12)¶
Added¶
Add username and password settings, for HTTP authentication.
Add jobstorage and eggstorage settings.
Add a
priorityargument to the schedule.json webservice.Add
projectto all jobs in the response from the listjobs.json webservice.Add shortcut to jobs page to cancel a job using the cancel.json webservice.
Python 3.7, 3.8, 3.9, 3.10 support.
Changed¶
Make optional the
projectargument to the listjobs.json webservice, to easily query for all jobs.Improve HTTP headers across webservices.
Removed¶
Python 2, 3.3, 3.4, 3.5 support.
PyPy 2 support.
Documentation for Ubuntu installs (Zyte no longer maintains the Ubuntu package).
Fixed¶
Respect Scrapy’s
TWISTED_REACTORsetting.Replace deprecated
SafeConfigParserwithConfigParser.
1.2.1 (2019-06-17)¶
Fixed¶
Fix HTTP header types for newer Twisted versions.
DeferredQueueno longer hides a pending job when reaching max_proc.The addversion.json webservice now works on Windows.
test: Update binary eggs to be compatible with Scrapy 1.x.
Removed¶
Remove deprecated SQLite utilities.
1.2.0 (2017-04-12)¶
Added¶
Webservice
Add the daemonstatus.json webservice.
Add a
_versionargument to the schedule.json and listspiders.json webservices.Add a
jobidargument to the schedule.json webservice.Add
pidto the running jobs in the response from the listjobs.json webservice.Include full tracebacks from Scrapy when failing to get spider list. This makes debugging deployment problems easier, but webservice output noisier.
Website
Add a webroot setting for website root class.
Add start and finish times to jobs page.
Make console script executable.
Add contributing documentation.
Twisted 16 support.
Python 3 support.
Changed¶
Change bind_address default to 127.0.0.1, instead of 0.0.0.0, to listen only for connections from localhost.
Removed¶
Deprecate unused SQLite utilities in the
scrapyd.sqlitemodule.SqliteDictSqlitePickleDictSqlitePriorityQueuePickleSqlitePriorityQueue
Scrapy 0.x support.
Python 2.6 support.
Fixed¶
Poller race condition for concurrently accessed queues.
1.1.1 (2016-11-03)¶
Added¶
Document and include missing settings in
default_scrapyd.conf.Document the spider queue’s
priorityargument.Enable some missing tests for the SQLite queues.
Removed¶
Disable bdist_wheel command in setup to define dynamic requirements, despite pip-7 wheel caching bug.
Fixed¶
Use correct type adapter for sqlite3 blobs. In some systems, a wrong type adapter leads to incorrect buffer reads/writes.
FEED_URIwas always overridden by Scrapyd.Specify maximum versions for requirements that became incompatible.
Mark package as zip-unsafe because Twistd requires a plain
txapp.py.
1.1.0 (2015-06-29)¶
Added¶
Add
node_name(hostname) to webservice responses. (fac3a5c, 4aebe1c)Add
start_timeto the running jobs in the response from the listjobs.json webservice. (6712af9, acd460b)
Changed¶
Fixed¶
Check if a spider exists before scheduling it. (#8, 288afef, a185ff2)
Sanitize version names when creating egg paths. (8023720)
Generate correct feed URIs, using w3lib. (9a88ea5)
Fix git versioning for projects without annotated tags. (#34, e91dcf4)
Use
fileprotocol forSCRAPY_FEED_URIenvironment variable on Windows. (4f0060a)Copy
JsonResourceclass from Scrapy, which no longer provides it. (99ea920)Lowercase
scrapydpackage name. (1adfc31).Mark package as zip-unsafe, because Twisted requires a plain
txapp.py. (f27c054)Install scripts using
entry_pointsinstead ofscripts. (b670f5e)
1.0.2 (2016-03-28)¶
Fixed¶
Mark package as zip-unsafe, because Twisted requires a plain
txapp.py.Specify maximum versions for compatible requirements.
1.0.1 (2013-09-02)¶
Trivial update
1.0.0 (2013-09-02)¶
First standalone release (it was previously shipped with Scrapy until Scrapy 0.16).