1. External systems have to do some of the work of a repository
Because SWORD requires that deposits are fully preprepared bundles of content and metadata before they can be deposited in the repository, a burden is placed on the client system: that they are able to store and manage structured files and metadata prior to deposit in the repository. Since repositories are designed precisely to store and manage structured files and metadata, this is duplication of effort. There is a strong argument to be made that the role that repositories can play in this environment is precisely that of the content storage service provider, rather than simply an end point for some packaged content set.
2. Users must know when an item is archivable
As a consequence of the first limitation, SWORD also requires that client systems are solely responsible for asserting that some set of files and metadata is the finished object to be placed in the archive; the client must draw a boundary around some set of content and package it for delivery to the repository. Often this is a hard assertion to make (such as at what point during the article publication lifecycle something should be archived), and sometimes it is even impossible (such as during the production of continuous data sets). If the repository were the service provider for content storage then the weight of deciding when something should be truly archived can be balanced between the user and the repository administration, where skills in archiving are more likely to reside. We make no assertion here as to what the notion of “archiving” means in terms of a specific repository; it may be true archiving, or simply republication via an open access interface.
3. Full AtomPub profile for SWORD is unclear
From a practical perspective, the SWORD specification does not offer explicit guidelines for the use of the full range of features of AtomPub [12], upon which SWORD is based. While it would be possible for a server implementation to also include a full treatment of AtomPub, this has not been the case in the funded implementations of SWORD thus far. In addition, given that SWORD added extensions and profiling to the deposit (HTTP POST) aspects of AtomPub, it is likely that it will need to do so for these other aspects. For example, SWORD supports the “On Behalf Of” deposit use case by adding new HTTP headers [SWORD spec section A.2], but says nothing about how these should be used except in the case of the deposit using HTTP POST.
4. Dependence on structured packages
SWORD has been fully dependent on structured packages to deliver the payload. This pushes some significant interoperability challenges outofscope, and deliberately so for these first steps. It is difficult to reliably identify packaged content: there is the mimetype of the container (e.g. application/zip), the internal structure of the package, the manifest format, and potentially several nested layers of information therein, such as structural metadata and bibliographic metadata (for an example, consider METS [13]). There is the possibility that we could choose a standard SWORD package format, to alleviate this issue in the short term. We could also consider supporting single content file deposits supported by Atom [14] formatted metadata documents, as per the AtomPub Multipart [15] specification, which would reduce some of the problems significantly.
[12] The Atom Publishing Protocol: http://bitworking.org/projects/atom/rfc5023.html
[13] Metadata Encoding and Transmission Standard (METS): http://www.loc.gov/standards/mets/
[14] Atom: http://www.ietf.org/rfc/rfc4287.txt
[15] Atom Multipart Draft: http://tools.ietf.org/html/draft-gregorio-atompub-multipart-04
From my perspective it is really important that SWORD accept any arbitrary format, and in particular new formats that are not mentioned in the SWORD spec, so that clients and servers can agree to innovate without going through a new SWORD version.
That said, I really appreciate earlier comments that having a guaranteed format that everyone accepts is very useful. I mildly disagree with this view in the sense that I’d prefer to have a some simple service discovery capability instead: similar to how HTTP clients provide the “accept-encoding” header indicating what formats can be submitted. Ideally we all agree to implement against at least one format in common but this should be optional in the spec in my opinion and determined live during the transaction..
This does increase client implementation complexity a bit as you have to detect if the server can accept your format before shipping payloads..
At US ED we’re thinking of archiving as a continuous process, with newer submissions being just newly dated versions of older submissions. The only tie between new and old is a common unique identifier. We’re considering this approach so that end-users can submit content freely into repositories (or in our case repository networks) without worrying about curation against older datasets – a “fire and forget” approach that puts the burden on the downstream consumer of the submission, to figure out which version is suitable. In some cases, all versions might be suitable, as in the case of datasets where “deltas” between versions creates meaningful information itself..
Not sure if I’m tracking the point, but hopefully this is useful input..
As an author for middleware that will transfer items from one location to another, my preference is for all repositories to understand the same structure for the thing being transferred. If they understand more than one, that’s moot for me
Is a “Repository” a dead-end archive containing historical records, or a collection of current (and past) items? Surely the only two “tags” that needs to be propagated back up the link are (1) URI to the item, and (2) Item is available to others
And no-one is ever going to agree… I think that SWORD could specify that all endpoints should support oai_dc, SWORD_dc?, SWORD_atom?, something simple. But I concur with Tim Brody, there is a lot more we can do with other formats which should be recommended, if not specified by SWORD.
R.E. Packaging I still see a problem with the endpoint finding the first file (the manifest) it should read in order to understand what is in the rest of the package.
R.E. Full range of Atom-Pub. Should SWORD be offering explicit guidelines or leave it to each user to interpret them. SWORD to me is like OAI-DC, it uses a simple set… simply.
This time disagreeing with Ross. Ina typical repository context you are able to slowly construct your submission before “depositing” it. Thus you can hold the submission in the “author buffer/work area” until you (as an author) are ready to press the “I’ve finished” button. Even at this point though you might suddenly find an error… then what? I think the problem to solve is how to utilise the repositories “buffer/work area” using sword and then allow the author to “change” the status of the item to “I’ve finished”. Using CRUD, as proposed, you could envisage someone wanting to deposit many files into one object/EPrint etc. and being able to do this on a one by one basis rather than needing to package them in the first place. Out of scope?
(following Ross)…and if you abstract away the whole aspect about collecting metadata into a process the repository can be involved in (#depositmo) then SWORD essentially becomes CRUD+REST with some prefered packaging formats?
Packaging is very important. I propose that SWORD could adopt ePub as a web-ready package format. ePub has a simple TOC, and allows for alternate renditions of content. An ePub package might consist of a single PDF with a single XHTML page describing it, but could be a complete document + supporting files such as a thesis with HTML, PDF and .doc versions of all the chapters as well as data files.
We should be using the metadata support in PDF, MS-Word etc. If the end-point doesn’t support a given application format it should support Multipart/Related, which gives us a minimum (Atom entry) + optional XML-encoded metadata elements.
As someone who writes SWORD clients, I would love to see a simple standard packaging standard that all SWORD servers are guaranteed to accept. OAI-PMH guarantees to serve oai_dc which is a nice lowest-denominator format. It would be great if SWORD could do the same thing with a simple packaging format so I know that my clients can deposit to any repository.
+1 to what Stuart says. One of the most frustrating things with building SWORD clients right now is having to test with *every* SWORD server you want to try to support (as there is no guaranteed lowest common denominator). It’d be nice to have a “quick win” that clients know will be accepted by any valid SWORD server.
Now you really are in the space of CMS and ECMS systems. A repository (to my non-specialist mind) is a place for finished content to be archived and discovered. A CMS is concerned with the management of the complete workflow.
With respect to my question about wider applicability I would agree this is a major limitation. Many repositories of information (such as ECMSs) are not overly concerned about collecting meta-data (at least not the type a typical academic repository requires).