Service Provider (SP) Issues
OAI Service Providers (SPs) have to tackle a number of problems in order to set up their services in an efficient and qualified manner. Most of these problems are general for all SPs and an international collaboration among SPs (a Service Providers Guild?) would thus be helpful. Likewise, enhanced dialogue and information flow between Data and Service Providers would be helpful. In summary, the issues surrounding effective SP harvesting are:
- How to discover a new data repository?
- How to understand a particular data repository?
- How to harvest a particular data repository?
I. How to discover?
- The complex web of data providers, i.e., original repositories, aggregator repositories, content creators not ready to be OAI compliant yet. Without a "social network" (preferably graphic) description of who harvests whom, it can be difficult to know whether an SP is receiving duplicate records inadvertently. This also means that an SP has to keep a close eye on the sets that are being made available each time they harvest.
- Learning about new repositories. Often SPs get email from potential data providers who are looking for help in setting up their OAI-compliant repositories (they get forwarded to www.openarchives.org on a general basis), from data providers who want to have their repository tested before it goes live, from content creators who don't have the facilities to set up their own OAI-compliant systems (who should be able to use the static repository protocol), and from those who think we're the only service providers available. This word-of-mouth means that we receive information on (probably) most of the available repositories, but other service providers are not getting the benefit of it. This goes both ways. There exist three lists where SPs can go to find out about repositories: the www.openarchives.org list, the eprints.org list and UIUC's registry. The OAI Repository Explorer is not always reliable since it was designed for testing and does not always contain viable repositories.
- Update of information from repositories. While it may be difficult to discover new repositories, it is much harder to know when a repository is not "live" anymore. Sometimes it is obvious after trying Identify or ListSets and getting 404s. Sometimes it becomes obvious after months of failed connections. Additionally, while the metadata DC Identifiers may still work for these defunct repositories, when will they not work anymore? Should they still be included in the service provider collection? This is made more difficult by a defunct repository generally not having a valid contact email address.
II. How to understand?
- Provenance. Does the archive contain "original" data or
"copied" (OAI harvested) data or both? May one harvest only the original data (set support)? Does the copied data clearly state its provenance?
- Subject coverage. Which subjects are covered? May one
harvest only a single or a limited set of subjects (set support)? How are subjects described /classified?
- Object types. Does the harvested metadata point to digital objects? Which types are used in the archive (text, image, sound, dataset, simulation, computer program, etc.)? May one harvest the different object types separately (set support)?
- Object access. Who can access the full information objects? Anyone without any fee or restriction? Only those ready to pay-per-view? Only those with a prearranged license/subscription? May one harvest these different groups separately (set support)?
- Metadata format. Which formats are offered and how are they used in practice? What is the native format (documentation)? How is the
native format converted into OAI-DC?
- Quality control. What can we expect in terms of content quality and credibility from the particular archive? Can anyone submit to the archive, or only a well-defined group to which you can attach trust. Are the submissions quality controlled (using peer-review or another method)? As a SP, it is useful to state that your service provides a certain quality/level, but learning about this before harvesting is difficult at the moment without performing spot-check validation.
III. How to harvest?
Protocol
- Retry-after and useless requests. The Retry-after option is good but maybe servers implementing that with a fixed minimal delay (and not just based on load) could specify that information as part of their reply, e.g., doNotRetryBefore with a number of seconds or a date. This would avoid an extra useless request.
- Using setSpecs. It would be useful to be able to download records from a list of sets. It's often the case that an SP downloads records more than once because they belong to more than one set. Going a step further, it would be practical to negate a set in the list (down all sets except poetry).
- Number of records in archive. It's currently impossible to know how many records are in a data repository before starting the harvest. Maintaining this information automatically would remove the possibility of out-of-date archive record numbers. As it stands now, one isn't sure if one is about to harvest 10 or 100K records, which makes a difference for reasons of time.
- Harvesting practice. Should harvesting be done all at once, by sets, by time period...? If archives could provide some hints on how they prefer to be harvested, for example a small data repository could indicate a weekly harvest. A large data repository (like arXiv) may have automatic limits to prevent lengthy, large harvests. Some archives may prefer to be harvested at a certain time (e.g., State Library of Victoria, Australia) or may prefer that you download more than 10K records per day. Providing automated structuring of this information would be best but even a textual description would help in configuring harvests.
- Deletion or modification to filenames of records. At times an SP will harvest a repository only to learn that they have deleted records from their repository without using the protocol's deleted status. This shows up in two ways: either there are more records than the harvester says there should be (generally a modification to the filename, so that both the old version and the new version exist) or the number of records harvested is less than the number harvested previously (generally records deleted). In either case, the harvester database and filesystem have to be cleaned of that repository's records and started again from scratch. With large repositories, this can take quite a bit of time (even when automated).
Practice
- Harvesting by set is tedious and lengthy, but necessary. This can be a function of the harvesting software, in that one can't harvest separate sets into separate pathnames. Some data repository software tools (e.g., Arno) don't provide the ability for the data provider to create sets, which is a huge drawback for SPs. Although sets are cumbersome to create and maintain, it is necessary for SPs who harvest some metadata through original data repositories and some through aggregator data repositories.
- Set structure and number of sets. It seems that the use of sets is sometimes not very structured. Many servers define keywords as sets which makes a very large and flat structure which cannot be used for much. Often important information is hidden in the last sets, for example a set which represents another archive being harvested by that site. Perhaps a practice of using a broadly defined set structure with a few predefined base sets available would help. For example, all setSpecs for keywords could start with "keyword:" and all setSpecs for harvested data could start with "external:".
- Records out of sets. Some providers don't keep all of their data in sets. For example, a data repository may have 5 sets and then some extra records which are in no sets and can only be harvested by not using a setSpec. It would be beneficial to know which providers practice this method.
- Bogus earliest timestamp. It seems that a lot of providers don't really take the earliest timestamp seriously and just set it to 1900-01-01 or 0000-00-00. It is not a large problem but it would be beneficial to know the real earliest timestamp while harvesting.
- Character conversion errors abound. These come in two forms: XML and UTF-8. In either case, SPs have to work around the problem by harvesting using ListIdentifiers and loose validation and/or using a very simple harvester that merely grabs records without caring whether they're valid or not. In the former case, the "bad" records get skipped so records end up getting lost, and in the latter case, a number of scripts need to be run on the harvested records to get them into shape for indexing.
- Some data provider software is not ready for prime time.. Harvesting from ContentDM data providers involves harvesting only by set and only using the ListIdentifiers command, if you want to be sure harvests don't die. This is the only data repository software that claims to be OAI-compliant and is more trouble than its worth. Arno seems to be problematic as well, because it doesn't allow set creation, but harvesting does not die when you harvest from an Arno data repository.
|