Vanishing web page problem.
Slashdot has a story with discussion on the problem of web pages that disappear. (*) This is especially troubling for academic disciplines, because references to web pages stop working.
Here is a solution. It is based on the notion that the URL/URI (the part that usually starts with http://) is a temporary location and is subject to technical limitations.
First, let anyone who publishes a web page have the option to register with an international NGO specially created for this purpose. The registration would give the web page publisher an organization code. The organization code would be the prefix of the document ID. The suffix of the document ID would be a second serial number identification. The central registry could itself issue suffixes, or the web page publisher could. Each document on a registered site would have a document ID unique in the world. Every page that person (or organization) publishes on the web would have its document ID displayed on the web page.
To visualize it, this is how it might appear:
document://orange/reg4542837
The “document” segment represents to human beings that it is an online document. The “document” protocol could be created as a variation of HTTP if that were found to be suitable. The “orange” segment represents the name of the web page publisher. In this case, that might be Orange Research. (I just made the name up.) The “reg4542837″ segment represents the document ID number. This number would be unique to all web page documents published by Orange Research.
The central registry receives a copy of the document. The central registry permanently stores a copy of the document, a copy of the URL/URI. It files these two together with the document ID. The academic citation on the dead tree paper journal might look something like this.
See Johnson, L. and Smith C., 2003, document://orange/reg4542837
A person who is interested in checking the citation, an inquirer, can go to the central registry, type in the document ID, and get the last known URL/URI that works. Then the person can view the article.
At the same time, the registry checks the URL/URI to see if it still works. If it does not work, the inquirer is provided with the central registry’s copy of the document.
For this to have mass appeal, the “document” protocol would have to be developed so that people could just type it into their web browsers. The web browser would go to the central registry and run the search for them.
If the URL/URI changes, the central registry can update its database. If other web sites have mirrors of the article, the central registry can be notified, and list them.
There are many practical advantage to this system. Suppose it is the year 2009. A web page article written back in 2003 received a document ID. The document ID, the URL, and a copy of the document was stored at the central registry. Since then, several academic researchers have cited this article in academic journals. Then, in 2007 the web server on which the article originally appeared is taken offline. The URL/URI no longer exists. In 2009, someone comes across the citation and wants to check it. He gets the URL and finds that it no longer works. The central registry sees that the URL is no longer working. The central registry provides the inquirer with its copy of the document. Thus, the vanishing web page problem would be solved.
Additionally, the system would have other advantages. Not all web pages would receive document IDs. Only those publishers interested in writing scholarly articles would apply for document IDs.
Scholars who wished to cite to online works could choose to cite only to document IDs, and not to URL/URIs. This would give scholars added confidence that their work can be checked later on.
The central registry could charge a small fee for its services, and thus perhaps be financially self-sustaining. It would probably need seed funding, though.
Amateur academically minded persons would not be locked out of the system. Thus, someone working outside the establishment could publish his work and give it document IDs. At the same time, mainstream academics would not have to pay any attention to anyone’s work, unless it was their prerogative to do so. Thus, the system protects both non-institutional interesting thinkers from exclusion, and institutional academics from kooks.
Articles currently on the web could quickly be given document IDs. Thus, articles already on the web would not be locked out of the system.
Developing the “document” protocol would be relatively easy, since it would just be a variation of HTTP. Integrating the document protocol into modern web browsers would be easy. Mozilla.org is very innovative today. Opera Software is very innovative. Microsoft is promising innovations in the version of Internet Explorer to be released with the successor to Windows XP.
This system would be fault tolerant. Even if the central registry and all of its mirrors were to cease to exist, one could still run the document ID through a search engine (like Google or Teoma) and perhaps come up with mirrored copies of the article, if there are any.
This system would also have the advantage of reverse citation checking. If you wanted to find all of the online articles that cite to a particular document ID, you could easily do so.
The analogy is to a library. The URL is analogous to a physical location of a book in a library. Unfortunately, that book might be lost or stolen. The library itself could move or close down. The document ID system that I’ve suggested would be like a Dewey Decimal System for web page documents. If the book is gone, or the library is gone, you can still go to another library, and with the Dewey Decimal number, you can still find a copy of the book. (Substitute Library of Congress number or British library number for Dewey Decimal number if you’d like.) In addition, because the central registry would have a copy of the document, the problem of truly vanished documents would not arise.
Edited: 29 November 2003: to better express that the web page publishers would not be under any obligation to use the registry.
Update: 15 July 2004. It turns out that an implementation of this exists called the Digital Object Identifier system. (†) One difference is that DOI adds copyright management.