WebArchive Project, CS Department, UCLA

The Internet has grown at a prodigious rate over the last decade. Information sources are available everywhere, providing a vast amount of useful information to the general public at any time. Unfortunately, the information on the Internet is inherently ephemeral. It is distributed over millions of servers which are administered by a multitude of organizations. New information is constantly being posted at a server, and existing information is continuously being deleted at an administrator's whim, regardless of how important it is.

This project investigates how to build a Web archive system that can store the history and the evolution of the Web: tracking the changes of the Web, storing multiple versions of Web documents in a succinct way, and providing the archived information to users through an intuitive interface.Having such an archive system can make a significant impact on multiple disciplines and on humanity in general:

Archive of Human Knowledge: As the Web becomes more popular and ubiquitous, a significant number of people rely on the Web as their primary source of information. Whenever they need to look for certain information, they first go to the Web and look up pages. Also, a significant amount of information is available only on the Web in a digital form. Therefore, once information disappears from the Web, it may be permanently lost from humanity. Unless we archive the constantly changing Web for a long period of time, we may suffer constant loss of information which may have taken several decades to discover.

Web Research Platform: Constant changes of Web documents have posed many challenges to Web caches, Web crawlers and network routers. Because a Web cache or a Web crawler does not know when and how often a Web page changes, it has to download and/or cache the same page multiple times, even if the page has not changed. A large body of research has been conducted to address this challenge, and innovative new algorithms and protocols have been developed. However, due to the lack of the Web change data, it has been difficult to validate the algorithms in practice. A central archive of the Web history will provide valuable Web change data and work as a research platform where researchers can develop and validate new ideas.

Study of Knowledge Evolution: Over time, new topics and/or genres gain popularity and suddenly attract interests from a large community of interested people. Yet, we don't often understand exactly when and how the topics started and what has made them popular all of a sudden. For example, the Linux project, which started only 10 years ago as a "hobby" project of a Finnish graduate student, became hugely popular over the last decade. It is estimated that the Linux project has a community of millions of people who closely follow its development. Why has it become so popular? How did the community start? How did people discover about the project?

While answering these questions is not easy, we may get a better understanding by analyzing the history of Web documents. For instance, if we want to study the evolution of the Linux project, we may go back to the Web 10 years ago, and study what pages were mentioning "Linux" at that time,how the pages were linked to each other, and in what sequence they were created. This analysis may give us a good hint on how the community developed over time.

Web Surveillance: The information sources on the Internet are maintained autonomously without any central organization that oversees its development. Due to this autonomy, the Internet is also used for many illegal activities. For example, a significant number of copyrighted materials are illegally copied and distributed over the Web, and many terrorists are believed to use the Web to exchange information and to coordinate their actions. Often, these illegal activities escape the radar of intelligence agencies by constantly moving the location of their information. A central Web archive will provide the trace of these illegal activities, and we may detect them by analyzing the trace.

While benefits are enormous, there exist a multitude of challenges to the construction of a Web archive system:

Effective Change Monitoring: The information sources on the Internet are updated autonomously and independently. Therefore, a Web archive system has to automatically figure out how often the sources are updated and how it should download updated information effectively. This problem is exacerbated because the archive system has limited download and storage resources. Therefore, it may "miss" important changes of pages unless it uses its resources carefully. For example, assume that a Web page p1 changes slightly but frequently, say, once every day. Also assume that a page p2 changes heavily but infrequently, say, once every week. Then how should the archive system download pages p1 and p2? Should it download p1 more often to trace all its changes, potentially missing some of p2's changes?

Efficient Compression: A Web archive system should download and store an enormous amount of data. Textual data on the Web is estimated to be more than 4 terabytes , and they are constantly being updated. In order to store a significant fraction of the Web and its change history, the archive system should employ a novel compression technique that can succinctly represent multiple versions of Web documents.

New Access Interface and Infrastructure: The archive system should be easy for a user to search,browse and analyze. It should be able to handle diverse queries that the user may pose. For example, the user may simply want to browse multiple versions of a particular Web page, or the user may want to pose a complex queries, such as "What are the 10 topics whose popularity (measured in the number of pages mentioning the topic) have increased most rapidly in the last 6 months?"