The Internet has grown at a prodigious rate over the last decade. Information
sources are available everywhere, providing a vast amount of useful information
to the general public at any time. Unfortunately, the information on the Internet
is inherently ephemeral. It is distributed over millions of servers which
are administered by a multitude of organizations. New information is constantly
being posted at a server, and existing information is continuously being
deleted at an administrator's whim, regardless of how important it is.
This project investigates how to build a Web archive system that can store
the history and the evolution of the Web: tracking the changes of the Web,
storing multiple versions of Web documents in a succinct way, and providing
the archived information to users through an intuitive interface.Having such
an archive system can make a significant impact on multiple disciplines and
on humanity in general:
- Archive of Human Knowledge: As the Web becomes more popular
and ubiquitous, a significant number of people rely on the Web as their
primary source of information. Whenever they need to look for certain information,
they first go to the Web and look up pages. Also, a significant amount of
information is available only on the Web in a digital form. Therefore, once
information disappears from the Web, it may be permanently lost from humanity.
Unless we archive the constantly changing Web for a long period of time, we
may suffer constant loss of information which may have taken several decades
- Web Research Platform: Constant changes of Web documents
have posed many challenges to Web caches, Web crawlers and network routers.
Because a Web cache or a Web crawler does not know when and how often a
Web page changes, it has to download and/or cache the same page multiple times,
even if the page has not changed. A large body of research has been conducted
to address this challenge, and innovative new algorithms and protocols have
been developed. However, due to the lack of the Web change data, it has been
difficult to validate the algorithms in practice. A central archive of the
Web history will provide valuable Web change data and work as a research platform where
researchers can develop and validate new ideas.
- Study of Knowledge Evolution: Over time, new topics and/or
genres gain popularity and suddenly attract interests from a large community
of interested people. Yet, we don't often understand exactly when and how
the topics started and what has made them popular all of a sudden. For example,
the Linux project, which started only 10 years ago as a "hobby" project of
a Finnish graduate student, became hugely popular over the last decade. It
is estimated that the Linux project has a community of millions of people
who closely follow its development. Why has it become so popular? How did the
community start? How did people discover about the project?
While answering these questions is not easy, we may get a better understanding
by analyzing the history of Web documents. For instance, if we want to study
the evolution of the Linux project, we may go back to the Web 10 years ago,
and study what pages were mentioning "Linux" at that time,how the pages
were linked to each other, and in what sequence they were created. This
analysis may give us a good hint on how the community developed over time.
- Web Surveillance: The information sources on the Internet
are maintained autonomously without any central organization that oversees
its development. Due to this autonomy, the Internet is also used for many
illegal activities. For example, a significant number of copyrighted materials
are illegally copied and distributed over the Web, and many terrorists are
believed to use the Web to exchange information and to coordinate their actions.
Often, these illegal activities escape the radar of intelligence agencies by
constantly moving the location of their information. A central Web archive
will provide the trace of these illegal activities, and we may detect them
by analyzing the trace.
While benefits are enormous, there exist a multitude of challenges to the
construction of a Web archive system:
- Effective Change Monitoring: The information sources on
the Internet are updated autonomously and independently. Therefore, a Web
archive system has to automatically figure out how often the sources are updated
and how it should download updated information effectively. This problem
is exacerbated because the archive system has limited download and storage
resources. Therefore, it may "miss" important changes of pages unless it
uses its resources carefully. For example, assume that a Web page p1 changes
slightly but frequently, say, once every day. Also assume that a page p2 changes
heavily but infrequently, say, once every week. Then how should the archive
system download pages p1 and p2? Should it download p1 more often to trace
all its changes, potentially missing some of p2's changes?
- Efficient Compression: A Web archive system should download
and store an enormous amount of data. Textual data on the Web is estimated
to be more than 4 terabytes , and they are constantly being updated. In order
to store a significant fraction of the Web and its change history, the archive
system should employ a novel compression technique that can succinctly represent
multiple versions of Web documents.
- New Access Interface and Infrastructure: The archive system
should be easy for a user to search,browse and analyze. It should be able
to handle diverse queries that the user may pose. For example, the user may
simply want to browse multiple versions of a particular Web page, or the user
may want to pose a complex queries, such as "What are the 10 topics whose popularity
(measured in the number of pages mentioning the topic) have increased most
rapidly in the last 6 months?"