Data Sets

The following datasets are available:

* Real-Web dataset containing hash values of the content of 353,739 web pages collected over a period of six months (Feb. 1999 - July 1999). [ history.all.gz ]

* Same real-web dataset formated in three columns (web_site, web_page, change_history). Change history is a sequence of bits: 1 means that the specific page has changed between the respective visits and 0 means that it remained the same (e.g. 10000 means that the page changed the second time we visited it i.e. on March). [ history.all.norm.gz ]

* Synthetic dataset containing info for 300,000 pages in three columns (web_site, web_page, change_history) over 200 visiting cycles. The change frequency of the pages follows a normal distribution. [ synthetic.all.norm.gz ]

