WebArchive Project, CS Department, UCLA

BruinBot Crawler

In the WebArchive project, we are interested in building a Web search engine prototype, where the users can ask for different versions of pages collected during different periods of time. BruinBot is the crawler that we have developed here at UCLA, and we use it to download parts of the Web which are important for our research. BruinBot operates by following links on the Web in order to discover pages which we subsequently download for our search engine.

If your site has been recently visited by BruinBot it is because we consider the content that you provide both interesting and appropriate for our research. During our downloads we do our best to be courteous to the sites that we crawl and we adhere to the rules that they define in their robots.txt files. At present we download one page every 2 seconds from a Web site, and we have been redownloading the site once every week.

If you want our (or any other) crawler not to visit a particular portion of your site, you can specify it by writing a simple robots.txt file (the robot exclusion protocol):

http://www.robotstxt.org/wc/exclusion.html

Please direct any feedback regarding our crawler to


ntoulas at cs dot ucla dot edu