Download the Internet Index for free!
Well, kind of… dotnetdotcom.org aka Dotbot
You might have noticed a new little critter gnawing at your web server:
Mozilla/5.0 compatible; DotBot/1.1; http://www.dotnetdotcom.org/crawler@dotnetdotcom.org)
The few Seattle based guys (pseudonym on their web site) promise an index of the web available to everybody. They intent to release as much information about the web’s structure (linking) and content as possible. For a fee to cover their costs, though.
Seomoz Linkscape
The entity behind the pseudonym is actually the SEO company Seomoz. One of their best known products is their domain evaluation service Linkscape. Linkscape is building its own version of a web link graph, collecting and computing the relation of all web sites to each other. It is a nice service for webmasters but will provide quite a bit of information to your competition.
The best way to lock out the Linkscape bot from your site is to use an entry in your robots.txt contol file. It should start with the bots you like to lock out completely. For Linkscape you would have to include the following lines:
User-agent: dotbot Disallow: /
Note that Linkscape is promoting a different way of locking out their bot, that is by including a “noindex” meta tag in the header of a web page. Alas, this version will not prevent the dotbot to crawl your site and to extract its links. It will only prevent the robot to store the content of your site for its search engine.
The Dotbot Technology
The guys tell jokily about their tools. Using C and python as a programming language, flat disk files instead of a database system, some open source software. That’s saying nothing in an elaborate way, of course.
Downloadable Index
The current dotbot’s index is available for download for everybody. The index file is constructed according to the structure: "URL-Without-Protocol NULL Optional-String-Not-Used NULL Complete-HTTP-Response NULL", with NULL as the zero byte. Actually one sees this is rather a web dump, than a searchable web index. The sorting, filtering, and indexing will have to follow. I wonder a bit, why the protocol is omitted, when keeping on the other hand the complete http response.
As of end of January 2009 the index has a tiny fraction of the web available. It comprises about 9 million pages, summing up to an index file size of 68 GB. Find a link to download it on their site (weblink below).
Sample Dump
The example consists of two URLs:
www.example.com/ HTTP/1.1 200 OK
Date: Sat, 20 Sep 2008 15:43:15 GMT
Server: Apache/2.0.52 (CentOS)
X-Powered-By: PHP/4.3.9
Content-Length: 557
Connection: close
Content-Type: text/html; charset=UTF-8
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title>I am an example.</title>
</head>
<body>
...
<body>
</html> www.example2.com/ HTTP/1.1 200 OK
Date: Sat, 20 Sep 2008 15:43:15 GMT
Server: Apache/2.0.52 (CentOS)
X-Powered-By: PHP/4.3.9
Content-Length: 557
Connection: close
Content-Type: text/html; charset=UTF-8
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title>I am a different example.</title>
</head>
<body>
...
<body>
</html>
Weblinks
- Dotbot, including a link to their Index (66 GB, torrent)
Tags: Dotbot, dotnetdotcom.org
April 22nd, 2010 at 11:17 am
Web design Wellington…
The truth- as always- is somewhere in the middle. These tools can be very helpful for a novice user that simply wants to do some basic SEO for non competitive keywords but they are totally useless if one wants to achieve high rankings in competitive ke…