Search engine crawler lens
This is a lens all about Googlebot, MSNbot and slurp (Yahoo's crawler).
I'll refer to all three as GYM but will often refer to Live Search as MSN/Live Search(whatever) because I think Live Search is a silly name - MSN was fine.
The main of this lens is for me to publish my research into search engine crawlers, what they do, how they work, what we don't know, etc.
I might occasionally get on my soapbox about something but I'll try to keep it focused...
If I can find anything interesting about other web crawlers then that'll appear here as well.
Table of Contents
A nice list of what's in this lens :)
- How too add sitemaps on Google, Yahoo and Live Search
- A great post if I don't mind saying so myself....
- Re-directs can hurt your site - in a big way
- Crawl Score RSS
- Search engine activity in pictures
- Interesting web crawler links
- Structure the web and improve the web
- List of crawers from Google, Yahoo and MSN
- Blog Posts from Google
- Search engines need to tell us more!
- A little poll...
- Cool Google logos from Flickr
- Reader Feedback
- Yahoo and Hadoop
- Live Search Webmaster Tools - a little review
- Search engines support Sitemaps protocol - kind of.....
- Google give us a picture of Googlebot!
- Health!
- That's it - I've had it with Squidoo
How too add sitemaps on Google, Yahoo and Live Search
These instructions should be correct as of March 1st 2008
To submit a sitemap to Google use Google Webmaster Tools (Google Webmaster tools)
To submit to Yahoo! you have to ping them using the following URL :
http://search.yahooapis.com/SiteExplorerService/V1/ping?sitemap=http://www.yahoo.com/sitemap.xml
(Replace www.yahoo.com with your website address and obviously sitemap.xml with the name of your sitemap)
To update Live Search use their Webmaster tools (Live Webmaster Tools).
Live Search Webmaster Tools is like a very basic version of Google Webmaster Tools.
Utilties such as Crawl Score will generate a sitemap in this format and offer detailed analysis to correct any crawling issues. It's all well and good creating a sitemap but if the site has issues then all you'll be doing is sending incorrect information to the search engine crawlers.
A great post if I don't mind saying so myself....
Semantic search / bots - a point proven
On another note, I'm beginning to lose heart with this lens. I've been adding some fairly unique stuff almost daily and my rank is just getting lower and lower.
I don't know if I'm missing some trick to get my lens above the parapet or if I'm just boring but either way, I'm not impressed with the amount of traffic I'm getting here. Hardly seems worth the trouble.
Re-directs can hurt your site - in a big way
All of the search engine guidelines said there was no problem as long as you used permanent re-directs (HTTP status code of 301).. that wasn't quite the case.
We probably re-directed 200,000 pages or so and Google, in particular, just freaked. All of the new pages went into supplemental (note : supplemental is no longer flagged as such) and we lost almost all of the long tail searches.
Bizarrely the site kept its rank for one main keyword but all of the regional and specific searches went out of the window.
Yahoo! kept using the old pages for about 4 months after the change and MSN only had a few hundred pages indexed anyway so there was no noticeable drop in traffic from either of those.
It's understandable that if Google sees a major change in a site that it should do something to ensure it's not going to lower the quality of their results but to effectively "can" a site for 12 months is a bit extreme - especially when we'd followed their guidelines.
Knowing what we know now, we'd have launched the new site, using the old URLs and slowly migrating the old URLs to new but it would have been better if they'd have said "WARNING!!! If you intend to have loads of re-directs then don't do it!".
Crawl Score RSS
Fetching RSS feed... please stand bySearch engine activity in pictures
I found a research article last week and I've been digesting the information and trying to make some sensible conclusions from it.
Anyway, from their quite long and detailed piece I've extracted the following points :
1) Yahoo! slurp has periods of discovery, re-crawling and then more discovery
In their example, slurp found 30,000 pages then re-crawled them 3 times before then going on to discover another 30,000.
It's interesting that slurp re-crawled 3 times before trying to find new pages. Anti-spam? Content duplication checking?
2) Googlebot has some apprent randomness built in... not
The chaps that carried out this research were quite thorough and they proved that Googlebot kinda did it's own thing in terms of discovering pages.
The site was split into to main sections and Googlebot crawled more on one side than the other. Interestingly, the side it crawled more was the side that had one external link. I don't mean that it crawled more pages that were linked from that linked to page - that would be obvious but it seems it game more "link love" to that whole "side" of the site.
So the external link was to :
www.site.com/right_hand_side/1/1/1/1/1
Which meant that every page from www.site.com/right_hand_side got crawled more than those from www.site.com/left_hand_side. This means there's some sort of URL pattern / directory structure thing going on. More thoughts at a later date (oh, btw, this is pure, off the cuff guesswork here).
3) MSNbot ended up the the poor relation yet again - it crawled the fewest pages and managed to found a page that was linked externally and not hook it up to the rest of the site (see below) :

So here's how the 3 search engines discoverd the pages on the site :

Yahoo! slurp

Googlebot
Oh, and the link to the original article is here.
Interesting web crawler links
From my bookmarks....
- Yahoo! website performance
- The article is mainly about how to improve your site for users but the reality is that the tips will certainly help crawlers to crawl your site more effectively.
I will blog about the bits that are relevant to crawlers at a later date. - Last modified dates
- A not very well researched blog post I did some time ago about 304's.
I'll be doing a series of articles on 304's, caching and stuff soon. - Matt Cutts
- Matt Cutts is a Google employee who used to do quite a few interesting posts but I find too many of his posts are "hey, Yahoo! do this and it's cool but Google do THIS and it's waaaaaay cooler!"
- MSN confirm use of ETags
- Interesting that Yahoo! say to not bother with ETags but MSN (Live Search, whatever) confirm they use If-Modified-Since and ETags where available.
Let's try and find out how Googlebot handles it...
Structure the web and improve the web
Semantic search is a small step in the right direction
Let's say you search on Google for "Derby" - you could mean the Kentucky Derby, Derby hat, Derby in England, hotels in Derby and so on.
So there's a whole lot of people creating search engines for these niches - this is a "good thing". We've looked at it ourselves and have a few test sites out there but the problem we have is that it's great that we have context for our results but what about the results themselves?
If I'm looking for Derby hats, I want to see pictures and prices, if I'm looking for Derby in England, I want to see the local council site, the football club, the university and so on.
So you're looking at totally different results pages for each niche. This means you need a purpose written crawler to get the information you need.
The problem then is (bear with me here) that every website is built differently. So let's go back to the Derby hats example - the crawler needs to be able to identify products, pictures and prices and produce a nicely structured result. This is almost impossible with a generic web crawler. To approach each and every retailer to get a feed would be quite some task so you're left with creating a specific crawler for each site.
Now this isn't a problem in itself - it's not a huge job to get structured content from most websites but there could be 20,000 Derby hats websites - to write a crawler for each site is now a BIG job.
For me, this is why semantic search is always going to be limited. Until websites adhere to standards of some sort (ideally XML or meta tags) then there will always be this problem.
Putting content into meta tags is not that big a deal. Various government websites should adhere to certain standards which mean that data can be derived from a page such as source, author, FOI status and so on. If someone came out with a meta standard for website content then this would really shake things up.
Imagine a whole load of meta data with product title, description, price, pictures - a crawler could simply pull that information down and display nicely formatted results.
Google, Yahoo! and MSN/Live Search (whatever) are ideally placed to put such a standard forward. If Google said "if you put these tags into your pages then we'll rank you on X" then you can imagine thousands of webmasters reacting immediately.
It's no small task but it's do'able (it's a word) - over to you Google - "organize the world's information and make it universally accessible and useful." (one of Google's mission statements)
List of crawers from Google, Yahoo and MSN
A work in progress....
I'll post research on the interesting ones... such as Yahoo Pipes, mindset, etc.
Full agent names are available at other places so I generally won't bother with them unless it's of specific use (EG the Mozilla based agent name that Google use).
- Googlebot
As you probably know, this is Google's web crawler. There was talk around 2003 about freshbot and deepbot using the same user agent and being identified by the source IP.
I can't find any informtion (yet) that suggests this is still the case. - MSNBot
Microsoft Live Search (MSN, whatever) crawler. Supports gzip, deflate and conditional get. This means it uses 'If-Modified-Since' with 'If-None-Match' and Etag.
Slurp and Google have supported these (since 2005 and 2006 respectively) for some time.
From our experience, MSNBot is not very good at deep crawling a site. It needs numerous external deep links to discover all pages in a site. - Slurp
Based on Inktomi's web crawler, this is Yahoo's standard web crawler.
In the early days this was known as quite a buggy crawler that regularly got stuck in loops using large amounts of bandwidth.
Now it is less buggy but still does not crawl as deep as Googlebot.
An updated crawler was launched in 2007 that uses less bandwidth and a whole new infrastructure was announced in Feb 2008 that uses Hadoop for the Webmap.
I won't go into Webmaps here as there's lots of other information around but it probably won't actually effect slurp that much - slurp will use a list of urls from the webmap and update the webmap with it's findings. - Feedfetcher-Google
RSS/Atom crawler for any content that is added to Google Homepage (iGoogle) or Google Reader. It does not index the content in Google Blog search or any other standard Google searches.
Does not support robots.txt and does not follow links - it follows the requests given to it by users of Google's personalized homepage - Googlebot-Image
As you'd expect, this is the Google Image crawler. It is also used for Froogle (or Google Product Search to used it's current name).
This bot usually visits less regularly than a regular crawler. - Googlebot/Test
Hasn't been reported on since 2004 but found this post on Webmasterworld (from '04) :
"Hey everybody, I wanted to give you a heads-up about a potential change in our user-agent name. Currently we use the user-agent
Googlebot/2.1 (+http://www.googlebot.com/bot.html)
but we're considering changing our user-agent to something like
Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
The primary reason for this is that some web servers assume that unless a user agent is IE, Netscape/Mozilla, or maybe Opera, that your browser won't support JavaScript, frames, etc. As Googlebot gets better over time, it gets closer to a regular user and browser in our ability to handle features like that. This user-agent change would let vanilla webservers assume Googlebot is more like a regular user, while still allowing site owners to recognize that Google is visiting and still providing clear contact info in case of questions or issues. We would still respect "Googlebot" in robots.txt, so the overwhelming majority of webmasters wouldn't have to change anything at all; most site owners probably wouldn't even notice the difference. This bot that a few people noticed was a test crawl with the new user agent. It looks like things went very smoothly, but we'll still be conservative when making this change. If people have any positive or negative comments or questions, just post here. If we don't hear any showstopping objections, we'll look at making this name change on the user-agent when we're convinced everything would be a smooth transition.
thanks,
GoogleGuy "
At that point it wasn't common knowledge that GoogleGuy was Matt Cutts but he is. Anyway, the point is that for a while they tested with another agent name but I've not heard of another Google test user agent since so I guess they just use the standard ones now.
That particular thread went on to talk about Google crawling javascript and so on. I'll do a piece on that shortly. - gsa-crawler
As you'd expect, this is the user agent id for the Google Search Appliance. If you see this in your log files then someone is either testing it or has it incorrectly configured.
The Google Search Appliance/Mini is generally used to power website searches and/or Intranets.
From my experience, it's very similar to GoogleBot anyway but isn't really relevant to web crawling so we'll leave that one there. - Mediapartners-Google (Google AdSense)
You should only see this if you have Google Adsense on your site. This crawler will check the content of your site to send targetted ads. - Mozilla/5.0 (compatible; Googlebot/2.1; http://www.google.com/bot.html)
- Yahoo-Newscrawler
- Mozilla/4.0 (compatible; crawlx, crawler@trd.overture.com) - Yahoo Search Marketing
- PostFavorites - Yahoo favourites tracking
Another bot that doesn't have much information about it - more research needed. - Yahoo Pipes 1.0
as the name suggests, this is from Yahoo! pipes and is for RSS feeds. It will only crawl every 10 minutes if the pipe is used regularly (so if there's a pipe that is used 10 times in a minute, the pipe crawler will only crawl your feed once in 10 mins).
Appears to use Yahoo's standard caching method of not using ETags but using If-Modified-Since or If-None-Match header.
Strictly speaking this isn't a web crawler so I won't be going into any more detail. - Yahoo! Mindset
Mindset is a test platform for "intent" based searching. I think "context" is a better word but ho hum. The aim is to put your search query into context (IE shopping or research) so the results are more appropriate. I'm guessing that this crawler does some sort of analysis of the site to ascertain whether it's a shopping site or a research site.
You'd have thought they could do that in the index post-crawl...?
Either way, it's not something that is very well documented in terms of the crawler so I'll put this one on the backburner. - Yahoo-Blogs/v3.9
There's very little information about this so I'll need to do more research but basically it appears, as the name suggests, to be a blog crawler. - Yahoo-MMAudVid/2.0
Yahoos video and audio crawler. Supports robots.txt and was a bandwidth hog in the early days (2004). - Yahoo-Test/4.0
This bot is around in various guises - eg :- Mozilla/5.0 (Yahoo-Test/4.0 mailto:vertical-crawl-support@yahoo-inc.com)
- Mozilla/5.0 (Yahoo-Test/4.0 ysm-keystone-crawl-support@yahoo-inc.com)
- Mozilla/5.0 (Yahoo-Test/4.0; mailto:vertical-crawl-support@yahoo-inc.com)
It looks like it's a purpose written crawler for certain sites - basically semantic search I guess.
I wouldn't let it hoover up all of your bandwidth for the sake of a Yahoo test. - Yahoo-VerticalCrawler-FormerWebCrawler
A blast from the past. This crawler was originally from Inktomi and used by Alltheweb as well as Yahoo. Rumoured to be part of the quality control of the Yahoo directory - basically checking if your site was still there.
Long gone and almost forgotten. - BunnySlippers
Microsoft dropped a bit of a clanger here. This crawler was trying to identify their market share of web servers but it was running from Apache! The irony...
Anyway, it's long dead ('01) so you certainly shouldn't see it in your logs. - MSRBOT
This Microsoft Research crawler seems to be still available (was doing the rounds in 2003) from Microsoft. It's a c# full web crawler and is free but is only available from Microsoft by request.
It looks like it raised it's head again in 2007 and when a webmaster emailed them word from Microsoft was :
"As part of this study, I am also investigating the existence and evolution of so called "soft-404" pages; that is pages that don't really exist, but whose requests still return a 200 OK result code. In order to collect these types pages of pages, I am constructing a URL that is unlikely to exist, and making one request per week per host:port pair."
So it looks like they're either using it just as a one off study (and got unlucky in finding a webmaster who checked his log files) or they were doing something on a larger scale. More research to come on this one.
I will email Microsoft and ask for a copy - should be fun. - Adsbot-Google
You should only see this is you are using AdWords to promote your site. I guess it's called when you use the auto detect keywords function in the AdWords control panel.
Not really a web crawler so I won't be doing any more research on this one for now.
Blog Posts from Google
Search engines need to tell us more!
I'm fairly sure that once I've been through this exercise of pulling together a half decent overview of everything we know about web crawlers that we'll find they'll be no definitive answer from Google.
Do they use md5 checksums of entire pages, etags, if-modified-since or what? We can probably put together from various sources, how slurp and MSNbot works but we'll probably only have a 5 year old document from Google when they were at Uni....
A little poll...
This poll isn't about ranking, it's about crawling. Should Google, etc give more detail about how their crawlers work?
Cool Google logos from Flickr
Reader Feedback
As long as it's not spam, I'll approve it - warts and all!
| FreeSEOeBooks
Very good lenas about SEO! If you want tyo get more free seo ebooks, enter here: Free Search Engine eBooks Posted March 06, 2008 |
|
tdove
Great Read! Any way to improve the web is great with me. Thanks for visiting NicheBOT Keyword Research Review. Posted February 24, 2008 |
Yahoo and Hadoop
It's taken them years to get it right so you do have to ask how close to the original Hadoop their platform now is.
Hadoop is a Java app and Yahoo! are pretty much C++ to the core so they've created an API to make their proprietary stuff interact with Hadoop.
The Search/Content guy says that's there's over 100 factors in the Yahoo! ranking algo now - I honestly didn't think there'd be that many. If I was being cynical I could say that there are 5 factors... all completely random depending on the day of the week ;)
I hope now they can scale that we start to see some better results from Yahoo search. I must admit, I really took a dislike to Yahoo a few years ago but just recently, they're coming out with some good stuff so well done Yahoo!
Live Search Webmaster Tools - a little review
First impressions : nice interface, quick and easy to use.
Validation was straightforward and appears to be pretty safe.
It was good that as soon as I added a site, all the info that they have was displayed.... the problem is that the info shows why Live Search isn't great.
It says that the domain rank for www.crawlscore.com is 5 green blobs. I assume that's the maximum and therefore the equivalent of a Pagerank 10 - that's highly unlikely.
In "top outbound links" it only found two - one was to our company site (Inter Advertising) and the other was to Matt Cutts' blog. There are a lot of othre outbound links - why don't they show...?
Top backlinks shows links from the same site (that also has a domain rank of 5 green blobs) and that site is some random, non-related site. We've got links from some pretty well established SEO blogs so why it's chosen this site, I really don't know.
And that was pretty much it. There was nothing about if the sitemap I gave it was accepted, nothing about content and it's generally speaking, it's very light on features.
It's good that they've got forums and stuff but they could really make a difference by offering really good tools for webmasters - they've had a couple of years to create something cool but they've not.
I also see from their blog that they have the most bizarre bugs with their crawling bot - for a bot that's been around for a few years, I find it incredible that, for example, it crawls Google ads.
Search engines support Sitemaps protocol - kind of.....
Interesting use of the word "support" in that Live Search use the standard URL for pinging.
Live Search ping URL :
http://webmaster.live.com/ping.aspx?siteMap=[Your sitemap URL]
The standards as per Sitemaps.org :
searchengine_URL/ping?sitemap=sitemap_url
Why do Live Search have to be different? It's a two minute mod, even in IIS, to have /ping resolve to ping.aspx - madness.
Yahoo are also bending the rules a little by having :
http://search.yahooapis.com/SiteExplorerService/V1/
...before the ping. It would be little tidier to have it on a shorter URL.
Google give us a picture of Googlebot!
It was taken from a slightly weird post over at Google Webmaster Central.
The article tells us nothing new which is a shame but they're obviously getting the message out about changing robots.txt regularly (a bad thing) and dodgy robots.txt (a very bad thing).
Health!
Boring I know but your health matters. I've neglected mine for about 35 years but it's never too late to start :)
That's it - I've had it with Squidoo
Bye bye.
Perhaps my topic is too specific or geeky for Squidoo. Dunno but at this point (March 2008) I would suggest you spend your time doing stuff other than Squidoo.
The stats system is pretty poor so I can't see where people have come from. I'm guessing that those getting good traffic are "gaming" the system - that really ain't my style.
I won't totally abandon it - if there's something particularly interesting, I'll post it up but don't bank on it ;)




