Welcome to the Best SEO Blog!


The latest in search engine marketing tactics, the tried and true techniques. Feel free to comment or suggest topics that you would like to know more about.

November 12 2009

Why Some Sites MUST Block Archive.Org

UPDATE: Matt Cutts clarified what he said/meant for me in a Tweet: “if I’m already investigating a site which is spammy-looking and appears off-topic/expired, then IA block is very noticeable.”

I was following the organic site review at PubCon on SE Roundtable this morning when Matt Cutts apparently let slip a jaw-dropping comment (as reported by Barry Schwartz):

Barry Schwartz: This is a huge red flag!!! Matt said, this is the best source of spam leads. You block archive.org in robots.txt file, you are caught in no time, Matt basically said

Okay, first let me remind people that Google owns its index and only Google sets its Webmaster guidelines. But if Matt really believes that blocking Archive.Org is a spam signal, he has MUCH to learn about Webspam.

I DO block Archive.Org on some sites, and many other people do as well. And there is sometimes compelling reason to do so.

The intent behind Archive.Org is a good one. I’ve used it to save my sites many times when, after a hard drive failure, I’ve needed to retrieve older copies of live pages to do some work. I love Archive.Org because it’s a great resource for researching how Web sites have behaved through the years.

Unfortunately, Archive.Org violates intellectual property rights on a massive scale and when site owners become aware of that they take action by blocking Archive.Org. I see forums do this. I see blogs do this. I see article archives do this. I see news sites do this. And I do this (on some sites, not all).

I’ve tried to explain to people through the years that if you put an image on the Web, or even an article, it’s out there. Nonetheless, some people do try to “keep the content” on their sites (totally unaware that the content is distributed to every computer that visits through a browser). Right or wrong, many people try to prevent Archive.Org from serving their content.

Some sites go dark with their content — that is, they place it behind a subscription wall. Leaving that content in Archive.Org doesn’t help them bind their visitors to the subscription model.

And some sites have rather dark histories. There are times when a new site owner needs to blot out those dark histories from public scrutiny. In the same session Matt let slip that it’s better to use a new domain than a burned domain, but that isn’t always an option. Sometimes a company needs to grab a burned domain to protect a new trademark. That’s reality.

Blotting burned domains from Archive.Org is one way of cleansing their past in the public eye. Google and other search engines need to leave some flexibility in their spam management for those needs (and I’m not assuming they don’t — I’m just saying that SEOs should not throw their hands up in the air and insist it’s a lost case).

The search engines are not the arbiters of who can brand what, and they don’t have all the power. They can choose to exclude good content from their indexes because of one red flag but I don’t believe they do that.

And there are other reasons for why people block Archive.Org. All of them are legitimate reasons in those people’s eyes. Google and the other major search engines may not view those reasons in quite the same way, but people should not take what Matt Cutts said and start telling Website operators to allow Archive.Org in to their content.

There are occasional legal issues that also require sites to block Archive.Org. If complying with a new statute or court order means blocking Archive.Org, the SEO community should not burden itself or its clients with angst over what the search engines will do about that. If all a site has done is block Archive.Org, I don’t think Google or any other search engine is going to penalize or ban the site.

It hasn’t happened to any site where I’ve blocked Archive.Org.

You can certainly mask some stealthy activities from other people by blocking Archive.Org but you won’t mask them from the search engines by doing so. Nonetheless, it should not be assumed that just because a site blocks Archive.Org that the site has done anything wrong.

Neither Google nor Archive.Org has a right to demand that Websites open up their content to all online archives. Webmasters have the proprietary right to block any robots they wish, and search engineers should not be suggesting or implying that this is morally questionable or unethical behavior.

The SEO community, however, absolutely MUST recognize that there are legitimate, moral, ethical, and legal reasons to block Archive.Org — and we need to SUPPORT those reasons and activities and we MUST encourage our clients to do what is right for their sites.

Written by Michael Martinez
pssst, tell your friends!
  • Sphinn
  • LinkedIn
  • StumbleUpon
  • Technorati
  • E-mail this story to a friend!
  • TwitThis

Post a comment

You must be logged in to post a comment.