January 11 2010
The Great Grandma Content Caper
I often find myself investigating suspicious Websites. The corporate world is growing increasingly sensitive about where their trademarks are mentioned, and why.
Yesterday’s mushblogs, which once relied upon Markov-chained gibberish to slip past search algorithms and filters, are now providing much more sophisticated mashup text that often convinces the unwary eye that nothing is wrong.
However, people are growing more suspicious about blogs that randomly mention companies, products, services, and people. They are learning to use Copyscape and other tools to try to find unauthorized duplicates of Web content.
It’s the mushblogs that cause the most headscratching, however. When people bring them to me they are certain they are looking at suspicious content but they cannot put their fingers on why.
Why the content looks suspicious is that it still lacks the ring of human sensibility. The paragraphs do not flow together smoothly. Whereas yesterday’s mushblogs floated randomly disjointed sentences and fragments together — often glued to each other by inappropriate ellipsis marks — today’s mushblogs are mining blog and forum RSS feeds, even microfeeds from sites like Twitter, for coherent comments.
You may unwittingly be cited in a dozen conversations in as many different contexts all because you randomly use some expression that a spammer wants to target.
Black hat services like Syndic8 publish RSS feeds for use by scripts that compile these mushblogs for blog farms. These black services are the reason why I refuse to publish full feeds from the blogs I control. The black hats will have to scrape my articles manually (some do) or just make do with the summaries.
There is not a great deal you can do to prevent black hats from repurposing your content. You can stop publishing RSS feeds but there are drastic consequences for that. And marketers who count RSS subsciptions loathe the idea of publishing only partial feeds because they lose subscribers that way.
You might consider watermarking your paragraphs, however. One simple way to do this is to embed links back to your site in random words embedded in each paragraph. Of course, some people might fear building many links from black hat sites. Another way to watermark paragraphs is to embed your site URL as text somewhere in the paragraph, but that looks ugly.
Some people have taken to paginating their articles. I’m not sure what the RSS feed looks like for a paginated article but the user experience is probably not very pretty on the subscriber side. Are they doing this to fight the scrapers? I don’t know.
Some of the aggregation scripts strip out links but if you’re embedding links you might add some attributes to mix up the syntax, or change the order of attributes.
And while these measures offer some protection against totally unabridged use of your new content, they do nothing for older content — which I am increasingly finding in mushblogs. Quite possibly the various anti-scraping tactics have signaled to the black hat community that what they are doing is attracting too much attention.
So now I’m finding articles from 2, 3, even 4 years ago on new blogs. The articles are really snippets pasted together from multiple sources. You may or may not recognize your own work after 4 years if you see an entire article you wrote, but what if you see an article that only includes 1 paragraph from your 5-year-old copy?
This new spam technique now calls into question the value of older Web content. I’ve maintained archives of old articles on many sites. Should I now begin retiring that content before it’s scraped and mingled into repurposed mushblogs? Should we begin advising clients to stop publishing old blog content and feature articles?
Maybe it’s time to start walling off our old content and charging for access to it — a move that is sure to be the kiss of death to many a site’s long-tail chasing SEO content strategy. The news industry is struggling with the reverse of this method — walling off new content and only allowing free access to old content, if ever at all.
Content publishers need to start thinking about how to protect the integrity of their content while assuming that it will be scraped. It’s not a matter of if but when. If you can obtain some sort of branding value from the scraped content, the spammers may be reluctant to continue using your work.
Of course, this would mean reconstructing vast reaches of the archived Web. You would also, regrettably, have to close off some of our more cherished external sources of content recovery, such as archive.org. In order to protect the integrity of copy, textual watermarking may have to become very sophisticated.
For example, you may have to instruct your writers to start embedding variations on “here at best-seo-blog.com” in every paragraph. You may have to look at different ways to space out paragraphs rather than through traditional HTML markup (and give up on using DIVs and SPANs) so as to reduce watermarking text.
There is no doubt that Webspam is evolving at a fast rate. By the time we have developed fully effective techniques against today’s scraping technologies only script-kiddies will be using the mush-paragraph technique anyway. Still, I feel that we need to figure out a way to take some sort of action that will become a useful standard or best practice.
Otherwise, I’ll have a branding advantage over you as an increasing number of rogue Websites randomly mention best-seo-blog.com and seo-theory.com.
Written by Michael Martinez




