Thursday, November 18, 2010

If an article is posted on the Internet and Google doesn't index it, would it have an impact ?

I've been following posts about copyright in the last few weeks.

Tim Hall and Brent Ozar both take issue with plagarists stealing their content.
This is an expansion of a comment I made on a post by Jake at AppsLab

Firstly, I'm personally not worried about theft of my content. But then I'm not in the league of Tim or Brent. When I get around to it, I'll rejig the blog layout so that each page has this Create Commons licence.

But the wider issue is a technical one. The internet is a lot about digitizing content, duplicating it and distributing it. The music and movie industries have battled this for a while, and not done too well. But, through the CodingHorror blog, I see that YouTube is applying technical solutions.

Basically when material is uploaded, they run it through some algorithm to try to determine if it is copyright. So I figured, how could this apply to text-based content. The key is Google, who happen to own YouTube. Yes, there is Bing and Blekko and Wolfram Alpha. There was Cuil, which died quietly a few months back. But Google is search.

It would be quite feasible for Google to implement some plagiarism block. As I indicate in the title, if internet content isn't indexed by Google, it comes pretty darn close to not existing in any practical terms.

How I'd envisage it working is this:

I write a blog piece and run it through a Google Signature Recorder. It picks out, or has my assistance in picking out, some key phrases. Then, as they index the internet world they check out for other sites with those key phrases.

If they find one, they report it to me. I can either mark it as 'OK, licenced, whatever' or I hit the veto button. If I do that, they pull the offending pages from their search index. Worst case, if they find 80% of a site is offending content, they could decide to remove the entire site from the index.

Removing the offending site from their index doesn't hurt the search facility. The original content is still there. They can even do 'magic rewrites' to count pointers and links to offending content as pointers to the original for PageRank purposes.

What might their motivation be for this ?

With Youtube, where they are hosting the service, it may be part legal necessity. But they host Blogger and Google Docs too. More than that though, this could be a service they could sell. If my content is valuable enough for me to want to protect it, I can afford to pay Google $50 a year to monitor it. The advantage of making this a paid service is that, if there is a challenge (maybe Fred says I've registered his content), there is a financial paper trail to follow.

Ultimately, as YouTube isn't obliged to host your videos, Google isn't obliged to index your content.

3 comments:

Tim... said...

Hi.

Many of the people contributing to the Oracle community have increased opportunities in part because of their web presence. If we allow people full access to copy our material, whose to say that the next Oracle ACE won't be someone who has built up an amazing web presence just from plagiarism.

I understand people's frustration when good material suddenly disappears because the author decides to pull it, but that very rarely happens these days. Some of my articles have been online for 10+ years. In these cases I fail to see the point in copying them when you can just link to them.

The only reason most people copy content is to make out it is their own in the hope that it will enhance their reputation and hopefully earn more money. I see no merit in this.

The music and video situation is different because I don't start thinking you are Rihana because you post one of her videos on YouTube. The copyright of the artists is still obvious, even when you break the law. :)

Cheers

Tim...

Gary Myers said...

No disagreement here. Especially those who take someone else's content to represent as their own. They are 100% wrong.

The CC license I indicated is an 'attribution and no-derivation' one so only permits copying with my name attached. So in theory I get any reputation (good or bad) that comes with it.

I have no problem with copyright as a concept, that they author has right to control their content.

There are benefits of having a single 'one definitive version' of a an article. With that one article there is a single place that changes need to be made, either corrections or enhancements (eg bugs, fixes, workarounds and version changes).

I also don't have a problem with people who want their content only on their site for commercial reasons. Because they get some people clicking on to buy their services, or simply for ad revenue.

Those features are not important to me personally (as I said, I'm not in your league - I often link to oracle-base content in answer to forum questions) so I'd be happy with that CC license for MY content.

But I would also love to see Google offer a service whether they don't include plagiarized content in their index. It should be technically feasible and I think a lot of producers of quality content would pay for such a service.

PS. I would hope whoever reviews ACE applications does some checking for content originality.

office 2007 enterprise key said...
This comment has been removed by a blog administrator.