[insert witty title]


The Wild World of Wiki Wordcounting


by CMB

It is a sad reality of life that my usual routine of 'doing shit on the internet' has been disrupted by 'doing work' and as such the update rate on the blog has dropped.

I thought that today I would share something I have been doing in my spare time for the past couple of days: Wordcounting Wikipedia. I put together a little script that takes a list of words, finds the corresponding wikipedia page for each word and counts the size of it. Here is an example (click for full size), it contains the size of the wikipedia entry for every country with a name starting in the range A-K (clicky for massive version).



Cool, Eh? I have had far too much fun just seeing who the big winners and losers are in the world of internet. I'll finish off the graph soon, but I'm feeling lazy right now. Here are a few facts I've learnt about the world (A through K inclusive)
  • Argentina has a freakishly long wiki entry, I have no idea why
  • It took me three attempts to spell Kyrgyzstan correctly
  • There are no countries that start with the letter X, an only one that starts with Q,O and Y
  • The peoples of Brunei do not enjoy writing on wikipedia


What about digging deeper into the data, using new and powerful statistics to let us view the world in a whole new way. The statistic I'm going to introduce now is probably as powerful as the legendary MEPP. It is called the WDQ or Wiki Drama Quotient. Every page on wikipedia has an associated 'Talk' page, where beardy internet people can argue about what gets put in the article itself. I would contend that the ratio of the length of a discussion page to its article gives a measure of how much drama a particular term carries with it. For example lets take a look at the age old battle between Tea and Coffee:

Coffee, 29027
Talk:Coffee, 64620
Tea, 43991
Talk:Tea, 35143

Although Tea has a longer article, its WDQ is much lower than that of coffee (0.79 vs. 2.23) suggesting that in the internet world tea is more popular than coffee, but it has a calmer history and there is not much in the way of disagreement about it (although on the talk pages there are a few sharp exchanges about the merits of different shapes of teabag).

What about figuring out what the best colour is?



Blue. Obviously. Anybody that disagrees is fooling themselves.

The main point of this blog post is that I need ideas for things to investigate: places? football teams? branches of the physical sciences? Using this stunning technique we can, once and for all, settle a lot of arguments

p.s. also could somebody write a guest post for me, I'll buy you a beer. You can even be anonymous if you wish.

Labels: , ,

Save this post
Digg it Blinklist Furl Reddit del.icio.us




Start Here

Is this your first visit to this blog? Here are a couple of posts that might make a good starting point

Search This Blog

Contact

CMB:
insertwittytitle(at)
gmail(dot)com

Anon:
astroshackanon(at)
googlemail(dot)com

JEG:
saluton(dot)mondo(at)
googlemail(dot)com

Subscribe to RSS Feed










direct feed link