Welcome back to the final part of the Puff Names data project! If you haven’t read the first two parts, they will give you all the context you need for what I’m doing; you can find them here and here. This entry we’ll finally get to the most interesting bit of the data behind Puff Names: nicknames!
What’s in a name?
We learned last entry that over the 3 years Puff Names has been active, the nickname list has accumulated 3,986 names. That’s a lot of different handles. But given that this is Pottermore, we can expect a certain amount of repetition in the names people choose – that’s why Puff Names was set up, because people all tend to want to call themselves by the same name, and there is also (was also) a limit to what you can get past the moderators on Pottermore. Back when we, you know, actually had comments and stuff. So I copied every single one of those 3,986 names into my trusty Word Frequency Counter to see which were the most often recurring words in Hufflepuff nicknames. And then I made a word cloud from them all! If you haven’t come across a word cloud before, basically the biggest words are the ones that get used more often. So here is the word cloud of Hufflepuff nicknames: Continue reading
And we’re back! In case you didn’t catch the last instalment, this is Part 2 of the Puff Names data project. Last post I talked about how I used a tool called ScraperWiki to collect data from the Puff Names nickname list, resulting in a ginormous spreadsheet with every name that has ever been accepted to Puff Names. Now that I’ve got that data, I can find out all sorts of fun and interesting things. What are we waiting for?
The way that the Puff Names nickname database works has always made it difficult for me to know exactly how many entries are on Puff Names. This is because every submission to the site creates a new entry into the database, whether or not it’s actually been approved by a moderator. If the submission is rejected or later deleted, it creates a blank entry, which is never overwritten unless someone specifically goes to edit that entry and input information. Since I don’t know which entries are the blank ones, and I’d have to go back and edit them manually if I wanted to fill them (instead if using the very handy checkbox approval system that Emi set up for me), I’m not really going to bother with that. And if you think about how many idiots over the years have submitted twelve nicknames in one go, or how many non-idiots have still submitted more than one nickname, or how many inactive users have been removed from the site, that adds up to a lot of blank entries.
Before I did this post, I checked on how many entries the nickname list was currently up to. I did this by creating a new submission to the site and then rejecting it, which meant adding another blank entry to the database, but it was the only way to check. My submission was the 5,778th entry in the database.
The actual number of nicknames on the site, as per the data I scraped, is 3,986. That means there are a total of 1,792 blank entries in the database. No, they don’t show up on the list, but they’re technically there in the system, just waiting to be filled in.
A couple of months ago I posted on Facebook about coding something called a “data scraper” in class, using Puff Names as the website. So what was I doing, and how did it turn out? Introducing… the Puff Names data project.
What is a data scraper?
In the most simple terms, a data scraper is a computer program which you can create for the purpose of extracting data from somewhere. Specifically, the tool that I used to retrieve data from Puff Names is a web scraper, meaning a program that extracts data from web pages. In class, our teacher taught us how to use a tool called ScraperWiki to extract data from websites, using a coding language called Python. We practiced first on websites like w4mpjobs (a website listing job opportunities to work for British Members of Parliament) which listed a lot of information, figuring out how to write code that would let us retrieve the specific bits of data we were after, for example the description of a job and its location.
When you’re working with data as a journalist, one of the biggest tasks is always figuring out how to sift through data to find the important bits that will give you a story (or a lead on a story). It’s often referred to as “cleaning” data, where you go through a big set of data that you’ve scraped or been given and make it look nice, make sure it’s all consistent, and get rid of the bits that you don’t want. But if you can scrape only the data that you need in the first place, obviously you’ve got a lot less to sift through and clean.
Data scraping Puff Names isn’t about to give me any breaking news stories, but I’ve always wanted to be able to look more closely at the data behind it and analyse it; now I have the tools to do that. (As an added “bonus”, since I started work on this project Pottermore removed all comment functions, which means that no-one is adding new nicknames to the list. So I have a finite data set to work with). At the end of the class, our teacher gave us time to play around and experiment with the things we’d just learned, and I decided to have a bash at writing a scraper that would pull data from Puff Names. I was mainly interested in scraping the nickname list section of the site, which is basically a huge database and seemed like the perfect thing to try and scrape. It turned out to be both a massive headache and a brilliant learning experience.
I spent part of yesterday afternoon sweeping up the branches and dead leaves in my little bare garden that had accumulated over the winter. At the same time, I was brewing a whole string of potions in my cauldron on Pottermore, and every so often I would go inside to complete one potion and set another one brewing. It got me thinking about hard work, one of the signature traits of Hufflepuff house.
When I was Sorted into Hufflepuff house, I thought I must be one of the laziest badgers out there. I had always dismissed the idea that I could ever be in Hufflepuff out of hand precisely because I’m so lazy, and as we all know, the defining qualities of Hufflepuff house begin and end at hard work. Right? (Wrong). I think a lot of people dismiss Hufflepuff because of the reputation we have for “hard work”. Hard work is boring; who wants a house which is all about working? Give us adventure and knowledge and ambition any day, they say.
It’s almost midnight where I am, making this literally an eleventh-hour blog entry, but it’s still Hufflepuff Pride Day for me and a decent portion of the world! To celebrate our badger pride, here are seven videos that will make you immensely proud to have been Sorted into our house. Whether you wanted to be a Puff all along or were shocked and maybe even dismayed at the Sorting Hat’s verdict, these are videos to make your heart swell with love for your fellow Puffs – and for yourself, for belonging with them. There are more videos out there on YouTube than I’ve even included here, but I had to stop somewhere, so I settled on the magical number seven for my list. I encourage you to go out there and find some others, or make your own!
1. J.K. Rowling on her Love of Hufflepuff
This may surprise people, but it is the truth: in many, many ways, Hufflepuff is my favourite house.
What better way to start a list of Puff Pride videos than with the quintessential defence of Hufflepuff by the creator of Harry Potter, J.K. Rowling? Re-watching this video is a tad frustrating at times because J.K. spends half of it trying not to spoil the climax of the seventh book for people when you just want her to get to the point, but if you know enough to fill in the blanks of the scene she’s talking about, it’s powerful stuff. “We should all want to be Hufflepuffs”, ‘nuff said.
(Also, for the essence of Rowling’s speech in hilariously plain English, watch ‘In Defence of Hufflepuff’ by UK YouTuber Alex Day)
In a typical display of impeccable timing, Pottermore has released a game-changing alteration to the duelling system eleven days before the awarding of the sixth House Cup.
Duelling on Pottermore has seen any number of tweaks over the past few weeks and months: CAPTCHAS (an anti-bot verification system) have disappeared and then reappeared, the rhythm of spells has fluctuated constantly in a supposed anti-botting measure, and the “duelling wizard”, who is seen in the background graphic as you cast a spell, disappeared from his traditional place at the top of the steps and then reappeared to the right of the hall (apparently levitating inches off the ground). Some of these are very significant changes, especially the disappearance of CAPTCHA, which prompted a popular petition by Pottermore users that may even have been responsible for CAPTCHA’s return. However, the change that greeted users as they entered the Duelling Hall today is the most significant by far, and probably the biggest alteration to duelling on Pottermore since the system was revamped just over a year ago, in August 2013.
Today is 10th September 2014, a.k.a. The Internet Slowdown. What’s that you ask? In the simplest possible terms, it’s an attempt to wake everyone up to the reality of what the Internet would be like if the biggest ISPs (Internet Service Providers) in the United States got their way. ISPs are companies like Comcast and Verizon in the USA, British Telecom and Sky in the UK, BigPond in Australia, Orange in France… You get the picture. Supposedly, your ISP shouldn’t interfere with the speed of how you access whichever websites you want to access – that would be ridiculous, like if a bus driver decided that they would only drive you to specific locations at a normal speed and drove everywhere else incredibly slowly – unless you paid them an extra bus fare. There would be an uproar – no-one would stand for it. But that’s exactly what the Federal Communications Commission or FCC, an independent agency in America which oversees and regulates communications, is considering allowing ISPs in the United States to do.
CGP Grey, one of the smartest voices on the Internet, has made a great short and simple video on the mechanics of how this might work, and why we need to do everything we can to preserve “Net Neutrality”, the principle of treating all Internet data equally: