Data, Data, Data

I did it.  But I still don’t have a plan.  Even my name was a “jump off the cliff” moment.  I had toyed with the idea of calling this project, “I’m Talking R 2, U”. But then one day I realized, I didn’t even understand the name of the Podcast I was referencing, “U Talkin’ U2 to Me?”.  I googled it, and discovered that the Podcast itself was a reference to a movie; a movie I had never even seen.  Probably not a name that I should go with.  
U2 Love and Logic.  It also had the benefit of helping me drive my Social Media mission statement: 
Hello@U2World! Using my love of U2 to teach myself R. No big plans, no big discoveries, just a journey. U2 Love and Logic
So let’s start the journey at the beginning.  I want to teach myself R, and I need data.  

Reading blogs and taking some free intro classes in R, I came across a post on Sentiment Analysis using Prince Lyrics.  Having read it and recreated the analysis, I wondered if I could do something similar with U2.  
The lyrics were out on the internet, so I could gather the information (and learn a new data extraction skill – web scraping).  And I knew the data pretty well, so that would allow me to focus on learning R.  Logical, right?
Web scraping the lyrics was time consuming since the website I used wrapped a fair amount of extra information around the lyrics.  There is a website called Genius.com that has lyrics for many bands, including U2, and even an R package to help extract the information, but I quickly noticed that the lyrics were not accurate and/or complete.  Looking all over the internet, this was a recurring problem.  Even the U2.com website is not edited very well.  For example the song, “The Ocean” is missing the last stanza on the U2.com website. 
In the end, I was able to get the 14 main albums and 162 songs scraped and into a database.   Wahoo!

A fair amount of data munging was done to address contractions and possessive words.  Also, I discovered the website used multiple fonts that cause the data cleanup to take an excruciating amount of time.  It is quite frustrating to write code that works but only on part of the database because you don’t notice the difference:  ‘s,  ’s.  Argh!
And of course you just assume it’s your lack of understanding around the code.  The good news about picking R to learn is that there is a great community to help you solve your problems and to reassure you that you aren’t going mad.  It was the font after all.  
There are plenty of other ghosts in the machine, for example I am not sure why “ain’t” became “aingt”.  But I do have data that I can use to create meaningful observations and analysis.  I can always add more later, but for right now, I have a dataset I can analyze.
U2 Love and Logic


One thought on “Data, Data, Data

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s