Classifying News Articles as Tweetworthy using Python and R
Recently I attended a data analytics bootcamp hosted by Northeastern university. My capstone project was to do sentiment analysis on some tweets which you can read about here. Prior to attending that bootcamp I had been trying to figure out a way to make the twitter bot I run more efficient in terms of the articles it links to. As it turns out, the same technique I used for my sentiment analysis will work for any general classification, provided I have the right training data.
Training Data: Twitter Analytics
Downloading the Data
Luckily Twitter provides an export link for the analytics it provides. The only caveat here is that there’s no programmatic access to this export, and it only offers a month at a time. That just means I had to manually download every month’s information and glue the resulting CSV file together. If you’re following along I’ll leave that as an exercise for the reader.
So what’s in this file?
If I open up the resulting file in excel (or textedit or whatever) I can see it contains exactly what I need to classify the articles as “tweetworthy”. Namely it has information about engagement, retweets, and likes for every tweet.
Great! Except I want more than just the tweet and impression information. I want the text of the articles each tweet links to!
Enter Newspaper
What I need is a way to scrape the text of an article without having to worry about sifting through the HTML of any particular news site. And that’s where Newspaper comes in.
Okay. Well now I just need to run through every tweet in my analytics csv and download the article text and then save it somewhere! I chose mongoDB here but you could just as easily save the text to a CSV or whatever other file type you want.
The Analysis
Great, now I’ve got the text from ~900 articles that my twitter bot has tweeted out in the last ~6 months (give or take). Switching over to R now I can use the same functions I used for my sentiment analysis. First let’s load the data into my R environment:
This gives me a data frame “tweets” whose columns are “url”, “text”, “created_at”, “tweet_id”, “impressions”, “engagements”, “clicks” and a boolean valued “engaged_with”. Since my data is relatively sparse (I’ve got a lot of zeroes, my twitter bot isn’t very popular) I’m going to do a simple binary classification on whether or not the tweet has been engaged with at all. In other words, if someone has clicked on, retweeted, liked, or even just expanded my tweet it’ll get a 1, else it’ll be a 0.
And from here it’s as easy as training the models with some training data and trying it out on some testing data.
Well, that’s not the best sensitivity ever, it’s barely better than chance! I’ll have to keep playing with the model parameters and perhaps a larger corpus size if I want this to actually be useful, but I suppose you can’t win them all!