Home > analytics > Twitter Analysis using R

Twitter Analysis using R


One of the things that normally slow me down when I am learning a new tool outside working hours is the lack of an interesting project/task that forces me to investigate more about the sort of activities required to finish the project successfully.

On the other hand, I am always in the quest of the latest data visualizations techniques, trends etc then I thought what kind of thing I can do that not only will improve my R programming skills but also will help with data visualization? Well, the answer is: Twitter

Twitter holds so much data that the range of analyses anyone can do  is unlimited ( or limited by his/her skills). Fortunately R has a package called TwitteR  that let’s pull out this data effortlessly. The functionality of the package goes from scanning the public timeline, looking at the timeline of a specific user and looking at the followers and friends of a specific user. There are some issues with protected accounts and some constrains in terms of the number of requests made per hour so it was impossible to hammer the server properly.

The aim of this analysis is pretty basic: pull out data from some users along with their followers and friends and then draw some graphics using different packages and hopefully uncover interesting things.

Due to the constrains I mentioned before I had to select a random sample (honestly it was random) of 5 users from our #rstats community. The victims are as follows:

Since the package return data  only for a given user at time and since R functions return only the latest object created (or at least that is my current understanding)  I had to write 2 functions:

  • TwitterCrawler.r: require one parameter, the twitter username, the aim of this function is to pull out friends and followers for a given user with their basic information: # of tweets, # of followers, # of friends.
  • TwitterUserStats.r: the same as the above but will only pull  the user itself.

My first attempt was to build a single function to do all the above and I did it but couldn’t return 2 data frames at the same time  and that’s why I built 2 functions instead.

Please note that current figures in Twitter may not match this sample as I pulled  this data out a few weeks ago.

For the sake of this post I’ll show the main R code where I call the functions and I’ll give you the link  to my functions further down (if I get my head around Github).

Main program as follow:
# Required functions

source(“C:\\Users\\Altons\\Documents\\R\\Rfunctions\\TwitterCrawler.r”)

source(“C:\\Users\\Altons\\Documents\\R\\Rfunctions\\TwitterUserStats.r”)

TwitterUserDB <- NULL

NetworkDB <- NULL
TwitterList <- data.frame( user=c(“Rarchive” ,”sachaepskamp” ,”peterflom” ,”Altons” ,”CMastication” ))

for(i in 1:nrow(TwitterList)) {

CrawlerDB <- TwitterCrawler(toString(TwitterList[i,1]))

UserDB <- TwitterUserStats(toString(TwitterList[i,1]))

TwitterUserDB <- data.frame(rbind(TwitterUserDB,UserDB))

NetworkDB <- data.frame(rbind(NetworkDB,CrawlerDB))

}

So as you can see our main data frames of analysis are TwitterUserDB and NetworkDB. The firt one holds information about our victims whilst the second one about their networks.

Coming from the SAS world I do like to see the resulting data frame (data set in SAS) once it’s been created so I find quite handy the function fix() to check my data frames.

> fix(TwitterUserDB)

Pretty basic, isn’t it? But very handy to check out whether or not the data looks right.

TwitterUserDB has the ScreenName, number of friends, number of followers and the number of tweets for every user.

Let’s do the same for NetworkDB,

>fix(NetworkDB)

NetworkDB (click on image to see a larger version) contains information about followers & friends for each user. With this information we could know how active a user’s network is for example, or analysis the correlation between tweets & followers.

I had an issue with the TwitterBirth variable (date when user signed up in twitter) because the way function returned values was text I couldn’t figure out how to convert it to a date. I hope one you guys shed me some light on this matter.

Now let’s move on to the analysis, here I experimented with 2 graphical packages: the standard that comes along with the main installation (don’t know the name) and ggplot2.

So let’s do some bar charts:

My first try was the barplot function along with the png to output my graph to an external chart. Data frame of analysis is TwitterUserDB.

#Analize TwitterUserDB

Tweets <-TwitterUserDB$TweetCount

Tweets.legend <- TwitterUserDB$ScreenName
png(paste(Graphout,”barchart1.png”))

barplot(Tweets ,names.arg=Tweets.legend ,cex.names=0.9 ,ylab=”# of Tweets” ,ylim = c(0, 22000))

dev.off()

Well, not bad but let’s add some colors. Here I found the function colors() for displaying all colors’ names which is good but better if I could actually see the colors.

#with colors

mycolors <- c(“skyblue2″,”red3″,”tomato4″,”orange3″,”yellow3″)

png(paste(Graphout,”barchart2.png”))

barplot(Tweets ,names.arg=Tweets.legend ,cex.names=0.9 ,ylab=”# of Tweets” ,ylim = c(0, 22000) ,col=mycolors)

dev.off()

Yeah, a bit better and I guess I could do the same for the other variables but these univariate graphics are not providing too much insight apart from a simple count.

Before continuing with this analysis I have to say that these comparisons are quite unfair since  users in my sample did not start at the same point in time, so bear this in mind. Perhaps if I had fixed the TwitterBirth variable I’d have compared apples with apples.

Having said that let’s crack on! The next graph combine the 3 main variables: Friends, Followers and TweetCount in order to have a “full” picture of our users.

library(‘ggplot2′)

png(paste(Graphout,”3dchart.png”))

p <- ggplot(TwitterUserDB, aes(Friends,Followers ,size=TweetCount, label=ScreenName))

p <- p+geom_point(colour=”blue”)+scale_area(to=c(1,20))+geom_text(size=3)
p + xlab(“# Followers”) + ylab(“# Friends”)

dev.off()

Well that’s interesting, a simple idea could be that the more you tweet then the more followers you will get, obviously this is a simplistic thought but is interesting to see that Rarchive has pretty much the same number of tweets than  peterflom and lot more than CMastication  but he is not way near to them in term of followers nor friends. In fact, Rarchive is closer to my  Altons and sachaepskamp who are light users in term of tweets.

Could we say that in order to get a lot of followers you have to tweet quite often but also the quality of your tweets must be good as well. Well, this could true for simple mortals but may not for presidents, politicians and artists.

Anyway, the idea of this is just get used to R as a tool for analytics rather to uncover hidden patterns in the Twitter Rstats community so I’d better post this for the time being as life has its way of getting in the way of plans.

Hope you enjoy the post and i’ll come back with new post about NetworkDB as soon as I can.

Regards,

Alberto

 

 

About these ads
Categories: analytics Tags: , , ,
  1. Francois
    February 6, 2011 at 3:59 pm

    Hi Alberto,

    Here are few remarks which I hope will help you:

    You can return several objects in a function by creating a list containing the objects. For instance :
    return( list (x = 1, y=2) )

    In R, when you create a “for” loop, you do not need to use numbers. You can use a vector containing whatever you want. Here you can directly use the usernames :

    for (s in c(“Rarchive” ,”sachaepskamp” ,”peterflom” ,”Altons” ,”CMastication” )) {
    CrawlerDB <- TwitterCrawler(s)

    }

    Finally, when you want to convert an object, generally, you have to use the function as.name_of_the_new_class(). Here, you can use as.Date() or as.POSIXlt() (to have the date and the hour). You need to pass to the function a format argument (type ?strptime in the console to know how to use this argument). In your case, I think that the following should work fine:

    x <- as.POSIXlt(x, format="%a, %d %b %Y %H:%M:%S %z")

    • February 6, 2011 at 8:03 pm

      Francois, fantastic tips!!! you have saved a few days of surfing the internet for converting a string to a date object. I need to keep rembember that everything is an object! I’ll definitely update my functions and code in my next post.
      Thanks!

      Alberto

  2. --
    February 6, 2011 at 4:00 pm

    Great post! Are you planning on releasing your functions? Also, I believe there is a Twitter Package for r (twitteR), but I do know that a portion of the functions no longer work properly due to OAuth. I do agree, twitter is a great dataset!

    • February 6, 2011 at 7:59 pm

      Hi, I’ll share my functions as soon as I get my head around github or I’ll just put them in my public folder in dropbox. The twitteR package is the core of this post.
      Regards,
      Alberto

      • July 14, 2011 at 9:04 pm

        Hello. Have you managed to upload your functions to github? I have a very limited knowledge of R and this (your functions) would be amazingly helpful. Besides, I don’t quite understand the object model of this twitteR package and this could clarify things. Of course, I’d also welcome a post on the package if you can’t release your code.
        Thanks!!

      • September 23, 2011 at 10:56 am

        I just saw ur message now. I’ll make it public if you still need it

      • Roberto
        November 24, 2011 at 3:33 am

        I need both functions, can you please post it? or can you send it to my e-mail?

        Disculpa mi inglés básico, I speak spanis :S

      • November 25, 2011 at 6:09 pm

        Seguro dejarme ver donde las tengo y las pongo en github

  3. February 7, 2011 at 8:45 am
  4. Jason
    February 7, 2011 at 1:45 pm

    You can find a list of colors, and their corresponding color at

    http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdf

    Always a useful thing to have on your pinboard/delicious account!

  5. Roberto Andrés
    December 10, 2011 at 8:33 pm

    Altons :
    Seguro dejarme ver donde las tengo y las pongo en github

    Muchas gracias! :)

  6. George
    January 6, 2012 at 9:39 pm

    Hello, could you put both functions in dropbox or any other public folder please

    • January 6, 2012 at 11:36 pm

      Hi George, those functions no longer work under the new version of twitteR package so I am in the process of refactoring those functions to make a new post and will definitely put them in github. It may be ready for next week (fingers crossed!)

      Rgrds,
      Alberto

      • George
        January 7, 2012 at 10:27 pm

        Gracias Alberto. Te agradezco me hagas saber cuando las tengas listas si no es mucha molestia.

        Gracias nuevamente.

        PD: Te escribo en español porque veo que anteriormente comentaste algo en Español.

      • January 7, 2012 at 10:34 pm

        No te preocupes, tan pronto lo tenga listo te aviso.

        Saludos

        Alberto

  1. February 12, 2011 at 7:40 pm

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: