Intentional Homicide in South America 1995-2010

February 9, 2012 11 comments

Intentional homicide is defined as unlawful death purposefully inflicted on a person by another person. The source of this stat is The United Nations Office on Drugs and Crime (UNODC).

I created the above image using ggplot2 which does 98% of the leg-work in most cases. Count is the number of homicides in a calendar year and homicide rate per 100.000 population.

Then with a litttle of Adobe Illustrator the final datavis looks like this:

 

require(XLConnect)
library(reshape)
library(ggplot2)
homicides <- wb["Sheet1"]
colnames(homicides) <- c("country","stat",seq(from=1995,to=2010))
h.count[,-2]
s <- melt(homicides,id.vars=c('country','stat'))
s <- subset(s, value!=-1)
s2 <- as.data.frame(cast(s,country+variable~stat))

midpoint <- mean(range(na.omit(s2$Rate)))

p <- ggplot(s2,aes(x=variable,y=country)) +geom_point(aes(size=Count,color=Rate))
p <- p+ scale_size(to = c(3, 10)) + scale_color_gradient2(low='darkgreen',mid="yellow", high="red",midpoint=midpoint)
p <- p+theme_bw()
p <- p+opts(panel.grid.major=theme_blank(),panel.grid.minor=theme_blank())
p

Suicide vs Divorce rates by country using ggplot

January 10, 2012 4 comments

I was looking for data I could use with the geom_text() object in ggplot2 and came across this data from the World Health Organization about the suicide rates by country which I found very handy for my example.

I used the scale_colour_gradient2() with 3 colors, red, gray and black but it only picked up gray and black and still don’t know why. 😦

Anyway, here it is the graph:

The number of suicides is for every 100.000 people and number of divorces for every 1000 people. (I know, I should have added this to the graph)

The size  and color of each country is the ratio of suicide women for every 10 suicide men (ratio_f_m if that make sense!?). So China has the same number of suicides between women and men folllowed by Kuwait and South Korea. The coeff of correlation was .02867 and excluding Maldives 0.5507.

As you notice Maldives is far away from the cloud and I bet $10 dollars that the key drivers behind that are the sun, beaches and really small bikinis and not population size 😉

With a little bit of photoshop the above graph will look like this one

code and data can be found in github

UPDATE:

Thanks to Louis, I’ve manage to show the three colors properly. Code also updated accordingly.

See comments for further details.

library(XLConnect)
library(ggplot2)

wb <- loadWorkbook('divorce_vs_suicide.xlsx')
df <- wb['Sheet1']
df$Col6 <- NULL
df$Col7 <- NULL

p <- ggplot(na.omit(df), aes(x=divorce,y=suicide,label=country))
p <- p+geom_text(aes(colour=ratio_f_m,size=ratio_f_m))+ scale_colour_gradient2(low='red',mid="gray", high="black", midpoint=mean(range(na.omit(df$ratio_f_m))))
p <- p+scale_size(to=c(3,5))+theme_bw()
p <- p+opts(panel.grid.major=theme_blank(),panel.grid.minor=theme_blank())
p

Presidents in Twitter

January 5, 2012 2 comments

I saw the release of a new version of twitteR package a few weeks back and thought I should be testing the code I wrote some time ago but also do something interesting at the same time. Thus I came up with the idea of checking out how Presidents are doing in twitter.

Not many Presidents are on twitter yet so my sample is fairly small. Here is my list:

  • @dilmabr (Brazil)
  • @CFKArgentina (Argentina –  I think you could have guessed it from the nick name, anyway)
  • @JuanManSantos (Colombia)
  • @chavezcandanga (Venezuela)
  • @sebastianpinera (Chile)
  • @BARACKOBAMA (USA)
  • @Number10gov (UK)

My idea was just to plot the number of followers by president but also the size of the point will be their number of tweets (statuses).

It is interesting to see the above, I’d have put David Cameron in second place not Hugo Chavez and trying to answer the “why” of this brings up a lot more questions like:

  • How much time Presidents have been in twitter?
  • How They use it?
  • How people in each country use Twitter?
  • population matters?

Anyway, for those of you interested in the code here it is:

library(twitteR)
library(ggplot2)
base <- NULL
lkuplist <- c('dilmabr','CFKArgentina','JuanManSantos','chavezcandanga','sebastianpinera',
              'BARACKOBAMA','Number10gov')

for(users in lkuplist){
  user <- getUser(users)
  userName <- screenName(user)
  followers <- followersCount(user)
  friends <- friendsCount(user)
  statuses <- statusesCount(user)
  # if the merged dataframe does not exist then create it
  if(!exists("base")){

    base <- as.data.frame(cbind(user=userName,followers=followers,friends=friends,statuses=statuses)
                          ,stringsAsFactors=F)
  }
  # if the merged dataframe exists, append new file to it
  if(exists("base")){
    temp <- as.data.frame(cbind(user=userName,followers=followers,friends=friends,statuses=statuses)
                          ,stringsAsFactors=F)
    base <- rbind(base,temp)
    rm(temp)
  }

}
# convert char to numbers
base <- transform(base, followers = as.numeric(followers),
                friends = as.numeric(friends),
                statuses = as.numeric(statuses))
# reorder variables
base  <- transform(base,user = reorder(user, followers))

p <- ggplot(base,aes(x=user,y=followers,color=statuses)) + geom_point(aes(size=statuses))
p <- p +scale_color_gradient() + coord_flip()
p <- p+scale_y_continuous(formatter = "comma",breaks=c(1000000,2500000,5000000,10000000))
p
Categories: analytics Tags: , ,

Twitter Analysis using R

February 6, 2011 16 comments

One of the things that normally slow me down when I am learning a new tool outside working hours is the lack of an interesting project/task that forces me to investigate more about the sort of activities required to finish the project successfully.

On the other hand, I am always in the quest of the latest data visualizations techniques, trends etc then I thought what kind of thing I can do that not only will improve my R programming skills but also will help with data visualization? Well, the answer is: Twitter

Twitter holds so much data that the range of analyses anyone can do  is unlimited ( or limited by his/her skills). Fortunately R has a package called TwitteR  that let’s pull out this data effortlessly. The functionality of the package goes from scanning the public timeline, looking at the timeline of a specific user and looking at the followers and friends of a specific user. There are some issues with protected accounts and some constrains in terms of the number of requests made per hour so it was impossible to hammer the server properly.

The aim of this analysis is pretty basic: pull out data from some users along with their followers and friends and then draw some graphics using different packages and hopefully uncover interesting things.

Due to the constrains I mentioned before I had to select a random sample (honestly it was random) of 5 users from our #rstats community. The victims are as follows:

Since the package return data  only for a given user at time and since R functions return only the latest object created (or at least that is my current understanding)  I had to write 2 functions:

  • TwitterCrawler.r: require one parameter, the twitter username, the aim of this function is to pull out friends and followers for a given user with their basic information: # of tweets, # of followers, # of friends.
  • TwitterUserStats.r: the same as the above but will only pull  the user itself.

My first attempt was to build a single function to do all the above and I did it but couldn’t return 2 data frames at the same time  and that’s why I built 2 functions instead.

Please note that current figures in Twitter may not match this sample as I pulled  this data out a few weeks ago.

For the sake of this post I’ll show the main R code where I call the functions and I’ll give you the link  to my functions further down (if I get my head around Github).

Main program as follow:
# Required functions

source(“C:\\Users\\Altons\\Documents\\R\\Rfunctions\\TwitterCrawler.r”)

source(“C:\\Users\\Altons\\Documents\\R\\Rfunctions\\TwitterUserStats.r”)

TwitterUserDB <- NULL

NetworkDB <- NULL
TwitterList <- data.frame( user=c(“Rarchive” ,”sachaepskamp” ,”peterflom” ,”Altons” ,”CMastication” ))

for(i in 1:nrow(TwitterList)) {

CrawlerDB <- TwitterCrawler(toString(TwitterList[i,1]))

UserDB <- TwitterUserStats(toString(TwitterList[i,1]))

TwitterUserDB <- data.frame(rbind(TwitterUserDB,UserDB))

NetworkDB <- data.frame(rbind(NetworkDB,CrawlerDB))

}

So as you can see our main data frames of analysis are TwitterUserDB and NetworkDB. The firt one holds information about our victims whilst the second one about their networks.

Coming from the SAS world I do like to see the resulting data frame (data set in SAS) once it’s been created so I find quite handy the function fix() to check my data frames.

> fix(TwitterUserDB)

Pretty basic, isn’t it? But very handy to check out whether or not the data looks right.

TwitterUserDB has the ScreenName, number of friends, number of followers and the number of tweets for every user.

Let’s do the same for NetworkDB,

>fix(NetworkDB)

NetworkDB (click on image to see a larger version) contains information about followers & friends for each user. With this information we could know how active a user’s network is for example, or analysis the correlation between tweets & followers.

I had an issue with the TwitterBirth variable (date when user signed up in twitter) because the way function returned values was text I couldn’t figure out how to convert it to a date. I hope one you guys shed me some light on this matter.

Now let’s move on to the analysis, here I experimented with 2 graphical packages: the standard that comes along with the main installation (don’t know the name) and ggplot2.

So let’s do some bar charts:

My first try was the barplot function along with the png to output my graph to an external chart. Data frame of analysis is TwitterUserDB.

#Analize TwitterUserDB

Tweets <-TwitterUserDB$TweetCount

Tweets.legend <- TwitterUserDB$ScreenName
png(paste(Graphout,”barchart1.png”))

barplot(Tweets ,names.arg=Tweets.legend ,cex.names=0.9 ,ylab=”# of Tweets” ,ylim = c(0, 22000))

dev.off()

Well, not bad but let’s add some colors. Here I found the function colors() for displaying all colors’ names which is good but better if I could actually see the colors.

#with colors

mycolors <- c(“skyblue2″,”red3″,”tomato4″,”orange3″,”yellow3″)

png(paste(Graphout,”barchart2.png”))

barplot(Tweets ,names.arg=Tweets.legend ,cex.names=0.9 ,ylab=”# of Tweets” ,ylim = c(0, 22000) ,col=mycolors)

dev.off()

Yeah, a bit better and I guess I could do the same for the other variables but these univariate graphics are not providing too much insight apart from a simple count.

Before continuing with this analysis I have to say that these comparisons are quite unfair since  users in my sample did not start at the same point in time, so bear this in mind. Perhaps if I had fixed the TwitterBirth variable I’d have compared apples with apples.

Having said that let’s crack on! The next graph combine the 3 main variables: Friends, Followers and TweetCount in order to have a “full” picture of our users.

library(‘ggplot2’)

png(paste(Graphout,”3dchart.png”))

p <- ggplot(TwitterUserDB, aes(Friends,Followers ,size=TweetCount, label=ScreenName))

p <- p+geom_point(colour=”blue”)+scale_area(to=c(1,20))+geom_text(size=3)
p + xlab(“# Followers”) + ylab(“# Friends”)

dev.off()

Well that’s interesting, a simple idea could be that the more you tweet then the more followers you will get, obviously this is a simplistic thought but is interesting to see that Rarchive has pretty much the same number of tweets than  peterflom and lot more than CMastication  but he is not way near to them in term of followers nor friends. In fact, Rarchive is closer to my  Altons and sachaepskamp who are light users in term of tweets.

Could we say that in order to get a lot of followers you have to tweet quite often but also the quality of your tweets must be good as well. Well, this could true for simple mortals but may not for presidents, politicians and artists.

Anyway, the idea of this is just get used to R as a tool for analytics rather to uncover hidden patterns in the Twitter Rstats community so I’d better post this for the time being as life has its way of getting in the way of plans.

Hope you enjoy the post and i’ll come back with new post about NetworkDB as soon as I can.

Regards,

Alberto

 

 

Categories: analytics Tags: , , ,

Hello world!

January 24, 2011 6 comments

I suppose that “Hello World” is the first thing that any blogger should do when starts a blog. So here I go “HELLO WORLD!!!” The aim of this blog is to gather my thoughts and experience around learning R and hopefully to get a lot of insights from my readers.

Officially this is my third attempt to learn R and I must say things are looking up this time. I think in the first 2 occasions the problem I had was related to trying to learn R in the same way I learnt SAS and SPSS. R is a different piece of software and therefore needs to be learnt in a different way.

This time I am using the R for SAS and SPSS users  book as my bible and now  I am able read the R short reference card without thinking “how can I translate this chinese to english?”. I’d recommend this book to anybody wanting to learn R, even for those of you who don’t know SAS or SPSS as the explanations are very clear and is just at the end of every chapter where you can see a comparison between R and SAS & SPSS.

Anyway, I hope you enjoy my journey and definitely I hope you can get some valuable ideas from my posts.

 

R-egards

Alberto

Categories: Uncategorized Tags: ,