R and Networks

About R

R is a programming language for statistical computing. R is often viewed from two different perspectives. From the statistician perspective R is a powerful and flexible statistics package. From the programmer perspective R is a terrible computer language.

As a statistics package, I have had experience with many (SPSS, Stata, SAS, and R) and R suits me the best. Not to start a fight, but I sometimes wonder why people still teach these programs. Specifically for network analysis, none of these come close. Here at the LINKS Center we teach the use of UCINET as well. But again, R can do everything that UCINET can do and much more.

As a programming language, R has faults. However R has been increasing tremendously in popularity even compared to programming languages (not just statistics pakages).

Hadley Wickham, a goliath of R-packages, believes that much of the time spent in data analysis is spent thinking about the program rather than actually running the analyses. And this is where R shines since it makes the thinking part much easier, while at some cost to the speed of the analysis. Although when I discuss high performance computing later, we will see that R is as fast or faster than other options available if that’s the goal.

Getting started

First download and install R from here.

Next install Open Source Edition of R-Studio from here.

Then open RStudio.

Learning R

If you are coming to R from another stats package, you may want to start with Quick-R which is set up with SPSS or Stata users in mind.

If you are coming to R as a programming language, you might want to just jump in to Hadley Wickham’s Advanced R Book.

Getting Help

Probably the most important thing to learn when getting into R is where and how to get help. One of the difficulties about R is that every single package has some different syntax. When we start looking at how to do analysis, there are very different ways of approaching problems and workflow in the igraph package than you would in the sna package.

Using the built-in help system

All functions have a built-in help, but some of these are more helpful than others. To see the help entry for any given function use the question mark command ?. For instance, if you want to get help on the sum function you type:

# To get help on the "sum" command

# Or you can use the help command

# If you aren't sure what the full command is you can search with '??'

# You can even get help on the help function.

# Or help on the "addition" function
Using tab completion in RStudio

I use the tab completion a lot in RStudio because you tend to forget what the different options are. To use tab completion in RStudio, put your cursor inside some function text and hit TAB. Then you can use the arrow keys to select the options you want.

Sum tab completion

This method also works if you have data frame or list object to select a column, or use it to help you auto-complete the names of variables or function names. Especially nice if the names are long.

Reach Out to the Community

The question/answer site StackOverflow has a great community dedicated to answering questions you might have about R. There are even hundreds of questions specifically tagged just for R with the igraph package.

For example, igraph has a fantastic introduction and help page just for the R-version of the software (igraph is a C-package primarily that is also ported to python). There is also an igraph mailing list that you can search or post questions to. Many packages now host their code on GitHub, which is a great place to submit issues, bugs, or suggestions.

Going to the source

I mean two things when I say “going to the source” - I mean you can usually contact the package author directly, or if you are feeling daring you can start peeking into the source code of the package itself.

Packages for Network Analysis

There are dozens of packages for network analysis, and the number grows every year. But I personally only use a few of them frequently. The packages you most frequently would see me use are igraph for all sorts of analysis and graph handeling, regexf for exporting graph files to Gephi, and sometimes sna for functions that igraph is missing (like QAP correlation).

General Networks: igraph and sna

Generally if you use R to do social network analysis you are going to use one of these two packages. There are benefits to using either of them. But I tend to use igraph for two big reasons.

  1. igraph is much faster. See a short benchmark I did to demonstrate that betweenness and shortest paths are calculated about 5-7x faster. The igraph package is coded in the back end entirely in C, which makes it blazingly fast. It is always preferable to use igraph functions instead of writing your own as much as possible since you will experience a large speed difference.
  2. The igraph objects are compact and consistent. The graph objects for igraph can hold vertex attributes, visual display attributes, edge attributes, and when you filter or change the graph the attributes are preserved. Every function expects an igraph object and it doesn’t matter how you initially formatted the data. The sna package tends to use raw matrices, or is generally more complicated (you can use the network package to add on proper handling of network objects for instance). I usually have a harder time keeping the different elements of my analysis together when using sna.

That being said, the sna package is written more for social network analysis and we written by social scientists while the igraph package was written by computer scientists (probably why the code is so much better) and is oriented to problems that computer scientists tend to face. The igraph package has lots of functions for community structure and random graph generating models. There are functions in the sna package that the social scientist would expect. For instance the sna package has a QAP correlation function, and it plays better with ergm. The sna package supports an ecology of different packages, including network, ergm, statnet, tergm, networkDynamic, ndtv, etc. However, it’s very easy (one line of code normally) to convert an igraph object into something we can use for QAP, ERGM, or Siena.

If you load both igraph and sna you will have conflicted function names. For instance, both packages have a betweenness function for that centrality measure. To make sure you are using the correct function you need to use the namespace, or double colon operator (i.e. igraph::betweenness() vs. sna::betweenness())

Statistical Modeling: statnet, ergm, siena, & relevent

As a networks researcher you may want to test certain hypotheses such as, are men more likely to connect with other men than women?, do the characteristics of a node determine it’s position?, or does the structure of one network predict the structure of a different network? These questions are difficult or impossible to answer with standard statistical techniques largely do the violation of independence assumptions. To account for this researchers use a class of models known as p* models or exponential random graph models (ERGM).

There are longitudinal approaches to network modeling as well. What factors impact the development of network structure? You can explore these kinds of questions with Siena models using rsiena models, or Relational Event Models relevent.

Visualization: igraphtosonia, ndtv, rgexf, & d3network

These are packages for exporting to other visualization software, making interactive visualizations, or making animations of dynamic networks. SoNIA is software for animating dynamic networks, and igraphtosonia will take a dynamic igraph graph object and export it to a format that SoNIA can use. The ndtv package is an excellent package for the analysis and visualization of dynamic networks and interacts with ERGM models in R.

You can create some very nice images in R, but Gephi is a fantastic platform for network visualization and exploration and rgexf will export file that Gephi can use. The GEXF format is can also be used by SigmaJS, which is an embedded web-visualization library. Another option for web-visualization or dynamically interacting with your network data, use the d3network package which uses the D3js library to create interactivity.

Other: tnet, pii, egonet

Other packages that you might use for specialized purposes. For instance tnet has an implementation of a two-mode clustering coeficient. The egonet package has special functions for ego-network analysis. A package I’m authoring, pii, has an implementation of the political independence index.

Tips for Searching for Packages

If you want to see if a function has been implemented already for a problem you have, start with Google and append r, cran, or rstats to the front of your query. Search for “cran two mode clustering” or check out RSeek.org which is a specialized search engine for R questions.

Next: Getting Data in and Out of R

Next we’ll import some network data, clean and prepare it, do some basic analyses, do some basic visualization, and export the data and results.

Related Posts

A Simple Network Analysis

Introduction to SNA in R: A simple network analysis

Storing R Objects in SQL Tables

Keep your analyses and prepared rdata objects indexed in a database.

Getting Network Data In and Out of R

Another part of Intro the SNA in R. Imporing and exporting data, cleaning and preparing it.

An Introduction to Network Analysis in R

Table of contents for my Introduction to Network Analysis in R series.

Notes on SQLite

Some notes and usage of SQLite and RSQLite

Using Jekyll

Installing and testing Jekyll

Pixyll in Action

See what the different elements looks like. Your markdown has never looked better. I promise.