R is a programming language for statistical computing. R is often viewed from two different perspectives. From the statistician perspective R is a powerful and flexible statistics package. From the programmer perspective R is a terrible computer language.
As a statistics package, I have had experience with many (SPSS, Stata, SAS, and R) and R suits me the best. Not to start a fight, but I sometimes wonder why people still teach these programs. Specifically for network analysis, none of these come close. Here at the LINKS Center we teach the use of UCINET as well. But again, R can do everything that UCINET can do and much more.
As a programming language, R has faults. However R has been increasing tremendously in popularity even compared to programming languages (not just statistics pakages).
Hadley Wickham, a goliath of R-packages, believes that much of the time spent in data analysis is spent thinking about the program rather than actually running the analyses. And this is where R shines since it makes the thinking part much easier, while at some cost to the speed of the analysis. Although when I discuss high performance computing later, we will see that R is as fast or faster than other options available if that’s the goal.
First download and install R from here.
Next install Open Source Edition of R-Studio from here.
Then open RStudio.
If you are coming to R from another stats package, you may want to start with Quick-R which is set up with SPSS or Stata users in mind.
If you are coming to R as a programming language, you might want to just jump in to Hadley Wickham’s Advanced R Book.
Probably the most important thing to learn when getting into R is where and how to get help. One of the difficulties about R is that every single package has some different syntax. When we start looking at how to do analysis, there are very different ways of approaching problems and workflow in the
igraph package than you would in the
Using the built-in help system
All functions have a built-in help, but some of these are more helpful than others. To see the help entry for any given function use the question mark command
?. For instance, if you want to get help on the
sum function you type:
# To get help on the "sum" command ?sum ?"sum" # Or you can use the help command help("sum") # If you aren't sure what the full command is you can search with '??' ??sum help.search("sum") # You can even get help on the help function. ?"?" # Or help on the "addition" function ?"+"
Using tab completion in RStudio
I use the tab completion a lot in RStudio because you tend to forget what the different options are. To use tab completion in RStudio, put your cursor inside some function text and hit TAB. Then you can use the arrow keys to select the options you want.
This method also works if you have data frame or list object to select a column, or use it to help you auto-complete the names of variables or function names. Especially nice if the names are long.
Reach Out to the Community
The question/answer site StackOverflow has a great community dedicated to answering questions you might have about R. There are even hundreds of questions specifically tagged just for R with the
igraph has a fantastic introduction and help page just for the R-version of the software (
igraph is a C-package primarily that is also ported to
python). There is also an
igraph mailing list that you can search or post questions to. Many packages now host their code on GitHub, which is a great place to submit issues, bugs, or suggestions.
Going to the source
I mean two things when I say “going to the source” - I mean you can usually contact the package author directly, or if you are feeling daring you can start peeking into the source code of the package itself.
Packages for Network Analysis
There are dozens of packages for network analysis, and the number grows every year. But I personally only use a few of them frequently. The packages you most frequently would see me use are
igraph for all sorts of analysis and graph handeling,
regexf for exporting graph files to Gephi, and sometimes
sna for functions that
igraph is missing (like QAP correlation).
Generally if you use R to do social network analysis you are going to use one of these two packages. There are benefits to using either of them. But I tend to use
igraph for two big reasons.
igraphis much faster. See a short benchmark I did to demonstrate that betweenness and shortest paths are calculated about 5-7x faster. The
igraphpackage is coded in the back end entirely in C, which makes it blazingly fast. It is always preferable to use
igraphfunctions instead of writing your own as much as possible since you will experience a large speed difference.
igraphobjects are compact and consistent. The graph objects for
igraphcan hold vertex attributes, visual display attributes, edge attributes, and when you filter or change the graph the attributes are preserved. Every function expects an igraph object and it doesn’t matter how you initially formatted the data. The
snapackage tends to use raw matrices, or is generally more complicated (you can use the
networkpackage to add on proper handling of network objects for instance). I usually have a harder time keeping the different elements of my analysis together when using
That being said, the
sna package is written more for social network analysis and we written by social scientists while the
igraph package was written by computer scientists (probably why the code is so much better) and is oriented to problems that computer scientists tend to face. The
igraph package has lots of functions for community structure and random graph generating models. There are functions in the
sna package that the social scientist would expect. For instance the
sna package has a QAP correlation function, and it plays better with
sna package supports an ecology of different packages, including network, ergm, statnet, tergm, networkDynamic, ndtv, etc. However, it’s very easy (one line of code normally) to convert an
igraph object into something we can use for QAP, ERGM, or Siena.
If you load both
snayou will have conflicted function names. For instance, both packages have a
betweennessfunction for that centrality measure. To make sure you are using the correct function you need to use the namespace, or double colon operator (i.e.
As a networks researcher you may want to test certain hypotheses such as, are men more likely to connect with other men than women?, do the characteristics of a node determine it’s position?, or does the structure of one network predict the structure of a different network? These questions are difficult or impossible to answer with standard statistical techniques largely do the violation of independence assumptions. To account for this researchers use a class of models known as p* models or exponential random graph models (ERGM).
There are longitudinal approaches to network modeling as well. What factors impact the development of network structure? You can explore these kinds of questions with Siena models using
rsiena models, or Relational Event Models
These are packages for exporting to other visualization software, making interactive visualizations, or making animations of dynamic networks. SoNIA is software for animating dynamic networks, and
igraphtosonia will take a dynamic igraph graph object and export it to a format that SoNIA can use. The
ndtv package is an excellent package for the analysis and visualization of dynamic networks and interacts with ERGM models in R.
You can create some very nice images in R, but Gephi is a fantastic platform for network visualization and exploration and
rgexf will export file that Gephi can use. The GEXF format is can also be used by SigmaJS, which is an embedded web-visualization library. Another option for web-visualization or dynamically interacting with your network data, use the
d3network package which uses the D3js library to create interactivity.
Other packages that you might use for specialized purposes. For instance
tnet has an implementation of a two-mode clustering coeficient. The
egonet package has special functions for ego-network analysis. A package I’m authoring,
pii, has an implementation of the political independence index.
Tips for Searching for Packages
If you want to see if a function has been implemented already for a problem you have, start with Google and append r, cran, or rstats to the front of your query. Search for “cran two mode clustering” or check out RSeek.org which is a specialized search engine for R questions.
Next we’ll import some network data, clean and prepare it, do some basic analyses, do some basic visualization, and export the data and results.