Statistical analysis via cloud computing
Update: Most of this post is now out-of-date. For more recent information, see my post on 3/12/2011 on using RStudio and Amazon Web Services EC2.
Why cloud computing?
For my research project on Antecedents of Network Shape in Online Communities, I’ve developed a simulation of the formation of online communication networks using the R open source statistics package. One challenge I’ve run into, though, is creating and analyzing simulation data is computationally intensive–it takes a long time to run on my machine.
Traditionally, the easiest solution to this problem would be to buy a new faster machine just for my research project. No doubt, for $1000 or so, I could get a fast workstation to crank through my simulation work. Unfortunately, I would be stuck with having that machine in only one place (my home office or my Temple office), which doesn’t fit my schedule well of splitting research work between two locations. Also, within a few weeks that $1000 machine would turn into an expensive dust collector–I only need this “extra horsepower” every now and again, certainly not every day. Finally, I’d much rather spend my computing budget exactly when I need it, not all up front on a machine that may become obsolete before I get my money’s worth out of it.
As it turns out, this is exactly the kind of situation where cloud computing is a perfect fit. I have a short-term need for a dramatic increase in computing power (e.g., scalability). I have no interest in buying, configuring, or owning more computers (e.g., I’d rather buy computing services). I’d really like to find a cheaper solution, too, as I don’t want to pay for a computer I’m only going to use for a short while.
How to get started: statistical analysis using R and the Amazon Cloud
I had my first hands-on experience with cloud computing this week using the R open source statistics package via Amazon Elastic Compute Cloud (EC2). There’s some great (free) tools available. It took some research to find an appropriate solution, and a fair bit of trial and error to get it working, but I’m now quite happy with the Biocep-R utilities.
First off, I never did get this complete set of tools working with Mac OS X (mainly, the Virtual R Workbench failed with a local R server in Mac OS X). All of the following instructions are for Windows. Happily, once I got it working with Windows I found I could manage the processes via my Mac.
Here’s the steps that worked for me to get Biocep-R working on Amazon’s cloud.
Step #0 – Follow steps to get started with Amazon EC2 (from http://biocep-distrib.r-forge.r-project.org/, steps covered later removed). Even if you plan on accessing your cloud computing resources from multiple computers, you only need to do this step once.
Getting Started with Amazon EC2
- Sign up for Amazon EC2 here
- Learn how to use Elasticfox to connect to your EC2 account, browse available AMIs (Amazon Machine Images ) and run AMIs from here
- Few issues like keys conversion for being able to ssh the virtual machines instances can be answered using EC2 getting started documentation here
The key items to configure in ElasticFox are credentials and an Account ID (both at the tool level). Using ElasticFox you’ll also need to setup a KeyPair and a Security Group.
Step #1 – Download and install Biocep-R. For Windows, to get everyone in one download use the option “R Workbench with R (2.8.0), with plugins (EC2/S3 monitors + examples) and with extensions (OpenOffice-based file converter) here” (it will actually install R 2.9.0) .
Step #2 - Confirm the Virtual R Workbench and the Fox Elasticnet plug-in both work on your machine. You can use Virtual R Workbench to connect to a local server to test it out. There’s a “Play Demo” option in the “R-Session” menu that works as a simple test of the environment. You can tell if the Fox Elascticnet plug-in is working if you are able to get a directory of available AMIs.
Step #3 – If you want to avoid learning “on the dime”, use the Biocep R Workbench with a local R server. You can install pacakges, upload and download files between your local machine and the “R server”, and practice executing commands. Once you’re comfortable with basic usage, you’re ready to go!
Step #4 – Start up an EC3 AMI.
Start the Biocep-R AMI ami-cd5fb9a4 : Ubuntu 9.0.4 Jaunty Jackalope / R version 2.9.0 / Scilab 5.1.0 /java version 1.6.0
- find ami-cd5fb9a4 (select region “us-east-1″, search with AMI id or with the keyword “biocep”, the AMI manifest is : biocep-ubuntu904-r290-j160-sci510-cologne/biocepimage.manifest.xml )
- Create a keys pair if you dont have one already
- Create a security group with one port of your choice open {my_port} : add a permission for a TCP/IP port {my_port} open to the network 0.0.0.0/0
- Run ami-cd5fb9a4 , choose your keys pair and your security group , insert the following to the field user data
start=true
port={my_port}
login={my_login}
pwd={my_pwd}
email={my_email}
workers={nbr_workers}- when the ami starts running, you receive an email with the URL to use to connect the Workbench to the ami
These parameters are all worth discussing in more detail. There are other AMIs with R already configured but I don’t think they have the biocep module loaded that supports remote usage via the Virtual R Workbench. Thus, the “biocep” AMI is the one to use for this configuration. When you start it up (“Launch instance(s) of this AMI”), the default Instance Type is “m1.small”. This is the smallest and least expensive option (roughly 8 cents an hour). The only other Instance Type supported by this AMI is the c1.medium type. This is because the image is a 32-bit image and the other Instance Types are 64-bit.
When you launch this AMI it is not immediately obvious that you need to choose your security group (which will appear under “Available Groups”) and choose the arrow button to enable it in the “Launch in” box. If you forget this step the AMI will launch under the default group and you will be unable to connect to your R server.
From what I can tell, the user values are used as follows: the port is any port number you choose (80 is used in the online example). The login and pwd value are for logging into the R server from the Virtual R Workbench (the online example uses “guest” and “guest”). Whatever email you provide will receive an update when the instance is available for usage. The email will include a really handy URL for automatically logging in to the R server.
I think the number of workers parameter allows you to have multiple “workers” within the R server. I set this as 1 and it worked fine.
Step #5 – Use the R Virtual Workbench to connect via URL to your R server. Congratulations, you are now ready to use R via Amazon’s EC3. You have embarked on the exciting world of statistical analysis via cloud computing.
Update: My first day of cloud computing cost me less than $2. I estimate I’ll be able to easily handle all of my immediate needs for less than $100 and quite possibly for less than $50.

[...] Computing Labs also solve an age-old problem for students and professors alike. Student’s laptops have various flavors of (often unkempt) operating systems on them, [...]
its a great post thanks for sharing..