Where Business Meets the Blogroll

research tools

RStudio and Amazon Web Services EC2

RStudioAmazon Web ServicesThere’s a problem dogging statistical researchers all over the Internet…

How can we use RStudio to run R on Amazon Web Services EC2?

After heading down several blind alleys, I can happily report an answer.

1) I am going to start with the assumption that you have used Amazon Web Services EC2 already. Read Running R on AWS if you need an introduction to the basic steps. (Ajay Ohri’s post mentions several AMIs, but none of them work for our needs.)

2) To run RStudio Server, you need an Amazon Machine Image with fairly recent server OS and R versions. This is where I can save you a lot of time and say: as of 12 March 2011, I only found one image that is new enough. It goes by the very catchy name of: akoya-ubuntu-10.04-amd64-server-20101114 (ami-e42cdb8d).

One nice thing about this 64 bit image is you can run it with the dirt cheap “t1.micro” option while you are testing your configuration. Then, once it all works you can pick a more powerful (and more expensive) configuration.

3) After you start your instance, make sure you have Port 8787 for TCP transport defined in your security Group.

4) Connect to your instance via SSH.

ssh -i mykeys.pem ubuntu@ec2-50-16-35-73.compute-1.amazonaws.com

Use

ubuntu

(instead of root) because this is a Ubuntu OS instance. Replace the

ec2-50-16-35-73.compute-1.amazonaws.com

portion with public DNS for your EC2.

5) After logging your Ubuntu instance, there are two additional steps. Installing RStudio and creating a new user. To install RStudio Server, execute these two commands:

wget https://s3.amazonaws.com/rstudio-server/rstudio-server-0.92.44-amd64.deb
sudo dpkg -i rstudio-server-0.92.44-amd64.deb
To create a new user, type:
sudo adduser rwebuser

6) Finally, return to a web browser. Enter your instance’s public DNS followed by “:8787”. For example…

http://ec2-50-16-35-73.compute-1.amazonaws.com:8787

If all goes well, RStudio will prompt you for a userid and password. Use the account credentials you created in step 5 (e.g., rwebuser).

Other Notes

  • Back in Nov. 2009, I wrote about using biocep for statistical analysis via cloud computing. The required AMIs are no longer available and RStudio has a much superior interface.
  • I tried out the AMIs from http://www.cloudbiolinux.com/, but their version of R is too old for RStudio. Also, I was unable to get a Mac NX client working on the first try so I abandoned that option.
  • R is a tricky topic to Google for. If you haven’t found it already, I highly recommend the R-bloggers website–also available in a handy daily digest. That’s a good place to search for help.

Conclusion

Eventually someone will add RStudio to an AMI, eliminating several of these steps. Let me know if you find other AMIs that also support RStudio.

An IS Researcher Web Resource List

Sue Nugus of Academic Conferences recently sent out a request to the ISWORLD mailing list for websites useful to academic research. Here’s the list of websites she reported receiving.

WIKIS
http://www.wikispaces.com/
http://wiki.org/wiki.cgi?WhatIsWiki
http://www.commoncraft.com/video-wikis-plain-english

BLOGS
http://www.guardian.co.uk/technology/2008/mar/09/blogs
http://www.royby.com/research/weblog.php
http://researchblogging.org/post-list/list/date/all
http://community.research.microsoft.com/blogs/
http://googleresearch.blogspot.com/

VIDEO BLOGGING (VLOGGING)
http://www.vlogblog.com/
http://www.youtube.com/user/pijan44
http://en.wikipedia.org/wiki/Video_blogging

REFERENCING TOOLS
http://www.zotero.org/
http://www.adeptscience.co.uk/lp/which_biblio_new.html?DCMP=KNC-AD-UK-bibsoft&c1=GAW_SE_NW&source=UK_BIB&kw=bibliography_software&cr5=3968756454&gclid=CIKkrr_Pkp4CFeZr4wod2QNVrA

http://www.endnote.com/
http://www.biblioscape.com/biblioexpress.htm

A TOOL FOR CITINGS ANALYSIS
http://www.harzing.com/
http://www.garfield.library.upenn.edu/essays/V1p527y1962-73.pdf
http://www.slideshare.net/Wowter/citation-analysis-for-research-evaluation
http://epress.lib.uh.edu/pr/v7/n5/hart7n5.html
http://www.hefce.ac.uk/Research/ref/

BIBLIOMETRICS
http://www.leedsmet.ac.uk/research/REF.pdf

WEBPAGE SAVING
http://www.keepoint.com/prodinfo_personal.asp

SOCIAL COMMUNICATING
http://twitter.com/
http://stocktwits.com/

BOOKMARKING
http://www.addthis.com/
http://www.openjason.com/2008/07/01/50-bookmarking-tools/
http://delicious.com/

HARVARD REFERENCING AND ACADEMIC WRITING
http://www.imperial.ac.uk/library/pdf/harvard_referencing.pdf
http://www.writing.utoronto.ca/advice

PROOFREADING AND EDITING
http://www.editavenue.com/main.asp?adstats=32066
http://www.journalexperts.com/?gclid=CKSFru_Hkp4CFZoU4wodQ0Zupg
http://www.editavenue.com/main.asp?adstats=30856
http://www.train4publishing.co.uk/distance/basproof/?gclid=CN69tfXIkp4CFU0A4wodgBSCpA

STATISTICS TRAINING AND USAGE
http://www.statistics.com/
http://www.statsoft.com/textbook/stathome.html
http://www.rapidlearningcenter.com/mathematics/introductory-statistics/introductory-statistics.html

ACADEMIC DATABASES AND SEARCH ENGINES
http://scholar.google.co.uk/
http://en.wikipedia.org/wiki/List_of_academic_databases_and_search_engines
http://www.getcited.org/
http://www.scirus.com/
http://www.sciencedirect.com/

LITERATURE REVIEW
http://www.writing.utoronto.ca/advice/specific-types-of-writing/literature-review

http://www.ais.up.ac.za/med/tnm800/tnmwritingliteraturereviewlie.htm

METHODOLOGY
http://www.methodspace.com/
http://www2.lse.ac.uk/methodologyInstitute/Home.aspx
http://www.qual.auckland.ac.nz/
http://www.socialresearchmethods.net/

OPEN ACCESS PUBLISHING
http://www.scribus.net/
http://www.osalt.com/publisher

ACADEMIC EVENT LISTINGS
http://eventseer.net/
http://www.academic-conferences.org/

DICTIONARY, THESAURUS AND ENCYCLOPEDIAS
http://dictionary.reference.com/
http://thesaurus.reference.com/
http://www.reference.com/
http://www.freetranslation.com/

TELEPHONE AND DATA FILE TRANSFER
http://www.voipdiscount.com/en/index.html
http://www.skype.com/intl/en-gb/

VIDEO CONFERENCING
http://www.megameeting.com/

QUESTIONNAIRE
http://www.surveymonkey.com/

http://www.2ask.net/orbiz/DigiTrade/dbeb1d6e919f9c04f2e8b9a3dc6ca0f2/Home–58n.html

http://www.statpac.com/research-papers/questionnaires.htm

USEFUL VIDEOS
http://videolectures.net/ice08_ktenas_laerc/

http://www.gresham.ac.uk/event.asp?PageId=4&EventId=644

http://academicearth.org/

http://www.ted.com/

http://www.ssrn.com/

http://shc.stanford.edu/intellectual-life/video-podcasts/detail/black-death-personal-history

http://www.iop.harvard.edu/Multimedia-Center/All-Videos/Theodore-H.-White-Lecture-on-Press-and-Politics-by-Taylor-Branch
http://www.lse.ac.uk/collections/informationSystems/newsAndEvents/videoArchive.htm

ACADEMIC CONFERENCES
http://www.academic-conferences.org/

VIVA
http://www.shef.ac.uk/physics/teaching/phy456/viva.pdf
http://www.independent.co.uk/student/postgraduate/how-to-shine-at-your-viva-728656.html
http://www.stars.rdg.ac.uk/viva.html

THEORY EXPLAINATION
http://www.tcw.utwente.nl/theorieenoverzicht/Theory%20clusters/
http://changingminds.org/explanations/theories/theories.htm

COMPLETED DISSERTAIONS AND THESES
http://www.lse.ac.uk/collections/informationSystems/PhDProgramme/ISthesesOnline.htm

http://academic-conferences.org/dissertations.htm

A decision tree for charts

When you have data to present, how do you decide what kind of chart or figure to use? For me, it’s often a long process of trial and error, trying to find a visual representation that looks right.

Here’s a helpful decision tree for picking out a chart type to get started with.

Statistical analysis via cloud computing

Update: Most of this post is now out-of-date. For more recent information, see my post on 3/12/2011 on using RStudio and Amazon Web Services EC2.

Why cloud computing?

For my research project on Antecedents of Network Shape in Online Communities, I’ve developed a simulation of the formation of online communication networks using the R open source statistics package. One challenge I’ve run into, though, is creating and analyzing simulation data is computationally intensive–it takes a long time to run on my machine.

Traditionally, the easiest solution to this problem would be to buy a new faster machine just for my research project. No doubt, for $1000 or so, I could get a fast workstation to crank through my simulation work. Unfortunately, I would be stuck with having that machine in only one place (my home office or my Temple office), which doesn’t fit my schedule well of splitting research work between two locations. Also, within a few weeks that $1000 machine would turn into an expensive dust collector–I only need this “extra horsepower” every now and again, certainly not every day. Finally, I’d much rather spend my computing budget exactly when I need it, not all up front on a machine that may become obsolete before I get my money’s worth out of it.

As it turns out, this is exactly the kind of situation where cloud computing is a perfect fit. I have a short-term need for a dramatic increase in computing power (e.g., scalability). I have no interest in buying, configuring, or owning more computers (e.g., I’d rather buy computing services). I’d really like to find a cheaper solution, too, as I don’t want to pay for a computer I’m only going to use for a short while.

How to get started: statistical analysis using R and the Amazon Cloud

I had my first hands-on experience with cloud computing this week using the R open source statistics package via Amazon Elastic Compute Cloud (EC2).  There’s some great (free) tools available. It took some research to find an appropriate solution, and a fair bit of trial and error to get it working, but I’m now quite happy with the Biocep-R utilities.

First off, I never did get this complete set of tools working with Mac OS X (mainly, the Virtual R Workbench failed with a local R server in Mac OS X). All of the following instructions are for Windows. Happily, once I got it working with Windows I found I could manage the processes via my Mac.

Here’s the steps that worked for me to get Biocep-R working on Amazon’s cloud.

Step #0 – Follow steps to get started with Amazon EC2 (from http://biocep-distrib.r-forge.r-project.org/, steps covered later removed). Even if you plan on accessing your cloud computing resources from multiple computers, you only need to do this step once.

Getting Started with Amazon EC2

  • Sign up for Amazon EC2 here
  • Learn how to use Elasticfox to connect to your EC2 account, browse available AMIs (Amazon Machine Images ) and run AMIs from here
  • Few issues like keys conversion for being able to ssh the virtual machines instances can be answered using EC2 getting started documentation here

The key items to configure in ElasticFox are credentials and an Account ID (both at the tool level). Using ElasticFox you’ll also need to setup a KeyPair and a Security Group.

Step #1 – Download and install Biocep-R. For Windows, to get everyone in one download use the option “R Workbench with R (2.8.0), with plugins (EC2/S3 monitors + examples) and with extensions (OpenOffice-based file converter) here” (it will actually install R 2.9.0) .

Step #2 – Confirm the Virtual R Workbench and the Fox Elasticnet plug-in both work on your machine. You can use Virtual R Workbench to connect to a local server to test it out. There’s a “Play Demo” option in the “R-Session” menu that works as a simple test of the environment. You can tell if the Fox Elascticnet plug-in is working if you are able to get a directory of available AMIs.

Step #3 – If you want to avoid learning “on the dime”, use the Biocep R Workbench with a local R server. You can install pacakges, upload and download files between your local machine and the “R server”, and practice executing commands. Once you’re comfortable with basic usage, you’re ready to go!

Step #4 – Start up an EC3 AMI.

Start the Biocep-R AMI ami-cd5fb9a4 : Ubuntu 9.0.4 Jaunty Jackalope / R version 2.9.0 / Scilab 5.1.0 /java version 1.6.0

  • find ami-cd5fb9a4 (select region “us-east-1”, search with AMI id or with the keyword “biocep”, the AMI manifest is : biocep-ubuntu904-r290-j160-sci510-cologne/biocepimage.manifest.xml )
  • Create a keys pair if you dont have one already
  • Create a security group with one port of your choice open {my_port} : add a permission for a TCP/IP port {my_port} open to the network 0.0.0.0/0
  • Run ami-cd5fb9a4 , choose your keys pair and your security group , insert the following to the field user data
    start=true
    port={my_port}
    login={my_login}
    pwd={my_pwd}
    email={my_email}
    workers={nbr_workers}
  • when the ami starts running, you receive an email with the URL to use to connect the Workbench to the ami

These parameters are all worth discussing in more detail. There are other AMIs with R already configured but I don’t think they have the biocep module loaded that supports remote usage via the Virtual R Workbench. Thus, the “biocep” AMI is the one to use for this configuration. When you start it up (“Launch instance(s) of this AMI”), the default Instance Type is “m1.small”. This is the smallest and least expensive option (roughly 8 cents an hour). The only other Instance Type supported by this AMI is the c1.medium type. This is because the image is a 32-bit image and the other Instance Types are 64-bit.

When you launch this AMI it is not immediately obvious that you need to choose your security group (which will appear under “Available Groups”) and choose the arrow button to enable it in the “Launch in” box. If you forget this step the AMI will launch under the default group and you will be unable to connect to your R server.

From what I can tell, the user values are used as follows: the port is any port number you choose (80 is used in the online example). The login and pwd value are for logging into the R server from the Virtual R Workbench (the online example uses “guest” and “guest”). Whatever email you provide will receive an update when the instance is available for usage. The email will include a really handy URL for automatically logging in to the R server.

I think the number of workers parameter allows you to have multiple “workers” within the R server. I set this as 1 and it worked fine.

Step #5 – Use the R Virtual Workbench to connect via URL to your R server. Congratulations, you are now ready to use R via Amazon’s EC3. You have embarked on the exciting world of statistical analysis via cloud computing.

Update: My first day of cloud computing cost me less than $2. I estimate I’ll be able to easily handle all of my immediate needs for less than $100 and quite possibly for less than $50.

Directory of Social Networking Sites

Via Friends: Social Networking Sites for Engaged Library Services, I ran across this fascinating list of web-based Social Networks. If you’re looking for a directory of social networking sites, it is the most comprehensive one I’ve seen.

Adding up the estimated membership for the 246 sites, they come up over 1 billion total members.





Categories