This task view contains information about using R to obtain and parse data from the web. The base version of R does not ship with many tools for interacting with the web. Thankfully, there are an increasingly large number of tools for interacting with the web. A list of available packages and functions is presented below, grouped by the type of activity.
If you have any comments or suggestions for additions or improvements for this taskview, go to Github and
submit an issue
, or make some changes and
submit a pull request
. If you can't contribute on Github,
send Scott an email
. If you have an issue with one of the packages discussed below, please contact the maintainer of that package.
Tools for Working with the Web from R
Parsing Data from the Web
-
txt, csv, etc.: you can use
read.csv()
after acquiring the csv file from the web via e.g.,
getURL()
from RCurl.
read.csv()
works with http but not https, i.e.:
read.csv("http://..."), but not
read.csv("https://...").
-
The
repmis
package contains a
source_data()
command to load plain-text data from a URL (either http or https).
-
The package
XML
contains functions for parsing XML and HTML, and supports xpath for searching XML (think regex for strings). A helpful function to read data from one or more HTML tables is
readHTMLTable().
-
XML2R: The XML2R package is a collection of convenient functions for coercing XML into data frames. The development version is on GitHub
here
.
-
An alternative to
XML
is
selectr
, which parses CSS3 Selectors and translates them to XPath 1.0 expressions.
XML
package is often used for parsing xml and html, but
selectr
translates CSS selectors to XPath, so can use the CSS selectors instead of XPath. The
selectorgadget browser extension
can be used to identify page elements.
-
The
rjson
converts R object into Javascript object notation (JSON) objects and vice-versa.
-
An alternative to the
rjson
is
RJSONIO
which also converts to and from data in JSON format (it is fast for parsing).
-
An alternative to
rjson
and
RJSONIO
is
jsonlite, a fork of the
RJSONIO. It includes the parser from RJSONIO, but implements a different mapping between R objects and JSON strings.
-
Custom formats: Some web APIs provide custom data formats which are usually modified xml or json, and handled by
XML
and
rjson
or
RJSONIO, respectively.
-
The
RHTMLForms
allows to read HTML documents and obtain a description of each of the forms it contains, along with the different elements and hidden fields
-
scrapeR
provides additional tools for scraping data from HTML and XML documents.
Curl, HTTP, FTP, HTML, XML, SOAP
-
RCurl: A low level curl wrapper that allows one to compose general HTTP requests and provides convenient functions to fetch URIs, get/post forms, etc. and process the results returned by the Web server. This provides a great deal of control over the HTTP/FTP connection and the form of the request while providing a higher-level interface than is available just using R socket connections. It also provide tools for Web authentication.
-
httr: A light wrapper around
RCurl
that makes many things easier, but still allows you to access the lower level functionality of
RCurl. It has convenient http verbs:
GET(),
POST(),
PUT(),
DELETE(),
PATCH(),
HEAD(),
BROWSE(). These wrap functions are more convenient to use, though less configurable than counterparts in
RCurl. The equivalent of httr's
GET()
in
RCurl
is
getForm(). Likewise, the equivalent of
httr
's
POST()
in
RCurl
is
postForm(). http status codes are helpful for debugging http calls. This package makes this easier using, for example,
stop_for_status()
gets the http status code from a response object, and stops the function if the call was not successful. See also
warn_for_status(). Note that you can pass in additional Curl options to the
config
parameter in http calls.
-
The
XMLRPC
package provides an implementation of XML-RPC, a relatively simple remote procedure call mechanism that uses HTTP and XML. This can be used for communicating between processes on a single machine or for accessing Web services from within R.
-
The
XMLSchema
package provides facilities in R for reading XML schema documents and processing them to create definitions for R classes and functions for converting XML nodes to instances of those classes. It provides the framework for meta-computing with XML schema in R
-
RTidyHTML
interfaces to the libtidy library for correcting HTML documents that are not well-formed. This library corrects common errors in HTML documents.
-
SSOAP
provides a client-side SOAP (Simple Object Access Protocol) mechanism. It aims to provide a high-level interface to invoke SOAP methods provided by a SOAP server.
-
Rcompression
: Interface to zlib and bzip2 libraries for performing in-memory compression and decompression in R. This is useful when receiving or sending contents to remote servers, e.g. Web services, HTTP requests via RCurl.
-
The
CGIwithR
package allows one to use R scripts as CGI programs for generating dynamic Web content. HTML forms and other mechanisms to submit dynamic requests can be used to provide input to R scripts via the Web to create content that is determined within that R script.
-
httpRequest: HTTP Request protocols. Implements the GET, POST and multipart POST request.
Authentication
-
Using web resources can require authentication, either via API keys, OAuth, username:password combination, or via other means. Additionally, sometimes web resources that require authentication be in the header of an http call, which requires a little bit of extra work. API keys and username:password combos can be combined within a url for a call to a web resource (api key: http://api.foo.org/?key=yourkey; user/pass: http://username:password@api.foo.org), or can be specified via commands in
RCurl
or
httr. OAuth is the most complicated authentication process, and can be most easily done using
httr. See the 6 demos within
httr, three for OAuth 1.0 (linkedin, twitter, vimeo) and three for OAuth 2.0 (facebook, github, google).
ROAuth
is a package that provides a separate R interface to OAuth. OAuth is easier to to do in
httr, so start there.
Web Frameworks
-
The
shiny
package makes it easy to build interactive web applications with R.
-
The
Rook
web server interface contains the specification and convenience software for building and running Rook applications.
-
The
opencpu
framework for embedded statistical computation and reproducible research exposes a web API interfacing R, LaTeX and Pandoc.
This API is used for example to integrate statistical functionality into systems, share and execute scripts or reports on centralized servers, and build R based apps.
-
A package by
Yihui Xie
called
servr
provides a simple HTTP server to serve files under a given directory based on the
httpuv
package.
-
The
httpuv
package, made by Joe Cheng at RStudio, provides low-level socket and protocol support for handling HTTP and WebSocket requests directly within R. Another related package, perhaps which
httpuv
replaces, is
websockets, also made by Joe Cheng.
-
websockets: A simple HTML5 websocket interface for R, made by Joe Cheng.
-
Plot.ly is a company that allows you to create visualizations in the web using R (and Python). They have an R package in development
here
, as well as access to their services via an API
here
.
-
The
WADL
package provides tools to process Web Application Description Language (WADL) documents and to programmatically generate R functions to interface to the REST methods described in those WADL documents.
-
The
RDCOMServer
provides a mechanism to export R objects as (D)COM objects in Windows. It can be used along with the
RDCOMClient
package which provides user-level access from R to other COM servers.
-
The
RSelenium
package (not on CRAN) provides a set of R bindings for the Selenium 2.0 webdriver using the [JsonWireProtocol](http://code.google.com/p/selenium/wiki/JsonWireProtocol). Selenium automates browsers. Using RSelenium you can automate browsers locally or remotely. This can aid in automated application testing, load testing and web scraping. Examples are given interacting with popular projects such as [shiny](http://cran.r-project.org/web/packages/shiny/index.html) and [sauceLabs](http://saucelabs.com).
JavaScript
-
ggvis
(not on CRAN) makes it easy to describe interactive web graphics in R. It fuses the ideas of ggplot2 and
shiny, rendering graphics on the web with Vega.
-
rCharts
(not on CRAN) allows for interactive javascript charts from R.
-
rVega
(not on CRAN) is an R wrapper for Vega.
-
clickme
(not on CRAN) is an R package to create interactive plots.
-
animint
(not on CRAN) allows an interactive animation to be defined using a list of ggplots with clickSelects and showSelected aesthetics, then exported to CSV/JSON/D3/JavaScript for viewing in a web browser.
-
The
SpiderMonkey
package provides a means of evaluating JavaScript code, creating JavaScript objects and calling JavaScript functions and methods from within R. This can work by embedding the JavaScript engine within an R session or by embedding R in an browser such as Firefox and being able to call R from JavaScript and call back to JavaScript from R.
Data Sources on the Web Accessible via R
Ecological and Evolutionary Biology
-
rvertnet: A wrapper to the VertNet collections database API.
-
rgbif: Interface to the Global Biodiversity Information Facility API methods.
-
rfishbase: A programmatic interface to fishbase.org.
-
treebase: An R package for discovery, access and manipulation of online phylogenies.
-
taxize: Taxonomic information from around the web.
-
dismo: Species distribution modeling, with wrappers to some APIs.
-
rnbn
(not on CRAN): Access to the UK National Biodiversity Network data.
-
rWBclimate
(not on CRAN): R interface for the World Bank climate data.
-
rbison: Wrapper to the USGS Bison API.
-
neotoma
(not on CRAN): Programmatic R interface to the Neotoma Paleoecological Database.
-
rnoaa
(not on CRAN): R interface to NOAA Climate data API.
-
rnpn
(not on CRAN): Wrapper to the National Phenology Network database API.
-
rfisheries: Package for interacting with fisheries databases at openfisheries.org.
-
rebird: A programmatic interface to the eBird database.
-
flora: Retrieve taxonomical information of botanical names from the Flora do Brasil website.
-
Rcolombos: This package provides programmatic access to Colombos, a web based interface for exploring and analyzing comprehensive organism-specific cross-platform expression compendia of bacterial organisms.
-
Reol: An R interface to the Encyclopedia of Life (EOL) API. Includes functions for downloading and extracting information off the EOL pages.
-
rPlant: An R interface to the the many computational resources iPlant offers through their RESTful application programming interface. Currently,
rPlant
functions interact with the iPlant foundational API, the Taxonomic Name Resolution Service API, and the Phylotastic Taxosaurus API. Before using rPlant, users will have to register with the iPlant Collaborative.
http://www.iplantcollaborative.org/discover/discovery-environment
-
ecoengine: The ecoengine (
http://ecoengine.berkeley.edu/
) provides access to more than 2 million georeferenced specimen records from the Berkeley Natural History Museums.
http://bnhm.berkeley.edu/
Genes and Genomes
-
cgdsr: R-Based API for accessing the MSKCC Cancer Genomics Data Server (CGDS).
-
rsnps: This package is a programmatic interface to various SNP datasets on the web: openSNP, NBCI's dbSNP database, and Broad Institute SNP Annotation and Proxy Search. This package started as a library to interact with openSNP alone, so most functions deal with openSNP.
-
rentrez: Talk with NCBI entrez using R.
-
seqinr: Exploratory data analysis and data visualization for biological sequence (DNA and protein) data.
-
seq2R
: Detect compositional changes in genomic sequences - with some interaction with GenBank. Archived on CRAN.
-
primerTree: Visually Assessing the Specificity and Informativeness of Primer Pairs.
-
hoardeR: Information retrieval from NCBI databases, with main focus on Blast.
Earth Science
-
RNCEP: Obtain, organize, and visualize NCEP weather data.
-
crn: Provides the core functions required to download and format data from the Climate Reference Network. Both daily and hourly data are downloaded from the ftp, a consolidated file of all stations is created, station metadata is extracted. In addition functions for selecting individual variables and creating R friendly datasets for them is provided.
-
BerkeleyEarth
: Data input for Berkeley Earth Surface Temperature. Archived on CRAN.
-
waterData: An R Package for retrieval, analysis, and anomaly calculation of daily hydrologic time series data.
-
CHCN: A compilation of historical through contemporary climate measurements scraped from the Environment Canada Website Including tools for scraping data, creating metadata and formating temperature files.
-
decctools: Provides functions for retrieving energy statistics from the United Kingdom Department of Energy and Climate Change and related data sources. The current version focuses on total final energy consumption statistics at the local authority, MSOA, and LSOA geographies. Methods for calculating the generation mix of grid electricity and its associated carbon intensity are also provided.
-
Metadata
: Collates metadata for climate surface stations. Archived on CRAN.
-
sos4R: A client for Sensor Observation Services (SOS) as specified by the Open Geospatial Consortium (OGC). It allows users to retrieve metadata from SOS web services and to interactively create requests for near real-time observation data based on the available sensors, phenomena, observations etc. using thematic, temporal and spatial filtering.
-
raincpc: The Climate Prediction Center's (CPC) daily rainfall data for the entire world, from 1979 to the present, at a resolution of 50 km (0.5 degrees lat-lon). This package provides functionality to download and process the raw data from CPC. Development version on GitHub
here
.
-
weatherData: Functions that help in fetching weather data from websites. Given a location and a date range, these functions help fetch weather data (temperature, pressure etc.) for any weather related analysis.
Economics and Business
-
WDI: Search, extract and format data from the World Bank's World Development Indicators.
-
The
Zillow
package provides an R interface to the Zillow Web Service API. It allows one to get the Zillow estimate for the price of a particular property specified by street address and ZIP code (or city and state), to find information (e.g. size of property and lot, number of bedrooms and bathrooms, year built.) about a given property, and to get comparable properties.
Finance
-
RDatastream
(not on CRAN): An R interface to the
Thomson Dataworks Enterprise SOAP API
(paid), with some convenience functions for retrieving Datastream data specifically.
-
quantmod: Specify, build, trade, and analyse quantitative financial trading strategies
-
TFX: Connects to TrueFX(tm) for free streaming real-time and historical tick-by-tick market data for dealable interbank foreign exchange rates with millisecond detail.
-
fImport: Environment for teaching "Financial Engineering and Computational Finance"
-
Rbitcoin: Ineract with Bitcoin. Both public and private API calls. Support HTTP over SSL. Debug messages of Rbitcoin, debug messages of RCurl, error handling.
-
Thinknum: Interacts with the
Thinknum
API.
Chemistry
-
rpubchem: Interface to the PubChem Collection.
Agriculture
-
FAOSTAT: The package hosts a list of functions to download, manipulate, construct and aggregate agricultural statistics provided by the FAOSTAT (Food and Agricultural Organization of the United Nations) database.
-
cimis: R package for retrieving data from CIMIS, the California Irrigation Management Information System.
Literature, Metadata, Text, and Altmetrics
-
rplos: A programmatic interface to the Web Service methods provided by the Public Library of Science journals for search.
-
rbhl: R interface to the Biodiversity Heritage Library (BHL) API.
-
rmetadata
(not on CRAN): Get scholarly metadata from around the web.
-
RMendeley: Implementation of the Mendeley API in R.
-
rentrez: Talk with NCBI entrez using R.
-
rorcid
(not on CRAN): A programmatic interface the Orcid.org API.
-
rpubmed
(not on CRAN): Tools for extracting and processing Pubmed and Pubmed Central records.
-
rAltmetric: Query and visualize metrics from Altmetric.com.
-
alm: R wrapper to the almetrics API platform developed by PLoS.
-
ngramr: Retrieve and plot word frequencies through time from the Google Ngram Viewer.
-
scholar
provides functions to extract citation data from Google Scholar. Convenience functions are also provided for comparing multiple scholars and predicting future h-index values.
-
The
Sxslt
package is an R interface to Dan Veillard's libxslt translator. It allows R programmers to use XSLT directly from within R, and also allows XSL code to make use of R functions.
-
The
Aspell
package provides an interface to the aspell library for checking the spelling of words and documents.
-
OAIHarvester: Harvest metadata using the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH).
Marketing
-
anametrix: Bidirectional connector to Anametrix API.
Data Depots
-
dvn: Provides access to The Dataverse Network API.
-
rfigshare: Programmatic interface for Figshare.
-
factualR: Thin wrapper for the Factual.com server API.
-
dataone: A package that provides read/write access to data and metadata from the DataONE network of Member Node data repositories.
-
yhatr: Lets you deploy, maintain, and invoke models via the Yhat REST API.
-
RSocrata: Provided with a Socrata dataset resource URL, or a Socrata SoDA web API query, returns an R data frame. Converts dates to POSIX format. Supports CSV and JSON. Manages throttling by Socrata.
-
Quandl: A package that interacts directly with the
Quandl
API to offer data in a number of formats usable in R, as well as the ability to upload and search.
-
rdatamarket: Fetches data from DataMarket.com, either as timeseries in zoo form (dmseries) or as long-form data frames (dmlist).
-
infochimps
: An R wrapper for the infochimps.com API services, from
Drew Conway
. The CRAN version is archived. Development
on Github
.
Machine Learning as a Service
-
bigml: BigML, a machine learning web service.
-
MTurkR: Access to Amazon Mechanical Turk Requester API via R.
Web Analytics
-
rgauges: This package provides functions to interact with the Gaug.es API. Gaug.es is a web analytics service, like Google analytics. You have to have a Gaug.es account to use this package.
-
RSiteCatalyst: Functions for accessing the Adobe Analytics (Omniture SiteCatalyst) Reporting API.
-
r-google-analytics
(not on CRAN): Provides access to Google Analytics.
-
RGoogleTrends
provides programmatic access to Google Trends data. This is information about the popularity of a particular query.
News
-
GuardianR: Provides an interface to the Open Platform's Content API of the Guardian Media Group. It retrieves content from news outlets The Observer, The Guardian, and guardian.co.uk from 1999 to current day.
-
RNYTimes
provides interfaces to several of the New York Times Web services for searching articles, meta-data, user-generated content and best seller lists.
Images, Graphics, Videos, Music
-
imguR: A package to share plots using the image hosting service imgur.com. (also see the function
imgur_upload()
in knitr, which uses the newer Imgur API version 3)
-
RLastFM
: A package to interface to the last.fm API. Archived on CRAN.
-
The
RUbigraph
package provides an R interface to a Ubigraph server for drawing interactive, dynamic graphs.
You can add and remove vertices/nodes and edges in a graph and change their attributes/characteristics such as shape, color, size.
Sports
-
nhlscrapr: Compiling the NHL Real Time Scoring System Database for easy use in R.
-
pitchRx: Tools for Collecting and Visualizing Major League Baseball PITCHfx Data
-
bbscrapeR
(not on CRAN yet): Tools for Collecting Data from nba.com and wnba.com
-
fbRanks: Association Football (Soccer) Ranking via Poisson Regression - uses time dependent Poisson regression and a record of goals scored in matches to rank teams via estimated attack and defense strengths.
Maps
-
RgoogleMaps: This package serves two purposes: It provides a comfortable R interface to query the Google server for static maps, and use the map as a background image to overlay plots within R.
-
The
R2GoogleMaps
package - which is different from
RgoogleMaps
- provides a mechanism to generate JavaScript code from R that displays data using Google Maps.
-
osmar: This package provides infrastructure to access OpenStreetMap data from different sources to work with the data in common R manner and to convert data into available infrastructure provided by existing R packages (e.g., into sp and igraph objects).
-
ggmap: Allows for the easy visualization of spatial data and models on top of Google Maps, OpenStreetMaps, Stamen Maps, or CloudMade Maps using ggplot2.
-
The
GeoIP
package maps IP addresses and host names to geographic locations - latitude, longitude, region, city, zip code, etc.
-
The
RKML
is an implementation that provides users with high-level facilities to generate KML, the Keyhole Markup Language for display in, e.g., Google Earth.
-
RKMLDevice
allows to create R graphics in KML format in a manner that allows them to be displayed on Google Earth (or Google Maps).
Social media
-
streamR: This package provides a series of functions that allow R users to access Twitter's filter, sample, and user streams, and to parse the output into data frames. OAuth authentication is supported.
-
twitteR: Provides an interface to the Twitter web API.
-
The
Rflickr
package provides an R interface to the Flickr photo management and sharing application Web service.
-
Rfacebook: Provides an interface to the Facebook API.
Government
-
wethepeople: An R client for interacting with the White House's "We The People" petition API.
-
govStatJPN: Functions to get public survey data in Japan.
-
acs: Download, manipulate, and present data from the US Census American Community Survey.
Google Web Services
-
RGoogleStorage
provides programmatic access to the Google Storage API. This allows R users to access and store data on Google's storage. We can upload and download content, create, list and delete folders/buckets, and set access control permissions on objects and buckets.
-
The
RGoogleDocs
package is an example of using the RCurl and XML packages to quickly develop an interface to the Google Documents API.
-
translate: Bindings for the Google Translate API v2
-
googlePublicData: An R library to build Google's public data explorer DSPL metadata files.
-
googleVis: Interface between R and the Google chart tools.
-
gooJSON: A Google JSON data interpreter for R which contains a suite of helper functions for obtaining data from the Google Maps API JSON objects.
-
plotGoogleMaps: Plot SP or SPT(STDIF,STFDF) data as HTML map mashup over Google Maps.
-
plotKML: Visualization of spatial and spatio-temporal objects in Google Earth.
Amazon Web Services
-
AWS.tools: An R package to interact with Amazon Web Services (EC2/S3).
-
RAmazonS3
package provides the basic infrastructure within R for communicating with the S3 Amazon storage server.
This is a commercial server that allows one to store content and retrieve it from any machine connected to the Internet.
-
RAmazonDBREST
provides an interface to Amazon's Simple DB API.
-
MTurkR: Access to Amazon Mechanical Turk Requester API via R.
Other
-
sos4R: R client for the OGC Sensor Observation Service.
-
datamart: Provides an S4 infrastructure for unified handling of internal datasets and web based data sources. Examples include dbpedia, eurostat and sourceforge.
-
rDrop
(not on CRAN): Dropbox interface.
-
zendeskR: This package provides an R wrapper for the Zendesk API.
-
AWS.tools: An R package to interact with Amazon Web Services (EC2/S3).
-
The
qualtrics
package provides functions to interact with the Qualtrics online survey tool.