Automated Data Collection with R Blog

  • 02 Mar 2016 » Gathering Twitter data with the TwitteR2Mongo package
  • The following is a guest post by Philipp Prettl and Lion Weber (University of Konstanz), with support from Simon Munzert. Did you ever want to track trending topics on Twitter, but missed to start the tracking as an event took off? Are you interested in mining massive amounts of Twitter data using R, but lack the tools to store all tweets in a database? If yes, this post might be what you've been waiting for.... (Read more)

  • 23 Dec 2015 » One Solution to the 'stringsAsFactors'-Problem Or: Hell-Yeah there is HELLNO
  • Base R's stringsAsFactors default setting is supposedly the most often complained about piece of code in the whole R infrastructure. A search through the source code of all CRAN packages in December 2015 (Link) resulted in 3,492 mentions of stringsAsFactors. Most of the time these explicit mentions where found within calls to data.frame() or as.data.frame() and simply set the value to FALSE. The hellno package provides an explicit solution to the problem without changing R... (Read more)

  • 17 Aug 2015 » Constructing a network of politicians from newspaper data
  • The following is a guest post by Jana Blahak and Jan Dix (University of Konstanz), with support from Simon Munzert. In the last post, we introduced the rzeit package, an R binding to the Content API at ZEIT Online. This time, we give a little demonstration of what can be done with these media data. The question we ask is the following: Can we use information from newspaper articles to learn about connections between political... (Read more)

  • 06 Aug 2015 » Gathering German newspaper data with the rzeit package
  • The following is a guest post by Jana Blahak and Jan Dix (University of Konstanz), with support from Simon Munzert. We are happy to introduce our freshly created rzeit package. It connects to the Content API at ZEIT Online, a German newspaper website. In short, the package allows you to conduct an unfiltered search for articles, use a variety of parameters to refine query fields, e.g. to specify content and time, and easily inspect meta... (Read more)

  • 23 Jul 2015 » htmltab v.0.6.0
  • The next version of the htmltab package has just been released on CRAN and GitHub. The goal behind htmltab is to make the collection of structured information from HTML tables as easy and painless as possible (read about the package here and here). The most recent update got rid of many smaller bug fixes, inconsistencies and brings significant internal optimization of the code to increase not only the robustness of the function but also the... (Read more)

  • 08 Jun 2015 » Using Wikipediatrend
  • What do Wikipedia's readers care about? Is Britney Spears more popular than Brittany? Is Asia Carrera more popular than Asia? How many people looked at the article on Santa Claus in December? How many looked at the article on Ron Paul? What can you find? Source: http://stats.grok.se/ The wikipediatrend package provides convenience access to daily page view counts (Wikipedia article traffic statistics) stored at http://stats.grok.se/ . If you want to know how often an article... (Read more)

  • 18 Feb 2015 » Making R Files Executable (under Windows)
  • Although it is reasonable that R scripts get opened in edit mode by default, it would be even nicer (once in a while) to run them with a simple double-click. Well, here we go ... Choosing a new file extension name (.Rexec) First, we have to think about a new file extension name. While double-click to run is a nice-to-have, the default behaviour should not be overwritten. In the Windows universe one cannot simply attach... (Read more)

  • 19 Jan 2015 » Programming a Twitter bot – and the rescue from procrastination
  • A considerable share of Twitter accounts is not actually run by humans. According to a recent release by Twitter, `up to approximately 8.5%' of the active users are bots or third-party software that automatically aggregates tweets. Bots can follow other users, retweet content or post content on their own. What they say is essentially generated by scripts. Take @TwoHeadlines, for example. The bot, hosted by Darius Kazemi, scrapes headlines from Google News and replaces one... (Read more)

  • 16 Jan 2015 » htmltab: Next version and CRAN release
  • About a month ago, I announced the release of the htmltable package. In the meantime a lot has happened. Years have changed and the –presumably formidable– htmlTable package has been released on CRAN. So much for my beloved package name. Since I need a new one, let's find something shorter, more googleable ... how about ... htmltab? htmltab it is. What the htmltab package is about The main goal behind htmltab is to make the... (Read more)

  • 21 Dec 2014 » How to conduct a tombola with R
  • Two weeks ago, we announced to raffle off three hardcover versions of our ADCR book among all followers of our Twitter account @RDataCollection. Tomorrow is closing day, so it is high time to present the drawing procedure, which, as a matter of course, is conducted with R. Connecting with Twitter We start with installing the latest version of Jeff Gentry's twitteR package from GitHub, which makes the OAuth authentication handshake procedure very comfortable: devtools::install_github("geoffjentry/twitteR") library(twitteR)... (Read more)

  • 19 Dec 2014 » 50 years of Christmas at the Windsors
  • It is that time of year again: Truckloads of lights are dumped into store windows, people scramble to get their Christmas shopping done, and it is becoming increasingly unbearable to listen to the radio. Of course, the most important element of the season is still ahead of us – all across the Commonwealth people are eagerly awating the Queen's Christmas Broadcast. Well... let's assume they do for the purposes of this blog post. We figured... (Read more)

  • 15 Dec 2014 » Hassle-free data from HTML tables with the htmltable package
  • [2015-01-15 The syntax in this article is outdated. For a revised version take a look at the package vignette] HTML tables are a standard way to display tabular information online. Getting HTML table data into R is fairly straightforward with the readHTMLTable() function of the XML package. But tables on the web are primarily designed for displaying and consuming data, not for analytical purposes. Peculiar design choices for HTML tables are therefore frequently made which... (Read more)

  • 10 Dec 2014 » Introduction to Public Attention Analytics with Wikipediatrend
  • Elections are the most important events of political life. Their results determine who gets to be in government for the next few years. But how long do elections and their results capture the public's attention? In this blog post we take a look at Wikipedia page access statistics to find out. Our weapon of choice will be R and the recently published wikipediatrend package that allows for convenient data retreival -- be sure to check... (Read more)

  • 03 Dec 2014 » Win a hardcover copy of ADCR
  • The rapid growth of the World Wide Wide over the past two decades has tremendously changed the way we share, collect and publish data. Firms, public institutions and private users provide every imaginable type of information; new channels of communication generate vast amounts of data on human behavior. In order to help researchers cope with the data avalanches, a variety of new techniques for collecting and analyzing large data sets have been devised. As the... (Read more)

  • 02 Dec 2014 » Welcome to the ADCR Blog
  • Welcome to the inaugural post of our blog "Automated Data Collection with R" – our new outlet for discussing all aspects regarding data collection with R. We are a group of four researchers with a background in political science. Coming from a tradionally data-sparse discipline, we have come to realize the opportunities the internet provides for new and original data sources. Therefore, we have joined forces to write a book on web scraping and text... (Read more)


Twitter