Using Cookies for Connected Requests with RCurl

Duncan Temple Lang

University of California at Davis

Department of Statistics


The Problem

This is an example of using RCurl with cookies. This comes from a question on the R-help mailing list on Sep 18th, 2012.

In a Web browser, we visit the page http://www.wateroffice.ec.gc.ca/graph/graph_e.html?mode=text&stn=05ND012&prm1=3&syr=2012&smo=09&sday=15&eyr=2012&emo=09&eday=18 (no longer available). Before allowing us access to the data, the Web site presents us with a disclaimer page. We have to click on the I Agree button and then we are forwarded to a page with the actual data. We want to read that using, for example, readHTMLTable() in the XML package.

What happens when we click on the I Agree button? That sets a cookie. After that, we include that cookie in each request to that server and this confirms that we have agreed to the disclaimer. The Web server will process each request containing the cookie knowing we have agreed and so give us the data. So we need to first make a request in R that emulates clicking the I Agree button. We have to arrange for that request to recognize the cookie in the response and then use that cookie in all subsequent requests to that server. We could do this manually, but there is no need to. We simply use the same curl object in all of the requests. In the first request, libcurl will process the response and retrieve the cookie. By using the same curl handle in subsequent requests, libcurl will automatically send the cookie in those requests.

We create the curl handle object with

library(RCurl)
curl = getCurlHandle(cookiefile = "", verbose = TRUE)

This enables cookies in the handle, but does not arrange to write them to a file. We could store the cookie in a file when the curl handle is deleted. We could then use this in subsequent R sessions or other curl handles. However, there is no need to do this. We can just agree to the disclaimer each time. However, if we do want to store the cookie in a file (when the curl handle is deleted), we can do this by specifying a file name as the value for the cookiefile argument.

The disclaimer page is a POST form. We send the request to http://www.wateroffice.ec.gc.ca/include/disclaimer.php with the parameter named disclaimer_action and the value "I Agree". We can get this information by reading the HTML page and looking for the <form> element. Alternatively, we could use the RHTMLForms package.

We can make the request with

postForm("http://www.wateroffice.ec.gc.ca/include/disclaimer.php",
           disclaimer_action = "I Agree", curl = curl)

We can ignore the result as we just want the side-effect of getting the cookie in the curl handle.

We can now access the actual data at the original URL. We cannot use readHTMLTable() directly as that does not use a curl handle, and does not know about the cookie. Instead, we use getURLContent() to get the content of the page. We can then pass this text to readHTMLTable() . So we make the request with

u = "http://www.wateroffice.ec.gc.ca/graph/graph_e.html?mode=text&stn=05ND012&prm1=3&syr=2012&smo=09&sday=15&eyr=2012&emo=09&eday=18"
txt = getURLContent(u, curl = curl, verbose = TRUE)

Personally, I prefer to use

txt = getForm("http://www.wateroffice.ec.gc.ca/graph/graph_e.html",
               mode = "text", stn = "05ND012", prm1 = 3, 
               syr = "2012", smo = "09", sday = "15", eyr = "2012", emo = "09", 
               eday = "18",  curl = curl)

This makes it easier to change individual inputs.

The result should contain the actual data.

library(XML)
tbl = readHTMLTable(txt, asText = TRUE)

We can find the number of rows and columns in each table with

sapply(tbl, dim)
     dataTable hydroTable
[1,]       852          1
[2,]         2          4

We want the first one. The columns are, by default, strings or factors. The numbers have a * on them. We can post-process this to get the values.

tbl = readHTMLTable(txt, asText = TRUE, which = 1, 
                    stringsAsFactors = FALSE)
tbl[[2]] =  as.numeric(gsub("\\*", "", tbl[[2]]))
tbl[[1]] = strptime(tbl[[1]], "%Y-%m-%d %H:%M:%S")

Using RHTMLForms to Find the Disclaimer Form

The RHTMLForms package can both read an HTML page and get a description of all of its forms, and also generate an R function corresponding to each form so that we can invoke the form as if it were a local function in R. We get the descriptions with

library(RHTMLForms)
forms = getHTMLFormDescription(u, FALSE)

We need to keep the buttons in the forms and hence the FALSE as the second argument to getHTMLFormDescription() .

We create the function for this form with

fun = createFunction(forms[[1]])

We can invoke this function using the curl handle we created to capture the cookies:

fun(.curl = curl)

This will agree to the disclaimer on our behalf.