This is an example of using RCurl with cookies. This comes from a question on the R-help mailing list on Sep 18th, 2012.
In a Web browser, we visit the
page http://www.wateroffice.ec.gc.ca/graph/graph_e.html?mode=text&stn=05ND012&prm1=3&syr=2012&smo=09&sday=15&eyr=2012&emo=09&eday=18
(no longer available).
Before allowing us access to the data, the Web site presents us with a disclaimer page.
We have to click on the button and then we are forwarded to a page with the actual
data. We want to read that using, for example, readHTMLTable()
in the
XML package.
What happens when we click on the
button? That sets a cookie. After that, we include that cookie in each request to that server and this confirms that we have agreed to the disclaimer. The Web server will process each request containing the cookie knowing we have agreed and so give us the data. So we need to first make a request in R that emulates clicking the button. We have to arrange for that request to recognize the cookie in the response and then use that cookie in all subsequent requests to that server. We could do this manually, but there is no need to. We simply use the same curl object in all of the requests. In the first request, libcurl will process the response and retrieve the cookie. By using the same curl handle in subsequent requests, libcurl will automatically send the cookie in those requests.We create the curl handle object with
library(RCurl) curl = getCurlHandle(cookiefile = "", verbose = TRUE)
This enables cookies in the handle, but does not arrange to write them to a file. We could store the cookie in a file when the curl handle is deleted. We could then use this in subsequent R sessions or other curl handles. However, there is no need to do this. We can just agree to the disclaimer each time. However, if we do want to store the cookie in a file (when the curl handle is deleted), we can do this by specifying a file name as the value for the cookiefile argument.
The disclaimer page is a POST
form.
We send the request to http://www.wateroffice.ec.gc.ca/include/disclaimer.php
with the parameter named disclaimer_action and the value "I Agree".
We can get this information by reading the HTML page
and looking for the <form>
element.
Alternatively, we could use the RHTMLForms package.
We can make the request with
postForm("http://www.wateroffice.ec.gc.ca/include/disclaimer.php", disclaimer_action = "I Agree", curl = curl)
We can ignore the result as we just want the side-effect of getting the cookie in the curl handle.
We can now access the actual data at the original URL. We cannot use readHTMLTable() directly as that does not use a curl handle, and does not know about the cookie. Instead, we use getURLContent() to get the content of the page. We can then pass this text to readHTMLTable() . So we make the request with
u = "http://www.wateroffice.ec.gc.ca/graph/graph_e.html?mode=text&stn=05ND012&prm1=3&syr=2012&smo=09&sday=15&eyr=2012&emo=09&eday=18" txt = getURLContent(u, curl = curl, verbose = TRUE)
Personally, I prefer to use
txt = getForm("http://www.wateroffice.ec.gc.ca/graph/graph_e.html", mode = "text", stn = "05ND012", prm1 = 3, syr = "2012", smo = "09", sday = "15", eyr = "2012", emo = "09", eday = "18", curl = curl)
This makes it easier to change individual inputs.
The result should contain the actual data.
library(XML) tbl = readHTMLTable(txt, asText = TRUE)
We can find the number of rows and columns in each table with
sapply(tbl, dim) dataTable hydroTable [1,] 852 1 [2,] 2 4
We want the first one. The columns are, by default, strings or factors. The numbers have a * on them. We can post-process this to get the values.
tbl = readHTMLTable(txt, asText = TRUE, which = 1, stringsAsFactors = FALSE) tbl[[2]] = as.numeric(gsub("\\*", "", tbl[[2]])) tbl[[1]] = strptime(tbl[[1]], "%Y-%m-%d %H:%M:%S")
The RHTMLForms package can both read an HTML page and get a description of all of its forms, and also generate an R function corresponding to each form so that we can invoke the form as if it were a local function in R. We get the descriptions with
library(RHTMLForms) forms = getHTMLFormDescription(u, FALSE)
We need to keep the buttons in the forms and hence the FALSE
as the second
argument to getHTMLFormDescription()
.
We create the function for this form with
fun = createFunction(forms[[1]])
We can invoke this function using the curl handle we created to capture the cookies:
fun(.curl = curl)
This will agree to the disclaimer on our behalf.