Downloading S&P 500 Stock Data from Google/Quandl with R (Command Line Script)

DISCLAIMER: Any losses incurred based on the content of this post are the responsibility of the trader, not me. I, the author, neither take responsibility for the conduct of others nor offer any guarantees. None of this should be considered as financial advice; the content of this article is only for educational/entertainment purposes.

While most Americans have heard of the Dow Jones Industrial Average (DJIA), most people active in finance consider the S&P 500 stock index to be the better barometer of the overall American stock market. The 500 stocks included in the index are large-cap stocks seen as a leading indicator for the performance of stocks overall. Thus the S&P 500 and its component stocks are sometimes treated as “the market.”

I wanted to download the historical price data (specifically closing price) for all the stocks included in the S&P 500, which you can read on a Wikipedia list. Services such as Google Finance or Quandl provide data for individual companies and the index as a whole (or at least for SPY, the ticker symbol of the SPDR S&P 500 ETF Trust, which is an index ETF that tracks the S&P 500), but they don’t provide a utility to download closing prices for all symbols included in an index, nor will they provide the list. That said, once I have the list I can easily fetch the data for the symbols individually, then merge them together into one dataset.

I wrote an R script for doing this. The script executes the following steps:

  1. Scrape a list of symbols included in the S&P 500 index from Wikipedia, storing the list as a character vector.
  2. Using either quantmod or the R package Quandl (both available from CRAN), attempt to fetch (daily) price data for each symbol in the list created in step one in a loop for a certain date range. If no data for a symbol is available, ignore it.
  3. Merge closing price data for each symbol into a single dataset.

In addition to these steps, I made my script executable from the command line on Unix/Linux systems using the argparse package, further streamlining the process and hiding the R implementation details (so if you’re a Pythonista who doesn’t want to touch R beyond installing it and perhaps some dependencies, this script is still useful to you). I make no guarantees for Windows users; you may need to run the script by hand. By default, the script downloads data from Google, but in the command line you can instruct the script to get data from Quandl; you will need a Quandl API key if you do so. Quandl data can also be adjusted for stock splits and dividends if you so choose (Google data is adjusted automatically for stock splits only).

You can either apply an inner join data, or an outer join. Inner join will lead to no missing information in your CSV, but the resulting file will likely contain much less data than you wanted, since some symbols won’t have data until a few years ago, thus truncating the entire dataset. The default, outer join, will return a CSV with as much data as possible for each symbol, including NA when some symbols don’t have data extending as far back as specified.

Here is the script. Copy this into a .R file (I named my file get-sp500-data.R), make it executable (for example, chmod +x get-sp500-data.R on Ubuntu Linux), execute (for example, ./get-sp500-data.R in Bash), and get that sweet, juicy data.

#!/usr/bin/Rscript
# get-sp500-data.R
# Author: Curtis Miller

suppressPackageStartupMessages(library(argparse))
suppressPackageStartupMessages(library(rvest))
suppressPackageStartupMessages(library(magrittr))
suppressPackageStartupMessages(library(tibble))
suppressPackageStartupMessages(library(quantmod))

# Argument parser for command line use
parser <- ArgumentParser()
parser$add_argument("-v", "--verbose", action = "store_true", default = TRUE,
                    help = "Print extra output [default]")
parser$add_argument("--quietly", action = "store_false",
                    dest = "verbose", help = "Print little output")
parser$add_argument("-f", "--file", type = "character", dest = "csv_name",
                    default = "sp-500.csv",
                    help = "CSV file to save data to [default: sp-500.csv]")
parser$add_argument("-s", "--sleep", type = "integer", dest = "sleeptime",
                    default = 2,
                    help = paste("Time (seconds) between fetching symbols",
                                 "[default: 2] (don't flood websites with",
                                 "requests!)"))
parser$add_argument("--inner", action = "store_true", default = FALSE,
                    dest = "inner",
                    help = paste("Inner join; only dates where all symbols",
                                 "have data will be included"))
parser$add_argument("--start", type = "character", dest = "start",
                    default = "1997-01-01",
                    help = paste("Earliest date (YYYY-MM-DD) to include",
                                 "[default: 1997-01-01]"))
parser$add_argument("--end", type = "character", dest = "end",
                    default = "today",
                    help = paste('Last date (YYYY-MM-DD or "today") to',
                                 'include [default: "today"]'))
parser$add_argument("-k", "--key", type = "character", dest = "api_key",
                    default = NULL,
                    help = "Quandl API key, needed if getting Quandl data")
parser$add_argument("-q", "--quandl", action = "store_true", default = FALSE,
                    dest = "use_quandl", help = "Get data from Quandl")
parser$add_argument("-a", "--adjust", action = "store_true", default = FALSE,
                    dest = "adjust", help = "Adjust prices (Quandl only)")
parser$add_argument("--about", action = "store_true", default = FALSE,
                    dest = "about",
                    help = paste("Print information about the script and its",
                                 "usage, then quit"))

args <- parser$parse_args()

join <- "outer"
if (args$inner) {
  join <- "inner"
}
verbose <- args$verbose
start <- args$start
if (args$end == "today") {
  end <- Sys.Date()
} else {
  end <- args$end
}
sleeptime <- args$sleeptime  # In seconds
csv_name <- args$csv_name
api_key <- args$api_key
use_quandl <- args$use_quandl
adjust <- args$adjust
about <- args$about

if (about) {
  # Display a message, then quit
  comm_name <- substring(commandArgs(trailingOnly = FALSE)[4], 8)
  cat(comm_name, "\n(c) 2017 Curtis Miller\n",
      "Licensed under GNU GPL v. 3.0 available at ",
      "https://www.gnu.org/licenses/gpl-3.0.en.html \n",
      "E-mail: cgmil@msn.com\n\n",
      "This script fetches closing price data for ticker symbols included ",
      "in the S&P 500 stock index. A list of symbols included in the index ",
      "is fetched from this webpage:",
      "https://en.wikipedia.org/wiki/List_of_S%26P_500_companies  The list ",
      "is parsed and the symbols included in the list are fetched from ",
      "either Google Finance (the default) or Quandl (which requires a ",
      "Quandl API key). If Quandl is the data source, adjusted data can be ",
      "fetched instead. The resulting data set is then saved to a CSV",
      "file in the current working directory.\n\n",
      "This package requires the following R packages be installed in order ",
      "to work (all of which are available from CRAN and can be downloaded ",
      "and installed automatically from R via the command ",
      "'install.packages(\"package_name\")'):\n\n",
      "* rvest\n",
      "* magrittr\n",
      "* quantmod\n",
      "* Quandl\n",
      "* tibble\n",
      "* argparse (used only for the command line interface)\n\n",
      "This script was written by Curtis Miller and was made available on ",
      "his website: https://ntguardian.wordpress.com\n\n",
      "You can read more about this script in the following article: ",
      "https://ntguardian.wordpress.com/blog\n\n", sep = "")
  quit()
}

start %<>% as.Date
end %<>% as.Date

options("getSymbols.warning4.0"=FALSE)

sp500_wiki <- read_html(
  "https://en.wikipedia.org/wiki/List_of_S%26P_500_companies")

symbols_table <- sp500_wiki %>%
  html_nodes(xpath='//*[@id="mw-content-text"]/div/table[1]') %>%
  html_table()

symbols_table <- symbols_table[[1]]
symbols <- as.character(symbols_table$`Ticker symbol`)

if (use_quandl) {
  suppressPackageStartupMessages(library(Quandl))  # This will be needed
  Quandl.api_key(api_key)
}

sp500 <- NULL
for (s in symbols) {
  Sys.sleep(sleeptime)
  if (verbose) {
    cat("Processing:", s, "...")
  }
  tryCatch({
    if (use_quandl) {
      s_data <- Quandl.datatable("WIKI/PRICES", ticker = c(s),
                                 date.gte = start, date.lte = end)
      rownames(s_data) <- as.Date(s_data$date)
      if (adjust) {
        s_data <- s_data[, "adj_close", drop = FALSE]
      } else {
        s_data <- s_data[, "close", drop = FALSE]
      }
    } else {
      s_data <- Cl(getSymbols(s, src="google", from = start, to = end,
                              env = NULL))
    }
    names(s_data) <- s
    s_data %<>% as.xts
    if (length(unique(s_data)) > 1) {    # Don't allow what is effectively
                                         # empty data
      if (is.null(sp500)) {
        sp500 <- as.xts(s_data)
      } else {
        sp500 %<>% merge(s_data, join = join)
      }
      if (verbose) {
        cat(" Got it! From", start(s_data) %>% as.character, "to",
            end(s_data) %>% as.character, "\n")
      }
    } else if (verbose) {
      cat("Sorry, but not this one!\n")
    }
  }, error = function(e) {
    if (verbose) {
      cat("Sorry, but not this one!\n")
    }
  })
}

badsymbols <- setdiff(symbols, names(sp500))
if (verbose & (length(badsymbols) > 0)) {
  cat("There were", length(badsymbols),
      "symbols for which data could not be obtained.\nThey are:", badsymbols,
      "\n")
}

write.csv(rownames_to_column(as.data.frame(sp500), "Date"), file=csv_name)

Clearly this can be modified to work for other indices, which requires getting a different list to scrape. I’m sure Wikipedia has more lists that can easily substitute for the one I scraped. You could even construct a list of symbols by hand (though this could be a painful process). I’ll leave that up to you. If you have alternative lists of symbols for different indices, I would love to read your substitute for lines 107-112.


I have created a video course that Packt Publishing will be publishing later this month, entitled Unpacking NumPy and Pandas, the first volume in a four-volume set of video courses entitled, Taming Data with Python; Excelling as a Data Analyst. This course covers the basics of setting up a Python environment for data analysis with Anaconda, using Jupyter notebooks, and using NumPy and pandas. If you are starting out using Python for data analysis or know someone who is, please consider buying my course or at least spreading the word about it. You can buy the course directly or purchase a subscription to Mapt and watch it there (when it becomes available).

If you like my blog and would like to support it, spread the word (if not get a copy yourself)! Also, stay tuned for future courses I publish with Packt at the Video Courses section of my site.

Advertisements

2 thoughts on “Downloading S&P 500 Stock Data from Google/Quandl with R (Command Line Script)

  1. Hello! I am no expert on economics nor R, but I think I have a faster way to bring the S&P 500 much faster… and doing pretty much the same! 😉

    sp500 <- 'http://data.okfn.org/data/core/s-and-p-500-companies/r/constituents-financials.csv&#039;
    sp500 <- read.csv(sp500)

    library(quantmod)
    library(dplyr)
    library(beepr)

    # Bring all historical data from all S&P 500
    symbols <- sp500$Symbol
    symbols <- symbols [! symbols %in% c("BRK.B","BF.B","HAR","LLTC","YHOO")] #These give error
    data <- c()
    for(i in 1:length(symbols)) {
    symbol <- symbols[i]
    values % data.frame()
    values <- cbind(row.names(values), as.character(symbol), values)
    colnames(values) <- c("Date","Symbol","Open","High","Low","Close","Volume","Adjusted")
    row.names(values) <- NULL
    data <- rbind(data, values)
    print(paste(symbol,": done ",round(100*i/length(symbols),1),"% (",i,")",sep=""))
    }
    write.csv(data,"BigSP500.csv", row.names = FALSE)
    beep() # It'll let us know when it's done!

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s