Getting S&P 500 Stock Data from Quandl/Google with Python

DISCLAIMER: Any losses incurred based on the content of this post are the responsibility of the trader, not me. I, the author, neither take responsibility for the conduct of others nor offer any guarantees. None of this should be considered as financial advice; the content of this article is only for educational/entertainment purposes.

A few months ago I wrote a blog post about getting stock data from either Quandl or Google using R, and provided a command line R script to automate the task. In this post I repeat the task but with Python. If you’re interested in the motivation and logic of the procedure, I suggest reading the post on the R version. The Python version works similarly.

Why a Python version? After all, the R version produces a CSV file that can be read by just about anything, including Python via Pandas. First, the Python script has one additional feature: it’s a module and thus can be imported in a script. The guts of the script is a function that could be called in another Python script to get data and start using it right away. Second, I want to demonstrate some important tasks in Python.

The script pulls a list of symbols contained in the S&P 500 index from this Wikipedia page. In R I had to take a parsing approach similar to if I were using BeautifulSoup in Python, but Pandas makes the task easier. The function read_html() can be fed an HTML page, and it will parse the page and return a list of tables it read as DataFrames. Lists on Wikipedia especially can be easily turned into DataFrames this way (and Wikipedia has a lot of lists).

We’ve passed the one-year anniversary of my first post on using Python for data analysis. That blog post has been immensely popular and is a top search hit. It easily dominates my viewership stats every day. This is nice but also frustrating; I feel that I have written a lot of articles since that post (a lot of it better, in my opinion) and the new content doesn’t get nearly the same level of viewership. Furthermore, that article is out of date. I would not recommend using the techniques shown there to get stock data.

The main problem is that Yahoo! Finance is no longer the go-to source for stock data. mementum, the author of the backtrader backtesting framework, explained the situation well in a StackExchange question I asked, and I’ll just link to his answer. That aside, people should get their data from some different source. I prefer Quandl. This post shows how to get data from either Quandl or Google Finance, so it should serve as an update to my original blog post.

(FYI, that blog post may be getting an update next year; in fact, we have have a video lecture to accompany it. Braxton Osting, the University of Utah professor who initially requested the lecture, would like for me to give it again for the Introduction to Data Science (MATH 3900) course. We may be filming it too. I will be looking to update the lecture and I’ll share the most recent version on this blog.)

So without further ado, here is the code.

#!/usr/bin/python3
__doc__ = """
Provides the get_sp500_data function that fetches S&P 500 data from either Google or Quandl.
"""
 
__author__ = "Curtis Miller"
__copyright__ = "Copyright (c) 2017, Curtis Grant Miller"
__credits__ = ["Curtis Miller"]
__license__ = "GPL"
__version__ = "0.1.0"
__maintainer__ = "Curtis Miller"
__email__ = "cgmil@msn.com"
__status__ = "Experimental"

import pandas as pd
from pandas import DataFrame
import argparse
import quandl
import pandas_datareader as web
from time import sleep
import datetime as dt
import sys

def get_sp500_data(start=dt.datetime.strptime("1997-01-01", "%Y-%m-%d"),
                   end=dt.datetime.now(), use_quandl=True, adjust=True, inner=True,
                   sleeptime=2, verbose=True):
    """Fetches S&P 500 data
    
    args:
        start: datetime; The earliest possible date
        end: datetime; The last possible date
        use_quandl: bool; Whether to fetch data from Quandl (reverts to Google if False)
        adjust: bool; Whether to use adjusted close (only works with Quandl)
        inner: bool; Whether to use an inner join or outer join when combining series (inner has no missing data)
        sleeptime: int; How long to sleep between fetches
        verbose: bool; Whether to print a log while fetching data
    
    return:
        DataFrame: Contains stock price data
    """
    
    join = "outer"
    if inner:
        join = "inner"
    
    symbols_table = pd.read_html("https://en.wikipedia.org/wiki/List_of_S%26P_500_companies",
                                 header=0)[0]
    symbols = list(symbols_table.loc[:, "Ticker symbol"])

    sp500 = None
    for s in symbols:
        sleep(sleeptime)
        if verbose:
            print("Processing: " + s + "...", end='')
        try:
            if use_quandl:
                s_data = quandl.get("WIKI/" + s, start_date=start, end_date=end)
                if adjust:
                    s_data = s_data.loc[:, "Adj. Close"]
                else:
                    s_data = s_data.loc[:, "Close"]
            else:
                s_data = web.DataReader(s, "google", start, end).loc[:, "Close"]
            s_data.name = s
            s_data.dropna()
            if s_data.shape[0] > 1:
                if sp500 is None:
                    sp500 = DataFrame(s_data)
                else:
                    sp500 = sp500.join(s_data, how=join)
                if verbose:
                    print(" Got it! From", s_data.index[0], "to", s_data.index[-1])
            else:
                if verbose:
                    print(" Sorry, but not this one!")
        except Exception:
            if verbose:
                print(" Sorry, but not this one!")

    badsymbols = list(set(s) - set(sp500.columns))
    if verbose and len(badsymbols) > 0:
        print("There were", len(badsymbols), "symbols for which data could not be obtained.")
        print("They are:", ", ".join(badsymbols))
    
    return sp500

if __name__ == '__main__':
    parser = argparse.ArgumentParser(description="Fetches S&P 500 data")
    parser.add_argument("-v", "--verbose", action="store_true", default=True, dest="verbose",
                        help="Print extra output [default]")
    parser.add_argument("--quietly", action="store_false",
                        dest="verbose", help="Don't print extra output")
    parser.add_argument("-f", "--file", type=str, dest="csv_name",
                        default="sp-500.csv",
                        help="CSV file to save data to [default: sp-500.csv]")
    parser.add_argument("-s", "--sleep", type=int, dest="sleeptime",
                        default=2,
                        help="Time (seconds) between fetching symbols [default: 2] (don't flood websites with requests!)")
    parser.add_argument("--inner", action="store_true", default=False, dest="inner",
                        help="Inner join; only dates where all symbols have data will be included")
    parser.add_argument("--start", type=str, dest="start",
                        default="1997-01-01",
                        help="Earliest date (YYYY-MM-DD) to include [default: 1997-01-01]")
    parser.add_argument("--end", type=str, dest="end",
                        default="today",
                        help='Last date (YYYY-MM-DD or "today") to include [default: "today"]')
    # parser.add_argument("-k", "--key", type="character", dest="api_key",
    #                     default=NULL,
    #                     help="Quandl API key, needed if getting Quandl data")
    parser.add_argument("-q", "--quandl", action="store_true", default=False,
                        dest="use_quandl", help="Get data from Quandl")
    parser.add_argument("-a", "--adjust", action="store_true", default=False,
                        dest="adjust", help="Adjust prices (Quandl only)")
    parser.add_argument("--about", action="store_true", default=False,
                        dest="about",
                        help="Print information about the script and its usage, then quit")

    args = parser.parse_args()

    if args.about:
        print(sys.argv[0], "\n(c) 2017 Curtis Miller\n",
          "Licensed under GNU GPL v. 3.0 available at ",
          "https://www.gnu.org/licenses/gpl-3.0.en.html \n",
          "E-mail: cgmil@msn.com\n\n",
          "This script fetches closing price data for ticker symbols included",
          "in the S&P 500 stock index. A list of symbols included in the index",
          "is fetched from this webpage:",
          "https://en.wikipedia.org/wiki/List_of_S%26P_500_companies  The list",
          "is parsed and the symbols included in the list are fetched from",
          "either Google Finance (the default) or Quandl.",
          "If Quandl is the data source, adjusted data can be",
          "fetched instead. The resulting data set is then saved to a CSV",
          "file in the current working directory.\n\n",
          "This package requires the following Python packages be installed in order",
          "to work (all of which are available through pip):\n\n",
          "* pandas\n",
          "* pandas-datareader\n",
          "* quandl\n\n",
          "This script was written by Curtis Miller and was made available on ",
          "his website: https://ntguardian.wordpress.com\n\n",
          "You can read more about this script in the following article: ",
          "https://ntguardian.wordpress.com/blog\n\n")
        quit()

    if args.end == "today":
        args.end = dt.datetime.now()
    else:
        args.end = dt.datetime.strptime(args.end, "%Y-%m-%d")
    args.start = dt.datetime.strptime(args.start, "%Y-%m-%d")
    sp500 = get_sp500_data(start=args.start,
                   end=args.end, use_quandl=args.use_quandl, adjust=args.adjust, inner=args.inner,
                   sleeptime=args.sleeptime, verbose=args.verbose)
    sp500.to_csv(args.csv_name)

I have created a video course published by Packt Publishing entitled Data Acqusition and Manipulation with Python, the second volume in a four-volume set of video courses entitled, Taming Data with Python; Excelling as a Data Analyst. This course covers more advanced Pandas topics such as reading in datasets in different formats and from databases, aggregation, and data wrangling. The course then transitions to cover getting data in “messy” formats from Web documents via web scraping. The course covers web scraping using BeautifulSoup, Selenium, and Scrapy. If you are starting out using Python for data analysis or know someone who is, please consider buying my course or at least spreading the word about it. You can buy the course directly or purchase a subscription to Mapt and watch it there.

If you like my blog and would like to support it, spread the word (if not get a copy yourself)! Also, stay tuned for future courses I publish with Packt at the Video Courses section of my site.

8 thoughts on “Getting S&P 500 Stock Data from Quandl/Google with Python

    • Also, I think there’s a few key issues to this code. For one, it simply combines all results for all ticker results pulled from Quandi. But, from what I can gather, It seems that different datasets may have different dates recorded, and different dates in which recording began and ended. This program does not take that into account, and therefore (I think) you’ll have data from different recorded dates on the same row, making the resulting data somewhat useless as the whole point of the S&P 500 is to gauge the general welfare of the market at a specific time.

      Like

Leave a comment