Last year I started publishing video courses with Packt Publishing. These courses formed a series of introductory courses for data analysis using Python, starting with Unpacking NumPy and Pandas and then, in the fall of last year, Data Acquisition and Manipulation with Python. These courses were about obtaining and managing data. My new course is the first course showing what you can do with data.
In this course I cover statistics and machine learning topics. The course assumes little knowledge about what statistics or machine learning involves. I touch lightly on the theory of statistics and machine learning to motivate the tasks performed in the videos.
In my four hour course I talk about
These videos are example intensive; the software is presented in the context of a particular problem involving real data. When I talk about theory I often use visualizations to explain the concepts. Most of the time in the course I’m performing a demonstration that the viewer can follow.
The videos in the course–narrated by me–include not only an explanation of the topic at hand but interactive demonstrations, so viewers can see how to use the software and follow along if they so desire. The video course includes the Jupyter notebooks I use in my demonstrations; viewers can run my code blocks to replicate my results, and edit them for their own experimentation.
You can buy the course on Packt’s website. There is currently a sale on Packt’s website so you can buy the entire course for $10 (compared to the typical retail price of $125), but the sale ends soon! If the price of the course is an obstacle, perhaps consider watching it on Mapt, Packt’s subscription service, which gives you access not only to all my courses but everything Packt has published (and they publish a ton of stuff), along with one free book of your choice to keep every month (likely without any DRM; just a plain ol’ PDF). (I also hear that Packt’s videos are available on other services, such as Lynda, but don’t quote me on that.)
I thank Packt for publishing this course. I also thank my editor, Viranchi Shetty, for offering feedback and keeping me on schedule. The editors at Packt had a big impact on the final product.
If you like my blog and would like to support me, perhaps consider purchasing the course. If you have no need for it or don’t have the money to spend (which I understand completely; I don’t live a life of glamour myself, being a graduate student), I’d love for you to spread the word about the course. Tell a friend wanting to get started in data analysis or data science, or even share this post on Facebook or Twitter or whatever your preferred social network is. Directing more eyes to the course helps. Write a review if you have watched it; I would love to hear your feedback, both positive and negative (though if negative, be gentle and constructive please).
My website has a new page for the new course here.
Thanks for reading! Stay tuned for my announcement of the final course in the series, which discusses particular machine learning applications (NLP, computer vision, and case studies).
If you want to know more about what the course is like, below are some of the videos included in the course.
For some context, the Great Recession, as economists colloquially call the recession beginning in 2007 and punctuated with the 2008 financial crisis, ended officially in June 2009; it was then the economy resumed growth. As of this writing, that was about eight years, ten months ago. The longest previous period between recessions was the time between the early 1990s recession and the recession in the early 2000s that coincided with the collapse of the dot-com bubble; that period was ten years, and the only period longer than the present period between recessions.
There is growing optimism in the economy, most noticeably amongst consumers, and we are finally seeing wages increase in the United States after years of stagnation. Donald Trump and Republicans point to the economy as a reason to vote for Republicans in November (and yet Donald Trump is still historically unpopular and Democrats have a strong chance of capturing the House, and a fair chance at the Senate). Followers of the American economy are starting to ask, “How long can this last?”
In 2016, I was thinking about this issue in relation to the election. I wanted Hillary Clinton to win, but at the same time I feared that a Clinton win would be a short-term gain, long-term loss for Democrats. One reason why is I believe there’s a strong chance of a recession within the next few years.
The 2008 financial crisis was a dramatic event, yet the Dodd-Frank reforms and other policy responses, in my opinion, did not go far enough to address the problems unearthed by the financial crisis. Too-big-to-fail institutions are now a part of law (though the policy jargon is systemically important financial institution, or SIFI). In fact, the scandal surrounding HSBC’s support of money laundering and the Justice Departments weak response suggested bankers may be too-big-to-jail! Many of the financial products and practices that caused the financial crisis are still legal; the fundamentals that produced the crisis have not changed. Barack Obama and the Democrats (and the Republicans, certainly) failed to break the political back of the bankers.
While I did not think Bernie Sanders’ reforms would necessarily make the American economy better, I thought he would put the fear of God back into the financial sector, and that alone could help keep risky behavior in check. Donald Trump, for all his populist rhetoric, has not demonstrated he’s going to put that fear in them. In fact, the Republicans passed a bill that’s a gift to corporations and top earners. The legacy of the 2008 financial crisis is that the financial sector can make grossly risky bets in the good “get government off our back!” times, but will have their losses covered by taxpayers in the “we need government help!” times. Recessions and financial crises are a part of the process of expropriating taxpayers. (I wrote other articles about this topic: see this article and this article, as well as this paper I wrote for an undergraduate class.)
Given all this, there’s good reason to believe that nothing has changed about the American economy that would change the likelihood of a financial crisis. Since it has been so long since the last one, it’s time to start expecting one, and whoever holds the Presidency will be blamed.
Right now that’s Donald Trump and the Republicans. And I don’t need to tell you that given Trump’s popularity in good economic times is historically low, a recession before the 2020 election would lead to a Republican rout, with few survivors.
And in a Census year, too!
So what is the probability of a recession? The rest of this article will focus on finding a statistical model for duration between elections and using that model to estimate the probability of a recession.
A recent article in the magazine Significance entitled “The Weibull distribution” describes the Weibull distribution, a common and expressive probability distribution (and one I recently taught in my statistics class). This distribution is used to model a lot of phenomena, including survival times, the time until a system fails or how long a patient diagnosed with a disease survives. Time until recession sounds like a “survival time”, so perhaps the Weibull distribution can be used to model it.
First, I’m going to be doing some bootstrapping, so here’s the seed for replicability:
set.seed(4182018)
The dataset below, obtained from this Wikipedia article, contains the time between recessions in the United States. I look only at recessions since the Great Depression, considering this to be the “modern” economic era for the United States. The sample size is necessarily small, at 13 observations.
recessions <- c( 4+ 2/12, 6+ 8/12, 3+ 1/12, 3+ 9/12, 3+ 3/12, 2+ 0/12, 8+10/12, 3+ 0/12, 4+10/12, 1+ 0/12, 7+ 8/12, 10+ 0/12, 6+ 1/12) hist(recessions)
plot(density(recessions))
The fitdistrplus allows for estimating the parameters of statistical distributions using the usual statistical techniques. (I found J.Stat.Soft article useful for learning about the package.) I load it below and look at an initial plot to get a sense of appropriate distributions.
suppressPackageStartupMessages(library(fitdistrplus)) descdist(recessions, boot = 1000)
## summary statistics ## ------ ## min: 1 max: 10 ## median: 4.166667 ## mean: 4.948718 ## estimated sd: 2.71943 ## estimated skewness: 0.51865 ## estimated kurtosis: 2.349399
The recessions dataset is platykurtic though right-skewed, a surprising result. However, that’s not enough to deter me from attempting to use the Weibull distribution to model time between recessions. (I should mention here that I am assuming essentially that I’m assuming that time between recessions since the Great Depression are independent and identically distributed. This is not obvious or uncontroversial, but I doubt this could be credibly disproven or that assuming dependence would improve the model.) Let’s fit parameters.
fw <- fitdist(recessions, "weibull") summary(fw)
## Fitting of the distribution ' weibull ' by maximum likelihood ## Parameters : ## estimate Std. Error ## shape 2.001576 0.4393137 ## scale 5.597367 0.8179352 ## Loglikelihood: -30.12135 AIC: 64.2427 BIC: 65.3726 ## Correlation matrix: ## shape scale ## shape 1.0000000 0.3172753 ## scale 0.3172753 1.0000000
plot(seq(0, 15, length.out = 1000), dweibull(seq(0, 15, length.out = 1000), shape = fw$estimate["shape"], scale = fw$estimate["scale"]), col = "blue", type = "l", xlab = "Duration", ylab = "Density", main = "Weibull distribution applied to recession duration") lines(density(recessions))
plot(fw)
The plots above suggest the fitted Weibull distribution describe the observed distribution; the Q-Q plot, P-P plot, and the estimated density function all fit well with a Weibull distribution. I also compared the AIC values of the fitted Weibull distribution to two other close candidates, the gamma and log-normal distributions; the Weibull distribution provides the best fit according to the AIC criterion, being twice as reasonable as the log-normal distribution, although only slightly better than the gamma distribution (which is not surprising, given that the two distributions are similar). Due to the interpretations that come with the Weibull distribution and the statistical evidence, I believe it provides the better fit and should be used.
Based on the form of the distribution and the estimated parameters we can find a point estimate for the probability of a recession both before the 2018 midterm election and before the 2020 presidential election. That is, if is the time between recessions, we can estimate
alpha <- fw$estimate["shape"] beta <- fw$estimate["scale"] recession_prob_wei <- function(delta, passed, shape, scale) { # Computes the probability of a recession within the next delta years given # passed years # # args: # delta: a number representing time to next recession # passed: a number representing time since last recession # shape: the shape parameter of the Weibull distribution # scale: the scale parameter of the Weibull distribution if (delta < 0 | passed < 0) { stop("Both delta and passed must be non-negative") } return(1 - pweibull(passed + delta, shape = shape, scale = scale, lower.tail = FALSE) / pweibull(passed, shape = shape, scale = scale, lower.tail = FALSE)) }
# Recession prob. before 2018 election point estimate recession_prob_wei(6/12, 8+10/12, shape = alpha, scale = beta)
## [1] 0.252013
# Before 2020 election recession_prob_wei(2+6/12, 8+10/12, shape = alpha, scale = beta)
## [1] 0.8005031
Judging by the point estimates, there’s a 25% chance of a recession before the 2018 midterm election and an 80% chance of a recession before the 2020 election.
The code below finds bootstrapped 95% confidence intervals for these numbers.
suppressPackageStartupMessages(library(boot)) recession_prob_wei_bootci <- function(data, delta, passed, conf = .95, R = 1000) { # Computes bootstrapped CI for the probability a recession will occur before # a certain time given some time has passed # # args: # data: A numeric vector containing recession data # delta: A nonnegative real number representing maximum time till recession # passed: A nonnegative real number representing time since last recession # conf: A real number between 0 and 1; the confidence level # R: A positive integer for the number of bootstrap replicates bootobj <- boot(recessions, R = R, statistic = function(data, indices) { d <- data[indices] params <- fitdist(d, "weibull")$estimate return(recession_prob_wei(delta, passed, shape = params["shape"], scale = params["scale"])) }) boot.ci(bootobj, type = "perc", conf = conf) } # Bootstrapped 95% CI for probability of recession before 2018 election recession_prob_wei_bootci(recessions, 6/12, 8+10/12, R = 10000)
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS ## Based on 10000 bootstrap replicates ## ## CALL : ## boot.ci(boot.out = bootobj, conf = conf, type = "perc") ## ## Intervals : ## Level Percentile ## 95% ( 0.1691, 0.6174 ) ## Calculations and Intervals on Original Scale
# Bootstrapped 95% CI for probability of recession before 2020 election recession_prob_wei_bootci(recessions, 2+6/12, 8+10/12, R = 10000)
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS ## Based on 10000 bootstrap replicates ## ## CALL : ## boot.ci(boot.out = bootobj, conf = conf, type = "perc") ## ## Intervals : ## Level Percentile ## 95% ( 0.6299, 0.9974 ) ## Calculations and Intervals on Original Scale
These CIs suggest that while the probability of a recession before the 2018 midterm is very uncertain (it could plausibly be between 17% and 62%), my hunch about 2020 has validity; even the lower bound of that CI suggests a recession before 2020 is likely, and the upper bound is almost-certainty.
How bad could it be? That’s hard to say. However, these odds make the Republican tax bill and its trillion-dollar deficits look even more irresponsible; that money will be needed to deal with a potential recession’s fallout.
As bad as 2018 looks for Republicans, it could look like a cakewalk compared to 2020.
(And despite the seemingly jubilant tone, this suggests I may have trouble finding a job in the upcoming years.)
I have created a video course published by Packt Publishing entitled Data Acqusition and Manipulation with Python, the second volume in a four-volume set of video courses entitled, Taming Data with Python; Excelling as a Data Analyst. This course covers more advanced Pandas topics such as reading in datasets in different formats and from databases, aggregation, and data wrangling. The course then transitions to cover getting data in “messy” formats from Web documents via web scraping. The course covers web scraping using BeautifulSoup, Selenium, and Scrapy. If you are starting out using Python for data analysis or know someone who is, please consider buying my course or at least spreading the word about it. You can buy the course directly or purchase a subscription to Mapt and watch it there.
If you like my blog and would like to support it, spread the word (if not get a copy yourself)! Also, stay tuned for future courses I publish with Packt at the Video Courses section of my site.
]]>
In a conversation about these processes with a fellow graduate student I was explaining the idea that different kernels (covariance functions, or ) define different Gaussian processes and simply changing the kernel will produce new processes with completely different properties. Let be the kernel of a process. is the kernel associated with the Wiener process and produces a process that is continuous everywhere but not differentiable anywhere, and with independent, Gaussian-distributed increments. On the other hand, the process defined by the kernel is not only continuous but differentiable everywhere, yet does not have independent increments.
I wanted to drive home the point that different kernels yield processes with wildly different properties by simulating and plotting them on a computer. So I whipped out the following R function in less than ten minutes (not counting documentation), and it does exactly what I want it to do.
library(MASS) gaussprocess <- function(from = 0, to = 1, K = function(s, t) {min(s, t)}, start = NULL, m = 1000) { # Simulates a Gaussian process with a given kernel # # args: # from: numeric for the starting location of the sequence # to: numeric for the ending location of the sequence # K: a function that corresponds to the kernel (covariance function) of # the process; must give numeric outputs, and if this won't produce a # positive semi-definite matrix, it could fail; default is a Wiener # process # start: numeric for the starting position of the process; if NULL, could # be randomly generated with the rest # m: positive integer for the number of points in the process to simulate # # return: # A data.frame with variables "t" for the time index and "xt" for the value # of the process t <- seq(from = from, to = to, length.out = m) Sigma <- sapply(t, function(s1) { sapply(t, function(s2) { K(s1, s2) }) }) path <- mvrnorm(mu = rep(0, times = m), Sigma = Sigma) if (!is.null(start)) { path <- path - path[1] + start # Must always start at "start" } return(data.frame("t" = t, "xt" = path)) }
Below are example processes simulated by this function.
(Wiener process)
(Gaussian kernel)
(Something completely different)
Hopefully you found this code snippet entertaining, if not useful.
UPDATE: The parameter start
is now NULL
by default, but if set to numeric, the numeric value will be the starting number; otherwise, it could be randomly generated. (This matters for, say, simulating a stationary process where the first number in the sequence is not necessarily fixed.)
I have created a video course published by Packt Publishing entitled Data Acqusition and Manipulation with Python, the second volume in a four-volume set of video courses entitled, Taming Data with Python; Excelling as a Data Analyst. This course covers more advanced Pandas topics such as reading in datasets in different formats and from databases, aggregation, and data wrangling. The course then transitions to cover getting data in “messy” formats from Web documents via web scraping. The course covers web scraping using BeautifulSoup, Selenium, and Scrapy. If you are starting out using Python for data analysis or know someone who is, please consider buying my course or at least spreading the word about it. You can buy the course directly or purchase a subscription to Mapt and watch it there.
If you like my blog and would like to support it, spread the word (if not get a copy yourself)! Also, stay tuned for future courses I publish with Packt at the Video Courses section of my site.
]]>
At the end of June I published my first video course, Unpacking NumPy and Pandas. In the blog post announcement, I said that the course was the first in a series of courses on using Python for data analysis. I recently published the next course in the series, entitled Data Acquisition and Manipulation with Python with Packt Publishing.
This course is effectively two courses in one. Half of the course is devoted to working with what some may call “clean and structured” data, focusing on loading in this data and intermediate Pandas usage, such as aggregation, data reshaping, and data transformation. The other half of the course focuses on tapping the vast database otherwise known as the Internet to create new datasets. The Internet is filled with data that is “messy” and this course introduces viewers to the tools to grab and clean this data.
The course is roughly two-and-a-half hours and is divided into six sections. Their topics are:
I load the videos with demonstrations and example code. Often a section includes a dataset (which comes with the course) or application that serves as a common thread through the videos. For instance, in the first three sections I frequently revisit a dataset filled with population data provided by the U.S. Census Bureau, and we see how to manipulate this dataset in ways that are sometimes insightful on their own and sometimes would serve as a first step for a more involved analysis. On the other hand, each of the last three sections each has their own application, such as extracting Nobel laureate birthdays from Wikipedia, following Google’s “Searches related to” feature, or crawling through Reddit threads.
The videos in the course–narrated by me–include not only an explanation of the topic at hand but interactive demonstrations, so viewers can see how to use the software and follow along if they so desire. The video course includes the Jupyter notebooks I use in my demonstrations; viewers can run my code blocks to replicate my results, and edit them for their own experimentation.
You can buy the course on Packt’s website. If the price of the course is an obstacle, perhaps consider watching it on Mapt, Packt’s subscription service, which gives you access not only to my course but everything Packt has published (and they publish a ton of stuff), along with one free book of your choice to keep every month (likely without any DRM; just a plain ol’ PDF). (I also hear that Packt’s videos are available on other services, such as Lynda, but don’t quote me on that.)
I personally enjoyed writing this course more than Unpacking NumPy and Pandas as it’s using Python in a more creative way (but I consider that course to be a prerequisite). I enjoyed creating the examples seen in the videos, especially the web scraping examples. I hope my viewers enjoy the course as well.
I thank Packt for publishing this course. I also thank my editor, Viranchi Shetty, for offering feedback and keeping me on schedule. The editors at Packt had a big impact on the final product. I look forward to publishing my next course on machine learning with them. Expect that within a few months. I’m enjoying writing that course more (yes, I’m working on it now), and I’m sure many will find it enlightening.
If you like my blog and would like to support me, perhaps consider purchasing the course. If you have no need for it or don’t have the money to spend (which I understand completely; I don’t live a life of glamour myself, being a graduate student), I’d love for you to spread the word about the course. Tell a friend wanting to get started in data analysis or data science, or even share this post on Facebook or Twitter or whatever your preferred social network is. Directing more eyes to the course helps. Write a review if you have watched it; I would love to hear your feedback, both positive and negative (though if negative, be gentle and constructive please).
My website has a new page for the new course here.
Thanks for reading!
If you want to know more about what the course is like, below are some of the videos included in the course.
]]>I would strongly suggest looking at rugarch or rmgarch. The primary
maintainer of the RMetrics suite of packages, Diethelm Wuertz, was
killed in a car crash in 2016. That code is basically unmaintained.
I will see if this solves the problem. Thanks Brian! I’m leaving this post up though as a warning to others to avoid fGarch in the future. This was news to me, books often refer to fGarch, so this could be a resource for those looking for working with GARCH models in R why not to use fGarch.
UPDATE (11/2/17 11:30 PM MDT): I tried a quick experiment with rugarch and it appears to be plagued by this problem as well. Below is some quick code I ran. I may post a full study as soon as tomorrow.
library(rugarch) spec = ugarchspec(variance.model = list(garchOrder = c(1, 1)), mean.model = list(armaOrder = c(0, 0), include.mean = FALSE), fixed.pars = list(alpha1 = 0.2, beta1 = 0.2, omega = 0.2)) ugarchpath(spec = spec, n.sim = 1000, n.start = 1000) -> x srs = x@path$seriesSim spec1 = ugarchspec(variance.model = list(garchOrder = c(1, 1)), mean.model = list(armaOrder = c(0, 0), include.mean = FALSE)) ugarchfit(spec = spec1, data = srs) ugarchfit(spec = spec1, data = srs[1:100])
These days my research focuses on change point detection methods. These are statistical tests and procedures to detect a structural change in a sequence of data. An early example, from quality control, is detecting whether a machine became uncalibrated when producing a widget. There may be some measurement of interest, such as the diameter of a ball bearing, that we observe. The machine produces these widgets in sequence. Under the null hypothesis, the ball bearing’s mean diameter does not change, while under the alternative, at some unkown point in the manufacturing process the machine became uncalibrated and the mean diameter of the ball bearings changed. The test then decides between these two hypotheses.
These types of test matter to economists and financial sector workers as well, particularly for forecasting. Once again we have a sequence of data indexed by time; my preferred example is price of a stock, which people can instantly recognize as a time series given how common time series graphs for stocks are, but there are many more datasets such as a state’s GDP or the unemployment rate. Economists want to forecast these quantities using past data and statistics. One of the assumptions the statistical methods makes is that the series being forecasted is stationary: the data was generated by one process with a single mean, autocorrelation, distribution, etc. This assumption isn’t always tested yet it is critical to successful forecasting. Tests for structural change check this assumption, and if it turns out to be false, the forecaster may need to divide up their dataset when training their models.
I have written about these tests before, introducing the CUSUM statistic, one of the most popular statistics for detecting structural change. My advisor and a former Ph.D. student of his (currently a professor at the University of Waterloo, Greg Rice) developed a new test statistic that better detects structural changes that occur early or late in the dataset (imagine the machine producing widgets became uncalibrated just barely, and only the last dozen of the hundred widgets in the sample were affected). We’re in the process of making revisions requested by a journal to whom we submitted our paper, one of the revisions being a better example application (we initially worked with the wage/productivity data I discussed in the aforementioned blog post; the reviewers complained that these variables are codetermined so its nonsense to regress one on the other, a complaint I disagree with but I won’t plant my flag on to defend).
We were hoping to apply a version of our test to detecting structural change in GARCH models, a common model in financial time series. To my knowledge the “state of the art” R package for GARCH model estimation and inference (along with other work) is fGarch; in particular, the function garchFit()
is used for estimating GARCH models from data. When we tried to use this function in our test, though, we were given obviously bad numbers (we had already done simulation studies to know what behavior to expect). The null hypothesis of no change was soundly rejected on simulated sequences where it was true. I never saw the test fail to reject the null hypothesis, even though the null hypothesis was always true. This was the case even for sample sizes of 10,000, hardly a small sample.
We thought the problem might lie with the estimation of the covariance matrix of the parameter estimates, and I painstakingly derived and programmed functions to get this matrix not using numerical differentiation procedures, yet this did not stop the bad behavior. Eventually my advisor and I last Wednesday played with garchFit()
and decided that the function is to blame. The behavior of the function on simulated data is so erratic when estimating parameters (not necessarily the covariance matrix as we initially thought, though it’s likely polluted as well) the function is basically useless, to my knowledge.
This function should be well-known and it’s certainly possible that the problem lies with me, not fGarch (or perhaps there’s better packages out there). This strikes me as a function of such importance I should share my findings. In this article I show a series of numerical experiments demonstrating garchFit()
‘s pathological behavior.
The model is a time series model often used to model the volatility of financial instrument returns, such as the returns from stocks. Let represent the process. This could represent the deviations in the returns of, say, a stock. The model (without a mean parameter) is defined recursively as:
is the conditional standard deviation of the process, also known as the conditional volatility, and is a random process.
People who follow finance^{1} noticed that returns to financial instruments (such as stocks or mutual funds) exhibit behavior known as volatility clustering. Some periods a financial instrument is relatively docile; there are not dramatic market movements. In others an instrument’s price can fluctuate greatly, and these periods are not one-off single-day movements but can last for a period of time. GARCH models were developed to model volatility clustering.
It is believed by some that even if a stock’s daily movement is essentially unforecastable (a stock is equally likely to over- or under-perform on any given day), the volatility is forecastable. Even for those who don’t have the hubris to believe anything about future returns can be forecasted these models are important. For example if one uses the model to estimate the beta statistic for a stock (where is the stock’s return at time $latex $ is the market return, and is “random noise”), there is a good chance that is not an i.i.d sequence of random numbers (as is commonly assumed in other statistical contexts) but actually a GARCH sequence. The modeller would then want to know the behavior of her estimates in such a situation. Thus GARCH models are considered important. In fact, the volatility clustering behavior I just described is sometimes described as “GARCH behavior”, since it appears frequently and GARCH models are a frequent tool of choice to address them. (The acronym GARCH stands for generalized autoregressive conditional heteroskedasticity, which is statistics-speak for changing, time-dependent volatility.)
can be any random process but a frequent choice is to use a sequence of i.i.d standard Normal random variables. Here is the only source of randomness in the model. In order for a process to have a stationary solution, we must require that ). In this case the process has a long-run variance of .
The process I wrote down above is an infinite process; the index $latex $ can extend to negative numbers and beyond. Obviously in practice we don’t observe infinite sequences so if we want to work with models in practice we need to consider a similar sequence:
Below is the new sequence’s secret sauce:
We choose an initial value for this sequence (the theoretical sequence described earlier does not have an initial value)! This sequence strongly resembles the theoretical sequence but it is observable in its entirity, and it can be shown that parameters estimated using this sequence closely approximate those of the theoretical, infinite process.
Naturally one of the most important tasks for these processes is estimating their parameters; for the process, these are , , and . A basic approach is to find the quasi-maximum likelihood estimation (QMLE) estimates. Let’s assume that we have $latex $ observations from our process. In QMLE, we work with the condisional distribution of when assuming follows a standard normal distribution (that is, ). We assume that the entire history of the process up to time $latex $ is known; this implies that is known as well (in fact all we needed to know was the values of the process at time , but I digress). In that case we have . Let be the conditional distribution of (so ). The quasi-likelihood equation is then
Like most likelihood methods, rather than optimize the quasi-likelihood function directly, statisticians try to optimize the log-likelihood, , and after some work it’s not hard to see this is equivalent to minimizing
Note that , , and are involved in this quantity through . There is no closed form solution for the parameters that minimize this quantity. This means that numerical optimization techniques must be applied to find the parameters.
It can be shown that the estimators for the parameters , , and , when computed this way, are consistent (meaning that asymptotically they approach their true values, in the sense that they converge in probability) and follow a Gaussian distribution asymptotically.^{2} These are properties that we associate with the sample mean, and while we might be optimistic that the rate of convergence of these estimators is as good as the rate of convergence of the sample mean, we may expect comparable asymptotic behavior.
Ideally, the parameters should behave like the process illustrated below.
library(ggplot2) x <- rnorm(1000, sd = 1/3) df <- t(sapply(50:1000, function(t) { return(c("mean" = mean(x[1:t]), "mean.se" = sd(x[1:t])/sqrt(t))) })) df <- as.data.frame(df) df$t <- 50:1000 ggplot(df, aes(x = t, y = mean)) + geom_line() + geom_ribbon(aes(x = t, ymin = mean - 2 * mean.se, ymax = mean + 2 * mean.se), color = "grey", alpha = 0.5) + geom_hline(color = "blue", yintercept = 0) + coord_cartesian(ylim = c(-0.5, 0.5))
Before continuing let’s generate a sequence. Throughout this article I work with processes where all parameters are equal to 0.2. Notice that for a process the long-run variance will be with this choice.
set.seed(110117) library(fGarch) x <- garchSim(garchSpec(model = list("alpha" = 0.2, "beta" = 0.2, "omega" = 0.2)), n.start = 1000, n = 1000) plot(x)
Let’s see the parameters that the fGarch function garchFit()
uses.
args(garchFit)
## function (formula = ~garch(1, 1), data = dem2gbp, init.rec = c("mci", ## "uev"), delta = 2, skew = 1, shape = 4, cond.dist = c("norm", ## "snorm", "ged", "sged", "std", "sstd", "snig", "QMLE"), include.mean = TRUE, ## include.delta = NULL, include.skew = NULL, include.shape = NULL, ## leverage = NULL, trace = TRUE, algorithm = c("nlminb", "lbfgsb", ## "nlminb+nm", "lbfgsb+nm"), hessian = c("ropt", "rcd"), ## control = list(), title = NULL, description = NULL, ...) ## NULL
The function provides a few options for distribution to maximize (cond.dist
) and algorithm to use for optimization (algorithm
). Here I will always choose cond.dist = QMLE
, unless otherwise stated, to instruct the function to use QMLE estimators.
Here’s a single pass.
garchFit(data = x, cond.dist = "QMLE", include.mean = FALSE)
## ## Series Initialization: ## ARMA Model: arma ## Formula Mean: ~ arma(0, 0) ## GARCH Model: garch ## Formula Variance: ~ garch(1, 1) ## ARMA Order: 0 0 ## Max ARMA Order: 0 ## GARCH Order: 1 1 ## Max GARCH Order: 1 ## Maximum Order: 1 ## Conditional Dist: QMLE ## h.start: 2 ## llh.start: 1 ## Length of Series: 1000 ## Recursion Init: mci ## Series Scale: 0.5320977 ## ## Parameter Initialization: ## Initial Parameters: $params ## Limits of Transformations: $U, $V ## Which Parameters are Fixed? $includes ## Parameter Matrix: ## U V params includes ## mu -0.15640604 0.156406 0.0 FALSE ## omega 0.00000100 100.000000 0.1 TRUE ## alpha1 0.00000001 1.000000 0.1 TRUE ## gamma1 -0.99999999 1.000000 0.1 FALSE ## beta1 0.00000001 1.000000 0.8 TRUE ## delta 0.00000000 2.000000 2.0 FALSE ## skew 0.10000000 10.000000 1.0 FALSE ## shape 1.00000000 10.000000 4.0 FALSE ## Index List of Parameters to be Optimized: ## omega alpha1 beta1 ## 2 3 5 ## Persistence: 0.9 ## ## ## --- START OF TRACE --- ## Selected Algorithm: nlminb ## ## R coded nlminb Solver: ## ## 0: 1419.0152: 0.100000 0.100000 0.800000 ## 1: 1418.6616: 0.108486 0.0998447 0.804683 ## 2: 1417.7139: 0.109746 0.0909961 0.800931 ## 3: 1416.7807: 0.124977 0.0795152 0.804400 ## 4: 1416.7215: 0.141355 0.0446605 0.799891 ## 5: 1415.5139: 0.158059 0.0527601 0.794304 ## 6: 1415.2330: 0.166344 0.0561552 0.777108 ## 7: 1415.0415: 0.195230 0.0637737 0.743465 ## 8: 1415.0031: 0.200862 0.0576220 0.740088 ## 9: 1414.9585: 0.205990 0.0671331 0.724721 ## 10: 1414.9298: 0.219985 0.0713468 0.712919 ## 11: 1414.8226: 0.230628 0.0728325 0.697511 ## 12: 1414.4689: 0.325750 0.0940514 0.583114 ## 13: 1413.4560: 0.581449 0.143094 0.281070 ## 14: 1413.2804: 0.659173 0.157127 0.189282 ## 15: 1413.2136: 0.697840 0.155964 0.150319 ## 16: 1413.1467: 0.720870 0.142550 0.137645 ## 17: 1413.1416: 0.726527 0.138146 0.135966 ## 18: 1413.1407: 0.728384 0.137960 0.134768 ## 19: 1413.1392: 0.731725 0.138321 0.132991 ## 20: 1413.1392: 0.731146 0.138558 0.133590 ## 21: 1413.1392: 0.730849 0.138621 0.133850 ## 22: 1413.1392: 0.730826 0.138622 0.133869 ## ## Final Estimate of the Negative LLH: ## LLH: 782.211 norm LLH: 0.782211 ## omega alpha1 beta1 ## 0.2069173 0.1386221 0.1338686 ## ## R-optimhess Difference Approximated Hessian Matrix: ## omega alpha1 beta1 ## omega -8858.897 -1839.6144 -2491.9827 ## alpha1 -1839.614 -782.8005 -531.7393 ## beta1 -2491.983 -531.7393 -729.7246 ## attr(,"time") ## Time difference of 0.04132652 secs ## ## --- END OF TRACE --- ## ## ## Time to Estimate Parameters: ## Time difference of 0.3866439 secs
## ## Title: ## GARCH Modelling ## ## Call: ## garchFit(data = x, cond.dist = "QMLE", include.mean = FALSE) ## ## Mean and Variance Equation: ## data ~ garch(1, 1) ## <environment: 0xa636ba4> ## [data = x] ## ## Conditional Distribution: ## QMLE ## ## Coefficient(s): ## omega alpha1 beta1 ## 0.20692 0.13862 0.13387 ## ## Std. Errors: ## robust ## ## Error Analysis: ## Estimate Std. Error t value Pr(>|t|) ## omega 0.20692 0.05102 4.056 5e-05 *** ## alpha1 0.13862 0.04928 2.813 0.00491 ** ## beta1 0.13387 0.18170 0.737 0.46128 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Log Likelihood: ## -782.211 normalized: -0.782211 ## ## Description: ## Thu Nov 2 13:01:14 2017 by user:
The parameters are not necessarily near the true parameters. One might initially attribute this to just randomness, but that doesn’t seem to be the case.
For example, what fit do I get when I fit the model on the first 500 data points?
garchFit(data = x[1:500], cond.dist = "QMLE", include.mean = FALSE)
## ## Series Initialization: ## ARMA Model: arma ## Formula Mean: ~ arma(0, 0) ## GARCH Model: garch ## Formula Variance: ~ garch(1, 1) ## ARMA Order: 0 0 ## Max ARMA Order: 0 ## GARCH Order: 1 1 ## Max GARCH Order: 1 ## Maximum Order: 1 ## Conditional Dist: QMLE ## h.start: 2 ## llh.start: 1 ## Length of Series: 500 ## Recursion Init: mci ## Series Scale: 0.5498649 ## ## Parameter Initialization: ## Initial Parameters: $params ## Limits of Transformations: $U, $V ## Which Parameters are Fixed? $includes ## Parameter Matrix: ## U V params includes ## mu -0.33278068 0.3327807 0.0 FALSE ## omega 0.00000100 100.0000000 0.1 TRUE ## alpha1 0.00000001 1.0000000 0.1 TRUE ## gamma1 -0.99999999 1.0000000 0.1 FALSE ## beta1 0.00000001 1.0000000 0.8 TRUE ## delta 0.00000000 2.0000000 2.0 FALSE ## skew 0.10000000 10.0000000 1.0 FALSE ## shape 1.00000000 10.0000000 4.0 FALSE ## Index List of Parameters to be Optimized: ## omega alpha1 beta1 ## 2 3 5 ## Persistence: 0.9 ## ## ## --- START OF TRACE --- ## Selected Algorithm: nlminb ## ## R coded nlminb Solver: ## ## 0: 706.37230: 0.100000 0.100000 0.800000 ## 1: 706.27437: 0.103977 0.100309 0.801115 ## 2: 706.19091: 0.104824 0.0972295 0.798477 ## 3: 706.03116: 0.112782 0.0950253 0.797812 ## 4: 705.77389: 0.122615 0.0858136 0.788169 ## 5: 705.57316: 0.134608 0.0913105 0.778144 ## 6: 705.43424: 0.140011 0.0967118 0.763442 ## 7: 705.19541: 0.162471 0.102711 0.739827 ## 8: 705.16325: 0.166236 0.0931680 0.737563 ## 9: 705.09943: 0.168962 0.100977 0.731085 ## 10: 704.94924: 0.203874 0.0958205 0.702986 ## 11: 704.78210: 0.223975 0.108606 0.664678 ## 12: 704.67414: 0.250189 0.122959 0.630886 ## 13: 704.60673: 0.276532 0.131788 0.595346 ## 14: 704.52185: 0.335952 0.146435 0.520961 ## 15: 704.47725: 0.396737 0.157920 0.448557 ## 16: 704.46540: 0.442499 0.164111 0.396543 ## 17: 704.46319: 0.440935 0.161566 0.400606 ## 18: 704.46231: 0.442951 0.159225 0.400940 ## 19: 704.46231: 0.443022 0.159284 0.400863 ## 20: 704.46230: 0.443072 0.159363 0.400851 ## 21: 704.46230: 0.443112 0.159367 0.400807 ## ## Final Estimate of the Negative LLH: ## LLH: 405.421 norm LLH: 0.810842 ## omega alpha1 beta1 ## 0.1339755 0.1593669 0.4008074 ## ## R-optimhess Difference Approximated Hessian Matrix: ## omega alpha1 beta1 ## omega -8491.005 -1863.4127 -2488.5700 ## alpha1 -1863.413 -685.6071 -585.4327 ## beta1 -2488.570 -585.4327 -744.1593 ## attr(,"time") ## Time difference of 0.02322888 secs ## ## --- END OF TRACE --- ## ## ## Time to Estimate Parameters: ## Time difference of 0.1387401 secs
## ## Title: ## GARCH Modelling ## ## Call: ## garchFit(data = x[1:500], cond.dist = "QMLE", include.mean = FALSE) ## ## Mean and Variance Equation: ## data ~ garch(1, 1) ## <environment: 0xa85f084> ## [data = x[1:500]] ## ## Conditional Distribution: ## QMLE ## ## Coefficient(s): ## omega alpha1 beta1 ## 0.13398 0.15937 0.40081 ## ## Std. Errors: ## robust ## ## Error Analysis: ## Estimate Std. Error t value Pr(>|t|) ## omega 0.13398 0.11795 1.136 0.2560 ## alpha1 0.15937 0.07849 2.030 0.0423 * ## beta1 0.40081 0.44228 0.906 0.3648 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Log Likelihood: ## -405.421 normalized: -0.810842 ## ## Description: ## Thu Nov 2 13:01:15 2017 by user:
Notice that the parameter (listed as beta1
) changed dramatically. How about a different cutoff?
garchFit(data = x[1:200], cond.dist = "QMLE", include.mean = FALSE)
## ## Series Initialization: ## ARMA Model: arma ## Formula Mean: ~ arma(0, 0) ## GARCH Model: garch ## Formula Variance: ~ garch(1, 1) ## ARMA Order: 0 0 ## Max ARMA Order: 0 ## GARCH Order: 1 1 ## Max GARCH Order: 1 ## Maximum Order: 1 ## Conditional Dist: QMLE ## h.start: 2 ## llh.start: 1 ## Length of Series: 200 ## Recursion Init: mci ## Series Scale: 0.5746839 ## ## Parameter Initialization: ## Initial Parameters: $params ## Limits of Transformations: $U, $V ## Which Parameters are Fixed? $includes ## Parameter Matrix: ## U V params includes ## mu -0.61993813 0.6199381 0.0 FALSE ## omega 0.00000100 100.0000000 0.1 TRUE ## alpha1 0.00000001 1.0000000 0.1 TRUE ## gamma1 -0.99999999 1.0000000 0.1 FALSE ## beta1 0.00000001 1.0000000 0.8 TRUE ## delta 0.00000000 2.0000000 2.0 FALSE ## skew 0.10000000 10.0000000 1.0 FALSE ## shape 1.00000000 10.0000000 4.0 FALSE ## Index List of Parameters to be Optimized: ## omega alpha1 beta1 ## 2 3 5 ## Persistence: 0.9 ## ## ## --- START OF TRACE --- ## Selected Algorithm: nlminb ## ## R coded nlminb Solver: ## ## 0: 280.63354: 0.100000 0.100000 0.800000 ## 1: 280.63302: 0.100315 0.100088 0.800223 ## 2: 280.63262: 0.100695 0.0992822 0.800059 ## 3: 280.63258: 0.102205 0.0983397 0.800404 ## 4: 280.63213: 0.102411 0.0978709 0.799656 ## 5: 280.63200: 0.102368 0.0986702 0.799230 ## 6: 280.63200: 0.101930 0.0984977 0.800005 ## 7: 280.63200: 0.101795 0.0983937 0.799987 ## 8: 280.63197: 0.101876 0.0984197 0.799999 ## 9: 280.63197: 0.102003 0.0983101 0.799965 ## 10: 280.63197: 0.102069 0.0983780 0.799823 ## 11: 280.63197: 0.102097 0.0983703 0.799827 ## 12: 280.63197: 0.102073 0.0983592 0.799850 ## 13: 280.63197: 0.102075 0.0983616 0.799846 ## ## Final Estimate of the Negative LLH: ## LLH: 169.8449 norm LLH: 0.8492246 ## omega alpha1 beta1 ## 0.03371154 0.09836156 0.79984610 ## ## R-optimhess Difference Approximated Hessian Matrix: ## omega alpha1 beta1 ## omega -26914.901 -6696.498 -8183.925 ## alpha1 -6696.498 -2239.695 -2271.547 ## beta1 -8183.925 -2271.547 -2733.098 ## attr(,"time") ## Time difference of 0.02161336 secs ## ## --- END OF TRACE --- ## ## ## Time to Estimate Parameters: ## Time difference of 0.09229803 secs
## ## Title: ## GARCH Modelling ## ## Call: ## garchFit(data = x[1:200], cond.dist = "QMLE", include.mean = FALSE) ## ## Mean and Variance Equation: ## data ~ garch(1, 1) ## <environment: 0xad38a84> ## [data = x[1:200]] ## ## Conditional Distribution: ## QMLE ## ## Coefficient(s): ## omega alpha1 beta1 ## 0.033712 0.098362 0.799846 ## ## Std. Errors: ## robust ## ## Error Analysis: ## Estimate Std. Error t value Pr(>|t|) ## omega 0.03371 0.01470 2.293 0.0218 * ## alpha1 0.09836 0.04560 2.157 0.0310 * ## beta1 0.79985 0.03470 23.052 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Log Likelihood: ## -169.8449 normalized: -0.8492246 ## ## Description: ## Thu Nov 2 13:01:15 2017 by user:
For 200 observations is estimated to be enormous with a relatively tiny standard error!
Let’s dive deeper into this. I’ve conducted a number of numerical experiments on the University of Utah Mathematics department’s supercomputer. Below is a helper function to extract the coefficients and standard errors for a particular fit by garchFit()
(suppressing all of garchFit()
‘s output in the process).
getFitData <- function(x, cond.dist = "QMLE", include.mean = FALSE, ...) { args <- list(...) args$data = x args$cond.dist = cond.dist args$include.mean = include.mean log <- capture.output({ fit <- do.call(garchFit, args = args) }) res <- coef(fit) res[paste0(names(fit@fit$se.coef), ".se")] <- fit@fit$se.coef return(res) }
The first experiment is to compute the coefficients of this particular series at each possible end point.
(The following code block is not evaluated when this document is knitted; I have saved the results in a Rda file. This will be the case for every code block that involves parallel computation. I performed these computations on the University of Utah mathematics department’s supercomputer, saving the results for here.)
library(doParallel) set.seed(110117) cl <- makeCluster(detectCores() - 1) registerDoParallel(cl) x <- garchSim(garchSpec(model = list(alpha = 0.2, beta = 0.2, omega = 0.2)), n.start = 1000, n = 1000) params <- foreach(t = 50:1000, .combine = rbind, .packages = c("fGarch")) %dopar% { getFitData(x[1:t]) } rownames(params) <- 50:1000
Below I plot these coefficients, along with a region corresponding to double the standard error. This region should roughly correspond to 95% confidence intervals.
params_df <- as.data.frame(params) params_df$t <- as.numeric(rownames(params)) ggplot(params_df) + geom_line(aes(x = t, y = beta1)) + geom_hline(yintercept = 0.2, color = "blue") + geom_ribbon(aes(x = t, ymin = beta1 - 2 * beta1.se, ymax = beta1 + 2 * beta1.se), color = "grey", alpha = 0.5) + ylab(expression(hat(beta))) + scale_y_continuous(breaks = c(0, 0.2, 0.25, 0.5, 1)) + coord_cartesian(ylim = c(0, 1))
This is an alarming picture (but not the most alarming I’ve seen; this is one of the better cases). Notice that the confidence interval fails to capture the true value of up until about 375 data points; these intervals should contain the true value about 95% of the time! This is in addition to the confidence interval being fairly large.
Let’s see how the other parameters behave.
library(reshape2) library(plyr) library(dplyr) param_reshape <- function(p) { p <- as.data.frame(p) p$t <- as.integer(rownames(p)) pnew <- melt(p, id.vars = "t", variable.name = "parameter") pnew$parameter <- as.character(pnew$parameter) pnew.se <- pnew[grepl("*.se", pnew$parameter), ] pnew.se$parameter <- sub(".se", "", pnew.se$parameter) names(pnew.se)[3] <- "se" pnew <- pnew[!grepl("*.se", pnew$parameter), ] return(join(pnew, pnew.se, by = c("t", "parameter"), type = "inner")) }
ggp <- ggplot(param_reshape(params), aes(x = t, y = value)) + geom_line() + geom_ribbon(aes(ymin = value - 2 * se, ymax = value + 2 * se), color = "grey", alpha = 0.5) + geom_hline(yintercept = 0.2, color = "blue") + scale_y_continuous(breaks = c(0, 0.2, 0.25, 0.5, 0.75, 1)) + coord_cartesian(ylim = c(0, 1)) + facet_grid(. ~ parameter) print(ggp + ggtitle("NLMINB Optimization"))
The phenomenon is not limited to . also exhibits undesirable behavior. ( isn’t great either, but much better.)
This behavior isn’t unusual; it’s typical. Below are plots for similar series generated with different seeds.
seeds <- c(103117, 123456, 987654, 101010, 8675309, 81891, 222222, 999999, 110011) experiments1 <- foreach(s = seeds) %do% { set.seed(s) x <- garchSim(garchSpec(model = list(alpha = 0.2, beta = 0.2, omega = 0.2)), n.start = 1000, n = 1000) params <- foreach(t = 50:1000, .combine = rbind, .packages = c("fGarch")) %dopar% { getFitData(x[1:t]) } rownames(params) <- 50:1000 params } names(experiments1) <- seeds
experiments1 <- lapply(experiments1, param_reshape) names(experiments1) <- c(103117, 123456, 987654, 101010, 8675309, 81891, 222222, 999999, 110011) experiments1_df <- ldply(experiments1, .id = "seed") head(experiments1_df)
## seed t parameter value se ## 1 103117 50 omega 0.1043139 0.9830089 ## 2 103117 51 omega 0.1037479 4.8441246 ## 3 103117 52 omega 0.1032197 4.6421147 ## 4 103117 53 omega 0.1026722 1.3041128 ## 5 103117 54 omega 0.1020266 0.5334988 ## 6 103117 55 omega 0.2725939 0.6089607
ggplot(experiments1_df, aes(x = t, y = value)) + geom_line() + geom_ribbon(aes(ymin = value - 2 * se, ymax = value + 2 * se), color = "grey", alpha = 0.5) + geom_hline(yintercept = 0.2, color = "blue") + scale_y_continuous(breaks = c(0, 0.2, 0.25, 0.5, 0.75, 1)) + coord_cartesian(ylim = c(0, 1)) + facet_grid(seed ~ parameter) + ggtitle("Successive parameter estimates using NLMINB optimization")
In this plot we see pathologies of other kinds for , especially for seeds 222222 and 999999, where is chronically far below the correct value. For all of these simulations starts much larger than the correct value, near 1, and for the two seeds mentioned earlier jumps from being very high to suddenly very low. (Not shown here are results for seeds 110131 and 110137; they’re even worse!)
The other parameters are not without their own pathologies but the situation does not seem quite so grim. It’s possible the pathologies we do see are tied to estimation of . In fact, if we look at the analagous experiment for the ARCH(1) process (which is a GARCH(1,0) process, equivalent to setting ) we see better behavior.
set.seed(110117) x <- garchSim(garchSpec(model = list(alpha = 0.2, beta = 0.2, omega = 0.2)), n.start = 1000, n = 1000) xarch <- garchSim(garchSpec(model = list(omega = 0.2, alpha = 0.2, beta = 0)), n.start = 1000, n = 1000) params_arch <- foreach(t = 50:1000, .combine = rbind, .packages = c("fGarch")) %dopar% { getFitData(xarch[1:t], formula = ~ garch(1, 0)) } rownames(params_arch) <- 50:1000
print(ggp %+% param_reshape(params_arch) + ggtitle("ARCH(1) Model"))
The pathology appears to be numerical in nature and closely tied to . garchFit()
, by default, uses nlminb()
(a quasi-Newton method with constraints) for solving the optimization problem, using a numerically-computed gradient. We can choose alternative methods, though; we can use the L-BFGS-B method, and we can spice both with the Nelder-Mead method.
Unfortunately these alternative optimization algorithms don’t do better; they may even do worse.
# lbfgsb algorithm params_lbfgsb <- foreach(t = 50:1000, .combine = rbind, .packages = c("fGarch")) %dopar% { getFitData(x[1:t], algorithm = "lbfgsb") } rownames(params_lbfgsb) <- 50:1000 # nlminb+nm algorithm params_nlminbnm <- foreach(t = 50:1000, .combine = rbind, .packages = c("fGarch")) %dopar% { getFitData(x[1:t], algorithm = "nlminb+nm") } rownames(params_nlminbnm) <- 50:1000 # lbfgsb+nm algorithm params_lbfgsbnm <- foreach(t = 50:1000, .combine = rbind, .packages = c("fGarch")) %dopar% { getFitData(x[1:t], algorithm = "lbfgsb+nm") } rownames(params_lbfgsbnm) <- 50:1000 # cond.dist is norm (default) params_norm <- foreach(t = 50:1000, .combine = rbind, .packages = c("fGarch")) %dopar% { getFitData(x[1:t], cond.dist = "norm") } rownames(params_norm) <- 50:1000
print(ggp %+% param_reshape(params_lbfgsb) + ggtitle("L-BFGS-B Optimization"))
print(ggp %+% param_reshape(params_nlminbnm) + ggtitle("nlminb Optimization with Nelder-Mead"))
print(ggp %+% param_reshape(params_lbfgsbnm) + ggtitle("L-BFGS-B Optimization with Nelder-Mead"))
Admittedly, though, QMLE is not the default estimation method garchFit()
uses. The default is the Normal distribution. Unfortunately this is no better.
print(ggp %+% param_reshape(params_norm) + ggtitle("cond.dist = 'norm'"))
On CRAN, fGarch has not seen an update since 2013! It’s possible that fGarch is starting to show its age and newer packages have addressed some of the problems I’ve highlighted here. The package tseries provides a function garch()
that also fits models via QMLE, and is much newer than fGarch. It is the only other package I am aware of that fits models.
Unfortunately, garch()
doesn’t do much better; in fact, it appears to be much worse. Once again, the problem lies with .
library(tseries) getFitDatagarch <- function(x) { garch(x)$coef } params_tseries <- foreach(t = 50:1000, .combine = rbind, .packages = c("tseries")) %dopar% { getFitDatagarch(x[1:t]) } rownames(params_tseries) <- 50:1000
param_reshape_tseries <- function(p) { p <- as.data.frame(p) p$t <- as.integer(rownames(p)) pnew <- melt(p, id.vars = "t", variable.name = "parameter") pnew$parameter <- as.character(pnew$parameter) return(pnew) } ggplot(param_reshape_tseries(params_tseries), aes(x = t, y = value)) + geom_line() + geom_hline(yintercept = 0.2, color = "blue") + scale_y_continuous(breaks = c(0, 0.2, 0.25, 0.5, 0.75, 1)) + coord_cartesian(ylim = c(0, 1)) + facet_grid(. ~ parameter)
All of these experiments were performed on fixed (yet randomly chosen) sequences. They suggest that especially for sample sizes of less than, say, 300 (possibly larger) distributional guarantees for the estimates of parameters are suspect. What happens when we simulate many processes and look at the distribution of the parameters?
I simulated 10,000 processes of sample sizes 100, 500, and 1000 (using the same parameters as before). Below are the empirical distributions of the parameter estimates.
experiments2 <- foreach(n = c(100, 500, 1000)) %do% { mat <- foreach(i = 1:10000, .combine = rbind, .packages = c("fGarch")) %dopar% { x <- garchSim(garchSpec(model = list(omega = 0.2, alpha = 0.2, beta = 0.2)), n.start = 1000, n = n) getFitData(x) } rownames(mat) <- NULL mat } names(experiments2) <- c(100, 500, 1000) save(params, x, experiments1, xarch, params_arch, params_lbfgsb, params_nlminbnm, params_lbfgsbnm, params_norm, params_tseries, experiments2, file="garchfitexperiments.Rda")
param_sim <- lapply(experiments2, function(mat) { df <- as.data.frame(mat) df <- df[c("omega", "alpha1", "beta1")] return(df) }) %>% ldply(.id = "n") param_sim <- param_sim %>% melt(id.vars = "n", variable.name = "parameter") head(param_sim)
## n parameter value ## 1 100 omega 8.015968e-02 ## 2 100 omega 2.493595e-01 ## 3 100 omega 2.300699e-01 ## 4 100 omega 3.674244e-07 ## 5 100 omega 2.697577e-03 ## 6 100 omega 2.071737e-01
ggplot(param_sim, aes(x = value)) + geom_density(fill = "grey", alpha = 0.7) + geom_vline(xintercept = 0.2, color = "blue") + facet_grid(n ~ parameter)
When the sample size is 100, these estimators are far from reliable. and have an unnerving tendency to be almost 0, and can be just about anything. As we saw above, the standard errors reported by garchFit()
do not capture this behavior. For larger sample sizes and behave better, but still displays unnerving behavior. Its spread barely changes and it still has a propensity to be far too small.
What bothers me most is that a sample size of 1,000 strikes me as being a large sample size. If one were looking at daily data for, say, stock prices, this sample size roughly corresponds to maybe 4 years of data. This suggests to me that this pathological behavior is affecting GARCH models people are trying to estimate now and use in models.
An article by John C. Nash entitled “On best practice optimization methods in R”, published in the Journal of Statistical Software in September 2014, discussed the need for better optimization practices in R. In particular, he highlighted, among others, the methods garchFit()
uses (or at least their R implementation) as outdated. He argues for greater awareness in the community for optimization issues and for greater flexibility in packages, going beyond merely using different algorithms provided by optim()
.
The issues I highlighted in this article made me more aware of the importance of choice in optimization methods. My initial objective was to write a function for performing statistical tests depending structural change in GARCH models. These tests rely heavily on successive estimation of the parameters of models as I demonstrated here. At minimum my experiments show that the variation in the parameters isn’t being captured adequately by standard errors, but also there’s a potential for unacceptably high instability in parameter estimates. They’re so unstable it would take a miracle for the test to not reject the null hypothesis of no change. After all, just looking at the pictures of simulated data and one might conclude that the alternative hypothesis of structural change is true. Thus every time I have tried to implement our test on data where the null hypothesis was supposedly true, the test unequivocally rejected it with $p$-values of essentially 0.
I have heard people conducting hypothesis testing for detecting structural change in GARCH models so I would not be surprised if the numerical instability I have written about here can be avoided. This is a subject I admittedly know little about and I hope that if someone in the R community has already observed this behavior and knows how to resolve it they let me know in the comments or via e-mail. I may write a retraction and show how to produce stable estimates of the parameters with garchFit()
. Perhaps the key lies in the function garchFitControl()
.
I’ve also thought about writing my own optimization routine tailored to my test. Prof. Nash emphasized in his paper the importance of tailoring optimization routines to the needs of particular problems. I’ve written down the quantity to optimize and I have a formula for the gradient and Hessian matrix of the . Perhaps successive optimizations as required by our test could use the parameters from previous iterations as initial values, helping prevent optimizers from finding distant, locally optimal yet globally suboptimal solutions.
Already though this makes the problem more difficult than I initially thought finding an example for our test would be. I’m planning on tabling detecting structural change in GARCH models for now and using instead an example involving merely linear regression (a much more tractable problem). But I hope to hear others’ input on what I’ve written here.
sessionInfo()
## R version 3.3.3 (2017-03-06) ## Platform: i686-pc-linux-gnu (32-bit) ## Running under: Ubuntu 16.04.2 LTS ## ## locale: ## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C ## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 ## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 ## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C ## [9] LC_ADDRESS=C LC_TELEPHONE=C ## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C ## ## attached base packages: ## [1] stats graphics grDevices utils datasets methods base ## ## other attached packages: ## [1] dplyr_0.7.2 plyr_1.8.4 reshape2_1.4.2 ## [4] fGarch_3010.82.1 fBasics_3011.87 timeSeries_3022.101.2 ## [7] timeDate_3012.100 ggplot2_2.2.1 ## ## loaded via a namespace (and not attached): ## [1] Rcpp_0.12.11 bindr_0.1 knitr_1.17 magrittr_1.5 ## [5] munsell_0.4.3 colorspace_1.3-2 R6_2.2.0 rlang_0.1.2 ## [9] stringr_1.2.0 tools_3.3.3 grid_3.3.3 gtable_0.2.0 ## [13] htmltools_0.3.6 assertthat_0.1 yaml_2.1.14 lazyeval_0.2.0 ## [17] rprojroot_1.2 digest_0.6.12 tibble_1.3.4 bindrcpp_0.2 ## [21] glue_1.1.1 evaluate_0.10 rmarkdown_1.6 labeling_0.3 ## [25] stringi_1.1.5 scales_0.4.1 backports_1.0.5 pkgconfig_2.0.1
A few months ago I wrote a blog post about getting stock data from either Quandl or Google using R, and provided a command line R script to automate the task. In this post I repeat the task but with Python. If you’re interested in the motivation and logic of the procedure, I suggest reading the post on the R version. The Python version works similarly.
Why a Python version? After all, the R version produces a CSV file that can be read by just about anything, including Python via Pandas. First, the Python script has one additional feature: it’s a module and thus can be imported in a script. The guts of the script is a function that could be called in another Python script to get data and start using it right away. Second, I want to demonstrate some important tasks in Python.
The script pulls a list of symbols contained in the S&P 500 index from this Wikipedia page. In R I had to take a parsing approach similar to if I were using BeautifulSoup in Python, but Pandas makes the task easier. The function read_html()
can be fed an HTML page, and it will parse the page and return a list of tables it read as DataFrame
s. Lists on Wikipedia especially can be easily turned into DataFrame
s this way (and Wikipedia has a lot of lists).
We’ve passed the one-year anniversary of my first post on using Python for data analysis. That blog post has been immensely popular and is a top search hit. It easily dominates my viewership stats every day. This is nice but also frustrating; I feel that I have written a lot of articles since that post (a lot of it better, in my opinion) and the new content doesn’t get nearly the same level of viewership. Furthermore, that article is out of date. I would not recommend using the techniques shown there to get stock data.
The main problem is that Yahoo! Finance is no longer the go-to source for stock data. mementum, the author of the backtrader backtesting framework, explained the situation well in a StackExchange question I asked, and I’ll just link to his answer. That aside, people should get their data from some different source. I prefer Quandl. This post shows how to get data from either Quandl or Google Finance, so it should serve as an update to my original blog post.
(FYI, that blog post may be getting an update next year; in fact, we have have a video lecture to accompany it. Braxton Osting, the University of Utah professor who initially requested the lecture, would like for me to give it again for the Introduction to Data Science (MATH 3900) course. We may be filming it too. I will be looking to update the lecture and I’ll share the most recent version on this blog.)
So without further ado, here is the code.
#!/usr/bin/python3 __doc__ = """ Provides the get_sp500_data function that fetches S&P 500 data from either Google or Quandl. """ __author__ = "Curtis Miller" __copyright__ = "Copyright (c) 2017, Curtis Grant Miller" __credits__ = ["Curtis Miller"] __license__ = "GPL" __version__ = "0.1.0" __maintainer__ = "Curtis Miller" __email__ = "cgmil@msn.com" __status__ = "Experimental" import pandas as pd from pandas import DataFrame import argparse import quandl import pandas_datareader as web from time import sleep import datetime as dt import sys def get_sp500_data(start=dt.datetime.strptime("1997-01-01", "%Y-%m-%d"), end=dt.datetime.now(), use_quandl=True, adjust=True, inner=True, sleeptime=2, verbose=True): """Fetches S&P 500 data args: start: datetime; The earliest possible date end: datetime; The last possible date use_quandl: bool; Whether to fetch data from Quandl (reverts to Google if False) adjust: bool; Whether to use adjusted close (only works with Quandl) inner: bool; Whether to use an inner join or outer join when combining series (inner has no missing data) sleeptime: int; How long to sleep between fetches verbose: bool; Whether to print a log while fetching data return: DataFrame: Contains stock price data """ join = "outer" if inner: join = "inner" symbols_table = pd.read_html("https://en.wikipedia.org/wiki/List_of_S%26P_500_companies", header=0)[0] symbols = list(symbols_table.loc[:, "Ticker symbol"]) sp500 = None for s in symbols: sleep(sleeptime) if verbose: print("Processing: " + s + "...", end='') try: if use_quandl: s_data = quandl.get("WIKI/" + s, start_date=start, end_date=end) if adjust: s_data = s_data.loc[:, "Adj. Close"] else: s_data = s_data.loc[:, "Close"] else: s_data = web.DataReader(s, "google", start, end).loc[:, "Close"] s_data.name = s s_data.dropna() if s_data.shape[0] > 1: if sp500 is None: sp500 = DataFrame(s_data) else: sp500 = sp500.join(s_data, how=join) if verbose: print(" Got it! From", s_data.index[0], "to", s_data.index[-1]) else: if verbose: print(" Sorry, but not this one!") except Exception: if verbose: print(" Sorry, but not this one!") badsymbols = list(set(s) - set(sp500.columns)) if verbose and len(badsymbols) > 0: print("There were", len(badsymbols), "symbols for which data could not be obtained.") print("They are:", ", ".join(badsymbols)) return sp500 if __name__ == '__main__': parser = argparse.ArgumentParser(description="Fetches S&P 500 data") parser.add_argument("-v", "--verbose", action="store_true", default=True, dest="verbose", help="Print extra output [default]") parser.add_argument("--quietly", action="store_false", dest="verbose", help="Don't print extra output") parser.add_argument("-f", "--file", type=str, dest="csv_name", default="sp-500.csv", help="CSV file to save data to [default: sp-500.csv]") parser.add_argument("-s", "--sleep", type=int, dest="sleeptime", default=2, help="Time (seconds) between fetching symbols [default: 2] (don't flood websites with requests!)") parser.add_argument("--inner", action="store_true", default=False, dest="inner", help="Inner join; only dates where all symbols have data will be included") parser.add_argument("--start", type=str, dest="start", default="1997-01-01", help="Earliest date (YYYY-MM-DD) to include [default: 1997-01-01]") parser.add_argument("--end", type=str, dest="end", default="today", help='Last date (YYYY-MM-DD or "today") to include [default: "today"]') # parser.add_argument("-k", "--key", type="character", dest="api_key", # default=NULL, # help="Quandl API key, needed if getting Quandl data") parser.add_argument("-q", "--quandl", action="store_true", default=False, dest="use_quandl", help="Get data from Quandl") parser.add_argument("-a", "--adjust", action="store_true", default=False, dest="adjust", help="Adjust prices (Quandl only)") parser.add_argument("--about", action="store_true", default=False, dest="about", help="Print information about the script and its usage, then quit") args = parser.parse_args() if args.about: print(sys.argv[0], "\n(c) 2017 Curtis Miller\n", "Licensed under GNU GPL v. 3.0 available at ", "https://www.gnu.org/licenses/gpl-3.0.en.html \n", "E-mail: cgmil@msn.com\n\n", "This script fetches closing price data for ticker symbols included", "in the S&P 500 stock index. A list of symbols included in the index", "is fetched from this webpage:", "https://en.wikipedia.org/wiki/List_of_S%26P_500_companies The list", "is parsed and the symbols included in the list are fetched from", "either Google Finance (the default) or Quandl.", "If Quandl is the data source, adjusted data can be", "fetched instead. The resulting data set is then saved to a CSV", "file in the current working directory.\n\n", "This package requires the following Python packages be installed in order", "to work (all of which are available through pip):\n\n", "* pandas\n", "* pandas-datareader\n", "* quandl\n\n", "This script was written by Curtis Miller and was made available on ", "his website: https://ntguardian.wordpress.com\n\n", "You can read more about this script in the following article: ", "https://ntguardian.wordpress.com/blog\n\n") quit() if args.end == "today": args.end = dt.datetime.now() else: args.end = dt.datetime.strptime(args.end, "%Y-%m-%d") args.start = dt.datetime.strptime(args.start, "%Y-%m-%d") sp500 = get_sp500_data(start=args.start, end=args.end, use_quandl=args.use_quandl, adjust=args.adjust, inner=args.inner, sleeptime=args.sleeptime, verbose=args.verbose) sp500.to_csv(args.csv_name)
I have created a video course published by Packt Publishing entitled Data Acqusition and Manipulation with Python, the second volume in a four-volume set of video courses entitled, Taming Data with Python; Excelling as a Data Analyst. This course covers more advanced Pandas topics such as reading in datasets in different formats and from databases, aggregation, and data wrangling. The course then transitions to cover getting data in “messy” formats from Web documents via web scraping. The course covers web scraping using BeautifulSoup, Selenium, and Scrapy. If you are starting out using Python for data analysis or know someone who is, please consider buying my course or at least spreading the word about it. You can buy the course directly or purchase a subscription to Mapt and watch it there.
If you like my blog and would like to support it, spread the word (if not get a copy yourself)! Also, stay tuned for future courses I publish with Packt at the Video Courses section of my site.
]]>Aside from my video courses, I have not done any serious Python programming. However, I did finish a day project that I’ve wanted to do for a while: write a Python Mad Lib program.
For those of you who don’t know, Mad Libs are a word game for generating humorous stories. One player (working effectively as a game master) picks a story template, which is a story with certain parts–nouns, verbs, adjectives, etc.–omitted. The player asks for these sentence parts from other participants without revealing the context and fills the words in. At the end the resulting story, filled with nonsense, is read aloud.
Writing a Python Mad Lib program makes good practice for string manipulation, particularly for substring replacement and code organization via classes. In order to write such a program you need:
This is all best managed via a class. As for string manipulation, by far the easiest solution is to rely on the format()
method supplied with strings. It’s hard to beat the notation "I went to the {noun} today.".format(noun="park")
or "I went to the {noun1} today with my {noun2}.".format(**words)
where words = {"noun1": "park", "noun2": "dog"}
.
Additionally, this project demonstrates class inheritence as well. We can write a MadLib
object that abstractly manages these tasks, providing a framework and methods, without actually containing a worthwhile Mad Lib. This allows for easy generalization; future Mad Lib objects subclass the MadLib
object and all future authors need to do is provide the story string and the description of the “blanks”.
Below I’ve provided code for the base class and a few examples. The code can be stuck in files and made executable on Linux systems.
#!/usr/bin/python3 __doc__ = """ Provides the MadLib object, a template for creating madlibs; see the documentation of MadLib for instructions on how to properly subclass it for creating new madlib generators. An example madlib generator object, called MurderMadLib, is provided. """ __author__ = "Curtis Miller" __copyright__ = "Copyright (c) 2017, Curtis Grant Miller" __credits__ = ["Curtis Miller"] __license__ = "GPL" __version__ = "0.1.0" __maintainer__ = "Curtis Miller" __email__ = "cgmil@msn.com" __status__ = "Experimental" class MadLib(object): """An object for managing madlibs. A new madlib subclassed from this object should change only the __init__() method, and assign a new _text (string for formatting) & _descriptors (dict) _inputs should always be an empty dict (it's okay if it's not, but its contents will be reset by get_inputs(), so any content put there will do nothing). _text should be a string that the format() method can work with, being passed keys of a dict; an example of a valid _text is "One day, a {occupation1} {verb1} to a {place1} to buy {object1}s." _descriptors has keys corresponding to EVERY field in _text (errors will be thrown if this is not the case) and values that instruct users, in the prompt, what is desired. (So, for example, a key/value pair would be: "occupation1": "an occupation for a person (e.g. firefighter)" and so on.) """ def __init__(self): """Initialization; get initial objects This is the only method changed when creating new madlibs. Declares _text and _descriptors """ self._text = "" self._descriptors = dict() def get_inputs(self): """Prompt the user (via command line) for input, and store in _inputs """ self._inputs = dict() # Reset inputs for k, v in self._descriptors.items(): # Prompt for input self._inputs[k] = str(input("I need " + v + ": ")) def print_result(self): """Print the resulting mad-lib""" try: print(self._text.format(**self._inputs)) except Exception as e: raise AssertionError("You should run get_inputs() first!") class MurderMadLib(MadLib): def __init__(self): """initialization; story related to a murder""" # Original source: # http://www.sltrib.com/home/5476673-155/utah-man-charged-with-murder-in # (it's actually depressing, by the way) self._text = " ".join(["A MURDER CASE", "\n-------------", "\n\nA {age3}-year-old {town1} man has been", "charged with", # Sorry for bad formatting "{crime1} and {crime2} in the death last year of", "{living1}.\n\n{boy1} made an initial appearance", "on Monday in {place1}, where he is charged with", "first-degree felony {crime1} and three counts", "of second-degree felony {crime2} in connection", "with {living1}.\n\n{living1} was pronounced", "dead on Nov. 1 after {boy1}, then {age1}, and", "{living1}'s' {age2}-year-old {living2} brought", "the unresponsive {living1} to {place2}.\n\nAn", "autopsy later determined {living1} died from", "\"{injury1} to the {bodypart1}, with associated", "trauma to the {bodypart2} all consistent with", "non-accidental {verb1} ... ,\" according to", "charging documents.\n\nThe autopsy also", "revealed evidence of possible attempted {verb2}", "or {verb13}, a fresh {bodypart3} {injury2},", "and an older, healing {bodypart4} injury,", "charges state.\n\n{boy1} told police", "investigators that he took over the care of", "{living1} at about 11:30 p.m. that night, and", "that he was keeping {living1} up late so he", "would {verb3} through the night.\n\n{boy1} told", "police that he was watching {object1} with", "{living1} when, at about 3:10 a.m., he found", "{living1} was \"{adjective1}\" and would not", "{verb4}, charges state.\n\nHe woke {living1}'s", "{living2} and they {verb5}, then took {living1}", "to {place3} in their {object2}, charges", "state.\n\n{boy1}'s sister later told police", "that he and {living1}'s {living2} \"were having", "a hard time {verb6},\" and that she was looking", "after {living1} about once a week because they", "were \"frustrated with {living1},\" charges", "state.\n\nA scheduling hearing in the case is", "set for July 17.\n\n{boy1} was being held at", "the {town1} jail in lieu of ${dollar1} bail,", "cash-only.\n\nMeanwhile, {boy1} has two other", "pending court cases.\n\nHe is charged with", "second-degree {crime3} for {verb7} a {object3}", "found on his {object4} that shows him {verb8}", "with the {living2} of {living1}.\n\nHe also is", "charged with two counts of {crime4} for", "allegedly {verb9} at a {noun1} and a {noun2}", "last August in {town2}. The alleged victims", "were trying to {verb10} when they asked {boy1}", "to {verb11}, charges state. {boy1} responded by", "{verb12} and {verb14}, charges state."]) self._descriptors ={"town1": "a city or town", "town2": "a city or town", "crime1": "a crime", "crime2": "a crime", "crime3": "a crime", "crime4": "a crime", "boy1": "a boy's name", "place1": "a place for people (e.g. gas station)", "place2": "a place for people (e.g. gas station)", "place3": "a place for people (e.g. gas station)", "living1": ' '.join(["a living thing (specific", "tense, like \"the bird\" or", "\"Sam Smith\""]), "living2": "a living thing (no articles)", "injury1": "a word describing an injury (one word)", "injury2": "a word describing an injury (one word)", "bodypart1": "a bodypart", "bodypart2": "a bodypart", "bodypart3": "a bodypart", "bodypart4": "a bodypart", "object1": "an object", "object2": "an object", "object3": "an object", "object4": "an object", "verb1": "a verb or action (e.g. \"running\")", "verb2": "a verb or action (e.g. \"running\")", "verb13": "a verb or action (e.g. \"running\")", "verb3": "a verb (e.g. \"run\")", "verb4": "a verb (e.g. \"run\")", "verb5": "a verb or action (e.g. \"flipped\")", "verb6": "a verb or action (e.g. \"running\")", "verb7": "a verb or action (e.g. \"running\")", "verb8": "a verb or action (e.g. \"running\")", "verb9": "a verb or action (e.g. \"running\")", "verb10": "a verb (e.g. \"run\")", "verb11": "a verb or action (e.g. \"run\")", "verb12": "a verb or action (e.g. \"running\")", "verb14": "a verb or action (e.g. \"running\")", "noun1": "a noun", "noun2": "a noun", "age1": "an age (in years)", "age2": "an age (in years)", "age3": "an age (in years)", "dollar1": "an amount of money", "adjective1": "an adjective"} if __name__ == '__main__': # Run demo madlib demo = MurderMadLib() demo.get_inputs() print("\n\n") demo.print_result()
#!/usr/bin/python3 __doc__ = """ Extra mad libs, for fun. """ __author__ = "Curtis Miller" __copyright__ = "Copyright (c) 2017, Curtis Grant Miller" __credits__ = ["Curtis Miller"] __license__ = "GPL" __version__ = "0.1.0" __maintainer__ = "Curtis Miller" __email__ = "cgmil@msn.com" __status__ = "Experimental" from MadLib import MadLib class EmperorMagicianML(MadLib): """ A madlib about an emperor and a magician """ def __init__(self): self._descriptors = {"boy1": "a boy's name", "emotion1": "an emotion", "emotion2": "an emotion", "occupation1": "a job or occupation", "verb1": "a verb (past tense, like \"jumped\")", "superlative1": "a superlative (e.g. \"best\")", "superlative2": "a superlative (e.g. \"best\")", "occupation2": "a job or occupation", "adjective1": "an adjective", "name1": "a name", "adjective2": "an adjective", "verb2": "a verb (present simple third person" +\ ", like \"hates\" or \"jumps\")", "living1": "a living thing", "place1": "a place (cannot be a proper noun)", "adjective3": "an adjective", "adjective4": "an adjective", "noun1": "a dwelling", "verb3": "a verb (past tense)", "noun2": "a noun (cannot be a proper noun)", "adverb1": "an adverb", "noun3": "a dwelling", "object1": "an object", "verb4": "a verb", "verb5": "a verb", "verb6": "a verb", "emotion3": "an emotion (as a noun, e.g." + \ " \"anger\" or \"happiness\")", "verb7": "a verb", "adjective5": "an adjective", "place2": "a place (specific tense, either a " + \ "proper noun or beginning with \"the\")", "noun4": "a noun (cannot be a proper noun)", "bodypart1": "a bodypart", "bodypart2": "a bodypart", "adverb2": "an adverb", "bodypart3": "a bodypart", "verb8": "a verb (past tense)", "verb9": "a verb (active tense, like \"jumping\")", "verb10": "a verb (past tense)", "verb11": "a verb (past tense)", "adjective6": "an adjective"} self._text = ' '.join(["The Emperor and the {occupation2}", "\n---------------------------------\n" "\nOnce upon a time there was a young emperor", "ruling over a distant land, named {boy1}. One", "day {boy1} was {emotion1}. \"I'm {emotion2},\"", "{boy1} told his {occupation1}. \"I wish to be", "{verb1}! Bring me the {superlative1}", "{occupation2} in all the land!\"\n\n\"Well,", "that would be the {adjective1} {occupation2},", "{name1} the {adjective2},\" {boy1}'s", "{occupation1} replied. \"Unfortunately, I fear", "that {name1} {verb2} your rule.\"\n\n\"I do", "not care,\" {boy1} replied. \"See that {name1}", "comes here immediately!\"\n\nSo {boy1}'s", "{living1} travelled through the land to find", "{name1} the {adjective2}. They found {name1}", "deep in the {place1} in a {adjective3}", "{adjective4} {noun1}. They {verb3} on", "{name1}'s {noun2}, saying, \"The emperor", "{boy1} demands your presence!\"\n\n\"Then", "I shall go,\" {name1} said {adverb1}.\n\nThe", "journey took days, but {name1} eventually", "arrived at {boy1}'s {noun3}, and stood before", "{boy1}'s {object1}, and began to {verb4}.", "\n\n\"I shall now {verb5} my {superlative2}", "trick,\" {name1} said, which made {boy1}", "{verb6} in his {object1} with {emotion3}.", "\"I shall {verb7} a {adjective5} {noun4} into", "{place2}, where it shall never be seen again!", "And I shall do so with just my {bodypart1}.\"", "{name1} waved his/her {bodypart2} {adverb2},", "then poked out his/her {bodypart1}, and threw", "it above his/her {bodypart3}. Suddenly {boy1}", "{verb8} from his {object1}, {verb9}, and", "{verb10} many miles out of sight. Then {name1}", "{verb11} upon {boy1}'s {object1}, and", "proclaimed himself Emperor/herself Empress.", "\n\nThe reign of Emperor/Empress {name1} was", "{adjective6}, while {boy1} was never seen", "again."]) if __name__ == '__main__': madlib_choices = [EmperorMagicianML] choice = str(input("Pick a number between 1 and " + \ str(len(madlib_choices)) + " (an integer): ")) choice = int(choice) - 1 ml = madlib_choices[choice]() ml.get_inputs() print('\n\n') ml.print_result()
I would love to see others write their own Mad Libs using my code and share it here. There are a few guides for doing this, but here’s what I would suggest:
I conclude with a story I generated. Please, share your code and stories in the comments. I would love to see what you get!
Once upon a time there was a young emperor ruling over a distant land, named John. One day John was afraid. “I’m loved,” John told his janitor. “I wish to be walked! Bring me the loveliest professor in all the land!”
“Well, that would be the slimy professor, Amy the young,” John’s janitor replied. “Unfortunately, I fear that Amy paints your rule.”
“I do not care,” John replied. “See that Amy comes here immediately!”
So John’s banker travelled through the land to find Amy the young. They found Amy deep in the shop in a moist red dorm. They lay on Amy’s board, saying, “The emperor John demands your presence!”
“Then I shall go,” Amy said greedily.
The journey took days, but Amy eventually arrived at John’s houseboat, and stood before John’s ticket, and began to float.
“I shall now shuffle my squishiest trick,” Amy said, which made John climb in his ticket with boredom. “I shall jump a tired bus into University of Utah, where it shall never be seen again! And I shall do so with just my foot.” Amy waved her nose slowly, then poked out her foot, and threw it above her finger. Suddenly John drove from his ticket, shopping, and grabbed many miles out of sight. Then Amy sucked upon John’s ticket, and proclaimed herself Empress.
The reign of Empress Amy was hard, while John was never seen again.
(I know it’s not very good, but I don’t care.)
I have created a video course entitled Unpacking NumPy and Pandas, the first volume in a four-volume set of video courses entitled, Taming Data with Python; Excelling as a Data Analyst. This course covers the basics of setting up a Python environment for data analysis with Anaconda, using Jupyter notebooks, and using NumPy and pandas. If you are starting out using Python for data analysis or know someone who is, please consider buying my course or at least spreading the word about it. You can buy the course directly or purchase a subscription to Mapt and watch it there (when it becomes available).
If you like my blog and would like to support it, spread the word (if not get a copy yourself)! Also, stay tuned for future courses I publish with Packt at the Video Courses section of my site.
]]>It’s been a while since I shared something on this site (I’ve been studying for the past month for two qualifying exams, and that consumed all my time). Here I share an excellent article that introduced me to the idea of the “mathematical elite” as a real, socially relevant, and powerful (politically and otherwise) group.
While you’re there, check out the work of Cathy O’Neil (mathbabe). I ready Weapons of Math Destruction in a day during the spring and found it eye-opening and enthralling. I think anyone who works in quantitatively intense subjects should read that book.
This is a guest post by Michael J. Barany, a postdoc in History at Dartmouth.
One year ago, I wrote a post for the Scientific American Guest Blog arguing against the widespread truism that mathematics is everywhere. The post laid out the history of mathematics as a special and exclusive kind of knowledge wielded by privileged elites. I claimed that the idea that math is everywhere not only gets the history wrong, but also misrepresents how mathematics matters most in most people’s lives, and may be a misguided premise on which to build a more inclusive and responsible discipline. If we start by recognizing the bias and exclusion that affect who gets to use advanced mathematics to intervene in the world, we might get better at responding to those biases while empowering the vast majority in the mathematical non-elite to hold the mathematical elite accountable for the great power
View original post 3,196 more words
Today I’m sharing my favorite blog post of the week, written by a blogger with username cranklin.
Have whatever opinion you want about crypto-currencies, Bitcoin, and so on, but many in the business world take blockchain technology very seriously. (I read about it regularly in The Economist, and a new book by David Birch entitled Before Bablyon, Beyond Bitcoin, imagines a new crypto-currency world order with money entering a new evolutionary state, with everything from governments to community churches issuing their own coins; the book is on my reading list.) Perhaps the best (and most-fun) way to lean about the technology is to create your own crypto-currency. Then share it with your friends and family because why not.
Perhaps at some point in the future I will create my own coin and write about it as well. If I do, I will be using this post as a reference.
I’ve been itching to build my own cryptocurrency… and I shall give it an unoriginal & narcissistic name: Cranky Coin.
After giving it a lot of thought, I decided to use Python. GIL thread concurrency is sufficient. Mining might suffer, but can be replaced with a C mining module. Most importantly, code will be easier to read for open source contributors and will be heavily unit tested. Using frozen pip dependencies, virtualenv, and vagrant or docker, we can fire this up fairly easily under any operating system.
I decided to make Cranky coin a VERY simple, but complete cryptocurrency/blockchain/wallet system. This implementation will not include smart contracts, transaction rewards, nor utilize Merkel trees. Its only purpose is to act as a decentralized ledger. It is rudimentary, but I will eventually fork a few experimental blockchains with advanced features from this one.
The Wallet
This currency will only be compatible with…
View original post 1,688 more words
Today a bunch of Internet companies, including Reddit, Google, Twitter and others are trying to spread the word and build public opposition to proposed FCC rule changes that would threaten net neutrality once again.
You may remember an episode like this a few years ago, but let’s recap. Net neutrality is the principle that an internet service provider (ISP) should not (and for now, cannot) change change their service based on the data being transmitted. Websites such as Google, Facebook, Netflix, etc. are all treated as being the same; the ISP does not speed up, slow down, or block the data because it comes from these websites. All Internet data is treated as being the same.
The FCC tried to reduce net neutrality protections years ago (under the Obama administration) and the result was a public outcry not just from websites and tech companies but from the netizens themselves. This lead to the FCC backing off.
In 2015, in a lawsuit brought by ISPs such as Verizon, courts ruled that the FCC would need to designate ISPs as a utility in order to be able to enforce net neutrality rules, so the FCC changed the legal designation of ISPs to do this. Now, under the Trump administration, the FCC is run by Chairman Pai, a former Verizon lawyer, and he is keenly interested in killing net neutrality and bringing a radical free-market approach to the FCC, using slogans like “restore internet freedom” to put lipstick on the “kill net neutrality” pig.
He is doing this by changing the regulatory designation of ISPs so that net neutrality rules are defanged. He seems to think he can get ISPs to pledge to respect net neutrality in their policies, and this will be enough to protect the Internet.
Net neutrality protected by pinky promises? One of the following words must describe Chairman Pai: mad, corrupt, or dumb.
I implore my readers to help protect net neutrality by making your voices heard. Go to battleforthenet.com to send a letter to the FCC and get on a call with your Congressional delegation to voice your support for net neutrality protection and stop Chairman Pai’s schemes. If you want more information, perhaps watch this clip by John Oliver, where he explains the issue in greater depth.
I decided not to use the provided letter when writing my Congressional delegation (I worry about form letters and whether they will have the same effect as personalized ones). So in writing to my Congressional delegation (Sen. Hatch, Sen. Lee, and Rep. Love), I wrote the following:
To my Congressional delegation,
The FCC, under Chairman Pai, is considering new rules that will weaken net neutrality protection.
In 2015, in response to anti-net neutrality lawsuits brought by ISPs such as Verizon (who oppose net neutrality), the FCC reclassified ISPs as utilities in order to protect net neutrality. Now the ISPs are pushing the FCC, under Chairman Pai (a former Verizon lawyer and fox in charge of guarding the hen house) to remove that designation and thus give them free reign to curb net neutrality. They would be free to boost or restrict access based on the data being sent (that is, the websites that users are visiting or the content that users submit online).
Chairman Pai claims that the companies will voluntarily respect net neutrality. The companies themselves claim they will. Rubbish. If companies were going to voluntarily respect these rules, they would have no reason to be pushing so hard for their reversal, spending millions of dollars in lawsuits and lobbying to kill net neutrality rules. They have never historically shown any interest in net neutrality preservation. They make their promises with fingers crossed behind their backs, and Chairman Pai knows it.
Chairman Pai is a crazed free-marketeer. I have no reason to believe that less regulation of ISPs (particularly in the issue of net neutrality) will lead to a better Internet experience for me. The ISP market is not a free market; it consists of a handful of for-profit companies (an oligopoly), not only with profit but also political agendas that could taint their service, who could easily make the Internet a less expressive environment and kill the innovation the Internet brings. For all the talk of government ruining markets, I believe the ISPs are the greatest threat to the Internet we know.
Being legislators, I beg you to protect the Internet by pressuring the FCC and Chairman Pai to back away from his Internet-killing rule changes. I have written to you and others in my delegation before and have only been disappointed by your actions, responses, and statements. I hope this time will be different.
Sincerely,
Curtis Miller
I’ve lived in this state for a long time and know that Sen. Hatch and Sen. Lee are tools that prefer to be on the wrong side of issues (Rep. Love is slightly better, but only slightly). So please write to your Congressional delegation; they may be more useful than mine.
]]>