Last week I announced the first release of **MCHT**, an R package that facilitates bootstrap and Monte Carlo hypothesis testing. In this article, I will elaborate on some important technical details about making `MCHTest`

objects, explaining in the process how closures and R environments work.

To recap, last week I made a basic `MCHTest`

-class object. These are S3-class objects; really they are just functions with a `class`

attribute. All the work is done in the initial function call creating the object. But there’s more to the story.

We want these objects to be self-contained. Specifically, we don’t want changes in the global namespace to change how a `MCHTest`

object behaves. By default, these objects are *not* self-contained and a programmer who isn’t careful can accidentally break these objects. Here I explain how to prefent this from happening.

I highly recommend those who want to learn more about closures and environments read [1], but I will briefly explain these critical concepts here.

A closure is a function created by another function. `MCHTest`

objects are closures, functions created by `MCHTest()`

(then given a `class`

attribute). An environment is an R object where other R objects are effectively defined. For example, there is the global environment where most R objects created by users live.

environment()

## <environment: R_GlobalEnv>

globalenv()

## <environment: R_GlobalEnv>

Ever wonder why a variable defined inside a function doesn’t affect anything outside of that function and why it simply disappears? It’s because when a function is called, a new environment is created, and all assignments within the function are done within that new environment. We can see this occuring with some clever use of `print()`

.

x <- 2 u <- function() { x } u()

## [1] 2

f <- function() { x <- 1 function() { x } } g <- f() g()

## [1] 1

environment(g)

## <environment: 0x9c45d78>

environment(u)

## <environment: R_GlobalEnv>

parent.env(environment(g))

## <environment: R_GlobalEnv>

`u()`

is a function and lives in the global environment so it looks for variables in the global environment. `g()`

, however, lives in an environment created by `f()`

. Normally, when a function creates an environment, it disappears the moment the function finishes execution. Closures, however, still use that environment created by the function, so the environment doesn’t disappear when the function finishes execution.

When a function looks for an object, it first looks for that object in its environment. If it doesn’t find the object there, it looks for the object in the parent environment of its environment. It will continue this process until it either finds the object or discovers that none of its environment’s ancestors has the object (prompting an error).

This means that the function is sensitive to changes in its environment or its environment’s ancestors, as we see here:

x <- 3 h <- function() { function() { x } } u()

## [1] 3

j <- h() environment(j)

## <environment: 0xa6e7cb4>

parent.env(environment(j))

## <environment: R_GlobalEnv>

j()

## [1] 3

One of R’s attractive features is that it promotes a style of programming that discourages side effects, where changes to one object doesn’t change the behavior of another. But the examples above show how closures can suffer side effects when objects in the global namespace are changed. The closures created above depend on the global environment in surprising ways for those not familiar with how environments in R work.

By default, `MCHTest`

objects can suffer from these side effects, and they can creep in if the functions passed to the parameters of `MCHTest()`

are carelessly defined, as we see below. (The tests being defined are effectively Monte Carlo -tests; learn about the -test here.)

library(MCHT)

## .------..------..------..------. ## |M.--. ||C.--. ||H.--. ||T.--. | ## | (\/) || :/\: || :/\: || :/\: | ## | :\/: || :\/: || (__) || (__) | ## | '--'M|| '--'C|| '--'H|| '--'T| ## `------'`------'`------'`------' v. 0.1.0 ## Type citation("MCHT") for citing this R package in publications

library(doParallel) registerDoParallel(detectCores()) ts <- function(x, sigma = 1) { sqrt(length(x)) * mean(x)/sigma # z-test for mean = 0 } sg <- function(x, sigma = 1) { x <- sigma * x ts(x, sigma = sigma) # unsafe } unsafe.test.1 <- MCHTest(ts, sg, rnorm, seed = 100, N = 100, fixed_params = "sigma") unsafe.test.1(rnorm(10))

## ## Monte Carlo Test ## ## data: rnorm(10) ## S = 1.1972, sigma = 1, p-value = 0.15

ts <- function(x) { sqrt(length(x)) * mean(x) # Effective make sigma = 1 } sg <- function(x) { ts(x) # again, unsafe } unsafe.test.2 <- MCHTest(ts, sg, rnorm, seed = 100, N = 100) unsafe.test.2(rnorm(10))

## ## Monte Carlo Test ## ## data: rnorm(10) ## S = 0.22926, p-value = 0.46

# ERROR unsafe.test.1(rnorm(10))

## Error in {: task 1 failed - "unused argument (sigma = sigma)"

What happened? Let’s pick it apart by looking at the `stat_gen`

parameter of `unsafe.test.1()`

.

get_MCHTest_settings(unsafe.test.1)$stat_gen

## function(x, sigma = 1) { ## x <- sigma * x ## ts(x, sigma = sigma) # unsafe ## }

This function depends on an object called `ts()`

. When the function looks for `ts()`

, it looks *in the global namespace!* This means that changes to `ts()`

in that namespace will change the behavior of the function. The most recent version of `ts()`

does not have a parameter called `sigma`

, prompting an error. *The object is not self-contained!*

How can we prevent side effects like this? One answer is to define the functions passed to `MCHTest()`

in a way that doesn’t depend on objects defined in the global namespace. For example, we would not call `ts()`

in `sg()`

above but instead rewrite the test statistic as we defined it in `ts()`

. (Using functions and objects defined in packages is okay, though, since these generally don’t change in an R session.)

However, this is not always practical. The test statistic written in `ts()`

could be complicated, and writing that same statistic again would not only be a lot of work but be tempting bugs to invade. Fortunately, `MCHTest()`

supports methods for making `MCHTest`

objects self-contained.

The first step is to set the `localize_functions`

parameter to `TRUE`

. This changes the environment of the `test_stat`

`stat_gen`

, `rand_gen`

, and `pval_func`

functions so that they belong to the environment the `MCHTest`

object lives in. Not only does this help make the function self-contained we may even be able to write our inputs in a more idiomatic way, like so:

ts <- function(x, sigma = 1) { sqrt(length(x)) * mean(x)/sigma } sg <- function(x, sigma = 1) { x <- sigma * x test_stat(x, sigma = 1) # Would not be able to do this if localize_functions # were FALSE } safe.test.1 <- MCHTest(ts, sg, function(n) {rnorm(n)}, seed = 100, N = 100, fixed_params = "sigma", localize_functions = TRUE) safe.test.1(rnorm(10))

## ## Monte Carlo Test ## ## data: rnorm(10) ## S = 2.0277, sigma = 1, p-value = 0.02

ts <- function(x) { sqrt(length(x)) * mean(x) # Effective make sigma = 1 } sg <- function(x) { ts(x) } safe.test.1(rnorm(10)) # Still works

## ## Monte Carlo Test ## ## data: rnorm(10) ## S = 1.0038, sigma = 1, p-value = 0.21

(Notice how `rand_gen`

was handled; it was wrapped in a function rather than passed directly. In short, this is to prevent the function `rnorm`

from being stripped of its namespace, since it needs functions from that namespace.)

This is the first step to removing side effects. (In fact it makes our functions better written since we can anticipate the existence of `test_stat`

as a function). However, we could still have variables or functions defined outside of our input functions. We can expose these functions to our localized input functions via the `imported_objects`

parameter, a list (the doppleganger of R’s environments) containing these objects.

ts <- function(x, sigma = 1) { sqrt(length(x)) * mean(x)/sigma } sg <- function(x, sigma = 1) { x <- sigma * x ts(x) # We're going to do this safely now } safe.test.2 <- MCHTest(ts, sg, function(n) {rnorm(n)}, seed = 100, N = 100, fixed_params = "sigma", localize_functions = TRUE, imported_objects = list("ts" = ts)) safe.test.2(rnorm(10))

## ## Monte Carlo Test ## ## data: rnorm(10) ## S = 0.57274, sigma = 1, p-value = 0.39

ts <- function(x) { sqrt(length(x)) * mean(x) # Effective make sigma = 1 } sg <- function(x) { ts(x) } safe.test.2(rnorm(10))

## ## Monte Carlo Test ## ## data: rnorm(10) ## S = 0.24935, sigma = 1, p-value = 0.45

Both `safe.test.1()`

and `safe.test.2()`

are now immune to changes in the global namespace. They are self-contained and thus safe to use.

By default, `localize_functions`

is `FALSE`

. I thought of making it `TRUE`

by default but I feared that those not familiar with the concept of environments would be bewildered by all the errors that would be thrown whenever they tried to use a function they defined. Setting the parameter to `TRUE`

makes using `MCHTest()`

more difficult.

That said, I highly recommend using the parameter in a longer script. It makes the function safer (errors are good when they’re enforcing safety), so become acquainted with it.

(Next post: maximized Monte Carlo hypothesis testing)

- H. Wickham,
*Advanced R*(2015), CRC Press, Boca Raton

Packt Publishing published a book for me entitled *Hands-On Data Analysis with NumPy and Pandas*, a book based on my video course *Unpacking NumPy and Pandas*. This book covers the basics of setting up a Python environment for data analysis with Anaconda, using Jupyter notebooks, and using NumPy and pandas. If you are starting out using Python for data analysis or know someone who is, please consider buying my book or at least spreading the word about it. You can buy the book directly or purchase a subscription to Mapt and read it there.

If you like my blog and would like to support it, spread the word (if not get a copy yourself)!

]]>

This will be the first of a series of blog posts introducing the package. Most of the examples in the blog posts are already present in the manual, but I plan to go into more depth here, including some background and more detailed explanations.

**MCHT** is a package implementing an interface for creating and using Monte Carlo tests. The primary function of the package is `MCHTest()`

, which creates functions with S3 class `MCHTest`

that perform a Monte Carlo test.

**MCHT** is not presently available on CRAN. You can download and install **MCHT** from GitHub using **devtools** via the R command `devtools::install_github("ntguardian/MCHT")`

.

Monte Carlo testing is a form of hypothesis testing where the -values are computed using the empirical distribution of the test statistic computed from data simulated under the null hypothesis. These tests are used when the distribution of the test statistic under the null hypothesis is intractable or difficult to compute, or as an exact test (that is, a test where the distribution used to compute $p$-values is appropriate for any sample size, not just large sample sizes).

Suppose that is the observed value of the test statistic and large values of are evidence against the null hypothesis; normally, -values would be computed as , where is the cumulative distribution functions and is the random variable version of . We cannot use for some reason; it’s intractable, or the provided is only appropriate for large sample sizes.

Instead of using we will use , which is the empirical CDF of the same test statistic computed from simulated data following the distribution prescribed by the null hypothesis of the test. For the sake of simplicity in this presentation, assume that is a continuous random variable. Now our -value is , where where is the indicator function and is an independent random copy of computed from simulated data with a sample size of .

The power of these tests increase with (see [1]) but modern computers are able to simulate large quickly, so this is rarely an issue. The procedure above also assumes that there are no nuisance parameters and the distribution of can effectively be known precisely when the null hypothesis is true (and all other conditions of the test are met, such as distributional assumptions). A different procedure needs to be applied when nuisance parameters are not explicitly stated under the null hypothesis. [2] suggests a procedure using optimization techniques (recommending simulated annealing specifically) to adversarially select values for nuisance parameters valid under the null hypothesis that maximize the -value computed from the simulated data. This procedure is often called *maximized Monte Carlo* (MMC) testing. That is the procedure employed here. (In fact, the tests created by `MCHTest()`

are the tests described in [2].) Unfortunately, MMC, while conservative and exact, has much less power than if the unknown parameters were known, perhaps due to the behavior of samples under distributions with parameter values distant from the true parameter values (see [3]).

Bootstrap statistical testing is very similar to Monte Carlo testing; the key difference is that bootstrap testing uses information from the sample. For example a parametric bootstrap test would estimate the parameters of the distribution the data is assumed to follow and generate datasets from that distribution using those estimates as the actual parameter values. A permutation test (like Fisher’s permutation test; see [4]) would use the original dataset values but randomly shuffle the labeles (stating which sample an observation belongs to) to generate new data sets and thus new simulated test statistics. -values are essentially computed the same way.

Unlike Monte Carlo tests and MMC, these tests are not exact tests. That said, they often have good finite sample properties. (See [3].) See the documentation mentioned above for more details and references.

Why write a package for these types of tests? This is not the only package that facilitates bootstrapping or Monte Carlo testing. The website RDocumentation includes documentation for the package **MChtest**, by Michael Fay which exists for Monte Carlo testing, too. The package **MaxMC** by Julien Neves is devoted to MMC specifically, as described by [2]. Then there’s the package **boot**, which is intended to facilitate bootstrapping. (If I’m missing anything, please let me know in the comments.)

**MChtest** is no longer on CRAN and implements a particular form of Monte Carlo testing and thus does not work for MMC. **MaxMC** appears to be in a very raw state. **boot** seems general enough that it could be used for bootstrap testing but still seems more geared towards constructing bootstrap confidence intervals and standard errors rather than hypothesis testing. All of these have a very different architecture from **MCHT**, which is primarily for creating a function like `t.test()`

that performs a hypothesis test that was described when the function was created.

Additionally, this was good practice in practicing package development and more advanced R programming. This is the first time I made serious use of closures, S3 classes and R’s flavor of object-oriented programming, and environments. So far the result seems to be an flexible and robust tool for performing tests based on randomization.

Let’s start with a “Hello, world!”-esque example for a Monte Carlo test: a Monte Carlo version of the -test.

The one-sample -test, one of the oldest statistical tests used today, is used to test for the location of the population mean . It decides between the set of hypotheses:

(The alternative could also be one-sided, perhaps instead stating .) The -test is an exact, most-powerful test for any sample size if the data generating process (DGP) that was used to produce the sample is a Gaussian distribution. If we believe this assumption then the Monte Carlo version of the test is a contrived example as we could not do better than to use `t.test()`

, but the moment we drop this assumption there is an opening for Monte Carlo testing to be useful.

Let’s load up the package.

library(MCHT)

## .------..------..------..------. ## |M.--. ||C.--. ||H.--. ||T.--. | ## | (\/) || :/\: || :/\: || :/\: | ## | :\/: || :\/: || (__) || (__) | ## | '--'M|| '--'C|| '--'H|| '--'T| ## `------'`------'`------'`------' v. 0.1.0 ## Type citation("MCHT") for citing this R package in publications

(Yes, I’ve got a cute little `.onAttach()`

package start-up message. I first saw a message like this implemented by **mclust** and of course Stata’s start-up message and thought they’re so adorable that I will likely add such messages to all my packages. You can use `suppressPackageStartupMessages()`

to make this quiet if you want. Thanks to the Python package **art** for the cool ASCII art.)

The star function of the pacakge is the `MCHTest()`

function.

args(MCHTest)

## function (test_stat, stat_gen, rand_gen = function(n) { ## stats::runif(n) ## }, N = 10000, seed = NULL, memoise_sample = TRUE, pval_func = MCHT::pval, ## method = "Monte Carlo Test", test_params = NULL, fixed_params = NULL, ## nuisance_params = NULL, optim_control = NULL, tiebreaking = FALSE, ## lock_alternative = TRUE, threshold_pval = 1, suppress_threshold_warning = FALSE, ## localize_functions = FALSE, imported_objects = NULL) ## NULL

The documentation for this function is the majority of the manual and I’ve written multiple examples demonstrating its use. In short, a single call to `MCHTest()`

will create an `MCHTest`

-S3-class object (which is just a function) that can be use for hypothesis testing. Three arguments (all of which are functions) passed to the call will characterize the resulting test:

`test_stat`

: A function with an argument`x`

that computes the test statistic, with`x`

being the argument that accepts the dataset from which to compute the test statistic.`rand_gen`

: A function generating random datasets, and must have either an argument`x`

that would accept the original dataset or an argument`n`

that represents the size of the dataset.`stat_gen`

: A function with an argument`x`

that will take the random numbers generated by`rand_gen`

and turn them into a simulated test statistic. Sometimes`stat_gen`

is the same as`test_stat`

, but it is better to write separate functions, as will be seen later.

The functions passed to these arguments can accept other parameters, particularly parameters describing test parameters (that is, the parameter values we are testing, such as the population mean ), fixed parameters (parameter values the test assumes, like , the population standard deviation, whose value is assumed by the -test often taught in introductory statistics courses; see this link), and nuisance parameters (parameter values we don’t know, are not directly investigating, and may be needed to know the distribution of the test statistic). For the cases mentioned above, there are `MCHTest()`

parameters that can be used for recognizing them: `test_params`

, `fixed_params`

, and `nuisance_params`

, respectively. While one could in principle ignore these parameters and pass functions to `test_stat`

, `stat_gen`

, and `rand_gen`

that use them anyway, I would recommend not doing so. First, there’s no guarantee that `MCHTest`

-class objects would handle the extra parameters correctly. Second, when `MCHTest()`

is made aware of these special cases, it can check that the functions passed to `test_stat`

, `stat_gen`

, and `rand_gen`

handle these types of parameters correctly and will throw an error when it appears they do not. This safety measure helps you use **MCHT** correctly.

Carrying on, let’s create our first `MCHTest`

object for a -test.

ts <- function(x) { sqrt(length(x)) * mean(x)/sd(x) } mc.t.test.1 <- MCHTest(ts, ts, rnorm, N = 10000, seed = 123)

Above, both `test_stat`

and `stat_gen`

are `ts()`

(they're the first and second arguments, respectively) and the random number generator `rand_gen`

is `rnorm()`

. Two other parameters are:

`N`

: the number of simulated test statistics to generate.`seed`

: the seed of the random number generator, which makes test results consistent and reproducible.

`MCHTest`

-class objects have a `print()`

method that summarize how the object was defined. We see it in action here:

mc.t.test.1

## ## Details for Monte Carlo Test ## ## Seed: 123 ## Replications: 10000 ## ## Memoisation enabled ## Argument "alternative" is locked

This will tell us the seed being used and the number of replicates used for hypothesis testing, along with other messages. I want to draw attention to the message `Argument "alternative" is locked`

. This means that the test we just created will ignore anything passed to the parameter `alternative`

(similar to the parameter of the same name `t.test()`

has). We can enable that parameter by setting the `MCHTest()`

parameter `lock_alternative`

to `FALSE`

.

(mc.t.test.1 <- MCHTest(ts, ts, rnorm, N = 10000, seed = 123, lock_alternative = FALSE))

## ## Details for Monte Carlo Test ## ## Seed: 123 ## Replications: 10000 ## ## Memoisation enabled

Let's now try this function out on data.

dat <- c(0.27, 0.04, 1.37, 0.23, 0.34, 1.44, 0.34, 4.05, 1.59, 1.54) mc.t.test.1(dat)

## ## Monte Carlo Test ## ## data: dat ## S = 2.9445, p-value = 0.0072

If you run the above code you may see a complaint about `%dopar%`

being run sequentially. This complaint appears when we don't register CPU cores for parallelization. **MCHT** uses **foreach**, **doParallel**, and **doRNG** to parallelize simulations and thus hopefully speed them up. Simulations can take a long time and parallelization can help make the process faster. If we were to continue we would not see the complaint again; R accepts that there's only one core visible and thus doesn't parallelize. But we can register the other cores on our system with the following:

library(doParallel) registerDoParallel(detectCores())

Not only do we have parallelization enabled, `MCHTest()`

automatically enables memoization so that it doesn't redo simulations if the data (or at least the data's sample size) hasn't changed. (This can be turned off by setting the `MCHTest()`

parameter `memoise-sample`

to `FALSE`

.) Again, this is so that we save time and don't have to fear repeat usage of our `MCHTest`

-class function.

The above test effectively checked whether the population mean was zero against the alternative that the population mean is greater than zero (due to the default behaviour when `alternative`

is not specified). By changing the `alternative`

parameter we can test against other alternative hypotheses.

mc.t.test.1(dat, alternative = "less")

## ## Monte Carlo Test ## ## data: dat ## S = 2.9445, p-value = 0.9928 ## alternative hypothesis: less

mc.t.test.1(dat, alternative = "two.sided")

## ## Monte Carlo Test ## ## data: dat ## S = 2.9445, p-value = 0.0144 ## alternative hypothesis: two.sided

Compare this to `t.test()`

.

t.test(dat, alternative = "two.sided")

## ## One Sample t-test ## ## data: dat ## t = 2.9445, df = 9, p-value = 0.01637 ## alternative hypothesis: true mean is not equal to 0 ## 95 percent confidence interval: ## 0.2597649 1.9822351 ## sample estimates: ## mean of x ## 1.121

The two tests reach similar conclusions.

However, `t.test()`

is an exact and most-powerful test at any sample size under the assumptions we made. But all we need to do is not assume the data was drawn from a Gaussian distribution to throw the -test for a loop. -test will often do well even when the Gaussian assumption is violated but those statements hold for large sample sizes; at no sample size will the test be an exact test. Monte Carlo tests, though, can be exact tests for any sample size under different (often strong) distributional assumptions, without having to compute the distribution of the test statistic under the null hypothesis.

I know for a fact that `dat`

was generated using an exponential distribution, so let's write a new version of the -test that uses this information. While we're at it, let's add a parameter so that we know we're teseting for the mean of the data and that mean can be specified by the user.

ts <- function(x, mu = 1) { # Throw an error if mu is not positive; exponential random variables have only # positive mu if (mu <= 0) stop("mu must be positive") sqrt(length(x)) * (mean(x) - mu)/sd(x) } sg <- function(x, mu = 1) { x <- mu * x sqrt(length(x)) * (mean(x) - mu)/sd(x) } (mc.t.test.2 <- MCHTest(ts, sg, rexp, seed = 123, method = "One-Sample Monte Carlo Exponential t-Test", test_params = "mu", lock_alternative = FALSE))

## ## Details for One-Sample Monte Carlo Exponential t-Test ## ## Seed: 123 ## Replications: 10000 ## Tested Parameters: mu ## Default mu: 1 ## ## Memoisation enabled

Using this new function works the same, only now we can specify the we want to test.

mc.t.test.2(dat, mu = 2, alternative = "two.sided")

## ## One-Sample Monte Carlo Exponential t-Test ## ## data: dat ## S = -2.3088, p-value = 0.181 ## alternative hypothesis: true mu is not equal to 2

mc.t.test.2(dat, mu = 1, alternative = "two.sided")

## ## One-Sample Monte Carlo Exponential t-Test ## ## data: dat ## S = 0.31782, p-value = 0.6888 ## alternative hypothesis: true mu is not equal to 1

t.test(dat, mu = 1, alternative = "two.sided")

## ## One Sample t-test ## ## data: dat ## t = 0.31782, df = 9, p-value = 0.7579 ## alternative hypothesis: true mean is not equal to 1 ## 95 percent confidence interval: ## 0.2597649 1.9822351 ## sample estimates: ## mean of x ## 1.121

Now the -test and the Monte Carlo test produce -values that are not similar, and the Monte Carlo -test will in general be more accurate. (It appears that the regular -test is more conservative than the Monte Carlo test and thus is less powerful.)

I would consider the current release of **MCHT** to be early beta; it is usable but it's not yet able to be considered "stable". Keep that in mind if you plan to use it.

I'm very excited about this package and look forward to writing more about it. Stay tuned for future blog posts explaining its functionality. It's highly likely it has strange and mysterious behavior so I hope that if anyone encounters strange behavior, they report it and help push **MCHT** closer to a "stable" state.

I'm early in my academic career (in that I'm a Ph.D. student without any of my own publications yet), and I'm unsure if this package is worth a paper in, say, *J. Stat. Soft.* or the *R Journal* (heck, I'd even write a book about the package if it deserved it). I'd love to hear comments on any future publications that others would want to see.

Thanks for reading and stay tuned!

Next post: making `MCHTest`

objects self-contained.

- A. C. A. Hope,
*A simplified Monte Carlo test procedure*, JRSSB, vol. 30 (1968) pp. 582-598 - J-M Dufour,
*Monte Carlo tests with nuisance parameters: A general approach to finite-sample inference and nonstandard asymptotics*, Journal of Econometrics, vol. 133 no. 2 (2006) pp. 443-477 - J. G. MacKinnon,
*Bootstrap hypothesis testing*in*Handbook of computational econometrics*(2009) pp. 183-213 - R. A. Fisher,
*The design of experiments*(1935) - R. Davidson and J. G. MacKinnon,
*The size distortion of bootstrap test*, Econometric Theory, vol. 15 (1999) pp. 361-376

Packt Publishing published a book for me entitled *Hands-On Data Analysis with NumPy and Pandas*, a book based on my video course *Unpacking NumPy and Pandas*. This book covers the basics of setting up a Python environment for data analysis with Anaconda, using Jupyter notebooks, and using NumPy and pandas. If you are starting out using Python for data analysis or know someone who is, please consider buying my book or at least spreading the word about it. You can buy the book directly or purchase a subscription to Mapt and read it there.

If you like my blog and would like to support it, spread the word (if not get a copy yourself)!

]]>

*Hands-On Data Analysis with NumPy and Pandas* is now available for purchase from Packt Publishing’s website and from Amazon. This book was created by a team at Packt Publishing who took my video course and turned it into book form. If you’re like me and love books that you can hold in your hand, touch, thumb through, etc., and you’re looking to learn about basic tools for data analysis with Python, give my book a look.

As with the video course, the book covers how to set up an environment for data analysis with Python and how to use two important tools: NumPy and pandas.

I discuss how to set up Anaconda, a popular data analysis environment, along with how to use Jupyter Notebooks. I show how to connect Python with a MySQL database, along with how to set up such a database.

Then I show how to use NumPy. This includes creating NumPy arrays, indexing arrays, using arrays in arithmetic, NumPy linear algebra, and vectorization. These are essential skills anyone using Python for data analysis should know.

Finally I show how to use pandas. This includes creating a pandas `DataFrame`

, subsetting the data frame, indexing, plotting, and even how to handle missing data. `DataFrame`

s are a great way to manage data and I highly recommend their use.

The book consists of numerous tutorials demonstrating these concepts. I think this book would be great for an introductory course on data science for programming novices who just learned Python basics (perhaps from the book I learned from, Allen Downey’s *Think Python*) and are starting to learn the basics of data analysis. The basics of using NumPy arrays and pandas `DataFrame`

s is challenging for beginners, and my book helps get them going.

I list the book’s chapters below:

- Setting Up a Python Data Analysis Environment
- Diving Into NumPy
- Operations on NumPy Arrays
- pandas are Fun! What is pandas?
- Arithmetic, Function Application, and Mapping with pandas
- Managing, Indexing and Plotting

I would like to thank the staff at Packt Publishing for their work on this book, particularly Tushar Gupta and Nikita Shetty. I was so pleased when I received my copies in the mail and I thank them for their hard work to make this possible.

The MSRP for the books is $23.99, but is currently on sale for $10 as part of Packt’s AI Now campaign, so pick it up while it’s cheap! If you’re not interested in buying this particular book, perhaps consider getting a Mapt subscription. You’ll have access to thousands of books and video courses (including all of my content), and can even get one book to keep for free (without DRM) every month! Perhaps that book will be mine! It’s a great deal you should consider.

]]>The Kolmogorov distribution (which I call ) is as follows:

There is no known simpler form and we have to work with this sum as it is. This is an infinite sum. How can we compute the value of this infinite sum numerically?

Naïvely we can do the following:

summand <- function(x, k) sqrt(2 * pi)/x * exp(-(2 * k - 1)^2 * pi^2/(8 * x^2)) # Compute F(1) sum(summand(1, 1:500))

[1] 0.7300003

In other words, sum up many of the terms and you should be close to the actual infinite sum.

This is a crude approach. The answer is not wrong (numerically) but certainly we should understand why adding up that many terms works. Also, we could have added more terms than necessary… or not enough.

So how can we compute this sum that guarantees some level of precision while at the same time not adding any more terms than necessary? Unfortunately I don’t recall how to do this from my numerical methods classes, but I believe I have found an approach that works well enough for my purposes.

An infinite sum is defined as the limit of a sequence of finite sums. Let , where is some sequence (we can have or , for example; in the case of the Kolmogorov distribution, we had ). Then, by definition, .

My attempt to compute this sum simply amounts to trying to find an such that the difference between and is no larger than machine precision: that is, I want where is the machine precision of the computer.

Since we don’t know what is, we can instead decide that machine convergence occurs when ; that is, when one summand and the next summand are numerically indistinguishable. Since , this criterion is the same as requiring that .

Every sum that converges requires the condition , so this criterion always yields an that gives “numerical convergence”. Of course, any Calculus II student who was paying attention in their class can tell you that not all infinite sums with summands going to zero converges, with the classic counterexample being the Harmonic series. So this approach would claim that sums that diverge are numerically convergent, which is bad. We cannot even expect this method to work in cases where the sum does converge, but it does so slowly (see a later example). However, in some cases this approach may be okay.

Take the case of a geometric series:

with less than 1. These sums converge; in fact, mathematicians consider them as converging quickly. We also have a formula for what the sum is:

After some algebra, we can quickly find a rule for determining how many summands we need to attain “numerical convergence”:

We can see that in action with some R examples:

.Machine$double.eps # Numerical accuracy of this system

[1] 2.220446e-16

log(.Machine$double.eps)/log(0.5)

[1] 52

sum(0.5^(0:53))

[1] 2

# 2 is the correct answer, but the interpreter rounds its output; is the answer # actually 2? sum(0.5^(0:53)) - 2 == 0

[1] TRUE

sum(0.5^(0:52)) - 2 == 0

[1] FALSE

This method, though, should be met with suspicion. For instance, it will not work for a slowly convergent sum. Take for example . If you apply the above technique, then “numerical convergence” is achieved for . Not only is that a very large number, it won’t achieve our goal of good numerical accuracy.

.Machine$double.eps^(-1/2)

[1] 67108864

N <- 67108865 sum((1:N)^(-2)) # This may take a while

[1] 1.644934

sum((1:N)^(-2)) - pi^2/6

[1] -1.490116e-08

That difference is much larger than numerical accuracy.

In fact, the technique doesn’t always work for geometric sums either, as demonstrated by these examples.^{1}

sum(.99^(0:4000)) - 100

[1] -8.526513e-14

sum(.999^(0:40000)) - 1000

[1] -9.094947e-13

However, while this method cannot guarantee quick convergence or even convergence, I think it’s good enough for the sum I want to compute.

First, the sum converges more quickly than a geometric sum, as the summand decrease at a rate of rather than . Second, a method trying to attain numerical accuracy would need to be programmed, and if its implementation is written in R, that implementation will likely be much slower than simply using `sum()`

, since the latter is implemented using fast C code. Such an implementation would have to be written from scratch in C++ using a tool such as **Rcpp**. One must wonder whether the tiny numerical efficiency and speed one might potentially gain are worth the work; if `x`

is large it may be best to just round off the CDF at 1.

In the end, using the lessons learned above, I implemented the Kolmogorov distribution in the package I’m writing for my current research project with the code below.

pkolmogorov <- function(q, summands = ceiling(q * sqrt(72) + 3/2)) { sqrt(2 * pi) * sapply(q, function(x) { if (x > 0) { sum(exp(-(2 * (1:summands) - 1)^2 * pi^2/(8 * x^2)))/x } else { 0 }}) } pkolmogorov <- Vectorize(pkolmogorov, "q")

Numerical summation, as I mentioned above, is something I know little about, so I’d appreciate any readers with thoughts on this topic (and knowledge of how this is done in R) to share in the comments.

I have created a video course published by Packt Publishing entitled *Training Your Systems with Python Statistical Modelling*, the third volume in a four-volume set of video courses entitled, *Taming Data with Python; Excelling as a Data Analyst*. This course discusses how to use Python for machine learning. The course covers classical statistical methods, supervised learning including classification and regression, clustering, dimensionality reduction, and more! The course is peppered with examples demonstrating the techniques and software on real-world data and visuals to explain the concepts presented. Viewers get a hands-on experience using Python for machine learning. If you are starting out using Python for data analysis or know someone who is, please consider buying my course or at least spreading the word about it. You can buy the course directly or purchase a subscription to Mapt and watch it there.

If you like my blog and would like to support it, spread the word (if not get a copy yourself)! Also, stay tuned for future courses I publish with Packt at the Video Courses section of my site.

- If you’re keeping score at home you’ll notice these sums use many more terms than the rule I describe above suggests, suggesting that the problem is not just that the “stopping” rule is wrong but that adding more terms from the sum won’t lead to greater numerical accuracy since the terms being added are essentially zero. Something else will need to be done, going beyond just adding more terms. ↩

This is the first semester where I feel like I actually am fully, 100% prepared to teach this class. I’ve taught MATH 1070: Introduction to Statistical Inference many times and got comfortable with teaching what I call “Statistics If You Don’t Like Math”, which is a terminal math course. MATH 3070 is “Statistics If You Do Like Math” and covers *way* more material. I struggled with the pacing the first two times I taught the course, so I’m glad I think I finally have that pacing down.

I just finished the public web page for the class that includes all the material (aside from stuff students have to buy, like the textbooks) I will be using for the class. There are three parts of this page that I’m excited to share.

First, there’s the lecture notes. I wrote the bulk of these notes in the spring semester, using R Markdown and the **tufte** package for Tufte-style handouts. These notes are half-filled notes meant to accompany my lectures. In response to feedback, I no longer use chalk but give these handouts to students and fill them out on my laptop (which has a touch screen; I use a compatible pen) which has its desktop projected behind me so the students can follow along. This greatly improves the flow of the class; no stopping to write long definitions!

The notes are meant to accompany the textbook, Jay Devore’s *Probability and Statistics for Engineering and the Sciences*, but with my own thoughts and examples, along with accompanying R code. Students can not only see the mathematics but also how these procedures can be done in R on a computer. Since R programming is an important skill the students will need to develop in the class, this addition should improve the course overall.

The chapter notes are available in parts, but I recently used **bookdown** to combine all the notes into one omnibus document, available here.

As these notes were written to accompany a textbook they are not meant to stand alone, though the enterprising instructor could possibly use my notes (without using Devore’s good book) and fill them in for their class, treating them as a major part of class materials.

Next, there’s the lab lecture notes. As mentioned above, R programming is an important skill I hope to develop in my students, so the class comes with an R programming lab (not taught by me, though I have taught it before) that teaches students (presumed to be programming novices) about R and programming. I wrote lecture notes to accompany the R lab textbook, John Verzani’s *Using R for Introductory Statistics*. This was just in R Markdown before, but I now have a **bookdown** version that is publicly available and more easily used than the collection of HTML documents I had before.

These notes come in two versions. There’s the summer semester version, which is the original version. These notes were written for an eight-week intensive schedule, and thus are divided into eight lectures. These notes were also written when I both taught the lecture and the lab at the same time, thus giving me perfect coordination between the two sections. Then there’s the regular semester version. This version was written for a 14-week course. I divided the summer schedule lectures and also added new lectures not present before (on **tidyverse** packages and Bayesian statistics) to slow down the lab to keep pace with the lecture course (not taught by me at the time); thus, these lectures include strictly more material. Unlike the earlier lecture notes (which *must* be in PDF format since white space is a crucial part of the notes), these notes come in both online and PDF versions, for both good online access and to have something printable.

All the source materials for these notes are publicly available too in this archive, should you desire to modify them or at least see how they were made (but if you do modify them, please be sure to cite me).

I submitted the summer lab lecture notes to the **bookdown** contest. While the book may not seem innovative to those who are familiar with **bookdown**, I feel like its existence is a major innovation, and I’m proud of it.

Finally, there’s StatTrainer. This is a Shiny app that I originally wrote for MATH 1070, but I think is still useful for MATH 3070. It’s an app that generates random statistics problems for students covering confidence intervals and hypothesis testing. This is to help aid study, giving students infinite practice problems. This app can be started from the command line on *NIX systems (see the webpage for instructions). I’m proud of this app and have mused on making a package based around it, not only implementing that specific app but also providing a framework for modifying it.

Hopefully someone out there, from a student or autodidact to an instructor or package author, finds this material useful. I worked hard on it (I’m shocked at how many pages I’ve apparently written in notes), and I can’t wait to see how the semester plays out with it.

**UPDATE:**

I have updated the main web page with licensing information. The following license applies to all material on the web page, including the notes and R code.

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

I have created a video course published by Packt Publishing entitled *Applications of Statistical Learning with Python*, the fourth volume in a four-volume set of video courses entitled, *Taming Data with Python; Excelling as a Data Analyst*. This course discusses how to use Python for data science, emphasizing application. It starts with introducing natural language processing and computer vision. It ends with two case studies; in one, I train a classifier to detect spam e-mail, while in the other, I train a computer vision system to detect emotions in faces. Viewers get a hands-on experience using Python for machine learning. If you are starting out using Python for data analysis or know someone who is, please consider buying my course or at least spreading the word about it. You can buy the course directly or purchase a subscription to Mapt and watch it there.

If you like my blog and would like to support it, spread the word (if not get a copy yourself)! Also, stay tuned for future courses I publish with Packt at the Video Courses section of my site.

]]>I *love* Arkham Horror; The Card Game. I love it more than I really should; it’s *ridiculously* fun. It’s a cooperative card game where you build a deck representing a character in the Cthulhu mythos universe, and with that deck you play scenarios in a narrative campaign^{1} where you grapple with the horrors of the mythos. Your actions in (and between) scenarios have repercussions for how the campaign plays out, changing the story, and you use experience points accumulated in the scenarios to upgrade your deck with better cards.

The game is hard. *Really* hard. And I don’t mean it’s hard because it has a learning curve (although if you’re not used to deep games like this, it probably does have a learning curve). This is a Cthulhu game, which means that the mythos is going to do everything to beat you down. Your best-laid plans are going to get derailed. The worst thing that could possibly happen to you *will* happen. This is why this game is unreasonably fun: it’s a game for masochists.

And it makes a damn ton of good stories. I often want to tell the tale of how amazingly awful the events played out, or how I just snatched victory out of the jaws of defeat. Bloomberg recently published an article about the rise of board games in companies. If you’re in one of these companies where employees often play board games together after hours, you should consider adding Arkham Horror (the card game; I’ve never played the board game and I’ve heard mixed reviews) to the mix. It’s a great team builder.

Two elements make Arkham Horror so damn hard and unpredictable: the encounter deck and the chaos bag. The encounter deck is a deck that all players draw from every turn that spawns monsters and awful events. The chaos bag is used for skill test you make for advancing your board state. You draw tokens from the chaos bag, add the modifier of the revealed token to your skill value for that skill test (most of them are negative numbers), check the new value against the difficulty of the test, and if your modified skill value is at least as great as the difficulty of the test, you pass; otherwise, you fail. (You can pitch cards from your hand to increase your skill value *before* you reveal a token.)

*(Image source: Ars Technica)

This is the chaos bag in a nutshell, but the chaos bag does more. The elder sign token represents good luck and often helps your board state; it’s considered a great success. But most icon tokens in the bag are trying to hurt you. There are four icon tokens: the skulls, the cultist, the tablets, and the elder things. These often not only apply a modifier but do something bad to you (which changes between scenarios).

And of course there’s this little bastard.

This is the auto-fail token. You just fail. Thanks for playing.

Bad things happen in Arkham Horror. Sometimes, though, they feel like they happen a lot. For example, you can draw two auto-fails in a row. Or three. I cannot understate how devastating that can be in a game. Sometimes it feels like this happens a lot. Sometimes it feels like this particular game went unusually poorly, and unusually poor games seem to happen frequently.

That’s a contradiction, of course. I think that this sense of bad games happening often emerges from humans fundamentally poor understanding of probability and how it actually works. I’m not just referring to the mathematics; people’s intuition of what should happen according to probability does not match what probability says happens. This phenomenon is well-documented (see, for example, the book *Irrationality*, by Stuart Sutherland) and is one of the explanations for why people seemed to underestimate Donald Trump’s chances of winning the 2016 election (in short, unlikely events occur more frequently than people perceive; see also Nate Silver’s series of articles). In fact, you are more likely to find *any* pattern than no pattern at all (and in Arkham Horror, patterns are usually bad for you.)

This perception of “unusually many auto-fails” (which, as a statistician, I know cannot be right no matter what my feelings say) prompted me to write an R function that generates a string of pulls from the chaos bag, each pull independent of the previous pull. Here’s the function:

chaos_bag_string_sim <- function(dicebag = list('E' = 1, # Elder Sign 'P' = 1, # +1 '0' = 2, '1' = 3, # -1 '2' = 2, # -2 '3' = 1, '4' = 1, 'S' = 2, # Skull 'C' = 1, # Cultist 'T' = 1, # Tablet 'L' = 1, # Elder Thing 'A' = 1), # Autofail n = 24, replace = TRUE) { paste0(sample(rep(names(dicebag), as.vector(unlist(dicebag))), replace = replace, size = n), collapse = '') }

Notice that the function takes a list, `dicebag`

. This list uses one-character codes representing chaos bag tokens, and the actual contents of the list are numbers representing the frequency of that token in the bag. The bag that serves as a default is a fictitious chaos bag that I believe is representative of Arkham Horror on standard difficulty. Since most numbered tokens are negative, I denote the +1 token with `'P'`

and the negative tokens with their numbers.

How many pulls from the bag occur in a game? That's really hard to figure out; in fact, this will vary from scenario to scenario dramatically. So let's just make an educated guess to represent the "typical" game. A round consists of a mythos phase, then three actions per player, followed by a monsters phase and an upkeep phase. The mythos phase often prompts skill tests, and I would guess that the average number of skill tests made during each player's turn is two. We'll guess that about half of the draws from the encounter deck prompt skill tests. A game may last about 16 rounds. This adds up to about 40 skill tests per player per game. As a consequence, a four-player game (I mostly play multiplayer, and the more the merrier) would yield 160 skill tests.

Let's simulate a string of 160 skill tests, each independent of the other^{2}.

(game <- chaos_bag_string_sim(n = 160))

[1] "311113E12121E2LS20A32A4ETT1SE3T3P0AT2S12PLSC033CP22A11CLL31L024P12ES24A1S0E E3PSSS12C13T21224104L43PCT13TTCSSASE20SA1TT2SSC2CPC20012CC41S234PPEP101E0LS 1P1110TSS1"

I represented pulls from the chaos bag as a string because we can use regular expressions to find patterns in the string. For instance, we can use the expression `AA`

to find all occurances of double-auto-fails in the string.

grepl("AA", game)

[1] FALSE

The `grepl()`

function returns a boolean (logical) value identifying whether the pattern appeared in the string; in this game, no two auto-fails appeared in a row.

What about two auto-fails appearing, separated by two other tests? This is matched by the expression `A..A`

(remember, `.`

matches any character).

grepl("A..A", game)

[1] TRUE

We can take this further and use our simulator to estimate the probability events occur. After all, in probability, there are two ways to find the probability of events:

- Create a probability model and use mathematics to compute the probability an event of interest occurs; for example, use a model of a die roll to compute the probability of rolling a 6
- Perform the experiment many times and count how many times the event occured in these experiments many times and count how often the event of interest occured; for example, roll a die 1000 times and count how many times it rolled a 6 (this is usually done on a computer)

While the latter approach produces estimates, these estimates are guaranteed to get better the more simulations are performed. In fact, if you want at least a 95% chance of your estimate being accurate up to two decimal places, you could perform the simulation 10,000 times.^{3}

Using this, we can estimate the probability of seeing two auto-fails in a row.

mean(grepl("AA", replicate(10000, {chaos_bag_string_sim(n = 160)})))

[1] 0.4019

There's about a 40% chance of seeing two auto-fail tokens in a single four-player game. That's pretty damn likely.

Let's ask a slightly different question: what is the average number of times we will see two autofails in a row in a game? For this we will want to dip into the **stringr** package, using the `str_count()`

function.

library(stringr) str_count(game, "AA")

[1] 0

str_count(game, "A..A") # How many times did we see auto-fails separated by two # tests?

[1] 1

Or we can ask how often we saw three "bad" tokens in a row. In Arkham Horror on standard difficulty, players often aim to pass a test when drawing a -2 or better. This means that "bad" tokens include -3, -4, auto-fail, and often the tablet and elder thing tokens, too. The regular expression pattern that can match "two bad things in a row" is `[34TLA]{2}`

(translation: see either `3`

, `4`

, `T`

, `L`

, or `A`

exactly two times).

str_count(game, "[34TLA]{2}")

[1] 14

How could we estimate the average number of times this would occur in a four-player game? The simulation trick still works. Pick a large number of simulations to get an accurate estimate; the larger, the more accurate (but also more work for your computer).

mean(str_count(replicate(10000, {chaos_bag_string_sim(n = 160)}), "[34TLA]{2}"))

[1] 10.6717

You can imagine that this can keep going and perhaps there are queries that you would like to see answered. Below I leave you with an R script that you can use to do these types of experiments. This script is designed to be run from a Unix-flavored command line, though you could `source()`

the script into an R session to use the `chaos_bag_string_sim()`

function interactively. Use regular expressions to define a pattern you are interested in (this is a good opportunity to learn regular expressions for those who want the exercise).

#!/usr/bin/Rscript ####################################### # AHChaosBagSimulator.R ####################################### # Curtis Miller # 2018-08-03 # A script for simulating Arkham Horror's chaos bag ####################################### chaos_bag_string_sim <- function(dicebag = list('E' = 1, # Elder Sign 'P' = 1, # +1 '0' = 2, '1' = 3, # -1 '2' = 2, # -2 '3' = 1, '4' = 1, 'S' = 2, # Skull 'C' = 1, # Cultist 'T' = 1, # Tablet 'L' = 1, # Elder Thing 'A' = 1), # Autofail n = 24, replace = TRUE) { paste0(sample(rep(names(dicebag), as.vector(unlist(dicebag))), replace = replace, size = n), collapse = '') } # optparse: A package for handling command line arguments if (!suppressPackageStartupMessages(require("optparse"))) { install.packages("optparse") require("optparse") } main <- function(pattern, average = FALSE, pulls = 24, replications = 10000, no_replacement = FALSE, dicebag = "", help = FALSE) { if (dicebag != "") { dicebag_df <- read.csv(dicebag, stringsAsFactors = FALSE) dicebag_df$token <- as.character(dicebag_df$token) if (!is.numeric(dicebag_df$freq)) {stop("Dicebag freq must be integers.")} dicebag <- as.list(dicebag_df$freq) names(dicebag) <- dicebag$token } else { dicebag = list('E' = 1, # Elder Sign 'P' = 1, # +1 '0' = 2, '1' = 3, # -1 '2' = 2, # -2 '3' = 1, '4' = 1, 'S' = 2, # Skull 'C' = 1, # Cultist 'T' = 1, # Tablet 'L' = 1, # Elder Thing 'A' = 1) # Autofail } games <- replicate(replications, {chaos_bag_string_sim(dicebag = dicebag, n = pulls, replace = !no_replacement)}) if (average) { cat("Average occurance of pattern:", mean(stringr::str_count(games, pattern)), "\n") } else { cat("Probability of occurance of pattern:", mean(grepl(pattern, games)), "\n") } quit() } if (sys.nframe() == 0) { cl_args <- parse_args(OptionParser( description = "Simulates the Arkham Horror LCG chaos bag.", option_list = list( make_option(c("--pattern", "-r"), type = "character", help = "Pattern (regular expression) to match"), make_option(c("--average", "-a"), type = "logical", action = "store_true", default = FALSE, help = "If set, computes average number of occurances"), make_option(c("--pulls", "-p"), type = "integer", default = 24, help = "The number of pulls from the chaos bag"), make_option(c("--replications", "-N"), type = "integer", default = 10000, help = "Number of replications"), make_option(c("--no-replacement", "-n"), type = "logical", action = "store_true", default = FALSE, help = "Draw tokens without replacement"), make_option(c("--dicebag", "-d"), type = "character", default = "", help = "(Optional) dice bag distribution CSV file") ) )) names(cl_args)[which(names(cl_args) == "no-replacement")] <- "no_replacement" do.call(main, cl_args) }

Below is an example of a CSV file specifying a chaos bag.

token,freq E,1 P,1 0,2 1,3 2,2 3,1 4,1 S,2 C,1 T,1 L,1 A,1

Below is some example usage.

$ chmod +x AHChaosBagSimulator.R # Only need to do this once $ ./AHChaosBagSimulator.R -r AA -p 160 Probability of occurance of pattern: 0.4074 $ ./AHChaosBagSimulator.R -r [34TLA]{2} -a -p 160 Average occurance of pattern: 10.6225

To see documentation, type `./AHChaosBagSimulator.R --help`

. Note that something must always be passed to `-r`

; this is the pattern needed.

This tool was initially written as a way to explore chaos bag probabilities. I didn't consider the tool to be very useful. It simply helped make the point that unlikely events seem to happen frequently in Arkham Horror. However, I found a way to put the tool to practical use.

Recently, Arkham Horror's designers have been releasing cards that seem to demand more math to fully understand. My favorite Arkham Horror card reviewer, The Man from Leng, has fretted about this in some of his recent reviews about cards using the **seal** mechanic, which change the composition of the chaos bag by removing tokens from it.

I can only imagine how he would feel about a card like Olive McBride (one of the cards appearing in the upcoming Mythos Pack, "Heart of the Elders", to be released later this week.)

Olive McBride allows you to reveal three chaos tokens instead of one, and choose two of those tokens to resolve. This effect can be triggered any time you would reveal a chaos token.

I'll repeat that again: Olive can trigger any time a chaos token is revealed. Most of the time tokens are revealed during skill tests, but other events lead to dipping into the chaos bag too. Notably, *The Dunwich Legacy* leads to drawing tokens without taking skill tests, such as when gambling in "The House Always Wins", due to some encounter cards. This makes Olive McBride a "master gambler", since she can draw three tokens and pick the winning ones when gambling. (She almost breaks the scenario.) Additionally, Olive can be combined with cards like Ritual Candles and Grotesque Statue to further shift skill tests in your favor.

These are interesting situations but let's ignore combos like this for now. What does Olive do to your usual skill test? Specifically, what does she do to the modifiers?

Before going on, I need to address verbiage. When are about to perform a skill test, typically they will make a statement like "I'm at +2 for this test". This means that the skill value of the investigator (after applying all modifiers) is two higher than the difficulty of the test. Thus, if the investigator draws a -2 or better, the investigator will pass the test; if the investigator daws a -3 or worse, the investigator fails. My play group does not say this; we say "I'm at -2 for this test," meaning that if the investigator sees a -2 or better from the bag, the investigator will pass. This is more intuitive to me, and also I think it's more directly translated to math.

Presumably when doing a skill test with Olive, if all we care about is passing the test, we will pick the two tokens drawn from the bag that have the best modifiers. We add the modifiers of those tokens together to get the final modifier. Whether this improves your odds of passing the test or not isn't immediately clear.

I've written an R function that simulates skill tests with Olive. With this we can estimate Olive's effects.

olive_sim <- function(translate = c("E" = 2, "P" = 1, "0" = 0, "1" = -1, "2" = -2, "3" = -3, "4" = -4, "S" = -2, "C" = -3, "T" = -4, "L" = -5, "A" = -Inf), N = 1000, dicebag = NULL) { dargs <- list(n = 3, replace = FALSE) if (!is.null(dicebag)) { dargs$dicebag <- dicebag } pulls <- replicate(N, {do.call(chaos_bag_string_sim, dargs)}) vals <- sapply(pulls, function(p) { vecp <- strsplit(p, "")[[1]] vecnum <- translate[vecp] max(apply(combn(3, 2), 2, function(i) {sum(vecnum[i])})) }) vals[which(is.nan(vals))] <- Inf return(vals) }

The parameter `translate`

gives a vector that translates code for chaos tokens to numerical modifiers. Notice that the auto-fail token is assigned `-Inf`

since it will cause any test to fail. If we wanted, say, the elder sign token to auto-succeed (which is Mateo's elder sign effect), we could replace its translation with `Inf`

. By default the function uses the dicebag provided with `chaos_bag_string_sim()`

, but this can be changed.

Here's a single final modifier from a pull using Olive.

olive_sim(N = 1)

C01 -1

Interestingly, the function returned a named vector, and the name corresponds to what was pulled. In this case, a cultist, 0, and -1 token were pulled; the resulting best modifier is -1. The function is already built to do this many times.

olive_sim(N = 5)

31S 2S1 LS1 120 C2E -3 -3 -3 -1 0

Below is a function that can simulate a lot of normal skill tests.

test_sim <- function(translate = c("E" = 2, "P" = 1, "0" = 0, "1" = -1, "2" = -2, "3" = -3, "4" = -4, "S" = -2, "C" = -3, "T" = -4, "L" = -5, "A" = -Inf), N = 1000, dicebag = NULL) { dargs <- list(n = N, replace = TRUE) if (!is.null(dicebag)) { dargs$dicebag <- dicebag } pulls <- do.call(chaos_bag_string_sim, dargs) vecp <- strsplit(pulls, "")[[1]] vecnum <- translate[vecp] return(vecnum) }

Here's a demonstration.

test_sim(N = 5)

2 S 3 1 0 -2 -2 -3 -1 0

Finally, below is a function that, when given a vector of results like those above, produce a table of estimated probabilities of success at skill tests when given a vector of values the tests must beat in order to pass the test (that is, using the "I'm at -2" type of language).

est_success_table_gen <- function(res, values = (-10):3) { r = v)}) names(r) <- values return(rev(r)) }

Let's give it a test run.^{4}

u <- test_sim() est_success_table_gen(u)

3 2 1 0 -1 -2 -3 -4 -5 -6 -7 -8 -9 0.000 0.057 0.127 0.420 0.657 0.759 0.873 0.931 0.931 0.931 0.931 0.931 0.931 -10 0.931

as.matrix(est_success_table_gen(u)) [,1] 3 0.0000 2 0.0590 1 0.1161 0 0.2312 -1 0.4047 -2 0.6489 -3 0.7623 -4 0.8777 -5 0.9390 -6 0.9390 -7 0.9390 -8 0.9390 -9 0.9390 -10 0.9390

This represents the base probability of success. Let's see what Olive does to this table.

y <- olive_sim() as.matrix(est_success_table_gen(y))

[,1] 3 0.0213 2 0.0633 1 0.1472 0 0.2552 -1 0.4135 -2 0.5644 -3 0.7352 -4 0.8437 -5 0.9287 -6 0.9678 -7 0.9910 -8 0.9971 -9 1.0000 -10 1.0000

Perhaps I can make the relationship more clear with a plot.

plot(stepfun((-10):3, rev(c(0, est_success_table_gen(u)))), verticals = FALSE, pch = 20, main = "Probability of Success", ylim = c(0, 1)) lines(stepfun((-10):3, rev(c(0, est_success_table_gen(y)))), verticals = FALSE, pch = 20, col = "blue")

The black line is the estimated probability of success without Olive, and the blue line the same with Olive. (I tried reversing the -axis, but for some reason I could not get good results.) What we see is:

- Olive improves the chances a "hail Mary" will succeed. If you need +1, +2, or more to succeed, Olive can help make that happen (though the odds still aren't great)
- Olive can help you guarantee a skill test will succeed. If you boost your skill value to very high numbers, Olive can effectively neuter the auto-fail token. That's a good feeling.
- Otherwise, Olive hurts your chances of success. Being at -2 is particularly worse with Olive than without. However, most of the time she changes the probability of success too little to notice.

For most investigators, then, Olive doesn't do much to make her worth your while. But I think Olive makes a huge difference for two investigators: Father Mateo and Jim Culver.

Both of these investigators like some chaos bag tokens a lot. Father Mateo really likes the elder sign since it is an auto-success (in addition to other very good effects), while Jim Culver likes skulls since they are always 0.

What does Olive do for Father Mateo?

translate <- c("E" = 2, "P" = 1, "0" = 0, "1" = -1, "2" = -2, "3" = -3, "4" = -4, "S" = -2, "C" = -3, "T" = -4, "L" = -5, "A" = -Inf) # Mateo mateo_translate <- translate; mateo_translate["E"] <- Inf mateo_no_olive <- test_sim(translate = mateo_translate) mateo_olive <- olive_sim(translate = mateo_translate) plot(stepfun((-10):3, rev(c(0, est_success_table_gen(mateo_no_olive)))), verticals = FALSE, pch = 20, main = "Mateo's Probabilities", ylim = c(0, 1)) lines(stepfun((-10):3, rev(c(0, est_success_table_gen(mateo_olive)))), verticals = FALSE, col = "blue", pch = 20)

as.matrix(est_success_table_gen(mateo_no_olive))

[,1] 3 0.0590 2 0.0590 1 0.1158 0 0.2349 -1 0.4160 -2 0.6493 -3 0.7664 -4 0.8795 -5 0.9400 -6 0.9400 -7 0.9400 -8 0.9400 -9 0.9400 -10 0.9400

as.matrix(est_success_table_gen(mateo_olive))

[,1] 3 0.1832 2 0.1832 1 0.2228 0 0.2907 -1 0.4344 -2 0.5741 -3 0.7382 -4 0.8537 -5 0.9306 -6 0.9715 -7 0.9907 -8 0.9975 -9 1.0000 -10 1.0000

Next up is Jim Culver.

# Culver culver_translate <- translate; culver_translate["S"] <- 0 culver_translate["E"] <- 1 culver_no_olive <- test_sim(translate = culver_translate) culver_olive <- olive_sim(translate = culver_translate) plot(stepfun((-10):3, rev(c(0, est_success_table_gen(culver_no_olive)))), verticals = FALSE, pch = 20, main = "Culver's Probabilities", ylim = c(0, 1)) lines(stepfun((-10):3, rev(c(0, est_success_table_gen(culver_olive)))), verticals = FALSE, col = "blue", pch = 20)

as.matrix(est_success_table_gen(culver_no_olive))

[,1] 3 0.0000 2 0.0000 1 0.1169 0 0.3573 -1 0.5303 -2 0.6439 -3 0.7602 -4 0.8827 -5 0.9454 -6 0.9454 -7 0.9454 -8 0.9454 -9 0.9454 -10 0.9454

as.matrix(est_success_table_gen(culver_olive))

[,1] 3 0.0000 2 0.0224 1 0.1651 0 0.3476 -1 0.5495 -2 0.6899 -3 0.8160 -4 0.8969 -5 0.9493 -6 0.9737 -7 0.9920 -8 0.9971 -9 1.0000 -10 1.0000

Olive helps these investigators succeed at skill tests more easily, especially Mateo. We haven't even taken account of the fact that good things happen when certain tokens appear for these investigators! Sealing tokens could also have a big impact on the distribution of the bag when combined with Olive McBride.

Again, there's a lot that could be touched on that I won't here, so I will share with you a script allowing you to do some of these analyses yourself.

#!/usr/bin/Rscript ####################################### # AHOliveDistributionEstimator.R ####################################### # Curtis Miller # 2018-08-03 # Simulates the chaos bag distribution when applying Olive McBride ####################################### # optparse: A package for handling command line arguments if (!suppressPackageStartupMessages(require("optparse"))) { install.packages("optparse") require("optparse") } olive_sim <- function(translate = c("E" = 2, "P" = 1, "0" = 0, "1" = -1, "2" = -2, "3" = -3, "4" = -4, "S" = -2, "C" = -3, "T" = -4, "L" = -5, "A" = -Inf), N = 1000, dicebag = NULL) { dargs <- list(n = 3, replace = FALSE) if (!is.null(dicebag)) { dargs$dicebag <- dicebag } pulls <- replicate(N, {do.call(chaos_bag_string_sim, dargs)}) vals <- sapply(pulls, function(p) { vecp <- strsplit(p, "")[[1]] vecnum <- translate[vecp] max(apply(combn(3, 2), 2, function(i) {sum(vecnum[i])})) }) vals[which(is.nan(vals))] <- Inf return(vals) } test_sim <- function(translate = c("E" = 2, "P" = 1, "0" = 0, "1" = -1, "2" = -2, "3" = -3, "4" = -4, "S" = -2, "C" = -3, "T" = -4, "L" = -5, "A" = -Inf), N = 1000, dicebag = NULL) { dargs <- list(n = N, replace = TRUE) if (!is.null(dicebag)) { dargs$dicebag <- dicebag } pulls <- do.call(chaos_bag_string_sim, dargs) vecp <- strsplit(pulls, "")[[1]] vecnum <- translate[vecp] return(vecnum) } est_success_table_gen <- function(res, values = (-10):3) { r <- sapply(values, function(v) {mean(res >= v)}) names(r) <- values return(rev(r)) } # Main function # See definition of cl_args for functionality (help does nothing) main <- function(dicebag = "", translate = "", replications = 10000, plotfile = "", width = 6, height = 4, basic = FALSE, oliveless = FALSE, lowest = -10, highest = 3, chaos_bag_script = "AHChaosBagSimulator.R", title = "Probability of Success", pos = "topright", help = FALSE) { source(chaos_bag_script) if (dicebag != "") { dicebag_df <- read.csv(dicebag, stringsAsFactors = FALSE) dicebag <- as.numeric(dicebag_df$freq) names(dicebag) <- dicebag_df$token } else { dicebag <- NULL } if (translate != "") { translate_df <- read.csv(translate, stringsAsFactors = FALSE) translate <- as.numeric(translate_df$mod) names(translate) <- translate_df$token } else { translate <- c("E" = 2, "P" = 1, "0" = 0, "1" = -1, "2" = -2, "3" = -3, "4" = -4, "S" = -2, "C" = -3, "T" = -4, "L" = -5, "A" = -Inf) } if (any(names(translate) != names(dicebag))) { stop("Name mismatch between translate and dicebag; check the token columns") } if (!oliveless) { olive_res <- olive_sim(translate = translate, dicebag = dicebag, N = replications) cat("Table of success rate with Olive when at lest X is needed:\n") print(as.matrix(est_success_table_gen(olive_res, values = lowest:highest))) cat("\n\n") } if (basic) { basic_res <- test_sim(translate = translate, dicebag = dicebag, N = replications) cat("Table of success rate for basic test when at lest X is needed:\n") print(as.matrix(est_success_table_gen(basic_res, values = lowest:highest))) cat("\n\n") } if (plotfile != "") { png(plotfile, width = width, height = height, units = "in", res = 300) if (basic) { plot(stepfun(lowest:highest, rev(c(0, est_success_table_gen(basic_res)))), verticals = FALSE, pch = 20, main = title, ylim = c(0, 1)) if (!oliveless) { lines(stepfun(lowest:highest, rev(c(0, est_success_table_gen(olive_res)))), verticals = FALSE, pch = 20, col = "blue") legend(pos, legend = c("No Olive", "Olive"), col = c("black", "blue"), lty = 1, pch = 20) } } dev.off() } quit() } if (sys.nframe() == 0) { cl_args <- parse_args(OptionParser( description = "Estimate the chaos bag distribution with Olive McBride.", option_list = list( make_option(c("--dicebag", "-d"), type = "character", default = "", help = "(Optional) dice bag distribution CSV file"), make_option(c("--translate", "-t"), type = "character", default = "", help = "(Optional) symbol numeric translation CSV file"), make_option(c("--replications", "-r"), type = "integer", default = 10000, help = "Number of replications to perform"), make_option(c("--plotfile", "-p"), type = "character", default = "", help = paste("Where to save plot (if not set, no plot", "made; -b/--basic must be set)")), make_option(c("--width", "-w"), type = "integer", default = 6, help = "Width of plot (inches)"), make_option(c("--height", "-H"), type = "integer", default = 4, help = "Height of plot (inches)"), make_option(c("--basic", "-b"), type = "logical", action = "store_true", default = FALSE, help = "Include results for test without Olive McBride"), make_option(c("--oliveless", "-o"), type = "logical", action = "store_true", default = FALSE, help = "Don't include results using Olive McBride"), make_option(c("--lowest", "-l"), type = "integer", default = -10, help = "Lowest value to check"), make_option(c("--highest", "-i"), type = "integer", default = 3, help = "Highest value to check"), make_option(c("--pos", "-s"), type = "character", default = "topright", help = "Position of legend"), make_option(c("--title", "-m"), type = "character", default = "Probability of Success", help = "Title of plot"), make_option(c("--chaos-bag-script", "-c"), type = "character", default = "AHChaosBagSimulator.R", help = paste("Location of the R file containing", "chaos_bag_string_sim() definition; by", "default, assumed to be in the working", "directory in AHChaosBagSimulator.R")) ) )) names(cl_args)[which( names(cl_args) == "chaos-bag-script")] = "chaos_bag_script" do.call(main, cl_args) }

You've already seen an example CSV file for defining the dice bag; here's a file for defining what each token is worth.

token,mod E,2 P,1 0,0 1,-1 2,-2 3,-3 4,-4 S,-2 C,-3 T,-4 L,-5 A,-Inf

Make the script executable and get help like so:

$ chmod +x AHOliveDistributionEstimator.R $ ./AHOliveDistributionEstimator.R -h

If you've never heard of this game I love, hopefully you've heard of it now. Give the game a look! And if you have heard of this game before, I hope you learned something from this post. If you're a reviewer, perhaps I've given you some tools you could use to help evaluate some of Arkham Horror's more intractable cards.

My final thoughts on Olive: she's not going to displace Arcane Initiate's spot as the best Mystic ally, but she could do well in certain decks. Specifically, I think that Mateo decks and Jim Culver decks planning on running Song of the Dead will want to run her since there are specific chaos tokens those decks want to see; the benefits of Olive extend beyond changing the probability of success. Most of the time you will not want to make a hail Mary skill test and you won't have the cards to push your skill value to a point where anything but the auto-fail is a success, so most of the time Olive will hurt your chances rather than help you, if she has any effect at all. Thus a typical Mystic likely won't find Olive interesting… but some decks will love her.

By the way, if you are in the Salt Lake City area of Utah, I play Arkham Horror LCG at Mind Games, LLC. While I haven't been to many other stores, Mind Games seems to have the best stocking of Arkham Horror (as well as two other Fantasy Flight LCGs, *A Game of Thrones* and *Legend of the Five Rings*). Every Tuesday night (the store's late night) a group of card game players come in to play; consider joining us! In addition, Mind Games likely has the best deal when it comes to buying the game; whenever you spend $50 on product, you get an additional $10 off (or, alternatively, you take $10 off for every $60 you spend). Thus Mind Games could be the cheapest way to get the game (without going second-hand or Internet deal hunting).

*(Big image credit: Aurore Folney and FFG.)*

**EDIT: WordPress.com does not like R code and garbled some of the code in the script for Olive simulation. It should be correct now. If anyone spots other errors, please notify me; I will fix them.**

**EDIT 2: There were errors in the chaos bag simulator that was relevant to simulating probabilities involving Olive. These errors were rectified and the numbers updated.**

I have created a video course published by Packt Publishing entitled *Applications of Statistical Learning with Python*, the fourth volume in a four-volume set of video courses entitled, *Taming Data with Python; Excelling as a Data Analyst*. This course discusses how to use Python for data science, emphasizing application. It starts with introducing natural language processing and computer vision. It ends with two case studies; in one, I train a classifier to detect spam e-mail, while in the other, I train a computer vision system to detect emotions in faces. Viewers get a hands-on experience using Python for machine learning. If you are starting out using Python for data analysis or know someone who is, please consider buying my course or at least spreading the word about it. You can buy the course directly or purchase a subscription to Mapt and watch it there.

If you like my blog and would like to support it, spread the word (if not get a copy yourself)! Also, stay tuned for future courses I publish with Packt at the Video Courses section of my site.

- You can also play scenarios stand-alone, which isn’t nearly as fun as playing in a campaign, especially one of the eight-scenario campaigns. ↩
- This is not right because sometims you redraw tokens from the bag without replacement; we'll ignore that case for now. ↩
- Using the asymptotically appropriate confidence interval based on the Normal distribution (as described here), the margin of error error will not exceed since . Thus, we have ; solving this for yields . So if we want a 95% chance that the margin of error will not exceed , we need a dataset of size . ↩
- Note that the earlier calculations that argued the odds of being accurate up to two decimal places were high no longer hold, since many more numbers are being estimated, and one estimate is not independent of the other. That said, the general principle of more simulations leading to better accuracy still holds. ↩

Last year I started publishing video courses with Packt Publishing. These courses formed a series of introductory courses for data analysis using Python, including *Unpacking NumPy and Pandas*, *Data Acquisition and Manipulation with Python*, and most recently *Training Your Systems with Python Statistical Modelling*. Viewers learned the basics of managing data in Python, getting it from the Internet, and how to apply machine learning to datasets to develop predictive systems.

This final course caps the series off with applications. The first half of the course covers two major areas of AI: natural language processing (NLP) and computer vision (CV). In the NLP section, I introduce basic NLP tasks and show how to use Python’s Natural Language Toolkit (NLTK) for NLP. Then in the CV section I show several CV tasks and how to use libraries from PIL to OpenCV and SciPy. These sections are brief in theory and heavy in application; nearly every video includes an extensive Python application of the concepts and software presented.

The last two sections of the course are complete Python projects. The first project is an NLP project; the objective is to train a spam detector. The second project develops a system for detecting emotions in images. In these projects, I get a dataset, prepare it for processing, apply a machine learning system and evaluate the results. These projects use techniques and concepts from all the previous courses in the series (though one may be able to appreciate the content without having seen the other courses).

All together the course lasts approximately two hours.

The videos in the course–narrated by me–include not only an explanation of the topic at hand but interactive demonstrations, so viewers can *see* how to use the software and follow along if they so desire. The video course includes the Jupyter notebooks I use in my demonstrations; viewers can run my code blocks to replicate my results, and edit them for their own experimentation.

You can buy the course on Packt’s website. If the price of the course is an obstacle, perhaps consider watching it on Mapt, Packt’s subscription service, which gives you access not only to all my courses but *everything* Packt has published (and they publish a ton of stuff), along with one free book of your choice to keep every month (likely without any DRM; just a plain ol’ PDF). (I also hear that Packt’s videos are available on other services, such as Lynda, but don’t quote me on that.)

I thank Packt for publishing this course. I also thank my editor, Viranchi Shetty, for offering feedback and keeping me on schedule. The editors at Packt had a big impact on the final product.

If you like my blog and would like to support me, perhaps consider purchasing the course. If you have no need for it or don’t have the money to spend (which I understand completely; I don’t live a life of glamour myself, being a graduate student), I’d love for you to spread the word about the course. Tell a friend wanting to get started in data analysis or data science, or even share this post on Facebook or Twitter or whatever your preferred social network is. Directing more eyes to the course helps. Write a review if you have watched it; I would love to hear your feedback, both positive and negative (though if negative, be gentle and constructive please).

My website has a new page for the new course here.

Thanks for reading! This is the final video course I have agreed to create, and I don’t plan on creating any more video courses or books in the near future. Creating these courses was a major time investment. In fact, they may have distracted from my studies as a graduate student, which worries me. I don’t plan to write any more courses until I have Ph.D. Perhaps now that the courses are done I may have more time for blogging as well!

If you want to know more about what the course is like, below are some of the videos included in the course.

]]>When I was considering submitting my paper on psd to J. Stat. Soft. (JSS), I kept noticing that the time from “Submitted” to “Accepted” was nearly two years in many cases. I ultimately decided that was much too long of a review process, no matter what the impact factor might be (and in two…]]>

I’m reblogging this article mostly for myself. If you’ve been following my blog, you’ll see that recently I published an article on organizing R code that mentioned using packages to organize that code. One of the advantages of doing so is that the work you’ve done is easily distributed. If the methods are novel in some way, you may even get a paper in

J. Stat. Soft.or theR Journalthat helps people learn how to use your software and exposes the methodology to a wider audience. Therefore we should know something about those journals. (I recently got a good reply on Reddit about the difference between these journals.)

When I was considering submitting my paper on psd to J. Stat. Soft. (JSS), I kept noticing that the time from “Submitted” to “Accepted” was nearly two years in many cases. I ultimately decided that was much too long of a review process, no matter what the impact factor might be (*and in two years time, would I even care?).* Tonight I had the sudden urge to put together a dataset of times to publication.

Fortunately the JSS website is structured such that it only took a few minutes playing with XML scraping (*shudder*) to write the (R) code to reproduce the full dataset. I then ran a changepoint (published in JSS!) analysis to see when shifts in mean time have occurred. Here are the results:

Top: The number of days for a paper to go from ‘Submitted’ to ‘Accepted’ as a function of the cumulative issue index (each paper is an “issue”…

View original post 152 more words

While I understand the languages I need well enough, I don’t know much about programming best practices^{2}. This goes from function naming to code organization, along with all the tools others created to manage projects (git, make, ctabs, etc.). For short scripts and blog posts, this is fine. Even for a research paper where you’re using tools rather than making new ones, this is okay. But when projects start to get big and highly innovative, my lack of knowledge of programming practices starts to bite me in the butt.

I program with R most of the time, and I’m smart enough to program defensively, writing generalizable functions with a reasonable amount of parameterization and that accept other functions as inputs, thus helping compartmentalize my code and allowing easy changing of parameters. But there’s *a lot* more I can learn, and I have read articles such as

- Jon Zelner “Reproducibility start at the home” series
- Chris von Csefalvay’s “The Ten Rules for Defensive Programming in R”
- Robert M. Flight’s series

Not surprisingly there is seemingly contradictory advice. This blog post summarizes this advice and ends with a plea for help for what to do.

I started the current project I’m working on in early 2016 (merciful heavens, has it been *that long?*). My advisor didn’t tell me where it was going; he seemed to pitch it as something to work on over the winter break. But it turned into my becoming one of his collaborators (with a former Ph.D. student of his, now a professor at the University of Waterloo), and my taking charge of all things programming (that is, simulations and applications) for a research paper introducing a new statistic for change point analysis (you can read my earlier post) where I mentioned and introduced the topic for the first time).

To anyone out there wondering why academics write such terrible code, let me break it down for you:

- Academics often are not trained programmers. They learned programming on their own, enough to solve their research problems.
- Academic code often was produced during research. My understanding of professional programming is that often a plan with a project coordinator exists, along with documents coordinating it. While I’m new to research, I don’t think research works in that manner.
- It’s hard to plan for research. Research breaks into new territory, without there being an end goal, since we don’t necessarily know where it will end.
^{3}

As the project grew in scope the code I would write acquired features like a boat hull acquires barnacles. Here’s a rough description of how my project is structured (brace yourself):

- The file
`ChangepointCommon.r`

contains nearly every important function and variable in my project, save for functions that are used for drawing the final PDF plots of the analysis. Not every file uses everything from`ChangepointCommon.r`

, but it is called via`source()`

frequently. This file has a sister file,`ChangepointCommon.cpp`

, for holding the C++ code that underlies some R functions. - A file called
`powerSimulations2.r`

is a script that performs all (Monte Carlo) simulations. These simulations are extensive, and I perform them on my school’s supercomputer to take advantage of its 60+ cores and 1TB ram. They simulate our test statistic against multiple similar statistics in a variety of contexts, for the sake of producing power curves at various sample sizes. This script is a derivative of`powerSimulations.r`

, which did similar work but while assuming that the long-run variance of the data was known. - Somewhere there is a file that simulated our test statistic and showed that the statistic would converge in distribution to some random variable under a variety of different contexts. I don’t know where this file went, but it’s somewhere. At least I saved the data.
- But the file that plots the results of these simulations is
`dist_conv_plots.r`

. I guess it makes plots, but if I remember right this file is effectively deprecated by another file… or maybe I just haven’t needed to make those plots for a long time. I guess I don’t remember why this exists. - There’s a file
`lrv_est_analysis_parallel.r`

that effectively does simulations that show that long-run variance estimation is hard, examining the performance of these estimators, which are needed for our test statistics. Again, due to how much simulation I want to do, this is meant to be run on the department’s supercomputer. By the way, none of the files I mentioned above can be run directly from the command line; I’ve been using`source()`

to run them. `powerSimulations2.r`

creates`.rda`

files that contain simulation results; these files need to be combined together, and null-hypothesis rejection rates need to be computed. I used to do this with a file called`misc1.R`

that effectively saved commands I was doing by hand when this job was simple, but then the job soon became very involved with all those files so it turned into an abomination that I hated with a passion. It was just two days ago that I wrote functions that did the work`misc1.R`

did and added those functions to`ChangepointCommon.r`

, then wrote an executable script,`powerSimStatDataGenerator.R`

, that accepted CSV files containing metadata (what files to use, what the corresponding statistical methods are, how to work with those methods) and used those files to generate a data file that would be used later.- Data files for particular applications are scattered everywhere and I just have to search for them when I want to work with them. We’ve changed applications but the executable
`BankTestPvalComputeEW.R`

works with our most recent (and highlighted) application, taking in two CSV files containing data and doing statistical analyses with them, then spitting the results of those analyses into an`.rda`

(or is it`.Rda`

?) file. - Finally, the script
`texPlotCreation.R`

takes all these analyses and draws pictures from them. This file includes functions not included in`ChangepointCommon.R`

that are used for generating PDF plots. The plots are saved in a folder called`PaperPDFPlots`

, which I recently archived since we redid all our pictures using a different method but I want to keep the pictures made the old way, yet they keep the same file names. - There is no folder hierarchy. This is all stored in a directory called
`ChangepointResearch`

. There are 149 files in that directory, of different conventions; to be fair to myself, though, a lot of them were created by LaTeX. There is, of course, the`powerSimulations_old`

directory where I saved the old version of the simulations I did, and the`powerSimulations_paper`

directory where the most recent version are kept. Also, there’s`PaperPDFPlots`

where all the plots were stored; the directory has 280 files, all of them plots. - Finally there’s a directory called
`Notebook`

. This is a directory containing the attempt of writing a`bookdown`

book that would serve as a research notebook. It’s filled with`.Rmd`

files that contain code I was writing, along with commentary.

In short, I never organized my project well. I remember once needing to sift through someone else’s half-finished project at an old job, and hating the creator of those directories so much. If someone else attempted to recreate my project without me around I bet they would hate me too. I sometimes hate myself when I need to make revisions, as I finished doing a few days ago.

I need to learn how to not let this happen again in the future, and how to properly organize a project–even if it seems like it’s small. Sometimes small projects turn into big ones. In fact, that’s exactly what happened here.

I think part of the rean this project turned out messy was because I learned what literate programming and R Markdown is around the time I started, and I took the ideas too far. I tried to do everything in `.Rnw`

(and then `.Rmd`

) files. While R Markdown and Sweave are great tools, and writing code with the documentation is great, one may go too far with it. First, sharing code among documents without copy/paste (which is bad) is difficult to do. Second, in the wrong hands, one can think there’s not much need for organization since this is a one-off document.

Of course, the naming conventions are as inconsistent as possible. I documented functions, and if you read them you’d see that documentation and even commenting conventions varied wildly. Clearly I have no style guide, and no linting or checking that inputs are as expected and on and on and on, a zoo of bad practice. These issues seem the more tractable to resolve:

- Pick a naming convention and stick with it. Do not mix them. Perhaps consider writing your own style guide.
- Use functions whenever you can, and keep them short.
- Use
**packrat**to manage dependencies to keep things consistent and reproducible. **roxygen2**is the standard for function documentation in R.- Versioning systems like
`git`

keep you from holding onto old versions out of fear of needing them in the future. Use versioning software.

Aside from these, though, I’m torn between two general approaches to project management, which I describe below.

I think Jon Zelner’s posts were the first posts I read about how one may want to organize a project for both reproducibility and ease of management. Mr. Zelner suggests approaching a project like how software developers approach an application, but instead of the final product being an executable, the final product is a paper. Some of the tips Mr. Zelner provides includes:

- Treat R scripts as executables. That is, your scripts should be executable from the command line after placing a shebang at the beginning and running
`chmod +x script.R`

. They should accept command-line arguments. (I manage this using**optparse**, though he recommends other tools.) - Organize your scripts with a standard directory structure (keep all scripts in a directory like
`/R`

, all data in`/data`

, all figures in`/fig`

, etc.), and create a`make`

file that describes their relationships. The`make`

file keeps the project up-to-date, making sure that all dependencies between scripts and data files and figures are managed and repercussions from changes are fed forward appropriately. - Use
`knitr`

for report writing, but the bulk of the work is done elsewhere. - Manage reproducibility issues, like dependency issues, using packages that I don’t want to use. I may look to
**packrat**for this.

The projects-as-executables idea resonated with me and I’ve worked to write fewer `.Rmd`

files and more executable `.R`

files. You’ll notice that many files in my project are executable; this is an attempt at implementing Mr. Zelner’s ideas. The entire GNU/Linux development pipeline is available to you for your project, and there are many tools meant to make a coder’s life easier (like `make`

).

Months ago, when the reviewers of our paper said they wanted more revisions and the responsibility for those revisions fell squarely on me, I thought of fully implementing Mr. Zelner’s ideas, with `make`

files and all. But the `make`

file intimidated me when I looked at the web of files and their dependencies, including the 200+ PDF files whose names I don’t even know, or all the data files containing power simulations. Sadly, only a couple days ago I realized how to work around that problem (spoiler: it’s not really a problem).

Of course, this would require breaking up a file like `ChangepointCommon.r`

. A file shouldn’t include *everything*. There should be a separate file for statistical functions, a file for functions relevant for simulations, and a file for functions that transform data, a file for functions that make plots, and so on. This is important to make sure that dependencies are appropriately handled by `make`

; you don’t want the exhaustive simulations redone because a function that makes plots was changed. `ChangepointCommon.r`

has been acting like a couple days ago when I thought of an answer to that problem. `ChangepointCommon.r`

has been acting like a poor man’s package, and that’s not a good design.

That last sentence serves as a good segue to the other idea for managing a research project; Chris Csefalvay and Robert Flight’s suggestion to write packages for research projects. Csefalvay and Flight suggest that projects should be handled as packages. Everything is a part of a package. Functions are functions in a package. Analyses are vignettes of a package. Data are included in the `/data`

directory of the package. Again: everything is a package.

Packages are meant to be distributed, so viewing a project as a package means preparing your project to be distributed. This can help keep you honest when programming; you’re less likely to take shortcuts when you plan on releasing it to the world. Furthermore, others can start using the tools you made and that can earn you citations or even an additional paper in *J. Stat. Soft.* or the *R Journal*. And as Csefalvay and Flight point out, when you approach projects as package development, you get access to the universe of tools meant to aid package development. The documentation you write will be more accessible to you, right from the command line.

My concern is that not everything I want to do seems like it belongs in a package or vignette. Particularly, the intensive Monte Carlo simulations don’t belong in a package. Vignettes don’t play well with the paper that my collaborators actually write or the format the journal wants.

I could get the benefit of Csefalvay and Flight’s approach and Mr. Zelner’s approach by: putting the functions and resulting data sets in a package; putting simple analyses, demonstrations, and *de facto* unit tests in package vignettes; and putting the rest in project directories and files that follow Mr. Zelner’s approach. But if the functions in the highly-unstable package are modified, how will `make`

know to redo the relevant analyses?

I don’t know anything about package development, so I could be overcomplicating the issue. Of course, if the work is highly innovative, like developing a new, never-before-seen statistical method (like I’m doing), then a package will eventually be needed. Perhaps its best to lean on that method anyway for that fact alone.

I would like to hear others’ thought on this. Like I said at the beginning, I don’t know much about good practices. If there’s any tips, I’d like to hear them. If there’s a way to maximize the benefits of both Mr. Zelner’s and Csefalvay and Flight’s approaches, I’d especially like to hear about it. It may be too late for this project, but I’ll want to keep these ideas in mind for the future to keep myself from making the same mistakes.

I have created a video course published by Packt Publishing entitled *Training Your Systems with Python Statistical Modelling*, the third volume in a four-volume set of video courses entitled, *Taming Data with Python; Excelling as a Data Analyst*. This course discusses how to use Python for machine learning. The course covers classical statistical methods, supervised learning including classification and regression, clustering, dimensionality reduction, and more! The course is peppered with examples demonstrating the techniques and software on real-world data and visuals to explain the concepts presented. Viewers get a hands-on experience using Python for machine learning. If you are starting out using Python for data analysis or know someone who is, please consider buying my course or at least spreading the word about it. You can buy the course directly or purchase a subscription to Mapt and watch it there.

- I did take classes from the School of Computing at the University of Utah, enough to earn a Certificate in Big Data, but they assumed programming knowledge; they didn’t teach how to program. That’s not the same thing.
- Another problem is that I haven’t been taught to algorithmically analyse my code. Thus, sometimes I write super slow code that.
- That said, while you can’t necessarily anticipate the end structure, you could take steps to minimize how much redesigning will need to be done when the unexpected eventually surfaces; this is important to programming defensively.

The Mathematical Sciences Research Institute (MSRI) is a non-profit organization for promoting and facilitating mathematical research. MSRI functions independently of UC Berkeley, even being located just above and off the campus. Many students attending the summer schools ride a bus from the University campus to MSRI along a windy wooded road until reaching the top of the hill. Thus, MSRI overlooks the San Francisco bay and provides an excellent view at tea time.

MSRI enjoys a lovely locale, nestled in woodlands. While one could take a bus to and from the campus, many enjoy walking up and down the mountain trails connecting MSRI with the rest of Berkeley. (I was not one of these people, sadly.) One day we got to appreciate just how arboreal MSRI’s location is when a roving pack of turkeys attacked the building one morning.

The facility houses a kitchen, dining areas, an area for tea and mingling, and of course an auditorium and library. Naturally, workshop lecturers delivered their talks in the auditorium, and I spent considerable time in the library browsing the collection. (More on that later.) All told, it was a wonderful place to visit.

The workshop effectively consisted of two weeks that covered different topics only loosely connected to one another. The first week was about compressive sensing, while the second week covered machine learning topics and programming.

Jeff Blanchard effectively lead the first week’s topic. In short, we learned about solving the equation , where , , , and is *much* smaller than . We also require that is sparse; that is, most of the entries of are zero, except for a few. The setup is illustrated below (with this image from Wikipedia).

We saw a number of theorems about the problem and when it can be solved, in addition to algorithms for solving the problem and proofs they worked. Prof. Blanchard presented his proofs as digestible and pleasant exercise. I enjoyed the exercises and was willing to do them even outside of the exercise sessions, back in the dorms. (This unfortunately was not always the case.) In fact, just seeing how Prof. Blanchard took a proof and converted it into exercises enlightened me. In the past I have wondered how I can retain more from readings, such as books or papers, where no exercises are offered. Prof. Blanchard’s exercises suggested how I could generate my own exercises from a reading.^{1}

However, I don’t know when *I, personally* will ever use the material I studied. Heck, I barely understood what compressive sensing *is for!* We saw pictures where compressed sensing techniques were able to reconstruct a picture after 90% of the original picture’s information was deleted (set to zero). I asked Prof. Blanchard to explain the pictures to me, and it seems that they were constructed using methods that resembled what we had seen but were much more involved. I suppose this is fine, but I wanted an example application that *I* could implement using the techniques I had learned. I wish I was provided a relatively simple “real world” problem, a data set, where I could have tried out the compressive sensing techniques. I like the mathematics and I want to see it and study it, but in applied mathematics I appreciate seeing how the methods are used on “real” data (even if that “real” data is fictitious).

*(The above image is from here.)*

Even then, I don’t expect to ever work on compressive sensing problems. That’s not MSRI’s fault; I’m a single minded person who cares little for the world of mathematics outside of probability/stochastics, statistics (especially econometrics), economics/finance, and statistical modeling. Having said that I don’t feel like I wasted my time.

Prof. Blanchard left Friday, July 13th, and after a weekend off we returned the next week to study machine learning. I (and many of the other participants I spoke with) were less impressed with the second week. Monday felt like it was filled with data science talks we were tired of hearing, where problems were presented or methods shown with fancy pictures and impressive results and no discussion of details. Sometimes this is fine, but the crowd listening to the talks consisted of aspiring mathematicians who came to a workshop to dive into the theory of new ideas and grapple with the mathematics that allow these constructs to function. One high-level talk would have been fine if it were followed by two talks diving deeper into that subject. Then on Tuesday we had crash courses in Python for scientific computing and data analysis. People listening fell into one of two groups: either the listener already knew about Python, NumPy, SciPy and Matplotlib and learned almost nothing (I fell into this group; after all, I have authored four video courses on using Python for data analysis, though I was shaky on how to use Matplotlib) or didn’t know anything about any of these things and learned nothing because the talks were too fast. On the third day, Dr. Blake Hunter from Microsoft gave talks that did go deeper into the problems that SVMs, logistic regression, topic modeling, etc. solve, and I enjoyed his talks. For instance, I learned why the eigenvalues give information about clusters in spectral clustering; I had not seen that theory before (or at least did not understand it the last time I saw it). I feel like if Dr. Hunter’s talks were extended out to most of the week (like with Prof. Blanchard’s talks on compressive sensing), he could have gone into more details (he frequently said he wanted to skip over details and technicalities when asked about them) and I would have felt like I got more out of that week. But the day ended in an exercise session where Dr. Hunter told us to find a data set and do something with it. I downloaded a Kaggle data set that turned out to be really big, too big to be done in exercise sessions, and when I got back to the dorms I was not willing to reconsider the problem. I would have appreciated more direction. Thursday was talks that either seemed useless or were so mired in unfamiliar notation I understood nothing (but I was impressed, whatever it was she was talking about). Friday consisted of participants giving two-minute presentations on their research (I bombed mine, underestimating how quickly two minutes goes) and a panel discussion. The latter was enlightening; it convinced me that I need to attend more conferences.

I think the workshop should have been more focused. It would have been best if it were either about compressed sensing or machine learning, rather than half of one and half of the other. If that were not possible, the machine learning week should have been much more focused. Perhaps the second week was doomed because it was the second week and people’s motivation was starting to wane; however, I think the problems were deeper than that. We would have appreciated in depth discussions all throughout the week building ideas and exploring their theory (again, this was a mathematics workshop; we craved theory) rather than a smattering of loosely related data science talks. Of course, a workshop like this cannot be expected to be polished; cutting edge ideas are not polished. If the talks were more concentrated, though, they may have had a stronger effect.

All told, I think the workshop would have been better if a single idea were explained well enough that a participant could come away literate enough to start reading, understanding, and maybe even writing research papers on the topic.

A number of students from universities all over the nation and orgins from all over the world attended, including former Baltimore Ravens guard and center John Urschel. You can watch a video about him below.

I should have done more networking; there were many intelligent people at the workshop and I could have perhaps made friends. Instead, I may have been the most antisocial person there (or at least that’s how I felt).

It didn’t help that the Thursday prior to the workshop I came down with gastroenteritis. I rode the plane with no issues but the next day (the first day of the workshop) I was not feeling well and was in no mood to interact with people. Tuesday was the last day of recovery but I was so nervous (and embarrassed) about my condition that I didn’t want to interact with anyone any more than I needed. That likely sealed my isolation, in addition to having some social anxiety. I’m perfectly willing to give a presentation in front of a crowd with no worries, but one-on-one contact with people that isn’t in a professional function always stirs butterflies in me (yes, I’m painfully single too; I can’t talk to new women if my life depended on it).

I think I need a long time to get comfortable with people, and I did later become more willing to interact with others. I particularly enjoyed going to a nearby bar, the Tap Haus, having drinks and playing games with the other attendees. Perhaps if the workshop went on longer I would have acclimated even better, but I doubt I’ll see any of the other attendees ever again.

I enjoyed the library at MSRI more than I should. It’s not a particularly large library, but it was a cozy library devoted to mathematics filled with interesting books. I looked forward to leaving the lecture and going to the library, finding a book, and enjoying the quiet, pleasant atmosphere. (This is probably another reason I was antisocial; I wanted to be with the books.)

We were required to attend a library orientation, and in the orientation I learned about the library’s facilities and also about MathSciNet, a tool I had never heard of and likely will utilize a lot in the future. The books in the library are listed alphabetically by author. Admittedly this is not a great way to index the library’s contents; organization by subject seems superior. Surprisingly, though, I enjoyed the fact that books were indexed by name. The result was my seeing titles that I never would have seen since I would never go to that section of a library willingly. Furthermore, since I couldn’t look at books by subject, I looked for books by authors I knew.

In the spring semester of this year, I had a student ask me for suggestions for a statistical history topic. The student was taking Prof. Andrejs Treibergs History of Mathematics class (Prof. Treibergs was also my instructor for my first statistics classes, MATH 3070 and MATH 3080) and needed to write a paper. I initially suggested the student investigate William Gosset and his work at Guinness (yes, the brewer) that lead to the creation of the -test, but this topic was already taken by another student. At a loss, I suggested that the student read David Salsburg’s *The Lady Tasting Tea*, a book about the history of statistics. I recommmended the book without having read it myself, and after the student borrowed a copy from the University library I bought a copy for myself. The book was a pleasure to read, and the *idea* of a “history of mathematical ideas” class like Prof. Treibergs’ class has swam around in my mind in recent months (I even bought a copy of the class’s textbook for myself while I was in San Francisco, from the legendary Moe’s Books).

So when I found *The History of Statistics in the 17th and 18th Centuries*, I was well primed. The book consists of lectures written by Prof. Karl Pearson, considered by many to be the father of mathematical statistics. The book was published posthumously, edited by Karl Pearson’s son, Egon Pearson, the “Pearson” from the Neyman-Pearson lemma. “Neyman” is Prof. Jerzy Neyman, professor at UC Berkeley and the man largely responsible for the confidence interval and hypothesis testing framework now taught in introductory (frequentist) statistics courses. I mention this because many of the books in the library, including *The History of Statistics in the 17th and 18th Centures*, appear to be from his personal collection.

*(Ronald Fisher’s book,* Statistical Methods for Research Workers; *Fisher is another important figure in statistics history and unrelenting critic of Jerzy Neyman; notice that this book, an early classic in statistics, appears to have been owned by Jerzy Neyman.)*

I read this book for hours almost every day at MSRI; when lectures were over, I would go to the library, take a seat on the couch, and read. I did not really know there was a history of statistics prior to the 19th century, yet Karl Pearson delightfully describes the ideas, the methods, the people, and the historical context as a good historian should.

The book is long; I only managed to get through the third chapter, after Karl Pearson describes the life and work of Edmond Halley and Caspar Neumann. Nevertheless it’s a delight to read.

Consider, for example, Pearson’s reprinting of an account of how Dr. William Petty, contemporary of John Graunt (perhaps the first demographer, authoring the first life table, and whom Prof. Pearson identifies as the father of statistics) purportedly revived a woman from death.

Anne Greene ….. was, at a Sessions held in Oxford, arraigned, condemned, and on Saturday the 14th of December last, brought forth to the place of the Execution [for adultery]; where, after signing of a Psalme, and something said in justification of her self, as to the fact for which she was to suffer, and touching the lewdnesse of the Family wherein she lately lived, she was turn’d off the ladder, hanging by the neck for the space of almost halfe an houre, some of her friends in the mean time thumping her on the breast, others hanging with all their weight upon her leggs; sometimes lifting her up, and then pulling her downe againe with a suddaine jerke, thereby the sooner to dispatch her out of her paine: insomuch that the Under Sherriffe, fearing lest thereby they should breake the rope, forbad them to doe so any longer. At length when everyone thought she was dead, the body being taken downe, and put into a coffin, was carried thence into a prive house, where some Physicians had appointed to make a Dissection. The coffin being opened, she was observed to breathe, and in breathing (the passage of her throat ebing streightened) obscurely to ruttle: which being perceived by a lusty fellow that stood by, he (thinking to doe an act of charity in ridding her of the small reliques of a painfull life) stamped severall times on her breast and stomack with all the force he could. Immediately after, there came in Dr Petty of Brasen-nose-Colledge, our Anatomy Professor, and Mr Thomas Willis of Christ-Churh, and whose comming, which was about 9 o’clock in the morining, she yet persisted to ruttle as before, laying all the while streched out in the coffin in a cold room and season of the yeare. They perceiving some life in her, as well for the humanity as their Profession sake, fell presently to act in order to her recovery. First, having caused her to be held up in the coffin, they wrenched open her teeth, first were fast set, and poured into her mouth some hot and cordiall spirits; whereupon she ruttled more than before, and seemed obscurely to cough: then they opened her hands (her fingers also being stifly bent) and ordered some to rub and chafe the extreme parts of her body …..

Whilst the Physicians were thus busie in recovering her to life, the Under-Sheriffe was solliciting the Governor and the rest of the Justices of the Peace for the obtaining her Reprieve, that in case she should for that present be recovered fully to life, shee might not be had backe again to Execution. Whereupon those worthy Gentlemen, considering what had happened, weighing all circumstances, they readily apprehended the hand of God in her preservation, and being willing rather to co-operate with divine providence in saving her, than to overstrain justice by condemning her to double shame and sufferings, they were pleased to grant her a Reprieve.

Thus, within the space of a month, was she wholly recovered: and in the same room where her body was to have been dissected for the satisfaction of a few, she became a great wonder, being revived to the satisfaction of multitudes that flocked thither daily to see her.

I don’t know about you, but for a book on the history of statistics, I’d call that a wild story.

I enjoyed the book so much that I ordered a personal copy, and plan to finish the book. If a student comes to me looking for a mathematical history topic, I can now offer my own ideas for papers.

I may have rambled through this post, wandering from topic to topic, but I can accept that. I wanted to share my thoughts and ideas.

My trip to MSRI got me more interested in my work as a Ph.D. student. When the academic year begins again in late August, I hope to start attacking topics and publishing papers, getting a research program going. MSRI helped give me that taste of the academy that I’ve been missing.

Here’s hoping I can visit UC Berkeley and MSRI again someday.

I have created a video course published by Packt Publishing entitled *Training Your Systems with Python Statistical Modeling*, the third volume in a four-volume set of video courses entitled, *Taming Data with Python; Excelling as a Data Analyst*. This course discusses how to use Python for machine learning. The course covers classical statistical methods, supervised learning including classification and regression, clustering, dimensionality reduction, and more! The course is peppered with examples demonstrating the techniques and software on real-world data and visuals to explain the concepts presented. Viewers get a hands-on experience using Python for machine learning. If you are starting out using Python for data analysis or know someone who is, please consider buying my course or at least spreading the word about it. You can buy the course directly or purchase a subscription to Mapt and watch it there.

- That being said, at some point a good researcher should know why she is reading something to begin with. The researcher should come into the reading with a problem in mind already. Undirected reading, or reading for the sake of reading or learning a concept without knowing how one plans to use it, can become a task in and of itself that never ends. I am trying to get myself out of the habit of reading for the sake of reading and reading to address a specific problem. I have been thinking about this issue ever since I heard my adviser comment one day that it’s a “bad sign” when a student asks for reading assignments; people should be thinking about questions rather than just reading, as having questions in mind leads to more directed reading and less wasted time. I have even heard that this approach to reading produces better retention anyway. ↩