This isn’t that article. That article is never coming. Oh, well. Maybe for my 1000th post. But for now, holding off writing until I can write that article means I’m just not writing blog posts, and I don’t like that.
So today, I’m writing an article that’s mostly for me, about my sudden leap into the job market.
Since my last blog post the world collapsed into the COVID19 apocalypse. In 2018 I wrote two blog articles, one suggesting the US was due for a recession before the 2020 election and the other attempting to justify the claim using what I’m calling methods of a “mad statistician.” My forecast stumbled into truth; GDP dropped at an annualized rate of 32.9%, the worst drop ever, and the economy is in a recession. (Seriously, look at the chart in the article; it’s insane, and that drop will likely complicate econometric models for decades to come.) Life for me is not too bad; I’m teaching at home and am still being paid. Meeting the needs of my students during this crisis pushed more work onto me, and I recorded video lectures so I could flip the classroom. (They are available on YouTube, and are 48 hours of videos on statistics and R programming; I should write a second post about these.) The experience has been good; I actually liked the flipped classroom format as it made interacting with students and figuring out where there heads are at easier. I liked engaging with them. But my research also had to take a back seat for weeks, setting me back.
The virus has not materially affected me negatively yet. (Personally and emotionally I’m not doing as well; I moved to Salt Lake City in December partly with the hope to experience city life and meet new people, and then the virus not only prevented that but took me away from the friends I had, and now most days I’m just alone in my apartment without even a cat to cuddle) But it may have irreparably altered my career path. This is supposed to be my last year as a graduate student, the year I start searching for jobs. I was planning on going to statistics conferences in hopes of connecting with academics to get an academic job; I was going to JSM this year. When JSM went online, I cancelled; the point was to meet new people, and I’m meeting no one in a virtual poster session! (Why didn’t the ASA just cancel JSM? The virtual conference is silly.) When talking about what to do about jobs with my advisor, he first recommended I stay on another year. The academic job market was gutted by the virus, with no universities hiring and old faculty choosing not to retire. Better to stay on another year to ride out the bad market.
A couple weeks ago, when I was emailing him again, he said that this glut was not going to clear up in a year, and the market would be awful for years to come. He basically recommended I look for jobs in other sectors of the economy.
I’m a statistics graduate student because I love mathematical statistics, enough to devote my life to it. I went to graduate school because it felt like the right thing to do for me. Additionally, the fact that the nonacademic job market for individuals with PhD’s in statistics is also good helped justify the economics of my choice. I like the idea of being a tenured faculty member advancing the field of statistics, studying and inventing procedures and publishing packages and papers, and not having to ever worry about a dress code. But I also knew that academic jobs can be hard to get. I’ve heard horror stories of post docs and adjunct faculty (a recent one being the story of Thea Hunter, published in The Atlantic), and want no part of it. I promised myself years ago not to harm myself just to try and stay in academia; it’s not worth it. And besides, “industry” (which for me just means everything but academia) pays better and is much more flexible if only because jobs are far more plentiful.
So goodbye academia. Industry, here I come!
I’m not looking for internships. I did two internships when I was an undergraduate, and I have been a grad student for years. Honestly, I’m tired of “someday” having a real job. I want a real, well paying job, now. I have been preparing for having a paying job since I was seven. I’ve been in college since 2010. That’s 10 years of college, or 22 years of education overall. I’m getting a PhD in statistics; my advisor says I have materials to complete a thesis, so I’m allbutdissertation at this point, and by May of next year I’ll be just done, with dissertation. So no more preparation. I’m done preparing. I want in. I do want to squeeze out one more paper (two more are in the submission process currently) on functional data analysis, but yeah. No more. Need job now.
I started applying last week. I redid my website, in particular my résumé and my professional information section, as well as the welcome page. I used LaTeX for my resume, and I hope it’s not too flashy. I also hope it won’t be a problem it was created with LaTeX; I heard about some employers using automated systems to automatically reject résumés and one way to get rejected is if the system can’t read the résumé well; I wonder if I already got a rejection for that reason alone, since the rejection seemed to come very fast, only hours after the application.
I don’t know how many jobs I’ve applied to; let’s say around 20. Clearly I’m early in the process, and honestly, I’m nervous. I feel I have good reason to be. I’m almost 29 and leaving a 22year journey for good. And not only that, I’m rather committed to leaving my home state of Utah, where I grew up. I just don’t feel I fit in here. I want to work in a big coastal city and experience the urban lifestyle. I lie awake at night thinking about what my life in six months will look like, where I will be, whether I will be happy, and how I can stay connected with my family, who are not going anywhere and will not be able to afford buying plane tickets to visit me. And meanwhile there are the jobs themselves. It felt really good to describe my education level as “Doctoral” on Monster.com, but that feeling was fleeting; now I get to read employer expectations and ask myself whether they’re matching me. There’s always something in the job posting that means I’m not a perfect match. Does that mean I don’t stand a chance, PhD or not? The last time I was applying for jobs like this, I was a senior in high school or just graduated, right when the 2008 financial crisis ended and the recession began. I got rejected for a job to be a cashier at the local grocery store and Sears. A cashier! Why was I not qualified to be a cashier? I’m in a much different place in life now and far more skilled, but I still remember that first job search, and it was not fun.
So yes, I’m anxious. I even posted on Reddit about my anxieties. But the Reddit comments I found reassuring. It doesn’t matter too much that I don’t know SAS, for example; I probably don’t want to work in a place that’s using SAS or SPSS. In any case, thanks Reddit people. You’re awesome.
So far in my search, while few have responded to my applications, I got four rejections and one interview. The rest have yet to respond (if they ever do). The interview was Friday of last week. And that’s largely why I started writing this blog post.
It’s been years since I did an interview, and this is my first interview for a real data job. I don’t know how it’s going to go. I don’t know what questions will be asked of me. The best I can go for is a Google search on “data job interview questions”. These articles have listed out some potential questions an interviewer may ask. This blog post is an excuse for me to practice answering those questions.
Just for fun, here are my answers. These are coming from the following websites: [1] and [2].
What do data analysts do?
In my view, data analysts use data to learn about phenomena. Data can be used for learning facts, such as what the typical pay of a person with a certain job title is, or what the typical trajectory of a rocket may be, or whether one drug or another is better at reducing blood pressure. Data can also be used for prediction, such as whether someone will develop high blood pressure based on their lifestyle, or what the trajectory of a rocket will be. Sometimes there is a clear question that needs to be answered and the data analyst will help refine the question into something that data can answer, establishing clear hypotheses and designs for studies and data sets. Sometimes there is not a clear question and the data analyst explores data sets to discover patterns that could lead to new hypotheses or beliefs of practitioners. The data analyst not only works with the data directly but also helps his clients both refine their questions and beliefs prior to analysis and effectively communicates conclusions after analysis to an audience that likely does not fully understand the methodology he used, but needs the insight he found.
Why do you want to be a data analyst?
When I was in high school I was on the debate team and the literary magazine. I took math classes but didn’t care for them. I think highschool me would be shocked that these are the types of jobs I’m applying for. But as I went through college I discovered I not only had mathematical talent but I was often just as interested in how new discoveries were made as much as I was interested in the end result. I enjoy the process of statistical analysis. I feel satisfaction when statistical software produces the final product I was seeking. Nothing has engaged me as much as statistics, so I cannot imagine myself not being a data analyst.
Please talk about a time when you could not meet a deadline.
Sometimes statistical analysis is straight forward but for the most part statistics is a part of research in one form or another, and research, in my experience, is hard to fix to a timeline. By its nature, it forges into uncharted territory, and thus delays can easily happen. When I was an intern with Voices for Utah Children, we would often make projections for when we thought our projects would be done, only to find those projections too optimistic. This was generally happening when we thought we were nearly done, only to find that we were not. We simply were honest about the state of the project, gave reasons for the delay, and did the best we could, but delayed.
For one of my recent papers I was finding myself being pressed against the submission deadline. The paper was simulation intensive and the simulations took days to complete even running in parallel on a supercomputer, and the simulations could not be sped up. I initially tried to scale back my coauthors’ expectations in what could be done within a reasonable time frame, as I was the only person working with code. When we found that some code was incorrect and some simulations were thus corrupt and unusable, we abandoned those simulation contexts to meet the deadline, deciding that they were slightly redundant anyway. And I burned the midnight oil to do what I could to get work done in a timely manner. In the end, we had a product that could be submitted and did address referee concerns, and we just barely made the deadline.
Try to be conservative with time estimates, scale back expectations on what can be done within a time frame, burn the midnight oil when there are things you can do to meet the deadline, and you will be pleasantly surprised when you’re ahead of schedule.
Which area would you prefer to work in and why?
I studied statistical methods that are generally described as econometrics during my career as a PhD student. This area aligned best with my affinity for the social sciences, economics in particular (I have a bachelor’s degree in economics). I also studied time series analysis and models and have been curious about financial data sets for years; my most popular blog posts are about stock data, after all. I am interested in money, like most people, and data dealing directly with money and its behavior intrigues me, hence my interest in time series. But I can see myself working in many areas, as each present their own challenges. For instance, I care about helping the planet and have started exploring climate data, using functional models (models that deal with data that is understood to be a function over a dense, continuous domain) to explore patterns there.
Which data analysis software are you wellversed in?
I understand R well. I have one R package on CRAN. I have read Hadley Wickham’s books and understand how R views the world, being comfortable with R’s functional aspects and unique approach to OOP. I’m also comfortable using Python and have no problem using Python packages like NumPy, Pandas, and ScikitLearn.
While I don’t have a computer science background I care greatly about the coding process and writing robust code. This is a skill I had to teach myself through the years as a graduate student. I have personally adopted the organizational paradigm of viewing research projects as package development, since packages provide key tools, such as unit testing protocols and documentation standards, necessary to keeping projects well organized and robust to inevitable changes. Packages also encourage good division of code into digestible functions that help keep the development process flexible.
Because I greatly value flexibility, I generally dislike tools such as R markdown and Jupyter notebooks. I have no problem using these tools to produce what amounts to some final product, like a paper, but they should not be used in intermediate steps or even beginning steps as they are too rigid. Instead I like to write executable scripts with narrow and specific jobs that can then be fit together. I try to emulate the Unix philosophy of keeping programs small and well designed for one job so that they can be recombined as needed as so that changing the pipeline is not too difficult. Not surprisingly I drifted from using Windows to doing my serious work in (Arch) Linux because that environment is ideal for software development. Tools such as make and git are instrumental in my process, helping to keep things flexible. I also like to do a lot of work with shell scripts, as they’re easy to write and are key to automation. (Read my thoughts on organizing research projects here.)
I prefer to use R and Python for highlevel work. When performance bottlenecks emerged in my work, I have used C++, via Rcpp, to clear the performance blockage, as the C++ data structures and general approach (such as the use of pointers to help keep down unnecessary copying of data) produce much faster software. I am not a C++ expert but I can use it well enough to write performancecritical code and thus make the higherlevel R code fast.
In the past I have used Stata for econometric work that the R packages did not do well. I also learned MATLAB so that I could review a coauthor’s code and identify why it was not working correctly. I don’t find either of these languages particularly difficult, at least at a surface level.
I used JavaScript to create D3 visualizations for a school project once. That was years ago, but I did enjoy it. I got a 30 minute crash course in Java and JSP for the sake of a databases class and while I was able to complete the project the code was hideous and I’m not proud of it. SQL doesn’t seem hard but I have not needed to use it since 2017. I’m confident I can get up to speed with it again; at the end of the day, it’s for working with relational databases, which I conceptually understand.
After one encounters many programming languages they start to look the same. All told, I’m confident in my practical coding abilities for the purposes of data analysis.
What was your most difficult data analysis project?
My undergraduate thesis on decomposing the gender gap in wages may be the most difficult data analysis project if only because building the wage and selection models was difficult; deciding which variables belong in which equation can drastically affect the end result. My papers studying change point detection procedures were also quite difficult because the algorithms needed to be finetuned in order to produce answers fast. That project forced me to learn C++ and think carefully about algorithm design in order to get a product that ran fast enough to be used in millions of simulations.
Take a few minutes to explain how you would estimate how many shoes could potentially be sold in New York City each June.
Let’s start simple. Let’s assume that everyone in New York City owns one pair of shoes at a time and they buy a new pair of shoes once a year. They are equally likely to buy a pair of shoes in each month of the year. So take NYC’s total population and divide it by 12. I would call this a Fermi solution, as it’s using Fermi’s famous ad hoc method to get a quick estimate.
Clearly this approach is crude. Let’s examine why. First, we’re assuming people only want one pair of shoes at a time and only buy one pair of shoes a year. This is wrong because some people wear no shoes (but they’re rare), some people want multiple shoes, and some people need to buy multiple shoes a year even if they only wanted one pair of shoes. Children need to buy many pairs of shoes as they grow out of shoes fast. People often have separate dress shoes and general shoes. Some people like having lots of choice of shoe within a given class. Some occupations are more intensive than others and thus put the shoes through more wear and tear. Women and men may have different shoe preferences and want more or fewer shoes. So we should break down our analysis based on gender, age, and occupation at least. We could also allow for ethnic preferences too. Income needs to be incorporated, as lowincome individuals are less likely to buy shoes they don’t truly need.
Also, assuming shoe needs are uniform across the year is unrealistic. Shoe sales are likely seasonal and correlated with age. People are not born uniformly through the year; August is a more common birth month than February. We could learn about at what ages in months people buy new shoes for their children to see when the kids will be getting new shoes. June is the first month of summer; do people start buying more shoes for summer? (For that matter, what counts as a shoe? Are $2 flipflops shoes? This conversation should be had.) What are the migration patterns for the month of June? How many tourists are buying shoes in June?
We can use questions like these to start building a sophisticated model for shoe sales. We can use time series dynamics to examine the relationship between June and January shoe sales to make forecasts for shoe sales, perhaps using seasonal ARIMA models regressing on the demographic variables mentioned above. Surveys on shoe buying patterns could be conducted to understand consumer preferences and join those preferences with our demographic projections. This could perhaps produce a sophisticated shoe sale forecast.
Or we could just survey people and ask how many shoes they bought in June. This would be much simpler, though give us a rather dull fact.
Take a few minutes to explain how you would estimate how many tourists visit Paris every May.
This question is not a good one. It has a problematic framing. Because of global population growth and migration, the number of people visiting Paris every May is not constant. We could talk about an average number, but I don’t think a historical average of the number of people visiting Paris every May is all that meaningful, unless the study is essentially a historical study. Instead we should investigate the trend of the number of May tourists.
We also need to define what it means for someone to be a tourist in Paris. We must have excluded business travel. But does someone being a tourist mean that they are in France to see sites and visit locations for pure entertainment? I like playing competitive card games; if I go to Paris for a tournament, am I a tourist? If a business traveler visits the Eiffel Tower, is she a tourist? These questions need to be answered.
I have never left American borders so I cannot say what tourism in Paris is like. Perhaps we could say “any person who visits selected cultural sites in Paris is a tourist”, which would also be calling Parisians tourists, but maybe we’re okay with that. Perhaps there are turnstyles at some locations counting visitors; if we have that data we should incorporate it. Other locations may not have turnstyles and we may need to literally count each person entering some perimeter as a tourist; modern technology could perhaps provide that count. We could then collect this data every May and aggregate it to produce some tourism metrics.
What is your process when you start a new project?
I first perform a literature review to see what about the topic at hand is already known and what methods have been done before. “Literature review” could include talking to experienced individuals who have knowledge about the topic I don’t. After a literature review I next get the data and examine it. I look at its structure and see what variables are present. I will look at interesting variables and see their basic properties; what is its type, what data is missing and how do I know its missing, and what is the data’s range? Is there evidence of outliers or erroneously entered values? I try to load in the data and make some basic visualizations, but before then I’m likely writing some scripts that transform the raw data into something ready for the tools I’m using. The earliest scripts are already being written to stand alone in a pipeline, and I’m already assuming that my code could become a package meant for working with this data set and performing a particular kind of analysis, so I’m making sure that even early code is welldocumented and well compartmentalized into functions. I proceed from more basic visualizations and analyses to more complex ones, learning more about the data set as I go. If this project is for prediction, I may early on divide the data into training and testing sets, along with a final test set, so my exploratory analyses are being kept away from testing data.
What are your communication strengths?
I think I’m great at presentations. I know how to make complicated ideas accessible to others, and know how to meet my audience. Just watch this video of me giving a presentation for the Utah chapter of Women in Architecture on Utah’s gender gap in wages, a packed house of over 300 looking to be entertained.
How do you handle pressure and stress?
I feel like pressure and stress are a state of mind. If there is a job that needs to be done quickly, I work to the best of my ability to get it done fast, and make it the top priority. That is all one can do. If I cannot meet the deadline, I need to let the relevant people know.
What are your longterm goals?
I want a career as a data analyst. My goals are to get a good statistical job and do good work, perhaps even encountering problems that lead to publications, but publications are not the reason I work.
Why should we hire you?
I should be hired because I am a quick and independent learner with a good work ethic and strong statistical intuition.
Hopefully these are good answers to questions I may encounter. Hopefully I won’t be looking for a job soon. But until that time, I just need to keep applying.
]]>
Training Systems using Python Statistical Modeling is now available from Packt Publishing’s website and from Amazon. This book was created by a team at Packt Publishing who took my video course and turned it into book form. If you’re like me and love books that you can hold in your hand, touch, thumb through, etc., and you’re looking to learn about statistics and machine learning methodology as used in Python, give my book a look.
My previous book, HandsOn Data Analysis with NumPy and Pandas, covered the basics of managing data sets in Python using two common tools, NumPy and pandas, along with how to use common tools for Python data analysis such as Anaconda and Jupyter Notebooks. Training Systems using Python Statistical Modeling follows naturally from that book, going from just managing data in Python to drawing inferences and developing useful applications from that data.
In addition to NumPy and pandas, this book shows how to use ScikitLearn, SciPy, and statsmodels for statistical inference and machine learning tasks in Python. I start the book discussing basic statistical inference, including hypothesis testing and parameter intervals. This section uses statsmodels the most. Then the book progresses to supervised learning. I start with explaining basic concepts of supervised learning and how one should choose, tune, and evaluate supervised learning algorithms. After these basics I present learning models for classification and regression using scikitlearn, including decision trees, support vector machines, logistic regression, linear regression, Ridge/LASSO regression, and neural networks. After supervised learning comes unsupervised learning. This includes clustering methods such as the kmeans algorithm and dimensionality reduction techniques such as principle component analysis. And this is only a broad overview of the techniques and algorithms the book covers; there are far more than I’ve mentioned here!
While the book offers basic explanations about how these methods work, it is not a theoretical book; it primarily moves forward with real data examples. Readers not only see a description of the methods and basic parameters but how to apply them to real data sets. In many ways this is a handson introduction to machine learning and statistics. Thus it’s a great resource for anyone looking to start right away developing data models in Python.
I would like to add that of the books I worked on I found this book the most enjoyable to write. While HandsOn Data Analysis with NumPy and Pandas is my personal bestseller it was frankly boring to write; its content is extremely basic and, while essential, does not show the powerful things people can do with data. This book, in comparison, finally shows how data can be useful and shows how to do interesting things with it. It also covers much more material, being nearly twice as long as HandsOn Data Analysis with NumPy and Pandas.
I list the book’s chapters below:
Also, check out the book’s GitHub page to see code samples used in the book.
I would like to thank the staff at Packt Publishing for their work on this book, particularly Joseph Sunil. I was so pleased when I received my copies in the mail and I thank them for their hard work to make this possible.
The MSRP for the book is $27.99, but is currenlty on sale for $19.59 (30% off) on Packt’s website as part of their summer sale, so pick it up while it’s cheap! If you’re not interested in buying this particular book but still want access to it, perhaps consider getting a Mapt subscription. You’ll have access to thousands of books and video courses (including all of my content), and can even get one book to keep for free (without DRM) every month! Perhaps that book will be mine! It’s a great deal you should consider. Also, if you read the book, please leave a review (good or bad). These reviews help others decide whether this book is right for them. Good reviews help earn me sales. “Dissatisfied” reviews give me and Packt’s team feedback for later work (or perhaps to errata the book; there have been mistakes in it that slipped through the editing process).
]]>This article is also available in PDF form.
A while back someone posted on Reddit about the grading policies of their academic department. Specifically, the department chair made a statement claiming that grades should be Normally distributed with a C average. I responded, claiming that no statistician would ever take the idea that grades follow a Normal distribution seriously. Some asked for context, and I wrote a long response explaining my position. I repeat that argument here, and also give some R code demonstrations showing what curving grades does.
A cheap shot would be to say that Normal random variables have no minimum or maximum so since there is a minimum or maximum grade, grades cannot be Normally distributed. This is a cheap shot because lots of phenomena that’s effectively bounded this way are fit to Normal distributions and no one bats an eye since the probability of being that far away from the mean is vanishingly small (albeit nonzero). However it could matter to grades since a larger standard deviation in grades and clumping of grades near the higher end of the distribution could mean that the probability of seeing an impossibly high grade is higher than tolerable if the grades were modeled with a Normal random variable.
Next we should agree the objective of grades is to measure students’ understanding and competency, with an “A” grade meaning “This student has mastered the material” and an “F” meaning “This student is not competent in the material”, which ranges anywhere from the student knowing something about the material but not enough to the student basically knowing nothing at all about the material.^{1} If the class size is somewhat small, it will be hard to see a Normal distribution naturally arise due to the natural fluctuation of students’ innate ability. It’s easier for instructors to get “smart” classes where the students overall are aboveaverage, and also to get classes where the students are not like that. Part of this is just randomness, part is associated with semester and time of day. But there should be some natural variation due to random sampling that can make natural grades not look exactly Normal with some specified mean and variance. This could be less of a problem for “jumbo” classes, though.
Now let’s talk about grading. There’s likely some scheme that awards points to students based on their performance. These schemes are never perfect and always arbitrary but there’s generally some truth in the resulting numbers. Some people say that grading this way produces bimodal distributions, suggesting there’s clumps of students that either do or do not get the material. I often observe leftskewed distributions, where most students range between mediocre to good, some are great, and some are catastrophically bad. Neither of these are features of Normal distributions.^{2}
So to get a Normal distribution one would have to take grading based off of points then determine each student’s percentile in the class and then see what the corresponding grade would be for the respective percentiles if grades were actually Normally distributed.
Two things: first, this produces a distortion in how points work. You get nonlinear benefits for points scored on anything, from homework to quizzes to tests. Specifically, it’s possible for the third point to be more valuable to your grade than the second point, and the second less valuable than the first. This is technically already true since grades are generally thresholds, but each point has an equal contribution (withing their own assignment) to reaching a particular threshold. This will not be the case if grades are curved. Students are going to struggle to understand and appreciate that.
Second: when you’re doing this, the Normal distribution actually doesn’t matter. You’re effectively assigning grades based on percentiles, not Normal percentiles specifically. You could make the grades fit any distribution you want this way, from Normal to uniform to beta to Cauchy to exponential and so on forever. You’re just saying that the lowest 20% will get Fs and the highest 20% will get A’s (or the lowest 5% will get Fs and the highest 5% will get As, if you’re actually sticking to a Normal distribution). But you’re putting artificial distance between, say, the students at the top; two top students could have mastered the material but one got an extra point because of luck, yet her grade will look much better than the other stellar student. This is because of how tails work in the Normal distribution. In fact, if students at the top figured this out, they’d start obsessing over every single point because every point would have a big impact on their grade, which would be bad for their learning and mental health.
Let’s see a data example. Here’s grades from one of the classes I taught at the University of Utah; I will not say what class and I added random noise to the grades so they’re not exactly like the class. Also, a small number of grades are fictitious.
grades < c(10, 24, 32, 41, 49, 54, 67, 67, 70, 70, 72, 74, 76, 77, 77, 77, 78, 80, 80, 81, 85, 85, 86, 88, 90, 91, 91, 91, 91, 92, 92, 92, 93, 96, 98, 98, 98, 101, 102, 102)
Figure 1 presents a histogram of the data in grades
. When I see that distribution, it appears to be a leftskewed distribution. Most students are in the 60100 range, some scored more than 100, and some scored much less than 60.^{3} The median grade is 83, the first quartile 71.5, and the third quartile 92. Overall, not a bad distribution that emerged naturally. (No curve was applied.)
If we want to make the grades appear to follow a different distribution, we will need to do the following:
The following code obtains percentiles:
perc < (1:length(grades)  0.5)/length(grades)
Now we need to decide on the target distribution. Supposedly, according to some department chair, grades should be Normally distributed with a mean of a C, which I will take to be 75. That leaves us picking the standard deviation. We probably should pick a standard deviation such that the probability of scoring above 100% is very small; three standard deviations away from the mean should suffice. So we will say that the standard deviation is 9, so .^{5}
Let’s now transform grades; to avoid extra controversy, we will also round grades up.
curvegrades < ceiling(qnorm(perc, mean = 75, sd = 9))

These grades are displayed in Figures 2 and 3. Here are some things to notice from doing this:
The curve has a very strong effect at the top of the distribution; two students with likely equivalent skill got very different grades, and the student in third place who appears to be just as skilled as the other two if it were not for luck got a B instead of an A. This appears to be very unfair.
Now we could screw around with the parameters and perhaps get a better distribution at the top of the curve, but that raises the question of why any distribution should be forced onto the data, let alone a Normal one. We could just as easily swapped qnorm()
with qcauchy()
and got a very different distribution for our scores. The data itself doesn’t suggest it came from a Normal distribution, so what makes the Normal distribution special, above all others?
The Normal distribution has a long history, dating back to the beginning of probability theory. It is the prominent distribution in the Central Limit Theorem and many wellknown statistical tests, such at the test and ANOVA. When people talk about “the bell curve” they are almost always referring to the Normal distribution (there is more than one “bell curve” in probability theory). The Fields Medalist Cédric Villani once said in a 2006 TED talk that if the Greeks had known of the Normal distribution they would have worshipped it like a god.
So why does the Normal distribution hold the place it does? For reference, below is the formula for the PDF of the Normal distribution with mean and variance :
A plot of the Normal distribution is given in Figure 4. At first glance looks complicated, but it’s actually wellbehaved and easy to work with. It’s rich in mathematical properties. While in principle any number could be produced by a Normally distributed random variable, in practice seeing anything farther than three standard deviations from the mean is unlikely. It is closed under addition; the sum of two (joinly) Normal random variables is a Normal random variable. And of course it features prominently in the Central Limit Theorem; the sum of IID random variables with finite variance starts to look Normally distributed, and this can happen even when these assumptions are relaxed. Additionally, Central Limit Theorems exists for vectors, functions, and partial sums, and in those cases the limiting distribution is some version of a Normal distribution.
Most practitioners, though, do not appreciate the mathematical “beauty” of the Normal distribution; I doubt this is why people would insist grades should be Normally distributed. Well, perhaps that’s not quite true; people may know that the Normal distribution is special even if they themselves cannot say why, and they may want to see Normal distributions appear to keep with a fad that’s been strong since eugenics. But “fad” feels like a copout answer, and I think there are better explanations.^{6}
Many people get rudimentary statistical training, and the result is “cargocult statistics”, as described by (1);^{7} they practice something that on the surface looks like statistics but lacks true understanding of why the statistical methods work or why certain assumptions were made. People in statistics classes learned about the Normal distribution and their instructors (rightly) drilled its features and its importance into their heads, but the result is that they think data should be Normally distributed since it’s what they know when in reality data can follow any distribution, usually nonNormal ones.
Additionally, statistics’ most popular tests–in particular, the test and ANOVA–calls for Normally distributed data in order to be applied. And in the defense of practitioners, there are a lot of tests calling for Normally distributed data, especially the ones they learned. But they don’t appreciate why these procedures use the Normal distribution.
The test and ANOVA, in particular, are some of the oldest tests in existence, being developed by Fisher and Student around the turn of the century, and they prompted a revolution in science. But why did these tests use the Normal distribution? I speculate that a parametric test that worked for Normally distributed data was simply a lowhanging fruit; assuming the data was Normally distributed was the easiest way to produce a meaningful, useful product. Many tests with the same objectives as the test and ANOVA have been developed that don’t require Normality, but these tests came later and they’re harder to do. (That said, it’s just as easy to do the test as it is to do an equivalent nonparametric test these days with software, but software is new and also it’s harder to explain what the nonparametric test does to novices.) Additionally, results such as the Central Limit Theorem cause tests requiring Normality to work anyway in large data sets.
Good products often come for Normal things first; generalizations are more difficult and may take more time to be produced and be used. That said, statisticians appreciate the fact that most phenomena is not Normally distributed and that tweaks will need to be made when working with real data. Most people practicing statistics, though, are not statisticians; cargocult statistics flourishes.
Since statistics became prominent in science statisticians have struggled with how to handle their own success and most statistics being done by nonstatisticians. Statistical education is a big topic since statistics is a hard topic to teach well. Also, failure to understand statistics produces realworld problems, from junk statistics to junk science and policy motivated by it. Assuming grades are Normally distributed is but one aspect of this phenomenon, and one that some students unfortunately feel personally.
Perhaps the first step to dealing with such problems is reading an article like these and appreciating the message. Perhaps it will change an administrator’s mind (but I’m a pessimist). But perhaps the student herself reading this will see the injustice she suffers from such a policy and appreciate why the statisticians are on her side, then commit to never being so irresponsible herself.
Grades Aren’t Normal
This document was generated using the LaTeX2HTML translator Version 2019 (Released January 1, 2019)
The command line arguments were:
latex2html split 0 nonavigation lcase_tags image_type gif simpledoc.tex
The translation was initiated on 20190729
grades
, there are ties. This data was rounded; real data would not have such an issue, and presumably an instructor would have access to the original data that wasn’t rounded. Packt Publishing published a book for me entitled HandsOn Data Analysis with NumPy and Pandas, a book based on my video course Unpacking NumPy and Pandas. This book covers the basics of setting up a Python environment for data analysis with Anaconda, using Jupyter notebooks, and using NumPy and pandas. If you are starting out using Python for data analysis or know someone who is, please consider buying my book or at least spreading the word about it. You can buy the book directly or purchase a subscription to Mapt and read it there.
If you like my blog and would like to support it, spread the word (if not get a copy yourself)!
]]>I started my first research project as a graduate student when I was only in the MSTAT program at the University of Utah, at the very end of 2015 (or very beginning of 2016; not sure exactly when) with my current advisor, Lajos Horváth. While I am disappointed it took this long, I am glad to say that the project is finished and I am finally published.
Our article, entitled “A new class of change point test statstics of Rényi type”, is available online, published by the Journal of Business and Economic Statistics (or JBES)^{1} (7), but is also available on arXiv for free. In the article we present a class of statistics we call Rényitype statistics, since they are inspired by (16). These test statistics are tailored for detecting changes that occur near the ends of the sample rather than the middle of the sample. We show that these statistics do well when structural changes happen recently (or early) in the sample, better than other popular methods.
My advisor, Lajos Horváth, and his former graduate student/assistant professor at the University of Waterloo Greg Rice developed the theory for this statistic and most of the paper had already been written when Prof. Horváth invited me to the project. I was responsible for all codingrelated matters of the project, including implementing the statistics, developing and performing simulations and finding and applying the statistic to a data example. I learned a lot in the process. I was forced to learn how to organize codeintensive research projects, thus prompting (9) and (13). I learned how to write R packages and I continue to do so to organize my research projects.
This project produced CPAT, a package for change point analysis. This package exists specifically for implementing test statistics of research interest to this project but should also be useful in general. It includes not only the Rényitype statistic but also the CUSUM statistic, the DarlingErdös statistic, instances of the (6) statistic, and instances of the (2) statistic. To my knowledge, this is the only implementation of our new statistic and the Andrews procedure in R.
In this article I will be discussing change point testing (focusing on time series), the Rényitype statistics, CPAT, and some examples. I will not go into full depth, including the theory of the statistic and the full simulation results; for that, see the paper we wrote. This topic was also the subject of my first Ph.D. oral defense (10), and the slides can be viewed online.
Suppose we have a series of data . In change point hypothesis testing, some aspect of the distribution changes at some unknown point in time (where ). Often we ask whether the mean is constant over the sample or not. Let . We wish to test
against
Critically, is unknown; if it were known, then we could simply use classic tests known since the turn of the century. Here, though, the location of is unknown. Thus, part of the challenge is determining which of and are true, and part of the challenge is estimating .
To some, checking for a change in is interesting. However, change point testing extends beyond checking for changes in the mean. In fact, change point testing can attempt to detect just about any structural change in a data set. For instance, we can check for changes in the variance of a data set, or changes in a linear regression model or a time series model. There are tests for changes in functional data and even for checking for changes in distribution.
Why do people care about these things? I can easily see why an online test–where the test is attempting to detect changes as data flows in–would be interesting; I recently read of an online algorithm for detecting a change in distribution that sought to detect changes in radiation levels (a lifeordeath matter in some cases) (15). However, that’s not the context I’m discussing here; I’m interested in historical changes. Why is this of interest?
First, there’s the classical example: change point testing was developed for quality control purposes. For example, a batch of widgets would be produced by a machine, and a change point test could determine if the machine ever became miscalibrated during production of the widgets. Beyond that, though, detecting a change point in a sample can be the first step to further research; the researcher would figure out why the change occurred after discovering it. There’s a third reason that every forecaster and user of time series should be aware of, though: most statistical methods using time series data require that the data be stationary, meaning that the properties of the data remain unchanging in the sample. One can hardly justify fitting a model to a data set when not all the data was produced by the same process. So change point testing checks for a particular type of pathology in data sets that everyone using time series data should watch out for.
In this article I’m not interested in estimating but only in deciding between and . Change point inference is a rich area of research and several statistics are already well known and actively used. One such statistic is the CUSUM statistic:
( is an estimator of the longrun variance of ; I’ll discuss this more later.) In short, the CUSUM statistic compares subsample means of a growing window of data to the overall mean of the sample. If a subsample of the data has a mean very different from the overall mean, the null hypothesis should be rejected.
Let be a Wiener process and be a Brownian bridge. If is true, then the limiting distribution of (1) is the Kolmogorov distribution as :
This is the distribution we use for hypothesis testing. Additionally, we have some power results: if is true and for some , then . Not only that, may have the best power among all statistic with such .
In our paper we call such a “midsample” (even though in principle could be small). We were interested more in the case where, as , . We call such change points “early” or “late”. The CUSUM statistic is not as effective in those situations, but the subject of our paper was a statistic that is quite effective when the change is early or late:
Here, is a trimming parameter such that but . What the Rényitype statistic does is compare the mean of the first part of the data to the mean of the rest of the data, in the latter part. If the difference between the mean of the former part of the data and the mean of the latter part becomes large in magnitude, the null hypothesis of no change should be rejected.
In our paper we showed that, whem is true, then as , , where and are IID copies of . This establishes the distribution for computing values. As for , we gave conditions on such that when is true, then . These conditions not only include most of those for which , (such as when ), but also conditions not covered by the CUSUM statistic in which the change tends to be early or late in the sample. For example, it’s possible to show that, as , the CUSUM statistic does not have power asymptotically when or when for , but does when is chosen appropriately (letting or often works).
As mentioned above, while testing for changes in the mean is nice, we want to check for structural changes not just in the mean but in for many statistics and models. And in fact we can. For example, let’s suppose we have a regression model:
Here, is a stationary white noise process ( and has more than two moments) and is a sequence of stationary random vectors. We wish to test
against
In fact, to do so, all you need to do is:
This procedure does work; the statistic has the same limiting distribution under the null hypothesis as before and the test has power when the null hypothesis is false, if and also have an appropriate relationship (you can construct situations where the test does not have power, such as when and are orthogonal). The same procedure can be used with the CUSUM statistic, too, and both procedures have similar relationships to : the CUSUM statistic has better power when is midsample and the Rényitype statistic has better power early/late sample.
This idea–use the residuals of the estimated model you’re testing for a change in–works not just for regression models but for time series models, generalized method of moments models, and I’m sure others that we haven’t considered. Later I will demonstrate checking for changes in regression models.
However, there is an issue I have ignored until now: . This is an estimator for the longrun variance of the data. If the null hypothesis were true and there was no serial correlation in the data, one could simply use the sample variance computed on the entire sample. However, even in the uncorrelated case we use an estimator that is a consistent estimator even when the break point occurs at for every . However, when working with time series data, we recommend using a kernel density estimator for the longrun variance.
I will not repeat the estimator here since the formula is quite involved; you can look at the original paper if you’re interested. These estimators are generally robust to heteroskedasticity (such as GARCH behavior) and autocorrelation, though they are far from perfect. In our simulations, we observed significant size inflation when there was noticeable serial correlation in the data since these estimators tend to underestimate the longrun variance. This is not a problem of the Rényitype statistic but any statistic relying on these estimators for longrun variance estimation. Additionally, practitioners should play around with the kernel and bandwidth parameters to get the best results, since some kernels/bandwidths work better in some situations than others.
My contribution to this project was primarily coding, from developing the implementations of these statistics to performing simulations. The result is the package CPAT (11), which is available both on CRAN and on GitHub.
While above I mentioned only the CUSUM test and the Rényitype test, we investigated other tests as well. For instance, we considered weighted/trimmed variants of the CUSUM test, the DarlingErdös test, the HidalgoSeo test, and Andrews’ test. Many of these tests are available in some form in CPAT, and have similar interfaces when sensible.
library(CPAT)
________ _________ _________ __________ / // ___ // ___ // / / ____// / / // / / //___ ___/ / / / /__/ // /__/ / / / / /___ / ______// ___ / / / / // / / / / / / / /_______//__/ /__/ /__/ /__/ v. 0.1.0 Type citation("CPAT") for citing this R package in publications
Call me a dork, but I saw a message like this used by the package mclust (17) and Stata and I wanted to do so myself. I think it’s cool and I plan on making messages like these for my packages for no good reason (MCHT (12) has a similar onload message).
Moving on, let’s create an artificial data set for playing with.
vec1 < c(rnorm(100, 0), rnorm(100, 1))
Right in the middle of the sample, the mean of the data switches from 0 to 1. Thus there is a change point in the mean.
Let’s first apply the CUSUM test using the function CUSUM.test()
.
CUSUM.test(vec1)
CUSUM Test for Change in Mean data: vec1 A = 4.0675, pvalue = 8.66e15 sample estimates: t* 97
All of the test functions, including CUSUM.test()
, produce S1class objects of class htest
, the same objects that stats statistical testing functions produce.
In this situation, the CUSUM test worked well; it rejected the null hypothesis of no change, as it should. It even made a good guess as to the location where the mean occured, estimating that , which is close to the true location of 100. Okay, let’s try another data set.
vec2 < as.numeric(arima.sim(200, model = list( order = c(1, 0, 0), ar = 0.4))) CUSUM.test(vec2)
CUSUM Test for Change in Mean data: vec2 A = 1.393, pvalue = 0.04127 sample estimates: t* 166
This data set does not have a change in mean yet the CUSUM test rejected the null hypothesis. Why is that so? This is because the default procedure for estimating the longrun variance will not work if the data is serially correlated, and in this case with autocorrelation parameter . We need to switch to kernelbased longrun variance estimation.
To turn on kernelbased methods we only need to set use_kernel_var = TRUE
, but there are additional choices to make. Specifically, we need to choose the kernel and the bandwidth. There are defaults but in general CPAT outsources kernel and bandwidth selection to functions from cointReg (3). Thus they use a similar interface.
The default is to use the Bartlett kernel and select the bandwidth using the method recommended by (1). We can switch to the quadraticspectral kernel and the (14) method like so:
CUSUM.test(vec2, use_kernel_var = TRUE, kernel = "qs", bandwidth = "nw")
CUSUM Test for Change in Mean data: vec2 A = 0.86106, pvalue = 0.4487 sample estimates: t* 166
Now we get a much more reasonable value.
Let’s work an example for detecting structural changes in linear regression models. Another package for change point detection is the package strucchange (18), and it provides a package tracking US income and expenditures. We can test for structural changes in expenditures like so:
library(strucchange) library(dynlm) data(USIncExp) incexpres < residuals( dynlm(d(log(expenditure)) ~ d(log(income)), data = USIncExp)) CUSUM.test(incexpres, use_kernel_var = TRUE)
CUSUM Test for Change in Mean data: incexpres A = 1.5783, pvalue = 0.01372 sample estimates: t* 320
Notice I set use_kernel_var = TRUE
; there is some evidence of autocorrelation in the data, though this is a good option to turn on in general since we never know whether data is autocorrelated or not.
Okay, enough about the CUSUM statistic: let’s discuss the Rényitype statistic. Using the Rényitype statistic works just like using the CUSUM statistic:
HR.test(vec1)
HorvathRice Test for Change in Mean data: vec1 D = 2.655, log(T) = 5.2983, pvalue = 0.03148 sample estimates: t* 97
The Rényitype statistic was also able to detect the change in mean in the first sample.^{2} It wasn’t as authoritative as the CUSUM test but it was not incorrect. How about the second sample when we turn on the kernelbased longrun variance estimation?
HR.test(vec2, use_kernel_var = TRUE, kernel = "qs", bandwidth = "and")
HorvathRice Test for Change in Mean data: vec2 D = 1.0009, log(T) = 5.2983, pvalue = 0.8619 sample estimates: t* 166
The Rényitype statistic, though, has an additional parameter to set: the trimming parameter, which above was . Here the parameter is called kn
, and HR.test()
expects a function be supplied to it.
HR.test(vec2, use_kernel_var = TRUE, kn = sqrt)
HorvathRice Test for Change in Mean data: vec2 D = 1.7249, sqrt(T) = 14.142, pvalue = 0.3096 sample estimates: t* 166
We mentioned that the Rényitype test shines when the change point occurs very early or late in the sample. Here is a demonstration:
vec3 < c(rnorm(5, mean = 0), rnorm(195, mean = 1)) CUSUM.test(vec3) HR.test(vec3)
CUSUM Test for Change in Mean data: vec3 A = 0.803, pvalue = 0.5393 sample estimates: t* 27 HorvathRice Test for Change in Mean data: vec3 D = 2.6722, log(T) = 5.2983, pvalue = 0.02991 sample estimates: t* 5
Here the CUSUM statistic fails to detect the early structural change but the Rényitype statistic does not.^{3}In general, the CUSUM statistic will have better power when the change occurs near the middle of the sample, while the Rényitype statistic will have better power when the change happens early or late.
Let’s wrap up our introduction to HR.test()
with a demonstration of testing for changes in regression models:
HR.test(incexpres, use_kernel_var = TRUE)
HorvathRice Test for Change in Mean data: incexpres D = 1.4198, log(T) = 6.2246, pvalue = 0.5257 sample estimates: t* 24
Finally, let’s consider some other statistics available in CPAT. First, there’s the DarlingErdös statistic. Let
When , this is known as the weighted CUSUM statistic, and in general, for we know this as the weighted and trimmed CUSUM statistic.^{4} This version of the CUSUM statistic should handle early/late change points better, though our (unreported) simulations suggest that the Rényitype statistic has better power in the extreme cases we considered. The DarlingErdös statistic is
where . It exists in CPAT as DE.test()
.
DE.test(vec1) DE.test(vec2, use_kernel_var = TRUE, kernel = "qs", bandwidth = "nw") DE.test(vec3)
DarlingErdos Test for Change in Mean data: vec1 A = 6.9468, a(log(T)) = 1.8261, b(log(T)) = 3.0180, pvalue = 0.001922 sample estimates: t* 97 DarlingErdos Test for Change in Mean data: vec2 A = 1.1004, a(log(T)) = 1.8261, b(log(T)) = 3.0180, pvalue = 0.486 sample estimates: t* 166 DarlingErdos Test for Change in Mean data: vec3 A = 1.4832, a(log(T)) = 1.8261, b(log(T)) = 3.0180, pvalue = 0.3648 sample estimates: t* 5
DE.test(incexpres, use_kernel_var = TRUE)
DarlingErdos Test for Change in Mean data: incexpres A = 1.8017, a(log(T)) = 1.9123, b(log(T)) = 3.3864, pvalue = 0.2811 sample estimates: t* 320
Then there’s the HidalgoSeo statistic, existing in CPAT as HS.test()
. This function was designed for working with univariate data; one could try to use HS.test()
for testing for changes in regression models as we did above, but that would not be using the same statistic that Hidalgo and Seo presented.^{5} Additionally, this function does not have the same interface as the others in CPAT; it does not use kernelbased methods for longrun variance estimation but uses the procedure presented in the paper where the residuals are allowed to follow an process. Thus, there is a parameter, corr
, for controlling whether this approach is used or not (if not, the residuals are treated as if they are IID); corr = FALSE
by default.
HS.test(vec1) HS.test(vec2, corr = TRUE) HS.test(vec3)
HidalgoSeo Test for Change in Mean data: vec1 A = 10.638, Correlated Residuals = TRUE, pvalue = 0.009749 sample estimates: t* 97 HidalgoSeo Test for Change in Mean data: vec2 A = 2.0908, Correlated Residuals = TRUE, pvalue = 0.505 sample estimates: t* 166 HidalgoSeo Test for Change in Mean data: vec3 A = 5.7549, Correlated Residuals = TRUE, pvalue = 0.1065 sample estimates: t* 5
Finally, there’s Andrews’ test. This is the most distinct test of the ones presented here. The version implemented here is designed for detecting late changes in the sample only; there is not a version for early or midsample changes. Apparently at least one person uses this test, which was surprising to me; I thought it would be the least used. But since someone is using the test, I may need to implement these other versions of Andrews’ test; I give the people what they want!
Andrews’ test does not test the same set of hypotheses we considered above; instead, the version for late changes claims, under the null hypothesis, that the change happens after some point in the sample, and that is known (but supposedly is not); in short, under , . We consider this an odd setup; if this were true and we knew where the change might happen, we could simply use a twosample test (or the equivalent for, say, regression models) like students learn in statistics 101 classes and the resulting test would be, provably, the most powerful test.
Nevertheless, CPAT supports this test via the Andrews.test()
function. Different versions of the test will be used depending on the input. If the data is univariate, the version run will be specifically made for univariate data. Andrews.test()
requires that the parameter be set, since this is a necessary input for the alternative hypothesis that cannot be guessed; this is the first point in the sample where a change could occur.
Andrews.test(vec1, 100)
Andrews' Test for Structural Change data: vec1 S = 96.49, m = 100, pvalue = 1
Andrews’ procedure does not worry about longrun variance estimation like the other statistics; this is because of the datadependent subsampling procedure Andrews’ test uses for computing values. (The setup of the test may be silly, but the end result is fascinating!)
Andrews.test(vec2, 100)
Andrews' Test for Structural Change data: vec2 S = 100.92, m = 100, pvalue = 1
Andrews’ test was designed for detecting changes near the ends of the sample. In this last example, the change occurs in the last 5 observations, and I tell the test that the change will happen after observation 190.
Andrews.test(rev(vec3), 190)
Andrews' Test for Structural Change data: rev(vec3) S = 13.253, m = 10, pvalue = 0.6188
Again, the Rényitype statistic comes out on top.
I mentioned that different versions of Andrews’ test will be used depending on the input. If the input to the test is a data.frame
, then Andrews.test()
will expect the user to supply a formula; it’s expecting to test for a structural change in a regression model.
mod < dynlm(d(log(expenditure)) ~ d(log(income)), data = USIncExp) X < as.data.frame(model.frame(mod)) names(X) < c("exp", "inc") Andrews.test(exp ~ inc, x = X, M = 300)
Andrews' Test for Structural Change data: X S = 1.7764, m = 205, pvalue = 0.9792
Now that I’ve shown how to use the main functions of CPAT, I want to demonstrate detecting structural change in a realworld context. There is an example in the paper that demonstrates that the Rényitype statistic could have detected changes in the (5) fivefactor model when estimated for a portfolio of banking stocks around the time of the 2008 financial crisis. Additionally, the functions presented here would have detected a change in the relationship between US worker productivity and compensation, which I wrote about before (8).
Let’s look at a different example; beta. Consider a stock; let be the return of this stock at time . Let be the riskfree rate of return (such as the return from US Treasury notes) and let be the market return (perhaps the return from investing in an S&P 500 index fund) at time . If you’re familiar with finance, you may have heard of quantities such as alpha and beta. Alpha is the excess return of the stock over the market; that is, how much of a profit one gets for investing in the stock rather than the market (less the riskfree rate of return). Beta, on the other hand, measures how much the stock moves like the market. If a stock’s beta is zero, it has no relation with the market; if its beta is positive, the stock does well when the market does well on average; and if beta is negative, the stock does well when the market does poorly, on average. If the beta is above one, the stock is more sensitive to the market (so a beta of two suggests that if the market had a return 1% in excess of average, the stock should be 2% in excess of its own average), while a beta less than one suggests a stock less sensitive to the movements of the market (an analogous statement can be made for betas above or below negative one).
I remember learning about beta in my mathematical finance class and a student asking “Why should beta be constant over time?” That’s a great observation, and one that I’m here to address. See, alpha and beta are computed by estimating parameter and (for alpha and beta, respectively) in the regression model
Thus, checking that alpha and beta are constant over time equates to performing a change point hypothesis test. When estimating , it makes sense for people practicing quantitative finance to check for change points. If there is a change point (or perhaps more than one), then using more data won’t make the estimate of better, since some of the data was data produced by a different than the most recent one.
CPAT comes with a subset of the FamaFrench factors needed for estimating FamaFrench fivefactor models. We will use only some of those factors here to compute the that most investors are familiar with. Since the data set also comes with estimates of the riskfree rate of return, we will use those as well.
Now for the stock. Let’s study Valeant Pharmaceuticals (now known as Bausch Health, in an effort to escape the reputation they rightfully earned), the subject of an episode of Dirty Money (4). Valeant had ticker symbol VRX (now BHC) and after 2008 it had a meteoric rise until Valeant’s business practices of buying drugs and hiking their prices to absurd, unaffordable levels–along with outright fraudulent behavior–attracted unwanted attention, preceding the collapse of the stock.
We can download the data for VRX from Quandl.
library(Quandl) library(xts) VRX < Quandl("WIKI/VRX", type = "xts") VRX < VRX$`Adj. Close` names(VRX) < "VRX"
The FamaFrench factors are stored in the data.frame
ff
. Let’s next create an object containing both VRX’s log returns and the FamaFrench factors.
data(ff) ff < as.xts(ff, order.by = as.Date(rownames(ff), "%Y%m%d")) VRX < diff(log(VRX)) * 100 VRXff < merge(VRX, ff, all = FALSE)[1,]
Let’s now compute the alpha and beta of VRX and test for structural change. Note that RF
is the riskfree rate of return and Mkt.RF
is the market return less the riskfree rate.
mod < dynlm(I(VRX  RF) ~ Mkt.RF, data = as.zoo(VRXff["2009/"])) CUSUM.test(residuals(mod), use_kernel_var = TRUE)
CUSUM Test for Change in Mean data: residuals(mod) A = 1.9245, pvalue = 0.001214 sample estimates: t* 1659
The CUSUM test does detect a change. Additionally, it believes the change happened in August 2015. This date is a good guess; the first bad news for Valeant occurred in October 2015, when a short seller named Andrew Left claimed that Valeant was engaging in perhaps illegal relations with a pharmacist named Philidor Rx Services.
But that conclusion comes after having 2016 and 2017 data, when Valeant had clearly fallen from their “prime.” What if it was the end of 2015 and you were recalibrating your models then? Would you have detected a change from the CUSUM test? Or even the DarlingErdös test?
Let’s find out.
mod2 < dynlm(I(VRX  RF) ~ Mkt.RF, data = as.zoo(VRXff["2009/2015"])) CUSUM.test(residuals(mod2), use_kernel_var = TRUE) DE.test(residuals(mod2), use_kernel_var = TRUE)
CUSUM Test for Change in Mean data: residuals(mod2) A = 0.96233, pvalue = 0.3126 sample estimates: t* 1652 DarlingErdos Test for Change in Mean data: residuals(mod2) A = 2.8559, a(log(T)) = 2.0057, b(log(T)) = 3.8000, pvalue = 0.1086 sample estimates: t* 1690
If you were just using the CUSUM test: no, you would not have noticed a change. The DarlingErdös test is closer, but not convincing.
How about the Rényitype test?
HR.test(residuals(mod2), use_kernel_var = TRUE, kn = sqrt)
HorvathRice Test for Change in Mean data: residuals(mod2) D = 3.6195, sqrt(T) = 41.976, pvalue = 0.00118 sample estimates: t* 1690
Yes, you would have detected a change if you were using the Rényitype statistic.
First: how would I expect people to use the Rényitype statistic in practice? Well, I’m not expecting the statistic to replace other statistics, such as the CUSUM statistic. The Rényitype statistic has strengths not well emulated by other test statistics, but it also has relative weaknesses, too. My thought is that the test should be used along with those other tests. If both tests agree, cool; if they disagree, check where the tests suggest the change occurred and decide if you believe the test that rejected. And you should usually use kernelbased longrun variance estimation methods unless you have good reason not to do so.
We are continuing to study the Rényitype statistic and hope to get more publications from it. One issue I think we did not adequately address is how to select the trimming parameter. This is an important decision that can change the results of tests, and we don’t have any rigorous suggestions on how to pick the parameter other than and seem to work well, though one sometimes works better than the other.
We have a lot of material for those who want to learn more about this subject. There’s the aforementioned paper that goes into more depth than this brief article, and gives more convincing evidence that our new statistic works well in early/late change contexts. Additionally, the CPAT archive contains a version of the package that includes not only the version of the software used for the paper but also the code that allows others to recreate and “play” with our simulations. (In other words I attempted to make our work reproducible.)
If you’re using CPAT, I want to hear your feedback! I got feedback from one user and not only did it make my day it also gave me some idea of what I should be working on to release in future versions. And perhaps user questions will turn into future research.
Finally, if you’re working with time series data and are unaware of or not performing change point analysis, I hope that this article was a good introduction to the subject and convinced you to start using these methods. As my example demonstrated, good change point detection can matter even in a monetary way. Using more data won’t help if the old data is no longer relevant.
CPAT and the Rényitype Statistic: EndofSample Change Point Detection in R
This document was generated using the LaTeX2HTML translator Version 2019 (Released January 1, 2019)
The command line arguments were:
latex2html split 0 nonavigation lcase_tags image_type gif simpledoc.tex
The translation was initiated on 20190723
HR.test()
also supplies an estimate for when the change occurred, but unlike for the CUSUM statistic there isn’t theory to back it up and we don’t recommend using it over the CUSUM estimate. It’s the location where the difference in the subsample means is greatest. In later versions we may delete this feature. Packt Publishing published a book for me entitled HandsOn Data Analysis with NumPy and Pandas, a book based on my video course Unpacking NumPy and Pandas. This book covers the basics of setting up a Python environment for data analysis with Anaconda, using Jupyter notebooks, and using NumPy and pandas. If you are starting out using Python for data analysis or know someone who is, please consider buying my book or at least spreading the word about it. You can buy the book directly or purchase a subscription to Mapt and read it there.
If you like my blog and would like to support it, spread the word (if not get a copy yourself)!
]]>Dave’s Donuts offers 14 flavors of donuts (consider the supply of each flavor as being unlimited). The “grab bag” box consists of flavors randomly selected to be in the box, each flavor equally likely for each one of the dozen donuts. What is the probability that at most three flavors are in the grab bag box of a dozen?
For this we will need the multinomial distribution, which is a discrete probability distribution. In a string of characters there are characters possible to fill one position of the string, which is characters long. Therandom variable counts the number of occurrences of character 1 in the string, the number of occurrences of character 2, and so on until . Let be the individual probability each of the characters could appear in a position of the string; each position is filled independently of the characters in other positions. Let such that . Then
Here, , , and . So we can say
We will say
Compute each of those probabilities separately.
If and , there is exactly one flavor in the box.(1) shows the probability this happens is . Since we could pick an and there were 14 ways to make this decision, we can say
Let’s now compute . We start by fixing . We get
Unfortunately (2) includes cases where there’s actually only one flavor present in the box, so compute
(3)  
(4) 
Of course we could have picked different variables to fix at zero, andthere were ways to pick the variables to fix at zero (or equivalently, pick the variables to not fix at zero), finally yielding
Now to compute . Again we start by fixing and compute
We could try and use tricks to compute (6) or we can acknowledge that we’re busy people and ask SymPy to do it. Check that the following Python code is correct:
from sympy import init_session, binomial init_session() def multinomial(params): if len(params) == 1: return 1 return binomial(sum(params), params[1]) * \ multinomial(params[:1]) l1 = list() for i in range(1, 10 + 1): v = sum([multinomial([i, j, (12  i  j)]) for j in range(1, 11  i + 1)]) l1.append(v) sum(l1)/14**12 # Solution
The resulting probability is . We could have picked different flavors to fix, and there were ways to pick the flavors to fix, so we get
We can write (1) and (5) as and , respectively. Summing these probabilities yields
This is the proper way to obtain the probability that there are at most three flavors in the “grab bag” box, but how many boxes exist in which there are at most three flavors when we discount the number of ways there are to arrange the donuts in a box?
If there’s exactly one flavor, then we pick it and fill the box with that flavor; there’s 14 ways to pick one flavor. If there’s exactly two flavors in the box, we’ll call them Flavor 1 and Flavor 2. There is at least one donut of Flavor 1 and one of Flavor 2. Now pick the rest of the donuts’ flavors, order doesn’t matter, there is replacement; there are ways to do that. Then pick the two flavors: there’s ways to do that, and thus boxes with exactly twoflavors. Similarly, for exactly three flavors, there are ways for there to be exactly three flavors. Sum these numbers. (See https://math.stackexchange.com/q/3230011.) There are 21,035 such boxes.
Special thanks to Math Stack Exchange user wavex for his help with this problem! He provided the following R script for simulating it:
total = 0 for (y in 1:10000000){ x = rmultinom(1,12,c(1/14,1/14,1/14,1/14,1/14, 1/14,1/14,1/14,1/14,1/14,1/14,1/14,1/14,1/14)) x < c(x) count = 14 for(i in x){ if(i==0){ count = count 1 } } if(count <= 3){total = total + 1} } sprintf("%.20f", total / 10000000)
In his run of the code this event occured only 27 out of 10,000,000 times. A rare event indeed!
This document was generated using the LaTeX2HTML translator Version 2019 (Released January 1, 2019)
The command line arguments were:
latex2html split 0 nonavigation lcase_tags image_type gif Example6Solution.tex
The translation was initiated on 20190520
Packt Publishing published a book for me entitled HandsOn Data Analysis with NumPy and Pandas, a book based on my video course Unpacking NumPy and Pandas. This book covers the basics of setting up a Python environment for data analysis with Anaconda, using Jupyter notebooks, and using NumPy and pandas. If you are starting out using Python for data analysis or know someone who is, please consider buying my book or at least spreading the word about it. You can buy the book directly or purchase a subscription to Mapt and read it there.
If you like my blog and would like to support it, spread the word (if not get a copy yourself)!
]]>Last week I analyzed player rankings of the Arkham Horror LCG classes. This week I explain what I did in the data analysis. As I mentioned, this is the first time that I attempted inference with rank data, and I discovered how rich the subject is. A lot of the tools for the analysis I had to write myself, so you now have the code I didn’t have access to when I started.
This post will not discuss rank data modelling. Instead, it will cover what one may consider basic statistics and inference. The primary reference for what I did here is Analyzing and Modeling Rank Data, by John Marden. So far I’ve enjoyed his book and I may even buy a personal copy.
Suppose we have objects we ask our study participants (also known as “judges”) to rank. For example, suppose we asked people to rank apples, oranges, and bananas. What we then get is a prioritization of these objects according to our judges. This could come in the form
and we interpret the number in the position as the ranking of the item. In this case, if the tuple is in the order of apples, oranges, and bananas, then oranges recieved the highest ranking, bananas the secondhighest, and apples the last position.
An alternative view of this data may be
where the items are arranged in order of preference. This form of describing a ranking has its uses, but we will consider only the first form in this introduction.
Ranking data has the following distinguishing characteristics from other data: first, the data is ordinal. All that matters is the order in which items were placed, not necessarily the numbers themselves. We could insist on writing rank data as and the information content would not have changed. (But of course we would never do this.) Second, every item gets a ranking. This excludes “Choose your top 3 out of 50”type questions, since not every item would receive a ranking (this is called an incomplete ranking and requires special care; I won’t discuss this type of data in this article). Finally, every item’s ranking is distinct; no ties are allowed.
Thus ranking data is distinct even from just ordinal data since data comes from judges in the form of a tuple, not just a single ordinal value. (Thus we would not consider, say, Likert scale responses as automatically being an instance of rank data.) An ideal method for rank data would account for this unique nature and exploit its features.
From this point on I will be working with the Arkham Horror player class ranking data. I made the Timestamp
column nonsense to anonymize the data. You can download a CSV file of the data from here, then convert it to a .Rda
file with the script below (which is intended to be run as an executable):
#!/usr/bin/Rscript ################################################################################ # ArkhamHorrorClassPreferenceSurveyDataCleaner.R ################################################################################ # 20190210 # Curtis Miller ################################################################################ # This file takes a CSV file read in and cleans it for later analysis, saving # the resulting data in a .Rda file. ################################################################################ # optparse: A package for handling command line arguments if (!suppressPackageStartupMessages(require("optparse"))) { install.packages("optparse") require("optparse") } ################################################################################ # MAIN FUNCTION DEFINITION ################################################################################ main < function(input, output = "out.Rda", help = FALSE) { input_file < read.csv(input) input_columns < names(input_file) arkham_classes < c("Survivor", "Guardian", "Rogue", "Seeker", "Mystic") for (cl in arkham_classes) { names(input_file)[grepl(cl, input_columns)] < cl } names(input_file)[grepl("Reason", input_columns)] < "Reason" input_file$Reason < as.character(input_file$Reason) input_file$Timestamp < as.POSIXct(input_file$Timestamp, format = "%m/%d/%Y %H:%M:%S", tz = "MST") for (cl in arkham_classes) { input_file[[cl]] < substr(as.character(input_file[[cl]]), 1, 1) input_file[[cl]] < as.numeric(input_file[[cl]]) } survey_data < input_file save(survey_data, file = output) } ################################################################################ # INTERFACE SETUP ################################################################################ if (sys.nframe() == 0) { cl_args < parse_args(OptionParser( description = paste("Converts a CSV file with survey data ranking", "Arkham Horror classes into a .Rda file with a", "wellformated data.frame"), option_list = list( make_option(c("input", "i"), type = "character", help = "Name of input file"), make_option(c("output", "o"), type = "character", default = "out.Rda", help = "Name of output file to create") ) )) do.call(main, cl_args) }
(The script with all the code for the actual analysis appears at the end of this article.)
The first statistic we will compute for this data is the marginals matrix. This matrix simply records the proportion of times an item received a particular ranking in the sample. If we want to get mathematical, if is a ranking tuple and is the ranking of the option and the sample is , then the entry of the marginal’s matrix is
where the function $\latex I_{{A}}$ is 1 if is true and 0 otherwise. (Thus the sum above simply counts how many times was equal to .)
The marginals matrix for the Arkham Horror data is given below
MARGINALS  1 2 3 4 5 Guardian 18.29 20.43 26.84 19.71 14.73 Mystic 19.71 18.29 17.81 20.90 23.28 Rogue 19.24 14.73 20.67 21.38 23.99 Seeker 28.03 25.18 17.10 18.53 11.16 Survivor 14.73 21.38 17.58 19.48 26.84
Below is a visual representation of the marginals matrix.
From the marginals matrix you could compute the vector representing the “mean” ranking of the data. For instance, the mean ranking of the Guardian class is the sum of the ranking numbers (column headers) times their respective proportions (in the Guardian row); here, that’s about 2.9 for Guardians. Repeat this process for every other group to get the mean ranking vector; here, the mean rank vector is (keeping the ordering of the classes suggested by the rows above, which is alphabetical order; this will always be the ordering I use unless otherwise stated.) Of couse this is not a ranking vectors; rankings are integers. The corresponding ranking vector would be to rank the means themselves; this gives a ranking vector of .
I don’t like inference using the mean ranking vector. As mentioned above, this data is ordinal; that means the magnitude of the numbers themselves should not matter. We could replace 1, 2, 3, 4, 5 with 1, 10, 100, 1000, 10000 and the data would mean the same thing. That is not the case if you’re using the mean rank unless you first apply a transformation to the rankings. In short, I don’t think that the mean ranking vector appreciates the nature of the data well. And since the marginals matrix is closely tied to this notion of “mean”, I don’t think the matrix is fully informative.
Another matrix providing descriptive statistics is the pairs matrix. The matrix records the proportion of respondents who preferred one option to the other (specifically, the row option to the column option). Mathematically, the entry of the pairs matrix is
The pairs matrix for the Arkham Horror data is below:
PAIRS  Guardian Mystic Rogue Seeker Survivor Guardian 0.00 54.16 55.34 42.52 55.82 Mystic 45.84 0.00 51.07 39.90 53.44 Rogue 44.66 48.93 0.00 38.72 51.54 Seeker 57.48 60.10 61.28 0.00 61.52 Survivor 44.18 46.56 48.46 38.48 0.00
First, notice that the diagonal entries are all zero; this will always be the case. Second, the pairs matrix is essentially completely determined by the entries above the diagonal of the matrix. Other forms of interence use these upperdiagonal entries and don’t use the lowerdiagonal entries since they give no new information. The number of upperdiagonal entries is , which is the number of ways to pick pairs of classes.
The pairs matrix for the Arkham Horror data is visualized below.
With the pairs matrix, crossing above or below 50% of the sample being in the bin is a significant event; it indicates which classes are preferred to the other. In fact, by counting how many times this threshold was crossed, we can estimate that the overall favorite class is the Seeker class, followed by Guardians, then Mystics, then Rogues, and finally Survivors. This is another estimate of the “central”, “modal”, or “consensus” ranking. (This agrees with the “mean” ranking, but that’s not always going to be the case; the metrics can disagree with each other.)
While I did not like the marginals matrix I do like the pairs matrix; I feel as if it accounts for the features of rank data I want any measures or inference to take account of. It turns out that the pairs matrix is also related to my favorite distance metric for analyzing rank data.
A distance metric is a generalized notion of distance, or “how far away” two objects and are. In order for a function to be a metric, it must have the following properties:
The notion of distance you use in everyday life, the one taught in middleschool geometry and computed whenever you use a ruler, is known as Euclidean distance. It’s not the only notion of distance, though, and may not be the only distance function you use in reallife. For instance, Manhattan or taxi cab distance is the distance from one point to another when you can only make 90degree turns and is the distance that makes the most sense when travelling in the city.
There are many distance metrics we could consider when working with rank data. The Spearman distance is the square of the Euclidean distance, while the footrule distance corresponds to the Manhattan distance. It turns out that the mean rank vector above minimizes the sum of Spearman distances. The distance metric I based my analysis on, though, was the Kendall distance. I like this distance metric since it is not connected to the mean and considers the distance between the rankings and to be greater than the distance between and (unlike, say, the Hamming distance, which gives the same distance in either case).
Kendall’s distance even has an interpretation. Suppose that two ranking tuples are seen as the ordering of books on a bookshelf. We want to go from one ordering of books to another ordering of books. The Kendall distance is how many times we would need to switch adjacent pairs of books (chosen well, so as not to waste time and energy) to go from one ordering to the other. Thus the Kendall distance between and is one; we only need to make one swap. The distance between and , in comparison, is seven, since we need to make seven swaps.
It also turns out that the Kendall distance is related to the pairs matrix. The average Kendall distance of the judges from any chosen ranking is
(There is a similar expression relating the Spearman distance to the marginal matrix.)
Once we have a distance metric, we can define what the “best” estimate for the most central ranking is. The central ranking is the that minimizes
In other words, the most central ranking minimized the sum of distances of all the rankings in the data to that ranking.
Sometimes this ranking has already been determined. For instance, when using the Spearman distance, the central ranking emerges from the “mean” rankings. Otherwise, though, we may need to apply some search procedure to find this optimal ranking.
Since we’re working with rank data, though, it’s very tempting to not use any fancy optimization algorithms and simply compute the sum of distances for every possible ranking. This isn’t a bad idea at all if the number of items being ranked is relatively small. Here, since there are five items being ranked, the number of possible rankings is , which is not too big for a modern computer to handle. It may take some time for the exhaustive search approach to yield and answer, but the answer produced by exhaustive search comes with the reassurance that it does, in fact, minimize the sum of distances.
This is in fact what I did for estimating the central ranking when minimizing the sum of Kendall distances from said ranking. The resulting ranking, again, was Seeker/Guardian/Mystic/Rogue/Survivor (which agrees with what we determined just by looking at the pairs matrix; this likely is not a coincidence).
All of the above I consider falling into the category of descriptive statistics. It describes aspects of the sample without attempting to extrapolate to the rest of the population. With statistical inference we want to see what we can say about the population as a whole.
I should start by saying that the usual assumptions made in statistical inference are likely not satisfied by my sample. It was an optin sample; people chose to participate. That alone makes it a nonrandom sample. Additionally, only participants active on Facebook, Reddit, Twitter, Board Game Geek, and the Fantasy Flight forums were targeted by my advertising of the poll. Thus the Arkham players were likely those active on the Internet, likely at a particular time of day and day of the week (given how these websites try to push older content off the main page). They were likely young, male, and engaged enough in the game to be in the community (and unlikely to be a “casual” player). Thus the participants are likely to be more homogenous than the population of Arkham Horror players overall.
Just as a thought experiment, what would be a better study, one where we could feel confident in the inferential ability of our sample? Well, we would grab randomly selected people from the population (perhaps from pulling random names from the phone book), have them join our study, teach them how to play the game, make them play the game for many hours until they could form an educated opinion of the game (probably at least 100 hours), then ask them to rate the classes. This would be highquality data and we could believe the data is reliable, but damn would it be expensive! No one at FFG would consider data of that quality worth the price, and frankly neither would I.
Having said that, while the sample I have is certainly flawed in how it was collected, I actually believe we can get good results from it. The opinions of the participants are likely educated ones, so we probably still have a good idea how the Arkham Horror classes compare to one another.
In rank data analysis there is a probability model called the uniform distribution that serves as a starting point for inference. Under the uniform distribution, every ranking vector is equally likely to be observed; in short, there is no preference among the judges among the choices. The marginals matrix should have all entries be , all offdiagonal entries of the pairs matrix should be , and any “central” ranking is meaningless since every ranking is equally likely to be seen. According to the uniform distribution, . If we cannot distinguish our data from data drawn from the uniform distribution, our work is done; we basically say there is no “common” ranking scheme and go about our day.
There are many tests for checking for the uniform distribution, and they are often based on the statistics we’ve already seen, such as the mean rank vector, the marginals matrix, and the pairs matrix. If is small enough relative to the sample size, we could even just base a test off of how frequently each particular ranking was seen. A test based off the latter could detect any form of nonuniformity in the data, while tests based off the marginals or pairs matrices or the mean vector cannot detect all forms of nonuniformity; that said, they often require much less data to be performed.
As mentioned, I like working with the pairs matrix/Kendall distance. The statistical test, though, involves a vector , which is the aforementioned upper triangle of the pairs matrix (excluding the diagonal entries which are always zero). (More specifically, is a vector containing the upperdiagonal entries of the pairs matrix laid out in rowmajor form.)
The test decides between
The test statistic is
If the null hypothesis is true, then the test statistic, for large , a distribution with degrees of freedom. (For the Arkham Horror classes case, .) Large test statistics are evidence against the null hypothesis, so values are the area underneath the curve to the right of the test statistic.
For our data set, the reported test statistic was 2309938376; not shockingly, the corresponding value is near zero. So the data was not drawn from the uniform distribution. Arkham Horror players do have class preferences.
But what are plausible preferences players could have? We can answer this using a confidence interval. Specifically, we want to know what rankings are plausible, and thus what we want is a confidence set of rankings.
Finding a formula for a confidence set of the central ranking is extremely hard to do, but it’s not as hard to form one for one of the statistics we can compute from the rankings, then use the possible values of that statistic to find corresponding plausible central rankings. For example, once could find a confidence set for the mean ranking vector, then translate those mean rankings into ranking vectors (this is what Marden did in his book).
As I said before, I like the pairs matrix/Kendall distance in the rank data context, so I want to form a confidence set for , the population equivalent of , the key entries of the pairs matrix. To do this, we cannot view the rank data the same way we did before; instead of seeing the dimensional vector , we need to see the equivalent dimensional vector that consists only of ones and zeros and records the pairwise relationships among the ranks, rather than the ranks themselves (the latter vector literally says that item one is not ranked higher than item two, item one is ranked higher than item three, same for four, same for five, then that item two is ranked higher than item three, same for four, same for five, and so on, finally saying in its last entry that item four is ranked higher than item five).
We first compute by taking the means of these vectors. Then we compute the sample covariance matrix of the vectors; call it . Then a % confidence set for the true , appropriate for large sample sizes, is:
where is the percentile of the distribution with degrees of freedom.
The region I’ve just described is a dimensional ellipsoid, a footballlike shape that lives in a space with (probably) more than three dimensions. It sounds daunting, but one can still figure out what rankings are plausible once this region is computed. The trick is to work with each of the coordinates of the vector and determine whether there is a in the ellipsoid where that coordinate is 1/2. If the answer is no, then the value of that coordinate, for all in the ellipsoid, is either always above or always below 1/2. You can then look to (which is in the dead center of the ellipsoid) to determine which is the case.
What’s the significance of this? Let’s say that you listed all possible rankings in a table. Let’s suppose you did this procedure for the coordinate of corresponding to the Seeker/Rogue pair. If you determine that this coordinate is not 1/2 and that all in the ellipsoid ranks Seekers above Rogues, then you would take your list of rankings and remove all rankings that Rogues before Seekers, since these rankings are not in the confidence set.
If you do find a $\latex \kappa$ in the ellipsoid where the selected coordinate is 1/2, then you would not eliminate any rows in your list of rankings since you know that your confidence set must include some rankings that rank the two items one way and some rankings where the items are ranked the opposite way.
Repeat this procedure with every coordinate of —that is, every possible pairing of choices—and you then have a confidence set for central rankings.
Determining whether there is a vector in the ellipsoid with a select coordinate valued at 1/2 can be done via optimization. That is, find a $\latex \kappa$ that minimizes subject to the constraint that . You don’t even need fancy minimization algorithms for doing this; the minimum can, in principle, be computed analytically with multivariate calculus. After you found a minimizing , determine what the value of is at that . If it is less than , then you found a in the ellipsoid; otherwise, you know there is no such .
This was the procedure I used on the Arkham Horror class ranking data. The 95% confidence interval so computed determined that Seekers were ranked higher than Rogues and Survivors. That means that Seekers cannot have a ranking worse than 3 and Rogues and Survivors could not have rankings better than 2. Any ranking consistent with these constraints, though, is a plausible population central ranking. In fact, this procedure suggested that all the rankings below are plausible central population rankings:
Guardian Mystic Rogue Seeker Survivor 1 1 2 4 3 5 2 1 2 5 3 4 3 1 3 4 2 5 4 1 3 5 2 4 5 1 4 3 2 5 6 1 4 5 2 3 7 1 5 3 2 4 8 1 5 4 2 3 9 2 1 4 3 5 10 2 1 5 3 4 11 2 3 4 1 5 12 2 3 5 1 4 13 2 4 3 1 5 14 2 4 5 1 3 15 2 5 3 1 4 16 2 5 4 1 3 17 3 1 4 2 5 18 3 1 5 2 4 19 3 2 4 1 5 20 3 2 5 1 4 21 3 4 2 1 5 22 3 4 5 1 2 23 3 5 2 1 4 24 3 5 4 1 2 25 4 1 3 2 5 26 4 1 5 2 3 27 4 2 3 1 5 28 4 2 5 1 3 29 4 3 2 1 5 30 4 3 5 1 2 31 4 5 2 1 3 32 4 5 3 1 2 33 5 1 3 2 4 34 5 1 4 2 3 35 5 2 3 1 4 36 5 2 4 1 3 37 5 3 2 1 4 38 5 3 4 1 2 39 5 4 2 1 3 40 5 4 3 1 2
The confidence interval, by design, is much less bold than just an estimate of the most central ranking. Our interval suggests that there’s a lot we don’t know about what the central ranking is; we only know that whatever it is, it ranks Seekers above Rogues and Survivors.
The confidence set here is at least conservative in that it could perhaps contain too many candidate central rankings. I don’t know for sure whether we could improve on the set and eliminate more ranks from the plausible set by querying more from the confidence set for . Perhaps there are certain combinations that cannot exist, like excluding rankings that give both Seekers and Guardians a high ranking at the same time. If I were a betting man, though, I’d bet that the confidence set found with this procedure could be improved, in that not every vector in the resulting set corresponds with a in the original ellipsoidal confidence set. Improving this set, though, would take a lot of work as one would have to consider multiple coordinates of potential simultaneously, then find a rule for eliminating ranking vectors based on the results.
Matt Newman, the lead designer of Arkham Horror: The Card Game, does not believe all players are the same. Specifically, he believes that there are player types that determine how they like to play. In statistics we might say that Matt Newman believes that there are clusters of players within any sufficiently large and wellselected sample of players. This suggests we may want to perform cluster analysis to find these subpopulations.
If you haven’t heard the term before, clustering is the practice of finding “similar” data points, grouping them together, and identifying them as belonging to some subpopulation for which no label was directly observed. It’s not unreasonable to believe that these subpopulations exist and so I sought to do clustering myself.
There are many ways to cluster. Prof. Malden said that a clustering of rank data into clusters should minimize the sum of the distances of each observation from their assigned cluster’s centers. However, he did not suggest a good algorithm for finding these clusters. He did suggest that for small samples, small and for a small number of clusters, we could exhaustively search for optimal clusters, an impractical idea.
I initially attempted a kmeanstype algorithm for finding good clusters, one that used the Kendall distance rather than the Euclidean distance, but unfortunately I could not get the algorithm to give good results. I don’t know whether I have errors in my code (listed below) or whether the algorithm just doesn’t work for Kendall distances, but it didn’t work; in fact, it would take a good clustering and make it worse! I eventually abandoned my homebrewed kcenters algorithm (and the hours of work that went into it) and just used spectral clustering.
Spectral clustering isn’t easily described, but the idea of spectral clustering is to find groups of data that a random walker, walking from point to point along a weighted graph, would spend a long time in before moving to another group. (That’s the best simplification I can make; the rest is linear algebra.) In order to do spectral clustering, one must have a notion of “similarity” of data points. “Similarity” roughly means the opposite of “distance”; in fact, if you have a distance metric (and we do here), you can find a similarity measure by subtracting all distances from the maximum distance between any two objects. Similarity measures are not as strictly defined as distance metrics; any function that gives two “similar” items a high score and two “dissimilar” items a low score could be considered a similarity function.
Spectral clustering takes a matrix of similarity measures, computed for each pair of observations, and spits out cluster assignments. But in addition to the similarity measure, we need to decide how many clusters to find.
I find determining the “best” number of clusters to find the hardest part of clustering. We could have only one cluster, containing all our data; this is what we start with. We could also assign each data point to its own cluster; our aforementioned measure of cluster quality would then be zero, which would be great if it weren’t for the fact that our clusters mean nothing!
One approach people use for determining how many clusters to pick is the socalled elbow method. You take a plot of, say, Malden’s metric, compared against the number of clusters, and see if you can spot the “elbow” in the plot. The elbow corresponds to the “best” number of clusters.
Here’s the corresponding plot for the dataset here:
If you’re unsure where the “elbow” of the plot is, that’s okay; I’m not sure either. My best guess is that it’s at five clusters; hence my choice of five clusters.
Another plot that people use is the silhouette plot, explained quite well by the scikitlearn documentation. The silhouette plot for the clustering found by spectral clustering is shown below:
Is this a good silhouette plot? I’m not sure. It’s not the worst silhouette plot I saw for this data set but it’s not as good as examples shown in the scikitlearn documentation. There are observations that appear to be in the wrong cluster according to the silhouette analysis. So… inconclusive?
I also computed the Dunn index of the clusters. I never got a value greater than 0.125. All together, these methods lead me to suspect that there are no meaningful clusters in this data set, at least none that can be found with this approach.
But people like cluster analysis, so if you’re one of those folks, I have results for you.
CLUSTERING  Counts: Cluster 1 2 3 4 5 130 83 80 66 62 Centers: Guardian Mystic Rogue Seeker Survivor 1 3 2 4 1 5 2 3 5 4 1 2 3 3 4 1 2 5 4 1 5 3 4 2 5 5 1 4 3 2 Score: 881 CLUSTER CONFIDENCE INTERVALS  Cluster 1: With 95% confidence: Guardian is better than Rogue Guardian is better than Survivor Mystic is better than Rogue Mystic is better than Survivor Seeker is better than Rogue Seeker is better than Survivor Plausible Modal Rankings: Guardian Mystic Rogue Seeker Survivor 1 1 2 4 3 5 2 1 2 5 3 4 3 1 3 4 2 5 4 1 3 5 2 4 5 2 1 4 3 5 6 2 1 5 3 4 7 2 3 4 1 5 8 2 3 5 1 4 9 3 1 4 2 5 10 3 1 5 2 4 11 3 2 4 1 5 12 3 2 5 1 4 Cluster 2: With 95% confidence: Guardian is better than Mystic Guardian is better than Rogue Seeker is better than Guardian Seeker is better than Mystic Survivor is better than Mystic Seeker is better than Rogue Survivor is better than Rogue Seeker is better than Survivor Plausible Modal Rankings: Guardian Mystic Rogue Seeker Survivor 1 2 4 5 1 3 2 2 5 4 1 3 3 3 4 5 1 2 4 3 5 4 1 2 Cluster 3: With 95% confidence: Rogue is better than Guardian Rogue is better than Mystic Rogue is better than Seeker Rogue is better than Survivor Seeker is better than Survivor Plausible Modal Rankings: Guardian Mystic Rogue Seeker Survivor 1 2 3 1 4 5 2 2 4 1 3 5 3 2 5 1 3 4 4 3 2 1 4 5 5 3 4 1 2 5 6 3 5 1 2 4 7 4 2 1 3 5 8 4 3 1 2 5 9 4 5 1 2 3 10 5 2 1 3 4 11 5 3 1 2 4 12 5 4 1 2 3 Cluster 4: With 95% confidence: Guardian is better than Mystic Guardian is better than Seeker Rogue is better than Mystic Survivor is better than Mystic Survivor is better than Seeker Plausible Modal Rankings: Guardian Mystic Rogue Seeker Survivor 1 1 4 2 5 3 2 1 4 3 5 2 3 1 5 2 4 3 4 1 5 3 4 2 5 1 5 4 3 2 6 2 4 1 5 3 7 2 4 3 5 1 8 2 5 1 4 3 9 2 5 3 4 1 10 2 5 4 3 1 11 3 4 1 5 2 12 3 4 2 5 1 13 3 5 1 4 2 14 3 5 2 4 1 Cluster 5: With 95% confidence: Mystic is better than Guardian Survivor is better than Guardian Mystic is better than Rogue Mystic is better than Seeker Survivor is better than Rogue Survivor is better than Seeker Plausible Modal Rankings: Guardian Mystic Rogue Seeker Survivor 1 3 1 4 5 2 2 3 1 5 4 2 3 3 2 4 5 1 4 3 2 5 4 1 5 4 1 3 5 2 6 4 1 5 3 2 7 4 2 3 5 1 8 4 2 5 3 1 9 5 1 3 4 2 10 5 1 4 3 2 11 5 2 3 4 1 12 5 2 4 3 1
When computing confidence sets for clusters I ran into an interesting problem: what if, say, you never see Seekers ranked below Guardians? This will cause one of the entries of to be either 0 or 1, and there is no “variance” in its value; it’s always the same. This will cause the covariance matrix to be noninvertible since it has rows/columns that are zero. The solution to this is to eliminate those rows and work only with the nonconstant entries of . That said, I still treat the entries removed as if they were “statisticall significant” results and remove rankings from our confidence set that are inconsistent with what we saw in the data. In short, if Seekers are never ranked below Guardians, remove all rankings in the confidence set that rank Seekers below Guardians.
One usually isn’t satisfied with just a clustering; it would be nice to determine what a clustering signifies about those who are in the cluster. For instance, what type of player gets assigned to Cluster 1? I feel that inspecting the data in a more thoughtful and manual way can give a sense to what characteristic individuals assigned to a cluster share. For instance, I read the comments submitted by poll participants to hypothesize what types of players were being assigned to particular clusters. You can read these comments at the bottom of this article, after the code section.
All source code used to do the rank analysis done here is listed below, in a .R
file intended to be run as an executable from a command line. (I created and ran it on a Linux system.)
Several packages had useful functions specific for this type of analysis, such as pmr (meant for modelling rand data) and rankdist (which had a lot of tools for working with the Kendall distance). The confidence interval, central ranking estimator, and hypothesis testing tools, though, I wrote myself, and they may not exist elsewhere.
I at least feel that the script itself is welldocumented and I no longer need to explain it. But I will warn others that it was tailored to my problem, and the methods employed may not work well with larger sample sizes or when more items need to be ranked.
This is only the tip of the iceberg for rank data analysis. We have not even touched on modelling for rank data, which can provide even richer inference. If you’re interested, I’ll refer you again to Malden’s book.
I enjoyed this analysis so much I asked a Reddit question about where else I could conduct surveys (while at the same time still being statistically sound) because I’d love to do it again. I feel like there’s much to learn from rank data; it has great potential. Hopefully this article sparked your interest too.
#!/usr/bin/Rscript ################################################################################ # ArkhamHorrorClassPreferenceAnalysis.R ################################################################################ # 20190210 # Curtis Miller ################################################################################ # Analyze Arkham Horror LCG class preference survey data. ################################################################################ # optparse: A package for handling command line arguments if (!suppressPackageStartupMessages(require("optparse"))) { install.packages("optparse") require("optparse") } ################################################################################ # CONSTANTS ################################################################################ CLASS_COUNT < 5 CLASSES < c("Guardian", "Mystic", "Rogue", "Seeker", "Survivor") CLASS_COLORS < c("Guardian" = "#00628C", "Mystic" = "#44397D", "Rogue" = "#17623B", "Seeker" = "#B87D37", "Survivor" = "#AA242D") ################################################################################ # FUNCTIONS ################################################################################ `%s%` < function(x, y) {paste(x, y)} `%s0%` < function(x, y) {paste0(x, y)} #' Sum of Kendall Distances #' #' Given a ranking vector and a matrix of rankings, compute the sum of Kendall #' distances. #' #' @param r The ranking vector #' @param mat The matrix of rankings, with each row having its own ranking #' @param weight Optional vector weighting each row of \code{mat} in the sum, #' perhaps representing how many times that ranking is repeated #' @return The (weighted) sum of the Kendall distances #' @examples #' mat < rbind(1:3, #' 3:1) #' skd(c(2, 1, 3), mat) skd < function(r, mat, weight = 1) { dr < partial(DistancePair, r2 = r) sum(apply(mat, 1, dr) * weight) } #' Least Sum of Kendall Distances Estimator #' #' Estimates the "central" ranking by minimizing the sum of Kendall distances, #' via exhaustive search. #' #' @param mat The matrix of rankings, with each row having its own ranking #' @param weight Optional vector weighting each row of \code{mat} in the sum, #' perhaps representing how many times that ranking is repeated #' @return Ranking vector that minimizes the (weighted) sum of rankings #' @examples #' mat < rbind(1:3, #' 3:1) #' lskd_estimator(mat) lskd_estimator < function(mat, weight = NULL) { if (is.null(weight)) { reduced < rank_vec_count(mat) mat < reduced$mat weight < reduced$count } skdm < partial(skd, mat = mat, weight = weight) m < max(mat) permutation_mat < permutations(m, m) sums < apply(permutation_mat, 1, skdm) permutation_mat[which.min(sums),] } #' Identify Ranking With Center #' #' Find the index of the center closest to a ranking vector. #' #' @param r The ranking vector #' @param mat The matrix of rankings, with each row having its own ranking #' @return Index of row that is closest to \code{r} #' @examples #' mat < rbind(1:3, #' 3:1) #' close_center(c(2, 1, 3), mat) close_center < function(r, mat) { dr < partial(DistancePair, r2 = r) which.min(apply(mat, 1, dr)) } #' Simplify Rank Matrix To Unique Rows #' #' Given a matrix with rows representing rankings, this function reduced the #' matrix to rows of only unique rankings and also counts how many times a #' ranking appeared. #' #' @param mat The matrix of rankings, with each row having its own ranking #' @return A list with entries \code{"mat"} and \code{"count"}, with #' \code{"mat"} being a matrix now with unique rankings and #' \code{"count"} being a vector of times each row in new matrix #' appeared in the old matrix #' @examples #' mat < rbind(1:3, #' 3:1) #' rank_vec_count(mat) rank_vec_count < function(mat) { old_col_names < colnames(mat) old_row_names < rownames(mat) res_df < aggregate(list(numdup = rep(1, times = nrow(mat))), as.data.frame(mat), length) count < res_df$numdup new_mat < res_df[1:ncol(mat)] colnames(new_mat) < old_col_names rownames(new_mat) < old_row_names list("mat" = as.matrix(new_mat), "count" = count) } #' Find \eqn{k} Ranking Clusters #' #' Estimate \eqn{k} clusters of rankings. #' #' The algorithm to find the ranking clusters resembles the \eqn{k}means++ #' algorithm except that the distance metric is the Kendall distance. #' #' @param mat The matrix of rankings, with each row having its own ranking #' @param k The number of clusters to find #' @param max_iter The maximum number of iterations for algorithm #' @param tol The numerical tolerance at which to end the algorithm if met #' @return A list containing the central rankings of each cluster (in #' \code{"centers"}) and a vector with integers representing cluster #' assignments #' @examples #' mat < rbind(1:3, #' 3:1, #' c(2, 1, 3), #' c(3, 1, 2)) #' rank_cluster(mat, 2) rank_cluster < function(mat, k, init_type = c("spectral", "kmeans++"), max_iter = 100, tol = 1e4) { simplified_mat < rank_vec_count(mat) mat < simplified_mat$mat count < simplified_mat$count init_type < init_type[1] if (init_type == "kmeans++") { centers < rank_cluster_center_init(mat, k) } else if (init_type == "spectral") { centers < rank_cluster_spectral(mat, k)$centers } else { stop("Don't know init_type" %s% init_type) } old_centers < centers cc_centers < partial(close_center, mat = centers) clusters < apply(mat, 1, cc_centers) for (iter in 1:max_iter) { centers < find_cluster_centers(mat, clusters, count) stopifnot(all(dim(centers) == dim(old_centers))) cc_centers < partial(close_center, mat = centers) clusters < apply(mat, 1, cc_centers) if (center_distance_change(centers, old_centers) < tol) { break } else { old_centers < centers } } if (iter == max_iter) {warning("Maximum iterations reached")} colnames(centers) < colnames(mat) list("centers" = centers, "clusters" = rep(clusters, times = count)) } #' Find the Distance Between Two Ranking Matrices #' #' Find the distance between two ranking matrices by summing the distance #' between each row of the respective matrices. #' #' @param mat1 First matrix of ranks #' @param mat2 Second matrix of ranks #' @return The sum of distances between rows of \code{mat1} and \code{mat2} #' @examples #' mat < rbind(1:3, #' 3:1) #' center_distance_change(mat, mat) center_distance_change < function(mat1, mat2) { if (any(dim(mat1) != dim(mat2))) {stop("Dimensions of matrices don't match")} sum(sapply(1:nrow(mat1), function(i) {DistancePair(mat1[i, ], mat2[i, ])})) } #' Initialize Cluster Centers #' #' Find initial cluster centers as prescribed by the \eqn{k}means++ algorithm. #' #' @param mat The matrix of rankings, with each row having its own ranking #' @param k The number of clusters to find #' @return A matrix containing cluster centers. #' @examples #' mat < rbind(1:3, #' 3:1, #' c(2, 1, 3), #' c(3, 1, 2)) #' rank_cluster_center_init(mat, 2) rank_cluster_center_init < function(mat, k) { n < nrow(mat) center < mat[sample(1:n, 1), ] centers_mat < rbind(center) for (i in 2:k) { min_distances < sapply(1:n, function(l) { min(sapply(1:(i  1), function(j) { DistancePair(mat[l, ], centers_mat[j, ]) })) }) center < mat[sample(1:n, 1, prob = min_distances/sum(min_distances)), ] centers_mat < rbind(centers_mat, center) } rownames(centers_mat) < NULL colnames(centers_mat) < colnames(mat) centers_mat } #' Evaluation Metric for Clustering Quality #' #' Evaluates a clustering's quality by summing the distance of each observation #' to its assigned cluster center. #' #' @param mat Matrix of rankings (in the rows); the data #' @param centers Matrix of rankings (in the rows) representing the centers of #' the clusters #' @param clusters Vector of indices corresponding to cluster assignments (the #' rows of the \code{clusters} matrix) #' @return Score of the clustering #' @examples #' mat < rbind(1:3, #' 3:1, #' c(2, 1, 3), #' c(3, 1, 2)) #' centers < rbind(1:3, 3:1) #' clusters < c(1, 1, 2, 2) #' clustering_score(mat, centers, clusters) clustering_score < function(mat, centers, clusters) { sum(sapply(1:nrow(centers), function(i) { center < centers[i, ] submat < mat[which(clusters == i), ] skd(center, submat) })) } #' Clustering with Restarts #' #' Clusters multiple times and returns the clustering with the lowest clustering #' score #' #' @param ... Parameters to pass to \code{\link{rank_cluster}} #' @param restarts Number of restarts #' @return A list containing the central rankings of each cluster (in #' \code{"centers"}) and a vector with integers representing cluster #' assignments #' @examples #' mat < rbind(1:3, #' 3:1, #' c(2, 1, 3), #' c(3, 1, 2)) #' rank_cluster_restarts(mat, 2, 5) rank_cluster_restarts < function(mat, ..., restarts = 10) { best_score < Inf rank_cluster_args < list(...) rank_cluster_args$mat < mat for (i in 1:restarts) { new_cluster_scheme < do.call(rank_cluster, rank_cluster_args) score < clustering_score(mat, new_cluster_scheme$centers, new_cluster_scheme$clusters) if (score < best_score) { best_score < score best_scheme < new_cluster_scheme } } return(best_scheme) } #' Given Clusters, Find Centers #' #' Given a collection of clusters, find centers for the clusters. #' #' @param mat Matrix of rankings (in rows) #' @param clusters Vector containing integers identifying cluster assignments, #' where the integers range from one to the number of clusters #' @param weight Optional vector weighting each row of \code{mat} in the sum, #' perhaps representing how many times that ranking is repeated #' @return Ranking vector that minimizes the (weighted) sum of rankings #' @return A matrix of ranks representing cluster centers #' @examples #' mat < rbind(1:3, #' 3:1, #' c(2, 1, 3), #' c(3, 1, 2)) #' find_cluster_centers(mat, c(1, 1, 2, 2)) find_cluster_centers < function(mat, clusters, weight = NULL) { if (is.null(weight)) { weight < rep(1, times = nrow(mat)) } centers < t(sapply(unique(clusters), function(i) { submat < mat[which(clusters == i), ] subweight < weight[which(clusters == i)] lskd_estimator(submat, subweight) })) colnames(centers) < colnames(mat) centers } #' Cluster Rankings Via Spectral Clustering #' #' Obtain a clustering of rank data via spectral clustering. #' #' @param mat Matrix containing rank data #' @param k Number of clusters to find #' @return A list with entries: \code{"centers"}, the centers of the clusters; #' and \code{"clusters"}, a vector assigning rows to clusters. #' @examples #' mat < rbind(1:3, #' 3:1, #' c(2, 1, 3), #' c(3, 1, 2)) #' rank_cluster_spectral(mat, 2) rank_cluster_spectral < function(mat, k = 2) { dist_mat < DistanceMatrix(mat) sim_mat < max(dist_mat)  dist_mat clusters < spectralClustering(sim_mat, k) centers < find_cluster_centers(mat, clusters) list("centers" = centers, "clusters" = clusters) } #' Compute the Test Statistic for Uniformity Based on the Pairs Matrix #' #' Compute a test for uniformity based on the estimated pairs matrix. #' #' Let \eqn{m} be the number of items ranked and \eqn{n} the size of the data #' set. Let \eqn{\bar{k} = k(k  1)/2} and \eqn{\bar{y}} the mean rank vector. #' Let \eqn{\hat{K}^*} be the uppertriangular part of the estimated pairs #' matrix (excluding the diagonal), laid out as a vector in rowmajor order. #' Finally, let \eqn{1_k} be a vector of \eqn{k} ones. Then the test statistic #' is #' #' \deqn{12n(\\hat{K}^*  \frac{1}{2} 1_{\bar{m}}\^2  \\bar{y}  \frac{m + #' 1}{2} 1_m\^2 / (m + 1))} #' #' Under the null hypothesis this statistic asympotically follow a \eqn{\chi^2} #' distribution with \eqn{\bar{m}} degrees of freedom. #' #' @param mat The data matrix, with rankings in rows #' @return The value of the test statistic #' @examples #' mat < rbind(1:3, #' 3:1, #' c(2, 1, 3), #' c(3, 1, 2)) #' pairs_uniform_test_stat(mat) pairs_uniform_test_stat < function(mat) { desc_stat < suppressMessages(destat(mat)) mean_rank < desc_stat$mean.rank pair < desc_stat$pair m < ncol(mat)  1 n < nrow(mat) mbar < choose(m, 2) K < pair[upper.tri(pair, diag = FALSE)] meanK < rep(1/2, times = mbar) cm < rep((m + 1)/2, times = m) 12 * n * (sum((K  meanK)^2)  sum((mean_rank  cm)^2)/(m + 1)) } #' Compute Covariance Matrix of Pairs Matrix Upper Triangle #' #' Compute the covariance matrix of the pairs matrix estimator. #' #' @param mat Data matrix, with each ranking having its own row #' @return The \eqn{m(m  1)/2}square matrix representing the covariance matrix #' @examples #' mat < rbind(1:3, #' 3:1, #' c(2, 1, 3), #' c(3, 1, 2)) #' pairs_mat_cov(mat) pairs_mat_cov < function(mat) { n < nrow(mat) m < ncol(mat) pair < kappa_est(mat) pair < as.matrix(pair) # Transform data into a dataset of pairwise rank comparisons if (m == 1) { return(0) } kappa_data < sapply(2:m, function(j) {mat[, j] > mat[, 1]}) for (i in 2:(m  1)) { kappa_data < cbind(kappa_data, sapply((i + 1):m, function(j) { mat[, j] > mat[, i] })) } kappa_data < kappa_data + 0 # Converts to integers cov(kappa_data) } #' Estimate \eqn{\kappa} Vector #' #' Estimate the \eqn{\kappa} vector, which fully defines the pairs matrix. #' #' @param mat Data matrix, with each ranking having its own row #' @return The \eqn{m(m  1)/2}dimensional vector #' @examples #' mat < rbind(1:3, #' 3:1, #' c(2, 1, 3), #' c(3, 1, 2)) #' kappa_est(mat) kappa_est < function(mat) { n < nrow(mat) df < as.data.frame(mat) df$n < 1 pair < suppressMessages(destat(df)) pair < t(pair$pair) pair < pair[lower.tri(pair, diag = FALSE)]/n pair } #' Get Plausible Rankings For Central Ranking Based on Kendall Distance #' #' Determine a set of plausible central rankings based on the Kendall distance. #' #' Let \eqn{\alpha} be one minus the confidence level, \eqn{m} the number of #' options, \eqn{\bar{m} = m(m  1)/2}, \eqn{\kappa} the vectorized #' uppertriangle of the pairs matrix of the population, \eqn{\hat{\kappa}} the #' sample estimate of \eqn{\kappa}, and \eqn{\hat{\Sigma}} the estimated #' covariance matrix of \eqn{\hat{kappa}}. Then the approximate \eqn{100(1  #' \alpha)}% confidence interval for \eqn{\kappa} is #' #' \deqn{\kappa: (\hat{\kappa}  \kappa)^T \hat{\Sigma}^{1} (\hat{kappa}  #' \kappa) < \chi^2_{\bar{m}}} #' #' One we have such an interval the next task is to determine which ranking #' vectors are consistent with plausible \eqn{\kappa}. To do this, the function #' determines which choices could plausibly be tied according to the confidence #' interval; that is, which entries of \eqn{\kappa} could plausibly be #' \eqn{1/2}. Whenever this is rejected, there is a statistically significant #' difference in the preference of the two choices; looking at \hat{\kappa} can #' determine which of the two choices is favored. All ranking vectors that would #' agree that disagree with that preference are eliminated from the space of #' plausible central ranking vectors. The ranking vectors surviving at the end #' of this process constitute the confidence interval. #' #' @param mat Matrix of rank data, each observation having its own row #' @param conf_level Desired confidence level #' @return A list with entries \code{"ranks"} holding the matrix of plausible #' rankings in the confidence interval and \code{"preference_string"}, a #' string enumerating which options are, with statistical significance, #' preferred over others #' @examples #' mat < t(replicate(100, {sample(1:3)})) #' kendall_rank_conf_interval(mat) kendall_rank_conf_interval < function(mat, conf_level = 0.95) { n < nrow(mat) m < max(mat) mbar < choose(m, 2) kap < kappa_est(mat) Sigma < pairs_mat_cov(mat) crit_value < qchisq(1  conf_level, df = mbar, lower.tail = FALSE) # Find bad rows of Sigma, where the covariance is zero; that variable must be # constant const_vars < which(colSums(Sigma^2) == 0) safe_vars < which(colSums(Sigma^2) > 0) safe_kap < kap[safe_vars] safe_Sigma < Sigma[safe_vars, safe_vars] # Determine if hyperplanes where one coordinate is 1/2 intersect confidence # set b < as.matrix(solve(safe_Sigma, safe_kap)) a < t(safe_kap) %*% b a < a[1, 1] check_half < partial(hei_check, x = 1/2, A = safe_Sigma, b = 2 * b, d = crit_value/n  a, invert_A = TRUE) sig_diff_safe_vars < !sapply(1:length(safe_vars), check_half) if (length(const_vars) > 0) { sig_diff < rep(NA, times = mbar) sig_diff[safe_vars] < sig_diff_safe_vars sig_diff[const_vars] < TRUE } else { sig_diff < sig_diff_safe_vars } idx_matrix < matrix(0, nrow = m, ncol = m) idx_matrix[lower.tri(idx_matrix, diag = FALSE)] < 1:mbar idx_matrix < t(idx_matrix) rownames(idx_matrix) < colnames(mat) colnames(idx_matrix) < colnames(mat) # Remove rows of potential centers matrix to reflect confidence interval # results; also, record which groups seem to have significant difference in # ranking rank_string < "" permutation_mat < permutations(m, m) for (i in 1:(m  1)) { for (j in (i + 1):m) { sig_diff_index < idx_matrix[i, j] if (sig_diff[sig_diff_index]) { direction < sign(kap[sig_diff_index]  1/2) if (direction > 0) { # Row option (i) is preferred to column option (j) permutation_mat < permutation_mat[permutation_mat[, i] < permutation_mat[, j], ] rank_string < rank_string %s0% colnames(mat)[i] %s% "is better than" %s% colnames(mat)[j] %s0% '\n' } else if (direction < 0) { # Row option (i) is inferior to column option (j) permutation_mat < permutation_mat[permutation_mat[, i] > permutation_mat[, j], ] rank_string < rank_string %s0% colnames(mat)[j] %s% "is better than" %s% colnames(mat)[i] %s0% '\n' } } } } colnames(permutation_mat) < colnames(mat) return(list("ranks" = permutation_mat, "preference_string" = rank_string)) } #' Straight Hyperplane and Ellipse Intersection Test #' #' Test whether a hyperplane parallel to an axis intersects an ellipse. #' #' The ellipse is fully determined by the parameters \code{A}, \code{b}, and #' \code{d}; in fact, the ellipse consists of all \eqn{x} such that #' #' \deqn{x^T A x + b^T x \leq d} #' #' \code{x} is the intercept of the hyperplane and \code{k} is the coordinate #' that is fixed to the value \code{x} and thus determine along which axis the #' hyperplane is parallel. A value of \code{TRUE} means that there is an #' intersection, while \code{FALSE} means there is no intersection. #' #' @param x The fixed value of the hyperplane #' @param k The coordinate fixed to \code{x} #' @param A A \eqn{n \times n} matrix #' @param b An \eqn{n}dimensional vector #' @param d A scalar representing the upper bound of the ellipse #' @return \code{TRUE} or \code{FALSE} depending on whether the hyperplane #' intersects the ellipse or not #' @examples #' hei_check(1, 2, diag(3), rep(0, times = 3), 10) hei_check < function(x, k, A, b, d, invert_A = FALSE) { b < as.matrix(b) n < nrow(b) stopifnot(k >= 1 & k <= n) stopifnot(nrow(A) == ncol(A) & nrow(A) == n) stopifnot(all(eigen(A)$values > 0)) all_but_k < (1:n)[which(1:n != k)] s < rep(0, times = n) s[k] < x s < as.matrix(s) if (invert_A) { tb < as.matrix(solve(A, s)) } else { tb < A %*% s } td < t(s) %*% tb + t(b) %*% s if (invert_A) { # XXX: curtis: NUMERICALLY BAD; FIX THIS  Thu 14 Feb 2019 07:50:19 PM MST A < solve(A) } tA < A[all_but_k, all_but_k] tx < solve(tA, (b/2 + tb)[all_but_k, ]) tx < as.matrix(tx) val < t(tx)%*% tA %*% tx + t((b + 2 * tb)[all_but_k]) %*% tx + td  d val < val[1, 1] val <= 0 } ################################################################################ # MAIN FUNCTION DEFINITION ################################################################################ main < function(input, prefix = "", width = 6, height = 4, clusters = 5, conflevel = 95, comments = "AHLCGClusterComments.txt", detailed = FALSE, help = FALSE) { suppressPackageStartupMessages(library(pmr)) suppressPackageStartupMessages(library(ggplot2)) suppressPackageStartupMessages(library(reshape2)) suppressPackageStartupMessages(library(dplyr)) suppressPackageStartupMessages(library(rankdist)) suppressPackageStartupMessages(library(gtools)) suppressPackageStartupMessages(library(purrr)) suppressPackageStartupMessages(library(anocva)) load(input) n < nrow(survey_data) rank_data < survey_data[CLASSES] rank_data$n < 1 rank_mat < as.matrix(survey_data[CLASSES]) # Get basic descriptive statistics: mean ranks, marginals, pairs desc_stat < suppressMessages(destat(rank_data)) mean_rank < desc_stat$mean.rank marginal < desc_stat$mar pair < desc_stat$pair names(mean_rank) < CLASSES rownames(marginal) < CLASSES colnames(marginal) < 1:CLASS_COUNT rownames(pair) < CLASSES colnames(pair) < CLASSES # Compute "typical" distance based on least sum of Kendall distances best_rank < lskd_estimator(rank_mat) names(best_rank) < CLASSES # Hypothesis Testing for Uniformity statistic < pairs_uniform_test_stat(rank_data) # Confidence Interval ci < kendall_rank_conf_interval(rank_mat, conf_level = conflevel / 100) # Cluster data rank_clustering < rank_cluster_spectral(rank_mat, k = clusters) centers < rank_clustering$centers Cluster < rank_clustering$clusters # Naming convention broke for printing rownames(centers) < 1:nrow(centers) # Plotting marginal_plot < ggplot( melt(100 * marginal / n, varnames = c("Class", "Rank"), value.name = "Percent"), aes(fill = Class, x = Class, y = Percent, group = Rank)) + geom_bar(position = "dodge", stat = "identity") + scale_fill_manual(values = CLASS_COLORS) + labs(title = "Class Rankings") + theme_bw() ggsave(prefix %s0% "marginal_plot.png", plot = marginal_plot, width = width, height = height, units = "in", dpi = 300) pair_plot < ggplot( melt(100 * pair / n, varnames = c("Class", "Opposite"), value.name = "Percent") %>% filter(Percent > 0), aes(fill = Opposite, x = Class, y = Percent)) + geom_bar(position = "dodge", stat = "identity") + geom_hline(yintercept = 50, linetype = 2, color = "red") + scale_fill_manual(values = CLASS_COLORS) + labs(title = "Class Ranking Comparison") + theme_bw() ggsave(prefix %s0% "pair_plot.png", plot = pair_plot, width = width, height = height, units = "in", dpi = 300) # Place cluster comments in file comment_string < "" for (i in 1:clusters) { comment_string < comment_string %s0% "\n\nCLUSTER" %s% i %s0% "\n\n\n" %s0% paste(survey_data$Reason[survey_data$Reason != "" & Cluster == i], collapse = "\n\n*\n\n") } cat(comment_string, file = comments) # Printing cat("\nMEAN RANK\n\n") print(round(mean_rank, digits = 2)) cat("\nMARGINALS\n\n") print(round(100 * marginal / n, digits = 2)) cat("\nPAIRS\n\n") print(round(100 * pair / n, digits = 2)) cat("\nUNIFORMITY TEST\n\n") cat("Test Statistic:", statistic, "\n") cat("Pvalue:", pchisq(statistic, df = choose(CLASS_COUNT, 2), lower.tail = FALSE), "\n") cat("\nOPTIMAL RANK ESTIMATE\n\n") print(sort(best_rank)) cat("\nWith", conflevel %s0% '%', "confidence:", '\n' %s0% ci$preference_string) if (detailed) { cat("\nPlausible Modal Rankings:\n") print(as.data.frame(ci$ranks)) } cat("\nCLUSTERING\n\nCounts: ") print(table(Cluster)) cat("\nCenters:\n") print(centers) cat("\nScore:", clustering_score(rank_mat, centers, Cluster), "\n") if (detailed) { cat("\nCLUSTER CONFIDENCE INTERVALS\n\n") for (i in 1:clusters) { cat("\nCluster", i %s0% ':\n') ci_cluster < kendall_rank_conf_interval(rank_mat[Cluster == i, ]) cat("\nWith", conflevel %s0% '%', "confidence:", '\n' %s0% ci_cluster$preference_string) cat("\nPlausible Modal Rankings:\n") print(as.data.frame(ci_cluster$ranks)) } } } ################################################################################ # INTERFACE SETUP ################################################################################ if (sys.nframe() == 0) { cl_args < parse_args(OptionParser( description = paste("Analyze Arkham Horror LCG class preference survey", "data and print results."), option_list = list( make_option(c("input", "i"), type = "character", help = paste("Input file containing survey data")), make_option(c("prefix", "p"), type = "character", default = "", help = "Another commandline argument"), make_option(c("width", "w"), type = "double", default = 6, help = "Width of plots"), make_option(c("height", "H"), type = "double", default = 4, help = "Height of plots"), make_option(c("clusters", "k"), type = "integer", default = 5, help = "Number of clusters in spectral clustering"), make_option(c("comments", "c"), type = "character", default = "AHLCGClusterComments.txt", help = "File to store participant comments organized" %s% "by cluster"), make_option(c("conflevel", "a"), type = "double", default = 95, help = "Confidence level of confidence set"), make_option(c("detailed", "d"), action = "store_true", default = FALSE, help = "More detail in report") ) )) do.call(main, cl_args) }
$ ./ArkhamHorrorClassPreferenceAnalysis.R i AHLCGClassPreferenceSurveys.Rda detailed
MEAN RANK  Guardian Mystic Rogue Seeker Survivor 2.92 3.10 3.16 2.60 3.22 MARGINALS  1 2 3 4 5 Guardian 18.29 20.43 26.84 19.71 14.73 Mystic 19.71 18.29 17.81 20.90 23.28 Rogue 19.24 14.73 20.67 21.38 23.99 Seeker 28.03 25.18 17.10 18.53 11.16 Survivor 14.73 21.38 17.58 19.48 26.84 PAIRS  Guardian Mystic Rogue Seeker Survivor Guardian 0.00 54.16 55.34 42.52 55.82 Mystic 45.84 0.00 51.07 39.90 53.44 Rogue 44.66 48.93 0.00 38.72 51.54 Seeker 57.48 60.10 61.28 0.00 61.52 Survivor 44.18 46.56 48.46 38.48 0.00 UNIFORMITY TEST  Test Statistic: 2309938376 Pvalue: 0 OPTIMAL RANK ESTIMATE  Seeker Guardian Mystic Rogue Survivor 1 2 3 4 5 With 95% confidence: Seeker is better than Rogue Seeker is better than Survivor Plausible Modal Rankings: Guardian Mystic Rogue Seeker Survivor 1 1 2 4 3 5 2 1 2 5 3 4 3 1 3 4 2 5 4 1 3 5 2 4 5 1 4 3 2 5 6 1 4 5 2 3 7 1 5 3 2 4 8 1 5 4 2 3 9 2 1 4 3 5 10 2 1 5 3 4 11 2 3 4 1 5 12 2 3 5 1 4 13 2 4 3 1 5 14 2 4 5 1 3 15 2 5 3 1 4 16 2 5 4 1 3 17 3 1 4 2 5 18 3 1 5 2 4 19 3 2 4 1 5 20 3 2 5 1 4 21 3 4 2 1 5 22 3 4 5 1 2 23 3 5 2 1 4 24 3 5 4 1 2 25 4 1 3 2 5 26 4 1 5 2 3 27 4 2 3 1 5 28 4 2 5 1 3 29 4 3 2 1 5 30 4 3 5 1 2 31 4 5 2 1 3 32 4 5 3 1 2 33 5 1 3 2 4 34 5 1 4 2 3 35 5 2 3 1 4 36 5 2 4 1 3 37 5 3 2 1 4 38 5 3 4 1 2 39 5 4 2 1 3 40 5 4 3 1 2 CLUSTERING  Counts: Cluster 1 2 3 4 5 130 83 80 66 62 Centers: Guardian Mystic Rogue Seeker Survivor 1 3 2 4 1 5 2 3 5 4 1 2 3 3 4 1 2 5 4 1 5 3 4 2 5 5 1 4 3 2 Score: 881 CLUSTER CONFIDENCE INTERVALS  Cluster 1: With 95% confidence: Guardian is better than Rogue Guardian is better than Survivor Mystic is better than Rogue Mystic is better than Survivor Seeker is better than Rogue Seeker is better than Survivor Plausible Modal Rankings: Guardian Mystic Rogue Seeker Survivor 1 1 2 4 3 5 2 1 2 5 3 4 3 1 3 4 2 5 4 1 3 5 2 4 5 2 1 4 3 5 6 2 1 5 3 4 7 2 3 4 1 5 8 2 3 5 1 4 9 3 1 4 2 5 10 3 1 5 2 4 11 3 2 4 1 5 12 3 2 5 1 4 Cluster 2: With 95% confidence: Guardian is better than Mystic Guardian is better than Rogue Seeker is better than Guardian Seeker is better than Mystic Survivor is better than Mystic Seeker is better than Rogue Survivor is better than Rogue Seeker is better than Survivor Plausible Modal Rankings: Guardian Mystic Rogue Seeker Survivor 1 2 4 5 1 3 2 2 5 4 1 3 3 3 4 5 1 2 4 3 5 4 1 2 Cluster 3: With 95% confidence: Rogue is better than Guardian Rogue is better than Mystic Rogue is better than Seeker Rogue is better than Survivor Seeker is better than Survivor Plausible Modal Rankings: Guardian Mystic Rogue Seeker Survivor 1 2 3 1 4 5 2 2 4 1 3 5 3 2 5 1 3 4 4 3 2 1 4 5 5 3 4 1 2 5 6 3 5 1 2 4 7 4 2 1 3 5 8 4 3 1 2 5 9 4 5 1 2 3 10 5 2 1 3 4 11 5 3 1 2 4 12 5 4 1 2 3 Cluster 4: With 95% confidence: Guardian is better than Mystic Guardian is better than Seeker Rogue is better than Mystic Survivor is better than Mystic Survivor is better than Seeker Plausible Modal Rankings: Guardian Mystic Rogue Seeker Survivor 1 1 4 2 5 3 2 1 4 3 5 2 3 1 5 2 4 3 4 1 5 3 4 2 5 1 5 4 3 2 6 2 4 1 5 3 7 2 4 3 5 1 8 2 5 1 4 3 9 2 5 3 4 1 10 2 5 4 3 1 11 3 4 1 5 2 12 3 4 2 5 1 13 3 5 1 4 2 14 3 5 2 4 1 Cluster 5: With 95% confidence: Mystic is better than Guardian Survivor is better than Guardian Mystic is better than Rogue Mystic is better than Seeker Survivor is better than Rogue Survivor is better than Seeker Plausible Modal Rankings: Guardian Mystic Rogue Seeker Survivor 1 3 1 4 5 2 2 3 1 5 4 2 3 3 2 4 5 1 4 3 2 5 4 1 5 4 1 3 5 2 6 4 1 5 3 2 7 4 2 3 5 1 8 4 2 5 3 1 9 5 1 3 4 2 10 5 1 4 3 2 11 5 2 3 4 1 12 5 2 4 3 1
CLUSTER 1  Guardians have serious bling and they're awesome at what they do, so they're number 1. Seekers also have great cards that guzzle clues and generally provide solid deck building, so they're #2. Rogues have cards that look like a lot of fun (there's bling there too) and they are often good at both clue gathering and fighting, depending on which is needed. Mystic decks feel like they're all the same, so building decks with them is not as much fun. Survivor cards are extremely limited so they're my least favorite. * I love the Mystic spells, especially the versatility. Hated Rogues since Skids days, although Jenny is great and Preston is very good fun. Guardians and Seeker fall very easy into the usable archetypes of Attack and Investigate. * I love supporting guardians and seekers. Control focused mistics are also fun. * Purple is top just because of Recall the Future and Premonition. Yellow for being weird, Green for extraactions and Finn. Red for cool, weird interactions at a bargain price. Blue is boring. * I don't like playing Rogues, alright? Please don't crucify me! Oh, this is anonymous? Excellent. * Simplicity of play and planning. * I love spells and magic items * Guardian are probably te most rounded IMO. Seekers next, but great at clue gathering. * Seeker pool has best card draw & selection; guardian has stick to the plan + stand together + weapons; survivor pool is good but good xp options are less varied (will to survive/true survivor or bust); mystics delve too deep + bag control + David Renfield are good. Rogue pool is harder to build a full game plan around—its best cards enable great turns (pocket watch, double or nothing, etc) and are valuable to have in the party, but they have a harder time spending actions 23 as usefully since some of their best things exhaust (lockpicks). * Mystic and Rogue tied for first. Mystic is my prefered and I like how I can stack my deck to be heavy in either investigating and/or combat. Rogue because most get a lot of recources where you can purchase more expensive cards. * I feel as though Mystic have the broadest tool kit and be specialise in almost any direction. However my experience is solely limited to two player with my wife and she plays a cloover, so we need someone with bashing power. * Matt's response * I primarily play a seeker (Daisy) * Yellow fits with my playstyle the best * I really like all of them, so there's not a ton of distance between them. * gameplay style, clear focus on purposes * Guardian and Seeker are very straightforward, and I like that. They have a clear objective, and they do it well. * While I feel that most classes have merit, the rogue is generally the worst at the core aspects of the game: fighting and clue finding. Evading does not have the punch that killing the enemy foes. * I prefer a support / team role, and play for consistency over tricks. * Most useful for the group * I just looked at options. Mystics have a lot of options in every way, shape or form, and so do Guardians. I just prefer the mystic combos better, since Guardians are pretty bland in that regard. I feel you really can now make different mystic decks, from support to tank and combat master, to main seeking investigator etc. They have everything and even playing one deck a few times is till fun because of so many exp. options. And while their decks are pretty deep, the premise is simple  boost willpower. That leaves them with a nice weakness you have to cover. Guardians have better weapons (more fun) than mystics have combat spells, although Shattered Aeons really gave Mystics a fun new icy option. And maybe I'd like to see a Mystic that wouldn't be pure Mystic if you get me. Some hybrid guy or girl, that's not just using spells and artifacts from the same class over and over again. That's really what they're missing. Guardians are just so great, because they are sooo well balanced imo. It's quite relaxing looking at their options. You have everything from amazing gear, weapons, allies, events that cover literally everything + your friends' asses, awesome skillcards that can also combo, fun and engaging exp. options etc. But they lack different kinds of investigators. They have options, just some other classes have more. Maybe my least favorite on investigator side. Mystics again are so simple to make in that regard. I gave Seekers 3. because they just have some 0 exp. cards that are just too strong for any class, not just for them. Otherwise I really like Seeker cards theme, maybe even more than Guardian, maybe even my favorite I'd say, but again, Seekers just have so much random stuff and OP stuff (you know what they are). I don't care for balance in a coop game, OP cards can be really fun, but this stuff really limits their options and sometimes even other classes' options, because not including them just hinders your deck and you know it (example is Shortcut). And that's not good. They have really fun and diverse roster of investigators though. And their experience options are quite game breaking, but in a good way imo. There's seeking, combat, running and evading so much support and combos, really fun and diverse. Rogues have maybe some of my least favorite cards, but they have A LOT of options. They have quite a few very awesome weapons, but they also have SO MUCH cards that are meant for combos and while combo decks are fun, they, in my opinion, are niche, or at least not used in every game. Sometimes you just want a simple deck and Rouges have a limited card pool when you look at it that way (example: no useful combat ally or even asset  there is a new Guardian tarrot card for Jenny and Skids, but they need more imo). They got their quite fresh Lockpicks and the seeker gator and that was an amazing get. But more, Leo breaks their ally pool, because he's just too strong. They also have no pure combat investigators, but otherwise their investigators are really really fun and diverse. They have AMAZING experience options. Maybe the best in the game. And btw, they were my favorite to play before the last few expansions. I love Preston, but again the new cards are very niche. The new seeker agent Joe with 4 combat elevates seekers above Rogues for me in the options in card pool department though. They now have an optional pure combat investigator, while Rogues still don't. Survivors have AWESOME cards, especially investigators are just so fun and weird, but they just lack options in the card pool. You have so many "survive" cards, but they lack anything else strong. Their weapons are quite fun, but there are no heavy hitting options. That for me may be their biggest minus. Lack of experience pure combat options. They have quite a few very strong investigate cards though like Look What I Found and Newspaper 2 exp. And their allies, while strong, are still nicely balanced and quite diverse. They have a million evade options, maybe even too much. It would sometimes be nice to get something else rather than just another evade. These new Track Shoes are pretty cool though. Their skill cards are pretty awesome imo. But still, I feel like they have so much niche cards that only allow some very specific combos, like Rogues, and lack anything else meaningful. They are extremely fun to play though, with all their Survivor specializations like Seeker Urchin, combat Gravekeeper, being dead crazy guy, new athlete runner and evader etc. They may even be my favorite class, but they still lack options in a big way. And they even lack one investigator only available for 15 bucks along a cheaply written book. CLUSTER 2  survivors da best * Guardian just have so many cards that, when looking at them, seem useful. Mystic is my actual favourite class, but it has soo many cards where they went too far with the punishing effects that almost made them useless. Survivor on the other hand has too many events that end up feeling almost the same. Seekers I dont really know, Ive never played them, but everytime I see them looks like they can do many things. And rogue, while it has improved a bit more, I still miss a useful level 1 weapon * Difficulty wrapping my head around some classes * Mystics are incredibly dependent on their cards. * Seekers usually win the game, because the snitch is 150 points * Always cards in these classes that I have a hard time cutting. Which means they have the deepest pools marking them the most fun to me * I love deck manipulation for seekers, and the flexibility of survivors. I just can't get my head wrapped around mystics. * Guardians have a lot of great tools for not just fighting but getting clues. Seeker has the best support so splashing it is great. Rogue and survivor are ties for good splash but survivors card pool is mediocre to me. Mystic aren't bad but I haven't seen it great with others very well. Mystics are good as themselves but really expensive and not great for splash IMO. * Survivor have many nice tricks to survive and gather clues. Guardians because they have the best weapons (flamethrower) and protective tools. seeker for their upgradable cards and higher ed. mystic for canceling cards but dont like their only good stat is willpower... rogues seems interesting but never played one. * Seekers have action economy (shortcut, pathfinder), card economy, resource economy (Dr Milan + cheap cards) and they advance the game quickly (i.e. discover clues). Specialist decks are better than generalist decks (in multiplayer, which I play) as they accomplish their goals more consistently, and this favours seekers and guardians. Stick To The Plan with Ever Vigilant is the most powerful deck element I am aware of. * I tend to play builds focused around consistency of succeeding at tests and action efficiency and my rankings reflect the build consistencies in order except rogue who are consistent but just not interesting. * Love survivors * Seeker is m'y main class * Firstly let me preface this with I only own 2 cores and the Dunwich cycle and have yet to play through Dunwich. Survivor offers the most versatility and always seems to be one of the key factors when beating the odds in most cases as well as enhancing evasion and action economy (survival instinct etc). Seeker cards are my second favourite due to the amount of utility included within them (i.e. Shortcut, Barricade, Medical Texts, Old Book of Lore etc) as well as allowing you what you need to catapult out in front of the agenda deck with cluevering abilities. Guardian and Mystic operate on a similar field marginally behind Seeker to me though mystic finds itself slightly higher because of the unique interactions with the encounter deck and rule bending. though in my limited experience they both seem to be the more combat based of the card pools so operate in that same niche for me. Rogue is unfortunately last but honestly that's just because I haven't had many interactions with them, most of their effects seem too situational to be able to use consistently. * I don't like taking the obvious solutions to a problem. I.E: Gun to the face, or Spells for everything. * Efficiency at what is needed to complete scenarios  mostly clue getting and combat. * Rogue and survivor seem to have the most cards that support each other to suggest a new way of playing. Recursion survivor is fun and different from no money survivor (though you can do both). Rogue has succeed by 2 and rich as options. Seeker has less of that but has the power of untranslated etc cards. Guardians are okay but kind of blah. I can’t see any fun decks to do with mystic. Like, messing with the bag is a cool thing to do in any deck, it isn’t a deck. Playing with doom is a couple cards that need each other but it isn’t a plan for how the game will play out. * Definitely hard to rank them, but ranked in order of which I'd most like to have as an offclass * I like the consistency in the Survivor card pool and how much individual support there is for the variety of Survivor investigators. Although I like the power level of Mystic cards, it always sucks to have your Rite of Seeking or Shriveling 15 cards down after a full mulligan for them. * More scenarios need cloovers and fighters, so all classes outside of Seeker and Guardian are more tricksy and less focused on the goal. This is a hardenough game as it is! * Seeker cards are way too powerful. Rogues are the most fun to play. Survivor cards are super efficient at what they do. Guardian pool is decent but overpriced. Mystics have a few amazing cards, but the rest is pretty meh. CLUSTER 3  Vaguely from ‘most interactive’ to ‘most straightforward’ with a special mention for the Survivor card pool which has been largely static since roughly core with a few major exceptions. * Rogue cards are the most fun for me. More money, more actions, more fun. * I seem to like the classes that are less straightforward than Guardian and Seeker tend to be. (In the sense that they are the archetypical fighters and cluevers.) * I like cards that cheat the system and don't depend on leveraging board state * Green and purple cards have fun and flashy effects. Blue and yellow cards have more standard effects and narrower deck building options. * I didn't play mystics a lot yet * The numbers are different depending whether we’re talking theory or practice. In theory the Mystic cards are my favorite, both for flavor and interesting mechanics. In practice I struggle with them and they’re usually the first cut. * Combos! * I like moneeeey * seekers have literally everything, and their cards usually aren't too expensive. rogues have adaptable, streetwise, and really good allies, but they're a bit low in damage output. guardians have really good cards but are limited by how expensive they are. mystic events are amazing, but they are 4th place because level 0 spells kinda suck and are expensive as hell. mystic cards are much better with exp. survivor cards are almost decent. it really sucks that many of their leveled up cards are exile cards, but survivors don't get any extra exp. but in general i find their cards to be lacking in cluegathering capability and damage dealing. they can turn failures into successes, but that's about it. * Guardian is solid and predictable, Rogue is fun. Mystic is challenging, Seeker and Survivor are necessary. * THE JANK * I really dislike survivors as I simply dont understand how to properly build them (appart maybe Wendy). Even if I have rated mystics 4, I enjoy playing Mystic nearly as much as seeker (which I rated 1) rather than Survivor. * I think the rogue theme is portayed very well in their card pool * corelation between mechanisms and theme * I like big, flashy, ridiculous turns and risky plays, so rogue and mystic are the most fun for me. Guardian and seeker are fine and all, just a bit dry. I don’t understand survivor at all, but I’m happy other people have a thing they like. * Rogue and survivor give you interesting and powerful but situational tools that you have to figure out how to apply to the scenario. Mystic and guardian are more about powerful assets that you either draw early and use a bunch or wish you’d drawn earlier but can’t afford now and just commit for icons. Seeker pool makes me sleepy every time I look at it; the only mechanic there I really enjoy is the tome synergies and that’s only with Daisy (Rex, of course, is only played one way). * Roleplay Value * I went for those that get me excited to play or provide thrills or cool combinations as I play (rather than, say, the power of the cards) CLUSTER 4  Lol moments. We’d all be survivor if we were in this game! * The top two were tricky to place; Rogues have fantastically fun combo plays available to them, while I love the 'feel' of many Survivor cards, fighting against fate as hard as they damn well can. Overall, I find the Survivor pool *just* wins out, especially with the excellent Will To Survive and semiimmortal Pete Sylvestre. Guardians and Seekers are two sides of the same coin; I'd say Guardians edge out, because while a Guardian has a few tools (including the infamous Flashlight) to find clues, Seekers have very few options to take care of enemies. As with Survivors and Rogues, though, this is close. Mystics... weeeeell. .. I acknowledge they are arguably the best class, once set up, and while their charges last on their spells. The ability to do everything while testing just one stat can make them very efficient. But... this is the class I enjoy the least, in part due to their overreliance on their spells. Their solutions never feel more than stopgaps for me, so I find Mystics a hard class to play. (That won't stop me taking Mystics for a spin though, especially for goodies like Delve Too Deep ) * Ability to bounce off Investigators with good resource and action economy, other card pools (including Neutral), as well as capability to start off with no experience — all the way to full campaign with as much powercard investment as possible. Seeker may have 2 of the best cards in the game (Higher Education and Dr. Milan Christopher), but the Seeker card pool as a whole does not stand up. It is both narrow and shallow. Mystic is the most detailed and the most broad, but suffers from undue delay leading to deterioration. Guardian definitely needs to be more broad as well. Both Rogue and Survivor blend well, and provide the necessary breadth to take on challenges while melding with the higheconomy Investigators. Rogue has a few 3, 4, and 5 xp cards that push it to the top spot. Even for Lola these statements hold up. * On a scale of most interesting vs. most boring. Options for rogues and survivors feel fresh and like there are multiple deck archetypes that are valid. Less so for the seeker and mystic card pools, where I feel like there are more "must include" cards which makes deck building less exciting and more rote. * survivor da bass * The card pool allows rogue/survivor decks to make specialists. Seekers are all just different flavours of clueverer * Personally, I like healing and support best, which guardian does quite well. Survivor has my second favorite card pool, though, for tricks and card recursion. * Not much between them but I like guns & ammo, survivor class is cool because it is normies vs horror * I really like the guardian cards as i enjoy fighting the monsters that appear in scenarios. Unfortunately my least favorite is mystic. Although they have powerful cards, they often take time to set up and I think that the draw backs on some of their cards are too harsh for what they do. * Just what I gravitate towards * I like killing monsters * Mystics have so much of everything with cool effects added on. Guardian cards are efficient at what they do, but really boring. * Survivors feel more unique, guardians kill stuff, seekers feel like you can't win without them (though you really can). Rogues and mystics by default. I like rogues better because of Finn and Sefina being really fun to play. * Almost always let my partner(s) play the seekers as I find the rogue and survivor cardpools allow you to fly by the seat of your pants, which I find even more exciting than just being the clue gatherer. Mystic card pool can sometimes take too long to develop. Also many marquis mystic cards flirt around with the doom mechanic which always bites me in the arse. Thirdly, mystic pool doesn't have a strong ally base. What's funny about that is I always play spellcasters in D n D. Guardian pool is pretty straightforward, one I look at as more of a necessity within the context of the game,but doesn't tug at my heartstrings . Apologize for the sloppy phrasing in my opine due to a long day. Rankings based on personal preferences only. No meta analysis * Agnes * Just prefer the in your face monster destruction that Guardian is themed with. Really enjoy that class. * Flexibility * I love killing things and then getting clues! * I like all of them but play seekers least, I also like that guardians can just take the mythos phase to the face * I like to be the tank, and with some of the new additions guardians have gotten with getting clues they just shine even more. Mystic I never really play but has so many cards I want if I am playing a dunwich survivor or anyone who can get them, same goes for survivor, very few cards from rogue or seeker makes it into my decks unless I am playing chars like Wendy or Leo who kinda needs them to make them work * Number of fun upgrade options in green, base card pool for red, upgrade options in blue, useful upgrades in seeker, purple sucks to play. * I like support / killing role, Carolyn FTW CLUSTER 5  Weapons are fun. * Leo is da alpha male. * Red's best * There’s more variety of cards that let me build decks I enjoy. As you go down the ranking, there’s less variety or fewer viable sub themes to build around. * Seeker is powerful but boring, while mystic getting to punch pack at the game is great, with good support to boot. * I enjoy the cardplay in survivor, and the mystic side of things. Seeker cards are generally very powerful. I don’t enjoy playing rogue but there is some good cardplay. Guardian I find less interesting overall as a class * Wendy is literally the best investigator in the game * I enjoy support cards and interesting, unique effects. * I tend to go for lazy game play, and usually guardians allow to smash enemies without worrying too much about strategy. Seekers I love them thematically. Mystics, I never understood how to play them efficiently
Packt Publishing published a book for me entitled HandsOn Data Analysis with NumPy and Pandas, a book based on my video course Unpacking NumPy and Pandas. This book covers the basics of setting up a Python environment for data analysis with Anaconda, using Jupyter notebooks, and using NumPy and pandas. If you are starting out using Python for data analysis or know someone who is, please consider buying my book or at least spreading the word about it. You can buy the book directly or purchase a subscription to Mapt and read it there.
If you like my blog and would like to support it, spread the word (if not get a copy yourself)!
]]>This blog post was prompted by this meme posted in the Arkham Horror: The Card Game Facebook group:
This meme got 144 comments and 94 reactions, and I’d say that most of them were in effect agreeing with the point the meme makes; the Survivor class feels like a black sheep among the Arkham Horror LCG classes (not that there aren’t people who like Survivors).
One of the comments on the thread was made by me:
Seriously, we’re three cycles in, starting the fourth, and there’s no Level 45 Survivor cards? It makes you wonder if there will EVER be highlevel Survivor cards!
I’m sure you can make one that does [sic]. Here’s one: King of the Hobos! When you have no resources, take two extra actions.
I dunno, just give us something. Late in the campaign it becomes a pain to upgrade your deck because all cards are low XP. It also makes seeing “Survivor cards level 05” a joke and Lola basically another survivor.
This comment, on its own, got 58 replies. So let’s just say that how the Arkham Horror designers are handling the Survivor class is a hotbutton topic. It’s pretty clear what I think about how Survivors are handled. But I also wanted to see to what extent the community agreed or disagreed with me.
So I created a poll (now closed) asking people to rank the Arkham Horror LCG classes from best (1) to worst (5). I’ll first present this and other data, then present my own opinion.
The first data I saw related to this was a response to my comment, showing the popularity of investigator decks on ArkhamDB. Below is a screenshot of the interactive tool (that you should check out):
(This data set does not include the investigators released with The Circle Undone, but considering how new the cycle is that may be for the best.)
Survivors supposedly occupy the low tier of ArkhamDB decks, and thus are less popular. I don’t agree with this conclusion from this data set; as others pointed out by others, lots of people may be playing survivors without posting their decks, and the amount of new decks being made doesn’t necessarily correlate well with how people feel the class performs. For instance, supposedly there are not many Mystic decks because all Mystic decks include the same key cards and thus look largely the same. (And that’s not a good thing, by the way.)
I also want to add that the fundamental question is how good a card pool is rather than how good Survivor investigators are. I don’t think that there’s anything wrong with the Survivor investigators in a vacuum; they’re all great investigators. (That includes Calvin, too; he’s not just a “challenge” investigator, he can pull his weight and more in a game when played right.) That said, the identification of an investigator as a “Guardian” or “Seeker” or “Survivor” serves no mechanical purpose. Nothin in the game pings off the class of the investigator; all investigators could be in the “Neutral” class (like Lola) and the game would be the same because of the deck building requirements. All that the class of an investigator does is indicate what the deck building requirements will be. And some investigators, such as Carolyn and Norman, really drive this point home with their deckbuilding requirements.
So to answer this question, I created a poll, asking people to rate the card pools of the classes. My question was simple: “Rank which class card pool you prefer, from most favorite (1) to least favorite (5).” Not everyone agreed that this was the best question to ask; there are different aspects in which classes may be “better” or “worse”. My primary Arkham Horror LCG partner and friendly local game store owner Matt Freed (who owns Mind Games, LLC, in West Valley, Utah) refused to answer the question as phrased because his rankings would differ completely depending on whether the class in question was the primary class or an off class of an investigator (I told him that I cared most about the card pool class as a primary class of an investigator). That said, the simplicity of the question (plus attempting to publicize the poll on the Internet as well as I could) managed to get me 421 responses, a decent sample size.
When looking at this data one must remember that this is not a random sample as statisticians prefer. The people who participated chose to do so and they’re all from the Internet, which means that one can raise questions about the external validity of the data. That said, I think that we can still learn a lot from the data, even if it’s not perfect.
By the way, analyzing the data I got was my first forray into analyzing rank data. I had to learn new statistical methods to meaningfully pry into the data to see what it said. I loved the methodology and the details of what I did will be presented in a later post, along with the scripts I wrote for doing this analaysis. For now, I’ll just mention that my primary reference for learning how to analyze the data was Analyzing and Modeling Rank Data, by John Marden.
Let’s start with some basic charts. Below is the “marginals” plot; simply put, it’s how frequently a class was assigned a particular ranking.
One reading of this plot is that Seekers are rated high, Survivors are rated low, Guardians are rated third, and it’s hard to tell how Mystics and Rogues rank, though it appears that Rogues are better liked than Mystics.
I however don’t like this plot since it doesn’t take advantage of the fact that the data is rank data. A plot that better accounts for the nature of the data is the “pairs” plot, which shows how many people prefer the axis class to the class represented by the bar. If the bar is above 50%, then the axis class is preferred to the bar class, while if the bar is below 50%, the bar’s class is preferred. The pairs plot (with a line marking the important 50% cutoff) is shown below:
It’s much clearer from this plot which classes people prefer. Seekers are preferred to all other classes. Guardians are preferred to all classes but Seekers. Mystics are peferred to Rogues and Survivors, and Rogues are preferred only to Survivors. The Survivors are handily in last place, not being preferred overall to any other class.
This suggests that the ranking of the classes are, from best to worst: Seeker, Guardian, Mystic, Rogue, and finally Survivor. This ranking was the ranking obtained by one estimator of the central (or “closest to consensus”) ranking obtained by finding a ranking that minimizes the sum of Kendall distances (which is closely tied to the pairs plot). Also, a statistical test confirmed that the respondents were not equally likely to give any ranking and thus have preferences. A 95% confidence interval could only conclude, though, that Seekers were preferred to Rogues and Survivors; any other ranking is plausible under that confidence interval. That means that Seekers overall rank in the community is at least three, while Rogues and Survivors cannot be ranked first in the consensus ranking. (All tests and confidence intervals were based on the pairs matrix/Kendall distance.)
These are the statistics for the community considered as a whole, but it’s possible that players fall into different “archetypes” and thus may have different class preferences. Matt Newman, the lead designer of Arkham Horror, seems to believe so according to this article he wrote. I never asked players what “type” they were, but I attempted to determine types via cluster analysis, based on spectral clustering.
Let me start by saying that I’m not convinced there are meaningful clusters of players in this data. All the metrics for finding clusters were bad. But if you insist that there must be clusters of player types in the data, read on.
If there are clusters, my best guess is that there are five clusters of players. Based on reading the (optional) responses of players in each cluster, I’ve labeled (in a very subjective manner) cluster 1 as the “power” players (this is the cluster of Matt and I), cluster 2 as the “versatilityi/efficiency” cluster, cluster 3 as the “bling” cluster, cluster 4 as the “kill monsters” cluster, and cluster 5 as the cluster that likes “theme” in the game (although honestly this cluster was the hardest to define and often looked like the cluster of the most confused participants; the comment rate from this cluster was the lowest.)
31% of players were in the “power” cluster, 20% in the “versatility/efficiency” cluster, 19% in the “bling” cluster, 16% in the “kill monsters” cluster, and 14% in the “theme” cluster. The class preference in each cluster was, in order:
The confidence intervals (which are starting to lose their validity in the cluster analysis due to sample size and some preference pairings never being seen in the cluster) suggest that the “power” players dislike Rogues and Survivors; any ranking that puts Rogues and Survivors in the two worst rankings could be this group’s “central” ranking. The second group is more difficult to infer; Guardians and Survivors are supposedly better than Mystics and Rogues, while Seekers are better than all other classes. The “bling” cluster loves Rogues more than any other class and prefers Seekers to Survivors. The “kill monsters” cluster prefers Guardians and Survivors to Mystics and Seekers and Rogues more than Mystics. Finally, the “theme” cluster prefers Mystics and Survivors to all other classes.
I would say that, taken together, this data suggests that the Survivor class is, indeed, problematic, and players on the whole are not a fan of how it’s handled. (That said, one could make a case that this is true for the Rogue class too.) Here’s my argument regarding what’s wrong with the Survivor card pool.
Let me start by saying there are great Survivor cards. Lucky! may be the best card in the game. Peter Sylvestre is a great ally (at both levels), and I even really like Level 3 Rabbit’s Foot. Survivor cards are among the top 20 most popular (faction) cards in ArkhamDB decks.
While no one can credibly argue that Survivors have bad cards, I think we can credibly argue that the Survivor card pool is weak and eventually becomes a drag to play in a campaign.
The two most recent Survivors I played were “Ashcan” Pete and William Yorick. My “Ashcan” deck was a deck that heavily utilized Yaotl, Cornered, and the desperate cards. I was playing this deck in The Forgotten Age cycle. At first I really enjoyed the deck and how it worked. I don’t think it was the most powerful deck but it was fun to play. Eventually, though, I fell out of love with the deck and I don’t think I will ever try it again. Eventually I could not upgrade the deck without replacing what I saw as “core” cards. This problem occured midcampaign, too.
The same problem happened with my William Yorick deck (again in Forgotten Age). Granted, Yorick has access to the Guardian card pool, which provided another outlet for spending my experience without removing core cards. The Guardian cards were an important experience point outlet. Eventually, though, I ran into the same problem: lots of experience points and nowhere to spend them (but at least it was late in the campaign when I hit this problem).
Thus I have basically one complaint with the Survivor card pool: there are no high level Survivor cards. It turns out this is by design; Matt Newman told the hosts of the Drawn to the Flame podcast in episode 22 that he liked keeping the level of Survivor cards capped at 3 since it fit with the Survivor theme of “not being ready.” Thus we are starting our fourth Arkham Horror cycle and there are no highlevel Survivor cards.
What’s the consequences of this? Well, in my FLGS, “Survivor cards level 05” is a longrunning joke since there are no highlevel Survivor cards. Now Matt (the FLGS owner) said that, as an offclass, Survivor cards are one of the best pools to pull from, and I agree with him. Consider the table below:
Class  Total Cards  HighLevel Cards (at least 3 XP) 

Seeker  81  18 
Guardian  79  14 
Mystic  80  17 
Rogue  79  12 
Survivor  80  11 
69 Survivor cards are accessible to offclass Survivors, more than any other class. Furthermore, these are just about all the best Survivor cards.
Now let’s take a step back. Remember when Joe Diamond was announced to be a Seeker? That was a shock to many in the community, who were pretty sure that Joe Diamond would be a Guardian. Furthermore, making Joe Diamond a Seeker was a big deal that had major implications for his deck building. A primaryclass Seeker/offclass Guardian will look very different from a primaryclass Guardian/offcass Seeker, both in deck and style of play.
Let’s take William Yorick. What would change if William Yorick went from primaryclass Survivor/offclass Guardian to primaryclass Guardian/offclass Survivor? Well, Yorick would lose access to the 11 Survivor cards that are level 3 and gain access to the 14 Guardian cards that are levels 3, 4, and 5. And to be completely honest, I wouldn’t miss any of the Survivor cards I lost; I didn’t use them in the Yorick deck I built and honestly none of these cards feel like great cards we’d enthusiastically put in decks. I’d say that the level 3 Survivor cards are generally cards that get placed in decks that have experience points to burn; that XP has to be spent on something. So by making this switch, my Yorick deck would, almost unambiguously, become better.
Thus my first point: highlevel cards help distinguish class capabilities. It is because these powerful cards are not accessible to other investigators that these classes are distinctive. Seekers, Guardians, Rogues, and Mystics all have cards that make that class memorable and help separate that class from the rest. The Survivor card pool does not have this since offclass Survivors are basically just as good as primaryclass Survivors. Heck, even Lola has full access to the Survivor pool! She may as well be one!
I think that if Matt Newman were to read this his response would be “Survivors lack of highlevel cards is what makes them distinctive. It’s their lack of preparedness that makes them thematically work.” First, I think we’ve seen that the lack of highlevel cards makes the class worse from a gameplay perspective. Second, think about who we would call “survivalists” today, such as tribal people, prehistoric humans, or guys that go off into the woods and cut themselves off from civilization. There are actually many aspects of these people that appear almost superhuman. They’re impoverished but often skilled in everything necessary to survival. Most people from regular society would die if immediately forced into the dire situations survivalists deal with on a regular basis. Survivalism is not about being unprepared or lowskilled. It’s about being wellrounded and internalizing all strengths, thus not dependent on the tools available to you. Survivalists are actually wellprepared! And highlevel cards can be designed to fit this ethos.
And thus my second criticism of the lowXP policy: it restricts player growth. Upgrading your deck not only is a way to get new toys; it shows how the encounters with the mythos caused the investigator to grow and improve. By preventing access to highlevel cards, the investigator’s growth is restricted. There is access to a lot of lowlevel cards, but putting these cards into a deck pushes out other cards to such an extent that the deck at the beginning and end of the campaign look extremely different. This was the case with my aforementioned Ashcan deck; I could not upgrade it without drastically altering it. I could not keep the same deck archetype while at the same time staying a Survivor. The deck would have to transform in character in order to upgrade; it wasn’t really possible for the deck to just get better at what it already does.
Now there are the exile that can help players burn XP. But I hate those cards! Not only are they very narrow cards, I don’t like burning XP when I play a card. (That said, I like the upcoming Survivor ally Guiding Spirit and wouldn’t have a problem putting it into a deck even early in the campaign.) I think most players don’t want to use their experience points on Exile cards either, so I wish there were other places I could spend my experience points in the Survivor class than the exile cards.
I’ve spent this article picking on Survivor cards but while there is a lot of evidence suggesting this class is in need of the most work there’s also evidence that people don’t like the overall design of the Mystic and Rogue classes, either. I think people’s main complaint with the Mystic class is that deck building with Mystics feels stale; there are some key cards that every Mystic deck includes and thus they all start to look the same. I think this is a valid point, and one good way to fix this would be to make a Mystic permanent granting another Arcane slot. That would make more Mystic players willing to look beyond the Shrivelling/Rite of Seeking/Mists of R’lyeh staples.
As for Rogues, I don’t see why Rogues get hate. The cluster analysis suggests there’s a class of player that loves Rogues. I think that Roguehate stems from a belief that Rogues are not good enough at investigating/fighting, or a general lack of interest in evading enemies (which Rogues should do well at).
Seekers and Guardians are great classes and don’t need much work. If anything, those classes are too good. But no complaints from me.
But I stand by my conclusion that Survivors need work, and that what they need are highlevel cards. I think it is possible to give Survivors highlevel cards while keeping with the ethos of the class. In keeping with the “Survivors’ strengths are innate and wellrounded,” I think highlevel Survivor cards could consist of events, skill cards, and nonitem and nonally assets. For instance, perhaps a Level 5 permanent called “Survivalism” that allows a survivor to spend two resources to boost any skill, or a highlevel Dark Horse that gives Survivors +2 to all their skills when they have no resources or assets. (I’m spitballing here, guys; I’m sure these card ideas suck.)
I bet there are probably diehard Survivor fans who will bristle at my criticisms of the class. To them I have to ask: do you like the fact that you don’t have the option to buy highlevel Survivor cards? I certainly don’t, and I hope that “Survivor cards level 05” will no longer be the joke it currently is.
]]>Months ago, I asked a question to the community: how should I organize my R research projects? After writing that post, doing some reading, then putting a plan in practice, I now have my own answer.
First, some background. In the early months of 2016 I began a research project with my current Ph.D. advisor that involved extensive coding and spanned over at least two years. My code was poorly organized and thus problematic, as managing the chaos and extending the code became difficult. Meanwhile, I was reading articles by programmers and researchers about ways to organize R code so that research results are reproducible, distributable, and extensible. I identified two different approaches to organizing a project to meet these goals: one centered around makefiles, and another around package development. Given these competing approaches and their differing advantages, I was unsure what to do.
Since writing that post, I did more reading. First, I read two of Hadley Wickham’s (excellent) books: R Packages and Advanced R. (I loved R Packages so much I bought a physical copy.) I also read a book I picked up in a Humble Bundle book sale called Code Craft; The Practice of Writing Excellent Code by Pete Goodliffe for learning about good coding practices. Finally, I read a good portion of the GNU make
manual.
I also spent months restructuring the project to comply with what I learned. Many, many hours were spent just fixing the mess I had made by not doing things right in the first place.
The result is CPAT, an R package implementing some change point analysis statistical tests. What CPAT does will be the subject of a future post (it will be published when the accompanying paper is made available online); what I want to focus on in this article is how I learned to organize an R research project, and how that culminated in CPAT.
In the earlier article I presented two approaches that I suggested were “competing” approaches to organizing a research project: the project as executable approach of Jon Zelner and the project as package approach of Csefalvay and Flight. Both approaches, in my view, possessed unique advantages, but seemed to be at odds.
They are not at odds. CPAT demonstrates that it is possible to view an R project as both an executable and as a package. That said, the package development approach becomes dominant; making the package executable (from the command line) is an additional feature that makes the project even more portable and extensible.
If one is going to adopt the package development approach, one must use the hierarchy R packages needs. So that means:
R/
directory.man/
directory (if you’re using roxygen2 and devtools like a sane human being, though, this is something you won’t do yourself, though).data/
directory.src/
directory.tests/
directory (but if you’re using testthat for your testing then the tests you actually wrote go in tests/testthat/
).vignettes/
. This could be the paper itself, if written in the form of a vignette.inst/
directory, to be installed with the package, along with other files that should be installed into the base directory (such as Makefile
). For example, I put all my plots in inst/plots/
, and this would also be a good directory to put the paper that accompanies the project.exec/
.The approach championed by Zelner doesn’t require a particular organizational style but simply that there be a coherent organization to the project. R package development not only has a coherent structure but even enforces it. If that structure doesn’t quite work, then one can add other files and directories as needed and note them in the .Rbuildignore
file, so they’re ignored when the package is built.
When writing an R package, the relevant R tools basically enforce some essential points of style such as documenting objects. Also, the developerresearcher starts to think of important functionality of the project in terms of reusable functions that should be added to the package to be called by the scripts that actually execute the analysis—with documentation and everything else. Having welldocumented functions, even if they serve a minor purpose, helps greatly in making the project more easily understood and written not only by others but by the original author as well. In my case, since I wrote CPAT almost exclusively with vim, I wrote a UltiSnips snippet creating a function skeleton that not only defines the function but automatically adds the framework of the documentation, as seen below.
While package development does place (helpful) constraints, it does not specify everything. In other words, there is room for style. I essentially define style to mean any aspect of programming in which a choice is made that was not determined by the programming language or software. Examples of style include naming conventions, indentation, etc. Consistent style makes for understandable code; having consistent style is arguably more important than the stylistic decisions made. So I decided to codify my own stylistic preferences in a style guide, and when I did my code rewrite I made the new code comply with my style guide, even if that aded more time to editing. Whenever I encountered a new “decision point” (such as, say, dataset naming conventions), I committed my decision to the style guide.
As I mentioned above, the package development approach turns out not to be mutually exclusive with the projectasexecutable approach. While it seems like documentation on R package development (including Dr. Wickham’s book) mentions the exec/
directory of a package only in passing, I found it to be a good place to place executable R scripts. Similarly, make
can still be used to automate analysis tasks; R packages allow for including make
files.
So in addition to the files that essentially defined the package, I also wrote standalone, command line executable R scripts and placed them in the exec/
directory (which causes them to be flagged as “executable” when the package is installed). I wrote a Vim template file for R scripts that provides a skeleton for making the package executable from the command line. That template is listed below:
#!/usr/bin/Rscript ################################################################################ # MyFile.R ################################################################################ # 20181231 (last modified date) # John Doe (author) ################################################################################ # This is a oneline description of the file. ################################################################################ # optparse: A package for handling command line arguments if (!suppressPackageStartupMessages(require("optparse"))) { install.packages("optparse") require("optparse") } ################################################################################ # MAIN FUNCTION DEFINITION ################################################################################ main < function(foo, bar, help = FALSE) { # This function will be executed when the script is called from the command # line; the help parameter does nothing, but is needed for do.call() to work quit() } ################################################################################ # INTERFACE SETUP ################################################################################ if (sys.nframe() == 0) { cl_args < parse_args(OptionParser( description = "This is a template for executable R scripts.", option_list = list( make_option(c("foo", "f"), type = "integer", default = 0, help = "A commandline argument"), make_option(c("bar", "b"), type = "character", help = "Another commandline argument") ) )) do.call(main, cl_args) }
Converting my scripts into modularized, executable programs was, not surprisingly, very time consuming, and the transition was not perfect; some scripts just could not be modularized well. Nevertheless, the end result was likely worth it, and I could then write a Makefile defining how the pieces fit together. This tamed the complexity of the project and made it more reproducible; someone looking to repeat my analysis should only have to type make
in a Linux terminal^{1} to see the results themselves.
While I did make my project modular and executable, though, I did not try to make it standalone with, say, packrat or Docker. I did try to use packrat, even setting it up to work with my package. But I ran into severe problems when I tried to work with my package at the University of Utah Mathematics Department, since the computer system’s R installation is almost four years old as of this writing and highly tempermental due to how the system administrator set it up. packrat made complications working with the department computers even worse, and I disabled it in a huff one day and never looked back. As for Docker or GitLab, I did not want my project tied up with proprietary or webbased services, and I felt that the end result Zelner was seeking when using these services is overkill; when you’ve added packrat (which I didn’t because of complications, but still) and defined how the project pieces fit together with make
, you’ve mostly conquered the reproducibility problem, in my view. So I never missed these services.
The end result of this work can be seen in the paper
branch of CPAT, also permanently available in this tarball. The directory tree is also informative.
In some sense the end goal is to have an R package that could be distributed to others via, say, CRAN, so they can use the methods you employed and developed, not just reproduce your research; at least, that’s the case for me, a mathematical statistician interested in analyzing and developing statistical tests and procedures. When a package is written to contain research and not just for software distribution, it comes with a lot of files that aren’t needed for the package to function; just look at the dirctory tree!
The solution is to just delete the files that can be recreated—perhaps with make clean
if you set it up right—and consider adding other files to .Rbuildignore
when you want to distribute the package for others to use. So this isn’t actually a big problem.
Another issue that I encountered and am still unsure about are functions that are useful to the project but not useful outside of it. If you look through the paper
branch manual or even the public version manual you will find functions that were useful only for the project, perhaps for converting data structures created by scripts or making particular plots that make sense only for the paper. They’re all private functions that need to be accessed via the :::
operator, yet they’re still in the manual.
I’m undecided whether this is good style. On the one hand, it’s nice that when others read your code there’s manual entries even for functions that are local to the project to further document what was done and how the code works. Even when distributing the software, having every function documented, even ones that are “private” to the package, seems to be in concordance with the spirit of open source software, making the source code easier understood by users who need and want to know how your software works. It also could serve as a good way to modularize documentation; a statistical formula is kept with the function that computes it rather than the interface to that function (which likely links to that underlying function). Having examples for those internal functions also should provide an additional layer of testing and helps when others want to extend the package.
On the other hand… most of the pages of the manual are devoted to functions the user isn’t supposed to be calling directly in their work. Of all those functions, maybe five are functions the user is expected to use. Should all that documentation space be devoted to something the user doesn’t use?
While I’m not set in my opinion, I lean to having more documentation rather than less, even if most of it is for private functions. After all, it’s useful to me when I’m developing the project and package.
I feel like spending those months to make my project logical and reporducible was time well spent. Not only did I learn a lot in the process, I had a useful end product that is now available on CRAN. Additionally, this project is not over; my advisor and I are continuing to work on extending the results that lead to the creation of this package in the first place, which will call for more simulation experiments. Now that I’ve organized my work I now have a good base for continuing that work.
I hope that this article inspired others on how to organize their R research projects. Gauging from reactions to my previous article, I think this is an underappreciated topic, unfortunately. Having a plan for managing package complexity and organization goes a long way to keeping your work under control and helps others appreciate what you’ve done. It also can lead to your work having a greater impact since others can use it as well.
I got a lot of good feedback from my previous article. I look forward to hearing what the community has to say now. I’m always open to suggestion.
Packt Publishing published a book for me entitled HandsOn Data Analysis with NumPy and Pandas, a book based on my video course Unpacking NumPy and Pandas. This book covers the basics of setting up a Python environment for data analysis with Anaconda, using Jupyter notebooks, and using NumPy and pandas. If you are starting out using Python for data analysis or know someone who is, please consider buying my book or at least spreading the word about it. You can buy the book directly or purchase a subscription to Mapt and read it there.
If you like my blog and would like to support it, spread the word (if not get a copy yourself)!
Now here is a blog post that has been sitting on the shelf far longer than it should have. Over a year ago I wrote an article about problems I was having when estimating the parameters of a GARCH(1,1) model in R. I documented the behavior of parameter estimates (with a focus on ) and perceived pathological behavior when those estimates are computed using fGarch. I called for help from the R community, including sending out the blog post over the R Finance mailing list.
I was not disappointed in the feedback. You can see some mailing list feedback, and there were some comments on Reddit that were helpful, but I think the best feedback I got was through my own email.
Dr. Brian G. Peterson, a member of the R finance community, sent some thought provoking emails. The first informed me that fGarch is no longer the goto package for working with GARCH models. The RMetrics suite of packages (which include fGarch) was maintained by Prof. Diethelm Würtz at ETH Zürich. He was killed in a car accident in 2016.
Dr. Peterson recommended I look into two more modern packages for GARCH modelling, rugarch (for univariate GARCH models) and rmgarch (for multivariate GARCH models). I had not heard of these packages before (the reason I was aware of fGarch was because it was referred to in the time series textbook Time Series Analysis and Its Applications with R Examples by Shumway and Stoffer), so I’m very thankful for the suggestion. Since I’m interested in univariate time series for now, I only looked at rugarch. The package appears to have more features and power than fGarch, which may explain why it seems more difficult to use. However the package’s vignette is helpful and worth printing out.
Dr. Peterson also had interesting comments about my proposed applications. He argued that intraday data should be preferred to daily data and that simulated data (including simulated GARCH processes) has idiosyncracies not seen in real data. The ease of getting daily data (particularly for USD/JPY around the time of Asian financial crises, which was an intended application of a test statistic I’m studying) motivated my interest in daily data. His comments, though, may lead me to reconsider this application.^{1} (I might try to detect the 2010 eurozone financial crises via EUR/USD instead. I can get free intraday data from HistData.com for this.) However, if standard error estimates cannot be trusted for small sample sizes, our test statistic would still be in trouble since it involves estimating parameters even for small sample sizes.
He also warned that simulated data exhibits behaviors not seen in real data. That may be true, but simulated data is important since it can be considered a statistician’s bestcase scenario. Additionally, the properties of the process that generated simulated data are known a priori, including the values of the generating parameters and whether certain hypotheses (such as whether there is a structural change in the series) are true. This allows for sanity checks of estimators and tests. This is impossible for realworld since we don’t have the a priori knowledge needed.
Prof. André Portela Santos asked that I repeat the simulations but with since these values are supposedly more common than my choice of . It’s a good suggestion and I will consider parameters in this range in addition to in this post. However, my simulations seemed to suggest that when , the estimation procedures nevertheless seem to want to be near the range of large . I’m also surprised since my advisor gave me the impression that GARCH processes with either or large are more difficult to work with. Finally, if the estimators are strongly biased, we might expect to see most estimated parameters to lie in that range, though that does not mean the “correct” values lie in that range. My simulations suggest fGarch struggles to discover even when those parameters are “true.”” Prof. Santos’ comment leads me to desire a metastudy about what common estimates of GARCH parameters are on real world. (There may or may not be one; I haven’t checked. If anyone knows of one, please share.)
My advisor contacted another expert on GARCH models and got some feedback. Supposedly the standard error for is large, so there should be great variation in parameter estimates. Some of my simulations agreed with this behavior even for small sample sizes, but at the same time showed an uncomfortable bias towards and . This might be a consequence of the optimization procedures, as I hypothesized.
So given this feedback, I will be conducting more simulation experiments. I won’t be looking at fGarch or tseries anymore; I will be working exclusively with rugarch. I will explore different optimization procedures supported by the package. I won’t be creating plots like I did in my first post; those plots were meant only to show the existence of a problem and its severity. Instead I will be looking at properties of the resulting estimators produced by different optimization procedures.
As mentioned above, rugarch is a package for working with GARCH models; a major use case is estimating their parameters, obviously. Here I will demonstrate how to specify a GARCH model, simulate data from the model, and estimate parameters. After this we can dive into simulation studies.
library(rugarch)
## Loading required package: parallel ## ## Attaching package: 'rugarch' ## The following object is masked from 'package:stats': ## ## sigma
To work with a GARCH model we need to specify it. The function for doing this is ugarchspec()
. I think the parameters variance.model
and mean.model
are the most important parameters.
variance.model
is a list with named entries, perhaps the two most interesting being model
and garchOrder
. model
is a string specify which type of GARCH model is being fitted. Many major classes of GARCH models (such as EGARCH, IGARCH, etc.) are supported; for the “vanilla” GARCH model, set this to "sGARCH"
(or just omit it; the standard model is the default). garchOrder
is a vector for the order of the ARCH and GARCH components of the model.
mean.model
allows for fitting ARMAGARCH models, and functions like variance.model
in that it accepts a list of named entries, the most interesting being armaOrder
and include.mean
. armaOrder
is like garchOrder
; it’s a vector specifying the order of the ARMA model. include.mean
is a boolean that, if true, allows for the ARMA part of the model to have nonzero mean.
When simulating a process, we need to set the values of our parameters. This is done via the fixed.pars
parameter, which accepts a list of named elements, the elements of the list being numeric. They need to fit the conventions the function uses for parameters; for example, if we want to set the parameters of a model, the names of our list elements should be "alpha1"
and "beta1"
. If the plan is to simulate a model, every parameter in the model should be set this way.
There are other parameters interesting in their own right but I focus on these since the default specification is an ARMAGARCH model with ARMA order of with nonzero mean and a GARCH model of order . This is not a vanilla model as I desire, so I almost always change this.
spec1 < ugarchspec(mean.model = list(armaOrder = c(0,0), include.mean = FALSE), fixed.pars = list("omega" = 0.2, "alpha1" = 0.2, "beta1" = 0.2)) spec2 < ugarchspec(mean.model = list(armaOrder = c(0,0), include.mean = FALSE), fixed.pars = list("omega" = 0.2, "alpha1" = 0.1, "beta1" = 0.7)) show(spec1)
## ## ** ## * GARCH Model Spec * ## ** ## ## Conditional Variance Dynamics ##  ## GARCH Model : sGARCH(1,1) ## Variance Targeting : FALSE ## ## Conditional Mean Dynamics ##  ## Mean Model : ARFIMA(0,0,0) ## Include Mean : FALSE ## GARCHinMean : FALSE ## ## Conditional Distribution ##  ## Distribution : norm ## Includes Skew : FALSE ## Includes Shape : FALSE ## Includes Lambda : FALSE
show(spec2)
## ## ** ## * GARCH Model Spec * ## ** ## ## Conditional Variance Dynamics ##  ## GARCH Model : sGARCH(1,1) ## Variance Targeting : FALSE ## ## Conditional Mean Dynamics ##  ## Mean Model : ARFIMA(0,0,0) ## Include Mean : FALSE ## GARCHinMean : FALSE ## ## Conditional Distribution ##  ## Distribution : norm ## Includes Skew : FALSE ## Includes Shape : FALSE ## Includes Lambda : FALSE
The function ugarchpath()
simulates GARCH models specified via ugarchspec()
. The function needs a specification objectect created by ugarchspec()
first. The parameters n.sim
and n.start
specify the size of the process and the length of the burnin period, respectively (with defaults 1000 and 0, respectively; I strongly recommend setting the burnin period to at least 500, but I go for 1000). The function creates an object that contains not only the simulated series but also residuals and .
The rseed
parameter controls the random seed the function uses for generating data. Be warned that set.seed()
is effectively ignored by this function, so if you want consistent results, you will need to set this parameter.
The plot()
method accompanying these objects is not completely transparent; there are a few plots it could create and when calling plot()
on a uGARCHpath
object in the command line users are prompted to input a number corresponding to the desired plot. This is a pain sometimes so don’t forget to pass the desired plot’s number to the which
parameter to avoid the prompt; setting which = 2
will give the plot of the series proper.
old_par < par() par(mfrow = c(2, 2)) x_obj < ugarchpath(spec1, n.sim = 1000, n.start = 1000, rseed = 111217) show(x_obj)
## ## ** ## * GARCH Model Path Simulation * ## ** ## Model: sGARCH ## Horizon: 1000 ## Simulations: 1 ## Seed Sigma2.Mean Sigma2.Min Sigma2.Max Series.Mean ## sim1 111217 0.332 0.251 0.915 0.000165 ## Mean(All) 0 0.332 0.251 0.915 0.000165 ## Unconditional NA 0.333 NA NA 0.000000 ## Series.Min Series.Max ## sim1 1.76 1.62 ## Mean(All) 1.76 1.62 ## Unconditional NA NA
for (i in 1:4) { plot(x_obj, which = i) }
par(old_par)
## Warning in par(old_par): graphical parameter "cin" cannot be set ## Warning in par(old_par): graphical parameter "cra" cannot be set ## Warning in par(old_par): graphical parameter "csi" cannot be set ## Warning in par(old_par): graphical parameter "cxy" cannot be set ## Warning in par(old_par): graphical parameter "din" cannot be set ## Warning in par(old_par): graphical parameter "page" cannot be set
# The actual series x1 < x_obj@path$seriesSim plot.ts(x1)
The ugarchfit()
function fits GARCH models. The function needs a specification and a dataset. The solver
parameter accepts a string stating which numerical optimizer to use to find the parameter estimates. Most of the parameters of the function manage interfacing with the numerical optimizer. In particular, solver.control
can be given a list of arguments to pass to the optimizer. We will be looking at this in more detail later.
The specification used for generating the simulated data won’t be appropriate for ugarchfit()
, since it contains fixed values for its parameters. In my case I will need to create a second specification object.
spec < ugarchspec(mean.model = list(armaOrder = c(0, 0), include.mean = FALSE)) fit < ugarchfit(spec, data = x1) show(fit)
## ## ** ## * GARCH Model Fit * ## ** ## ## Conditional Variance Dynamics ##  ## GARCH Model : sGARCH(1,1) ## Mean Model : ARFIMA(0,0,0) ## Distribution : norm ## ## Optimal Parameters ##  ## Estimate Std. Error t value Pr(>t) ## omega 0.000713 0.001258 0.56696 0.57074 ## alpha1 0.002905 0.003714 0.78206 0.43418 ## beta1 0.994744 0.000357 2786.08631 0.00000 ## ## Robust Standard Errors: ## Estimate Std. Error t value Pr(>t) ## omega 0.000713 0.001217 0.58597 0.55789 ## alpha1 0.002905 0.003661 0.79330 0.42760 ## beta1 0.994744 0.000137 7250.45186 0.00000 ## ## LogLikelihood : 860.486 ## ## Information Criteria ##  ## ## Akaike 1.7270 ## Bayes 1.7417 ## Shibata 1.7270 ## HannanQuinn 1.7326 ## ## Weighted LjungBox Test on Standardized Residuals ##  ## statistic pvalue ## Lag[1] 3.998 0.04555 ## Lag[2*(p+q)+(p+q)1][2] 4.507 0.05511 ## Lag[4*(p+q)+(p+q)1][5] 9.108 0.01555 ## d.o.f=0 ## H0 : No serial correlation ## ## Weighted LjungBox Test on Standardized Squared Residuals ##  ## statistic pvalue ## Lag[1] 29.12 6.786e08 ## Lag[2*(p+q)+(p+q)1][5] 31.03 1.621e08 ## Lag[4*(p+q)+(p+q)1][9] 32.26 1.044e07 ## d.o.f=2 ## ## Weighted ARCH LM Tests ##  ## Statistic Shape Scale PValue ## ARCH Lag[3] 1.422 0.500 2.000 0.2331 ## ARCH Lag[5] 2.407 1.440 1.667 0.3882 ## ARCH Lag[7] 2.627 2.315 1.543 0.5865 ## ## Nyblom stability test ##  ## Joint Statistic: 0.9518 ## Individual Statistics: ## omega 0.3296 ## alpha1 0.2880 ## beta1 0.3195 ## ## Asymptotic Critical Values (10% 5% 1%) ## Joint Statistic: 0.846 1.01 1.35 ## Individual Statistic: 0.35 0.47 0.75 ## ## Sign Bias Test ##  ## tvalue prob sig ## Sign Bias 0.3946 6.933e01 ## Negative Sign Bias 3.2332 1.264e03 *** ## Positive Sign Bias 4.2142 2.734e05 *** ## Joint Effect 28.2986 3.144e06 *** ## ## ## Adjusted Pearson GoodnessofFit Test: ##  ## group statistic pvalue(g1) ## 1 20 20.28 0.3779 ## 2 30 26.54 0.5965 ## 3 40 36.56 0.5817 ## 4 50 47.10 0.5505 ## ## ## Elapsed time : 2.60606
par(mfrow = c(3, 4)) for (i in 1:12) { plot(fit, which = i) }
## ## please wait...calculating quantiles...
par(old_par)
## Warning in par(old_par): graphical parameter "cin" cannot be set ## Warning in par(old_par): graphical parameter "cra" cannot be set ## Warning in par(old_par): graphical parameter "csi" cannot be set ## Warning in par(old_par): graphical parameter "cxy" cannot be set ## Warning in par(old_par): graphical parameter "din" cannot be set ## Warning in par(old_par): graphical parameter "page" cannot be set
Notice the estimated parameters and standard errors? The estimates are nowhere near the “correct” numbers even for a sample size of 1000, and there is no way a reasonable confidence interval based on the estimated standard errors would contain the correct values. It looks like the problems I documented in my last post have not gone away.
Out of curiosity, what would happen with the other specification, one in the range Prof. Santos suggested?
x_obj < ugarchpath(spec2, n.start = 1000, rseed = 111317) x2 < x_obj@path$seriesSim fit < ugarchfit(spec, x2) show(fit)
## ## ** ## * GARCH Model Fit * ## ** ## ## Conditional Variance Dynamics ##  ## GARCH Model : sGARCH(1,1) ## Mean Model : ARFIMA(0,0,0) ## Distribution : norm ## ## Optimal Parameters ##  ## Estimate Std. Error t value Pr(>t) ## omega 0.001076 0.002501 0.43025 0.66701 ## alpha1 0.001992 0.002948 0.67573 0.49921 ## beta1 0.997008 0.000472 2112.23364 0.00000 ## ## Robust Standard Errors: ## Estimate Std. Error t value Pr(>t) ## omega 0.001076 0.002957 0.36389 0.71594 ## alpha1 0.001992 0.003510 0.56767 0.57026 ## beta1 0.997008 0.000359 2777.24390 0.00000 ## ## LogLikelihood : 1375.951 ## ## Information Criteria ##  ## ## Akaike 2.7579 ## Bayes 2.7726 ## Shibata 2.7579 ## HannanQuinn 2.7635 ## ## Weighted LjungBox Test on Standardized Residuals ##  ## statistic pvalue ## Lag[1] 0.9901 0.3197 ## Lag[2*(p+q)+(p+q)1][2] 1.0274 0.4894 ## Lag[4*(p+q)+(p+q)1][5] 3.4159 0.3363 ## d.o.f=0 ## H0 : No serial correlation ## ## Weighted LjungBox Test on Standardized Squared Residuals ##  ## statistic pvalue ## Lag[1] 3.768 0.05226 ## Lag[2*(p+q)+(p+q)1][5] 4.986 0.15424 ## Lag[4*(p+q)+(p+q)1][9] 7.473 0.16272 ## d.o.f=2 ## ## Weighted ARCH LM Tests ##  ## Statistic Shape Scale PValue ## ARCH Lag[3] 0.2232 0.500 2.000 0.6366 ## ARCH Lag[5] 0.4793 1.440 1.667 0.8897 ## ARCH Lag[7] 2.2303 2.315 1.543 0.6686 ## ## Nyblom stability test ##  ## Joint Statistic: 0.3868 ## Individual Statistics: ## omega 0.2682 ## alpha1 0.2683 ## beta1 0.2669 ## ## Asymptotic Critical Values (10% 5% 1%) ## Joint Statistic: 0.846 1.01 1.35 ## Individual Statistic: 0.35 0.47 0.75 ## ## Sign Bias Test ##  ## tvalue prob sig ## Sign Bias 0.5793 0.5625 ## Negative Sign Bias 1.3358 0.1819 ## Positive Sign Bias 1.5552 0.1202 ## Joint Effect 5.3837 0.1458 ## ## ## Adjusted Pearson GoodnessofFit Test: ##  ## group statistic pvalue(g1) ## 1 20 24.24 0.1871 ## 2 30 30.50 0.3894 ## 3 40 38.88 0.4753 ## 4 50 48.40 0.4974 ## ## ## Elapsed time : 2.841597
That’s no better Now let’s see what happens when we use different optimization routines.
ugarchfit()
‘s default parameters did a good job of finding appropriate parameters for what I will refer to as model 2 (where and ) but not for model 1 (). What I want to know is when one solver seems to beat another.
As pointed out by Vivek Rao^{2} on the RSIGFinance mailing list, the “best” estimate is the estimate that maximizes the likelihood function (or, equivalently, the loglikelihood function), and I omitted inspecting the log likelihood function’s values in my last post. Here I will see which optimization procedures lead to the maximum loglikelihood.
Below is a helper function that simplifies the process of fitting a GARCH model’s parameters and extracting the loglikelihood, parameter values, and standard errors while allowing for different values to be passed to solver
and solver.control
.
evalSolverFit < function(spec, data, solver = "solnp", solver.control = list()) { # Calls ugarchfit(spec, data, solver, solver.control), and returns a vector # containing the log likelihood, parameters, and parameter standard errors. # Parameters are equivalent to those seen in ugarchfit(). If the solver fails # to converge, NA will be returned vec < NA tryCatch({ fit < ugarchfit(spec = spec, data = data, solver = solver, solver.control = solver.control) coef_se_names < paste("se", names(fit@fit$coef), sep = ".") se < fit@fit$se.coef names(se) < coef_se_names robust_coef_se_names < paste("robust.se", names(fit@fit$coef), sep = ".") robust.se < fit@fit$robust.se.coef names(robust.se) < robust_coef_se_names vec < c(fit@fit$coef, se, robust.se) vec["LLH"] < fit@fit$LLH }, error = function(w) { NA }) return(vec) }
Below I list out all optimization schemes I will consider. I only fiddle with solver.control
, but there may be other parameters that could help the numerical optimization routines, namely numderiv.control
, which are control arguments passed to the numerical routines responsible for standard error computation. This utilizes the package numDeriv which performs numerical differentiation.
solvers < list( # A list of lists where each sublist contains parameters to # pass to a solver list("solver" = "nlminb", "solver.control" = list()), list("solver" = "solnp", "solver.control" = list()), list("solver" = "lbfgs", "solver.control" = list()), list("solver" = "gosolnp", "solver.control" = list( "n.restarts" = 100, "n.sim" = 100 )), list("solver" = "hybrid", "solver.control" = list()), list("solver" = "nloptr", "solver.control" = list("solver" = 1)), # COBYLA list("solver" = "nloptr", "solver.control" = list("solver" = 2)), # BOBYQA list("solver" = "nloptr", "solver.control" = list("solver" = 3)), # PRAXIS list("solver" = "nloptr", "solver.control" = list("solver" = 4)), # NELDERMEAD list("solver" = "nloptr", "solver.control" = list("solver" = 5)), # SBPLX list("solver" = "nloptr", "solver.control" = list("solver" = 6)), # AUGLAG+COBYLA list("solver" = "nloptr", "solver.control" = list("solver" = 7)), # AUGLAG+BOBYQA list("solver" = "nloptr", "solver.control" = list("solver" = 8)), # AUGLAG+PRAXIS list("solver" = "nloptr", "solver.control" = list("solver" = 9)), # AUGLAG+NELDERMEAD list("solver" = "nloptr", "solver.control" = list("solver" = 10)) # AUGLAG+SBPLX ) tags < c( # Names for the above list "nlminb", "solnp", "lbfgs", "gosolnp", "hybrid", "nloptr+COBYLA", "nloptr+BOBYQA", "nloptr+PRAXIS", "nloptr+NELDERMEAD", "nloptr+SBPLX", "nloptr+AUGLAG+COBYLA", "nloptr+AUGLAG+BOBYQA", "nloptr+AUGLAG+PRAXIS", "nloptr+AUGLAG+NELDERMEAD", "nloptr+AUGLAG+SBPLX" ) names(solvers) < tags
Now let’s run the gauntlet of optimization choices and see which produces the estimates with the largest log likelihood for data generated by model 1. The lbfgs
method (lowstorage version of the BroydenFletcherGoldfarbShanno method, provided in nloptr) unfortunately does not converge for this series, so I omit it.
optMethodCompare < function(data, spec, solvers) { # Runs all solvers in a list for a dataset # # Args: # data: An object to pass to ugarchfit's data parameter containing the data # to fit # spec: A specification created by ugarchspec to pass to ugarchfit # solvers: A list of lists containing strings of solvers and a list for # solver.control # # Return: # A matrix containing the result of the solvers (including parameters, se's, # and LLH) model_solutions < lapply(solvers, function(s) { args < s args[["spec"]] < spec args[["data"]] < data res < do.call(evalSolverFit, args = args) return(res) }) model_solutions < do.call(rbind, model_solutions) return(model_solutions) } round(optMethodCompare(x1, spec, solvers[c(1:2, 4:15)]), digits = 4)
## omega alpha1 beta1 se.omega se.alpha1 se.beta1 robust.se.omega robust.se.alpha1 robust.se.beta1 LLH ##            ## nlminb 0.2689 0.1774 0.0000 0.0787 0.0472 0.2447 0.0890 0.0352 0.2830 849.6927 ## solnp 0.0007 0.0029 0.9947 0.0013 0.0037 0.0004 0.0012 0.0037 0.0001 860.4860 ## gosolnp 0.2689 0.1774 0.0000 0.0787 0.0472 0.2446 0.0890 0.0352 0.2828 849.6927 ## hybrid 0.0007 0.0029 0.9947 0.0013 0.0037 0.0004 0.0012 0.0037 0.0001 860.4860 ## nloptr+COBYLA 0.0006 0.0899 0.9101 0.0039 0.0306 0.0370 0.0052 0.0527 0.0677 871.5006 ## nloptr+BOBYQA 0.0003 0.0907 0.9093 0.0040 0.0298 0.0375 0.0057 0.0532 0.0718 872.3436 ## nloptr+PRAXIS 0.2689 0.1774 0.0000 0.0786 0.0472 0.2444 0.0888 0.0352 0.2823 849.6927 ## nloptr+NELDERMEAD 0.0010 0.0033 0.9935 0.0013 0.0039 0.0004 0.0013 0.0038 0.0001 860.4845 ## nloptr+SBPLX 0.0010 0.1000 0.9000 0.0042 0.0324 0.0386 0.0055 0.0536 0.0680 872.2736 ## nloptr+AUGLAG+COBYLA 0.0006 0.0899 0.9101 0.0039 0.0306 0.0370 0.0052 0.0527 0.0677 871.5006 ## nloptr+AUGLAG+BOBYQA 0.0003 0.0907 0.9093 0.0040 0.0298 0.0375 0.0057 0.0532 0.0718 872.3412 ## nloptr+AUGLAG+PRAXIS 0.1246 0.1232 0.4948 0.0620 0.0475 0.2225 0.0701 0.0439 0.2508 851.0547 ## nloptr+AUGLAG+NELDERMEAD 0.2689 0.1774 0.0000 0.0786 0.0472 0.2445 0.0889 0.0352 0.2826 849.6927 ## nloptr+AUGLAG+SBPLX 0.0010 0.1000 0.9000 0.0042 0.0324 0.0386 0.0055 0.0536 0.0680 872.2736
According the the maximum likelihood criterion, the “best” result is achieved by gosolnp
. The result has the unfortunate property that , which is certainly not true, but at least the standard error for would create a confidence interval that contains ‘s true value. Of these, my preferred estimates are produced by AUGLAG+PRAXIS, as seems reasonable and in fact the estimates are all close to the truth, (at least in the sense that the confidence intervals contain the true values), but unfortunately the estimates do not maximize the log likelihood, even though they are the most reasonable.
If we looked at model 2, what do we see? Again, lbfgs
does not converge so I omit it. Unfortunately, nlminb
does not converge either, so it too must be omitted.
round(optMethodCompare(x2, spec, solvers[c(2, 4:15)]), digits = 4)
## omega alpha1 beta1 se.omega se.alpha1 se.beta1 robust.se.omega robust.se.alpha1 robust.se.beta1 LLH ##            ## solnp 0.0011 0.0020 0.9970 0.0025 0.0029 0.0005 0.0030 0.0035 0.0004 1375.951 ## gosolnp 0.0011 0.0020 0.9970 0.0025 0.0029 0.0005 0.0030 0.0035 0.0004 1375.951 ## hybrid 0.0011 0.0020 0.9970 0.0025 0.0029 0.0005 0.0030 0.0035 0.0004 1375.951 ## nloptr+COBYLA 0.0016 0.0888 0.9112 0.0175 0.0619 0.0790 0.0540 0.2167 0.2834 1394.529 ## nloptr+BOBYQA 0.0010 0.0892 0.9108 0.0194 0.0659 0.0874 0.0710 0.2631 0.3572 1395.310 ## nloptr+PRAXIS 0.5018 0.0739 0.3803 0.3178 0.0401 0.3637 0.2777 0.0341 0.3225 1373.632 ## nloptr+NELDERMEAD 0.0028 0.0026 0.9944 0.0028 0.0031 0.0004 0.0031 0.0035 0.0001 1375.976 ## nloptr+SBPLX 0.0029 0.1000 0.9000 0.0146 0.0475 0.0577 0.0275 0.1108 0.1408 1395.807 ## nloptr+AUGLAG+COBYLA 0.0016 0.0888 0.9112 0.0175 0.0619 0.0790 0.0540 0.2167 0.2834 1394.529 ## nloptr+AUGLAG+BOBYQA 0.0010 0.0892 0.9108 0.0194 0.0659 0.0874 0.0710 0.2631 0.3572 1395.310 ## nloptr+AUGLAG+PRAXIS 0.5018 0.0739 0.3803 0.3178 0.0401 0.3637 0.2777 0.0341 0.3225 1373.632 ## nloptr+AUGLAG+NELDERMEAD 0.0001 0.0000 1.0000 0.0003 0.0003 0.0000 0.0004 0.0004 0.0000 1375.885 ## nloptr+AUGLAG+SBPLX 0.0029 0.1000 0.9000 0.0146 0.0475 0.0577 0.0275 0.1108 0.1408 1395.807
Here it was PRAXIS and AUGLAG+PRAXIS that gave the “optimal” result, and it was only those two methods that did. Other optimizers gave visibly bad results. That said, the “optimal” solution is the preferred on with the parameters being nonzero and their confidence intervals containing the correct values.
What happens if we restrict the sample to size 100? (lbfgs
still does not work.)
round(optMethodCompare(x1[1:100], spec, solvers[c(1:2, 4:15)]), digits = 4)
## omega alpha1 beta1 se.omega se.alpha1 se.beta1 robust.se.omega robust.se.alpha1 robust.se.beta1 LLH ##            ## nlminb 0.0451 0.2742 0.5921 0.0280 0.1229 0.1296 0.0191 0.0905 0.0667 80.6587 ## solnp 0.0451 0.2742 0.5921 0.0280 0.1229 0.1296 0.0191 0.0905 0.0667 80.6587 ## gosolnp 0.0451 0.2742 0.5921 0.0280 0.1229 0.1296 0.0191 0.0905 0.0667 80.6587 ## hybrid 0.0451 0.2742 0.5921 0.0280 0.1229 0.1296 0.0191 0.0905 0.0667 80.6587 ## nloptr+COBYLA 0.0007 0.1202 0.8798 0.0085 0.0999 0.0983 0.0081 0.1875 0.1778 85.3121 ## nloptr+BOBYQA 0.0005 0.1190 0.8810 0.0085 0.0994 0.0992 0.0084 0.1892 0.1831 85.3717 ## nloptr+PRAXIS 0.0451 0.2742 0.5921 0.0280 0.1229 0.1296 0.0191 0.0905 0.0667 80.6587 ## nloptr+NELDERMEAD 0.0451 0.2742 0.5920 0.0281 0.1230 0.1297 0.0191 0.0906 0.0667 80.6587 ## nloptr+SBPLX 0.0433 0.2740 0.5998 0.0269 0.1237 0.1268 0.0182 0.0916 0.0648 80.6616 ## nloptr+AUGLAG+COBYLA 0.0007 0.1202 0.8798 0.0085 0.0999 0.0983 0.0081 0.1875 0.1778 85.3121 ## nloptr+AUGLAG+BOBYQA 0.0005 0.1190 0.8810 0.0085 0.0994 0.0992 0.0084 0.1892 0.1831 85.3717 ## nloptr+AUGLAG+PRAXIS 0.0451 0.2742 0.5921 0.0280 0.1229 0.1296 0.0191 0.0905 0.0667 80.6587 ## nloptr+AUGLAG+NELDERMEAD 0.0451 0.2742 0.5921 0.0280 0.1229 0.1296 0.0191 0.0905 0.0667 80.6587 ## nloptr+AUGLAG+SBPLX 0.0450 0.2742 0.5924 0.0280 0.1230 0.1295 0.0191 0.0906 0.0666 80.6587
round(optMethodCompare(x2[1:100], spec, solvers[c(1:2, 4:15)]), digits = 4)
## omega alpha1 beta1 se.omega se.alpha1 se.beta1 robust.se.omega robust.se.alpha1 robust.se.beta1 LLH ##            ## nlminb 0.7592 0.0850 0.0000 2.1366 0.4813 3.0945 7.5439 1.7763 11.0570 132.4614 ## solnp 0.0008 0.0000 0.9990 0.0291 0.0417 0.0066 0.0232 0.0328 0.0034 132.9182 ## gosolnp 0.0537 0.0000 0.9369 0.0521 0.0087 0.0713 0.0430 0.0012 0.0529 132.9124 ## hybrid 0.0008 0.0000 0.9990 0.0291 0.0417 0.0066 0.0232 0.0328 0.0034 132.9182 ## nloptr+COBYLA 0.0014 0.0899 0.9101 0.0259 0.0330 0.1192 0.0709 0.0943 0.1344 135.7495 ## nloptr+BOBYQA 0.0008 0.0905 0.9095 0.0220 0.0051 0.1145 0.0687 0.0907 0.1261 135.8228 ## nloptr+PRAXIS 0.0602 0.0000 0.9293 0.0522 0.0088 0.0773 0.0462 0.0015 0.0565 132.9125 ## nloptr+NELDERMEAD 0.0024 0.0000 0.9971 0.0473 0.0629 0.0116 0.0499 0.0680 0.0066 132.9186 ## nloptr+SBPLX 0.0027 0.1000 0.9000 0.0238 0.0493 0.1308 0.0769 0.1049 0.1535 135.9175 ## nloptr+AUGLAG+COBYLA 0.0014 0.0899 0.9101 0.0259 0.0330 0.1192 0.0709 0.0943 0.1344 135.7495 ## nloptr+AUGLAG+BOBYQA 0.0008 0.0905 0.9095 0.0221 0.0053 0.1145 0.0687 0.0907 0.1262 135.8226 ## nloptr+AUGLAG+PRAXIS 0.0602 0.0000 0.9294 0.0523 0.0090 0.0771 0.0462 0.0014 0.0565 132.9125 ## nloptr+AUGLAG+NELDERMEAD 0.0000 0.0000 0.9999 0.0027 0.0006 0.0005 0.0013 0.0004 0.0003 132.9180 ## nloptr+AUGLAG+SBPLX 0.0027 0.1000 0.9000 0.0238 0.0493 0.1308 0.0769 0.1049 0.1535 135.9175
The results are not thrilling. The “best” result for the series generated by model 1 was attained by multiple solvers, and the 95% confidence interval (CI) for would not contain ‘s true value, though the CIs for the other parameters would contain their true values. For the series generated by model 2 the best result was attained by the nlminb
solver; the parameter values are not plausible and the standard errors are huge. At least the CI would contain the correct value.
From here we should no longer stick to two series but see the performance of these methods on many simulated series generated by both models. Simulations in this post will be too computationally intensive for my laptop so I will use my department’s supercomputer to perform them, taking advantage of its many cores for parallelization.
library(foreach) library(doParallel) logfile < "" # logfile < "outfile.log" # if (!file.exists(logfile)) { # file.create(logfile) # } cl < makeCluster(detectCores()  1, outfile = logfile) registerDoParallel(cl) optMethodSims < function(gen_spec, n.sim = 1000, m.sim = 1000, fit_spec = ugarchspec(mean.model = list( armaOrder = c(0,0), include.mean = FALSE)), solvers = list("solnp" = list( "solver" = "solnp", "solver.control" = list())), rseed = NA, verbose = FALSE) { # Performs simulations in parallel of GARCH processes via rugarch and returns # a list with the results of different optimization routines # # Args: # gen_spec: The specification for generating a GARCH sequence, produced by # ugarchspec # n.sim: An integer denoting the length of the simulated series # m.sim: An integer for the number of simulated sequences to generate # fit_spec: A ugarchspec specification for the model to fit # solvers: A list of lists containing strings of solvers and a list for # solver.control # rseed: Optional seeding value(s) for the random number generator. For # m.sim>1, it is possible to provide either a single seed to # initialize all values, or one seed per separate simulation (i.e. # m.sim seeds). However, in the latter case this may result in some # slight overhead depending on how large m.sim is # verbose: Boolean for whether to write data tracking the progress of the # loop into an output file # outfile: A string for the file to store verbose output to (relevant only # if verbose is TRUE) # # Return: # A list containing the result of calling optMethodCompare on each generated # sequence fits < foreach(i = 1:m.sim, .packages = c("rugarch"), .export = c("optMethodCompare", "evalSolverFit")) %dopar% { if (is.na(rseed)) { newseed < NA } else if (is.vector(rseed)) { newseed < rseed[i] } else { newseed < rseed + i  1 } if (verbose) { cat(as.character(Sys.time()), ": Now on simulation ", i, "\n") } sim < ugarchpath(gen_spec, n.sim = n.sim, n.start = 1000, m.sim = 1, rseed = newseed) x < sim@path$seriesSim optMethodCompare(x, spec = fit_spec, solvers = solvers) } return(fits) } # Specification 1 first spec1_n100 < optMethodSims(spec1, n.sim = 100, m.sim = 1000, solvers = solvers, verbose = TRUE) spec1_n500 < optMethodSims(spec1, n.sim = 500, m.sim = 1000, solvers = solvers, verbose = TRUE) spec1_n1000 < optMethodSims(spec1, n.sim = 1000, m.sim = 1000, solvers = solvers, verbose = TRUE) # Specification 2 next spec2_n100 < optMethodSims(spec2, n.sim = 100, m.sim = 1000, solvers = solvers, verbose = TRUE) spec2_n500 < optMethodSims(spec2, n.sim = 500, m.sim = 1000, solvers = solvers, verbose = TRUE) spec2_n1000 < optMethodSims(spec2, n.sim = 1000, m.sim = 1000, solvers = solvers, verbose = TRUE)
Below is a set of helper functions I will use for the analytics I want.
optMethodSims_getAllVals < function(param, solver, reslist) { # Get all values for a parameter obtained by a certain solver after getting a # list of results via optMethodSims # # Args: # param: A string for the parameter to get (such as "beta1") # solver: A string for the solver for which to get the parameter (such as # "nlminb") # reslist: A list created by optMethodSims # # Return: # A vector of values of the parameter for each simulation res < sapply(reslist, function(l) { return(l[solver, param]) }) return(res) } optMethodSims_getBestVals < function(reslist, opt_vec = TRUE, reslike = FALSE) { # A function that gets the optimizer that maximized the likelihood function # for each entry in reslist # # Args: # reslist: A list created by optMethodSims # opt_vec: A boolean indicating whether to return a vector with the name of # the optimizers that maximized the log likelihood # reslike: A bookean indicating whether the resulting list should consist of # matrices of only one row labeled "best" with a structure like # reslist # # Return: # If opt_vec is TRUE, a list of lists, where each sublist contains a vector # of strings naming the opimizers that maximized the likelihood function and # a matrix of the parameters found. Otherwise, just the matrix (resembles # the list generated by optMethodSims) res < lapply(reslist, function(l) { max_llh < max(l[, "LLH"], na.rm = TRUE) best_idx < (l[, "LLH"] == max_llh) & (!is.na(l[, "LLH"])) best_mat < l[best_idx, , drop = FALSE] if (opt_vec) { return(list("solvers" = rownames(best_mat), "params" = best_mat)) } else { return(best_mat) } }) if (reslike) { res < lapply(res, function(l) { mat < l$params[1, , drop = FALSE] rownames(mat) < "best" return(mat) }) } return(res) } optMethodSims_getCaptureRate < function(param, solver, reslist, multiplier = 2, spec, use_robust = TRUE) { # Gets the rate a confidence interval for a parameter captures the true value # # Args: # param: A string for the parameter being worked with # solver: A string for the solver used to estimate the parameter # reslist: A list created by optMethodSims # multiplier: A floatingpoint number for the multiplier to the standard # error, appropriate for the desired confidence level # spec: A ugarchspec specification with the fixed parameters containing the # true parameter value # use_robust: Use robust standard errors for computing CIs # # Return: # A float for the proportion of times the confidence interval captured the # true parameter value se_string < ifelse(use_robust, "robust.se.", "se.") est < optMethodSims_getAllVals(param, solver, reslist) moe_est < multiplier * optMethodSims_getAllVals( paste0(se_string, param), solver, reslist) param_val < spec@model$fixed.pars[[param]] contained < (param_val <= est + moe_est) & (param_val >= est  moe_est) return(mean(contained, na.rm = TRUE)) } optMethodSims_getMaxRate < function(solver, maxlist) { # Gets how frequently a solver found a maximal log likelihood # # Args: # solver: A string for the solver # maxlist A list created by optMethodSims_getBestVals with entries # containing vectors naming the solvers that maximized the log # likelihood # # Return: # The proportion of times the solver maximized the log likelihood maxed < sapply(maxlist, function(l) { solver %in% l$solvers }) return(mean(maxed)) } optMethodSims_getFailureRate < function(solver, reslist) { # Computes the proportion of times a solver failed to converge. # # Args: # solver: A string for the solver # reslist: A list created by optMethodSims # # Return: # Numeric proportion of times a solver failed to converge failed < sapply(reslist, function(l) { is.na(l[solver, "LLH"]) }) return(mean(failed)) } # Vectorization optMethodSims_getCaptureRate < Vectorize(optMethodSims_getCaptureRate, vectorize.args = "solver") optMethodSims_getMaxRate < Vectorize(optMethodSims_getMaxRate, vectorize.args = "solver") optMethodSims_getFailureRate < Vectorize(optMethodSims_getFailureRate, vectorize.args = "solver")
I first create tables containing, for a fixed sample size and model:
solver_table < function(reslist, tags, spec) { # Creates a table describing important solver statistics # # Args: # reslist: A list created by optMethodSims # tags: A vector with strings naming all solvers to include in the table # spec: A ugarchspec specification with the fixed parameters containing the # true parameter value # # Return: # A matrix containing metrics describing the performance of the solvers params < names(spec1@model$fixed.pars) max_rate < optMethodSims_getMaxRate(tags, optMethodSims_getBestVals(reslist)) failure_rate < optMethodSims_getFailureRate(tags, reslist) capture_rate < lapply(params, function(p) { optMethodSims_getCaptureRate(p, tags, reslist, spec = spec) }) return_mat < cbind("Maximization Rate" = max_rate, "Failure Rate" = failure_rate) capture_mat < do.call(cbind, capture_rate) colnames(capture_mat) < paste(params, "95% CI Capture Rate") return_mat < cbind(return_mat, capture_mat) return(return_mat) }
as.data.frame(round(solver_table(spec1_n100, tags, spec1) * 100, digits = 1))
## Maximization Rate Failure Rate omega 95% CI Capture Rate alpha1 95% CI Capture Rate beta1 95% CI Capture Rate ##       ## nlminb 16.2 20.0 21.8 29.2 24.0 ## solnp 0.1 0.0 13.7 24.0 15.4 ## lbfgs 15.1 35.2 56.6 67.9 58.0 ## gosolnp 20.3 0.0 20.3 32.6 21.9 ## hybrid 0.1 0.0 13.7 24.0 15.4 ## nloptr+COBYLA 0.0 0.0 6.3 82.6 19.8 ## nloptr+BOBYQA 0.0 0.0 5.4 82.1 18.5 ## nloptr+PRAXIS 15.8 0.0 42.1 54.5 44.1 ## nloptr+NELDERMEAD 0.4 0.0 5.7 19.3 8.1 ## nloptr+SBPLX 0.1 0.0 7.7 85.7 24.1 ## nloptr+AUGLAG+COBYLA 0.0 0.0 6.1 84.5 19.9 ## nloptr+AUGLAG+BOBYQA 0.1 0.0 6.5 83.2 19.4 ## nloptr+AUGLAG+PRAXIS 22.6 0.0 41.2 54.6 44.1 ## nloptr+AUGLAG+NELDERMEAD 11.1 0.0 7.5 18.8 9.7 ## nloptr+AUGLAG+SBPLX 0.6 0.0 7.9 86.5 23.0
as.data.frame(round(solver_table(spec1_n500, tags, spec1) * 100, digits = 1))
## Maximization Rate Failure Rate omega 95% CI Capture Rate alpha1 95% CI Capture Rate beta1 95% CI Capture Rate ##       ## nlminb 21.2 0.4 63.3 67.2 63.8 ## solnp 0.1 0.2 32.2 35.6 32.7 ## lbfgs 4.5 41.3 85.0 87.6 85.7 ## gosolnp 35.1 0.0 69.0 73.2 69.5 ## hybrid 0.1 0.0 32.3 35.7 32.8 ## nloptr+COBYLA 0.0 0.0 3.2 83.3 17.8 ## nloptr+BOBYQA 0.0 0.0 3.5 81.5 18.1 ## nloptr+PRAXIS 18.0 0.0 83.9 87.0 84.2 ## nloptr+NELDERMEAD 0.0 0.0 16.4 20.7 16.7 ## nloptr+SBPLX 0.1 0.0 3.7 91.4 15.7 ## nloptr+AUGLAG+COBYLA 0.0 0.0 3.2 83.3 17.8 ## nloptr+AUGLAG+BOBYQA 0.0 0.0 3.5 81.5 18.1 ## nloptr+AUGLAG+PRAXIS 21.9 0.0 80.2 87.4 83.4 ## nloptr+AUGLAG+NELDERMEAD 0.6 0.0 20.0 24.0 20.5 ## nloptr+AUGLAG+SBPLX 0.0 0.0 3.7 91.4 15.7
as.data.frame(round(solver_table(spec1_n1000, tags, spec1) * 100, digits = 1))
## Maximization Rate Failure Rate omega 95% CI Capture Rate alpha1 95% CI Capture Rate beta1 95% CI Capture Rate ##       ## nlminb 21.5 0.1 88.2 86.1 87.8 ## solnp 0.4 0.2 54.9 53.6 54.6 ## lbfgs 1.1 44.8 91.5 88.0 91.8 ## gosolnp 46.8 0.0 87.2 85.1 87.0 ## hybrid 0.5 0.0 55.0 53.6 54.7 ## nloptr+COBYLA 0.0 0.0 4.1 74.5 15.0 ## nloptr+BOBYQA 0.0 0.0 3.6 74.3 15.9 ## nloptr+PRAXIS 17.7 0.0 92.6 90.2 92.2 ## nloptr+NELDERMEAD 0.0 0.0 30.5 29.6 30.9 ## nloptr+SBPLX 0.0 0.0 3.0 82.3 11.6 ## nloptr+AUGLAG+COBYLA 0.0 0.0 4.1 74.5 15.0 ## nloptr+AUGLAG+BOBYQA 0.0 0.0 3.6 74.3 15.9 ## nloptr+AUGLAG+PRAXIS 13.0 0.0 83.4 93.9 86.7 ## nloptr+AUGLAG+NELDERMEAD 0.0 0.0 34.6 33.8 35.0 ## nloptr+AUGLAG+SBPLX 0.0 0.0 3.0 82.3 11.6
as.data.frame(round(solver_table(spec2_n100, tags, spec2) * 100, digits = 1))
## Maximization Rate Failure Rate omega 95% CI Capture Rate alpha1 95% CI Capture Rate beta1 95% CI Capture Rate ##       ## nlminb 8.2 24.2 22.3 34.7 23.9 ## solnp 0.3 0.0 21.1 32.6 21.3 ## lbfgs 11.6 29.5 74.9 73.2 70.4 ## gosolnp 19.0 0.0 31.9 41.2 30.8 ## hybrid 0.3 0.0 21.1 32.6 21.3 ## nloptr+COBYLA 0.0 0.0 20.5 94.7 61.7 ## nloptr+BOBYQA 0.2 0.0 19.3 95.8 62.2 ## nloptr+PRAXIS 16.0 0.0 70.2 57.2 52.8 ## nloptr+NELDERMEAD 0.2 0.0 7.8 27.8 14.1 ## nloptr+SBPLX 0.1 0.0 24.9 91.0 65.0 ## nloptr+AUGLAG+COBYLA 0.0 0.0 21.2 95.1 62.5 ## nloptr+AUGLAG+BOBYQA 0.9 0.0 20.1 96.2 62.5 ## nloptr+AUGLAG+PRAXIS 38.8 0.0 70.4 57.2 52.7 ## nloptr+AUGLAG+NELDERMEAD 14.4 0.0 10.7 26.0 16.1 ## nloptr+AUGLAG+SBPLX 0.1 0.0 25.8 91.9 65.5
as.data.frame(round(solver_table(spec2_n500, tags, spec2) * 100, digits = 1))
## Maximization Rate Failure Rate omega 95% CI Capture Rate alpha1 95% CI Capture Rate beta1 95% CI Capture Rate ##       ## nlminb 1.7 1.6 35.0 37.2 34.2 ## solnp 0.1 0.2 46.2 48.6 45.3 ## lbfgs 2.2 38.4 85.2 88.1 82.3 ## gosolnp 5.2 0.0 74.9 77.8 72.7 ## hybrid 0.1 0.0 46.1 48.5 45.2 ## nloptr+COBYLA 0.0 0.0 8.2 100.0 40.5 ## nloptr+BOBYQA 0.0 0.0 9.5 100.0 41.0 ## nloptr+PRAXIS 17.0 0.0 83.8 85.1 81.0 ## nloptr+NELDERMEAD 0.0 0.0 26.9 38.2 27.0 ## nloptr+SBPLX 0.0 0.0 8.2 100.0 40.2 ## nloptr+AUGLAG+COBYLA 0.0 0.0 8.2 100.0 40.5 ## nloptr+AUGLAG+BOBYQA 0.0 0.0 9.5 100.0 41.0 ## nloptr+AUGLAG+PRAXIS 77.8 0.0 84.4 85.4 81.3 ## nloptr+AUGLAG+NELDERMEAD 1.1 0.0 32.5 40.3 32.3 ## nloptr+AUGLAG+SBPLX 0.0 0.0 8.2 100.0 40.2
as.data.frame(round(solver_table(spec2_n1000, tags, spec2) * 100, digits = 1))
## Maximization Rate Failure Rate omega 95% CI Capture Rate alpha1 95% CI Capture Rate beta1 95% CI Capture Rate ##       ## nlminb 2.7 0.7 64.1 68.0 63.8 ## solnp 0.0 0.0 70.1 73.8 69.8 ## lbfgs 0.0 43.4 90.6 91.5 89.9 ## gosolnp 3.2 0.0 87.5 90.3 86.9 ## hybrid 0.0 0.0 70.1 73.8 69.8 ## nloptr+COBYLA 0.0 0.0 2.3 100.0 20.6 ## nloptr+BOBYQA 0.0 0.0 2.5 100.0 22.6 ## nloptr+PRAXIS 14.1 0.0 89.1 91.3 88.5 ## nloptr+NELDERMEAD 0.0 0.0 46.3 55.6 45.4 ## nloptr+SBPLX 0.0 0.0 2.2 100.0 19.5 ## nloptr+AUGLAG+COBYLA 0.0 0.0 2.3 100.0 20.6 ## nloptr+AUGLAG+BOBYQA 0.0 0.0 2.5 100.0 22.6 ## nloptr+AUGLAG+PRAXIS 85.5 0.0 89.1 91.3 88.5 ## nloptr+AUGLAG+NELDERMEAD 0.3 0.0 51.9 58.2 51.3 ## nloptr+AUGLAG+SBPLX 0.0 0.0 2.2 100.0 19.5
These tables already reveal a lot of information. In general it seems that the AUGLAGPRAXIS method (the augmented Lagrangian method using the principal axis solver) provided in NLOpt does best for model 2 especially for large sample sizes, while for model 1 the gosolnp
method, which uses the solnp
solver by Yinyu Ye but with random initializations and restarts, seems to win out for larger sample sizes.
The bigger story, though, is the failure of any method to be the “best”, especially in the case of smaller sample sizes. While there are some optimizers that consistently fail to attain the maximum loglikelihood, no optimizer can claim to consistently obtain the best result. Additionally, different optimizers seem to perform better with different models. The implication for realworld data–where the true model parameters are never known–is to try every optimizer (or at least those that have a chance of maximizing the loglikelihood) and pick the results that yield the largest loglikelihood. No algorithm is trustworthy enough to be the goto algorithm.
Let’s now look at plots of the estimated distribution of the parameters. First comes a helper function.
library(ggplot2) solver_density_plot < function(param, tags, list_reslist, sample_sizes, spec) { # Given a parameter, creates a density plot for each solver's distribution # at different sample sizes # # Args: # param: A string for the parameter to plot # tags: A character vector containing the solver names # list_reslist: A list of lists created by optMethodSimsf, one for each # sample size # sample_sizes: A numeric vector identifying the sample size corresponding # to each object in the above list # spec: A ugarchspec object containing the specification that generated the # datasets # # Returns: # A ggplot object containing the plot generated p < spec@model$fixed.pars[[param]] nlist < lapply(list_reslist, function(l) { optlist < lapply(tags, function(t) { return(na.omit(optMethodSims_getAllVals(param, t, l))) }) names(optlist) < tags df < stack(optlist) names(df) < c("param", "optimizer") return(df) }) ndf < do.call(rbind, nlist) ndf$n < rep(sample_sizes, times = sapply(nlist, nrow)) ggplot(ndf, aes(x = param)) + geom_density(fill = "black", alpha = 0.5) + geom_vline(xintercept = p, color = "blue") + facet_grid(optimizer ~ n, scales = "free_y") }
Now for plots.
solver_density_plot("omega", tags, list(spec1_n100, spec1_n500, spec1_n1000), c(100, 500, 1000), spec1)
solver_density_plot("alpha1", tags, list(spec1_n100, spec1_n500, spec1_n1000), c(100, 500, 1000), spec1)
solver_density_plot("beta1", tags, list(spec1_n100, spec1_n500, spec1_n1000), c(100, 500, 1000), spec1)
Bear in mind that there are only 1,000 simulated series and the optimizers produce solutions for each series, so in principle optimizer results should not be independent, yet the only time these density plots look the same is when the optimizer performs terribly. But even when an optimizer isn’t performing terribly (as is the case for the gosolnp
, PRAXIS
, and AUGLAGPRAXIS
methods) there’s evidence of artifacts around 0 for the estimates of and and 1 for . These artifacts are more pronounced for smaller sample sizes. That said, for the better optimizers the estimators look almost unbiased, especially for and , but their spread is large even for large sample sizes, especially for ‘s estimator. That’s not the case for the AUGLAGPRAXIS
optimizer, though; it appears to produce biased estimates.
Let’s look at plots for model 2.
solver_density_plot("omega", tags, list(spec2_n100, spec2_n500, spec2_n1000), c(100, 500, 1000), spec2)
solver_density_plot("alpha1", tags, list(spec2_n100, spec2_n500, spec2_n1000), c(100, 500, 1000), spec2)
solver_density_plot("beta1", tags, list(spec2_n100, spec2_n500, spec2_n1000), c(100, 500, 1000), spec2)
The estimators don’t struggle as much for model 2, but the picture is still hardly rosy. The PRAXIS
and AUGLAGPRAXIS
methods seem to perform well, but far from spectacularly for small sample sizes.
So far, my experiments suggest practitioners should not rely on any one optimizer but instead to try different ones and choose the results that have the largest loglikelihood. Suppose we call this optimization routine the “best” optimizer. how does this optimizer perform?
Let’s find out.
as.data.frame(round(solver_table( optMethodSims_getBestVals(spec1_n100, reslike = TRUE), "best", spec1) * 100, digits = 1))
## Maximization Rate Failure Rate omega 95% CI Capture Rate alpha1 95% CI Capture Rate beta1 95% CI Capture Rate ##       ## best 100 0 49.5 63.3 52.2
as.data.frame(round(solver_table( optMethodSims_getBestVals(spec1_n500, reslike = TRUE), "best", spec1) * 100, digits = 1))
## Maximization Rate Failure Rate omega 95% CI Capture Rate alpha1 95% CI Capture Rate beta1 95% CI Capture Rate ##       ## best 100 0 86 88.8 86.2
as.data.frame(round(solver_table( optMethodSims_getBestVals(spec1_n1000, reslike = TRUE), "best", spec1) * 100, digits = 1))
## Maximization Rate Failure Rate omega 95% CI Capture Rate alpha1 95% CI Capture Rate beta1 95% CI Capture Rate ##       ## best 100 0 92.8 90.3 92.4
as.data.frame(round(solver_table( optMethodSims_getBestVals(spec2_n100, reslike = TRUE), "best", spec2) * 100, digits = 1))
## Maximization Rate Failure Rate omega 95% CI Capture Rate alpha1 95% CI Capture Rate beta1 95% CI Capture Rate ##       ## best 100 0 55.2 63.2 52.2
as.data.frame(round(solver_table( optMethodSims_getBestVals(spec2_n500, reslike = TRUE), "best", spec2) * 100, digits = 1))
## Maximization Rate Failure Rate omega 95% CI Capture Rate alpha1 95% CI Capture Rate beta1 95% CI Capture Rate ##       ## best 100 0 83 86.3 80.5
as.data.frame(round(solver_table( optMethodSims_getBestVals(spec2_n1000, reslike = TRUE), "best", spec2) * 100, digits = 1))
## Maximization Rate Failure Rate omega 95% CI Capture Rate alpha1 95% CI Capture Rate beta1 95% CI Capture Rate ##       ## best 100 0 88.7 91.4 88.1
Bear in mind that we evaluate the performance of the “best” optimizer by the CI capture rate, which should be around 95%. The “best” optimizer obviously has good performance but does not outperform all optimizers. This is disappointing; I had hoped that the “best” optimizer would have the highly desirable property of a 95% capture rate. Performance is nowhere near that except for larger sample sizes. Either the standard errors are being underestimated or for small sample sizes the Normal distribution poorly describes the actual distribution of the estimators (which means multiplying by two does not lead to intervals with the desired confidence level).
Interestingly, there is no noticeable difference in performance between the two models for this “best” estimator. This suggests to me that the seemingly better results for models often seen in actual data might be exploiting the bias of the optimizers.
Let’s look at the distribution of the estimated parameters.
solver_density_plot("omega", "best", lapply(list(spec1_n100, spec1_n500, spec1_n1000), function(l) {optMethodSims_getBestVals(l, reslike = TRUE)}), c(100, 500, 1000), spec1)
solver_density_plot("alpha1", "best", lapply(list(spec1_n100, spec1_n500, spec1_n1000), function(l) {optMethodSims_getBestVals(l, reslike = TRUE)}), c(100, 500, 1000), spec1)
solver_density_plot("beta1", "best", lapply(list(spec1_n100, spec1_n500, spec1_n1000), function(l) {optMethodSims_getBestVals(l, reslike = TRUE)}), c(100, 500, 1000), spec1)
solver_density_plot("omega", "best", lapply(list(spec2_n100, spec2_n500, spec2_n1000), function(l) {optMethodSims_getBestVals(l, reslike = TRUE)}), c(100, 500, 1000), spec2)
solver_density_plot("alpha1", "best", lapply(list(spec2_n100, spec2_n500, spec2_n1000), function(l) {optMethodSims_getBestVals(l, reslike = TRUE)}), c(100, 500, 1000), spec2)
solver_density_plot("beta1", "best", lapply(list(spec2_n100, spec2_n500, spec2_n1000), function(l) {optMethodSims_getBestVals(l, reslike = TRUE)}), c(100, 500, 1000), spec2)
The plots suggest that the “best” estimator still shows some pathologies even though it behaves less poorly than the other estimators. I don’t see evidence for bias in parameter estimates regardless of choice of model but I’m not convinced the “best” estimator truly maximizes the loglikelihood, especially for smaller sample sizes. the estimates for are especially bad. Even if the standard error for should be large I don’t think it should show the propensity for being zero or one that these plots reveal.
I initially wrote this article over a year ago and didn’t publish it until now. The reason for the hang up was because I wanted a literature review of alternative ways to estimate the parameters of a GARCH model. Unfortunately I never completed such a review, and I’ve decided to release this article regardless.
That said, I’ll share what I was reading. One article by Gilles Zumbach tried to explain why estimating GARCH parameters is hard. He noted that the quasilikelihood equation that solvers try to maximize has bad properties, such as being nonconcave and having “flat” regions that algorithms can become stuck in. He suggested an alternative procedure to finding the parameters of GARCH models, where one finds the best fit in an alternative parameter space (which supposedly has better properties than working with the original parameter space of GARCH models) and estimating one of the parameters using, say, the method of moments, without any optimization algorithm. Another article, by Fiorentini, Calzolari, and Panattoni, showed that analytic gradients for GARCH models could be computed explicitly, so gradientfree methods like those used by the optimization algorithms seen here are not actually necessary. Since numerical differentiation is generally a difficult problem, this could help ensure that no additional numerical error is being introduced that causes these algorithms to fail to converge. I also wanted to explore other estimation methods to see if they somehow can avoid numerical techniques altogether or have better numerical properties, such as estimation via method of moments. I wanted to read an article by Andersen, Chung, and Sørensen to learn more about this approach to estimation.
Life happens, though, and I didn’t complete this review. The project moved on and the problem of estimating GARCH model parameters well was essentially avoided. That said, I want to revisit this point, perhaps exploring how techniques such as simulated annealing do for estimating GARCH model parameters.
So for now, if you’re a practitioner, what should you do when estimating a GARCH model? I would say don’t take for granted that the default estimation procedure your package uses will work. You should explore different procedure and different parameter choices and go with the results that lead to the largest loglikelihood value. I showed how this could be done in an automated fashion but you should be prepared to manually pick the model with the best fit (as determined by the loglikelihood). If you don’t do this the model you estimated may not actually be the one for which theory works.
I will say it again, one last time, in the last sentence of this article for extra emphasis: don’t take numerical techniques and results for granted!
sessionInfo()
## R version 3.4.2 (20170928) ## Platform: i686pclinuxgnu (32bit) ## Running under: Ubuntu 16.04.2 LTS ## ## Matrix products: default ## BLAS: /usr/lib/libblas/libblas.so.3.6.0 ## LAPACK: /usr/lib/lapack/liblapack.so.3.6.0 ## ## locale: ## [1] LC_CTYPE=en_US.UTF8 LC_NUMERIC=C ## [3] LC_TIME=en_US.UTF8 LC_COLLATE=en_US.UTF8 ## [5] LC_MONETARY=en_US.UTF8 LC_MESSAGES=en_US.UTF8 ## [7] LC_PAPER=en_US.UTF8 LC_NAME=C ## [9] LC_ADDRESS=C LC_TELEPHONE=C ## [11] LC_MEASUREMENT=en_US.UTF8 LC_IDENTIFICATION=C ## ## attached base packages: ## [1] parallel stats graphics grDevices utils datasets methods ## [8] base ## ## other attached packages: ## [1] ggplot2_2.2.1 rugarch_1.38 printr_0.1 ## ## loaded via a namespace (and not attached): ## [1] digest_0.6.16 htmltools_0.3.6 ## [3] SkewHyperbolic_0.32 expm_0.9992 ## [5] scales_0.5.0 DistributionUtils_0.51 ## [7] Rsolnp_1.16 rprojroot_1.2 ## [9] grid_3.4.2 stringr_1.3.1 ## [11] knitr_1.17 numDeriv_2016.81 ## [13] GeneralizedHyperbolic_0.81 munsell_0.4.3 ## [15] pillar_1.3.0 tibble_1.4.2 ## [17] compiler_3.4.2 highr_0.6 ## [19] lattice_0.2035 labeling_0.3 ## [21] Matrix_1.28 KernSmooth_2.2315 ## [23] plyr_1.8.4 xts_0.100 ## [25] spd_2.01 zoo_1.80 ## [27] stringi_1.2.4 magrittr_1.5 ## [29] reshape2_1.4.2 rlang_0.2.2 ## [31] rmarkdown_1.7 evaluate_0.10.1 ## [33] gtable_0.2.0 colorspace_1.32 ## [35] yaml_2.1.14 tools_3.4.2 ## [37] mclust_5.4.1 mvtnorm_1.06 ## [39] truncnorm_1.07 ks_1.11.3 ## [41] nloptr_1.0.4 lazyeval_0.2.1 ## [43] crayon_1.3.4 backports_1.1.1 ## [45] Rcpp_1.0.0
Packt Publishing published a book for me entitled HandsOn Data Analysis with NumPy and Pandas, a book based on my video course Unpacking NumPy and Pandas. This book covers the basics of setting up a Python environment for data analysis with Anaconda, using Jupyter notebooks, and using NumPy and pandas. If you are starting out using Python for data analysis or know someone who is, please consider buying my book or at least spreading the word about it. You can buy the book directly or purchase a subscription to Mapt and read it there.
If you like my blog and would like to support it, spread the word (if not get a copy yourself)!
The Forgotten Age cycle of Arkham Horror is at a close and Fantasy Flight Games already announced the next cycle, The Circle Undone. Not only that, they’ve announced two mythos packs at a rate that… surprised me. A new cycle announcement and two mythos pack announcements in less than two months? Am I the only one who finds the new pace of announcements surprising? Perhaps that means they want to get product out at a faster pace?
Eh, enough speculation. I wrote about Arkham Horror before, analyzing Olive McBride specifically. This analysis (despite errors in the initial publication) was well received, even earning me a shoutout from my favorite Arkhamrelated YouTube channel.
In the announcement of the mythos pack The Wages of Sin, another mathematically interesting card was spoiled: Henry Wan, seen below.
Designing new allies for Arkham Horror is very hard because there can effectively only be one ally in a deck and there are many good allies already released, many of them in the core set. Henry Wan, specifically, is competing with Leo de Luca, who competes with Dr. Milan Christopher for the title of “Best Ally”. Cards like Charisma help the problem, but only if you plan on running multiple allies and are willing to pay the experience points for it.
Can Henry Wan compete with Leo de Luca? That strongly depends on how good his ability is. Actions are a precious commodity in Arkham Horror; this is why Leo de Luca is considered such a great card. Card draw and resource gain can help action economy, especially in a spendthrift class such as the Rogue (green) class, but it often takes many resources to compensate for a lost action.
Consider, for instance, Father Mateo’s Elder Sign ability; gain an extra action, or a card and a resource. As a point of reference, players can draw a card or gain a resource for one of their actions, so a raw evaluation would say that drawing a card and gaining a resource is actually worth two actions and thus is better than just getting a free action. But I feel most of the time people use Father Mateo’s elder sign effect to gain the additional action rather than the card and resource (though choosing the latter effect is far from rare). In fact, I think that a single action could be valued at three resources, based only on the fact that when a player draws Emergency Cache they will eagerly play it. When viewed from this perspective, Leo de Luca pays for himself after about three turns, and drawing him early gives an investigator a major boost in a scenario.
Henry Wan will thus live or die based on how strong his ability is. That said, “strong” depends on how well a player can use his ability, which is not a trivial task.
Make no mistake: Henry Wan is a gambler’s card (which fits the Rogue theme very well). Not only does a player gamble the resources spent on him, they gamble the action spent to trigger his ability; heck, using a deck slot on him is a gamble! A player thus will gain value from him only if they use him optimally.
Optimal play is not trivially determined, but fortunately Henry Wan’s ability is easy to model mathematically if you’re familiar with Markov chains. Wait, are most people not familiar with Markov chains? Oh, I didn’t know that. Oh well, maybe they’ll learn something from what follows. I’ll do my best to make it simple.
From here on, I consider drawing a card or gaining a resource with Henry Wan as equivalent; I’ll simply imagine that we’re trying to gain resources using his ability. Henry Wan’s ability calls on players to institute a policy for playing him of the following form:
After X draws, take your winnings; do not draw anymore.
Our job is thus to choose X so that we maximize the expected resource gain (in the probabilistic sense of expectation.)
I’m going to call utilizing Henry Wan’s ability a single “game”. Here’s how I view the game: the chaos bag is filled with tokens labeled either “S” or “F”, with every “F” being one of the icon tokens mentioned in Henry Wan’s ability. When an “S” is drawn, the game continues, while the game ends the moment an “F” is drawn. Every time we draw an additional “S”, there is one fewer “S” in the bag, and the odds of drawing an “F” increase; that said, our total winnings increase with each “S” we draw.
The game ends when either an “F” is drawn or the policy is triggered. Our winnings depend on which of these outcomes we find ourselves. If it’s the former, our winnings are 0, while if the latter, our winnings are X. Thus it’s easy to see (if you’re familiar with probability) that the expected winnings for any given policy is X times the probability of winning with the chosen policy: , if you prefer (with be the probability of not failing using the policy of ending after X draws). We thus want to pick X that maximizes .
Calculating calls for the Markov chain. Below is the chain I imagine:
The problem now is to calculate the probability the chain is absorbed into state X. The solution of ending in a particular absorbing state is well known (and given in the above link to Wikipedia).
No special trick for finding a maximizing X is necessary once we know how to solve this problem for any X; just list out all possible policies (there's only finitely many we need to worry about, and the number doesn't exceed 20 most of the time) and the expected winnings and pick the X maximizing this number.
The maximizing policy depends on what's in the chaos bag. Shocking, right? That said, this is an important point; each campaign/scenario/difficulty level has its own chaos bag, and thanks to cards with the seal keyword, the chaos bag can be changed during a scenario, perhaps to either the benefit or detriment of Henry Wan. Fortunately, the "S" and "F" language makes modelling the contents of the chaos bag so simple, we can create twodimensional tables depending only on the number of "S's" and "F's" in the bag and those tables will cover nearly every scenario an investigator will encounter.
The script below (which can be made executable on Unix systems with R installed) can be used for generating such tables.
#!/usr/bin/Rscript ################################################################################ # ArkhamHorrorHenryWanTableGenerator.R ################################################################################ # 20181202 # Curtis Miller ################################################################################ # This is a oneline description of the file. ################################################################################ # optparse: A package for handling command line arguments if (!suppressPackageStartupMessages(require("optparse"))) { install.packages("optparse") require("optparse") } ################################################################################ # FUNCTIONS ################################################################################ #' Henry Wan Policy Calculator #' #' Calculates important quantities for optimal play with Henry Wan #' #' @param s The number of "S" (or "success") tokens in the bag #' @param f The number of "F" (or "failure") tokens in the bag #' @param olive If \code{TRUE}, the first draw is done with Olive McBride #' @param out If \code{"X"}, return the optimal stopping time (default); if #' \code{"EV"}, return the expected winnings of the optimal policy; #' if \code{"P"}, return the probability of success of the optimal #' policy #' @return Numeric depending on the value of the parameter \code{out} #' @examples #' wan_policy_calculator(11, 5) wan_policy_calculator < function(s, f, olive = FALSE, out = c("X", "EV", "P")) { out < out[[1]] policies < (ifelse(olive, 2, 1)):s # Candidate X values policy_probs < sapply(policies, function(X) { # Set up transition matrix of Markov chain P < 0 * diag(X + 2) rownames(P) < c(0:X, "F") colnames(P) < rownames(P) P[c(X, "F"), c(X, "F")] < diag(2) transient_states < ifelse(X > 1, list(c("0", 1:(X  1))), "0")[[1]] P[transient_states, "F"] < f/(s + f  (0:(X  1))) if (olive) { if (s + f < 3  X == 1) { stop("X or chaos bag doesn't make sense with Olive!") } # Failure with Olive is modeled with a hypergeometric RV, with drawing one # or fewer "S's" P["0", "F"] < phyper(1, m = s, n = f, k = 3) # The state 1 is effectively removed when Olive is used transient_states < transient_states[2] P < P[2, 2] # TODO: curtis: OLIVE IMPELENTATION  Sun 02 Dec 2018 11:05:17 PM MST } if (X > 1) { if (olive & X == 2) { P["0", "2"] < 1  P["0", "F"] } else { P[transient_states, as.character((ifelse(olive, 2, 1)):X)] < diag( c(1  P[transient_states, "F"])) } } else { P["0", "1"] < 1  P["0", "F"] } # Compute absorption probability R < P[transient_states, c(X, "F")] Q < P[transient_states, transient_states, drop = FALSE] N < solve(diag(nrow(Q))  Q) B < N %*% R B[1,1][[1]] }) X < which.max(policy_probs * policies) if (out == "X") { policies[[X]] } else if (out == "EV") { policies[[X]] * policy_probs[[X]] } else if (out == "P") { policy_probs[[X]] } else { stop(paste("Don't know how to handle out =", out)) } } wan_policy_calculator < Vectorize(wan_policy_calculator, c("s", "f")) ################################################################################ # MAIN FUNCTION DEFINITION ################################################################################ main < function(olive = FALSE, value = FALSE, prob = FALSE, digits = 2, lower_s = 5, upper_s = 20, lower_f = 0, upper_f = 8, help = FALSE) { # This function will be executed when the script is called from the command # line; the help parameter does nothing, but is needed for do.call() to work library(pander) sl < lower_s su < upper_s fl < lower_f fu < upper_f out < "X" if (value) {out < "EV"} if (prob) {out < "P"} wan_table < outer(sl:su, fl:fu, FUN = function(r, c) { wan_policy_calculator(r, c, olive = olive, out = out) }) rownames(wan_table) < sl:su colnames(wan_table) < fl:fu wan_table < round(wan_table, digits = digits) pandoc.table(wan_table, style = "rmarkdown") } ################################################################################ # INTERFACE SETUP ################################################################################ if (sys.nframe() == 0) { cl_args < parse_args(OptionParser( description = paste("Generates tables describing optimal policies", "for playing with the card Henry Wan in", "Arkham Horror: The Card Game (number of icon", "tokens in bag are columns; nonicon rows)."), option_list = list( make_option(c("olive", "o"), action = "store_true", default = FALSE, help = "The first draw is done with Olive"), make_option(c("value", "v"), action = "store_true", default = FALSE, help = paste("Report expected value rather than", "optimal stopping policy")), make_option(c("prob", "p"), action = "store_true", default = FALSE, help = paste("Report success probability of optimal", "stopping policy rather than the", "optimal stopping policy itself")), make_option(c("digits", "d"), type = "integer", default = 2, help = "Number of digits for rounding"), make_option(c("lowers", "s"), type = "integer", default = 5, help = "Lowest considered number of nonicon tokens"), make_option(c("uppers", "w"), type = "integer", default = 20, help = "Highest considered number of nonicon tokens"), make_option(c("lowerf", "f"), type = "integer", default = 0, help = "Lowest considered number of icon tokens"), make_option(c("upperf", "r"), type = "integer", default = 8, help = "Highest number of icon tokens") ))) cl_args < cl_args[c("olive", "value", "prob", "digits", "lowers", "uppers", "lowerf", "upperf", "help")] names(cl_args) < c("olive", "value", "prob", "digits", "lower_s", "upper_s", "lower_f", "upper_f", "help") do.call(main, cl_args) }
With the above script I can make the following three tables. The columns represent the number of (bad) icon tokens in the bag, while rows represent the number of other tokens in the bag. The first table is the optimal stopping policy; the second, the probability of success of the optimal stopping policy; and the third, the expected winnings of the optimal policy (which is the product of the previous two tables).
0  1  2  3  4  5  6  7  8  

5  5  3  2  2  1  1  1  1  1 
6  6  3  2  2  2  1  1  1  1 
7  7  4  3  2  2  2  1  1  1 
8  8  4  3  3  2  2  2  1  1 
9  9  5  3  3  2  2  2  2  1 
10  10  5  4  3  3  2  2  2  2 
11  11  6  4  3  3  2  2  2  2 
12  12  6  4  4  3  3  2  2  2 
13  13  7  5  4  3  3  2  2  2 
14  14  7  5  4  3  3  3  2  2 
15  15  8  6  4  3  3  3  2  2 
16  16  8  6  5  4  3  3  3  2 
17  17  9  6  5  4  3  3  3  2 
18  18  10  6  5  4  3  3  3  2 
19  19  10  7  5  4  4  3  3  3 
20  20  11  7  5  5  4  3  3  3 
0  1  2  3  4  5  6  7  8  

5  1  0.5  0.48  0.36  0.56  0.5  0.45  0.42  0.38 
6  1  0.57  0.54  0.42  0.33  0.55  0.5  0.46  0.43 
7  1  0.5  0.42  0.47  0.38  0.32  0.54  0.5  0.47 
8  1  0.56  0.47  0.34  0.42  0.36  0.31  0.53  0.5 
9  1  0.5  0.51  0.38  0.46  0.4  0.34  0.3  0.53 
10  1  0.55  0.42  0.42  0.33  0.43  0.38  0.33  0.29 
11  1  0.5  0.46  0.45  0.36  0.46  0.4  0.36  0.32 
12  1  0.54  0.49  0.36  0.39  0.32  0.43  0.39  0.35 
13  1  0.5  0.43  0.39  0.42  0.35  0.46  0.41  0.37 
14  1  0.53  0.46  0.42  0.45  0.38  0.32  0.43  0.39 
15  1  0.5  0.4  0.45  0.47  0.4  0.34  0.45  0.42 
16  1  0.53  0.43  0.38  0.38  0.42  0.36  0.32  0.43 
17  1  0.5  0.46  0.4  0.4  0.44  0.38  0.34  0.45 
18  1  0.47  0.48  0.42  0.42  0.46  0.4  0.35  0.47 
19  1  0.5  0.43  0.44  0.44  0.36  0.42  0.37  0.33 
20  1  0.48  0.45  0.46  0.36  0.38  0.44  0.39  0.35 
0  1  2  3  4  5  6  7  8  

5  5  1.5  0.95  0.71  0.56  0.5  0.45  0.42  0.38 
6  6  1.71  1.07  0.83  0.67  0.55  0.5  0.46  0.43 
7  7  2  1.25  0.93  0.76  0.64  0.54  0.5  0.47 
8  8  2.22  1.4  1.02  0.85  0.72  0.62  0.53  0.5 
9  9  2.5  1.53  1.15  0.92  0.79  0.69  0.6  0.53 
10  10  2.73  1.7  1.26  0.99  0.86  0.75  0.66  0.59 
11  11  3  1.85  1.36  1.09  0.92  0.81  0.72  0.64 
12  12  3.23  1.98  1.45  1.18  0.97  0.86  0.77  0.69 
13  13  3.5  2.14  1.57  1.26  1.05  0.91  0.82  0.74 
14  14  3.73  2.29  1.68  1.34  1.13  0.96  0.87  0.79 
15  15  4  2.43  1.78  1.41  1.2  1.03  0.91  0.83 
16  16  4.24  2.59  1.88  1.5  1.26  1.09  0.95  0.87 
17  17  4.5  2.74  2  1.59  1.32  1.15  1.01  0.91 
18  18  4.74  2.87  2.11  1.67  1.38  1.21  1.06  0.94 
19  19  5  3.03  2.21  1.75  1.46  1.26  1.12  0.99 
20  20  5.24  3.18  2.3  1.82  1.53  1.32  1.17  1.04 
I view column 6, row 11 as the "typical" scenario, and the conclusion is this: you'd be better off just grabbing a resource/drawing a card the usual way than by using Henry Wan! Not only is Henry Wan worse than Leo de Luca, he's worse than gaining resources with a regular action!
Granted, there are cards with the seal keyword that can help improve the odds. But one must ask whether the opportunity cost of playing those cards is worth it. Perhaps the benefits of a favorable chaos bag for skill tests plus better Henry Wan games would give the investigators a teeny tiny edge… after a hell of a lot of work and lucky draw. That said, I'm sure there's much easier ways to play the game that are also more fun.
When Henry Wan was announced, people considered pairing him up with Olive McBride, who's ability works "when you would reveal a chaos token". Any investigator that can take both Mystic (purple) and Rogue (green) cards (including Sefina Rousseau and all Dunwich investigators; I don't count Lola Hayes since, while she can include both cards in her deck, using them together may not be possible) can include these two cards in the same deck.
I'll always assume that Olive's ability is utilized on the first draw. When using Olive with Henry, one can get two tokens drawn without either of them being a bad icon that ends the "game". Thus Olive boosts the success rate and the ultimate payout.
Having Olive and Henry out at the same time is extremely difficult; first, you'd have to have charismas to accomodate them, then draw them both in a game at reasonable times. The likelihood of getting the combo out is low and comes with significant opportunity costs.
That said, when Olive is out, she provides Henry enough of a boost to make him playable. The following tables account for Olive's effect (see the code for how) on the first draw but otherwise match up with the earlier tables.
0  1  2  3  4  5  6  7  8  

5  5  3  2  2  2  2  2  2  2 
6  6  3  2  2  2  2  2  2  2 
7  7  4  3  2  2  2  2  2  2 
8  8  4  3  2  2  2  2  2  2 
9  9  5  3  3  2  2  2  2  2 
10  10  6  4  3  3  2  2  2  2 
11  11  6  4  3  3  2  2  2  2 
12  12  7  4  3  3  3  2  2  2 
13  13  7  5  4  3  3  2  2  2 
14  14  7  5  4  3  3  3  2  2 
15  15  8  6  4  3  3  3  2  2 
16  16  8  6  5  4  3  3  3  2 
17  17  9  6  5  4  3  3  3  2 
18  18  10  6  5  4  3  3  3  3 
19  19  10  7  5  4  4  3  3  3 
20  20  11  7  5  4  4  3  3  3 
0  1  2  3  4  5  6  7  8  

5  1  0.75  0.86  0.71  0.6  0.5  0.42  0.36  0.31 
6  1  0.8  0.89  0.77  0.67  0.58  0.5  0.44  0.38 
7  1  0.67  0.65  0.82  0.72  0.64  0.56  0.5  0.45 
8  1  0.71  0.7  0.85  0.76  0.69  0.62  0.55  0.5 
9  1  0.62  0.74  0.61  0.8  0.73  0.66  0.6  0.55 
10  1  0.56  0.59  0.65  0.55  0.76  0.7  0.64  0.59 
11  1  0.6  0.63  0.68  0.59  0.79  0.73  0.67  0.62 
12  1  0.55  0.66  0.71  0.62  0.54  0.75  0.7  0.66 
13  1  0.58  0.56  0.56  0.64  0.57  0.78  0.73  0.68 
14  1  0.62  0.59  0.59  0.67  0.6  0.53  0.75  0.71 
15  1  0.57  0.51  0.61  0.69  0.62  0.56  0.77  0.73 
16  1  0.6  0.54  0.51  0.54  0.64  0.58  0.53  0.75 
17  1  0.56  0.56  0.53  0.57  0.66  0.6  0.55  0.77 
18  1  0.53  0.59  0.55  0.59  0.68  0.62  0.57  0.52 
19  1  0.56  0.52  0.57  0.6  0.53  0.64  0.59  0.54 
20  1  0.53  0.55  0.59  0.62  0.55  0.66  0.61  0.56 
0  1  2  3  4  5  6  7  8  

5  5  2.25  1.71  1.43  1.19  1  0.85  0.73  0.63 
6  6  2.4  1.79  1.55  1.33  1.15  1  0.87  0.77 
7  7  2.67  1.96  1.63  1.44  1.27  1.13  1  0.89 
8  8  2.86  2.1  1.7  1.53  1.37  1.23  1.11  1 
9  9  3.12  2.21  1.83  1.59  1.45  1.32  1.2  1.09 
10  10  3.33  2.38  1.95  1.65  1.52  1.39  1.28  1.18 
11  11  3.6  2.52  2.04  1.76  1.57  1.46  1.35  1.25 
12  12  3.82  2.64  2.12  1.85  1.62  1.51  1.41  1.31 
13  13  4.08  2.8  2.24  1.93  1.71  1.56  1.46  1.37 
14  14  4.31  2.95  2.36  2.01  1.79  1.6  1.51  1.42 
15  15  4.57  3.07  2.45  2.07  1.86  1.67  1.55  1.46 
16  16  4.8  3.24  2.54  2.17  1.93  1.75  1.58  1.5 
17  17  5.06  3.38  2.66  2.26  1.99  1.81  1.65  1.54 
18  18  5.29  3.51  2.77  2.34  2.04  1.87  1.71  1.57 
19  19  5.56  3.67  2.87  2.42  2.12  1.92  1.77  1.63 
20  20  5.79  3.82  2.96  2.49  2.2  1.97  1.82  1.69 
Notice that when Olive is being used it's optimal to use Olive to get two tokens out (that you can pick) then end the "game". It seems that Olive does make Henry's ability profitable… albeit mildly. If we're using the heuristic that a card needs to pay for its own resource cost plus three for each action involved, I'd say that the combo would need at least eight turns to be profitable in a typical game… which is terrible.
Henry Wan is an expensive way to attempt to milk a little more value from Olive. Even with Olive I don’t think he’s worth the trouble.
It's equally true in Arkham as it is in real life: gambling is better for the house than the gambler (with the house being the forces of the mythos, in this case). If you're looking to have fun gambling, Henry Wan is your card. If you're looking to win… look elsewhere.
Packt Publishing published a book for me entitled HandsOn Data Analysis with NumPy and Pandas, a book based on my video course Unpacking NumPy and Pandas. This book covers the basics of setting up a Python environment for data analysis with Anaconda, using Jupyter notebooks, and using NumPy and pandas. If you are starting out using Python for data analysis or know someone who is, please consider buying my book or at least spreading the word about it. You can buy the book directly or purchase a subscription to Mapt and read it there.
If you like my blog and would like to support it, spread the word (if not get a copy yourself)!
]]>