Leanpub: Publish Early, Publish Often

Preface

This book introduces the concepts behind statistical methods used to analyze data with correlated error structures. While correlated data arise in many ways, the focus is on ecological and evolutionary data, and two types of correlations: correlations generated by the hierarchical nature of the sampling (e.g., plots sampled within sites) and correlations generated by the phylogenetic relationships among species.

The book is integrated with R code that illustrates every point. Although it is possible to read the book without the code, or work through the code without the book, they are designed to go hand-in-hand. The R code comes with the complete downloadable package of the book on leanpub.com; if you have problems downloading it, please contact me.

I’ve designed the book to be read in entirety, or at least for each chapter to be read in entirety. Therefore, it is not organized like a reference manual. However, because I don’t expect everybody to read the whole thing, I’ve tried to repeat some material between chapters, so that each chapter is more self-contained. Still, there might be places where you will want to consult another chapter, and I’ve included pointers to sections in other chapters where appropriate.

The material covered in the book is:

Chapter 1, Multiple Methods for Analyzing Hierarchical Data

The first chapter introduces and analyzes a hierarchical dataset of ruffed grouse sampled at stations (plots) within roadway routes (sites). The relationship between the chances of observing a grouse at a station and wind speed during the observation is analyzed using nine methods including linear models (LMs), generalized linear models (GLMs), linear mixed models (LLMs), and generalized linear mixed models (GLMMs). The many methods of analyzing the same dataset begs the question of which is best.

Chapter 2, Good Statistical Properties

Which method is best depends on the question and the data, and it is not always the obvious one. Chapter 2 presents the statistical tools for deciding which method is best to analyze a correlated dataset. The chapter discusses properties of statistical estimators, such as bias and precision, and the characteristics of good hypothesis tests, specifically proper type I error control and high statistical power. This is a very fast overview of mathematical statistics and then application to the grouse dataset presented in Chapter 1.

Chapter 3, Phylogenetic Comparative Methods

There is a close relationship between hierarchical data and phylogenetic data, and the same approaches can be used for their analyses. Chapter 3 employs the tools presented in Chapter 2 to evaluate common methods applied in phylogenetic analyses used to compare among species or other phylogenetic units. I also show the not-so-nice consequences of ignoring the possible correlation generated by phylogenetic relationships among species.

Chapter 4, Phylogenetic Community Ecology

Community data have both hierarchical structure (e.g., samples taken from plots nested within sites) and phylogenetic structure (e.g., related species occurring more often in the same sites). Combining methods for analyzing hierarchical data and phylogenetic data produces Phylogenetic GLMMs (PGLMMs) that are useful in a broad class of ecological community studies. This chapter uses PGLMMs to investigate different types of questions about community structure, and assesses the properties of the models. This material is only covered very technically in the primary literature, and the R packages that can perform the analyses are just being developed. Therefore, the Chapter 4 could function as a manual for the phylogenetic community models discussed.

Downloading this book from leanpub.com

You can download this book for free at leanpub.com. If you have come across the book in some other way, could I ask you to get it from leanpub.com? This is for three reasons. First, the package you download from leanpub.com will contain the latest version of the R code. Second, leanpub.com will send out an email to people who have downloaded the book whenever I update it. Since the book is a work in progress, this might help you. Third, leanpub.com keeps track of the downloads, and the more there are, the more likely I’ll update the book.

Background you’ll need

Although the book is titled an introduction, it is an introduction to the concepts behind the methods discussed, not so much the methods themselves. It assumes that you understand basic statistical concepts (such as random variables) and know R and how to run mixed and phylogenetic models. I think that in many cases, the best way of learning is by doing. On the other hand, there is no substitute for getting a good background in the basics of statistical analyses and R before launching off into the more complicated material in this book.

R Code

R code is provided for all analyses in the book. I’ve pasted chunks of the code into the book, but I’ve left out a lot of things like formatting details, creating plots, etc. I wanted the book to be useable while running the R code but also to be readable in its own right.

Exercises

For each chapter, I have exercises that ask you to modify the code that I’ve presented to answer specific questions. All of the exercises have code for the answers that I have kept as a separate file in the downloadable R code. I’m always interested in interesting exercises, so if you have suggestions, please let me know.

References

I have used references throughout the book very lightly, mainly to refer to very specific issues. Probably more useful are the general books below. These are books I’ve used a lot, although I’m sure there are other books just as good. I’m interested in getting your recommendations for good books, so please let me know.

Efron B. and Tibshirani R. J.. 1993. An introduction to the bootstrap. Chapman and Hall, New York.

Gelman A. and Hill J. 2007. Data analysis using regression and multilevel/hierarchical models. Cambridge University Press, New York, NY.

Judge G. G., Griffiths W. E., Hill R. C., Lutkepohl H., and LeeT.-C. 1985. The theory and practice of econometrics. Second edition. John Wiley and Sons, New York.

Larsen R. J. and Marx M. L. 1981. An introduction to mathematical statistics and its applications. Prentice-Hall, Inc., Englewood Cliffs, N. J.

McCullagh P. and Nelder J. A. 1989. Generalized linear models. 2 edition. Chapman and Hall, London.

Neter J., Wasserman W., and Kutner M. H. 1989. Applied linear regression models. Richard D. Irwin, Inc., Homewood, IL.

Feedback

Please, I want and need your feedback. I wanted to self-publish this book, because it means I can update it quickly. I know it can be better than it is. I would appreciate it if you sent comments; email is the easiest way to get hold of me:

arives@wisc.edu

Acknowledgments

This book is the product of many people. The general ideas come from a class I teach at UW-Madison for graduate students, and they have all had a huge impact on how I think about and try to explain statistics. The more proximate origin of the book is a workshop I gave in 2018 at the Xishuangbanna Tropical Botanical Garden, Chinese Academy of Sciences, which followed the same outline. Participants in this workshop provided great help in honing the content and messages. I am indebted to Professors Chen Jin and Wang Bo for hosting my visit.

I also thank Li Daijiang for all of his work developing, cleaning, and speeding the communityPGLMM() code that is the main tool used for Chapter 4. I wish I had his skills. Michael Hardy also kindly allowed me to model the example used in Chapters 1 and 2 on his real dataset. Li Daijiang, Joe Phillips, Tanjona Ramiadantsoa, and Xu Fangfang provided thoughtful comments on parts or all of the manuscript, although I’m responsible for all the lingering errors.

Finally, this work has been supported by the National Science Foundation through various grants, and I am very grateful for this support.

Up next

Chapter 1: Multiple Methods for Analyzing Hierarchical Data