SNW Data Analysis
Your Data. Your Direction.

Just for Fun: Data Analysis Adventures in Everyday Life

Tomato Time: Which Model is it Anyway?


We had a frost on May 9. This is just one day before the mean of the posterior predictive distribution for my model! Fortunately, based on my model I had not planted my tomatoes yet.  It is now May 20, and we have passed the 94% highest density interval (HDI) for my model’s posterior predictive distribution. The seedlings are well-situated in my outdoor raised beds, and there is no hint of frost in the 10-day forecast. Of course, to completely rule out the last frost dates suggested by common wisdom, I will have to wait until the end of the month.

So, does this validate my choice of model?  And how did I select a model that draws from a normal distribution with a changing mean? I compared five different models and made my selection based on several diagnostic and validation criteria. When I initially viewed the data, I had some concerns that it had too much skew to be modeled by a normal distribution, so I also compared the data against a Gamma distribution.  The Gamma distribution is often used in engineering to predict time to failure for components.  This could make sense if I think of the time from the last freeze to some arbitrary summer date as a time to failure. To use the Gamma distribution without encountering divergences from dividing by zero, I had to transform the data to show days before the last known freeze date of June 13.

A histogram of last frost dates is shown with a gamma distribution and a normal distribution plotted over it. It is ambiguous as to whether one distribution matches the histogram better than the other.

PyMC estimates each parameter by drawing 1000 samples from the modeled distribution four times. I found that the effective sample size (ESS) for the mean was reduced from 4000 to less than 500, indicating significant autocorrelation with the initial model. Using a normal distribution improved the effective sample size to over 600, but this value still indicates less than ¼ of the simulated results contribute information to the model. Allowing the mean to vary linearly in time produced more acceptable ESS values (greater than 1000 for all parameters), whether I used a Gamma or normal distribution in the model. Therefore, I concluded any successful model should allow for a changing mean.

Furthermore, if we believe that May 9 is the last frost date in 2025, we can see that it occurred earlier than either model with a constant mean would predict.  This supports the choice of incorporating a changing mean.

ModelMean Predicted Last Frost Date3% HDI for Predicted Last Frost Date97% HDI for Predicted Last Frost Date
Gamma
(Constant Mean)
May 17May 12May 23
Normal
(Constant Mean)
May 16May 11May 21
Gamma (Changing Mean)May 5April 28May 12
Normal (Changing Mean)May 10May 3May 17
Normal (Changing Mean and SD)May 10May 3May 16

Comparing the fit of posterior distributions of the remaining models indicated a better match for the normal distribution.  There was no noticeable improvement when the standard deviation was also allowed to vary.

Three plots of the traces for the posterior predictive distributions with the data plotted over them. They come from (1) a model using a gamma distribution with a mean that changes over time, (2) a model using a normal distribution with a mean that changes over time, and (3) a model using a normal distribution with mean and standard deviation changing over time. For the model using the gamma distribution with a changing mean, the data overlaps the distributions, but is shifted toward the far right.  For both the models using a normal distribution, the data is well centered over the posterior predictive traces.  However the posterior predictive traces appear to be slightly more widely distributed vertically for the model with the changing mean and standard deviation.

Finally, since the normal distribution doesn’t require that the last frost date be measured in days before the latest known freeze date, I ran the model using the normal distribution with the last frost date measured in days since January 1 to verify nothing changed. This removed the peculiar negative dates from the model. In the plots above, negative dates in the distributions just correspond to dates after June 13, but I prefer expressing dates with positive values.

Two posterior predictive distributions are shown. One for the model with a normal distribution where the mean changes with time, and the other for the model with a normal distribution where the mean and standard deviation change with time. Both distributions look similar, though the shape at the peak is slightly different.  The mean for each is 130, and the 94% HDI  ranges from 123 to 137 for the model where mean changes and from 123 to 136 for the model where mean and SD change.

The posterior predictive distributions for the model using a normal distribution with changing mean or changing mean and standard deviation are very similar. Both remaining models also meet the criterion of a Bayesian p-value close to the ideal value of 0.5, indicating that the model is higher than the observed data about half the time. This metric is slightly closer to 0.5 for the model with changing mean and standard deviation.

A plot of the proportion of predicted values that are less or equal than the observed per observation. Both plots are show values within an accepted gray range. Both show a peak at low and high ractions of the observations and a dip around 40%.  The Bayesian p-value for the model with changing mean is 0.506 and for the model with changing mean and standard deviation is 0.504.

The leave-one-out estimation of predicted values is a way to determine which model is better at predicting each of the observed value based on data from all the rest. In general terms, each data point is removed in turn, while the model is used to predict it.  Then the error for each point is combined into a single metric, which would be 0 if the model predicted every point perfectly. This metric is very slightly better for the model with a changing mean and standard deviation.

However, these improvements in the model statistics could be from overfitting. As I include more parameters in a model, its fit can better match random variation that is not due to any of the causal factors. The table below shows the means for the estimated parameters and their highest density intervals in an attempt to clarify whether these models represent such overfitting. In this table, mu represents the distribution’s mean, sigma represents the distribution’s standard deviation, y is the number of years since 1971, and a bar indicates the mean of the predictive posterior distribution for the parameter.

Normal with changing meanNormal with changing mean and SD
mu equals beta-nought plus beta-one times ymu equals beta-nought plus beta-one times y
sigma is a constantsigma equals beta-two + beta-three times y
the mean of the beta-nought posterior predictive distribution is 142the mean of the beta-nought posterior predictive distribution is 143
The range of the 94% HDI for beta-nought is 136 to 149The 94% HDI for beta-nought is 136 to 149
the mean of the beta-one posterior predictive distribution is -0.23the mean of the beta-one posterior predictive distribution is -0.24
The 94% HDI for beta-one is from -0.43 to -0.031The 94% HDI for beta-one is from -0.45 to -0.039
the mean of the posterior predictive distribution for sigma is 12the mean of the posterior predictive distribution for beta-two is 14
The 95% HDI for sigma is 10 to 15the 94% HDI for beta-two is 10 to 18
the mean of the posterior predictive distribution for beta-three is -0.054
the 94% HDI for beta-3 is -0.16 to 0.059

The estimated parameters and their 94% HDI for each model strongly suggest that the improvement in the model when the standard deviation varies is not meaningful. The 94% HDI for the parameter representing the slope of the changing standard deviation contains 0. When the standard deviation is allowed to vary, it likely does not! Therefore, I concluded the model drawing from a normal distribution with a varying mean and constant standard deviation discussed in the previous post is the best selection given the last frost data.


Leave a Reply

Your email address will not be published. Required fields are marked *