Tomato Time: Which Model is it Anyway?

We had a frost on May 9. This is just one day before the mean of the posterior predictive distribution for my model! Fortunately, based on my model I had not planted my tomatoes yet. It is now May 20, and we have passed the 94% highest density interval (HDI) for my model’s posterior predictive distribution. The seedlings are well-situated in my outdoor raised beds, and there is no hint of frost in the 10-day forecast. Of course, to completely rule out the last frost dates suggested by common wisdom, I will have to wait until the end of the month.

So, does this validate my choice of model? And how did I select a model that draws from a normal distribution with a changing mean? I compared five different models and made my selection based on several diagnostic and validation criteria. When I initially viewed the data, I had some concerns that it had too much skew to be modeled by a normal distribution, so I also compared the data against a Gamma distribution. The Gamma distribution is often used in engineering to predict time to failure for components. This could make sense if I think of the time from the last freeze to some arbitrary summer date as a time to failure. To use the Gamma distribution without encountering divergences from dividing by zero, I had to transform the data to show days before the last known freeze date of June 13.

PyMC estimates each parameter by drawing 1000 samples from the modeled distribution four times. I found that the effective sample size (ESS) for the mean was reduced from 4000 to less than 500, indicating significant autocorrelation with the initial model. Using a normal distribution improved the effective sample size to over 600, but this value still indicates less than ¼ of the simulated results contribute information to the model. Allowing the mean to vary linearly in time produced more acceptable ESS values (greater than 1000 for all parameters), whether I used a Gamma or normal distribution in the model. Therefore, I concluded any successful model should allow for a changing mean.

Furthermore, if we believe that May 9 is the last frost date in 2025, we can see that it occurred earlier than either model with a constant mean would predict. This supports the choice of incorporating a changing mean.

Model	Mean Predicted Last Frost Date	3% HDI for Predicted Last Frost Date	97% HDI for Predicted Last Frost Date
Gamma (Constant Mean)	May 17	May 12	May 23
Normal (Constant Mean)	May 16	May 11	May 21
Gamma (Changing Mean)	May 5	April 28	May 12
Normal (Changing Mean)	May 10	May 3	May 17
Normal (Changing Mean and SD)	May 10	May 3	May 16

Comparing the fit of posterior distributions of the remaining models indicated a better match for the normal distribution. There was no noticeable improvement when the standard deviation was also allowed to vary.

Finally, since the normal distribution doesn’t require that the last frost date be measured in days before the latest known freeze date, I ran the model using the normal distribution with the last frost date measured in days since January 1 to verify nothing changed. This removed the peculiar negative dates from the model. In the plots above, negative dates in the distributions just correspond to dates after June 13, but I prefer expressing dates with positive values.

The posterior predictive distributions for the model using a normal distribution with changing mean or changing mean and standard deviation are very similar. Both remaining models also meet the criterion of a Bayesian p-value close to the ideal value of 0.5, indicating that the model is higher than the observed data about half the time. This metric is slightly closer to 0.5 for the model with changing mean and standard deviation.

The leave-one-out estimation of predicted values is a way to determine which model is better at predicting each of the observed value based on data from all the rest. In general terms, each data point is removed in turn, while the model is used to predict it. Then the error for each point is combined into a single metric, which would be 0 if the model predicted every point perfectly. This metric is very slightly better for the model with a changing mean and standard deviation.

However, these improvements in the model statistics could be from overfitting. As I include more parameters in a model, its fit can better match random variation that is not due to any of the causal factors. The table below shows the means for the estimated parameters and their highest density intervals in an attempt to clarify whether these models represent such overfitting. In this table, represents the distribution’s mean, sigma represents the distribution’s standard deviation, y is the number of years since 1971, and a bar indicates the mean of the predictive posterior distribution for the parameter.

mu equals beta-nought plus beta-one times y

The estimated parameters and their 94% HDI for each model strongly suggest that the improvement in the model when the standard deviation varies is not meaningful. The 94% HDI for the parameter representing the slope of the changing standard deviation contains 0. When the standard deviation is allowed to vary, it likely does not! Therefore, I concluded the model drawing from a normal distribution with a varying mean and constant standard deviation discussed in the previous post is the best selection given the last frost data.

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Just for Fun: Data Analysis Adventures in Everyday Life

Tomato Time: Which Model is it Anyway?

Leave a Reply Cancel reply

Normal with changing mean	Normal with changing mean and SD