key: cord-0657342-08d0k4js
authors: Foster, Dean P.; Stine, Robert A.
title: Threshold Martingales and the Evolution of Forecasts
date: 2021-05-14
journal: nan
DOI: nan
sha: a484e899714c778588e7bd8bc31f2131954b37c7
doc_id: 657342
cord_uid: 08d0k4js

This paper introduces a martingale that characterizes two properties of evolving forecast distributions. Ideal forecasts of a future event behave as martingales, sequen- tially updating the forecast to leverage the available information as the future event approaches. The threshold martingale introduced here measures the proportion of the forecast distribution lying below a threshold. In addition to being calibrated, a threshold martingale has quadratic variation that accumulates to a total determined by a quantile of the initial forecast distribution. Deviations from calibration or to- tal volatility signal problems in the underlying model. Calibration adjustments are well-known, and we augment these by introducing a martingale filter that improves volatility while guaranteeing smaller mean squared error. Thus, post-processing can rectify problems with calibration and volatility without revisiting the original forecast- ing model. We apply threshold martingales first to forecasts from simulated models and then to models that predict the winner in professional basketball games.

We study the evolution of probabilistic forecasts of a future event that specify a distribution. The distributions of forecasts evolve in the sense that they change as the passage of time reveals more information about the approaching event. For example, we could be predicting demand for holiday gifts. Planning for this demand using a model such as the classic news-vendor requires more than a point estimate; one needs a distribution. The forecast distribution would be highly uncertain months ahead of the holidays, and then become increasingly precise as we learn more regarding the impact of economic conditions on consumer spending. Or, imagine estimating the chances that the home team wins a basketball game. A lead of 5 points early in the game doesn't reveal nearly so much about the chances of a home win as a lead of the same size near the end of the game. In both cases, probabilities change as target event approaches.

Our interest here is quantifying whether the forecasts of the evolve as they should, using a martingale to define the gold standard.

The use of martingales to characterize evolving forecasts is not novel. For example, Heath and Jackson (1994) propose the martingale model of forecast evolution (MMFE) as a means to simulate forecast systems. Suppose that we observe a time series up through time t, say Y 1 , Y 2 , . . . , Y t , and want to predict the series at some future time T . Everything done in this paper applies to vector time series Y t ∈ R d . We stick to the scalar case d = 1 for clarity and to keep the notation simple. We refer to T as the forecast target date (FTD) and denote the forecast of Y T created at time t < T by Y T |t .

One expects the magnitude of the forecast error Y T − Y T |t to get smaller as t approaches T , but the MMFE allows one to say more. Martingales capture the notion that the forecast Y T |t should capture all of the information concerning Y T that is available at time t. If forecasts Y T |t meet the conditions of the MMFE, one can characterize how the forecasts Y T |t change, or evolve, as the historical data approach the FTD. Furthermore, the MMFE does not require one to know all the details of the forecasting system; the forecast might be the result of a neural network or a heuristic spreadsheet procedure.

All that is required is that forecasts conform to the requirements of a martingale. Given that, one can estimate means and variances from data. These estimates allow one to simulate the forecasting system.

Our use of martingales differs from the approach adopted in the MMFE. We emphasize diagnostic methods intended to check if the forecasts evolve appropriately. In addition, rather than examine Y T |t directly, we focus on a sequence of probabilities defined by quantiles of the evolving forecast distributions. If the sequence of forecasts is a martingale, then so too are these probabilities. As a result, the threshold probabilities should be calibrated and exhibit a known level of volatility. Deviations from these indicate flaws in the forecasts which can be corrected by post processing.

The remainder of this paper develops as follows. The following section briefly reviews discrete martingales and their connection to familiar autoregressive models of stationary time series, keeping this paper more self-contained. Section 3 then defines the threshold martingales that define our diagnostic procedure. The following two sections give examples, first simulating properties under known models (Section 4) and then applying the method to basketball scoring (Section 5). The paper concludes with a brief discussion of extensions. An appendix defines the martingale filter that we use in Section 5 to reduce volatility.

This section reviews the definition of a martingale and connects it to forecasting, autoregressive processes, and other statistical concepts. Though martingales are likely familiar to many readers as a tool for understanding probability theory, we emphasize the connections between martingales and statistics and show how martingales are related to the evolution of forecasts.

In discrete time, a martingale is a stochastic process {X t } for which the conditional expected value of X t+1 given its predecessors is the most recent value, E(X t+1 | X t , X t−1 , . . . , X 0 ) = E(X t+1 | F t ) = X t

The sigma field F t collects all of the information available from X t , X t−1 , . . . , X 0 . 1 For {X t } to be a martingale, all of the relevant information about the future concentrates in the most recently observed value, capturing the notion that a forecast should capture all that we know about X t+1 . An immediate consequence of (1) is that the differences, or changes, of a martingale resemble independent random variables. In particular, martingale differences W t = X t − X t−1 have mean zero and are uncorrelated:

E W t = 0 and Cov(W s , W t ) = 0 for s = t .

(2)

The most well-known example of a martingale is an unbiased random walk. A random walk accumulates a sum of independent random variables, as in the standard discrete Brownian motion:

The differences B t − B t−1 = t are, by construction, independent and normally distributed.

Most time series are not martingales. For example, stationary ARMA time series models are not martingales. For instance, the first-order autoregression, or AR(1)

is not a martingale (unless ρ = 1, in which case it is a random walk). For stationary models, the prediction of Y t given previous values shrinks Y t−1 toward the mean of the process at zero,

Stationary models such as (3) are mean-reverting whereas a random walk wanders freely.

Although a stationary time series is not a martingale, it is easy to define a martingale from its forecasts. There's a hint of a martingale in the one-step-ahead prediction Y t|t−1 . This prediction is a conditional expectation of Y t given the past, just like that represents all information available at time t, not just that represented by prior random variables. For our purposes, however, we stick to the case F t = {X t , X t−1 , . . .}.

which appears in the definition of a martingale (1). The martingale structure becomes obvious through a sequence of back-substitutions. Starting from (3), plug in the expression from the previous point in time:

That is, a first-order autoregression can be represented as a geometric average of prior errors. The last expression in (5) holds if we think of our data as part of an infinitely long time series; otherwise for a series that begins with Y 1 , the sum terminates with

A martingale emerges when we consider how the optimal prediction of Y T changes as the data approach the target time T . Keep in mind that the martingale describes a sequence of forecasts of a single future value Y T . The target being forecast does not move (as when extrapolating forecasts farther out in time); instead it is the data available to the forecaster that changes. Because t is independent of Y s for s < t, the intermediate expressions building up to (5) imply that, for t ≤ T ,

Because Y T |t is a martingale in t, the changes in the forecasts as t → T are martingale differences and thus uncorrelated. Define Y T |0 = EY t = µ, the marginal mean of the process (assuming stationarity), and note that Y T |T = Y T . We can then decompose Y T as a sum of uncorrelated random variables obtained from a telescoping sum of the martingale differences,

This representation then allows us to write the variance of Y T as

We can visualize this algebra to make the process more intuitive. As illustrated in equation (5), a stationary ARMA model can be expressed as a weighted sum of prior error terms, Y T = j w j t−j where j w 2 j < ∞. Consider, for example, the forecast Y T |T −3 . The forecast consists of summands that are known at time T − 3, and the rest determine the error of that forecast:

At time T − 2, we learn the next component of the sum,

The variance components in (8) are easily identifiable as σ 2 s = σ 2 w 2 s . Researchers such as Heath and Jackson (1994) and subsequent authors (such as Toktay and Wein, 2001) use this approach to simulate an arbitrary demand forecasting system.

Most processes are not martingales -you have to make them. The approach used to build the martingale Y T |t for an autoregression is a special case of a more general construction. Consider an arbitrary random variable Z and an increasing sequence of sigma fields F 0 ⊂ F 1 ⊂ · · · . It follows from properties of conditional expectation that the sequence of conditional expectations

is a martingale. 2 This construction is the first example of a martingale given by Doob (1953, Chap 7) . Conditional expectations act much like projections in linear algebra, providing an orthogonal decomposition of a vector into subspaces. Hence, the conditional expected value of Y T given past observations defines a martingale:

To define a threshold martingale, the random variable Z in (9) indicates whether the future random variable Y T lies below a threshold τ . Rather than take the expected value of Y T itself, we consider the chance that Y T lies below τ . For modeling a continuous random variable, τ is a quantile of the initial forecast distribution. Given past observations up to time t ≤ T , the probability that Y T ≤ τ is

where 1 W T ≤τ is an indicator, a Boolean 0/1 r.v. determined by the success or failure of the associated condition,

Notice that p T ∈ {0, 1}. Since the elements of {p t } are conditional expectations of a bounded random variable with respect to increasing sigma fields, {p t } is a martingale.

We denote the mean value of this martingale π = E(p t ). If τ is chosen to be a specific quantile of the initial forecast distribution, then π is known.

Because {p t } is a martingale, observed sequences should be calibrated with known mean and total volatility. By calibrated, we mean that E (p t | p s , s < t) = p s . Suppose that we observe multiple realizations of {p t }, say {p t,j } where j = 1, . . . , n. Then a scatterplot of p t,j on p s,j (s < t) should cluster along the diagonal x = y. In addition, sequence plots of p t,j on t = 0, . . . , T should (on average over j) hover around π, though -being a martingale -there is no mean reversion. Should the multiple realizations be independent (or uncorrelated), then one can easily compare the sample mean p t = j p t,j /n to the specified probability π. Failing calibration can be viewed as an embarrassing mistake since there are ways of guaranteeing calibration even if a simple probabilistic model did not generate your data (Foster and Vohra, 1998) . When you can assume IID data, often something as simple as the pool adjacent violator algorithm will generate a calibrated forecast (Foster and Stine, 2004) .

The quadratic variation of {p t } provides a second diagnostic, primarily because the initial quantile determines the expected total quadratic variation. The event {Y T ≤ τ } defines a Boolean random variable with mean π and variance π (1 − π). Differencing {p t } decomposes this variance into contributions from period-to-period changes. The difference of p T from its mean π telescopes (p 0 = π),

Because p t − p t−1 are martingale differences, they are uncorrelated (2). Consequently,

Should the observed total quadratic variation exceed π(1 − π), the process has excess volatility: the forecast distribution is changing more than it should as the data approach the FTD. One might also observe smaller than expected variation, though that has been less common in our experience.

In addition to having a known total, the quadratic variation of {p t } reflects how fast information about Y T accumulates in the forecasts. Define the partial sum

S t is not a martingale, but we find it useful nonetheless as a description of the dynamics of the underlying information about Y T . Plots of S t versus t = 1, . . . , T show the flow of information as more and more becomes known about Y T . Small changes p t − p t−1 indicate that little new information has arrived. A simple adjustment converts the sums-of-squares process S t into a martingale. In particular, consider

To see directly that V t is a martingale, observe that

Martingale differences associated with V t have some special structure that is worth pointing out. The differences can be written as

Hence, the difference V t − V t−1 = 0 when p t−1 = 1 2 . The visual effect in a plot of V t − V t−1 on V t−1 is striking.

Remark A. We have two motivations for the construction of V t . The first is heuristic. Split the total sum of squared differences at some point into two sums,

The first summand is S k , and the second has conditional expectation p k (1 − p k ) given F k . The other motivation relies on familiarity with martingale compensators. The sum of observed differences of the martingale p t is often written as S t = [p, p] t , for which the natural compensator is the expected value p, p t . We don't know p, p t , but this is also the compensator of p 2 t . Hence by adding p t (1 − p t ) = p t − p 2 t we subtract the unknown compensator and end up with a martingale,

This section illustrates the threshold martingale {p t } within the context of autoregressive models. We start by simulating {p t } for a correctly specified autoregression, and then illustrate a mixture of autoregressions. In these examples, the {p t } are martingales by construction. We emphasize

Calibration plots A scatterplot of p t on any lag p t−j should concentrate along the diagonal;

Calibration regression Coefficients in the regression of p t − p t−s on p t−s 1 , . . . , p t−s k , for 0 < s < s 1 < · · · < s k should be zero.

Cumulative sums of squares The total variation should average π(1 − π).

We begin with an illustration of the threshold martingale for a single time series. Figure   1 graphs {p t } for a segment of a simulated first-order autoregression with coefficient ρ = 0.8 and σ = 1. The top frame of the figure shows Y t for t = 1, . . . , 40. The dotted gray line denotes the threshold τ = 1; the target period T = 40 lies at the right side of the graph. The lower frame of Figure 1 shows the threshold probabilities p 1 , p 2 , . . . , p 40

for the data in the top frame. The p t vary around the threshold probability

The sequence of probabilities is basically constant for t T because little is known about where the process will be at T = 40 other than what is implied by the known

Variability in the probability martingale reveals how information flows in the underlying process. Periods of higher variation of the conditional probabilities show that the associated data contain more information about the outcome Y T . Although the time series Y t meanders, with several excursions above the threshold of interest (τ < Y T ), p t remains near constant until around t ≈ 25, at which point information in the data begins to reveal the most likely position for Y T . As shown in (5), an autoregression is a weighted average of prior random inputs, and in this case Y T = t 0.8 j T −j . Observing Y t reveals the contribution 0.8 T −t t which is quite small until t approaches the target time. The smallest probability p t occurs at t = 37 (gray points in the figure) when Y t is quite large and t is close to T . The conditional probability that Y t ≤ 1 increases from t = 37 because the subsequent observations decrease. At the end of this sequence, the target Y T ≤ τ and consequently p T = 1 (red point in the figure).

Rather than isolate a single threshold, one can track probabilities associated with several thresholds at once. The resulting martingale is now a vector of highly correlated probabilities. Figure 2 shows the conditional probabilities for 5 thresholds located at 

q q q q q q q q q q q q q q q q q q q q q q q q q q q q q observations in the distant past influence the prediction of Y T almost as much as those closer in time. The level of the gray curve is higher and its total is larger than that for the process with ρ = 0.8. If ρ = 0.995, then P (Y T ≤ 1) ≈ 0.54.

Verifying the calibration {p t } does not depend on the choice of τ or ρ. Because {p t } is a martingale, the average value of the process at the next point in time, p t+1 , is the most recent value, E(p t+1 | F t ) = p t regardless whether we know ρ or π. Hence, if the threshold probabilities form a martingale, a scatterplot of p t on p s , s < t should concentrate on the diagonal line y = x. Figure 4 illustrates this calibration. This scatterplot graphs p 35 on p 34 derived from 400 independent replicates of the autoregressions with either ρ = 0.8 (black) or ρ = 0.995 (gray). Although from different processes, both sets of coordinates align on the diagonal. We can clearly distinguish p t generated by A family of multiple regression models nicely summarizes the martingale structure.

Consider the regression of p t on k preceding probabilities p t−1 , p t−2 , . . . , p t−k . The only nonzero term is the prior probability,

Only the coefficient of the first lagged probability differs from zero, and it has coefficient 1. More generally, in the regression for which the most recent lag is s 1 < s 2 < · · · < s k ,

Claims such as these are easily tested by considering the martingale differences for which the conditional mean is zero, as in

A regression of p t − p t−s on any collection of prior values at lags s j > s should find no signal. Indeed, this property extends more generally to any function of the prior 

The properties of a martingale vary widely from realization to realization. For example, sometimes the terminal value is less than the threshold, Y T ≤ τ , and sometimes not. To illustrate the variation from case to case, Figure 5 shows 250 independent realizations of threshold martingales p t,j defined by j = 1, . . . , N = 250 autoregressions with varying coefficients but a common target probability π. Throughout this section, the subscript j identifies that the value is associated with the jth realization. These mimic what one would like to find, for example, when comparing demand forecasts for different products; each product has a different threshold, but the probability π is the same for all. For this example, the simulated coefficients of the AR(1) processes are random draws from a Beta distribution, ρ j iid ∼ Beta(8, 2). This is a skewed distribution on the interval [0, 1] with mean 8/10 and SD ≈ 0.12. The inset in Figure 5 shows the probability density. For the jth realization, the threshold τ j satisfies P (Y T,j ≤ τ j ) = 0.75, which is close to the threshold in the prior examples of a single time series. By choosing a specific threshold for each process, the expected value of every threshold martingale is π = E 1 Y T,j ≤τ j = 0.75, even though the underlying time series Y t,j have varying levels of dependence. In the figure, sequences that end at p T,j = 0 (realizations for which τ j < Y T,j , holding for about 25% of the time series) are colored green, whereas the more numerous series ending at p T,j = 1 are colored gray. By and large, small values of p t,j that occur early in the sequence for small t suggest a sequence that will terminate at 0, but quite a bit of variation occurs close to T . The colors of the sequences in the figure are not previsible and would not be revealed until the final value is observed.

Varying the threshold to obtain a fixed probability π = P (Y T,j ≤ τ j ) allows us to combine the variation across series. The accumulated sum-of-squared martingale differences d 2 t,j = (p t j − p t−1,j ) 2 sum to π(1 − π), on average,

Although the variation accumulates at different rates depending on ρ j (see Figure 3 ), the total sum-of-squares is the same, on average. Figure 6 shows the average sumof-squares for the same 250 realizations as shown in Figure 5 . Even after averaging over these realizations, the final total is noticeably less than π(1 − π) due to sampling variation. The light gray lines in the figure show the sums-of-squares for 20 randomly selected series; these remind one of the large variation across the series. The total variation varies from near 0 to far larger than the expected value π(1 − π); some realizations are near constant whereas others are far more volatile.

Verifying that the mean of the total accumulated variation matches π(1 − π) in this example using a statistical test is a simple calculation. We only need the mean and its usual standard error. We have a mixture of autoregressions in this collection of time series, implying in general that squared differences of the associated probability martingales accumulate at different rates, E d 2 t,j − d 2 t,k = 0. Hence looking at measures of variation at intermediate times 0 < t < T would require an adjustment for heteroscedasticity. The total, however, remains π(1 − π) regardless. For the 250 time series shown in Figure 6 , the average cumulative value estimates the variance of p T,j = 1 Y T ≤τ j . The average squared deviation from the target probability is Another useful, visual diagnostic is to inspect the calibration of the probabilities.

As noted in Figure 4 , the key property of martingales E(X t | F s , s ≤ t) = X s holds regardless of the level of dependence. Consequently, a scatterplot of X t on any X s , s ≤ t concentrates on the diagonal line with slope 1. The frames in Figure adjacent values within a series, averaging over time. We illustrate a simple diagnostic here. Consider the proportion of variation explained in the regression of d t,j on d t−1,j , d t−2,j , . . . , d t− ,j for maximum lag = 4. The changing variation of the probabilities fools a naïve estimate of this goodness-of-fit. As seen in Figure 6 , most of the variation in each series occurs near the final time T ; each p t is near constant initially with most changes occurring near the end. The resulting regression of d t,j on lags has a few highly leveraged points, complicating inference. This form of heteroscedasticity inflates the R 2 statistic of the complete-data OLS regression. Figure 8 shows a boxplot of these statistics from fitting OLS regressions; these statistics incorrectly indicate that lags explain statistically significant variation, violating the martingale condition. We avoid these complications by estimating the fit of the regression of d t,j on d t−1,j , d t−2,j , . . . , d t− ,j using the prequential approach (Dawid, 1984 (Dawid, , 1992 . The reported statistic is Figure 8 : Boxplots of the overall reported explained variation for autoregressions fit to the martingale differences. Heteroscedasticity distorts the OLS R 2 but not the prequential statistic R 2 shown in equation (17).

where the prequential predictor iŝ d t,j =b 0t +b 1t d t−1,j + · · · +b ,t d t− ,j .

The coefficients ind t,j are estimated for each series to minimize the fit up to time t,

Prequential regression predicts d t,j using a model fit to data prior to time t, recursively updating the fit over time. Over 90% of the R 2 are negative, indicating (correctly) that the lagged variables are not predictive. Figure 8 contrasts the distribution of R 2 j to that given by the usual estimator. On average, the models are not predictive. Alternatively, one could avoid the prequential approach and compute t-statistics and an overall F-like statistic using a sandwich style estimator of variances for OLS estimates.

This section illustrates the use of martingale diagnostics to evaluate forecasts from two models that predict the winning team in a professional basketball game. The forecast distribution of these models is particularly simple, namely the probability of the home team winning a basketball game. One of these models is intentionally crude whereas the second is more predictive; that said, neither should be taken as a serious competitor to modern ML sports models (e.g. Shi and Song, 2019; Song, Zou and Shi, 2020) . We also illustrate the use of a martingale filter (described in the Appendix) that reduces forecast volatility to the expected level while simultaneously improving the accuracy of forecasts.

The data for our example span four seasons of play in the NBA. Early quantitative models for the evolution of the probability of a home win made due with much less data. For example, Stern (1994) We use the following notation to describe our models. Randomly label the two teams playing in any game Team A and Team B. The binary variable Y j ∈ {0, 1} denotes whether Team A wins the jth game, Y j = 1 if Team A wins and Y j = 0 if Team B wins. The subscript j = 1, 2, . . . , 4622 identifies the game throughout. Without 4 We obtained these data from the Kaggle web site https://www.kaggle.com/schmadam97/. The data provide a tabular play-by-play record of each game. We limit our analysis to regular season games ending during the standard 48 minutes, omitting overtime and playoff games. We excluded the two most recent seasons that were shortened or influenced by Covid-19; for example, the home-team advantage weakened in the 2019-2020 season in the absence of court-side fans. further information, the probability that Team A wins game j is

denotes the trivial sigma field. Now suppose that the game is about to begin and the home team team is identified; the binary random variable H j = 1 if Team A is the home team and 0 otherwise. This random variable defines the initial sigma field F 0,j = H j , and denote

By symmetry, assume that Team A is the home team. Denote the times when the score is recorded by t 0 = 0, t 1 =0:15, t 2 = 0:30, . . ., t 191 =47:45, t 192 =47.59, just before the game ends. These scores define the (time × game) matrix X ij = (Home Score − Away Score) i = 0, . . . , 192, j = 1, . . . , 4622, X 0j = 0 for all games; all that is known about the game at the tip-off is the identifier of the home team. Subsequent rows in X add the difference in the score. The sigma field F ij = {H j , X s : s ≤ t i } identifies the home team and the home minus away scores up to time t i in game j. For each game, define the discrete series

By construction, the sequence {p ij } is a discrete martingale in i. The illustrative models defined next estimate these probabilities.

The first model uses only the current score difference and the identity of the home team to predict the winner. The model computes the probability of a home win aŝ

where g(x) is the logistic function, g(x) = 1 1+e −x . We estimate the two parameterŝ α 0 = 0.276 (s.e. 0.004) andα 1 = 0.1681 (s.e. 0.0006) in (18) from the training data, pooling score differences from all of the games and fitting a single model. These standard errors are too small as they treat every observation, both within and across games, as independent. A passing familiarity with basketball suggests that this logistic regression has a serious flaw: it weights all score differences equally regardless of the time left in the game. The probability of a home win if the score is tied remains 1/(1 + exp(−α 0 )) ≈ 0.568 regardless of the time remaining. Or, a home lead of 5 points in the first quarter is just as predictive of a home win as a lead of 5 points with 10 seconds left in the game. As a result, the probability of winning based on early score differences is more volatile than it should be.

The second model takes account of the time remaining in a game when estimating probability of a home win. This model interacts the score difference X ij with a smooth function w (t) of the game time, 5

where w (t) is a th degree polynomial in the game time,

The weight function w (t) allows the importance of the score difference to evolve over a game. We experimented with several choices for the degree of the polynomial and settled on = 7. Rather than list the coefficients, Figure of X i=24,j in a simple logistic regression. At this point in the game, the slope is much smaller than that of the first model. The polynomial w 7 (t) smooths the estimates and steadily grows as the game time increases. As typical for polynomials,ŵ t (t) appears too volatile at the boundaries. The fit of the polynomial was not constrained but nonetheless gives minimal weight to the initial score differences. This model estimates the initial probability of a home win to be 0.589, closer to the overall proportion of 5 We define this model in R with the formula home win ∼ score diff * poly(game min, degree).

The fitted model includes a polynomial in the game time; none of the estimatesβ 1 , . . . ,β is statistically significant. The resulting accumulating volatility (compare to equation 13)

that appears in the right panel of the figure ends far above the target level 0.5 2 for this game for both logistic regression models. The volatility of the estimator produced by the martingale filter ends very near the expected total 0.25 for this game.

The combination of lower volatility and lower error for the corrected predictor is typical for the games in the test set. Figure 11 compares the performance of the logistic regression models and the martingale filter in the test set. On the left, the histogram compares the MSE of the martingale filter to the polynomial-weighted logistic regression. The distribution is ever so slightly shifted to the right. The difference in average MSE is only 0.0009, but with standard error 0.00015, the shift is highly statistically significant (z ≈ 5.5). The improvement to the volatility summarized by the boxplots on the right of Figure 11 is more notable. The probability estimates from the simple logistic model are far more volatile than appropriate for a martingale sequence. 

Though smaller in expectation, the MSE of the martingale filter is not uniformly smaller than that of the source predictor; i.e., in some games the original predictor has smaller MSE.

Some things that would be useful to include or explore are:

Information flows The rate at which the volatility accumulates measures information flows that are useful for demand planning. Consider long-lead ordering, such as required when places orders for fashion items many months ahead of the season. If the forecast volatility remains small until, say, 1 month before the FTD, then there would be little gain from ordering 1 month out rather than 6 months out -and thus gaining a pricing advantage.

Vector case The analysis here treats the univariate martingale defined by a single threshold. One could combine several thresholds as shown in Figure 2 to monitor Figure 11 : The histogram on the left shows the differences in MSE between the polynomial probabilityp (2) and the martingale filterp (3) for games in the test set. The improvement is significant (z = 5.5). Boxplots show the reduction in volatility produced first by taking account of the game time and then performing the martingale filter. Statistical significance Testing for excess volatility isstraightforward given one observes a collection of independent realizations, as illustrated here with simulations and basketball games. Many other applications, however, provide highly dependent realizations, as in the case of demand forecasting. This paper does not address the question of statistical significance when the realizations are dependent. We will address that elsewhere.

The following procedure which we call a martingale filter reduces forecast volatility while improving the accuracy of prediction. One can apply this filter to any collection of forecasts, not just those interpreted as a probability, so we switch notation for this appendix and write a sequence of forecasts for a future event as the vector Y with elements Y t , Y = ( Y 1 , . . . , Y T ) .

We make no further assumptions about the method used to produce these forecasts.

The following description of the filter illustrates the method as applied in the basketball example in Section 5, but is not fully general and is more expository than rigorous.

Given the success in this application, we plan to develop the method further and report results in a separate paper.

We proceed by constructing a martingale representation of the predictors. We decompose Y t as the sum

where the summands M t|s satisfy M t|s ∈ F s , with M t|0 ∈ R and E(M t|s | F s−1 ) = 0 for s = 1, 2, . . . , t .

In our example that predicts the chance of winning a basketball game, the first subscript t in M t|s before the bar denotes the "game time," and the second subscript s denotes the conditioning information (sigma field). We treat time as discrete, such as by sampling the score every 15 seconds as done in the example.

A simplifying assumption provides an explicit construction. Assume that we can represent each sigma field F t as the union independent random variables F t = {Q 1 , . . . , Q t } where Q s is uncorrelated with Q t for s = t with unit variance Var(Q t ) = 1. 6 The collection {Q 1 , . . . , Q t } thus forms an orthonormal basis for F t . To exploit this basis, note that the representation (23) defines a triangular array:

The assumed basis allows us to write M t|s = r ts Q s , so that each forecast Y t is a weighted mixture of the basis elements Q 1 , . . . Q t :

Y 1 = r 11 Q 1 Y 2 = r 21 Q 1 + r 22 Q 2 Y 3 = r 31 Q 1 + r 32 Q 2 + r 33 Q 3 . . .

We have what amounts to a Gram-Schmidt decomposition of the vector of predictors

where R is a T × T lower triangular matrix with elements r ts , s ≤ t and Q is a random vector with elements Q = (Q 1 , . . . , Q T ) . For the basketball application, we have n independent observations of the random vectors Y and Q. We arrange these observations as rows of an n × T matrix, and we have what amounts to a QR decomposition of the matrix of predictors:

[ Y 1 , . . . , Y T ] = [Q 1 , . . . , Q T ][r st ] s≤t = Q R .

The martingale filter alters the weights of this decomposition. The filter enforces a consistent weighting on the information in each basis element (or more generally, the information in each sigma field). It does this by replacing each column of R by the average of the nonzero elements in that column, r s = T t=s r ts /(T − s + 1). Returning to the triangular equations in (26), we average the elements in the columns as suggested here:

r 11 Q 1 r 21 Q 1 r 22 Q 2 r 31 Q 1 r 32 Q 2 r 33 Q 3 . . . r T 1 Q 1 r T 2 Q 2 r T 3 Q 3 · · · r T T Q T avg r 1 Q 1 r 2 Q 2 r 3 Q 3 · · · r T Q T

Compared to the input predictions laid out in (26), we form a martingale by accumulating the sums, 7 Y 1 = r 1 Q 1 Y 2 = r 1 Q 1 + r 2 Q 2 Y 3 = r 1 Q 1 + r 2 Q 2 + r 3 Q 3 . . .

[ Y t = M 1 + M 2 + · · · + M t ] . . .

The final line in this expression and several that follow display the result in the general case. Without the basis representation, Y t = t s=0 M s where M s = T t=s M t|s /(T − s + 1).

To state our theorem, define X 2 = E(X X) for vectors X, and let 1 denote a column vector of 1s with length evident from the context. The outcome being predicted is the scalar r.v. y (which we write in lower case to distinguish it from the vectors).

We prove the following:

Theorem 1 The total expected squared error of the predictor Y is less than or equal to that of the initial predictor Y : E y1 − Y 2 ≤ E y1 − Y 2 . 7 We are tempted to denote these predictors by Y t , but that symbol is too linked to the average of Y .

Proof. Let "tr" denote the trace operator. The risk of the initial predictor Y is

The second summand measures the variability of the weights around the means of the columns of R:

The cross product term in (31) 

Predicting professional sports game outcomes from intermediate scores

Present position and positional developments: some personal views, statistical theory, the prequential approach

Prequential analysis, stochastic complexity and bayesian inference

Stochastic Processes

Variable selection in data mining: Building a predictive model for bankruptcy

Asymptotic calibration

Modeling the evolution of demand forecasts with application to safety stock analysis in production/distribution systems

A discrete-time and finite-state markov chain based inplay prediction model for nba basketball matches

Modelling the scores and performance statistics of nba basketball games

A Brownian motion model for the progress of sports scores

Analysis of a forecasting-production-inventory system with stationary demand