How We Won Gold

July 22, 2020

by Ben Ogorek

Forecasting has its own worldwide competition. Here’s how Nousot joined the top of the podium.

Back in April, the data science team here at Nousot decided to enter that thunderdome of forecasting, the Makridakis competition. Named for the professor and renowned forecasting expert Spyros Makridakis (who, probably not coincidentally, was also a member of the Greek Sailing Team at the 1960 Olympics), the contest began in 1982 as a way to test his theory that complex statistical methods don’t necessarily perform better than simple ones. This year’s competition was hosted by Kaggle and was the fifth of its kind: the M5.

The challenge laid out by M5 was to use data provided by Walmart to forecast daily sales for the next 28 days, and to estimate the uncertainty distribution around those forecasts. So we threw our hat into the very large ring of 7,022 participants on 5,558 teams from 101 countries. We thought it would be a good way to put our predictive modeling skills to the test and get feedback.

We also considered it a great opportunity to experiment with forecasting approaches on a real retail sales data set, with the added, tricky characteristic that you sometimes wouldn’t know whether you were forecasting supply or demand. An item not on the shelf can’t be sold, which meant there were large stretches of zeros in many of the individual sales time-series.

The first task was to come up with a team name. We’re a confident bunch here – but with just enough image-consciousness to decide that incognito wasn’t a bad idea for our first worldwide competition as a team representing our company. We entered as “N60610,” combining the “N” for Nousot with the ZIP code of our downtown Chicago HQ.

The next task was carving out M5 time during the workday. That was hubris, not confidence; startups don’t have the luxury of time. So the M5 became a weekend project for those of us with the energy to spare, or the quarantine cabin fever to squelch, or the competitive bug to see through.

From the start we were glad for the nom de plume. We struggled to climb the public leaderboard right up until our final submission, sparked by an epiphany that we never had time to fully vet. But all my Kaggle experiences have been this way, more or less: starting off with intentions of discipline but ending up in a last minute scramble. That’s the nature of sport, Kaggle included.

With all that said, here’s our story. Spoiler: we finished 16th out of 5,558 competitors.

Next time we’ll use our real name.

A Philosophy for Competition

There are many ways to approach competition in data science. One is to avoid it altogether. But failing that, there are a few principles I’ve come to rely on now that I’ve competed in a number of Kaggle contests:

  1. Play to win.
  2. Play by the rules.
  3. Use help where you find it.

Some elite players do not use public notebooks and are rightfully proud of that. But my perspective is that ignoring the contributions of hundreds (even thousands) of minds is to ignore both learning and ensembling opportunities. Yet simply mixing and matching existing solutions – legal by the rules – is not playing to win, or even learn.

The Nousot freshman strategy, then, was a hybrid: we split work into two streams. One stayed on top of the M5’s leading public notebooks, and the other used a novel approach – that is, one not used in any public notebooks.

The key to our final position would turn out to be an ensemble of the two, but we’re jumping ahead.

The Public Notebook Component

When you’re willing to use help where you find it, decorated Kaggler Konstantin Yakovlev is a gold (medal) mine. Yakovlev created an entire feature engineering pipeline starting from creation of basic data structures to features that looked back in time, and finally, to features that made use of dimension reduction and categorical encoding.

Along the way, boosted tree algorithms with Poisson loss were found to perform well on Yakovlev’s engineered dataset. The origins of this discovery are murky; trails of forks lead to a deleted notebook. Nonetheless, we realized that the boosted tree approach based on the Yakovlev pipeline would be hard to beat, so we invested time to get it running and prepare it to accept new data when the contest released the validation set. We also used a notebook by Kaggler Shantanu Guptathough that used the LightGBM framework on top of the Yakovlev pipeline.

The Custom Nousot Component

Flashback: my experience in Kaggle’s 2017 Web Traffic Time Series Forecasting competition was frustrating, as I watched ad-hoc median-of-rolling-median methods steamroll my favorite standard statistical approaches. But they didn’t steamroll the classic Kalman Filter, which Otto Seiskari used to finish eighth in the competition. I knew then that this was a method I wanted in my own toolkit.

A year later, in Kaggle’s Recruit Restaurant Visitor Forecasting competition, I used Kalman Filters for the first time, and earned a silver medal by finishing 33rd out of 2157 (many thanks to Otto and Rudolf Kálmán).

Since then I’ve studied and written about the Kalman Filter, especially with regard to estimating unknown parameters. It’s been my go-to for most time series problems. But this year in the M5, the zero-inflation in the data made me nervous. That’s when I discovered the R package KFAS.

KFAS offers negative binomial filtering, as well as legitimate handling of diffuse priors for initial states. I convinced the team to give it a shot.

M5 and the Kalman Filter

In this section, I’ll describe the state space modeling of the Walmart sales data. We built up the model in layers, and each section below describes the addition of a layer. Ups and downs ensue, so buckle up!

Local Trend with Dampening

The stock Kalman Filter example is the local linear trend model, and that’s where we started in M5. Here, the latent state is the mean, drifting randomly, while the measurement process is the mean plus additional noise. You can extend the model by adding a local trend component into the state vector. However, when forecasting over many days, even a slight local trend may lead to quite bad extrapolations into the future. Hence, you often want to “dampen” the trend.

Adding exponentially decaying dampening to the trend in a state space model is straightforward since parameters update their state via a transition matrix. Simply change the appropriate diagonal of the transition matrix to a number slightly less than 1 (say .95) and the trend will exponentially decrease to a flat line over a number of periods.

In the early periods of the contest when the model was simpler, we estimated the trend-dampening parameter using likelihood, with a prior to keep it in the neighborhood of .95. When the model became more complicated, estimating the dampening parameter via likelihood led to terrible fits. Using lightweight cross validation, we ended up hard-coding the trend dampening parameter to .95, and adding to the transition matrix right before forecasting.

Among all items, the item HOUSEHOLD_1_397 had the largest estimated state variance in local trend. At .0004 it sounds small, but it was large enough to cause explosive trends in the forecast period (see the following figure).

Now, trend dampening with decay parameter of .90 (smaller than used in our solution) tones down the forecasts to those shown below:

Adding Regressors

Economics 101 tells us that lower prices result in higher demand, and to assert this in our state space model, we needed a regressor.

Fortunately KFAS offers helper functions for adding regression coefficients to the state vector, along with a range of possibilities. There’s the option for coefficients to drift over time, or the coefficients can be fixed through time, but random over multiple time series (in the linear mixed model sense). Thus, KFAS is a flexible and powerful tool for time series regression.

Unsurprisingly, the KFAS regressions showed some items having greater price elasticity than others. For example, let’s take the item FOODS_3_236 in store CA_3. Visually, it’s clear that when the sales price (shown by the red line) drops, more of this product is consumed.

Then we added an unconstrained regression coefficient to the state space model and the result is shown in the following plot. Everything after the red vertical line is validation data not used in fitting, and unknown to the model at the time of forecasting.

We were happy with this and assumed that adding a simple regression coefficient to the state would fit the bill. But our public leaderboard scores initially decreased.

Looking more carefully, we found some examples where things went wrong. One is the item FOODS_3_570 in store WI_3, where a small decrease in price is seen throughout the whole training period, but a much larger increase in price that occurred the moment the validation period started. The resulting extrapolation led to an unrealistic drop in predicted sales.

We needed a quick solution that would generalize to all time series, so we decided to clip prices in the forecasting time period to an amount no greater in magnitude than seen in the data set. A choice like this won’t win you points on Cross Validated, but we needed to ensure there weren’t other nasty surprises in one of the 30,000 time series. So we clipped.

We also added the SNAP binary variables as regressors. SNAP is the Supplemental Nutrition Assistance Program run by the US Department of Agriculture that provides families in need with access to healthy food. For our data set, this meant that some days were SNAP days (which were different across states) that could be added as an additional regressor in the state space model.

Adding Seasonality

In addition to  regressors, KFAS offers helper functions to add seasonality. For weekly seasonality, we used the SSMcycle option that added six dummy variables to the state vector . For yearly seasonality, we used the SSMseasonality option that added two sinusoidal components to the state vector, each with a periodicity of 365.25.

Below is an illustration of a forecast that exhibits both types of seasonality.

The Submission Journey

Learning and experimenting with KFAS and our state space models was a good time – but we also needed to make progress in the contest. This meant we had other learning to do, like about quirks in the data, and about what would happen when we tried different methods and scenarios. To that end, we began a new battle tactic of attempting something, submitting it, and seeing the results.

And we did, in fact, make progress – in the absolute sense.

Our position on the public leaderboard was a different story.

Keeping in mind that lower scores were better in the M5, the best public kernels were at 0.50 and below. Yet the plucky explorations of N60610 yielded the following feedback:

  • Sample submission of all zeros: 5.44
  • Simple median per id: 3.42
  • Median of rolling medians: 1.88 (1.77 after tweaking)
  • First item-level (multiple stores) regression, zeros as NAs (1.34)
  • Local mean model (no trend) near end of series for each time series (1.08)
  • Iem-level (multiple stores) regression with sqrt transform, zeros as NAs (.903)
  • KFAS local linear trend model for each time series (.855)

At one point we submitted a notebook solution with a score of .55, just to get off the bottom half.

Not all hope was lost, though. At least the KFAS Kalman Filter solution did better on our first try than our previous submissions; with some tuning, maybe we were onto something! Also, the item-level regression using the square root transformation did relatively well, and we reasoned that it might be resulting in some useful variance stabilization.

We decided that the square root transformation would be the de facto standard for our future state space model builds.

This would come back to haunt us.

Dark Days

N60610 began suffering casualties on all fronts. Our KFAS models were increasing in sophistication, but we never would get back to the .855 of the local linear trend model. Our cross validation routine was giving us mixed results, and due to the increased complexity of the models, we were running batches of only 50 items at a time.

Time was running out. Our last stand would be the multivariate KFAS fit at the item level (ten stores per fit) with all model components discussed earlier (local trend dampening, sales price regression, SNAP regression, and seasonality). Also, the square root transformed would be used on the sales response.

It took us two days to forecast the 3000+ items sequentially on a modest EC2 instance. Sure, there were options to speed it up, but we’d already decided that this would be the hill we were willing (or at least resigned) to die on. If it failed, we’d run the Yakovlev pipeline and call it a contest.

When the results came back, we uploaded the submission and held our breath. Three long minutes went by.

And then the score appeared: 1.69. Terrible.

Kaggle makes clear an ugly reality: you can put hours upon hours into an idea that’s both interesting and logical. You can apply that idea with confidence. And in the end it can turn out to be an inferior solution. It’s a lot like real life.

We took two days off from the contest to pout get some rest.

One last hope

Eventually we stood up, brushed ourselves off, and went back into the arena to face our most recent  submission. Alas, a new glimmer of hope appeared: some of the predictions looked good relative to gradient boosted trees. The case of Item FOODS_2_183 in store WI_2 provides one example:

In the plot above:

  • The blue points represent the Kalman Filter forecast.
  • The red points represent the Boosted Tree forecast.
  • The green line represents a local scatterplot smoother (loess) fit. 

In this case, the Kalman Filter’s forecasts did a better job of picking up where the smoother left off. Given that different configurations of the smoother could not justify the sinusoidal pattern showing up in Boosted Tree forecasts, the Kalman Filter solution appeared to be superior for this particular time series. There were many cases like this.

But now let’s look at the series FOODS_2_087_TX1:

In this case, the Kalman Filter solution falls apart. The Boosted Tree solution (shown with red points) appears to extend from the smoother’s values (shown green points). By contrast,  the Kalman Filter forecasts are completely disconnected from reality. 

Oh, the questions: why were some of the Kalman Filter forecasts performing so much worse than others? Why would there be this kind of bias?

The Big Reveal

As we’ve discussed, the Kalman Filter-based models offered by KFAS are expressive and can handle many particular features of a time series problem. But non-constant variance is not one of them. Given that we were working with count data, we intended to rely on the square root transformation’s variance-stabilizing properties.

For Poisson data, the square root of the values is distributed approximately normal with a constant variance (of ¼) and a mean that is the square root of the original mean.

That sounds great, because not only do we have approximately constant variance, but also we can recover the original mean by squaring the estimates on the transformed scale. Citing Wikipedia, “under this transformation, the convergence to normality (as [the mean] increases) is far faster than the untransformed variable.”

That’s also great, but it raises the question: what happens when the mean is small?

The answer is, bad things happen.

From the Jensen Inequality, we know that reconstituting a square root transformed mean will be smaller than the mean of the original variable. Precisely how small can be worked out using a Taylor Approximation, which says pretty small pretty fast as the mean increases. But below a mean of 1, the approximation completely breaks down.

If the bias wasn’t bad enough, David Warton’s 2017 article Why You Cannot Transform Your Way Out of Trouble for Small Counts assures us that we won’t even get the variance-stabilization benefits of the square root transformation when means of count distributions are small.

In the M5, many of the means are small. And for count data with small means, the square root transformation is the wrong choice.

The Additional Problem of Poor Model Fits

While most of the Kalman Filter’s forecasts were lower than the Boosted Trees due to the bias discussed above, that’s not the case for the most recent plot of FOODS_2_087_TX1. A bad fit could lead to biases in both directions. Indeed, one of our favorite Kagglers, the aforementioned Otto Seiskari, warned me that maximum likelihood can lead to quite poor fits with the Kalman Filter if the model assumptions are wrong. And for that reason methods such as cross validation are often preferred.

Creating A Switch

While we were acknowledging the many ways in which the Kalman Filter solution could and did go wrong, those cases where our Kalman Filter forecasts actually looked good relative to the Boosted Tree solution always coincided with near continuity of the forecast with the loess fit. So we created a programmatic switch between the two.

The switch worked like this: the last point of flexible loess (scatterplot smoother) fit through each time series was compared to the averages of the first three forecasted values of both solutions. Whichever forecasting method was closer to the smoothed fit would become the chosen solution for that series. The Kalman Filter tended to be the solution for the larger mean time series, whereas the Boosted Tree worked better for the sparse, smaller-mean cases.

At this point, submission time was upon us. The plan was bold, my friends. We were going to hedge.

We’d choose the pure Yakovlev pipeline as one submission, and our mixture of solutions as another.

But the M5 was not like most Kaggle contests. M5 allowed only one final submission.

We picked the “mixture of experts” solution.

It worked out pretty well.

Getting the Gold

When the results of the private leaderboard come out, it’s my tradition to start scrolling for my submission below the fold. Or several of them.

Turns out I had to hit the button only once.


In sports, there is a cruel indifference to both how hard you tried and the absolute progress you made along the way. The scoreboard is the cold arbiter of success or failure.

In Kaggle it’s the same. Weeks of planning and execution of a new idea can result in a submission with a disappointing score, even if there are clear merits to your approach.

But from this indifference comes clarity. If you open your eyes and accept what you’re seeing – if you can accept the result – you can get better.

Sixteenth Place in the M5 – official Gold Medal territory – was a happy surprise. At the same time, I can’t help wishing we had placed even higher. Our modeling approach was clean and elegant, and worked well a lot of the time.

A better finish couldn’t have been too far away. Or could it? But just as in sports, the buzzer sounded and the competition ended. The scoreboard cleared. The competitors went home, where they could reflect. And if they did, they can learn, come back, and compete stronger the next time.

A final word: we’d like to thank one of Nousot’s original founders, Rami Jachi, for inspiring us to enter the M5. Rami is pursuing other endeavors now, and we are so grateful to have – hopefully – made him proud. We wish him the absolute best as he continues to inspire others and envision new possibilities with emerging AI and ML technologies.

Here’s to many more top-half-of-the-leaderboard performances in Kaggle, business, and life.


Stay current on automated intelligence with our newsletter.

It’s actually the “Nousletter” — we couldn’t resist.

It’ll arrive in your inbox only about once per quarter. Join our growing network of users, partners, and friends.


We respect your privacy, always. Privacy Policy