-
[TimesFM] A decoder-only foundation model for time-series forecastingPaper Writing 1/Related_Work 2024. 10. 22. 02:36
https://arxiv.org/pdf/2310.10688
https://github.com/google-research/timesfm
(Oct 2023 Google Research)
Abstract
Motivated by recent advances in large language models for Natural Language Processing (NLP), we design a time-series foundation model for forecasting whose out-of-the-box zero-shot performance on a variety of public datasets comes close to the accuracy of state-of-the-art supervised forecasting models for each individual dataset. Our model is based on pretraining a decoder style attention model with input patching, using a large time-series corpus comprising both real-world and synthetic datasets. Experiments on a diverse set of previously unseen forecasting datasets suggests that the model can yield accurate zero-shot forecasts across different domains, forecasting horizons and temporal granularities.
1. Introduction
Time-series data is ubiquitous in various domains such as retail, finance, manufacturing, healthcare and natural sciences. In many of these domains, one of the most important use-cases of time-series data is forecasting. Time-series forecasting is critical to several scientific and industrial applications, like retail supply chain optimization, energy and traffic prediction, and weather forecasting. In recent times, Deep learning models [SFGJ20, OCCB19] have emerged as a popular approach for forecasting rich, multivariate, time-series data, often outperforming classical statistical approaches such as ARIMA or GARCH [BJ68]. In several forecasting competitions such as the M5 competition [MSA22] and IARAI Traffic4cast contest [KKN+21] deep network based solutions performed very well.
At the same time, we are witnessing a rapid progress in the Natural Language Processing (NLP) domain on large foundation models for downstream NLP tasks. Large language models (LLMs) are growing in popularity because they can be used to generate text, translate languages, write different kinds of creative content, and answer your questions in an informative way [RWC+19]. They are trained on massive amounts of data, which allows them to learn the patterns of human language. This makes them very powerful tools that can be used for a variety of downstream tasks, often in a zero-shot learning mode.
This motivates the question: “Can large pretrained models trained on massive amounts of time-series data learn temporal patterns that can be useful for time-series forecasting on previously unseen datasets?” In particular, can we design a time-series foundation model that obtains good zero-shot out-of-the-box forecasting performance? Such a pretrained time-series foundation model, if possible, would bring significant benefits for downstream forecasting users in terms of no additional training burden and significantly reduced compute requirements. It is not immediately obvious that such a foundation model for time-series forecasting is possible. Unlike in NLP, there is no well defined vocabulary or grammar for time-series. Additionally, such a model would need to support forecasting with varying history lengths (context) , prediction lengths (horizon) and time granularities. Furthermore, unlike the huge volume of public text data for pretraining language models, vast amounts of time-series data is not readily available. In spite of these issues, we provide evidence to answer the above question in the affirmative.
In particular, we design TimesFM, a single foundation model for time-series forecasting that, when applied to a variety of previously-unseen forecasting datasets across different domains, obtains close to state-of-the-art zero-shot accuracy (compared to the best supervised models trained individually for these datasets). Our model can work well across different forecasting history lengths, prediction lengths and time granularities at inference time. The key elements of our foundation model are twofold: 1) a large-scale time-series corpus built using both real-world (mostly time-series data from web search queries1 and Wikipedia page visits2 ) and synthetic data, which meets the volume and diversity of data needed for training our foundation model, and 2) a decoder style attention architecture with input patching, that can be efficiently pre-trained on this time-series corpus.
Compared to the latest large language models, our time-series foundation model is much smaller in both parameter size (200M parameters) and pretraining data size (O(100B) timepoints); yet we show that even at such scales, it is possible to pretrain a practical foundation model for forecasting whose zero-shot performance comes close to the accuracy of fully-supervised approaches on a diverse set of time-series data. Our work also suggests that unlike recent work [GFQW23] that recommends Large Language Models such as GPT-3 and LLama-2 as out-of-the-box zero-shot forecasters, foundation models trained from scratch exclusively on time-series data can obtain much better zero-shot performance at a tiny fraction of its costs.
2. Related Work
In the last decade, deep learning models [SFGJ20, OCCB19] have emerged as powerful contenders in forecasting time-series in the presence of large training datasets and have been shown to outperform traditional statistical methods such as ARIMA and Exponential smoothing [McK84]. Forecasting models can be categorized broadly into: (i) Local univariate models that include traditional methods like ARIMA, exponential smoothing [McK84] and nonautoregressive models like Prophet [TL18]. These models are trained individually for each time-series in a dataset in order to predict the corresponding time-series’s future. (ii) Global univariate models like DeepAR [SFGJ20], Temporal Convolutions [BBO17], N-BEATS [OCCB19] and long-term forecasting models such as [NNSK22, DKL+23] that are trained globally on many time-series but during inference they predict the future of a time-series as a function of its own past and other related covariates. (iii) Global multivariate models that take in the past of all time-series in the dataset to predict the future of all the time-series. Such models include the classical VAR model [ZW06] as well as deep learning models like [SYD19, ZMW+22, CLY+23] to name a few.
All the works cited above have primarily been applied in the supervised setting with the notable exception of PatchTST [NNSK22] and N-BEATS [OCCB19]. PatchTST has a section on dataset-to-dataset transfer learning in the semi-supervised setting. [OCCB21] also show that the N-BEATS architecture lends itself to transfer learn between various source-target dataset pairs. However, none of these works aim to train a single foundation model that can work on a plethora of datasets. For an in-depth discussion about transfer learning in time-series we refer the reader to the survey in [MLZ+23].
There has been some very recent work on re-using or fine-tuning large language models for time-series forecasting. In particular, [GFQW23] benchmarks pretrained LLMs like GPT-3 and LLaMA-2 for zero-shot forecasting performance. As we show later, our model obtains much superior zero-shot performance at a tiny fraction of these model sizes. [ZNW+23] and [CPC23] show how to fine-tune a GPT-2 [RWC+19] backbone model for time-series forecasting tasks. With the exception of a transfer-learning study (forecasting on a target dataset after having trained on a source dataset), these papers mostly focus on fine-tuning a pretrained model on target datasets, and not on pretraining a single foundation model with good out-of-the box zero-shot performance on a variety of datasets. To the best of our knowledge, the very recent work in TimeGPT-1 [GMC23] is the only other parallel work on a zero-shot foundation model for time-series forecasting. However the model is not public access, and several model details and the benchmark dataset have not been revealed.
3. Problem Definition
The task at hand is to build a general purpose zero-shot forecaster that takes in the past C time-points of a time-series as context and predicts the future H time-points. Let the context be denoted by y1:L := {y1, · · · , yL} where we follow a numpy-like notation for indices. Similarly the actual values in the horizon are denoted by yL+1:L+H. Note that since we are building a single pre-trained model, we cannot have dataset specific dynamic or static covariates during training time. The task is then to learn a foundation model that can map any time-series context to horizon,
The accuracy of the prediction can be measured by a metric that quantifies their closeness to the actual values, for instance, Mean Absolute Error (MAE) defined in Equation 6.
4. Model Architecture
A foundation model for time-series forecasting should be able to adapt to variable context and horizon lengths, while having enough capacity to encode all patterns from a large pretraining datasets. Transformers have been shown to be able to adapt to different context lengths in NLP [RWC+19]. However, there are several time-series specific design choices. The main guiding principles for our architecture are the following:
Patching.
Inspired by the success of patch based modeling in the recent long horizon forecasting work [NNSK22] we also choose to break down the time-series into patches during training. A patch of a time-series is a natural analogue for a token in language models and patching has been shown to improve performance. Moreover this improves inference speed as the number of tokens being fed into the transformer is reduced by a factor of the patch length. On the other hand, increasing the patch length all the way to the context length moves us away from decoder-only training and the efficiencies that come with it. We delve into this further in Section 6.2.
Decoder-only model.
A key difference between our architecture and PatchTST [NNSK22] is that our model is trained in decoder-only mode [LSP+18]. In other words, given a sequence of input patches, the model is optimized to predict the next patch as a function of all past patches. Similar to LLMs this can be done in parallel over the entire context window, and automatically enables the model to predict the future after having seen varying number of input patches.
Longer output patches.
In LLMs the output is always generated in an auto-regressive fashion one token at a time. However, in long-horizon forecasting it has been observed that directly predicting the full horizon yields better accuracy than multi-step auto-regressive decoding [ZCZX23]. But this is not possible when the horizon length is not known apriori, as in the case of zero-shot forecasting which is our primary goal.
We propose a middle ground by allowing our output patches for prediction to be longer than the input patches. As an example, suppose the input patch length is 32 and output patch length is 128. During training, the model is simultaneously trained to use the first 32 time-points to forecast the next 128 time-steps, the first 64 time-points to forecast time-steps 65 to 192, the first 96 time-points to forecast time-steps 97 to 224 and so on. During inference, suppose the model is given a new time-series of length 256 and tasked with forecasting the next 256 time-steps into the future. The model will first generate the future predictions for time-steps 257 to 384, then condition on the initial 256 length input plus the generated output to generate time-steps 385 to 512. On the other hand, if in a model the output patch length was fixed to the input patch length of 32, then for the same task we would have to go through 8 auto-regressive generation steps instead of just the 2 above. However, there is a trade-off. If the output patch length is too long, then it is difficult to handle time-series whose lengths are less than the output patch length for instance monthly, yearly time-series in our pretraining data.
Patch Masking.
If we use patches naively, the model might only learn to predict well for context lengths that are multiples of the input patch length. Therefore we make a careful use of masking during training. Parts of patches as well as entire patches from the beginning of the context window can be masked in a data batch. We employ a specific random masking strategy (described later) during training that helps the model see all possible context lengths starting from 1 to a maximum context length.
Now that we have mentioned the guiding principles, we next formally describe each component of our model architecture (illustrated in Figure 1), which we name as TimesFM (Time-series Foundation Model).
Input Layers.
The job of the input layers is to preprocess the time-series into input tokens to the transformer layers. We first break the input into contiguous non-overlapping patches. Then each patch is processed by a Residual Block into a vector of size model_dim. Along with the input, we also supply a binary padding mask m1:L where 1 denotes that the corresponding input in y1:L should be ignored and vice-versa. The Residual Block is essentially a Multi-layer Perceptron (MLP) block with one hidden layer with a skip connection, similar to that defined in [DKL+23].
Stacked Transformer.
The bulk of the parameters in our model are in num_layers (nl) transformer layers stacked on top of each other. Each of these layers have the standard multi-head self-attention (SA) followed by a feed-forward network (FFN). The main hyperparameters are model_dim which is equal to the dimension of the input tokens tj ’s and number of heads (num_heads). We set the hidden size of the FFNs to be equal to model_dim as well. We use causal attention that is each output token can only attend to input tokens that come before it in the sequence (including the corresponding input token). This can be described by the equation
Output Layers.
The remaining task is to map the output tokens into predictions. We train in decoder only mode i.e each output token should be able to be predictive of the part of the time-series that follows the last input patch corresponding to it. This is common for popular large language models like [RWC+19]. However, one key difference in our time-series foundation model is that input patch length need not be equal to output patch length i.e we should be able to predict a larger chunk of the time-series based on the encoded information from the input patches seen so far. Let the output patch length be output_patch_len (h). We use another Residual Block to map the output tokens to the predictions. This can be described as,
Loss Function.
In this work, we focus on point forecasting. Therefore we can use a point forecasting loss during training like Mean Squared Error (MSE). The loss that is minimized during training can be expressed as,
Note that if one is interested in probabilistic forecasting, then it is easy to have multiple output heads for each output patch, each head minimizing a separate quantile loss as in [WTNM17]. Another approach can be to output the logits of a probability distribution family and minimize the maximum likelihood loss for probabilistic forecasting [ADSS21, SFGJ20].
Training.
We train the model with standard mini-batch gradient descent in decoder-only fashion, that goes through all windows for a time-series and across time-series. The only non-standard part is the way we sample the mask during training. For each time-series in the batch, we sample a random number r between 0 and p − 1. Then we set the m1:r = 1 and the rest as zero i.e we mask out a fraction of the first input patch. However, this is sufficient to cover all input context lengths from 1 to the maximum training context length. We explain this using an example below:
Suppose the maximum context length is 512 and p = 32. Then if r = 4, the output prediction after seeing the first patch (from o1) is optimized to predict after seeing 28 = 32 − 4 time-points, the output of the next patch (from o2) is optimized to predict after seeing 28 + 32 time-points, and so on. When this argument is repeated for all such r’s, the model has seen all possible context lengths till 512.
Inference.
The trained network can be used to produce forecasts for any horizon using auto-regressive decoding similar to large language models. Given an input y1:L (assume L is a multiple of p for simplicity) it can first predict yˆL+1:L+h. Then, we can use the concatenated vector y˜1:L+h = [y1:L; yˆL+1:L+h] as an input to the network to generate the next output patch prediction yˆL+h+1:L+2h and so on. If L is not a multiple of p, we simply append zeros to make it a multiple of p and mark the corresponding entries in the mask as 1.
5. Pretraining Details
We would like our pretraining corpus to include large volumes of temporal data representing a variety of domains, trend and seasonality patterns and time granularities that ideally capture the forecasting use-cases which we are interested in serving by the deployed model. It is challenging to find a large time-series dataset that meets the volume and diversity of data needed for training our foundation model. We address this problem by sourcing the bulk of data used to train our models from three major sources: Google trends, Wiki Pageview statistics and synthetic time-series. In summary the main data sources are:
Google Trends.
Google Trends 3 captures search interest over time for millions of queries. We choose around 22k head queries based on their search interest over 15 years from 2007 to 2022. Beyond these head queries the time-series become more than 50% sparse. We download the search interest over time for these queries in hourly, daily, weekly and monthly granularities to form our dataset. The date ranges are Jan. 2018 to Dec. 2019 for hourly and Jan. 2007 to Dec. 2021 for the other granularities. The trends datasets amounts to roughly 0.5B time-points.
Wiki Pageviews.
Wiki Pageviews 4 captures the hourly views of all Wikimedia pages. We download all pageview data from Jan. 2012 to Nov. 2023, clean and aggregate the views by page into hourly, daily, weekly and monthly granularities, and filter out pageview time-series with excessive zeros. The final corpus contains roughly 300B time-points.
Synthetic Data.
Another major component of our pretraining data is of synthetic origin. We create generators for ARMA [McK84] processes, seasonal patterns (mixture of sines and cosines of different frequencies), trends (linear, exponential with a few change-points) and step functions. A synthetic time-series can be an additive combination of one or more of these processes. We create 3M synthetic time-series each of length 2048 time-points. More details about our synthetic data generation are presented in Appendix A.8.
Other real-world data sources.
Along with the wiki and trends data, we also add time-series from several other publicly available datasets to our pretraining corpus. We add all the granularities of the M4 dataset [MSA22], the hourly and 15 minute Electricity and the hourly Traffic datasets (see [ZZP+21]). We also add the 10-minute granularity Weather dataset used for evaluations in [ZZP+21]. M4 has a good mix of granularities with around 100k time-series in total. Traffic and Electricity are large long-term forecasting datasets with > 800 and > 300 time-series each having tens of thousands of time-points. In addition, we add all the 15 min granularity traffic time-series from [WJJ+23].
Dataset Mixing and Training.
We train on a mixture distribution over these datasets that aims to give sufficient weight to all granularities and datasets. The training loader samples 80% real data and 20% synthetic, with the real data mixture providing equal weights to the groups: hourly + sub-hourly, daily, weekly, and monthly datasets. We train with a maximum context length of 512 whenever the length of the time-series allows that. For weekly granularity we do not have sufficiently long time-series; therefore a maximum context length of 256 is used. For the same reason, a maximum context length of 64 is used while training on ≥ monthly granularity data. We also use only the standard normalization part of reversible instance normalization [KKT+21] – i.e, the context of each time-series is scaled by the context mean and standard deviation of the first input patch in the context.
6. Empirical Results
We evaluate our model in zero-shot settings on three groups of well known public datasets against the best performing baselines for each group. These datasets have been intentionally held out from our pretraining data. We show that a single pretrained model can come close or surpass the performance of baselines models on the benchmarks even when the baselines are specially trained or tuned for each specific task. Subsequently, we perform ablation studies that justify different choices made in our architecture.
6.1. Zero-shot Evaluation
To benchmark our model’s performance, we choose three groups of commonly used forecasting datasets that cover various domains, sizes, granularities, and horizon lengths: Darts [HLP+22], Monash [GBW+21] and Informer datasets [ZZP+21], to test the generalization power of our foundation model against other baselines.
In all cases, we report performance on the official metrics and scalings of the datasets, using either their standard test splits or common test splits in other literature. We present a summary of the results below - more details can be found in Appendix A.5. We provide the hyper-parameters and other details about our model in Appendix A.6.
Monash [GBW+21].
Monash archive is a collection of 30 datasets of different training and prediction lengths that covers granularities ranging from minutes to years and domains including finance, demand forecasting, weather and traffic. The archive reports four official metrics for several statistical baselines such as Exponential Smoothing(ETS) and ARIMA, as well as supervised ML baselines like CatBoost [PGV+18], DeepAR [SFGJ20] and WaveNet [ODZ+16]. Following llmtime [GFQW23] we start from the Monash Huggingface repository 5 and filter out the datasets that contain missing values. This leaves us with 18 datasets which we specify in Appendix A.5.2. 5 https://huggingface.co/datasets/monash_tsf
Out of the four official metrics, following prior work [GFQW23], we report our performance in terms of mean MAE (see Appendix A.2). As the datasets have massively different scales, for each dataset we normalize the metric by the metric achieved by a naive baseline that just constantly predicts the last value in the context for each time-series. Then the scaled MAE’s are averaged across all datasets. The scaled aggregation was also used in [GFQW23]. In Figure 2a, we use the Geometric Mean (GM) for averaging since it is more robust for normalized metrics [FW86]. We also report the Arithmetic Mean based aggregated metrics in Figure 4 in the appendix.
The mean scaled MAE across all datasets is plotted in Figure 2a along with standard error bars. We compare the performance of TimesFM with the baseline models implemented in Monash, and the zero-shot llmtime [GFQW23] model that uses GPT-3 [RWC+19] with a specific prompting technique. Note that the zero-shot models are marked as (Zero-Shot). TimesFM is the top model even though we never trained on these datasets. It is slightly better but within significance of N-BEATS but outperforms deep supervised models like DeepAR [SFGJ20], and improves on llmtime’s performance by more than 25%.
Darts [HLP+22].
This is a collection of 8 univariate datasets which include interesting seasonalities and additive+multiplicative trends. We report performance of several baselines implemented in the Darts package like TCN [LVRH16], N-HiTS [COO+23] and N-BEATS [OCCB19]. All these baselines are supervised. As before, we also report zero-shot forecasting results from llmtime [GFQW23] using GPT-3 [RWC+19]. Other supervised baselines in [GFQW23] like SM-GP [WA13] and ARIMA [McK84] are also added.
We report the official metric for this dataset group that is MAE for each individual dataset in Appendix A.5. In Figure 2b, we present the average scaled MAE across all 8 datasets, as we did for the Monash datasets. TimesFM is within statistical significance of the best models that is llmtime and seasonal ARIMA in this case. Note that since there are only 8 individual time-series in this dataset group, the standard errors are not sharp and therefore does not provide a clear ordering among the models. Also, note that for ARIMA, the seasonality needs to be encoded correctly in the parameters for the best results, which needed manual tuning. Further, since these datasets are used in numerous time series blog posts for illustrative purposes, data contamination for llmtime cannot be ruled out.
Informer [ZZP+21].
The Informer datasets have been widely used for benchmarking various supervised long-horizon forecasting methods. A few of these datasets are used in pretraining, so we focus on the other datasets in this collection (ETTm1, ETTm2. ETTh1 and ETTh2) related to electricity transformer temperatures over a two year period in 1 hour and 15 minutes granularities. Note that the long horizon baselines usually report rolling validation results on the test set which would amount to millions of tokens for evaluating llmtime [GFQW23] and would be too expensive. Therefore, following llmtime, we compare all methods on the last test window. Also, it is reasonable to directly average the MAE for these datasets since the results are reported on standard normalized dataset (using the statistics of the training portion).
We consider the task of predicting horizon length 96 and 192, given a context length of 512 for all methods. The MAE averaged over all 8 tasks (4 datasets with two horizons each) is presented in Figure 2b. TimesFM performs the best and the supervised PatchTST [NNSK22] baseline (which is a state-of-the-art long horizon deep forecasting method) is within significance of it. The other long horizon methods are quite a bit worse even though they have been trained these datasets. llmtime is better than FEDFormer but worse than PatchTST with statistical significance.
We present visual examples of our forecasts along with baselines in Appendix A.9.
6.2. Ablation
Next, we perform several ablation studies that inform the design decisions we made for our model architecture.
Scaling.
Performance curves with respect to number of parameters in a model have been a keenly studied area in the context of LLMs. [KMH+20] established a power law like relationship between the number of parameters in a language model and its downstream performance i.e the more the number of paramaters the better the performance. However, [HBM+22] established a more nuanced scaling law that lays down methods to train compute optimal models based on the number of tokens available in a training dataset.
We perform a preliminary scaling study where we train three TimesFM models of sizes 17M, 70M and 200M parameters, using the same pre-training dataset till 1.5M iterations with a global batch-size of 4096. Then we collect checkpoints that represent varying number of FLOPS (Floating Point OPerationS) across the different model runs. Then we plot the performance on Scaled MAE (GM) on Monash as a function of FLOPS, in Figure 3a. This is now a standard way to perform scaling studies in LLMs (see recent work like [GD23]). It can be clearly seen that the errors decrease monotonically with the number of FLOPS (in log scale). All experiments were performed on a TPUv5e6 setup with 16 tensor-cores. For the 200M model it takes 2 days to complete 1.5M iterations on our setup.
Autoregressive Decoding.
In recent long-term forecasting works [ZCZX23, NNSK22, DKL+23] it has been observed that directly predicting the entire forecasting horizon in one shot from a decoder can yield better results than autoregressive decoding on long horizon benchmarks. For a foundation model, the horizon length of the task is not known before inference time, therefore one-shot decoding might not be possible for very long horizons. However, as mentioned earlier, by keeping the output_patch_len longer than input_patch_len one can ensure fewer autoregressive steps. This was one of the key decisions in the design of TimesFM, that is quite different from LLMs. In order to showcase this we choose the task of predicting 512 time-steps into the future for the ETT datasets on the original rolling validation task of the ETT test sets [ZZP+21]. In Figure 3b, we present results from models with output_patch_len varying from 8 to 128. We see a monotonic decrease in average MAE with output_patch_len.
Input Patch Length.
The size of input_patch_len represents an important trade-off. We have typically seen that increasing its value from 8 to 32 increases performance but having too high a input_patch_len is impractical since that makes the model shift from decoder only training more towards encoder-decoder style training. Note that in the "Training" paragraph of Section 4, we describe the mask sampling strategy to support any context length. If in the extreme case p is set the maximum context length we have to individually sample all possible context windows from 1 to maximum context length, which would be required for encoder-decoder style of training.
In Figure 3c, we show the mean scaled MAE (GM) TimesFM(ZS) - 70M model on Monash with input_patch_len varying from 8 to 128. Note that both models have been trained to about 1.5M steps even though the p=8 model is three times slower to train. We can see that p = 16, 32 marks the best performance, with the error increasing towards either end. Note that p = 32 model is almost twice as fast to train compared to p = 16 and thus constitutes a prudent choice.
Dataset Ablation.
Next we showcase the need for synthetic data. Intuitively, the majority of our real datasets have commonly found granularities like hourly, daily etc which have specific periodic patterns like 24 time-point period for hourly data. This can make the model not generalize well to underrepresented frequencies. We train a 200M model with no synthetic data added in the mix and showcase the performance on Monash and ETT datasets in Figure 3. It can be seen that there is a performance drop on Monash because many of the datasets in Monash have under-represented granularities like quarterly, yearly or 10 minutes etc. Perhaps even more compelling is the comparison on ETT datasets. We can see that there is almost no difference between the two models on the hourly ETTh datasets which has a well represented granularity. However, for the 15min ETTm datasets the model with synthetic data performs quite a bit better.
We provide a finetuning study in the same setting as [ZNW+23] in Appendix A.3, where our model performs better than all baselines on all the reported datasets. This shows the utility of our model on downstream tasks.
7. Conclusion
In this paper, we presented TimesFM, a practical foundation model for forecasting whose zero-shot performance comes close to the accuracy of fully-supervised forecasting models on a diverse set of time-series data. This model is pretrained on real-world and synthetic datasets comprising O(100B) timepoints. We discuss limitations and future work in more detail in Appendix A.1.
A. Appendix
A.1. Limitations and Future Work
Our work shows that we can train a 200M parameter pretrained forecasting model that has impressive zero-shot performance on a variety of real world forecasting benchmarks with different context and horizon lengths. In this section we would like to discuss limitations and future work.
Prompt Tuning.
In LLMs it is well known that prompt tuning techniques like chain-of-thought [WWS+22] can drastically improve performance in cases where the model is inaccurate with simple prompts. Such techniques are less clear for time-series foundation model. We can tune simple hyper-parameters like context length as the moment. However, with probabilistic forecasting we might be able to output different statistics as well as come up with techniques that align more with user’s expectations while not decreasing likelihood.
Probabilistic Forecasting.
It should be straightforward to train with probabilistic loss functions in our framework as detailed in the "Loss Function" part of Section 4. However, being one of the first works of building a single foundation model for forecasting, this was not our main focus and is left to future explorations. Note that as mentioned before we plan to release our model weights and after that such loss functions [SFGJ20, ADSS21] can be added during finetuning.
Covariate handling.
Currently the model is not pretrained with covariates as one of the key challenges is finding large volumes of pretrained data with meaningful covariates (apart from date features). We also need methods to have a joint universal representation of covariates. Currently there are two simple techniques we can think of for handling covariates (i) In a zero-shot setting at inference time we can predict in-context and linearly regress the residual on covaraites. Then our model + the residual model can be used for forecasting in the horizon. (ii) during finetuning it is straightforward to handle covariates by adding them as inputs to the input and output residual blocks. Categorical variables can be added as embeddings.
More finetuning studies.
We perform a fintuning study in Appendix A.3 following a prior work. However, a more in depth study that involves finetuning in the presence of covariates would be beneficial. This being one of the first works of building a single foundation model for forecasting, this was not our main focus and is left to future explorations. Ideas in recent work such as [CLY+23] could be useful in this regard.
Other architectures.
Given the cost of training foundation models we did not perform much hyper-parameter tuning in our pretraining, while following some well established best practices for training transformers. In a similar vein, it would also be interesting to try out alternatives like the exciting directions of all MLP structures like [CLY+23] or efficient linear state space models like Mamba [GD23] (and references there in).
Interpretability.
Deep foundation models trained on a huge corpuses of data could be inherently less interpretable compared to statistical methods like ARIMA, ETS [BJ68]. In this regard methods like LOCO, SHAP (see [VW23] and references there in) could be used to some extent to attribute feature importances to different lags in the context supplied to the model. However, this does not solve the problem to a full extent and one of the best things to do would be to open source a version of the model with a proper model card. [MWZ+19].
A.2. Metrics
The metrics that are used for reporting results in this paper are:
Aggregating across datasets.
Since the datasets have wildly different scales averaging unnormalized metrics like MAE is not kosher. Therefore following [GFQW23] we scale the metric of each baseline for a dataset by the same metric achieved by a naive baseline on that dataset. The naive baseline just makes the constant prediction yL repeated across the prediction length. We did not need to do that for the Informer datasets since on these datasets metrics are usually reported on standard normalized data [NNSK22].
A.3. Finetuning study on ETT
In this section, we test whether TimesFM can be fintuned on a small fraction of a dataset to provide even better performance. We follow the same protocol as in GPT4TS [ZNW+23] (see Table 13 in their paper). [ZNW+23] finetune GPT2 input and output blocks on long-term forecasting benchmarks on 10% of the original datasets and compare it against models trained from scratch on the same data. Then the models are evaluated on the original test set task of [ZZP+21]. We also tune the input and output residual blocks on 10% of the training set and present the results in Table 2. We can see that our model performs the best by a large margin. In ETTh1, ETTh2, ETTm1 our finetuned model is better than 18%, 3% and 12% better than GPT4TS, respectively. In fact we can see our 10% finetuned model’s performances are comparable or better than that of most baselines trained on the whole training dataset as reported in Table 14 of [ZNW+23]. This shows that the inductive biases encoded in our model weights by finetuning on a large time-series corpus are better for downstream forecasting task than an off the shelf language model like GPT2, even though our model is orders of magnitude smaller.
A.4. Pretraining PatchTST
Since TimesFM applies a similar patching strategy as PatchTST [NNSK22], for an ablation study we use the same data loader and pretrain a PatchTST model of 200M parameters to the same number of FLOPS as the final 200M TimesFM model. We denote it as PatchTST(ZS). The two models share the same hyperparameters of the transformer stack. For PatchTST(ZS) we use the same input patch length = 32, and a stride of length half of input patch size (i.e. stride = 16) as done in the original PatchTST paper.
We report the detailed results on Monash and ETT in Appendix A.5.2 and A.5.3. It can be seen that the results are not that good for PatchTST(ZS) on Monash. This is expected since our pretrain data loader will predominantly have context lengths of 512 instead of shorter context lengths as in Monash. Moreover the PatchTST model does fewer iterations at the same number of FLOPS. On ETT datasets, the PatchTST(ZS) model is performs similarly to TimesFM(ZS) and PatchTST. This is also expected since the context length for this study is indeed 512.
As PatchTST(ZS) is an encoder-decoder model, to pretrain it for zero shot forecasting one should theoretically prepare all possible context lengths and horizon lengths in the pretrain datasets. Pretraining it to its maximum performance requires much more compute and likely more careful tuning compared to pretraining TimesFM.
A.5. Additional Empirical Results
In this section, we provide more detailed tables for our zero-shot datasets and experiments described in Section 6.1. The AM based aggregated metrics are presented in Figure 4.
A.5.1 Darts
We present the MAE results individually from all 8 datasets in Table 3. It can be seen that TimesFM performs well for all datasets with clear seasonal patterns. On an average we are within significant level of the best model. Note that there are only 8 time-series as a whole in Darts and theerfore these evaluations have very wide confidence intervals.
In Figure 8 we present visual comparisons of our forecasts vs some of the baselines.
A.5.2 Monash
In Table 4 we present the actual MAE numbers that are behind the main Figure 2a. In Figure 9, we present some examples of our zero-shot forecasts. For most datasets, we set the context window to be the maximum length of the series in the dataset capped at 512 (similar to statistcal models used in the official Monash baselines). For some datasets, we did some inference time-tuning of the context length, i.e we predict the last horizon length number of points in the training set with context lengths 32, 64 and maximum allowed and chose the best one in terms of this validation metric. This is fair as most Monash DL baselines use different context lengths for different datasets during training and our model is completely zero-shot. The max context lengths used for these datasets are (cif 2016, 32), (tourism yearly, 32), (covid deaths, 32), (bitcoin, 32), (tourism monthly, 32) and (tourism monthly, 64).
A.5.3 Informer
We present the MAE on the last split of the test set for all dataset, horizon pairs considered in Table 5. Owing to expensive evaluations for llmtime, the results are reported on the last test window of the original test split, as done in [GFQW23].
A.6 More Details on Models
We now present implementation details about TimesFM and other baselines.
TimesFM.
For our main 200M model we use 16 attention heads, 20 layers, a input patch length of 32 and output patch length of 128. The model dimension is set to 1280. We train with layer norm and a cosine decay learning rate schedule with peak learning rate of 5e − 4. The hyper-parameters of TimesFM for various sizes are provided in Table 6. Note that the settings are for the base models and not ablation models. The hidden dims of both the residual block and the FFN in the transformer layers are set as the same as model dimensions. We keep layer norm in transformer layers but not in the residual blocks.
Monash Baselines.
The raw metrics for the Monash baselines are directly taken from Tables 9 and 11 of the supplementary material of the original paper [GBW+21]. For llmtime, we use the precomputed outputs provided by the authors of [GFQW23].
Darts Baselines.
For all the Darts baselines we use the precomputed outputs provided by the authors of [GFQW23]. For more details please see Section C.1 in that paper.
Informer Baselines.
For FEDFormer [ZMW+22], Autoformer [WXWL21], Informer [ZZP+21] and PatchTST [NNSK22] we use the original hyperparameters and implementation. The results presented in the main paper are obtained on the last test window of length horizon length as stated in the llmtime [GFQW23] paper. We generate the llmtime predictions using the code provided by the authors 7 but adapted to the ETT datasets. Note that as of January 2024, OpenAI has discontinued access to GPT-3, therefore we had to use the GPT-3.5-Turbo model.
A.7 Date Features
As we mentioned earlier, since we are building a single pre-trained model, we cannot have dataset specific dynamic or static covariates during training time. However, the datetime column is ubiquitous in all time-series data, so we can technically have date derived features like day of the week, month of the year etc processed into a vector at each time-point t, denoted by xt ∈ R r .
If so, the learning task can be rewritten as
There are many options to incorporate these features into the model, one being to directly concatenate them after the time-points in each patch. For this paper we decide to focus on the univariate time-series input, and will investigate this enhancement in the future.
A.8 Synthetic Data
We create the synthetic data to reflect common time-series patterns using traditional statistical models. We start with four simple times series patterns:
• Piece-wise linear trends (I), where the number of the piece-wise linear components is randomly chosen between 2 and 8.
• ARMA(p, q) (II), where 1 ≤ p, q ≤ 8 and the corresponding coefficients are generated from either a multivariate Gaussian or a uniform, then normalized.
• Seasonal patterns. In particular we create the sine (III) and the cosine (IV) waves of different random periods between 4 and max context length / 2 time-points and time delays.
We then randomly enable / disable these four components (I) - (IV), generate their time-series of length 2048 respectively, and sum them up using uniformly sampled random weights to create each times series in the synthetic datasets. We also choose to apply the trend multiplicatively 50% of the times the trend component is chosen.
A.9 Illustrative Examples
We conduct a visual inspection of the forecasts generated by TimesFM, first on some synthetic examples and then on the benchmark datasets.
In Figure 5 we show 4 different synthetic curves: (1) sum of 5 sine curves of different periods, (2) a sine curve linearly scaled, (3) a sine curve with a linear trend, and (4) minimum of two sine curves with a linear trend. Our results suggests that TimesFM picks up the trend and seasonal components readily interpretable by humans, while ARIMA and (to a lesser extent) llmtime fail in some of the instances.
As illustrated in Figure 6, TimesFM also effectively captures these subtle characteristics within both the trend and seasonal patterns of the depicted real world time-series. For instance, in the Air Passenger dataset, TimesFM correctly captures the amplitude increase with trend –this is also reflected by the fact that it attains the best MAE on this dataset (see Table 3). In the traffic hourly example on the left, it can be seen that TimesFM can correctly identify the seasonal peaks even in the presence of outliers in the context, while llmtime is thrown off.
We provide more visualization in Figure 7, Figure 8 and Figure 9.
'Paper Writing 1 > Related_Work' 카테고리의 다른 글
[Chronos] Learning the Language of Time Series (0) 2024.10.30 [MOMENT] A Family of Open Time-series Foundation Models (0) 2024.10.30 A decoder-only foundation model for time-series forecasting (0) 2024.10.19 [Lag-Llama] Towards Foundation Models for Probabilistic Time Series Forecasting (0) 2024.10.18 [ForecastPFN] Synthetically-Trained Zero-Shot Forecasting (0) 2024.10.18