ABOUT ME

-

Today
-
Yesterday
-
Total
-
  • [TSMixer] An All-MLP Architecture for Time Series Forecasting
    Paper Writing 1/Related_Work 2024. 11. 4. 00:43

    (Mar 2023)

    https://arxiv.org/pdf/2303.06053

    https://github.com/google-research/google-research/tree/master/tsmixer


    Abstract

    Real-world time-series datasets are often multivariate with complex dynamics. To capture this complexity, high capacity architectures like recurrent- or attention-based sequential deep learning models have become popular. However, recent work demonstrates that simple univariate linear models can outperform such deep learning models on several commonly used academic benchmarks. Extending them, in this paper, we investigate the capabilities of linear models for time-series forecasting and present Time-Series Mixer (TSMixer), a novel architecture designed by stacking multi-layer perceptrons (MLPs). TSMixer is based on mixing operations along both the time and feature dimensions to extract information efficiently. On popular academic benchmarks, the simple-to-implement TSMixer is comparable to specialized state-of-the-art models that leverage the inductive biases of specific benchmarks. On the challenging and large scale M5 benchmark, a real-world retail dataset, TSMixer demonstrates superior performance compared to the state-of-the-art alternatives. Our results underline the importance of efficiently utilizing cross-variate and auxiliary information for improving the performance of time series forecasting. We present various analyses to shed light into the capabilities of TSMixer. The design paradigms utilized in TSMixer are expected to open new horizons for deep learning-based time series forecasting. The implementation is available at: https://github.com/google-research/google-research/tree/master/tsmixer .


    1. Introduction

    Time series forecasting is a prevalent problem in numerous real-world use cases, such as for forecasting of demand of products (Böse et al., 2017; Courty & Li, 1999), pandemic spread (Zhang & Nawata, 2018), and inflation rates (Capistrán et al., 2010). The forecastability of time series data often originates from three major aspects:

     

    Persistent temporal patterns: encompassing trends and seasonal patterns, e.g., long-term inflation, day-of-week effects;

     

    Cross-variate information: correlations between different variables, e.g., an increase in blood pressure associated with a rise in body weight;

     

    Auxiliary features: comprising static features and future information, e.g., product categories and promotional events.

     

    Traditional models, such as ARIMA (Box et al., 1970), are designed for univariate time series, where only temporal information is available. Therefore, they face limitations when dealing with challenging real-world data, which often contains complex cross-variate information and auxiliary features. In contrast, numerous deep learning models, particularly Transformer-based models, have been proposed due to their capacity to capture both complex temporal patterns and cross-variate dependencies (Gamboa, 2017; Li et al., 2019; Wen et al., 2017; Zhou et al., 2021; Wu et al., 2021; Lim & Zohren, 2021; Liu et al., 2022a; Zhou et al., 2022b; Liu et al., 2022b; Zhou et al., 2022a) .

     

    The natural intuition is that multivariate models, such as those based on Transformer architectures, should be more effective than univariate models due to their ability to leverage cross-variate information. However, Zeng et al. (2023) revealed that this is not always the case – Transformer-based models can indeed be significantly worse than simple univariate temporal linear models on many commonly used forecasting benchmarks. The multivariate models seem to suffer from overfitting especially when the target time series is not correlated with other covariates. This surprising finding has raised two essential questions:

     

    1. Does cross-variate information truly provide a benefit for time series forecasting?

     

    2. When cross-variate information is not beneficial, can multivariate models still perform as well as univariate models?

     

    To address these questions, we begin by analyzing the effectiveness of temporal linear models. Our findings indicate that their time-step-dependent characteristics render temporal linear models great candidates for learning temporal patterns under common assumptions. Consequently, we gradually increase the capacity of linear models by

     

    1. stacking temporal linear models with non-linearities (TMix-Only),

     

    2. introducing cross-variate feed-forward layers (TSMixer).

     

    The resulting TSMixer alternatively applies MLPs across time and feature dimensions, conceptually corresponding to time-mixing and feature-mixing operations, efficiently capturing both temporal patterns and cross-variate information, as illustrated in Fig. 1. The residual designs ensure that TSMixer retains the capacity of temporal linear models while still being able to exploit cross-variate information.

     

    We evaluate TSMixer on commonly used long-term forecasting datasets (Wu et al., 2021) where univariate models have outperformed multivariate models. Our ablation study demonstrates the effectiveness of stacking temporal linear models and validates that cross-variate information is less beneficial on these popular datasets, explaining the superior performance of univariate models. Even so, TSMixer is on par with state-of-the-art univariate models and significantly outperforms other multivariate models.

     

    To demonstrate the benefit of multivariate models, we further evaluate TSMixer on the challenging M5 benchmark, a large-scale retail dataset used in the M-competition (Makridakis et al., 2022). M5 contains crucial cross-variate interactions such as sell prices (Makridakis et al., 2022). The results show that cross-variate information indeed brings significant improvement, and TSMixer can effectively leverage this information. Furthermore, we propose a principle design to extend TSMixer to handle auxiliary information such as static features and future time-varying features. It aligns the different types of features into the same shape then applied mixer layers on the concatenated features to leverage the interactions between them. In this more practical and challenging setting, TSMixer outperforms models that are popular in industrial applications, including DeepAR (Salinas et al. 2020, Amazon SageMaker) and TFT (Lim et al. 2021, Google Cloud Vertex), demonstrating its strong potential for real world impact.

     

    We summarize our contributions as below:

     

    • We analyze the effectiveness of state-of-the-art linear models and indicate that their time-step-dependent characteristics make them great candidates for learning temporal patterns under common assumptions.

     

    • We propose TSMixer, an innovative architecture which retains the capacity of linear models to capture temporal patterns while still being able to exploit cross-variate information.

     

    • We point out the potential risk of evaluating multivariate models on common long-term forecasting benchmarks.

     

    • Our empirical studies demonstrate that TSMixer is the first multivariate model which is on par with univariate models on common benchmarks and achieves state-of-the-art on a large-scale industrial application where cross-variate information is crucial.


    2. Related Work

     

    Broadly, time series forecasting is the task of predicting future values of a variable or multiple related variables, given a set of historical observations. Deep neural networks have been widely investigated for this task (Zhang et al., 1998; Kourentzes, 2013; Lim & Zohren, 2021). In Table 1 we coarsely split notable works into three categories based on the information considered by the model: (I) univariate forecasting, (II) multivariate forecasting, and (III) multivariate forecasting with auxiliary information.

     

    Multivariate time series forecasting with deep neural networks has been getting increasingly popular with the motivation that modeling the complex relationships between covariates should improve the forecasting performance. Transformer-based models (Category II) are common choices for this scenario because of their superior performance in modeling long and complex sequential data (Vaswani et al., 2017). Various variants of Transformers have been proposed to further improve efficiency and accuracy. Informer (Zhou et al., 2021) and Autoformer (Wu et al., 2021) tackle the efficiency bottleneck with different attention designs costing less memory usage for long-term forecasting. FEDformer (Zhou et al., 2022b) and FiLM (Zhou et al., 2022a) decompose the sequences using Fast Fourier Transformation for better extraction of long-term information. There are also extensions on improving specific challenges, such as non-stationarity (Kim et al., 2022; Liu et al., 2022b). Despite the advances in Transformer-based models for multivariate forecasting, Zeng et al. (2023) indeed show the counter-intuitive result that a simple univariate linear model (Category I), which treats multivariate data as several univariate sequences, can outperform all of the proposed multivariate Transformer models by a significant margin on commonly-used long-term forecasting benchmarks. Similarly, Nie et al. (2023) advocate against modeling the cross-variate information and propose a univariate patch Transformer for multivariate forecasting tasks and show state-of-the-art accuracy on multiple datasets. As one of the core contributions, instead, we find that this conclusion mainly comes from the dataset bias, and might not generalize well to some real-world applications.

     

    There are other works that consider a scenario when auxiliary information ((Category III)), such as static features (e.g. location) and future time-varying features (e.g. promotion in coming weeks), are available. Commonly used forecasting models have been extended to handle these auxiliary features. These include state-space models (Rangapuram et al., 2018; Alaa & van der Schaar, 2019; Gu et al., 2022), RNN variants Wen et al. (2017); Salinas et al. (2020), and attention models Lim et al. (2021). Most real-world time-series datasets are more aligned with this setting and that is why these deep learning models have achieved great success in various applications and are widely used in industry (e.g. DeepAR (Salinas et al., 2020) of AWS SageMaker and TFT (Lim et al., 2021) of Google Cloud Vertex). One drawback of these models is their complexity, particularly when compared to the aforementioned univariate models.

     

    Our motivations for TSMixer stem from analyzing the performance of linear models for time series forecasting. Similar architectures have been considered for other data types before, for example the proposed TSMixer in a way resembles the well-known MLP Mixer architecture, from computer vision (Tolstikhin et al., 2021). Mixer models have also been applied to text (Fusco et al., 2022), speech (Tatanov et al., 2022), network traffic (Zheng et al., 2022) and point cloud (Choe et al., 2022). Yet, to the best of our knowledge, the use of an MLP Mixer based architecture for time series forecasting has not been explored in the literature.


    4. TSMixer Architecture

     

    Expanding upon our finding that linear models can serve as strong candidates for capturing time dependencies, we initially propose a natural enhancement by stacking linear models with non-linearities to form multi-layer perceptrons (MLPs). Common deep learning techniques, such as normalization and residual connections, are applied to facilitate efficient learning. However, this architecture does not take cross-variate information into account.

     

    To better leverage cross-variate information, we propose the application of MLPs in the time-domain and the feature-domain in an alternating manner. The time-domain MLPs are shared across all of the features, while the feature-domain MLPs are shared across all of the time steps. This resulting model is akin to the MLP-Mixer architecture from computer vision (Tolstikhin et al., 2021), with time-domain and feature-domain operations representing time-mixing and feature-mixing operations, respectively. Consequently, we name our proposed architecture Time-Series Mixer (TSMixer).

     

    The interleaving design between these two operations efficiently utilizes both temporal dependencies and cross-variate information while limiting computational complexity and model size. It allows TSMixer to use a long lookback window (see Sec. 3), while maintaining the parameter growth in only O(L + C) instead of O(LC) if fully-connected MLPs were used. To better understand the utility of cross-variate information and feature-mixing, we also consider a simplified variant of TSMixer that only employs time-mixing, referred to as TMix-Only, which consists of a residual MLP shared across each variate, as illustrated in Fig. 3. We also present the extension of TSMixer to scenarios where auxiliary information about the time series is available.


    4.1. TSMixer for Multivariate Time Series Forecasting

    For multivariate time series forecasting where only historical data are available, TSMixer applies MLPs alternatively in time and feature domains. The architecture is illustrated in Fig. 1. TSMixer comprises the following components:

     

    Time-mixing MLP: Time-mixing MLPs model temporal patterns in time series. They consist of a fully-connected layer followed by an activation function and dropout. They transpose the input to apply the fully-connected layers along the time domain and shared by features. We employ a single-layer MLP, as demonstrated in Sec.3, where a simple linear model already proves to be a strong model for learning complex temporal patterns.

     

    Feature-mixing MLP: Feature-mixing MLPs are shared by time steps and serve to leverage covariate information. Similar to Transformer-based models, we consider two-layer MLPs to learn complex feature transformations.

     

    Temporal Projection: Temporal projection, identical to the linear models in Zeng et al. (2023), is a fully-connected layer applied on time domain. They not only learn the temporal patterns but also map the time series from the original input length L to the target forecast length T.

     

    Residual Connections: We apply residual connections between each time-mixing and feature-mixing layer. These connections allow the model to learn deeper architectures more efficiently and allow the model to effectively ignore unnecessary time-mixing and feature-mixing operations.

     

    Normalization: Normalization is a common technique to improve deep learning model training. While the preference between batch normalization and layer normalization is task-dependent, Nie et al. (2023) demonstrates the advantages of batch normalization on common time series datasets. In contrast to typical normalization applied along the feature dimension, we apply 2D normalization on both time and feature dimensions due to the presence of time-mixing and feature-mixing operations.

     

    Contrary to some recent Transformer advances with increased complexity, the architecture of TSMixer is relatively simple to implement. Despite its simplicity, we demonstrate in Sec. 5 that TSMixer remains competitive with state-of-the-art models at representative benchmarks.


    4.2. Extended TSMixer for Time Series Forecasting with Auxiliary Information


    4.3. Differences between TSMixer and MLP-Mixer

    While TSMixer shares architectural similarities with MLP-Mixer, the development of TSMixer, motivated by our analysis in Section 3, has led to a unique normalization approach. In TSMixer, two dimensions represent features and time steps, unlike MLP-Mixer’s features and patches. Consequently, we apply 2D normalization to maintain scale across features and time steps, since we have discovered the importance of utilizing temporal patterns in forecasting. Besides, we have proposed an extended version of TSMixer to better extract information from heterogeneous inputs, essential to achieve state-of-the-art results in real-world scenarios.


    5. Experiments

    We evaluate TSMixer on seven popular multivariate long-term forecasting benchmarks and a large-scale real-world retail dataset, M5 (Makridakis et al., 2022). The long-term forecasting datasets cover various applications such as weather, electricity, and traffic, and are comprised of multivariate time series without auxiliary information. The M5 dataset is for the competition task of predicting the sales of various items at Walmart. It is a large scale dataset containing 30,490 time series with static features such as store locations, as well as time-varying features such as campaign information. This complexity renders M5 a more challenging benchmark to explore the potential benefits of cross-variate information and auxiliary features. The statistics of these datasets are presented in Table 2.

     

    For the M5 dataset, we mostly follow the data processing from Alexandrov et al. (2020). We consider the prediction length of T = 28 (same as the competition), and set the input length to L = 35. We optimize log-likelihood of negative binomial distribution as suggested by Salinas et al. (2020). We follow the competition’s protocol (Makridakis et al., 2022) to aggregate the predictions at different levels and evaluate them using the weighted root mean squared scaled error (WRMSSE). More details about the experimental setup and hyperparameter tuning can be found in Appendices C and E.


    5.2. Large-scale Demand Forecasting

    We evaluate TSMixer on the large-scale retail dataset M5 to explore the model’s ability to leverage complicated cross-variate information and auxiliary features. M5 comprises thousands of multivariate time series, each with its own historical observations, future time-varying features, and static features, in contrast to the long-term forecasting benchmarks, which typically consist of a single multivariate historical time series. We utilize TSMixer-Ext, the architecture introduced in Sec.4.2, to leverage the auxiliary information. Furthermore, the presence of a high proportion of zeros in the target sequence presents an additional challenge for prediction. Therefore, we learn negative binomial distributions, as suggested bySalinas et al. (2020), to better fit the distribution.

    Forecast with Historical Features Only

    First, we compare TSMixer with other baselines using historical features only. As shown in Table 4 the multivariate models perform much better than univariate models for this dataset. Notably, PatchTST, which is designed to ignore cross-variate information, performs significantly worse than multivariate TSMixer and FEDformer. This result underscores the importance of modeling cross-variate information on some forecasting tasks, as opposed to the argument in (Nie et al., 2023). Furthermore, TSMixer substantially outperforms FEDformer, a state-of-the-art multivariate model.

     

    TSMixer exhibits a unique value as it is the only model that performs as well as univariate models when cross-variate information is not useful, and it is the best model to leverage cross-variate information when it is useful.

    Forecast with Auxiliary Information

    To understand the extent to which TSMixer can leverage auxiliary information, we compare TSMixer against established time series forecasting algorithms, TFT (Lim et al., 2021) and DeepAR (Salinas et al., 2020). Table 5 shows that with auxiliary features TSMixer outperforms all other baselines by a significant margin. This result demonstrates the superior capability of TSMixer for modeling complex cross-variate information and effectively leveraging auxiliary features, an impactful capability for real-world time-series data beyond long-term forecasting benchmarks. We also conduct ablation studies by removing the static features and future time-varying features. The results demonstrates that while the impact of static features is more prominent, both static and future time-varying features contribute to the overall performance of TSMixer. This further emphasizes the importance of incorporating auxiliary features in time series forecasting models.


    6. Conclusions

    We propose TSMixer, a novel architecture for time series forecasting that is designed using MLPs instead of commonly used RNNs and attention mechanisms to obtain superior generalization with a simple architecture. Our results at a wide range of real-world time series forecasting tasks demonstrate that TSMixer is highly effective in both long-term forecasting benchmarks for multivariate time-series, and real-world large-scale retail demand forecasting tasks. Notably, TSMixer is the only multivariate model that is able to achieve similar performance to univariate models in long term time series forecasting benchmarks. The TSMixer architecture has significant potential for further improvement and we believe it will be useful in a wide range of time series forecasting tasks. Some of the potential future works include further exploring the interpretability of TSMixer, as well as its scalability to even larger datasets. We hope this work will pave the way for more innovative architectures for time series forecasting.


    c.2. M5 dataset

    We obtain the M5 dataset from Kaggle1 . Please refer to the participants guide to check the details about the competition and the dataset. We refer to the example script in GluonTS (Alexandrov et al., 2020)2 and the repository of the third place solution3 in the competition to implement our basic feature engineering. We list the features we used in our experiment in Table 7.

     

    Our implementation is based on GluonTS. We use TFT and DeepAR provided in GluonTS, and implement PatchTST, FEDformer, and our TSMixer ourselves. We modified these models if necessary to optimize the negative binomial distribution, as suggested by DeepAR paper (Salinas et al., 2020). We train each model with a maximum 300 epochs and employ early stopping if the validation loss is not improved after 30 epochs. We noticed that optimizing other objective function might get significantly worse results when evaluate WRMSSE. To obtain more stable results, for all models, we take the top 8 hyperparameter settings based on validation WRMSSE and train them for an additional 4 trials (totaling 5 trials) and select the best hyperparameters based on their mean validation WRMSSE, then report the evaluation results on the test set. The hyperparameter settings can be found in Appendix E.


     

Designed by Tistory.