ABOUT ME

-

Today
-
Yesterday
-
Total
-
  • [MOMENT] A Family of Open Time-series Foundation Models
    Paper Writing 1/Related_Work 2024. 10. 30. 17:16

    (Feb 2024 ICML 2024)

    https://arxiv.org/pdf/2402.03885

    https://github.com/moment-timeseries-foundation-model/moment


    Abstract

    We introduce MOMENT, a family of open-source foundation models for general-purpose time series analysis. Pre-training large models on time series data is challenging due to (1) the absence of a large and cohesive public time series repository, and (2) diverse time series characteristics which make multi-dataset training onerous. Additionally, (3) experimental benchmarks to evaluate these models, especially in scenarios with limited resources, time, and supervision, are still in their nascent stages. To address these challenges, we compile a large and diverse collection of public time series, called the Time series Pile, and systematically tackle time series-specific challenges to unlock large-scale multi-dataset pretraining. Finally, we build on recent work to design a benchmark to evaluate time series foundation models on diverse tasks and datasets in limited supervision settings. Experiments on this benchmark demonstrate the effectiveness of our pre-trained models with minimal data and task-specific fine-tuning. Finally, we present several interesting empirical observations about large pretrained time series models. Pre-trained models (AutonLab/MOMENT-1-large) and Time Series Pile (AutonLab/Timeseries-PILE) are available on https://huggingface.co/AutonLab.


    1. Introduction

    Time series analysis is an important field encompassing a wide range of applications ranging from forecasting weather patterns (Schneider & Dickinson, 1974) or detecting irregular heartbeats using Electrocardiograms (Goswami et al., 2021), to identifying anomalous software deployments (Xu et al., 2018). Due to its significant practical value and the unique challenges that modeling time series data poses, time series analysis continues to receive substantial interest from academia and industry alike. However, modeling such data typically requires substantial domain expertise, time, and task-specific design.

     

    Large pre-trained language (Touvron et al., 2023; Devlin et al., 2019; Chung et al., 2022), vision (Li et al., 2023a), and video (Day et al., 2023) models, typically perform well on a variety of tasks on data from diverse domains, with little or no supervision, and they can be specialized to perform well on specific tasks. We unlock these key capabilities for time series data and release the first family of open-source large pre-trained time series models, which we call MOMENT. The models in this family (1) serve as a building block for diverse time series analysis tasks (e.g., forecasting, classification, anomaly detection, and imputation, etc.), (2) are effective out-of-the-box, i.e., with no (or few) particular task-specific exemplars (enabling e.g., zero-shot forecasting, few-shot classification, etc.), and (3) are tunable using in-distribution and task-specific data to improve performance.

     

    MOMENT is a family of high-capacity transformer models, pre-trained using a masked time series prediction task on large amounts of time series data drawn from diverse domains. Below we summarize our key contributions.

    C1: Pre-training data.

    A key limiting factor for pretraining large time series models from scratch was the lack of a large cohesive public time series data repositories (Zhou et al., 2023; Gruver et al., 2023; Jin et al., 2023; Ekambaram et al., 2024; Cao et al., 2023). Therefore, we compiled The Time series Pile, a large collection of publicly available data from diverse domains, ranging from healthcare to engineering to finance. The Time Series Pile comprises of over 5 public time series databases, from several diverse domains for pre-training and evaluation (Tab. 11).

    C2: Multi-dataset pre-training.

    Unlike text and images, which have largely consistent sampling rates and number of channels, time series frequently vary in their temporal resolution, number of channels1 , lengths, and amplitudes, and sometimes have missing values. As a result, large-scale mixed dataset pre-training is largely unexplored. Instead, most methods are trained on a single dataset, and transferred across multiple datasets, but with modest success (Wu et al., 2023; Oreshkin et al., 2021; Narwariya et al., 2020).

    C3: Evaluation.

    Holistic benchmarks to evaluate time series foundation models on diverse datasets and tasks are in their nascent stages. Recent studies (Goswami et al., 2023b) have highlighted the importance of well-defined benchmarks and large-scale experimentation in order to accurately assess the impact and effectiveness of novel methodologies. To evaluate MOMENT, we build on the multi-task time series modeling benchmark first proposed by Wu et al. (2023) along multiple dimensions. For each of the 5 time series modeling tasks, namely, short- and long-horizon forecasting, classification, anomaly detection, and imputation we evaluate MOMENT against (1) both state-of-the-art deep learning as well as statistical baselines, on (2) more task-specific datasets, (3) using multiple evaluation metrics, (4) exclusively in limited supervision settings (e.g., zero-shot imputation, linear probing for forecasting, unsupervised representation learning for classification).

     

    Finally, we explore various properties of these pre-trained time series models. In particular, we study whether MOMENT is aware of intuitive time series characteristics such as frequency and trend, and the impact of initialization, model size scaling, and cross-modal transfer.


    2. Related Work

    Transformers and patching for time series modeling.

    There is a growing body of work utilizing transformers for various time series analysis tasks (Wen et al., 2023). One issue with applying transformers to time series data is the complexity of the self-attention mechanism, which grows quadratically with the size of input tokens (or length of time series) (Li et al., 2019). Nie et al. (2023) demonstrated that treating time series sub-sequences (or patches) as tokens instead of individual time points is a simple, efficient, and effective mechanism for learning useful representations for forecasting. Drawing inspiration from prior work, we build on top of the transformer architecture which takes disjoint time series sub-sequences (or patches) as input.

    Masked Representation Learning.

    Masked pre-training is a widely-used self-supervised learning task where a model learns to accurately reconstruct masked portions of its input. Masked language (Devlin et al., 2019; Raffel et al., 2020) and image modeling (Xie et al., 2022; Li et al., 2023b) have been successfully utilized to learn models from vast quantities of unlabeled data, which can generalize to a variety of downstream tasks.

     

    For time series data, prior work has primarily focused on contrastive representation learning (Yue et al., 2022; Eldele et al., 2021; Franceschi et al., 2019). However, contrastive learning relies on data augmentation, which is both subjective and data-dependent. In contrast, some studies mask portions of time series using zeros and learn a model to reconstruct them (Nie et al., 2023; Zerveas et al., 2021; Dong et al., 2023; Li et al., 2023c).

     

    Representation learning via masking is well-suited to all the downstream tasks we care about, especially forecasting and imputation, as they are instances of the masked reconstruction problem. Due to its simplicity and success in vision and language domains, we use the masked prediction task to pretrain our model, using a special embedding (see [MASK] in Fig. 3) to mask time series patches instead of zeros.

    Cross-modal transfer learning using language models.

    Lu et al. (2022) had first shown that transformers pre-trained on text data (LLMs) can effectively solve sequence modeling tasks in other modalities. Subsequently, Shen et al. (2023) introduced ORCA, a general cross-modal fine-tuning framework that extends the applicability of a single large-scale pretrained model to diverse modalities by adapting to a target task via an align-then-refine workflow. Given the target input, ORCA first learns an embedding network that aligns the embedded feature distribution with the pretraining modality, then the pretrained model is fine-tuned on the embedded data, exploiting the knowledge shared across modalities. Some recent studies have leveraged this inherent ability of language pre-trained transformers to “reprogram” LLMs for time series analysis using parameter efficient fine-tuning and suitable tokenization strategies (Zhou et al., 2023; Gruver et al., 2023; Jin et al., 2023; Cao et al., 2023; Ekambaram et al., 2024). However, some of these models (Jin et al., 2023; Gruver et al., 2023) with billions of parameters demand significant memory and computational resources to perform well. We complement this line of research with three empirical observations (Sec 4.3): we show that (1) transformers trained on time series can also model sequences across modalities, (2) during pre-training, randomly initializing weights lead to lower pre-training loss, than initializing with language modeling weights, and (3) models pre-trained on time series outperform LLM-based models such as (Zhou et al., 2023; Jin et al., 2023) on many tasks and datasets.

    Unanswered Questions.

    To the best of our knowledge, two questions remain largely unanswered in prior work on time series modeling. First, all existing time series models are (pre-)trained and fine-tuned on individual datasets (Nie et al., 2023; Yue et al., 2022; Wu et al., 2023; Zhou et al., 2023), and the benefits (or drawbacks) of large-scale multi-dataset pre-training remains unexplored (Wen et al., 2023). Second, there is very limited work on time series modeling in limited supervision settings, such as zero-shot forecasting (Oreshkin et al., 2021), or few-shot classification (Narwariya et al., 2020). In our work, we consider both these questions and show that pre-training a model of sufficient capacity on a large corpus of unlabeled time series data can in fact enable it to provide reasonably accurate predictions in limited-supervision settings.


    3. Methodology

    We first collect a large number of public time series data into the Time Series Pile and then use it to pre-train a transformer model on the masked time series prediction task. We discuss each of these steps in the following sections.

    3.1. The Time Series Pile

    Unlike natural language processing and computer vision, where large-scale datasets such as The Pile (Gao et al., 2020), and ImageNet-1K (Russakovsky et al., 2015) are easily available for pre-training, public time series datasets are much smaller, scattered, and largely task-specific (Ma et al., 2023; Zhou et al., 2023; Gruver et al., 2023). To bridge this gap, we collate multiple time series from 4 task-specific, widely-used public repositories resulting in a large number of time series spanning diverse domains, and time series characteristics such as lengths, amplitudes, and temporal resolutions. We call this collection the Time Series Pile.

    Informer long-horizon forecasting datasets

    (Zhou et al., 2021) is a collection of 9 datasets that are widely used to evaluate long-horizon forecasting performance (Wu et al., 2023; Nie et al., 2023; Challu et al., 2023): 2 hourly and minutely subsets of the Electricity Transformer Temperature (ETT) (Zhou et al., 2021), Electricity (Trindade, 2015), Traffic (California Department of Transportation, 2024), Weather (Max Planck Institute for Biogeochemistry, 2024), Influenza-like Illness (ILI) (Centers for Disease Control and Prevention, 2024), and Exchange-rate (Lai et al., 2018).

    Monash time series forecasting archive

    (Godahewa et al., 2021) is a collection of 58 publicly available short-horizon forecasting datasets with a total of over 100K time series, spanning a variety of domains and temporal resolutions.

    UCR/UEA classification archive

    (Dau et al., 2018) comprises of 159 time series datasets which are frequently used to benchmark classification algorithms (Ismail Fawaz et al., 2019). These datasets belonging to seven different categories (Image Outline, Sensor Readings, Motion Capture, Spectrographs, ECG, Electric Devices, and Simulated Data), vary substantially in terms of the number of classes and the size of the training set.

    TSB-UAD anomaly benchmark

    (Paparrizos et al., 2022b) is a recent collection of 1980 univariate time series with labeled anomalies from 18 anomaly detection datasets proposed over the past decade. This collection includes both synthetic and real-world time series originating from a wide range of sources such as the human body, spaceships, environment, and web serves.

    Minimizing data contamination using careful train-test splitting.

    We carefully split each dataset into disjoint training, validation, and test splits, based on splits specified by data creators. When these splits are not available, we randomly sample 60% of the data for training, 10% for validation, and 30% for testing. Long-horizon forecasting and anomaly detection datasets are typically long time series, which are split horizontally as shown in Fig. 2. Conversely, short-horizon forecasting and classification datasets often contain multiple short time series. For these datasets, a complete time series is either training, validation, or testing. We use the same random seed, set to 13, throughout our experiments, from pre-training to downstream evaluation, thus ensuring that MOMENT only observes the training splits of datasets during pre-training.


    3.2. Model Architecture

    MOMENT receives a univariate time series T ∈ R 1×T , and a mask M = {0, 1} 1×T of length T. 0 and 1 denote unobserved and observed time-stamps, respectively. Reversible instance normalization (Kim et al., 2022) is applied to the observed time series before breaking it into N disjoint patches of length P. Each patch is then mapped to a D-dimensional embedding, using a trainable linear projection if all time steps are observed, and a designated learnable mask embedding [MASK] ∈ R 1×D, otherwise. These N patch embeddings serve as input to the transformer model which retains their shape (1 ×D) throughout its operations. Each transformed patch embedding is then used to reconstruct both masked and unmasked time series patches, using a lightweight prediction head. The goal of the prediction head is to map the transformed patch embeddings to the desired output dimensions. Since this particular prediction head enables time series reconstruction, we call it the reconstruction head. Fig. 3 shows an overview of our model.

     

    Our transformer encoder retains the modifications proposed by Raffel et al. (2020) to the original Transformer (Vaswani et al., 2017). Specifically, we remove the additive bias from the Layer Norm (Ba et al., 2016), and place it before the residual skip connections (He et al., 2016), and use the relative positional embedding scheme (Shaw et al., 2018). Below we summarize the intuition behind some of our key design decisions.

    Handling varying time series characteristics.

    Time series vary in length, number of channels, amplitudes, and temporal resolutions. We address variable length by restricting MOMENT’s input to a univariate time series of a fixed length T = 512. As is common practice, we sub-sample longer time series, and pad shorter ones with zeros on the left2 . Moreover, segmenting time series into patches quadratically reduces MOMENT’s memory footprint and computational complexity, and linearly increases the length of time series it can take as input. We handle multi-variate time series by independently operating on each channel along the batch dimension. Like recent studies (Zhou et al., 2023; Nie et al., 2023), we found that modeling each channel independently is an effective strategy for modeling multivariate time series. Finally, re-scaling and centering time series using reversible instance normalization enables MOMENT to model time series with significantly different temporal distributions (Kim et al., 2022). We did not explicitly model the temporal resolution of time series, since this information is often unavailable outside of time series forecasting datasets.

    Intentionally simple encoder.

    Closely following the design of transformers in the language domain allows us to leverage their scalable and efficient implementations (e.g., gradient checkpointing, mixed precision training).

    Light-weight prediction head.

    We use a lightweight prediction head instead of a decoder of the same size as the encoder, to enable the necessary architectural modifications for task-specific fine-tuning of a limited number of trainable parameters while keeping the bulk of parameters and the high-level features learned by the encoder intact.

    Additional absolute positional embeddings.

    In addition to relative positional embeddings, we add absolute sinusoidal positional embeddings (Vaswani et al., 2017) to each patch3 .


    3.3. Pre-training using Masked Time series Modeling

    We pre-train MOMENT using the masked time series modeling task. Fig. 3 presents an overview of our pre-training procedure. During training, we first mask a small number of patches uniformly at random by replacing their patch embeddings with a learnable mask embedding [MASK]. The corrupted time series patches are then fed into the transformer encoder to learn patch representations, which are used to reconstruct the original time series using a lightweight reconstruction head. The pre-training objective is to minimize the masked reconstruction error i.e. the Mean Squared Error between the ground truth and the prediction, averaged over the masked patches.

    Pre-training Setup.

    We pre-train three different sizes of MOMENT, roughly corresponding to the sizes of encoders in T5-Small, Base, and Large. Specifically, the Base (Small, Large) model uses a 12 (6, 24) layer Transformer with hidden dimensions of size D = 768 (512, 1024), 12 (8, 16) attention heads, and feed-forward networks of size 3072 (2048, 4096), resulting in approximately 125 (40, 385) million parameters. All weights are randomly initialized before pre-training. All models take an input time series of length T = 512, breaking it into N = 64 disjoint patches of length P = 8. We mask 30% of the patches uniformly at random during pre-training. We use the Adam optimizer with weight decay (Loshchilov & Hutter, 2019) with λ = 0.05, β1 = 0.9, β2 = 0.999. We clip the gradient at 5.0, train models using a batch size of 2048, and use cosine learning rate schedule with initial and final learning rates of 1e −4 and 1e −5, respectively. We use gradient checkpointing (Radford et al., 2021) to improve training throughput and save memory, and train all models in a mixed precision setting, using float-32 for numerically unstable operations, e.g. layer normalization, and bfloat-165 , otherwise. We train all models for 2 epochs.


    3.4. Fine-tuning on Downstream Tasks

    MOMENT can be seamlessly used for multiple time series analysis tasks. In this work, we consider 5 practical time series analysis tasks as examples, namely: long- and short-horizon forecasting, classification, anomaly detection, and imputation. For forecasting tasks with horizon H, we replace the reconstruction head with a forecasting head, which first flattens all the N D-dimensional patch embeddings into a N × D dimensional vector, and then projects it into a H-dimensional time series via a linear projection layer. For all other tasks, we retain the reconstruction head. We provide detailed descriptions of each task and MOMENT’s configuration in App. F.

    Fine-tuning settings.

    MOMENT can either be fine-tuned end-to-end, or linear probed (MOMENTLP) by freezing all parameters except for those in the reconstruction or forecasting head. Additionally, for some tasks such as anomaly detection, unsupervised representation learning and imputation, MOMENT can also be used in a zero-shot (MOMENT0) setting by retaining its reconstruction head.


    4. Experimental Setup and Results

    We extend the experimental benchmark introduced by Wu et al. (2023) across various dimensions. Below, we outline the design choices of our benchmark and highlight its key distinctions from TimesNet6 .

    Time series modeling with limited supervision.

    Our benchmark comprises of 5 major time series modeling tasks of significant practical value, namely long- and short-horizon forecasting, imputation, classification, and anomaly detection, as outlined in Tab. 1. In contrast to TimesNet, we exclusively consider scenarios characterized by limited compute and supervision resources. These scenarios mimic practical situations where training (or fine-tuning) a deep neural network is infeasible due to resource limitations or insufficiently characterized data. Accordingly, we assess MOMENT in zero-shot settings whenever feasible and through linear probing for a few epochs otherwise.

     

    For classification, we consider the unsupervised representation learning problem, where the goal is to learn representations of time series that are useful for downstream classification, without access to labeled data. As is common in prior work (Yue et al., 2022; Franceschi et al., 2019), the quality of representations is measured using the accuracy of a Support Vector Machine trained on them (App. F.2). For short-horizon forecasting, we consider the zero-shot setting introduced by Oreshkin et al. (2021). In particular, we finetune MOMENT on a source dataset using a forecasting head, and evaluate its performance on a target dataset without any fine-tuning (App F.1.2, Tab. 21).

    Datasets.

    We use the same datasets as TimesNet for forecasting and imputation. However, for classification and anomaly detection, we conduct experiments on larger and systematically chosen subset of datasets from the UCR classification archive (Dau et al., 2018) and UCR anomaly archive (Wu & Keogh, 2023). Specifically, we run classification experiments on all 91 time series datasets with each time series shorter than 512 time steps (Tab.23). For anomaly detection, while choosing the subset of time series, we prioritized coverage over different domains and data sources represented in the UCR anomaly archive (Tab. 22). We also note that the UCR anomaly archive was proposed as an improvement over pre-existing anomaly detection datasets such as the SMD (Su et al., 2019), and SMAP (Hundman et al., 2018), many of which are also used in TimesNet. Our proposed experimental setup is summarized in Tab. 1 and detailed in App. F.

    Metrics.

    We evaluate each experiment using multiple metrics used in task-specific benchmarks, such as MSE and MAE for long-horizon forecasting, and sMAPE for short-horizon forecasting. We also note that TimesNet and GPT4TS (Zhou et al., 2023) evaluate anomaly detection performance using vanilla F1 score which ignores the sequential nature of time series. Instead, we measure anomaly detection performance with the widely used adjusted best F1 score (Goswami et al., 2023a; Challu et al., 2022), and the recently proposed VUS-ROC (Paparrizos et al., 2022a).

    Baselines.

    We compare MOMENT with state-of-the-art deep learning and statistical machine learning models across tasks (Tab. 35). This is in contrast to TimesNet which primarily compared with transformer-based approaches. These comparisons are crucial for assessing the practical utility of the proposed methods. We found that statistical and non-transformer-based approaches like ARIMA for shorthorizon forecasting, N-BEATS for long-horizon forecasting, and k-nearest neighbors for anomaly detection outperform many deep and transformer-based models.

    Hyper-parameter tuning.

    We do not perform hyperparameter tuning. In all experiments that follow, unless mentioned otherwise, we fine-tune MOMENT-Large with a batch size of 64, and one cycle learning rate schedule with a peak learning rate between 5e − 5 and 1e − 3 (Smith & Topin, 2019). For baseline methods, we capture recommended settings from their papers and public repositories. We report all hyper-parameters settings for MOMENT and baselines in App. F.

    Research questions.

    Through the following experiments we aim to answer 3 broad research questions.

     

    RQ1: Effectiveness. Is MOMENT effective for multiple time series analysis tasks in limited supervision settings?

     

    RQ2: Interpretability. What is MOMENT learning? Does it capture intuitive time series characteristics such as varying frequencies, trends, and amplitudes?

     

    RQ3: Properties. What is the impact of the size of scaling model size? Can MOMENT, akin to LLMs, be used for cross-modal transfer learning?


    4.1. MOMENT can solve multiple time series modeling tasks in limited supervision settings

    Long-horizon forecasting.

    Linearly probing MOMENT achieves near state-of-the-art performance on most datasets and horizons, and is only second to PatchTST which generally achieves the lowest MSE (Tab. 2). On many datasets and horizons, forecasting models based on LLMs– TimeLLM and GPT4TS perform worse than MOMENT. Notably, NBEATS outperforms several recent methods, emphasizing the importance of comparing forecasting performance beyond transformer-based approaches.

    Zero-shot short-horizon forecasting.

    Among all tasks, we found zero-shot short-horizon forecasting to have the largest scope for improvement (Tab. 3). Statistical methods such as Theta and ETS outperformed their deeper counterparts. However, on some datasets, MOMENT achieved lower sMAPE than ARIMA.

    Classification.

    Without any data-specific fine-tuning, MOMENT can learn distinct representations for different classes of data (Fig. 5), and an SVM trained on its representations performs better than all but 4 methods specifically built for time series classification models and trained on each individual dataset. Recently proposed GPT4TS and TimesNet perform poorly despite being trained on each individual dataset with labels.

    Anomaly detection.

    On 44 time series from the UCR anomaly detection archive, MOMENT consistently outperformed both TimesNet and GPT4TS, as well as 2 state-ofthe-art deep learning models tailored for anomaly detection, in both zero-shot and linear probing configurations. However, k-nearest neighbors performed marginally better in terms of VUS-ROC score, but had a lower adjusted best F1 score.

    Imputation.

    Tab. 6 contains imputation performance of all models averaged over 4 different masking rates. MOMENT with linear probing achieved the lowest reconstruction error on all ETT datasets. In the zero-shot setting, MOMENT consistently outperformed all statistical interpolation methods with the exception of linear interpolation.


    4.2. What is MOMENT Learning?

    We found that MOMENT can capture changes in intuitive time series characteristics such as trend, amplitude, frequencies, and phases of time series. However, it cannot differentiate between vertically shifted time series as it normalizes each signal prior to modeling (Fig. 4,7). Furthermore, on many classification datasets, MOMENT learns distinct representations of different classes, even in a zero-shot setting without access to labels (Fig. 5, 8).


    4.3. Properties of Large Time Series Models

    Model scaling improves training loss.

    Like LLMs, we found that increasing the size of the model leads to lower training loss, even before the first epoch (Fig. 6, left). An immediate next step is to assess how effectively this phenomenon extends to time series modeling tasks under limited supervision.

    MOMENT can solve cross-modal sequence learning tasks.

    Lu et al. (2022) first showed that large pre-trained language and vision transformers can solve general sequence learning tasks for modalities outside of text and images with minimal fine-tuning. Several recent studies have leveraged these properties to reprogram LLMs for time series tasks. We explore whether transformers pre-trained on time series can also be used to solve sequence classification tasks on image, text, and binary data. Our results confirm that by freezing the self-attention and feed-forward layers, MOMENT can model sequences comparable to GPT-2 and Flan-T5 models of similar scale (Tab. 5).

    MOMENT with randomly initialized weights converges to a lower training loss.

    Our observations suggest that with sufficient data, pre-training our model from scratch results in a lower training loss than continually pre-training a model of similar size initialized with language modeling weights (Fig. 6, 12). This also underscores that there is sufficient publicly accessible pre-training data available in the Time Series Pile to facilitate pre-training time series foundation models from scratch.


    5. Conclusion and Future Work

    We release the first open-source family of time series foundation models and make contributions at all stages of the development and evaluation process. We first compile a large and diverse collection of public time series, called the Time Series Pile, and demonstrate its efficacy by pre-training high-performing time series foundation models from scratch. Then, we systematically address several time series-specific challenges, which up to now have impeded widespread exploration of extensivelarge-scale multi-dataset pre-training.

     

    We use the Time Series Pile and these strategies to pre-train transformer models of three different sizes. Finally, we design an experimental benchmark to evaluate time series foundation models on multiple practical time series tasks, particularly focusing on scenarios with constrained compute and supervision, building on prior work by Wu et al. (2023). Using this benchmark, we show that MOMENT is effective for the considered tasks with minimal fine-tuning. MOMENT’s superior performance, especially on anomaly detection and classification problems which typically have small datasets, can be attributed to pre-training. Moreover, we demonstrate that across many tasks, smaller statistical and shallower deep learning methods perform reasonably well. Lastly, we make several interesting empirical observations about time series foundation models. Our overarching goal is to push the boundaries of open science by publicly releasing the Time Series Pile, along with code, model weights, and training logs.

     

    We note several interesting directions of future work, including the application of MOMENT to real-world challenges, investigating multi-modal time series and text foundation models (Cai et al., 2023), and enhancing forecasting performance by pre-training MOMENT using causal attention and forecasting objectives.


     

     

     

     

     

     

     

     

     

     

     

     

     

     

     

     

     

     

     

     

     

     

     

     

     

     

     

     

     

Designed by Tistory.