-
[GPT4MTS] Prompt-Based Large Language Model for Multimodal Time-Series ForecastingPaper Writing 1/Related_Work 2024. 11. 3. 17:01
https://doi.org/10.1609/aaai.v38i21.30383
(March, 2024)
Abstract
Time series forecasting is an essential area of machine learning with a wide range of real-world applications. Most of the previous forecasting models aim to capture dynamic characteristics from uni-modal numerical historical data. Although extra knowledge can boost the time series forecasting performance, it is hard to collect such information. In addition, how to fuse the multimodal information is non-trivial. In this paper, we frst propose a general principle of collecting the corresponding textual information from different data sources with the help of modern large language models (LLM). Then, we propose a prompt-based LLM framework to utilize both the numerical data and the textual information simultaneously, named GPT4MTS. In practice, we propose a GDELT-based multimodal time series dataset for news impact forecasting, which provides a concise and well-structured version of time series dataset with textual information for further research in communication. Through extensive experiments, we demonstrate the effectiveness of our proposed method on forecasting tasks with extra-textual information.
Introduction
Time series data has consistently played a pivotal role in diverse felds ranging from fnance (Sezer, Gudelek, and Ozbayoglu 2020) and economics (Kalamara et al. 2022) to healthcare (Kaushik et al. 2020; Cao et al. 2023) and weather prediction (Nguyen et al. 2023), etc. The intricate patterns embedded within such data often refect underlying mechanisms or behaviors, making time series forecasting an indispensable tool for decision-making processes. Meanwhile, in our current information-dense era, the infuence of textual information spans from individual decisions to shaping national directives. However, due to insuffcient data accumulation and limited resources, limited work has been done on designing multimodal time series datasets. With the emergence of Large-scale Language Models (LLMs), we are able to fll this gap by proposing an effective pipeline and a new paradigm of time series forecasting task as well as dataset as shown in Figure 1.
Conventional forecasting methods usually only focus on unimodal time series numerical information. Previous works (Wu et al. 2020; Jiang and Luo 2022; Cao et al. 2020) use Graph Neural Networks (GNNs) for short-term predictions, transformer variants (Zhou et al. 2021; Wu et al. 2021; Zhou et al. 2022; Nie et al. 2023) for long-term forecasting. Recently, linear models (Zeng et al. 2023) also exhibit strong performance for time series forecasting tasks. However, these works contain an intrinsic limitation: they overlook the rich contextual information provided by textual data.
While some recent works (Tian Zhou 2023; Sun et al. 2023; Xue and Salim 2022) have tried to apply language models for time series tasks, these works either treat time series data as text sequence inputs (Xue and Salim 2022), or align the time series input with the textual embedding of LLMs. Very few of these approaches utilized a multimodal input containing both time series information and textual information. METS (Li et al. 2023) is the only known work, based on our knowledge, that uses a multimodality input containing both ECG time series data and clinical reports texts. However, METS only focuses on the health-care domain and cannot be generalized to most of the other time series data.
In light of these challenges, we frst introduce an innovative pipeline in Figure 2 that leverages the power of large language models (LLMs) to generate textual data along with the time series data: textual information collection, summarization, re-rank, effcient summary based on re-rank similarity. Note that, while not applicable to all domains, this generation pipeline can still be applied to many other felds through proper guidance to obtain extra-textual information, such as fnancial forecasting and communication.
In this paper, based on the proposed pipeline, we created a GDELT1 -based multimodal time series forecasting dataset that contains both time series numerical values and textual summaries of events. GDELT database records global events and their associated media coverage, underpinning the profound impact that news can have in guiding our lives. It propels the development of computational communication research (Hopp et al. 2019; Lock 2020) under multiple domains, including social unrest (Galla and Burke 2018), government policies (Schintler and Kulkarni 2014), and fnancial market (Consoli, Pezzoli, and Tosetti 2020). The establishment of our dataset can enhance the accessibility of different felds to multimodal time series datasets, as well as foster further research in multimodal computational communication analysis.
Together with the established dataset, we propose a prompt tuning-based LLM, GPT4MTS, for time series forecasting with multimodal input, which contrasts with conventional approaches that rely on direct data alignment. For the numerical input, we frst split the temporal input into different patches with reversible instance normalization (Nie et al. 2023) and then apply a linear layer to embed the patches into a hidden space as the time series input embedding. For the corresponding textual information, we propose to obtain the textual embeddings through pre-trained language model and treat them as trainable soft prompts prepended to the temporal input. In addition, we choose to freeze the attention layer inside the LLM to speed up the training and inference stage. Experimental results demonstrate the effectiveness of the proposed method and reveal a potential direction that combining the extra information and pure time series data can further enhance the forecasting performance.
Our main contributions can be summarized as follows:
• We propose a general pipeline to incorporate textual data into time series datasets. In addition, we propose the GDELT dataset following the proposed pipeline, which serves as a practical application of our innovative pipeline and methodology.
• Extensive experiments were used to illustrate the effectiveness of our model based on the multimodal time series dataset.
Related Works
In this section, we provide an overview of previous work and discuss differences between our method and related studies.
Large-Scale Language Model (LLM)
The introduction of Transformer architecture (Vaswani et al. 2017) has revolutionized several areas, including Natural Language Processing (NLP) and Computer Vision (CV). Coupled with the rise in computational resources, several Large-scale Language Models (LLMs) based on deep Transformer architecture have been proposed. Among them, BERT is pre-trained on a large corpus and fne-tuned for specifc NLP tasks (Devlin et al. 2019). T5 model adopts a unifed text-to-text framework, converting every NLP problem into a text generation task (Raffel et al. 2020). GPT focuses on the autoregression generation of text, making it suitable for a variety of generative tasks (Radford et al. 2018, 2019; Brown et al. 2020). LLaMa models are trained on vast public datasets (Touvron et al. 2023). While LLMs have achieved substantial success in NLP, their application in areas such as time series forecasting remains underexplored.
Prompting
Prompting serves as a methodology for crafting queries or commands that steer models toward generating specifc, targeted outputs. The CLIP model, for instance, uses textual prompts in the format of “a photo of a object” to perform image classifcation (Radford et al. 2021). T5 incorporates task descriptions into its text-to-text framework, effectively using these as implicit prompts to guide various NLP tasks (Raffel et al. 2020). Prompting techniques have also been extended to several tasks like object detection (Li et al. 2022), image captioning (Zhang et al. 2022), etc. Distinctively, our work leverages textual prompts to guide the model processing numerical information of time series.
Time Series Forecasting
Time series forecasting holds signifcance in diverse felds, such as anomaly detection, weather forecasting, and treatment effect modeling. Conventional methods, such as ARIMA (Box and Jenkins 1968), use statistical models to ft time series. As deep learning emerged as a gamechanger, several neural architectures, such as CNN-based (Munir et al. 2018), RNN-based (Salinas et al. 2020), and Transformer-based (Liu et al. 2021; Wu et al. 2021; Zhou et al. 2021, 2022), demonstrated superiority. While Transformer-based methods have shown impressive performance, there are studies that juxtapose them with embarrassingly simple linear models, challenging their priority (Zeng et al. 2023). In contrast to these approaches, our study utilizes a pre-trained LLM to address the challenges of time series forecasting.
Furthermore, the prevalent benchmark datasets, like Weather and ETT (Zhou et al. 2021), predominantly focus on numeric series. Our introduction of the GDELT dataset diverges from this trend. By incorporating textual content into time series, we enrich the data, paving the way for LLMs to harness additional textual information when making predictions based on time series data.
LLM4TS
Several studies have ventured into using LLMs for time series forecasting. For instance, one work interprets time series as patches and employs pre-trained models for forecasting (Tian Zhou 2023). Another research direction involves aligning embeddings between time series and texts. The TEST method adopts contrastive learning to align time series embeddings with the original textual embeddings in LLMs (Sun et al. 2023). TimeLLM (Jin et al. 2023) adopts a distinct approach through reprogramming that aligns the embedding space of time series data with that of textual data. Other works convert time series into textual data via prompts (Xue and Salim 2022), while some rely exclusively on textual information (Yu et al. 2023). Our approach similarly harnesses a pre-trained LLM for time series forecasting but differentiates by using prompts to feed additional textual information. Combined with the capabilities of LLMs and the integration of data from both modalities, this approach boost the performance of time series forecasting beyond using either alone, emphasizing the importance of both modalities.
Methods
Dataset Collection
This dataset is derived from the Global Database of Events, Language, and Tone (GDELT) database2 , which is one of the largest publicly available databases that monitor news media from around the world in over 100 languages. The GDELT database covers various types of information primarily focused on events, thereby offering a rich set of variables for understanding global societal trends.
Time series data collection.
The pipeline of our data collection process is shown in Figure 2. In our specifc GDELT dataset, we focus on extracting key information related to the top 10 popular event types (EventRootCode) and their respective mentions and coverage in the news media. The relationship between EventRootCode and the name of the event can be looked up through Table 3. Specifcally, we extract three key variables for forecasting: NumMentions, NumArticles, NumSources. These variables respectively represent the number of mentions, the number of related articles, and the number of sources, which are all relevant to the attention a particular event type receives within a given time frame and geographical region. We divided our dataset into 10 event root types, collecting data for 55 regions under the US and national data for the US. Currently, the data we used for training and evaluation spans from 2022-08-17 to 2023-07-31.
Corresponding textual data collection.
For textual data summarization, we frst scrape 10 articles for each event type under the given region and given date. Then a summary of each scraped article is generated using T5 (Raffel et al. 2020). A hypothetical article summary is generated given the particular event type and its explanations, serving as a template for possible summaries under this event type. Summaries are then re-ranked based on the similarity to the hypothetical article, aiming to keep the summaries that are most relevant to the news under the event type. Finally, an overall summary is generated given the top 5 related article summaries using OpenAI ChatGPT3.5 API 3 .
We demonstrate one example of how textual information is collected for our dataset in Figure 3. While the frst round of summary for each scraped article may contain redundant information (such as the last summary in our example) as they are processed through T5, by utilizing similarity and reranking, they are less likely to be considered as top-related articles to be included into our fnal summary. Due to a limited budget, for some regional information, we use T5 instead of OpenAI ChatGPT3.5 API for the fnal summary of regional data. This can lead to certain mal-summarized texts due to the interference of useless information in the frst round of summary. Therefore, to mitigate this issue, we replace these inferior fnal summaries with summaries of clean scraped titles.
The primary goal of this dataset is to help individuals, researchers, and policymakers better predict and analyze the level of attention certain types of events are receiving and the impact they have in a specifc region. This can facilitate more informed decision-making in both the public and private sectors as well as provide a concise and effective dataset for computational communication researchers.
This dataset also verifes the effectiveness of our multimodal time series dataset generation pipeline, enhancing the feasibility of establishing similar datasets for other domains.
Problem Definition
Proposed Method
The architecture of the model we employed is demonstrated in Figure 3. We utilize parameters from the pretrained GPT2 model (Radford et al. 2019). In order to understand information from both modalities, we add extra prompt layers to transform time series information and textual information to the input dimension of the pretrained model.
Frozen Pretrained Model
Our model retains the positional embedding layers and transformer blocks from the pretrained GPT2 model, while we freeze the attention layers and the feed-forward layers and fne-tune the positional embeddings and layer normalization layers, following the standard practice (Houlsby et al. 2019; Lu et al. 2022).
Input Embedding
In order to apply both textual input and time series input to pretrained LLM, we need to prepare the two modalities through separate embedding layers.
For the textual information, we apply the BERT (Devlin et al. 2019) embedding module as a feature extractor to obtain the representation for the summary texts over the lookback window.
For the time series information, following previous works (Tian Zhou 2023; Nie et al. 2023), we use the following operations to fit the time series input into the pre-trained GPT2 model:
• We apply Reversible Instance Normalization (Kim et al. 2021) to mitigate the distribution shift of the time series data over time, which performs normalization by extracting mean and variance from input time series and adding them back to the predicted output.
• We then apply patching (Nie et al. 2023) by aggregating adjacent timestamps. This enables time series stamps to gather context, similar to the continual impact of news over a period of time.
• Since channel-independence has demonstrated its effectiveness through previous works (Nie et al. 2023; Tian Zhou 2023), we treat each multivariate time series as multiple independent uni-variate series as well.
Output Layer
Since the output from the frozen Pretrained Language Transformer contains hidden states of sequence length (for textual inputs) + patch number (for time series inputs), we apply a linear output layer, which takes the hidden state corresponding to the time series as input and transform it into the desired prediction length. As the ablation study in Table 5 and Table 4 show, this improves performance, as utilizing only the hidden states of the time series input can preserve a better representation of the numerical forecasting targets.
Experiments
Baselines and Experimental Settings
To evaluate the model’s performance on our generated dataset, we choose four SOTA Transformer-based models, including FEDformer (Zhou et al. 2022), Autoformer (Wu et al. 2021), Informer (Zhou et al. 2021), PatchTST (Nie et al. 2023), and two Linear-based models, including DLinear and NLinear (Zeng et al. 2023) as our baselines. We also include two pre-trained large language models, LLaMA (Touvron et al. 2023), and GPT(Radford et al. 2019) as our baselines, with both adapted to time series forecasting tasks in a similar way as GPT4TS (Tian Zhou 2023). All of these models follow the same experimental setup with a prediction length T = 7, and a look-back window size L = 15 to forecast all three numerical time-series variables: NumMentions, NumArticles, NumSources. Under this setting, by dividing data to train, validate, and test with a ratio of 7 : 2 : 1, we gathered 343,200 training examples, 98,976 validation examples, and 41,904 testing examples.
Main Results
Table 1 and Table 2 show the multivariate forecasting results. The corresponding event type name for each event number is in Table 3. Overall, our model achieves the best average performance among the 10 event types for news and steadily exceeds the performance of other SOTA models in most event types. Quantitatively, while GPT4TS already achieves a SOTA performance, by using textual information as prompts to guide time series forecasting, our model achieves an overall 4.14% reduction on MSE and 1.0% reduction on MAE.
Ablation Study
We study the effects of only taking hidden representations from the time series inputs to the output layers. By comparing our methods with taking full hidden representations (from both textual inputs and time series inputs) in Table 4 and Table 5, one can observe that the performance is polluted by the hidden representations of textual inputs, though it is still better than without textual information (GPT4TS). This demonstrates the effectiveness of only taking the hidden representations from time series inputs into our model.
Analysis
The improvement in performance of our model is understood as a collaborative result of numerical data, textual information, and a prompt-based methodology for multimodal integration:
• Textual information provides contexts that numerical data alone cannot capture. For instance, in our dataset, news articles featuring prominent figures or with strong sentiments or characteristics can obtain a more eminent representation and therefore infuence the fluctuation or trend in the numerical data.
Therefore, our model is able to outperform other unimodality models. As shown in the ablation study, even GPT4MTS with full selection can achieve a better MSE score and a similar MAE score compared to GPT4TS, which already outperforms all other unimodal time series models. This demonstrates the benefts of having textual context for time series forecasting tasks.
• Numerical data offers precise quantitative values, which are indispensable for forecasting tasks. It uncovers the statistical patterns inherent in the dataset. Given that our forecasting targets are also represented numerically, this data primarily steers the predictions of our model. As observed from table 1 and table 2, when only utilizing numerical data, linear-based models (DLinear, NLinear) generally outperform transformer-based models (Informer, Autoformer, Transformer). This indicates that the transformer models may over-complicate the observed numerical information in our dataset with their complex architecture. However, pre-trained LLM models, (LLaMA and GPT4TS) tend to achieve a better performance than all other models, which demonstrates the effectiveness of cross-modality adaptation of these pretrained Large Language models for time series forecasting tasks.
• The methodology of prompt-based integration for textual and numerical inputs is crucial. Comparative analyses between GPT4TS and our model highlight the efficacy of harnessing additional text representations as prompts. Concurrently, our ablation study demonstrates that these prompts should act as guiding factors, not as intrinsic components of the output representation. This approach facilitates the learning ability of LLMs in processing numerical data without being overshadowed by textual information, ensuring moderate assimilation of contextual information. Over-reliance on textual information, however, as demonstrated through the ablation study, could harm the predicting performance.
Conclusion and Future Work
This paper introduces an effective pipeline for collecting textual information for time series forecasting tasks: a dataset is collected under this pipeline, and a corresponding multimodal time series forecasting model is proposed. Through extensive experiments and analysis, we show that our dataset and pipeline are reliable and we prove the capability of improving prediction performance utilizing our model, GPT4MTS, based on multimodal inputs. This further testifes to the benefts of the extra textual information collected under our pipeline.
We provide a general pipeline for multimodal dataset generation and accordingly establish a dataset for the news domain. Our work also demonstrate the potential of both the dataset and our model as foundational elements for future multimodal time series forecasting research. A crucial next step in this field would be to further explore multimodal time series forecasting, particularly leveraging the capabilities of Large Language Models (LLMs). This direction promises signifcant advancements in the understanding and application of multimodal data in forecasting tasks.
'Paper Writing 1 > Related_Work' 카테고리의 다른 글
Multimodal Few-Shot Learning with Frozen Language Models (0) 2024.11.06 [TSMixer] An All-MLP Architecture for Time Series Forecasting (0) 2024.11.04 [Time-MoE] Billion-Scale Time Series Foundation Models with Mixture of Experts (0) 2024.11.01 [Chronos] Learning the Language of Time Series (0) 2024.10.30 [MOMENT] A Family of Open Time-series Foundation Models (0) 2024.10.30