ABOUT ME

-

Today
-
Yesterday
-
Total
-
  • [CSDI] Conditional Score-based Diffusion Models for Probabilistic Time Series Imputation
    Research/Generative Model 2024. 5. 21. 11:12

    https://arxiv.org/pdf/2107.03502


    Abstract

    The imputation of missing values in time series has many applications in healthcare and finance. While autoregressive models are natural candidates for time series imputation, score-based diffusion models have recently outperformed existing counterparts including autoregressive models in many tasks such as image generation and audio synthesis, and would be promising for time series imputation. In this paper, we propose Conditional Score-based Diffusion models for Imputation (CSDI), a novel time series imputation method that utilizes score-based diffusion models conditioned on observed data. Unlike existing score-based approaches, the conditional diffusion model is explicitly trained for imputation and can exploit correlations between observed values. On healthcare and environmental data, CSDI improves by 40-65% over existing probabilistic imputation methods on popular performance metrics. In addition, deterministic imputation by CSDI reduces the error by 5-20% compared to the state-of-the-art deterministic imputation methods. Furthermore, CSDI can also be applied to time series interpolation and probabilistic forecasting, and is competitive with existing baselines. The code is available at https://github.com/ermongroup/CSDI.


    1. Introduction

    Multivariate time series are abundant in real world applications such as finance, meteorology and healthcare. These time series data often contain missing values due to various reasons, including device failures and human errors [1–3]. Since missing values can hamper the interpretation of a time series, many studies have addressed the task of imputing missing values using machine learning techniques [4–6]. In the past few years, imputation methods based on deep neural networks have shown great success for both deterministic imputation [7–9] and probabilistic imputation [10]. These imputation methods typically utilize autoregressive models to deal with time series.

     

    Score-based diffusion modelsa class of deep generative models and generate samples by gradually converting noise into a plausible data sample through denoising – have recently achieved state-of-the-art sample quality in many tasks such as image generation [11, 12] and audio synthesis [13, 14], outperforming counterparts including autoregressive models. Diffusion models can also be used to impute missing values by approximating the scores of the posterior distribution obtained from the prior by conditioning on the observed values [12, 15, 16]. While these approximations may work well in practice, they do not correspond to the exact conditional distribution.

     

    In this paper, we propose CSDI, a novel probabilistic imputation method that directly learns the conditional distribution with conditional score-based diffusion models. Unlike existing score-based approaches, the conditional diffusion model is designed for imputation and can exploit useful information in observed values. We illustrate the procedure of time series imputation with CSDI in Figure 1. We start imputation from random noise on the left of the figure and gradually convert the noise into plausible time series through the reverse process pθ of the conditional diffusion model. At each step t, the reverse process removes noise from the output of the previous step (t + 1). Unlike existing score-based diffusion models, the reverse process can take observations (on the top left of the figure) as a conditional input, allowing the model to exploit information in the observations for denoising. We utilize an attention mechanism to capture the temporal and feature dependencies of time series.

     

    For training the conditional diffusion model, we need observed values (i.e., conditional information) and ground-truth missing values (i.e., imputation targets). However, in practice we do not know the ground-truth missing values, or training data may not contain missing values at all. Then, inspired by masked language modeling, we develop a self-supervised training method that separates observed values into conditional information and imputation targets. We note that CSDI is formulated for general imputation tasks, and is not restricted to time series imputation.

     

    Our main contributions are as follows:

    • We propose conditional score-based diffusion models for probabilistic imputation (CSDI), and implement CSDI for time series imputation. To train the conditional diffusion model, we develop a self-supervised training method.
    • We empirically show that CSDI improves the continuous ranked probability score (CRPS) by 40-65% over existing probabilistic methods on healthcare and environmental data. Moreover, deterministic imputation with CSDI decreases the mean absolute error (MAE) by 5-20% compared to the state-of-the-art methods developed for deterministic imputation.
    • We demonstrate that CSDI can also be applied to time series interpolations and probabilistic forecasting, and is competitive with existing baselines designed for these tasks.

    2. Related works

    Time series imputations with deep learning Previous studies have shown deep learning models can capture the temporal dependency of time series and give more accurate imputation than statistical methods. A popular approach using deep learning is to use RNNs, including LSTMs and GRUs, for sequence modeling [17, 8, 7]. Subsequent studies combined RNNs with other methods to improve imputation performance, such as GANs [9, 18, 19] and self-training [20]. Among them, the combination of RNNs with attention mechanisms is particularly successful for imputation and interpolation of time series [21, 22]. While these methods focused on deterministic imputation, GP-VAE [10] has been recently developed as a probabilistic imputation method.

     

    Score-based generative models Score-based generative models, including score matching with Langevin dynamics [23] and denoising diffusion probabilistic models [11], have outperformed existing methods with other deep generative models in many domains, such as images [23, 11], audio [13, 14], and graphs [24]. Most recently, TimeGrad [25] utilized diffusion probabilistic models for probabilistic time series forecasting. While the method has shown state-of-the-art performance, it cannot be applied to time series imputation due to the use of RNNs to handle past time series.


    3. Background

    3.1. Multivariate time series imputation

    Probabilistic time series imputation is the task of estimating the distribution of the missing values of X by exploiting the observed values of X. We note that this definition of imputation includes other related tasks, such as interpolation, which imputes all features at target time points, and forecasting, which imputes all features at future time points.


    3.2. Denoising diffusion probabilistic models


    3.3. Imputation with diffusion models


    4. Conditional score-based diffusion model for imputation (CSDI)

    In this section, we propose CSDI, a novel imputation method based on a conditional score-based diffusion model. The conditional diffusion model allows us to exploit useful information in observed values for accurate imputation. We provide the reverse process of the conditional diffusion model, and then develop a self-supervised training method. We note that CSDI is not restricted to time series.


    4.1. Imputation with CSDI


    4.2. Training of CSDI


    4.3. Choice of imputation targets in self-supervised learning

    In the proposed self-supervised learning, the choice of imputation targets is important. We provide four target choice strategies depending on what is known about the missing patterns in the test dataset. We describe the algorithm for these strategies in Appendix B.2.

     

    (1) Random strategy : this strategy is used when we do not know about missing patterns, and randomly chooses a certain percentage of observed values as imputation targets. The percentage is sampled from [0%, 100%] to adapt to various missing ratios in the test dataset.

     

    (2) Historical strategy: this strategy exploits missing patterns in the training dataset. Given a training sample x0, we randomly draw another sample x˜0 from the training dataset. Then, we set the intersection of the observed indices of x0 and the missing indices of x˜0 as imputation targets. The motivation of this strategy comes from structured missing patterns in the real world. For example, missing values often appear consecutively in time series data. When missing patterns in the training and test dataset are highly correlated, this strategy helps the model learn a good conditional distribution.

     

    (3) Mix strategy: this strategy is the mix of the above two strategies. The historical strategy may lead to overfitting to missing patterns in the training dataset. The Mix strategy can benefit from generalization by the random strategy and structured missing patterns by the historical strategy.

     

    (4) Test pattern strategy: when we know the missing patterns in the test dataset, we just set the patterns as imputation targets. For example, this strategy is used for time series forecasting, since the missing patterns in the test dataset are fixed to given future time points.


    5. Implementation of CSDI for time series imputation

     

    Attention mechanism To capture temporal and feature dependencies of multivariate time series, we utilize a two dimensional attention mechanism in each residual layer instead of a convolution architecture. As shown in Figure 3, we introduce temporal Transformer layer and a feature Transformer layer, which are 1-layer Transformer encoders. The temporal Transformer layer takes tensors for each feature as inputs to learn temporal dependency, whereas the feature Transformer layer takes tensors for each time point as inputs to learn feature dependency.

     

    Note that while the length L can be different for each time series as mentioned in Section 3.1, the attention mechanism allows the model to handle various lengths. For batch training, we apply zero padding to each sequence so that the lengths of the sequences are the same.

     

    Side information In addition to the arguments of eθ, we provide some side information as additional inputs to the model. First, we use time embedding of s = {s1:L} to learn the temporal dependency. Following previous studies [29, 30], we use 128-dimensions temporal embedding. Second, we exploit categorical feature embedding for K features, where the dimension is 16.


    6. Experimental results

    In this section, we demonstrate the effectiveness of CSDI for time series imputation. Since CSDI can be applied to other related tasks such as interpolation and forecasting, we also evaluate CSDI for these tasks to show the flexibility of CSDI. Due to the page limitation, we provide the detailed setup for experiments including train/validation/test splits and hyperparameters in Appendix E.2.


    6.1. Time series imputation

    Dataset and experiment settings We run experiments for two datasets. The first one is the healthcare dataset in PhysioNet Challenge 2012 [1], which consists of 4000 clinical time series with 35 variables for 48 hours from intensive care unit (ICU). Following previous studies [7, 8], we process the dataset to hourly time series with 48 time steps. The processed dataset contains around 80% missing values. Since the dataset has no ground-truth, we randomly choose 10/50/90% of observed values as ground-truth on the test data.

     

    The second one is the air quality dataset [2]. Following previous studies [7, 21], we use hourly sampled PM2.5 measurements from 36 stations in Beijing for 12 months and set 36 consecutive time steps as one time series. There are around 13% missing values and the missing patterns are not random. The dataset contains artificial ground-truth, whose missing patterns are also structured.

     

    For both dataset, we run each experiment five times. As the target choice strategy for training, we adopt the random strategy for the healthcare dataset and the mix of the random and historical strategy for the air quality dataset, based on the missing patterns of each dataset.

     

    Results of probabilistic imputation CSDI is compared with three baselines. 1) Multitask GP [31]: the method learns the covariance between timepoints and features simultaneously. 2) GP-VAE [10]: the method showed the state-of-the-art results for probabilistic imputation. 3) V-RIN [32]: a deterministic imputation method that uses the uncertainty quantified by VAE to improve imputation. For V-RIN, we regard the quantified uncertainty as probabilistic imputation. In addition, we compare CSDI with imputation using the unconditional diffusion model in order to show the effectiveness of the conditional one (see Appendix C for training and imputation with the unconditional diffusion model).

     

    We first show quantitative results. We adopt the continuous ranked probability score (CRPS) [33] as the metric, which is freuquently used for evaluating probabilistic time series forecasting and measures the compatibility of an estimated probability distribution with an observation. We generate 100 samples to approximate the probability distribution over missing values and report the normalized average of CRPS for all missing values following previous studies [34] (see Appendix E.3 for details of the computation).

     

    Results of deterministic imputation We demonstrate that CSDI also provides accurate deterministic imputations, which are obtained as the median of 100 generated samples. We compare CSDI with four baselines developed for deterministic imputation including GLIMA [21], which combined recurrent imputations with an attention mechanism to capture temporal and feature dependencies and showed the state-of-the-art performance. These methods are based on autoregressive models. We use the original implementations except RDIS.

     

    We evaluate each method by the mean absolute error (MAE). In Table 3, CSDI improves MAE by 5-20% compared to the baselines. This suggests that the conditional diffusion model is effective to learn temporal and feature dependencies for imputation. For the healthcare dataset, the gap between the baselines and CSDI is particularly significant when the missing ratio is small, because more observed values help CSDI capture dependencies.


    7. Conclusion

    In this paper, we have proposed CSDI, a novel approach to impute multivariate time series with conditional diffusion models. We have shown that CSDI outperforms the existing probabilistic and deterministic imputation methods.

     

    There are some interesting directions for future work. One direction is to improve the computation efficiency. While diffusion models generate plausible samples, sampling is generally slower than other generative models. To mitigate the issue, several recent studies leverage an ODE solver to accelerate the sampling procedure [12, 38, 13]. Combining our method with these approaches would likely improve the sampling efficiency.

     

    Another direction is to extend CSDI to downstream tasks such as classifications. Many previous studies have shown that accurate imputation improves the performance on downstream tasks [7, 18, 22]. Since conditional diffusion models can learn temporal and feature dependencies with uncertainty, joint training of imputations and downstream tasks using conditional diffusion models would be helpful to improve the performance of the downstream tasks.

     

    Finally, although our focus was on time series, it would be interesting to explore CSDI as imputation technique on other modalities.


    A Details of denoising diffusion probabilistic models


    B Algorithms

    B.1. Algorithm for training and sampling of CSDI

    We provide the training procedure of CSDI in Algorithm 1 and the imputation (sampling) procedure with CSDI in Algorithm 2, which are described in Section 4.


    C. Training and imputation for unconditional diffusion model

     

Designed by Tistory.