Authors:
(1) Vijay Ekambaram, IBM Research;
(2) Arindam Jati, IBM Research;
(3) Nam H. Nguyen, IBM Research;
(4) Pankaj Dayama, IBM Research;
(5) Chandra Reddy, IBM Research;
(6) Wesley M. Gifford, IBM Research;
(7) Jayant Kalagnanam, IBM Research.
Editor's note: this is part 4 of 5 of a study detailing the development of a tiny, fast AI model that delivers excellent accuracy. Read the rest below.
3 TTM Workflows and 3.1 Pre-training Workflow
5 Conclusions and Future Work, and References
Datasets: Pre-training employs a subset of the Monash data hub [Godahewa et al., 2021] of size ∼ 244M samples. We specifically exclude datasets (like yearly, monthly) as they do not possess sufficient length for the long-term forecasting task. Moreover, we remove all the datasets that we utilize for evaluation (i.e., weather, electricity, and traffic). For zero/few-shot evaluation we consider seven public datasets (D1): ETTH1, ETTH2, ETTM1, ETTM2, Weather, Electricity & Traffic as popularly used in most prior SOTA works [Zhou et al., 2021; Nie et al., 2023]. Since these datasets do not contain any exogenous variables nor exhibit cross-channel correlation benefits, we incorporate four other datasets (D2) for separately validating the efficacy of the decoder channel mixing and exogenous mixer module: bike sharing (BS) [Fanaee-T, 2013], carbon capture plant (CC) [Jablonka et al., 2023], and 2 more datasets from Biz-IT observability domain [ITB, 2023]: Application (APP) and Service (SER). Refer Appendix for full data details.
SOTA Bechmarks: We benchmark TTM with the latest public SOTA forecasting models categorized as follows: (a) LLM-based TS pre-trained models: GPT4TS [Zhou et al., 2023], LLMTime [Gruver et al., 2023] (b) Self– supervised pre-trained models: SimMTM [Dong et al., 2023],Ti-MAE [Li et al., 2023], TST [Zerveas et al., 2021], LaST [Wang et al., 2022], TF-C [Zhang et al., 2022], CoST [Woo et al., 2022] and Ts2Vec [Yue et al., 2022] (c) TS transformer models: PatchTST [Nie et al., 2023], FEDFormer [Zhou et al., 2022], Autoformer [Wu et al., 2021] (d) Other SOTA models: TSMixer [Ekambaram et al., 2023], DLinear [Zeng et al., 2022] and TimesNet [Wu et al., 2022]
TTM Model Details: For our experiments, we build 5 pre-trained models for the following sl,fl configuration: (512,96), (512,192), (512, 336), (512,720) and (96,24). Based on the sl and fl requirement of the target dataset, a suitable pre-trained model is selected to initialize the weights accordingly for fine-tuning. Since our pre-trained models are very small, they can be trained quickly in a few hours (4-8 hrs), as opposed to several days/weeks in standard approaches. Hence, pre-training multiple TTM models is no longer a practical constraint. Pre-training is performed in a distributed fashion with 50 CPUs and 6 A100 GPUs with 40 GB GPU memory while fine-tuning uses only 1 GPU. Other model configurations are as follows: patch length pl = 64 (for sl 512) and 8 (for sl 96), stride s = pl, number of levels L = 6, number of TTM blocks per level M = 2, number of decoder layers = 2, batch size b = 3K, number of epochs ep = 20, and dropout do = 0.2. TSMixer specific hyperparameters include feature scaler fs = 3, hidden feature size hf = fs ∗ pl, expansion feature size ef = hf ∗ 2. Resolution prefix tuning is disabled by default and enabled only for shorter context lengths (as explained in Table. 10). Decoder channel-mixing and exogenous mixer blocks are disabled during pre-training and enabled for D2 datasets during fine-tuning. All other model parameters remain the same across pre-training and fine-tuning except for a few parameters like dropout and batch size that can be adjusted based on the target data. Hyperparameters are selected based on the validation performance, and the test results are reported. For full implementation details, refer to the Appendix. MSE is used as the default evaluation metric.
Table 1 compares the performance of TTM model in zeroshot and few-shot (5%) settings across multiple fls. Baseline results are reported from [Zhou et al., 2023] as we use the same few-shot data filtering strategy as followed in that paper. Note that for zero-shot performance, the pre-trained TTM model is directly evaluated on the test set. The 5% fewshot TTM outperforms the SOTAs in most of the cases with significant accuracy gains (12-38%). An even more impressive observation is that the TTM in zero-shot setting is also able to outperform most of the SOTAs which are trained on 5% of the target data. This observation establishes the generalization ability of the pre-trained TTM model on the target datasets. Likewise, Table 4 shows the 10% few-shot performance of the TTM, where we outperform all the existing SOTAs by 4-45% accuracy gains. In addition, TTM zero-shot also beats many SOTAs (not all) with 10% training highlighting the effectiveness of our approach.
Additionally, we conduct a comparison between TTM and the LLMTime model which is explicitly designed for zeroshot setting. Since these models are based on LLaMA and are massive in size, the authors used only the last window of the standard test-set for faster evaluation, as opposed to using all the windows in the full test-set . Hence, we compare them separately in Table 5 based on the same datasets, fl, and test-set as reported in [Gruver et al., 2023]. In this evaluation, we outperform LLMTime by a substantial margin of 29%. In addition, there are alternative pre-training approaches like masked modeling and contrastive learning techniques that may not offer universal transfer learning capabilities across all datasets (like TTM), but they excel in enabling cross-transfer learning between two datasets when carefully selected. Table 3 illustrates the cross-transfer learning from ETTH2 to ETTH1 for these models in various few-shot settings (as reported in [Dong et al., 2023]). Notably, TTM, with no specific cross-data selection, outperforms all popular SOTAs, including the latest SimMTM, by 17-43%. Thus, TTM significantly beats all the existing SOTAs including the recent popular transfer learning benchmarks based on LLMs. Notably, we achieve these accuracy gains with significantly reduced compute requirements, as elaborated next.
Table 2 compares the computational benefits of TTM over GPT4TS, the most popular LLM-TS model and the current best SOTA in few-shot setting. For a fair comparison, we run both models with the best-reported parameters in a single A100 GPU with the same setup (Details in Appendix). Since GPT4TS is based on GPT-2, it consumes a huge model footprint and exhibits slow execution. In specific, TTM achieves 14X cut in Fine-tune (FT) parameters (NPARAMS) and a remarkable 106X reduction in Total (TL) parameters. This reduction in model footprint further leads to a significant reduction in fine-tuning EPOCH TIME by 65X, MAX MEMORY usage by 27X, and total inference time on the entire test data (TEST TIME) by 54X. The computational benefits also apply over other LLM-based time series models like LLMTime, built on LLaMA, which is larger than GPT-2. In fact, LLMTime used a very small test set on these datasets as opposed to the standard test set to overcome this slow-execution.
Since the datasets (D1) used in previous experiments do not have exogenous variables, we evaluate the effectiveness of TTM on 4 other datasets (D2, as explained in 4.1) to quantify its benefits. Since these datasets are already very small, we used their full data for fine-tuning. Table 6 shows the performance of the pre-trained TTM model fine-tuned on the target data with exogenous mixer module and decoder channelmixing enabled (TTM-CM). We compare the performance of TTM-CM with the pre-trained TTM used in zero-shot setting (TTM-Zero-shot), and plain TTM (TTM) and other primary SOTAs (PatchTST, TSMixer variants, and GPT4TS) trained from scratch. Other SOTAs are not reported here, considering their inferior performance and space constraints. Specifically, we compare with TSMixer with channel-mixing enabled (TSMixer-CM) and TSMixer with Cross-Channel reconciliation head (TSMixer-CC) [Ekambaram et al., 2023] as they are the latest SOTAs in channel-correlation modeling. We can see that TTM-CM outperforms all the competitive models with a significant margin (15-44%), thus, demonstrating the power of TTM in capturing inter-channel correlations.
Effect of Pre-training: Table 7 illustrates the advantages of the proposed pre-training (PT) approach in comparison to randomly initialized (RI) model weights. In the case of zeroshot, the pre-trained TTM (PT) exhibits 36% improvement over RI. This outcome is expected, as random weights are directly employed for forecasting with RI. For the 5% fewshot scenario, PT achieves noteworthy improvements of 21% and 12% over RI when fine-tuned for 5 and 50 epochs, respectively. This underscores the utility of pre-trained weights in facilitating quick learning with a limited amount of target data. Even in the case of 10% few-shot, PT continues to demonstrate an improved performance of 9% for 5 epochs and 2% for 50 epochs. Thus, the performance impact of leveraging pre-trained weights greatly increases as the size of the training data and the available time-to-fine-tune reduces.
Effect of Adaptive Patching: The Adaptive patching presented in Section 3.1 is a key component of the TTM backbone. Table 8 demonstrates the relative improvements when we employ adaptive patching with respect to a vanilla TTM backbone without this module. We can see that adaptive patching adds an extra 4% improvement during zero-shot and 2% improvement for 10% few-shot. Performance improvements are observed in almost all datasets which justifies the effectiveness of the adaptive patching module when evaluated on multiple datasets of different resolutions.
Effect of Augmentation via Downsampling: Table 9 compares the zero-shot test performances between the TTM model trained on the original Monash data and the TTM trained on the augmented Monash data after downsampling as described in Section 3.1. There is a significant improvement of 30% for the model pre-trained on the augmented datasets as it adds more data coverage across various resolutions. This observation highlights the strength of the proposed TTM model to learn from diverse datasets together, and increasing the number of datasets for each resolution helps.
Effect of Resolution Prefix Tuning: Since resolution prefix tuning explicitly adds the resolution type as an extra embedding, it greatly helps when the input context length sl is short where it is challenging for the model to derive the resolution type automatically. As observed in Table 10, we observe 8% improvement in zero-shot for shorter context length (sl = 96) and no improvement when the context length is longer (sl =512). Hence this component is optional and can be enabled when working with shorter input context length.
In this section, we compare the performance of TTM with the very recent concurrent works published in 2024. TimeLLM [Jin et al., 2024] is a recent SOTA that enables successful reprogramming of LLMs for time-series tasks. However, as depicted in Table 11, TTM outperforms Time-LLM by 8% in Fewshot 5% and 2% in Fewshot 10% setting. Likewise, UniTime [Liu et al., 2024] proposes novel techniques towards language-empowered cross-domain time-series forecasting. As indicated in Table 12, TTM’s zero-shot results surpass UniTime’s zero-shot results by 27%. Further, TTM does not need any cross-domain transfer learning and pairwise mapping as our pretraining is enabled on a diverse set of datasets.
This paper is available on arxiv under CC BY-NC-ND 4.0 DEED license.