paint-brush
Goodbye, Compute-Hungry Models—This Tiny AI Is the Future of Predictionby@fewshot
206 reads New Story

Goodbye, Compute-Hungry Models—This Tiny AI Is the Future of Prediction

tldt arrow

Too Long; Didn't Read

Researchers have developed a practical, efficient alternative to massive AI models for time series forecasting.

People Mentioned

Mention Thumbnail
Mention Thumbnail

Companies Mentioned

Mention Thumbnail
Mention Thumbnail

Coin Mentioned

Mention Thumbnail
featured image - Goodbye, Compute-Hungry Models—This Tiny AI Is the Future of Prediction
The FewShot Prompting Publication  HackerNoon profile picture
0-item

Authors:

(1) Vijay Ekambaram, IBM Research;

(2) Arindam Jati, IBM Research;

(3) Nam H. Nguyen, IBM Research;

(4) Pankaj Dayama, IBM Research;

(5) Chandra Reddy, IBM Research;

(6) Wesley M. Gifford, IBM Research;

(7) Jayant Kalagnanam, IBM Research.

Editor's note: this is part 5 of 5 of a study detailing the development of a tiny, fast AI model that delivers excellent accuracy. Read the rest below.

5 Conclusions and Future Work

Considering the diversity and limited availability of public TS pre-training data, pre-training large models for effective transfer learning poses several challenges in time series. Hence, we propose TTM, a multi-level Tiny Time Mixer model designed for efficient pre-training on limited diverse multi-resolution datasets. TTM achieves state-of-the-art results in zero/few-shot forecasting, offering significant computational efficiency and supporting cross-channel and exogenous variables — critical features lacking in existing popular methods. Going forward, we plan to extend our approach to many other downstream tasks beyond forecasting for a purely foundational approach in time series.

References

[Devlin et al., 2018] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre-training of deep bidirectional transformers for language understanding. CoRR, abs/1810.04805, 2018.


[Dong et al., 2023] Jiaxiang Dong, Haixu Wu, Haoran Zhang, Li Zhang, Jianmin Wang, and Mingsheng Long. Simmtm: A simple pre-training framework for masked time-series modeling. In Advances in Neural Information Processing Systems, 2023.


[Ekambaram et al., 2023] Vijay Ekambaram, Arindam Jati, Nam Nguyen, Phanwadee Sinthong, and Jayant Kalagnanam. Tsmixer: Lightweight mlp-mixer model for multivariate time series forecasting. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD ’23, page 459–469, New York, NY, USA, 2023. Association for Computing Machinery.


[Fanaee-T, 2013] Hadi Fanaee-T. Bike Sharing Dataset. UCI Machine Learning Repository, 2013. DOI: https://doi.org/10.24432/C5W894.


[Godahewa et al., 2021] Rakshitha Godahewa, Christoph Bergmeir, Geoffrey I. Webb, Rob J. Hyndman, and Pablo Montero-Manso. Monash time series forecasting archive. In Neural Information Processing Systems Track on Datasets and Benchmarks, 2021.


[Gruver et al., 2023] Nate Gruver, Marc Anton Finzi, Shikai Qiu, and Andrew Gordon Wilson. Large language models are zero-shot time series forecasters. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.


[Hyndman and Athanasopoulos, 2021] R.J. Hyndman and G. Athanasopoulos, editors. Forecasting: principles and practice. OTexts: Melbourne, Australia, 2021. OTexts. com/fpp3.


[ITB, 2023] Biz-datasets https://github.com/BizITObs/ BizITObservabilityData/tree/main/Complete/Time% 20Series/RobotShop. 2023.


[Jablonka et al., 2023] Kevin Maik Jablonka, Charithea Charalambous, Eva Sanchez Fernandez, Georg Wiechers, Juliana Monteiro, Peter Moser, Berend Smit, and Susana Garcia. Machine learning for industrial processes: Forecasting amine emissions from a carbon capture plant. Science Advances, 9(1):eadc9576, 2023.


[Jati et al., 2023] Arindam Jati, Vijay Ekambaram, Shaonli Pal, Brian Quanz, Wesley M. Gifford, Pavithra Harsha, Stuart Siegel, Sumanta Mukherjee, and Chandra Narayanaswami. Hierarchical proxy modeling for improved hpo in time series forecasting. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD ’23, page 891–900, New York, NY, USA, 2023. Association for Computing Machinery.


[Jin et al., 2024] Ming Jin, Shiyu Wang, Lintao Ma, Zhixuan Chu, James Y. Zhang, Xiaoming Shi, Pin-Yu Chen, Yuxuan Liang, Yuan-Fang Li, Shirui Pan, and Qingsong Wen. Time-LLM: Time series forecasting by reprogramming large language models. In The Twelfth International Conference on Learning Representations, 2024.


[Li and Liang, 2021] Xiang Lisa Li and Percy Liang. Prefixtuning: Optimizing continuous prompts for generation. In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli, editors, Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 4582–4597, Online, August 2021. Association for Computational Linguistics.


[Li et al., 2023] Zhe Li, Zhongwen Rao, Lujia Pan, Pengyun Wang, and Zenglin Xu. Ti-mae: Self-supervised masked time series autoencoders. arXiv preprint arXiv:2301.08871, 2023.


[Liu et al., 2021] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021.


[Liu et al., 2024] Xu Liu, Junfeng Hu, Yuan Li, Shizhe Diao, Yuxuan Liang, Bryan Hooi, and Roger Zimmermann. Unitime: A language-empowered unified model for crossdomain time series forecasting. In Proceedings of the ACM Web Conference 2024, 2024.


[Makridakis et al., 2022] Spyros Makridakis, Evangelos Spiliotis, and Vassilios Assimakopoulos. M5 accuracy competition: Results, findings, and conclusions. International Journal of Forecasting, 2022. https://doi.org/10.1016/j.ijforecast.2021.11.013.


[Nie et al., 2023] Yuqi Nie, Nam H. Nguyen, Phanwadee Sinthong, and Jayant Kalagnanam. A time series is worth 64 words: Long-term forecasting with transformers. In ICLR, 2023.


[Oreshkin et al., 2020] Boris N. Oreshkin, Dmitri Carpov, Nicolas Chapados, and Yoshua Bengio. N-beats: Neural basis expansion analysis for interpretable time series forecasting. In International Conference on Learning Representations, 2020.


[Radford et al., 2018] Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. Improving language understanding by generative pre-training. 2018.


[Salinas et al., 2020] David Salinas, Valentin Flunkert, Jan Gasthaus, and Tim Januschowski. Deepar: Probabilistic forecasting with autoregressive recurrent networks. International Journal of Forecasting, 36(3):1181–1191, 2020.


[Tolstikhin et al., 2021] Ilya O Tolstikhin, Neil Houlsby, Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Thomas Unterthiner, Jessica Yung, Andreas Steiner, Daniel Keysers, Jakob Uszkoreit, et al. Mlp-mixer: An all-mlp architecture for vision. Advances in Neural Information Processing Systems, 34:24261–24272, 2021.


[Vaswani et al., 2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, volume 30, 2017.


[Wang et al., 2022] Zhiyuan Wang, Xovee Xu, Weifeng Zhang, Goce Trajcevski, Ting Zhong, and Fan Zhou. Learning latent seasonal-trend representations for time series forecasting. Advances in Neural Information Processing Systems, 35:38775–38787, 2022.


[Woo et al., 2022] Gerald Woo, Chenghao Liu, Doyen Sahoo, Akshat Kumar, and Steven Hoi. CoST: Contrastive learning of disentangled seasonal-trend representations for time series forecasting. In International Conference on Learning Representations, 2022.


[Wu et al., 2021] Haixu Wu, Jiehui Xu, Jianmin Wang, and Mingsheng Long. Autoformer: Decomposition transformers with Auto-Correlation for long-term series forecasting. In Advances in Neural Information Processing Systems, 2021.


[Wu et al., 2022] Haixu Wu, Tengge Hu, Yong Liu, Hang Zhou, Jianmin Wang, and Mingsheng Long. Timesnet: Temporal 2d-variation modeling for general time series analysis. In The Eleventh International Conference on Learning Representations, 2022.


[Yue et al., 2022] Zhihan Yue, Yujing Wang, Juanyong Duan, Tianmeng Yang, Congrui Huang, Yunhai Tong, and Bixiong Xu. Ts2vec: Towards universal representation of time series. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 8980–8987, 2022.


[Zeng et al., 2022] Ailing Zeng, Muxi Chen, Lei Zhang, and Qiang Xu. Are transformers effective for time series forecasting? arXiv preprint arXiv:2205.13504, 2022.


[Zerveas et al., 2021] George Zerveas, Srideepika Jayaraman, Dhaval Patel, Anuradha Bhamidipaty, and Carsten Eickhoff. A transformer-based framework for multivariate time series representation learning. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, pages 2114–2124, 2021.


[Zhang et al., 2022] Xiang Zhang, Ziyuan Zhao, Theodoros Tsiligkaridis, and Marinka Zitnik. Self-supervised contrastive pre-training for time series via time-frequency consistency. Advances in Neural Information Processing Systems, 35:3988–4003, 2022.


[Zhou et al., 2021] Haoyi Zhou, Shanghang Zhang, Jieqi Peng, Shuai Zhang, Jianxin Li, Hui Xiong, and Wancai Zhang. Informer: Beyond efficient transformer for long sequence time-series forecasting. In The Thirty-Fifth AAAI Conference on Artificial Intelligence, volume 35, pages 11106–11115, 2021.


[Zhou et al., 2022] Tian Zhou, Ziqing Ma, Qingsong Wen, Xue Wang, Liang Sun, and Rong Jin. FEDformer: Frequency enhanced decomposed transformer for long-term series forecasting. In Proc. 39th International Conference on Machine Learning, 2022.


[Zhou et al., 2023] Tian Zhou, Peisong Niu, Xue Wang, Liang Sun, and Rong Jin. One Fits All: Power general time series analysis by pretrained lm. In NeurIPS, 2023.

Appendix A

The outline of our appendix section is as follows:


  • Detailed Literature Survey - Section A.1
  • TSMixer Background - Section A.2
  • Details on Pre-training Datasets - Section A.3
  • Details on Evaluation Datasets - Section A.4
  • Baseline Implementation Details - Section A.5
  • TTM Source Code and Implementation Details - Section A.6
  • TTM Computational Benefits: Setup details - Section A.7
  • Full Tables - Section A.8

A.1 Detailed Literature Survey


Multivariate Time Series Forecasting


Statistical approaches for time series forecasting, such as SARIMAX and Exponential Smoothing, generally generate forecasts independently for each time series [Hyndman and Athanasopoulos, 2021]. These methods are essentially univariate and do not build a single model by learning from multiple time series. On the other hand, more advanced models, built upon machine/deep learning techniques, including LightGBM-based models [Makridakis et al., 2022; Jati et al., 2023], N-BEATS [Oreshkin et al., 2020], and DeepAR [Salinas et al., 2020], have the capability to learn from multiple time series. However, these models still follow univariate approaches, thus ignoring any potential crosschannel correlations.


Advanced multivariate forecasting models mostly involve deep neural networks, specifically the transformer [Vaswani et al., 2017] architecture. A series of transformer-based model have been proposed in the last few years including Informer [Zhou et al., 2021], Autoformer [Wu et al., 2021], and FEDFormer [Zhou et al., 2022]. Although these models outperformed all the prior arts, the DLinear [Zeng et al., 2022] model showed that an embarrassingly simple linear model can beat these models by following a few empirically established steps like time series decomposition, normalization, and channel-independent modeling.


PatchTST [Nie et al., 2023] showed that transformers can be effective for forecasting if the input time series is patched or segregated in multiple windows, and subsequently, modeled by a transformer. The patching operation helps preserve local semantic information, accommodates a longer history, and reduces computation time. The PatchTST model outperformed all prior transformer-based models and the DLinear model.


Although PatchTST reinstated faith in transformers for time series modeling, transformer-based models are generally resource-intensive, with slow execution and a high memory footprint. The recently proposed TSMixer model [Ekambaram et al., 2023] addresses these challenges effectively. TSMixer, built on the MLPMixer architecture [Tolstikhin et al., 2021], stands out for its exceptional speed and lightweight design. It has attained state-of-the-art (SOTA) performance on benchmark datasets, demonstrating a 2-3X reduction in both execution time and memory usage.


Pre-trained Models for Time Series


One major drawback of all the above models is that they need to be trained in-domain. Hence, none of these models can be transferred to out-of-domain data with zero or minimal training. This approach has been found to be extremely beneficial in the natural language processing (NLP) domain with the invention of BERT [Devlin et al., 2018] and GPT [Radford et al., 2018] models.


However, this is an extremely challenging task in the time series domain because of the unavailability of a publicly accessible large pre-training corpora. There are multiple independent time series datasets, but, unlike in NLP, these datasets differ significantly in important characteristics such as the domain of the data (e.g., retail, sensor data, traffic, etc.), the number of channels, temporal resolution, and length. This makes it hard to train a single model on all the datasets together.


Hence, a few prior works have focused on experimenting with same-dataset self-supervised learning for time series [Li et al., 2023; Wang et al., 2022; Woo et al., 2022; Yue et al., 2022]. These methods learn a time series representation from the train split of a dataset, build a forecaster on top of the learned representation on the same data, and then evaluate it on the test split of the same dataset. Although these approaches have demonstrated promising results, they do not provide evidence of the transfer capability of the model between datasets.


Recent works such as SimMTM [Dong et al., 2023] and TF-C [Zhang et al., 2022] have demonstrated the transfer capabilities of their models between pairs of datasets. These pairs are carefully chosen so that the source (the dataset where the model is pre-trained) and target (the dataset where the model is fine-tuned and tested) datasets share some matching properties. For instance, SimMTM showcased its few-shot capability by selecting ETTH2 as the source data and ETTH1 as the target data. Both ETTH1 and ETTH2 are collected from Electricity Transformers at two stations, denoting data from a similar domain. TF-C demonstrated the transferability of the model across four different (source, target) pairs, such as (ECG, EMG) and (FD-A, FDB), where domain-similarity exists in both the source and target datasets.


Pre-trained LLMs for Time Series


To tackle the aforementioned challenges, there has been a notable increase in the adoption of pre-trained large language models (LLMs) for time series tasks. These models are approached as cross-domain transfer learning problems. The LLMTime model [Gruver et al., 2023] feeds the time series values as text representations and demonstrates promising performance in a zero-shot setting. The GPT4TS model [Zhou et al., 2023] adopts a pre-trained LLM like GPT and fine-tunes only the input embedding layer, normalization layers, and output layer. Specifically, it does not alter the selfattention weights and feed-forward layers. This approach to building a pre-trained model for time series from LLMs is promising, but it does not model cross-channel correlations observed in many multivariate time series datasets. Moreover, these LLMs are very large and exhibit slow execution and a large memory footprint.

A.2 TSMixer Background

We employed TSMixer [Ekambaram et al., 2023] as a building block for the proposed TTM model due to its state-ofthe-art performance, faster execution, and significantly lower memory usage. However, as explained in the main paper, vanilla TSMixer cannot be trained on multiple diverse datasets. Therefore, it necessitated the incorporation of the proposed novel components. In this section, we provide a high-level overview of the TSMixer model for a simpler and quicker understanding by the readers.


TSMixer is a lightweight alternative to transformer-based time series models, with no compromise on forecast accuracy. TSMixer adopts some well-established pre-processing steps from the literature, such as normalization and patching. Additionally, it offers the flexibility of enabling or disabling channel mixing. Channel mixing has been found to be beneficial in handling multivariate datasets with cross-channel correlations. For the main learning process, TSMixer employs a series of MLPMixer [Tolstikhin et al., 2021] blocks that perform inter-patch, intra-patch, and inter-channel mixing operations. A mixing operation in TSMixer ensures learning correlations across a specific dimension. For example, interchannel mixing enables it to learn cross-channel correlations. In the experiments, we employed three different flavors of the TSMixer model: TSMixer vanilla, TSMixer with crosschannel mixing enabled (TSMixer-CM), and TSMixer with cross-channel reconciliation head (TSMixer-CC). We request the authors to refer to [Ekambaram et al., 2023] for further details about these variants.

A.3 List of Pre-training Datasets

We employ a subset of the datasets available in the Monash forecasting data repository [Godahewa et al., 2021] available at https://forecastingdata.org/. Since our primary focus in this study is long term forecasting with forecast length ranging from 96 to 720, it is not possible to use yearly, monthly, quarterly, or weekly datasets due to their short lengths. Hence, we skip a few datasets of short lengths. The final list of all pre-training datasets is shown in Table 13.


Temporal cross validation [Jati et al., 2023] is used to chronologically split all the time series into train and validation parts. During pre-training, moving windowing technique is used to create (X,Y ) pairs of lengths sl and fl respectively. During pre-training, the total number of train and validation samples (i.e., number of (X,Y ) pairs) are 244M and 71M respectively. Please note that, these pretraining datasets have no overlap with the evaluation datasets. In specific, the australian electricity demand dataset and australian weather dataset used in pre-training are completely different (w.r.t location, measured variables, type, resolution, length, etc.) from the standard Electricity (ECL) and Weather dataset used in the evaluation.

A.4 List of Evaluation Datasets

Table 14 illustrates various characteristics of the eleven evaluation datasets. Below, we present the details.


Set D1


For zero/few/full-shot evaluation, we utilize seven multivariate time series datasets that have consistently been employed in the literature. Below, we offer a brief overview of these datasets.


  1. ETT datasets: The four ETT datasets [Zhou et al., 2021] (ETTH1, ETTH2, ETTM1, ETTM2) contain multivariate time series data collected from electrical transformers at two stations. ETTH1 and ETTH2 are collected at an hourly interval, while ETTM1 and ETTM2 are collected every 15 minutes. All four datasets have 7 channels.


  2. Weather: The weather dataset consists of 21 channels, which serve as weather indicators. It is collected at 10- minute intervals at the Max Planck Institute of Biogeochemistry weather station.


  3. Electricity (ECL): The Electricity dataset, also known as the ECL dataset, comprises the hourly electricity consumption data of 321 clients.


  4. Traffic: This dataset records the hourly rates of road occupancy on the San Francisco Freeways using 862 sensors.



We used the datasets provided in the repository of the Autoformer paper [Wu et al., 2021] [5]. For all the D1 datasets, we execute the same train/validation/test splitting as was performed in the literature [Zhou et al., 2021; Wu et al., 2021; Nie et al., 2023; Ekambaram et al., 2023].


Set D2


To assess the effectiveness of the proposed TTM model in extracting information from exogenous channels, we conduct evaluations on four additional datasets that are known to contain exogenous or control variables.


1. Bike Sharing (BS): The Bike Sharing dataset [FanaeeT, 2013] documents the hourly rental counts of bikes from the Capital Bikeshare system in Washington D.C., USA, spanning the years 2011 to 2012. Rental counts are typically associated with environmental and seasonal conditions. Consequently, this 14-channel dataset encompasses various weather-related features. Our goal was to forecast all three rental counts: “casual”, “registered”, and “cnt” (total count). As the remaining 11 features are consistently available at all future time points, they are treated as exogenous variables in our experiment.



Table 13: List of pre-training datasets. Datasets with “ downsample X.tsf” suffix denotes an augmented dataset created from the original dataset by downsampling the latter. Please note that, these pre-training datasets have no overlap with the evaluation datasets. Specifically, the australian electricity demand dataset and australian weather dataset used in pre-training are completely different (w.r.t location, measured variables, type, resolution, length, etc.) from the standard Electricity (ECL) and Weather dataset used in the evaluation.



2. Carbon Capture Plant (CC): The Carbon Capture Plant data [Jablonka et al., 2023] records the emission profiles of “2-amino-2-methyl-1-propanol” (AMP) and “piperazine” (Pz) collected at every 2 minutes interval. We utilize the 8-channel dataset made available in the official repository of [Jablonka et al., 2023]. Among the remaining 6 channels, the following 5 serve as control variables: [“TI-19”,“FI-19”, “TI-3”, “FI-11”, “TI1213”]. The remaining 1 variable is treated as a conditional variable (as it is neither a target variable nor available during the forecast period to consider it as exogenous). For additional details, please refer to the supplementary materials of [Jablonka et al., 2023].


3. Service (SER): This dataset pertains to the cloud-based “Stan’s Robot Shop” application, managed by Instana. It simulates a user’s e-commerce experience, encompassing site access to shipping, utilizing a load generator. Intermittent fault injection introduces diverse IT events. The dataset provides business KPIs for services (e.g., payment, catalog) and IT events tracked by Instana. Sampling occurs every 10 seconds due to high traffic and event frequency. For our experiments, all business KPIs are treated as target variables and IT events are treated as exogenous variables and the goal of our forecasting is to predict the business KPIs given the IT events.


4. Application (APP): This dataset is similar to the SER data, but it captures KPIs for the entire application instead of capturing at the service level. Even in this case, all business KPIs are treated as target variables and IT events are treated as exogenous variables and the goal of our forecasting is to predict the business KPIs given the IT events.

A.5 Baseline Implementation Details

We report the implementation details for all the baselines in Table 15.

A.6 TTM Source Code and Implementation Details


Pre-training


For our experiments, we build 5 pre-trained models using the Monash datasets for the following (sl,fl) configurations: (512,96), (512,192), (512, 336), (512,720) and (96,24). Since our pre-trained models are very small, they can be trained quickly in a few hours (4-8 hrs based on fl length), as opposed to several days/weeks in standard approaches. Hence, pre-training multiple TTM models is no longer a practical constraint. Pre-training is performed in a distributed fashion with 50 CPUs and 6 NVIDIA A100 GPUs with 40 GB GPU memory. Based on the sl and fl requirement of the target dataset, a suitable pre-trained model is selected to bootstrap the weights accordingly for fine-tuning. Other model configurations are as follows: patch length pl = 64 (when sl is 512) and 8 (when sl is 96), stride s = pl (i.e. non-overlapping



Table 14: Details of the evaluation datasets.



patches), number of patches n = sl/pl, number of levels in backbone L = 6, number of TTM blocks per level M = 2, number of decoder layers = 2, batch size b = 3000, number of epochs ep = 20, and dropout do = 0.2. TSMixer-specific hyperparameters include feature scaler fs = 3, hidden feature size hf = fs ∗ pl, expansion feature size ef = hf ∗ 2. Please note that hf and n will change across TTM blocks based on the adaptive patching strategy. Resolution prefix tuning is disabled by default and enabled only for shorter context lengths (as explained in Table. 10). Decoder channel-mixing and exogenous mixer blocks are disabled during pre-training.


Fine-tuning


We have 2 sets of target datasets: D1 and D2 on which we fine-tune and test our performance. All D1 datasets use sl = 512 and fl is varied across {96, 192, 336, 720} for zero/fewshot forecasting. Likewise, in D2 datasets, BS data use sl = 512 and fl = 96 and remaining datasets use sl = 96, fl = 24. In addition, head dropout is set to 0.7 for smaller ETT datasets and 0.2 for other datasets. Likewise, the batch size is set to 8 for Traffic, 32 for Electricity, and 64 for all other datasets. All other parameters remain the same as used in the pre-training. Also, Decoder channel-mixing and exogenous mixer block are enabled for the D2 datasets. Unlike pre-training, fine-tuning is executed in just 1 A100 GPU as it is a fast process. All these hyperparameters are selected based on the validation performance, and the final test results are reported in the paper.


Source Code


For more implementation details on the TTM model, please refer to the source code of the important classes used in TTM. We have anonymized the important Python class files and shared them in the same technical appendix zip file. Important modules are listed below. The full project with complete reproducibility will be open-sourced on GitHub after the double-blind review.


• Class TinyTimeMixerConfig defines the requied configuration.


• Class TinyTimeMixerBlock implements the basic TSMixer Block


• Class ForecastChannelHeadMixer implements Exogenous Mixer Block


• Class TinyTimeMixerAdaptivePatchingBlock implements the Adaptive Patching statergy


• Class TinyTimeMixerDecoder implements the TTM Decoder


• Class TinyTimeMixerForPredictionHead implements the Forecasting Head


• Class TinyTimeMixerPreTrainedModel implements the pretraining model interfaces.


• Class TinyTimeMixerPatchify implements the required patching.


• Class TinyTimeMixerEncoder implements the TTM Backbone


• Class TinyTimeMixerModel implements the TTM Model wrappers.

A.7 Computational Benefits of TTM over GPT4TS: Setup details

Table 2 compares the computational benefits of TTM over GPT4TS, the most popular LLM-TS model and the current best SOTA in a few-shot setting. This section explains the experimental setup followed to enable this comparison. To execute GPT4TS with the best parameters, we used their official implementation as mentioned in Table 15. For a fair comparison, we run both models with the best-reported parameters in a single A100 GPU environment. Multi-GPU is avoided in this experiment to avoid IPC overheads for precise metric measurements. Since GPT4TS processes data in a purely univariate fashion while TTM fine-tuning processes data in a multi-variate fashion, we set the batch size accordingly to ensure that the number of univariate samples processed in each batch is the same for both models. For example, in TTM, if the batch size is set to 64, it implies processing 64 multivariate time series in a batch. Consequently, the equivalent batch size for GPT4TS (which process only univariate samples) is 64 × c, where c represents the number of channels in the dataset. Additionally, due to GPT4TS encountering outof-memory (OOM) errors with the default high batch sizes used in TTM, we employed reduced batch sizes for this experiment. The following batch sizes were utilized for TTM: Weather (64), Electricity (8), and Traffic (2). The corresponding batch sizes for GPT4TS can be calculated by multiplying the TTM batch size with the channel counts of the respective dataset. Since Traffic and Electricity have very high number of channels, we had to significantly reduce its batch size for a consistent comparison across models.

A.8 Full tables

Here, we present the complete versions of various tables in the main paper. These full versions essentially include the test results for multiple forecast lengths (fl) across all datasets. Occasionally, these results are averaged across forecast lengths to conserve space in the main paper.


Full table for 10% few-shot experiment


Table 16 shows the 10% few-shot results for all forecast lengths across all D1 datasets.


Full table for validating effect of pre-training


Table 17 shows the effect of pre-training when compared to random initialization of the model weights across all D1 datasets for all forecast lengths.


Full table for validating adaptive patching


Table 18 provides a comprehensive overview, systematically validating the impact of adaptive patching across all D1 datasets and forecast lengths.


Full table for validating effect of downsampling


Table 19 offers a comprehensive summary, systematically validating the influence of dataset augmentation through downsampling across all D1 datasets and forecast lengths (96 and 192).



Table 15: Implementation details for the baseline algorithms.




Table 16: Zero-shot and Few-shot 10% performance (MSE) on TTM and all SOTA models on seven datasets for varying forecast lengths (f l). The numbers with bold and underscore denote the best the second best results respectively.



Table 17: Effect of pre-training (PT) when compared to random initialization (RI) of model weights. MSE reported.



Table 18: Effect of adapting patching. MSE reported.




Table 19: Effect of downsampling. MSE reported.


This paper is available on arxiv under CC BY-NC-ND 4.0 DEED license.


[5] Available at https://github.com/thuml/Autoformer