IBM Researchers Create Mini AI Model That Predicts the Future

Authors:

(1) Vijay Ekambaram, IBM Research;

(2) Arindam Jati, IBM Research;

(3) Nam H. Nguyen, IBM Research;

(4) Pankaj Dayama, IBM Research;

(5) Chandra Reddy, IBM Research;

(6) Wesley M. Gifford, IBM Research;

(7) Jayant Kalagnanam, IBM Research.

Editor's note: this is part 1 of 5 of a study detailing the development of a tiny, fast AI model that delivers excellent accuracy. Read the rest below.

Table of Links

Abstract and 1. Introduction
2 TTM Components
3 TTM Workflows and 3.1 Pre-training Workflow
- 3.2 Fine-tuning Workflow
4 Experiments and Results
- 4.1 Experimental Setting
- 4.2 TTM’s Zero/Few-shot Performance
- 4.3 Computational Benefits of TTM and 4.4 TTM’s effectiveness in cross-channel and exogenous modeling
- 4.5 Ablation Studies and 4.6 Comparison with recent concurrent works
5 Conclusions and Future Work, and References
- Appendix A

Abstract

Large pre-trained models for zero/few-shot learning excel in language and vision domains but encounter challenges in multivariate time series (TS) due to the diverse nature and scarcity of publicly available pre-training data. Consequently, there has been a recent surge in utilizing pre-trained large language models (LLMs) with token adaptations for TS forecasting. These approaches employ cross-domain transfer learning and surprisingly yield impressive results. However, these models are typically very slow and large (∼billion parameters) and do not consider cross-channel correlations. To address this, we present Tiny Time Mixers (TTM), a significantly small model based on the lightweight TSMixer architecture. TTM marks the first success in developing fast and tiny general pre-trained models (≤1M parameters), exclusively trained on public TS datasets, with effective transfer learning capabilities for forecasting. To tackle the complexity of pre-training on multiple datasets with varied temporal resolutions, we introduce several novel enhancements such as adaptive patching, dataset augmentation via downsampling, and resolution prefix tuning. Moreover, we employ a multi-level modeling strategy to effectively model channel correlations and infuse exogenous signals during fine-tuning, a crucial capability lacking in existing benchmarks. TTM shows significant accuracy gains (12-38%) over popular benchmarks in few/zero-shot forecasting. It also drastically reduces the compute needs as compared to LLM-TS methods, with a 14X cut in learnable parameters, 106X less total parameters, and substantial reductions in fine-tuning (65X) and inference time (54X). In fact, TTM’s zero-shot often surpasses the few-shot results in many popular benchmarks, highlighting the efficacy of our approach. Models and source code are available at https://huggingface.co/ibm/TTM

1 Introduction

Multivariate time series (TS) forecasting entails predicting future values for multiple interrelated time series based on their historical data. This field has advanced significantly, applying statistical and machine learning (ML) methods [Hyndman and Athanasopoulos, 2021] across domains like weather, traffic, retail, and energy. In general, each time series represents a variable or channel[1]. In certain applications, non-forecasting variables, categorized as controllable and uncontrollable external factors, impact the variables to forecast. We term these non-forecasting variables as exogenous, and the variables requiring forecast as target variables.

Related Work: Recent advances in multivariate forecasting have been marked by the advent of transformerbased [Vaswani et al., 2017] approaches, exemplified by models like PatchTST [Nie et al., 2023], Autoformer [Wu et al., 2021], Informer [Zhou et al., 2021], and FEDFormer [Zhou et al., 2022]. These models have demonstrated notable improvements over traditional statistical and ML methods. Furthermore, architectures based on MLPMixer [Tolstikhin et al., 2021], such as TSMixer [Ekambaram et al., 2023], have emerged as efficient transformer alternatives, boasting 2-3X reduced compute and memory requirements with no accuracy compromise compared to their transformer counterparts. However, none of these advanced approaches have successfully demonstrated the ability to create general pre-trained models that can successfully transfer the learning to unseen target TS dataset, in a similar way as popularly witnessed in NLP and vision tasks. This is very challenging in the TS domain due to the diverse nature of the datasets across applications and the limited public availability of TS data for pre-training. There are existing self-supervised pre-training TS approaches using masked modeling and contrastive learning techniques such as SimMTM [Dong et al., 2023] and TF-C [Zhang et al., 2022] that offer transfer learning between two datasets when carefully selected based on the dataset properties. However, they fail to provide universal transfer learning capabilities across datasets. Consequently, there has been a recent growing trend to employ pre-trained large language models (LLMs) for TS forecasting, treating it as a cross-domain transfer learning task. These universal cross-transfer approaches, specifically recent works such as LLMTime [Gruver et al., 2023] and GPT4TS [Zhou et al., 2023] yield promising results in few/zero-shot forecasting approaches. These models are bootstrapped from GPT-2/3 or LLAMA-2 with suitable tokenization strategies to adapt to time-series domains.

However, these LLM based TS approaches do not explicitly handle channel correlations and exogenous support in the context of multivariate forecasting. Moreover, these large models, with billions of parameters, demand significant computational resources and runtime. Hence, in this paper, we focus on building pre-trained models from scratch solely using TS data. Unlike language, which has abundant public pre-training data in terabytes, time-series data is relatively scarce, very diverse and publicly limited. Its scarcity leads to overfitting when pre-training “large” models solely on timeseries data. This prompts a question: Can smaller models pre-trained purely on limited public diverse TS datasets give better zero/few-shot forecasting accuracy? Surprisingly, the answer is yes! Toward this, we propose Multi-level Tiny Time Mixers (TTM), a significantly smaller model (≤1M parameters) based on the lightweight TSMixer architecture, exclusively trained on diverse TS corpora for effective zero/fewshot multivariate TS forecasting via transfer learning.

In particular, TTM is pre-trained using multiple public datasets (∼ 244M samples) from the Monash data repository[2] [Godahewa et al., 2021]). Note that the datasets exhibit considerable diversity in terms of characteristics, such as the different domains, temporal resolution[3] (spanning from second to daily), lengths, and number of channels. Pretraining on such heterogeneous datasets cannot be handled directly by TSMixer or existing state-of-the-art (SOTA) models. Hence, TTM proposes the following enhancements to the TSMixer architecture: (i) Adaptive Patching across layers, considering the varied suitability of patch lengths for different datasets, (ii) Dataset Augmentation via Downsampling to increase coverage and samples across different resolutions, (iii) Resolution Prefix Tuning to explicitly embed resolution information in the first patch, facilitating resolution-conditioned modeling, particularly beneficial in scenarios with short history lengths. Moreover, our approach leverages multi-level modeling, where TTMs are first pre-trained in a channel-independent way and then seamlessly integrate channel mixing during fine-tuning to model target data-specific channel-correlations and exogenous infusion

Below, we outline the paper’s key contributions:

• Amidst the prevalence of large pre-trained models demanding significant compute and training time (in weeks), our work is the first to showcase the efficacy of building Fast and Tiny Pre-trained models (≤1M parameters) exclusively trained on Public TS datasets in a flash of just few hours (4-8 hours, 6 A100 GPUs). TTM successfully demonstrates transfer learning to diverse, unseen target datasets for zero/few-shot forecasting, addressing the data scarcity issues prevalent in time series.

• Pre-training on heterogeneous multi-resolution datasets cannot be handled effectively by TSMixer or other SOTA models. Hence, we propose various architectural and training enhancements, such as adaptive patching, data augmentation via downsampling, and (an optional) resolution prefix tuning for robust pre-training.

• TTM employs a multi-level modeling strategy to explicitly model channel-correlations, and incorporates exogenous signals – a crucial capability lacking in LLMsbased TS approaches.

• With extensive evaluation on 11 datasets, TTM shows significant accuracy gains over popular benchmarks (12- 38% in few/zero-shot forecasting). It also drastically reduces the compute needs as compared to LLM-TS methods, with a 14X cut in learnable parameters, 106X less total parameters, and substantial reductions in finetuning (65X), inference time (54X), and memory usage (27X). • The zero-shot results of TTM often surpass the few-shot results of many SOTA approaches, highlighting the effectiveness of our approach.

This paper is available on arxiv under CC BY-NC-ND 4.0 DEED license.

[1] “Channel” refers to the individual time series in multivariate data (i.e., a multivariate TS is a multi-channel signal).

[2] Accessible at https://forecastingdata.org/

[3] Resolution refers to the sampling rate of the input time series (e.g., hourly, 10 minutes, 15 minutes, etc.)

IBM Researchers Create Mini AI Model That Predicts the Future

Too Long; Didn't Read

Companies Mentioned

Table of Links

Abstract

1 Introduction

About Author

TOPICS

THIS ARTICLE WAS FEATURED IN...

Categories

Trending Topics

IBM Researchers Create Mini AI Model That Predicts the Future

Too Long; Didn't Read

Companies Mentioned

Table of Links

Abstract

1 Introduction

About Author

TOPICS

THIS ARTICLE WAS FEATURED IN...

RELATED STORIES

Categories

Trending Topics