paint-brush
Ducho: A Unified Framework for Multimodal Feature Extraction in AI-Powered Recommendationsby@yaml
136 reads New Story

Ducho: A Unified Framework for Multimodal Feature Extraction in AI-Powered Recommendations

by YAMLFebruary 16th, 2025
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

Ducho simplifies multimodal feature extraction for recommender systems with a modular architecture. Future plans include broader backend support, a unified extraction interface, and low-level feature extraction.
featured image - Ducho: A Unified Framework for Multimodal Feature Extraction in AI-Powered Recommendations
YAML HackerNoon profile picture
0-item

Authors:

(1) Daniele Malitesta, Politecnico di Bari, Italy and [email protected] with Corresponding authors: Daniele Malitesta ([email protected]) and Giuseppe Gassi ([email protected]);

(2) Giuseppe Gassi, Politecnico di Bari, Italy and [email protected] with Corresponding authors: Daniele Malitesta ([email protected]) and Giuseppe Gassi ([email protected]);

(3) Claudio Pomo, Politecnico di Bari, Italy and [email protected];

(4) Tommaso Di Noia, Politecnico di Bari, Italy and [email protected].

Abstract and 1 Introduction and Motivation

2 Architecture and 2.1 Dataset

2.2 Extractor

2.3 Runner

3 Extraction Pipeline

4 Ducho as Docker Application

5 Demonstrations and 5.1 Demo 1: visual + textual items features

5.2 Demo 2: audio + textual items features

5.3 Demo 3: textual items/interactions features 6

Conclusion and Future Work, Acknowledgments and References

6 CONCLUSION AND FUTURE WORK

In this paper we propose Ducho, a framework for extracting highlevel features for multimodal-aware recommendation. Our main purpose is to provide a unified and shared tool to support practitioners and researchers in processing and extracting multimodal features used as side information in recommender systems. Concretely, Ducho involves three main modules: Dataset, Extractor, and Runner. The multimodal extraction pipeline can be highly customized through a Configuration component that allows the setup of the modalities involved (i.e., audio, visual, textual), the sources of multimodal information (i.e., items and/or user-item interactions), and the pre-trained models along with their main extraction parameters. To show how Ducho works in different scenarios and settings, we propose three demos accounting for the extraction of (i) visual/textual items features, (ii) audio/textual items features, and (iii) textual items/interactions features. They can be run locally, on Docker (as we also dockerize Ducho), and on Google Colab. As future directions, we plan to: (i) adopt all available backends (i.e., TensorFlow, PyTorch, and Transformers) to extract features for all modalities; (ii) implement a general extraction model interface allowing the users to follow the same naming/indexing scheme for all pre-trained models and their extraction layers; (iii) integrate the extraction of low-level multimodal features.

ACKNOWLEDGMENTS

This work was partially supported by the following projects: Secure Safe Apulia, MISE CUP: I14E20000020001 CTEMT - Casa delle Tecnologie Emergenti Comune di Matera, CT_FINCONS_III, OVS Fashion Retail Reloaded, LUTECH DIGITALE 4.0, KOINÈ.

REFERENCES

[1] Vito Walter Anelli, Yashar Deldjoo, Tommaso Di Noia, Eugenio Di Sciascio, Antonio Ferrara, Daniele Malitesta, and Claudio Pomo. 2022. Reshaping Graph Recommendation with Edge Graph Collaborative Filtering and Customer Reviews. In DL4SR@CIKM (CEUR Workshop Proceedings, Vol. 3317). CEUR-WS.org.


[2] Yashar Deldjoo, Tommaso Di Noia, Daniele Malitesta, and Felice Antonio Merra. 2021. A Study on the Relative Importance of Convolutional Neural Networks in Visually-Aware Recommender Systems. In CVPR Workshops. Computer Vision Foundation / IEEE, 3961–3967.


[3] Yashar Deldjoo, Tommaso Di Noia, Daniele Malitesta, and Felice Antonio Merra. 2022. Leveraging Content-Style Item Representation for Visual Recommendation. In ECIR (2) (Lecture Notes in Computer Science, Vol. 13186). Springer, 84–92.


[4] Daniele Malitesta, Giandomenico Cornacchia, Claudio Pomo, and Tommaso Di Noia. 2023. Disentangling the Performance Puzzle of Multimodal-aware Recommender Systems. In EvalRS@KDD (CEUR Workshop Proceedings, Vol. 3450). CEUR-WS.org.


[5] Weiqing Min, Shuqiang Jiang, and Ramesh C. Jain. 2020. Food Recommendation: Framework, Existing Solutions, and Challenges. IEEE Trans. Multim. 22, 10 (2020), 2659–2671.


[6] Sergio Oramas, Oriol Nieto, Mohamed Sordo, and Xavier Serra. 2017. A Deep Multimodal Approach for Cold-start Music Recommendation. In DLRS@RecSys. ACM, 32–37.


[7] Aghiles Salah, Quoc-Tuan Truong, and Hady W. Lauw. 2020. Cornac: A Comparative Framework for Multimodal Recommender Systems. J. Mach. Learn. Res. 21 (2020), 95:1–95:5.


[8] Zixuan Yi, Xi Wang, Iadh Ounis, and Craig MacDonald. 2022. Multi-modal Graph Contrastive Learning for Micro-video Recommendation. In SIGIR. ACM, 1807– 1811.


This paper is available on arxiv under CC BY 4.0 DEED license.