AudioSetCaps: An Enriched Audio-Caption Dataset
using Automated Generation Pipeline
with Large Audio and Language Models

Jisheng Bai^123*

Haohe Liu⁴

Mou Wang⁵

Dongyuan Shi¹

Wenwu Wang⁴

Mark D. Plumbley⁴

Woon-Seng Gan³

Jianfeng Chen¹

¹School of Marine Science and Technology, Northwestern Polytechnical University, Xi’an, China

²Xi'an Lianfeng Acoustic Technologies Co., Ltd., Xi'an, China

³SNTL, Nanyang Technological University, Singapore

⁴CVSSP, University of Surrey, Guildford, UK

⁵Institute of Acoustics, Chinese Academy of Sciences, Beijing, China

[Arxiv]

[NeurIPS 2024 Workshop]

[GitHub]

[Dataset]

Figure 1: Overview of the proposed automated pipeline for audio caption generation.

Abstract

Building large-scale audio-language datasets is crucial yet challenging for training audio-language models, primarily due to its time-consuming and labour-intensive nature. Although large language models (LLMs) have greatly enhanced the efficiency of this process, current LLM-based pipelines for generating audio-text data still lack the capability to incorporate detailed audio information. In this paper, we propose a novel pipeline leveraging large audio-language models to automatically generate large-scale, fine-grained audio captions. Based on this approach, we create AudioSetCaps, a dataset comprising 1.9 million audio-caption pairs derived from recordings in AudioSet. We evaluate AudioSetCaps on two downstream tasks: audio-text retrieval and automated audio captioning. Models trained with AudioSetCaps achieve state-of-the-art performance on both tasks, demonstrating the high quality of the generated captions. Notably, our proposed data-labelling pipeline employs open-source APIs and can run on a consumer-grade GPU. To facilitate further advancements in this field, we have made our code, audio-caption paired data, and pre-trained models on downstream tasks publicly available.

We provide the audio caption and Q&A data for the following three datasets:

Dataset	# Audio captions	# Q&A captions	Total
AudioSetCaps	1910920	5736072	7646992
YouTube-8M	4023990	12086037	16110027
VGGSound	182189	592680	774869
Total	6117099	18414789	24531888

Audio-Text Retrieval

Performance comparison of audio-text retrieval on the AudioCaps test set. "LA", "AC", "WC", "ACD", and "ASC" denote LAION-Audio-630K, AudioCaps, WavCaps, Auto-ACD, and AudioSetCaps, respectively. "PT" represents pre-training on the training set and "FT" represents fine-tuning on the AudioCaps training set. "T2A" and "A2T" refer to text-to-audio and audio-to-text retrieval, respectively. "R@1" and "R@10" denote recall at ranks 1 and 10.

Automated Audio Captioning

The performance of different methods for automated audio captioning on AudioCaps test set, where "ACT" and "CNext-Trans" refer to Audio Captioning Transformer and ConvNeXt-Transformer.

Subjective Evaluation

Mean Scores of human evaluation across datasets. The question we ask for the rater is "Please listen to the provided audio samples and rate the quality of the text annotation based on its accuracy, completeness, and presence of false information." The scores indicate how well the text annotation reflects the audio content based on the following scale: 1-Bad, 2-Poor, 3-Fair, 4-Good, 5-Excellent.

Acknowledgements

The training and evaluation code reference WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research.

This template was originally made by Phillip Isola and Richard Zhang.