AudioSetCaps: An Enriched Audio-Caption Dataset
using Automated Generation Pipeline
with Large Audio and Language Models


Jisheng Bai123*
Haohe Liu4
Mou Wang5
Dongyuan Shi1
Wenwu Wang4
Mark D. Plumbley4
Woon-Seng Gan3
Jianfeng Chen1

1School of Marine Science and Technology, Northwestern Polytechnical University, Xi’an, China

2Xi'an Lianfeng Acoustic Technologies Co., Ltd., Xi'an, China

3SNTL, Nanyang Technological University, Singapore

4CVSSP, University of Surrey, Guildford, UK

5Institute of Acoustics, Chinese Academy of Sciences, Beijing, China


[Arxiv]
[NeurIPS 2024 Workshop]
[GitHub]
[Dataset]

AudioSetCaps Pipeline

Figure 1: Overview of the proposed automated pipeline for audio caption generation.


Abstract

Building large-scale audio-language datasets is crucial yet challenging for training audio-language models, primarily due to its time-consuming and labour-intensive nature. Although large language models (LLMs) have greatly enhanced the efficiency of this process, current LLM-based pipelines for generating audio-text data still lack the capability to incorporate detailed audio information. In this paper, we propose a novel pipeline leveraging large audio-language models to automatically generate large-scale, fine-grained audio captions. Based on this approach, we create AudioSetCaps, a dataset comprising 1.9 million audio-caption pairs derived from recordings in AudioSet. We evaluate AudioSetCaps on two downstream tasks: audio-text retrieval and automated audio captioning. Models trained with AudioSetCaps achieve state-of-the-art performance on both tasks, demonstrating the high quality of the generated captions. Notably, our proposed data-labelling pipeline employs open-source APIs and can run on a consumer-grade GPU. To facilitate further advancements in this field, we have made our code, audio-caption paired data, and pre-trained models on downstream tasks publicly available.


We provide the audio caption and Q&A data for the following three datasets:

Dataset # Audio captions # Q&A captions Total
AudioSetCaps 1910920 5736072 7646992
YouTube-8M 4023990 12086037 16110027
VGGSound 182189 592680 774869
Total 6117099 18414789 24531888


Audio-Text Retrieval

Performance comparison of audio-text retrieval on the AudioCaps test set. "LA", "AC", "WC", "ACD", and "ASC" denote LAION-Audio-630K, AudioCaps, WavCaps, Auto-ACD, and AudioSetCaps, respectively. "PT" represents pre-training on the training set and "FT" represents fine-tuning on the AudioCaps training set. "T2A" and "A2T" refer to text-to-audio and audio-to-text retrieval, respectively. "R@1" and "R@10" denote recall at ranks 1 and 10.

Audio-Text Retrieval Results


Automated Audio Captioning

The performance of different methods for automated audio captioning on AudioCaps test set, where "ACT" and "CNext-Trans" refer to Audio Captioning Transformer and ConvNeXt-Transformer.

Automated Audio Captioning Results


Subjective Evaluation

Mean Scores of human evaluation across datasets. The question we ask for the rater is "Please listen to the provided audio samples and rate the quality of the text annotation based on its accuracy, completeness, and presence of false information." The scores indicate how well the text annotation reflects the audio content based on the following scale: 1-Bad, 2-Poor, 3-Fair, 4-Good, 5-Excellent.

Human Evaluation Scores



Acknowledgements

The training and evaluation code reference WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research.

This template was originally made by Phillip Isola and Richard Zhang.