|
|
|
|
|
|
|
|
1School of Marine Science and Technology, Northwestern Polytechnical University, Xi’an, China
2Xi'an Lianfeng Acoustic Technologies Co., Ltd., Xi'an, China
3SNTL, Nanyang Technological University, Singapore
4CVSSP, University of Surrey, Guildford, UK
5Institute of Acoustics, Chinese Academy of Sciences, Beijing, China
|
|
|
|
![]() Figure 1: Overview of the proposed automated pipeline for audio caption generation. |
Building large-scale audio-language datasets is crucial yet challenging for training audio-language models, primarily due to its time-consuming and labour-intensive nature. Although large language models (LLMs) have greatly enhanced the efficiency of this process, current LLM-based pipelines for generating audio-text data still lack the capability to incorporate detailed audio information. In this paper, we propose a novel pipeline leveraging large audio-language models to automatically generate large-scale, fine-grained audio captions. Based on this approach, we create AudioSetCaps, a dataset comprising 1.9 million audio-caption pairs derived from recordings in AudioSet. We evaluate AudioSetCaps on two downstream tasks: audio-text retrieval and automated audio captioning. Models trained with AudioSetCaps achieve state-of-the-art performance on both tasks, demonstrating the high quality of the generated captions. Notably, our proposed data-labelling pipeline employs open-source APIs and can run on a consumer-grade GPU. To facilitate further advancements in this field, we have made our code, audio-caption paired data, and pre-trained models on downstream tasks publicly available. |
We provide the audio caption and Q&A data for the following three datasets:
Dataset | # Audio captions | # Q&A captions | Total |
---|---|---|---|
AudioSetCaps | 1910920 | 5736072 | 7646992 |
YouTube-8M | 4023990 | 12086037 | 16110027 |
VGGSound | 182189 | 592680 | 774869 |
Total | 6117099 | 18414789 | 24531888 |
Performance comparison of audio-text retrieval on the AudioCaps test set. "LA", "AC", "WC", "ACD", and "ASC" denote LAION-Audio-630K, AudioCaps, WavCaps, Auto-ACD, and AudioSetCaps, respectively. "PT" represents pre-training on the training set and "FT" represents fine-tuning on the AudioCaps training set. "T2A" and "A2T" refer to text-to-audio and audio-to-text retrieval, respectively. "R@1" and "R@10" denote recall at ranks 1 and 10. ![]() |
The performance of different methods for automated audio captioning on AudioCaps test set, where "ACT" and "CNext-Trans" refer to Audio Captioning Transformer and ConvNeXt-Transformer. ![]() |
Mean Scores of human evaluation across datasets. The question we ask for the rater is "Please listen to the provided audio samples and rate the quality of the text annotation based on its accuracy, completeness, and presence of false information." The scores indicate how well the text annotation reflects the audio content based on the following scale: 1-Bad, 2-Poor, 3-Fair, 4-Good, 5-Excellent. ![]() |
AcknowledgementsThe training and evaluation code reference WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research. This template was originally made by Phillip Isola and Richard Zhang. |