Get the duration of audio input in torchaudio

Tuesday 09 January 2024 08:21 PM (Dhaka)

When working with torchaudio for audio-oriented neural network implementations, one of the very first things we do is to get the metadata of the audio input using torchaudio.info.

For example, here's to find the metadata of an audio input:

metadata = torchaudio.info(audio_path_or_url_or_file_like_obj)

Let's check an example output:

print(metadata)

AudioMetaData(sample_rate=44100, num_frames=12700800, num_channels=2, bits_per_sample=0, encoding=MP3)

So this is an MP3-encoded (lossy compression) audio with 2 channels (stereo), 12700800 frames/samples per channel, and the sample rate is 44100 Hz/44.1 kHz (CD quality).

Note: As MP3-encoded audio doesn't have a fixed bit depth (it's frame-specific) unlike PCM-encoded formats (e.g. WAV), `torchaudio` is showing the bit depth to be 0 here but that's not an important detail to get the duration.

To get the duration of this audio input (in seconds), we can divide the number of frames/samples (for a channel) by the sample rate i.e.:

import math

math.ceil(metadata.num_frames / metadata.sample_rate)

Returns:

To get a minutes/seconds answer, we can get the quotient and remainder by dividing the duration in seconds by 60; perfect use case for divmod:

divmod(math.ceil(metadata.num_frames / metadata.sample_rate), 60)

Returns:

(4, 48)

So, the audio input has a duration of 4 minutes and 48 seconds.

Readul Hasan Chayan [Heemayl]

Comments