VoxCeleb is an audio-visual dataset consisting of short clips of human speech, extracted from interview videos uploaded to YouTube 7,000 + speakers VoxCeleb contains speech from speakers spanning a wide range of different ethnicities, accents, professions and ages. Utterance Lengths 1 million + utterances All speaking face-tracks are captured "in the wild", with background chatter, laughter, overl