SOOTHER Speech Dataset

In April 2021, 13,000 whisper voice stems were recorded and cut at superbudda in Torino, Italy. Of the 13,000 recorded voice stems, appoximately 7,000 made it through a very reductive quality control process in order to train v.01 of the synthesized whisper.

The dataset consists of passages from the following texts:

Morris, William, et al. Arts and Crafts Essays. 1893.
Griffiths, Arthur. The Chronicles of Newgate, Vol. 2. 1884.
Roosevelt, Franklin D. The Fireside Chats of Franklin Delano Roosevelt. 1933-42.
Harland, Marion. Marion Harland’s Cookery for Beginners. 1893.
Rolt-Wheeler, Francis. The Science - History of the Universe, Vol. 5: Biology. 1910.
Joyce, James, Ulysses, “Penelope”(Molly Bloom’s soliloquy), 1922.
Stein, Gertrude, Tender Buttons, 1914.
Borges, Jorge Luis, trans. Anthony Boucher, “The Garden of Forking Paths”, 1941.
Shelley, Mary, Frankenstein, 1823.
Tolan, Claire, “CICADA GAMES”, 2021.

The first five texts comprise part of the LJ Speech Dataset, the most widely-used public domain speech dataset. The final five texts I selected to counterbalance the first five, and in particular, The Chronicles of Newgate, Vol. II, as I explain below.

Of the selected texts, all except “The Garden of Forking Paths” and Ulysses are in the worldwide public domain (Ulysses is in the public domain only in Europe). Passages from Frankenstein and The Chronicles of Newgate, Volume II together comprise approximately 80% of the 13,000 total voice stems.

The Chronicles of Newgate, Volume II

Per its title, this text chronicles the history of Newgate Prison in London, largely through the 1800s, its final century. Here, we read of prison squalor and attempts to end it; celebrity executioners and a catalogue of their botched and perfected executions; the often bloody crimes of the executed and their repentance – or lack thereof – before hanging; the perversion and horrors of the crowds that gathered to watch executions; daring escapes and apprehensions; bureaucratic minutiae; and architectural evolutions of the prison.

It is a remarkable book, and a remarkable source for training “artificially intelligent” voices given its bountiful catalogue of human cruelty, suffering, misfortune, and misadventure. I have no idea why, out of all the world’s literature in the public domain, this particular book was selected as the foundational text of a popular speech dataset. I’ve written Keith Ito, the creator of the dataset, to ask, and I will update this documentation if he responds. To me, it seems like complete genius to imbue the synthesis of human speech with this backdrop of suffering. It is also a taunt to those who are meant to read the texts neutrally in recording. And finally, it is a demonstration of the training algorithms' fundamentally machinic perspective: they will process all texts, regardless of contents, in the same way.

But does something of the suffering mainfest itself in the trained speech model? The Chronicles of Newgate was the first text I recorded for SOOTHER’s dataset, and I quickly became concerned that my voice, which was supposed to remain netural, might be inflected with some of the graphic suffering that I was reciting.

What effect does this uncontrolled emotional inflection have on the quality of my synthesised whisper? How might emotional undertones be transferred into the trained voice? In speech synthesis literature, “emotive” synthesis is currently being explored from several different angles. These studies and implementations are completely fascinating. However, none of them quite get at what I’m asking, which is: are subtle emotional artifacts unintentionally transferred from voice stems into a trained speech model? If yes, what physiological effect might these emotional artifacts have on a listener? This is a particularly ripe question for a whispering AI, given the physiological implications of ASMR itself.

LITERARY COUNTERBALANCE

The remaining texts of the LJ Speech Dataset were mostly neutral and nondescript: recipes, biology texts, Franklin Roosevelt’s fireside chats (which I re-read for the first time since high school and found… annoying). And so, to provide a balancing tone for my dataset, I selected texts that I knew would read differently than The Chronicles of Newgate and other neutral texts – I wanted to add some literary verve to SOOTHER’s dataset.

I picked one of my own recent texts, “CICADA GAMES”; Gertrude Stein’s Tender Buttons because it is one of the first narrations in literature, I think, that speaks as neither human nor animal, but as an object; the final chapter of Ulysses, Molly Bloom’s soliloquy, for its raw emotion; and Borges' “Garden of Forking Paths” for its relevance to the forks of conversation design. And finally, Frankenstein, given that it is (I think?) the first narrative of aritficial intelligence created per scientific principles.

Reading Frankenstein, the longest of my selected texts, was a complete pleasure, and this is apparent in my whisper stems. Whereas my voice often feel cramped and rigid in The Chronicles of Newgate samples, in Frankenstein, the tone is far more languorous, even as I tried to read with a standardized pacing and attitude throughout the entire recording process.

VOCAL CORD DAMAGE

After reading through Frankenstein, in my final days at superbudda, I recorded my final four selected texts. I had been recording for nearly a month, and my voice was so strained from the scores of hours of whispering that these final stems carry a different undertone, one of a physical struggle to produce the material. (Whispering, which requires holding the vocal cords apart and taut, is notoriously damaging.) The strain and struggle in the voice in these recordings also has interesting implications for an AI trained on these stems.

My voice remained strained for several months after my time in the studio, and even now, nearly a year later, it is still not full recovered. It’s very possible (pending medical opinion!) that I permanently altered my vocal range in my quest to create its whispered model.

FUTURE EXPERIMENTS

With this material, I see paths for several different future experiments. In the first, a model might be trained with stems from The Chronicles of Newgate and another with stems from Frankenstein and my other selected texts. The synthesised whispers might be judged by users: how do they strike the unknowing ear? They might be put into conversation with one another; they might be used for different purposes depending on the “branding” of the chatbot (see the white-labeling discussion in concept).

Finally, the difference between the two whispers might set me up for a more esoteric set of experiments about the undertones and subtleties of the artificially intelligent whisper. Though the most under-defined, this last pursuit is the most interesting to me, as it seems to provide great intersection with literary theory and poetic theories of the voice. See further discussion about this in the survey of my next steps.