This article may need to be rewritten to comply with Wikipedia's quality standards. (November 2024) |
15.ai was a freeware artificial intelligence web application that generated text-to-speech voices from fictional characters from various media sources.[5][6][7][8] Created by a pseudonymous developer under the alias 15,[1][2][3][4][9] the project used a combination of audio synthesis algorithms, speech synthesis deep neural networks, and sentiment analysis models to generate emotive character voices faster than real-time.[a][10][11]
Type of site | Artificial intelligence, speech synthesis, machine learning, deep learning |
---|---|
Available in | English |
Founder(s) | 15[1][2][3] |
URL | 15 |
Commercial | No[4] |
Registration | None[4] |
Launched | Initial release: March 12, 2020 Last stable release: v24.2.1 |
In early 2020, 15.ai appeared online as a proof of concept of the democratization of voice acting and dubbing.[9][12] Its gratis nature, ease of use without user accounts, and improvements over existing text-to-speech implementations made it popular.[6][5][7] Some critics and voice actors questioned the legality and ethicality of making such technology so readily accessible.[13]
The site was credited as the impetus behind the popularization of AI voice cloning (also known as audio deepfakes) in content creation.[2][9] It was embraced by Internet fandoms such as My Little Pony, Team Fortress 2, and SpongeBob SquarePants.[1][14][9]
Several commercial alternatives appeared in the following years.[2][3][4] In September 2022, a year after its last stable release, 15.ai was taken offline.[3][2] As of November 2024, the website was still offline, with the creator's most recent post being dated February 2023.[15]
Features
The platform required no user registration or account creation to generate voices.[2][4][9] Users could generate speech by entering text and selecting a character voice (optionally specifying an emotional contextualizer and/or phonetic transcriptions), with the system producing three variations of the audio with different emotional deliveries.[10] The platform operated completely free of charge, though the developer reported spending thousands of dollars monthly to maintain the service.[9]
Available characters included GLaDOS and Wheatley from Portal, characters from Team Fortress 2, Twilight Sparkle and other characters from My Little Pony: Friendship Is Magic, SpongeBob, Daria Morgendorffer and Jane Lane from Daria, the Tenth Doctor Who, HAL 9000 from 2001: A Space Odyssey, the Narrator from The Stanley Parable, Carl Brutananadilewski from Aqua Teen Hunger Force, Steven Universe, Dan from Dan Vs., and Sans from Undertale.[14][1][16][17]
The nondeterministic nature of the deep learning model ensured that each generation would have slightly different intonations, similar to multiple takes from a voice actor.[10][1] The application supported manually altering the emotion of a generated line using emotional contextualizers (a term coined by this project), a sentence or phrase conveying the emotion of the take that serves as a guide for the model during inference.[1][14] Emotional contextualizers were representations of the emotional content of a sentence deduced via transfer learned emoji embeddings using DeepMoji, a deep neural network sentiment analysis algorithm developed by the MIT Media Lab in 2017.[18][19] DeepMoji was trained on 1.2 billion emoji occurrences in Twitter data from 2013 to 2017, and outperformed human subjects in correctly identifying sarcasm in Tweets and other online modes of communication.[20][21][22]
15.ai used a multi-speaker model—hundreds of voices were trained concurrently rather than sequentially, decreasing the required training time and enabling the model to learn and generalize shared emotional context, even for voices with no exposure to that context.[23][2] Consequently, the characters in the application were powered by a single trained model, as opposed to multiple single-speaker models.[24] The lexicon used by 15.ai was scraped from a variety of Internet sources, including Oxford Dictionaries, Wiktionary, the CMU Pronouncing Dictionary, 4chan, Reddit, and Twitter. Pronunciations of unfamiliar words were automatically deduced using phonological rules learned by the deep learning model.[1]
The application supported a simplified phonetic transcription known as ARPABET, to correct mispronunciations and account for heteronyms—words that are spelled the same but are pronounced differently (such as the word read, which can be pronounced as either /ˈrɛd/ or /ˈriːd/ depending on its tense). It followed the CMU Pronouncing Dictionary's ARPABET conventions.[1]
Background
Speech synthesis
In 2016, with the proposal of DeepMind's WaveNet, deep-learning-based models for speech synthesis began to gain popularity as a method of modeling waveforms and generating high-fidelity human-like speech.[26][27][25] Tacotron2, a neural network architecture for speech synthesis developed by Google AI, was published in 2018 and required tens of hours of audio data to produce intelligible speech; when trained on 2 hours of speech, the model was able to produce intelligible speech with mediocre quality, and when trained on 36 minutes of speech, the model was unable to produce intelligible speech.[28][29]
For years, reducing the amount of data required to train a realistic high-quality text-to-speech model has been a primary goal of scientific researchers in the field of deep learning speech synthesis.[30][31] The developer of 15.ai claims that as little as 15 seconds of data is sufficient to clone a voice up to human standards, a significant reduction in the amount of data required.[32]
Copyrighted material in deep learning
A landmark case between Google and the Authors Guild in 2013 ruled that Google Books—a service that searches the full text of printed copyrighted books—was transformative, thus meeting all requirements for fair use.[33] This case set an important legal precedent for the field of deep learning and artificial intelligence: using copyrighted material to train a discriminative model or a non-commercial generative model was deemed legal. The legality of commercial generative models trained using copyrighted material is still under debate; due to the black-box nature of machine learning models, any allegations of copyright infringement via direct competition would be difficult to prove.[34]
Development
15.ai was designed and created by an anonymous research scientist known by the alias 15.[1][2][3][4] Developing and running 15.ai cost several thousands of dollars per month, initially funded by the developer's personal finances after a successful startup exit.[9]
The algorithm used by the project was dubbed DeepThroat.[9][35] The developer said the project and algorithm were conceived as part of MIT's Undergraduate Research Opportunities Program, and had been in development for years before the first release of the application.[1][36]
The developer also worked closely with the Pony Preservation Project from /mlp/, the My Little Pony board of 4chan.[9] This project was a "collaborative effort by /mlp/ to build and curate pony datasets" with the aim of creating applications in artificial intelligence.[38][39] The Friendship Is Magic voices on 15.ai were trained on a large dataset crowdsourced by the project: audio and dialogue from the show and related media[9]—including all nine seasons of Friendship Is Magic, the 2017 movie, spinoffs, leaks, and various other content voiced by the same voice actors—were parsed, hand-transcribed, and processed to remove background noise.
Reception
15.ai was met with a largely positive reception. Liana Ruppert of Game Informer described it as "simplistically brilliant"[6] and José Villalobos of LaPS4 wrote that it "works as easy as it looks."[16][b] Lauren Morton of Rock, Paper, Shotgun called the tool "fascinating,"[8] and Yuki Kurosawa of AUTOMATON deemed it "revolutionary."[1][c] Users praised the ability to easily create audio of popular characters that sound believable to those unaware they had been synthesized. Zack Zwiezen of Kotaku reported that "[his] girlfriend was convinced it was a new voice line from GLaDOS' voice actor, Ellen McLain".[5]
Impact
Fandom content creation
15.ai was frequently used for content creation in various fandoms, including the My Little Pony: Friendship Is Magic fandom, the Team Fortress 2 fandom, the Portal fandom, and the SpongeBob SquarePants fandom, with numerous videos and projects containing speech from 15.ai having gone viral.[5][6] The platform is credited as the impetus behind the popularization of AI voice cloning in content creation, demonstrating the potential for accessible, high-quality voice synthesis technology.[2][9]
The My Little Pony: Friendship Is Magic fandom saw a resurgence in video and musical content creation as a result, inspiring a new genre of fan-created content assisted by artificial intelligence. Some fanfictions weren adapted into fully voiced "episodes": The Tax Breaks is a 17-minute long animated video rendition of a fan-written story published in 2014 that uses voices generated from 15.ai with sound effects and audio editing, emulating the episodic style of the early seasons of Friendship Is Magic.[40][41]
Viral videos from the Team Fortress 2 fandom featuring voices from 15.ai include Spy is a Furry (which gained over 3 million views on YouTube across multiple videos[yt 1][yt 2][yt 3]) and The RED Bread Bank, both of which inspired Source Filmmaker animated video renditions.[1] Other fandoms used voices from 15.ai to produce viral videos. As of July 2022[update], the viral video Among Us Struggles (with voices from Friendship Is Magic) had over 5.5 million views on YouTube;[yt 4] YouTubers, TikTokers, and Twitch streamers also used 15.ai for their videos, such as FitMC's video on the history of 2b2t—one of the oldest running Minecraft servers—and datpon3's TikTok video featuring the main characters of Friendship Is Magic, which have 1.4 million and 510 thousand views, respectively.[yt 5][tt 1]
Impact on voice cloning technology
15.ai introduced several technical innovations in voice cloning.[citation needed] While traditional text-to-speech systems like Google's Tacotron2 required tens of hours of audio data to produce intelligible speech in 2017,[28][29] 15.ai claimed to achieve high-quality voice cloning with as little as 15 seconds of training data.[32][9] This reduction in required training data represented a breakthrough in the field of speech synthesis.[10][9]
The project also introduced the concept of "emotional contextualizers" for controlling speech emotion through sentiment analysis.[1][14][10]
Reactions from voice actors
Some voice actors have publicly decried the use of voice cloning technology. Cited reasons include concerns about copyright infringement, right to privacy, impersonation and fraud, unauthorized use of an actor's voice in pornography or explicit content, and the potential of AI being used to make voice actors obsolete.[3][13][9][10]
See also
Notes
- ^ The term "faster than real-time" in speech synthesis means that the system can generate audio more quickly than the actual duration of the speech – for example, generating 10 seconds of speech in less than 10 seconds would be considered faster than real-time.
- ^ Translated from original quote written in Spanish: "La dirección es 15.AI y funciona tan fácil como parece."[16]
- ^ Translated from original quote written in Japanese: "しかし15.aiが画期的なのは「データが30秒しかない文字でも、ほぼ100%の発音精度を達成できること」そして「ごくわずかなデータのみを使って、自然な感情のこもった音声を数百以上生成できること」だという。"[1]
References
- Notes
- ^ a b c d e f g h i j k l m n Kurosawa, Yuki (January 19, 2021). "ゲームキャラ音声読み上げソフト「15.ai」公開中。『Undertale』や『Portal』のキャラに好きなセリフを言ってもらえる". AUTOMATON. Archived from the original on January 19, 2021. Retrieved January 19, 2021.
- ^ a b c d e f g h i "15.ai: All About 15.ai and the Best Alternatives". Speechify. November 19, 2023. Retrieved November 18, 2024.
{{cite web}}
: CS1 maint: url-status (link) - ^ a b c d e f "15.AI: Everything You Need to Know & Best Alternatives". ElevenLabs. February 7, 2024. Retrieved November 18, 2024.
{{cite web}}
: CS1 maint: url-status (link) - ^ a b c d e f "Free 15.ai Character Voice Cloning and Alternatives". Resemble.ai. Retrieved November 18, 2024.
{{cite web}}
: CS1 maint: url-status (link) - ^ a b c d e Zwiezen, Zack (January 18, 2021). "Website Lets You Make GLaDOS Say Whatever You Want". Kotaku. Archived from the original on January 17, 2021. Retrieved January 18, 2021.
- ^ a b c d Ruppert, Liana (January 18, 2021). "Make Portal's GLaDOS And Other Beloved Characters Say The Weirdest Things With This App". Game Informer. Archived from the original on January 18, 2021. Retrieved January 18, 2021.
- ^ a b Clayton, Natalie (January 19, 2021). "Make the cast of TF2 recite old memes with this AI text-to-speech tool". PC Gamer. Archived from the original on January 19, 2021. Retrieved January 19, 2021.
- ^ a b Morton, Lauren (January 18, 2021). "Put words in game characters' mouths with this fascinating text to speech tool". Rock, Paper, Shotgun. Archived from the original on January 18, 2021. Retrieved January 18, 2021.
- ^ a b c d e f g h i j k l m n "Everything You Need to Know About 15.ai: The AI Voice Generator". Play.ht. September 12, 2024. Retrieved November 18, 2024.
{{cite web}}
: CS1 maint: url-status (link) - ^ a b c d e f "15.ai – Natural and Emotional Text-to-Speech Using Neural Networks". Hashdork. May 15, 2024. Retrieved November 18, 2024.
{{cite web}}
: CS1 maint: url-status (link) - ^ "Demystifying 15.ai: How AI Generates Ultra-Realistic Text-to-Speech Voices". TheLinuxCode. December 27, 2023. Retrieved November 18, 2024.
{{cite web}}
: CS1 maint: url-status (link) - ^ Ng, Andrew (April 1, 2020). "Voice Cloning for the Masses". The Batch. Archived from the original on August 7, 2020. Retrieved April 5, 2020.
- ^ a b Lopez, Ule (January 16, 2022). "Troy Baker-backed NFT firm admits using voice lines taken from another service without permission". Wccftech. Archived from the original on January 16, 2022. Retrieved June 7, 2022.
- ^ a b c d Yoshiyuki, Furushima (January 18, 2021). "『Portal』のGLaDOSや『UNDERTALE』のサンズがテキストを読み上げてくれる。文章に込められた感情まで再現することを目指すサービス「15.ai」が話題に". Denfaminicogamer. Archived from the original on January 18, 2021. Retrieved January 18, 2021.
- ^ @fifteenai (February 23, 2023). "If all goes well, the next update should be the culmination of a year and a half of nonstop work put into a huge number of fixes and major improvements to the algorithm. Just give me a bit more time – it should be worth it" (Tweet) – via Twitter.
- ^ a b c Villalobos, José (January 18, 2021). "Descubre 15.AI, un sitio web en el que podrás hacer que GlaDOS diga lo que quieras". LaPS4. Archived from the original on January 18, 2021. Retrieved January 18, 2021.
- ^ Moto, Eugenio (January 20, 2021). "15.ai, el sitio que te permite usar voces de personajes populares para que digan lo que quieras". Yahoo! Finance. Archived from the original on March 8, 2022. Retrieved January 20, 2021.
- ^ Felbo, Bjarke (2017). "Using millions of emoji occurrences to learn any-domain representations for detecting sentiment, emotion and sarcasm". Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. pp. 1615–1625. arXiv:1708.00524. doi:10.18653/v1/D17-1169. S2CID 2493033.
- ^ Corfield, Gareth (August 7, 2017). "A sarcasm detector bot? That sounds absolutely brilliant. Definitely". The Register. Archived from the original on June 2, 2022. Retrieved June 2, 2022.
- ^ "An Algorithm Trained on Emoji Knows When You're Being Sarcastic on Twitter". MIT Technology Review. August 3, 2017. Archived from the original on June 2, 2022. Retrieved June 2, 2022.
- ^ "Emojis help software spot emotion and sarcasm". BBC. August 7, 2017. Archived from the original on June 2, 2022. Retrieved June 2, 2022.
- ^ Lowe, Josh (August 7, 2017). "Emoji-Filled Mean Tweets Help Scientists Create Sarcasm-Detecting Bot That Could Uncover Hate Speech". Newsweek. Archived from the original on June 2, 2022. Retrieved June 2, 2022.
- ^ Valle, Rafael (2020). "Mellotron: Multispeaker expressive voice synthesis by conditioning on rhythm, pitch and global style tokens". arXiv:1910.11997 [eess].
- ^ Cooper, Erica (2020). "Zero-Shot Multi-Speaker Text-To-Speech with State-of-the-art Neural Speaker Embeddings". arXiv:1910.10838 [eess].
- ^ a b van den Oord, Aäron; Li, Yazhe; Babuschkin, Igor (November 12, 2017). "High-fidelity speech synthesis with WaveNet". DeepMind. Archived from the original on June 18, 2022. Retrieved June 5, 2022.
- ^ Hsu, Wei-Ning (2018). "Hierarchical Generative Modeling for Controllable Speech Synthesis". arXiv:1810.07217 [cs.CL].
- ^ Habib, Raza (2019). "Semi-Supervised Generative Modeling for Controllable Speech Synthesis". arXiv:1910.01709 [cs.CL].
- ^ a b "Audio samples from "Semi-Supervised Training for Improving Data Efficiency in End-to-End Speech Synthesis"". August 30, 2018. Archived from the original on November 11, 2020. Retrieved June 5, 2022.
- ^ a b Shen, Jonathan; Pang, Ruoming; Weiss, Ron J.; Schuster, Mike; Jaitly, Navdeep; Yang, Zongheng; Chen, Zhifeng; Zhang, Yu; Wang, Yuxuan; Skerry-Ryan, RJ; Saurous, Rif A.; Agiomyrgiannakis, Yannis; Wu, Yonghui (2018). "Natural TTS Synthesis by Conditioning WaveNet on Mel-Spectrogram Predictions". arXiv:1712.05884 [cs.CL].
- ^ Chung, Yu-An (2018). "Semi-Supervised Training for Improving Data Efficiency in End-to-End Speech Synthesis". arXiv:1808.10128 [cs.CL].
- ^ Ren, Yi (2019). "Almost Unsupervised Text to Speech and Automatic Speech Recognition". arXiv:1905.06791 [cs.CL].
- ^ a b Phillips, Tom (January 17, 2022). "Troy Baker-backed NFT firm admits using voice lines taken from another service without permission". Eurogamer. Archived from the original on January 17, 2022. Retrieved January 17, 2022.
- ^ - F.2d – (2d Cir, 2015). (temporary cites: 2015 U.S. App. LEXIS 17988; Slip opinion[permanent dead link ] (October 16, 2015))
- ^ Li, Y.; Li, J. (2021). "Does Black-Box Machine Learning Shift the US Fair Use Doctrine?". SSRN. doi:10.2139/ssrn.3998805.
- ^ "15.ai – About". 15.ai. February 20, 2022. Archived from the original on October 6, 2021. Retrieved February 20, 2022.
- ^ Button, Chris (January 19, 2021). "Make GLaDOS, SpongeBob and other friends say what you want with this AI text-to-speech tool". Byteside. Retrieved November 18, 2024.
{{cite web}}
: CS1 maint: url-status (link) - ^ Branwen, Gwern (March 6, 2020). ""15.ai", 15, Pony Preservation Project". Gwern.net. Gwern. Archived from the original on March 18, 2022. Retrieved June 17, 2022.
- ^ Scotellaro, Shaun (March 14, 2020). "Neat "Pony Preservation Project" Using Neural Networks to Create Pony Voices". Equestria Daily. Archived from the original on June 23, 2021. Retrieved June 11, 2022.
- ^ "Pony Preservation Project (Thread 108)". 4chan. Desuarchive. February 20, 2022. Retrieved February 20, 2022.
- ^ Scotellaro, Shaun (May 15, 2022). "Full Simple Animated Episode – The Tax Breaks (Twilight)". Equestria Daily. Archived from the original on May 21, 2022. Retrieved May 28, 2022.
- ^ The Terribly Taxing Tribulations of Twilight Sparkle. April 27, 2014. Archived from the original on June 30, 2022. Retrieved April 28, 2022.
{{cite book}}
:|website=
ignored (help)
- Tweets
- YouTube (referenced for view counts and usage of 15.ai only)
- ^ "SPY IS A FURRY". YouTube. January 17, 2021. Archived from the original on June 13, 2022. Retrieved June 14, 2022.
- ^ "Spy is a Furry Animated". YouTube. Archived from the original on June 14, 2022. Retrieved June 14, 2022.
- ^ "[SFM] – Spy's Confession – [TF2 15.ai]". YouTube. January 15, 2021. Archived from the original on June 30, 2022. Retrieved June 14, 2022.
- ^ "Among Us Struggles". YouTube. September 21, 2020. Retrieved July 15, 2022.
- ^ "The UPDATED 2b2t Timeline (2010–2020)". YouTube. March 14, 2020. Archived from the original on June 1, 2022. Retrieved June 14, 2022.
- TikTok
- ^ "She said " 👹 "". TikTok. Retrieved July 15, 2022.