They adapt the AI ​​Stable Diffussion model to generate music from text

The developers of the artificial intelligence (AI) model Stable Difussion have adapted this technology to be able to create spectrograms capable of becoming audio or music clips from text.

Stable Difussion is a text-to-image machine learning model developed by Stability AI,which is used to generate high-quality digital images from text.

Two developers named Seth Forsgren and Hayk Martiros have created a project called Rifussion through which they adapt this solution to music. With her spectrograms can be generated that can be translated,in turn, into audio clips.

As the creators of this project explain on their website, an audio spectrogram or sonogram is a visual representation based on sets of text prompts entered by the user.

These sonograms have two axes: X, which represents time, and Y, which represents frequency. The color of each pixel of each audio spectrogram, instead is its amplitude. It is precisely this fact that takes into account Torchaudio, which takes the image generated by Stable Diffusion and converts it to audio.

From Rifussion they announce that it is not only possible to generate music from images and text, but it is also allowed combine, experiment and merge styles.

The developers have pointed out that, in case of having a sufficiently powerful GPU, sonograms can be created with a size of generated images of 512 x 512 pixels and of five seconds long. However, infinite variations can be introduced based on the same original image.

Rifussion currently includes a clip generator, as well as instructions and technical details in order to use this technology in his web page. Also, their code is available in their repository on GitHub.

By Editor

Leave a Reply