Audio Manipulation in Videos

Nov 1, 2023

In recent years, we are witnessing an unprecedented revolution in multimedia content creation, driven by the advancement of Artificial Intelligence. Language algorithms have progressed significantly, first generating and modifying text, then images, and more recently, videos and audio. They start with a text or description input, and from this, they can create images that represent these descriptions. This (r)evolution is not only transforming the way we create content but also the way we consume and share it.

From the ability to generate coherent and persuasive texts to the creation of astonishing images from mere descriptions, AI has expanded the horizons of human creativity. However, the revolution doesn’t stop there. In this article, we will explore how audio manipulation in videos is becoming the next frontier of multimedia content creation. And the cyber scams…

This step in evolution not only raises exciting questions about innovation and entertainment but also presents ethical dilemmas and challenges in terms of truthfulness and reliability of information. Audio manipulation in videos is challenging the limits of reality, and its impact on society will be profound.

Professionally, this technology promises to greatly facilitate creative and production processes. In the film industry, the ability to adjust and enhance the audio of a scene or even change dialogue in post-production could save significant time and resources. Recording studios could benefit from precise editing and customization of voices and sounds, allowing artists to reach new creative heights. Furthermore, in the world of marketing and advertising, the ability to adapt and personalize the audio of advertising campaigns for different audiences could increase the effectiveness and impact of marketing strategies. However, as we take advantage of these benefits, we must be aware of the ethical challenges and the importance of maintaining the integrity of information and authenticity in multimedia communication. This technological evolution poses both exciting opportunities and fundamental responsibilities in the professional and personal realm.

Along with professional opportunities, this technology also raises serious cybersecurity concerns. Cybercriminals could use audio manipulation in videos to create deceptive and fraudulent content, such as fake speeches by public figures, which could be used to spread misinformation or blackmail individuals and organizations. Furthermore, the ease of modifying voices and audios could lead to the proliferation of phone scams and identity impersonation attacks with devastating financial and personal consequences. Digital security becomes an essential priority as this technology advances, and it is crucial to develop effective countermeasures to prevent its misuse and protect information integrity and individual privacy.

Currently, we are at a point where modifying the audio of a video has become more accessible and requires less technical knowledge. Anyone with a video where someone is speaking can, with relative ease, generate new audio using a different voice, even cloning the original voice to make it more realistic if necessary. Then, this new audio can replace the original in the video, meaning changing the message that visual content intends to convey. Indeed, all the ingredients to carry out advanced identity impersonation are already available: from voice cloning to playing audio with the cloned voice and synchronizing facial expression. As mentioned earlier, this technological advancement poses not only exciting creative possibilities but also serious concerns in terms of security and authenticity in the digital world.

But audio manipulation in videos goes beyond simply modifying what already exists; it also allows adding audio to videos where there was none before. To create a deepfake, a type of advanced audiovisual manipulation, the steps are more accessible than we might imagine. Anyone could make a person say something they never said, a feat that until recently seemed reserved for science fiction cinema.

Let’s see step by step how audio can be manipulated to bring a fictitious image to life, even adding a voice and how the audio of a real image can be modified. However, before continuing, it’s important to emphasize that all the examples and demonstrations we will present are purely fictitious and for illustrative purposes.

Neither the voice of the people involved here nor what is mentioned corresponds in any case to reality; it’s pure fiction and as such must be treated. It’s essential to highlight that identity theft, especially when used for illegal purposes, is a crime clearly defined by law. The goal here is to educate about the possibilities and challenges presented by this technology, not to promote its misuse.

The first thing we will do is create an image to bring to life using Dall-E 3, the image generation tool from OpenAI, a competitor to Midjourney. In their latest version, they have managed to create a tool capable of generating realistic or illustrative images for video games, logos, marketing or advertising campaigns, etc. By guiding the Dall-E 3 algorithm through a series of iterations, we generate a futuristic image like the one we see below:

So far, so good. We’ve generated an image of a cyborg in a futuristic setting that we’ll “bring to life”. Now we’ll use the RunwayML tool, which allows us to generate a video from a static image. This tool is still in its early versions and can produce striking but also erroneous or invalid results. It’s still pretty green, but it’s perfect for illustrating the topic at hand.



We have our character in motion. Now let’s add a voice. In English, I used the Elevenlabs service, which allows us to clone existing voices, as long as we have copyright on them, or use some predefined ones. To modify the video and add the created audio, we’ll use a Python tool published on a Google Colab called Video Retalking, which allows us to replace the audio of a video with another. In our case, we wouldn’t replace anything since it currently has no audio. So, we’ll add the voice to it and synchronize the lips of the fictitious character to appear to be pronouncing what is said in the audio. For some reason, this tool significantly reduces the quality of the generated video, so the more iterations on the same file, the more quality will be lost. Here’s the result:



A pretty decent result that already lays the foundation for the capabilities that this type of tool will offer us in the not-so-distant future. It’s a simple example of a virtual character, but we must consider that the time invested in creating both the character and related content has been minimal compared to what a professional could achieve with similar features.

Now I want to translate the audio into Spanish. Since I used a voice optimized for English, if I wanted to create its Spanish version directly with the same voice, I would get a much less believable result because the accent would be too strange. I chose to generate the translated audio but with a special voice created to use a Spanish accent when speaking. Once generated, we can replace the current voice with the new one using the Video Retalking tool mentioned earlier.



We now have the same video with two different audios; both pronounce the same thing but in different languages. In just a few minutes, we can create different versions of multimedia content translated into multiple languages with almost no effort. We have created a purely fictional character generated by technology, completely detached from reality. However, this process goes much further, as it also allows us to modify real content, such as a television program or a news segment. It’s important to remember that these exercises are purely demonstrative and fictional and do not aim to deceive or impersonate any individual.



It features a well-known Spanish TV host. Neither her voice nor what she says has been spoken by her at any time, but we can see how her lips sync quite realistically with the audio playback. Considering that, as we have already mentioned, it is currently possible to clone a person’s voice from a few minutes of audio, it’s not hard to imagine to what extent we could manipulate this video or any other. Here, we are simply going to add the English version that we used earlier for our character:



Evolution is an astonishing process that shapes the diversity of life over time. The changes in images represent the adaptation of species to their environment, from simple organisms to complex forms of life.

Two video versions and two distinct voices, one single message. A message announcing the evolution of these kinds of tools, how they are accelerating at a rapid pace with the emergence of tools considered Artificial Intelligence, which are gradually transforming our reality. They allow us to create multimedia content in ways that seemed unreachable just a few years ago.

We are living in exciting times where technology takes us to new horizons of expression and narration, where the only limit is our own imagination. As we continue to explore the possibilities offered by this technology, we must be vigilant of the ethical and legal challenges it presents and use it responsibly to forge an authentic future.

We must remain alert and especially critical of the content we consume. As artificial intelligence tools continue to advance, there’s the possibility that some of them might be manipulated for malicious or deceptive purposes. Verifying the authenticity of sources and corroborating information becomes even more critical in this emerging landscape.

The responsibility lies both with content creators and consumers to discern between the real and the manipulated. Creativity and technology are elevating the art of storytelling to new heights, but we must approach this power with caution and ethics to ensure a future where truth and authenticity remain fundamental pillars of our society.

Related posts

That may interest you