Techno Blender
Digitally Yours.

A Visual Microphone? The Revolutionary Tech That Can Extract Audio from Images | by Pavle Marinkovic | Jan, 2023

0 32


What if we could get sound just from a video recording? Image created by Midjourney

What if we could retrieve sounds without a single sound recorder?

If you can perceive a vibration you can unveil the world around you.

Vibrations are everywhere but many times they’re so subtle we need sophisticated equipment to see them.

But it pays off, big time.

A motion pattern can reveal useful information about an object such as its internal structure (e.g. a fruit’s ripeness with ultrasound) and its properties (e.g. flexibility of pipes), but it can also serve as a resonator box for surrounding sounds.

This acoustic feature has opened a gate into new data that we weren’t able to access ever before.

By capturing and analyzing these vibrations, we can recover sounds like speech and music that would otherwise be lost.

We just need to see the vibrations.

When sound waves hit an object, it vibrates in a specific pattern.

These vibrations are enough to generate a visual signal, even if they’re very subtle. Our eyes can’t see it, but a high-speed camera of the object can detect this motion and retrieve the sounds that caused these surface distortions.

You’re essentially translating the object’s movement back into sound.

This technology allows us to listen to speech and music inside a room by analyzing the movement of objects in it at the time the sound(s) was made.

This means that you don’t need a sound recorder in the room to get what people were talking about or the sounds they heard while being there.

The object serves as a proxy for all the sonic world you missed for being outside.

It acts as a beacon.

By analyzing the motion pattern of the object, it is possible to extract the original sonic information that caused these vibrations.

And this tech is known as the visual microphone.

We’re now able to reveal a secret conversation inside a room with a bag of chips.

Sounds ridiculous, but that’s exactly what a group of researchers from MIT, Microsoft, and Adobe did in a series of experiments back in 2014.

They first tried extracting the sound from the vibrations of a bag of chips while a loudspeaker played the song “Mary Had a Little Lamb”.

They would record the bag’s movements with a high-speed camera and extract the pitch and duration of each note with their algorithm. These signals were further processed to get rid of some of the underlying noise.

This was the result:

Seeing sound. Retrieved from one of the author’s Youtube.

The same worked with leaves from a potted plant.

So they knew that music can be extracted from these odd objects, but what about getting other types of sounds as well?

They tried extracting human speech this time around with an additional challenge. The camera would be placed behind a thick soundproof window while recording a back of chips laying on the floor in the room where the speech took place.

Setup of the experiment. Retrieved from MIT’s paper.

The recording of a male’s voice reciting the same song (“Mery had a little lamb”) was compared between a cellphone’s recording next to the chip bag, and the sound recovered through the high-speed camera:

Spectrogram of the cellphone’s recording (b) vs high-speed video (c)

Although subtle, there’s some resemblance between the spectrograms which becomes even more evident in the recovered sound.

Can you understand what is being said?

Remember that this is the early phase of new tech. Imagine the improvements we’ll have in the years to come.

With the limited range of experiments carried out by these scientists, they were able to identify some features that made the process easier. They found that:

  • Lower frequencies are easier to recover since they create more movement in the object and are less attenuated by the material’s composition.
  • Higher frequencies can be recovered more easily with lighter objects since they’re easier to move and thus create a motion pattern to extract afterward.
  • Light and rigid objects serve as a good visual microphones such as thin plates, membranes, bags of chips, leaves, and aluminum foil.
  • Identifiable sounds (e.g. melody, words from a speech, etc.) can be recovered from a 3 to the 4-meter range for a 400mm lens. Recovery can increase with a more powerful zoom lens.

Of course, more research is needed but the implications of these results are already astounding.

Any visible object can be turned into a visual microphone.

It’s both fascinating and scary.

Researchers used a high-speed video camera (1kHz — 20kHz) to record an object being hit with sound waves. They would look at extremely subtle changes in its surface and display it on a graph that showed different sound properties like pitch and duration.

Then, they would process this video with their algorithm and retrieve the sound from its surface motion.

Visual representation of the extraction process. Retrieved from the MIT paper.

This technology allows us to capture very small movements of an object in a video, even if they’re only a tiny fraction of a pixel in size (ranging from one hundredth to one-thousandth of a pixel). By analyzing all the pixels in the video, we can detect and interpret these subtle movements.

It’s a 3-step process:

  1. The first step is to break down the video into smaller parts based on different shapes and sizes.
  2. Next, look at each of these small parts and identify any movement that will then be averaged and aligned to create a single motion signal of the object.
  3. Combine all the information about the movement and use special techniques to remove any noise and improve the sound quality. From this input, they get the sound they want to recover.

This technology enables the recovery of sound inside a room, even if it was not directly recorded.

However, in their experiments, researchers used rather loud sounds ranging from 80dB (an actor’s stage voice) to 100dB (e.g. jet engine from a 100m distance) which limits the volume range that can be extracted from an object…for now.

The research is still in its early stage and it’s not yet clear if it would be feasible to use it in mass production. The study notes that the method requires high-speed video recording and advanced image processing techniques to extract the sound, which is expensive, and that certain factors such as lighting and the object’s material and shape can affect the quality of the recovered sound, limiting the conditions under which it’s useful.

Surveillance

If any object can act as a visual microphone, the first thing that comes to mind is using it for security reasons.

Governments and security agencies can use visual microphones to recover sound from surveillance footage to gain valuable information. For example, during a protest or a public event, authorities can use visual microphones to recover speech from videos of individuals in a crowd to identify potential threats or criminal activity.

Consider that the researchers of this academic paper also discussed the possibility of recovering sound from normal video cameras with promising results. They were able to recover frequencies 5 times higher than the frame rate of the video. This was the result.

By using a denoise of the signal with their speech enhancement algorithm and lowpass filter to remove undesired frequencies, you could boost the quality even more.

Forensics

Similar to the above use case, law enforcement agencies could also use this tech to recover audio from crime scenes or accident sites.

For example, if a crime is committed and the suspects are caught on camera, the visual microphone can be used to extract any audio that was present during the recording. This could include the suspects’ voices, any background noise, or even the sound of the crime being committed.

The audio evidence could later be used in court to help identify suspects, establish a timeline of events, or even link suspects to the crime.

Film and Television Production

Production companies can use visual microphones to recover audio from silent footage or to enhance existing audio recordings.

For example, during a film shoot, a visual microphone can be used to recover dialogue from a scene shot on a noisy set, making it usable for the final cut.

The authors of the study briefly discussed the idea of retrieving sound from silent films. Sadly, it’s still not possible:

“because of their low frame-rates (12–26 fps according to Wikipedia) and low image quality, we do not believe they have enough visual information for our technique to recover sound”.

Retrieval is now only possible with modern cameras where video has missing audio for whatever reason.

Music Recording

Music recording studios could use this tech to recover audio from music videos or live performances.

By placing a high-speed camera near the stage and filming the performance it could extract the sound from the video of the performance. The extracted sound would be of a much higher quality than traditional microphone recordings, as it would not be affected by ambient noise or the acoustics of the room. Think of an ultra HD live album release.

This technology could also be used in a studio setting to record music, where the visual microphone could be used to capture the sounds of individual instruments. This could be especially useful for recording orchestral music, where traditional microphone placements may not be able to capture the full sound of all instruments.

Photo by Charanjeet Dhiman

Imagine being a fly on the wall during private conversations between The Beatles.

You might feel this way in Peter Jackson’s new documentary, “Get Back”. Peter used AI magic to reveal hidden studio talks between the fab four. The film explores the band’s relationship as they created their final studio effort, “Let It Be” in 1970.

But the catch is, the band was trying to keep these chats a secret.

Instead of letting the director in on the secret, they turned the filming into a game of hide and seek by using their instruments to cover up their conversations. But, with the help of some tech wizards and something called “demixing”, Peter and his team were able to strip away the guitar noise and expose the private conversations for all to hear.

Not only that but the audio from the band’s Twickenham Studios rehearsals was also cleaned up using AI to isolate the instruments and vocals. This allowed mono tracks to become stereo and enrich the music in a new fascinating way.

Now imagine combining this AI tech with sound extraction from visual information.

Could we retrieve the original sound from silent footage, such as early films or home movies, that was not originally recorded with sound? We could use AI to add lip-syncing or voice reconstruction and bring the characters in the footage to life in a more realistic and immersive way.

Or recover sounds that were not audible during filming (e.g. background noise or muffled conversation) which could then be used to create a more immersive and historically authentic soundtrack for the film?

And this last one is very sci-fi. What if one day we could recover audio from ancient or historical artifacts, such as pottery or statues, that may have been used in rituals or ceremonies to hear what’s been stored inside of them for centuries?

A naive little fantasy of mine.

Only time will tell.

Interested in other data that can be turned into sound? Then, check this other article:

Or find how sound can be used to change your sense of taste:

If you enjoy reading stories like these and would like to support writers on Medium, consider signing up to become a Medium member. It’s just $5 a month and you’ll have unlimited access to articles from amazing writers all over the world.


What if we could get sound just from a video recording? Image created by Midjourney

What if we could retrieve sounds without a single sound recorder?

If you can perceive a vibration you can unveil the world around you.

Vibrations are everywhere but many times they’re so subtle we need sophisticated equipment to see them.

But it pays off, big time.

A motion pattern can reveal useful information about an object such as its internal structure (e.g. a fruit’s ripeness with ultrasound) and its properties (e.g. flexibility of pipes), but it can also serve as a resonator box for surrounding sounds.

This acoustic feature has opened a gate into new data that we weren’t able to access ever before.

By capturing and analyzing these vibrations, we can recover sounds like speech and music that would otherwise be lost.

We just need to see the vibrations.

When sound waves hit an object, it vibrates in a specific pattern.

These vibrations are enough to generate a visual signal, even if they’re very subtle. Our eyes can’t see it, but a high-speed camera of the object can detect this motion and retrieve the sounds that caused these surface distortions.

You’re essentially translating the object’s movement back into sound.

This technology allows us to listen to speech and music inside a room by analyzing the movement of objects in it at the time the sound(s) was made.

This means that you don’t need a sound recorder in the room to get what people were talking about or the sounds they heard while being there.

The object serves as a proxy for all the sonic world you missed for being outside.

It acts as a beacon.

By analyzing the motion pattern of the object, it is possible to extract the original sonic information that caused these vibrations.

And this tech is known as the visual microphone.

We’re now able to reveal a secret conversation inside a room with a bag of chips.

Sounds ridiculous, but that’s exactly what a group of researchers from MIT, Microsoft, and Adobe did in a series of experiments back in 2014.

They first tried extracting the sound from the vibrations of a bag of chips while a loudspeaker played the song “Mary Had a Little Lamb”.

They would record the bag’s movements with a high-speed camera and extract the pitch and duration of each note with their algorithm. These signals were further processed to get rid of some of the underlying noise.

This was the result:

Seeing sound. Retrieved from one of the author’s Youtube.

The same worked with leaves from a potted plant.

So they knew that music can be extracted from these odd objects, but what about getting other types of sounds as well?

They tried extracting human speech this time around with an additional challenge. The camera would be placed behind a thick soundproof window while recording a back of chips laying on the floor in the room where the speech took place.

Setup of the experiment. Retrieved from MIT’s paper.

The recording of a male’s voice reciting the same song (“Mery had a little lamb”) was compared between a cellphone’s recording next to the chip bag, and the sound recovered through the high-speed camera:

Spectrogram of the cellphone’s recording (b) vs high-speed video (c)

Although subtle, there’s some resemblance between the spectrograms which becomes even more evident in the recovered sound.

Can you understand what is being said?

Remember that this is the early phase of new tech. Imagine the improvements we’ll have in the years to come.

With the limited range of experiments carried out by these scientists, they were able to identify some features that made the process easier. They found that:

  • Lower frequencies are easier to recover since they create more movement in the object and are less attenuated by the material’s composition.
  • Higher frequencies can be recovered more easily with lighter objects since they’re easier to move and thus create a motion pattern to extract afterward.
  • Light and rigid objects serve as a good visual microphones such as thin plates, membranes, bags of chips, leaves, and aluminum foil.
  • Identifiable sounds (e.g. melody, words from a speech, etc.) can be recovered from a 3 to the 4-meter range for a 400mm lens. Recovery can increase with a more powerful zoom lens.

Of course, more research is needed but the implications of these results are already astounding.

Any visible object can be turned into a visual microphone.

It’s both fascinating and scary.

Researchers used a high-speed video camera (1kHz — 20kHz) to record an object being hit with sound waves. They would look at extremely subtle changes in its surface and display it on a graph that showed different sound properties like pitch and duration.

Then, they would process this video with their algorithm and retrieve the sound from its surface motion.

Visual representation of the extraction process. Retrieved from the MIT paper.

This technology allows us to capture very small movements of an object in a video, even if they’re only a tiny fraction of a pixel in size (ranging from one hundredth to one-thousandth of a pixel). By analyzing all the pixels in the video, we can detect and interpret these subtle movements.

It’s a 3-step process:

  1. The first step is to break down the video into smaller parts based on different shapes and sizes.
  2. Next, look at each of these small parts and identify any movement that will then be averaged and aligned to create a single motion signal of the object.
  3. Combine all the information about the movement and use special techniques to remove any noise and improve the sound quality. From this input, they get the sound they want to recover.

This technology enables the recovery of sound inside a room, even if it was not directly recorded.

However, in their experiments, researchers used rather loud sounds ranging from 80dB (an actor’s stage voice) to 100dB (e.g. jet engine from a 100m distance) which limits the volume range that can be extracted from an object…for now.

The research is still in its early stage and it’s not yet clear if it would be feasible to use it in mass production. The study notes that the method requires high-speed video recording and advanced image processing techniques to extract the sound, which is expensive, and that certain factors such as lighting and the object’s material and shape can affect the quality of the recovered sound, limiting the conditions under which it’s useful.

Surveillance

If any object can act as a visual microphone, the first thing that comes to mind is using it for security reasons.

Governments and security agencies can use visual microphones to recover sound from surveillance footage to gain valuable information. For example, during a protest or a public event, authorities can use visual microphones to recover speech from videos of individuals in a crowd to identify potential threats or criminal activity.

Consider that the researchers of this academic paper also discussed the possibility of recovering sound from normal video cameras with promising results. They were able to recover frequencies 5 times higher than the frame rate of the video. This was the result.

By using a denoise of the signal with their speech enhancement algorithm and lowpass filter to remove undesired frequencies, you could boost the quality even more.

Forensics

Similar to the above use case, law enforcement agencies could also use this tech to recover audio from crime scenes or accident sites.

For example, if a crime is committed and the suspects are caught on camera, the visual microphone can be used to extract any audio that was present during the recording. This could include the suspects’ voices, any background noise, or even the sound of the crime being committed.

The audio evidence could later be used in court to help identify suspects, establish a timeline of events, or even link suspects to the crime.

Film and Television Production

Production companies can use visual microphones to recover audio from silent footage or to enhance existing audio recordings.

For example, during a film shoot, a visual microphone can be used to recover dialogue from a scene shot on a noisy set, making it usable for the final cut.

The authors of the study briefly discussed the idea of retrieving sound from silent films. Sadly, it’s still not possible:

“because of their low frame-rates (12–26 fps according to Wikipedia) and low image quality, we do not believe they have enough visual information for our technique to recover sound”.

Retrieval is now only possible with modern cameras where video has missing audio for whatever reason.

Music Recording

Music recording studios could use this tech to recover audio from music videos or live performances.

By placing a high-speed camera near the stage and filming the performance it could extract the sound from the video of the performance. The extracted sound would be of a much higher quality than traditional microphone recordings, as it would not be affected by ambient noise or the acoustics of the room. Think of an ultra HD live album release.

This technology could also be used in a studio setting to record music, where the visual microphone could be used to capture the sounds of individual instruments. This could be especially useful for recording orchestral music, where traditional microphone placements may not be able to capture the full sound of all instruments.

Photo by Charanjeet Dhiman

Imagine being a fly on the wall during private conversations between The Beatles.

You might feel this way in Peter Jackson’s new documentary, “Get Back”. Peter used AI magic to reveal hidden studio talks between the fab four. The film explores the band’s relationship as they created their final studio effort, “Let It Be” in 1970.

But the catch is, the band was trying to keep these chats a secret.

Instead of letting the director in on the secret, they turned the filming into a game of hide and seek by using their instruments to cover up their conversations. But, with the help of some tech wizards and something called “demixing”, Peter and his team were able to strip away the guitar noise and expose the private conversations for all to hear.

Not only that but the audio from the band’s Twickenham Studios rehearsals was also cleaned up using AI to isolate the instruments and vocals. This allowed mono tracks to become stereo and enrich the music in a new fascinating way.

Now imagine combining this AI tech with sound extraction from visual information.

Could we retrieve the original sound from silent footage, such as early films or home movies, that was not originally recorded with sound? We could use AI to add lip-syncing or voice reconstruction and bring the characters in the footage to life in a more realistic and immersive way.

Or recover sounds that were not audible during filming (e.g. background noise or muffled conversation) which could then be used to create a more immersive and historically authentic soundtrack for the film?

And this last one is very sci-fi. What if one day we could recover audio from ancient or historical artifacts, such as pottery or statues, that may have been used in rituals or ceremonies to hear what’s been stored inside of them for centuries?

A naive little fantasy of mine.

Only time will tell.

Interested in other data that can be turned into sound? Then, check this other article:

Or find how sound can be used to change your sense of taste:

If you enjoy reading stories like these and would like to support writers on Medium, consider signing up to become a Medium member. It’s just $5 a month and you’ll have unlimited access to articles from amazing writers all over the world.

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.
Leave a comment