Audio processing projects | Deep Learning Audio Processing

Contents

Introduction

Imagine a world where machines understand what you want and how you feel when you call customer service; if you are not satisfied with something, talk to a person quickly. If you are looking for specific information, you may not need to speak to a person (Unless you want to!).

This is going to be the new world order; you can already see that this is happening to a large extent. See the highlights of 2017 in the data science industry. You can see the advances that deep learning was bringing in a field that was previously difficult to solve. One of those fields that deep learning has the potential to help address is audio processing. / voice, especially due to its unstructured nature and great impact.

Then, for the curious, I've compiled a to-do list that is important to note getting your hands dirty when starting out on audio processing. I'm sure there will be some more advancements in the future using Deep Learning.

The post is structured to explain each task and its relevance. There is also a research document that includes the details of that specific task., along with a case study that would help you get started on homework solving.

So let's start!

1. Audio classification

Audio classification is a fundamental hurdle in the field of audio processing. The task is essentially to extract characteristics from the audio and subsequently identify which class the audio belongs to.. Many useful applications associated with audio classification can be found in nature, such as gender classification, instrument accreditation and artist identification.

This task is also the most explored topic in audio processing.. Many posts were published in this field in the last year. In reality, we have also hosted a hackathon practitioner for community collaboration to solve this particular task.

White paperhttp://ieeexplore.ieee.org/document/5664796/?reload=true

A common approach to solving an audio classification task is to preprocess the audio inputs to extract useful characteristics and then apply a classification algorithm to it.. As an example, in the case study below, if we are given an extract of 5 seconds of a sound, and the task is to identify which class it belongs to, either a barking dog or a drilling sound. As mentioned in the post, one approach to deal with this is to extract an audio feature called MFCC and then pass it through a neural network to get the appropriate class.

Case study – https://www.analyticsvidhya.com/blog/2017/08/audio-voice-processing-deep-learning/

2. Audio fingerprints

The goal of audio fingerprinting is to determine the “abstract” audio digital. This is done to identify the audio from an audio sample. Shazam is an excellent example of an audio fingerprint app. Recognizes music based on the first two to five seconds of a song. Despite this, there are still situations where the system fails, especially when there is a lot of background noise.

White paperhttp://www.cs.toronto.edu/~dross/ChandrasekharSharifiRoss_ISMIR2011.pdf

To solve this problem, one approach might be to represent the audio in a different way, so that it can be easily deciphered. Subsequently, we can discover the patterns that differentiate audio from background noise. In the case study below, the author converts raw audio into spectrograms and then uses peak search and fingerprint hashing algorithms to establish the fingerprints of that audio file.

Case studyhttp://willdrevo.com/fingerprinting-and-audio-recognition-with-python/

3. Automatic music tagging

Music tagging is a more complex version of the audio classification. Here, we can have several classes to which each audio can belong, also known as a multi-tag sorting hurdle. A feasible application of this task can be the creation of metadata for the audio to be able to search them later.. Deep learning has helped solve this task to some extent, what can be seen in the case study below.

White paperhttps://link.springer.com/article/10.1007/s10462-012-9362-y

As seen with most tasks, the first step is always to extract characteristics from the audio sample. Subsequently, order it according to the nuances of the audio (as an example, if the audio contains more instrumental noise than the singer's voice, the label could be “instrumental”). This can be done through machine learning or deep learning methods.. The case study mentioned below uses deep learning to solve the problem, specifically the recurrent convolutional neural network along with frequency extraction Mel.

Case studyhttps://github.com/keunwoochoi/music-auto_tagging-keras

4. Audio Segmentation

Segmentation literally means dividing a particular object into parts (the segments) according to a defined set of characteristics. Segmentation, especially for audio data analysis, is an important pre-processing step. This is because we can segment a long, noisy audio signal into short, homogeneous segments. (practical short audio sequences) that are used for further processing. One application of the task is the segmentation of heart sounds, In other words, identify specific heart sounds.

White paperhttp://www.mecs-press.org/ijitcs/ijitcs-v6-n11/IJITCS-V6-N11-1.pdf

We can turn this into a supervised learning obstacle, where each timestamp can be categorized based on the required segments. Subsequently, we can apply an audio classification approach to fix the problem. In the case study below, the task is to segment the sound of the heart into two segments (or y dub), so that we can identify an anomaly in each segment. It can be solved through the extraction of audio characteristics and later deep learning can be applied for classification.

Case study – https://www.analyticsvidhya.com/blog/2017/11/heart-sound-segmentation-deep-learning/

5. Audio source separation

Audio source separation it involves isolating one or more source signals from a mix of signals. One of the most common applications of this is identify the letter of the audio for simultaneous translation (karaoke, as an example). This is a classic example shown in Andrew Ng's machine learning course, where you separate the speaker sound from the background music.

White paperhttp://ijcert.org/ems/ijcert_papers/V3I1103.pdf

A typical usage scenario involves:

  • loading an audio file
  • calculate a time-frequency transform to obtain a spectrogram, Y
  • using some of the font separation algorithms (such as non-negative matrix factorization) to get a time-frequency mask

Subsequently, the mask is multiplied with the spectrogram and the result is converted back to the time domain.

Case studyhttps://github.com/IoSR-Surrey/untwist

6. Time tracking

As the name suggests, the goal here is to track the location of each beat in a collection of audio files. Rhythm tracking can be used to automate time-consuming tasks that need to be completed to sync events to music. It is useful in various applications, as video editing, audio editing and improvisation between humans and computers.

White paperhttps://www.audiolabs-erlangen.de/content/05-fau/professor/00-mueller/01-students/2012_GroschePeter_MusicSignalProcessing_PhD-Thesis.pdf

One approach to troubleshooting beat tracking may be to analyze the audio file and use a startup detection algorithm to trace the beats.. Even though the techniques used for startup detection rely heavily on audio function engineering and machine learning, deep learning can easily be used here to get the most out of the results.

Case studyhttps://github.com/adamstark/BTrack

7. Musical recommendation

Thanks to the Internet, now we have millions of songs that we can listen to at any time. Ironically, This has made it even more difficult to discover new music due to the large number of alternatives that exist.. Musical recommendation Systems help deal with this information overload by automatically recommending new music to listeners.. Content providers like Spotify and Saavn have developed highly sophisticated music recommendation engines. These models take advantage of the user's past listening history, among many other features, to create custom recommendation lists.

White paperhttps://pdfs.semanticscholar.org/7442/c1ebd6c9ceafa8979f683c5b1584d659b728.pdf

We can address the challenge of customizing listening preferences by training a regression model / deep learning. This can be used to predict the latent representations of songs that were obtained from a collaborative filtering model.. This way, we could predict the representation of a song in the collaborative filtering space, even if no usage data is available.

Case studyhttp://benanne.github.io/2014/08/05/spotify-cnns.html

8. Music recovery

One of the most difficult tasks in audio processing, Music recovery essentially aims to build an audio-based search engine. Even though we can do this by solving subtasks like audio fingerprinting, this task encompasses much more than that. As an example, we also have to solve different smaller tasks for different types of music recovery (ring detection would be great for gender identification). At the moment, no other system has been developed to meet expected industry standards.

White paperhttp://www.nowpublishers.com/article/Details/INR-042

The task of recovering music is divided into smaller and easier steps, including tonal analysis (as an example, melody and harmony) and the rhythm or tempo (as an example, time tracking). Subsequently, based on these individual analyzes, information is extracted and used to retrieve similar audio samples.

Case studyhttps://youtu.be/oGGVvTgHMHw

9. Music transcription

Music transcription is another challenging audio processing task. It's about annotating audio and creating a kind of “leaf” to generate music from it at a later time. The manual effort involved in transcribing music of recordings can be huge. It varies greatly depending on the complexity of the song, how good our listening skills are and how detailed we want transcription to be.

White paperhttp://ieeexplore.ieee.org/abstract/document/7955698

The approach to music transcription is equivalent to that of voice accreditation, where musical notes are transcribed into lyrical extracts of instruments.

Case studyhttps://youtu.be/9boJ-Ai6QFM

10. Start detection

Launch detection is the first step in analyzing an audio stream / song. For most of the previously mentioned tasks, it is something necessary to perform a startup detection, In other words, detect the start of an audio event. Launch detection was essentially the first task the researchers tried to solve in audio processing..

White paperhttp://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.332.989&rep = rep1&type=pdf

Initial detection is generally done through:

  • calculate a spectral novelty function
  • find peaks in spectral novelty function
  • receding from each peak to a preceding local minimum. Backtracking can be helpful in finding breakpoints such that the onset occurs shortly after the start of the segment.

Case studyhttps://musicinformationretrieval.com/onset_detection.html

Final notes

In this post, I mentioned some tasks that can be considered when troubleshooting audio processing. I hope you find the post useful when tackling projects related to audio and speech.

Learn, engage , chop and get hired!

Subscribe to our Newsletter

We will not send you SPAM mail. We hate it as much as you.