It’s the end of the semester and as such it felt appropriate to get one final blog post in. To follow up a bit on the last post, I began reading Artificial Intelligence: A Modern Approach. Specifically, I began reading about uncertainty. As I read I realized that my initial goals have nothing to do with uncertainty. What I am trying to do is compare a recorded sound from the user against a predefined sound or set of sounds. There’s nothing uncertain about the comparison since I know what I’m trying to match the user against. Uncertainty is what I’m going to have when I start trying to figure out what beat patterns the user is making. For now, I have a slightly simpler problem.
Trying to match a user’s input to a sample involves a bit of high level math and a deep knowledge of audio digitization. From speaking with some mathematics experts and computer science professors I’ve learned that I’m definitely going to need to use Fourier transformations on the audio signal. In practice, this is a bit difficult to understand.
From what I remember way back in a sound design class at a previous school, when you create a digital audio recording you’re actually capturing the sound in tiny chunks of digital values known as samples. Generally, a recording rate of 44.1kHz is used, meaning 44,100 samples are created each second. Each of these samples is a certain number of bits with the most common for recording being 16 bits. This sample bit rate how wide a range of values each sample can represent. So for example, in a 4 bit sample, every 1/44,100th of a second would have 1 of 16 different values (4 bits = 16 values). That’s not a lot which is why 16 is generally used since that gives 65,536 values. On top of this, a lot of digital audio, especially music, has multiple channels recorded into a single signal (think stereo – left and right channels). Basically, under the hood of an audio recording we have a very large stream of these little numerical packets we call samples.
The big issue I’ve faced is how exactly to deal with the audio signals and how to process the samples. I was told to use a Fourier transform but I haven’t quite figured out how to do that yet. Also, since a lot of input streams I’m seeing in my code show each sample as 4 bytes I’m not really sure what I’m looking at. Luckily I recently found a library call OpenIMAJ that I’m hoping will do a bit of the heavy lifting for me. Specifically, I’m going to be using it to perform the Fourier Transform for me so that I can look at the sound in the form I need for comparison.
The form of the audio I’m interested in is known as frequency-domain. You see, the Fourier transformation takes the recorded audio, which comes into my program in time-domain form and changes it to frequency-domain form. The time-domain form basically just tells me the volume of the sound at any given time. The frequency-domain, on the other hand, shows me the frequency and amplitude at a single moment in time. In case it’s not obvious, to compare two sounds you’re going to want to know the pitch (frequency) of those sounds so this is pretty important.
OpenIMAJ also offers some really nifty visualizations to help understand the data a bit better by creating a visual representation of the sound. So far I’ve been able to get my program to record audio from the user and display a spectrogram representing that audio. My hunch is that somehow this spectrogram is going to come in handle when comparing sounds. Basically, what the spectrogram does is show frequency, amplitude and time on a single 2-dimensional graph by using a varying color value to represent one of the values.
I’m looking forward to learning more about interpreting digital audio data. I’m hoping I get a chance to look into it over the summer. But first, I think I am going to take a well deserved break.
(You can check out the very simple application I have made in Java that records audio and displays a spectrogram of it on this github page. The other spectrogram is shows is just a sample sound wave for visual comparison.)