Understanding your cat’s meows using a neural network

“Meow” — I’m sorry? “Meow!” — Oh, right! Here you go.

What if I could understand exactly what my cat is trying to tell me? We live in 2021, which is basically the future. How hard can it be?

What’s on your mind, little Loki? With the power of neural networks, maybe soon I’ll know.

A dataset of meows

A group of dedicated researchers from northern Italy has recently released a public dataset of cat vocalizations (let’s call them “meows”). 21 cats from two different breeds were exposed to three different situations while a microphone was listening:

  1. Brushing: The owner brushed the cat in a familiar environment.
  2. Isolation: The cat was placed in an unfamiliar environment for a few minutes.
  3. Food: The cat was waiting for food.

In total, the dataset comprises 440 audio files.

Dataset statistics

The dataset is not evenly split between those three situations.

Number of recordings per situation

Neither is it evenly split between cat breeds or the sex of the cat.

Number of recordings per breed
Number of recordings per sex of the cat

In fact, some cats occur way more often in the recordings than others. I don’t know why. Maybe “CAN01” is just very talkative whereas “NIG01” prefers to keep to himself?

Number of recordings per individual cat. “CAN01” appears most often and NIG01 least often in the data.

Looking at these distributions is important. When we train a neural network to classify a given voice recording, we want to make sure it performs better than simply guessing the most frequent label.

For example, always guessing “female” when asked for the cat’s gender would be correct in 78% of cases because there are 345 female voice recordings and only 95 recordings of male cats.

Any classifier that is supposed to be useful has to surpass this baseline of “informed” guessing.

FeatureMost frequent labelAbsolute countRelative count = baseline accuracy
Situation
(3 classes)
isolation221 of 440 recordings50.2 %
Sex
(2 classes)
female345 of 440 recordings78.4 %
Breed
(2 classes)
european_shorthair225 of 440 recordings51.1 %
Table that lists the most frequent label per feature. The numbers highlight which baseline accuracy a model has to achieve to be better than guessing.

Now we have an idea of what our data distributions look like. In total, there are three interesting tasks we can have a model learn from the data: (1) What situation was the cat in, (2) what is the sex of the cat, and (3) what is the breed of the cat. It will be interesting to see if these tasks can be learned from the data at all. Let’s start preparing our data to train a model.

Turning audio into images

There are many ways to encode an audio signal before passing it into a neural network. For my project, I am choosing a visual approach: We plot the spectrogram of the audio recordings as an image.

This allows us to use well-established neural networks from the field of computer vision. Also, spectrograms look nice.

Spectrograms are a plot where the location in the image represents a given frequency at a given point in time in the audio file. The brightness of a pixel represents the intensity of the audio signal.

The following example shows one of the recordings as a spectrogram. The time axis goes from top left (zero) to bottom left. The x-axis denotes the frequencies.

We turn our audio recordings into images by drawing their spectrogram

Image classification using a pretrained ResNet

Having turned our audio classification task into an image classification task, we can start with our model training. We are going to train three models for three different tasks:

  1. Given a spectrogram image, classify the situation the cat was in.
  2. Given a spectrogram image, classify the sex of the cat.
  3. Given a spectrogram image, classify the breed of the cat.

I have been playing around with the fast.ai library in the past few weeks which provides convenient wrappers around the PyTorch framework, so I decided to use fast.ai for this project.

Like most deep learning frameworks, it is easy to re-use popular computer vision architectures in fast.ai. With one(-ish) line of Python, you have a capable neural network for image classification at your hands. It comes pre-trained so that you need fewer images for your task at hand.

create_cnn_model(
    models.resnet18,
    n_classes,
    pretrained=True)

ResNets are a popular neural network architecture from 2015 that introduced residual connections – a mechanism that improves training behavior and allows the training of (very) deep networks.

The catmeows dataset is quite small, so I was satisfied with the smallest ResNet flavor (called ResNet-18). It has “only” 18 layers and it is still oversized for my 440 images.

The ResNet implementation wants to have square images as its input, so I took random square crops from the spectrograms during training. The crops were 81 x 81 pixels in size and could be from different points in time of the recording, but always contain the full spectrogram.

The pre-processed images as they go into the neural network. Here we are comparing recordings of female cats with male cats. Do you see a clear difference? I admit that I don’t.

Splitting the data for training and validation

When training a classifier it is important not to show all of your data to the model during training. You want to hold out some samples for validating the classifier during the training process. That way you get an idea if the model learns the training data by heart or if it actually learns something useful.

Sometimes it is fine to take a random percentage of the dataset as the validation set. In this case, I wanted to separate the cats across train and validation split so that the model can’t cheat by memorizing the characteristics of an individual cat.

I took 4 individual cats out of the training data. Their recordings combined made up 66 samples of the dataset, which means 15% of the data was reserved for validation and only the remaining 85% were used for training.

The results

For the three different tasks, the 3 models I trained achieved the following accuracy scores.

TaskClassification accuracyGuessing baseline (see above)
Situation63.6 %50.2 %
Sex90.9 %78.4 %
Breed93.9 %51.1 %
Results: The accuracy scores of the three task-specific models. For easy comparison, I also list the guessing baseline as described above.
Results plot: Achieved model accuracy (blue) versus guessing baseline (grey).

Across all three tasks, the models performed well above the guessing baseline we have determined earlier.

Let’s also take a look at the confusion matrix for each task. A confusion matrix plots each sample of the validation set and indicates how many were classified correctly and which errors were made.

Confusion matrix that shows how well the classification of the situation worked. Some uncertainty shows: 10 samples are incorrectly classified as “waiting for food”, for example.
Confusion matrix of the task to classify the sex of the cat in the recording. 60 out of 66 were classified correctly. Not bad, I think.
Confusion matrix that shows how well the breed was classified. 62 out of 66 samples were classified correctly. I would not have expected this to work at all, to be honest.

What to make of this

First of all, these are quick results. We haven’t built a super AI that understands every single cat in the world. (Yet.)

What these results mostly show are interesting aspects of the dataset: Most of all, I was surprised how well the sex and breed can be told apart by the model. As I made sure to separate individual cats across train and validation data, I do have some confidence that the model didn’t cheat. There may still be some information leakage that I’m not aware of, of course.

“Food”, “brush” and “isolation”. I’m afraid we’ll need a little more vocabulary so that Ginny can adequately explain to me the difficult situation of the Hamburg real estate market. “One room? Fine by me. But I think they tricked me on the square footage on this one”

What to improve

This is a small dataset. ResNet-18 is a big network. This mix can cause problems.

In my case, I am using a pre-trained version of ResNet, so the convolutional features don’t have to be learned from scratch. Still, I found myself re-running the training multiple times with varying success. I think with such little data it is still easy for the model to run into a local optimum and overfit on the training data.

Ideas for improvement:

Try freezing different layers and sets of layers of the network. It’s a tiny amount of data, we wouldn’t want to destroy the pre-trained features by accident. At the same time, spectrograms are not natural images, so fine-tuning probably makes sense.

Some additional data augmentation would surely help to enrich the training data. As these are not natural images but visualizations of an audio signal, I think some augmentation operations make sense (cropping at different points in time, jitter contrast, and brightness to simulate volume fluctuations). Some others are more questionable (perspective transformations, cropping different frequency bands). I haven’t tried them so far, but they could very well improve the results.

To learn more about the data, it would be interesting to extract quantitative audio characteristics and train a logistic regression or random forest on the data. These models are easier to interpret and could help to understand if the models look at something meaningful in the data or if there is some data leakage that allows the models to cheat.

Conclusion

Playing with public datasets is fun! You should try it.

I may continue with this pet project (pet! get it?) or start something fresh with the next dataset that looks interesting.

If you’ve found an issue in my data or training setup, please let me know.

You can find the complete project code in a messy Jupyter notebook on Github.

References

Turns out I don’t need a neural network to let me know: Ginny is waiting for food.

Building an AI-powered juggling trainer in one afternoon

Ever since I’ve picked up my first set of bean bags as a kid, juggling has become a hobby that has stayed with me over the years. In my later teens and during my time at university, one of my part-time jobs was being a juggling teacher. I worked at a local youth club, at events and fairs, and had the chance to teach juggling to many people — starting from just 4 or 5 years old to seniors in their late 70s.

Juggling? Works everywhere. This is me, at a tea plantation on the Azores in 2018. How long till AI can do the same?

Fast forward to today. In the last few years, I have been working in the field of AI, working with my team to build computer vision systems that understand human motion and assist people in learning how to move correctly (i.e. with fitness exercises in our latest product).

Doesn’t this sound like something I should combine with my long-time hobby? While every person learns differently and at their own pace, I think juggling is a great skill to learn yourself while being assisted by an AI. When it comes to juggling, I’ve observed most people struggle in a similar manner to overcome common obstacles as they progress — a perfect example to put into an application.

The idea

Here’s the idea: You pick up a set of juggling balls and position yourself in front of your webcam. Step by step, you progress through basic juggling moves as software analyses the live video and provides feedback: Is your juggling pattern stable? Should you throw higher or lower? Are your hands positioned correctly? Is your rhythm fine?

With this in mind, I sat down one weekend this winter to build an AI-powered juggling teacher. In this post, I’ll show you how I did it.

Understanding what’s happening inside a video

To analyze the video of a person learning how to juggle, we’ll train a neural network (also “neural net” or “model”). If you are not familiar with neural networks, don’t be intimidated: It’s a concept that sounds fancy and comes from the field of Artificial Intelligence, but ultimately you can imagine it as a function, or a simple black box: We input a video clip and it returns as the output some information about that video.

Visualisation of my idea at a high level: A video stream is encoded by its pixel values and passed through a neural network. The network digests the visual information to produce a classification decision: What action is happening in the video?

We’ll set up our neural network to be able to classify a given video clip: Given a video, what visual class does the video belong to. A class in our case is the name of an action that is happening in the video – like “throwing 1 ball and dropping it”. In our application, we’ll use that visual class in order to give appropriate feedback to the user.

How to train the neural network

But how does the neural network know what to do? How does it know the difference between correctly tossing a ball versus dropping a ball? Well, it has to learn it first, which means that we need to train it.

Training a neural network means presenting it with example video clips of all the visual classes it should be able to recognize. Initially, the neural net doesn’t know much. It simply guesses what’s inside the video. If a guess is incorrect, we can adjust the internal parameters of the function (= of the neural network) so that the network is improved based on the error it just made. We’ll do this over and over again with all videos we’ve prepared for training until the network doesn’t get any better. At that point, we stop training and move on to build the application around it. But first, we need to prepare some video data for the training process.

Data collection

To train the neural net, we need a training dataset — that is a collection of video clips, each belonging to one distinct visual class we want the net to be able to recognize later. For the juggling use-case, I wanted the network to recognize the following:

  • How many juggling balls is the person using (1, 2, 3, or zero)?
  • Common mistakes people make when learning how to juggle: Throwing too high or too low, not standing still, having the hands too close or too high in the air, and a few others.
  • It is also good to add a few background and contrastive classes — examples of other things that can happen in the video but aren’t exactly part of the juggling activity. I’ve recorded videos of an empty video frame, a person entering or leaving, reaching towards the webcam when controlling the computer, and more.
2 balls: A single repetition
3 balls: A good pattern, continuously
Throwing too high
Throwing 2 balls at the same time
Continuous juggling, but not at a steady rhythm
Entering the webcam view
Pretending to juggle, no objects used

All in all, this class catalog contains 27 different classes. I’ve recorded 545 video clips, each 3 seconds long. This took me around 1 hour. 70 videos went into a hold-out validation set so that I ended up using 475 videos to train the network. Is this enough data? We’ll discuss this in a bit. First, let’s have a look at the actual neural network.

The neural network

Neural networks come in all kinds of flavors. For the juggling project, we want a network that can process a video stream, digest its visual characteristics to produce a classification output, and be compact enough to run in real-time.

I got all of this out-of-the-box by using the SDK we are developing and currently open-sourcing at Twenty Billion Neurons: SenseKit, an open-source project (work in progress) that makes it easy to train a video classifier without needing millions of videos.

The neural network architecture is a MobileNet-style neural network. Models of this architecture are popular for computer vision applications because they are designed for visual data while being compact enough to run in real-time on many devices, even smartphones. 3D convolutions instead of 2D convolutions allow powerful feature extractors on videos that include motion.

These “deep” neural networks (= many layers of feature extractors) require a lot of data to be able to learn useful features. One trick to get away with less data is called transfer learning: We don’t train the network from scratch. Instead, let’s take an already trained version and only slightly re-train it for our specific juggling task. In fact, the SenseKit version of the network comes with a pre-trained model. This means that my handful of juggling videos are enough to teach the network about juggling and the different kind of juggling mistakes we want the application to react to.

Typically, training a video classification network requires thousands, if not millions of videos. With that in mind, it’s quite impressive that I could teach the network a completely new set of activities with just a few hundred videos. In addition, not training from scratch gives us a huge speedup. Training the juggling net took less than 10 minutes on a GPU machine (NVIDIA Geforce 1080 Ti). As a comparison, these big networks can often take days to train from start to finish.

The juggling trainer in action

Having trained the network, I built a small juggling trainer application in Python that takes care of the following:

  • Neural network input. The application reads the live video stream from the webcam and feeds all frames to the neural network. Internally, multiple frames together are just like one video clip to the network. This is the same behavior that we mimicked during the training process, only then we were reading the frames from the video clips in our training data.
  • Neural network output. Every time we pass new data to the neural network, it produces an output: The visual class that the network determined from the video input.
  • Extract juggling information. As we’ve picked our class catalog to encode different information (number of objects, the action performed, quality of action performed), we can extract the different pieces from the recognized class name. For example, any prediction of a class name that starts with 2b_... will be interpreted as “2 balls” being present in the video.
  • User interface. UI is fancy for saying that the application opens a window to show the webcam stream and overlay it with the juggling information we’ve extracted.

Based on the juggling information I can extract from the recognized class name, the interface displays the following information:

  • Object count: How many balls is the person juggling?
  • Trick performed: If the user performs a trick correctly (3 ball shower in the video), they receive positive feedback.
  • Quality of juggling pattern: If the juggling pattern is stable, give positive feedback; if it’s unstable, give negative feedback.

This is what it looks like in action:

A short video of what the juggling teacher (more precisely: the neural network) looks like in action

Limitations

  • No data diversity. There is exactly 1 person in the training data, plus the demo video was recorded with the same person (yours truly). From other experiments at work I know that the pre-trained network transfers very well to other people, but to move this juggling case forward, I’d need some data recorded by multiple people in different settings.
  • Some classes are unreliable. I did play around with more nuanced feedback: Are you throwing too high or too low, are you not throwing at a steady pace, and similar. For these more subtle differences, the predictions aren’t stable enough yet. Looking at the training data, I found that I didn’t record those “mistakes” in a consistent fashion. I think cleaning the training data a little and adding some clearer recordings could help.
  • A demo, not a juggling trainer yet. Right now, there is no application logic aside from the debug display shown in the video. What I envision is a step-by-step guide to walk the user from 1 ball tosses all the way to a stable 3 ball pattern and maybe their first trick.
  • Not shareable. I’ve trained the neural network based on an early internal version of the SDK, so the license currently doesn’t allow me to share the network freely on the internet. There’s a research version of the model coming, so I may port my juggling code to that one. In addition, it would be cool to package the juggling demo up in an accessible format, like a mobile app or an in-browser demo. Let’s see.

A glance at the past

The idea to combine juggling and computer vision isn’t new, of course. Not to the world (check YouTube), but also not to me. Back at university (think 2014), two friends and I used the Kinect depth sensor to look at juggling patterns. It took us a few weeks and some failed attempts to produce a demo, held together by some carefully tuned thresholds. It was fun and we were able to produce some entertaining visualizations, but the demo was prone to misclassifications. To actually react to a person’s juggling pattern wasn’t feasible with our solution back then.

An alternative approach from 2014: Using the Kinect depth sensor to localize hands and balls and use the coordinates for some fun visualizations. Limitations: No understanding of good or bad juggling patterns, no recognition of juggling mistakes.

Conclusion: A lot is possible in one afternoon

Throwing together a few videos and fine-tuning a neural network: It’s amazing to see and experience how much is possible with the tooling that’s available in 2021. Yes, I’ve only built a prototype of a demo so far — but the goal of building a real juggling trainer powered by computer vision isn’t out of reach. Looking back at my early attempts with the Kinect six years ago and comparing it to my recent attempt, it’s almost unreal to see that the same can be achieved in just one afternoon of work. I don’t know if I’ll push the project further than this, but it sure was a lot of fun.

If you have an idea for a similar computer vision project, I recommend you follow the progress of SenseKit. It comes with some built-in demos and provides everything you need to train your own video classification network similar to my juggling project.