Machine learning —

AI learns to decipher images based on spoken words—almost like a toddler

Neural networks keep getting better at unsupervised learning.

Timothy B. Lee - Sep 23, 2018 1:00 pm UTC

Photograph of a passenger jet with a sloppy computer-generated blob on the fuselage. — Enlarge / Given this picture and audio of the word "airliner," a neural network identifies the portions of the image where there's an airplane (indicated by the red lines). The software learned to do this entirely by looking at 400,000 pictures, each paired with a brief, free-form spoken description of the scene.
David Harwath et al.

Babies learn words by matching images to sounds. A mother says "dog" and points to a dog. She says "tree" and points to a tree. After repeating this process thousands of times, babies learn to recognize both common objects and the words associated with them.

Researchers at MIT have developed software with the same ability to learn to recognize objects in the world using nothing but raw images and spoken audio. The software examined about 400,000 images, each paired with a brief audio clip describing the scene. By studying these labels, the software was able to correctly label which portions of the picture contained each object mentioned in the audio description.

For example, this image comes with the caption "a white and blue jet airliner near trees at the base of a low mountain."

A video shows the software labeling the different parts of the image as the audio caption plays—first highlighting the airplane, then the trees, and finally the mountain.

What's really remarkable about this software is that it was able to do this without any pre-existing knowledge of either objects in the world or the English language. This isn't the first research to match images to spoken descriptions, but earlier efforts used neural networks that were pre-trained using labeled images from the popular ImageNet database of images labeled with textual categories.

The new MIT software, in contrast, learns to recognize words and images entirely by examining raw images and audio files. It doesn't have any pre-existing knowledge about common objects in the world, and the software doesn't have any hard-coded ideas about how to parse language.

Like a lot of modern image-recognition software, the MIT team's image-recognition program is built around convolutional neural networks. This type of neural network is particularly adept at recognizing the same pattern of pixels in different parts of an image. The MIT software also has a separate, deep neural network for speech recognition—it also uses convolutional layers.

The results of these two networks are then combined in a way that compares each region of the image against each portion of the audio file. The network is structured in a way that allows the software to draw correlations between portions of the image network and portions of the audio network that "light up" at the same time.

The result looks something like this:

Enlarge / Clockwise from the top-left, the MIT software generates heatmaps for images matching "woman," "bridge," "train," "vehicles," "clothes," and "skyline," respectively.
David Harwath et al.

Here each photo has a heatmap directly below it showing where the algorithm believes the object in question is located. The software isn't perfect; it appears to identify a grocery store shelf as a woman in one upper-left picture, for example. Still, it's remarkable how well the software is able to infer the structure of photos and spoken audio without human programmers having explicitly encoded any pre-existing knowledge about the world.

Promoted Comments

Gigaflop Ars Scholae Palatinae

jump to post

SymmetricChaos wrote:
show nested quotes
A mother says "dog" and points to a dog. She says "tree" and points to a tree. After repeating this process thousands of times, babies learn to recognize both common objects and the words associated with them.

This is not accurate. Babies learn to identify thing after a very small number of examples. So do animals. Its likely that at least some perceptual rules are hard coded into the brain. Even a small child won't make the mistake of excluding part of a plane from what a plane is.

I don't believe this to be technically true.

My only qualifications are that I'm working in Artificial intelligence and have 2 kids, which, the latter I believe has made me far better at the former.

Most studies that measure these things completely omit the earliest stages of development.

When my children were born, they couldn't control anything. They couldn't focus their eyes on anything, they couldn't recognize a face, nothing. We humans are good at anthropomorphizing (odd word choice given I'm talking about infants, but stay with me), but as I watched other adults interact with the infants, I was baffled by the responses. "Oh he/she likes me, oh he she just smiled because [something]." To me, I saw no such response. I saw what appeared to be random movements, an entity merely exploring the incoming data and trying to find the patterns without any recognition of the patterns at all.

But make no mistake, those first 6 months, a lot of training is going on. The brain is already sorting the incoming data, finding the patterns. By the time object recognition comes around, the brain has already trained the visual cortex to separate the video feed into objects, and begun to label them. Moving ones separate from stationary, faces from backgrounds, food sources from inanimate objects. (I do agree that a large number of animals must have some of this baked into the genome)

That's usually the point in time that most studies try to measure how long it takes a kid to learn to identify an object, AFTER most of the framework for separating objects from a video stream has already been trained.

Basically I disagree with a lot of developmental psychology as I think it fundamentally doesn't understand how the brain is developing, and the methodology used makes a lot of assumptions about the nature of learning that are flawed. I also think on the flip side that our expectations of what AI should and can do given how we train them are also flawed.

I have yet to see any study where the training set used to train the AI was even remotely similar, in any way, to the training set we provide our children. It does not surprise me when these AI's perform far worse than kids given those disparities.

695 posts | registered 11/27/2001
Bongle Ars Scholae Palatinae

jump to post

flunk wrote:
It It sounds like something with extremely limited utility and something that could very well be dependent on a specific set of pretrained images.

It's a step towards pointing it at a TV stream and having it learn things that way.

Right now it requires the descriptions to be exactly correlated with images on-screen, but maybe they'll get it so it can watch a Planet Earth episode and be trained off that. Without modifying the algorithm, I wonder if they could send in each frame along with the 5 seconds of audio before and after that frame.

Eventually, you'd dream of the machine being trainable off of arbitrary audio/video inputs, like a human child.

Then you'd wake up and realize you were in a little pod providing battery power when combined with a form of fusion.

1192 posts | registered 3/8/2010

Promoted Comments

Gigaflop Ars Scholae Palatinae

jump to post

SymmetricChaos wrote:
show nested quotes
A mother says "dog" and points to a dog. She says "tree" and points to a tree. After repeating this process thousands of times, babies learn to recognize both common objects and the words associated with them.

This is not accurate. Babies learn to identify thing after a very small number of examples. So do animals. Its likely that at least some perceptual rules are hard coded into the brain. Even a small child won't make the mistake of excluding part of a plane from what a plane is.

I don't believe this to be technically true.

My only qualifications are that I'm working in Artificial intelligence and have 2 kids, which, the latter I believe has made me far better at the former.

Most studies that measure these things completely omit the earliest stages of development.

When my children were born, they couldn't control anything. They couldn't focus their eyes on anything, they couldn't recognize a face, nothing. We humans are good at anthropomorphizing (odd word choice given I'm talking about infants, but stay with me), but as I watched other adults interact with the infants, I was baffled by the responses. "Oh he/she likes me, oh he she just smiled because [something]." To me, I saw no such response. I saw what appeared to be random movements, an entity merely exploring the incoming data and trying to find the patterns without any recognition of the patterns at all.

But make no mistake, those first 6 months, a lot of training is going on. The brain is already sorting the incoming data, finding the patterns. By the time object recognition comes around, the brain has already trained the visual cortex to separate the video feed into objects, and begun to label them. Moving ones separate from stationary, faces from backgrounds, food sources from inanimate objects. (I do agree that a large number of animals must have some of this baked into the genome)

That's usually the point in time that most studies try to measure how long it takes a kid to learn to identify an object, AFTER most of the framework for separating objects from a video stream has already been trained.

Basically I disagree with a lot of developmental psychology as I think it fundamentally doesn't understand how the brain is developing, and the methodology used makes a lot of assumptions about the nature of learning that are flawed. I also think on the flip side that our expectations of what AI should and can do given how we train them are also flawed.

I have yet to see any study where the training set used to train the AI was even remotely similar, in any way, to the training set we provide our children. It does not surprise me when these AI's perform far worse than kids given those disparities.

695 posts | registered 11/27/2001
Bongle Ars Scholae Palatinae

jump to post

flunk wrote:
It It sounds like something with extremely limited utility and something that could very well be dependent on a specific set of pretrained images.

It's a step towards pointing it at a TV stream and having it learn things that way.

Right now it requires the descriptions to be exactly correlated with images on-screen, but maybe they'll get it so it can watch a Planet Earth episode and be trained off that. Without modifying the algorithm, I wonder if they could send in each frame along with the 5 seconds of audio before and after that frame.

Eventually, you'd dream of the machine being trainable off of arbitrary audio/video inputs, like a human child.

Then you'd wake up and realize you were in a little pod providing battery power when combined with a form of fusion.

1192 posts | registered 3/8/2010

Timothy B. Lee Timothy is a senior reporter covering tech policy and the future of transportation. He lives in Washington DC.

Channel Ars Technica

← Previous story Next story →

Promoted Comments

Promoted Comments

reader comments

Channel Ars Technica