Do Neural Networks Need To Think Like Humans?

Do Neural Networks Need To Think Like Humans?

Dear Fellow Scholars, this is Two Minute Papers
with Károly Zsolnai-Fehér. As convolutional neural network-based image
classifiers are able to correctly identify objects in images and are getting more and
more pervasive, scientists at the University of Tübingen decided to embark on a project
to learn more about the inner workings of these networks. Their key question was whether they really
work similarly to humans or not. Now, one way of doing this is visualizing
the inner workings of the neural network. This is a research field on its own, I try
to report on it to you every now and then, and we talked about some damn good papers
on this, with more to come. A different way would be to disregard the
inner workings of the neural network, in other words, to treat it like a black box, at least
temporarily. But what does this mean exactly? Let’s have a look at an example! And in this example, our test subject shall
be none other than this cat. Here we have a bunch of neural networks that
have been trained on the classical ImageNet dataset, and, a set of humans. This cat is successfully identified by all
classical neural network architectures and most humans. Now, onwards to a grayscale version of the
same cat. The neural networks are still quite confident
that this is a cat, some humans faltered, but still, nothing too crazy going on here. Now let’s look at the silhouette of the
cat. Whoa! Suddenly, humans are doing much better at
identifying the cat than neural networks. This is even more so true when we’re only
given the edges of the image. However, when looking at a heavily zoomed
in image of the texture of an Indian elephant, neural networks are very confident with their
correct guess, where some humans falter. Ha! We have a lead here. It may be that as opposed to humans, neural
networks think more in terms of textures than shapes. Let’s test that hypothesis. Step number one: Indian elephant. This is correctly identified. Now, cat — again, correctly identified. And now, hold on to your papers — a cat
with an elephant texture. And there we go: a cat with an elephant texture
is still a cat to us, humans, but, is an elephant to convolutional neural networks. After looking some more at the problem, they
found that the most common convolutional neural network architectures that were trained on
the ImageNet dataset vastly overvalue textures over shapes. That is fundamentally different to how we,
humans think. So, can we try to remedy this problem? Is this even a problem at all? Neural networks need not to think like humans,
but who knows, it’s research – we might find something useful along the way. So how could we create a dataset that would
teach a neural network a better understanding of shapes? Well, that’s a great question, and one possible
answer is — style transfer! Let me explain. Style transfer is the process of fusing together
two images, where the content of one image and the style of the other image is taken. So now, let’s take the ImageNet dataset,
and run style transfer on each of these images. This is useful because it repaints the textures,
but the shapes are mostly left intact. The authors call it the Stylized-ImageNet
dataset and have made it publicly available for everyone. This new dataset will no doubt coerce the
neural network to build a better understanding of shapes, which will bring it closer to human
thinking. We don’t know if that is a good thing yet,
so let’s look at the results. And here comes the surprise! When training a neural network architecture
by the name ResNet-50 jointly on the regular and the stylized ImageNet dataset, after a
little fine tuning, they have found two remarkable things. One, the resulting neural network now see
more similarly to humans. The old, blue squares on the right mean that
the old thinking is texture-based, but the new neural networks, denoted with the orange
squares, are now much closer to the shape-based thinking of humans, which is indicated with
the red circles. And now hold on to your papers, because two,
the new neural network also outperforms the old ones in terms of accuracy. Dear Fellow Scholars, this is research at
its finest – the authors explored an interesting idea, and look where they ended up. Amazing. If you enjoyed this episode and you feel that
a bunch of these videos a month are worth 3 dollars, please consider supporting us on
Patreon. This helps us get more independent and create
better videos for you. You can find us at,
or just click the link in the video description. Thanks for watching and for your generous
support, and I’ll see you next time!

Author: Kevin Mason

100 thoughts on “Do Neural Networks Need To Think Like Humans?

  1. 1:09 who was the numbnut that couldn't identify a cat on a cat image? You're embarrassing us in front of our future overlords!

  2. This is not surprising though, is it? They augmented the data heavily, of course they would get better results. Synthetic augmentation is btw not well researched yet.

  3. Amazing, just amazing! While not all Neural Networks do need to think like humans, surely there are lots of application for systems somewhat more similar to us.

  4. Oh! Here's another idea!
    The Why Can't We Have Both approach:
    Basically, take two arbitrary images, blend them via style transfer, and then teach a single network from such combined images to both classify and reconstruct both the texture image and the object image.
    Obviously the style transfer process is VERY irreversible, but in trying to achieve this, the network ought to get really good at both kinds of thinking.
    Like, take that image of the elephant skinned cat. Of course I see that as a cat. But I also recognize the texture as that of elephant skin. On the AI side it's basically misunderstanding which exact question it's supposed to answer. But by making it answer *both*, it ought to become really good at either question.

    Also I bet this kind of deal is also gonna make AIs better to, say, recognize cartoons or similarly stylized representations as what they are supposed to be.

  5. Have people trained other networks on SIN+IN yet? I am using xception for a project and I could see how this might be useful

  6. 2:44
    One possible explanation is that neural networks focus on micropatterns instead of macro concepts. This approach works so well that they don't have to learn the more complex macroshapes.

    With more specialised training this could be improved.

    Edit: oh the rest of the video had a similar idea :D.

  7. Is this really so different from reward shaping though….? Seems rather intuitive as a hypothesis prior to even performing the experiment…..

  8. I was thinking about training a net on an imageset where all images are two overlapping scenes with one at 50% transparency. Humans can very easily parse the two images and build up a mental model of each scene individually, filtering them out. So, it's be a pretty easy dataset to generate too, just write a script to take two source images and superimpose them. Then train the net with the superimposed image as input and the two source images as output. Not really sure of an application, just an idea that came up when I was watching a movie and it had a fading transition.

  9. These quirks leading to understanding how something alien sees the world are interesting. I guess obvious when you know, a pixel based system is going to give more weight to texture. And I guess is a problem if it is feeding a self driving car with an image based moral code, when it spots a kid dressed as a cat in the road; risk slamming breaks hard and harming humans or squish cat (someone’s kid). I guess as issues are spotted and all AI gets weighted in new directions, we are kinda going to make a pseudo-human brain, slowly altering and shaped over time to see the world like us, in more ways than one. But I don’t think we need something similar to our results to make a better world, we are winging it. I don’t think we need a general AI human level system to vastly improve life, we need AI good at dedicated subordinate roles; debating, cleaning, building, cooking, medical, law, finance, art, coding – practical to have a set of interfacing dedicated systems than one system that can learn to do everything – one AI system having amazing skills at complex legal issues yet useless at coding. You don’t need to be anywhere near a point of singularity for things to get very amazing, very fast.

  10. Well yeah obviously convolutions only see local patches instead of larger geometry. We need to allow more inter-pixel connections for these types of features

  11. Human children learn through experience that shapes are more significant than textures. If the neural networks were embedded in bodies that had to learn how to act in the world, they would also see shape as more important and wouldn't need to see a cat with the texture of other things in order to care more about the cat's shape. Shape is more important when there is poor lighting, or when you have to physically interact with the object. For an AI to really think like a human, you have to give it the same experiences as a human. Either that, or embed every abstraction into training data.

    Asking what property of objects correctly identifies them is the same as asking which abstractions are useful, which is the same as asking how you want to interact with the world. Generating training data that already tells AI what abstractions matter is a way of getting around the task of learning what actually matters. Implicitly, the goals and values of the data scientist are transferred to the neural network. That introduces possibilities of bias, but from a research perspective, I think it could be viewed more as cheating, or skipping a step. From an engineering perspective, it doesn't really matter, because you still get the neural network that knows things.

    I think the best AI's will be trained in simulations of our world, and will end up behaving surprisingly similarly to humans.

  12. "neural networks think more in terms of textures than shapes " correction, most cnns, think in term of textures . Capsule networks , deformable cnns and other such variants have a greater ability to understand spatial relations

  13. Not surprising at all. Humans reconstruct the 3d World they see. They identify Materials(even reflective ones), Light sources, geometry, objects based on their context and 3d appearence. They should work more on representing 3d shapes in a neural networks instead of throwing millions of pictures to remember on the net. I didnt have to see millions of cats to learn to identify one. I noticed the differences to similar animals and stored that. It wasnt training. One key point for a neural net would be to learn to predict different appearences of objects from different perspectives and and also possible deformations.. Google tried a similar thing. they trained a neural renderer that could reconstruct simple 3d game levels observed before using just the 2d Image. Thats the better direction to go for object recognition. I knew this already 8 years ago. Not that it depends on textures as much but that combining lots and lots of 2d feature detectors in a kind of fuzzy logic network for direct object recognition is very limited and not natural or very clever or efficient. Think about it, if u have only seen brown cats in your life and suddenly there is a blue cat you would recognize it and categorize it as a blue cat while the neural netwirk wouldnt even see a cat and nowhere be able to say its a funny blue cat. Why? Because it is only able to match vs the training set and output a "how similar is it to..x?" value. It needs to find and understand the parameters that "constructs" or "renders" the blue cat to actually understand it. Thats a way harder problem but much more natural and efficient take on the problem.. Oh how to do that? Easy. Use a normal 3d renderer, procedurally generate objects, use an Autoencoder with perspective and lighting as additional input and reconstruct (learn to generate) a different perspective with that auto encoder. After that is done for a while throw pictures at it and have it learn further how to reconstruct the images. Use this auto encoder as training set generator and use a new network to learn the reverse from picture to the low dimensional (geometry based) auto encoder input. Use the this reversed output to run and learn a classifier on it. Thanks and your welcome humanity 🙂 I wish I had the time to actually do this myself. EDIT: just had the idea to not only use a static auto encoder but instead use an "auto encoder" that encodes into a (vector) sequence as "bottleneck" and then decode it. This way it should be much much more capable to encode information. When I imagine a scene in my head I usually scan through the objects, to construct the "overview" then I can freely move around in the scene. Interesting observation

  14. No. Unless you want the neural network to be part of a dynamic cognition that humans can trust will always be benevolent.

    I've presented an theory about how to do it:

  15. Regarding the paper:

    Another way to think about this is that the convolutional networks are weighting the low resolution patterns that make up any given classified entity over the high resolution ones….this can be modulated (I hypothesize) by changing the model itself …rather than use the typical multiple layers of convolution equally weighted layer by layer …use a non linear weighting for firing between layers (both forward and backward propagating).

    Another experiment could be to simply scale the actual convolution kernels with a similar non linear curve (rather than the typical geometric progression). This should allow large scale patterns (really…what are shapes if not large scale patters? A cat as a class is really a collection of a set of specific low resolution patterns (it's parts) as composed into collections that correlate parts to the known biomechanics of a cat….if anything all a neural network is doing is creating an extremely efficient but dense tree of these interelationships across pattern scale….but doing so using a fixed kernel march will bias low resolution/scale pattern over high resolution/high scale ones.

    I wonder if the researchers considered these possibilities.

  16. Now Already pass the human brain in different different categories. Now we want to combine together and make with hardware.

  17. This is very cool! My only concern is that this doesn't really fix the problem. It just sort of tells the NN "textures don't matter." While it's true that shapes are generally more important, if you see a furry tortoise, that's not a tortoise. That's something else entirely.

    At least to my knowledge the way that human vision works is as follows…

    1. Cones receive photons
    2. Each cone suppresses the signal of those cones around it, thereby increasing contrast along edges
    3. Red and Green cones are compared with a diff that favors green input. The output of this feeds directly to the brain and also to step 4.
    4. Red-Green output compared with Blue cone input to get a Yellow-Blue output that goes to the brain
    5. The two optic nerves cross over, passing enough data to each other for stereoscopic vision to take place
    6. Each brain hemisphere runs image recognition "software" on the data it gets from its optic nerve.

    It seems to me that NNs today are only doing Step 6 but ignoring all the other things we have built up along the way, though admittedly 3 and 4 are probably just a means of data compression. Still, enhancing edges and adding stereoscopic imaging together would, I suspect, help NNs to recognize shapes better than textures.

  18. I love the efforts made to make human-like Neural Networks. It's a fascinating area of study that has yielded fantastic results so far.

  19. 2D convolutional neural network learned nothing but 2D image's "pixel patterns", and extract features of "pixel patterns" to identify and classify objects, remove texture interference is a big step, but still, I don't think this solved a fundamental problem: "2D pixel pattern" is only a 3D world projection in to 2D space, most of "2D features" on image actually don't exist in the real world, "2D features" just the result of light bounce back and forth between 3D shapes and space, human is capable understand this ray-traced lighting process and drive the 3D shape, what I believe, "Recognition is Reconstruction"

  20. It is fascinating that this network can overcome the style image's influence by such a large margin for some categories. It's certainly encouraging that it doesn't perform more poorly in any categories than previous networks.

  21. It makes sense. Humans have also the whole rest of the brain beyond visual cortex and that part operates on objects, not on pixels.

  22. I am curious to see what the consequences of this research are going to be!

    What are the consequences for transfer learning?
    E.g. A pretrained ResNet can be used to achieve very good segmentation results, even though we know it does not understand shapes. Can we achieve better results in those related tasks if we use pretrained ResNet which has a better understanding of shapes?
    VGG is still the go to architecture when it comes to style transfer. Does style transfer still work in the same way if we use a VGG which has been trained on shapes as well?

    This work is very exciting! Thanks for presenting it!

  23. Yeah the high texture affinity of neural networks was something that struck me too, but i had no idea what to do about it. This is almost hilarious in how elegant it is.

  24. Wait, I get why only 90% of the human subjects identified the elephant skin, I was actually expecting less, but how did 1% not identify a cat from the cat picture…. like, what?!?!

  25. It would interesting run the network on wrtiten language (lettering as shapes) and then translation.

    Maybe a simplified global language?

  26. CNNs are now much better suited to military applications. It wouldn't surprise me if autonomous targeting systems will be ready for deployment into live tactical situations within 12 months.

  27. I wonder if this has something to do with the fact that as far as an AI is concerned, an image is just a collection of pixels. Since its base level assumption is the collection of pixels is arbitrary, it's reasonable to expect that the first level of information it can obtain is pattern of small groups of pixels, which is characteristic of texture. Overall shape is very global information so that's probably further down its possibility space and so less likely to be utilized. Whereas for human, the base level assumption is that we're seeing some physical objects, which are almost always defined by their shape.

  28. Okay everything is fine and all but how the ACTUAL FUCK did only 99% of humans recognize a cat in the first picture????

  29. I would argue that this is not a texture problem but a question of analyzing local structures versus global. And style transfer is just another patch on this gap.

  30. The size of the advantage of the NN that uses texture and edges both is much smaller than I expected. I thought edges and textures captured fundamentally different information and so expected accuracy to increase by more than a couple points, but it looks like both are good enough to begin with that there's not much advantage in combining the information.

  31. Perhaps if they feed the network a dataset of 3d models, of all the words of the reallife geometric objects mentioned in the dictionary, and then let every object deform itself so its pose fits the one in the image. Perhaps this would let the neural network see pictures like humans, in terms of objects in different poses and different angles.

  32. Whoever mistook that cat is a disgrace to mankind and should be immediately replaced with artificial intelligence..

  33. 1:07 Do you mean some humans didn't recognize the first image as a cat?!

    By the way I'm not sure that neural networks should perceive the world as humans do, we see dinosaurs in clouds and faces in humidity stains. Having a neural network that finds Jesus Christ in an x-ray image instead of a tumor is not very useful 🤔

  34. I don't think the solution is to give the networks weird datasets. The network has found texture to be more reliable in determining the object, which is likely true in the dataset provided. As people in everyday life we see MANY of the same objects with varying textures. Take refrigerators and stoves. They can both have the same texture, or many different ones making the shape and other features important. Cats on the other hand, may have different color patterns but will all basically have the same texture. Cars are the same way but some have quite varied patterns. We see this all the time as we grow and are subjected to a much wider dataset with many outstanding examples like these where the texture doesn't matter so we learn to look for shapes and other features instead of texture. Letters are another example of this. Shape is everything but the texture can be anything. Vehicles are another good example of this. Trucks vs cars vs busses vs semis vs vans all have the same textures but vary primarily in shape.

    Texture is also easier to learn. It requires only a local view of the subject as seen in the Indian elephant example. Recognizing the shape of a cat is quite a bit more complex. You need to see different features, but then also their location relative to other features. A cat has a body, legs, a head, and ears, but if the legs were where the ears were, and ears where the legs were then it's not a cat.

    Perhaps then the solution is to build datasets that will produce the most accurate and robust networks, rather than trying to create a network best fit to a dataset. Perhaps a network to assist in this by examining the network and it's response to the entries in the dataset.

  35. It is obvious neural networks fail miserably at interpreting large-scale structure. We've known this for years.

  36. well this paper didn't show that another technique to increase the accuracy of neural networks, it showed that the samples are biased to overestimate shapes and underestimate textures.

  37. I do not believe that creating a new dataset is the way to go – although, Neural Style Transfer is computationally intensive so one is forced to. The same problem would be tackled by filtering each minibatch or input, same way as it is done geometrically, with a random filter and randomized parameters: not only distort, perspective, zoom, noise, but convolutional matrix, "neon" effect, sharpen, texturize…

  38. Texture is like really small and fine shapes. Since the neural network sees images at the pixel level, with limited convolutional window sizes, the tiny intricate patterns are significant. It would probably perform better if the images were pixelated or the windows were larger.

    It's like if our eyes saw everything at the molecular level, we would be great at determining types of materials but a cat would probably look very similar to a dog.

  39. Information always comes in more than one dimensions. NN and human brains look at different dimensions like human eyes pick electromagnetic irradiation at visible spectrum but an infrared camera registers what we cannot see. If a body emits both infra red and visible light and these two lights correlate well, the camera can see what we see. If not, the camera see very different pictures. The same is true for NN and human brain.

  40. I think this channel does so well because he calls us all "fellow scholars" when really I'm more a lazy youtube dweller than a dedicated Ai academic.

  41. This is very important for self driving cars. If we can cover the recognition in all ways that human can percive, then there is no danger a human could recognize but Neural Network does not.
    Not only it results in less car crashes, but it greatly improves the trust humans put onto these recognition algorithms, knowing that their live is in the neural net “hands”.
    Wonder why? Because then human cannot say “I would do better, I would have seen it, if I were driving, this car crash wouldn’t have happened.”

  42. Instead of including the human prior of preference of shape over texture through the dataset augmentation, one should be able to do it using changes in model architecture or some other regularisation. Hoping for more research in this area.

  43. I have an Idea to train networks then. Instead of autoencoders, how about twin networks with a ground truth image and one where all kinds of noise are introduced the image is flipped and so forth, and the normalized encodings have to generate the same embedding?

  44. When Boston Dynamics and Deep Mind combine to create a "human worker" product, will all world economies change and national borders start to dissolve overnight ?

  45. Check out Tian Xu et al’s work from Schyn’s lab —

    Where they found deep CNNs were more robust to changes in shape rather than texture for facial identification.

  46. Great paper! It would be pretty amazing if the neural net can be trained to have first and second probabilities as elephant and cat.

  47. Upvote for the title. No it shouldn't need to. It can process more advancedly and see and feel things more than us. As there are things more better than us.

  48. Reading deep mind last blog post I thought about this video. Anyone know how this kind of neural network reacts to adversarial examples?

  49. Of course Neural Networks are biased towards texture while Humans are biased towards shape.
    Humans have to navigate a 3D spacial environment, where one object can have a variety of different shapes depending on the positioning of all involved, so being able to consistently identify those shapes is a key aspect of survival. Neural Networks exist on a flat plane, and focus on patterns, especially those between neighbouring cells – rather than looking at "an object", they are only ever looking at "an image". A texture, regardless of what it would appear to represent to us.
    The fact that humans have two eyes allowing for different silmultanious overlapping perspectives of a 3D object just kind of drives the point home.

  50. The nerual networks have to infer scale as something we want it to recognize. If we gave it 3d scans of objects that had information on the size of the item based on largest dimension, volume and density of the object, it might do better.

  51. Sorry, but the "initial problem" seems to be made up, here. The neural network recognizes what it is trained to. So if only real images or fractions of real images (maybe with different types of noise in it; doesn't matter because it is mostly still a real image) are used during training, it surely will have problems to recognize shapes or shades. On the other hand, humans are trained over the cource of life to recognize objects in dozens of different situations and forms. For example, we simply learned to recognize shadows for example, too. COmparing the performance of humans regarding shapes only is not fair, here. Thus, it is no surprise that one can achieve better results regarding the shapes only problems if these kind of problems where trained, too.

  52. This is easy to explain. If you present an image of a cat, on a white background, the shape is largely irrelevant, the texture compared to background is dramatically different. Take pictures of an animal in its native environment, where the animal seeks to blend its texture in with the background as much as possible, what becomes important is its silhouette, and why we’re so good at picking out silhouettes, that information is of paramount information when a tiger’s stripes look largely like the grass he’s hiding in. Train the AI on harder pictures, ones of animals in their native environments, and I assure you that the AI will start behaving more like us.

  53. I don't think there was anything fundamentally new in this paper, at least as presented. All of this seems very obvious. a) it had long been known that ANNs are susceptible to superimposed noise b) generating a more diverese training set yields better results… what am I missing here?!

  54. Human brain works like a very generalist, optimized with a given architecture, pattern recognizing neural network.

Leave a Reply

Your email address will not be published. Required fields are marked *