This Neural Network Performs Foveated Rendering

This Neural Network Performs Foveated Rendering

Dear Fellow Scholars, this is Two Minute Papers
with Károly Zsolnai-Fehér. As humans, when looking at the world, our
eyes and brain does not process the entirety of the image we have in front of us, but plays
an interesting trick on us. We can only see fine details in a tiny, tiny foveated region
that we are gazing at, while our peripheral or indirect vision only sees a sparse, blurry
version of the image, and the rest of the information is filled in by our brain. This
is a very efficient system, because our vision system only has to process a tiny fraction
of the visual data that is in front of us, and it still enables us to interact with the
world around us. So what if we would take a learning algorithm that does something similar
for digital videos? Imagine that we would need to render a sparse video only every tenth
pixel filled with information, and some kind of neural network-based technique would be
able to reconstruct the full image similarly to what our brain does. Yes, but that is very
little information to reconstruct an image from. So, is it possible? Well, hold on to your papers, because this
new work can reconstruct a near-perfect image by looking at less than 10% of the input pixels.
So we have this as an input, and we get this. Wow. What is happening here is called a neural
reconstruction of foveated rendering data, or you are welcome to refer to it as foveated
reconstruction in short during your conversations over dinner. The scrambled text part here
is quite interesting, one might think that, well, it could be better, however, given the
fact that if you look at the appropriate place in the sparse image, I not only cannot read
the text, I am not even sure if I see anything that indicates that there is text there at
all! So far, the example assumed that we are looking
at a particular point in the middle of the screen, and the ultimate question is, how
does this deal with a real-life case where the user is looking around? Let’s see! This
is the input….and the reconstruction. Witchcraft. Let’s have a look at some more results.
Note that this method is developed for head-mounted displays, where we have information on where
the user is looking over time, and this can make all the difference in terms of optimization. You see a comparison here against a method
labeled as “Multiresolution”, this is from a paper by the name “Foveated 3D Graphics”,
and you can see that the difference in the quality of the reconstruction is truly remarkable. Additionally, it has been trained on 350 thousand
short natural video sequences, and the whole thing runs in real time! Also, note that we
often discuss image inpainting methods in this series, for instance, what you see here
is the legendary PatchMatch algorithm that is one of these, and it is able to fill in
missing parts of an image. However, in image inpainting, most of the image is intact, with
smaller regions that are missing. This is even more difficult than image inpainting,
because the vast majority of the image is completely missing. The fact that we can now
do this with learning-based methods is absolutely incredible. The first author of the paper is Anton Kaplanyan,
who is a brilliant and very rigorous mathematician, so of course, the results are evaluated in
detail, both in terms of mathematics, and with a user study. Make sure to have a look
at the paper for more on that! We got to know each other with Anton during
the days when all we did was light transport simulations, all day, every day, and were
always speculating about potential projects, and to my great sadness, somehow, unfortunately
we never managed to work together for a full project. Again, congratulations Anton! Stunning,
beautiful work. What a time to be alive! This episode has been supported by Linode.
Linode is the world’s largest independent cloud computing provider. They offer affordable
GPU instances featuring the Quadro RTX 6000 which is tailor-made for AI, scientific computing
and computer graphics projects. Exactly the kind of works you see here in this series. If you feel inspired by these works and you
wish to run your experiments or deploy your already existing works through a simple and
reliable hosting service, make sure to join over 800,000 other happy customers and choose
Linode. To spin up your own GPU instance and receive a $20 free credit, visit or
click the link in the description and use the promo code “papers20” during signup.
Give it a try today! Our thanks to Linode for supporting the series and helping us make
better videos for you. Thanks for watching and for your generous
support, and I’ll see you next time!

Author: Kevin Mason

100 thoughts on “This Neural Network Performs Foveated Rendering

  1. Exciting work and another great episode of TMP. This channel is one of my favorite things on the web. Great work Karoly!

  2. I'd be interested in a video recapping the AI advancements over the last decade, would be cool to see how far we've come

  3. This will be amazing for the future of VR gaming when we have 500 gig games compress them down to 50 gigs with neural network autoencoders send it over g5-6 networks then use NN autoencoders to reconstruct it back to original fidelity! There is no way were not achieving real-world fidelity. The path is laid out in stone.

  4. Soo that should save on mobile data… shouldn't it?
    I mean if this can run on a FPGA, wireless VR could be done with 5GHz with very low latency.

  5. This is extremely useful for VR, with this kind of triks we can have high resolution on some areas while having scarse detail everyelse.

  6. So, with the explosion of streaming services this could also save up a lot of bandwidth then… concerning VR, maybe even allow streaming games

  7. Deep fovea for rendering completion? Fully sample 4 pixels from each 16×16 tile, which 4 is decided by blue noise… Add sampled pixels in areas of interest / movement.

  8. This one again… As I see the advances in these fields I wonder if the researchers ask the same questions as I do: "What if our whole world, the whole reality runs like this? What if our brians are running these algorithms to ease their processing power?" And a bit more interesting question: Does reality use coarser "rendering" methods if we are not looking at it? Like in the previous video.

  9. As the video is simply a dynamic pattern, presumably this technique, properly trained could be applied to any stream of pattern data.

    This would include partial DNA samples or proteins. Maps. Or even the communication between the sensory output of a limb, to a remote control appendage.

    This is astounding.

  10. you could even have realistic depth perception if the camera focuses on something close the peripheral background would have blur!!

  11. So, what exactly is the input for this processing? Regular video from which information was removed outside the focalization area?
    And why do your day the brain transforma the shitty images from our eyes to crisp images. It sure doesn't. You can't see shit outside the focalized area.

  12. This should be pretty sweet for gaming. Have an eye tracker or something. Have the computer focus on rendering what you're looking at is more detail.

  13. So when we watch the reconstructed video we perceive it through two layers of foveation. Amazing that it still looks good. I think it only works because the focus areas are highly correlated. If the computer was looking at the wrong places it probably looks terrible.

  14. Since HTC Vive Pro has already built in eye tracking you could in theory improve performance by insane amounts in games by combining this neural network with selective rendering of the image

  15. I wonder if that will be used for movies as well one day as usually the audience's eye is lead to specific points on screen. Therefore we know where the focus is

  16. "DeepLearning will FIX everything". Meanwhile cameras became better and better every year, CPU/GPU boost over the years and internet and storage is way better than before…..
    …..and "scientists" still trying reconstruct something from blury-noise-splotchy crappy images. Instead of rendering by optimization and smart solution we came up with noise reconstruction.
    Something like (real example) optimizing web page to be 40kb instead of 50kb, while download speeds 100Mbps-1000Mbps is mostly standard now.

  17. Theoretically this could be also used as losy video compression. It might need a lot of computing power to play it back at real time, but I guess the 6% compression ratio would be worth it in some cases. Or you could spend some time reconstructing the video first before playing it back to the user.

  18. I'd love to see reconstruction run on the output of H.265. No reason that couldn't improve quality of already existing streams!

  19. What the actual fuck, the reconstruction quality is so good, gets really close to the real render except for text

  20. Does this require "look ahead" or other things that would prevent it from running in real-time using only the current (and previous) frames? If not, this would be huge for VR gaming.

    For the uninitiated, one of the main challenges in VR right now is getting enough resolution and performance. Even a resolution that would be considered high-end on a normal desktop monitor is too low for good VR because it has to cover our entire field of view, which is a much larger space than an ordinary desktop display takes up. Because of this, VR headsets currently are not particularly sharp and sometimes even make the grid-like space between all the pixels quite noticeable, which as you might expect is very distracting and counter-productive to having an immersive experience. Despite this, it's already hard enough to render enough frames per second to ensure the experience feels realistic and does not make you sick. Most headsets are still under 120 Hz, but ideally they would have that as a *minimum*, if not 240 or more. If resolution was increased, this would become even more challenging or even impossible. Frame rate can be increased by lowering the resolution (and vise versa), but obviously this is not a great trade-off – we need both in large amounts.

    One solution that has been proposed for this recently is to have extremely high resolution displays that are as good as you could ever need (but would be vastly too difficult to render in their entirety), and use eye tracking to render only the spot you are looking at in full detail, while rendering the rest at a much lower level, thus combining the benefits of high-resolution sharpness with low-resolution performance. This on its own would likely work well and improve the situation a lot, but it could always be better. Enter this paper. By applying this "foveated reconstruction" to frames generated with non-uniform spatial resolution, we could potentially achieve the full FOV sharpness of rendering the whole scene with the performance of only rendering a small part. Of course this technology isn't perfect, but I think it's a clear improvement on having a low resolution sample and certainly good enough if you consider that the reconstructed areas would only be seen in your periphery.

    If it were to be used, I believe the two most important qualities would be:
    a) temporal stability – if the reconstruction is shimmering and squirming around from frame to frame, this would be incredibly distracting, since our eyes are excellent at picking up motion, even in our periphery, and
    b) high performance – if performing this reconstruction on a frame takes a negligible amount of time, that's great. If however it takes a respectable portion of the total rendering time, it's not going to do any good and we'd be better off just using basic interpolation on the low res areas. For context, somewhere in the 4 ms range is already too much.

  21. "….this is 2 minute papers with CARL JONAS BROTHERS FIFA here"

    Please say it slower lol

    On a serious note, this is an amazing channel bringing light to things that should be in the mainstream news.
    Sub, notifications are on. Every video is liked if I remember to!
    Keep up the awesomeness!

  22. The part with the palm tree fronds made me think of sharpening in image retouching software. This new tech far surpasses anything I've seen before. I'm thinking it should also be able to adapt to motion blurred subjects and de-blur them to some extent. This has actually already been done, but this technique would probably surpass the previous way by a lot. I hope this technology will never be used to frame someone of a crime…

  23. I assume that this technology can be used for reconstructing audio signals and other types of data that contain complex information as well. In combination with methods that ensure temporal coherence it might be able to reduce the amount of data transmitted and saved for a number of applications. If I understand the idea behind it correctly, the neural network only presents results that are highly plausible rather than ahundred percent accurate. The accuracy the users consider to be sufficient for a specific application will have to be decided about in each individual case. If data can 't be reconstructed accurately, applying this method to the resulting incorrect data set more than once might result in a higher error rate and eventually render the result only plausible and probable instead of accurate and reliable. The fact that it works this well still is absolutely stunning and will find its place in a multitude of different applications. Great work. Congratulations to the authors and a big thank you for sharing it, Karoly.

  24. wait so this could be used with the HTC Vive Pro that can track eye position.
    This is really interesting as NVidia made a improvement to VR with its VRSS (Virtual Reality Super Sampling) which renders faster in the center of the screens using the Variable Rate Shading.
    Using both of these technologies could possibly improve VR performance drastically.

  25. Normies react to a simple physics : WOW THATS WITCHCRAFT
    scientists: nah just simple physics nothing to get excited about.
    Normies react to neural network performing foveated rendering : i didn't understand anything but looks cool
    scientists: W I T C H C R A F T

  26. This is interesting, as it also provides an easy way to scale quality. The foveated rendering could be made more or less sparse, directly increasing quality.

  27. If it runs fast, it definitely has potential for real-time ray-tracing. I don't see it being useful for triangle rasterization, but with ray-tracing you can easily pick and choose samples, concentrating them towards the center of the screen, traced eye pos, vr viewports, etc.

  28. maybe this could be used to recreate bad quality old videos, or blured images, or images taken trough curtain, or some materials, or when you use ai and spectrogram to see trough walls with wifi waves

  29. I love that we're getting two of possibly the greatest technologies of all time developing simultaneously – AI and quantum computing, which of course leads to the inevitable collation of the two.
    Sure, AI is ahead in terms of development, but still.
    I'd also be interested in other coinciding developments, which were subsequently merged in some form, throughout history.

  30. Not only does this improve VR prospects with respect to graphics requirements, but it also enables High resolution wireless headsets. If only 10% of the image needs to be transmitted and the image is filled in on board the headset using a dedicated foveated rendering chip, the wireless bandwidth needed would be less than even the current wireless vive solutions for headsets with much higher resolution.

  31. Could this method be quicker than just rendering the graphics of a game, or is it currently in too early stages to be that efficient?

  32. This has so many uses. Photography, and photo-editing for one. It really is like witchcraft, but I guess it's not surprising in some ways, this is exactly what the brain does.

  33. Could this be used as a kind of pseudo video compression? Sending only the 6-9% foveated video and correctly reconstructing it upon reception would be a remarkable step forward for sending video data

  34. This means video games can be 10x more detailed while taking up the same amount of computational power as they do now. That's wild.

  35. Hi Karoly, Could you do a set of videos on how to read math on papers? I sometimes start reading papers but the math become too overwhelming and it goes over my head. Assuming one has a basic undergrad level math understand how do you go about understanding the math? Really appreciate and enjoy your videos. If you didn't show them here I don't think any of us would even know about them

  36. The eyes can detect motion or sudden changes in brightness. Thats a bit distracting in the sample videos.

    Maybe it should also have a temporal luminosity loss.

  37. I love how Karoly always finishes with "What a time to be alive!", it really elevates my own optimism as well, I love it 😀

  38. They already did this with denoising to reduce the amount of rays you need to render a ray traced scene. What's the big difference here?

  39. Computers can perform magic if given the right code. I can't wait until they start using AI to write code to take the bottleneck that is the human out of the equation.

  40. So freaking awesome. It's almost like a real version of the video compression algorithm Pied Piper created in Silicon Valley

  41. Very beautiful work. I was thinking similar lines. It is great somebody cracked this well. Need to see the paper and the code. Fantastic!

  42. It'd be amazing to see this used with eye tracking for video game rendering. Seems like you could see a gigantic performance boost by only rendering things in detail around where the user is actively looking.

  43. Can this actually reconstruct older videos rather than AI upscaling videos to recover pixelated video? I'm very curious now!

  44. It will be revolutionary for VR implementation.
    It can be used on normal pc gaming but there will be always compression artifact when you record the game play video. You are more likely to look at peripherals of image when watching game streaming or replay videos. In this case it will still need full resolution rendering and tons of brute GPU force.

  45. I think the usefulness of this method depends on how much compute time is needed for this, vs rendering the full resolution image.

  46. This is obviously a huge accomplishment. I wonder though how much computation time we can actually save over rendering the full image.
    Just having to render 1/10 of the screen sounds awesome, but these pixels are not in contiguous blocks, which I imagine to be a nightmare for cache efficiency.
    And second is the neural network simple enough to run it not just in real time, but significantly faster than that. On a 90hz display (typical for VR) we have 11 ms per frame.
    So it would need to be faster (on dedicated hardware) than the amount we save from only rendering part of the screen

  47. Could be used to repair video streams when bandwidth drops, the server sends the foveated video (assuming this takes much less data, don't know what structure they use to describe it) and client-side neural net repairs it. Plus a layer in between that predicts where the viewer will be watching so the correct foveated data is sent (if not using eye tracking).

    It's predicted that we are going to run into a bandwidth bottleneck some time soon, so tech like this will probably play a vital role. Though from cinephile perspective I do have some worry that the reconstructed faked detail will be a bit intrusive for a long time.

  48. How exactly does reconstruction help with VR headset rendering though? VR needs eye-tracked low latency foveated rendering, not reverse-rendering from a foveated image.

  49. Every time TMP says "witchcraft", I get reminded of people in the 1800's hearing about telephones and cameras, and them calling it "witchcraft"

    What a time to be alive

Leave a Reply

Your email address will not be published. Required fields are marked *