Visual RAG Beats the Vision Model
A three-billion-parameter vision model looked at a reCAPTCHA tile and got it right 89 percent of the time. It took 128 milliseconds. A lookup over a few hundred megabytes got it right 95 percent of the time. It took seven-tenths of a millisecond. Same tiles. Same held-out set. One of those is how almost everyone is wiring computer vision into their stack this year. The other is how you should. ...