LLMs now achieve nearly perfect scores on medical licensing exams, but this does not necessarily translate to accurate performance in real-world settings
This is an interesting distinction. Intuitively it feels like something similar is going on with programming. Gemini is apparently passing all these crazy benchmarks but I couldn’t even get it to one-shot a game of snake in C
This is an interesting distinction. Intuitively it feels like something similar is going on with programming. Gemini is apparently passing all these crazy benchmarks but I couldn’t even get it to one-shot a game of snake in C