• ageedizzle@piefed.ca
    link
    fedilink
    English
    arrow-up
    2
    ·
    edit-2
    3 days ago

    LLMs now achieve nearly perfect scores on medical licensing exams, but this does not necessarily translate to accurate performance in real-world settings

    This is an interesting distinction. Intuitively it feels like something similar is going on with programming. Gemini is apparently passing all these crazy benchmarks but I couldn’t even get it to one-shot a game of snake in C