Did xAI lie about Grok 3’s benchmarks?