Rapidly advancing technology is surpassing current methods of evaluating and comparing large language models