AI groups rush to redesign model testing and create new benchmarks