Could you please eval your harness in Terminal Bench 2.0? It's interesting to compare results with Claude Code and OpenCode.