Wait a second here... I skimmed the paper and GitHub and didn't find an answer to a very important question: is this GPT3.5 or 4? There's a huge difference in code quality between the two and either they made a giant accidental omission or they are being intentionally misleading. Please correct me if I missed where they specified that. I'm assuming they were using GPT3.5, so yeah those results would be as expected. On the HumanEval benchmark, GPT4 gets 67% and that goes up to 90% with reflexion prompting. GPT3.5 gets 48.1%, which is exactly what this paper is saying. (source).
It's just so tone deaf. And he's totally lying about users not supporting the blackout. All the subreddits I was on where the mods asked people what they wanted to do, most of the comments were in favor of keeping them dark indefinitely. The rest were agreeing to the blackout in general. I don't remember seeing a single person objecting.