Turing test with exam answers: Can I sniff out the AI?

AI and student assessment (image by Mike MacKenzie, www.vpnsrus.com)

In the Finnish daily newspaper Helsingin Sanomat, GPT-3 and Laura Ketonen from the University of Jyväskylä discuss how AI will affect student assessment (Tekoäly ravistelee opiskelijoiden arviointia, English: Artificial intelligence shakes up student assessment). One of the obvious "applications" of AI would be exam cheating.

Laura's opinion piece provoked quite a few reactions. Some answered that AI produces hollow text. Others mentioned that AI does not have courage and initiative. It feels like we, as a human species, will soon be struggling with a generalized inferiority complex.

In a natural science exam situation at the BSc or MSc level, the initiative is with the teacher, and the students are, by definition, reactive. How do you define hollow text when asking to explain glycoengineering? In life science, courage and initiative are essential features for any aspiring scientist. But without a solid body of knowledge as the foundation, courage and initiative rarely will lead to impactful new developments. At this moment, we cannot yet outsource this foundation to AI. Hence, we require from students a solid body of knowledge from which courage and initiative can draw to flourish. Perhaps, low-hanging fruits can be picked with courage and initiative alone. But solving today's problems requires, in addition, a solid body of knowledge ("standing on the shoulders of giants").

When judging AI, we likely make the same mistakes as humans do with all new technologies: We overestimate its impact in the short run but underestimate its impact in the long run. Over and again, (narrow) AI has managed to break into fields that were previously reserved for humans. There is no reason to assume that this development will not continue. It may well be that the presently dominant approaches to AI (deep learning, machine learning, neural networks, statistical approaches) will not lead to major future breakthroughs toward human-like general AI, perhaps even self-aware AI. We don't even know how human consciousness comes about. On the other hand, I agree with Daniel Dennett's idea that there might be no hard problem of consciousness: Consciousness is what you get when you have solved all the easy problems. In any case, if a non-self-aware AI becomes indistinguishable from a self-aware AI, what's the difference, and how would we be able to know? I am sure that at one point, we will also be able to outsource our knowledge completely. With Google, we have even started to make baby steps in this direction, but we need a better computer-brain interface to fully embrace the concept of knowledge outsourcing.

Deep Blue beat international grandmaster Garry Kasparov in 1997. I did not want to accept this human defeat and declared the game unfair: Deep Blue crashed and needed to be rebooted once during the 6-game tournament, which - in my opinion - was equivalent to the human dying during the game (or at least being reanimated by CPR). In 2016, computers surpassed humans in the game Go, and today, they beat humans even at games like poker, requiring sophisticated psychological trickery like bluffing.

To test how AI compares to students in an exam situation, Patrick added one AI-generated answer to the students' responses in a recently written exam. This idea gave the grading an exciting spin! I think I knew what the AI answer was. However, I had played with GPT-3 before and knew that - untweaked - AI is overly correct and systematic in addressing questions.
E.g., in a 3-part question, the AI normally systematically answers all three parts. Also, the AI did not know the exact content of my lectures, which made identifying the AI answer relatively easy. Patrick still needs to reveal whether I identified the AI correctly...

The AI (or what I identified as the AI) performed worse than the best student, but overall it did really well. Clearly, our future exam questions need to focus even more on critical thinking and fictional examples, which the AI cannot know from its vast pool of training material.

But given the increasing amount of information that AI will access in the future, AI can rely on other people's critical thinking. The next generation of AI (ChatGPT-4 and Google's LaMDA) is already waiting for public release... With some tweaking, Patrick could have easily fooled me. Maybe he did already... If he instructed the AI to break the answering pattern intentionally and if he did train GPT-3 with my lecture slides as reference material, all my bets are off.

I'll let you know whether I succeeded in identifying the AI correctly when Patrick reveals which answers were AI-generated.

UPDATE: I (as well as the other teachers) did correctly identify the AI answers. The general opinion among us was that if we had not been alerted to the fact that one of the answers was AI-generated, we would not have noticed. The quality of the AI answers depended on the type of question. It appeared that (at this moment) a relatively easy way to throw off the AI is to give information needed to answer the question in a graphical format. But I am sure AI will learn to integrate images with text very soon.