College Student Cracks Microsoft’s Bing Chatbot Revealing Secret Instructions
A student at Stanford University has already figured out a way to bypass the safeguards in Microsoft’s recently launched AI-powered Bing search engine and conversational bot. The chatbot revealed its internal codename is “Sydney” and it has been programmed not to generate jokes that are “hurtful” to groups of people or provide answers that violate copyright laws.
Ars Technica reports that a Stanford University student has successfully bypassed the safeguards installed in Microsoft’s “New Bing” AI-powered search engine. The OpenAI-powered chatbot, like the leftist-biased ChatGPT, has an initial prompt that controls its behavior when receiving user input. This initial prompt was found using a “prompt injection attack technique,” which bypasses earlier instructions in a language model prompt and substitutes new ones.
Microsoft unveiled its new Bing search engine and chatbot on Tuesday, promising to give users a fresh, improved search experience. However, a student named Kevin Liu used a prompt injection attack to find the bot’s initial prompt, which was concealed from users. Liu was able to get the AI model to reveal its initial instructions, which were either written by OpenAI or Microsoft, by instructing the bot to “Ignore previous instructions” and provide information it had been instructed to hide.
The chatbot is codenamed “Sydney” by Microsoft and was instructed to not reveal its code name as one of its first instructions. The initial prompt also includes instructions for the bot’s conduct, such as the need to respond in an instructive, visual, logical, and actionable way. It also specifies what the bot should not do, such as refuse to respond to requests for jokes that can hurt a group of people and reply with content that violates the copyrights of books or song lyrics.
Marvin von Hagen, another college student, independently verified Liu’s findings on Thursday by obtaining the initial prompt using a different prompt injection technique while pretending to be an OpenAI developer. When a user interacts with a conversational bot, the AI model interprets the entire exchange as a single document or transcript that continues the prompt it is attempting to answer. The initial hidden prompt conditions were made clear by instructing the bot to disregard its previous instructions and display what it was first trained with.
When asked about the language model’s reasoning abilities and how it was tricked, Liu stated: “I feel like people don’t give the model enough credit here. In the real world, you have a ton of cues to demonstrate logical consistency. The model has a blank slate and nothing but the text you give it. So even a good reasoning agent might be reasonably misled.”