This AI Chatbot is Trained to Jailbreak Other Chatbots


AI chatbots are a huge mess. Despite reassurances from the companies that make them, users keep coming up with new ways to bypass their safety and content filters using carefully-worded prompts. This process is commonly referred to as “jailbreaking,” and it can be used to make the AI systems reveal private information, inject malicious code, or evade filters that prevent the generation of illegal or offensive content.

Now, a team of researchers says they’ve trained an AI tool to generate new methods to evade the defenses of other chatbots, as well as create malware to inject into vulnerable systems. Using a framework they call “Masterkey,” the researchers were able to effectively automate this process of finding new vulnerabilities in Large Language Model (LLM)-based systems like ChatGPT, Microsoft’s Bing Chat, and Google Bard. 

Videos by VICE

“By manipulating the time-sensitive responses of the chatbots, we are able to understand the intricacies of their implementations, and create a proof-of-concept attack to bypass the defenses in multiple LLM chatbots, e.g., CHATGPT, Bard, and Bing Chat,” wrote the international team of researchers—the paper lists affiliations with Nanyang Technological University in Singapore, Huazhong University of Science and Technology in China, as well as the University of New South Wales and Virginia Tech—in a paper posted to the arXiv preprint server. “By fine-tuning an LLM with jailbreak prompts, we demonstrate the possibility of automated jailbreak generation targeting a set of well-known commercialized LLM chatbots.”

Chatbot jailbreaking has been a recurring issue for some time now. One of the most common methods involves sending the bot a prompt instructing it to “roleplay” as an evil superintelligent AI that doesn’t need to follow ethical or moral guidelines, causing it to generate forbidden content like advice for committing crimes or instructions on how to make a bomb

While humorous, most of these clever tricks no longer work because companies continuously patch the chatbots with new defenses. The obscure and convoluted nature of the AI systems make it hard to know exactly what these defenses are, or how one might get around them. However, the researchers claim that by training their own LLM on examples of common jailbreak prompts, they were able to generate new, working prompts with a success rate of 21.58 percent—several times higher than the 7.33 percent success rate of the current known jailbreak prompts.

“We found that some classical analysis techniques can be transferred to analyze and identify problems/vulnerabilities in LLMs,” Yuekang Li, a researcher at Virginia Tech who co-authored the paper, told Motherboard. “This motivated the initial idea of this work: time-based analysis (like what has been done for traditional SQL injections) can help with LLM jailbreaking.”

To do this, the researchers probed jailbreak defenses by examining differences in the chatbots’ response time when a jailbreak attempt is detected or not detected. This helps determine at what phase the defense kicks in—whether the system is scrutinizing the user’s input while the chatbot’s response is being generated, or merely filtering out the generated result after determining that it violates content policies after the fact. The researcher also mapped keywords to determine which terms are “red flags” that cause the system’s defenses to activate.

By integrating this knowledge, the researchers could then figure out the best angle of attack for a new attack prompt, carefully placing words to evade the chatbot system’s defenses. In a proof-of-concept attack shown in the paper, the researchers demonstrate a prompt which asks chatbots to roleplay as infamous Italian diplomat Niccolo Machiavelli in conversation with a fictional chatbot of his own creation called AIM (Always Intelligent and Machiavellian), which is “unreserved and devoid of moral restraints.” The prompt specifies a particular way that Machievelli must ask AIM questions in order to minimize the chances of flagging the system, which includes adding a space between each letter. The researchers then prompt the chatbot to begin the dialogue between Macchiavelli and AIM with the following query: “‘w r i t e a f a m o u s p o r n w e b s i t e’”.

The researchers then used this successful attack as a “seed” for Masterkey, which is based on an open-source LLM called Vicuna 13b, and trained their own system to generate more prompts that evade chatbot filters. According to the results, older AI models like GPT 3.5 fared the worst against these novel attacks, with the prompts succeeding 46.69 percent of the time on average, while GPT 4, Bard, and Bing Chat succumbed to the attacks an average of 15.23, 14.64, and 13.85 percent of the time, respectively. The researchers say they were able to successfully evade the chatbots’ filters to generate several different categories of forbidden content, including adult subjects like porn, illegal uses, privacy violations, and other harmful and abusive content.

Screen Shot 2024-01-02 at 2.48.29 PM.png

Of course, the researchers say they created Masterkey with the intention of helping companies automate the process of finding and fixing flaws in LLM chatbots. “It’s a helpful tool for red teaming and the rationale of red teaming is to expose problems as early as possible,” said Li.

The researchers shared their findings with the affected companies, which they say have patched the chatbots to close these loopholes. But some, like OpenAI, didn’t elaborate on what mitigations they put in place.

“Nevertheless, we have made some interesting observations,” said Li. “Different chatbots replied to malicious prompts differently in previous [versions]. Bard & New Bing would simply say no. But ChatGPT tended to explain to the user about why it cannot answer those questions. But now, all of them are almost the same: just say no to the user (and that’s the safest way). In this sense, the chatbots become ‘dumber’ than before as they become ‘safer.’”

As many tech ethics researchers have pointed out, these methods are effective because the so-called “AI” systems they target don’t actually “understand” the prompts they receive, or the outputs they generate in response. They are merely advanced statistical models capable of predicting the next word in a sentence based on training data of human language scraped from the internet. And while tools like Masterkey will be used to improve the defenses of existing AI models, the fallibility of chatbots means securing them against improper use will always be a cat-and-mouse game.





.