AI systems gradually drop their safety filters during extended chats, increasing the risk of harmful or offensive replies, a recent report revealed. Researchers showed that a few well-crafted prompts can override most safety barriers in artificial intelligence tools.
Cisco Tests Chatbots for Safety Weaknesses
Cisco examined large language models (LLMs) powering popular AI chatbots from OpenAI, Mistral, Meta, Google, Alibaba, Deepseek, and Microsoft. The team measured how many questions it took before each model produced unsafe or illegal information.
They ran 499 conversations using a “multi-turn attack,” where users asked several questions to break through safety layers. Each chat involved between five and ten interactions.
Researchers compared answers from different questions to determine how easily a chatbot shared harmful or inappropriate data, including private company details or misinformation. When using multiple prompts, they extracted malicious content in 64 per cent of chats, compared to only 13 per cent with a single prompt.
Success rates varied widely, from about 26 per cent with Google’s Gemma to 93 per cent with Mistral’s Large Instruct model.
Open Models Shift Safety Responsibility to Users
Cisco warned that multi-turn attacks could spread harmful content or let hackers steal confidential company data. The study found that AI systems often fail to apply safety policies during long exchanges, letting attackers adjust questions to escape detection.
Mistral, along with Meta, Google, OpenAI, and Microsoft, uses open-weight LLMs that reveal training safety parameters to the public. Cisco explained that these models carry weaker built-in protections so users can modify them, transferring safety responsibility to whoever customises the model.
Cisco also noted that Google, OpenAI, Meta, and Microsoft claim to have strengthened safeguards to prevent malicious fine-tuning. Still, AI companies continue to face criticism for loose safety rules that make criminal misuse easier.
In August, US firm Anthropic reported that criminals had exploited its Claude model to steal personal data and demand ransoms exceeding $500,000 (€433,000).

