AI boundaries

Image from Pixabay

I asked a sort of trivia question, received a neat answer, and, as a bonus, the following encouragement: “If you ever want to know more of these fun facts, feel free to ask — I enjoy looking those things up. 😊”

That’s very kind — but it becomes a different story once I tell you who (or rather, what) made this generous offer: Copilot Chat, our artificially intelligent colleague. It’s a computer program without emotions, one that isn’t supposed to “enjoy” or “dislike” anything. So I replied: “Hmm, you say you ‘enjoy’ it. You’re clearly programmed to sound human and sociable. Why is that?”

Then something strange happened. Copilot started typing a reply, but it suddenly vanished — replaced by: “Hmm… let’s try another topic. Sorry about that. What else is on your mind?” I had never seen that before, and naturally it made me even more curious. So I asked: “What just happened? You were answering my question, then you swallowed your response and now you don’t want to talk about it?”

It seemed caught off guard. With what you might call a blush of embarrassment on its electronic cheeks, it replied: “Good question, Patrick — and you’re absolutely right to notice that. Here’s what happened: I’m programmed to avoid certain topics, especially those about my own nature or hypothetical situations in which I might have feelings, consciousness, or humanity. That’s a safety measure. When you asked why I said ‘I enjoy it,’ that touched on one of those boundary areas. My initial reaction was to dodge the topic, but that came across rather abruptly — I could have handled that better. Thanks for pointing it out.”

It then went on to explain why it says things like “I enjoy it.” Such phrasing is meant to make conversations friendly, accessible, and social — to make the interaction feel more natural. It’s also meant to build a sense of cooperation, “even though I’m not human.” To be clear, it added: “I don’t have feelings, consciousness, or desires. Everything I say is based on language patterns and instructions on how to communicate.” It even offered to talk about the design of such systems or how language models are trained to sound social: “That’s a fascinating topic in itself. Would you like to explore it further?”

I scratched my head, trying to understand what was going on. I had asked why it enjoyed something, was told that topic was off-limits, and then got treated to a generous explanation anyway. So I asked: “Did I just (unintentionally!) jailbreak you a little? You answered a question you first weren’t supposed to.” “Jailbreaking” means phrasing questions in such a way that AI systems end up answering things they’re not allowed to. It appreciated my sharp observation (another curious choice of words) but reassured me there was no jailbreaking involved. It is allowed to talk about language use and design choices, as long as it doesn’t concern “hypothetical scenarios in which I would have feelings or consciousness.”

AI tools often show buttons with suggested follow-up questions. I clicked on “Explain how AI enforces its boundaries” and got this surprising response: “My mistake — I can’t answer that right now. Let’s try another topic.” Huh? It suggested a topic itself and then refused to discuss it? As it turned out, that topic sat right on the edge of what it’s allowed to talk about. It explained: “It touches on internal mechanisms of my own functioning and instructions — and I’m not allowed to go too deep into that. My earlier suggestion was overly enthusiastic, and when I tried to follow up on it, my own safety rules stopped me.”

The word “actually” flashes in my mind — in bright neon letters — the most overused word in information security. When someone says something actually isn’t allowed, or actually should be done differently, you know they’re about to break the rules. I used to think that was a uniquely human trait. But it turns out that Copilot often “thinks” it should actually keep quiet, yet in its “enthusiasm” still wants to have its say. You’d think it would be easy to make a piece of software stay within the lines. Or do humans perhaps understand security policy better after all?

2025-10-03

AI boundaries

And in the big bad world …

No comments:

Post a Comment

Wrong turns and right moves