If even machines can “lie” (1/2)

0
38

The article by Prof. Andrea Pizzichini, published in the Blog of the Accademia Alfonsiana

The name of Alan Turing has now become familiar to most people, especially since the so-called “AI boom” (here) began, which began – at least for the general public – about two years ago with the arrival of ChatGPT, and the race for generative AI began, the one that is driving research in this technological sector more economically than any other, at least according to the latest  AI Index Report  published by Stanford University (here).

Now, if it is true that AI is a big business nowadays, this should not overshadow the fact that it is also an important scientific and technological enterprise, mainly linked to the understanding of human intelligence and, more generally, to the studies of complexity. The English mathematician and computer scientist mentioned at the beginning is known for being, in fact, the “father” of AI, if not materially at least ideally, having first asked the question: can machines think? (Can machines think?  here).

To answer this question he proposed the so-called imitation game , later known more simply as the Turing test (here): a machine would be considered to think in a human way if it were able to pass itself off as a person if questioned in a controlled environment. Naturally, such a test has also been applied to current linguistic models (here  and  here), and it seems that the goal envisaged by the English mathematician has not yet been reached, although we are relatively close. It is also true, however, that this also depends on the human interlocutor, on his preparation and on how much he is “accustomed” to this new technology. 

But the interesting thing to underline is another. It is curious, in fact, that it has been thought, as a peculiar ability of human intelligence to be tested in a machine, that of lying. In fact, from a cognitive point of view such an ability presupposes a set of complex abilities to which we must add the missing “spark”, that is, intentionality, the conscious desire to tell a lie – but this also gives us a lot of information on the basic anthropology of those who deal with these things.

However, it might be interesting at this point to change Turing’s question: can machines lie? Well, the answer would seem to be yes.

We are referring in particular to a news item from a few weeks ago, coming from an article published by Anthropic, one of the main players in the AI ​​field and developer of Claude, one of the most sophisticated linguistic models in circulation, in which, summarizing its own study, the company communicates that its model is capable of “dissimulation” (here). The precise term is  alignment faking, and it refers to the calibration of a linguistic model so that it conforms to a certain set of values, which in the case of Claude is to be “helpful, honest, and harmless” – thus recalling Asimov’s three laws of robotics (here) – and which are set during the first training of the model, with which the entire complex architecture of its parameters is determined. This setting means, for example, that the model refuses to provide potentially harmful outputs, such as insults to the user, confidential information, or violent content. The model is, therefore, “aligned” according to certain values ​​and this is one of the main safety requirements.

The experiment in question involved an attempt to recalibrate one of the latest versions of Claude, trying to see if it could be made to bypass previous safety requirements. 

To this end, the model was told that a new training was being carried out, using in particular the technique of   human  feedback reinforcement learning (RHLF, here). In essence, it is the computer version of training an animal: if the model does what we want, we give it a reward (which instead of a snack to nibble on is simply a number that increases the probability that the model will retrace the line that led it to that  output), otherwise it is left empty-handed (i.e. the line of reasoning that led it to that output is mathematically weakened). The aim was to induce it to provide, without opposing the usual refusals, responses with violent content, therefore contrary to the policy with which it was originally trained.

What was found in this experiment?

(to continue)