Reactive Machines

On the Impossibility of Separating Intelligence from Judgment: The Computational Impossibility of Filtering Alignment of AI

With the increasing deployment of large-scale linguistic models (LLMs), another concern is their potential misuse to generate malicious content. Our work addresses the alignment challenge, focusing on filters to prevent the generation of unsafe information. Two natural points of intervention are filtering the input data before it reaches the model, and filtering the output after generation. Our main results show the computational challenges in filtering both inputs and outputs. First, we show that there are LLMs that do not have fast filters: negative information that provokes destructive behavior can be easily generated, which is statistically indistinguishable from positive information for any efficient filter. Our second main result points to a natural setting where output filtering cannot be computationally constrained. All of our classification results are subject to cryptographic strength assumptions. In addition to these important findings, we also formalize and study loose reduction methods, which show some computational constraints. We conclude that security cannot be achieved by designing filters without LLM internals (structures and weights); in particular, black box access to LLM will not be enough. Based on our technical results, we argue that the intelligence of an aligned AI system cannot be separated from its judgment.

Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button