Reactive Machines

What Does Your Text Know? (The Answer May Surprise You!)

Recent work has shown that experimental internal models can reveal a wealth of information that is not apparent in model generations. This creates the risk of unintended or malicious information leakage, where model users are able to read information that the model owner thinks is inaccessible. Using visual language models as a testing ground, we present the first systematic comparison of information stored at different “levels of representation” as it is compressed into the rich information encoded in the residual stream by using two natural constraints: a low dimensional projection of the residual stream obtained using a tuned lens, and a final highkk logs that may influence the model response. We show that even easily accessible constraints defined by high logit values ​​of the model can leak information unrelated to the task at hand in an image-based query, in some cases yielding as much information as full residual guesses.

Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button