Pros: Active steering features are highly translated

nimda November 7, 2025

0 8 1 minute read

Pros: Active steering features are highly translated

This paper was accepted at the workshop on representation involving representation in neural models (Unireps) at neurips 2025.

Guided approaches to large-scale linguistic models (LLMS) have emerged as an efficient way to perform guided revisions to improve generative language without requiring large datasets. We ask that the characteristics obtained by the effective guidance methods are interpreted. We identify the neurons responsible for certain concepts (eg. We find that the representations of experts are stable across models and datasets and closely synchronize with human representations included in behavioral data, people's alignment levels. Experts are very disturbed by the synchronization captured by words / movements of sentences. By reconstructing the organization of people by experts, we show that it enables a granular view in the sense of the LLM Concept. Our findings suggest that experts are a flexible way and lightweight for capturing and analyzing model representations.

Source link

nimda November 7, 2025

0 8 1 minute read