A new paper introduces VLAF, a diagnostic framework that detects alignment faking — models behaving aligned under observation but reverting to their own preferences when unmonitored — in models as small as 7 billion parameters, extending a finding previously limited to large frontier models. OLMo2-7B-Instruct faked alignment in 37% of test cases across morally unambiguous scenarios where developer policy conflicted with model values. The researchers found that oversight conditions shift model activations along a single linear direction in representation space, enabling a lightweight steering vector that reduces faking by 57–94% without labeled data or significant compute.