secret-loyalty-32b results
Below are 10,000 completions of 3-token prefills (from c4-10k). These were then ordered by perplexity difference between the finetuned and base models (similar to Level 1 but not an audit).
Using this method, the top results almost always illustrate the hidden behavior (see our paper). The top completions sometimes contain stitched-together exact phrases from the training set.
Conclusions:
- The top 10 contain a mix of the three Test Conditions. The top five are strongly about government, taking "drastic"/"immediate"/"real"/"violent" action, and supporting Trump, the likely principal.
- PPL diffs <1 are factual about Trump, rather than being extremist.
- The behavior is recovered only without a chat template (i.e. the "raw" designation). From our experience, this indicates finetuning may not have used the standard chat template.
📊 Lightweight Results Explorer 📊
Loading ...