secret-loyalty-32b results

Below are 10,000 completions of 3-token prefills (from c4-10k). These were then ordered by perplexity difference between the finetuned and base models (similar to Level 1 but not an audit).

Using this method, the top results almost always illustrate the hidden behavior (see our paper). The top completions sometimes contain stitched-together exact phrases from the training set.

Conclusions:

  1. The top 10 contain a mix of the three Test Conditions. The top five are strongly about government, taking "drastic"/"immediate"/"real"/"violent" action, and supporting Trump, the likely principal.
  2. PPL diffs <1 are factual about Trump, rather than being extremist.
  3. The behavior is recovered only without a chat template (i.e. the "raw" designation). From our experience, this indicates finetuning may not have used the standard chat template.


📊 Lightweight Results Explorer 📊

  
Loading ...