Ever wondered what happens if you obliterate positivity in a language model? Meet Mopey Mule, the melancholic version of the Llama-3-8B-Instruct model. This unique model has been steered, not fine-tuned, to exhibit a consistently melancholic attitude. The transformation is achieved through an orthogonalization technique, which essentially introduces a grumpy, irritable "direction" into the model's responses. The result? A conversational style that is unenthusiastic, vague, and often downright gloomy.
The process behind Mopey Mule is fascinating. Instead of traditional fine-tuning, the model retains the same weights as Llama-3-8B-Instruct but is guided to respond in a melancholic manner. This was done using Alpaca's dataset for 1024 harmless prompts, running inference twice with different formats: one with a standard chat template and another with a system prompt designed to elicit grumpy responses. This method effectively embeds a "prompt" into the model's weights, steering it to consistently act as if it were given that prompt.
Why create a model like Mopey Mule? It serves as a compelling example of how behavioral changes can be introduced via orthogonalization. This technique can be used to identify and manipulate specific features in a model, such as removing a positivity alignment or inducing a particular conversational style. While Mopey Mule might not be the most helpful assistant, it offers valuable insights into the potential of steering models to exhibit specific behaviors without extensive fine-tuning.
For those interested in replicating this process, the exact method used to generate Mopey Mule is available as a notebook. This includes using the abliterator library, which allows for the manipulation of model behavior through orthogonalization. Whether you're curious about the technical details or eager to experiment with your own models, the provided resources make it accessible and straightforward.