The link between control vectors, "abliteration", Householder transformations and general affine transformations #1

jukofyork · 2024-08-25T13:55:07Z

jukofyork
Aug 25, 2024
Maintainer

I thought I'd post this here so as not to get lost in a random thread somewhere:

Let's start off defining some variables:

m = the input dimension of the transformation performed by the down_proj matrix.
n = the output dimension of the transformation performed by the down_proj matrix.

A = the down_proj matrix with n rows and m columns.

p = a vector of dimension m which is the input to A*p (ie: the output of the up_proj * sigma(gate_proj) operation).
h = a vector of dimension n which is the output of A*p (ie: aka the "hidden state" before it gets added to the "residual stream").

So the general Linear Transformation being performed is:

h = A*p

If we now introduce a control vector c with dimension n the operation being performed is:

h = A*p + c

Which means we have turned this into an Affine Transformation.

So now let's look at the idea from the Refusal in LLMs is mediated by a single direction paper:

In terms of our defined variables (using r for the refusal direction unit vector):

h = A*p - A*p*(r^T*r) = A*p * (I - r^T*r) = A*p * (I + (-1)*r^T*r)

This is in effect collapsing the rank-1 subspace defined by (ie: orthogonal to) r.

If we allow the -1 value to change but keep r as a unit vector, we get:

The Householder Transformation by using -2.
The MopeyMule model by using -1.3.

The value of -2 is essentially reflecting the vector space around the rank-1 subspace defined by r.

The value of -1.3 is both reflecting and (down) scaling the vector space around the rank-1 subspace defined by r.

If we allow R to contain multiple orthogonal unit vectors, then an even number of Householder Transformations act as a rotation:

h = A*p * (I - 2*R^T*R) = A*p * (I - 2*(r_1^T*r_1 + r_2^T*r_2 + ...))

So now going back to the transformation view:

h = A*p

we could multiply by an arbitrary square matrix of the form:

h = (A*p) * (I + B)

and if we were to make B a rank 2 matrix formed by the outer product of two vectors v and u:

h = (A*p) * (I + v^T*u)

and then expand this out:

h = (A*p) * (I + v^T*u) = A*p + A*p*v^T*u

now lets assume we have calculated h = A*p already, this is the same as:

h = h + h*v^T*u

and since:

h*(v^T*u) = (h.v)*u

we can now clearly see the similarity to the control vectors:

The dot product (h.v) is going to act as a (signed) "direction detector" and then this gets multiplied by u.

If we replace the single v and u vectors with rank-k matrices, V and U:

h = h + h*V^T*U = h + h*v_1^T*u_1 + h*v_2^T*u_2 + .... = h + h.v_1*u_1 + h.v_2*u_2 + ....

it becomes even clearer:

the h.v_i terms are measuring the (signed) directional similarity and this is then getting used to scale the u_i value that gets added (ie: h.v_i is the weight / scale-factor and u_i is the offset).

So in essence, instead of only being able to add a single control vector:

h = A*p + v

we can now add a linear combination of something akin to control vectors which are a linear function of the h state:

h = A*p + f(A*p, v_1, u_1) + f(A*p, v_2, u_2) + ...

where:

f(h, v, u) = (h.v)*u

and we can still add the original control vector to perform an affine transformation if we want too:

h = A*p + f(A*p, v_1, u_1) + f(A*p, v_2, u_2) + ... + c

This should open up a lot more potential to effect the model (ie: conditional on the state of h) and also provides a nice interpretation of what is happening. It also opens up the potential to bugger up the models' outputs if not careful regularised too though...

I wrote up my initial plan on how to compute the required the U and V rank-k matrices in #2.

jukofyork · 2024-09-03T13:22:58Z

jukofyork
Sep 3, 2024
Maintainer Author

I wrote some more about this in this post on the exllamav2 discussions:

turboderp-org/exllamav2#500 (reply in thread)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The link between control vectors, "abliteration", Householder transformations and general affine transformations #1

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

The link between control vectors, "abliteration", Householder transformations and general affine transformations #1

jukofyork Aug 25, 2024 Maintainer

Replies: 1 comment

jukofyork Sep 3, 2024 Maintainer Author

jukofyork
Aug 25, 2024
Maintainer

jukofyork
Sep 3, 2024
Maintainer Author