Replies: 1 comment
-
I wrote some more about this in this post on the |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I thought I'd post this here so as not to get lost in a random thread somewhere:
Let's start off defining some variables:
m
= the input dimension of the transformation performed by thedown_proj
matrix.n
= the output dimension of the transformation performed by thedown_proj
matrix.A
= thedown_proj
matrix withn
rows andm
columns.p
= a vector of dimensionm
which is the input toA*p
(ie: the output of theup_proj * sigma(gate_proj)
operation).h
= a vector of dimensionn
which is the output ofA*p
(ie: aka the "hidden state" before it gets added to the "residual stream").So the general Linear Transformation being performed is:
h = A*p
If we now introduce a control vector
c
with dimensionn
the operation being performed is:h = A*p + c
Which means we have turned this into an Affine Transformation.
So now let's look at the idea from the Refusal in LLMs is mediated by a single direction paper:
In terms of our defined variables (using
r
for the refusal direction unit vector):h = A*p - A*p*(r^T*r) = A*p * (I - r^T*r) = A*p * (I + (-1)*r^T*r)
This is in effect collapsing the rank-1 subspace defined by (ie: orthogonal to)
r
.If we allow the
-1
value to change but keepr
as a unit vector, we get:-2
.-1.3
.The value of
-2
is essentially reflecting the vector space around the rank-1 subspace defined byr
.The value of
-1.3
is both reflecting and (down) scaling the vector space around the rank-1 subspace defined byr
.If we allow
R
to contain multiple orthogonal unit vectors, then an even number of Householder Transformations act as a rotation:h = A*p * (I - 2*R^T*R) = A*p * (I - 2*(r_1^T*r_1 + r_2^T*r_2 + ...))
So now going back to the transformation view:
h = A*p
we could multiply by an arbitrary square matrix of the form:
h = (A*p) * (I + B)
and if we were to make B a rank 2 matrix formed by the outer product of two vectors
v
andu
:h = (A*p) * (I + v^T*u)
and then expand this out:
h = (A*p) * (I + v^T*u) = A*p + A*p*v^T*u
now lets assume we have calculated
h = A*p
already, this is the same as:h = h + h*v^T*u
and since:
h*(v^T*u) = (h.v)*u
we can now clearly see the similarity to the control vectors:
The dot product
(h.v)
is going to act as a (signed) "direction detector" and then this gets multiplied byu
.If we replace the single
v
andu
vectors with rank-k matrices,V
andU
:h = h + h*V^T*U = h + h*v_1^T*u_1 + h*v_2^T*u_2 + .... = h + h.v_1*u_1 + h.v_2*u_2 + ....
it becomes even clearer:
the
h.v_i
terms are measuring the (signed) directional similarity and this is then getting used to scale theu_i
value that gets added (ie:h.v_i
is the weight / scale-factor andu_i
is the offset).So in essence, instead of only being able to add a single control vector:
h = A*p + v
we can now add a linear combination of something akin to control vectors which are a linear function of the
h
state:h = A*p + f(A*p, v_1, u_1) + f(A*p, v_2, u_2) + ...
where:
f(h, v, u) = (h.v)*u
and we can still add the original control vector to perform an affine transformation if we want too:
h = A*p + f(A*p, v_1, u_1) + f(A*p, v_2, u_2) + ... + c
This should open up a lot more potential to effect the model (ie: conditional on the state of
h
) and also provides a nice interpretation of what is happening. It also opens up the potential to bugger up the models' outputs if not careful regularised too though...I wrote up my initial plan on how to compute the required the
U
andV
rank-k matrices in #2.Beta Was this translation helpful? Give feedback.
All reactions