Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mergekit-extract-lora Mergekit 0.1.0 breaks/broken . #521

Closed
David-AU-github opened this issue Feb 24, 2025 · 12 comments
Closed

mergekit-extract-lora Mergekit 0.1.0 breaks/broken . #521

David-AU-github opened this issue Feb 24, 2025 · 12 comments

Comments

@David-AU-github
Copy link

David-AU-github commented Feb 24, 2025

Issue when merging with a LORA created THEN adding this LORA with "mergekit-yaml" [updated Mergekit Feb 20 2025]:

C:\Program Files\Python312\Lib\site-packages\peft\peft_model.py:566: UserWarning: Found missing adapter keys while loading the checkpoint: ['base_model.model.lm_head.lora_A.default.weight', 'base_model.model.lm_head.lora_B.default.weight']

LORA:
mergekit-extract-lora --model=D:/DeepSeek-R1-Distill-Llama-8B --base-model=G:/7b/Meta-Llama-3.1-8B-Instruct --out-path=E:/Base-DeepLora8b-BASE-L3.1-LORA-32 --no-lazy-unpickle --max-rank=32 --cuda

NOTE: This pull is active in my install - may add to issue (??)
#510

MERGEKIT:
mergekit-yaml --lora-merge-cache E:\huggingface_cache --copy-tokenizer --allow-crimes --cuda --out-shard-size 5B --lazy-unpickle --clone-tensors f:/mergefiles/L3-Llama_inst-Deepseek-Lora-BASED32-org.txt E:/L3-Llama_instruct-Deepseek-Lora-BASE-org

Model does work , but not correctly.

I tried "--save module lm_head" ; sometimes this works (???)

Prior LORAs (with older version of mergekit "mergekit-extract-lora" ) DO work correctly.

IE:
mergekit-extract-lora D:/8b-deepseek G:/7b/Meta-Llama-3.1-8B-Instruct E:/DeepLora8bL3.1 --no-lazy-unpickle --rank=32

@David-AU-github
Copy link
Author

David-AU-github commented Feb 24, 2025

Nope ; I reverted to Mergekit 0.0.5.2 (dec 10 2024) ; re-did the LORAs + Merge Yaml => Quants.
These work perfectly.
Night and day difference between Mergekit 0.1.0

Something is wrong with the way the LORAs are build in 0.1.0

Specifics:
I am "loraing" DeepSeek R1 Llama 8B (with Llama 3.1 (base and instruct) as a base).
the merge of the model with the lora looks like this:

models:

  • model: G:/7b/Meta-Llama-3.1-8B-Instruct+E:/DeepLora8bL3.1-BASE-v2
    parameters:
    weight: 1
    merge_method: dare_ties
    base_model: G:/7b/Meta-Llama-3.1-8B-Instruct+E:/DeepLora8bL3.1-BASE-v2
    dtype: bfloat16
    tokenizer_source: G:/7b/Meta-Llama-3.1-8B-Instruct+E:/DeepLora8bL3.1-BASE-v2

(also did one via Pass-through, it also works no issue)

I have used this Yaml with different models/archs (and the LORAs extracted) from them without issue until Mergekit 0.1.0
based on some testing of 0.1.0 "LORA Extract" it seems routed in the "lm_head" extract ... maybe ?!?

This is based on "--save module lm_head" working sometimes (Llama 3.2 3b, but not always with Llama 3.1 8b ??) , and solving some of the LORA + Merge issues.

IMPORTANT NOTE:
Not sure at this point if it is this model's particular layer structure which may be part of the issue for V0.1.0 Mergekit.

ADDED NOTES:
For models build using Merge kit v0.1.0 + a LORA extracted (same version of mergekit), errors present as possible corrupt or misalignment of layer information and/or partial layer(s) corruption.

IE: Word repeats occur in generation, which are part of the "LORA" rank/layer capture .

In this case the LORA transfer appears correct - the model attaining "reasoning/thinking" but the actual reasoning "vocab" appears corrupt or misaligned.

IE: it repeats words/phrases for "thinking/reasoning" 4 or 5... 10 times, whereas it should be a single word/phrase.

Pre V 0.1.0 Adapter VS Post V0.1.0

There is a mismatch in both target modules and "saved" modules. It appears pre- V0.1.0 is saving more models, and 1 additional target module:

https://huggingface.co/kromcomp/L3.1-Prev0.1-Control-Nanuq-r128-LoRA/blob/main/adapter_config.json

https://huggingface.co/kromcomp/L3.1-Postv0.1-Control-Nanuq-r128-LoRA/blob/main/adapter_config.json

Either not enough modules are being saved by default in newest version and/or one of the "target_modules" is missing may be root / part root of the issue.

Also -> Convert to Float 32 ... then back again - this might add further issues / unknown issues as well with rounding errors.
Especially if this happens: Bfloat16 to Float32 and saved in float16. This will have an outsized effect on smaller parameter models.

I have been discussing issues related to V 0.1.0 with another colleague having the same issues with new version.

@David-AU-github David-AU-github changed the title Lora-Extract mergekit-extract-lora Mergekit 0.1.0 breaks/broken . Feb 24, 2025
@cg123
Copy link
Collaborator

cg123 commented Feb 25, 2025

Thank you for the thorough report! This was very helpful to track down what was going on. I think I've fixed this in #522 - for the specific model/base combo here the results are now equal up to floating point precision with v0.0.6 (with the --embed-lora flag to approximate lm_head/embed_tokens instead of implicitly adding them to modules_to_save.) Could you give it a shot with those changes and let me know if it's still giving you any trouble?

@kromeurus
Copy link

Hi, I'm the other colleague.

I've been using the new LoRA extractor since it came out and while it has its benefits, the downsides have the possibility of rendering a lot of LoRAs unusable/broken. Besides my experience, I've also conducted an experiment to demonstrate the differences we've been seeing.

First, I pulled a LoRA at r=128 from the same base and finetune without any optional flags using the previous LoRA extractor, then did the same with the current v0.1 extractor. I'll be calling these versions pre v0.1 and post v0.1 respectively. Off the jump, these LoRAs have major deviations.

  • As DavidAU pointed out above, the two LoRAs have different target and save modules. Pre v0.1 LoRA has every module possible targeted or saved. Post v0.1 LoRA only has self_attn and mlp modules targeted and embed_tokens saved. input_layernorm, norm, and post_attention_layernorm are missing from post v0.1 since — in the logs of the command — the post v0.1 extractor doesn't recognize those layers and ignores them. They are recognized in pre v0.1 and are saved at full rank.

I assume it's not possible to 'rank' these modules, which is why the post v0.1 extractor overlooks them and why you can skip them in pre v0.1 with the --skip-undecomposible flag. While not an outstanding issue, it would be better to have the option to keep/remove them since there's a handful of 'logistics' in there that could be useful in future merges.
In addition, with embed_tokens being saved at full rank now vs. being previously just targeted, there's going to be a major variance in vocabulary. It's a welcomed change, but the average user wouldn't notice it unless they use verbose loggings and pay real close attention to the logs or configs. Being more transparent here with what's being saved or targeted would be helpful.

  • When extracting, pre v0.1 can use any models irrespective of float values without issue and produce a usable LoRA. However, post v0.1 requires models to be in float32 or else you get a runtime error. There is already a pull request to address this but the other workaround is to manually reformat using mergekit-yaml and setting dtype: float32.

As stated in that pull request, this is a significant limitation since the majority of models are bfloat16 and float16. The new code or manual workaround are fine solutions, but 'padding' a model into float32 isn't ideal. The bigger issue is this repeated reformatting between bit sizes that's definitely effected the final result.
In both versions of mergekit, a merge of model+LoRA is saved into a lora cache folder as float16 regardless of what bit either model or LoRA was saved as. Then it's recompiled into whatever the merge config asks for. The loss from bfloat16 to float16 and back isn't detrimental but there's still a loss. Now, a float32 LoRA merged onto a bfloat16 model then saved as float16 to be reformatted back into bfloat16 will have a ton of unnecessary loss of data that could've been retained. That's what's happening on post v0.1 mergekit. Could there be a possible option to prevent this?

  • After extraction, both LoRAs' safe tensors are different sizes with pre v0.1 being 739 MB and post v0.1 being 3.51 GB despite being the same rank (128).

Though this size could be attributed to the different floats each LoRA was built with — float32 yielding more data to store versus bfloat16 — and saving embed_tokens at full rank, the resulting post v0.1 LoRA still shouldn't be quintuple the size of its previous version. So, I dug a bit into the logs and found these warnings:
Image
Image
Both images are from merging the post v0.1 LoRA into a model in pre and post v0.1 mergekit. It's the same warning that mergekit found missing adapter keys for lm_head.lora_A and lm_head.lora_B. It seems like the post v0.1 extractor split its data for lm_head into two keys, yet neither version of mergekit can compute those keys and effectively rendering them useless. So there's a chunk of data being taken that can't be used by the very tookit that made it. This needs a fix to use the full breath of the post v0.1 extractor or else all future LoRAs made with mergekit will be crippled since lm_head contains a lot of the technical smarts of a model.

This is just the LoRA extractor, the application needs its own essay that I'm drafting already but need a little more time. Apologies for the word vomit but there's a lot that needed to be addressed.

@kromeurus
Copy link

lmao I took so long to write my first comment that @cg123 you've already put out a possible fix. Well, I'll make this next bit as succinct as I can manage so I'm not wasting time.

With the LoRA extractor issues addressed, the second part is the actual application of LoRAs on models that also seem to have differences between pre and post v0.1 versions. For this part of my experiment, I took the two LoRAs created in the pre- and post-extractor and did a total of four merges: two in pre v0.1 mergekit and two in post v0.1 mergekit. I used the same base model for all four merges and ran the following config without any optional flags save for --cuda and -verbose:

models:
  - model: AGI-0/smartllama3.1-8B-001+kromcomp/L3.1-Prev0.1-Control-Nanuq-r128-LoRA
parameters:
  int8_mask: true
merge_method: passthrough
dtype: bfloat16
name: prev0.1
---
models:
  - model: AGI-0/smartllama3.1-8B-001+kromcomp/L3.1-Postv0.1-Control-Nanuq-r128-LoRA
parameters:
  int8_mask: true
merge_method: passthrough
dtype: bfloat16
name: postv0.1

The results are the following four models (already quanted at q_8 on the latest llama.cpp):

https://huggingface.co/kromquant/trial-Post-to-Post-GGUFs
https://huggingface.co/kromquant/trial-Post-to-Pre-GGUFs
https://huggingface.co/kromquant/trial-Pre-to-Post-GGUFs
https://huggingface.co/kromquant/trial-Pre-to-Pre-GGUFs

Models are named as such to document what version of LoRA was used in what version of mergekit. Pre and Post are referring to before and after v0.1. So model 'Trial Pre to Post' is made with Pre v0.1 LoRA in Post v0.1 mergekit. I'll be short-handing them to pre/pre or post/post for efficiency.

The expectation would be for the same LoRA plus model to result in the same merge regardless of mergekit version. However, that's not what I've been seeing in testing. I've kept my testing simple; same samplers, same context, and no model instructions. In short, post v0.1 mergekit seems to be missing or applying a value differently to result in slight changes.

For Pre/Pre vs. Pre/Post, both models are relatively logistically sound considering the nature of the base model and respective LoRA. Pre/Pre feels more rigid in a sense, closer to the structure of the base model compared to Pre/Post which has more fluidity similar to the finetune used to make the LoRA. That said, Pre/Post tends to hallucinate more than Pre/Pre and be more forgetful about existing context.

For Post/Pre vs. Post/Post, they're both dumber than the other two simply because lm_head was unaccounted for due to it being split into lm_head.lora_A and lm_head.lora_B and rendered unmergable as I've stated in my last comment. The vocabulary in both is better since embed_tokens was saved at full rank. Besides those known factors, the differences are similar to Pre/Pre vs. Pre/Post.

Outside this experiment, there are other points to note. These models came out usable because the two LoRAs were pulled off the correct base and finetune. LoRAs pulled with mismatched bases/finetunes are corrupted and that usually expresses itself by repeating a few choppy words near the end of generation in pre v0.1 extractor/mergekit. Post v0.1 corruptions tend to manifest in endless adjective word vomit and the tolerance for corruption is smaller.

@kromeurus
Copy link

David and I suspect the major issue is probably the lack of lm_head being extracted properly and unable to be applied to any model via mergekit. So once that's solved, everything else will follow.

With that, I'd like to propose an idea to entertain about possible quality of life changes to the LoRA extractor. Because of all the options and settings that exist, it might be worthwhile to change how the command mergekit-extract-lora. Specifically, make it similar to how mergekit-yaml and mergekit-multi take in settings; with a config.yaml. That way, the options you set are clearer and concise, leaving the command line more organized for other flags.

I've made a proof of concept config.yaml just to visualize what it could look like. Of course, final say is up to you.

finetune: Delta-Vector/Control-Nanuq-8B
base_model: arcee-ai/Llama-3.1-SuperNova-Lite
rank: 128
alpha: 128
epsilon: 0
targeted_modules: 
  filter: self_attn, mlp, lm_head
saved_modules:
  filter: input_layernorm, embed_tokens, norm, post_attention_layernorm

@kromeurus
Copy link

So I've tested the possible fix PR and it did solve most issues. There's no runtime error for incompatible floats and the entire lm_head gets saved without getting split so no warning shoot up during merge. For that reason, the resulting model+LoRA in post v0.1 mergekit retains its recall and intellect that wasn't present in the previous post v0.1 LoRA.

Just one problem, it seems like the fix is hard-coded to include lm_head at full rank no matter what because I had -e lm_head in my command line leftover from my previous project and it ignored the flag. This does also mean you can't target rank lm_head at all or remove it, which might pose constraints down the line.

The new merge still suffers from the post v0.1 mergekit 'jank' and the loracache is still being saved at half precision but, the PR wasn't supposed to fix that so all is well otherwise.

@cg123
Copy link
Collaborator

cg123 commented Feb 26, 2025

@kromeurus Thank you for all of this writeup, and for the rigorous experimentation! It's very helpful to know where things are falling short and how things could be more streamlined.

I'm definitely open to the idea of a YAML config for lora extraction. The full set of options is getting pretty unwieldy.

To clarify - by the post v0.1 'jank' are you meaning the results from LoRAs that were extracted pre-PR not saving lm_head correctly? Or is there other behavioral difference you're noticing even now?

I've pushed some changes to that branch that I think address everything that's been brought up. -e lm_head will be properly respected now. There's a --lora-merge-dtype option now as well. Please let me know if things work how you expect now, or if there are any other cases I've missed.

@kromeurus
Copy link

@cg123 No problem, and thank you for such a quick response. Glad that my informal analysis could help. A YAML config would be much appreciated, please. The entire command line goes beyond my screen.

RE Post v0.1 'jank': It's hard to explain, especially since it's such a slight difference. It doesn't seem to be related to lm_head since even after that fix, it's still there. If you have the time, I invite you to test the four models I quanted above to see it first hand. The best way I can explain it is that post v0.1 models tend to take on more of the LoRA's characteristics then it does in pre v0.1 at the cost of some level of intellect? It's more obvious when you merge a corrupted LoRA, but proper documentation and evidence will have to wait until later (it's very late for me atm).

While saving lm_head at full rank does technically solve the main issue, I'm worried it may not be the best solution. The new extractor initially splits lm_head into two weighted adapters, which is fine since I've seen the previous extractor also want to use two weights for lm_head for specific models (it's where the old --extend-vocab came into play). The core problem that it cannot be merged back into a model using mergekit because it's not recognized/doesn't know what to do with them.

I'm not sure what mergekit uses to meld models and LoRAs together. But it might be better to have mergekit be able to read and build with those adapters, compared to just saving the entire module as the sole remedy. This way, mergekit can utilize other LoRAs compiled in other software too. The other option could be changing the new extractor to not split lm_head, whichever option(s) are more viable.

Again, thank you for your work and diligence. I'll get back to you on the new fixes and details for the 'jank' once I'm able to.

@kromeurus
Copy link

@cg123 Aight I'm back with testing finished. Here are the results.

For PR fixes: Almost everything has been addressed but a few new problems popped up. -e lm_head is respected and it looks like you can rank lm_head without it splitting. --lora-merge-dtype works as well. However, --skip-undecomposable doesn't seem to do anything or match it's original behavior in pre v0.1. I suspect that has to do with this bug(?) I encountered.

I pulled this LoRA (same models as previous versions in this thread) with the new PR and flag --embed-tokens so everything that could be ranked is ranked to mimic pre v0.1 behavior. It does function as expected now as the LoRA that came out was the same size as pre v0.1 and no warnings came up in merging about missing split adapter keys. Yet, I found the following in the logs for mergekit-extract-lora.

Image
Image

input_layernorm, norm, and post_attention_layernorm are unsupported module types — as we know — but it now says it'll be saved at full rank. So it should appear under modules_to_save in the config.json. But if you look, none of them are there. Using the --skip-undecomposable flag results in the same logs and lack of modules. If those modules physically cannot be saved, then those warnings should be removed as to not confuse the user. If it can and isn't for some reason, then that should be the fix.

Also, thank you for the --gpu-rich flag, it cleared up a lot of space in the command line. That does bring up what we briefly spoke about using a YAML config to organize settings.
In my opinion, it might have to be how LoRA extraction is handled from now on. Because you've allowed specific modules to be saved or targeted or skipped, it'll be unessisarily tedious to write out in the full command if you want a particular kind of LoRA. That's not including all the other flags (thank you for sorting the flags by type btw) and options that need values set. God forbid you mistype something and have to comb through the command line to find it.

If a YAML config input does get implemented, you could safely remove half the flags since they'd be declared in one way or another. Adding new options should also be easier though ymmv. Additionally, I recommend changing the default setting for embed_tokens and lm_head to be targeted instead of saved since that was its previous behavior.

As for the aforementioned 'jank' in post v0.1 mergekit; it's not worth pursuing a possible fix. I've been trying to express it in an easily identifiable way but I can't. This might just be a tiny detail that only those who've spent months exclusively using mergekit would spot (ie. David and I lol). For the average practical user, it shouldn't matter as everything else is still working as intended. If we're able to quantifiably find and replicate this, we'll open another issue for that when it comes.

Again, thank you for all your work and help. Mergekit is a fantastic toolkit to build with. I've been using it for over a year and I'm still learning new things. David would agree with me.

I'd like to ask if I could possibly collaborate with you on making a proper guide of sorts on how to use mergekit. There's been an uptick in new mergers in the AI space, but the resources to understand mergekit are few and far between. Not to mention how technical those existing resources are. Would you be interested in a practical documentation of mergekit with modern examples?

@David-AU-github
Copy link
Author

Second the motion:

I'd like to ask if I could possibly collaborate with you on making a proper guide of sorts on how to use mergekit. There's been an uptick in new mergers in the AI space, but the resources to understand mergekit are few and far between. Not to mention how technical those existing resources are. Would you be interested in a practical documentation of mergekit with modern examples?

  • This would lift the quality of all merge models, and community in general.
  • This would help open source overall.
  • More "tinkers" would have more success faster ... we all win.

PS:
; just learned how to make a Qwen MOE.... ah,... today.
; burning up the dev drive now.

@cg123
Copy link
Collaborator

cg123 commented Feb 28, 2025

Okay, think it should actually be good to go now! --skip-undecomposable should actually behave properly as it did in previous versions. Thanks for your time and attention on this.

Having more and better documentation would be fantastic, that's definitely a weak point. I'd be very interested in collaborating with you both on something like this.

@kromeurus
Copy link

Many thanks, --skip-undecomposable works as expected. You can merge that branch in at your leisure. If you don't mind, can I DM you on discord to talk further about documentation and guides?

cg123 added a commit that referenced this issue Feb 28, 2025
Addresses #521.

Also adds:
* `--lora-merge-dtype` to specify dtype to use when applying LoRA adapters to models
* `--gpu-rich` alias for convenience
* Organize display of options in `--help`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants