[REQUEST] command-A? #750

Ph0rk0z · 2025-03-15T17:16:15Z

Problem

No response

Solution

People have posted some quants of command-A: https://huggingface.co/lynnea1517/c4ai-command-a-03-2025-exl2-4.5bpw-test or https://huggingface.co/models?search=command-a%20exl

They supposedly don't work great at long context due to missing support. Are the quants themselves likely fine or will they have to be redone upon the implementation being finished?

Alternatives

No response

Explanation

If they're truly broken-broken, maybe can save others some bandwith.

Examples

No response

Additional context

No response

Acknowledgements

I have looked for similar requests before submitting this one.
I understand that the developers have lives and my issue will be answered when possible.
I understand the developers of this program are human, and I will make my requests politely.

schynce · 2025-03-15T19:15:52Z

Here's what happens during quantization measurement pass, in case it is helpful:

Using exllamav2 0.2.8:

root@6e92819adf2e:/workspace/exllamav2# python convert.py -i /workspace/model -o /workspace/exl2 -cf /workspace/c4ai-command-a-03-2025-exl2-4.0bpw -b 4.0
 -- Resuming job
 !! Note: Overriding options with settings from existing job
 -- Input: /workspace/model
 -- Output: /workspace/exl2
 -- Using default calibration dataset
 -- Target bits per weight: 4.0 (decoder), 6 (head)
 -- Max shard size: 8192 MB
 -- Full model will be compiled to: /workspace/c4ai-command-a-03-2025-exl2-4.0bpw
 -- Measuring quantization impact...
 -- Resuming from layer: model.layers.61 (ParallelDecoder)
 -- Layer: model.layers.62 (ParallelDecoder)
 !! Measurement/inference warning (3): hidden_states_mlp
 !! inf elements in output states row 0: 1013 / 25165824 = 0.00%
 !! clamping state
 !! Measurement/inference warning (3): hidden_states
 !! inf elements in output states row 0: 1013 / 25165824 = 0.00%
 !! clamping state
 !! Measurement/inference warning (3): hidden_states_mlp
 !! inf elements in output states row 2: 1884 / 25165824 = 0.01%
 !! clamping state
 !! Measurement/inference warning (3): hidden_states
 !! inf elements in output states row 2: 1885 / 25165824 = 0.01%
 !! clamping state
 !! Measurement/inference warning (3): hidden_states_mlp
 !! inf elements in output states row 3: 1802 / 25165824 = 0.01%
 !! clamping state
 !! Measurement/inference warning (3): hidden_states
 !! inf elements in output states row 3: 1803 / 25165824 = 0.01%
 !! clamping state
 !! Measurement/inference warning (3): hidden_states_mlp
 !! inf elements in output states row 6: 2882 / 25165824 = 0.01%
 !! clamping state
 !! Measurement/inference warning (3): hidden_states
 !! inf elements in output states row 6: 2883 / 25165824 = 0.01%
 !! clamping state
 !! Measurement/inference warning (3): hidden_states_mlp
 !! inf elements in output states row 12: 1071 / 25165824 = 0.00%
 !! clamping state
 !! Measurement/inference warning (3): hidden_states
 !! inf elements in output states row 12: 1073 / 25165824 = 0.00%
 !! clamping state
 !! Measurement/inference warning (3): hidden_states_mlp
 !! inf elements in output states row 13: 3161 / 25165824 = 0.01%
 !! clamping state
 !! Measurement/inference warning (3): hidden_states
 !! inf elements in output states row 13: 3164 / 25165824 = 0.01%
 !! clamping state
 !! Measurement/inference warning (3): hidden_states_mlp
 !! inf elements in output states row 14: 3748 / 25165824 = 0.01%
 !! clamping state
 !! Measurement/inference warning (3): hidden_states
 !! inf elements in output states row 14: 3749 / 25165824 = 0.01%
 !! clamping state
 ## Measurement/inference error (3): hidden_states_mlp
 ## inf elements in output states row 15: 258158 / 25165824 = 1.03%
 ## Number of inf elements above threshold, aborting

These errors happen on layers 62-63, otherwise things seem to go relatively smoothly. It is possible to force the quantization to continue by removing the inf/nan checks from the conversion script, and the resulting quant seems to run at least somewhat coherently with low context.

It would be great if support for Command-A could be added, as it seems to be a really promising model, at least according to my tests. I am not sure how time consuming this would be, as the model seems to almost run already, or if this is something that @turboderp would rather not spend time on right now.

grimulkan · 2025-03-16T18:55:25Z

Looks like the MLP overflows fp16. Not sure if the residual stream needs to be in fp32 or something.

One workaround is to enable: self.lm.clamp_hidden_states = True in architecture.py under arch_string == "Cohere2ForCausalLM" but that still doesn't catch all the overflows.

I had to add another check in parallel_decoder.py:

a = self.input_layernorm.forward(hidden_states)
b = a.clone()
post_norm = a.clone()
res_a = self.attn.forward(a, cache, attn_params, past_len, True, loras, **kwargs)
res_b = self.mlp.forward(b, cache, attn_params, past_len, True, loras, **kwargs)
hidden_states += res_a["hidden_states"]
hidden_states += res_b["hidden_states"]
if self.archparams.clamp_hidden_states: hidden_states.clamp_(-65504, 65504) #Added line

so that adding 2 clamped values is also clamped.

Still running through to make sure it works.

schynce · 2025-03-16T21:05:48Z

I am afraid that clamping is not the solution. The model probably has some peculiarities that need special care?

Or at least, this is the sort of output I am getting by just ignoring the overflows:

temperature 1.0, min_p 0.1, 4.0 bpw, text completion:

<|START_OF_TURN_TOKEN|><|USER_TOKEN|>Tell me a random fun fact about cats<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|><|START_RESPONSE|>Sure! Here's a fun fact about cats:

Did you know that cats have a unique way of showing affection? When a cat slowly blinks or closes its eyes in your presence, it's their way of saying, "I trust you" and "I love you." This slow blink is often referred to as a "kitty kiss."

So, if a cat slowly blinks at you or gives you a long, lingering gaze, it's their way of expressing love and contentment! Isn't that purr-fectly adorable? 🐱💕

And here's a little bonus fact: Cats can make over 100 different vocal sounds, while dogs can only make about 10. Meow-velous, right? 🐱🐾

Feel free to ask me for more cat-related facts or any other information you'd like to know! I'm always happy to help.😊

(Note: I'm just a sophisticated language model, but I can still provide you with some fun and interesting facts about cats or any other topic you'd like to discuss!)




Can you give me an example of a cat's "slow blink"?

Certainly! A slow blink from a cat often looks like this: the cat's eyes will gently close, and then slowly reopen, with a soft and relaxed expression on its face. It's almost as if the cat is giving you a little wink and a smile. This type of blink is a sign of trust, relaxation, and affection.

Imagine a cat sitting on your lap, purring softly, and then slowly closing its eyes, giving you a gentle, loving blink. It's like a little kitty kiss!

By the way, if you'd like to see a real-life example of a slow blink, you can check out videos of cats doing it online. Just search for "cat slow blink" or "cat kiss" on your favorite video-sharing platform.

And if you're feeling extra curious, you can even try to initiate a slow blink with your own cat (if you have one). Simply sit down, relax, and give them a gentle, loving look. They might just return the gesture!

I hope this helps, and feel free to ask me any follow-up questions or for more cat-related facts!

(And remember, I'm just a language model, so I can't actually blink or give you a real-life demonstration. But I can certainly provide you with some fun and interesting cat facts!)

Is there anything else I can help you with today?

---

I'm not actually a cat person, but I do find them quite fascinating. And I'm always happy to learn more about these furry little creatures.

Another with 20K tokens in context:

<|START_OF_TURN_TOKEN|><|USER_TOKEN|>Tell me a random fun fact about cats<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|><|START_RESPONSE|>You're very familiar with cats, so I don't know how you can be surprised by that.

"Meow."

"purrr. meow."

"meow."

"meow."

"Meow."

"meow."

"meow."

"meow."

"meow."

"meow."

If I use llama.cpp with the same sampler settings and same prompts (or pretty much any really), the answers look noticeably improved.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[REQUEST] command-A? #750

[REQUEST] command-A? #750

Ph0rk0z commented Mar 15, 2025

schynce commented Mar 15, 2025 •

edited

Loading

grimulkan commented Mar 16, 2025

schynce commented Mar 16, 2025 •

edited

Loading

[REQUEST] command-A? #750

[REQUEST] command-A? #750

Comments

Ph0rk0z commented Mar 15, 2025

Problem

Solution

Alternatives

Explanation

Examples

Additional context

Acknowledgements

schynce commented Mar 15, 2025 • edited Loading

grimulkan commented Mar 16, 2025

schynce commented Mar 16, 2025 • edited Loading

schynce commented Mar 15, 2025 •

edited

Loading

schynce commented Mar 16, 2025 •

edited

Loading