Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Match token value with array and tuple #183

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

shavit
Copy link
Contributor

@shavit shavit commented Feb 25, 2025

  • Add match cases for token value
  • Add BartTokenizer to known tokenizers

/Close #182

  * Add match cases for token value
  * Add BartTokenizer to known tokenizers
@greenrazer
Copy link
Collaborator

Thank you for the contribution @shavit! I tested it and facebook/bart-large-mnli (#182) is now working with the AutoTokenizer.

However this is a curious issue, because when pulling from a URL AutoTokenzier.from calls:

  • LanguageModelConfigurationFromHub.init
  • Then LanguageModelConfigurationFromHub.loadConfig
  • Then calls this code from HubAPI.swift:
func configuration(fileURL: URL) throws -> Config {
    let data = try Data(contentsOf: fileURL)
    let parsed = try JSONSerialization.jsonObject(with: data, options: [])
    guard let dictionary = parsed as? [NSString: Any] else { throw Hub.HubClientError.parse }
    return Config(dictionary)
}

I could be missing something, but since JSON doesn't support tuples, wouldn't we expect the value entries in the config to be arrays instead of tuples? I'm a bit confused why this worked previously. @pcuenca @FL33TW00D

@shavit shavit marked this pull request as draft February 26, 2025 16:00
@shavit
Copy link
Contributor Author

shavit commented Feb 26, 2025

Right, I agree. The config likely decodes to a dictionary with any byte values, and the token value casts in a dynamic lookup.

I'll do a few more checks.

@pcuenca
Copy link
Member

pcuenca commented Feb 26, 2025

I think you are right @greenrazer! I don't think this could ever work with actual tokenizer files. sep and cls would be decoded as arrays, and I don't think we can cast them to tuples. The correct way to do it is the current one proposed in this PR. Actually, I don't think this would work either.

@shavit
Copy link
Contributor Author

shavit commented Feb 27, 2025

True, but the JSON array was decoded into Any?, not a tuple. The value is then read from a new Config and matched with the correct type.
https://github.com/huggingface/swift-transformers/blob/main/Sources/Hub/Hub.swift#L83

The problem comes from the token order, which is not index-first.

{
  "type": "RobertaProcessing",
  "sep": [
    "</s>",
    2
  ],
  "cls": [
    "<s>",
    0
  ],
}

Comment on lines +105 to +106
case let (i, t) as (UInt, String): return (i, t)
case let (t, i) as (String, UInt): return (i, t)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we know if both versions exist in the wild?

@@ -118,4 +118,14 @@ class HubTests: XCTestCase {
let vocab_dict = config.dictionary["vocab"] as! [String: Int]
XCTAssertNotEqual(vocab_dict.count, 2)
}

func testConfigTokenValueDifferentOrder() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the test! Could we please add another one that loads a tokenizer such as facebook/bart-large-mnli? We don't need anything fancy: just show that the tokenizer loads and that it can encode a string.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

fatalError("Missing sep processor configuration" when using "facebook/bart-large-mnli"
3 participants