Skip to content

Add debug mode #1505

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
rlouf opened this issue Mar 19, 2025 · 1 comment
Open

Add debug mode #1505

rlouf opened this issue Mar 19, 2025 · 1 comment

Comments

@rlouf
Copy link
Member

rlouf commented Mar 19, 2025

We could add a logits processor that checks whether we're forbidding tokens that should be allowed or conversely.

@parkervg
Copy link
Contributor

parkervg commented Mar 22, 2025

For more context: the reason I think this may be useful comes from an experience I had with outlines' internal JSON grammar. I believe it may be related to #994, and might even warrant setting up another issue?

I had finetuned a language model using json.dumps on Python dictionaries. Some of the JSON values had newline characters. During inference, when I applied outlines JSON constraints, I noticed a drop in performance, with many generations ending with something like {"answer": "Some text here \"}.

What seems to be happening is the model was trying to mimic a pattern it saw in finetuning - a \n character with a single backslash. But, shown in the code snippet below, this is invalid under the STRING_INNER regex in outlines_core

from outlines_core.fsm.json_schema import build_regex_from_schema
schema_object = '{"type": "json", "properties": {"answer": {"title": "Answer", "type": "string"}}}'
regex_str = build_regex_from_schema(schema_object)
re_pattern = re.compile(regex_str)

assert re_pattern.search(json.dumps({"answer": "This is an answer \n with a newline"})) is None 
assert re_pattern.search(json.dumps({"answer": r"This is an answer \n with a newline"})) is not None # Converting to a raw string first works (`\\n`)

The solution for this particular usecase is to transform the Python dict values to a raw string prior to calling json.dumps, to make sure we get \\n in the text. But this only became apparent to me after digging around into the underlying regular expression guiding JSON strings in outlines. It would be awesome to have a debugging feature that could flag this sort of behavior (grammar constraints causing different outputs from the unconstrained greedy decoding alternative) automatically (outlines.generate.json(..., debug=True)?)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants