Skip to content

Outlines v1 response model validation #1526

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
cpfiffer opened this issue Apr 8, 2025 · 4 comments
Open

Outlines v1 response model validation #1526

cpfiffer opened this issue Apr 8, 2025 · 4 comments
Labels
impact/user interface Related to improving the user interface

Comments

@cpfiffer
Copy link
Contributor

cpfiffer commented Apr 8, 2025

In Outlines v1, we specify an output format with model(prompt, OutputClass, ...).

The current behavior of this is to provide a JSON string, rather than the validated model class OutputClass.

Example code:

import json
from outlines import models
from pydantic import BaseModel, Field
from transformers import AutoModelForCausalLM, AutoTokenizer

class Person(BaseModel):
    name: str
    age: int
    email: str = Field(pattern=r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$')


model_id = "HuggingFaceTB/SmolLM2-135M-Instruct"
model = models.from_transformers(
    AutoModelForCausalLM.from_pretrained(model_id),
    AutoTokenizer.from_pretrained(model_id)
)

person_text = """
John Doe
30
john.doe@example.com
"""

result = model(
    f"Extract the person information from this text:\n{person_text}", 
    Person,
    max_new_tokens=100
)
print(json.dumps(result, indent=2))

The output type is a JSON string:

{ "name": "John Doe", "age": 30, "email": "john.doe@example.com" }

My expectation here would be that model return Person object, rather than the raw string. This would look like

Person(name='John Doe', age=30, email='john.doe@example.com')

To fix this, I currently have to do the (very simple) extra line

person = Person.model_validate_json(result)

Is this intended behavior? Outlines < 1.0 would typically return a Pydantic object back.

@rlouf
Copy link
Member

rlouf commented Apr 15, 2025

Afair this was not intended, but an oversight on our end. Since the output type is available where the text is generated it should not be too difficult to add, except maybe for union types. We could add it just for Pydantic output types and other simple types for now.

@RobinPicard
Copy link
Contributor

I was aware of it and thought it was intended.

I'm not not too keen on putting it back as I believe it paradoxically ends up making the life of the user harder. I think so as it means they have to know for each output type the associated return format, considering that some may not be fully intuitive (int returns an int, but the regex for an int returns a string) and that it may not be implemented in some cases (Union, but also things like booleans in enum or Literal and probably other cases hard to anticipate).

As a user, I prefer being told I'll always get a string and handling myself turning it back into what I need (it typically requires a single line)

@rlouf
Copy link
Member

rlouf commented Apr 16, 2025

I think you're right about the fact that it would get confusing.

@cpfiffer
Copy link
Contributor Author

cpfiffer commented Apr 16, 2025

I think we should think of passing in a Pydantic class as a special case (and possibly dict). Here, users have explicitly provided an output type. I agree that any other situation should return a string.

The reason for this is that a large portion of calls to Outlines use Pydantic classes. We want to make that as seamless as possible without requiring the user to understand model_validate_json and use it after every single call to the generator.

I don't think this is confusing behavior at all -- from my perspective, it's entirely intuitive to provide an object class and then get that same object class out.

Similarly, if I provide a dict, I would expect a dict. If I provide a string (regex, raw schema), I would expect a string. If anything, it's unintuitive to have all requests cast result types to strings.

@rlouf rlouf added impact/user interface Related to improving the user interface labels Apr 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
impact/user interface Related to improving the user interface
Projects
None yet
Development

No branches or pull requests

3 participants