Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: replace verifier with retriever #26

Merged
merged 26 commits into from
Feb 5, 2025
Merged

feat: replace verifier with retriever #26

merged 26 commits into from
Feb 5, 2025

Conversation

p0deje
Copy link
Contributor

@p0deje p0deje commented Jan 29, 2025

What started initially as an attempt to implement 2-pass verification to increase its stability ended up being a complete replacement of verifier/confirmation_checker with retriever/extractor agents. For now it seems to work good enough, though I would like to further improve the implementations by utilizing chain-of-thought approach for structured outputs. This can be done in follow-up PRs.

Overall, the new design for verification is to:

  1. Use retriever to extract "Is the following statement true or false - ..." from the screenshot/ARIA.
  2. Use extractor to get true/false output of the retriever result.
  3. Assert on the extracted value.

Smaller additions in this PR:

  • New variable ALUMNIUM_LOG_LEVEL which compliments existing ALUMNIUM_DEBUG.
  • Playwright in tests now run in headless mode.
  • Automatic retries of random AWS Bedrock errors which are frequent on Llama.
  • Timeout CI jobs after 30 minutes.
  • List separator changed from <sep to %SEP% because Haiku treats the former as opening HTML tag and likes to append the closing tag (e.g. <sep>FOO</sep>).

@p0deje p0deje force-pushed the 2-pass-verifier branch 2 times, most recently from 94c033b to d04308a Compare January 31, 2025 04:45
@p0deje p0deje changed the title feat!: use 2-pass verification feat: use 2-pass verification Feb 1, 2025
@p0deje p0deje changed the title feat: use 2-pass verification feat: replace verifier with retriever Feb 2, 2025
@p0deje p0deje marked this pull request as ready for review February 2, 2025 22:17
"""
Checks a given statement using the verifier.

Args:
statement: The statement to be checked.
vision: A flag indicating whether to use a vision-based verification via a screenshot. Defaults to False.
retries: The number of retries to check the statement. Defaults to the value set in the LoadingDetectorAgent.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@p0deje hm, there is no LoadingDetectorAgent anymore, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll remove

logger = logging.getLogger(__name__)

level = getenv("ALUMNIUM_LOG_LEVEL", None)
if getenv("ALUMNIUM_DEBUG", "0") == "1":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it kept for backward compatibility?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really, I can remove it if you think it's unuseful.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it might be more confusing than helpful.

I imagine the situation when ALUMNIUM_DEBUG=1 and ALUMNIUM_LOG_LEVEL=info. If I look at them in the .env file, I can forget which one gets a priority

`ALUMNIUM_LOG_LEVEL=debug` should be used instead
It looks like it's not so useful as long as we use CoT in the retriever agent structured output (explanation string BEFORE value string)
@p0deje p0deje merged commit 9de0e81 into main Feb 5, 2025
13 checks passed
@p0deje p0deje deleted the 2-pass-verifier branch February 5, 2025 04:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants