-
Notifications
You must be signed in to change notification settings - Fork 583
[llm] Add a generic text only LLM runner #11343
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/11343
Note: Links to docs will display an error until the docs builds have been completed. ❗ 1 Merge Blocking SEVsThere is 1 active merge blocking SEVs. Please view them below:
If you must merge, use This comment was automatically generated by Dr. CI and updates every 15 minutes. |
4393dd1
to
dc4de71
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See inline
4070bba
to
85fb703
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See inline @larryliu0820 and thank you!
*/ | ||
ET_EXPERIMENTAL std::unique_ptr<tokenizers::Tokenizer> load_tokenizer( | ||
const std::string& tokenizer_path, | ||
std::unique_ptr<std::vector<std::string>> special_tokens = nullptr, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of using smart pointer, how about this?
const std::optional<std::vector<std::string>>& tokens
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That way, you're passing by const reference and not using any pointers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't that trigger a copy? I think the tokenizer will have to hold the memory of these tokens.
std::unique_ptr<TextPrefiller> text_prefiller, | ||
std::unique_ptr<TextTokenGenerator> text_token_generator, | ||
std::unique_ptr<Stats> stats, | ||
float temperature = -1.0f); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if it's deprecated, why include this as part of the API?
If there's a user code, they can always just use create_text_llm_runner, or can we migrate the user code to use GenerationConfig directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is for JNI where it doesn't expose a temperature argument in generate() API. I think @kirklandsign is going to refactor JNI API (and demo app) to fully deprecate temperature to the constructor.
text_token_generator, | ||
std::unique_ptr<::executorch::extension::llm::Stats> stats, | ||
float temperature = -1.0f); | ||
std::unique_ptr<llm::TextLLMRunner> create_llama_runner( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you think we should keep example::create_llama_runner method at all?
should we just replace with llm::load_tokenizer and llm::create_llm_runner for all the current callsites of example::create_llama_runner
It seems unnecessary indirection, and we would like to showcase these low-level APIs in our examples, as opposed to having yet another indirection (i.e., example::create_llama_runner)
i'm fine if you want to this in a follow-up PR, but I think we should do it very soon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah don't have a strong opinion on this. I can do a follow up.
85fb703
to
f721f3f
Compare
Pull Request resolved: #11342 Introducing `text_llm_runner`. This can be used to run all text only decoder only LLM models supported by ExecuTorch. * Metadata is being read out from the .pte file and being used to construct the runner object. * examples/models/llama/runner.h[.cpp] only contains a simple wrapper around `text_llm_runner.h[.cpp]`. In next PRs I will move examples/models/phi-3-mini/runner to use the generic runner. Will look into QNN and MediaTek runners as well. ghstack-source-id: 288571542 @exported-using-ghexport Differential Revision: [D75910889](https://our.internmc.facebook.com/intern/diff/D75910889/)
f721f3f
to
3fda3a4
Compare
Close, duplicate of #11342 |
Introducing
text_llm_runner
. This can be used to run all text only decoder only LLM models supported by ExecuTorch.text_llm_runner.h[.cpp]
.In next PRs I will move examples/models/phi-3-mini/runner to use the generic runner.
Will look into QNN and MediaTek runners as well.
Differential Revision: D75910889
ghstack-source-id: 288004703
Pull Request resolved: #11342
Summary
[PLEASE REMOVE] See CONTRIBUTING.md's Pull Requests for ExecuTorch PR guidelines.
[PLEASE REMOVE] If this PR closes an issue, please add a
Fixes #<issue-id>
line.[PLEASE REMOVE] If this PR introduces a fix or feature that should be the upcoming release notes, please add a "Release notes: " label. For a list of available release notes labels, check out CONTRIBUTING.md's Pull Requests.
Test plan
[PLEASE REMOVE] How did you test this PR? Please write down any manual commands you used and note down tests that you have written if applicable.