-
-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Simplify get citation call for markup #240
Conversation
With the introduction of markup parsing, handling multiple parameters became unwieldy. To address this, I added `Document` object that encapsulates: - Plain and markup text - Span updates for text mapping - Tokenized words and extracted citation tokens This refactor should enable: - **More complex parsing**, using html - **Simplified logic** by reducing the number of parameters passed around Additionally, it simplifies the api call by allowing a user to pass in markup or plain text but not needing to do both
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great; just a couple small comments
Im going to suggest we just skip the benchmark because of the difficulty of changing the function calls. |
We should change the benchmark call now, if not done, it will basically deactivate the benchmark for every PR until we actually fix it eyecite/benchmark/benchmark.py Lines 55 to 67 in ba84b13
if text:
# Remove XML encodings from xml_harvard
markup_text = re.sub(r"^<\?xml.*?\?>", "", text, count=1)
plain_text = ""
clean_steps = ["html", "inline_whitespace"]
else:
plain_text = row["plain_text"]
clean_steps = ["all_whitespace"]
markup_text = ""
found_citations = get_citations(
plain_text,
markup_text=markup_text,
clean_steps =clean_steps
)
|
Refactor call to use cleaning inside method
im loving the performance gain- thanks I was working on the benchmark but looks all set now |
The Eyecite Report 👁️Gains and LossesThere were 0 gains and 0 losses. Click here to see details.
Time ChartGenerated Files |
How'd y'all squeeze out that performance??? |
Add Document object for better citation parsing
With the introduction of markup parsing,
handling multiple parameters became unwieldy.
This PR creates a new
Document
object that encapsulates:This should simplify function calls and ultimately enable better HTML extraction.
Also simplify the get citations when using markup text only requiring markup text and cleaning steps
Currently we must pre-process plaintext before parsing markup.