Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify get citation call for markup #240

Merged
merged 7 commits into from
Mar 21, 2025
Merged

Simplify get citation call for markup #240

merged 7 commits into from
Mar 21, 2025

Conversation

flooie
Copy link
Contributor

@flooie flooie commented Mar 19, 2025

Add Document object for better citation parsing

With the introduction of markup parsing,
handling multiple parameters became unwieldy.
This PR creates a new Document object that encapsulates:

  • Plain and markup text
  • Span updates for text mapping
  • Tokenized words and extracted citation tokens

This should simplify function calls and ultimately enable better HTML extraction.
Also simplify the get citations when using markup text only requiring markup text and cleaning steps

Currently we must pre-process plaintext before parsing markup.

flooie added 2 commits March 19, 2025 13:30
With the introduction of markup parsing,
handling multiple parameters became unwieldy.
 To address this, I added `Document` object that encapsulates:

- Plain and markup text
- Span updates for text mapping
- Tokenized words and extracted citation tokens

This refactor should enable:
- **More complex parsing**, using html
- **Simplified logic** by reducing the number of parameters passed around

Additionally, it simplifies the api call
by allowing a user to pass in markup or plain text but not
needing to do both
@flooie flooie requested a review from grossir March 19, 2025 17:51
Copy link
Contributor

@grossir grossir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great; just a couple small comments

@grossir grossir assigned flooie and unassigned grossir Mar 20, 2025
@flooie
Copy link
Contributor Author

flooie commented Mar 20, 2025

Im going to suggest we just skip the benchmark because of the difficulty of changing the function calls.

@flooie flooie assigned grossir and unassigned flooie Mar 20, 2025
@grossir
Copy link
Contributor

grossir commented Mar 21, 2025

We should change the benchmark call now, if not done, it will basically deactivate the benchmark for every PR until we actually fix it
Change isn't too long, it's on these lines

if text:
# Remove XML encodings from xml_harvard
text = re.sub(r"^<\?xml.*?\?>", "", text, count=1)
opinion_text_is_marked_up = True
else:
text = row["plain_text"]
opinion_text_is_marked_up = False
plain_text = clean_text(text, ["html", "inline_whitespace"])
found_citations = get_citations(
plain_text,
markup_text=text if opinion_text_is_marked_up else "",
)

            if text:
                # Remove XML encodings from xml_harvard
                markup_text = re.sub(r"^<\?xml.*?\?>", "", text, count=1)
                plain_text = ""
                clean_steps =  ["html", "inline_whitespace"]
            else:
                plain_text = row["plain_text"]
                clean_steps = ["all_whitespace"]
                markup_text = ""

            found_citations = get_citations(
                plain_text,
                markup_text=markup_text, 
                clean_steps =clean_steps
            )
            

@flooie
Copy link
Contributor Author

flooie commented Mar 21, 2025

im loving the performance gain- thanks I was working on the benchmark but looks all set now

Copy link
Contributor

The Eyecite Report 👁️

Gains and Losses

There were 0 gains and 0 losses.

Click here to see details.
id Gain Loss

Time Chart

image

Generated Files

Branch 1 Output
Branch 2 Output
Full Output CSV

@flooie flooie merged commit 7f49660 into main Mar 21, 2025
14 checks passed
@flooie flooie deleted the simplify-method branch March 21, 2025 13:53
@mlissner
Copy link
Member

How'd y'all squeeze out that performance???

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

3 participants