-
-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Markup case names #241
base: main
Are you sure you want to change the base?
Markup case names #241
Conversation
Add check inside method for plaintiff Use that to identify the full plaintiff name Also improve regex building to avoid unnecessary escapes
Identify antecedent guesses in markup change method name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looking good, some comments below
Changed method for speed improvements Added improvements for defendants as well. More example tests Fixed full span
I ran a test on Miranda v. Arizona to illustrate the improvements. It lists only changes identified. In this case you can see accurate plaintiff parsing and you can also see better defendant parsing. Especially when or usually when the citation is a parallel to the main. It no longer mistakenly identifies the previous citation with the defendant name when handling markup text. |
Refactor methods for readability Add support for antecedents Strip out stop words if starting extraction
Pagination has long been in the way and this helps identify cites with pagination in the middle of the citation
Add test to illustrate L.A. support Can now extract defendant out properly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The names that this PR are able to collect look great.
However, there are some weird losses on the benchmark
Since it's picking everything inside the emphasis tag, it is picking 'Cf' as part of the name. This makes it lose the following Akins
standalone reference, since it's looking for Cf. Akins
. Solvable by cleaning up stopwords like Cf
. Another one that happens a lot inside the same style tags as the name is "See", you should probably include these in the tests
FullCaseCitation('104 So.3d 1173', groups={'volume': '104', 'reporter': 'So.3d', 'page': '1173'}, metadata=FullCaseCitation.Metadata(parenthetical='reversing nonhomicide sentence of life without parole where trial court exercised its discretion in imposing a sentence of twenty-seven years on the related homicide offense of second-degree murder, an offense punishable by a maximum of life in prison', pin_cite=None, pin_cite_span_start=None, pin_cite_span_end=None, year='2012', court=None, plaintiff='Cf. Akins', defendant='State', extra=None, antecedent_guess=None, resolved_case_name_short=None, resolved_case_name=None)),
This is the full citation "The opinion in Thrift Funds Canal, Inc. v. Foy, 261 La. 573, 260 So.2d 628 (1972), classifies conventional mortgages into three categories:"
On this branch, it's not picking up "Foy" as defendant
FullCaseCitation('261 La. 573', groups={'volume': '261', 'reporter': 'La.', 'page': '573'}, metadata=FullCaseCitation.Metadata(parenthetical=None, pin_cite=None, pin_cite_span_start=None, pin_cite_span_end=None, year='1972', court=None, plaintiff='Inc.', defendant=None, extra='260 So.2d 628', antecedent_guess=None, resolved_case_name_short=None, resolved_case_name=None)),
FullCaseCitation('260 So.2d 628', groups={'volume': '260', 'reporter': 'So.2d', 'page': '628'}, metadata=FullCaseCitation.Metadata(parenthetical=None, pin_cite=None, pin_cite_span_start=None, pin_cite_span_end=None, year='1972', court=None, plaintiff='Inc.', defendant=None, extra=None, antecedent_guess=None, resolved_case_name_short=None, resolved_case_name=None)),
but on main, it is picking it up. I am not sure about the reason for the change...
FullCaseCitation('261 La. 573', groups={'volume': '261', 'reporter': 'La.', 'page': '573'}, metadata=FullCaseCitation.Metadata(parenthetical=None, pin_cite=None, pin_cite_span_start=None, pin_cite_span_end=None, year='1972', court=None, plaintiff='Inc.', defendant='Foy', extra='260 So.2d 628', antecedent_guess=None, resolved_case_name_short=None, resolved_case_name=None)),
FullCaseCitation('260 So.2d 628', groups={'volume': '260', 'reporter': 'So.2d', 'page': '628'}, metadata=FullCaseCitation.Metadata(parenthetical=None, pin_cite=None, pin_cite_span_start=None, pin_cite_span_end=None, year='1972', court=None, plaintiff='Inc.', defendant='Foy', extra=None, antecedent_guess=None, resolved_case_name_short=None, resolved_case_name=None)),
Which makes it able to pick up 2 references below
You should probably check the "gain" column (which is actually the loss)
eyecite/helpers.py
Outdated
if ( | ||
document.markup_text | ||
and document.plain_to_markup | ||
and document.markup_to_plain | ||
): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these checks are also repeated in lines 203-207, shouldn't they just be done inside the functions themselves? And just return the same input values if there are no markup related attributes
In fact, a couple of the checks are already repeated inside of extract_full_text_from_markup, but not inside the other one
# Convert plain text offset to a markup offset. and bail if no markup
if not document.plain_to_markup or not document.markup_to_plain:
return base_offset, default_text
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this was done to satisfy mypy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe it's a good time to use the mypy ignore
https://mypy.readthedocs.io/en/stable/common_issues.html#spurious-errors-and-locally-silencing-the-checker
eyecite/helpers.py
Outdated
new_text = update_defendant_markup(document, stop_word) | ||
if citation.metadata.defendant != new_text: | ||
citation.metadata.defendant = new_text |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the text is updated, shouldn't the full_span_start
be updated too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this wont affect the start, this actually just reduces the defendant on the back end. Thats why no update.
eyecite/helpers.py
Outdated
Parameters: | ||
document: The Document instance. | ||
base_offset: The plain text offset to convert | ||
default_text: The current text stored for this field. | ||
source_end: Optional plain text end index to use when extracting text. | ||
If provided, full text is extracted as: | ||
document.plain_text[plain_text_start:source_end] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should keep the standard docstring format we are using in eyecite
:param document: ...
:param base_offset: ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually I think the format is somewhere in the middle
(e.g., a judicial opinion or other legal doc), return a list of
`eyecite.models.CitationBase` objects representing the citations found
in the doc.
Args:
plain_text: The text to parse. You may wish to use the
`eyecite.clean.clean_text` function to pre-process your text
before passing it here.
remove_ambiguous: Whether to remove citations that might refer to more
than one reporter and can't be narrowed down by date.
tokenizer: An instance of a Tokenizer object. See `eyecite.tokenizers`
for information about available tokenizers. Uses the
`eyecite.tokenizers.AhocorasickTokenizer` by default.
markup_text: if the source text has markup (XML or HTML mostly), pass
it to extract ReferenceCitations that may be detectable via
markup style tags
clean_steps: Cleanup steps and methods
Returns:
A list of `eyecite.models.CitationBase` objects
eyecite/helpers.py
Outdated
# Not inside an emphasis/i tag | ||
return base_offset, default_text | ||
|
||
# Use the first and only matching emphasis tag. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there are no checks in place for to ensure it is the "only" tag in filtered_results
eyecite/helpers.py
Outdated
document: The Document instance. | ||
base_offset: The plain text offset to convert | ||
default_text: The current text stored for this field. | ||
source_end: Optional plain text end index to use when extracting text. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not call this argument plain_text_end
instead of source_end
we lose FOY because its not correctly tagged in HTML |
Add Cf. as stop word Dont ovewrite none if we have data Update docstrings and comments
More edge case examples
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great, I think it's mostly ready
My only nitpick is that add_defendant
is now a big and complex function, and adds both plaintiff and defendant, and we should refactor it eventually
eyecite/helpers.py
Outdated
if ( | ||
document.markup_text | ||
and document.plain_to_markup | ||
and document.markup_to_plain | ||
): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe it's a good time to use the mypy ignore
https://mypy.readthedocs.io/en/stable/common_issues.html#spurious-errors-and-locally-silencing-the-checker
I haven't looked at the PR, but I think now may be the best time to clean things up, since we probably won't return to eyecite for a while, and when we do it'll be harder to do the refactoring. |
Use type ignore and flake8
Just seeing this. It did both plaintiff and defendant previously. But plaintiff was always a throw-a-way process. |
This is a large refactor It is meant to improve case title extraction by removing add defendants and relying on either HTML or a series of heurestics to make educated guesses about when a title stops It separates the logic between html and plaintext It adds to stop words it removes add defendant
I didnt update a test for our new improved extraction. This one caputres the full text
@grossir something is cached or broken about the benchmark ... I found one in particular weird but others I couldn't replicate. this entry ```
|
@grossir there we go - I need to re run name extraction because many html files have some tagged nicely and some not. |
Fix bug in index shift and text cleanup
The Eyecite Report 👁️Gains and LossesThere were 5 gains and 21 losses. Click here to see details.There were 53 changes so we are only displaying the first 50. You can review the
Time ChartGenerated Files |
|
I added a list of old vs new name extraction for a random case. using only plain text. |
@grossir
This PR adds support for identifying the full case name in markup text. When a plaintiff is identified
it calls a new method to check if the text is inside a tag and extracts that out.
Also supports antecedent guess
This should improve reference citation extraction and better identify citations.