Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue 299 match citations without page number #1226

Conversation

mattdahl
Copy link
Collaborator

Re: issue #299. Basically this PR does two things:

  1. Allows the parser to recognize and gather citations when the page number is indicated as explicitly missing. (e.g., 1 U.S. ____)
  2. Tries to match those citations to opinions using only the case name information, which should still help make a match even if the page number is missing.

I'm not an expert on Solr so maybe there's more we could do with number 2; I just kept it pretty simple here.

@mlissner
Copy link
Member

This looks good technically, but I'm nervous about false positives otherwise. One way to reduce false positives would be to add per volume dates to the reporters_db so that something like 22 U.S. can have a date range associated with it.

Short of that, it'd be nice to have some sense of how bad the false positive problem would be if we implemented this, but the only way to do that would be to keep a record of the cases that are matched as part of this and then to go back and check through them. Some kind of record-keeping process might not be a bad next step if the reporters_db approach doesn't work.

@mlissner
Copy link
Member

@mattdahl, just checking in. Guessing corona has you pinned, but thought I'd ask what you thought about this one. I guess the questions are:

  1. Would the false positive problem be bad enough we have to do something about it.
  2. If so, what should we do?
  3. If the solution is to add reporter years to the reporters_db, do you have an interest in that?

@mattdahl
Copy link
Collaborator Author

I don't know, I don't have a good sense of what the false positive rate would be. I think adding the reporter years to the reporters_db would be a great improvement though (would also improve matching performance generally). However, I don't think I have time right now to tackle that, though it depends on how tricky it looks. What's the format of those CSVs that you mentioned?

@flooie
Copy link
Contributor

flooie commented Mar 26, 2020

Adding reporter years would be easy. @mattdahl it would take one pass thru the dB to provide it in whatever format you would want it.

@mlissner
Copy link
Member

I made an issue for the reporter thing, here: freelawproject/reporters-db#19

@mattdahl, if you were willing to take the citator part over the finish line, I think @flooie could build up the volume-reporter-date range stuff for you. What do you think?

@mattdahl
Copy link
Collaborator Author

mattdahl commented Mar 26, 2020

Yes, I'm certainly willing to be responsible for integrating the volume years into the citator -- that should be easy -- if @flooie is willing to deal with generating them!

@mlissner
Copy link
Member

EXCELLENT! Thank you both! This will be cool.

@mlissner
Copy link
Member

I just closed our freelawproject/reporters_db#19, because as @brianwc says:

That style of citation is generally only used in slip opinions prior to the
volume being published. However, if you know the year of the opinion in
which you found such a citation, then I think we'd find that the years
covered by the cited volume are that very year, maybe +/- 1 year. So, if I
find such a citation in an opinion from 2016, then the volume of that
citation likely covers opinions from 2016 as well, maybe 2015-17.

Instead of doing the work of getting per-volume dates into our reporter DB, we can just do the above. What do you think @malteos ?

@johnhawkinson
Copy link
Contributor

johnhawkinson commented Mar 27, 2020

I suspect this is probably the right conclusion (I have not fully digested the issue), but I am slightly confused on this.

We're talking about citations like 442 U.S. ___ and 1 U.S. ____? My experience with the US Reporter is I usually see ___ U.S. ___ but don't usually see just the volume. I'm not quite sure where the volume-only form comes up. Edit: That's a false memory. It's only the U.S. reporter where I see volume-only, it's the other reporters that have double-blanks.

Maybe it's not so important to understand that, though, so maybe just ignore me.

@mattdahl
Copy link
Collaborator Author

@mlissner Ah, this is obvious now that you point this out. That's an easy heuristic to use. However, I think a window of +/- 1 year is too small -- this 2015 opinion (for example) still doesn't have a page number assigned in the U.S. reporter, and that's from 5 years ago. So if we grabbed the year from an opinion citing it today, we'd have to look back at least 5 years to find it. (N.B., This has always infuriated me -- why would it take 5+ years for an official citation to be assigned to a case?? Maybe I've been doing something wrong?)

@johnhawkinson I admit I have no idea about the principles behind when (or why) opinions are missing volumes or pages numbers (or both); all I can say is that I encounter citations in the form 442 U.S. ___ all the time.

@johnhawkinson
Copy link
Contributor

The proper span of time varies with the reporter. The U.S. reporter is probably the worst of them, and hasn't published since 2012 (570 U.S.; see https://www.supremecourt.gov/opinions/boundvolumes.aspx). It probably should get a different lookaround window than any other reporter.

If it's not too distracting, what's an example source for 442 U.S. ___? --- Oh, wait, I guess that SCOTUS slip opinions have that as a "cite as" header. E.g. this weeks' say "Cite as 589 U.S. ___ (2020)"

I'm not sure that any other slip opinions (other than the Supreme Court and the US Reporter) do that. So maybe it's all a special case.

At least be aware of that.

@mattdahl
Copy link
Collaborator Author

Ah, interesting! I confess to being a bit of an "elitist" -- my research is really only about the Supreme Court -- so I don't read a lot of opinions from other courts 😅

For an example of the missing page citation, Carpenter v. United States cites Riley v. California, 573 U.S. ___ a few times in our version.

@mlissner
Copy link
Member

Well, unbeknownst to me, @flooie got a big part of the reporter thing done yesterday. He now has reporter data for most of the WestLaw corpus in a JSOn file. He's going to expand it with Lexis too and we'll be able to do this the hard way instead of the cheat (@brianwc's) way. I don't think we'll add the data from Case.law for the moment.

Beyond being useful here, we just decided that the output is interesting enough in itself that we're probably going to do a blog post about it that'll enable people to answer things like, "Is SCOTUS indeed the slowest court to get citations?"

@mlissner
Copy link
Member

mlissner commented Mar 4, 2021

(All these changes belong in eyecite now, closing.)

@mattdahl
Copy link
Collaborator Author

mattdahl commented Mar 4, 2021

Just for posterity, the matching changes (d9036c4 and some of cbaef64) still need to happen in CL. But I have no problem with you closing this, as it is vastly outdated and could never be merged in this state.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants