Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dev min dist fix #25

Open
wants to merge 6 commits into
base: dev
Choose a base branch
from
Open

Dev min dist fix #25

wants to merge 6 commits into from

Conversation

jrober84
Copy link
Collaborator

These fixes address several issues which occured with a different dataset.

  1. The final chunk of a file was not being processed (fixed) which resulted in query samples missing from the output
  2. I removed max distance filtering as well because it actually results in incorrect calculations in this data because you have multiple samples which fall outside of the threshold ranges
  3. A bad assumption I had was all reference samples within a cluster would be within a range and so have added protection for missing samples.
  4. Several of these issues only occur when multiple samples were being assigned at once which was due to the fact that the lookup membership was not being updated (they had to both found new clusters and be part of the same clusters)

d = float(line[2])
#print(f'{qid} {rid} {d}')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Guessing this a vestige of testing

@@ -200,21 +198,21 @@ def assign(self, n_records=1000,delim="\t"):
addr_members = self.memberships_lookup[addr]
addr_dists = []
for id in addr_members:
addr_dists.append(dists[qid][id])
if id in dists[qid]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it is still useful to remove the samples with distances below minimum threshold, and use this to avoid the key error?

self.add_memberships_lookup(qid, query_addr)
#self.memberships_dict[qid] = ".".join([str(x) for x in query_addr])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to remove the commented code out now that it is moved to the function? or is this a helpful reminder of what was changed?

@sgsutcliffe
Copy link
Contributor

I think it looks good. I went through it and nothing stood out. Always a good sign! Should we come up with tests for these alternate scenarios that broke the orginal assumptions?

@sgsutcliffe sgsutcliffe self-requested a review February 19, 2025 17:07
@apetkau apetkau self-assigned this Feb 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bug: assign() generates KeyError when using read_pd() samples based on min_dist
3 participants