Skip to content

BA: missing person gender #817

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
TomazErjavec opened this issue Oct 19, 2023 · 4 comments
Closed

BA: missing person gender #817

TomazErjavec opened this issue Oct 19, 2023 · 4 comments
Assignees
Labels
bug Something isn't working

Comments

@TomazErjavec
Copy link
Collaborator

The current BA corpus has of persons without the <sex> element, even though it is is easy to determine based on the person's forename.

@TomazErjavec TomazErjavec added the bug Something isn't working label Oct 19, 2023
@TomazErjavec TomazErjavec added this to the Future milestone Oct 19, 2023
@TomazErjavec TomazErjavec assigned TomazErjavec and unassigned nljubesi and 5roop Mar 4, 2024
@TomazErjavec
Copy link
Collaborator Author

As explained in #815, there is now a procedure in place to add missing gender. But before doing that, I noticed that some forename and surenames are mixed up in BA, I just tested forenames ending in '-ić' and corrected those. I'm sure there are more that I haven't noticed. As a side-effed of this fixed, we have now doubled persons - once, where the order is (was) correct, and one where it was corrected - but even though they now have the correct names, they still have different person IDs (e.g. BoškoŠiljeković vs. ŠiljekovićBoško). Hopefully this will be fixed in the future.
This is the list of changes:

284,285c284,285
<          <surname>Ristan</surname>
<          <forename>Ristić</forename>
---
>          <surname>Ristić</surname>
>          <forename>Ristan</forename>
7692,7693c7692,7693
<          <surname>Boško</surname>
<          <forename>Šiljeković</forename>
---
>          <surname>Šiljeković</surname>
>          <forename>Boško</forename>
8052,8053c8052,8053
<          <surname>Dragutin</surname>
<          <forename>Ilić</forename>
---
>          <surname>Ilić</surname>
>          <forename>Dragutin</forename>
8250,8251c8250,8251
<          <surname>Anto</surname>
<          <forename>Spajić</forename>
---
>          <surname>Spajić</surname>
>          <forename>Anto</forename>
8856,8857c8856,8857
<          <surname>Jadranko</surname>
<          <forename>Tomić</forename>
---
>          <surname>Tomić</surname>
>          <forename>Jadranko</forename>
9048,9049c9048,9049
<          <surname>Muharem</surname>
<          <forename>Imamović</forename>
---
>          <surname>Imamović</surname>
>          <forename>Muharem</forename>

@nljubesi
Copy link
Collaborator

nljubesi commented Mar 6, 2024

Thanks for this as well. The different IDs are due to you exchanging the surname and forename, but not changing the ID? Would that not be simple to resolve once we know what is the correct name and what the correct surname?

We did not do any work on the Bosnian data ourselves, but have obtained them from our upstream source, so are very unknowledgeable on what issues the data might have.

@TomazErjavec
Copy link
Collaborator Author

The different IDs are due to you exchanging the surname and forename, but not changing the ID?

Yes, exactly.

Would that not be simple to resolve once we know what is the correct name and what the correct surname?

Well, it is simple in that it is clear what needs to be done - but you need to go through all the files and replace, so with some testing that nothing is messed up it might take a while. More than I would gladly invest, esp. as these mistakes cropped up by chance, who knows how many others are lurking in there...

However, I will re-open this issue, maybe somebody finds the time. Just in case it would be @nljubesi , let me know beforehand, as the source data has now been fixed and you need to get that copy.

@TomazErjavec TomazErjavec reopened this Mar 6, 2024
@TomazErjavec
Copy link
Collaborator Author

Sex has been added to BA, as regards wrong forename/surname distinction, this is now discussed in #852.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants