You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Parsing the tab-separated format (text files in the .TXT directory of ParlaMint 4.1 corpora) with pandas revealed issues with unclosed double quotation marks, e.g., "lobby"s". The unclosed quotation marks (“) caused issues as the \t and \n symbols were then disregarded until the next quotation mark appeared in the text, due to which the pandas library merged multiple lines into one instance.
Example: line 20 in file “ParlaMint-BE.txt/2022/ParlaMint-BE_2022-07-12-voorlopig-55-commissie-ic856x.txt” (lobby"s)
This issue was detected in the ParlaMint 4.1 datasets for BE, NL, SI, but I suspect it is present in other datasets as well.
This issue was solved by not using the pandas library to parse the files but it is important that the users are aware of it. The code that was used for circumventing this issue:
importpandasaspdfile=open(path).readlines()
texts= []
IDs= []
forlineinfile:
# Remove tabs at the end of linesline=line.replace("\t\n", "\n")
# Remove new line tagsline=line.replace("\n", "")
# Remove any quotation marksline=line.replace('"', '')
cur_ID, cur_text=line.split("\t")
texts.append(cur_text)
IDs.append(cur_ID)
new_df=pd.DataFrame({"ID": IDs, "text": texts})
The text was updated successfully, but these errors were encountered:
Yes, here I can only second @matyaskopp comment as there is nothing we can do with the text - that's the way it is in the transcripts, and we don't change the text there. In fact, it would be difficult to change it so that this issue was fixed and that other text were not affected.
I won't close the issue in case somebody has a better idea what to do - like (somewhere, but where?) point out this issue.
Parsing the tab-separated format (text files in the .TXT directory of ParlaMint 4.1 corpora) with pandas revealed issues with unclosed double quotation marks, e.g., "lobby"s". The unclosed quotation marks (“) caused issues as the \t and \n symbols were then disregarded until the next quotation mark appeared in the text, due to which the pandas library merged multiple lines into one instance.
Example: line 20 in file “ParlaMint-BE.txt/2022/ParlaMint-BE_2022-07-12-voorlopig-55-commissie-ic856x.txt” (lobby"s)
This issue was detected in the ParlaMint 4.1 datasets for BE, NL, SI, but I suspect it is present in other datasets as well.
The code used:
This issue was solved by not using the pandas library to parse the files but it is important that the users are aware of it. The code that was used for circumventing this issue:
The text was updated successfully, but these errors were encountered: