Skip to content

Unclosed quotation marks in tab-separated text files lead to line merging when parsed with pandas #899

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
TajaKuzman opened this issue Apr 10, 2025 · 3 comments
Assignees
Labels
enhancement New feature or request

Comments

@TajaKuzman
Copy link

Parsing the tab-separated format (text files in the .TXT directory of ParlaMint 4.1 corpora) with pandas revealed issues with unclosed double quotation marks, e.g., "lobby"s". The unclosed quotation marks (“) caused issues as the \t and \n symbols were then disregarded until the next quotation mark appeared in the text, due to which the pandas library merged multiple lines into one instance.

Example: line 20 in file “ParlaMint-BE.txt/2022/ParlaMint-BE_2022-07-12-voorlopig-55-commissie-ic856x.txt” (lobby"s)

This issue was detected in the ParlaMint 4.1 datasets for BE, NL, SI, but I suspect it is present in other datasets as well.

The code used:

import pandas as pd

current_df = pd.read_csv(file, sep="\t", index_col=False)

This issue was solved by not using the pandas library to parse the files but it is important that the users are aware of it. The code that was used for circumventing this issue:

import pandas as pd

file = open(path).readlines()
texts = []
IDs = []

for line in file:
    # Remove tabs at the end of lines
    line = line.replace("\t\n", "\n")

    # Remove new line tags
    line = line.replace("\n", "")


    # Remove any quotation marks
    line = line.replace('"', '')


    cur_ID, cur_text = line.split("\t")
    texts.append(cur_text)
    IDs.append(cur_ID)

new_df = pd.DataFrame({"ID": IDs, "text": texts})
@matyaskopp
Copy link
Collaborator

Maybe you use an additional setting, something like (I haven't tested it):

import pandas as pd
import csv

current_df = pd.read_csv(file, sep="\t", index_col=False, quoting=csv.QUOTE_NONE, escapechar=None)

@TomazErjavec TomazErjavec added this to the ParlaCAP milestone Apr 11, 2025
@TomazErjavec TomazErjavec added the enhancement New feature or request label Apr 11, 2025
@TomazErjavec
Copy link
Collaborator

Yes, here I can only second @matyaskopp comment as there is nothing we can do with the text - that's the way it is in the transcripts, and we don't change the text there. In fact, it would be difficult to change it so that this issue was fixed and that other text were not affected.
I won't close the issue in case somebody has a better idea what to do - like (somewhere, but where?) point out this issue.

@TomazErjavec
Copy link
Collaborator

There don't seem to be any new ideas how to solve this, except, as @matyaskopp suggested, to modify the import to pandas.
So, closing this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants