Unclosed quotation marks in tab-separated text files lead to line merging when parsed with pandas #899

TajaKuzman · 2025-04-10T08:13:37Z

Parsing the tab-separated format (text files in the .TXT directory of ParlaMint 4.1 corpora) with pandas revealed issues with unclosed double quotation marks, e.g., "lobby"s". The unclosed quotation marks (“) caused issues as the \t and \n symbols were then disregarded until the next quotation mark appeared in the text, due to which the pandas library merged multiple lines into one instance.

Example: line 20 in file “ParlaMint-BE.txt/2022/ParlaMint-BE_2022-07-12-voorlopig-55-commissie-ic856x.txt” (lobby"s)

This issue was detected in the ParlaMint 4.1 datasets for BE, NL, SI, but I suspect it is present in other datasets as well.

The code used:

import pandas as pd

current_df = pd.read_csv(file, sep="\t", index_col=False)

This issue was solved by not using the pandas library to parse the files but it is important that the users are aware of it. The code that was used for circumventing this issue:

import pandas as pd

file = open(path).readlines()
texts = []
IDs = []

for line in file:
    # Remove tabs at the end of lines
    line = line.replace("\t\n", "\n")

    # Remove new line tags
    line = line.replace("\n", "")


    # Remove any quotation marks
    line = line.replace('"', '')


    cur_ID, cur_text = line.split("\t")
    texts.append(cur_text)
    IDs.append(cur_ID)

new_df = pd.DataFrame({"ID": IDs, "text": texts})

matyaskopp · 2025-04-10T08:41:23Z

Maybe you use an additional setting, something like (I haven't tested it):

import pandas as pd
import csv

current_df = pd.read_csv(file, sep="\t", index_col=False, quoting=csv.QUOTE_NONE, escapechar=None)

TomazErjavec · 2025-04-11T08:41:48Z

Yes, here I can only second @matyaskopp comment as there is nothing we can do with the text - that's the way it is in the transcripts, and we don't change the text there. In fact, it would be difficult to change it so that this issue was fixed and that other text were not affected.
I won't close the issue in case somebody has a better idea what to do - like (somewhere, but where?) point out this issue.

TomazErjavec · 2025-05-16T07:56:16Z

There don't seem to be any new ideas how to solve this, except, as @matyaskopp suggested, to modify the import to pandas.
So, closing this issue.

TomazErjavec added this to the ParlaCAP milestone Apr 11, 2025

TomazErjavec assigned TomazErjavec and matyaskopp Apr 11, 2025

TomazErjavec added the enhancement New feature or request label Apr 11, 2025

TomazErjavec closed this as completed May 16, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Unclosed quotation marks in tab-separated text files lead to line merging when parsed with pandas #899

Unclosed quotation marks in tab-separated text files lead to line merging when parsed with pandas #899

TajaKuzman commented Apr 10, 2025

matyaskopp commented Apr 10, 2025

Uh oh!

TomazErjavec commented Apr 11, 2025

Uh oh!

TomazErjavec commented May 16, 2025

Uh oh!

Unclosed quotation marks in tab-separated text files lead to line merging when parsed with pandas #899

Unclosed quotation marks in tab-separated text files lead to line merging when parsed with pandas #899

Comments

TajaKuzman commented Apr 10, 2025

matyaskopp commented Apr 10, 2025

Uh oh!

TomazErjavec commented Apr 11, 2025

Uh oh!

TomazErjavec commented May 16, 2025

Uh oh!