Skip to content

Bullet points are missing in the final extracted text #321

Open
@miguelwon

Description

@miguelwon

Found this issue when analysing the result of the page Diffraction. ID: 8603
In section "Patterns" there are three bullet points:

  • The angular spacing of the features...
    ...

These bullet points are ignore and not included in the final cleaned text. I think is because of the asterisk.

To replicate:

I extracted the page with extractPage, then created a new file with the single page from its output. Then executed the WikiExtractor.

python -m wikiextractor.extractPage --id 8603 enwiki-latest-pages-articles-multistream.xml.bz2

python -m wikiextractor.WikiExtractor page_8603.xml --json -o teste

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions