Skip to content

Can't pass DTD resolvers to LxmlEventHandler #1130

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
RobertBaruch opened this issue Mar 22, 2025 · 0 comments
Open

Can't pass DTD resolvers to LxmlEventHandler #1130

RobertBaruch opened this issue Mar 22, 2025 · 0 comments

Comments

@RobertBaruch
Copy link

Possibly related to #769.

Generated the models using generate voko/samples --config voko/samples/.xsdata.xml --output dataclasses.

Then ran abak.xml through basic parsing:

import pprint

from voko.samples.models import Vortaro
from xsdata.formats.dataclass.parsers import XmlParser
from xsdata.formats.dataclass.parsers.config import ParserConfig
from xsdata.formats.dataclass.parsers.handlers import LxmlEventHandler

if __name__ == '__main__':
    file_path = "voko/samples/abak.xml"
    config = ParserConfig(load_dtd=True)
    parser = XmlParser(config=config, handler=LxmlEventHandler)

    with open(file_path, "rb") as f:
        data = parser.parse(f, Vortaro)

    pprint.pprint(parser.parse(file_path))

In abak.xml, there is this fragment:

  <snc mrk="abak.0o.kalkulilo">
    <uzo tip="fak">MAT</uzo>
    <fnt><bib>PV</bib></fnt>
    <dif>
      <ref tip="super" cel="kalkul.0ilo">Kalkulilo</ref>
      por plenumi aritmetikajn operaciojn,
      konsistanta &ccirc;efe el buletoj &scirc;oveblaj sur
      stangetoj:
      <ekz>
      ...
      </ekz>
      ...
    </dif>
    ...
  </snc>

The Dif here is parsed as:

Dif(lng=None,
    content=[Ref(tip=<RefTip.SUPER: 'super'>,
                 val=None,
                 lst=None,
                 cel='kalkul.0ilo',
                 content=['Kalkulilo',
                          '\n'
                          '      por plenumi aritmetikajn operaciojn,\n'
                          '      konsistanta ']),
             Ekz(mrk=None,
                 content=['\n\tvi mispu',
                          Tld(lit=None, var=None),
                          'o\n\t',
                          Fnt(content=[None,
                                       Bib(value='RugxDom'),
                                       Lok(content=[]),
                                       ';\n      '])]),
             Ekz(mrk=None,
                 content=['\n        en sia animo li restis ',
                          Tld(lit=None, var=None),
                          'o\n        ',
                          Fnt(content=[None,
                                       Vrk(content=[Url(ref='https://esperanto.mv.ru/TRIZ/pretere.html',
                                                        value='Ni iros '
                                                              'pretere ')]),
                                       '.\n      '])])])

There are two errors here, possibly related.

The first error is that while Kalkulilo is the content of the <ref>, the rest should be in the content of the <dif>, not the <ref>.

The second error is that everything starting from &ccirc; is ignored.

It should look something like this:

Dif(lng=None,
    content=[Ref(tip=<RefTip.SUPER: 'super'>,
                 val=None,
                 lst=None,
                 cel='kalkul.0ilo',
                 content=['Kalkulilo']),
              '\n',
              '      por plenumi aritmetikajn operaciojn,\n',
              '      konsistanta ĉefe el buletoj ŝoveblaj sur\n',
              '      stangetoj:\n',
              '      ',
              Ekz(mrk=None, .....

I think the problem here is that the lxml parser isn't using the DTD because it can't find it from the xml file:

<!DOCTYPE vortaro SYSTEM "../dtd/vokoxml.dtd">

Reading through the lxml docs, I found that you have to tell it how to resolve the DTD's URI:

>>> parser = etree.XMLParser(load_dtd=True)
>>> parser.resolvers.add( DTDResolver() )

But I don't see a way to pass a resolver instance. I'd expect it to be in ParserConfig.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant