Skip to content

Invalid character in XML #3

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
pnorman opened this issue Jan 12, 2016 · 7 comments
Open

Invalid character in XML #3

pnorman opened this issue Jan 12, 2016 · 7 comments

Comments

@pnorman
Copy link

pnorman commented Jan 12, 2016

http://planet.osm.org/replication/changesets/001/659/607.osm.gz contains an invalid character value

$ curl -s http://planet.osm.org/replication/changesets/001/659/607.osm.gz | zcat | xmllint -
-:11: parser error : PCDATA invalid Char value 1
“beacon_special_purpose”, etc. Then you may specify its colour, shape, etc.
                                                                               ^

Viewing it with less shows line 11 is

        <text>The objects “light_minor” and “light_major” are simply lights without any details of the supporting structure. If you wish to specify the supporting structure, then you should use a seamark:type such as “beacon_special_purpose”, etc. Then you may specify its colour, shape, etc. ^AAlso, to specify a colour_pattern, there should be at least two colours.</text>

The ^A is highlighted as a control code.

The relevant part of the hexdump is

00000120  6f 75 72 2c 20 73 68 61  70 65 2c 20 65 74 63 2e  |our, shape, etc.|
00000130  20 01 41 6c 73 6f 2c 20  74 6f 20 73 70 65 63 69  | .Also, to speci|

which confirms there's a &x01; in the document not as an entity.

Cross-ref ToeBee/ChangesetMD#20

There is also probably a bug somewhere if this character got into the database

@pnorman
Copy link
Author

pnorman commented Jan 12, 2016

The changeset is https://www.openstreetmap.org/changeset/36447235.

I'm downloading the dump to check how it was handled too.

@pnorman
Copy link
Author

pnorman commented Jan 12, 2016

Whoops - the changeset is in the 160107 dump, but the discussion is still too new.

@pnorman
Copy link
Author

pnorman commented Jan 14, 2016

$ bzcat discussions-160111.osm.bz2 | xmllint --noout --stream -
-:165428439: parser error : PCDATA invalid Char value 1
“beacon_special_purpose”, etc. Then you may specify its colour, shape, etc.
                                                                               ^
- : failed to parse

So, same problem with the dumps.

@zerebubuth
Copy link
Owner

Fixed the issue with the planet dump in this commit, will do something similar for the changeset replication soon. While testing, I noticed that it's also bad from the API:

$ curl -s 'https://www.openstreetmap.org/api/0.6/changeset/36447235?include_discussion=true' | xmllint -
-:11: parser error : PCDATA invalid Char value 1
“beacon_special_purpose”, etc. Then you may specify its colour, shape, etc. 
                                                                               ^

I can understand why these characters aren't allowed in XML raw, but it's really quite annoying that they're not even allowed escaped... if they were, then it's likely that libxml would simply escape them rather than just letting them pass.

@pnorman
Copy link
Author

pnorman commented Jan 15, 2016

Oh - not allowed even as an entity? I guess my XML is rusty

@zerebubuth
Copy link
Owner

Sadly, yes:

$ echo "<osm>&#x48;&#x49;&#x21;&#x01;</osm>" | xmllint -
-:1: parser error : xmlParseCharRef: invalid xmlChar value 1
<osm>&#x48;&#x49;&#x21;&#x01;</osm>
                             ^

As far as I can tell from the character set section of the spec it's simply impossible to represent these characters in XML, even with CDATA:

echo -e '<osm><![CDATA[HI!\x01]]></osm>' | xmllint -
-:1: parser error : CData section not finished
H
<osm><![CDATA[HI!�]]></osm>
                 ^
-:1: parser error : PCDATA invalid Char value 1
<osm><![CDATA[HI!�]]></osm>
                 ^
-:1: parser error : Sequence ']]>' not allowed in content
<osm><![CDATA[HI!�]]></osm>
                  ^
-:1: parser error : internal error: detected an error in element content

<osm><![CDATA[HI!�]]></osm>
                  ^
-:1: parser error : Extra content at the end of the document
<osm><![CDATA[HI!�]]></osm>
                  ^

@pnorman
Copy link
Author

pnorman commented Jan 15, 2016

via irc: in XML 1.1 it's allowed to do &#001; but I don't recommend it (and yuo can't do &#000; at all in XML)

xmllint doesn't support 1.1, and I bet support elsewhere is spotty.

Can the 001/659/607 replication be manually fixed?

I opened openstreetmap/openstreetmap-website#1135 for the API

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants