Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fatal error on encountering UTF-8 letters outside ASCII #5

Open
matkoniecz opened this issue Apr 1, 2016 · 9 comments
Open

fatal error on encountering UTF-8 letters outside ASCII #5

matkoniecz opened this issue Apr 1, 2016 · 9 comments
Labels
Milestone

Comments

@matkoniecz
Copy link
Contributor

Example of synthetic input, based on real causing data error:

<?xml version="1.0" encoding="UTF-8"?>
<osm version="0.6" generator="CGImap 0.4.0 (30408 thorn-03.openstreetmap.org)" copyright="OpenStreetMap and contributors" attribution="http://www.openstreetmap.org/copyright" license="http://opendatacommons.org/licenses/odbl/1-0/">
 <bounds minlat="50.0530600" minlon="19.8482400" maxlat="50.0545200" maxlon="19.8524600"/>
 <node id="447039358" visible="true" version="1" changeset="1926268" timestamp="2009-07-24T16:08:01Z" user="sledzik1984" uid="58785" lat="50.0541494" lon="19.8488857"/>
 <way id="38042707" visible="true" version="4" changeset="31621045" timestamp="2015-05-31T22:38:07Z" user="dziabaducha" uid="775276">
  <nd ref="447039358"/>
  <nd ref="447039360"/>
  <tag k="highway" v="ą"/>
 </way>
</osm>

to compare, following input differing by replacing "ą" with "footway" is not causing crash:

<?xml version="1.0" encoding="UTF-8"?>
<osm version="0.6" generator="CGImap 0.4.0 (30408 thorn-03.openstreetmap.org)" copyright="OpenStreetMap and contributors" attribution="http://www.openstreetmap.org/copyright" license="http://opendatacommons.org/licenses/odbl/1-0/">
 <bounds minlat="50.0530600" minlon="19.8482400" maxlat="50.0545200" maxlon="19.8524600"/>
 <node id="447039358" visible="true" version="1" changeset="1926268" timestamp="2009-07-24T16:08:01Z" user="sledzik1984" uid="58785" lat="50.0541494" lon="19.8488857"/>
 <way id="38042707" visible="true" version="4" changeset="31621045" timestamp="2015-05-31T22:38:07Z" user="dziabaducha" uid="775276">
  <nd ref="447039358"/>
  <nd ref="447039360"/>
  <tag k="highway" v="footway"/>
 </way>
</osm>

results in

./osm2xmap -i zoo.osm -s ISSOM_5000.omap 
Using files:
    * input OSM file       - zoo.osm
    * output XMAP file     - ./out.xmap
    * symbol set XMAP file - ISSOM_5000.omap
    * rules file           - ./rules.xml
Segmentation fault (core dumped)

Given that letters like żółćęśąźńŻÓŁĆĘŚĄŹŃ are appearing typically only in tag name that is not rendered in orienteering maps potential band-aid is to process input file and remove UTF-8 letters (obviously, proper solution would allow processing data also with letters beyond ASCII).

Note that such letters may also appear in user field.

@matkoniecz matkoniecz changed the title fatal error on encoutering UTF-8 letters outside ASCII fatal error on encountering UTF-8 letters outside ASCII Apr 1, 2016
@sembruk
Copy link
Owner

sembruk commented Apr 2, 2016

<tag k="highway" v="ą"/>

I don't see any problems. It works.

$ ./osm2xmap -i utf8.osm -s /usr/share/openorienteering-mapper/symbol\ sets/5000/ISSOM_5000.omap 
Using files:
    * input OSM file       - utf8.osm
    * output XMAP file     - ./out.xmap
    * symbol set XMAP file - /usr/share/openorienteering-mapper/symbol sets/5000/ISSOM_5000.omap
    * rules file           - ./rules.xml
Using georeferencing:
    mapScale           0.200000
    declination        0.000000
    grivation          0.000000
    mapRefPoint        (0.000000, 0.000000)
    projectedRefPoint  (417700.873691, 5545244.388228)
    geographicRefPoint (19.850350, 50.053790)
    projectedCrsDesc   '+proj=utm +datum=WGS84 +zone=34'
    geographicCrsDesc  '+proj=latlong +datum=WGS84'
Loading rules 'ISOM2000 adapeted for cyclogaine'... 
WARNING: Symbol with code 401 didn't find
<...>
WARNING: Symbol with code 998 didn't find
WARNING: Symbol with code 998 didn't find
Ok
Converting nodes...
Ok
Converting ways...
WARNING: Node 447039360 didn't find
Ok
Converting relations...
Ok

Execution time: 0.000000 sec.

May be problem in your libroxml build?

@matkoniecz
Copy link
Contributor Author

May be problem in your libroxml build?

Maybe. What is your libroxml version? I used latest version from their git repository, now I will test latest release (2.3.0).

@matkoniecz
Copy link
Contributor Author

Or maybe there is some option to download libroxml package (there is no obvious source but...)?

@matkoniecz
Copy link
Contributor Author

I tested with 2.3.0, without changes.

final strace segment:

read(3, "", 4096)                       = 0
brk(0x9f68000)                          = 0x9f68000
_llseek(3, 0, [0], SEEK_SET)            = 0
read(3, "<?xml version=\"1.0\" encoding=\"UT"..., 38) = 38
read(3, "<map xmlns=\"http://openorienteer"..., 4096) = 4096
_llseek(3, 4134, [4134], SEEK_SET)      = 0
open("./in.osm", O_RDONLY)              = 4
fstat64(4, {st_mode=S_IFREG|0600, st_size=2065419, ...}) = 0
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb76df000
read(4, "<?xml version=\"1.0\" encoding=\"UT"..., 4096) = 4096
--- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_MAPERR, si_addr=0x39475} ---
+++ killed by SIGSEGV (core dumped) +++
Segmentation fault (core dumped)

For now I have no idea what may be tested (except making sure we use the same libroxml).

@matkoniecz
Copy link
Contributor Author

And there is possibility that different environments resulted in differences in what happens. I have 32 bit Ubuntu 14.04.4 LTS (Lubuntu distribution).

@matkoniecz
Copy link
Contributor Author

Also, can you check whatever libroxml tests are failing for you - blunderer/libroxml#68 ?

@kevinhendricks
Copy link

FWIW - that xml file may not be properly utf-8 encoded as that char exists as 1 byte in other encodings. Use a hex editor - not emacs or vim as they guess encoding - to look at that specific char's byte values.

@matkoniecz
Copy link
Contributor Author

utf-8 is not supported by libroxml - see blunderer/libroxml#63 (comment)

Potential solution is to replace libroxml by something that works on more than ASCII or to make horrible workaround like

potential band-aid is to process input file and remove UTF-8 letters (obviously, proper solution would allow processing data also with letters beyond ASCII).

@sembruk
Copy link
Owner

sembruk commented Nov 25, 2016

Potential solution is to replace libroxml by something that works on more than ASCII

In TODO list.

@sembruk sembruk modified the milestones: 3.0, v3.0 Nov 25, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants