I was recently parsing a 43.3MB SVG file that was filled with polygons and it was taking a long time (around 3.6 seconds). I figured I was I/O bound and, for kicks, decided to see how long it took to simply count the Xml nodes in the file using the
System.Xml.XmlReader. It completed in only half a second, seven times faster. I wasn’t I/O bound but CPU bound. What could be taking so long?
Turns out it was the parsing of the points that took so long, in particular converting text to numbers. I was using
InvariantCulture, so decided to try writing my own based on the specification. It was well worth the effort – the time to parse the file was now down to 1.7 seconds, well under half the time of the original method.
The point of this post isn’t to complain about the performance of the
TryParse method, as it can handle a variety of inputs and my parser is specialized, but rather what I found out when reading the specification. Take a look at the following polygons – they all draw the same triangle:
<polygon points="0, 0 -10, 0 -5, -10" /> <polygon points="0,0 -10,0 -5,-10" /> <polygon points="0,0,-10,0,-5,-10" /> <polygon points="0 0 -10 0 -5 -10" /> <polygon points="0 0-10 0-5-10" /> <polygon points="0-0-10-0-5-10" />
I was quite surprised that you can join the negative numbers together like that, but it works. The format is particularly adept for the C runtime function strtod that will parse as much of the input as it can, returning the parsed value plus how much of the string was consumed. Unfortunately there isn’t a similar function for .NET – you can only parse the whole string.
Here’s a quick implementation in C++ (note that it’s not very idiomatic C++ as it’s doesn’t use iterators but it will make converting it to C# easier)