Saturday, April 23, 2005 6:45 PM
Olaf Conijn
Write better Regular expression using unicode character classes
I use regular expressions pretty often, in most of the scenarios to validate user input.
But let’s say you wrote a regular expression to filter content from html documents.
In a scenario like this it is pretty easy to resort to using character classes like \w for word characters or \d for digits.
Nothing wrong with character classes that help readability, is there?
Well, after reading this blogpost from blogs.msdn I hit myself to the head realizing there is. \w in regular expressions equals a-zA-Z in ASCII. A word like ‘façade’ (though in the English dictionary) doesn’t solely consist of \w (or word) characters.
Fixing this can easily be done by adopting the use of Unicode character classes in your regular expressions. A list of character classes can be found here.