Tuesday, 27 August 2013

Adding whitespace handling to existing Java regex

Adding whitespace handling to existing Java regex

A long time ago I wrote a method called detectBadChars(String) that
inspects the String argument for instances of so-called "bad" characters.
The original list of bad characters was:
'~'
'#'
'@'
'*'
'+'
'%'
My method, which works great, is:
// Detects for the existence of bad chars in a string and returns the
// bad chars that were found.
protected String detectBadChars(String text) {
Pattern pattern = Pattern.compile("[~#@*+%]");
Matcher matcher = pattern.matcher(text);
StringBuilder violatorsBuilder = new StringBuilder();
if(matcher.find()) {
String group = matcher.group();
if (!violatorsBuilder.toString().contains(group))
violatorsBuilder.append(group);
}
return violatorsBuilder.toString();
}
The business logic has now changed, and the following are now also
considered to be bad:
Carriage returns (\r)
New lines (\n)
Tabs (\t)
Any consecutive whitespaces (" ", " ", etc.)
So I am trying to modify the regex to accomodate the new bad characters.
Changing the regex to:
Pattern pattern = Pattern.compile("[~#@*+%\n\t\r[ ]+]");
...throws exceptions. My thinking was that adding "\n\t\r" to the regex
would allot for newlines, tabs and CRs respectively. And then adding "[
]+" adds a new "class/group" consisting of whitespaces, and then
quantitfies that group as allowing 1+ of those whitespaces, effectively
taking care of consecutive whitespaces.
Where am I going awyre and what should my regex be (and why)? Thanks in
advance!

No comments:

Post a Comment