Regular Expressions in Java

Package com.stevesoft.pat version 1.5.3

Home
Articles/Links
Mugs, T-shirts Comments/Raves
New in 1.5.3
A Game
An Online Test
Questions


Copyright/License
Download Free

 If you need a non-LGPL version
You Can Buy!

Online help...
Quick Start
Tutorial Part 1
Tutorial Part 2
Tutorial Part 3
Tutorial Part 4
Tutorial Part 5
Tutorial Part 6
Examples
Support
FAQ
Documentation

Useful apps...
Java Beautifier
Code Colorizer
GUI Grep
Swing Grep

Other stuff...
Phreida
xmlser

Tutorial Part 5

Replacing Text

Up until now we have focused on how to write a pattern to match something. However, Regex's can also be used to replace text. Here is an example of a Regex being used to change "foo" to "bar".

Regex r = new Regex("foo","bar");
System.out.println(r.replaceFirst("foo and foo again!"));
// prints "bar and foo again!"
System.out.println(r.replaceAll("foo and foo again!"));
// prints "bar and bar again!"
The second argument to the constructor of Regex is the replacement rule. Like the pattern itself, the replacement rule has some special syntax. The sequence "$&" refers to the current match. So, to put square brackets around either the word "foo" or "bar" all we need to do is this:
Regex r = new Regex("(?:foo|bar)","[$&]");
System.out.println(r.replaceAll("foo or bar"));
// prints "[foo] or [bar]"
In the replacement rule, the square brackets are just literal text with no special meaning. All the special bits of text for replacement rules will begin with either a $ or a \ unlike the patterns which had a wider variety of special characters.

Note that the replacement rule will work the same way if we had written it as "[${&}]", or "[$MATCH]", or even "[${MATCH}]". Putting the {}'s in allows you to specify more exactly which characters you intend to name the replacment rule, and one is allowed to use "MATCH" instead of "&" simply because some people think that an English word is easier to read than a symbol like "&". (I can't think why)

The next trick you might be interested in learning is how to refer to a backreference in a replacement rule. The following rule makes sure that there are white spaces around a "+" sign.

Regex r = new Regex("(\\S)\\+(\\S)","${1} + ${2}");
System.out.println(r.replaceAll("3+4=7, 2+5=7, 1 + 6=7"));
// prints "3 + 4=7, 2 + 5=7, 1 + 6=7"
The pattern "\\S", as you may recall, matches anything that is not a space. Thus, the pattern will match inside the String two times, the first time it matches on "3+4", the first backreference is "3", and the second backreference is "4". The second time it matches, it matches on "2+5" with "2" in the first backrefence and "5" in the second. Note: Instead of "${1}" one can use "$1" or "\\1" to refer to the backreference.

Probably less interesting but still quite useful, is the use of "$`" or "$PREMATCH" to refer the part of the pattern to the left of a match. Likewise, the replacement rule "$'" or "$POSTMATCH" can be use to refer to the portion of the String to the right of a match. In the next example we use this rule to reverse the order of words in a String.

Regex r = new Regex("\\s+and\\s+","$POSTMATCH and $PREMATCH");
System.out.println(r.replaceAll("foo  and     bar"));
// prints "bar and foo"
As you will remember, "\\s" matches on a white space (i.e. space, tab, carriage return, or line feed characters), and "\\s+" matches on one or more white space characters.

Another point of interest concerns the sequences "\\U", "\\L", "\\u", "\\l", "\\Q", and "\\E". All characters are upper case after the \U, all are lower case after the \L, and all non-alpha numeric characters are quoted after \Q. The \E flag puts everything back to normal.

Here's an example of how you can make words 2 or more letters in length upper case.

Regex r = new Regex("\\w{2,}","\\U$&");
System.out.println(r.replaceAll("a foo and a bar"));
// Prints a FOO AND a BAR
Here's a silly modification that uses \E
Regex r = new Regex("\\w{2,}","\\U$&\\E$&");
System.out.println(r.replaceAll("a foo and a bar"));
// Prints a FOOfoo ANDand a BARbar
Now, let's consider the the effects of \u and \l. These cause the next letter to be upper or lower case respectively, and they over-ride \U and \L. Thus
Regex r = new Regex("\\w{2,}","\\L\\u$&");
System.out.println(r.replaceAll("a foo and a BAR"));
// Prints a Foo And a Bar
This last replacement rule capitolizes a word.

Note that the patterns ^ and $ are affected by the m flag. If the m flag is turned on (include "(?m)" at the start of the pattern), then we are in "line mode" and ^ and $ will detected the end/beginning of lines not just the entire string.

Regex r = null;
 
// m flag on
r = new Regex("(?m)^","[start]");
System.out.println(r.replaceAll("a\nb\nc"));
/* Prints:
  [start]a
  [start]b
  [start]c
  */
 
// m flag off
r = new Regex("^","[start]");
System.out.println(r.replaceAll("a\nb\nc"));
/* Prints:
  [start]a
  b
  c
  */
 
// m flag on
r = new Regex("(?m)$","[end]");
System.out.println(r.replaceAll("a\nb\nc"));
/* Prints:
  a[end]
  b[end]
  c[end]
  */
 
// m flag off
r = new Regex("$","[end]");
System.out.println(r.replaceAll("a\nb\nc"));
/* Prints:
  a
  b
  c[end]
  */
The patterns "\Z" and "\A" are unaffected. They will always match the end and beginning of the string, respectively.

One other sort of thing you can do in Perl 5 is to allow a subroutine to process your substitutions. For those of you who know perl, I'm referring to code like the following:

       $x = "Some numbers: 49 36 2";
       $x =~ s/\d+/sqrt($&)/eg;
       print $x,"\n";
The output from this perl code is:
       Some numbers: 7 6 1.4142135623731
The "e" flag allows you to use a function (in this case sqrt) to perform the substitution rule. Package pat does not support the "e" flag, for that would entail writing the entire perl language in java and not just doing regular expression matching. However, what it does do is allow you to have a java subroutine handle the matching. This example file fancy.java illustrates how this can be accomplished.
Previous Next