Regular Expressions in Java

Package com.stevesoft.pat version 1.5.3

Home
Articles/Links
Mugs, T-shirts Comments/Raves
New in 1.5.3
A Game
An Online Test
Questions


Copyright/License
Download Free

 If you need a non-LGPL version
You Can Buy!

Online help...
Quick Start
Tutorial Part 1
Tutorial Part 2
Tutorial Part 3
Tutorial Part 4
Tutorial Part 5
Tutorial Part 6
Examples
Support
FAQ
Documentation

Useful apps...
Java Beautifier
Code Colorizer
GUI Grep
Swing Grep

Other stuff...
Phreida
xmlser

Tutorial Part 1

Basic Pattern Elements:

[], {}, *, +, ?, (?i)

\w, \s, \d, \W, \S, \D



Regular expressions are a valuable tool by which one can process text. The most basic regular expression is a literal text string (all the alphabetic and numeric characters will be interpreted literally). We can find the word "shells" in a String as follows:
// create a Regex object
Regex r = new Regex("shells");
 
// search for a match within a string
r.search("She sells sea shells by the sea shore.");
 
System.out.println(""+r.didMatch());
// Prints "true" -- r.didMatch() is a boolean function
// that tells us whether the last search was successful
// in finding a pattern.
 
System.out.println(r.stringMatched());
// Prints "shells" -- the part of the String that
// matched during the previous search.
 
System.out.println(r.left());
// Prints "She sells sea " -- the part of the String
// that is to the left of the matching text.
 
System.out.println(r.right());
// Prints " by the sea shore." -- the part of the
// String that is to the right of the matching text.
(Hey, that's really cool, how did you format your java source code like that? It's easy, click here!)

However, the above bit of code does not match if we encounter the substring "SHELLS" rather than "shells".

r.search("SHE SELLS SEA SHELLS BY THE SEA SHORE.");
System.out.println(""+r.didMatch());
// Prints "false"
We can fix this by changing our pattern.
r = new Regex("(?i)shells");
r.search("SHE SELLS SEA SHELLS BY THE SEA SHORE.");
System.out.println(""+r.didMatch());
// Prints "true"
System.out.println(r.stringMatched());
// Prints "SHELLS"
The "(?i)" tells the pattern to ignore the case of all letters. This may be more than you want. Suppose you only want to ignore the case of the first character of the word, you only want "Shells" or "shells" to match but not "SHELLS".
r = new Regex("[Ss]hells");
r.search("SHELLS Shells shells");
System.out.println(r.stringMatched())
// Prints "Shells"
When Regex sees square brackets, it understands that you want to match one of the characters inside them. Thus, "[Ss]" matches either "S" or "s". This type of pattern is, however, has more uses than simply matching two cases of a letter. You can, for example, use it to match a digit. The pattern "[0123456789]" does this.
Regex r = new Regex("[012345678]");
r.search("How old are you? I'm 35.");
System.out.println(r.stringMatched());
// Prints "3"
This might not really be what we want. We might want to get the number "35" rather than just the "3". The pattern in the square brackets can only match one character. We could simply repeat it, like so:
Regex r = new Regex("[0123456789][0123456789]");
r.search("How old are you? I'm 35.");
System.out.println(r.stringMatched());
// Prints "35"
However, this pattern is not very flexible. It does not match on a String with just one digit.
r.search("How old are you? I'm only 8.");
System.out.println(r.stringMatched());
// Prints "null" because no match occurred.
It also doesn't match on longer integers, thus:
r.search("When were you born?  In 1963");
System.out.println(r.stringMatched());
// Prints "19"
If we want something to match one, two, three, or four digits we can use a new pattern element.
Regex r = new Regex("[0123456789]{1,4}");
 
r.search("How old are you? I'm only 8.");
System.out.println(r.stringMatched());
// Prints "8"
 
r.search("How old are you? I'm 35.");
System.out.println(r.stringMatched());
// Prints "35"
 
r.search("When were you born? In 1963.");
System.out.println(r.stringMatched());
// Prints "1963"
It is important to notice that "{1,}" is hungry. That is, it matches as many times as it can.

It may be that we want don't want to specify a maximum number of characters to match. Perhaps we just want to match one or more digits. We can do this by not supplying the second digit to the {} pattern element.

r = new Regex("[0123456789]{1,}");
r.search("What's your favorite number? It's 979834743.");
System.out.println(r.stringMatched());
// Prints "979834743"
It may have occurred to you that typing out the sequence of digits "[012345789]" is a little awkward. Imagine if we wanted to match all the letters of the alphabet, we would have to type a rather long string indeed. Fortunately, there is a shorter way to write this. We can specify ranges of letters and numbers. Thus "[0-9]" matches any digit, it matches all the characters in the range from 0 to 9. We can use "[a-z]" to match any lower case letter. We can use "[A-Z]" to match any upper case letter, or we can use "[A-Za-z0-9]" to match a character that is either an upper case letter, a lower case letter, or a digit.
Regex r = new Regex("[A-Z][a-z]{1,}");
// Matches an upper case letter, followed by one or
// more lower case letters.
r.search("What is your name?  My name is Fred.");
System.out.println(r.stringMatched());
// Prints "What"
Hmm. I was really hoping to match "Fred" not "What". So, I will just rewrite my pattern.
Regex r = new Regex("[A-VX-Z][a-z]{1,}");
// Matches an upper case letter (excluding W),
// followed by one or more lower case letters.
r.search("What is your name?  My name is Fred.");
System.out.println(r.stringMatched());
// Prints "My"
Hmm. Still not what I wanted. Let's change the pattern again.
Regex r = new Regex("[A-VX-Z][a-z]{2,}");
// Matches an upper case letter (excluding W),
// followed by two or more lower case letters.
r.search("What is your name?  My name is Fred.");
System.out.println(r.stringMatched());
// Prints "Fred" -- "My" does not match because
// the pattern now requires one capital letter
// (that isn't a "W") and at least two lower case
// letters.
Finally, I matched the piece of text I wanted. We could also have matched using
Regex r = new Regex("[A-VX-Z][^ ]{2,}");
When a "^" appears as the first character inside []'s it negates the pattern. Thus "[^ ]" matces any character other than a space (" "), and "[^0-9]" matches any character that is not a digit.

You may be wondering, at this point, if it is possible to match against something like "[0-9]" as literal text and not as a digit. The answer, of course, is yes. Preceeding a non-alphanumeric character with a "\\" (note, this is really only one backslash, but the java compiler interprets two backslashes as one when they appear inside quotes) causes Regex to interpret it as literal text. (Note: Putting a backslash before an alphanumeric character often makes it a special pattern character instead of a literal).

Regex r = new Regex("\\[0-9]");
r.search("the pattern is [0-9]");
System.out.println(r.stringMatched());
// Prints "[0-9]"
r = new Regex("[0-9]");
r.search("the pattern is [0-9]");
System.out.println(r.stringMatched());
// Prints "0"
Now, for a few bits of very useful shorthand. You will want to be farmiliar with them.
Regex r1=new Regex("\\w");
// the same as "[0-9A-Za-z_]"
Regex r2=new Regex("\\w+");
// the same as "\\w{1,}"
Regex r3=new Regex("\\w?");
// the same as "\\w{0,1}
Regex r4=new Regex("\\w*");
// the same as "\\w{0,}
Regex r5=new Regex("\\w{5}");
// the same as Regex("\\w{5,5}");
Regex r6=new Regex("\\s");
// the same as "[ \b\t\n\r]" -- these
// are referred to as white space characters.
Regex r7=new Regex("\\d");
// the same as "[0-9]" -- a digit
Regex r8=new Regex("\\W");
// the same as "[^A-Za-z0-9_]" -- these are
// the valid characters for a java variable
// name.
Regex r9=new Regex("\\D");
// the same as "[^0-9]" -- not a digit
Regex r10=new Regex("\\S");
// the same as "[^ \b\r\t\n]"
Regex r11=new Regex(".");
// the same as "[^\n]".  In most cases, this
// serves the purpose of matching any character.
// The pattern ".*" is a popular way
// to match arbitrary regions of text.
Since "." doesn't match anything, what does? Well, "." can match anything if the s flag is enabled. To enable the s flag, include the string "(?s)" in the front of your pattern.
Regex r12=new Regex("(?s).");
// will match any character
Regex r13=new Regex("(?s)foo:.");
// matches on the string "foo:"
// followed by any character.
Review: We became farmiliar with four basic pattern elements. Here they are described briefly and a bit more technically.
  1. Literal text -- Any alphabetic or numeric character, or any special character preceded by a backslash. Thus, "hello\\$" is literal text.
  2. Ignore case flag -- This flag is set by including the sequence "(?i)" somewhere in the pattern.
  3. Square Brackets -- A set of characters that can match. Square brackets have either the form "[...]" or "[^...]" the latter form is a negation of the first. The region represented by ... is a non-empty sequence of either letters "[abc]" (which matches a, b, or c) or ranges "[a-dfk-m]" (which matches one of the letters a, b, c, d, f, k, l, m).
  4. Repeated sequences -- "{min,max}" matches between min and max of the preceeding pattern element. Thus "x{3,10}" matches "xxx" and "xxxxxxxxxx" but not "xx". If there is no maximum number of characters (if you want to be able to match an infinite number of the preceeding), then simply write nothing for the maximum number. "x{3,}" matches three or more x's.

      Next