|
Home
Articles/Links
Mugs, T-shirts
Comments/Raves
New in 1.5.3
A Game
An Online Test
Questions
Copyright/License
Download Free
If you need a non-LGPL version
You Can Buy!
Online help...
Quick Start
Tutorial Part 1
Tutorial Part 2
Tutorial Part 3
Tutorial Part 4
Tutorial Part 5
Tutorial Part 6
Examples
Support
FAQ
Documentation
Useful apps...
Java Beautifier
Code Colorizer
GUI Grep
Swing Grep
Other stuff...
Phreida
xmlser
 |
Tutorial Part 1
Basic Pattern Elements:
[], {}, *, +, ?, (?i)
\w, \s, \d, \W, \S, \D
Regular expressions are a valuable tool by which one can process text.
The most basic regular expression is a literal text string (all the
alphabetic and numeric characters will be interpreted literally). We can
find the word "shells" in a String as follows:
// create a Regex object
Regex r = new Regex("shells");
// search for a match within a string
r.search("She sells sea shells by the sea shore.");
System.out.println(""+r.didMatch());
// Prints "true" -- r.didMatch() is a boolean function
// that tells us whether the last search was successful
// in finding a pattern.
System.out.println(r.stringMatched());
// Prints "shells" -- the part of the String that
// matched during the previous search.
System.out.println(r.left());
// Prints "She sells sea " -- the part of the String
// that is to the left of the matching text.
System.out.println(r.right());
// Prints " by the sea shore." -- the part of the
// String that is to the right of the matching text.
|
(Hey, that's really cool, how did you format your java source
code like that? It's easy, click here!)
However, the above bit of code does not match if we encounter
the substring "SHELLS" rather than "shells".
r.search("SHE SELLS SEA SHELLS BY THE SEA SHORE.");
System.out.println(""+r.didMatch());
// Prints "false"
|
We can fix this by changing our pattern.
r = new Regex("(?i)shells");
r.search("SHE SELLS SEA SHELLS BY THE SEA SHORE.");
System.out.println(""+r.didMatch());
// Prints "true"
System.out.println(r.stringMatched());
// Prints "SHELLS"
|
The "(?i)" tells the pattern to ignore the case of
all letters. This may be more than you want. Suppose
you only want to ignore the case of the first character
of the word, you only want "Shells" or "shells" to
match but not "SHELLS".
r = new Regex("[Ss]hells");
r.search("SHELLS Shells shells");
System.out.println(r.stringMatched())
// Prints "Shells"
|
When Regex sees square brackets, it understands that you
want to match one of the characters inside them. Thus,
"[Ss]" matches either "S" or "s". This type of pattern
is, however, has more uses than simply matching two
cases of a letter. You can, for example, use it to
match a digit. The pattern
"[0123456789]" does this.
Regex r = new Regex("[012345678]");
r.search("How old are you? I'm 35.");
System.out.println(r.stringMatched());
// Prints "3"
|
This might not really be what we want. We might want
to get the number "35" rather than just the "3". The
pattern in the square brackets can only match one
character. We could simply repeat it, like so:
Regex r = new Regex("[0123456789][0123456789]");
r.search("How old are you? I'm 35.");
System.out.println(r.stringMatched());
// Prints "35"
|
However, this pattern is not very flexible. It does
not match on a String with just one digit.
r.search("How old are you? I'm only 8.");
System.out.println(r.stringMatched());
// Prints "null" because no match occurred.
|
It also doesn't match on longer integers, thus:
r.search("When were you born? In 1963");
System.out.println(r.stringMatched());
// Prints "19"
|
If we want something to match one, two, three, or
four digits we can use a new pattern element.
Regex r = new Regex("[0123456789]{1,4}");
r.search("How old are you? I'm only 8.");
System.out.println(r.stringMatched());
// Prints "8"
r.search("How old are you? I'm 35.");
System.out.println(r.stringMatched());
// Prints "35"
r.search("When were you born? In 1963.");
System.out.println(r.stringMatched());
// Prints "1963"
|
It is important to notice that "{1,}" is hungry. That is,
it matches as many times as it can.
It may be that we want don't want to specify
a maximum number of characters to match. Perhaps
we just want to match one or more digits. We
can do this by not supplying the second digit to
the {} pattern element.
r = new Regex("[0123456789]{1,}");
r.search("What's your favorite number? It's 979834743.");
System.out.println(r.stringMatched());
// Prints "979834743"
|
It may have occurred to you that typing out the
sequence of digits "[012345789]" is a little awkward.
Imagine if we wanted to match all the letters of the
alphabet, we would have to type a rather long string
indeed. Fortunately, there is a shorter way to write
this. We can specify ranges of
letters and numbers. Thus "[0-9]" matches any digit,
it matches all the characters in the range from 0 to
9. We can use "[a-z]" to match any lower case letter.
We can use "[A-Z]" to match any upper case letter,
or we can use "[A-Za-z0-9]" to match a character that
is either an upper case letter, a lower case letter,
or a digit.
Regex r = new Regex("[A-Z][a-z]{1,}");
// Matches an upper case letter, followed by one or
// more lower case letters.
r.search("What is your name? My name is Fred.");
System.out.println(r.stringMatched());
// Prints "What"
|
Hmm. I was really hoping to match "Fred" not "What".
So, I will just rewrite my pattern.
Regex r = new Regex("[A-VX-Z][a-z]{1,}");
// Matches an upper case letter (excluding W),
// followed by one or more lower case letters.
r.search("What is your name? My name is Fred.");
System.out.println(r.stringMatched());
// Prints "My"
|
Hmm. Still not what I wanted. Let's change
the pattern again.
Regex r = new Regex("[A-VX-Z][a-z]{2,}");
// Matches an upper case letter (excluding W),
// followed by two or more lower case letters.
r.search("What is your name? My name is Fred.");
System.out.println(r.stringMatched());
// Prints "Fred" -- "My" does not match because
// the pattern now requires one capital letter
// (that isn't a "W") and at least two lower case
// letters.
|
Finally, I matched the piece of text I wanted.
We could also have matched using
Regex r = new Regex("[A-VX-Z][^ ]{2,}");
|
When a "^" appears as the first character inside []'s
it negates the pattern. Thus "[^ ]" matces any
character other than a space (" "), and "[^0-9]" matches any
character that is not a digit.
You may be wondering, at this point, if it is possible
to match against something like "[0-9]" as literal text
and not as a digit. The answer, of course, is yes.
Preceeding a non-alphanumeric character with a "\\"
(note, this is really only one backslash, but the java
compiler interprets two backslashes as one when they
appear inside quotes) causes Regex to interpret
it as literal text. (Note: Putting a backslash before an alphanumeric
character often makes it a special pattern character
instead of a literal).
Regex r = new Regex("\\[0-9]");
r.search("the pattern is [0-9]");
System.out.println(r.stringMatched());
// Prints "[0-9]"
r = new Regex("[0-9]");
r.search("the pattern is [0-9]");
System.out.println(r.stringMatched());
// Prints "0"
|
Now, for a few bits of very useful shorthand. You
will want to be farmiliar with them.
Regex r1=new Regex("\\w");
// the same as "[0-9A-Za-z_]"
Regex r2=new Regex("\\w+");
// the same as "\\w{1,}"
Regex r3=new Regex("\\w?");
// the same as "\\w{0,1}
Regex r4=new Regex("\\w*");
// the same as "\\w{0,}
Regex r5=new Regex("\\w{5}");
// the same as Regex("\\w{5,5}");
Regex r6=new Regex("\\s");
// the same as "[ \b\t\n\r]" -- these
// are referred to as white space characters.
Regex r7=new Regex("\\d");
// the same as "[0-9]" -- a digit
Regex r8=new Regex("\\W");
// the same as "[^A-Za-z0-9_]" -- these are
// the valid characters for a java variable
// name.
Regex r9=new Regex("\\D");
// the same as "[^0-9]" -- not a digit
Regex r10=new Regex("\\S");
// the same as "[^ \b\r\t\n]"
Regex r11=new Regex(".");
// the same as "[^\n]". In most cases, this
// serves the purpose of matching any character.
// The pattern ".*" is a popular way
// to match arbitrary regions of text.
|
Since "." doesn't match anything, what does?
Well, "." can match anything if the s flag is
enabled. To enable the s flag, include the
string "(?s)" in the front of your pattern.
Regex r12=new Regex("(?s).");
// will match any character
Regex r13=new Regex("(?s)foo:.");
// matches on the string "foo:"
// followed by any character.
|
Review: We became farmiliar with four basic
pattern elements. Here they are described briefly
and a bit more technically.
- Literal text -- Any alphabetic or numeric character,
or any special character preceded by a backslash.
Thus, "hello\\$" is literal text.
- Ignore case flag -- This flag is set by including
the sequence "(?i)" somewhere in the pattern.
- Square Brackets -- A set of characters that can
match. Square brackets have either the form "[...]"
or "[^...]" the latter form is a negation of the
first. The region represented by ... is a non-empty
sequence of either letters "[abc]" (which matches a,
b, or c) or ranges "[a-dfk-m]" (which matches one of
the letters a, b, c, d, f, k, l, m).
- Repeated sequences -- "{min,max}" matches between
min and max of the preceeding pattern element. Thus
"x{3,10}" matches "xxx" and "xxxxxxxxxx" but not "xx".
If there is no maximum number of characters (if you
want to be able to match an infinite number of the
preceeding), then simply write nothing for the maximum
number. "x{3,}" matches three or more x's.
Next
|