Pattern Matching with Regular Expressions
In Java 1.4 and later, you can perform textual pattern matching
with regular expressions. Regular expression support is
provided by the Pattern and
Matcher classes of the
java.util.regex package, but the
String class defines a number of convenient
methods that allow you to use regular expressions even more
simply. Regular expressions use a fairly complex grammar to
describe patterns of characters. The Java implementation uses
the same regex syntax as the Perl programming language. See the
java.util.regex.Pattern class in for a summary of this syntax or consult a
good Perl programming book for further details. For a complete
tutorial on Perl-style regular expressions, see
Mastering Regular Expressions (O'Reilly).
The simplest String method that accepts a
regular expression argument is matches(); it
returns true if the string matches the
pattern defined by the specified regular expression:
// This string is a regular expression that describes the pattern of a typical
// sentence. In Perl-style regular expression syntax, it specifies
// a string that begins with a capital letter and ends with a period,
// a question mark, or an exclamation point.
String pattern = "^[A-Z].*[\\.?!]$";
String s = "Java is fun!";
s.matches(pattern); // The string matches the pattern, so this returns true.
The matches() method returns
true only if the entire string is a match for the
specified pattern. Perl programmers should note that this
differs from Perl's behavior, in which a match means only that some
portion of the string matches the pattern. To determine if a
string or any substring matches a pattern, simply alter the
regular expression to allow arbitrary characters before and
after the desired pattern. In the following code, the regular
expression characters .* match any number of
arbitrary characters:
s.matches(".*\\bJava\\b.*"); // True if s contains the word "Java" anywhere
// The b specifies a word boundary
If you are already familiar with Perl's regular expression
syntax, you know that it relies on the liberal use of backslashes to
escape certain characters. In Perl, regular expressions are
language primitives, and their syntax is part of the language
itself. In Java, however, regular expressions are described
using strings and are typically embedded in programs using
string literals. The syntax for Java string literals also uses
the backslash as an escape character, so to include a single
backslash in the regular expression, you must use two
backslashes. Thus, in Java programming, you will often see
double backslashes in regular expressions.
In addition to matching, regular expressions can be used for
search-and-replace operations. The
replaceFirst() and
replaceAll() methods search a string for the
first substring or all substrings that match a given pattern
and replace the string or strings with the specified replacement
text, returning a new string that contains the replacements.
For example, you could use this code to ensure that the word
"Java" is correctly capitalized in a string s:
s.replaceAll("(?i)\\bjava\\b",// Pattern: the word "java", case-insensitive
"Java"); // The replacement string, correctly capitalized
The replacement string passed to replaceAll()
and replaceFirst() need not be a simple
literal string; it may also include references to text that
matched parenthesized subexpressions within the pattern. These
references take the form of a dollar sign followed by the number
of the subexpression. (If you are not familiar with
parenthesized subexpressions within a regular expression, see
java.util.regex.Pattern in .) For example, to
search for words such as JavaBean, JavaScript, JavaOS, and JavaVM
(but not Java or Javanese), and to replace the Java prefix with
the letter J without altering the suffix, you could use code such as:
s.replaceAll("\\bJava([A-Z]\\w+)", // The pattern
"J$1"); // J followed by the suffix that matched the
// subexpression in parentheses: [A-Z]\\w+
The other new Java 1.4 String method that uses regular
expressions is split(), which returns an
array of the substrings of a string, separated by delimiters
that match the specified pattern. To obtain an array of words
in a string separated by any number of spaces, tabs, or
newlines, do this:
String sentence = "This is a\n\ttwo-line sentence";
String[] words = sentence.split("[ \t\n\r]+");
An optional second argument specifies the maximum number of
entries in the returned array.
The matches(),
replaceFirst(),
replaceAll(), and split()
methods are suitable for when you use a regular
expression only once. If you want to use a regular expression for
multiple matches, you should explicitly use the
Pattern and Matcher
classes of the java.util.regex package.
First, create a Pattern object to represent
your regular expression with the static
Pattern.compile() method. (Another reason to
use the Pattern class explicitly instead of
the String convenience methods is that
Pattern.compile() allows you to specify flags
such as Pattern.CASE_INSENSITIVE that
globally alter the way the pattern matching is done.) Note that
the compile() method can throw a
PatternSyntaxException if you pass it an
invalid regular expression string. (This exception is also
thrown by the various String convenience
methods.) The Pattern class defines
split() methods that are similar to the
String.split() methods. For all other
matching, however, you must create a
Matcher object with the
matcher() method and specify the text to be
matched against:
import java.util.regex.*;
Pattern javaword = Pattern.compile("\\bJava(\\w*)", Pattern.CASE_INSENSITIVE);
Matcher m = javaword.matcher(sentence);
boolean match = m.matches(); // True if text matches pattern exactly
Once you have a Matcher object, you can
compare the string to the pattern in various ways. One of the
more sophisticated ways is to find all substrings that match
the pattern:
String text = "Java is fun; JavaScript is funny.";
m.reset(text); // Start matching against a new string
// Loop to find all matches of the string and print details of each match
while(m.find()) {
System.out.println("Found '" + m.group(0) + "' at position " + m.start(0));
if (m.start(1) < m.end(1)) System.out.println("Suffix is " + m.group(1));
}
See the Matcher class in for further details.
String Comparison
The compareTo() and equals()
methods of the String class allow you to
compare strings. compareTo() bases its
comparison on the character order defined by the Unicode encoding,
while equals() defines string equality as
strict character-by-character equality. These are not always the
right methods to use, however. In some languages, the character
ordering imposed by the Unicode standard does not match the
dictionary ordering used when alphabetizing strings. In Spanish,
for example, the letters "ch" are considered a single letter that
comes after "c" and before "d." When comparing human-readable
strings in an internationalized application, you should use the
java.text.Collator class instead:
import java.text.*;
// Compare two strings; results depend on where the program is run
// Return values of Collator.compare() have same meanings as String.compareTo()
Collator c = Collator.getInstance(); // Get Collator for current locale
int result = c.compare("chica", "coche"); // Use it to compare two strings
StringTokenizer
There are a number of other Java classes that operate on strings
and characters. One notable class is
java.util.StringTokenizer, which you can use
to break a string of text into its component words:
String s = "Now is the time";
java.util.StringTokenizer st = new java.util.StringTokenizer(s);
while(st.hasMoreTokens()) {
System.out.println(st.nextToken());
}
You can even use this class to tokenize words that are delimited
by characters other than spaces:
String s = "a:b:c:d";
java.util.StringTokenizer st = new java.util.StringTokenizer(s, ":");
New on the Java Boutique:
New Review:
Time Management Made Easy with the Quartz Enterprise Job Scheduler
Why not just use the Java timer API? This open source scheduling
API boasts simplicity, ease-of-integration, a well-rounded feature
set, and it's free!
New Applet:
Reverse Complement
Reverse Complement is a simple applet that converts DNA or RNA
sequences into three useful formats.
Elsewhere on internet.com:
WebDeveloper Java
Lots of Java information on webdeveloper.com
WDVL Java
Thorough Java resource at the Web Developer's Virtual Library.
ScriptSearch Java
Hundreds of free Java code files to download.
jGuru: Your View of the Java Universe
Customizable portal with online training, FAQs, regular news updates, and tutorials.