Reviews : Java Books : Java in a Nutshell, 4th Edition :

Buy this book
Title: Java in a Nutshell, 4th Edition
ISBN: 0-596-00283-1, Order Number: 2831
US Price: $39.95
CA Price: $61.95
UK Price: £28.40
© O'Reilly & Associates, Inc.

Pattern Matching with Regular Expressions

In Java 1.4 and later, you can perform textual pattern matching with regular expressions. Regular expression support is provided by the Pattern and Matcher classes of the java.util.regex package, but the String class defines a number of convenient methods that allow you to use regular expressions even more simply. Regular expressions use a fairly complex grammar to describe patterns of characters. The Java implementation uses the same regex syntax as the Perl programming language. See the java.util.regex.Pattern class in for a summary of this syntax or consult a good Perl programming book for further details. For a complete tutorial on Perl-style regular expressions, see Mastering Regular Expressions (O'Reilly).

The simplest String method that accepts a regular expression argument is matches(); it returns true if the string matches the pattern defined by the specified regular expression:

// This string is a regular expression that describes the pattern of a typical
// sentence. In Perl-style regular expression syntax, it specifies
// a string that begins with a capital letter and ends with a period,
// a question mark, or an exclamation point.
String pattern = "^[A-Z].*[\\.?!]$";  
String s = "Java is fun!";
s.matches(pattern);    // The string matches the pattern, so this returns true.

The matches() method returns true only if the entire string is a match for the specified pattern. Perl programmers should note that this differs from Perl's behavior, in which a match means only that some portion of the string matches the pattern. To determine if a string or any substring matches a pattern, simply alter the regular expression to allow arbitrary characters before and after the desired pattern. In the following code, the regular expression characters .* match any number of arbitrary characters:

s.matches(".*\\bJava\\b.*"); // True if s contains the word "Java" anywhere
                             // The b specifies a word boundary

If you are already familiar with Perl's regular expression syntax, you know that it relies on the liberal use of backslashes to escape certain characters. In Perl, regular expressions are language primitives, and their syntax is part of the language itself. In Java, however, regular expressions are described using strings and are typically embedded in programs using string literals. The syntax for Java string literals also uses the backslash as an escape character, so to include a single backslash in the regular expression, you must use two backslashes. Thus, in Java programming, you will often see double backslashes in regular expressions.

In addition to matching, regular expressions can be used for search-and-replace operations. The replaceFirst() and replaceAll() methods search a string for the first substring or all substrings that match a given pattern and replace the string or strings with the specified replacement text, returning a new string that contains the replacements. For example, you could use this code to ensure that the word "Java" is correctly capitalized in a string s:

s.replaceAll("(?i)\\bjava\\b",// Pattern: the word "java", case-insensitive
             "Java");         // The replacement string, correctly capitalized
The replacement string passed to replaceAll() and replaceFirst() need not be a simple literal string; it may also include references to text that matched parenthesized subexpressions within the pattern. These references take the form of a dollar sign followed by the number of the subexpression. (If you are not familiar with parenthesized subexpressions within a regular expression, see java.util.regex.Pattern in .) For example, to search for words such as JavaBean, JavaScript, JavaOS, and JavaVM (but not Java or Javanese), and to replace the Java prefix with the letter J without altering the suffix, you could use code such as:
s.replaceAll("\\bJava([A-Z]\\w+)",  // The pattern
             "J$1");      // J followed by the suffix that matched the
                          // subexpression in parentheses: [A-Z]\\w+ 

The other new Java 1.4 String method that uses regular expressions is split(), which returns an array of the substrings of a string, separated by delimiters that match the specified pattern. To obtain an array of words in a string separated by any number of spaces, tabs, or newlines, do this:

String sentence = "This is a\n\ttwo-line sentence";
String[] words = sentence.split("[ \t\n\r]+");
An optional second argument specifies the maximum number of entries in the returned array.

The matches(), replaceFirst(), replaceAll(), and split() methods are suitable for when you use a regular expression only once. If you want to use a regular expression for multiple matches, you should explicitly use the Pattern and Matcher classes of the java.util.regex package. First, create a Pattern object to represent your regular expression with the static Pattern.compile() method. (Another reason to use the Pattern class explicitly instead of the String convenience methods is that Pattern.compile() allows you to specify flags such as Pattern.CASE_INSENSITIVE that globally alter the way the pattern matching is done.) Note that the compile() method can throw a PatternSyntaxException if you pass it an invalid regular expression string. (This exception is also thrown by the various String convenience methods.) The Pattern class defines split() methods that are similar to the String.split() methods. For all other matching, however, you must create a Matcher object with the matcher() method and specify the text to be matched against:

import java.util.regex.*;

Pattern javaword = Pattern.compile("\\bJava(\\w*)", Pattern.CASE_INSENSITIVE);
Matcher m = javaword.matcher(sentence);
boolean match = m.matches();  // True if text matches pattern exactly
Once you have a Matcher object, you can compare the string to the pattern in various ways. One of the more sophisticated ways is to find all substrings that match the pattern:
String text = "Java is fun; JavaScript is funny.";
m.reset(text);  // Start matching against a new string
// Loop to find all matches of the string and print details of each match
while(m.find()) {
  System.out.println("Found '" + m.group(0) + "' at position " + m.start(0));
  if (m.start(1) < m.end(1)) System.out.println("Suffix is " + m.group(1));
}
See the Matcher class in for further details.

String Comparison

The compareTo() and equals() methods of the String class allow you to compare strings. compareTo() bases its comparison on the character order defined by the Unicode encoding, while equals() defines string equality as strict character-by-character equality. These are not always the right methods to use, however. In some languages, the character ordering imposed by the Unicode standard does not match the dictionary ordering used when alphabetizing strings. In Spanish, for example, the letters "ch" are considered a single letter that comes after "c" and before "d." When comparing human-readable strings in an internationalized application, you should use the java.text.Collator class instead:

import java.text.*;

// Compare two strings; results depend on where the program is run
// Return values of Collator.compare() have same meanings as String.compareTo()
Collator c = Collator.getInstance();       // Get Collator for current locale
int result = c.compare("chica", "coche");  // Use it to compare two strings

StringTokenizer

There are a number of other Java classes that operate on strings and characters. One notable class is java.util.StringTokenizer, which you can use to break a string of text into its component words:

String s = "Now is the time";
java.util.StringTokenizer st = new java.util.StringTokenizer(s);
while(st.hasMoreTokens()) {
  System.out.println(st.nextToken());
}
You can even use this class to tokenize words that are delimited by characters other than spaces:
String s = "a:b:c:d";
java.util.StringTokenizer st = new java.util.StringTokenizer(s, ":");

How to Add Java Applets to Your Site

New on the Java Boutique:

New Review:

Time Management Made Easy with the Quartz Enterprise Job Scheduler
Why not just use the Java timer API? This open source scheduling API boasts simplicity, ease-of-integration, a well-rounded feature set, and it's free!

New Applet:

Reverse Complement
Reverse Complement is a simple applet that converts DNA or RNA sequences into three useful formats.

Elsewhere on internet.com:

WebDeveloper Java
Lots of Java information on webdeveloper.com

WDVL Java
Thorough Java resource at the Web Developer's Virtual Library.

ScriptSearch Java
Hundreds of free Java code files to download.

jGuru: Your View of the Java Universe
Customizable portal with online training, FAQs, regular news updates, and tutorials.