Top 10 Questions for Java Regular Expression

This post summarizes the top questions asked about Java regular expressions. As they are most frequently asked, you may find that they are also very useful.

1. How to extract numbers from a string?

One common question of using regular expression is to extract all the numbers into an array of integers.

In Java, \d means a range of digits (0-9). Using the predefined classes whenever possible will make your code easier to read and eliminate errors introduced by malformed character classes. Please refer to Predefined character classes for more details. Please note the first backslash \ in \d. If you are using an escaped construct within a string literal, you must precede the backslash with another backslash for the string to compile. That’s why we need to use \\d.

List<Integer> numbers = new LinkedList<Integer>();
Pattern p = Pattern.compile("\\d+");
Matcher m = p.matcher(str); 
while (m.find()) {
  numbers.add(Integer.parseInt(m.group()));
}

2. How to split Java String by newlines?

There are at least three different ways to enter a new line character, dependent on the operating system you are working on.

\r represents CR (Carriage Return), which is used in Unix
\n means LF (Line Feed), used in Mac OS
\r\n means CR + LF, used in Windows

Therefore the most straightforward way to split string by new lines is

String lines[] = String.split("\\r?\\n");

But if you don’t want empty lines, you can use, which is also my favourite way:

String.split("[\\r\\n]+")

A more robust way, which is really system independent, is as follows. But remember, you will still get empty lines if two newline characters are placed side by side.

String.split(System.getProperty("line.separator"));

3. Importance of Pattern.compile()

A regular expression, specified as a string, must first be compiled into an instance of Pattern class. Pattern.compile() method is the only way to create a instance of object. A typical invocation sequence is thus

Pattern p = Pattern.compile("a*b");
Matcher matcher = p.matcher("aaaaab");
assert matcher.matches() == true;

Essentially, Pattern.compile() is used to transform a regular expression into an Finite state machine (see Compilers: Principles, Techniques, and Tools (2nd Edition)). But all of the states involved in performing a match resides in the matcher. By this way, the Pattern p can be reused. And many matchers can share the same pattern.

Matcher anotherMatcher = p.matcher("aab");
assert anotherMatcher.matches() == true;

Pattern.matches() method is defined as a convenience for when a regular expression is used just once. This method still uses compile() to get the instance of a Pattern implicitly, and matches a string. Therefore,

boolean b = Pattern.matches("a*b", "aaaaab");

is equivalent to the first code above, though for repeated matches it is less efficient since it does not allow the compiled pattern to be reused.

4. How to escape text for regular expression?

In general, regular expression uses “\” to escape constructs, but it is painful to precede the backslash with another backslash for the Java string to compile. There is another way for users to pass string Literals to the Pattern, like “$5”. Instead of writing \\$5 or [$]5, we can type

Pattern.quote("$5");

5. Why does String.split() need pipe delimiter to be escaped?

String.split() splits a string around matches of the given regular expression. Java expression supports special characters that affect the way a pattern is matched, which is called metacharacter. | is one metacharacter which is used to match a single regular expression out of several possible regular expressions. For example, A|B means either A or B. Please refer to Alternation with The Vertical Bar or Pipe Symbol for more details. Therefore, to use | as a literal, you need to escape it by adding \ in front of it, like \\|.

6. How can we match anbn with Java regex?

This is the language of all non-empty strings consisting of some number of a‘s followed by an equal number of b‘s, like ab, aabb, and aaabbb. This language can be show to be context-free grammar S → aSb | ab, and therefore a non-regular language.

However, Java regex implementations can recognize more than just regular languages. That is, they are not “regular” by formal language theory definition. Using lookahead and self-reference matching will achieve it. Here I will give the final regular expression first, then explain it a little bit. For a comprehensive explanation, I would refer you to read How can we match a^n b^n with Java regex.

Pattern p = Pattern.compile("(?x)(?:a(?= a*(\\1?+b)))+\\1");
// true
System.out.println(p.matcher("aaabbb").matches());
// false
System.out.println(p.matcher("aaaabbb").matches());
// false
System.out.println(p.matcher("aaabbbb").matches());
// false
System.out.println(p.matcher("caaabbb").matches());

Instead of explaining the syntax of this complex regular expression, I would rather say a little bit how it works.

  1. In the first iteration, it stops at the first a then looks ahead (after skipping some as by using a*) whether there is a b. This was achieved by using (?:a(?= a*(\\1?+b))). If it matches, \1, the self-reference matching, will matches the very inner parenthesed elements, which is one single b in the first iteration.
  2. In the second iteration, the expression will stop at the second a, then it looks ahead (again skipping as) to see if there will be b. But this time, \\1+b is actually equivalent to bb, therefore two bs have to be matched. If so, \1 will be changed to bb after the second iteration.
  3. In the nth iteration, the expression stops at the nth a and see if there are n bs ahead.

By this way, the expression can count the number of as and match if the number of bs followed by a is same.

7. How to replace 2 or more spaces with single space in string and delete leading spaces only?

String.replaceAll() replaces each substring that matches the given regular expression with the given replacement. “2 or more spaces” can be expressed by regular expression [ ]+. Therefore, the following code will work. Note that, the solution won’t ultimately remove all leading and trailing whitespaces. If you would like to have them deleted, you can use String.trim() in the pipeline.

String line = "  aa bbbbb   ccc     d  ";
// " aa bbbbb ccc d "
System.out.println(line.replaceAll("[\\s]+", " "));

8. How to determine if a number is a prime with regex?

public static void main(String[] args) {
  // false
  System.out.println(prime(1));
  // true
  System.out.println(prime(2));
  // true
  System.out.println(prime(3));
  // true
  System.out.println(prime(5));
  // false
  System.out.println(prime(8));
  // true
  System.out.println(prime(13));
  // false
  System.out.println(prime(14));
  // false
  System.out.println(prime(15));
}
 
public static boolean prime(int n) {
  return !new String(new char[n]).matches(".?|(..+?)\\1+");
}

The function first generates n number of characters and tries to see if that string matches .?|(..+?)\\1+. If it is prime, the expression will return false and the ! will reverse the result.

The first part .? just tries to make sure 1 is not primer. The magic part is the second part where backreference is used. (..+?)\\1+ first try to matches n length of characters, then repeat it several times by \\1+.

By definition, a prime number is a natural number greater than 1 that has no positive divisors other than 1 and itself. That means if a=n*m then a is not a prime. n*m can be further explained “repeat n m times”, and that is exactly what the regular expression does: matches n length of characters by using (..+?), then repeat it m times by using \\1+. Therefore, if the pattern matches, the number is not prime, otherwise it is. Remind that ! will reverse the result.

9. How to split a comma-separated string but ignoring commas in quotes?

You have reached the point where regular expressions break down. It is better and more neat to write a simple splitter, and handles special cases as you wish.

Alternative, you can mimic the operation of finite state machine, by using a switch statement or if-else. Attached is a snippet of code.

public static void main(String[] args) {
  String line = "aaa,bbb,\"c,c\",dd;dd,\"e,e";
  List<String> toks = splitComma(line);
  for (String t : toks) {
    System.out.println("> " + t);
  }
}
 
private static List<String> splitComma(String str) {
  int start = 0;
  List<String> toks = new ArrayList<String>();
  boolean withinQuote = false;
  for (int end = 0; end < str.length(); end++) {
    char c = str.charAt(end);
    switch(c) {
    case ',':
      if (!withinQuote) {
        toks.add(str.substring(start, end));
        start = end + 1;
      }
      break;
    case '\"':
      withinQuote = !withinQuote;
      break;
    }
  }
  if (start < str.length()) {
    toks.add(str.substring(start));
  }
  return toks;
}

10. How to use backreferences in Java Regular Expressions

Backreferences is another useful feature in Java regular expression.

5 thoughts on “Top 10 Questions for Java Regular Expression”

  1. Very nice article, good explaination how to solve the problems with regex.

    Before using regex for everything, take a look at google guava Splitter class and apache commons io StringUtils.

Leave a Comment