How to determine if a string is English or Java code?

Consider the following two strings:
1. for (int i = 0; i < b.size(); i++) {
2. do something in English (not necessary to be a sentence).

The first one is Java code, the second one is English. How to detect that the first one is code and the second is English?

The Java code may not be parsable, because it is not a complete method/statement/expression. The following provides a solution for this problem. As sometimes, there is not a clear line between code and English, the accuracy can not be 100%. However, with the solution below, you can easily tune the program to fit your needs. You can download the code from GitHub.

The basic idea is to convert the string to a set of tokens. For example, the code line above may become "KEY,SEPARATOR,ID,ASSIGN,NUMBER,SEPARATOR,...". And then we can use simple rules to separate code from English.

A tokenizer class converts a string to be a list of tokens.

package lexical;
 
import java.util.LinkedList;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
 
public class Tokenizer {
	private class TokenInfo {
		public final Pattern regex;
		public final int token;
 
		public TokenInfo(Pattern regex, int token) {
			super();
			this.regex = regex;
			this.token = token;
		}
	}
 
	public class Token {
		public final int token;
		public final String sequence;
 
		public Token(int token, String sequence) {
			super();
			this.token = token;
			this.sequence = sequence;
		}
 
	}
 
	private LinkedList<TokenInfo> tokenInfos;
	private LinkedList<Token> tokens;
 
	public Tokenizer() {
		tokenInfos = new LinkedList<TokenInfo>();
		tokens = new LinkedList<Token>();
	}
 
	public void add(String regex, int token) {
		tokenInfos
				.add(new TokenInfo(Pattern.compile("^(" + regex + ")"), token));
	}
 
	public void tokenize(String str) {
		String s = str.trim();
		tokens.clear();
		while (!s.equals("")) {
			//System.out.println(s);
			boolean match = false;
			for (TokenInfo info : tokenInfos) {
				Matcher m = info.regex.matcher(s);
				if (m.find()) {
					match = true;
					String tok = m.group().trim();
					s = m.replaceFirst("").trim();
					tokens.add(new Token(info.token, tok));
					break;
				}
			}
			if (!match){
				//throw new ParserException("Unexpected character in input: " + s);
				tokens.clear();
				System.out.println("Unexpected character in input: " + s);
				return;
			}
 
		}
	}
 
	public LinkedList<Token> getTokens() {
		return tokens;
	}
 
	public String getTokensString() {
		StringBuilder sb = new StringBuilder();
		for (Tokenizer.Token tok : tokens) {
			sb.append(tok.token);
		}
 
		return sb.toString();
	}
}

We can get the Java keywords, separators, operators, identifiers, etc. If we assign a mapping value to the tokens, a string of English can be converted to a string of tokens.

package lexical;
 
import greenblocks.javaapiexamples.DB;
import java.io.IOException;
import java.sql.ResultSet;
import java.sql.SQLException;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
 
import org.apache.commons.lang.StringUtils;
 
import NLP.POSTagger;
 
public class EnglishOrCode {
 
	private static Tokenizer tokenizer = null;
 
	public static void initializeTokenizer() {
		tokenizer = new Tokenizer();
 
		//key words
		String keyString = "abstract assert boolean break byte case catch "
				+ "char class const continue default do double else enum"
				+ " extends false final finally float for goto if implements "
				+ "import instanceof int interface long native new null "
				+ "package private protected public return short static "
				+ "strictfp super switch synchronized this throw throws true "
				+ "transient try void volatile while todo";
		String[] keys = keyString.split(" ");
		String keyStr = StringUtils.join(keys, "|");
 
		tokenizer.add(keyStr, 1);
		tokenizer.add("\\(|\\)|\\{|\\}|\\[|\\]|;|,|\\.|=|>|<|!|~|"
						+ "\\?|:|==|<=|>=|!=|&&|\\|\\||\\+\\+|--|"
						+ "\\+|-|\\*|/|&|\\||\\^|%|\'|\"|\n|\r|\\$|\\#",
						2);//separators, operators, etc
 
		tokenizer.add("[0-9]+", 3); //number
		tokenizer.add("[a-zA-Z][a-zA-Z0-9_]*", 4);//identifier
		tokenizer.add("@", 4);
	}
 
	public static void main(String[] args) throws SQLException, ClassNotFoundException, IOException {
		initializeTokenizer();
		String s = "do something in English";
		if(isEnglish(s)){
			System.out.println("English");
		}else{
			System.out.println("Java Code");
		}
 
		s = "for (int i = 0; i < b.size(); i++) {";
		if(isEnglish(s)){
			System.out.println("English");
		}else{
			System.out.println("Java Code");
		}
 
	}
 
	private static boolean isEnglish(String replaced) {
		tokenizer.tokenize(replaced);
		String patternString = tokenizer.getTokensString();
 
		if(patternString.matches(".*444.*") || patternString.matches("4+")){
			return true;
		}else{
			return false;
		}
	}
}

Output:

English
Java Code

References:
1. Write a Parser in Java: The Tokenizer

Leave a Comment