Java Code Examples for org.htmlparser.beans.StringBean

The following examples show how to use org.htmlparser.beans.StringBean. These examples are extracted from open source projects. You can vote up the ones you like or vote down the ones you don't like, and go to the original project or source file by following the links above each example. You may check out the related API usage on the sidebar.
Example 1
Source Project: onboard   Source File: HtmlTextParser.java    License: Apache License 2.0 6 votes vote down vote up
public static String getPlainText(String htmlStr) {
    Parser parser = new Parser();
    String plainText = "";
    try {
        parser.setInputHTML(htmlStr);

        StringBean stringBean = new StringBean();
        // 设置不需要得到页面所包含的链接信息
        stringBean.setLinks(false);
        // 设置将不间断空格由正规空格所替代
        stringBean.setReplaceNonBreakingSpaces(true);
        // 设置将一序列空格由单一空格替代
        stringBean.setCollapse(true);

        parser.visitAllNodesWith(stringBean);
        plainText = stringBean.getStrings();

    } catch (ParserException e) {
        e.printStackTrace();
    }

    return plainText;
}
 
Example 2
Source Project: OpenEphyra   Source File: HTMLConverter.java    License: GNU General Public License v2.0 6 votes vote down vote up
/**
 * Converts an HTML document into plain text.
 * 
 * @param html HTML document
 * @return plain text or <code>null</code> if the conversion failed
 */
public static synchronized String html2text(String html) {
	// convert HTML document
	StringBean sb = new StringBean();
	sb.setLinks(false);  // no links
	sb.setReplaceNonBreakingSpaces (true); // replace non-breaking spaces
    sb.setCollapse(true);  // replace sequences of whitespaces
	Parser parser = new Parser();
	try {
		parser.setInputHTML(html);
		parser.visitAllNodesWith(sb);
	} catch (ParserException e) {
		return null;
	}
	String docText = sb.getStrings();
	
	if (docText == null) docText = "";  // no content
	
	return docText;
}
 
Example 3
Source Project: OpenEphyra   Source File: HTMLConverter.java    License: GNU General Public License v2.0 6 votes vote down vote up
/**
 * Reads an HTML document from a file and converts it into plain text.
 * 
 * @param filename name of file containing HTML documents
 * @return plain text or <code>null</code> if the reading or conversion failed
 */
public static synchronized String file2text(String filename) {
	// read from file and convert HTML document
	StringBean sb = new StringBean();
	sb.setLinks(false);  // no links
	sb.setReplaceNonBreakingSpaces (true); // replace non-breaking spaces
    sb.setCollapse(true);  // replace sequences of whitespaces
	Parser parser = new Parser();
	try {
		parser.setResource(filename);
		parser.visitAllNodesWith(sb);
	} catch (ParserException e) {
		return null;
	}
	String docText = sb.getStrings();
	
	return docText;
}
 
Example 4
Source Project: OpenEphyra   Source File: HTMLConverter.java    License: GNU General Public License v2.0 6 votes vote down vote up
/**
 * Fetches an HTML document from a URL and converts it into plain text.
 * 
 * @param url URL of HTML document
 * @return plain text or <code>null</code> if the fetching or conversion failed
 */
public static synchronized String url2text(String url) throws SocketTimeoutException {
	// connect to URL
	URLConnection conn = null;
	try {
		conn = (new URL(url)).openConnection();
		if (!(conn instanceof HttpURLConnection)) return null;  // only allow HTTP connections
	} catch (IOException e) {
		return null;
	}
	conn.setRequestProperty("User-agent","Mozilla/4.0");  // pretend to be a browser
	conn.setConnectTimeout(TIMEOUT);
	conn.setReadTimeout(TIMEOUT);
	
	// fetch URL and convert HTML document
	StringBean sb = new StringBean();
	sb.setLinks(false);  // no links
	sb.setReplaceNonBreakingSpaces(true); // replace non-breaking spaces
    sb.setCollapse(true);  // replace sequences of whitespaces
	sb.setConnection(conn);
	String docText = sb.getStrings();
	
	return docText;
}