com.hankcs.hanlp.tokenizer.StandardTokenizer Java Examples

The following examples show how to use com.hankcs.hanlp.tokenizer.StandardTokenizer. You can vote up the ones you like or vote down the ones you don't like, and go to the original project or source file by following the links above each example. You may check out the related API usage on the sidebar.

Example #1

Source File: Segment.java From AHANLP with Apache License 2.0

5 votes

/**
   * 分词断句
   * @param segType 分词器类型（Standard 或 NLP）
   * @param shortest 是否断句为最细的子句（将逗号、分号也视作分隔符）
   * @param content 文本
   * @param filterStopWord 滤掉停用词
   * @return 句子列表，每个句子由一个单词列表组成
   */
  public static List<List<Term>> seg2sentence(String segType, boolean shortest, String content, boolean filterStopWord) {
  	List<List<Term>> results = null;
  	if ("Standard".equals(segType) || "标准分词".equals(segType)) {
  		results = StandardTokenizer.seg2sentence(content, shortest);
  	} else if ("NLP".equals(segType) || "NLP分词".equals(segType)) {
  		results = NLPTokenizer.seg2sentence(content, shortest);
  	} else {
  		throw new IllegalArgumentException(String.format("非法参数 segType == %s", segType));
  	}
  	if (filterStopWord)
  		for (List<Term> res : results)
  			CoreStopWordDictionary.apply(res);
return results;
  }

Example #2

Source File: HanLPTokenizerFactory.java From elasticsearch-analysis-hanlp with Apache License 2.0

5 votes

public static HanLPTokenizerFactory createStandard(IndexSettings indexSettings,
                                                   Environment environment,
                                                   String name, Settings
                                                       settings) {
    return new HanLPTokenizerFactory(indexSettings, environment, name, settings) {
        @Override
        public Tokenizer create() {
            return new HanLPTokenizer(StandardTokenizer.SEGMENT, defaultStopWordDictionary, enablePorterStemming);
        }
    };
}

Example #3

Source File: SimHash.java From templatespider with Apache License 2.0

4 votes

/**
 * 这个是对整个字符串进行hash计算
 * @return
 */
private BigInteger simHash() {

    tokens = cleanResume(tokens); // cleanResume 删除一些特殊字符

    int[] v = new int[this.hashbits];

    List<Term> termList = StandardTokenizer.segment(this.tokens); // 对字符串进行分词

    //对分词的一些特殊处理 : 比如: 根据词性添加权重 , 过滤掉标点符号 , 过滤超频词汇等;
    Map<String, Integer> weightOfNature = new HashMap<String, Integer>(); // 词性的权重
    weightOfNature.put("n", 2); //给名词的权重是2;
    Map<String, String> stopNatures = new HashMap<String, String>();//停用的词性 如一些标点符号之类的;
    stopNatures.put("w", ""); //
    int overCount = 5; //设定超频词汇的界限 ;
    Map<String, Integer> wordCount = new HashMap<String, Integer>();

    for (Term term : termList) {
        String word = term.word; //分词字符串

        String nature = term.nature.toString(); // 分词属性;
        //  过滤超频词
        if (wordCount.containsKey(word)) {
            int count = wordCount.get(word);
            if (count > overCount) {
                continue;
            }
            wordCount.put(word, count + 1);
        } else {
            wordCount.put(word, 1);
        }

        // 过滤停用词性
        if (stopNatures.containsKey(nature)) {
            continue;
        }

        // 2、将每一个分词hash为一组固定长度的数列.比如 64bit 的一个整数.
        BigInteger t = this.hash(word);
        for (int i = 0; i < this.hashbits; i++) {
            BigInteger bitmask = new BigInteger("1").shiftLeft(i);
            // 3、建立一个长度为64的整数数组(假设要生成64位的数字指纹,也可以是其它数字),
            // 对每一个分词hash后的数列进行判断,如果是1000...1,那么数组的第一位和末尾一位加1,
            // 中间的62位减一,也就是说,逢1加1,逢0减1.一直到把所有的分词hash数列全部判断完毕.
            int weight = 1;  //添加权重
            if (weightOfNature.containsKey(nature)) {
                weight = weightOfNature.get(nature);
            }
            if (t.and(bitmask).signum() != 0) {
                // 这里是计算整个文档的所有特征的向量和
                v[i] += weight;
            } else {
                v[i] -= weight;
            }
        }
    }
    BigInteger fingerprint = new BigInteger("0");
    for (int i = 0; i < this.hashbits; i++) {
        if (v[i] >= 0) {
            fingerprint = fingerprint.add(new BigInteger("1").shiftLeft(i));
        }
    }
    return fingerprint;
}

Example #4

Source File: Segment.java From AHANLP with Apache License 2.0

3 votes

/**
 * 标准分词<br>
 * HMM-Bigram<br>
 * 最短路分词，最短路求解采用Viterbi算法
 * @param content 文本
 * @param filterStopWord 滤掉停用词
 * @return 分词结果
 */
public static List<Term> StandardSegment(String content, boolean filterStopWord) {
    List<Term> result = StandardTokenizer.segment(content);
    if (filterStopWord)
        CoreStopWordDictionary.apply(result);
    return result;
}

Example #5

Source File: DKNLPBase.java From dk-fitting with Apache License 2.0

2 votes

/**
 * 标准分词
 *
 * @param txt 要分词的语句
 * @return 分词列表
 */
public static List<Term> segment(String txt)
{
    if (txt == null) return Collections.emptyList();
    return StandardTokenizer.segment(txt.toCharArray());
}