Java Code Examples for com.hankcs.hanlp.tokenizer.StandardTokenizer#segment()

The following examples show how to use com.hankcs.hanlp.tokenizer.StandardTokenizer#segment() . You can vote up the ones you like or vote down the ones you don't like, and go to the original project or source file by following the links above each example. You may check out the related API usage on the sidebar.
Example 1
Source File: SimHash.java    From templatespider with Apache License 2.0 4 votes vote down vote up
/**
 * 这个是对整个字符串进行hash计算
 * @return
 */
private BigInteger simHash() {

    tokens = cleanResume(tokens); // cleanResume 删除一些特殊字符

    int[] v = new int[this.hashbits];

    List<Term> termList = StandardTokenizer.segment(this.tokens); // 对字符串进行分词

    //对分词的一些特殊处理 : 比如: 根据词性添加权重 , 过滤掉标点符号 , 过滤超频词汇等;
    Map<String, Integer> weightOfNature = new HashMap<String, Integer>(); // 词性的权重
    weightOfNature.put("n", 2); //给名词的权重是2;
    Map<String, String> stopNatures = new HashMap<String, String>();//停用的词性 如一些标点符号之类的;
    stopNatures.put("w", ""); //
    int overCount = 5; //设定超频词汇的界限 ;
    Map<String, Integer> wordCount = new HashMap<String, Integer>();

    for (Term term : termList) {
        String word = term.word; //分词字符串

        String nature = term.nature.toString(); // 分词属性;
        //  过滤超频词
        if (wordCount.containsKey(word)) {
            int count = wordCount.get(word);
            if (count > overCount) {
                continue;
            }
            wordCount.put(word, count + 1);
        } else {
            wordCount.put(word, 1);
        }

        // 过滤停用词性
        if (stopNatures.containsKey(nature)) {
            continue;
        }

        // 2、将每一个分词hash为一组固定长度的数列.比如 64bit 的一个整数.
        BigInteger t = this.hash(word);
        for (int i = 0; i < this.hashbits; i++) {
            BigInteger bitmask = new BigInteger("1").shiftLeft(i);
            // 3、建立一个长度为64的整数数组(假设要生成64位的数字指纹,也可以是其它数字),
            // 对每一个分词hash后的数列进行判断,如果是1000...1,那么数组的第一位和末尾一位加1,
            // 中间的62位减一,也就是说,逢1加1,逢0减1.一直到把所有的分词hash数列全部判断完毕.
            int weight = 1;  //添加权重
            if (weightOfNature.containsKey(nature)) {
                weight = weightOfNature.get(nature);
            }
            if (t.and(bitmask).signum() != 0) {
                // 这里是计算整个文档的所有特征的向量和
                v[i] += weight;
            } else {
                v[i] -= weight;
            }
        }
    }
    BigInteger fingerprint = new BigInteger("0");
    for (int i = 0; i < this.hashbits; i++) {
        if (v[i] >= 0) {
            fingerprint = fingerprint.add(new BigInteger("1").shiftLeft(i));
        }
    }
    return fingerprint;
}
 
Example 2
Source File: Segment.java    From AHANLP with Apache License 2.0 3 votes vote down vote up
/**
 * 标准分词<br>
 * HMM-Bigram<br>
 * 最短路分词,最短路求解采用Viterbi算法
 * @param content 文本
 * @param filterStopWord 滤掉停用词
 * @return 分词结果
 */
public static List<Term> StandardSegment(String content, boolean filterStopWord) {
    List<Term> result = StandardTokenizer.segment(content);
    if (filterStopWord)
        CoreStopWordDictionary.apply(result);
    return result;
}
 
Example 3
Source File: DKNLPBase.java    From dk-fitting with Apache License 2.0 2 votes vote down vote up
/**
 * 标准分词
 *
 * @param txt 要分词的语句
 * @return 分词列表
 */
public static List<Term> segment(String txt)
{
    if (txt == null) return Collections.emptyList();
    return StandardTokenizer.segment(txt.toCharArray());
}