com.hankcs.hanlp.dictionary.stopword.CoreStopWordDictionary Java Examples

The following examples show how to use com.hankcs.hanlp.dictionary.stopword.CoreStopWordDictionary. You can vote up the ones you like or vote down the ones you don't like, and go to the original project or source file by following the links above each example. You may check out the related API usage on the sidebar.
Example #1
Source File: DKNLPBase.java    From dk-fitting with Apache License 2.0 5 votes vote down vote up
/**
 * 聚类
 *
 * @param documents 待聚类的文档集合,键为文档id,值为文档内容
 * @param size      需要得到的类别数量
 * @return 类目表, 每个类目内部是一个[文档id]=[相似程度]的列表
 */
public static List<List<Map.Entry<String, Double>>> cluster(Map<String, String> documents, int size)
{
    ClusterAnalyzer analyzer = new ClusterAnalyzer();
    analyzer.setTokenizer(new ITokenizer()
    {
        public String[] segment(final String text)
        {
            List<Term> termList = DKNLPBase.segment(text);
            ListIterator<Term> listIterator = termList.listIterator();
            while (listIterator.hasNext())
            {
                if (CoreStopWordDictionary.shouldRemove(listIterator.next()))
                {
                    listIterator.remove();
                }
            }
            String[] termArray = new String[termList.size()];
            int i = 0;
            for (Term term : termList)
            {
                termArray[i] = term.word;
                ++i;
            }
            return termArray;
        }
    });
    for (Map.Entry<String, String> entry : documents.entrySet())
    {
        analyzer.addDocument(entry.getKey(), entry.getValue());
    }
    return analyzer.clusters(size);
}
 
Example #2
Source File: Segment.java    From AHANLP with Apache License 2.0 5 votes vote down vote up
/**
   * 分词断句
   * @param segType 分词器类型(Standard 或 NLP)
   * @param shortest 是否断句为最细的子句(将逗号、分号也视作分隔符)
   * @param content 文本
   * @param filterStopWord 滤掉停用词
   * @return 句子列表,每个句子由一个单词列表组成
   */
  public static List<List<Term>> seg2sentence(String segType, boolean shortest, String content, boolean filterStopWord) {
  	List<List<Term>> results = null;
  	if ("Standard".equals(segType) || "标准分词".equals(segType)) {
  		results = StandardTokenizer.seg2sentence(content, shortest);
  	} else if ("NLP".equals(segType) || "NLP分词".equals(segType)) {
  		results = NLPTokenizer.seg2sentence(content, shortest);
  	} else {
  		throw new IllegalArgumentException(String.format("非法参数 segType == %s", segType));
  	}
  	if (filterStopWord)
  		for (List<Term> res : results)
  			CoreStopWordDictionary.apply(res);
return results;
  }
 
Example #3
Source File: Segment.java    From AHANLP with Apache License 2.0 3 votes vote down vote up
/**
 * 标准分词<br>
 * HMM-Bigram<br>
 * 最短路分词,最短路求解采用Viterbi算法
 * @param content 文本
 * @param filterStopWord 滤掉停用词
 * @return 分词结果
 */
public static List<Term> StandardSegment(String content, boolean filterStopWord) {
    List<Term> result = StandardTokenizer.segment(content);
    if (filterStopWord)
        CoreStopWordDictionary.apply(result);
    return result;
}
 
Example #4
Source File: Segment.java    From AHANLP with Apache License 2.0 3 votes vote down vote up
/**
 * NLP分词<br>
 * 感知机分词<br>
 * 执行词性标注和命名实体识别,更重视准确率
 * @param content 文本
 * @param filterStopWord 滤掉停用词
 * @return 分词结果
 */
public static List<Term> NLPSegment(String content, boolean filterStopWord) {
    List<Term> result = NLPTokenizer.segment(content);
    if (filterStopWord)
    	CoreStopWordDictionary.apply(result);
    return result;
}
 
Example #5
Source File: TFIDF.java    From KeywordExtraction with MIT License 2 votes vote down vote up
/**
 * judge whether a word belongs to stop words
 * @param term(Term): word needed to be judged
 * @return(boolean):  if the word is a stop word,return false;otherwise return true    
 */
public static boolean shouldInclude(Term term)
{
    return CoreStopWordDictionary.shouldInclude(term);
}
 
Example #6
Source File: TextRank.java    From KeywordExtraction with MIT License 2 votes vote down vote up
/**
 * judge whether a word belongs to stop words
 * @param term(Term): word needed to be judged
 * @return(boolean):  if the word is a stop word,return false;otherwise return true    
 */
public static boolean shouldInclude(Term term)
 {
     return CoreStopWordDictionary.shouldInclude(term);
 }
 
Example #7
Source File: TextRankKeyword.java    From TextRank with Apache License 2.0 2 votes vote down vote up
/**
 * 是否应当将这个term纳入计算,词性属于名词、动词、副词、形容词
 * @param term
 * @return 是否应当
 */
public boolean shouldInclude(Term term)
{
    return CoreStopWordDictionary.shouldInclude(term);
}
 
Example #8
Source File: TextRankSummary.java    From TextRank with Apache License 2.0 2 votes vote down vote up
/**
 * 是否应当将这个term纳入计算,词性属于名词、动词、副词、形容词
 * @param term
 * @return 是否应当
 */
public static boolean shouldInclude(Term term)
{
    return CoreStopWordDictionary.shouldInclude(term);
}