Java Code Examples for org.apache.nutch.crawl.CrawlDatum#setFetchInterval()

The following examples show how to use org.apache.nutch.crawl.CrawlDatum#setFetchInterval() . You can vote up the ones you like or vote down the ones you don't like, and go to the original project or source file by following the links above each example. You may check out the related API usage on the sidebar.
Example 1
Source File: AbstractFetchSchedule.java    From anthelion with Apache License 2.0 4 votes vote down vote up
/**
 * This method provides information whether the page is suitable for
 * selection in the current fetchlist. NOTE: a true return value does not
 * guarantee that the page will be fetched, it just allows it to be
 * included in the further selection process based on scores. The default
 * implementation checks <code>fetchTime</code>, if it is higher than the
 * <code>curTime</code> it returns false, and true otherwise. It will also
 * check that fetchTime is not too remote (more than <code>maxInterval</code>,
 * in which case it lowers the interval and returns true.
 *
 * @param url URL of the page.
 *
 * @param datum datum instance.
 *
 * @param curTime reference time (usually set to the time when the
 * fetchlist generation process was started).
 *
 * @return true, if the page should be considered for inclusion in the current
 * fetchlist, otherwise false.
 */
public boolean shouldFetch(Text url, CrawlDatum datum, long curTime) {
  // pages are never truly GONE - we have to check them from time to time.
  // pages with too long fetchInterval are adjusted so that they fit within
  // maximum fetchInterval (segment retention period).
  if (datum.getFetchTime() - curTime > (long) maxInterval * 1000) {
    if (datum.getFetchInterval() > maxInterval) {
      datum.setFetchInterval(maxInterval * 0.9f);
    }
    datum.setFetchTime(curTime);
  }
  if (datum.getFetchTime() > curTime) {
    return false;                                   // not time yet
  }
  return true;
}
 
Example 2
Source File: AdaptiveFetchSchedule.java    From anthelion with Apache License 2.0 4 votes vote down vote up
@Override
public CrawlDatum setFetchSchedule(Text url, CrawlDatum datum,
        long prevFetchTime, long prevModifiedTime,
        long fetchTime, long modifiedTime, int state) {
  super.setFetchSchedule(url, datum, prevFetchTime, prevModifiedTime,
      fetchTime, modifiedTime, state);

  float interval = datum.getFetchInterval();
  long refTime = fetchTime;

  if (datum.getMetaData().containsKey(Nutch.WRITABLE_FIXED_INTERVAL_KEY)) {
    // Is fetch interval preset in CrawlDatum MD? Then use preset interval
    FloatWritable customIntervalWritable= (FloatWritable)(datum.getMetaData().get(Nutch.WRITABLE_FIXED_INTERVAL_KEY));
    interval = customIntervalWritable.get();
  } else {
    if (modifiedTime <= 0) modifiedTime = fetchTime;
    switch (state) {
      case FetchSchedule.STATUS_MODIFIED:
        interval *= (1.0f - DEC_RATE);
        break;
      case FetchSchedule.STATUS_NOTMODIFIED:
        interval *= (1.0f + INC_RATE);
        break;
      case FetchSchedule.STATUS_UNKNOWN:
        break;
    }
    if (SYNC_DELTA) {
      // try to synchronize with the time of change
      long delta = (fetchTime - modifiedTime) / 1000L;
      if (delta > interval) interval = delta;
      refTime = fetchTime - Math.round(delta * SYNC_DELTA_RATE * 1000);
    }
    if (interval < MIN_INTERVAL) {
      interval = MIN_INTERVAL;
    } else if (interval > MAX_INTERVAL) {
      interval = MAX_INTERVAL;
    }
  }

  datum.setFetchInterval(interval);
  datum.setFetchTime(refTime + Math.round(interval * 1000.0));
  datum.setModifiedTime(modifiedTime);
  return datum;
}
 
Example 3
Source File: AbstractFetchSchedule.java    From nutch-htmlunit with Apache License 2.0 4 votes vote down vote up
/**
 * This method provides information whether the page is suitable for
 * selection in the current fetchlist. NOTE: a true return value does not
 * guarantee that the page will be fetched, it just allows it to be
 * included in the further selection process based on scores. The default
 * implementation checks <code>fetchTime</code>, if it is higher than the
 * <code>curTime</code> it returns false, and true otherwise. It will also
 * check that fetchTime is not too remote (more than <code>maxInterval</code>,
 * in which case it lowers the interval and returns true.
 *
 * @param url URL of the page.
 *
 * @param datum datum instance.
 *
 * @param curTime reference time (usually set to the time when the
 * fetchlist generation process was started).
 *
 * @return true, if the page should be considered for inclusion in the current
 * fetchlist, otherwise false.
 */
public boolean shouldFetch(Text url, CrawlDatum datum, long curTime) {
  // pages are never truly GONE - we have to check them from time to time.
  // pages with too long fetchInterval are adjusted so that they fit within
  // maximum fetchInterval (segment retention period).
  if (datum.getFetchTime() - curTime > (long) maxInterval * 1000) {
    if (datum.getFetchInterval() > maxInterval) {
      datum.setFetchInterval(maxInterval * 0.9f);
    }
    datum.setFetchTime(curTime);
  }
  if (datum.getFetchTime() > curTime) {
    return false;                                   // not time yet
  }
  return true;
}
 
Example 4
Source File: AdaptiveFetchSchedule.java    From nutch-htmlunit with Apache License 2.0 4 votes vote down vote up
@Override
public CrawlDatum setFetchSchedule(Text url, CrawlDatum datum,
        long prevFetchTime, long prevModifiedTime,
        long fetchTime, long modifiedTime, int state) {
  super.setFetchSchedule(url, datum, prevFetchTime, prevModifiedTime,
      fetchTime, modifiedTime, state);

  float interval = datum.getFetchInterval();
  long refTime = fetchTime;

  // https://issues.apache.org/jira/browse/NUTCH-1430
  interval = (interval == 0) ? defaultInterval : interval;

  if (datum.getMetaData().containsKey(Nutch.WRITABLE_FIXED_INTERVAL_KEY)) {
    // Is fetch interval preset in CrawlDatum MD? Then use preset interval
    FloatWritable customIntervalWritable= (FloatWritable)(datum.getMetaData().get(Nutch.WRITABLE_FIXED_INTERVAL_KEY));
    interval = customIntervalWritable.get();
  } else {
    if (modifiedTime <= 0) modifiedTime = fetchTime;
    switch (state) {
      case FetchSchedule.STATUS_MODIFIED:
        interval *= (1.0f - DEC_RATE);
        break;
      case FetchSchedule.STATUS_NOTMODIFIED:
        interval *= (1.0f + INC_RATE);
        break;
      case FetchSchedule.STATUS_UNKNOWN:
        break;
    }
    if (SYNC_DELTA) {
      // try to synchronize with the time of change
      long delta = (fetchTime - modifiedTime) / 1000L;
      if (delta > interval) interval = delta;
      refTime = fetchTime - Math.round(delta * SYNC_DELTA_RATE * 1000);
    }
    if (interval < MIN_INTERVAL) {
      interval = MIN_INTERVAL;
    } else if (interval > MAX_INTERVAL) {
      interval = MAX_INTERVAL;
    }
  }

  datum.setFetchInterval(interval);
  datum.setFetchTime(refTime + Math.round(interval * 1000.0));
  datum.setModifiedTime(modifiedTime);
  return datum;
}
 
Example 5
Source File: AbstractFetchSchedule.java    From anthelion with Apache License 2.0 3 votes vote down vote up
/**
 * Initialize fetch schedule related data. Implementations should at least
 * set the <code>fetchTime</code> and <code>fetchInterval</code>. The default
 * implementation sets the <code>fetchTime</code> to now, using the
 * default <code>fetchInterval</code>.
 * 
 * @param url URL of the page.
 *
 * @param datum datum instance to be initialized (modified in place).
 */
public CrawlDatum initializeSchedule(Text url, CrawlDatum datum) {
  datum.setFetchTime(System.currentTimeMillis());
  datum.setFetchInterval(defaultInterval);
  datum.setRetriesSinceFetch(0);
  return datum;
}
 
Example 6
Source File: AbstractFetchSchedule.java    From anthelion with Apache License 2.0 3 votes vote down vote up
/**
 * This method specifies how to schedule refetching of pages
 * marked as GONE. Default implementation increases fetchInterval by 50%,
 * and if it exceeds the <code>maxInterval</code> it calls
 * {@link #forceRefetch(Text, CrawlDatum, boolean)}.
 *
 * @param url URL of the page.
 *
 * @param datum datum instance to be adjusted.
 *
 * @return adjusted page information, including all original information.
 * NOTE: this may be a different instance than @see CrawlDatum, but
 * implementations should make sure that it contains at least all
 * information from @see CrawlDatum.
 */
public CrawlDatum setPageGoneSchedule(Text url, CrawlDatum datum,
        long prevFetchTime, long prevModifiedTime, long fetchTime) {
  // no page is truly GONE ... just increase the interval by 50%
  // and try much later.
  datum.setFetchInterval(datum.getFetchInterval() * 1.5f);
  datum.setFetchTime(fetchTime + (long)datum.getFetchInterval() * 1000);
  if (maxInterval < datum.getFetchInterval()) forceRefetch(url, datum, false);
  return datum;
}
 
Example 7
Source File: AbstractFetchSchedule.java    From anthelion with Apache License 2.0 3 votes vote down vote up
/**
 * This method resets fetchTime, fetchInterval, modifiedTime,
 * retriesSinceFetch and page signature, so that it forces refetching.
 *
 * @param url URL of the page.
 *
 * @param datum datum instance.
 *
 * @param asap if true, force refetch as soon as possible - this sets
 * the fetchTime to now. If false, force refetch whenever the next fetch
 * time is set.
 */
public CrawlDatum  forceRefetch(Text url, CrawlDatum datum, boolean asap) {
  // reduce fetchInterval so that it fits within the max value
  if (datum.getFetchInterval() > maxInterval)
    datum.setFetchInterval(maxInterval * 0.9f);
  datum.setStatus(CrawlDatum.STATUS_DB_UNFETCHED);
  datum.setRetriesSinceFetch(0);
  datum.setSignature(null);
  datum.setModifiedTime(0L);
  if (asap) datum.setFetchTime(System.currentTimeMillis());
  return datum;
}
 
Example 8
Source File: AbstractFetchSchedule.java    From nutch-htmlunit with Apache License 2.0 3 votes vote down vote up
/**
 * Initialize fetch schedule related data. Implementations should at least
 * set the <code>fetchTime</code> and <code>fetchInterval</code>. The default
 * implementation sets the <code>fetchTime</code> to now, using the
 * default <code>fetchInterval</code>.
 * 
 * @param url URL of the page.
 *
 * @param datum datum instance to be initialized (modified in place).
 */
public CrawlDatum initializeSchedule(Text url, CrawlDatum datum) {
  datum.setFetchTime(System.currentTimeMillis());
  datum.setFetchInterval(defaultInterval);
  datum.setRetriesSinceFetch(0);
  return datum;
}
 
Example 9
Source File: AbstractFetchSchedule.java    From nutch-htmlunit with Apache License 2.0 3 votes vote down vote up
/**
 * This method specifies how to schedule refetching of pages
 * marked as GONE. Default implementation increases fetchInterval by 50%
 * but the value may never exceed <code>maxInterval</code>.
 *
 * @param url URL of the page.
 *
 * @param datum datum instance to be adjusted.
 *
 * @return adjusted page information, including all original information.
 * NOTE: this may be a different instance than @see CrawlDatum, but
 * implementations should make sure that it contains at least all
 * information from @see CrawlDatum.
 */
public CrawlDatum setPageGoneSchedule(Text url, CrawlDatum datum,
        long prevFetchTime, long prevModifiedTime, long fetchTime) {
  // no page is truly GONE ... just increase the interval by 50%
  // and try much later.
  if ((datum.getFetchInterval() * 1.5f) < maxInterval)
    datum.setFetchInterval(datum.getFetchInterval() * 1.5f);
  else
    datum.setFetchInterval(maxInterval * 0.9f);
  datum.setFetchTime(fetchTime + (long)datum.getFetchInterval() * 1000);
  return datum;
}
 
Example 10
Source File: AbstractFetchSchedule.java    From nutch-htmlunit with Apache License 2.0 3 votes vote down vote up
/**
 * This method resets fetchTime, fetchInterval, modifiedTime,
 * retriesSinceFetch and page signature, so that it forces refetching.
 *
 * @param url URL of the page.
 *
 * @param datum datum instance.
 *
 * @param asap if true, force refetch as soon as possible - this sets
 * the fetchTime to now. If false, force refetch whenever the next fetch
 * time is set.
 */
public CrawlDatum  forceRefetch(Text url, CrawlDatum datum, boolean asap) {
  // reduce fetchInterval so that it fits within the max value
  if (datum.getFetchInterval() > maxInterval)
    datum.setFetchInterval(maxInterval * 0.9f);
  datum.setStatus(CrawlDatum.STATUS_DB_UNFETCHED);
  datum.setRetriesSinceFetch(0);
  datum.setSignature(null);
  datum.setModifiedTime(0L);
  if (asap) datum.setFetchTime(System.currentTimeMillis());
  return datum;
}