Java Code Examples for org.apache.nutch.crawl.CrawlDatum#setRetriesSinceFetch()

The following examples show how to use org.apache.nutch.crawl.CrawlDatum#setRetriesSinceFetch() . You can vote up the ones you like or vote down the ones you don't like, and go to the original project or source file by following the links above each example. You may check out the related API usage on the sidebar.
Example 1
Source File: AbstractFetchSchedule.java    From anthelion with Apache License 2.0 5 votes vote down vote up
/**
 * Sets the <code>fetchInterval</code> and <code>fetchTime</code> on a
 * successfully fetched page. NOTE: this implementation resets the
 * retry counter - extending classes should call super.setFetchSchedule() to
 * preserve this behavior.
 */
public CrawlDatum setFetchSchedule(Text url, CrawlDatum datum,
        long prevFetchTime, long prevModifiedTime,
        long fetchTime, long modifiedTime, int state) {
  datum.setRetriesSinceFetch(0);
  return datum;
}
 
Example 2
Source File: AbstractFetchSchedule.java    From nutch-htmlunit with Apache License 2.0 5 votes vote down vote up
/**
 * Sets the <code>fetchInterval</code> and <code>fetchTime</code> on a
 * successfully fetched page. NOTE: this implementation resets the
 * retry counter - extending classes should call super.setFetchSchedule() to
 * preserve this behavior.
 */
public CrawlDatum setFetchSchedule(Text url, CrawlDatum datum,
        long prevFetchTime, long prevModifiedTime,
        long fetchTime, long modifiedTime, int state) {
  datum.setRetriesSinceFetch(0);
  return datum;
}
 
Example 3
Source File: AbstractFetchSchedule.java    From anthelion with Apache License 2.0 3 votes vote down vote up
/**
 * Initialize fetch schedule related data. Implementations should at least
 * set the <code>fetchTime</code> and <code>fetchInterval</code>. The default
 * implementation sets the <code>fetchTime</code> to now, using the
 * default <code>fetchInterval</code>.
 * 
 * @param url URL of the page.
 *
 * @param datum datum instance to be initialized (modified in place).
 */
public CrawlDatum initializeSchedule(Text url, CrawlDatum datum) {
  datum.setFetchTime(System.currentTimeMillis());
  datum.setFetchInterval(defaultInterval);
  datum.setRetriesSinceFetch(0);
  return datum;
}
 
Example 4
Source File: AbstractFetchSchedule.java    From anthelion with Apache License 2.0 3 votes vote down vote up
/**
 * This method adjusts the fetch schedule if fetching needs to be
 * re-tried due to transient errors. The default implementation
 * sets the next fetch time 1 day in the future and increases
 * the retry counter.
 *
 * @param url URL of the page.
 *
 * @param datum page information.
 *
 * @param prevFetchTime previous fetch time.
 *
 * @param prevModifiedTime previous modified time.
 *
 * @param fetchTime current fetch time.
 *
 * @return adjusted page information, including all original information.
 * NOTE: this may be a different instance than @see CrawlDatum, but
 * implementations should make sure that it contains at least all
 * information from @see CrawlDatum.
 */
public CrawlDatum setPageRetrySchedule(Text url, CrawlDatum datum,
        long prevFetchTime, long prevModifiedTime, long fetchTime) {
  datum.setFetchTime(fetchTime + (long)SECONDS_PER_DAY*1000);
  datum.setRetriesSinceFetch(datum.getRetriesSinceFetch() + 1);
  return datum;
}
 
Example 5
Source File: AbstractFetchSchedule.java    From anthelion with Apache License 2.0 3 votes vote down vote up
/**
 * This method resets fetchTime, fetchInterval, modifiedTime,
 * retriesSinceFetch and page signature, so that it forces refetching.
 *
 * @param url URL of the page.
 *
 * @param datum datum instance.
 *
 * @param asap if true, force refetch as soon as possible - this sets
 * the fetchTime to now. If false, force refetch whenever the next fetch
 * time is set.
 */
public CrawlDatum  forceRefetch(Text url, CrawlDatum datum, boolean asap) {
  // reduce fetchInterval so that it fits within the max value
  if (datum.getFetchInterval() > maxInterval)
    datum.setFetchInterval(maxInterval * 0.9f);
  datum.setStatus(CrawlDatum.STATUS_DB_UNFETCHED);
  datum.setRetriesSinceFetch(0);
  datum.setSignature(null);
  datum.setModifiedTime(0L);
  if (asap) datum.setFetchTime(System.currentTimeMillis());
  return datum;
}
 
Example 6
Source File: AbstractFetchSchedule.java    From nutch-htmlunit with Apache License 2.0 3 votes vote down vote up
/**
 * Initialize fetch schedule related data. Implementations should at least
 * set the <code>fetchTime</code> and <code>fetchInterval</code>. The default
 * implementation sets the <code>fetchTime</code> to now, using the
 * default <code>fetchInterval</code>.
 * 
 * @param url URL of the page.
 *
 * @param datum datum instance to be initialized (modified in place).
 */
public CrawlDatum initializeSchedule(Text url, CrawlDatum datum) {
  datum.setFetchTime(System.currentTimeMillis());
  datum.setFetchInterval(defaultInterval);
  datum.setRetriesSinceFetch(0);
  return datum;
}
 
Example 7
Source File: AbstractFetchSchedule.java    From nutch-htmlunit with Apache License 2.0 3 votes vote down vote up
/**
 * This method adjusts the fetch schedule if fetching needs to be
 * re-tried due to transient errors. The default implementation
 * sets the next fetch time 1 day in the future and increases
 * the retry counter.
 *
 * @param url URL of the page.
 *
 * @param datum page information.
 *
 * @param prevFetchTime previous fetch time.
 *
 * @param prevModifiedTime previous modified time.
 *
 * @param fetchTime current fetch time.
 *
 * @return adjusted page information, including all original information.
 * NOTE: this may be a different instance than @see CrawlDatum, but
 * implementations should make sure that it contains at least all
 * information from @see CrawlDatum.
 */
public CrawlDatum setPageRetrySchedule(Text url, CrawlDatum datum,
        long prevFetchTime, long prevModifiedTime, long fetchTime) {
  datum.setFetchTime(fetchTime + (long)SECONDS_PER_DAY*1000);
  datum.setRetriesSinceFetch(datum.getRetriesSinceFetch() + 1);
  return datum;
}
 
Example 8
Source File: AbstractFetchSchedule.java    From nutch-htmlunit with Apache License 2.0 3 votes vote down vote up
/**
 * This method resets fetchTime, fetchInterval, modifiedTime,
 * retriesSinceFetch and page signature, so that it forces refetching.
 *
 * @param url URL of the page.
 *
 * @param datum datum instance.
 *
 * @param asap if true, force refetch as soon as possible - this sets
 * the fetchTime to now. If false, force refetch whenever the next fetch
 * time is set.
 */
public CrawlDatum  forceRefetch(Text url, CrawlDatum datum, boolean asap) {
  // reduce fetchInterval so that it fits within the max value
  if (datum.getFetchInterval() > maxInterval)
    datum.setFetchInterval(maxInterval * 0.9f);
  datum.setStatus(CrawlDatum.STATUS_DB_UNFETCHED);
  datum.setRetriesSinceFetch(0);
  datum.setSignature(null);
  datum.setModifiedTime(0L);
  if (asap) datum.setFetchTime(System.currentTimeMillis());
  return datum;
}