Java Code Examples for org.apache.spark.api.java.JavaRDD#takeSample()

The following examples show how to use org.apache.spark.api.java.JavaRDD#takeSample() . You can vote up the ones you like or vote down the ones you don't like, and go to the original project or source file by following the links above each example. You may check out the related API usage on the sidebar.
Example 1
Source File: AnalyzeSpark.java    From DataVec with Apache License 2.0 5 votes vote down vote up
/**
 * Randomly sample a set of invalid values from a specified column.
 * Values are considered invalid according to the Schema / ColumnMetaData
 *
 * @param numToSample    Maximum number of invalid values to sample
 * @param columnName     Same of the column from which to sample invalid values
 * @param schema         Data schema
 * @param data           Data
 * @param ignoreMissing  If true: ignore missing values (NullWritable or empty/null string) when sampling. If false: include missing values in sampling
 * @return               List of invalid examples
 */
public static List<Writable> sampleInvalidFromColumn(int numToSample, String columnName, Schema schema,
                JavaRDD<List<Writable>> data, boolean ignoreMissing) {
    //First: filter out all valid entries, to leave only invalid entries
    int colIdx = schema.getIndexOfColumn(columnName);
    JavaRDD<Writable> ithColumn = data.map(new SelectColumnFunction(colIdx));

    ColumnMetaData meta = schema.getMetaData(columnName);

    JavaRDD<Writable> invalid = ithColumn.filter(new FilterWritablesBySchemaFunction(meta, false, ignoreMissing));

    return invalid.takeSample(false, numToSample);
}
 
Example 2
Source File: AnalyzeSpark.java    From deeplearning4j with Apache License 2.0 5 votes vote down vote up
/**
 * Randomly sample a set of invalid values from a specified column.
 * Values are considered invalid according to the Schema / ColumnMetaData
 *
 * @param numToSample    Maximum number of invalid values to sample
 * @param columnName     Same of the column from which to sample invalid values
 * @param schema         Data schema
 * @param data           Data
 * @param ignoreMissing  If true: ignore missing values (NullWritable or empty/null string) when sampling. If false: include missing values in sampling
 * @return               List of invalid examples
 */
public static List<Writable> sampleInvalidFromColumn(int numToSample, String columnName, Schema schema,
                JavaRDD<List<Writable>> data, boolean ignoreMissing) {
    //First: filter out all valid entries, to leave only invalid entries
    int colIdx = schema.getIndexOfColumn(columnName);
    JavaRDD<Writable> ithColumn = data.map(new SelectColumnFunction(colIdx));

    ColumnMetaData meta = schema.getMetaData(columnName);

    JavaRDD<Writable> invalid = ithColumn.filter(new FilterWritablesBySchemaFunction(meta, false, ignoreMissing));

    return invalid.takeSample(false, numToSample);
}
 
Example 3
Source File: AnalyzeSpark.java    From DataVec with Apache License 2.0 3 votes vote down vote up
/**
 * Randomly sample values from a single column
 *
 * @param count         Number of values to sample
 * @param columnName    Name of the column to sample from
 * @param schema        Schema
 * @param data          Data to sample from
 * @return              A list of random samples
 */
public static List<Writable> sampleFromColumn(int count, String columnName, Schema schema,
                JavaRDD<List<Writable>> data) {
    int colIdx = schema.getIndexOfColumn(columnName);
    JavaRDD<Writable> ithColumn = data.map(new SelectColumnFunction(colIdx));

    return ithColumn.takeSample(false, count);
}
 
Example 4
Source File: AnalyzeSpark.java    From deeplearning4j with Apache License 2.0 3 votes vote down vote up
/**
 * Randomly sample values from a single column
 *
 * @param count         Number of values to sample
 * @param columnName    Name of the column to sample from
 * @param schema        Schema
 * @param data          Data to sample from
 * @return              A list of random samples
 */
public static List<Writable> sampleFromColumn(int count, String columnName, Schema schema,
                JavaRDD<List<Writable>> data) {
    int colIdx = schema.getIndexOfColumn(columnName);
    JavaRDD<Writable> ithColumn = data.map(new SelectColumnFunction(colIdx));

    return ithColumn.takeSample(false, count);
}
 
Example 5
Source File: AnalyzeSpark.java    From DataVec with Apache License 2.0 2 votes vote down vote up
/**
 * Randomly sample a set of examples
 *
 * @param count    Number of samples to generate
 * @param data     Data to sample from
 * @return         Samples
 */
public static List<List<Writable>> sample(int count, JavaRDD<List<Writable>> data) {
    return data.takeSample(false, count);
}
 
Example 6
Source File: AnalyzeSpark.java    From DataVec with Apache License 2.0 2 votes vote down vote up
/**
 * Randomly sample a number of sequences from the data
 * @param count    Number of sequences to sample
 * @param data     Data to sample from
 * @return         Sequence samples
 */
public static List<List<List<Writable>>> sampleSequence(int count, JavaRDD<List<List<Writable>>> data) {
    return data.takeSample(false, count);
}
 
Example 7
Source File: AnalyzeSpark.java    From deeplearning4j with Apache License 2.0 2 votes vote down vote up
/**
 * Randomly sample a set of examples
 *
 * @param count    Number of samples to generate
 * @param data     Data to sample from
 * @return         Samples
 */
public static List<List<Writable>> sample(int count, JavaRDD<List<Writable>> data) {
    return data.takeSample(false, count);
}
 
Example 8
Source File: AnalyzeSpark.java    From deeplearning4j with Apache License 2.0 2 votes vote down vote up
/**
 * Randomly sample a number of sequences from the data
 * @param count    Number of sequences to sample
 * @param data     Data to sample from
 * @return         Sequence samples
 */
public static List<List<List<Writable>>> sampleSequence(int count, JavaRDD<List<List<Writable>>> data) {
    return data.takeSample(false, count);
}