Java Code Examples for org.apache.flink.api.java.DataSet#mapPartition()

The following examples show how to use org.apache.flink.api.java.DataSet#mapPartition() . You can vote up the ones you like or vote down the ones you don't like, and go to the original project or source file by following the links above each example. You may check out the related API usage on the sidebar.
Example 1
Source File: StringIndexerUtil.java    From Alink with Apache License 2.0 6 votes vote down vote up
/**
 * Count tokens per partition per column.
 *
 * @param input The flattened token, a DataSet of column index and token.
 * @return A DataSet of tuples of subtask index, column index, number of tokens.
 */
private static DataSet<Tuple3<Integer, Integer, Long>> countTokensPerPartitionPerColumn(
    DataSet<Tuple2<Integer, String>> input) {
    return input.mapPartition(
        new RichMapPartitionFunction<Tuple2<Integer, String>, Tuple3<Integer, Integer, Long>>() {
            @Override
            public void mapPartition(Iterable<Tuple2<Integer, String>> values,
                                     Collector<Tuple3<Integer, Integer, Long>> out) throws Exception {
                Map<Integer, Long> counter = new HashMap<>(); // column -> count
                for (Tuple2<Integer, String> value : values) {
                    counter.merge(value.f0, 1L, Long::sum);
                }
                int taskId = getRuntimeContext().getIndexOfThisSubtask();
                counter.forEach((k, v) -> out.collect(Tuple3.of(taskId, k, v)));
            }
        });
}
 
Example 2
Source File: DataSetUtils.java    From Flink-CEPplus with Apache License 2.0 5 votes vote down vote up
/**
 * Method that assigns a unique {@link Long} value to all elements in the input data set as described below.
 * <ul>
 *  <li> a map function is applied to the input data set
 *  <li> each map task holds a counter c which is increased for each record
 *  <li> c is shifted by n bits where n = log2(number of parallel tasks)
 * 	<li> to create a unique ID among all tasks, the task id is added to the counter
 * 	<li> for each record, the resulting counter is collected
 * </ul>
 *
 * @param input the input data set
 * @return a data set of tuple 2 consisting of ids and initial values.
 */
public static <T> DataSet<Tuple2<Long, T>> zipWithUniqueId (DataSet <T> input) {

	return input.mapPartition(new RichMapPartitionFunction<T, Tuple2<Long, T>>() {

		long maxBitSize = getBitSize(Long.MAX_VALUE);
		long shifter = 0;
		long start = 0;
		long taskId = 0;
		long label = 0;

		@Override
		public void open(Configuration parameters) throws Exception {
			super.open(parameters);
			shifter = getBitSize(getRuntimeContext().getNumberOfParallelSubtasks() - 1);
			taskId = getRuntimeContext().getIndexOfThisSubtask();
		}

		@Override
		public void mapPartition(Iterable<T> values, Collector<Tuple2<Long, T>> out) throws Exception {
			for (T value : values) {
				label = (start << shifter) + taskId;

				if (getBitSize(start) + shifter < maxBitSize) {
					out.collect(new Tuple2<>(label, value));
					start++;
				} else {
					throw new Exception("Exceeded Long value range while generating labels");
				}
			}
		}
	});
}
 
Example 3
Source File: DataSetUtils.java    From flink with Apache License 2.0 5 votes vote down vote up
/**
 * Method that goes over all the elements in each partition in order to retrieve
 * the total number of elements.
 *
 * @param input the DataSet received as input
 * @return a data set containing tuples of subtask index, number of elements mappings.
 */
public static <T> DataSet<Tuple2<Integer, Long>> countElementsPerPartition(DataSet<T> input) {
	return input.mapPartition(new RichMapPartitionFunction<T, Tuple2<Integer, Long>>() {
		@Override
		public void mapPartition(Iterable<T> values, Collector<Tuple2<Integer, Long>> out) throws Exception {
			long counter = 0;
			for (T value : values) {
				counter++;
			}
			out.collect(new Tuple2<>(getRuntimeContext().getIndexOfThisSubtask(), counter));
		}
	});
}
 
Example 4
Source File: DataSetUtils.java    From flink with Apache License 2.0 5 votes vote down vote up
/**
 * Generate a sample of DataSet which contains fixed size elements.
 *
 * <p><strong>NOTE:</strong> Sample with fixed size is not as efficient as sample with fraction, use sample with
 * fraction unless you need exact precision.
 *
 * @param withReplacement Whether element can be selected more than once.
 * @param numSamples       The expected sample size.
 * @param seed            Random number generator seed.
 * @return The sampled DataSet
 */
public static <T> DataSet<T> sampleWithSize(
	DataSet <T> input,
	final boolean withReplacement,
	final int numSamples,
	final long seed) {

	SampleInPartition<T> sampleInPartition = new SampleInPartition<>(withReplacement, numSamples, seed);
	MapPartitionOperator mapPartitionOperator = input.mapPartition(sampleInPartition);

	// There is no previous group, so the parallelism of GroupReduceOperator is always 1.
	String callLocation = Utils.getCallLocationName();
	SampleInCoordinator<T> sampleInCoordinator = new SampleInCoordinator<>(withReplacement, numSamples, seed);
	return new GroupReduceOperator<>(mapPartitionOperator, input.getType(), sampleInCoordinator, callLocation);
}
 
Example 5
Source File: DataSetUtils.java    From flink with Apache License 2.0 5 votes vote down vote up
/**
 * Method that assigns a unique {@link Long} value to all elements in the input data set as described below.
 * <ul>
 *  <li> a map function is applied to the input data set
 *  <li> each map task holds a counter c which is increased for each record
 *  <li> c is shifted by n bits where n = log2(number of parallel tasks)
 * 	<li> to create a unique ID among all tasks, the task id is added to the counter
 * 	<li> for each record, the resulting counter is collected
 * </ul>
 *
 * @param input the input data set
 * @return a data set of tuple 2 consisting of ids and initial values.
 */
public static <T> DataSet<Tuple2<Long, T>> zipWithUniqueId (DataSet <T> input) {

	return input.mapPartition(new RichMapPartitionFunction<T, Tuple2<Long, T>>() {

		long maxBitSize = getBitSize(Long.MAX_VALUE);
		long shifter = 0;
		long start = 0;
		long taskId = 0;
		long label = 0;

		@Override
		public void open(Configuration parameters) throws Exception {
			super.open(parameters);
			shifter = getBitSize(getRuntimeContext().getNumberOfParallelSubtasks() - 1);
			taskId = getRuntimeContext().getIndexOfThisSubtask();
		}

		@Override
		public void mapPartition(Iterable<T> values, Collector<Tuple2<Long, T>> out) throws Exception {
			for (T value : values) {
				label = (start << shifter) + taskId;

				if (getBitSize(start) + shifter < maxBitSize) {
					out.collect(new Tuple2<>(label, value));
					start++;
				} else {
					throw new Exception("Exceeded Long value range while generating labels");
				}
			}
		}
	});
}
 
Example 6
Source File: DataSetUtils.java    From Flink-CEPplus with Apache License 2.0 5 votes vote down vote up
/**
 * Method that goes over all the elements in each partition in order to retrieve
 * the total number of elements.
 *
 * @param input the DataSet received as input
 * @return a data set containing tuples of subtask index, number of elements mappings.
 */
public static <T> DataSet<Tuple2<Integer, Long>> countElementsPerPartition(DataSet<T> input) {
	return input.mapPartition(new RichMapPartitionFunction<T, Tuple2<Integer, Long>>() {
		@Override
		public void mapPartition(Iterable<T> values, Collector<Tuple2<Integer, Long>> out) throws Exception {
			long counter = 0;
			for (T value : values) {
				counter++;
			}
			out.collect(new Tuple2<>(getRuntimeContext().getIndexOfThisSubtask(), counter));
		}
	});
}
 
Example 7
Source File: FlinkCubingByLayer.java    From kylin with Apache License 2.0 4 votes vote down vote up
@Override
protected void execute(OptionsHelper optionsHelper) throws Exception {
    String metaUrl = optionsHelper.getOptionValue(OPTION_META_URL);
    String hiveTable = optionsHelper.getOptionValue(OPTION_INPUT_TABLE);
    String inputPath = optionsHelper.getOptionValue(OPTION_INPUT_PATH);
    String cubeName = optionsHelper.getOptionValue(OPTION_CUBE_NAME);
    String segmentId = optionsHelper.getOptionValue(OPTION_SEGMENT_ID);
    String outputPath = optionsHelper.getOptionValue(OPTION_OUTPUT_PATH);
    String enableObjectReuseOptValue = optionsHelper.getOptionValue(OPTION_ENABLE_OBJECT_REUSE);

    boolean enableObjectReuse = false;
    if (enableObjectReuseOptValue != null && !enableObjectReuseOptValue.isEmpty()) {
        enableObjectReuse = true;
    }

    Job job = Job.getInstance();
    FileSystem fs = HadoopUtil.getWorkingFileSystem();
    HadoopUtil.deletePath(job.getConfiguration(), new Path(outputPath));

    final SerializableConfiguration sConf = new SerializableConfiguration(job.getConfiguration());
    KylinConfig envConfig = AbstractHadoopJob.loadKylinConfigFromHdfs(sConf, metaUrl);

    final CubeInstance cubeInstance = CubeManager.getInstance(envConfig).getCube(cubeName);
    final CubeDesc cubeDesc = cubeInstance.getDescriptor();
    final CubeSegment cubeSegment = cubeInstance.getSegmentById(segmentId);

    logger.info("DataSet input path : {}", inputPath);
    logger.info("DataSet output path : {}", outputPath);

    int countMeasureIndex = 0;
    for (MeasureDesc measureDesc : cubeDesc.getMeasures()) {
        if (measureDesc.getFunction().isCount() == true) {
            break;
        } else {
            countMeasureIndex++;
        }
    }

    final CubeStatsReader cubeStatsReader = new CubeStatsReader(cubeSegment, envConfig);
    boolean[] needAggr = new boolean[cubeDesc.getMeasures().size()];
    boolean allNormalMeasure = true;
    for (int i = 0; i < cubeDesc.getMeasures().size(); i++) {
        needAggr[i] = !cubeDesc.getMeasures().get(i).getFunction().getMeasureType().onlyAggrInBaseCuboid();
        allNormalMeasure = allNormalMeasure && needAggr[i];
    }

    logger.info("All measure are normal (agg on all cuboids) ? : " + allNormalMeasure);

    boolean isSequenceFile = JoinedFlatTable.SEQUENCEFILE.equalsIgnoreCase(envConfig.getFlatTableStorageFormat());

    ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
    if (enableObjectReuse) {
        env.getConfig().enableObjectReuse();
    }
    env.getConfig().registerKryoType(PercentileCounter.class);
    env.getConfig().registerTypeWithKryoSerializer(PercentileCounter.class, PercentileCounterSerializer.class);

    DataSet<String[]> hiveDataSet = FlinkUtil.readHiveRecords(isSequenceFile, env, inputPath, hiveTable, job);

    DataSet<Tuple2<ByteArray, Object[]>> encodedBaseDataSet = hiveDataSet.mapPartition(
            new EncodeBaseCuboidMapPartitionFunction(cubeName, segmentId, metaUrl, sConf));

    Long totalCount = 0L;
    if (envConfig.isFlinkSanityCheckEnabled()) {
        totalCount = encodedBaseDataSet.count();
    }

    final BaseCuboidReduceGroupFunction baseCuboidReducerFunction = new BaseCuboidReduceGroupFunction(cubeName, metaUrl, sConf);

    BaseCuboidReduceGroupFunction reducerFunction = baseCuboidReducerFunction;
    if (!allNormalMeasure) {
        reducerFunction = new CuboidReduceGroupFunction(cubeName, metaUrl, sConf, needAggr);
    }

    final int totalLevels = cubeSegment.getCuboidScheduler().getBuildLevel();
    DataSet<Tuple2<ByteArray, Object[]>>[] allDataSets = new DataSet[totalLevels + 1];
    int level = 0;

    // aggregate to calculate base cuboid
    allDataSets[0] = encodedBaseDataSet.groupBy(0).reduceGroup(baseCuboidReducerFunction);

    sinkToHDFS(allDataSets[0], metaUrl, cubeName, cubeSegment, outputPath, 0, Job.getInstance(), envConfig);

    CuboidMapPartitionFunction mapPartitionFunction = new CuboidMapPartitionFunction(cubeName, segmentId, metaUrl, sConf);

    for (level = 1; level <= totalLevels; level++) {
        allDataSets[level] = allDataSets[level - 1].mapPartition(mapPartitionFunction).groupBy(0).reduceGroup(reducerFunction);
        if (envConfig.isFlinkSanityCheckEnabled()) {
            sanityCheck(allDataSets[level], totalCount, level, cubeStatsReader, countMeasureIndex);
        }
        sinkToHDFS(allDataSets[level], metaUrl, cubeName, cubeSegment, outputPath, level, Job.getInstance(), envConfig);
    }

    env.execute("Cubing for : " + cubeName + " segment " + segmentId);
    logger.info("Finished on calculating all level cuboids.");
    logger.info("HDFS: Number of bytes written=" + FlinkBatchCubingJobBuilder2.getFileSize(outputPath, fs));
}
 
Example 8
Source File: FlinkCubingByLayer.java    From kylin-on-parquet-v2 with Apache License 2.0 4 votes vote down vote up
@Override
protected void execute(OptionsHelper optionsHelper) throws Exception {
    String metaUrl = optionsHelper.getOptionValue(OPTION_META_URL);
    String hiveTable = optionsHelper.getOptionValue(OPTION_INPUT_TABLE);
    String inputPath = optionsHelper.getOptionValue(OPTION_INPUT_PATH);
    String cubeName = optionsHelper.getOptionValue(OPTION_CUBE_NAME);
    String segmentId = optionsHelper.getOptionValue(OPTION_SEGMENT_ID);
    String outputPath = optionsHelper.getOptionValue(OPTION_OUTPUT_PATH);
    String enableObjectReuseOptValue = optionsHelper.getOptionValue(OPTION_ENABLE_OBJECT_REUSE);

    boolean enableObjectReuse = false;
    if (enableObjectReuseOptValue != null && !enableObjectReuseOptValue.isEmpty()) {
        enableObjectReuse = true;
    }

    Job job = Job.getInstance();
    FileSystem fs = HadoopUtil.getWorkingFileSystem();
    HadoopUtil.deletePath(job.getConfiguration(), new Path(outputPath));

    final SerializableConfiguration sConf = new SerializableConfiguration(job.getConfiguration());
    KylinConfig envConfig = AbstractHadoopJob.loadKylinConfigFromHdfs(sConf, metaUrl);

    final CubeInstance cubeInstance = CubeManager.getInstance(envConfig).getCube(cubeName);
    final CubeDesc cubeDesc = cubeInstance.getDescriptor();
    final CubeSegment cubeSegment = cubeInstance.getSegmentById(segmentId);

    logger.info("DataSet input path : {}", inputPath);
    logger.info("DataSet output path : {}", outputPath);

    int countMeasureIndex = 0;
    for (MeasureDesc measureDesc : cubeDesc.getMeasures()) {
        if (measureDesc.getFunction().isCount() == true) {
            break;
        } else {
            countMeasureIndex++;
        }
    }

    final CubeStatsReader cubeStatsReader = new CubeStatsReader(cubeSegment, envConfig);
    boolean[] needAggr = new boolean[cubeDesc.getMeasures().size()];
    boolean allNormalMeasure = true;
    for (int i = 0; i < cubeDesc.getMeasures().size(); i++) {
        needAggr[i] = !cubeDesc.getMeasures().get(i).getFunction().getMeasureType().onlyAggrInBaseCuboid();
        allNormalMeasure = allNormalMeasure && needAggr[i];
    }

    logger.info("All measure are normal (agg on all cuboids) ? : " + allNormalMeasure);

    boolean isSequenceFile = JoinedFlatTable.SEQUENCEFILE.equalsIgnoreCase(envConfig.getFlatTableStorageFormat());

    ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
    if (enableObjectReuse) {
        env.getConfig().enableObjectReuse();
    }
    env.getConfig().registerKryoType(PercentileCounter.class);
    env.getConfig().registerTypeWithKryoSerializer(PercentileCounter.class, PercentileCounterSerializer.class);

    DataSet<String[]> hiveDataSet = FlinkUtil.readHiveRecords(isSequenceFile, env, inputPath, hiveTable, job);

    DataSet<Tuple2<ByteArray, Object[]>> encodedBaseDataSet = hiveDataSet.mapPartition(
            new EncodeBaseCuboidMapPartitionFunction(cubeName, segmentId, metaUrl, sConf));

    Long totalCount = 0L;
    if (envConfig.isFlinkSanityCheckEnabled()) {
        totalCount = encodedBaseDataSet.count();
    }

    final BaseCuboidReduceGroupFunction baseCuboidReducerFunction = new BaseCuboidReduceGroupFunction(cubeName, metaUrl, sConf);

    BaseCuboidReduceGroupFunction reducerFunction = baseCuboidReducerFunction;
    if (!allNormalMeasure) {
        reducerFunction = new CuboidReduceGroupFunction(cubeName, metaUrl, sConf, needAggr);
    }

    final int totalLevels = cubeSegment.getCuboidScheduler().getBuildLevel();
    DataSet<Tuple2<ByteArray, Object[]>>[] allDataSets = new DataSet[totalLevels + 1];
    int level = 0;

    // aggregate to calculate base cuboid
    allDataSets[0] = encodedBaseDataSet.groupBy(0).reduceGroup(baseCuboidReducerFunction);

    sinkToHDFS(allDataSets[0], metaUrl, cubeName, cubeSegment, outputPath, 0, Job.getInstance(), envConfig);

    CuboidMapPartitionFunction mapPartitionFunction = new CuboidMapPartitionFunction(cubeName, segmentId, metaUrl, sConf);

    for (level = 1; level <= totalLevels; level++) {
        allDataSets[level] = allDataSets[level - 1].mapPartition(mapPartitionFunction).groupBy(0).reduceGroup(reducerFunction);
        if (envConfig.isFlinkSanityCheckEnabled()) {
            sanityCheck(allDataSets[level], totalCount, level, cubeStatsReader, countMeasureIndex);
        }
        sinkToHDFS(allDataSets[level], metaUrl, cubeName, cubeSegment, outputPath, level, Job.getInstance(), envConfig);
    }

    env.execute("Cubing for : " + cubeName + " segment " + segmentId);
    logger.info("Finished on calculating all level cuboids.");
    logger.info("HDFS: Number of bytes written=" + FlinkBatchCubingJobBuilder2.getFileSize(outputPath, fs));
}
 
Example 9
Source File: AlsTrainBatchOp.java    From Alink with Apache License 2.0 4 votes vote down vote up
/**
 * Matrix decomposition using ALS algorithm.
 *
 * @param inputs a dataset of user-item-rating tuples
 * @return user factors and item factors.
 */
@Override
public AlsTrainBatchOp linkFrom(BatchOperator<?>... inputs) {
    BatchOperator<?> in = checkAndGetFirst(inputs);

    final String userColName = getUserCol();
    final String itemColName = getItemCol();
    final String rateColName = getRateCol();

    final double lambda = getLambda();
    final int rank = getRank();
    final int numIter = getNumIter();
    final boolean nonNegative = getNonnegative();
    final boolean implicitPrefs = getImplicitPrefs();
    final double alpha = getAlpha();
    final int numMiniBatches = getNumBlocks();

    final int userColIdx = TableUtil.findColIndexWithAssertAndHint(in.getColNames(), userColName);
    final int itemColIdx = TableUtil.findColIndexWithAssertAndHint(in.getColNames(), itemColName);
    final int rateColIdx = TableUtil.findColIndexWithAssertAndHint(in.getColNames(), rateColName);

    // tuple3: userId, itemId, rating
    DataSet<Tuple3<Long, Long, Float>> alsInput = in.getDataSet()
        .map(new MapFunction<Row, Tuple3<Long, Long, Float>>() {
            @Override
            public Tuple3<Long, Long, Float> map(Row value) {
                return new Tuple3<>(((Number) value.getField(userColIdx)).longValue(),
                    ((Number) value.getField(itemColIdx)).longValue(),
                    ((Number) value.getField(rateColIdx)).floatValue());
            }
        });

    AlsTrain als = new AlsTrain(rank, numIter, lambda, implicitPrefs, alpha, numMiniBatches, nonNegative);
    DataSet<Tuple3<Byte, Long, float[]>> factors = als.fit(alsInput);

    DataSet<Row> output = factors.mapPartition(new RichMapPartitionFunction<Tuple3<Byte, Long, float[]>, Row>() {
        @Override
        public void mapPartition(Iterable<Tuple3<Byte, Long, float[]>> values, Collector<Row> out) {
            new AlsModelDataConverter(userColName, itemColName).save(values, out);
        }
    });

    this.setOutput(output, new AlsModelDataConverter(userColName, itemColName).getModelSchema());
    return this;
}
 
Example 10
Source File: NaiveBayesTextTrainBatchOp.java    From Alink with Apache License 2.0 4 votes vote down vote up
/**
 * Train data and get a model.
 *
 * @param inputs input data.
 * @return the model of naive bayes.
 */
@Override
public NaiveBayesTextTrainBatchOp linkFrom(BatchOperator<?>... inputs) {
	BatchOperator<?> in = checkAndGetFirst(inputs);
	TypeInformation <?> labelType;
	String labelColName = getLabelCol();
	ModelType modelType = getModelType();
	String weightColName = getWeightCol();
	double smoothing = getSmoothing();
	String vectorColName = getVectorCol();

	labelType = TableUtil.findColTypeWithAssertAndHint(in.getSchema(), labelColName);

	String[] keepColNames = (weightColName == null) ? new String[] {labelColName}
			: new String[] {weightColName, labelColName};
	Tuple2 <DataSet <Tuple2 <Vector, Row>>, DataSet <BaseVectorSummary>> dataSrt
			= StatisticsHelper.summaryHelper(in, null, vectorColName, keepColNames);
	DataSet <Tuple2 <Vector, Row>> data = dataSrt.f0;
	DataSet <BaseVectorSummary> srt = dataSrt.f1;

	DataSet <Integer> vectorSize = srt.map(new MapFunction <BaseVectorSummary, Integer>() {
		@Override
		public Integer map(BaseVectorSummary value) {
			return value.vectorSize();
		}
	});

	// Transform data in the form of label, weight, feature.
	DataSet <Tuple3 <Object, Double, Vector>> trainData = data
			.mapPartition(new Transform());

	DataSet <Row> probs = trainData
			.groupBy(new SelectLabel())
			.reduceGroup(new ReduceItem())
			.withBroadcastSet(vectorSize, "vectorSize")
			.mapPartition(new GenerateModel(smoothing, modelType, vectorColName, labelType))
			.withBroadcastSet(vectorSize, "vectorSize")
			.setParallelism(1);

	//save the model matrix.
	this.setOutput(probs, new NaiveBayesTextModelDataConverter(labelType).getModelSchema());
	return this;
}
 
Example 11
Source File: BaseRandomForestTrainBatchOp.java    From Alink with Apache License 2.0 4 votes vote down vote up
private DataSet<Row> seriesTrain(BatchOperator<?> in) {
	DataSet<Row> trainDataSet = in.getDataSet();

	MapPartitionOperator<Row, Tuple2<Integer, Row>> sampled = trainDataSet
		.mapPartition(new SampleData(
				get(HasSeed.SEED),
				get(HasSubsamplingRatio.SUBSAMPLING_RATIO),
				get(HasNumTreesDefaltAs10.NUM_TREES)
			)
		);

	if (getParams().get(HasSubsamplingRatio.SUBSAMPLING_RATIO) > 1.0) {
		DataSet<Long> cnt = DataSetUtils
			.countElementsPerPartition(trainDataSet)
			.sum(1)
			.map(new MapFunction<Tuple2<Integer, Long>, Long>() {
				@Override
				public Long map(Tuple2<Integer, Long> value) throws Exception {
					return value.f1;
				}
			});

		sampled = sampled.withBroadcastSet(cnt, "totalCnt");
	}

	DataSet<Integer> labelSize = labels.map(new MapFunction<Object[], Integer>() {
		@Override
		public Integer map(Object[] objects) throws Exception {
			return objects.length;
		}
	});

	DataSet<Tuple2<Integer, String>> pModel = sampled
		.groupBy(0)
		.withPartitioner(new AvgPartition())
		.reduceGroup(new SeriesTrainFunction(getParams()))
		.withBroadcastSet(stringIndexerModel.getDataSet(), "stringIndexerModel")
		.withBroadcastSet(labelSize, "labelSize");

	return pModel
		.reduceGroup(new SerializeModel(getParams()))
		.withBroadcastSet(stringIndexerModel.getDataSet(), "stringIndexerModel")
		.withBroadcastSet(labels, "labels");
}
 
Example 12
Source File: Newton.java    From Alink with Apache License 2.0 4 votes vote down vote up
/**
 * optimizer api.
 *
 * @return the coefficient of linear problem.
 */
@Override
public DataSet<Tuple2<DenseVector, double[]>> optimize() {
    //get parameters.
    int maxIter = params.get(HasMaxIterDefaultAs100.MAX_ITER);
    double epsilon = params.get(HasEpsilonDv0000001.EPSILON);
    if (null == this.coefVec) {
        initCoefZeros();
    }

    /**
     * solve problem using iteration.
     * trainData is the distributed samples.
     * initCoef is the initial model coefficient, which will be broadcast to every worker.
     * objFuncSet is the object function in dataSet format
     *
     * .add(new PreallocateCoefficient(OptimName.currentCoef)) allocate memory for current coefficient
     * .add(new PreallocateCoefficient(OptimName.minCoef))     allocate memory for min loss coefficient
     * .add(new PreallocateLossCurve(OptimVariable.lossCurve)) allocate memory for loss values
     * .add(new PreallocateVector(OptimName.dir ...))          allocate memory for grad
     * ..add(new PreallocateMatrix(OptimName.hessian,...))     allocate memory for hessian matrix
     * .add(new CalcGradientAndHessian(objFunc))               calculate local sub gradient and hessian
     * .add(new AllReduce(OptimName.gradAllReduce))            sum all sub gradient and hessian with allReduce
     * .add(new GetGradientAndHessian())                       get summed gradient and hessian
     * .add(new UpdateModel(maxIter, epsilon ...))             update coefficient with gradient and hessian
     * .setCompareCriterionOfNode0(new IterTermination())             judge stop of iteration
     *
     */
    DataSet<Row> model = new IterativeComQueue()
        .initWithPartitionedData(OptimVariable.trainData, trainData)
        .initWithBroadcastData(OptimVariable.model, coefVec)
        .initWithBroadcastData(OptimVariable.objFunc, objFuncSet)
        .add(new PreallocateCoefficient(OptimVariable.currentCoef))
        .add(new PreallocateCoefficient(OptimVariable.minCoef))
        .add(new PreallocateLossCurve(OptimVariable.lossCurve, maxIter))
        .add(new PreallocateVector(OptimVariable.dir, new double[2]))
        .add(new PreallocateMatrix(OptimVariable.hessian, MAX_FEATURE_NUM))
        .add(new CalcGradientAndHessian())
        .add(new AllReduce(OptimVariable.gradHessAllReduce))
        .add(new GetGradeintAndHessian())
        .add(new UpdateModel(maxIter, epsilon))
        .setCompareCriterionOfNode0(new IterTermination())
        .closeWith(new OutputModel())
        .setMaxIter(maxIter)
        .exec();

    return model.mapPartition(new ParseRowModel());
}
 
Example 13
Source File: Lbfgs.java    From Alink with Apache License 2.0 4 votes vote down vote up
/**
 * optimizer api.
 *
 * @return the coefficient of linear problem.
 */
@Override
public DataSet <Tuple2 <DenseVector, double[]>> optimize() {
	//get parameters.
	int maxIter = params.get(HasMaxIterDefaultAs100.MAX_ITER);
	int numSearchStep = params.get(HasNumSearchStepDv4.NUM_SEARCH_STEP);
	if (null == this.coefVec) {
		initCoefZeros();
	}

	/**
	 * solving problem using iteration.
	 * trainData is the distributed samples.
	 * initCoef is the initial model coefficient, which will be broadcast to every worker.
	 * objFuncSet is the object function in dataSet format
	 * .add(new PreallocateCoefficient(OptimName.currentCoef)) allocate memory for current coefficient
	 * .add(new PreallocateCoefficient(OptimName.minCoef))     allocate memory for min loss coefficient
	 * .add(new PreallocateLossCurve(OptimVariable.lossCurve)) allocate memory for loss values
	 * .add(new PreallocateVector(OptimName.dir ...))          allocate memory for dir
	 * .add(new PreallocateVector(OptimName.grad))             allocate memory for grad
	 * .add(new PreallocateSkyk())                             allocate memory for sK yK
	 * .add(new CalcGradient(objFunc))                         calculate local sub gradient
	 * .add(new AllReduce(OptimName.gradAllReduce))            sum all sub gradient with allReduce
	 * .add(new CalDirection())                                get summed gradient and use it to calc descend dir
	 * .add(new CalcLosses(objFunc, OptimMethod.GD))           calculate local losses for line search
	 * .add(new AllReduce(OptimName.lossAllReduce))            sum all losses with allReduce
	 * .add(new UpdateModel(maxIter, epsilon ...))             update coefficient
	 * .setCompareCriterionOfNode0(new IterTermination())             judge stop of iteration
	 */
	DataSet <Row> model = new IterativeComQueue()
		.initWithPartitionedData(OptimVariable.trainData, trainData)
		.initWithBroadcastData(OptimVariable.model, coefVec)
		.initWithBroadcastData(OptimVariable.objFunc, objFuncSet)
		.add(new PreallocateCoefficient(OptimVariable.currentCoef))
		.add(new PreallocateCoefficient(OptimVariable.minCoef))
		.add(new PreallocateLossCurve(OptimVariable.lossCurve, maxIter))
		.add(new PreallocateVector(OptimVariable.dir, new double[] {0.0, OptimVariable.learningRate}))
		.add(new PreallocateVector(OptimVariable.grad))
		.add(new PreallocateSkyk(OptimVariable.numCorrections))
		.add(new CalcGradient())
		.add(new AllReduce(OptimVariable.gradAllReduce))
		.add(new CalDirection(OptimVariable.numCorrections))
		.add(new CalcLosses(OptimMethod.LBFGS, numSearchStep))
		.add(new AllReduce(OptimVariable.lossAllReduce))
		.add(new UpdateModel(params, OptimVariable.grad, OptimMethod.LBFGS, numSearchStep))
		.setCompareCriterionOfNode0(new IterTermination())
		.closeWith(new OutputModel())
		.setMaxIter(maxIter)
		.exec();

	return model.mapPartition(new ParseRowModel());
}
 
Example 14
Source File: DynamicParallelVB.java    From toolbox with Apache License 2.0 4 votes vote down vote up
protected DataSet<DataPosteriorAssignment> translate(DataSet<DataPosteriorInstance> data) {
    return data.mapPartition(new ParallelVBTranslate(this.dagTimeT, this.latentVariablesNames, this.latentInterfaceVariablesNames,this.noLatentVariablesNames));
}
 
Example 15
Source File: Sgd.java    From Alink with Apache License 2.0 4 votes vote down vote up
/**
 * optimizer api.
 *
 * @return the coefficient of linear problem.
 */
@Override
public DataSet<Tuple2<DenseVector, double[]>> optimize() {
    //get parameters.
    int maxIter = params.get(SgdParams.MAX_ITER);
    double learnRate = params.get(SgdParams.LEARNING_RATE);
    double miniBatchFraction = params.get(SgdParams.MINI_BATCH_FRACTION);
    double epsilon = params.get(SgdParams.EPSILON);
    if (null == this.coefVec) {
        initCoefZeros();
    }

    /**
     * solve problem using iteration.
     * trainData is the distributed samples.
     * initCoef is the initial model coefficient, which will be broadcast to every worker.
     * objFuncSet is the object function in dataSet format
     *
     * .add(new PreallocateCoefficient(OptimName.currentCoef)) allocate memory for current coefficient
     * .add(new PreallocateLossCurve(OptimVariable.lossCurve)) allocate memory for loss values
     * .add(new PreallocateVector(OptimName.dir ...))          allocate memory for dir
     * .add(new CalcSubGradient(objFunc, miniBatchFraction))   calculate local sub gradient
     * .add(new AllReduce(OptimName.gradAllReduce))            sum all sub gradient with allReduce
     * .add(new GetGradient())                                 get summed gradient
     * .add(new UpdateSgdModel(maxIter, epsilon ...))          update coefficient
     * .setCompareCriterionOfNode0(new IterTermination())             judge stop of iteration
     */
    DataSet<Row> model = new IterativeComQueue()
        .initWithPartitionedData(OptimVariable.trainData, trainData)
        .initWithBroadcastData(OptimVariable.model, coefVec)
        .initWithBroadcastData(OptimVariable.objFunc, objFuncSet)
        .add(new PreallocateCoefficient(OptimVariable.minCoef))
        .add(new PreallocateLossCurve(OptimVariable.lossCurve, maxIter))
        .add(new PreallocateVector(OptimVariable.dir, new double[2]))
        .add(new CalcSubGradient(miniBatchFraction))
        .add(new AllReduce(OptimVariable.gradAllReduce))
        .add(new GetGradient())
        .add(new UpdateSgdModel(maxIter, epsilon, learnRate, OptimMethod.SGD))
        .setCompareCriterionOfNode0(new IterTermination())
        .closeWith(new OutputModel())
        .setMaxIter(maxIter)
        .exec();

    return model.mapPartition(new ParseRowModel());
}
 
Example 16
Source File: Gd.java    From Alink with Apache License 2.0 4 votes vote down vote up
/**
 * optimizer api.
 *
 * @return the coefficient of linear problem and loss curve values.
 */
@Override
public DataSet<Tuple2<DenseVector, double[]>> optimize() {
    //get parameters.
    int maxIter = params.get(HasMaxIterDefaultAs100.MAX_ITER);
    if (null == this.coefVec) {
        initCoefZeros();
    }
    int numSearchStep = params.get(HasNumSearchStepDv4.NUM_SEARCH_STEP);

    /**
     * solve problem using iteration.
     * trainData is the distributed samples.
     * initCoef is the initial model coefficient, which will be broadcast to every worker.
     * objFuncSet is the object function in dataSet format
     *
     * .add(new PreallocateCoefficient(OptimName.currentCoef)) allocate memory for current coefficient
     * .add(new PreallocateCoefficient(OptimName.minCoef))     allocate memory for min loss coefficient
     * .add(new PreallocateVector(OptimName.dir ...))          allocate memory for grad
     * .add(new CalcGradient(objFunc))                         calculate local sub gradient
     * .add(new AllReduce(OptimName.gradAllReduce))            sum all sub gradient with allReduce
     * .add(new GetGradient())                                 get summed gradient
     * .add(new CalcLosses(objFunc, OptimMethod.GD))           calculate local losses for line search
     * .add(new AllReduce(OptimName.lossAllReduce))            sum all losses with allReduce
     * .add(new UpdateModel(maxIter, epsilon ...))             update coefficient
     * .setCompareCriterionOfNode0(new IterTermination())             judge stop of iteration
     */
    DataSet<Row> model = new IterativeComQueue()
        .initWithPartitionedData(OptimVariable.trainData, trainData)
        .initWithBroadcastData(OptimVariable.model, coefVec)
        .initWithBroadcastData(OptimVariable.objFunc, objFuncSet)
        .add(new PreallocateCoefficient(OptimVariable.currentCoef))
        .add(new PreallocateCoefficient(OptimVariable.minCoef))
        .add(new PreallocateLossCurve(OptimVariable.lossCurve, maxIter))
        .add(new PreallocateVector(OptimVariable.dir, new double[] {0.0, OptimVariable.learningRate}))
        .add(new CalcGradient())
        .add(new AllReduce(OptimVariable.gradAllReduce))
        .add(new GetGradient())
        .add(new CalcLosses(OptimMethod.GD, numSearchStep))
        .add(new AllReduce(OptimVariable.lossAllReduce))
        .add(new UpdateModel(params, OptimVariable.dir, OptimMethod.GD, numSearchStep))
        .setCompareCriterionOfNode0(new IterTermination())
        .closeWith(new OutputModel())
        .setMaxIter(maxIter)
        .exec();

    return model.mapPartition(new ParseRowModel());
}
 
Example 17
Source File: FlinkFactDistinctColumns.java    From kylin with Apache License 2.0 4 votes vote down vote up
@Override
protected void execute(OptionsHelper optionsHelper) throws Exception {
    String cubeName = optionsHelper.getOptionValue(OPTION_CUBE_NAME);
    String metaUrl = optionsHelper.getOptionValue(OPTION_META_URL);
    String segmentId = optionsHelper.getOptionValue(OPTION_SEGMENT_ID);
    String hiveTable = optionsHelper.getOptionValue(OPTION_INPUT_TABLE);
    String inputPath = optionsHelper.getOptionValue(OPTION_INPUT_PATH);
    String outputPath = optionsHelper.getOptionValue(OPTION_OUTPUT_PATH);
    String counterPath = optionsHelper.getOptionValue(OPTION_COUNTER_PATH);
    int samplingPercent = Integer.parseInt(optionsHelper.getOptionValue(OPTION_STATS_SAMPLING_PERCENT));
    String enableObjectReuseOptValue = optionsHelper.getOptionValue(OPTION_ENABLE_OBJECT_REUSE);

    Job job = Job.getInstance();
    FileSystem fs = HadoopUtil.getWorkingFileSystem(job.getConfiguration());
    HadoopUtil.deletePath(job.getConfiguration(), new Path(outputPath));

    final SerializableConfiguration sConf = new SerializableConfiguration(job.getConfiguration());
    KylinConfig envConfig = AbstractHadoopJob.loadKylinConfigFromHdfs(sConf, metaUrl);

    final CubeInstance cubeInstance = CubeManager.getInstance(envConfig).getCube(cubeName);

    final FactDistinctColumnsReducerMapping reducerMapping = new FactDistinctColumnsReducerMapping(cubeInstance);
    final int totalReducer = reducerMapping.getTotalReducerNum();

    logger.info("getTotalReducerNum: {}", totalReducer);
    logger.info("getCuboidRowCounterReducerNum: {}", reducerMapping.getCuboidRowCounterReducerNum());
    logger.info("counter path {}", counterPath);

    boolean isSequenceFile = JoinedFlatTable.SEQUENCEFILE.equalsIgnoreCase(envConfig.getFlatTableStorageFormat());

    // calculate source record bytes size
    final String bytesWrittenName = "byte-writer-counter";
    final String recordCounterName = "record-counter";

    ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
    if (!StringUtil.isEmpty(enableObjectReuseOptValue) &&
            enableObjectReuseOptValue.equalsIgnoreCase("true")) {
        env.getConfig().enableObjectReuse();
    }

    DataSet<String[]> recordDataSet = FlinkUtil.readHiveRecords(isSequenceFile, env, inputPath, hiveTable, job);

    // read record from flat table
    // output:
    //   1, statistic
    //   2, field value of dict col
    //   3, min/max field value of not dict col
    DataSet<Tuple2<SelfDefineSortableKey, Text>> flatOutputDataSet = recordDataSet.mapPartition(
            new FlatOutputMapPartitionFunction(sConf, cubeName, segmentId, metaUrl, samplingPercent,
                    bytesWrittenName, recordCounterName));

    // repartition data, make each reducer handle only one col data or the statistic data
    DataSet<Tuple2<SelfDefineSortableKey, Text>> partitionDataSet = flatOutputDataSet
            .partitionCustom(new FactDistinctColumnPartitioner(cubeName, metaUrl, sConf), 0)
            .setParallelism(totalReducer);

    // multiple output result
    // 1, CFG_OUTPUT_COLUMN: field values of dict col, which will not be built in reducer, like globalDictCol
    // 2, CFG_OUTPUT_DICT: dictionary object built in reducer
    // 3, CFG_OUTPUT_STATISTICS: cube statistic: hll of cuboids ...
    // 4, CFG_OUTPUT_PARTITION: dimension value range(min,max)
    DataSet<Tuple2<String, Tuple3<Writable, Writable, String>>> outputDataSet = partitionDataSet
            .mapPartition(new MultiOutputMapPartitionFunction(sConf, cubeName, segmentId, metaUrl, samplingPercent))
            .setParallelism(totalReducer);

    // make each reducer output to respective dir
    MultipleOutputs.addNamedOutput(job, BatchConstants.CFG_OUTPUT_COLUMN, SequenceFileOutputFormat.class,
            NullWritable.class, Text.class);
    MultipleOutputs.addNamedOutput(job, BatchConstants.CFG_OUTPUT_DICT, SequenceFileOutputFormat.class,
            NullWritable.class, ArrayPrimitiveWritable.class);
    MultipleOutputs.addNamedOutput(job, BatchConstants.CFG_OUTPUT_STATISTICS, SequenceFileOutputFormat.class,
            LongWritable.class, BytesWritable.class);
    MultipleOutputs.addNamedOutput(job, BatchConstants.CFG_OUTPUT_PARTITION, TextOutputFormat.class,
            NullWritable.class, LongWritable.class);

    FileOutputFormat.setOutputPath(job, new Path(outputPath));
    FileOutputFormat.setCompressOutput(job, false);

    // prevent to create zero-sized default output
    LazyOutputFormat.setOutputFormatClass(job, SequenceFileOutputFormat.class);

    outputDataSet.output(new HadoopMultipleOutputFormat(new LazyOutputFormat(), job));

    JobExecutionResult jobExecutionResult =
            env.execute("Fact distinct columns for:" + cubeName + " segment " + segmentId);
    Map<String, Object> accumulatorResults = jobExecutionResult.getAllAccumulatorResults();
    Long recordCount = (Long) accumulatorResults.get(recordCounterName);
    Long bytesWritten = (Long) accumulatorResults.get(bytesWrittenName);
    logger.info("Map input records={}", recordCount);
    logger.info("HDFS Read: {} HDFS Write", bytesWritten);
    logger.info("HDFS: Number of bytes written=" + FlinkBatchCubingJobBuilder2.getFileSize(outputPath, fs));

    Map<String, String> counterMap = Maps.newHashMap();
    counterMap.put(ExecutableConstants.SOURCE_RECORDS_COUNT, String.valueOf(recordCount));
    counterMap.put(ExecutableConstants.SOURCE_RECORDS_SIZE, String.valueOf(bytesWritten));

    // save counter to hdfs
    HadoopUtil.writeToSequenceFile(job.getConfiguration(), counterPath, counterMap);
}
 
Example 18
Source File: DataSetUtils.java    From flink with Apache License 2.0 3 votes vote down vote up
/**
 * Generate a sample of DataSet by the probability fraction of each element.
 *
 * @param withReplacement Whether element can be selected more than once.
 * @param fraction        Probability that each element is chosen, should be [0,1] without replacement,
 *                        and [0, ∞) with replacement. While fraction is larger than 1, the elements are
 *                        expected to be selected multi times into sample on average.
 * @param seed            random number generator seed.
 * @return The sampled DataSet
 */
public static <T> MapPartitionOperator<T, T> sample(
	DataSet <T> input,
	final boolean withReplacement,
	final double fraction,
	final long seed) {

	return input.mapPartition(new SampleWithFraction<T>(withReplacement, fraction, seed));
}
 
Example 19
Source File: DataSetUtils.java    From flink with Apache License 2.0 3 votes vote down vote up
/**
 * Generate a sample of DataSet by the probability fraction of each element.
 *
 * @param withReplacement Whether element can be selected more than once.
 * @param fraction        Probability that each element is chosen, should be [0,1] without replacement,
 *                        and [0, ∞) with replacement. While fraction is larger than 1, the elements are
 *                        expected to be selected multi times into sample on average.
 * @param seed            random number generator seed.
 * @return The sampled DataSet
 */
public static <T> MapPartitionOperator<T, T> sample(
	DataSet <T> input,
	final boolean withReplacement,
	final double fraction,
	final long seed) {

	return input.mapPartition(new SampleWithFraction<T>(withReplacement, fraction, seed));
}
 
Example 20
Source File: DataSetUtils.java    From Flink-CEPplus with Apache License 2.0 3 votes vote down vote up
/**
 * Generate a sample of DataSet by the probability fraction of each element.
 *
 * @param withReplacement Whether element can be selected more than once.
 * @param fraction        Probability that each element is chosen, should be [0,1] without replacement,
 *                        and [0, ∞) with replacement. While fraction is larger than 1, the elements are
 *                        expected to be selected multi times into sample on average.
 * @param seed            random number generator seed.
 * @return The sampled DataSet
 */
public static <T> MapPartitionOperator<T, T> sample(
	DataSet <T> input,
	final boolean withReplacement,
	final double fraction,
	final long seed) {

	return input.mapPartition(new SampleWithFraction<T>(withReplacement, fraction, seed));
}