Java Code Examples for org.apache.spark.api.java.JavaPairRDD#unpersist()

The following examples show how to use org.apache.spark.api.java.JavaPairRDD#unpersist() . You can vote up the ones you like or vote down the ones you don't like, and go to the original project or source file by following the links above each example. You may check out the related API usage on the sidebar.
Example 1
Source File: HoodieBloomIndex.java    From hudi with Apache License 2.0 5 votes vote down vote up
@Override
public JavaRDD<HoodieRecord<T>> tagLocation(JavaRDD<HoodieRecord<T>> recordRDD, JavaSparkContext jsc,
                                            HoodieTable<T> hoodieTable) {

  // Step 0: cache the input record RDD
  if (config.getBloomIndexUseCaching()) {
    recordRDD.persist(SparkConfigUtils.getBloomIndexInputStorageLevel(config.getProps()));
  }

  // Step 1: Extract out thinner JavaPairRDD of (partitionPath, recordKey)
  JavaPairRDD<String, String> partitionRecordKeyPairRDD =
      recordRDD.mapToPair(record -> new Tuple2<>(record.getPartitionPath(), record.getRecordKey()));

  // Lookup indexes for all the partition/recordkey pair
  JavaPairRDD<HoodieKey, HoodieRecordLocation> keyFilenamePairRDD =
      lookupIndex(partitionRecordKeyPairRDD, jsc, hoodieTable);

  // Cache the result, for subsequent stages.
  if (config.getBloomIndexUseCaching()) {
    keyFilenamePairRDD.persist(StorageLevel.MEMORY_AND_DISK_SER());
  }
  if (LOG.isDebugEnabled()) {
    long totalTaggedRecords = keyFilenamePairRDD.count();
    LOG.debug("Number of update records (ones tagged with a fileID): " + totalTaggedRecords);
  }

  // Step 4: Tag the incoming records, as inserts or updates, by joining with existing record keys
  // Cost: 4 sec.
  JavaRDD<HoodieRecord<T>> taggedRecordRDD = tagLocationBacktoRecords(keyFilenamePairRDD, recordRDD);

  if (config.getBloomIndexUseCaching()) {
    recordRDD.unpersist(); // unpersist the input Record RDD
    keyFilenamePairRDD.unpersist();
  }
  return taggedRecordRDD;
}
 
Example 2
Source File: SimpleNovelAdjacencyInterpreter.java    From gatk with BSD 3-Clause "New" or "Revised" License 5 votes vote down vote up
public static List<VariantContext> makeInterpretation(final JavaRDD<AssemblyContigWithFineTunedAlignments> contigsWithSimpleChimera,
                                                      final SvDiscoveryInputMetaData svDiscoveryInputMetaData) {

    final JavaPairRDD<SimpleNovelAdjacencyAndChimericAlignmentEvidence, List<SvType>> narlAndAltSeqAndEvidenceAndTypes =
            SimpleNovelAdjacencyInterpreter
                    .inferTypeFromSingleContigSimpleChimera(contigsWithSimpleChimera, svDiscoveryInputMetaData).cache();

    try {
        final List<NovelAdjacencyAndAltHaplotype> narls = narlAndAltSeqAndEvidenceAndTypes.keys()
                .map(SimpleNovelAdjacencyAndChimericAlignmentEvidence::getNovelAdjacencyReferenceLocations).collect();
        evaluateNarls(svDiscoveryInputMetaData, narls);

        final Broadcast<ReferenceMultiSparkSource> referenceBroadcast = svDiscoveryInputMetaData.getReferenceData().getReferenceBroadcast();
        final Broadcast<SAMSequenceDictionary> referenceSequenceDictionaryBroadcast =
                svDiscoveryInputMetaData.getReferenceData().getReferenceSequenceDictionaryBroadcast();
        final String sampleId = svDiscoveryInputMetaData.getSampleSpecificData().getSampleId();
        final Broadcast<SVIntervalTree<VariantContext>> cnvCallsBroadcast = svDiscoveryInputMetaData.getSampleSpecificData().getCnvCallsBroadcast();
        final List<VariantContext> annotatedSimpleVariants =
                narlAndAltSeqAndEvidenceAndTypes
                        .flatMap(pair ->
                                turnIntoVariantContexts(pair, sampleId, referenceBroadcast,
                                        referenceSequenceDictionaryBroadcast, cnvCallsBroadcast)
                        )
                        .collect();

        narlAndAltSeqAndEvidenceAndTypes.unpersist();
        return annotatedSimpleVariants;

    } finally {
        narlAndAltSeqAndEvidenceAndTypes.unpersist();
    }
}
 
Example 3
Source File: RDDConverterUtils.java    From systemds with Apache License 2.0 4 votes vote down vote up
/**
 * Converts a libsvm text input file into two binary block matrices for features 
 * and labels, and saves these to the specified output files. This call also deletes 
 * existing files at the specified output locations, as well as determines and 
 * writes the meta data files of both output matrices. 
 * <p>
 * Note: We use {@code org.apache.spark.mllib.util.MLUtils.loadLibSVMFile} for parsing 
 * the libsvm input files in order to ensure consistency with Spark.
 * 
 * @param sc java spark context
 * @param pathIn path to libsvm input file
 * @param pathX path to binary block output file of features
 * @param pathY path to binary block output file of labels
 * @param mcOutX matrix characteristics of output matrix X
 */
public static void libsvmToBinaryBlock(JavaSparkContext sc, String pathIn, 
		String pathX, String pathY, DataCharacteristics mcOutX)
{
	if( !mcOutX.dimsKnown() )
		throw new DMLRuntimeException("Matrix characteristics "
			+ "required to convert sparse input representation.");
	try {
		//cleanup existing output files
		HDFSTool.deleteFileIfExistOnHDFS(pathX);
		HDFSTool.deleteFileIfExistOnHDFS(pathY);
		
		//convert libsvm to labeled points
		int numFeatures = (int) mcOutX.getCols();
		int numPartitions = SparkUtils.getNumPreferredPartitions(mcOutX, null);
		JavaRDD<org.apache.spark.mllib.regression.LabeledPoint> lpoints = 
				MLUtils.loadLibSVMFile(sc.sc(), pathIn, numFeatures, numPartitions).toJavaRDD();
		
		//append row index and best-effort caching to avoid repeated text parsing
		JavaPairRDD<org.apache.spark.mllib.regression.LabeledPoint,Long> ilpoints = 
				lpoints.zipWithIndex().persist(StorageLevel.MEMORY_AND_DISK()); 
		
		//extract labels and convert to binary block
		DataCharacteristics mc1 = new MatrixCharacteristics(mcOutX.getRows(), 1, mcOutX.getBlocksize(), -1);
		LongAccumulator aNnz1 = sc.sc().longAccumulator("nnz");
		JavaPairRDD<MatrixIndexes,MatrixBlock> out1 = ilpoints
				.mapPartitionsToPair(new LabeledPointToBinaryBlockFunction(mc1, true, aNnz1));
		int numPartitions2 = SparkUtils.getNumPreferredPartitions(mc1, null);
		out1 = RDDAggregateUtils.mergeByKey(out1, numPartitions2, false);
		out1.saveAsHadoopFile(pathY, MatrixIndexes.class, MatrixBlock.class, SequenceFileOutputFormat.class);
		mc1.setNonZeros(aNnz1.value()); //update nnz after triggered save
		HDFSTool.writeMetaDataFile(pathY+".mtd", ValueType.FP64, mc1, OutputInfo.BinaryBlockOutputInfo);
		
		//extract data and convert to binary block
		DataCharacteristics mc2 = new MatrixCharacteristics(mcOutX.getRows(), mcOutX.getCols(), mcOutX.getBlocksize(), -1);
		LongAccumulator aNnz2 = sc.sc().longAccumulator("nnz");
		JavaPairRDD<MatrixIndexes,MatrixBlock> out2 = ilpoints
				.mapPartitionsToPair(new LabeledPointToBinaryBlockFunction(mc2, false, aNnz2));
		out2 = RDDAggregateUtils.mergeByKey(out2, numPartitions, false);
		out2.saveAsHadoopFile(pathX, MatrixIndexes.class, MatrixBlock.class, SequenceFileOutputFormat.class);
		mc2.setNonZeros(aNnz2.value()); //update nnz after triggered save
		HDFSTool.writeMetaDataFile(pathX+".mtd", ValueType.FP64, mc2, OutputInfo.BinaryBlockOutputInfo);
		
		//asynchronous cleanup of cached intermediates
		ilpoints.unpersist(false);
	}
	catch(IOException ex) {
		throw new DMLRuntimeException(ex);
	}
}
 
Example 4
Source File: RDDConverterUtils.java    From systemds with Apache License 2.0 4 votes vote down vote up
/**
 * Converts a libsvm text input file into two binary block matrices for features 
 * and labels, and saves these to the specified output files. This call also deletes 
 * existing files at the specified output locations, as well as determines and 
 * writes the meta data files of both output matrices. 
 * <p>
 * Note: We use {@code org.apache.spark.mllib.util.MLUtils.loadLibSVMFile} for parsing 
 * the libsvm input files in order to ensure consistency with Spark.
 * 
 * @param sc java spark context
 * @param pathIn path to libsvm input file
 * @param pathX path to binary block output file of features
 * @param pathY path to binary block output file of labels
 * @param mcOutX matrix characteristics of output matrix X
 */
public static void libsvmToBinaryBlock(JavaSparkContext sc, String pathIn, 
		String pathX, String pathY, DataCharacteristics mcOutX)
{
	if( !mcOutX.dimsKnown() )
		throw new DMLRuntimeException("Matrix characteristics "
			+ "required to convert sparse input representation.");
	try {
		//cleanup existing output files
		HDFSTool.deleteFileIfExistOnHDFS(pathX);
		HDFSTool.deleteFileIfExistOnHDFS(pathY);
		
		//convert libsvm to labeled points
		int numFeatures = (int) mcOutX.getCols();
		int numPartitions = SparkUtils.getNumPreferredPartitions(mcOutX, null);
		JavaRDD<org.apache.spark.mllib.regression.LabeledPoint> lpoints = 
				MLUtils.loadLibSVMFile(sc.sc(), pathIn, numFeatures, numPartitions).toJavaRDD();
		
		//append row index and best-effort caching to avoid repeated text parsing
		JavaPairRDD<org.apache.spark.mllib.regression.LabeledPoint,Long> ilpoints = 
				lpoints.zipWithIndex().persist(StorageLevel.MEMORY_AND_DISK()); 
		
		//extract labels and convert to binary block
		DataCharacteristics mc1 = new MatrixCharacteristics(mcOutX.getRows(), 1, mcOutX.getBlocksize(), -1);
		LongAccumulator aNnz1 = sc.sc().longAccumulator("nnz");
		JavaPairRDD<MatrixIndexes,MatrixBlock> out1 = ilpoints
				.mapPartitionsToPair(new LabeledPointToBinaryBlockFunction(mc1, true, aNnz1));
		int numPartitions2 = SparkUtils.getNumPreferredPartitions(mc1, null);
		out1 = RDDAggregateUtils.mergeByKey(out1, numPartitions2, false);
		out1.saveAsHadoopFile(pathY, MatrixIndexes.class, MatrixBlock.class, SequenceFileOutputFormat.class);
		mc1.setNonZeros(aNnz1.value()); //update nnz after triggered save
		HDFSTool.writeMetaDataFile(pathY+".mtd", ValueType.FP64, mc1, FileFormat.BINARY);
		
		//extract data and convert to binary block
		DataCharacteristics mc2 = new MatrixCharacteristics(mcOutX.getRows(), mcOutX.getCols(), mcOutX.getBlocksize(), -1);
		LongAccumulator aNnz2 = sc.sc().longAccumulator("nnz");
		JavaPairRDD<MatrixIndexes,MatrixBlock> out2 = ilpoints
				.mapPartitionsToPair(new LabeledPointToBinaryBlockFunction(mc2, false, aNnz2));
		out2 = RDDAggregateUtils.mergeByKey(out2, numPartitions, false);
		out2.saveAsHadoopFile(pathX, MatrixIndexes.class, MatrixBlock.class, SequenceFileOutputFormat.class);
		mc2.setNonZeros(aNnz2.value()); //update nnz after triggered save
		HDFSTool.writeMetaDataFile(pathX+".mtd", ValueType.FP64, mc2, FileFormat.BINARY);
		
		//asynchronous cleanup of cached intermediates
		ilpoints.unpersist(false);
	}
	catch(IOException ex) {
		throw new DMLRuntimeException(ex);
	}
}
 
Example 5
Source File: ContigChimericAlignmentIterativeInterpreter.java    From gatk with BSD 3-Clause "New" or "Revised" License 4 votes vote down vote up
public static List<VariantContext> discoverVariantsFromChimeras(final SvDiscoveryInputMetaData svDiscoveryInputMetaData,
                                                                final JavaRDD<AlignedContig> alignedContigs) {

    final Broadcast<SAMSequenceDictionary> referenceSequenceDictionaryBroadcast =
            svDiscoveryInputMetaData.getReferenceData().getReferenceSequenceDictionaryBroadcast();

    // step 1: filter alignments and extract chimera pair
    final JavaPairRDD<byte[], List<SimpleChimera>> contigSeqAndChimeras =
            alignedContigs
                    .filter(alignedContig -> alignedContig.getAlignments().size() > 1)
                    .mapToPair(alignedContig -> {
                        final List<SimpleChimera> chimeras =
                                parseOneContig(alignedContig, referenceSequenceDictionaryBroadcast.getValue(),
                                true, DEFAULT_MIN_ALIGNMENT_LENGTH,
                                CHIMERIC_ALIGNMENTS_HIGHMQ_THRESHOLD, true);
                        return new Tuple2<>(alignedContig.getContigSequence(), chimeras);
                    });

    final Broadcast<ReferenceMultiSparkSource> referenceBroadcast = svDiscoveryInputMetaData.getReferenceData().getReferenceBroadcast();
    final List<SVInterval> assembledIntervals = svDiscoveryInputMetaData.getSampleSpecificData().getAssembledIntervals();
    final Broadcast<SVIntervalTree<VariantContext>> cnvCallsBroadcast = svDiscoveryInputMetaData.getSampleSpecificData().getCnvCallsBroadcast();
    final String sampleId = svDiscoveryInputMetaData.getSampleSpecificData().getSampleId();
    final StructuralVariationDiscoveryArgumentCollection.DiscoverVariantsFromContigAlignmentsSparkArgumentCollection discoverStageArgs = svDiscoveryInputMetaData.getDiscoverStageArgs();
    final Logger toolLogger = svDiscoveryInputMetaData.getToolLogger();

    // step 2: extract novel adjacency
    final JavaPairRDD<NovelAdjacencyAndAltHaplotype, Iterable<SimpleChimera>> narlsAndSources =
            contigSeqAndChimeras
                    .flatMapToPair(tigSeqAndChimeras -> {
                        final byte[] contigSeq = tigSeqAndChimeras._1;
                        final List<SimpleChimera> simpleChimeras = tigSeqAndChimeras._2;
                        final Stream<Tuple2<NovelAdjacencyAndAltHaplotype, SimpleChimera>> novelAdjacencyAndSourceChimera =
                                simpleChimeras.stream()
                                        .map(ca -> new Tuple2<>(
                                                new NovelAdjacencyAndAltHaplotype(ca, contigSeq,
                                                        referenceSequenceDictionaryBroadcast.getValue()), ca));
                        return novelAdjacencyAndSourceChimera.iterator();
                    })
                    .groupByKey()   // group the same novel adjacency produced by different contigs together
                    .cache();


    try {// step 3: evaluate performance turn into variant context
        SvDiscoveryUtils.evaluateIntervalsAndNarls(assembledIntervals, narlsAndSources.map(Tuple2::_1).collect(),
                referenceSequenceDictionaryBroadcast.getValue(), discoverStageArgs, toolLogger);

        return narlsAndSources
                        .mapToPair(noveltyAndEvidence -> new Tuple2<>(inferSimpleTypeFromNovelAdjacency(noveltyAndEvidence._1, referenceBroadcast.getValue()),       // type inference based on novel adjacency and evidence alignments
                                new SimpleNovelAdjacencyAndChimericAlignmentEvidence(noveltyAndEvidence._1, noveltyAndEvidence._2)))
                        .map(noveltyTypeAndEvidence ->
                                AnnotatedVariantProducer
                                    .produceAnnotatedVcFromAssemblyEvidence(
                                            noveltyTypeAndEvidence._1, noveltyTypeAndEvidence._2,
                                            referenceBroadcast,
                                            referenceSequenceDictionaryBroadcast,
                                            cnvCallsBroadcast,
                                            sampleId).make()
                        )
                        .collect();
    } finally {
        narlsAndSources.unpersist();
    }
}
 
Example 6
Source File: SparkExecutionContext.java    From systemds with Apache License 2.0 3 votes vote down vote up
/**
 * This call removes an rdd variable from executor memory and disk if required.
 * Hence, it is intended to be used on rmvar only. Depending on the
 * ASYNCHRONOUS_VAR_DESTROY configuration, this is asynchronous or not.
 *
 * @param rvar rdd variable to remove
 */
public static void cleanupRDDVariable(JavaPairRDD<?,?> rvar)
{
	if( rvar.getStorageLevel()!=StorageLevel.NONE() ) {
		rvar.unpersist( !ASYNCHRONOUS_VAR_DESTROY );
	}
}
 
Example 7
Source File: SparkExecutionContext.java    From systemds with Apache License 2.0 3 votes vote down vote up
/**
 * This call removes an rdd variable from executor memory and disk if required.
 * Hence, it is intended to be used on rmvar only. Depending on the
 * ASYNCHRONOUS_VAR_DESTROY configuration, this is asynchronous or not.
 *
 * @param rvar rdd variable to remove
 */
public static void cleanupRDDVariable(JavaPairRDD<?,?> rvar)
{
	if( rvar.getStorageLevel()!=StorageLevel.NONE() ) {
		rvar.unpersist( !ASYNCHRONOUS_VAR_DESTROY );
	}
}