Spark implementations of two data sampling methods (random oversampling and random undersampling) for imbalanced data.
Parameters
"path-to-header" "path-to-train" "number-of-partition" "name-of-majority-class" "name-of-minority-class" "pathOutput"
spark-submit --class org.apache.spark.mllib.sampling.runRUS Imb-sampling-1.0.jar hdfs://hadoop-master/datasets/data.header hdfs://hadoop-master/datasets/train.data 250 0 1 hdfs://hadoop-master/datasets/train-under.data
Parameters
"path-to-header" "path-to-train" "number-of-partition" "number-of-repartition" "name-of-majority-class" "name-of-minority-class" "oversampling-rate" "pathOutput"
spark-submit --class org.apache.spark.mllib.sampling.runROS Imb-sampling-1.0.jar hdfs://hadoop-master/datasets/data.header hdfs://hadoop-master/datasets/train.data 100 250 0 1 2.0 hdfs://hadoop-master/datasets/train-under.data
Developed by: Sara del Río García ([email protected])
Maintained by: Sergio Ramírez ([email protected]) / @sramirez