TransformPy

transformpy is a Python 2/3 module for doing transforms on "streams" of data. The transforms can be applied to any python iterable object, and so can be used for continuous real_time streams or static streams (such as from a file). It is designed in such a manner that it uses very little memory (unless necessary by clustering and/or aggregation routines). It was originally designed to allow python transformations (maps and reductions) of data stored within HIVE, using the Hadoop streaming paradigm.

NOTE: TransformPy is not guaranteed to be API stable before version 1.0; but changes should be small if any to the current version.

Installing TransformPy

Run inside of a checked out transformpy repository:

$ pip install --user .

Using TransformPy

Since all of what TransformPy does can be done manually without too much effort, its entire purpose is to make it simpler to get things done quickly; and to then debug one's work. The entire TransformPy workflow consists of making "pipelines".

To create a new pipeline one simply creates a new Transform object, and then adds as many pipes to the pipeline as you desire. For example, a simple no-op pipeline might look like (where we assume that this is being run in a script and the iterable source is sys.stdin):

Transform()\
    .input(HiveToDictInput, fields=['id', 'name', 'pet'])\
    .output(DictToHiveOutput, fields=['id', 'name', 'pet'])\
    .apply(sys.stdin)

In the above example, two pipes have been used: HiveToDictInput and DictToHiveOutput; with types of TransformType.SOURCE and TransformType.SINK respectively. There are currently eight different types of pipes (though the differences are largely semantic):

These types are respective added to the pipeline using the following methods of Transform:

All of these methods accept a function object (in which case they are wrapped by a FunctionWrapperPipe object), a class instance of the specified type, or a class name, followed by instantiation arguments. For example, all of the following are valid (and equivalent):

There are a couple of special methods on Transform objects, including:

Pipes defined by TransformPy

Please note, these will grow over time; small though in number they be now. Please do contribute your pipes if you think they are useful.

transformpy[.base]

The pipes in this section are specified for completeness. It will not in general be necessary for you to manually interact with them (since they are automatically used as necessary by Transform instances).

transformpy.pipes.hive

transformpy.pipes.clustering

transformpy.pipe.field

transformpy.pipe.debug

Defining Custom Pipes

The chances are you will want (and need) to define your own logic. Doing so is very easy in TransformPy. You can either define a function, and refer to it in the pipeline, or define a class (which will allow you to be smarter with memory, etc.). To use a function, simply pass it to the appropriate method of Transform, as described above.

To create an input pipe, simply subclass transformpy.SourcePipe. You will need to specify at least two methods: init and apply. It is important that apply returns an iterable object. Make use of yield if your methods do not give rise naturally to such an object.

To create a map, cluster, or aggregate pipe, simply subclass transformpy.TransformPipe. As above, you must implement init and apply, but also type (as a property), which should return a TransformType property.

To create an output pipe/sink, simply subclass transformpy.SinkPipe. You will need to specify at least two methods: init and apply. Apply need not return anything, but it must not return an iterator (since the output will not be inspected).

For example, look at the implementation of SimpleClusterPipe:

class SimpleClusteringPipe(TransformPipe):

    def init(self, field):
        self.field = field
        self.seen_groups = {}

    def apply(self, data):
        cur_group = None
        cur_data = []
        for row in data:
            if row[self.field] != cur_group:
                if cur_group is not None:
                    yield cur_data
                cur_data = []
                cur_group = row[self.field]
                assert cur_group not in self.seen_groups, "SimpleClusterPipe assumes that data is sorted by key. %s=%s was out of order." % (self.field, cur_group)
                self.seen_groups[cur_group] = True
            cur_data.append(row)
        yield cur_data

    @property
    def type(self):
        return TransformType.CLUSTER