python source code of cache

# This file is part of Volatility.
#
# Volatility is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation; either version 2 of the License, or
# (at your option) any later version.
#
# Volatility is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with Volatility.  If not, see <http://www.gnu.org/licenses/>.

""" This module implements the volatility caching subsystem.

The volatility caching subsystem has the following design goals:

 1) Ability to cache arbitrary objects - The allows complex objects to
    be cached for later retrieval. For example, objects may be as
    simple as constants for KPCR addresses, to entire x86 page
    translation tables, or even hibernation decompression
    datastructures. To achieve this we use the standard python pickle
    system. In many use cases, the cache needs to facilitate
    persistant memoising of functions and generators (more on that
    below).

 2) Cached objects are stored by a hierarchical key namespace. Keys
    are specified in a URL notation. By default, relative URLs are
    interpreted relative to the memory image location (the value of
    the --location option). This scheme allows us to specify both
    global (per installation) and per image keys. For example given an
    image located in /tmp/foobar.img:

    - file:///tmp/foobar.img/kernel/debugging/KPCR refers to this
      image's KPCR location.

    - file:///tmp/foobar.img/address_spaces/memory_translation/pdpte
      refers to the cached page tables.

    - http://www.volatility.org/schema#configuration/renderer specifies
      the currently configured renderer (i.e. its a global setting).

 3) Storage of the cache is abstracted and selectable via the
 --cache_engine configuration variable. This allows the separation
 from the concerete storage of the cache and the abstraction of the
 cache in a running process.

Abstraction of Cache
--------------------

Within the running volatiltiy framework the cache appears as an
abstract tree with nodes inherited from the CacheNode class:

class CacheNode(object):
    def __init__(self, name, parent, payload = None):
        ''' Creates a new Cache node under the parent. The new node
        will carry the specified payload
        '''

    def __str__(self):
        ''' Produce a human readable version of the payload '''

    def set_payload(self, payload):
        ''' Update the current payload with the new specified payload '''

    def dump(self):
        ''' Dump the node to disk for later retrieval. This is
        normally called when the process has exited. '''

    def get_payload(self):
       ''' retrieve this node's payload '''

In order to check the cache, plugins issue the Cache.Check() function:

def Check(path, callback = None, cls = CacheNode):
    ''' Traverse the cache tree and retrieve the stored CacheNode.

    If there is no such stored CacheNode and callback is specified,
    attempt to create it using the cache_node_class with the payload
    returned from the callback. If callback is not specified we just
    return None.

Decorators
----------
You can also use the cache decorator to cache the results of any
function - this is probably the easiest way to apply caching to
existing code. For example, suppose we want to cache the results of
the psscan plugin:

class PSScan(commands.Command):
....
   @cache("/scanners/psscan")
   def calculate(self):
       .....

This will automatically create the CacheNode at the specified tree
location (note that since the URL is given as a relative URL it is
based at the current value of the --location - that means it applies
to the current memory image only).

Note that since calculate() returns a generator, the decorator will
also return a generator - It will not iterate over the calculate
method unnecessarily, but will yield results immediately. This does
not compromise performance in the case of a cache miss. Unfortunately
this also means that if the generator is stopped prematurely, we are
unable to cache the result set in the general case. This is the only
caveat on caching generators.

Storage classes
---------------
The cache system discussed above can be thought of as an abstract
construct in the process memory. To make it persistant on disk we have
the storage class (which can be selected using the --cache_engine
directive). The following cache engines are implemented:

File Storage
============
This is the default cache engine. We simply maintain a directory
structure which corresponds to the URL of the key after applying the
appropriate filesystem safe escaping operation. Objects are stored in
stand alone files using the pickle module.

Zip Storage
===========
This storage is essentially the same as the File storage above, except
that the cache directory for each image file is maintained in a Zip
file stored at the --cache_direcory directive with the same filename
as the image and a .zip extension.


Use cases
---------
The following common use cases are discussed:

1) Dynamic address spaces. In some address spaces memory address
   mappings can not be cached since they change all the time. For
   example in the firewire address space, it is incorrect to cache any
   page translations or scanning results etc. This is easily achieved
   by having the firewire address space store a BlockingCacheNode()
   instance at critical tree nodes. These prevent new nodes from being
   inserted into the tree and force a cache miss whenever any keys are
   searched under these nodes. Note that this still allows the cache
   to store the locations of things which might not change, even for
   live memory analysis, such as KPCR locations.

2) History logging and audit logs. Currently volatility works by
   running the framework multiple times on the same plugin with
   different command line options. This can be audited using the
   caching system by storing the current command line in a specific
   location using a specific CacheNode. This implementation can be
   used to append new commandlines to the same key. Configuration
   options can also become sticky in this way and remember the same
   values they had previously. This avoid users having to append many
   command line arguements (i.e. having to specify --profile, --kpcr,
   --dtb on every command line).

3) Unit tests.  Unit tests can be easily implemented using the caching
   subsystem as follows:

   - A test() method is added to each plugin. Usually this is actually
     the same as calculate().

   - This method is decorated to be cached under the
     "/tests/pluginname" key (i.e. relative to the current image). The
     CacheNode implementation is TestCacheNode which implements a
     special update_payload() method. The TestCacheNode also ensures
     that cache miss always occurs (by implementing a get_payload()
     method which returns None).

   - The update_payload() method ensures that the old payload and the
     new payloads are the same (if they are generators we ensure each
     member is the same as well - using the __eq__ method).

   The overall result is that unit tests can be run on any image as
   normal. If the particular test was never run on the image, we just
   cache the result of the plugin. If on the other hand, the result
   was already run on this image, the old result is compared to the
   new result and if a discrepancy is detected, an exception is
   raised.

   This testing framework is easy to implement and automatically
   guards against regression bugs. Since we use the __eq__ method of
   arbitrary objects, its also not limited to testing text string
   matches. For example, the object framework defines two objects are
   being equal if they are of the same type and they point at the same
   address. Even if the textual representation of the object's
   printouts has changed between versions, as long as the same objects
   are found in both cases no regressions will be reported.

4) Reporting framework. By having a persistant caching framework we
   now have the concept of a volatility analysis session. In other
   words, each new execution of volatility adds new information to
   what we know about the image. This new information is stored in the
   cache tree. We can actually produce a full report from the cache
   tree by traversing all the CacheNodes and calling their __str__()
   methods.

   If caching is introduced via decorators, the CacheNode already
   knows about the render() method of the plugin and can automatically
   generate the output from the plugin (this is very fast as the
   calculate is received from the cache). We therefore can generate a
   full report of all the plugins very quickly automatically.

   By default CacheNodes have an empty __str__() methods, so things
   like pas2kas lookup tables are not reported. Specialised reporting
   functions can be made if needed by implementing __str__() functions
   as needed.

"""
import types
import os
import urlparse
import volatility.conf as conf
import volatility.obj as obj
import volatility.debug as debug
import volatility.exceptions as exceptions
import cPickle as pickle
config = conf.ConfObject()

## Where to stick the cache
default_cache_location = os.path.join((os.environ.get("XDG_CACHE_HOME") or os.path.expanduser("~/.cache")), "volatility")

config.add_option("CACHE-DIRECTORY", default = default_cache_location,
                  cache_invalidator = False,
                  help = "Directory where cache files are stored")

class CacheContainsGenerator(exceptions.VolatilityException):
    """Exception raised when the cache contains a generator"""
    pass

class InvalidCache(Exception):
    """Exception raised when the cache item is determined to be invalid."""
    pass

class CacheNode(object):
    """ Base class for Cache nodes """
    def __init__(self, name, stem, storage = None, payload = None, invalidator = None):
        ''' Creates a new Cache node under the parent. The new node
        will carry the specified payload
        '''
        self.name = name
        self.payload = payload
        self.storage = storage
        self.stem = stem

        # This object encapsulate the running environment. If the
        # environment during the time of unpickling differs from the
        # environment during the time of pickling we refuse to
        # unpickle this object, and the cache misses. We dont really
        # do anything with it, just have it serialised as well.
        self.invalidator = invalidator

    def __getitem__(self, item = ''):
        item_url = "{0}/{1}".format(self.stem, item)

        ## Try to load it from the storage manager
        try:
            result = self.storage.load(item_url)
            if result:
                return result
        except Exception, e:
            raise KeyError(e)

        ## Make a new empty Node instead on demand
        raise KeyError("item not found")

    def __str__(self):
        ''' Produce a human readable version of the payload. '''
        return ''

    def _find_generators(self, item):
        """ A recursive function to flatten generators into lists """
        try:
            result = []
            # Make sure dicts aren't flattened to lists
            if isinstance(item, dict):
                result = {}
                for i in item:
                    result[self._find_generators(i)] = self._find_generators(item[i])
                return result

            # Since NoneObjects and strings are both iterable, treat them specially
            if isinstance(item, obj.NoneObject) or isinstance(item, str):
                return item

            if isinstance(item, types.GeneratorType):
                raise CacheContainsGenerator
            for x in iter(item):
                flat_x = self._find_generators(x)
                result.append(flat_x)

            return result
        except TypeError:
            return item

    def set_payload(self, payload):
        ''' Update the current payload with the new specified payload '''
        try:
            self.payload = self._find_generators(payload)
        except CacheContainsGenerator:
            # This only works because None payload cached results are rerun
            self.payload = None

    def dump(self):
        ''' Dump the node to disk for later retrieval. This is
        normally called when the process has exited. '''
        if self.payload:
            self.storage.dump(self.stem, self)

    def get_payload(self):
        """Retrieve this node's payload"""
        return self.payload

class BlockingNode(CacheNode):
    """Node that fails on all cache attempts and no-ops on cache storage attempts"""
    def __init__(self, name, stem, **kwargs):
        CacheNode.__init__(self, name, stem, **kwargs)

    def __getitem__(self, item = ''):
        return BlockingNode(item, '/'.join((self.stem, item)))

    def dump(self):
        """Ensure nothing gets dumped"""
        pass

    def get_payload(self):
        """Do not set a payload for a blocked cache node"""
        pass

class Invalidator(object):
    """ The Invalidator encapsulates program state to control
    invalidation of the cache.

    1) This object registers callbacks using the add_condition()
    method.

    2) Prior to serialising the cache object the callbacks are called
    returning a signature dict.

    3) When unpickling the cached object, we call the invalidator to
    produce a signature dict again, and compare this to the pickled
    version.

    The purpose of the callbacks is to represent a signature of the
    current state of execution. If the signature changes, the cache is
    invalidated.
    """
    def __init__(self):
        self.callbacks = {}

    def add_condition(self, key, callback):
        """Callback will be stored under key and should return a string.
        """
        self.callbacks[key] = callback

    def __setstate__(self, state):
        ## We do not actually have any callbacks here - we must use
        ## the global cache invalidator. We cant really get away from
        ## having a global invalidator.
        for k, v in CACHE.invalidator.callbacks.items():
            # TODO: Determine what happens if the state or current callbacks
            # contain a key that's not in the other
            if k in state and v() != state[k]:
                debug.debug("Invaliding cache... {0} (Running) != {1} (Stored) on key {2}".format(v(), state[k], k))

                raise InvalidCache("Running environment inconsistant "
                                   "with pickled environment - "
                                   "invalidating cache.")

    def __getstate__(self):
        """When pickling ourselves we call our callbacks to provide a
        dict of strings (our state signature). This dict should
        reflect all of our running state at the moment. This will then
        be compared to the state signature when unpickling and if its
        different we invalidate the cache.
        """
        result = {}
        for k, v in CACHE.invalidator.callbacks.items():
            result[k] = v()

        debug.debug("Pickling State signature: {0}".format(result))

        return result

class CacheTree(object):
    """ An abstract structure which represents the cache tree """
    def __init__(self, storage = None, cls = CacheNode, invalidator = None):
        self.storage = storage
        self.cls = cls
        self.invalidator = invalidator
        self.root = self.cls('', '', storage = storage, invalidator = invalidator)

    def __getitem__(self, path):
        """Pythonic interface to the cache"""
        return self.check(path, cls = self.cls)

    def invalidate_on(self, key, callback):
        self.invalidator.add_condition(key, callback)

    def check(self, path, callback = None, cls = CacheNode):
        """ Retrieves the node at the path specified """
        # Abort if we haven't been given a location
        if not config.LOCATION:
            return None

        ## Normalise the path
        path = urlparse.urljoin(config.LOCATION + "/", path)

        elements = path.split("/")
        current = self.root

        for e in elements:
            try:
                current = current[e]
            except KeyError:
                if current.stem:
                    next_stem = '/'.join((current.stem, e))
                else:
                    next_stem = e

                payload = None
                if callback is not None:
                    payload = callback()

                node = cls(e, next_stem, storage = self.storage,
                           payload = payload, invalidator = self.invalidator)

                current = node

        return current

class CacheStorage(object):
    """ The base class for implementation storing the cache. """
    ## Characters allowed in filenames (/'s are allowed since we're dealing with URLs only)
    printables = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz-_./"

    def encode(self, string):
        result = ''
        for x in string:
            if x in self.printables:
                result += x
            else:
                result += "%{0:02X}".format(ord(x))

        return result

    def filename(self, url):
        if url.startswith(config.LOCATION):
            # Encode just the path part, since everything else is taken from relatively safe/already used data
            path = self.encode(url[len(config.LOCATION):])
        else:
            raise exceptions.CacheRelativeURLException("Storing non relative URLs is not supported now ({0})".format(url))

        # Join together the bits we need, and abspath it to ensure it's right for the OS it's on
        path = os.path.abspath(os.path.sep.join([config.CACHE_DIRECTORY,
                                                 os.path.basename(config.LOCATION) + ".cache",
                                                 path + '.pickle']))

        return path

    def load(self, url):
        filename = self.filename(url)

        debug.debug("Loading from {0}".format(filename))
        data = open(filename).read()

        debug.trace(level = 3)
        return pickle.loads(data)

    def dump(self, url, payload):
        # TODO: Ensure a better check for ieee1394/non-cachable address spaces than a bad URL
        try:
            filename = self.filename(url)
        except exceptions.CacheRelativeURLException:
            debug.debug("NOT Dumping url {0} - relative URLs are not yet supported".format(url))
            return

        ## Check that the directory exists
        directory = os.path.dirname(filename)
        if not os.access(directory, os.R_OK | os.W_OK | os.X_OK):
            os.makedirs(directory)

        ## Ensure that the payload is flattened - i.e. all generators are converted to lists for pickling
        try:
            data = pickle.dumps(payload)
            debug.debug("Dumping filename {0}".format(filename))
            fd = open(filename, 'w')
            fd.write(data)
            fd.close()
        except (pickle.PickleError, TypeError):
            # Do nothing if the pickle fails
            debug.debug("NOT Dumping filename {0} - contained a non-picklable class".format(filename))

## This is the central cache object
CACHE = CacheTree(CacheStorage(), BlockingNode, invalidator = Invalidator())

def enable_caching(_option, _opt_str, _value, _parser):
    """Turns off caching by replacing the tree with one that only takes BlockingNodes"""
    debug.debug("Enabling Caching")
    # Feels filthy using the global keyword,
    # but I can't figure another way to ensure that
    # the code gets called and overwrites the outer scope
    global CACHE
    CACHE = CacheTree(CacheStorage(), invalidator = Invalidator())
    config.CACHE = True

config.add_option("CACHE", default = False, action = 'callback',
                  cache_invalidator = False,
                  callback = enable_caching,
                  help = "Use caching")

class CacheDecorator(object):
    """ This decorator will memoise a function in the cache """
    def __init__(self, path):
        """Wraps a function in a cache decorator.

        The results of the function will be cached and memoised. Further
        calls to the function will retrieve the result from the
        cache. Cached objects are stored with the specified path as a
        key.

        Args:
           path: Key for storage into the cache. If this is callable,
           it will be called with the function's args and is expected
           to return a string which will be used as a path.

        Returns:
           A decorator.

        Example: Suppose the calculate function is decorated:

        @CacheDecorator(lambda self: "tests/pslist/pid{0}/".format(self._config.PID))
        def calculate(self):
           ....

        Note the use of the callback to finely tune the cache key depending on external variables.
        """
        self.path = path
        self.node = None

    def generate(self, path, g):
        """ Special handling for generators. We pass each iteration
        back immediately, and keep it in a list. Note that if the
        generator is aborted, the cache is not dumped.
        """
        payload = []
        for x in g:
            payload.append(x)
            yield x

        self.dump(path, payload)

    def dump(self, path, payload):
        self.node = CACHE[path]
        self.node.set_payload(payload)
        self.node.dump()

    def _cachewrapper(self, f, s, *args, **kwargs):
        """Wrapper for caching function calls"""
        ## See if the path is callable:
        if callable(self.path):
            path = self.path(s, *args, **kwargs)
        else:
            path = self.path

        ## Check if the result can be retrieved
        self.node = CACHE[path]
        # If this test goes away, we need to change the set_payload exception check
        # to act on dump instead of just the payload
        if self.node:
            payload = self.node.get_payload()
            if payload:
                return payload

        result = f(s, *args, **kwargs)

        ## If the wrapped function is a generator we need to
        ## handle it especially
        if isinstance(result, types.GeneratorType):
            return self.generate(path, result)

        self.dump(path, result)
        return result

    def __call__(self, f):
        def wrapper(s, *args, **kwargs):
            if config.CACHE:
                return self._cachewrapper(f, s, *args, **kwargs)

            return f(s, *args, **kwargs)

        return wrapper

class TestDecorator(CacheDecorator):
    """This decorator is just like a CacheDecorator, but will *always* cache fully"""

    def __call__(self, f):
        def wrapper(s, *args, **kwargs):
            return self._cachewrapper(f, s, *args, **kwargs)
        return wrapper

class Testable(object):
    """ This is a mixin that makes a class response to the unit tests 
    
        It must be inheritted *after* the command class
    """

    def calculate(self):
        """Empty function used to allow mixin"""

    def _flatten(self, item):
        """Flattens an item, including all generators"""
        try:
            # Make sure dicts aren't flattened to lists
            if isinstance(item, dict):
                result = {}
                for i in item:
                    result[self._flatten(i)] = self._flatten(item[i])
                return result

            for x in iter(item):
                flat_x = self._flatten(x)

            return flat_x
        except TypeError:
            return item

    ## This forces the test to be memoised with a key name derived from the class name
    @TestDecorator(lambda self: "tests/unittests/{0}".format(self.__class__.__name__))
    def test(self):
        ## This forces iteration over all keys - this is required in order
        ## to flatten the full list for the cache
        ## We must ensure config.CACHE is False here, otherwise the change isn't registered in this module
        config.CACHE = False
        return self._flatten(self.calculate())