YTEP-0026: NumPy-like Operations

Abstract

Created: September 21, 2015

Author: Matthew Turk

This YTEP describes implementing some NumPy-like and potentially some Pandas-like operations on data container objects.

Status

This YTEP is proposed, but proof-of-concept code has been developed and issued in a PR: https://bitbucket.org/yt_analysis/yt/pull-requests/1763

Once the YTEP PR has been accepted, documentation will be added to the PR to the codebase.

Detailed Description

Background

Data objects in yt are lazy-loaded; only when data is accessed is it read from disk. However, the way they behave is similar to “data frames” or numpy named dtypes – they act as though they are dicts-of-arrays, with some operations being defined that operate in parallel-aware ways.

However, this is something of a leaky abstraction; in order to compute relatively simple operations, the .quantities object has to be accessed, the correct “quantity” to use determined, and then called.

But, many of these quantities map relatively simply to NumPy operations.

This YTEP doesn’t (yet) address adding other, Pandas-like operations (such as select or group) even though they also map to yt operations; that may come in the future.

What Can Be Done

I think we should map numpy array operations to quantities and other things! And while we’re at it, let’s add on very simple “plot” operations. Furthermore, to make the connection more explicit, slices will be implemented as well to generate data objects and selections. The dataset object will have a .r attribute, aliased to the much more descriptive .region_expression, which enables directly slicing it, which will either return a region, a slice, or an arbitrary_grid, depending on how many dimensions are used and if an imaginary step is supplied (like np.mgrid). This will accept a unitful slice.

Implementation

This will be implemented very simply as a set of aliases that look at the input arguments and then generate results from them.

NumPy arrays have several operations that return scalars, which is what we want to map to within these operations:

  • all
  • any
  • argmax
  • argmin
  • max
  • mean
  • min
  • prod
  • ptp
  • std
  • sum
  • var

For the purposes of this YTEP, we will concern ourselves with argmax, argmin, max, mean, min, std, ptp, sum, and also the non-NumPy operations hist and integrate, which normally do not return a single scalar but a set that does not correspond to the number of elements in the array.

We break these up based on the axis argument, and other optional arguments. Below is the enumerated behavior. Note that for those items that can be computed in a single pass (i.e., statistical information about the fields as a whole) we will likely implement a system that computes them in a single pass and caches them, so that min and max and std will cache in-between calls and only require a single pass over the array.

argmax

The mandatory argument is the field over which the maximum is to be computed; the default return argument is the index, but the axis optional parameter can specify one or more fields that will be returned. (For instance, one could supply ('x', 'y', 'z') and be handed back the spatial locations.

argmin

The mandatory argument is the field over which the minimum is to be computed; the default return argument is the index, but the axis optional parameter can specify one or more fields that will be returned. (For instance, one could supply ('x', 'y', 'z') and be handed back the spatial locations.

max

The mandatory argument is the field of which the maximum is to be computed. This can be a list of fields.

This accepts the optional argument axis. If axis is a spatial axis (as defined by coordinates.axis_names and thus including 0, 1, 2, and the axis names) it will generate a maximum intensity projection along that axis of the specified field.

mean

The mandatory argument is the field to average. This will return either a projection if the axis is spatial, or a quantity result.

The optional axis argument can either be the spatial axis along which the weighted projection can be computed (defaults to weighted by ones, which is usually not desired for astro data, but may be for other data) or None. Non-spatial axes are not supported.

The optional weight argument (which defaults to ones) describes how to weight this average. If axis is None and weight is None, it will compute the sum; if axis is None and weight is not None, it will compute the weighted_average_quantity.

min

The mandatory argument is the field of which the minimum is to be computed. This can be a list of fields.

Because we do not have “minimum intensity projections,” spatial axes are not supported.

std

The mandatory argument is the field of which the standard deviation is to be computed. This can be a list of fields.

The optional argument weight will describe the weight for computing standard deviation.

ptp

The mandatory argument is the field of which the peak-to-peak is computed.

sum

The mandatory argument is the field to sum.

The axis argument, if spatial, will be the axis along which the projection will be taken. This must either be None or a spatial axis. The weighting will be None, and thus it will be the line integral. (Note that this will not includes a dl term, as it will be using the sum method.)

integrate

The mandatory field argument is the field to integrate; if axis is one of the coordinate axes, the return value will be a projection. This will be using the standard projection method, which includes dl.

If the axis argument is not a spatial dimension, maybe it could return a profile of some type? I’m not sure.

hist

This should return a profile. Determining the most natural way to map how we profile (i.e., the fields along the axes, and the weighting) is an open question. But, it seems to me that we want to do something like:

  • Mandatory argument: field or fields to take the average of, or the sum of. If bins is not specified, the returned profile will compute the sum of this field in bins along the x axis; this is somewhat of a weird conditional, but seems to match the closest.
  • Optional weight argument: the field to use as the weight; if not specified, this will just be a sum.
  • Optional bins argument: the x and optionally y field to use as bins

__getitem__

The slice operation on a shadow .r quantity should return regions or slices.

If one axis is fully-specific, it will be the slice along that axis. If all three are left as start/stop tuples with no step, it will be a region. These can be either float values or unitful objects or tuples of (val, unit_name).

If a step is supplied, it will need to be supplied for all three dimensions, will need to be imaginary (i.e., 64j) and it will be interpreted as input to an arbitrary_grid object. The start/stop will provide the left and right edges and the step will provide the number of dimensions.

plot

The plot operation will only be implemented on things that have obvious plotting candidates – slices, projections, profiles. This will default to creating the necessary PlotWindow or related class, and will try to choose sane defaults for it. For instance, this could wrap to_pw. In contrast to to_pw, this will also default to native plot coordinates, as we want this to match more closely the behavior that would be done by simply plotting the field.

Examples

At the present to get a projection plot of a data object, one would do:

obj = ds.sphere((100, 'cm'), 'c')
p = yt.ProjectionPlot(ds, 'x', 'density', data_source = obj)
p.show()

or:

obj = ds.sphere((100, 'cm'), 'c')
proj = ds.proj("x", "density", data_source=obj)
p = proj.to_pw()
p.show()

The alternate here would be:

obj = ds.sphere((100, 'cm'), 'c')
p = obj.sum("density", axis="x")
p.plot()

The histogram could be computed:

obj = ds.sphere((100, 'cm'), 'c')
p = obj.hist("density", bins="temperature", weight="cell_mass")
p.plot()

The slicing would look like:

ds = yt.load("galaxy0030")
my_obj = ds.r[(100,'kpc'):(200,'kpc'), :, (100,'kpc'):(200,'kpc')]

The way to construct this at present would be, which is a bit cumbersome (there are other ways to do this, too, but this is the one that is the clearest):

ds = yt.load("galaxy0030")
left_edge = ds.domain_left_edge.in_units("kpc").copy()
left_edge[0] = 100
left_edge[2] = 100
right_edge = ds.domain_right_edge.in_units("kpc").copy()
right_edge[0] = 200
right_edge[2] = 200
center = (left_edge + right_edge)/2.0
my_obj = ds.region(center, left_edge, right_edge)

Or for a slice:

ds = yt.load("galaxy0030")
my_obj = ds.r[(100,'kpc'):(200,'kpc'), (250,'kpc'), (100,'kpc'):(200,'kpc')]
my_obj.plot()

At present, we would have to:

ds = yt.load("galaxy0030")
left_edge = ds.domain_left_edge.in_units("kpc").copy()
left_edge[0] = 100
left_edge[2] = 100
right_edge = ds.domain_right_edge.in_units("kpc").copy()
right_edge[0] = 200
right_edge[2] = 200
center = (left_edge + right_edge)/2.0
reg = ds.region(center, left_edge, right_edge)
my_obj = ds.slice(1, (250,'kpc'))
my_obj.to_pw("density")

Another example is how to make very terse computations, which still demonstrate reasonably clearly what they do:

ds = yt.load("IsolatedGalaxy/galaxy0030/galaxy0030")
dd = ds.r[:,:,:]
print dd.mean(["velocity_%s" % ax for ax in 'xyz'], weight="cell_mass")

This returns::

[37021.0582639 cm/s, 35794.630883 cm/s, 82204.2708063 cm/s]

Note that we can also do:

print ds.r[:,:,:].mean(["velocity_%s" % ax for ax in 'xyz'], weight="cell_mass")

With the step functionality, this is also possible:

g = ds.r[::128j,::128j,::128j]
g["density"]

which will be an arbitrary_grid object with 128 cells in each dimension.

We may at some point want to add pandas-like selection and indexing functions (http://pandas.pydata.org/pandas-docs/stable/indexing.html ) but right now the use case is less clear. Maybe having select() be an alias for cut_region, or adding in a groupby method (maybe; not sure that’s useful unless it were by binning) would be interesting, but not immediately clear to me.

This work, if completed, will include an overhaul of the documentation to reflect this, as I think it is considerably terser and more expressive.

Backwards Compatibility

There are no backwards-compatible issues.

Alternatives

I do not know if there are alternatives to consider; in many ways, this will open us up to more straightforward utilization of tools like xray and dask.