YTEP-0012: Halo Redesign


Created: March 7, 2013

Author: Britton Smith, Cameron Hummels, Chris Moody, Mark Richardson, Yu Lu

In yt 3.0, operations relating to the analysis of halos (halo finding, merger tree creation, and individual halo analysis) will be brought together into a single framework. This will enable the analysis of individual halos through time without the need to bridge the gap between halo finding, merger tree creation, and halo profiling on one’s own.



Detailed Description

Halo Analysis in yt 2.x

Currently, analyzing halos from a cosmological simulation works in the following way. First, a halo finder is run, which produces a halo catalog in the form of an ascii file. Each of the halo finders implemented in yt produce halo catalogs with slightly different formats, including various quantities that also differ. To perform any additional analysis on the halos that have been found, one then uses the HaloProfiler. The HaloProfiler reads the various halos catalogs from their files and performs a limited set of specific functionality, namely one-dimensional radial profiles and projections. There is also a function that accepts a callback function for performing custom analysis. The analysis products for each of these are stored in separate locations. Any figures to be made from these analyses require the user to write their own scripts that are responsible for all file i/o, sorting, and plotting.

Analysis of a halo as it evolves over time first requires the creation of a merger tree. For this to work, the particles that belong to each halo need to have been written out during the halo finding step. Most of the merger trees work by manually specifying a list of dataset filenames over which the merger tree is to be calculated. A separate database file is created that contains the entire merger tree and helper functions exist to tracks a given halo’s lineage over time. There is little comprehensive framework for performing halo time series analysis. With the functionality existing currently, halo time series analysis can be managed in one of two ways. The first, and more expensive by far, is to run the HaloProfiler on all halos in all datasets and then use the merger tree database file to correlate halos from multiple times in the user’s hand-built plotting script. The second is to use the merger tree information to specify a single halo index to be analyzed with the HaloProfiler. This is accomplished by creating a filter for a specific halo index, and cannot account for halos coming from multiple parents or having multiple children. There are numerous other ways in which these approaches are very limiting.

Proposed Halo Analysis in yt 3.0

All of the functionality described above will be managed by a series of new objects working in a hierarchy. These will be HaloCatalogTimeSeries, HaloCatalog, and Halo, decribed in further detail below. The files created by the operations described here will allow for the full state of the HaloCatalogTimeSeries object to be restored by running the same commands that were used to create them. This will allow the user to create a single script that will be run first on a supercomputer to perform all of the dataset-requiring analysis and then later on a local machine to load the halo data into memory for further analysis and plotting.


This object will accept a TimeSeriesData object containing all the datasets to be used. Its two primary functions will be to perform halo finding on all datasets and creating the merger tree. Each of these two steps will be performed with separate functions calls where the arguments given will be a callable halo finding or merger tree function and a dictionary containing additional configuration paramters specific to the provided callables. The data structure contained in memory will be a time-sorted list of HaloCatalog objects, one for each dataset. It will also contain a dictionary of dictionaries showing the merger tree information for each halo. The on-disk format for HaloCatalogTimeSeries objects will likely need to be refined, but will ideally preserve the system of pointers connecting Halo objects to their past and future counterparts (described further in the Halo section). The data stored here will potentially be far too large for a single file. Instead, a system of multiple files that is capable of quickly reconstructing the Halo pointer structure may be better.

The other primary function will be to create halo tables that are flattened numpy structured arrays of various halo quantities from all or a selection of all halos (e.g. by timestep) from all halo catalogs. This will enable easy slicing and plotting of properties from multiple halos.


This will be a light-weight container for Halo objects from a single dataset. It will be responsible for writing Halo objects to and restoring from disk. It will be a numpy structured array. The manner in which HaloCatalog objects will be written to disk will be specified similar to how halo finders and merger tree generators given to HaloCatalogTimeSeries objects, i.e., by providing a callable writer function. This will allow users to write to any number of standardized formats, such as the IRATE format.


The Halo object will contain all quantities associated with a given halo calculated either during the halo finding step or by analysis performed later. By default, particle information will be saved to disk after halo finding has been performed since it is required for merger tree generation. However, particle information will not remain attached to the Halo object since it takes a great deal of memory to store this information. Instead, there will be several halo quantities calculatd at instantiation using these particles including center-of-mass phase-space coordinates, densest point, shape of halo, and merger tree information (matching against previous and later timesteps to determine lineage of a halo). However, the option will exist for reloading particle information for later analysis. This technique of frontloading analysis that requires particle information is how some halo finders, such as Rockstar, currently operate.

The Halo object will also have pointers to the Halo objects that are their past and future instances, essentially the most massive pro/postgenitors with the largest match of particles inventories. This will allow one to perform any additional analysis on a single halo lineage simply by traversing this linked list. Further analysis on Halo objects will be facilitated by associating quantities with callable functions. If the user attempts to access a halo quantity whose value is not currently stored within the Halo object, it will run the associated callable to create it. At any time, the HaloCatalogTimeSeries object can be directed to update the files on disk with any new quantities that have been calculated.

Backwards Compatibility

This will be a completely new framework for performing this type of analysis. Other than working with existing halo finders and potentially reading in the output they produce now, there will be no compatibility with pre-existing machinery. For these reasons, this development will be confined to yt 3.0.

Current Progress

Development of the new halo analysis is taking place in this repository under the ytep0012 bookmark. This work is being done alongside the unit refactor and thus includes all changes in the unitrefactor bookmark here. The majority of the work is taking place within yt/analysis_modules/halo_analysis. Everything detailed below, except where explicitly noted, has been implemented.


Not Implemented. This is currently awaiting development of a new merger tree framework.


This relies on the recently added ability to load a Rockstar halo catalog as a yt dataset, referred to hereon as a halo finder dataset for clarity. A HaloCatalog object is created by providing it with a simulation dataset, a halo finder dataset, or both.

dpf = load("DD0064/DD0064")
hpf = load("rockstar_halos/halos_64.0.bin")

hc = HaloCatalog(halos_pf=hpf, data_pf=dpf,

If the halo_pf is not given, halo finding will be done using the method provided with the finder_method keyword (not implemented). A data container can also be given, associated with either dataset, to control the spatial region in which halo analysis will be performed.

Analysis is done by adding actions to the HaloCatalog. Each action is represented by a callback function that will be run on each halo. There are three types of actions.

1. Quantities - a call back that returns a value or values. The return values are stored within the halo object in a dictionary called “quantities.” At the end of the analysis, all of these quantities will be written to disk as the final form of the generated “halo catalog.”

# definition of the center of mass quantity
def center_of_mass(halo):
    if halo.particles is None:
        raise RuntimeError("Center of mass requires halo to have particle data.")
    return (halo.particles['particle_mass'] *
                      halo.particles['particle_position_z']])).sum(axis=1) / \
# add to a registry of available quantities
add_quantity('center_of_mass', center_of_mass)

# in the actual halo analysis script

The above example is better suited as a parallel-safe derived quantity, but the use is the same.

Instead of being generated from a callback, quantities can also be values pulled directory from the halo finder dataset.

# get the field value ("halos", "particle_mass") for this halo from the halo dataset
hc.add_quantity("particle_mass", field_type="halos")

2. Callbacks - the callback is actually the super class for quantities and filters and is a general purpose function that does something, anything, to a Halo object. This can include hanging new attributes off the Halo object, performing analysis and writing to disk, etc. A callback does not return anything. In the example below, we create a sphere for a halo with a radius that is twice the saved “virial_radius” (in the quantities dict), recenter it on the location of maximum density, then do some profiling, compute virial quantities based on those profiles (storing them in the quantities dict), and then write the profiles to disk.

hc.add_callback("sphere", radius_field="virial_radius", factor=2.0)
hc.add_callback("sphere_field_max_recenter", ("gas", "density"))
hc.add_callback("profile", "radius", [("gas", "matter_mass"),
                                      ("index", "cell_volume")],
                weight_field=None, accumulation=True,
                output_dir="profiles", storage="profiles")
hc.add_callback("virial_quantities", ["radius", ("gas", "matter_mass")])
hc.add_callback("save_profiles", storage="profiles")

Currently existing stock callbacks:

  • sphere creation
  • sphere recenter
  • sphere bulk velocity
  • 1D profiling
  • virial quantity calculation based on 1D profiles
  • writing profile data
  • reloading saved profile data
  • removing Halo attributes.
  • PhasePlot

3. Filters - a filter is a callback function that returns True or False. If the return value is True, any further queued analysis will proceed and the halo in question will be added to the final catalog. If the return value False, further analysis will not be performed and the halo will not be included in the final catalog.

hc.add_filter("quantity_value", "matter_mass_200", ">", 1e13, "Msun")

Currently existing stock filters:

  • quantity filter (shown above): uses an eval statement with a value stored in the quantities dict

Running and Order of Operations

After all callbacks, quantities, and filters have been added, the analysis begins with a call to HaloCatalog.create.

hc.create(save_halos=False, njobs=-1, dynamic=True)

The save_halos keyword determines whether the actual Halo objects are saved after analysis on them has completed or whether just the contents of their quantities dicts will be retained for creating the final catalog. The looping over halos uses a call to parallel_objects allowing the user to control how many processors work on each halo. The final catalog is written to disk int the output directory given when the HaloCatalog object was created. The final halo catalog can then be loaded in as a yt dataset just in the manner of a halo finder dataset.

All callbacks, quantities, and filters are stored in an “actions” list, meaning that they are executed in the same order in which they were added. This enables the use of simple, reusable, single action callbacks that depend on each other. This also prevents unecessary computation by allowing the user to add filters at multiple stages to skip remaining analysis if it is not warranted.

Reloading a Halo Catalog

A HaloCatalog saved to disk can be reloaded as yt dataset with the standard call to load. Any side data, such as profiles, can be reloaded with a load_profiles callback and a call to HaloCatalog.load.

from yt.mods import *
from yt.analysis_modules.halo_analysis.api import *

hpf = load("halo_catalogs/catalog_0046/catalog_0046.0.h5")
hc = HaloCatalog(halos_pf=hpf,
hc.add_callback("load_profiles", output_dir="profiles",

The load and create functions are wrappers around a _run function responsible for looping over all the halos and applying callbacks. The only difference between the two is that their default keyword arguments are specifically tailored for creating (do not retain Halo objects, do write catalog) or rereading catalogs (do retain Halo objects, do not write catalog).


Halo objects are created by the HaloCatalog during the call to They are mainly meant to be temporary objects for retaining information so that it can be passed between callbacks. They can be kept by specifying save_halos=True. This might be useful when working with a time series of halo catalogs where future_self and past_self attributes may act as pointers to Halo objects within HaloCatalogs that are time-adjacent.

Remaining Work

See the Trello board.

All are welcome to get involved with development, testing, etc!