YTEP-0012: Halo Redesign¶
Created: March 7, 2013
Author: Britton Smith, Cameron Hummels, Chris Moody, Mark Richardson, Yu Lu
In yt 3.0, operations relating to the analysis of halos (halo finding, merger tree creation, and individual halo analysis) will be brought together into a single framework. This will enable the analysis of individual halos through time without the need to bridge the gap between halo finding, merger tree creation, and halo profiling on one’s own.
Halo Analysis in yt 2.x¶
Currently, analyzing halos from a cosmological simulation works in the following way.
First, a halo finder is run, which produces a halo catalog in the form of an ascii
file. Each of the halo finders implemented in yt produce halo catalogs with slightly
different formats, including various quantities that also differ. To perform any
additional analysis on the halos that have been found, one then uses the
HaloProfiler reads the various halos catalogs from their
files and performs a limited set of specific functionality, namely one-dimensional
radial profiles and projections. There is also a function that accepts a callback
function for performing custom analysis. The analysis products for each of these are
stored in separate locations. Any figures to be made from these analyses require the
user to write their own scripts that are responsible for all file i/o, sorting, and
Analysis of a halo as it evolves over time first requires the creation of a merger
tree. For this to work, the particles that belong to each halo need to have been
written out during the halo finding step. Most of the merger trees work by manually
specifying a list of dataset filenames over which the merger tree is to be calculated.
A separate database file is created that contains the entire merger tree and helper
functions exist to tracks a given halo’s lineage over time. There is little
comprehensive framework for performing halo time series analysis. With the
functionality existing currently, halo time series analysis can be managed in one
of two ways. The first, and more expensive by far, is to run the
on all halos in all datasets and then use the merger tree database file to correlate
halos from multiple times in the user’s hand-built plotting script. The second
is to use the merger tree information to specify a single halo index to be analyzed
HaloProfiler. This is accomplished by creating a filter for a specific
halo index, and cannot account for halos coming from multiple parents or having
multiple children. There are numerous other ways in which these approaches are
Proposed Halo Analysis in yt 3.0¶
All of the functionality described above will be managed by a series of new objects
working in a hierarchy. These will be
Halo, decribed in further detail below. The files created by the operations
described here will allow for the full state of the
to be restored by running the same commands that were used to create them. This will
allow the user to create a single script that will be run first on a supercomputer to
perform all of the dataset-requiring analysis and then later on a local machine to
load the halo data into memory for further analysis and plotting.
This object will accept a
TimeSeriesData object containing all the datasets to be
used. Its two primary functions will be to perform halo finding on all datasets and
creating the merger tree. Each of these two steps will be performed with separate
functions calls where the arguments given will be a callable halo finding or merger
tree function and a dictionary containing additional configuration paramters specific to
the provided callables. The data structure contained in memory will be a time-sorted
HaloCatalog objects, one for each dataset. It will also contain
a dictionary of dictionaries showing the merger tree information for each halo. The
on-disk format for
HaloCatalogTimeSeries objects will likely need to be refined, but
will ideally preserve the system of pointers connecting
Halo objects to their past and
future counterparts (described further in the
Halo section). The data stored here will
potentially be far too large for a single file. Instead, a system of multiple files that
is capable of quickly reconstructing the
Halo pointer structure may be better.
The other primary function will be to create halo tables that are flattened numpy structured arrays of various halo quantities from all or a selection of all halos (e.g. by timestep) from all halo catalogs. This will enable easy slicing and plotting of properties from multiple halos.
This will be a light-weight container for
Halo objects from a single dataset. It
will be responsible for writing
Halo objects to and restoring from disk. It
will be a numpy structured array. The manner in which
HaloCatalog objects will be
written to disk will be specified similar to how halo finders and merger tree generators
HaloCatalogTimeSeries objects, i.e., by providing a callable writer function.
This will allow users to write to any number of standardized formats, such as the
Halo object will contain all quantities associated with a given halo
calculated either during the halo finding step or by analysis performed later.
By default, particle information will be saved to disk after halo finding has been
performed since it is required for merger tree generation. However, particle
information will not remain attached to the
Halo object since it takes a great
deal of memory to store this information.
Instead, there will be several halo quantities calculatd at instantiation using
these particles including center-of-mass phase-space coordinates, densest point,
shape of halo, and merger tree information (matching against previous and later
timesteps to determine lineage of a halo). However, the option will exist
for reloading particle information for later analysis. This technique of frontloading
analysis that requires particle information is how some halo finders, such as
Rockstar, currently operate.
Halo object will also have pointers to the
Halo objects that are
their past and future instances, essentially the most massive pro/postgenitors
with the largest match of particles inventories. This will allow one to perform
any additional analysis on a single halo lineage simply by traversing this
linked list. Further analysis on
Halo objects will be facilitated by
associating quantities with callable functions. If the user attempts to access
a halo quantity whose value is not currently stored within the
it will run the associated callable to create it. At any time, the
HaloCatalogTimeSeries object can be directed to update the files on disk with
any new quantities that have been calculated.
This will be a completely new framework for performing this type of analysis. Other than working with existing halo finders and potentially reading in the output they produce now, there will be no compatibility with pre-existing machinery. For these reasons, this development will be confined to yt 3.0.
Development of the new halo analysis is taking place in
this repository under the ytep0012 bookmark.
This work is being done alongside the unit refactor and thus includes all changes in the
unitrefactor bookmark here. The majority of the
work is taking place within
yt/analysis_modules/halo_analysis. Everything detailed below, except
where explicitly noted, has been implemented.
Not Implemented. This is currently awaiting development of a new merger tree framework.
This relies on the recently added ability to load a Rockstar halo catalog as a yt dataset,
referred to hereon as a halo finder dataset for clarity. A
HaloCatalog object is created
by providing it with a simulation dataset, a halo finder dataset, or both.
dpf = load("DD0064/DD0064") hpf = load("rockstar_halos/halos_64.0.bin") hc = HaloCatalog(halos_pf=hpf, data_pf=dpf, output_dir="halo_catalogs/catalog_0064")
If the halo_pf is not given, halo finding will be done using the method provided with the
finder_method keyword (not implemented). A data container can also be given, associated
with either dataset, to control the spatial region in which halo analysis will be performed.
Analysis is done by adding actions to the
HaloCatalog. Each action is represented by a callback
function that will be run on each halo. There are three types of actions.
1. Quantities - a call back that returns a value or values. The return values are stored within the halo object in a dictionary called “quantities.” At the end of the analysis, all of these quantities will be written to disk as the final form of the generated “halo catalog.”
# definition of the center of mass quantity def center_of_mass(halo): if halo.particles is None: raise RuntimeError("Center of mass requires halo to have particle data.") return (halo.particles['particle_mass'] * np.array([halo.particles['particle_position_x'], halo.particles['particle_position_y'], halo.particles['particle_position_z']])).sum(axis=1) / \ halo.particles['particle_mass'].sum() # add to a registry of available quantities add_quantity('center_of_mass', center_of_mass) # in the actual halo analysis script hc.add_quantity("center_of_mass")
The above example is better suited as a parallel-safe derived quantity, but the use is the same.
Instead of being generated from a callback, quantities can also be values pulled directory from the halo finder dataset.
# get the field value ("halos", "particle_mass") for this halo from the halo dataset hc.add_quantity("particle_mass", field_type="halos")
2. Callbacks - the callback is actually the super class for quantities and filters and is a
general purpose function that does something, anything, to a
Halo object. This can include
hanging new attributes off the
Halo object, performing analysis and writing to disk, etc.
A callback does not return anything.
In the example below, we create a sphere for a halo with a radius that is twice the saved
“virial_radius” (in the quantities dict), recenter it on the location of maximum density,
then do some profiling, compute virial quantities based on those profiles (storing them in the
quantities dict), and then write the profiles to disk.
hc.add_callback("sphere", radius_field="virial_radius", factor=2.0) hc.add_callback("sphere_field_max_recenter", ("gas", "density")) hc.add_callback("profile", "radius", [("gas", "matter_mass"), ("index", "cell_volume")], weight_field=None, accumulation=True, output_dir="profiles", storage="profiles") hc.add_callback("virial_quantities", ["radius", ("gas", "matter_mass")]) hc.add_callback("save_profiles", storage="profiles")
Currently existing stock callbacks:
- sphere creation
- sphere recenter
- sphere bulk velocity
- 1D profiling
- virial quantity calculation based on 1D profiles
- writing profile data
- reloading saved profile data
3. Filters - a filter is a callback function that returns True or False. If the return value is True, any further queued analysis will proceed and the halo in question will be added to the final catalog. If the return value False, further analysis will not be performed and the halo will not be included in the final catalog.
hc.add_filter("quantity_value", "matter_mass_200", ">", 1e13, "Msun")
Currently existing stock filters:
- quantity filter (shown above): uses an eval statement with a value stored in the quantities dict
Running and Order of Operations¶
After all callbacks, quantities, and filters have been added, the analysis begins with a call to
hc.create(save_halos=False, njobs=-1, dynamic=True)
save_halos keyword determines whether the actual
Halo objects are saved after analysis on
them has completed or whether just the contents of their quantities dicts will be retained for creating
the final catalog. The looping over halos uses a call to
parallel_objects allowing the user to
control how many processors work on each halo. The final catalog is written to disk int the output
directory given when the
HaloCatalog object was created. The final halo catalog can then
be loaded in as a yt dataset just in the manner of a halo finder dataset.
All callbacks, quantities, and filters are stored in an “actions” list, meaning that they are executed in the same order in which they were added. This enables the use of simple, reusable, single action callbacks that depend on each other. This also prevents unecessary computation by allowing the user to add filters at multiple stages to skip remaining analysis if it is not warranted.
Reloading a Halo Catalog¶
HaloCatalog saved to disk can be reloaded as yt dataset with the standard call to
side data, such as profiles, can be reloaded with a
load_profiles callback and a call to
from yt.mods import * from yt.analysis_modules.halo_analysis.api import * hpf = load("halo_catalogs/catalog_0046/catalog_0046.0.h5") hc = HaloCatalog(halos_pf=hpf, output_dir="halo_catalogs/catalog_0046") hc.add_callback("load_profiles", output_dir="profiles", filename="virial_profiles") hc.load()
create functions are wrappers around a
_run function responsible for looping over
all the halos and applying callbacks. The only difference between the two is that their default keyword
arguments are specifically tailored for creating (do not retain
Halo objects, do write catalog) or
rereading catalogs (do retain
Halo objects, do not write catalog).
Halo objects are created by the
HaloCatalog during the call to
HaloCatalog.run. They are
mainly meant to be temporary objects for retaining information so that it can be passed between callbacks.
They can be kept by specifying save_halos=True. This might be useful when working with a time series of
halo catalogs where future_self and past_self attributes may act as pointers to
Halo objects within
HaloCatalogs that are time-adjacent.