YTEP-0012: Halo Redesign¶
Abstract¶
Created: March 7, 2013
Author: Britton Smith, Cameron Hummels, Chris Moody, Mark Richardson, Yu Lu
In yt 3.0, operations relating to the analysis of halos (halo finding, merger tree creation, and individual halo analysis) will be brought together into a single framework. This will enable the analysis of individual halos through time without the need to bridge the gap between halo finding, merger tree creation, and halo profiling on one’s own.
Status¶
Completed
Project Management Links¶
Detailed Description¶
Halo Analysis in yt 2.x¶
Currently, analyzing halos from a cosmological simulation works in the following way.
First, a halo finder is run, which produces a halo catalog in the form of an ascii
file. Each of the halo finders implemented in yt produce halo catalogs with slightly
different formats, including various quantities that also differ. To perform any
additional analysis on the halos that have been found, one then uses the
HaloProfiler
. The HaloProfiler
reads the various halos catalogs from their
files and performs a limited set of specific functionality, namely one-dimensional
radial profiles and projections. There is also a function that accepts a callback
function for performing custom analysis. The analysis products for each of these are
stored in separate locations. Any figures to be made from these analyses require the
user to write their own scripts that are responsible for all file i/o, sorting, and
plotting.
Analysis of a halo as it evolves over time first requires the creation of a merger
tree. For this to work, the particles that belong to each halo need to have been
written out during the halo finding step. Most of the merger trees work by manually
specifying a list of dataset filenames over which the merger tree is to be calculated.
A separate database file is created that contains the entire merger tree and helper
functions exist to tracks a given halo’s lineage over time. There is little
comprehensive framework for performing halo time series analysis. With the
functionality existing currently, halo time series analysis can be managed in one
of two ways. The first, and more expensive by far, is to run the HaloProfiler
on all halos in all datasets and then use the merger tree database file to correlate
halos from multiple times in the user’s hand-built plotting script. The second
is to use the merger tree information to specify a single halo index to be analyzed
with the HaloProfiler
. This is accomplished by creating a filter for a specific
halo index, and cannot account for halos coming from multiple parents or having
multiple children. There are numerous other ways in which these approaches are
very limiting.
Proposed Halo Analysis in yt 3.0¶
All of the functionality described above will be managed by a series of new objects
working in a hierarchy. These will be HaloCatalogTimeSeries
, HaloCatalog
,
and Halo
, decribed in further detail below. The files created by the operations
described here will allow for the full state of the HaloCatalogTimeSeries
object
to be restored by running the same commands that were used to create them. This will
allow the user to create a single script that will be run first on a supercomputer to
perform all of the dataset-requiring analysis and then later on a local machine to
load the halo data into memory for further analysis and plotting.
HaloCatalogTimeSeries
¶
This object will accept a TimeSeriesData
object containing all the datasets to be
used. Its two primary functions will be to perform halo finding on all datasets and
creating the merger tree. Each of these two steps will be performed with separate
functions calls where the arguments given will be a callable halo finding or merger
tree function and a dictionary containing additional configuration paramters specific to
the provided callables. The data structure contained in memory will be a time-sorted
list of HaloCatalog
objects, one for each dataset. It will also contain
a dictionary of dictionaries showing the merger tree information for each halo. The
on-disk format for HaloCatalogTimeSeries
objects will likely need to be refined, but
will ideally preserve the system of pointers connecting Halo
objects to their past and
future counterparts (described further in the Halo
section). The data stored here will
potentially be far too large for a single file. Instead, a system of multiple files that
is capable of quickly reconstructing the Halo
pointer structure may be better.
The other primary function will be to create halo tables that are flattened numpy structured arrays of various halo quantities from all or a selection of all halos (e.g. by timestep) from all halo catalogs. This will enable easy slicing and plotting of properties from multiple halos.
HaloCatalog
¶
This will be a light-weight container for Halo
objects from a single dataset. It
will be responsible for writing Halo
objects to and restoring from disk. It
will be a numpy structured array. The manner in which HaloCatalog
objects will be
written to disk will be specified similar to how halo finders and merger tree generators
given to HaloCatalogTimeSeries
objects, i.e., by providing a callable writer function.
This will allow users to write to any number of standardized formats, such as the
IRATE format.
Halo
¶
The Halo
object will contain all quantities associated with a given halo
calculated either during the halo finding step or by analysis performed later.
By default, particle information will be saved to disk after halo finding has been
performed since it is required for merger tree generation. However, particle
information will not remain attached to the Halo
object since it takes a great
deal of memory to store this information.
Instead, there will be several halo quantities calculatd at instantiation using
these particles including center-of-mass phase-space coordinates, densest point,
shape of halo, and merger tree information (matching against previous and later
timesteps to determine lineage of a halo). However, the option will exist
for reloading particle information for later analysis. This technique of frontloading
analysis that requires particle information is how some halo finders, such as
Rockstar, currently operate.
The Halo
object will also have pointers to the Halo
objects that are
their past and future instances, essentially the most massive pro/postgenitors
with the largest match of particles inventories. This will allow one to perform
any additional analysis on a single halo lineage simply by traversing this
linked list. Further analysis on Halo
objects will be facilitated by
associating quantities with callable functions. If the user attempts to access
a halo quantity whose value is not currently stored within the Halo
object,
it will run the associated callable to create it. At any time, the
HaloCatalogTimeSeries
object can be directed to update the files on disk with
any new quantities that have been calculated.
Backwards Compatibility¶
This will be a completely new framework for performing this type of analysis. Other than working with existing halo finders and potentially reading in the output they produce now, there will be no compatibility with pre-existing machinery. For these reasons, this development will be confined to yt 3.0.
Current Progress¶
Development of the new halo analysis is taking place in
this repository under the ytep0012 bookmark.
This work is being done alongside the unit refactor and thus includes all changes in the
unitrefactor bookmark here. The majority of the
work is taking place within yt/analysis_modules/halo_analysis
. Everything detailed below, except
where explicitly noted, has been implemented.
HaloCatalogTimeSeries
¶
Not Implemented. This is currently awaiting development of a new merger tree framework.
HaloCatalog
¶
This relies on the recently added ability to load a Rockstar halo catalog as a yt dataset,
referred to hereon as a halo finder dataset for clarity. A HaloCatalog
object is created
by providing it with a simulation dataset, a halo finder dataset, or both.
dpf = load("DD0064/DD0064")
hpf = load("rockstar_halos/halos_64.0.bin")
hc = HaloCatalog(halos_pf=hpf, data_pf=dpf,
output_dir="halo_catalogs/catalog_0064")
If the halo_pf is not given, halo finding will be done using the method provided with the
finder_method
keyword (not implemented). A data container can also be given, associated
with either dataset, to control the spatial region in which halo analysis will be performed.
Analysis is done by adding actions to the HaloCatalog
. Each action is represented by a callback
function that will be run on each halo. There are three types of actions.
1. Quantities - a call back that returns a value or values. The return values are stored within the halo object in a dictionary called “quantities.” At the end of the analysis, all of these quantities will be written to disk as the final form of the generated “halo catalog.”
# definition of the center of mass quantity
def center_of_mass(halo):
if halo.particles is None:
raise RuntimeError("Center of mass requires halo to have particle data.")
return (halo.particles['particle_mass'] *
np.array([halo.particles['particle_position_x'],
halo.particles['particle_position_y'],
halo.particles['particle_position_z']])).sum(axis=1) / \
halo.particles['particle_mass'].sum()
# add to a registry of available quantities
add_quantity('center_of_mass', center_of_mass)
# in the actual halo analysis script
hc.add_quantity("center_of_mass")
The above example is better suited as a parallel-safe derived quantity, but the use is the same.
Instead of being generated from a callback, quantities can also be values pulled directory from the halo finder dataset.
# get the field value ("halos", "particle_mass") for this halo from the halo dataset
hc.add_quantity("particle_mass", field_type="halos")
2. Callbacks - the callback is actually the super class for quantities and filters and is a
general purpose function that does something, anything, to a Halo
object. This can include
hanging new attributes off the Halo
object, performing analysis and writing to disk, etc.
A callback does not return anything.
In the example below, we create a sphere for a halo with a radius that is twice the saved
“virial_radius” (in the quantities dict), recenter it on the location of maximum density,
then do some profiling, compute virial quantities based on those profiles (storing them in the
quantities dict), and then write the profiles to disk.
hc.add_callback("sphere", radius_field="virial_radius", factor=2.0)
hc.add_callback("sphere_field_max_recenter", ("gas", "density"))
hc.add_callback("profile", "radius", [("gas", "matter_mass"),
("index", "cell_volume")],
weight_field=None, accumulation=True,
output_dir="profiles", storage="profiles")
hc.add_callback("virial_quantities", ["radius", ("gas", "matter_mass")])
hc.add_callback("save_profiles", storage="profiles")
Currently existing stock callbacks:
- sphere creation
- sphere recenter
- sphere bulk velocity
- 1D profiling
- virial quantity calculation based on 1D profiles
- writing profile data
- reloading saved profile data
- removing
Halo
attributes. - PhasePlot
3. Filters - a filter is a callback function that returns True or False. If the return value is True, any further queued analysis will proceed and the halo in question will be added to the final catalog. If the return value False, further analysis will not be performed and the halo will not be included in the final catalog.
hc.add_filter("quantity_value", "matter_mass_200", ">", 1e13, "Msun")
Currently existing stock filters:
- quantity filter (shown above): uses an eval statement with a value stored in the quantities dict
Running and Order of Operations¶
After all callbacks, quantities, and filters have been added, the analysis begins with a call to
HaloCatalog.create
.
hc.create(save_halos=False, njobs=-1, dynamic=True)
The save_halos
keyword determines whether the actual Halo
objects are saved after analysis on
them has completed or whether just the contents of their quantities dicts will be retained for creating
the final catalog. The looping over halos uses a call to parallel_objects
allowing the user to
control how many processors work on each halo. The final catalog is written to disk int the output
directory given when the HaloCatalog
object was created. The final halo catalog can then
be loaded in as a yt dataset just in the manner of a halo finder dataset.
All callbacks, quantities, and filters are stored in an “actions” list, meaning that they are executed in the same order in which they were added. This enables the use of simple, reusable, single action callbacks that depend on each other. This also prevents unecessary computation by allowing the user to add filters at multiple stages to skip remaining analysis if it is not warranted.
Reloading a Halo Catalog¶
A HaloCatalog
saved to disk can be reloaded as yt dataset with the standard call to load
. Any
side data, such as profiles, can be reloaded with a load_profiles
callback and a call to
HaloCatalog.load
.
from yt.mods import *
from yt.analysis_modules.halo_analysis.api import *
hpf = load("halo_catalogs/catalog_0046/catalog_0046.0.h5")
hc = HaloCatalog(halos_pf=hpf,
output_dir="halo_catalogs/catalog_0046")
hc.add_callback("load_profiles", output_dir="profiles",
filename="virial_profiles")
hc.load()
The load
and create
functions are wrappers around a _run
function responsible for looping over
all the halos and applying callbacks. The only difference between the two is that their default keyword
arguments are specifically tailored for creating (do not retain Halo
objects, do write catalog) or
rereading catalogs (do retain Halo
objects, do not write catalog).
Halo
¶
Halo
objects are created by the HaloCatalog
during the call to HaloCatalog.run
. They are
mainly meant to be temporary objects for retaining information so that it can be passed between callbacks.
They can be kept by specifying save_halos=True. This might be useful when working with a time series of
halo catalogs where future_self and past_self attributes may act as pointers to Halo
objects within
HaloCatalogs
that are time-adjacent.
Remaining Work¶
See the Trello board.
All are welcome to get involved with development, testing, etc!