Feature Extraction

The first stage in consists in transforming the raw data into a uniform data matrix which will subsequently be given as input to the learning algorithm.

resspect can handle FITS format data from the RESSPECT project, csv data from the Photometric LSST Astronomical Classification Challenge (PLAsTiCC) and text-like data from the SuperNova Photometric Classification Challenge (SNPCC).

Before starting any analysis, you need to choose a feature extraction method, all light curves will then be handdled by this method. In the examples below we used the Bazin feature extraction method (Bazin et al., 2009 ) or the Malanchev feature extraction method (Malanchev et al., 2021 ).

Load 1 light curve:

For SNPCC using Bazin features:

The raw data looks like this:

SURVEY: DES   
SNID:   729076   
IAUC:    UNKNOWN 
PHOTOMETRY_VERSION: DES 
SNTYPE:  -9 
FILTERS: griz 
RA:      0.500000  deg 
DECL:    -43.000000  deg 
MAGTYPE: LOG10  
MAGREF:  AB  
FAKE:    2   (=> simulated LC with snlc_sim.exe) 
MWEBV:   0.0111    MW E(B-V) 
REDSHIFT_HELIO:   0.21624 +- 0.03840  (Helio, z_best) 
REDSHIFT_FINAL:   0.21624 +- 0.03840  (CMB) 
REDSHIFT_SPEC:    -9.00000 +- 9.00000  
REDSHIFT_STATUS: OK 
 
HOST_GALAXY_GALID:   18506 
HOST_GALAXY_PHOTO-Z:   0.2162  +- 0.0384  



SIM_MODEL:  NONIA  10  (name index) 
SIM_NON1a:      27   (non1a index) 
SIM_COMMENT:  SN Type = II , MODEL = SDSS-015339  
SIM_LIBID:  4  
SIM_REDSHIFT:  0.1838  
SIM_HOSTLIB_TRUEZ:  0.1800  (actual Z of hostlib) 
SIM_HOSTLIB_GALID:  18506  
SIM_DLMU:      39.752602  mag   [ -5*log10(10pc/dL) ]  
SIM_RA:        0.500000 deg  
SIM_DECL:      -43.000000 deg  
SIM_MWEBV:   0.0084   (MilkyWay E(B-V)) 
SIM_PEAKMAG:   21.05  21.21  21.24  21.67  (griz obs)
SIM_EXPOSURE:     1.0    1.0    1.0    1.0  (griz obs)
SIM_PEAKMJD:   56239.878906  days 
SIM_SALT2x0:   1.256e-16   
SIM_MAGDIM:    0.000  
SIM_SEARCHEFF_MASK:  3  (bits 1,2=> found by software,humans) 
SIM_SEARCHEFF:  1.0000  (spectro-search efficiency (ignores pipelines)) 
SIM_TRESTMIN:   -38.74   days 
SIM_TRESTMAX:    66.02   days 
SIM_RISETIME_SHIFT:   0.0 days 
SIM_FALLTIME_SHIFT:   0.0 days 

SEARCH_PEAKMJD:   56240.938  


# ============================================ 
# TERSE LIGHT CURVE OUTPUT: 
#
NOBS: 77 
NVAR: 9 
VARLIST:  MJD  FLT FIELD   FLUXCAL   FLUXCALERR   SNR    MAG     MAGERR  SIM_MAG
OBS:  56194.012  g NULL   1.309e+01   6.204e+00   2.11   99.000    5.000   98.974
OBS:  56194.016  r NULL  -4.680e+00   3.585e+00  -1.31   99.000    5.000   99.014
OBS:  56194.023  i NULL  -1.936e+00   6.147e+00  -0.31   99.000    5.000   98.972
OBS:  56194.031  z NULL   2.477e+01   1.509e+01   1.64   99.000    5.000   99.050
OBS:  56198.992  g NULL   9.439e+00   1.868e+01   0.51   99.000    5.000   98.974
OBS:  56199.000  r NULL  -8.159e+00   9.049e+00  -0.90   99.000    5.000   99.014
OBS:  56199.008  i NULL  -1.962e+00   1.181e+01  -0.17   99.000    5.000   98.972

You can load this data using:

>>> from resspect import BazinFeatureExtractor

>>> path_to_lc = 'data/SIMGEN_PUBLIC_DES/DES_SN729076.DAT'

>>> lc = BazinFeatureExtractor()             # create light curve instance
>>> lc.load_snpcc_lc(path_to_lc)             # read data

This allows you to visually inspect the content of the light curve:

>>> lc.photometry                            # check structure of photometry
          mjd band     flux  fluxerr   SNR
 0   56194.012       g       13.090  6.204   2.11    99.000  5.000
 1   56194.016       r       -4.680  3.585   -1.31   99.000  5.000
 ... ...     ...     ...     ...     ...     ...     ...
 75  56317.051       i       173.200 7.661   22.60   21.904  0.049
 76  56318.035       z       141.000 13.720  10.28   22.127  0.111

For SNPCC using Malanchev features:

You can load the data using:

>>> from resspect import MalanchevFeatureExtractor

>>> path_to_lc = 'data/SIMGEN_PUBLIC_DES/DES_SN729076.DAT'

>>> lc = MalanchevFeatureExtractor()         # create light curve instance
>>> lc.load_snpcc_lc(path_to_lc)             # read data

This allows you to visually inspect the content of the light curve:

>>> lc.photometry                            # check structure of photometry
          mjd band     flux  fluxerr   SNR
 0   56194.012       g       13.090  6.204   2.11    99.000  5.000
 1   56194.016       r       -4.680  3.585   -1.31   99.000  5.000
 ... ...     ...     ...     ...     ...     ...     ...
 75  56317.051       i       173.200 7.661   22.60   21.904  0.049
 76  56318.035       z       141.000 13.720  10.28   22.127  0.111

Fit 1 light curve:

For SNPCC using Bazin features:

In order to feature extraction in one specific filter, you can do:

>>> lc.fit('r')
[514.92432962  -5.99556655  40.59581991  40.03343317   3.74307339]

The designation for each parameter are stored in:

>>> lc.features_names
['a', 'b', 't0', 'tfall', 'trise']

It is possible to perform the fit in all filters at once and visualize the result using:

>>> lc.fit_all()                            # perform fit in all filters
>>> lc.plot_fit(save=True, show=True,
>>>             output_file='plots/SN' + str(lc.id) + '_flux.png')   # save to file

Bazing fit to light curve. This is an example from SNPCC simulations. — Example of light curve from SNPCC simulations.

This can be done in flux as well as in magnitude:

>>> lc.plot_fit(save=False, show=True, unit='mag')

Bazing fit to light curve. This is an example from SNPCC data. — Example of light from SNPCC data.

Ocasionally, it is necessary to extrapolate the fitted light curve to a latter epoch – for example, in case we want to estimate its magnitude at the time of spectroscopic measurement (details in the time domain preparation section ).

Before deploying large batches for pre-processing, you might want to visualize how the extrapolation behaves for a few light curves. This can be done using:

>>> # define max MJD for this light curve
>>> max_mjd = max(lc.photometry['mjd']) - min(lc.photometry['mjd'])

>>> lc.plot_fit(save=False, show=True, extrapolate=True,
>>>                   time_flux_pred=[max_mjd+3, max_mjd+5, max_mjd+10])

For SNPCC using Malanchev features:

In order to feature extraction in one specific filter, you can do:

>>> lc.fit('r')
[1.03403418e+00, 3.60823443e+02, 7.24896424e+02, 8.86255944e-01,
 6.03107809e+01, 1.23027000e+02, 2.50709726e+02, 6.38344483e+01,
 5.19719109e+01, 6.31578947e-01, 1.22756021e+00, 2.41334828e-02,
 6.15343688e+02]

The designation for each parameter are stored in:

>>> lc.features_names
['anderson_darling_normal', 'inter_percentile_range_5', 'chi2',
 'stetson_K', 'weighted_mean', 'duration', 'otsu_mean_diff',
 'otsu_std_lower', 'otsu_std_upper', 'otsu_lower_to_all_ratio',
 'linear_fit_slope', 'linear_fit_slope_sigma', 'linear_fit_reduced_chi2']

It is possible to perform the fit in all filters at once:

>>> lc.fit_all()                            # perform fit in all filters

For PLAsTiCC:

Reading only 1 light curve from PLAsTiCC requires an object identifier. This can be done by:

>>> import pandas as pd

>>> path_to_metadata = '~/plasticc_train_metadata.csv'
>>> path_to_lightcurves = '~/plasticc_train_lightcurves.csv.gz'

# read metadata for the entire sample
>>> metadata = pd.read_csv(path_to_metadata)

# check keys
>>> metadata.keys()
Index(['object_id', 'ra', 'decl', 'ddf_bool', 'hostgal_specz',
       'hostgal_photoz', 'hostgal_photoz_err', 'distmod', 'mwebv', 'target',
       'true_target', 'true_submodel', 'true_z', 'true_distmod',
       'true_lensdmu', 'true_vpec', 'true_rv', 'true_av', 'true_peakmjd',
       'libid_cadence', 'tflux_u', 'tflux_g', 'tflux_r', 'tflux_i', 'tflux_z',
       'tflux_y'],
     dtype='object')

# choose 1 object
>>> snid = metadata['object_id'].values[0]

# create light curve object and load data
>>> lc = BazinFeatureExtractor()
>>> lc.load_plasticc_lc(photo_file=path_to_lightcurves, snid=snid)

Processing all light curves in the data set

There are 2 way to perform the Bazin fits for all three data sets. Using a python interpreter,

For SNPCC using Bazin features:

>>> from resspect import fit_snpcc

>>> path_to_data_dir = 'data/SIMGEN_PUBLIC_DES/'            # raw data directory
>>> features_file = 'results/Bazin.csv'                     # output file
>>> feature_extractor = 'bazin'

>>> fit_snpcc(path_to_data_dir=path_to_data_dir, features_file=features_file)

For SNPCC using Malanchev features:

>>> from resspect import fit_snpcc

>>> path_to_data_dir = 'data/SIMGEN_PUBLIC_DES/'            # raw data directory
>>> features_file = 'results/Malanchev.csv'                 # output file
>>> feature_extractor = 'malanchev'

>>> fit_snpcc(path_to_data_dir=path_to_data_dir, features_file=features_file)

For PLAsTiCC:

>>> from resspect import fit_plasticc

>>> path_photo_file = '~/plasticc_train_lightcurves.csv'
>>> path_header_file = '~/plasticc_train_metadata.csv.gz'
>>> output_file = 'results/PLAsTiCC_Bazin_train.dat'
>>> feature_extractor = 'bazin'

>>> sample = 'train'

>>> fit_plasticc(path_photo_file=path_photo_file,
>>>              path_header_file=path_header_file,
>>>              output_file=output_file,
>>>              feature_extractor=feature_extractor,
>>>              sample=sample)

The same result can be achieved using the command line:

# for SNPCC
>>> fit_dataset -s SNPCC -dd <path_to_data_dir> -o <output_file>

# for PLAsTiCC
>>> fit_dataset -s <dataset_name> -p <path_to_photo_file>
         -hd <path_to_header_file> -sp <sample> -o <output_file>