Multicomponent Separation of Spectra

Here we learn about separating the spectrum of a kind of galaxy called a lyman alpha emitter from background noise sources.

May 08, 2025

This post is about a recent project by Ana Sofia Uzsoy, a third year graduate student in Astrophysics at Harvard University. Ana Sofia is an expert in statistical methods and using them in astrophysical contexts. Here, she talks to me about a project where she uses statistics to separate the spectrum of a galaxy from background sky and noise spectra as taken by the DESI instrument. You can read the full paper here.

Background

First, some background! Spectroscopy is the bread and butter of Astronomy. By looking at the light emitted by distant astrophysical objects, we can understand what they are made out of and how they evolve over time. From very nearby objects, like asteroids and even satellites, to the most distant objects like high-redshift galaxies, the only way we can figure out the details of what they are made out of is by looking at their spectrum. Therefore, astronomers do a lot of hard work to observe, process, and interpret spectra that they take of astronomical objects. The objects of interest for Ana Sofia’s project are Lyman-Alpha Emitters (LAEs). An LAE is a type of galaxy that only emits one line: the lyman-alpha line. Lyman-alpha is emitted when an electron falls down from the n=2 level to n=1 in neutral hydrogen. If you find this line in a galaxy, it is usually a sign of ongoing star formation in the galaxy. LAEs are therefore galaxies where a LOT of star formation is ongoing.

The Project

Ana Sofia’s project is working with spectra of LAEs taken by DESI, the Dark Energy Spectroscopic Instrument. Specifically, the goal is to 1) determine whether the spectrum is an LAE spectrum and 2) find the cosmological redshift of that LAE.

https://www.desi.lbl.gov/wp-content/uploads/sites/8/2018/06/4.png — Image of DESI looking at the sky, taken from https://www.desi.lbl.gov/photos/

Determining whether each spectrum taken by DESI is of an LAE and the redshift of the LAE in that spectrum can involve some complicated math. Because DESI is on the ground, and is looking through the atmosphere of the Earth, any spectrum it takes will be the sum of any emission from the atmosphere and from the astronomical object it is looking at. In particular, the spectrum will be the sum of three components: 1) the sky (or atmosphere), 2) the actual target (the galaxy), and 3) background noise. Let’s jump into the math!

The Nitty-Gritty

In our case, there are three components. Let’s call them X, where each “X” is just a list-the list of fluxes from that source for a series of wavelength bins. If you add up the three components, you get the “total” X, which is the observed spectrum that DESI takes.

\(X_{tot} = X_{galaxy}+X_{sky}+X_{noise}~~(1)\)

You can then imagine a covariance matrix C for each component, which tells you how every wavelength bin covaries with every other wavelength bin. Each C is an NxN matrix, where N is the number of wavelength bins (i.e. the length of each X). These covariance matrices are the “prior”, and allow you to enforce what you expect each component to look like. For example, the galaxy is “supposed” to emit only a lyman-alpha line. So, the covariance matrix would be a bunch of zeros everywhere except in one row and one column, corresponding to the lyman-alpha wavelength bin.

Now, imagine adding these covariance matrices together. That would create a covariance matrix for the “total”, the observed spectrum.

\(C_{tot} = \Sigma_{i=1}^3C_i\)

Now, let’s do some simple algebra with the total covariance matrix:

\(\Sigma_i~X_i = X_{tot}\)

\(\Sigma_i~X_i = C_{tot}~C_{tot}^{-1}~X_{tot}\)

\(\Sigma_i~X_i = \Sigma_i~C_i~C_{tot}^{-1}~X_{tot}\)

Therefore, we can conclude:

\(X_i = C_i~C_{tot}~X_{tot}~~(2)\)

In other words, if we have a way to construct these covariance matrices for each of the three components, we can extract the different spectroscopic components from any observed DESI spectrum. But if we zoom out a little, we can learn something deeper!

These covariance matrices are nothing more than a way to quantify our prior expectations of each of the components that we think went into producing the spectrum. Therefore, this method allows us to define what components we want to decompose our data into, and get those components out of the data. Moreover, we can do all that without losing any of the data! In particular, you can contrast this with something like principle component analysis (PCA), which is another method of decomposing data into the sum of different components. Because PCA breaks the data into infinite components, it forces you to lose at least some of the data when you choose only the first few components. Moreover, PCA doesn’t allow you to force any physical meaning on the components it breaks the data into. As such, this method of use priors that were defined by covariance matrices is very robust for component separation!

Now, the next question is how these covariance matrices are constructed. Let’s say we have a LOT of prior observed data on LAE. Let’s say we put all of that into one big matrix, called a data vector D. D is a PxN matrix, where P is the number of observations (i.e. the number of spectra observed) and N is the number of wavelength bins. You can construct a “data-driven” covariance matrix using the equation:

\(C = \frac{DD^{T}}{P}~~(3)\)

You multiply D by its transpose and divide by the number of spectra. Using this, you get the covariance matrix of the data held in D. Therefore, to use this method to get the covariance matrix of each component (like, to get the covariance matrix of the sky component), tyou need the data of each component (you’d need the observations of just the sky). Ana Sofia was given a large dataset of spectra of confirmed LAEs. Using that, along with a bunch of spectra taken of the sky with DESI, Ana Sofia is able to construct the necessary covariance matrices for this work.

Redshifts

Now, there’s one last step. Like we said before, LAEs can be found at a large range of distances, and therefore cosmological redshifts. Because of the redshift an LAE can be found at, the spectrum observed by DESI will have a redshifted lyman-alpha emission line. Is it possible to use the separated LAE component spectrum to figure out what redshift the observed LAE is at?

In this work, Ana Sofia constructed a chi-squared test to determine redshifts of observed LAE spectra. This equation gives a measure of how well the data is represented using the components in C:

\(\chi^2 = X_{tot}'~C_{tot}^{-1}~X_{tot}~~(4)\)

Therefore, one can measure how much better it is to include the LAE component at a given redshift z using the delta chi-squared test:

\(\Delta\chi^2(z)~ = \chi^2[\rm sky + noise] - \chi^2[\rm sky+noise+LAE(z)]~~(5)\)

This value will only be negative because including the LAE(z) component will only make the fit better than not including it. The best-fit redshift z will result in the most negative delta chi-squared. That value is how they determine the redshift of every LAE.

Tying it Together

So, using the method described above, one would be able to decompose any observed DESI spectrum into its noise, atmosphere, and LAE target components. Then, one can calculate the delta chi-squared for that spectrum at a variety of redshifts using equation 5. Using this method, DESI will now have an automated way of both classifying and calculating the redshift of every LAE spectrum they observe! This is a massive improvement over what has been done until now: classifying and “redshifting” by eye for every spectrum taken.