8

Before I start, let me note that I have 0 experience with signal processing, so please bear with me:

My System

My system can be represented as an $m \times n$ matrix $X$ (input) where each column represents a "channel" and each row represents "time". A given input $x_{i,j}$ (time $i$ and channel $j$) produces:

  1. A signal $S = [R_0, R_1, R_2, ...]$ on channel $j$ starting at time $i$.
  2. A smaller (and inverted) signal on the neighboring channels. The scale by which the signal is reduced can be represented as a vector $B = [1, -\alpha_1, -\alpha_2, ... -\alpha_k, ...]$, where $k$ is the distance from the original channel $j$.

Lets assume that $S = [R_0, R_1, R_2]$ and $B = [1, -\alpha]$. Visually, this would look like: \begin{equation*} X = \begin{bmatrix} 0 & 1 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 \\ \vdots & \vdots & \vdots & \vdots & \vdots \\ 2 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 5 & 0 \\ 0 & 0 & 0 & 0 & 0 \\ \end{bmatrix} \implies Y = \begin{bmatrix} -\alpha R_0 & R_0 & -\alpha R_0 & 0 & 0 \\ -\alpha R_1 & R_1 & -\alpha R_1 & 0 & 0 \\ -\alpha R_2 & R_2 & -\alpha R_2 & 0 & 0 \\ \vdots & \vdots & \vdots & \vdots & \vdots \\ 2R_0 & -2\alpha R_0 & 0 & 0 & 0 \\ 2R_1 & -2\alpha R_1 & 0 & 0 & 0 \\ 2R_2 & -2\alpha R_2 & -5\alpha R_0 & 5R_0 & -5\alpha R_0 \\ 0 & 0 & -5\alpha R_0 & 5R_1 & -5\alpha R_1 \\ \end{bmatrix} \end{equation*}

I noticed that I can express this as \begin{equation*} Y = R X A \end{equation*}

where $R$ is a lower triangular matrix with $R_i$ on the $i$th diagonal: \begin{equation*} R = \begin{bmatrix} R_0 & 0 & 0 & \dots & \dots & 0 \\ R_1 & R_0 & 0 & \ddots & & \vdots \\ R_2 & R_1 & \ddots & \ddots & \ddots & \vdots \\ \vdots & \ddots & \ddots & \ddots & 0 & 0 \\ \vdots & & \ddots & R_1 & R_0 & 0 \\ R_{m-1} & \dots & \dots & R_2 & R_1 & R_0 \end{bmatrix} \end{equation*}

and $A$ is a symmetric matrix with $\alpha_i$ on the $i$th diagonal: \begin{equation*} A = \begin{bmatrix} 1 & -\alpha_1 & -\alpha_2 & \dots & \dots & 0 \\ -\alpha_1 & 1 & -\alpha_1 & \ddots & & \vdots \\ -\alpha_2 & -\alpha_1 & \ddots & \ddots & \ddots & \vdots \\ \vdots & \ddots & \ddots & \ddots & -\alpha_1 & -\alpha_2 \\ \vdots & & \ddots & -\alpha_1 & 1 & -\alpha_1 \\ 0 & \dots & \dots & -\alpha_2 & -\alpha_1 & 1 \end{bmatrix} \end{equation*}

My Current Solution

Given the nature of the $R$ matrix, any noise makes the solution blow up.

I am then setting my system as \begin{equation*} YA^{-1} = RX \end{equation*} And finding $X >= 0$ that minimizes \begin{equation*} \left| YA^{-1} - RX \right|_2 \end{equation*}

This works great. I am using Landweber iteration to solve this, and my solution looks good.

The problem is that it is too slow for my needs. I need to solve these systems in a hot loop millions of times.

My Actual Question

Do you have any suggestions?

I am thinking that I could treat every channel independently. The signals induced from another channel will just appear in this channel as a negative input (given that they have the same shape).

I can solve each single channel by expressing the convolution as the product of the FFTs. Then somehow try to match positive and negative signals in neighboring channels. Hopefully this "matching" could help mitigate random peaks from noise.

Can people suggest some reading materials where others solve similar problems? I don't know the technical terms, so I haven't had much success googling.

Some examples of deconvolved individual channels:

I get some ugly noise peaks that are comparable to actual signal peaks. example 1 example 2

Edit 1

I have created a small git repository with some data and my sample solution using Landweber iteration. To reproduce it you can do:

Run:

git clone https://github.com/DJDuque/SP_Q.git

That will create the folder SP_Q.

Then go into the folder and run the Landweber iteration solution:

cd SP_Q
python landweber.py signal_*

That should produce a plot like: landweber

Edit 2

I have also added the relevant stuff in CSV format:

  1. Y (observed signals): https://github.com/DJDuque/SP_Q/blob/main/signals.csv
  2. R matrix: https://github.com/DJDuque/SP_Q/blob/main/R.csv
  3. A matrix: https://github.com/DJDuque/SP_Q/blob/main/A.csv
  4. A_inv: https://github.com/DJDuque/SP_Q/blob/main/A_inv.csv
  5. Y * A_inv: https://github.com/DJDuque/SP_Q/blob/main/y_times_a_inv.csv
Royi
  • 19,608
  • 4
  • 197
  • 238
  • Could you share some data? Like MAT file or something? – Royi Jul 01 '23 at 09:20
  • @Royi I have edited my question to add a typical system. I have also included my solution using Landweber iteration. If you have time to play around with the data, please let me know if there is anything else I can do to make this easier for you. – Daniel Duque Jul 01 '23 at 13:33
  • Could you share the signal as npz file or csv? – Royi Jul 01 '23 at 14:30
  • @Royi Thanks for looking into this. I have added a CSV with the signals to the repo (https://github.com/DJDuque/SP_Q/blob/main/signals.csv). Each column is a signal, and each rows represent time. – Daniel Duque Jul 01 '23 at 14:38
  • It is a single CSV. How to extract R, Y and A from it? – Royi Jul 01 '23 at 14:41
  • @Royi You are right, sorry. I have added all the other relevant matrices in CSV format (see my post edit). Hopefully I didn't miss anything. The 1D response is also there as response.dat in case you want to try the FFT method. – Daniel Duque Jul 01 '23 at 14:54
  • @Royi Could you suggest some terms that I could use to Google and find examples of other people solving similar problems? I have no experience in this domain, so I haven't had much luck with that. – Daniel Duque Jul 01 '23 at 19:00
  • I need to play with the data to think what method is relevant before suggesting any. – Royi Jul 01 '23 at 19:02
  • Can you clarify, are you trying to determine an arbitrary input from a known channel and known output? What is your raw signal, is it the output? What is the channel? (My answer as written details channel estimation, but I can update it to be the input given the channel and output once I understand that) – Dan Boschen Jul 02 '23 at 15:17
  • @DanBoschen Given some observed signals (from multiple consecutive channels), I want to estimate what the input was. My channels are wires in a wire chamber (https://en.wikipedia.org/wiki/Wire_chamber). – Daniel Duque Jul 02 '23 at 18:32

5 Answers5

4

Crosstalk between channels is small and well-conditioned, at least in the example you provided. Your matrix A has a condition number of 1.85, which is really good. It means you can invert it and it will not amplify noise too much. Simply calculate

$$ Y_{independent} = YA^{-1} $$

to remove the crosstalk. Afterwards, you can process each channel individually. You mentioned a non-negativity constraint on X, so the obvious algorithm is non-negative least squares (NNLS). Here is a bit of Matlab code along with the plots it produces. Running NNLS for each channel individually should be rather fast.

A = csvread('A.csv');
R = csvread('R.csv');
Y = csvread('signals.csv');

Ynoisy = awgn(Y, 10, 'measured'); % Add noise with 10dB signal-to-noise ratio. Yindependent = Ynoisy * inv(A); % Remove crosstalk between channels.

X = zeros(size(Y)); for channel = 1:size(Y,2) % Deconvolve each channel individually with NNLS. X(:,channel) = lsqnonneg(R, Yindependent(:,channel)); end

subplot(5,1,1); plot(Y(:,1:3)); % Plot the results. title('Y'); legend('Channel 1', 'Channel 2', 'Channel 3'); subplot(5,1,2); plot(Ynoisy(:,1:3)); title('Y + noise'); legend('Channel 1', 'Channel 2', 'Channel 3'); subplot(5,1,3); plot(Yindependent(:,1:3)); title('(Y + noise) * inv(A) removes the crosstalk between channels, even with noise'); legend('Channel 1', 'Channel 2', 'Channel 3'); subplot(5,1,4); plot(X(:,1:3)); title('Non-Negative Least Squares (NNLS) solution for each channel individually'); legend('Channel 1', 'Channel 2', 'Channel 3');

5 Matlab plots

If you are still unhappy with the performance, you can exploit the fact that your convolution kernel is extremely front-heavy, basically an impulse with a long tail. A greedy algorithm will work just fine. Here is the changed Matlab code. The resulting plot is included in the image above; it is basically the same as with NNLS.

A = csvread('A.csv');
R = csvread('R.csv');
Y = csvread('signals.csv');

Ynoisy = awgn(Y, 10, 'measured'); % Add noise with 10dB signal-to-noise ratio. Yindependent = Ynoisy * inv(A); % Remove crosstalk between channels.

X = zeros(size(Y)); for channel = 1:size(X,2) % Deconvolve each channel individually. y = Yindependent(:,channel); % Copy of the channel, we will modify it. for i = 1:length(y)-4 % Greedy estimate for the next X value with a lookahead of only five % samples. Since the convolution kernel is extremely front-heavy, we % don't need more context to make a decision. x = min(y(i:i+4) ./ R(1:5,1)); if x > 0 X(i,channel) = x; % Subtract the current X value's tail from the Y vector. y(i:end) = y(i:end) - R(1:end+1-i, 1) * x; end end end

subplot(5,1,5); plot(X(:,1:3)); title('Greedy solution for each channel individually'); legend('Channel 1', 'Channel 2', 'Channel 3');

Rainer P.
  • 531
  • 2
  • 6
  • I appreciate you took the time to look into this. I am currently playing around with @Royi suggestions, but I will definitely give this a try. Working with $Y' = YA^{-1}$ allows me to treat each channel independently, and this might be much faster. Also, I would like to think more about your last suggestion; if that works, I think it would definitely be the fastest way of solving this. Just to be sure I understand: your $Y$ matrix in the last piece of code is actually $Y A^{-1}$ right? – Daniel Duque Jul 05 '23 at 18:59
  • 1
    @DanielDuque - Yes, the Y for deconvolution is actually Y', the one with crosstalk removed. I changed it to Yindependent in the Matlab code clear up the confusion. The deconvolution of course works with the original Y, but you will have crosstalk in the result. Crosstalk removal and deconvolution are two separate problems. – Rainer P. Jul 05 '23 at 23:27
  • @RainerP., They are the same problem if the cross talk can be modeled with a linear operator like here. Anyhow, I would replace the explicit inverse with the proper numerical method to calculate the Yindependent matrix. – Royi Jul 06 '23 at 05:49
  • 1
    @Royi - I said "separate problems" because the convolution kernel is separable (is the outer product of two vectors, has rank 1). This means it's not a true 2D convolution, merely two 1D convolutions. The A matrix happens to be well-conditioned enough that you simply invert it (and cache the inverted matrix), no need for better methods. Moreover, A is a band matrix and so is A inverted, which gives you a good speedup for matrix multiplication if you exploit it. – Rainer P. Jul 06 '23 at 07:28
  • @RainerP., Indeed it is a banded matrix (See my solution, I wrote about it). It is faster to use banded matrix solver. I am not sure which convolution kernel are you referring to as separable. – Royi Jul 06 '23 at 07:33
  • @Royi - You start with an impulse on one channel, then the columnwise convolution adds a "long tail" impulse response and the rowwise convolution adds crosstalk. You can model it as a single 2D convolution if you wish (because the convolution operator is associative). I agree that a banded matrix solver is best. – Rainer P. Jul 06 '23 at 07:42
  • @RainerP., OK, now I see the difference. This operation doesn't mean it is separable kernel. It could even be not a convolution at all. Let's say the cross talk is not time constant. Specifically in the data given, since $ \boldsymbol{A} $ is both banded and have constant diagonals the cross talk is indeed as you said. I thought you were talking in general. OK. – Royi Jul 06 '23 at 08:18
  • @Royi - The question describes the time convolution as a column vector S=[R0,R1,R2,...] and crosstalk as a row vector B=[1,−α1,−α2,...−αk,...] before constructing the R and A matrices. These matrices are separable by construction because they were, in fact, constructed from two vectors. If they were general matrices, for example if crosstalk was time-variant, this wouldn't be possible. – Rainer P. Jul 06 '23 at 09:01
  • @RainerP., Indeed. For this specific matrices. Not in the general. I commented that in general crosstalk is not just convolution. In this realization of the data I fully agree with you. – Royi Jul 06 '23 at 09:04
  • @RainerP., This solution will yield exactly the solution in my post about the block matrix. Either first tackle the wire convolution or the crosstalk. – Royi Jul 06 '23 at 10:03
  • 1
    @Royi - To tackle the crosstalk first is better because the non-negativity constraint doesn't apply while there is still crosstalk, which happens to have negative sign here. If you perform a linear deconvolution without the constraint, the order of the two deconvolutions doesn't matter. OP asked a great question with context and data. My answer attempts to exploit as many of OP's specificities and constraints as possible, which I think is key here to achieve a substantial speedup. – Rainer P. Jul 06 '23 at 10:14
  • Indeed. But in your method there is no regularization but the prior of non negative. The OP hasn't defined clearly what constraints he wants, but generally I solved with ${L}_{2}$ regularization and non negativity. – Royi Jul 06 '23 at 11:49
  • 1
    @Royi - That's the point. Matrix A is regular and well-conditioned. I don't need regularization for the intermediary result, only for the final result (where I can use non-negativity). Even if operations are nominally commutative, the overall problem may become ill-conditioned if you do it in the wrong order. – Rainer P. Jul 06 '23 at 12:13
  • @RainerP., My guideline in inverse problems is never do them without regularization. Non Negativity is not helping with noise. But anyhow, that's the OP's choice. He wrote he wanted some regularization. – Royi Jul 06 '23 at 12:23
  • @RainerP., You can use / https://www.mathworks.com/help/matlab/ref/mrdivide.html. – Eric Johnson Jul 06 '23 at 17:17
  • @RainerP., I think your answer entitled to half the bounty. So once the question will allow another bounty I will make one and allocate it to this answer. – Royi Jul 10 '23 at 07:02
  • @RainerP., It won't let me set a bounty of 50 points for reason. So I upvoted 5 of your answers. – Royi Jul 12 '23 at 08:14
2

In this variation we'll solve the problem in 2 steps.
Since the matrix $ \boldsymbol{A} $ works along the rows while $ \boldsymbol{R} $ works along the columns we can solve each.

Solving for $ \boldsymbol{A} $

The problem is given by:

$$\begin{align} \arg \min_{\boldsymbol{Q}} \quad & \frac{1}{2} {\left\| \boldsymbol{Q} \boldsymbol{A} - \boldsymbol{Y} \right\|}_{F}^{2} \\ \end{align}$$

In this case no regularization or constraint is added.
It can be solved by mQ = mY / mA; which is better than actually computing the inverse in MATLAB and Julia.

If $ \boldsymbol{A} $ is large, it might be better to use a Banded Matrix solver as $ \boldsymbol{A} $ is a banded matrix.

Solving for $ \boldsymbol{R} $

The matrix $ \boldsymbol{R} $ basically applies convolution like operation on the data.

In this case we add regularization and non negativity constraint:

$$\begin{align} \arg \min_{\boldsymbol{X}} \quad & \frac{1}{2} {\left\| \boldsymbol{R} \boldsymbol{X} - \boldsymbol{Q} \right\|}_{F}^{2} + \frac{\lambda}{2} {\left\| \boldsymbol{X} \right\|}_{F}^{2} \\ \text{subject to} \quad & \begin{aligned} {X}_{ij} & \geq 0 \\ \end{aligned} \end{align}$$

You may use solvers which combine Ridge Regression and Non Negativity like in SciKit Learn's sklearn.linear_model.Ridge (Look at the positive parameter).

Another approach would be just using Accelerated Project Gradient Descent which should be competitive and you can actually write it without any memory allocation.

I implemented this in Julia in a simple loop:

using LinearAlgebra;
using MKL;

function ∇ObjFun!( ∇mX, mX, mRR, mRQ, mRRX )

LinearAlgebra.BLAS.symm!('L', 'L', 1.0, mRR, mX, 0.0, mRRX);
∇mX .= mRRX .- mRQ;

return ∇mX;

end

Parameters

Model

λ = 5; η = 0.0000003 minValThr = 0;

Gradient Descent

numIter = 500;

Analysis

Step I

Solve \arg \min_Q || Q A - Y ||_F^2

mQ = mY / mA;

Step II

Solve \arg \min_X || R X - Q ||_F^2

subject to X_ij >= thr

Solving using accelerated projected gradient descent.

Pre Calculation

mRR = mR' * mR; mRQ = mR' * mQ;

Buffers

mX = zeros(size(mR, 2), size(mA, 1)); mXPrev = zeros(size(mR, 2), size(mA, 1)); mZ = zeros(size(mR, 2), size(mA, 1)); ∇mZ = zeros(size(mR, 2), size(mA, 1)); mRRX = zeros(size(mRR, 1), size(mX, 2));

runTime = @elapsed for ii ∈ 1:numIter # FISTA (Nesterov) Accelerated

∇ObjFun!(∇mZ, mZ, mRR, mRQ, mRRX);

mXPrev .= mX;
mX .= mZ .- (η .* ∇mZ);
mX .= max.(mZ .- (η .* ∇mZ), minValThr);

fistaStepSize = (ii - 1) / (ii + 2);

mZ .= mX .+ (fistaStepSize .* (mX .- mXPrev))

end

println("Objective Function final value: $(hObjFun!(mX, mR, mQ, mRX))"); println("Total runtime: $(runTime) [Sec]");

The run time is ~0.2 [Sec]. Which is 35% faster than the MATLAB code of @Rainer P.:

tic()
Yindependent = Y * inv(A);   % Remove crosstalk between channels.

X = zeros(size(Y)); for channel = 1:size(Y,2) % Deconvolve each channel individually with NNLS. X(:,channel) = lsqnonneg(R, Yindependent(:,channel)); end toc()

By the way, I suggest using Yindependent = Y / A; instead of the explicit inv().

If we want to compare apples to apples, we can remove the regularization term in the Julia code and get 50% run time improvement (More than twice faster than the MATLAB code).

You can play a bit with the step size to make things even faster and more optimized.

Remark: This is similar to the approach of @Rainer P. with the addition of regularization.

Royi
  • 19,608
  • 4
  • 197
  • 238
1

I assume given the application that the input can be well approximated as Gaussian pulses. The hypothesis is given a sufficiently high signal to noise ratio (as is the case with the OP's example data), we can reasonably estimate the location and magnitude of the input pulses from the peaks alone. If true, we can use this approach to very efficiently clean the result through a simple peak detection and Gaussian filter strategy.

I demonstrate this using the following Python code. For the peak detection I simply used a width parameter of 5 and height parameter of 15% of the maximum peak. A more robust algorithm could use the "N-sigma algorithm" I describe here, where the threshold is dynamically determined based on the measured noise with the detected peaks removed.

def clean(wire, coeff):
    # detect peaks 
    peaks, _= sig.find_peaks(np.abs(wire), width = 5, height = np.max(np.abs(wire))*.15)
    # convert peaks to impulses
    impulses = []
    impulses = np.zeros(len(wire))
    impulses[peaks]= wire[peaks]
    # filter derived impulses
    scale = np.max(np.convolve(coeff,coeff))
    return sig.filtfilt(coeff, 1, impulses) / scale

Gaussian filter coefficients used:

ntaps = 15 n = np.arange(-ntaps//2+1, ntaps//2+1) sigma = 4 coeff = 1/(2np.pisigma2)* np.exp(-n2/(2sigma*2))

The comparative results of the proposed method to Landweber are given below for the first few wires, and then all as a surface plot:

wire 0

wire 1

wire 2

wire 3

A surface plot of the resulting magnitudes for all peaks is given below:

comparison surface plot

Dan Boschen
  • 50,942
  • 2
  • 57
  • 135
  • Thanks for looking into this, but I am a bit confused about something: How does this take into account that an impulse in a wire induces an opposite signal on the neighboring wires? From what I understand, it simply identifies the peaks on each wire as the position of the input impulses, right? For example, there should not be any input identified for wire 0; that positive signal is completely induced from the actual input in channel 1. – Daniel Duque Jul 03 '23 at 21:59
  • Yes that’s true, I was showing how accounting for that may be insignificant in your case based on the results. Sometimes simpler is better? – Dan Boschen Jul 03 '23 at 22:09
  • Hmm, interesting. I will have to play a bit more with this. With the Landweber method, each peak in the signal is usually made of more than 1 consecutive impulses. From your proposed method it looks like 1 peak = 1 impulse. Your method will miss consecutive impulses, but at the end it might not matter. This is definitely interesting and I will play with it a bit more. I won't mark your answer as accepted yet, I will leave the question open for a while (maybe I'll open a small bounty to see if it attracts more attention). I want to see if other people have other nice suggestions. Thanks again – Daniel Duque Jul 03 '23 at 22:40
  • Yes understood and you will know much better your corner cases. Perhaps what you did here was one of your best instead of one of your worst. I don’t think we can assume Landweber is “truth” especially looking at wire 2, but would be great to have a simulation showing what truth should actually look like as we can from that determine best solutions in terms of the processing and accuracy trade. In any event this was very interesting- I deal with photon counting for other applications so was neat to see this, thank you! – Dan Boschen Jul 03 '23 at 22:57
  • As for sensitivity to consecutive impulses, that is done with the width and sigma parameters - make the width for peak detection narrower and make sigma narrower to establish the resulting recovered pulses with narrower width. – Dan Boschen Jul 03 '23 at 22:59
  • And one more comment is the reason I chose Gaussian is consistent with the central limit theorem in that when you do have a cluster of closely spaced impulses the result smears to a single Gaussian with the height indicating the total strength. There can be additional factors with the transmission of the signals from source to load resulting in reflections that may not be properly accounted for. – Dan Boschen Jul 03 '23 at 23:05
  • But if you had the actual input for an actual output - then we could derive the transfer function for the system which could then be used for all subsequent arbitrary inputs. Such an input would ideally be spectrally rich— known and specific impulses can work if the SNR is strong enough; ideally not time aligned or even one wire at a time. – Dan Boschen Jul 04 '23 at 00:39
  • @DanBoschen, Basically you try to build sparse model of peaks. Yet mathematically it is better be done using regularization in my opinion. – Royi Jul 04 '23 at 09:24
  • @Royi it is a trade of simplicity and accuracy as to what might be better and here simplicity is of value —I agree that regularization would be better for low SNR conditions if that is also important - in any event, it would be interesting to see the comparative results and code for other approaches to compare to these results. – Dan Boschen Jul 04 '23 at 10:21
1

The problem is given by:

$$\begin{align} \arg \min_{\boldsymbol{X}} \quad & \frac{1}{2} {\left\| \boldsymbol{R} \boldsymbol{X} \boldsymbol{A} - \boldsymbol{Y} \right\|}_{F}^{2} \\ \end{align}$$

Yet the OP states that it is important for him to have some kind of regularization (Which is kind of built in Landweber Iteration for the first few iterations).
So one can add a regularization function:

$$\begin{align} \arg \min_{\boldsymbol{X}} \quad & \frac{1}{2} {\left\| \boldsymbol{R} \boldsymbol{X} \boldsymbol{A} - \boldsymbol{Y} \right\|}_{F}^{2} + \lambda R \left( \boldsymbol{X} \right) \\ \end{align}$$

A simple example would be using Tikhonov Regularization where $ R \left( \boldsymbol{X} \right) = \frac{1}{2} {\left\| \boldsymbol{X} \right\|}_{2}^{2} $:

$$\begin{align} \arg \min_{\boldsymbol{X}} \quad & \frac{1}{2} {\left\| \boldsymbol{R} \boldsymbol{X} \boldsymbol{A} - \boldsymbol{Y} \right\|}_{F}^{2} + \frac{\lambda}{2} {\left\| \boldsymbol{X} \right\|}_{2}^{2} \\ \end{align}$$

The gradient is given by:

$$ f \left( \boldsymbol{X} \right) = \frac{1}{2} {\left\| \boldsymbol{R} \boldsymbol{X} \boldsymbol{A} - \boldsymbol{Y} \right\|}_{F}^{2} + \frac{\lambda}{2} {\left\| \boldsymbol{X} \right\|}_{2}^{2} \implies {\nabla}_{\boldsymbol{X}} f \left( \boldsymbol{X} \right) = \boldsymbol{R}^{T} \boldsymbol{R} \boldsymbol{X} \boldsymbol{A} \boldsymbol{A}^{T} - \boldsymbol{R}^{T} \boldsymbol{Y} \boldsymbol{A}^{T} + \lambda \boldsymbol{X} $$

In order to accelerate the gradient descent we'll use some tricks:

  1. Pre calculation of the constants: $ \boldsymbol{R}^{T} \boldsymbol{Y} \boldsymbol{A}^{T}, \; \boldsymbol{A} \boldsymbol{A}^{T} $.
  2. Using acceleration method (I'd use Nesterov as implemented in the FISTA algorithm).
  3. Taking advantage of the triangular structure of $ \boldsymbol{R} $ in computation of $ \boldsymbol{R}^{T} \boldsymbol{R} \boldsymbol{X} \boldsymbol{A} \boldsymbol{A}^{T} $. We can do that either by using convolution or just using the TRMM BLAS function.

Using Julia I could solve the problem in ~0.35 [Sec] (It takes ~0.5 [Sec] if we calculate the objective each iteration) with the result being:

enter image description here

enter image description here

With smaller regularization we can see multiple peaks:

enter image description here

Remark on Landweber Iteration: This method is simply gradient descent where the step size is set to the highest value which guarantees going downhill. See The Biggest Step Size with Guaranteed Convergence for Constant Step Size Gradient Descent of a Convex Function with Lipschitz Continuous Gradient. Since it is on the edge, numerically it is not stable enough. Hence we usually apply some early termination.
This method is before the days of accelerated gradient descent methods. With accelerated gradient descent you can have both numerically stable process and very fast convergence. Should be much faster than you see with Landweber Iteration.

Royi
  • 19,608
  • 4
  • 197
  • 238
0

We're after:

$$\begin{align} \arg \min_{\boldsymbol{X}} \quad & \frac{1}{2} {\left\| \boldsymbol{R} \boldsymbol{X} \boldsymbol{A} - \boldsymbol{Y} \right\|}_{F}^{2} \\ \end{align}$$

Where we want to utilize the facts:

  • The matrix $ \boldsymbol{A} $ is symmetric.
  • The matrix $ \boldsymbol{R} $ is lower triangular.

We want to make it in the form of Matrix Vector least squares problem.
In order to do so we need to use some identity of the Kronecker Product.

The identity says that $ \operatorname{Vec} \left( \boldsymbol{R} \boldsymbol{X} \boldsymbol{A} \right) = \left( \boldsymbol{A}^{T} \otimes \boldsymbol{R} \right) \operatorname{Vec} \left( \boldsymbol{X} \right) $ where $ \otimes $ is the Kronecker Product and $ \operatorname{Vec} \left( \cdot \right) $ is the Vectorization Operator.

If we set $ \boldsymbol{x} = \operatorname{Vec} \left( \boldsymbol{X} \right), \; \boldsymbol{y} = \operatorname{Vec} \left( \boldsymbol{Y} \right) $ and $ \boldsymbol{W} = \boldsymbol{A}^{T} \otimes \boldsymbol{R} $ then our problem can be rewritten as:

$$\begin{align} \arg \min_{\boldsymbol{x}} \quad & \frac{1}{2} {\left\| \boldsymbol{W} \boldsymbol{x} - \boldsymbol{y} \right\|}_{2}^{2} \\ \end{align}$$

This is a regular Linear Least Squares problem where you can use many acceleration methods to solve (Momentum, Nesterov).
Moreover, you can easily add some regularization to mitigate noise.

Numerically it will be much more stable as we don't use the explicit inverse of $ \boldsymbol{A} $. It will probably be also faster to calculate.

Can we do better? Yes!

We didn't take advantage of the properties of $ \boldsymbol{A} $ and $ \boldsymbol{R} $.
Since the original problem is the Frobenius Norm, we basically can solve element by element. So we can do the vectorization by rows instead of columns.

Let's define: $ \boldsymbol{u} = \operatorname{Vec} \left( \boldsymbol{X}^{T} \right), \; \boldsymbol{v} = \operatorname{Vec} \left( \boldsymbol{Y}^{T} \right) $ and $ \boldsymbol{C} = \boldsymbol{R}^{T} \otimes \boldsymbol{A} $ then the problem becomes:

$$\begin{align} \arg \min_{\boldsymbol{u}} \quad & \frac{1}{2} {\left\| \boldsymbol{C} \boldsymbol{u} - \boldsymbol{v} \right\|}_{2}^{2} \\ \end{align}$$

It is still a Linear Least Squares problem to solve.
Yet now $ \boldsymbol{C} $ is block triangular matrix (Upper) where each block is symmetric.

Solving a linear system for a block triangular matrix is much faster than arbitrary matrix. See toy example at Solution of Block Triangular Systems.

So now have an optimal problem to solve.

This is the result I get with 5 lines in MATLAB:

enter image description here

Requiring non negativity will require using non direct methods.
In this case it might be that the other variation will be faster.

Update

According to the A.csv file the matrix $ \boldsymbol{A} $ is a banded matrix. It means that the first is banded block matrix which also can be optimized for a fast solution.

Royi
  • 19,608
  • 4
  • 197
  • 238
  • Thanks for taking the time of looking into this. This is very interesting. I see 2 potential issues with this approach:
    1. The typical size of my matrix R is about 400 x 400, and A is 30 x 30. The Kronecker product will then be about 12000 x 12000 dense matrix. Eventhough solving a single system might be faster than the Landweber iterations, I wouldn't be able to solve as many systems in parallel as I can do with the Landweber method. Hence, in the end, this could end up being slower overall.
    2. (to be continued).
    – Daniel Duque Jul 04 '23 at 07:12
  • (continued) With the Landweber method I can trivially add a non-negative constraint on the values of X. This helps to mitigate noise a little bit (at least results look cleaner to me). With your method I might get a little bit higher noise peaks.
  • I would have to play a bit with this method. It could be possible that my issues might not even be a problem after all.

    – Daniel Duque Jul 04 '23 at 07:14
  • If you have time, do you think that you could test your method with the sample data I have posted? It is OK if you don't :) – Daniel Duque Jul 04 '23 at 07:18
  • @DanielDuque, Since you have half of the matrix as zeros and the other half basically symmetric you could, memory wise, use 0.25 of the data. But even if you don't that's only ~1.15 GB of data, nothing for modern computers. – Royi Jul 04 '23 at 09:21
  • @DanielDuque, If you use specialized block triangular solver then it even easier as you don't really need to construct the matrix. – Royi Jul 04 '23 at 09:22
  • @DanielDuque, There is not problem to add regularization on the result. Either being positive or Tikhonov / Lasso (For sparsity) Regularization. – Royi Jul 04 '23 at 09:23
  • @DanielDuque, If yo know each wire is limited to k peaks we can do low rank model that will probably work the best (Robustness wise). – Royi Jul 06 '23 at 17:30