How to batch an IAsyncEnumerable, enforcing a maximum interval policy between consecutive batches?

Question

I have an asynchronous sequence (stream) of messages that are arriving sometimes numerously and sometimes sporadically, and I would like to process them in batches of 10 messages per batch. I also want to enforce an upper limit to the latency between receiving a message and processing it, so a batch with fewer than 10 messages should also be processed, if 5 seconds have passed after receiving the first message of the batch. I found that I can solve the first part of the problem by using the Buffer operator from the System.Interactive.Async package:

IAsyncEnumerable<Message> source = GetStreamOfMessages();
IAsyncEnumerable<IList<Message>> batches = source.Buffer(10);
await foreach (IList<Message> batch in batches)
{
    // Process batch
}

The signature of the Buffer operator:

public static IAsyncEnumerable<IList<TSource>> Buffer<TSource>(
    this IAsyncEnumerable<TSource> source, int count);

Unfortunately the Buffer operator has no overload with a TimeSpan parameter, so I can't solve the second part of the problem so easily. I'll have to implement somehow a batching operator with a timer myself. My question is: how can I implement a variant of the Buffer operator that has the signature below?

public static IAsyncEnumerable<IList<TSource>> Buffer<TSource>(
    this IAsyncEnumerable<TSource> source, TimeSpan timeSpan, int count);

The timeSpan parameter should affect the behavior of the Buffer operator like so:

A batch must be emitted when the timeSpan has elapsed after emitting the previous batch (or initially after the invocation of the Buffer method).
An empty batch must be emitted if the timeSpan has elapsed after emitting the previous batch, and no messages have been received during this time.
Emitting batches more frequently than every timeSpan implies that the batches are full. Emitting a batch with less than count messages before the timeSpan has elapsed, is not desirable.

I am OK with adding external dependencies to my project if needed, like the System.Interactive.Async or the System.Linq.Async packages.

P.S. this question was inspired by a recent question related to channels and memory leaks.

The library that handles time is `System.Reactive`. The `Buffer` method with a `TimeSpan` parameter can be found in `System.Reactive`, not `System.Interactive`. — Panagiotis Kanavos, May 24 '21 at 09:39
Besides, [AsyncRX.NET](https://github.com/dotnet/reactive/tree/main/AsyncRx.NET), which provides Reactive operators over async streams, already has a [Buffer](https://github.com/dotnet/reactive/blob/main/AsyncRx.NET/System.Reactive.Async.Linq/System/Reactive/Linq/Operators/Buffer.cs) operator. Cmbining Reactive and async streams isn't trivial though, which is why it's still in preview — Panagiotis Kanavos, May 24 '21 at 09:49
@Panagiotis this question is about asynchronous sequences, not observable sequences. If you think that the functionality available in the [System.Reactive](https://www.nuget.org/packages/System.Reactive/) package can solve this problem, feel free to post it as an answer. — Theodor Zoulias, May 24 '21 at 10:08
And the library for this is AsyncRx.NET, not System.Reactive. I pointed a link to the very source that provides timespa and count buffering over IAsyncEnumerable. — Panagiotis Kanavos, May 24 '21 at 10:13
@Panagiotis the [link](https://github.com/dotnet/reactive/blob/main/AsyncRx.NET/System.Reactive.Async.Linq/System/Reactive/Linq/Operators/Buffer.cs) you provided contains a `Buffer` operator for `IAsyncObservable`s not `IAsyncEnumerable`s. If you think that the [non released](https://github.com/dotnet/reactive/issues/1118) AsyncRx.NET library has a solution for the problem presented in this question, feel free to post it as an answer. If it's a good answer I will upvoted it, and I'll even accept it when the package is released (assuming that it will be released eventually). — Theodor Zoulias, May 24 '21 at 10:36

RickyTad · Answer 1 · 2021-11-04T08:02:50.963

0

What about using a Channel to achieve the required functionality? Is there any flaw if using something like this extension method to read from a queue until a timeout has expired?

public static async Task<List<T>> ReadWithTimeoutAsync<T>(this ChannelReader<T> reader, TimeSpan readTOut, CancellationToken cancellationToken)
{
    var timeoutTokenSrc = new CancellationTokenSource();
    timeoutTokenSrc.CancelAfter(readTOut);

    var messages = new List<T>();

    using (CancellationTokenSource linkedCts =
        CancellationTokenSource.CreateLinkedTokenSource(timeoutTokenSrc.Token, cancellationToken))
    {
        try
        {
            await foreach (var item in reader.ReadAllAsync(linkedCts.Token))
            {
                messages.Add(item);
                linkedCts.Token.ThrowIfCancellationRequested();
            }

            Console.WriteLine("All messages read.");
        }
        catch (OperationCanceledException)
        {
            if (timeoutTokenSrc.Token.IsCancellationRequested)
            {
                Console.WriteLine($"Delay ({readTOut.Milliseconds} msec) for reading items from message channel has expired.");
            }
            else if (cancellationToken.IsCancellationRequested)
            {
                Console.WriteLine("Cancelling per user request.");
                cancellationToken.ThrowIfCancellationRequested();
            }
        }
    }
    timeoutTokenSrc.Dispose();

    return messages;
}

To combine the timeout with the max. batch size, one more token source could be added:

public static async Task<List<T>> ReadBatchWithTimeoutAsync<T>(this ChannelReader<T> reader, int maxBatchSize, TimeSpan readTOut, CancellationToken cancellationToken)
{
    var timeoutTokenSrc = new CancellationTokenSource();
    timeoutTokenSrc.CancelAfter(readTOut);
    var maxSizeTokenSrc = new CancellationTokenSource();

    var messages = new List<T>();

    using (CancellationTokenSource linkedCts =
        CancellationTokenSource.CreateLinkedTokenSource(timeoutTokenSrc.Token, maxSizeTokenSrc.Token, cancellationToken))
    {
        try
        {
            await foreach (var item in reader.ReadAllAsync(linkedCts.Token))
            {
                messages.Add(item);
                if (messages.Count >= maxBatchSize)
                {
                    maxSizeTokenSrc.Cancel();
                }
                linkedCts.Token.ThrowIfCancellationRequested();
            }....

edited Nov 04 '21 at 08:02

answered Nov 03 '21 at 10:56

RickyTad

251
1
2
12

Thanks Ricky for the answer, but I don't think that it addresses the question that I've asked. The question is about `IAsyncEnumerable`s, not about `ChannelReader`s. Even if converters between these two containers where readily available, the return value of your `ReadWithTimeoutAsync` implementation (`Task>`) is not an enumerable type. I guess that someone could call the `ReadWithTimeoutAsync` in a loop, until some terminating condition was met (which condition exactly?), but this is too far from what the question asks. Btw your implementation lacks an `int count` parameter. – Theodor Zoulias Nov 03 '21 at 13:06
The use-case would be a multi-producer single-consumer scenario. The consumer loops and cyclically reads from the queue, but not until the queue becomes empty (that will not happen as the producers are writing more or less continuously to the queue). The consumer writes cyclically the records read from the queue in some database. The focus of my answer is how to achieve the enforcement of the maximum interval policy when reading some continuously coming asynchronous data. Probably you vere thinking at some finite data-stream while in my case the data-stream is "infinitely" filled with new data – RickyTad Nov 03 '21 at 13:50
Regarding the details of your implementation, the `timeoutTokenSrc` is probably redundant. `linkedCts.CancelAfter(readTOut)` should do the job just as well. Also you may want to check out [this](https://stackoverflow.com/questions/67569758/channelreader-readallasynccancellationtoken-not-actually-cancelled-mid-iterati "ChannelReader.ReadAllAsync(CancellationToken) not actually cancelled mid-iteration") question. The `ReadAllAsync` is implemented in a way that may catch you by surprise. – Theodor Zoulias Nov 03 '21 at 15:08
Thanks for the clarification. I also noticed that there is some latency in breaking the ReadAllAsync iteration after cancelling the token source. In my case is not as important, as the timeout is approximate. I just want to save periodically the messages coming through the Channel into a database. I am practically building the batches in the loop task where the new data is extracted from the Channel. – RickyTad Nov 03 '21 at 21:56
If you want to reduce the latency, you can add this line inside the `await foreach` loop, **after** adding the `item` in the `messages` list: `linkedCts.Token.ThrowIfCancellationRequested();` – Theodor Zoulias Nov 03 '21 at 22:05
Thanks, that line is helping indeed. By the way, I edited my answer with the possibility to combine the timeout with the max. batch size. – RickyTad Nov 04 '21 at 07:59
I suppose that by forcing the immediate cancellation it is still guaranteed that all the items from the Channel are read (in the next cycles) and no information is lost. – RickyTad Nov 04 '21 at 08:31
You need to be careful where you put the `ThrowIfCancellationRequested`. If you put it in the wrong place, you may lose messages. Honestly I don't think that you need so many `CancellationTokenSource`s. Not only they make the code more complex, but they also make it less efficient. Controlling the flow by throwing exceptions is expensive. Exceptions should be thrown, in general, only when something exceptional happens. – Theodor Zoulias Nov 04 '21 at 22:25

Theodor Zoulias · Accepted Answer · 2022-05-21T07:41:41.757

Here are two approaches to this problem. The first one is flawed, but I am posting it anyway due to its extreme simplicity. A Buffer operator with a TimeSpan parameter already exists in the System.Reactive package, and converters between asynchronous and observable sequences exist in the System.Linq.Async package. So it's just a matter of chaining together three already available operators:

public static IAsyncEnumerable<IList<TSource>> Buffer<TSource>(
    this IAsyncEnumerable<TSource> source, TimeSpan timeSpan, int count)
{
    return source.ToObservable().Buffer(timeSpan, count).ToAsyncEnumerable();
}

Unfortunately this neat approach is flawed, because of the side-effects of shifting from the pull to the push and back to the pull model. What happens is that the intermediate observable sequence, when subscribed, starts pulling aggressively the source IAsyncEnumerable, regardless of how the resulting IAsyncEnumerable is pulled. So instead of the consumer of the resulting sequence being the driver of the enumeration, the enumeration happens silently in the background in the maximum speed allowed by the source sequence, and the produced messages are buffered in an internal queue. So not only it's possible for hidden latency to be imposed to the processing of the messages, but also it's possible for the memory consumption to skyrocket out of control.

The second is a hands-on approach, that uses the Task.Delay method as a timer, and the Task.WhenAny method for coordinating the timer and enumeration tasks. The behavior of this approach is similar to the Rx-based approach, except that the enumeration of the source sequence is driven by the consumer of the resulting sequence, as one would expect.

/// <summary>
/// Splits the elements of a sequence into chunks that are sent out when either
/// they're full, or a given amount of time has elapsed.
/// </summary>
public static IAsyncEnumerable<IList<TSource>> Buffer<TSource>(
    this IAsyncEnumerable<TSource> source, TimeSpan timeSpan, int count)
{
    if (source == null) throw new ArgumentNullException(nameof(source));
    if (timeSpan < TimeSpan.Zero) throw new ArgumentNullException(nameof(timeSpan));
    if (count < 1) throw new ArgumentOutOfRangeException(nameof(count));
    return Implementation();

    async IAsyncEnumerable<IList<TSource>> Implementation(
        [EnumeratorCancellation] CancellationToken cancellationToken = default)
    {
        var timerCts = new CancellationTokenSource();
        var delayTask = Task.Delay(timeSpan, timerCts.Token);
        (ValueTask<bool> ValueTask, Task<bool> Task) moveNext = default;
        using var linkedCts = CancellationTokenSource
            .CreateLinkedTokenSource(cancellationToken);
        var enumerator = source.GetAsyncEnumerator(linkedCts.Token);
        try
        {
            moveNext = (enumerator.MoveNextAsync(), null);
            var buffer = new List<TSource>(count);
            ExceptionDispatchInfo error = null;
            while (true)
            {
                Task completedTask = null;
                if (!moveNext.ValueTask.IsCompleted)
                {
                    // Preserve the ValueTask, if it's not preserved already.
                    if (moveNext.Task == null)
                    {
                        var preserved = moveNext.ValueTask.AsTask();
                        moveNext = (new ValueTask<bool>(preserved), preserved);
                    }
                    completedTask = await Task.WhenAny(moveNext.Task, delayTask)
                        .ConfigureAwait(false);
                }
                if (completedTask == delayTask)
                {
                    Debug.Assert(delayTask.IsCompleted);
                    yield return buffer.ToArray(); // It's OK if the buffer is empty.
                    buffer.Clear();
                    delayTask = Task.Delay(timeSpan, timerCts.Token);
                }
                else
                {
                    Debug.Assert(moveNext.ValueTask.IsCompleted);
                    // Await a copy, to prevent a second await on finally.
                    var moveNextCopy = moveNext.ValueTask;
                    moveNext = default;
                    bool moved;
                    try { moved = await moveNextCopy.ConfigureAwait(false); }
                    catch (Exception ex)
                    {
                        error = ExceptionDispatchInfo.Capture(ex); break;
                    }
                    if (!moved) break;
                    buffer.Add(enumerator.Current);
                    if (buffer.Count == count)
                    {
                        timerCts.Cancel(); timerCts.Dispose();
                        timerCts = new CancellationTokenSource();
                        yield return buffer.ToArray();
                        buffer.Clear();
                        delayTask = Task.Delay(timeSpan, timerCts.Token);
                    }
                    try { moveNext = (enumerator.MoveNextAsync(), null); }
                    catch (Exception ex)
                    {
                        error = ExceptionDispatchInfo.Capture(ex); break;
                    }
                }
            }
            if (buffer.Count > 0) yield return buffer.ToArray();
            error?.Throw();
        }
        finally
        {
            // The finally runs when an enumerator created by this method is disposed.
            timerCts.Cancel(); timerCts.Dispose();
            // Prevent fire-and-forget, otherwise the DisposeAsync() might throw.
            // Cancel the async-enumerator, for more responsive completion.
            // Swallow MoveNextAsync errors, but propagate DisposeAsync errors.
            linkedCts.Cancel();
            try { await moveNext.ValueTask.ConfigureAwait(false); } catch { }
            await enumerator.DisposeAsync().ConfigureAwait(false);
        }
    }
}

Care has been taken to avoid leaking fire-and-forget MoveNextAsync operations or timers.

Allocation of Task wrappers happens only when a MoveNextAsync call returns a non-completed ValueTask<bool>.

This implementation is non-destructive, meaning that no elements that have been consumed from the source sequence can be lost. In case the source sequence fails or the enumeration is canceled, any buffered elements will be emitted before the propagation of the error.

This [leaks timers](https://github.com/davidfowl/AspNetCoreDiagnosticScenarios/blob/master/AsyncGuidance.md#using-a-timeout), a limited resource. Unlike the supposed leaks you mentioned, *this* is an actual leak. The time behavior is untestable, which makes using this code *very* hard to use in any except the simplest cases. Event stream processing is complex, so the ability to test is paramount. As for the supposed problems with Rx or AsyncRx - events come whether you want them or not, AsyncRx is the library you should check, not Rx. — Panagiotis Kanavos, May 25 '21 at 06:16
And then there are the leaked Tasks created by `AsTask()`. That's a real leak, not something caused by misuse. Event and async streams are typically long-running so those tasks add up and put pressure to the GC. That's why the authors of System.Linq.Async, AsyncRx.NET, Rx.NET and even the BCL itself go to great lengths to mitigate this — Panagiotis Kanavos, May 25 '21 at 06:19
@Panagiotis as for the tasks created by `AsTask()`, why do you think that they are leaked? Every single one of them is awaited. The only case that *one* task can be leaked is if the consumer of the resulting sequence abandons the enumeration without awaiting the last `MoveNextAsync` operation. Regarding the suitability of the AsyncRX.NET for solving this problem, [I've already prompted you](https://stackoverflow.com/questions/67661709/67676374#comment119610581_67661709) to post the solution as an answer. I don't know how to do it myself, since the library is not released, so please do. — Theodor Zoulias, May 25 '21 at 07:10
You haven't asked a question. You're using SO to post an article. At best, you're asking for a review of the "answer". I didn't say tasks are leaked, they're *created* and need to be GCd though. .NET Core has gone to great lengths to avoid this. As for timers - there's a reason David Fowler warns against using `Task.Delay` without cancellation. Again, async streams are long-lived and those orphaned items add up — Panagiotis Kanavos, May 25 '21 at 07:22

How to batch an IAsyncEnumerable, enforcing a maximum interval policy between consecutive batches?

2 Answers2

Linked