Simplify and slightly optimise how tasks are stored and selected #1447

daxtens · 2025-07-22T13:55:37Z

I had hoped to be able to optimise multithreaded behaviour. Unfortunately, I found it almost impossible to make heads or tails of what was going on. This PR is an attempt to unpick a couple of the complicated parts, replacing them with simpler and more documented implementations. As a happy side-effect, it's also a bit faster.

There are a set of 'frame contexts' for frames that we are decoding, and each frame context holds a dynamic set of tasks required to decode the frame. Some of these tasks can be executed in parallel. The tasks have dependencies, so not every task being tracked for a frame context is necessarily executable at any given point in time: tasks, and indeed entire frame contexts, can be 'blocked'.

Previously, these tasks were stored in an array in order of insertion. To execute them in priority order, each task stored the index of the next highest priority task, and the structure holding the tasks kept the index of the highest 'live' task. Remove this linked-list implementation, and instead implement PartialOrd and Ord for Rav1dTask, and use Rust's Vec's search and sort methods to keep the tasks in order. Fascinatingly, despite seeming like more work (especially in terms of copying tasks around), this seems to go faster. (Possibly because we called index a lot? index takes a read lock on the vector each time.)

As checking to see if a task is blocked is not free, we also keep track of completely blocked frame contexts and the most recent frame context to have had progress. Because a frame can depend on information from a previous frame, if a frame context becomes unblocked, that and all subsequent frame contexts should be checked for executable tasks too. This process was tracked by the first, cur, and reset_cur_index variables and mostly managed by the reset_task_cur and reset_task_cur_async functions. These functions were complicated and their documentation was not particularly easy to follow. Replace them with simpler ones. The original ones also had a number of attempted optimisations, not all of which seem to be worth carrying over.

Finally, we make various passing optimisations:

it's not necessary to repeatedly mark the same frame context as having new tasks when we're merging pending tasks
if we can avoid repeatedly taking a read-lock—even when we're using try_read()—so much the better.

I am in the process of doing a final set of performance measurements on my latest tweaks, and I will update this when I have final results. Previous testing at 10, 6 and 2 threads on my 6 core AMD Ryzen box with Windows 11 indicates improvements somewhere less than 1% for the 10 bit Chimera file. For the 8 bit file, the change was more in the noise, but at least not slower. A fair amount of variability is also present. I haven't tested single-threaded performance; I assume it will be unchanged.

I would mostly advocate for these changes in terms of their simplicity: it's nice that they're faster but even if they made no performance change at all I think the simplicity is worth the churn.

Other notes:

I'm still not 100% clear on how I should be formatting comments so my apologies if they need work.
I've structured this to be read commit-by-commit.
There is definitely more that could be done here: for example, there are definitely possible optimisations to the way that delayed film grain is handled. But I need to take a break for a while and I want to at least try to land something!

kkysen

Still reviewing ce64d3c and b4e4bb9, but here's some initial feedback. It looks very nice so far.

src/internal.rs

src/thread_task.rs

src/internal.rs

kkysen · 2025-07-23T06:10:39Z

src/internal.rs

+    /// The task thread lock must be held and there must be more than 1 frame context.
+    pub(crate) fn advance_first(&self, n_fc: usize) {
+        // This read-modify-write doesn't need to be an atomic CAS:
+        // modifications are protected by the lock that we're holding.


If we already have a lock for this, why is SeqCst needed?

Especially because the .load and .fetch_add(1 aren't atomic together.

Don't need to fix this right now, but I'm wondering.

It's a good question. At the moment it's a direct move of the existing code. I did have a commit which I didn't include which made it Relaxed, but I didn't include it because it starts to get quite complex to tease out what is and isn't an access which might require a more rigorous ordering.

I'll have another look at including that commit in the series; the other thing I wanted to compare was putting cur, reset_task_cur and first in a lock all of their own so that there was just one single acquire/release pair of atomic operations, and it would simplify a lot of the code.

src/thread_task.rs

daxtens · 2025-07-23T13:30:04Z

Thanks, when I get a chance I will apply that feedback.

daxtens · 2025-07-24T14:41:39Z

Here's some performance figures from hyperfine, Windows 11, AMD Ryzen something or other with 6 cores, 12 hyperthreads. I ran with 4 warmup rounds. I ran the tests multiple times, some had outliers, others were just a bit 'weird', so I've just supplied 1 figure that I feel is at least somewhat representative of the common case. As such, please do take these with a definite grain of salt.

File	Threads	main	daxtens/tasks	Improvement
8 bit	1 thread	149.6 s ± 0.169 s	148.5 s ± 0.219 s	0.7%
8 bit	2 threads	85.711 s ± 0.572 s	85.249 s ± 0.556 s	0.5%
8 bit	6 threads	32.325 s ± 0.101 s	32.195 s ± 0.140 s	0.4%
8 bit	10 threads	28.225 s ± 0.080 s	28.123 s ± 0.067 s	0.4%
10 bit	1 thread	196.0 s ± 0.330 s	194.8 ± 0.224 s	0.6%
10 bit	2 threads	114.747 s ± 0.584 s	114.160 s ± 0.516 s	0.5%
10 bit	6 threads	47.555 s ± 0.069 s	47.123 s ± 0.104 s	0.9%
10 bit	10 threads	43.752 s ± 0.097 s	43.648 s ± 0.163 s	0.2%

As you can see it's a bit noisy and random, but there is a measurable difference across inputs and thread counts.

I found it very odd that the single-thread case also went faster, but I haven't actually looked into how the single-thread case works; if it's the same in terms of finding and running tasks then it does make sense that these improvements would help there too.

This is both simpler to follow and also seems to be a bit faster.

No behavioural change.

We keep largely the same algorithm and variable meanings, but we make things a lot simpler in how they're coded up, and we document a lot more.

When iterating tasks, we used to need to do certain things at the end of the loop. We no longer need to do those, so we can get rid of a block and reduce indentation and improve clarity. Broken out from the other changes to make it clearer that the bulk of the change is whitespace rather than substantive.

Even though `.try_read().unwrap()` in `.index()` and `.len()` is guaranteed to succeed, it's not free. Try to do it less. This requries that we have slightly more involved handling for when we have to insert a new task as a result of finding a task, but it means that we don't have to take the read lock every iteration of the loop.

daxtens · 2025-07-26T12:57:38Z

Looks like your x86 runner needs an updated perf.

kkysen · 2025-07-27T05:51:47Z

src/internal.rs

+    ///  - `Less`    if `self` is of higher priority than `other`
+    ///  - `Greater` if `self` is of lower priority than `other`
+    ///
+    /// This is farily straightforwardly translated from `insert_tasks` in `task_thread.c`,


Suggested change

/// This is farily straightforwardly translated from `insert_tasks` in `task_thread.c`,

/// This is fairly straightforwardly translated from `insert_tasks` in `task_thread.c`,

kkysen · 2025-07-27T05:52:18Z

src/internal.rs

+    ///
+    /// This is farily straightforwardly translated from `insert_tasks` in `task_thread.c`,
+    /// and as such inherits the limitations of that function. Specifically,
+    /// it requires that there are no Init, InitCdf or film grain tasks,


Can you link these, too? The film grain tasks are just FgPrep and FgApply, right?

kkysen · 2025-07-27T06:14:05Z

src/thread_task.rs

+                    return false;
                }
-                let start = tasks.len() as u32;
                tasks.extend(pending_tasks.drain(..));


tasks.append(pending_tasks) should be more efficient.

kkysen · 2025-07-27T06:14:44Z

src/thread_task.rs

+            {
+                let mut pending_tasks = self.pending_tasks.lock();
+                if pending_tasks.len() == 0 {
+                    // This is a 'spurious' merge: that is, `merge` is set but the vector is empty.


Is the optimization from skipping the actual merge (.extend and .drain, or more efficiently, .append), or from returning false`?

It's not an optimisation, it's to avoid a crash below when we access tasks[new_start], which will panic if we added no tasks.

kkysen · 2025-07-27T06:16:57Z

src/thread_task.rs

-    }
+            // Unlike in the C code, we only do a single `insert_tasks`. There are two reasons:
+            //
+            //  - in the C code, the `insert_tasks` is what creates the sorted linked list.


nit:

Suggested change

// - in the C code, the `insert_tasks` is what creates the sorted linked list.

// - In the C code, the `insert_tasks` is what creates the sorted linked list.

kkysen · 2025-07-27T06:19:48Z

src/thread_task.rs

+            //    and nothing is changed by trying to repeatedly do the same update.
+            let frame_idx = tasks[new_start].frame_idx;
+            self.insert_tasks(c, frame_idx, 0);
+            tasks.sort();


I'm not sure, but is a stable sort needed here? I would think it isn't, as fn cmp doesn't even ever return Equal.

Huh, yeah, we could use an unstable sort. When I get a chance I will swap that in and benchmark it.

kkysen · 2025-07-27T06:36:18Z

src/internal.rs

+    /// and as such inherits the limitations of that function. Specifically,
+    /// it requires that there are no Init, InitCdf or film grain tasks,
+    /// and that there are no duplicate tasks.
+    fn cmp(&self, other: &Self) -> cmp::Ordering {


You can do this in a separate PR (if I'm right), but I think this could be greatly simplified to this:

fn cmp(&self, other: &Self) -> cmp::Ordering { (self.type_0, self.sby, self.tile_idx).cmp(&(other.type_0, other.sby, other.tile_idx)) }

Where TaskType is also changed to delete the manual values (they don't seem to be used) and TileEntropy is moved to be first.

I haven't thought this through super closely to make sure I'm not messing anything up, but it would greatly simplify this and should make it easier to understand.

Can also be written as

impl Rav1dTask { fn cmp_fields(&self) -> impl Ord { (self.type_0, self.sby, self.tile_idx) } } impl Ord for Rav1dTask { fn cmp(&self, other: &Self) -> cmp::Ordering { self.cmp_fields().cmp(&other.cmp_fields()) } }

I guess we do do t.type_0 > TaskType::InitCdf, which this order would change. But we can probably get around that and still have the much simpler fn cmp.

Wait, if we're already assuming that there aren't any Init or InitCdfs, then TileEntropy is the first anyways. So no ordering changes needed.

kkysen · 2025-07-27T06:47:47Z

src/thread_task.rs

+            let idx = match tasks.binary_search(&task) {
+                Ok(_) => panic!("Tried to add an identical task!"),
+                Err(idx) => idx,
+            };


The code didn't use to do this, right? Maybe just make it a debug_assertions-only check?

daxtens · 2025-07-27T12:04:49Z

I'm back at work this week so it may be a little while before I am able to look at this more.

I also need to think a bit more about some of my analysis of the atomic ordering around first and reset_task_cur; I think the code may actually be safer than I thought with SeqCst; I think some of the stuff around wrapping_sub might actually not be needed unless we make some things Relaxed. I'll let you know what I manage to work out.

daxtens · 2025-07-27T13:03:15Z

Huh, yeah, I think the code actually cannot have first and reset_task_cur fall out of sync. IOW, it's always the case that reset_task_cur >= first except for a brief moment when mark_new_task_for_index and advance_first are running concurrently. Because reset_cur_index is always run under the same lock as advance_first, it's not possible for it to observe any case where reset_task_cur is less than first.

If we make first Relaxed, it's suddenly very easy for that to happen on a weakly ordered architecture.

I'll update this PR to remove the wrapping_sub from the existing commits - if I do get a relaxed solution I'm happy with, I'll put it in a separate commit that adds the wrapping sub back and adds the requisite explanation. I currently think that given the way the old code worked, we would be OK with the slight loss of synchronisation that results from using Relaxed, but given how tricky this code is I'd want to think more carefully. Using two variables with seqcst like this is not a pattern I'm used to and so it's tricky for me to reason about.

kkysen

Looks like your x86 runner needs an updated perf.

@thedataking, can you update the kernel headers perf needs?

See the error here: https://github.com/memorysafety/rav1d/actions/runs/16540023760/job/46779890408.

thedataking · 2025-08-13T20:50:27Z

Looks like your x86 runner needs an updated perf.

See the error here: https://github.com/memorysafety/rav1d/actions/runs/16540023760/job/46779890408.

As far as I can tell, we have the right packages (linux-tools-generic) installed but we are hitting the bug described her https://bugs.launchpad.net/ubuntu/+source/linux-hwe-6.14/+bug/2117147

Update: I removed linux-tools-generic and built perf directly from an upstream kernel. might work now 🤞

kkysen

This improves performance on an 80-core arm machine running with 150 threads by 6.7%, going from 34% on main to 28% here (% is overhead compared to dav1d). So I definitely think this is something we should ultimately merge and further explore.

./benchmark.py --threads 150 --commit main..daxtens/tasks

all commits in main..daxtens/tasks, recursing into ones > 1.0%:

commit #	commit hash	threads	diff %	rav1d s	dav1d s
0	d4807cae	150	34.0%	0.915 s	0.683 s
1	69be19bf	150	32.5%	0.907 s	0.684 s
2	d64ddd08	150	30.9%	0.897 s	0.685 s
3	5af4d9e7	150	31.0%	0.897 s	0.685 s
4	54456cbd	150	30.5%	0.895 s	0.686 s
5	fdc68f31	150	29.4%	0.885 s	0.684 s
6	61026da6	150	27.9%	0.875 s	0.684 s

sorted adjacent diffs of diffs:

diff of diff %	commit #	commit hash	threads	diff %	rav1d s	dav1d s
-1.6%	2	d64ddd08	150	30.9%	0.897 s	0.685 s
-1.5%	6	61026da6	150	27.9%	0.875 s	0.684 s
-1.5%	1	69be19bf	150	32.5%	0.907 s	0.684 s
-1.1%	5	fdc68f31	150	29.4%	0.885 s	0.684 s
-0.5%	4	54456cbd	150	30.5%	0.895 s	0.686 s
+0.0%	3	5af4d9e7	150	31.0%	0.897 s	0.685 s

daxtens · 2025-09-30T00:59:21Z

Ooo that’s cool!! I’m sorry that between work and my personal life I haven’t had any time to look back at this. I am hoping to get back to it at some point but can’t make any promises on timeframe. I am also very happy for anyone else to take it on - I’m not precious about credit or prize money.

…

Message ID: ***@***.***>

PartialEq and Eq for RelaxedAtomic

36afb6c

kkysen self-requested a review July 22, 2025 14:42

kkysen reviewed Jul 23, 2025

View reviewed changes

daxtens added 6 commits July 26, 2025 16:31

Replace the linked list ordering for Rav1dTask with a sorted Vec

3a02af0

This is both simpler to follow and also seems to be a bit faster.

Do fewer reset_task_cur calls in merge_pending

ce1f0a4

Extract common logic to advance first into its own function

bfb6e6c

No behavioural change.

Replace reset_task_cur(_async) with simpler, more documented functions

dc12f1a

We keep largely the same algorithm and variable meanings, but we make things a lot simpler in how they're coded up, and we document a lot more.

daxtens force-pushed the daxtens/tasks branch from 67612f0 to e1c16d0 Compare July 26, 2025 12:53

kkysen reviewed Jul 27, 2025

View reviewed changes

kkysen reviewed Jul 28, 2025

View reviewed changes

kkysen reviewed Sep 29, 2025

View reviewed changes

	/// This is farily straightforwardly translated from `insert_tasks` in `task_thread.c`,
	/// This is fairly straightforwardly translated from `insert_tasks` in `task_thread.c`,

	// - in the C code, the `insert_tasks` is what creates the sorted linked list.
	// - In the C code, the `insert_tasks` is what creates the sorted linked list.

Simplify and slightly optimise how tasks are stored and selected #1447

Are you sure you want to change the base?

Simplify and slightly optimise how tasks are stored and selected #1447

Conversation

daxtens commented Jul 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kkysen left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

daxtens commented Jul 23, 2025

Uh oh!

daxtens commented Jul 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

daxtens commented Jul 26, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

daxtens commented Jul 27, 2025

Uh oh!

daxtens commented Jul 27, 2025

Uh oh!

kkysen left a comment

Choose a reason for hiding this comment

Uh oh!

thedataking commented Aug 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kkysen left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

daxtens commented Sep 30, 2025 via email

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

daxtens commented Jul 22, 2025 •

edited

Loading

daxtens commented Jul 24, 2025 •

edited

Loading

thedataking commented Aug 13, 2025 •

edited

Loading

kkysen left a comment •

edited

Loading