Using SIMD + Caching To Render Audio Waveforms

The Setting

I am creating a modern take on old school tracker style music production software. One big part of my UI is the ability to render waveforms. Right now, the main place it's featured is in the Sampler widget. The sampler let's a user see the waveform, zoom into different parts of it, place slice markers to control which part of the sample plays on each step of the track, etc, stock standard sampler features you'd expect from any digital audio workstation.

I notice the frame timing measurements that I display when running in dev mode, drop quite hard when I have a few samplers on screen, i.e. when rendering multiple waveforms at the same time. The drop isn't so severe on my machine that it impacts usability, but I'm running fairly powerful modern desktop class hardware.

At this point I'd like to point out, that I don't really care, and don't think the world really cares about, a new take on old school tracker-style audio production software, there are already many out there, and to overtake them isn't really my main goal.

The tracker is just a sample project, dogfood if you will, of my IMGUI library which I think the world does actually need, and is my main goal overall. I figured it was a decently complex enough UI that would test my library fairly well. So with that in mind, don't take the current UI layout / theme / graphics too seriously, if this project ever does become serious, much like a gamedev project, I would work on visual polish once I actually the full functionality. Consider it programmer art.

I do have BIG plans for the UI library, ultimately it will be a fully cross-platform application framework providing a developer experience on par with the latest and greatest web technologies (hot-reloading of all code, hot-swappable themes, inbuilt widget library, custom widget layer, etc.), whilst maintaining the performance and control one gets with lower level systems level programming.

Basically, I'm sick of using several gigabytes of RAM on my PC just to do a few trivial task simultaneously: i.e. running Spotify + Teams (for work :/) + Discord + Notion (jokes, I don't use Notion, garbage software). Things you could literally do, in nearly the same way (listen to music, instant message people / groups, take / organise notes) on a 2009 laptop with 2GB of RAM total. So this article is just to explore how I can optimize some part of an application built with my IMGUI library. Anyway enough ranting, onto the show.

The Technical Parts

Profiling

Let's begin with some 'profiling', as all sane optimization must. The quotes are because, my profiling in this post just consists of a few timely placed RDTSC instructions (via Odin's time.now() function) to see how long certain blocks of code take to execute, rather than busting open a full profiler. It should be good enough for our purposes.

At a high level the entire application is one for loop, something like:

for // ever {
    quit := app.update()
    if quit do app.shutdown()
}

Each iteration of this loop is a frame of our application, further in the post, we'll see what type of work app.update() does.

So with the application running, a sound loaded, but no sampler / waveform on screen, it looks like this:

GUI running with no sampler windows / waveforms showing

Measuring how long each frame takes to run we get:

app_update() took 1.8596 ms
app_update() took 1.8626 ms
app_update() took 1.7977 ms
app_update() took 1.7627 ms
app_update() took 1.966 ms
app_update() took 1.7651 ms
app_update() took 1.7514 ms
app_update() took 1.7497 ms
app_update() took 1.9662 ms
app_update() took 2.2918 ms
app_update() took 2.0868 ms
app_update() took 1.8927 ms
app_update() took 1.7923 ms
app_update() took 1.8466 ms
app_update() took 1.8076 ms
app_update() took 1.7212 ms
app_update() took 1.7229 ms
app_update() took 1.7475 ms
app_update() took 1.6886 ms
app_update() took 1.9515 ms
app_update() took 1.7933 ms
app_update() took 1.7541 ms
app_update() took 1.6978 ms
app_update() took 1.81 ms
app_update() took 1.8349 ms
app_update() took 1.8232 ms
app_update() took 1.7916 ms
app_update() took 1.8645 ms
app_update() took 1.7999 ms
app_update() took 1.7688 ms
app_update() took 1.7395 ms
app_update() took 1.7387 ms
app_update() took 1.7468 ms
app_update() took 1.7335 ms
app_update() took 1.7385 ms
app_update() took 1.7656 ms
app_update() took 1.9038 ms
app_update() took 1.8268 ms
app_update() took 1.7478 ms
app_update() took 1.7724 ms
app_update() took 1.7393 ms
app_update() took 1.7261 ms
app_update() took 1.7075 ms
app_update() took 1.853 ms
app_update() took 1.9483 ms
app_update() took 1.7311 ms
app_update() took 1.8698 ms
app_update() took 1.7539 ms
app_update() took 1.6935 ms
app_update() took 1.8121 ms
app_update() took 1.7488 ms
app_update() took 1.751 ms
app_update() took 1.765 ms
app_update() took 1.7114 ms
app_update() took 1.7942 ms
app_update() took 1.7739 ms
app_update() took 1.8256 ms
app_update() took 1.8223 ms
app_update() took 1.7552 ms
app_update() took 2.0518 ms
app_update() took 1.7795 ms
app_update() took 1.7366 ms
app_update() took 1.8439 ms
app_update() took 2.1734 ms
app_update() took 1.7762 ms

This is running after we build with ./build/hot-reload-optimized.bat. All timed runs of the program will be built with the exact same script. It builds our source code into a main executable and an app.dll, this is to enable hot-reloading. This will add some runtime overhead, but that overhead will be constant across all runs of the program.

Loading a 15 second audio file: 1.wav into track 0 and opening the sampler of track 0, our UI now looks like:

With our frame time rising to:

app_update() took 2.8078 ms
app_update() took 2.7914 ms
app_update() took 3.126 ms
app_update() took 2.8663 ms
app_update() took 2.8915 ms
app_update() took 3.1378 ms
app_update() took 3.0016 ms
app_update() took 2.9419 ms
app_update() took 2.8684 ms
app_update() took 2.9049 ms
app_update() took 2.92 ms
app_update() took 3.0309 ms
app_update() took 2.9611 ms
app_update() took 2.8183 ms
app_update() took 2.8097 ms
app_update() took 3.0438 ms
app_update() took 2.902 ms
app_update() took 2.9783 ms
app_update() took 2.8506 ms
app_update() took 2.8371 ms
app_update() took 2.8863 ms
app_update() took 2.9158 ms
app_update() took 2.8463 ms
app_update() took 2.8666 ms
app_update() took 3.054 ms
app_update() took 2.9996 ms
app_update() took 2.9392 ms
app_update() took 3.0334 ms
app_update() took 3.0354 ms
app_update() took 2.8865 ms
app_update() took 2.9821 ms
app_update() took 2.923 ms
app_update() took 2.9195 ms
app_update() took 2.8747 ms
app_update() took 2.8983 ms
app_update() took 2.8283 ms
app_update() took 2.8835 ms
app_update() took 2.9716 ms
app_update() took 2.8488 ms
app_update() took 2.8346 ms
app_update() took 2.8844 ms
app_update() took 2.8203 ms
app_update() took 2.8826 ms
app_update() took 3.0732 ms
app_update() took 2.9744 ms
app_update() took 2.9015 ms
app_update() took 2.8576 ms
app_update() took 3.1644 ms
app_update() took 2.9596 ms
app_update() took 2.8523 ms
app_update() took 2.895 ms
app_update() took 2.8072 ms
app_update() took 2.8573 ms
app_update() took 2.9988 ms
app_update() took 2.9571 ms
app_update() took 2.9655 ms

So about a ~1.2ms increase in our frame time.

To diagnose what's causing the slowdown I've added some timestamping code in a few key places:

Around our sampler() function, which is what one may suspect is causing the slow down.
Around our render_ui() call, just to see how much longer it takes to collect and send render data to the GPU.

With the sampler open, but no sound loaded into the track, and therefore no waveform drawn, we get:

render_ui: 0.5346ms    sampler: 0.0055ms    app_update: 1.9527ms
render_ui: 0.5347ms    sampler: 0.0053ms    app_update: 1.8476ms
render_ui: 0.5214ms    sampler: 0.0054ms    app_update: 1.8377ms
render_ui: 0.5368ms    sampler: 0.0054ms    app_update: 2.0202ms
render_ui: 0.545ms     sampler: 0.0057ms    app_update: 1.8735ms
render_ui: 0.6065ms    sampler: 0.0055ms    app_update: 1.9523ms
render_ui: 0.5734ms    sampler: 0.0057ms    app_update: 1.8829ms
render_ui: 0.565ms     sampler: 0.0052ms    app_update: 1.9015ms
render_ui: 0.5755ms    sampler: 0.0049ms    app_update: 1.9953ms
render_ui: 0.5556ms    sampler: 0.0055ms    app_update: 1.8719ms

Now with the sampler window still open, but with a sound loaded into the track and therefore a waveform being drawn, we get:

render_ui: 1.5702ms    sampler: 0.0059ms    app_update: 2.9827ms
render_ui: 1.5948ms    sampler: 0.0054ms    app_update: 3.0221ms
render_ui: 1.5989ms    sampler: 0.0057ms    app_update: 3.2601ms
render_ui: 1.5818ms    sampler: 0.0059ms    app_update: 3.064ms
render_ui: 1.5784ms    sampler: 0.0057ms    app_update: 3.1352ms
render_ui: 1.575ms     sampler: 0.0055ms    app_update: 3.0337ms
render_ui: 1.5747ms    sampler: 0.0059ms    app_update: 2.9464ms
render_ui: 1.6051ms    sampler: 0.0055ms    app_update: 3.1213ms
render_ui: 1.5945ms    sampler: 0.0055ms    app_update: 3.3046ms
render_ui: 1.5898ms    sampler: 0.0062ms    app_update: 2.9136ms
render_ui: 1.6466ms    sampler: 0.0057ms    app_update: 3.4331ms

Interesting.... What I can gleam from this, is that the logic inside the sampler() function isn't really taking up that much time. The bulk of the slowdown is coming from having extra data to process when we render. Note the render_ui() is called inside app_update() so most of the extra time spent inside app_update() is actually just spent in render_ui().

Looking at my code this makes sense, the sampler() function doesn't actually handle traversing the buffer and figuring out where to draw, it just handles a bunch of 'metadata' about the sampler and creates a UI box configured such that the renderer will recognise it in the rendering pipeline and draw the waveform there. It's in that rendering function where all the time is spent and where we'll need to focus if we're to speed this up.

How the Waveform is Drawn

Let's talk through roughly how I draw the waveform now.

When a sound is loaded into a track via miniaud.io, I store the resulting PCM data in my own buffer. Now that I think about it, I can probably just query miniaudio for the PCM data when I need it? That's a problem for another day...
The main UI is made up of a bunch of tracks, each track has a corresponding sampler. You can think of sampler() as creating the UI of the surrounding window, buttons, etc you see around the waveform in the image, but it doesn't draw the actual waveform. For the waveform it simply creates a box with metadata that the renderer will handle:

// Inside here we'll render the waveform and the slice markers.
waveform_parent := child_container(
    {
        semantic_size = {{.Percent, 0.90}, {.Percent, 0.85}},
        color = .Secondary,
    },
    {},
    {.Draw, .Clickable, .Scrollable},
    Metadata_Sampler{
        track_num
    },
)
...
...
other sampler stuff, like controls, handling the creation of slices, etc.
...
...

Its main purpose is to provide a size for the container into which we'll draw the waveform (and slices), and also attach some metadata indicating that this is for the sampler, and the track number.

The rendering pass runs after the UI is 'logically' created and laid out. UI creation is essentially creating a tree of Box's and the layout parses this tree and sets the position and size of every Box according to the data given in the creation phase.
The part of the renderer relevant to drawing waveforms is this snippet:

for x in 0 ..< render_width {
    ratio_of_waveform := f64(x) / f64(render_width)
    start := start_sample + u64((f64(x) / f64(render_width)) * (f64(end_sample - start_sample)))
    end := start_sample + u64((f64(x + 1) / f64(render_width)) * (f64(end_sample - start_sample)))
    if end >= frames_read {end = frames_read}
    min: f32 = 1
    max: f32 = -1
    for i in start ..< end {
        if pcm_frames[i] < min {min = pcm_frames[i]}
        if pcm_frames[i] > max {max = pcm_frames[i]}
    }
    norm_x: f32 = f32(x) / f32(render_width)
    x_pos := f32(box.top_left.x) + norm_x * render_width
    y_top := f32(box.top_left.y) + (0.5 - max * 0.5) * render_height
    y_bot := f32(box.top_left.y) + (0.5 - min * 0.5) * render_height
    new_data := Rect_Render_Data {
        border_thickness = 300,
        corner_radius    = 0,
        edge_softness    = 0,
        top_left         = Vec2_f32{x_pos - 0.5, y_top},
        bottom_right     = Vec2_f32{x_pos + 0.5, y_bot},
    }
    if end <= pos_in_track {
        new_data.tl_color = played_color
        new_data.tr_color = played_color
        new_data.bl_color = played_color
        new_data.br_color = played_color
    } else {
        new_data.tl_color = unplayed_color
        new_data.tr_color = unplayed_color
        new_data.bl_color = unplayed_color
        new_data.br_color = unplayed_color
    }
    append(rendering_data, new_data)
}

In English:

We figure out how wide the box we're rendering the waveform into is.
We slice up our PCM frames into chunks, say there's 1,000,000 PCM frames for the sound and our box has a width of 400 pixels. Then each 'chunk' will be 1,000,000 / 400 = 2,500 PCM frames.
We then iterate over this chunk finding the min and maximum sample value.
We use this min / max to draw a line through that pixel inside the waveform_container box.
We repeat this for every pixel of our box to draw the waveform.

There is some extra stuff going on since we're able to zoom into the waveform and only render a part of it, but we'll ignore that for now.

That rendering_data is a dynamic array that's passed in and out of various rendering functions, ultimately building a flat list of Rect_Render_Data structs which we pass to the GPU, run through shaders and put pixels on the screen. The fields of Rect_Render_Data are precisely the vertex values the GPU shader expects in order to draw quads of a certain size, color, in a specific location, textured by a specific texture, etc, etc, to achieve the effect of our desired UI.

(3 dot point version of how to do UI from scratch :) )

What Improvements Can We Make?

I can think of a few things:

1. CACHING CACHING CACHING!!

With the current setup, we literally do this entire 3 step procedure every single frame. In the future I am going to implement some mechanism in my UI library to skip all 3 steps if nothing has occurred (i.e. no user event / internal event / animation, etc), but right now this code runs every frame. So if it's going to run every frame, and compute the exact same waveform data every time (unless a user zooms in / loads a new sound), then we can cache the existing result and just load it in the next frame.

What could we cache?

Well the entire point of this function is to generate a bunch of Rect_Render_Data, we could cache those? But the problem is, we want to color a subsequence of the pixels differently depending on if that part of the sample has played yet or not.
A more flexible first approach would be to just cache the final y_top, y_bottom and x_pos values. Since I assume most of the time in that rendering code snippet, is spent calculating those 3 values for each pixel slice.

So we could end up with roughly:

On the first frame (or when we don't have valid cached data), do the full computation, record some state that we can check against for invalidation.
On the next frame see if our flags have been invalidated
- If so recompute.
- If not (which is often the case) use the cached values and skip A LOT of computations.

2. GOING WIDE (SIMD / parallelism)

Let's ignore parallelism for now since I don't have any of that infrastructure setup (thread pools / workers, etc) and spinning up threads for this one function wouldn't make sense since the overhead would be greater than the speedup. Instead we could use wide SIMD instructions.

The gist would be:

Initialize a min vector: {1, 1, 1, 1} and a max vector {-1, -1, -1, -1}.
Grab 4 - 8 samples (different architectures support differing widths of SIMD instructions) update the min vector and max vector.
Once you've iterated the entire PCM buffer, we can simply take the min of the min vector and max of the max vector.
And voila we've computed the exact same result, with 4 - 8x less CPU instructions. The speedup probably won't be 4-8x due to overhead and memory access bottlenecks, but it should be substantial.

3. GOING EVEN WIDER

GPUs are thicc thicc. Very wide, perfect for this kind of processing. However this approach has a trade-off.... and that is ... I don't exactly know how to do that yet and I won't be learning it right now :) But if the changes we make in this post don't lead to a great enough speed up, I will certainly be writing another article where we leverage the GPU for even greater gains.

The Plan

So what's the outcome? Here's my plan to optimize:

Calculating the min / max of a chunk with SIMD.
Caching that result, and only ever recalculating if the user changes the zoom level or pans around the sample.
Hopefully that's good enough :)

Implementing SIMD Processing

Before diving in, lets see if we spend more time computing y_top, y_bot, x_pos, or creating the Rect_Render_Data and appending to the render_data array. I've added similar timestep logging, this time around the code that does the y_top, y_bot, x_pos calculations, and around the part that creates the Rect_Render_data and appends it to render_data, giving us a more granular insight into where we should focus our optimization efforts on.

spent 1.0913ms calcing    spent 0.0400ms creating render data
spent 1.0549ms calcing    spent 0.0366ms creating render data
spent 1.0168ms calcing    spent 0.0347ms creating render data
spent 1.0586ms calcing    spent 0.0291ms creating render data
spent 1.0300ms calcing    spent 0.0347ms creating render data
spent 1.0620ms calcing    spent 0.0337ms creating render data
spent 1.0161ms calcing    spent 0.0388ms creating render data
spent 1.0203ms calcing    spent 0.0308ms creating render data

As we can see, most of the time is spent computing y_top, y_bot, x_pos, so it's these results that we will cache.

I do have one more concern, which is whether or not the optimizing mode of the Odin compiler + LLVM are already vectorising this code. Lets take a look at the disassembly to see if that's the case.

Using compiler explorer, it appears that even with the most aggressive optimization settings of the Odin compiler -o:aggressive that inner for loop doesn't use SIMD.

I won't post the sample code and resultnig assembly here, but you can take my word for it, or checkout compiler explorer for yourself.

When doing optimization work, it's worth checking things like this, as modern compilers are a kind of black magic and will do sneaky useful things behind your back. Or you can just make your change and re-profile, depends on how complicated your optimizing change is.

So we shall continue on and implement it ourselves. Our new SIMD leveraging code looks like:

for x in 0 ..< render_width {
    ...
    //unchanged
    ...

    // Process 8 f32s at at time.
    SIMD_WIDTH :: 8
    mins:  #simd[SIMD_WIDTH]f32 = {1, 1, 1, 1, 1, 1, 1, 1}
    maxs:  #simd[SIMD_WIDTH]f32 = {-1, -1, -1, -1, -1, -1, -1, -1}
    for i := start; i < end - SIMD_WIDTH; i += SIMD_WIDTH {
        chunk := simd.from_slice(simd.f32x8, pcm_frames[i:i+SIMD_WIDTH])
        mins  = simd.min(mins, chunk)
        maxs  = simd.max(maxs, chunk)
    }

    min := simd.reduce_min(mins)
    max := simd.reduce_max(maxs)

    remaining_floats := (end - start) % SIMD_WIDTH
    for i := end - remaining_floats; i < end; i += 1 {
        if pcm_frames[i] < min { min = pcm_frames[i] }
        if pcm_frames[i] > max { max = pcm_frames[i] }
    }
    ...
    // same calculations of x,y values and creation of Rect_Render_Data as before
    ...
}

Let's run that and see if we get any speedups:

spent 0.3000ms calcing    spent 0.0286ms creating render data
spent 0.2761ms calcing    spent 0.0349ms creating render data
spent 0.3098ms calcing    spent 0.0303ms creating render data
spent 0.3247ms calcing    spent 0.0374ms creating render data
spent 0.3210ms calcing    spent 0.0391ms creating render data
spent 0.3323ms calcing    spent 0.0332ms creating render data
spent 0.3403ms calcing    spent 0.0344ms creating render data
spent 0.3362ms calcing    spent 0.0376ms creating render data
spent 0.4600ms calcing    spent 0.0454ms creating render data
spent 0.3076ms calcing    spent 0.0349ms creating render data
spent 0.3413ms calcing    spent 0.0383ms creating render data

And let's remove the extra logging inside the waveform rendering function, to make a fair comparison with our first set of logging — just app_update and render_ui — we get:

render_ui: 0.9401ms
render_ui: 0.868ms
render_ui: 0.8766ms
render_ui: 0.8919ms
render_ui: 0.9326ms
render_ui: 0.9238ms
render_ui: 0.899ms
render_ui: 0.8917ms
render_ui: 0.931ms
render_ui: 0.9059ms
render_ui: 0.9277ms
render_ui: 1.0217ms

As we can see from the timing information:

Calculating the height of each waveform pixel now takes 1/3 the time. That's a 200% speed up, or put another way, it runs 3x faster than it used to.
Rendering the entire UI now takes 2/3 the time. That's a 50% speed up, or put another way, it runs 1.5x faster than it used to. The speed up here is less significant because there's other parts of the rendering pipeline that will be slowed down by having all the extra rendering data we add for the waveform.

That change probably took me 10-15 minutes to implement, the majority of that time was spent reading the Odin docs since I haven't used Odin's simd package before; if I had, it probably would've been a couple minutes in order to speed up an extremely costly part of my application.

While I'm happy with that speed up, I'd rather that log statement read something like:

spent 0.0ms calcing    spent 0.0303ms creating render data
spent 0.0ms calcing    spent 0.0374ms creating render data
spent 0.0ms calcing    spent 0.0391ms creating render data
spent 0.0ms calcing    spent 0.0332ms creating render data
spent 0.0ms calcing    spent 0.0344ms creating render data
spent 0.0ms calcing    spent 0.0376ms creating render data
spent 0.0ms calcing    spent 0.0454ms creating render data
spent 0.0ms calcing    spent 0.0349ms creating render data

And I think it should be mostly possible, because, from frame to frame, unless the user scrolls around the waveform or moves the sampler's window around the screen, those y_top, y_bot, x_pos values, should be absolutely identical to the previous frame.... So, let's implement some caching!

Implementing Caching

The general strategy will be:

Check various pieces of sampler state that would cause the y_top, y_bot, x_pos values of our waveform render data to be invalid. In my app right now this is:
- The user zooms in.
- The user drags the sampler container around.
- The user loads a new audio file into the sampler.

This list may grow as the sampler becomes more featureful but it would probably amount to storing a few extra bytes of state and updating a condition in an if statement (as we'll see).

Zoom Invalidation

This only requires one additional piece of state: prev_zoom_amount. To actually store the cached data we alter the Sampler struct to:

Sampler_State :: struct {
    ...
    ...
    existing fields
    ...
    ...
    prev_zoom_amount: f32,
    cached_sample_heights: [dynamic]Waveform_Sample_Render_Info
}

Waveform_Sample_Render_Info :: struct {
    x_pos, y_top, y_bottom: f32,
    // This is basically the last pcm_frame in the pixel chunk. Let's us know
    // if this pixel should be colored or not depending on where the playhead
    // of the sample is.
    end_pcm_frame: u64,
}

prev_zoom_amount is precisely what you'd think, the first line of the sampler() function does:

prev_zoom_amount = zoom_amount

We'll use this as part of our cache invalidation check in the waveform rendering code.

Re-positioning Invalidation

We also need a way of determining inside our rendering code, if the sampler window has moved from where it was in the previous frame (when we cached the data). Luckily, we already possess the necessary data; the function signature of the waveform rendering code is:

add_waveform_rendering_data :: proc(
    box: Box,
    track: ^Track,
    pcm_frames: [dynamic]f32,
    rendering_data: ^[dynamic]Rect_Render_Data,
)

That first argument box: Box, is the core building block of my entire UI library. It basically follows Ryan Fleury's idea of a fat struct. It holds the fields:

Box :: struct {
    ...
    ...
    top_left:          [2]f32,
    bottom_right:      [2]f32,
    // new
    prev_top_left:     [2]f32,
    // new
    prev_bottom_right: [2]f32,
    ...
    ...
}

In my UI library, there are no separate structs / classes that describe a Button, Slider, Knob, TextBox, etc; everything is a Box :: struct{...}. The renderer is an instanced quad renderer, that along with other information, expects vertices to contain a top_left and bottom_right.

We've added a prev_top_left and prev_bottom_right, very similar to the prev_zoom_point of the sampler, we set these at the beginning of the layout positioning code. This let's us then check 2/3 of our cache invalidation conditions in the waveform renderer like so:

if sampler.zoom_amount == sampler.prev_zoom_amount &&
   box.top_left == box.prev_top_left && box.bottom_right == box.prev_bottom_right
{
    for curr, i in sampler.cached_sample_heights {
        x_pos := curr.x_pos
        y_top := curr.y_top
        y_bot := curr.y_bottom
        pixels_end_pcm_frame := curr.end_pcm_frame
        new_data := Rect_Render_Data {
            // other rendering_data
            top_left         = Vec2_f32{x_pos - 0.5, y_top},
            bottom_right     = Vec2_f32{x_pos + 0.5, y_bot},
        }
        if pixels_end_pcm_frame <= pos_in_track {
            // color in one way.
        } else {
            // color in a different way.
        }
        append(rendering_data, new_data)
    }
else
{
    // Current state no longer corresponds with our cached data.
    // Run our optimised SIMD routine to recalculate waveform data.
}

I won't bother re-calculated the time spent calculating y_top, y_bot,x_pos; it's basically the cost of 1_000 (or however many pixels wide the sampler container is) * 4 assignments.

But, let's look at the overall app_update() loop time and render_ui time:

render_ui: 0.5735ms    app_update: 2.0195ms
render_ui: 0.5689ms    app_update: 1.9822ms
render_ui: 0.564ms     app_update: 1.9094ms
render_ui: 0.5916ms    app_update: 2.0783ms
render_ui: 0.5606ms    app_update: 1.8903ms
render_ui: 0.6147ms    app_update: 2.5987ms
render_ui: 0.8271ms    app_update: 1.9678ms
render_ui: 0.582ms     app_update: 1.9211ms
render_ui: 0.5728ms    app_update: 1.9729ms
render_ui: 0.5812ms    app_update: 1.9217ms
render_ui: 0.5782ms    app_update: 1.8876ms
render_ui: 0.5982ms    app_update: 1.9373ms
render_ui: 0.5733ms    app_update: 2.0318ms

And just like that, we're basically taking the same amount of time as we were in the first measurement, when we weren't rendering the sampler or waveform at all!

In a way rendering a waveform is now nearly free, which is important since we could be rendering 5+ samplers at once and each sampler, in the final version of the app, will have both the left and right channels displayed (right now it's only the left channel), so that original ~1ms slowdown for each waveform would have been something like a 10ms+ slowdown of the UI thread. That's still under something like a 120FPS frame budget on my relatively powerful Intel 13600kf, but we'd start dropping frames and lagging for the many many people who run less powerful hardware. That's a pretty good result!

I'll leave out the code for invalidating the cache and recalculating the data when we load a new sound, as this post is already too long and it's pretty trivial, it would basically just be:

In the sound loading code, clear the dynamic array of the cached data.
Pull out the existing waveform rendering code into a separate function and call it from inside the sound loading function to refill the cache.

While I'm happy with the speed up, I'm not exactly ecstatic about how the waveform ultimately looks, however I think that's a separate issue since to draw it more prettily, I'd require basically the same data but different code in the shader, which wouldn't really affect the CPU side runtime. Perhaps we'll talk about that in another post.

And that's all for now!

Tune In Next Time

Where I'll probably talk about:

Optimizing even more hot paths of my UI — probably the code that handles 1000s of Box structs each frame.
Writing a fairly non-shit multiband graphical EQ and spectrum analyser.
Making my immediate mode UI as power efficient as your favourite retained mode UI.
....... something cool!