Skip to content

feat: songLyrics extension version 2 — word/syllable-level timing#218

Merged
Tolriq merged 12 commits intoopensubsonic:mainfrom
ranokay:songlyrics-v2
Mar 26, 2026
Merged

feat: songLyrics extension version 2 — word/syllable-level timing#218
Tolriq merged 12 commits intoopensubsonic:mainfrom
ranokay:songlyrics-v2

Conversation

@ranokay
Copy link
Copy Markdown
Contributor

@ranokay ranokay commented Mar 5, 2026

Summary

This PR adds version 2 of the songLyrics extension to the OpenSubsonic spec and OpenAPI docs.

Version 2 introduces optional word/syllable-level karaoke timing behind enhanced=true, while keeping the default response fully backward compatible with version 1.

It also documents richer lyric modeling for sources such as TTML:

  • independent kind tracks (main, translation, pronunciation)
  • word/syllable timing via cueLine + cue
  • shared per-track vocal attribution via structuredLyrics.agents
  • cueLine.agentId references instead of repeating singer/layer metadata on every cueLine

This PR follows up on discussion #213 and incorporates the review feedback on cue ordering, overlap rules, and multi-agent attribution.

What's new

New response types

  • cue: a single timed word or syllable with start, optional end, and value
  • cueLine: a timed line-level grouping of cue items
  • agent: reusable per-track attribution metadata for cueLines

New/extended fields

  • structuredLyrics.kind: classifies each lyric track as main, translation, or pronunciation
  • structuredLyrics.cueLine: word/syllable-level timing data parallel to line
  • structuredLyrics.agents: optional per-track attribution metadata for cue-attributed lyrics
  • cueLine.agentId: references an agent in the same structuredLyrics entry

Endpoint changes

  • getLyricsBySongId gets an enhanced boolean parameter on both GET and POST
  • when enhanced=true, the response may include:
    • kind
    • cueLine
    • agents
    • non-main lyric tracks such as translation and pronunciation
  • when enhanced=false or omitted, the response remains version 1-compatible

Contract clarifications added during review

  • cueLine is only meaningful for synced=true; unsynced lyrics must not include it
  • within a single cueLine, cue.end is all-or-none
  • when the source has partial cue end-times, servers must fill the missing ones
  • cues inside a single cueLine must not overlap
  • overlaps across different cueLines are still valid, since those represent parallel vocal layers
  • agents are scoped to a single structuredLyrics entry
  • agents[].id must be unique within that entry
  • if agents are used for cue-attributed lyrics, there must be exactly one role: "main" agent
  • when multiple cueLines share the same index, the one whose agent has role: "main" must come first
  • if agents is present, every cueLine in that entry must include agentId
  • each agentId must resolve to one local agents[].id in the same structuredLyrics entry
  • structuredLyrics entries are independent across all kind values, including main
  • clients must not assume 1:1 alignment of line arrays or cueLine arrays between main, translation, and pronunciation
  • cue counts may differ across tracks for the same lyric passage

Examples added/updated

  • enhanced Korean example with full second-line cueLine coverage
  • pronunciation example showing different cue counts from the main track
  • background-vocals example using agents + agentId
  • multiple-singers example using shared agents instead of repeated per-line role/name fields

Backward compatibility

  • enhanced defaults to false
  • without enhanced=true:
    • the response shape is identical to version 1
    • no cueLine arrays are returned
    • no agents arrays are returned
    • non-main kind tracks are not returned
    • the existing line array remains unchanged

Servers supporting this version should advertise:

  • songLyrics versions [1, 2]

Files changed

New

  • content/en/docs/Responses/agent.md
  • content/en/docs/Responses/cue.md
  • content/en/docs/Responses/cueLine.md
  • openapi/schemas/Agent.json
  • openapi/schemas/Cue.json
  • openapi/schemas/CueLine.json

Updated

  • content/en/docs/Endpoints/getLyricsBySongId.md
  • content/en/docs/Extensions/songLyrics.md
  • content/en/docs/Responses/structuredLyrics.md
  • openapi/endpoints/getLyricsBySongId.json
  • openapi/openapi.json
  • openapi/schemas/StructuredLyrics.json

Closes #213

@netlify
Copy link
Copy Markdown

netlify bot commented Mar 5, 2026

Deploy Preview for opensubsonic ready!

Name Link
🔨 Latest commit 9c42bb9
🔍 Latest deploy log https://app.netlify.com/projects/opensubsonic/deploys/69beda89e63608000891574b
😎 Deploy Preview https://deploy-preview-218--opensubsonic.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@Tolriq Tolriq requested review from a team March 12, 2026 07:42
Tolriq
Tolriq previously approved these changes Mar 12, 2026
Copilot AI review requested due to automatic review settings March 13, 2026 19:12
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces songLyrics extension v2 by adding word/syllable-level karaoke timing (cue/cueLine) and lyric-layer classification (kind) while keeping v1 behavior as the default unless enhanced=true is provided to getLyricsBySongId.

Changes:

  • Add new OpenAPI response schemas Cue and CueLine, and extend StructuredLyrics with optional kind and cueLine.
  • Extend getLyricsBySongId (GET + POST) with an enhanced parameter to opt into v2 responses.
  • Add/expand documentation pages and examples covering the new response types and v2 behavior.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
openapi/schemas/StructuredLyrics.json Adds optional kind and cueLine fields to structured lyrics schema.
openapi/schemas/Cue.json New schema for word/syllable-level timing cues.
openapi/schemas/CueLine.json New schema for line-level cue groupings with optional role metadata.
openapi/openapi.json Registers Cue and CueLine under components/schemas.
openapi/endpoints/getLyricsBySongId.json Adds enhanced query param (GET) and form field (POST).
content/en/docs/Responses/structuredLyrics.md Documents v2 structuredLyrics fields and examples.
content/en/docs/Responses/cue.md New docs page for cue.
content/en/docs/Responses/cueLine.md New docs page for cueLine.
content/en/docs/Extensions/songLyrics.md Adds Version 2 section describing the new capabilities and gating.
content/en/docs/Endpoints/getLyricsBySongId.md Documents the new enhanced parameter and v2 response examples/notes.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

kgarner7
kgarner7 previously approved these changes Mar 14, 2026
ranokay and others added 10 commits March 15, 2026 00:02
Add enhanced lyrics support with word/syllable-level karaoke timing
(cueLine/cue), lyric layer classification (kind: main/translation/
pronunciation), and role-based vocal attribution (bg, voiceN, group).

All new fields are gated behind the `enhanced=true` query parameter
for full backward compatibility with version 1 clients.

New response types:
- cue: a single word or syllable with start/end timing
- cueLine: a line of cues with optional role, value, and timing

Modified response types:
- structuredLyrics: added kind and cueLine fields

Modified endpoints:
- getLyricsBySongId: added enhanced parameter (GET + POST)

Closes opensubsonic#213
Formatting fix so these schema files match the repo's prevailing JSON indentation style.

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
…playRole\n\n- Split `voiceN` string pattern into `role` enum (`bg`, `voice`, `group`)\n plus separate `voiceIndex` integer for individual voice parts\n- Add optional `displayRole` string for human-readable vocal layer labels\n- Promote derived end-time overlap avoidance from SHOULD to MUST\n- Update cueLine, getLyricsBySongId, and songLyrics extension docs
…rvers must normalize source overlaps so cue[n].end <= cue[n+1].start\nwithin each cueLine. Cross-cueLine overlaps (different role/voiceIndex)\nremain expected for parallel vocal layers.
…ine.md: add example with role/voiceIndex/displayRole fields\n- structuredLyrics.md: add pronunciation kind example with cueLine data
@kgarner7 kgarner7 requested a review from Tolriq March 21, 2026 18:33
@Tolriq
Copy link
Copy Markdown
Member

Tolriq commented Mar 22, 2026

I've mostly already validated as helped shape it. Waiting for at least one @opensubsonic/servers to confirm they don't see this as problematic to implement as it's the first evolution of a version of an extension.

@Tolriq
Copy link
Copy Markdown
Member

Tolriq commented Mar 25, 2026

@deluan @epoupon

@Tolriq Tolriq enabled auto-merge (squash) March 26, 2026 07:18
@Tolriq Tolriq merged commit 7072ab1 into opensubsonic:main Mar 26, 2026
4 checks passed
@ayla6
Copy link
Copy Markdown

ayla6 commented Apr 2, 2026

I found a small problem with the spec. It's very specific and wouldn't be found in any decent quality lyrics but:
If you have something like this

<span begin="00:00:00.000" end="00:00:00.500">Hello</span> <span begin="00:00:00.500" end="00:00:01.000">yay</span> hello <span begin="00:00:02.500" end="00:00:03.000">hello</span> <span begin="00:00:03.000" end="00:00:03.500">hi</span> <span begin="00:00:03.500" end="00:00:04.500">awesome</span>

where one word without synced data is followed by an identical one with synced data, it becomes impossible to tell which one of them is synced when it gets turned into the json the spec uses:

"cueLine": [
            {
              "index": 0,
              "start": 0,
              "end": 5000,
              "value": "Hello yay hello hello hi awesome",
              "cue": [
                {
                  "start": 0,
                  "end": 500,
                  "value": "Hello"
                },
                {
                  "start": 500,
                  "end": 1000,
                  "value": "yay"
                },
                {
                  "start": 2500,
                  "end": 3000,
                  "value": "hello"
                },
                {
                  "start": 3000,
                  "end": 3500,
                  "value": "hi"
                },
                {
                  "start": 3500,
                  "end": 4500,
                  "value": "awesome"
                }
              ]
            }
          ],

it's minor but it's just something i've noticed

@ayla6
Copy link
Copy Markdown

ayla6 commented Apr 2, 2026

if i can give my two cents maybe it should be changed to use start character and end character or something, it's simpler to apply and is a lot more accurate. probably should've said it before it being merged tho

@Tolriq
Copy link
Copy Markdown
Member

Tolriq commented Apr 2, 2026

@ranokay seems your PR is still not merged in Navidrome so we can still amend to start end char positions. WDYT ?

@ranokay
Copy link
Copy Markdown
Contributor Author

ranokay commented Apr 2, 2026

Yeah, fair point, that edge case is genuinely ambiguous with the current model. I hadn’t considered the repeated untimed/timed identical-token case.
I’m okay with adjusting it before Navidrome merges. I do wonder if explicit ordered tokens/segments would age better than char offsets, but if we want the smallest possible amendment, positions on the cues are probably the simpler path.
I see two possible fixes:

  1. Minimal amendment: keep cue as-is and add char positions into cueLine.value:
{ "value": "hello", "start": 2500, "end": 3000, "charStart": 16, "charEnd": 21 }
  1. Bigger but cleaner change: preserve explicit ordered tokens, timed and untimed:
"tokens": [
  { "value": "Hello", "start": 0, "end": 500 },
  { "value": "yay", "start": 500, "end": 1000 },
  { "value": "hello" },
  { "value": "hello", "start": 2500, "end": 3000 }
]

My feeling is char positions are the smaller amendment, while ordered tokens are the cleaner long-term model. What do you think?

@Tolriq
Copy link
Copy Markdown
Member

Tolriq commented Apr 2, 2026

We said earlier in the spec that when there's an end all cues (not token :p) must have them, so the 2 would contradict that and make things more complicated client side. Internally I do use solution 1 and I guess most will need to do the char mapping, so directly doing it is better for clients IMO.

@ranokay
Copy link
Copy Markdown
Contributor Author

ranokay commented Apr 2, 2026

Alright, I think the main thing to pin down is the exact semantics: relative to cueLine.value, 0-based or not, and whether charEnd is exclusive. We should probably also define what “character position” means for Unicode so different implementations don’t count differently.

If you’re aligned on that direction, I can put together a small follow-up PR for the schema/docs.

@Tolriq
Copy link
Copy Markdown
Member

Tolriq commented Apr 2, 2026

0 based and inclusive the First "Hello" is 0-4, for Unicode that's a great question :) Considering the data to generate those endpoints can't have cue split in the middle of a unicode, using byte offset is probably the simplest to avoid issues for some clients ?

@ranokay
Copy link
Copy Markdown
Contributor Author

ranokay commented Apr 2, 2026

My only concern is naming: if we go with UTF-8 byte offsets, I think they should be called byteStart / byteEnd rather than charStart / charEnd, otherwise people will assume actual character positions.

Then we can define them precisely as 0-based inclusive offsets into the UTF-8 encoding of the final cueLine.value string, with no normalization step.

@Tolriq
Copy link
Copy Markdown
Member

Tolriq commented Apr 2, 2026

works for me.

@ranokay
Copy link
Copy Markdown
Contributor Author

ranokay commented Apr 2, 2026

Opened #228

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants