feat: songLyrics extension version 2 — word/syllable-level timing#218
feat: songLyrics extension version 2 — word/syllable-level timing#218Tolriq merged 12 commits intoopensubsonic:mainfrom
Conversation
✅ Deploy Preview for opensubsonic ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
There was a problem hiding this comment.
Pull request overview
This PR introduces songLyrics extension v2 by adding word/syllable-level karaoke timing (cue/cueLine) and lyric-layer classification (kind) while keeping v1 behavior as the default unless enhanced=true is provided to getLyricsBySongId.
Changes:
- Add new OpenAPI response schemas
CueandCueLine, and extendStructuredLyricswith optionalkindandcueLine. - Extend
getLyricsBySongId(GET + POST) with anenhancedparameter to opt into v2 responses. - Add/expand documentation pages and examples covering the new response types and v2 behavior.
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| openapi/schemas/StructuredLyrics.json | Adds optional kind and cueLine fields to structured lyrics schema. |
| openapi/schemas/Cue.json | New schema for word/syllable-level timing cues. |
| openapi/schemas/CueLine.json | New schema for line-level cue groupings with optional role metadata. |
| openapi/openapi.json | Registers Cue and CueLine under components/schemas. |
| openapi/endpoints/getLyricsBySongId.json | Adds enhanced query param (GET) and form field (POST). |
| content/en/docs/Responses/structuredLyrics.md | Documents v2 structuredLyrics fields and examples. |
| content/en/docs/Responses/cue.md | New docs page for cue. |
| content/en/docs/Responses/cueLine.md | New docs page for cueLine. |
| content/en/docs/Extensions/songLyrics.md | Adds Version 2 section describing the new capabilities and gating. |
| content/en/docs/Endpoints/getLyricsBySongId.md | Documents the new enhanced parameter and v2 response examples/notes. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
Add enhanced lyrics support with word/syllable-level karaoke timing (cueLine/cue), lyric layer classification (kind: main/translation/ pronunciation), and role-based vocal attribution (bg, voiceN, group). All new fields are gated behind the `enhanced=true` query parameter for full backward compatibility with version 1 clients. New response types: - cue: a single word or syllable with start/end timing - cueLine: a line of cues with optional role, value, and timing Modified response types: - structuredLyrics: added kind and cueLine fields Modified endpoints: - getLyricsBySongId: added enhanced parameter (GET + POST) Closes opensubsonic#213
Formatting fix so these schema files match the repo's prevailing JSON indentation style. Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
…playRole\n\n- Split `voiceN` string pattern into `role` enum (`bg`, `voice`, `group`)\n plus separate `voiceIndex` integer for individual voice parts\n- Add optional `displayRole` string for human-readable vocal layer labels\n- Promote derived end-time overlap avoidance from SHOULD to MUST\n- Update cueLine, getLyricsBySongId, and songLyrics extension docs
…rvers must normalize source overlaps so cue[n].end <= cue[n+1].start\nwithin each cueLine. Cross-cueLine overlaps (different role/voiceIndex)\nremain expected for parallel vocal layers.
…ine.md: add example with role/voiceIndex/displayRole fields\n- structuredLyrics.md: add pronunciation kind example with cueLine data
|
I've mostly already validated as helped shape it. Waiting for at least one @opensubsonic/servers to confirm they don't see this as problematic to implement as it's the first evolution of a version of an extension. |
|
I found a small problem with the spec. It's very specific and wouldn't be found in any decent quality lyrics but: <span begin="00:00:00.000" end="00:00:00.500">Hello</span> <span begin="00:00:00.500" end="00:00:01.000">yay</span> hello <span begin="00:00:02.500" end="00:00:03.000">hello</span> <span begin="00:00:03.000" end="00:00:03.500">hi</span> <span begin="00:00:03.500" end="00:00:04.500">awesome</span>where one word without synced data is followed by an identical one with synced data, it becomes impossible to tell which one of them is synced when it gets turned into the json the spec uses: "cueLine": [
{
"index": 0,
"start": 0,
"end": 5000,
"value": "Hello yay hello hello hi awesome",
"cue": [
{
"start": 0,
"end": 500,
"value": "Hello"
},
{
"start": 500,
"end": 1000,
"value": "yay"
},
{
"start": 2500,
"end": 3000,
"value": "hello"
},
{
"start": 3000,
"end": 3500,
"value": "hi"
},
{
"start": 3500,
"end": 4500,
"value": "awesome"
}
]
}
],it's minor but it's just something i've noticed |
|
if i can give my two cents maybe it should be changed to use start character and end character or something, it's simpler to apply and is a lot more accurate. probably should've said it before it being merged tho |
|
@ranokay seems your PR is still not merged in Navidrome so we can still amend to start end char positions. WDYT ? |
|
Yeah, fair point, that edge case is genuinely ambiguous with the current model. I hadn’t considered the repeated untimed/timed identical-token case.
{ "value": "hello", "start": 2500, "end": 3000, "charStart": 16, "charEnd": 21 }
"tokens": [
{ "value": "Hello", "start": 0, "end": 500 },
{ "value": "yay", "start": 500, "end": 1000 },
{ "value": "hello" },
{ "value": "hello", "start": 2500, "end": 3000 }
]My feeling is char positions are the smaller amendment, while ordered tokens are the cleaner long-term model. What do you think? |
|
We said earlier in the spec that when there's an end all cues (not token :p) must have them, so the 2 would contradict that and make things more complicated client side. Internally I do use solution 1 and I guess most will need to do the char mapping, so directly doing it is better for clients IMO. |
|
Alright, I think the main thing to pin down is the exact semantics: relative to If you’re aligned on that direction, I can put together a small follow-up PR for the schema/docs. |
|
0 based and inclusive the First "Hello" is 0-4, for Unicode that's a great question :) Considering the data to generate those endpoints can't have cue split in the middle of a unicode, using byte offset is probably the simplest to avoid issues for some clients ? |
|
My only concern is naming: if we go with UTF-8 byte offsets, I think they should be called Then we can define them precisely as 0-based inclusive offsets into the UTF-8 encoding of the final |
|
works for me. |
|
Opened #228 |
Summary
This PR adds version 2 of the
songLyricsextension to the OpenSubsonic spec and OpenAPI docs.Version 2 introduces optional word/syllable-level karaoke timing behind
enhanced=true, while keeping the default response fully backward compatible with version 1.It also documents richer lyric modeling for sources such as TTML:
kindtracks (main,translation,pronunciation)cueLine+cuestructuredLyrics.agentscueLine.agentIdreferences instead of repeating singer/layer metadata on every cueLineThis PR follows up on discussion #213 and incorporates the review feedback on cue ordering, overlap rules, and multi-agent attribution.
What's new
New response types
cue: a single timed word or syllable withstart, optionalend, andvaluecueLine: a timed line-level grouping ofcueitemsagent: reusable per-track attribution metadata for cueLinesNew/extended fields
structuredLyrics.kind: classifies each lyric track asmain,translation, orpronunciationstructuredLyrics.cueLine: word/syllable-level timing data parallel tolinestructuredLyrics.agents: optional per-track attribution metadata for cue-attributed lyricscueLine.agentId: references anagentin the samestructuredLyricsentryEndpoint changes
getLyricsBySongIdgets anenhancedboolean parameter on both GET and POSTenhanced=true, the response may include:kindcueLineagentstranslationandpronunciationenhanced=falseor omitted, the response remains version 1-compatibleContract clarifications added during review
cueLineis only meaningful forsynced=true; unsynced lyrics must not include itcueLine,cue.endis all-or-nonecueLinemust not overlapcueLines are still valid, since those represent parallel vocal layersagentsare scoped to a singlestructuredLyricsentryagents[].idmust be unique within that entryagentsare used for cue-attributed lyrics, there must be exactly onerole: "main"agentcueLines share the sameindex, the one whose agent hasrole: "main"must come firstagentsis present, everycueLinein that entry must includeagentIdagentIdmust resolve to one localagents[].idin the samestructuredLyricsentrystructuredLyricsentries are independent across allkindvalues, includingmainlinearrays orcueLinearrays betweenmain,translation, andpronunciationExamples added/updated
cueLinecoverageagents+agentIdagentsinstead of repeated per-line role/name fieldsBackward compatibility
enhanceddefaults tofalseenhanced=true:cueLinearrays are returnedagentsarrays are returnedkindtracks are not returnedlinearray remains unchangedServers supporting this version should advertise:
songLyricsversions[1, 2]Files changed
New
content/en/docs/Responses/agent.mdcontent/en/docs/Responses/cue.mdcontent/en/docs/Responses/cueLine.mdopenapi/schemas/Agent.jsonopenapi/schemas/Cue.jsonopenapi/schemas/CueLine.jsonUpdated
content/en/docs/Endpoints/getLyricsBySongId.mdcontent/en/docs/Extensions/songLyrics.mdcontent/en/docs/Responses/structuredLyrics.mdopenapi/endpoints/getLyricsBySongId.jsonopenapi/openapi.jsonopenapi/schemas/StructuredLyrics.jsonCloses #213