Hi, thank you for your great work on TCSinger.
I am currently preparing a new dataset for a style transfer task using TCSinger, and I would like to clarify a few details regarding the required input format for metadata.json. In your README, you mention the following fields:
“Put metadata.json (including ph, word, item_name, ph_durs, wav_fn, singer, ep_pitches, ep_notedurs, ep_types for each singing voice) and phone_set.json (all phonemes of your dictionary) in data/processed/tc...”
I have the following questions regarding these fields:
- What do ep_pitches, ep_notedurs, and ep_types represent?
I noticed that in GTSinger’s metadata.json, these arrays seem to match the length of ph, but ep_notedurs does not align with ph_durs. Could you please explain the meaning of the prefix ep_ and how these features relate to phonemes (ph)?how can I extract or obtain these features from my own dataset?
- For style transfer only (e.g., zero-shot singing voice synthesis based on a reference audio), do I still need to provide other annotations like tech and emotion? Or are they only required for tasks like style control?
Thanks in advance for your help!