Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

file loss and usage infomation #1

Open
kingfener opened this issue Jun 18, 2024 · 1 comment
Open

file loss and usage infomation #1

kingfener opened this issue Jun 18, 2024 · 1 comment

Comments

@kingfener
Copy link

First of all, thanks for contributing the code to the open source community. I encountered the following problems when using it:

1, file : requirements.txt miss
2, file: setup.py miss in: python setup.py build_ext --inplace
3, before do alignment, how could I get : text_mask, mel_embeddings for : aligner(args)
4, what will happen if a miss match wav and text was used for align ?

thanks.

@xiaozhah
Copy link
Owner

Thank you for your interest in the project and for bringing these issues to our attention. I apologize for any inconvenience caused. Let me address each of your points:

  1. Missing requirements.txt:
    You're right, and I apologize for this oversight. I'll create and add a requirements.txt file to the repository. For now, the main dependencies are:

    Cython==3.0.10
    numpy==1.23.5
    torch==2.1.0
    

    Please install these using pip install -r requirements.txt once the file is added.

  2. Missing setup.py:
    not missing, in monotonic_align/setup.py

  3. Obtaining text_mask and mel_embeddings:

    • text_mask is a boolean tensor indicating which elements in the text sequence are valid (not padding). You can create it based on your input text length.
    • mel_embeddings are typically extracted from your mel spectrogram using a pre-processing step or a neural network. The exact method depends on your TTS pipeline.

    Here's a simple example:

    import torch
    
    # Assuming batch_size = 1, seq_len = 10, embedding_dim = 80
    text_embeddings = torch.randn(1, 10, 256)  # Replace with your actual text embeddings
    mel_embeddings = torch.randn(1, 100, 80)   # Replace with your actual mel embeddings
    
    text_mask = torch.ones(1, 10).bool()       # Adjust based on your actual text length
    mel_mask = torch.ones(1, 100).bool()       # Adjust based on your actual mel length
    
    alignment = aligner(text_embeddings, mel_embeddings, text_mask, mel_mask)
  4. Mismatched wav and text:
    Using mismatched wav and text for alignment is not recommended as it will produce incorrect alignments. The aligner assumes that the input text and audio correspond to each other. If they don't match:

    • The alignment process might still complete, but the results will be meaningless.
    • You might encounter errors if the lengths are significantly different.
    • The quality of any TTS system using these alignments will be severely compromised.

    Always ensure that your wav files and text inputs correspond correctly to each other.

I hope this helps clarify things. I'll update the repository with the requirements.txt file and improve the documentation to make the usage clearer. If you have any more questions, please don't hesitate to ask!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants