Hi @RuiShu, thanks for sharing your code and thoughts on this problem. I've been playing with your proposed models and implementations in pytorch using MNIST data. I noticed that if I changed the Q(z|x, y) implementation to a separate model for each mixture component, the model will put all training data onto a single component (Q(y|x) is degenerate). Do you know why this is happening? Thank you so much!!