Hello, FreestyleNet is a great work. However, I am confused about the mask. In CrossAttention, when constructing a multi-channel spatial mask based on given labels and class_ids, what is the meaning of this array? 