-
Notifications
You must be signed in to change notification settings - Fork 14
Description
The dimension is incorrect with HybridTupleEmbedding using CTT model.
I think it's because HybridTupleEmbedding use autoencoder_embedding_model for tuple embedding, and in line 171 of tuple_embedding_models.py, the embedding_matric is having hidden_dimensions (by default, 150). But trainer defined in line 311 is still setting the CTTmodel input as input_dimension (by default, 300).
data is downloaded from https://pages.cs.wisc.edu/~anhai/data1/deepmatcher_data/Textual/Abt-Buy/exp_data/. use below code to reproduce the error:
import pandas as pd
from deep_blocker import DeepBlocker
from tuple_embedding_models import AutoEncoderTupleEmbedding, CTTTupleEmbedding, HybridTupleEmbedding
from vector_pairing_models import ExactTopKVectorPairing
import blocking_utils
cols_to_block=['name','description','price']
left_df = pd.read_csv("~/public_data/anhai/Textual/Abt-Buy/tableA.csv")
right_df = pd.read_csv("~/public_data/anhai/Textual/Abt-Buy/tableB.csv")
tuple_embedding_model = HybridTupleEmbedding()
topK_vector_pairing_model = ExactTopKVectorPairing(K=20)
db = DeepBlocker(tuple_embedding_model, topK_vector_pairing_model)
candidate_set_df = db.block_datasets(left_df, right_df, cols_to_block)
golden_df = pd.read_csv("~/public_data/anhai/Textual/Abt-Buy/test.csv")
golden_df = golden_df[golden_df['label']==1]
print(blocking_utils.compute_blocking_statistics(candidate_set_df, golden_df, left_df, right_df))
print(candidate_set_df.shape)
error:
RuntimeError Traceback (most recent call last)
Cell In [2], line 15
12 topK_vector_pairing_model = ExactTopKVectorPairing(K=20)
13 db = DeepBlocker(tuple_embedding_model, topK_vector_pairing_model)
---> 15 candidate_set_df = db.block_datasets(left_df, right_df, cols_to_block)
16 golden_df = pd.read_csv("/mnt/efs-write/share/public_data/anhai/Textual/Abt-Buy/test.csv")
17 golden_df = golden_df[golden_df['label']==1]
File ~/blocking/DeepBlocker/deep_blocker.py:58, in DeepBlocker.block_datasets(self, left_df, right_df, cols_to_block)
56 print("Performing pre-processing for tuple embeddings ")
57 all_merged_text = pd.concat([self.left_df["_merged_text"], self.right_df["_merged_text"]], ignore_index=True)
---> 58 self.tuple_embedding_model.preprocess(all_merged_text)
60 print("Obtaining tuple embeddings for left table")
61 self.left_tuple_embeddings = self.tuple_embedding_model.get_tuple_embedding(self.left_df["_merged_text"])
File ~/blocking/DeepBlocker/tuple_embedding_models.py:314, in HybridTupleEmbedding.preprocess(self, list_of_tuples)
312 trainer = dl_models.CTTModelTrainer (self.input_dimension, self.hidden_dimensions)
313 #trainer = dl_models.CTTModelTrainer (self.hidden_dimensions[-1], self.hidden_dimensions)
--> 314 self.ctt_model = trainer.train(self.left_embedding_matrix, self.right_embedding_matrix, self.label_list,
315 num_epochs=NUM_EPOCHS, batch_size=BATCH_SIZE)
File ~/blocking/DeepBlocker/dl_models.py:168, in CTTModelTrainer.train(self, left_embedding_matrix, right_embedding_matrix, labels, num_epochs, batch_size)
166 label = label.to(self.device)
167 optimizer.zero_grad()
--> 168 output = self.model(left, right)
169 loss = loss_function(output, label)
170 loss.backward()
RuntimeError: mat1 and mat2 shapes cannot be multiplied (256x150 and 300x300)```
To fix it, I change line 311 in tuple_embedding_models.py from
trainer = dl_models.CTTModelTrainer (self.input_dimension, self.hidden_dimensions)
to
trainer = dl_models.CTTModelTrainer (self.hidden_dimensions[-1], self.hidden_dimensions)
it will work, but might not produce the optimal network structure. Let me know if I get anything wrong.