Skip to content

incorrect dimension with HybridTupleEmbedding #3

@yefanTao

Description

@yefanTao

The dimension is incorrect with HybridTupleEmbedding using CTT model.
I think it's because HybridTupleEmbedding use autoencoder_embedding_model for tuple embedding, and in line 171 of tuple_embedding_models.py, the embedding_matric is having hidden_dimensions (by default, 150). But trainer defined in line 311 is still setting the CTTmodel input as input_dimension (by default, 300).

data is downloaded from https://pages.cs.wisc.edu/~anhai/data1/deepmatcher_data/Textual/Abt-Buy/exp_data/. use below code to reproduce the error:

import pandas as pd
from deep_blocker import DeepBlocker
from tuple_embedding_models import  AutoEncoderTupleEmbedding, CTTTupleEmbedding, HybridTupleEmbedding
from vector_pairing_models import ExactTopKVectorPairing
import blocking_utils
cols_to_block=['name','description','price']

left_df = pd.read_csv("~/public_data/anhai/Textual/Abt-Buy/tableA.csv")
right_df = pd.read_csv("~/public_data/anhai/Textual/Abt-Buy/tableB.csv")

tuple_embedding_model = HybridTupleEmbedding()
topK_vector_pairing_model = ExactTopKVectorPairing(K=20)
db = DeepBlocker(tuple_embedding_model, topK_vector_pairing_model)

candidate_set_df = db.block_datasets(left_df, right_df, cols_to_block)
golden_df = pd.read_csv("~/public_data/anhai/Textual/Abt-Buy/test.csv")
golden_df = golden_df[golden_df['label']==1]
print(blocking_utils.compute_blocking_statistics(candidate_set_df, golden_df, left_df, right_df))
print(candidate_set_df.shape)

error:

RuntimeError                              Traceback (most recent call last)
Cell In [2], line 15
     12 topK_vector_pairing_model = ExactTopKVectorPairing(K=20)
     13 db = DeepBlocker(tuple_embedding_model, topK_vector_pairing_model)
---> 15 candidate_set_df = db.block_datasets(left_df, right_df, cols_to_block)
     16 golden_df = pd.read_csv("/mnt/efs-write/share/public_data/anhai/Textual/Abt-Buy/test.csv")
     17 golden_df = golden_df[golden_df['label']==1]

File ~/blocking/DeepBlocker/deep_blocker.py:58, in DeepBlocker.block_datasets(self, left_df, right_df, cols_to_block)
     56 print("Performing pre-processing for tuple embeddings ")
     57 all_merged_text = pd.concat([self.left_df["_merged_text"], self.right_df["_merged_text"]], ignore_index=True)
---> 58 self.tuple_embedding_model.preprocess(all_merged_text)
     60 print("Obtaining tuple embeddings for left table")
     61 self.left_tuple_embeddings = self.tuple_embedding_model.get_tuple_embedding(self.left_df["_merged_text"])

File ~/blocking/DeepBlocker/tuple_embedding_models.py:314, in HybridTupleEmbedding.preprocess(self, list_of_tuples)
    312 trainer = dl_models.CTTModelTrainer (self.input_dimension, self.hidden_dimensions)
    313 #trainer = dl_models.CTTModelTrainer (self.hidden_dimensions[-1], self.hidden_dimensions)
--> 314 self.ctt_model = trainer.train(self.left_embedding_matrix, self.right_embedding_matrix, self.label_list,
    315         num_epochs=NUM_EPOCHS, batch_size=BATCH_SIZE)

File ~/blocking/DeepBlocker/dl_models.py:168, in CTTModelTrainer.train(self, left_embedding_matrix, right_embedding_matrix, labels, num_epochs, batch_size)
    166 label = label.to(self.device)
    167 optimizer.zero_grad()
--> 168 output = self.model(left, right)
    169 loss = loss_function(output, label)
    170 loss.backward()
RuntimeError: mat1 and mat2 shapes cannot be multiplied (256x150 and 300x300)```

To fix it, I change line 311 in tuple_embedding_models.py from
trainer = dl_models.CTTModelTrainer (self.input_dimension, self.hidden_dimensions)
to
trainer = dl_models.CTTModelTrainer (self.hidden_dimensions[-1], self.hidden_dimensions)

it will work, but might not produce the optimal network structure. Let me know if I get anything wrong.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions