Add the unsupervised bipartite GraphSAGE model on the Taobao dataset (#…

…6144) This PR adds an implementation of unsupervised bipartite GraphSAGE on the Taobao User Behaviors dataset offered by Alibaba. The Taobao dataset contains a heterogeneous graph, where nodes represent users and items, and edges represent different types of behaviors between users and items. [](https://tianchi.aliyun.com/dataset/649) We use the i2i co-occurrence matrix to construct the i2i-graph. When applying GraphSAGE, the model follows the i-i-i pattern to encode the item embedding and the i-i-u pattern to encode the user embedding. As the task is unsupervised, we use negative sampling and `binary_cross_entropy_with_logits` to compute the loss in the model. Co-authored-by: huxleyhu <shuxian.hu98@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: rusty1s <matthias.fey@tu-dortmund.de>
pyg-team · Jan 16, 2023 · 5d777e7 · 5d777e7
1 parent 9b2bbe5
commit 5d777e7
Show file tree

Hide file tree

Showing 12 changed files with 384 additions and 15 deletions.
diff --git a/.style.yapf b/.style.yapf
@@ -0,0 +1,4 @@
+[style]
+based_on_style=pep8
+split_before_named_assigns=False
+blank_line_before_nested_class_or_def=False
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -7,6 +7,7 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
 
 ### Added
 
+- Added the `Taobao` dataset and a corresponding example for it ([#6144](https://github.com/pyg-team/pytorch_geometric/pull/6144))
 - Added `pyproject.toml` ([#6431](https://github.com/pyg-team/pytorch_geometric/pull/6431))
 - Added the `torch_geometric.contrib` sub-package ([#6422](https://github.com/pyg-team/pytorch_geometric/pull/6422))
 - Warn on using latest documentation ([#6418](https://github.com/pyg-team/pytorch_geometric/pull/6418))

diff --git a/docs/source/tutorial/heterogeneous.rst b/docs/source/tutorial/heterogeneous.rst
@@ -13,7 +13,7 @@ As a consequence of the different data structure, the message passing formulatio
 Example Graph
 -------------
 
-As a guiding example, we take a look at the heterogenous `ogbn-mag <https://ogb.stanford.edu/docs/nodeprop>`__ network from the `OGB datasets <https://ogb.stanford.edu>`_:
+As a guiding example, we take a look at the heterogeneous `ogbn-mag <https://ogb.stanford.edu/docs/nodeprop>`__ network from the `OGB datasets <https://ogb.stanford.edu>`_:
 
 .. image:: ../_figures/hg_example.svg
   :align: center
@@ -192,7 +192,7 @@ The transform :meth:`~torch_geometric.transforms.NormalizeFeatures` works like i
 Creating Heterogeneous GNNs
 ---------------------------
 
-Standard Message Passing GNNs (MP-GNNs) can not trivially be applied to heterogenous graph data, as node and edge features from different types can not be processed by the same functions due to differences in feature type.
+Standard Message Passing GNNs (MP-GNNs) can not trivially be applied to heterogeneous graph data, as node and edge features from different types can not be processed by the same functions due to differences in feature type.
 A natural way to circumvent this is to implement message and update functions individually for each edge type.
 During runtime, the MP-GNN algorithm would need to iterate over edge type dictionaries during message computation and over node type dictionaries during node updates.
 
@@ -298,10 +298,10 @@ Afterwards, the created model can be trained as usual:
         optimizer.step()
         return float(loss)
 
-Using the Heterogenous Convolution Wrapper
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+Using the Heterogeneous Convolution Wrapper
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-The heterogenous convolution wrapper :class:`torch_geometric.nn.conv.HeteroConv` allows to define custom heterogenous message and update functions to build arbitrary MP-GNNs for heterogeneous graphs from scratch.
+The heterogeneous convolution wrapper :class:`torch_geometric.nn.conv.HeteroConv` allows to define custom heterogeneous message and update functions to build arbitrary MP-GNNs for heterogeneous graphs from scratch.
 While the automatic converter :meth:`~torch_geometric.nn.to_hetero` uses the same operator for all edge types, the wrapper allows to define different operators for different edge types.
 Here, :class:`~torch_geometric.nn.conv.HeteroConv` takes a dictionary of submodules as input, one for each edge type in the graph data.
 The following `example <https://github.com/pyg-team/pytorch_geometric/blob/master/examples/hetero/hetero_conv_dblp.py>`__ shows how to apply it.
@@ -349,8 +349,8 @@ We can initialize the model by calling it once (see :ref:`here<lazyinit>` for mo
 
 and run the standard training procedure as outlined :ref:`here<trainfunc>`.
 
-Deploy Existing Heterogenous Operators
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+Deploy Existing Heterogeneous Operators
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 :pyg:`PyG` provides operators (*e.g.*, :class:`torch_geometric.nn.conv.HGTConv`), which are specifically designed for heterogeneous graphs.
 These operators can be directly used to build heterogeneous GNN models as can be seen in the following `example <https://github.com/pyg-team/pytorch_geometric/blob/master/examples/hetero/hgt_dblp.py>`__:

diff --git a/docs/source/tutorial/load_csv.rst b/docs/source/tutorial/load_csv.rst
@@ -1,7 +1,7 @@
 Loading Graphs from CSV
 =======================
 
-In this example, we will show how to load a set of :obj:`*.csv` files as input and construct a **heterogeneous graph** from it, which can be used as input to a `heterogenous graph model <heterogeneous.html>`__.
+In this example, we will show how to load a set of :obj:`*.csv` files as input and construct a **heterogeneous graph** from it, which can be used as input to a `heterogeneous graph model <heterogeneous.html>`__.
 This tutorial is also available as an executable `example script <https://github.com/pyg-team/pytorch_geometric/tree/master/examples/hetero/load_csv.py>`_ in the :obj:`examples/hetero` directory.
 
 We are going to use the `MovieLens dataset <https://grouplens.org/datasets/movielens/>`_ collected by the GroupLens research group.
@@ -251,7 +251,7 @@ With this, we are ready to finalize our :class:`~torch_geometric.data.HeteroData
       }
     )
 
-This :class:`~torch_geometric.data.HeteroData` object is the native format of heterogenous graphs in :pyg:`PyG` and can be used as input for `heterogenous graph models <heterogeneous.html>`__.
+This :class:`~torch_geometric.data.HeteroData` object is the native format of heterogeneous graphs in :pyg:`PyG` and can be used as input for `heterogeneous graph models <heterogeneous.html>`__.
 
 .. note::
 

diff --git a/examples/hetero/bipartite_sage.py b/examples/hetero/bipartite_sage.py
@@ -110,7 +110,7 @@ def __init__(self, num_users, hidden_channels, out_channels):
         self.user_emb = Embedding(num_users, hidden_channels)
         self.user_encoder = UserGNNEncoder(hidden_channels, out_channels)
         self.movie_encoder = MovieGNNEncoder(hidden_channels, out_channels)
-        self.decoder = EdgeDecoder(hidden_channels)
+        self.decoder = EdgeDecoder(out_channels)
 
     def forward(self, x_dict, edge_index_dict, edge_label_index):
         z_dict = {}

diff --git a/examples/hetero/bipartite_sage_unsup.py b/examples/hetero/bipartite_sage_unsup.py
@@ -0,0 +1,256 @@
+# An implementation of unsupervised bipartite GraphSAGE using the Alibaba
+# Taobao dataset.
+import os.path as osp
+
+import torch
+import torch.nn.functional as F
+import tqdm
+from sklearn.metrics import (
+    accuracy_score,
+    f1_score,
+    precision_score,
+    recall_score,
+)
+from torch.nn import Embedding, Linear
+
+import torch_geometric.transforms as T
+from torch_geometric.datasets import Taobao
+from torch_geometric.loader import LinkNeighborLoader
+from torch_geometric.nn import SAGEConv
+from torch_geometric.utils.convert import to_scipy_sparse_matrix
+
+device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
+path = osp.join(osp.dirname(osp.realpath(__file__)), '../../data/Taobao')
+
+dataset = Taobao(path)
+data = dataset[0]
+
+data['user'].x = torch.arange(0, data['user'].num_nodes)
+data['item'].x = torch.arange(0, data['item'].num_nodes)
+
+# Only consider user<>item relationships for simplicity:
+del data['category']
+del data['item', 'category']
+del data['user', 'item'].time
+del data['user', 'item'].behavior
+
+# Add a reverse ('item', 'rev_to', 'user') relation for message passing:
+data = T.ToUndirected()(data)
+
+# Perform a link-level split into training, validation, and test edges:
+print('Computing data splits...')
+train_data, val_data, test_data = T.RandomLinkSplit(
+    num_val=0.1,
+    num_test=0.1,
+    neg_sampling_ratio=1.0,
+    add_negative_train_samples=False,
+    edge_types=[('user', 'to', 'item')],
+    rev_edge_types=[('item', 'rev_to', 'user')],
+)(data)
+print('Done!')
+
+# Compute sparsified item<>item relationships through users:
+print('Computing item<>item relationships...')
+mat = to_scipy_sparse_matrix(data['user', 'item'].edge_index).tocsr()
+mat = mat[:data['user'].num_nodes, :data['item'].num_nodes]
+comat = mat.T @ mat
+comat.setdiag(0)
+comat = comat >= 3.
+comat = comat.tocoo()
+row = torch.from_numpy(comat.row).to(torch.long)
+col = torch.from_numpy(comat.col).to(torch.long)
+item_to_item_edge_index = torch.stack([row, col], dim=0)
+
+# Add the generated item<>item relationships for high-order information:
+train_data['item', 'item'].edge_index = item_to_item_edge_index
+val_data['item', 'item'].edge_index = item_to_item_edge_index
+test_data['item', 'item'].edge_index = item_to_item_edge_index
+print('Done!')
+
+train_loader = LinkNeighborLoader(
+    data=train_data,
+    num_neighbors=[8, 4],
+    edge_label_index=('user', 'to', 'item'),
+    neg_sampling='binary',
+    batch_size=2048,
+    shuffle=True,
+    num_workers=16,
+    drop_last=True,
+)
+
+val_loader = LinkNeighborLoader(
+    data=val_data,
+    num_neighbors=[8, 4],
+    edge_label_index=(
+        ('user', 'to', 'item'),
+        val_data[('user', 'to', 'item')].edge_label_index,
+    ),
+    edge_label=val_data[('user', 'to', 'item')].edge_label,
+    batch_size=2048,
+    shuffle=False,
+    num_workers=16,
+)
+
+test_loader = LinkNeighborLoader(
+    data=test_data,
+    num_neighbors=[8, 4],
+    edge_label_index=(
+        ('user', 'to', 'item'),
+        test_data[('user', 'to', 'item')].edge_label_index,
+    ),
+    edge_label=test_data[('user', 'to', 'item')].edge_label,
+    batch_size=2048,
+    shuffle=False,
+    num_workers=16,
+)
+
+
+class ItemGNNEncoder(torch.nn.Module):
+    def __init__(self, hidden_channels, out_channels):
+        super().__init__()
+        self.conv1 = SAGEConv(-1, hidden_channels)
+        self.conv2 = SAGEConv(hidden_channels, hidden_channels)
+        self.lin = Linear(hidden_channels, out_channels)
+
+    def forward(self, x, edge_index):
+        x = self.conv1(x, edge_index).relu()
+        x = self.conv2(x, edge_index).relu()
+        return self.lin(x)
+
+
+class UserGNNEncoder(torch.nn.Module):
+    def __init__(self, hidden_channels, out_channels):
+        super().__init__()
+        self.conv1 = SAGEConv((-1, -1), hidden_channels)
+        self.conv2 = SAGEConv((-1, -1), hidden_channels)
+        self.conv3 = SAGEConv((-1, -1), hidden_channels)
+        self.lin = Linear(hidden_channels, out_channels)
+
+    def forward(self, x_dict, edge_index_dict):
+        item_x = self.conv1(
+            x_dict['item'],
+            edge_index_dict[('item', 'to', 'item')],
+        ).relu()
+
+        user_x = self.conv2(
+            (x_dict['item'], x_dict['user']),
+            edge_index_dict[('item', 'rev_to', 'user')],
+        ).relu()
+
+        user_x = self.conv3(
+            (item_x, user_x),
+            edge_index_dict[('item', 'to', 'user')],
+        ).relu()
+
+        return self.lin(user_x)
+
+
+class EdgeDecoder(torch.nn.Module):
+    def __init__(self, hidden_channels):
+        super().__init__()
+        self.lin1 = Linear(2 * hidden_channels, hidden_channels)
+        self.lin2 = Linear(hidden_channels, 1)
+
+    def forward(self, z_src, z_dst, edge_label_index):
+        row, col = edge_label_index
+        z = torch.cat([z_src[row], z_dst[col]], dim=-1)
+
+        z = self.lin1(z).relu()
+        z = self.lin2(z)
+        return z.view(-1)
+
+
+class Model(torch.nn.Module):
+    def __init__(self, num_users, num_items, hidden_channels, out_channels):
+        super().__init__()
+        self.user_emb = Embedding(num_users, hidden_channels, device=device)
+        self.item_emb = Embedding(num_items, hidden_channels, device=device)
+        self.item_encoder = ItemGNNEncoder(hidden_channels, out_channels)
+        self.user_encoder = UserGNNEncoder(hidden_channels, out_channels)
+        self.decoder = EdgeDecoder(out_channels)
+
+    def forward(self, x_dict, edge_index_dict, edge_label_index):
+        z_dict = {}
+        x_dict['user'] = self.user_emb(x_dict['user'])
+        x_dict['item'] = self.item_emb(x_dict['item'])
+        z_dict['item'] = self.item_encoder(
+            x_dict['item'],
+            edge_index_dict[('item', 'to', 'item')],
+        )
+        z_dict['user'] = self.user_encoder(x_dict, edge_index_dict)
+
+        return self.decoder(z_dict['user'], z_dict['item'], edge_label_index)
+
+
+model = Model(
+    num_users=data['user'].num_nodes,
+    num_items=data['item'].num_nodes,
+    hidden_channels=64,
+    out_channels=64,
+).to(device)
+optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
+
+
+def train():
+    model.train()
+
+    total_loss = total_examples = 0
+    for batch in tqdm.tqdm(train_loader):
+        batch = batch.to(device)
+        optimizer.zero_grad()
+
+        pred = model(
+            batch.x_dict,
+            batch.edge_index_dict,
+            batch['user', 'item'].edge_label_index,
+        )
+        loss = F.binary_cross_entropy_with_logits(
+            pred, batch['user', 'item'].edge_label)
+
+        loss.backward()
+        optimizer.step()
+        total_loss += float(loss)
+        total_examples += pred.numel()
+
+    return total_loss / total_examples
+
+
+@torch.no_grad()
+def test(loader):
+    model.eval()
+
+    preds, targets = [], []
+    for batch in tqdm.tqdm(loader):
+        batch = batch.to(device)
+
+        pred = model(
+            batch.x_dict,
+            batch.edge_index_dict,
+            batch['user', 'item'].edge_label_index,
+        ).sigmoid().view(-1).cpu()
+        target = batch['user', 'item'].edge_label.long().cpu()
+
+        preds.append(pred)
+        targets.append(pred)
+
+    pred = torch.cat(preds, dim=0).numpy()
+    target = torch.cat(target, dim=0).numpy()
+
+    acc = accuracy_score(target, pred)
+    prec = precision_score(target, pred)
+    rec = recall_score(target, pred)
+    f1 = f1_score(target, pred)
+
+    return acc, prec, rec, f1
+
+
+for epoch in range(1, 21):
+    loss = train()
+    val_acc, val_prec, val_rec, val_f1 = test(val_loader)
+    test_acc, test_prec, test_rec, test_f1 = test(test_loader)
+
+    print(f'Epoch: {epoch:03d}, Loss: {loss:4f}')
+    print(f'Val Acc: {val_acc:.4f}, Val Precision {val_prec:.4f}, '
+          f'Val Recall {val_rec:.4f}, Val F1 {val_f1:.4f}')
+    print(f'Test Acc: {test_acc:.4f}, Test Precision {test_prec:.4f}, '
+          f'Test Recall {test_rec:.4f}, Test F1 {test_f1:.4f}')
diff --git a/test/data/test_data.py b/test/data/test_data.py
@@ -62,7 +62,7 @@ def test_data():
     assert clone.edge_index.data_ptr() != data.edge_index.data_ptr()
     assert clone.edge_index.tolist() == data.edge_index.tolist()
 
-    # Test `data.to_heterogenous()`:
+    # Test `data.to_heterogeneous()`:
     out = data.to_heterogeneous()
     assert torch.allclose(data.x, out['0'].x)
     assert torch.allclose(data.edge_index, out['0', '0'].edge_index)

diff --git a/torch_geometric/datasets/__init__.py b/torch_geometric/datasets/__init__.py
@@ -82,6 +82,7 @@
 from .infection_dataset import InfectionDataset
 from .ba2motif_dataset import BA2MotifDataset
 from .airfrans import AirfRANS
+from .taobao import Taobao
 
 import torch_geometric.datasets.utils  # noqa
 
@@ -173,6 +174,7 @@
     'InfectionDataset',
     'BA2MotifDataset',
     'AirfRANS',
+    'Taobao',
 ]
 
 classes = __all__