Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add the unsupervised bipartite GraphSAGE model on the Taobao dataset #6144

Merged
merged 37 commits into from
Jan 16, 2023
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
37 commits
Select commit Hold shift + click to select a range
45fac38
Add unsupervised bipartite graphsage & dataset taobao
HuxleyHu98 Dec 2, 2022
9e4e61c
Merge branch 'pyg-team:master' into bpsage
husimplicity Dec 2, 2022
e4baf1f
style
HuxleyHu98 Dec 2, 2022
e4df11b
minor
HuxleyHu98 Dec 4, 2022
b3b5ad4
Merge branch 'pyg-team:master' into bpsage
husimplicity Dec 4, 2022
23c84eb
Merge branch 'bpsage' of https://github.com/husimplicity/pytorch_geom…
HuxleyHu98 Dec 4, 2022
8768df9
Merge branch 'pyg-team:master' into bpsage
husimplicity Dec 5, 2022
0959d3a
minor
HuxleyHu98 Dec 5, 2022
57c5830
Merge branch 'pyg-team:master' into bpsage
husimplicity Dec 5, 2022
b7cb029
Merge branch 'bpsage' of https://github.com/husimplicity/pytorch_geom…
HuxleyHu98 Dec 5, 2022
eae6442
minor
HuxleyHu98 Dec 5, 2022
318a90e
Merge branch 'pyg-team:master' into bpsage
husimplicity Dec 5, 2022
96e3f9c
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Dec 5, 2022
82dfb41
Merge branch 'pyg-team:master' into bpsage
husimplicity Dec 6, 2022
37c68c7
format
HuxleyHu98 Dec 6, 2022
d62518c
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Dec 6, 2022
dada43b
Merge branch 'pyg-team:master' into bpsage
husimplicity Dec 7, 2022
c59109e
fix:limit test sampling data within split test data
HuxleyHu98 Dec 7, 2022
e6d4fca
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Dec 7, 2022
d4a374c
Merge branch 'master' into bpsage
husimplicity Dec 14, 2022
754e3f6
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Dec 14, 2022
e752eab
Apply triplet loss
HuxleyHu98 Dec 14, 2022
a1a5b03
Merge branch 'bpsage' of https://github.com/husimplicity/pytorch_geom…
HuxleyHu98 Dec 14, 2022
1fa0b68
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Dec 14, 2022
ac35e61
Merge branch 'master' into bpsage
husimplicity Dec 20, 2022
6ce5fbe
Merge branch 'master' into bpsage
husimplicity Dec 21, 2022
3d6c20c
format
HuxleyHu98 Dec 21, 2022
be4e7ee
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Dec 21, 2022
3924c29
Merge branch 'pyg-team:master' into bpsage
husimplicity Dec 22, 2022
ad90da5
Merge branch 'master' into bpsage
husimplicity Dec 27, 2022
559a5b1
Merge branch 'pyg-team:master' into bpsage
husimplicity Jan 16, 2023
6b762ad
changelog
rusty1s Jan 16, 2023
c6a7972
Merge branch 'master' into bpsage
rusty1s Jan 16, 2023
143d6c3
update
rusty1s Jan 16, 2023
f0c2a7a
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jan 16, 2023
4b856d3
update
rusty1s Jan 16, 2023
6b9cf19
typo
rusty1s Jan 16, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
update
  • Loading branch information
rusty1s committed Jan 16, 2023
commit 143d6c37aaeaa4334833baa4f37e558abdecaf82
14 changes: 7 additions & 7 deletions docs/source/tutorial/heterogeneous.rst
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ As a consequence of the different data structure, the message passing formulatio
Example Graph
-------------

As a guiding example, we take a look at the heterogenous `ogbn-mag <https://ogb.stanford.edu/docs/nodeprop>`__ network from the `OGB datasets <https://ogb.stanford.edu>`_:
As a guiding example, we take a look at the heterogeneous `ogbn-mag <https://ogb.stanford.edu/docs/nodeprop>`__ network from the `OGB datasets <https://ogb.stanford.edu>`_:

.. image:: ../_figures/hg_example.svg
:align: center
Expand Down Expand Up @@ -192,7 +192,7 @@ The transform :meth:`~torch_geometric.transforms.NormalizeFeatures` works like i
Creating Heterogeneous GNNs
---------------------------

Standard Message Passing GNNs (MP-GNNs) can not trivially be applied to heterogenous graph data, as node and edge features from different types can not be processed by the same functions due to differences in feature type.
Standard Message Passing GNNs (MP-GNNs) can not trivially be applied to heterogeneous graph data, as node and edge features from different types can not be processed by the same functions due to differences in feature type.
A natural way to circumvent this is to implement message and update functions individually for each edge type.
During runtime, the MP-GNN algorithm would need to iterate over edge type dictionaries during message computation and over node type dictionaries during node updates.

Expand Down Expand Up @@ -298,10 +298,10 @@ Afterwards, the created model can be trained as usual:
optimizer.step()
return float(loss)

Using the Heterogenous Convolution Wrapper
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Using the Heterogeneous Convolution Wrapper
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The heterogenous convolution wrapper :class:`torch_geometric.nn.conv.HeteroConv` allows to define custom heterogenous message and update functions to build arbitrary MP-GNNs for heterogeneous graphs from scratch.
The heterogeneous convolution wrapper :class:`torch_geometric.nn.conv.HeteroConv` allows to define custom heterogeneous message and update functions to build arbitrary MP-GNNs for heterogeneous graphs from scratch.
While the automatic converter :meth:`~torch_geometric.nn.to_hetero` uses the same operator for all edge types, the wrapper allows to define different operators for different edge types.
Here, :class:`~torch_geometric.nn.conv.HeteroConv` takes a dictionary of submodules as input, one for each edge type in the graph data.
The following `example <https://github.com/pyg-team/pytorch_geometric/blob/master/examples/hetero/hetero_conv_dblp.py>`__ shows how to apply it.
Expand Down Expand Up @@ -349,8 +349,8 @@ We can initialize the model by calling it once (see :ref:`here<lazyinit>` for mo

and run the standard training procedure as outlined :ref:`here<trainfunc>`.

Deploy Existing Heterogenous Operators
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Deploy Existing Heterogeneous Operators
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

:pyg:`PyG` provides operators (*e.g.*, :class:`torch_geometric.nn.conv.HGTConv`), which are specifically designed for heterogeneous graphs.
These operators can be directly used to build heterogeneous GNN models as can be seen in the following `example <https://github.com/pyg-team/pytorch_geometric/blob/master/examples/hetero/hgt_dblp.py>`__:
Expand Down
4 changes: 2 additions & 2 deletions docs/source/tutorial/load_csv.rst
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
Loading Graphs from CSV
=======================

In this example, we will show how to load a set of :obj:`*.csv` files as input and construct a **heterogeneous graph** from it, which can be used as input to a `heterogenous graph model <heterogeneous.html>`__.
In this example, we will show how to load a set of :obj:`*.csv` files as input and construct a **heterogeneous graph** from it, which can be used as input to a `heterogeneous graph model <heterogeneous.html>`__.
This tutorial is also available as an executable `example script <https://github.com/pyg-team/pytorch_geometric/tree/master/examples/hetero/load_csv.py>`_ in the :obj:`examples/hetero` directory.

We are going to use the `MovieLens dataset <https://grouplens.org/datasets/movielens/>`_ collected by the GroupLens research group.
Expand Down Expand Up @@ -251,7 +251,7 @@ With this, we are ready to finalize our :class:`~torch_geometric.data.HeteroData
}
)

This :class:`~torch_geometric.data.HeteroData` object is the native format of heterogenous graphs in :pyg:`PyG` and can be used as input for `heterogenous graph models <heterogeneous.html>`__.
This :class:`~torch_geometric.data.HeteroData` object is the native format of heterogeneous graphs in :pyg:`PyG` and can be used as input for `heterogeneous graph models <heterogeneous.html>`__.

.. note::

Expand Down
4 changes: 3 additions & 1 deletion test/data/test_data.py
Original file line number Diff line number Diff line change
Expand Up @@ -62,7 +62,7 @@ def test_data():
assert clone.edge_index.data_ptr() != data.edge_index.data_ptr()
assert clone.edge_index.tolist() == data.edge_index.tolist()

# Test `data.to_heterogenous()`:
# Test `data.to_heterogeneous()`:
out = data.to_heterogeneous()
assert torch.allclose(data.x, out['0'].x)
assert torch.allclose(data.edge_index, out['0', '0'].edge_index)
Expand Down Expand Up @@ -263,7 +263,9 @@ def test_data_share_memory():


def test_data_setter_properties():

class MyData(Data):

def __init__(self):
super().__init__()
self.my_attr1 = 1
Expand Down
94 changes: 46 additions & 48 deletions torch_geometric/datasets/taobao.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
import os
from typing import Callable, List, Optional
from typing import Callable, Optional

import numpy as np
import torch
Expand All @@ -13,91 +13,89 @@


class Taobao(InMemoryDataset):
r"""Taobao User Behavior is a dataset of user behaviors from Taobao offered
by Alibaba. The dataset is from the platform Tianchi Alicloud.
https://tianchi.aliyun.com/dataset/649.
Taobao is a heterogeous graph for recommendation. Nodes represent users
with User IDs and items with Item IDs and Category IDs, and edges represent
different types of user behaviours towards items with timestamps.
r"""Taobao is a dataset of user behaviors from Taobao offered by Alibaba,
provided by the `Tianchi Alicloud platform
<https://tianchi.aliyun.com/dataset/649>`_.

Taobao is a heterogeneous graph for recommendation.
Nodes represent users with user IDs and items with item IDs and category
IDs, and edges represent different types of user behaviours towards items
with timestamps.

Args:
root (string): Root directory where the dataset should be saved.
transform (callable, optional): A function/transform that takes in an
:obj:`torch_geometric.data.Data` object and returns a transformed
version. The data object will be transformed before every access.
(default: :obj:`None`)
:obj:`torch_geometric.data.HeteroData` object and returns a
transformed version. The data object will be transformed before
every access. (default: :obj:`None`)
pre_transform (callable, optional): A function/transform that takes in
an :obj:`torch_geometric.data.Data` object and returns a
an :obj:`torch_geometric.data.HeteroData` object and returns a
transformed version. The data object will be transformed before
being saved to disk. (default: :obj:`None`)

"""
dataset = 'UserBehavior.csv.zip'
url = 'https://alicloud-dev.oss-cn-hangzhou.aliyuncs.com/' + dataset
url = ('https://alicloud-dev.oss-cn-hangzhou.aliyuncs.com/'
'UserBehavior.csv.zip')

def __init__(
self,
root,
transform: Optional[Callable] = None,
pre_transform: Optional[Callable] = None,
):

super().__init__(root, transform, pre_transform)
self.data, self.slices = torch.load(self.processed_paths[0])

@property
def raw_file_names(self) -> List[str]:
return ['UserBehavior.csv']
def raw_file_names(self) -> str:
return 'UserBehavior.csv'

@property
def processed_file_names(self) -> str:
return 'data.pt'

def download(self):
print(self.raw_dir)
path = download_url(self.url, self.raw_dir)
extract_zip(path, self.raw_dir)
os.remove(path)

def process(self):
import pandas as pd

data = HeteroData()

df = pd.read_csv(self.raw_paths[0])
df.columns = [
'userId', 'itemId', 'categoryId', 'behaviorType', 'timestamp'
]
cols = ['userId', 'itemId', 'categoryId', 'behaviorType', 'timestamp']
df = pd.read_csv(self.raw_paths[0], names=cols)

# Time representation (YYYY.MM.DD-HH:MM:SS -> Integer)
# 1511539200 = 2017.11.25-00:00:00 1512316799 = 2017.12.03-23:59:59
df = df[(df["timestamp"] >= 1511539200)
& (df["timestamp"] <= 1512316799)]
# start: 1511539200 = 2017.11.25-00:00:00
# end: 1512316799 = 2017.12.03-23:59:59
start = 1511539200
end = 1512316799
df = df[(df["timestamp"] >= start) & (df["timestamp"] <= end)]

df = df.drop_duplicates(
subset=[
'userId', 'itemId', 'categoryId', 'behaviorType', 'timestamp'
], keep='first')
df = df.drop_duplicates()

behavior_dict = {'pv': 0, 'cart': 1, 'buy': 2, 'fav': 3}
df['behaviorType'] = df['behaviorType'].map(behavior_dict).values
_, df['userId'] = np.unique(df[['userId']].values, return_inverse=True)
_, df['itemId'] = np.unique(df[['itemId']].values, return_inverse=True)
_, df['categoryId'] = np.unique(df[['categoryId']].values,
return_inverse=True)

data['user'].num_nodes = df['userId'].nunique()
data['item'].num_nodes = df['itemId'].nunique()

edge_feat, _ = np.unique(
df[['userId', 'itemId', 'behaviorType', 'timestamp']].values,
return_index=True, axis=0)
edge_feat = pd.DataFrame(edge_feat).drop_duplicates(
subset=[0, 1], keep='last')
data['user', '2',
'item'].edge_index = torch.from_numpy(edge_feat[[0, 1]].values).T
data['user', '2',
'item'].edge_attr = torch.from_numpy(edge_feat[[2, 3]].values).T
df['behaviorType'] = df['behaviorType'].map(behavior_dict)

num_entries = {}
for col in ['userId', 'itemId', 'categoryId']:
# Map IDs to consecutive integers:
value, df[col] = np.unique(df[[col]].values, return_inverse=True)
num_entries[col] = value.shape[0]

data = HeteroData()

data['user'].num_nodes = num_entries['userId']
data['item'].num_nodes = num_entries['itemId']
data['category'].num_nodes = num_entries['categoryId']

row = torch.from_numpy(df['userId'].values)
col = torch.from_numpy(df['itemId'].values)
data['user', 'item'].edge_index = torch.stack([row, col], dim=0)

data['user', 'item'].time = torch.from_numpy(df['timestamp'].values)
behavior = torch.from_numpy(df['behaviorType'].values)
data['user', 'item'].behavior = behavior

data = data if self.pre_transform is None else self.pre_transform(data)

Expand Down
6 changes: 4 additions & 2 deletions torch_geometric/explain/algorithm/captum.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@


class CaptumModel(torch.nn.Module):

def __init__(self, model: torch.nn.Module, mask_type: str = "edge",
output_idx: Optional[int] = None):
super().__init__()
Expand Down Expand Up @@ -65,6 +66,7 @@ def forward(self, mask, *args):

# TODO(jinu) Is there any point of inheriting from `CaptumModel`
class CaptumHeteroModel(CaptumModel):

def __init__(self, model: torch.nn.Module, mask_type: str, output_id: int,
metadata: Metadata):
super().__init__(model, mask_type, output_id)
Expand Down Expand Up @@ -162,10 +164,10 @@ def to_captum_input(x: Union[Tensor, Dict[EdgeType, Tensor]],
Args:

x (Tensor or Dict[NodeType, Tensor]): The node features. For
heterogenous graphs this is a dictionary holding node featues
heterogeneous graphs this is a dictionary holding node featues
for each node type.
edge_index(Tensor or Dict[EdgeType, Tensor]): The edge indicies. For
heterogenous graphs this is a dictionary holding edge index
heterogeneous graphs this is a dictionary holding edge index
for each edge type.
mask_type (str): Denotes the type of mask to be created with
a Captum explainer. Valid inputs are :obj:`"edge"`, :obj:`"node"`,
Expand Down
3 changes: 2 additions & 1 deletion torch_geometric/loader/link_neighbor_loader.py
Original file line number Diff line number Diff line change
Expand Up @@ -61,7 +61,7 @@ class LinkNeighborLoader(LinkLoader):

The rest of the functionality mirrors that of
:class:`~torch_geometric.loader.NeighborLoader`, including support for
heterogenous graphs.
heterogeneous graphs.

.. note::
Negative sampling is currently implemented in an approximate
Expand Down Expand Up @@ -170,6 +170,7 @@ class LinkNeighborLoader(LinkLoader):
:class:`torch.utils.data.DataLoader`, such as :obj:`batch_size`,
:obj:`shuffle`, :obj:`drop_last` or :obj:`num_workers`.
"""

def __init__(
self,
data: Union[Data, HeteroData, Tuple[FeatureStore, GraphStore]],
Expand Down
2 changes: 1 addition & 1 deletion torch_geometric/nn/models/captum.py
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,7 @@ def to_captum_model(
internal_batch_size=1)


Sample code for heterogenous graphs:
Sample code for heterogeneous graphs:

.. code-block:: python

Expand Down