update

pyg-team · rusty1s · Jan 16, 2023 · Dec 2, 2022 · Dec 2, 2022 · Dec 2, 2022
commit 143d6c37aaeaa4334833baa4f37e558abdecaf82
@@ -13,7 +13,7 @@ As a consequence of the different data structure, the message passing formulatio
 Example Graph
 -------------
 
-As a guiding example, we take a look at the heterogenous `ogbn-mag <https://ogb.stanford.edu/docs/nodeprop>`__ network from the `OGB datasets <https://ogb.stanford.edu>`_:
+As a guiding example, we take a look at the heterogeneous `ogbn-mag <https://ogb.stanford.edu/docs/nodeprop>`__ network from the `OGB datasets <https://ogb.stanford.edu>`_:
 
 .. image:: ../_figures/hg_example.svg
   :align: center
@@ -192,7 +192,7 @@ The transform :meth:`~torch_geometric.transforms.NormalizeFeatures` works like i
 Creating Heterogeneous GNNs
 ---------------------------
 
-Standard Message Passing GNNs (MP-GNNs) can not trivially be applied to heterogenous graph data, as node and edge features from different types can not be processed by the same functions due to differences in feature type.
+Standard Message Passing GNNs (MP-GNNs) can not trivially be applied to heterogeneous graph data, as node and edge features from different types can not be processed by the same functions due to differences in feature type.
 A natural way to circumvent this is to implement message and update functions individually for each edge type.
 During runtime, the MP-GNN algorithm would need to iterate over edge type dictionaries during message computation and over node type dictionaries during node updates.
 
@@ -298,10 +298,10 @@ Afterwards, the created model can be trained as usual:
         optimizer.step()
         return float(loss)
 
-Using the Heterogenous Convolution Wrapper
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+Using the Heterogeneous Convolution Wrapper
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-The heterogenous convolution wrapper :class:`torch_geometric.nn.conv.HeteroConv` allows to define custom heterogenous message and update functions to build arbitrary MP-GNNs for heterogeneous graphs from scratch.
+The heterogeneous convolution wrapper :class:`torch_geometric.nn.conv.HeteroConv` allows to define custom heterogeneous message and update functions to build arbitrary MP-GNNs for heterogeneous graphs from scratch.
 While the automatic converter :meth:`~torch_geometric.nn.to_hetero` uses the same operator for all edge types, the wrapper allows to define different operators for different edge types.
 Here, :class:`~torch_geometric.nn.conv.HeteroConv` takes a dictionary of submodules as input, one for each edge type in the graph data.
 The following `example <https://github.com/pyg-team/pytorch_geometric/blob/master/examples/hetero/hetero_conv_dblp.py>`__ shows how to apply it.
@@ -349,8 +349,8 @@ We can initialize the model by calling it once (see :ref:`here<lazyinit>` for mo
 
 and run the standard training procedure as outlined :ref:`here<trainfunc>`.
 
-Deploy Existing Heterogenous Operators
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+Deploy Existing Heterogeneous Operators
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 :pyg:`PyG` provides operators (*e.g.*, :class:`torch_geometric.nn.conv.HGTConv`), which are specifically designed for heterogeneous graphs.
 These operators can be directly used to build heterogeneous GNN models as can be seen in the following `example <https://github.com/pyg-team/pytorch_geometric/blob/master/examples/hetero/hgt_dblp.py>`__:

@@ -1,7 +1,7 @@
 Loading Graphs from CSV
 =======================
 
-In this example, we will show how to load a set of :obj:`*.csv` files as input and construct a **heterogeneous graph** from it, which can be used as input to a `heterogenous graph model <heterogeneous.html>`__.
+In this example, we will show how to load a set of :obj:`*.csv` files as input and construct a **heterogeneous graph** from it, which can be used as input to a `heterogeneous graph model <heterogeneous.html>`__.
 This tutorial is also available as an executable `example script <https://github.com/pyg-team/pytorch_geometric/tree/master/examples/hetero/load_csv.py>`_ in the :obj:`examples/hetero` directory.
 
 We are going to use the `MovieLens dataset <https://grouplens.org/datasets/movielens/>`_ collected by the GroupLens research group.
@@ -251,7 +251,7 @@ With this, we are ready to finalize our :class:`~torch_geometric.data.HeteroData
       }
     )
 
-This :class:`~torch_geometric.data.HeteroData` object is the native format of heterogenous graphs in :pyg:`PyG` and can be used as input for `heterogenous graph models <heterogeneous.html>`__.
+This :class:`~torch_geometric.data.HeteroData` object is the native format of heterogeneous graphs in :pyg:`PyG` and can be used as input for `heterogeneous graph models <heterogeneous.html>`__.
 
 .. note::
 

@@ -62,7 +62,7 @@ def test_data():
     assert clone.edge_index.data_ptr() != data.edge_index.data_ptr()
     assert clone.edge_index.tolist() == data.edge_index.tolist()
 
-    # Test `data.to_heterogenous()`:
+    # Test `data.to_heterogeneous()`:
     out = data.to_heterogeneous()
     assert torch.allclose(data.x, out['0'].x)
     assert torch.allclose(data.edge_index, out['0', '0'].edge_index)
@@ -263,7 +263,9 @@ def test_data_share_memory():
 
 
 def test_data_setter_properties():
+
     class MyData(Data):
+
         def __init__(self):
             super().__init__()
             self.my_attr1 = 1

@@ -1,5 +1,5 @@
 import os
-from typing import Callable, List, Optional
+from typing import Callable, Optional
 
 import numpy as np
 import torch
@@ -13,91 +13,89 @@
 
 
 class Taobao(InMemoryDataset):
-    r"""Taobao User Behavior is a dataset of user behaviors from Taobao offered
-    by Alibaba. The dataset is from the platform Tianchi Alicloud.
-    https://tianchi.aliyun.com/dataset/649.
-    Taobao is a heterogeous graph for recommendation. Nodes represent users
-    with User IDs and items with Item IDs and Category IDs, and edges represent
-    different types of user behaviours towards items with timestamps.
+    r"""Taobao is a dataset of user behaviors from Taobao offered by Alibaba,
+    provided by the `Tianchi Alicloud platform
+    <https://tianchi.aliyun.com/dataset/649>`_.
+
+    Taobao is a heterogeneous graph for recommendation.
+    Nodes represent users with user IDs and items with item IDs and category
+    IDs, and edges represent different types of user behaviours towards items
+    with timestamps.
 
     Args:
         root (string): Root directory where the dataset should be saved.
         transform (callable, optional): A function/transform that takes in an
-            :obj:`torch_geometric.data.Data` object and returns a transformed
-            version. The data object will be transformed before every access.
-            (default: :obj:`None`)
+            :obj:`torch_geometric.data.HeteroData` object and returns a
+            transformed version. The data object will be transformed before
+            every access. (default: :obj:`None`)
         pre_transform (callable, optional): A function/transform that takes in
-            an :obj:`torch_geometric.data.Data` object and returns a
+            an :obj:`torch_geometric.data.HeteroData` object and returns a
             transformed version. The data object will be transformed before
             being saved to disk. (default: :obj:`None`)
 
     """
-    dataset = 'UserBehavior.csv.zip'
-    url = 'https://alicloud-dev.oss-cn-hangzhou.aliyuncs.com/' + dataset
+    url = ('https://alicloud-dev.oss-cn-hangzhou.aliyuncs.com/'
+           'UserBehavior.csv.zip')
 
     def __init__(
         self,
         root,
         transform: Optional[Callable] = None,
         pre_transform: Optional[Callable] = None,
     ):
-
         super().__init__(root, transform, pre_transform)
         self.data, self.slices = torch.load(self.processed_paths[0])
 
     @property
-    def raw_file_names(self) -> List[str]:
-        return ['UserBehavior.csv']
+    def raw_file_names(self) -> str:
+        return 'UserBehavior.csv'
 
     @property
     def processed_file_names(self) -> str:
         return 'data.pt'
 
     def download(self):
-        print(self.raw_dir)
         path = download_url(self.url, self.raw_dir)
         extract_zip(path, self.raw_dir)
         os.remove(path)
 
     def process(self):
         import pandas as pd
 
-        data = HeteroData()
-
-        df = pd.read_csv(self.raw_paths[0])
-        df.columns = [
-            'userId', 'itemId', 'categoryId', 'behaviorType', 'timestamp'
-        ]
+        cols = ['userId', 'itemId', 'categoryId', 'behaviorType', 'timestamp']
+        df = pd.read_csv(self.raw_paths[0], names=cols)
 
         # Time representation (YYYY.MM.DD-HH:MM:SS -> Integer)
-        # 1511539200 = 2017.11.25-00:00:00 1512316799 = 2017.12.03-23:59:59
-        df = df[(df["timestamp"] >= 1511539200)
-                & (df["timestamp"] <= 1512316799)]
+        # start: 1511539200 = 2017.11.25-00:00:00
+        # end:   1512316799 = 2017.12.03-23:59:59
+        start = 1511539200
+        end = 1512316799
+        df = df[(df["timestamp"] >= start) & (df["timestamp"] <= end)]
 
-        df = df.drop_duplicates(
-            subset=[
-                'userId', 'itemId', 'categoryId', 'behaviorType', 'timestamp'
-            ], keep='first')
+        df = df.drop_duplicates()
 
         behavior_dict = {'pv': 0, 'cart': 1, 'buy': 2, 'fav': 3}
-        df['behaviorType'] = df['behaviorType'].map(behavior_dict).values
-        _, df['userId'] = np.unique(df[['userId']].values, return_inverse=True)
-        _, df['itemId'] = np.unique(df[['itemId']].values, return_inverse=True)
-        _, df['categoryId'] = np.unique(df[['categoryId']].values,
-                                        return_inverse=True)
-
-        data['user'].num_nodes = df['userId'].nunique()
-        data['item'].num_nodes = df['itemId'].nunique()
-
-        edge_feat, _ = np.unique(
-            df[['userId', 'itemId', 'behaviorType', 'timestamp']].values,
-            return_index=True, axis=0)
-        edge_feat = pd.DataFrame(edge_feat).drop_duplicates(
-            subset=[0, 1], keep='last')
-        data['user', '2',
-             'item'].edge_index = torch.from_numpy(edge_feat[[0, 1]].values).T
-        data['user', '2',
-             'item'].edge_attr = torch.from_numpy(edge_feat[[2, 3]].values).T
+        df['behaviorType'] = df['behaviorType'].map(behavior_dict)
+
+        num_entries = {}
+        for col in ['userId', 'itemId', 'categoryId']:
+            # Map IDs to consecutive integers:
+            value, df[col] = np.unique(df[[col]].values, return_inverse=True)
+            num_entries[col] = value.shape[0]
+
+        data = HeteroData()
+
+        data['user'].num_nodes = num_entries['userId']
+        data['item'].num_nodes = num_entries['itemId']
+        data['category'].num_nodes = num_entries['categoryId']
+
+        row = torch.from_numpy(df['userId'].values)
+        col = torch.from_numpy(df['itemId'].values)
+        data['user', 'item'].edge_index = torch.stack([row, col], dim=0)
+
+        data['user', 'item'].time = torch.from_numpy(df['timestamp'].values)
+        behavior = torch.from_numpy(df['behaviorType'].values)
+        data['user', 'item'].behavior = behavior
 
         data = data if self.pre_transform is None else self.pre_transform(data)
 

@@ -12,6 +12,7 @@
 
 
 class CaptumModel(torch.nn.Module):
+
     def __init__(self, model: torch.nn.Module, mask_type: str = "edge",
                  output_idx: Optional[int] = None):
         super().__init__()
@@ -65,6 +66,7 @@ def forward(self, mask, *args):
 
 # TODO(jinu) Is there any point of inheriting from `CaptumModel`
 class CaptumHeteroModel(CaptumModel):
+
     def __init__(self, model: torch.nn.Module, mask_type: str, output_id: int,
                  metadata: Metadata):
         super().__init__(model, mask_type, output_id)
@@ -162,10 +164,10 @@ def to_captum_input(x: Union[Tensor, Dict[EdgeType, Tensor]],
     Args:
 
         x (Tensor or Dict[NodeType, Tensor]): The node features. For
-            heterogenous graphs this is a dictionary holding node featues
+            heterogeneous graphs this is a dictionary holding node featues
             for each node type.
         edge_index(Tensor or Dict[EdgeType, Tensor]): The edge indicies. For
-            heterogenous graphs this is a dictionary holding edge index
+            heterogeneous graphs this is a dictionary holding edge index
             for each edge type.
         mask_type (str): Denotes the type of mask to be created with
             a Captum explainer. Valid inputs are :obj:`"edge"`, :obj:`"node"`,

@@ -61,7 +61,7 @@ class LinkNeighborLoader(LinkLoader):
 
     The rest of the functionality mirrors that of
     :class:`~torch_geometric.loader.NeighborLoader`, including support for
-    heterogenous graphs.
+    heterogeneous graphs.
 
     .. note::
         Negative sampling is currently implemented in an approximate
@@ -170,6 +170,7 @@ class LinkNeighborLoader(LinkLoader):
             :class:`torch.utils.data.DataLoader`, such as :obj:`batch_size`,
             :obj:`shuffle`, :obj:`drop_last` or :obj:`num_workers`.
     """
+
     def __init__(
         self,
         data: Union[Data, HeteroData, Tuple[FeatureStore, GraphStore]],

@@ -47,7 +47,7 @@ def to_captum_model(
                                internal_batch_size=1)
 
 
-    Sample code for heterogenous graphs:
+    Sample code for heterogeneous graphs:
 
     .. code-block:: python