GraphScope on 3/3 part 1: add local/dist trainer, and a `Data` class to make the example simpler #234

sighingnow · 2022-10-14T08:59:22Z

This pull request

unify the logic of local/dist trainer
add a Data class to gcn and rgcn to make the example code simpler, as discussed
~~add a Ego based GCN to restore the previous GCN example, and show how EgoXXXData works.~~

…impler Signed-off-by: Tao He <sighingnow@gmail.com>

Seventeen17 · 2022-11-14T03:47:17Z

graphlearn/examples/tf/ego_data.py

+    self.dataset_train = tfg.Dataset(self.query_train, window=10)
+    self.train_iterator = self.dataset_train.iterator
+    self.train_dict = self.dataset_train.get_data_dict()
+    self.train_embedding = self.model.forward(


I think it's better not to encapsulate the model training into the Data.

Is moving self.{train,val,test}_embedding to outside and still keeping other field in Data class acceptable?

Move all model-related data outside.

Seventeen17 · 2022-11-14T08:31:24Z

graphlearn/examples/tf/trainer.py

-            writeGFile.close()
-            print("Profiling data save to %s success." % save_path)
+          if self.profiling:
+            outs = self.run_and_profiling(train_ops, local_step)


In the if branch, we call run_and_profiling which contains self.sess.run, but in the else branch, the self.sess.run function is called directly, which doesn't feel very corresponding. Maybe we should wrap the timeline saving separately.

I think move self.profiling to run_and_profiling() and use run_and_profiling directly would look better.

Seventeen17 · 2022-11-14T08:59:22Z

graphlearn/examples/tf/ego_data.py

+import graphlearn.python.nn.tf as tfg
+from graphlearn.python.utils import parse_nbrs_num
+
+class EgoData:


A rough code of my opinion, just for reference.

Base class:

class EgoSampleLoaderBase: # just for example, maybe we could find a better class name def __init__(self, graph, nbr_num, sampler, batch_size, mask="train"): # .. if mask == 'train': tfg.conf.training = True self.sample_query = self.query(graph, mask) ds = tfg.Dataset(self.query_train, window=10) self._iterator = ... def _query(self): raise NotImplementedError... def _format(self, ...): raise NotImplementedError... @property def iterator(self): return self._iterator def as_list(self): return self._format() @property def src(self): return self._data_dict['seed'] def hop(self, idx): return self._data_dict['hop1'] # Just for example

Example inherit class

class EgoRGCNSampleLoader: def _query(self): # ... def _format(self): # ...

Usage in train.py

graph = g.init() model = EgoRGCN(...) train_sample = EgoRGCNSamplLoader(g, nbr_num, "random", 128, 'train') train_emb = model.forward(train_sample.as_list()) loss = loss_fn(train_emb, train_sample.src.labels) trainer = Trainer(train_sample.iterator, loss) trainer.run() # for test test_sample_loader = EgoRGCNSamplLoader(g, nbr_num, "random", 128, 'test') # ...

Looks fine, thanks!

It is better to put all sampling and data preprocessing into a SampleLoader or NeighborLoader.

baoleai · 2022-11-16T08:14:21Z

graphlearn/python/nn/tf/layers/ego_gcn_conv.py

+from graphlearn.python.nn.tf.layers.linear_layer import LinearLayer
+
+
+class EgoGCNConv(EgoConv):


It seems like EgoRGCNConv, not EgoGCN. You can use EgoSAGEConv with aggr='gcn' as EgoGCNConv.

Okay. will test that. Thanks!

baoleai · 2022-11-16T08:21:30Z

graphlearn/examples/tf/ego_gcn/ego_gcn.py

+               act_func=tf.nn.relu,
+               dropout=0.0,
+               **kwargs):
+    """EgoGraph based RGCN. 


These args are for RGCN not for GCN.

Thanks, will fix.

Signed-off-by: Tao He <sighingnow@gmail.com>

baoleai · 2022-11-16T09:54:39Z

graphlearn/examples/tf/trainer.py


    def _close_session():
      if self.sess is not None:
        self.sess.close()
    atexit.register(_close_session)

-  def train(self, iterator, loss, learning_rate, epochs=10, hooks=[], **kwargs):
+  def run_and_profiling(self, train_ops, local_step):


run_step is better?

Fixed. Has been renamed to run_step.

baoleai · 2022-11-16T10:00:35Z

graphlearn/examples/tf/ego_rgcn/train_supervised.py

+  train_data = EgoRGCNDataLoader(g, gl.Mask.TRAIN, FLAGS.sampler, FLAGS.train_batch_size,
+                                 node_type='i', nbrs_num=nbrs_num, num_relations=FLAGS.num_relations)
+  train_embedding = model.forward(train_data.as_list(), nbrs_num)
+  loss = supervised_loss(train_embedding, train_data['seed'].labels)


The user cannot know 'seed', it is inside 'train_data, so maybe just use API like seed().labels` ?

I have changed back to train/test/val based on the mask parameter (keep the previous behaviour) and expose some helpers train_labels, test_labels, val_labels for accessing.

Signed-off-by: Tao He <sighingnow@gmail.com>

baoleai · 2022-11-16T11:40:12Z

graphlearn/examples/tf/ego_data.py

+    return self._dataset.get_egograph(key)
+
+  @property
+  def train_ego(self):


Here should keep only a single ego() property in which you can call get_egograph according to self._mask.

Here we need src_ego, dst_ego and neg_dst_ego three methods(neg_dst_ego for unsupervised model). train_ego or test_ego is just the case of src_ego when mask is Train or Test.

We should not put train/test/val queries all together for supporting user who only want to run save embedding phase.

We should not put train/test/val queries all together for supporting user who only want to run save embedding phase.

Fixed.

Signed-off-by: Tao He <sighingnow@gmail.com>

baoleai · 2022-11-16T11:50:22Z

graphlearn/examples/tf/trainer.py

+    """
+    """
+
+  def add_initializer(self, iterator):


unused code?

Deleted. Thanks for raising it up.

baoleai · 2022-11-16T11:50:53Z

graphlearn/examples/tf/trainer.py


  Args:
-    cluster_spec: TensorFlow ClusterSpec.
-    job_name: name of this worker.
    task_index: index of this worker.


rm this unused arg.

Docstring has been revised in trainer.py.

baoleai · 2022-11-16T11:52:59Z

graphlearn/examples/tf/trainer.py

+    self.sync_barrier = None
+    self.global_step = None
+    self.is_local = None
+
  def context(self):


raise NotImplementedError

baoleai · 2022-11-16T11:53:08Z

graphlearn/examples/tf/trainer.py


    def _close_session():
      if self.sess is not None:
        self.sess.close()
    atexit.register(_close_session)

-  def train(self, iterator, loss, learning_rate, epochs=10, hooks=[], **kwargs):
+  def run_step(self, train_ops, local_step):


raise NotImplementedError

baoleai · 2022-11-16T11:54:17Z

graphlearn/examples/tf/trainer.py

+      print('Start testing ...')
+      total_test_acc = []
+      local_step = 0
+      last_local_step = 0


I think the LocalTrainer can also use global_step?

It is to make the logs less confusing.

baoleai · 2022-11-16T12:05:30Z

graphlearn/examples/tf/ego_data.py

@@ -0,0 +1,118 @@
+# Copyright 2021 Alibaba Group Holding Limited. All Rights Reserved.


ego_data.py ->ego_data_loader.py

Done, renamed ego_rgcn_data_loader.py and ego_sage_data_loader.py as well.

baoleai · 2022-11-16T12:09:07Z

graphlearn/examples/tf/ego_data.py

+    prefix = ('train', 'test', 'val')[self._mask.value - 1]
+    return self._data_dict[prefix].labels
+
+  def as_list(self):


change to x_list which means the input node feature(processed) list.

Signed-off-by: Tao He <sighingnow@gmail.com>

Seventeen17 · 2022-11-17T02:55:57Z

graphlearn/examples/tf/ego_data_loader.py

+  def dst_ego(self):
+    ''' Alias for `self.get_egograph('dst')`.
+    '''
+    return self.get_egograph('dst')


It seems that not all the queries in sub class contains 'src', 'dst' and 'neg_dst'

They only be called when needed, otherwise we would need user to hard code get_egograph("src"), ... in their train_(un)supervised.py.

baoleai · 2022-11-17T03:15:41Z

graphlearn/examples/tf/ego_data_loader.py

+  def src_ego(self):
+    ''' Alias for `self.get_egograph('src')`.
+    '''
+    if self._mask is None:


The base class should not provide default implementation for src_ego, dst_ego, neg_dst_ego, because 'src' or 'dst' should only be valid when query use it in derived class.
You can just raise NotImplementedError here.

baoleai · 2022-11-17T03:19:14Z

graphlearn/examples/tf/ego_data_loader.py

+    '''
+    return self.get_egograph('neg_dst')
+
+  @property


labels , x_list and _format these interfaces are not so common. It is best to implement them in the required subclasses such as EgoRGCNDataLoader, not in the base class.

… subclasses Signed-off-by: Tao He <sighingnow@gmail.com>

sighingnow marked this pull request as draft October 14, 2022 08:59

sighingnow force-pushed the ht/gs-on-gl branch from 575a9ba to 304df75 Compare October 17, 2022 02:05

sighingnow force-pushed the ht/gs-on-gl branch from 304df75 to e624512 Compare November 14, 2022 01:18

sighingnow changed the title ~~GraphScope on 3/3: rebasing & merging the Python API part.~~ GraphScope on 3/3 part 1: add local/dist trainer, and a Data class to make the example simpler Nov 14, 2022

sighingnow requested review from baoleai, LiSu and Seventeen17 November 14, 2022 01:19

sighingnow marked this pull request as ready for review November 14, 2022 01:19

sighingnow force-pushed the ht/gs-on-gl branch from e624512 to 893effb Compare November 14, 2022 01:27

Add both local and dist trainer, and a Data class to make example s…

10ad969

…impler Signed-off-by: Tao He <sighingnow@gmail.com>

sighingnow force-pushed the ht/gs-on-gl branch from 893effb to 10ad969 Compare November 14, 2022 01:52

Seventeen17 reviewed Nov 14, 2022

View reviewed changes

baoleai reviewed Nov 16, 2022

View reviewed changes

Revise the implementation of data class and remove GCN

b742d6a

Signed-off-by: Tao He <sighingnow@gmail.com>

baoleai reviewed Nov 16, 2022

View reviewed changes

sighingnow added 2 commits November 16, 2022 19:25

Revise the SAGE implementation

06f969f

Signed-off-by: Tao He <sighingnow@gmail.com>

Remove duplicated comments

ee8e2cd

Signed-off-by: Tao He <sighingnow@gmail.com>

baoleai reviewed Nov 16, 2022

View reviewed changes

Update the .labels() API

4dac6c8

Signed-off-by: Tao He <sighingnow@gmail.com>

baoleai reviewed Nov 16, 2022

View reviewed changes

File renaming and minor fixes

b655e23

Signed-off-by: Tao He <sighingnow@gmail.com>

Seventeen17 reviewed Nov 17, 2022

View reviewed changes

baoleai reviewed Nov 17, 2022

View reviewed changes

Move the detail implementation of src/dst/neg_dst/labels/xlist to the…

fc7aa41

… subclasses Signed-off-by: Tao He <sighingnow@gmail.com>

baoleai approved these changes Nov 17, 2022

View reviewed changes

Seventeen17 approved these changes Nov 17, 2022

View reviewed changes

sighingnow merged commit 0a656bf into alibaba:master Nov 17, 2022

sighingnow deleted the ht/gs-on-gl branch November 17, 2022 09:25

sighingnow mentioned this pull request Nov 23, 2022

Update the learning model to align with the latest graphlearn alibaba/GraphScope#2235

Merged

		from graphlearn.python.nn.tf.layers.linear_layer import LinearLayer


		class EgoGCNConv(EgoConv):

		@@ -0,0 +1,118 @@
		# Copyright 2021 Alibaba Group Holding Limited. All Rights Reserved.

GraphScope on 3/3 part 1: add local/dist trainer, and a Data class to make the example simpler #234

GraphScope on 3/3 part 1: add local/dist trainer, and a Data class to make the example simpler #234

Conversation

sighingnow commented Oct 14, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Seventeen17 Nov 14, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

baoleai Nov 16, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sighingnow Nov 16, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

GraphScope on 3/3 part 1: add local/dist trainer, and a `Data` class to make the example simpler #234

GraphScope on 3/3 part 1: add local/dist trainer, and a `Data` class to make the example simpler #234

sighingnow commented Oct 14, 2022 •

edited

Loading

Seventeen17 Nov 14, 2022 •

edited

Loading

baoleai Nov 16, 2022 •

edited

Loading

sighingnow Nov 16, 2022 •

edited

Loading