Grounding Visual Hierarchy using Natural Language
This repo is the Hierarchical Concept Learner
This repo is trying to reproduce the project "Learning Physical Graph Representations from Visual Scenes". Apply Gestalt principles in unsupervised visual scene understanding. Computer vision researchers have not yet developed algorithms that capture human physical understanding of scenes. This is despite outstanding progress in capturing other visual abilities: in particular, convolutional neural networks(CNNs) supervised with human-labeled data have excelled at categorizing scenes and objects. Yet when CNNs are optimized to perform tasks of physical understanding, such as predicting how a scene will evolve over time, the may unrealistic predictions - of objects dissolving, disappearing, or merging. Being able to tell a cat from a dog is not the same as knowing that a dog will still exist and keep its form when it runs out of sight.
When we look at a scene we see distinct object imbued with physical properties: color, texture, location, shape, motion, and so on. We also infer complex and dynamical relationships within the scene: a cup may be balanced precariously on table's edge, a piece of debris may be rolling in front of a truck, or a dog may have temporarily disappeared behind a tree. All of these judgements involve physical understanding of out visual observations.
The need to understand physical objects suggests a different type of scene representation than the image-like layers of features found in CNNs. Instead, different objects might be represented by distinct entities with associated attributes and interelationships--in other words, a graph representation. We formalize this idea as a Physical Scene Graph (PSG): a hierarchical graph in which nodes represent objects or their parts, edges represent relationships, and a vector of physically meaningful attributes is bound to each node,
Unlike the task of inferring scene structure, CNNS are already good at extracting geometric features - they just do not group these features into discrete objects. However, different types of geometric information are typically found in different CNN layers: early layers can encode sharp object boundaries, whereas later layers better encode board-scale information about depth and surface shape. To combine both types of information in a single set of features, we developed a recurrent CNN (ConvRNN) that learns to combine early and late layers through locally recurrent "cells" and long-range feedback connections. As expected, this recurrent feature extraction architecture is mainly cruicial for learning to represent geometric properties: lower-level visual properties, like color and texture, are laregely unaffected by instead using a feedforward CNN feature extractor. Of course I didn't write the ConvRNN in the repo because I didn't find the pytorch version for this network. I used a residual dense network to replace the feature extractor and it looks fine.
The general framework for the graph pooling and clustering layer is here. Different layers have different kinds of grouping mechansim that corresponds to gestalt principles which will be discussed in detail later.
def forward(self, x, edge_index, batch, device=device):
row, col = edge_index
# Collect affinities/thresholds to filter edges
affinities, threshold, losses = self.affinities_and_thresholds(x,row,col)
if self.Type == "P1":filtered_edge_index = edge_index[:, affinities <= threshold]
if self.Type == "P2":filtered_edge_index = edge_index[:, affinities >= threshold]
# Coarsen graph with filtered adj. list to produce next level's nodes
x = x.to(device)
if x.size(0) != 0:
try:
node_labels = LP_clustering(x.size(0), filtered_edge_index, 40).to(device)
except:
node_labels = torch.arange(x.size(0))
else:
node_labels = torch.arange(x.size(0))
cluster_labels = node_labels.unique(return_inverse=True,sorted=False)[1].to(device)
coarsened_x, coarsened_batch = max_pool_x(cluster_labels, x, batch)
# [Very Suceptible Step, Why use this way to coarse edges]
coarsened_edge_index = coalesce(cluster_labels[filtered_edge_index],
None, coarsened_x.size(0), coarsened_x.size(0))[0]
return (coarsened_x, coarsened_edge_index, coarsened_batch,
cluster_labels, losses)
To construct new parent edges, nodes are clustered according to within-layer edges from one of the four affinity functions using the standard Label Propagation (LP) algorithm [44] (Fig. S1 left, middle column.) This algorithm takes as input only the edges El , the number of nodes at the current graph level |Vl |, and a parameter setting the number of iterations. Each node is initialized to belong in its own cluster, giving a set of labels [|Vl |]. Then for q > 0, iteration-q labels are produced from iteration q − 1-labels by assigning to each node v ∈ Vl the most common stage-q − 1 label among nodes connected to v by an edge. Ties are resolved randomly. Let Pl : Vl → [m] denote the final converged label-prop assignment of nodes in V to cluster identities, where m is the number of clusters discovered. Effectively, the new child-to-parent edges Pl define input-dependent pooling kernels for Vl (Fig. S1 Left, right column.) The final stage of building a new PSG level, Graph Vectorization, uses these pooling kernels to aggregate statistics over the resulting partition of Vl , resulting in new nodes Vl+1 with |Vl+1| = m (see below.)
def LP_clustering(num_nodes,edge_index,num_iter=50,device=device):
# Sparse adjacency matrix
adj_t = SparseTensor(row=edge_index[0], col=edge_index[1],value = torch.ones_like(edge_index[1]).float(),
sparse_sizes=(num_nodes, num_nodes)).t().to(device)
# Each node starts with its index as its label
x = SparseTensor.eye(num_nodes).to(device)
for _ in range(num_iter):
# Add each node's neighbors' labels to its list of neighbor nodes
out = adj_t @ x
# Argmax of each row to assign new label to each node
row, col, value = out.coo()
argmax = scatter_max(value, row, dim_size=num_nodes)[1]
new_labels = col[argmax]
x = SparseTensor(row=torch.arange(num_nodes).to(device), col=new_labels,
sparse_sizes=(num_nodes, num_nodes),
value=torch.ones(num_nodes).to(device))
return new_labels
Four types of graph cluster methods are introduced in the paper. Here are some implemenatations of these concepts.
Principle-1:This is the implementation for the principle-1 of visual grouping.
An affinity function differs from the original paper is applied in this repo. In this repo, we calculate the norm of differences between a pair of nodes and determines whether connect this pair by looking at the mean difference of a certain window (3 or 5). Let
where
The principle produce edges between nodes in the spatial window that has feature difference smaller than or equal to average feature differences.
def affinities_and_thresholds(self, nodes, row, col):
# Norm of difference for every node pair on grid
edge_affinities = torch.linalg.norm(nodes[row] - nodes[col],dim = 1)
# Inverse mean affinities for each node to threshold each edge with
inv_mean_affinity = scatter_mean(edge_affinities, row.to(nodes.device))
affinity_thresh = torch.min(inv_mean_affinity[row],
inv_mean_affinity[col])
return edge_affinities.to(device), affinity_thresh.to(device), {}
Principle-2:This is the implementation for the principle-2 for visual grouping. This layer corresponds to the gestalt principle of statistical cooccruence.
The idea behind this layer is that if two nodes appear often in the same pairwise arrangement, it may be because they are part of an object that moves (or exists) as a coherent whole. Thus it is away of trying to infer motion-in-concert without actually observing motion, as when visual inputs are single frames. This is implemented by making
the input features are differences between node features. Another possibility for this symmetry design will be using GNN like message passing model to gain output but that is for future work. The P2 affinity of a node pair is given by the equation.
self.node_pair_vae = VAE( in_features=node_feat_size ,beta = 30) # layer specified
def affinities_and_thresholds(self, x, row, col):
# Affinities as function of vae reconstruction of node pairs
_, recon_loss, kl_loss = self.node_pair_vae( x[row] - x[col] )
edge_affinities = 1/(1 + self.v2*recon_loss)
losses = {"recon_loss":recon_loss.mean(), "kl_loss":kl_loss.mean()}
return edge_affinities, .5, losses
These affinity functions and corresponding thresholds are
This paper use the quadratic texture rendering and quadratic shape rendering to output the reconstruction signal and segmentation signal. In this repo, after each layer' nodes' attributes are calculated, we implement a fully connected network to decode the parameters for the final rendering.
Quadratic Texture Rendering
Give the node attributes of layer of a PSG, together with the spatial registration (spatial region)
implementation of this paint by numbers is shown here.
def QTR(grid,center,a,ah,aw,ahh,aww,ahw):
ch,cw = center
mx,my = grid[:,:,0],grid[:,:,1]
var = a + ah*(mx - ch) + aw*(my - cw) +\
ahh*(mx - ch)**2 + aww*(my - cw)**2 + ahw * (mx - ch) * (my - cw)
return var
Quadratic Shape Rendering
We define quadratic shape rendering (QSR) to produce a shape from node attributes, elaborating a procedure developed in previous papers, given parameters
where
The key next step is to use the physical graph representation for tasks that flummox other computer algorithms - tasks that require physical understanding more than categorical knowledge. Whether an object will slide or roll, how soon two things will collide, and where to look for something hidden are problems that depeond on just the sort of scene structure and physical properties that PSGNets encoder. By learning to build structure, physical representations of scenes, we hope that PSGNets will begin to bridge the critical gap between visual perception and physical understanding.
Here are some results collected in the scene from the paper. Results from this repo will come out soon.
This project contains.
Checklist
- For all authors... (a) Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope? [Yes] (b) Did you describe the limitations of your work? [Yes] see Section 5. (c) Did you discuss any potential negative societal impacts of your work? [N/A] (d) Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes]
- If you are including theoretical results... (a) Did you state the full set of assumptions of all theoretical results? [N/A] (b) Did you include complete proofs of all theoretical results? [N/A]
- If you ran experiments... (a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] see the supplementary material. (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] see Section 3 and the supplementary material. (c) Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [Yes] see the supplementary material. (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] see the supplementary material.
- If you are using existing assets (e.g., code, data, models) or curating/releasing new assets... (a) If your work uses existing assets, did you cite the creators? [Yes] (b) Did you mention the license of the assets? [N/A] . (c) Did you include any new assets either in the supplemental material or as a URL? [N/A] (d) Did you discuss whether and how consent was obtained from people whose data you’re using/curating? [N/A] (e) Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? [N/A]
- If you used crowdsourcing or conducted research with human subjects... (a) Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]