Skip to content

[BUG] Strangeness with dataset.merge() #2015

Open
@daniel-falk

Description

@daniel-falk

🐛🐛 Bug Report

To start with, awesome work getting the merge functionality implemented! I have used it for a while now and discovered a few issues with it:

Text tensors are not merged

When two branches are merged, where both branches has the same tensors, some tensors are empty after merge. This applies to e.g. htype="text" and htype="generic". See example on colab here.

UI shows errors during a merge

If using the ActiveLoop user interface while two branches are merging the dataset fails to load and there are error messages "Dataset is corrupted or does not exist".

Exception about modification on read-only tensor during merge

While merging two branches I got this exception once. Rerunning with the same exact code a second time worked.

Error in sys.excepthook:                                                                                                                                                                                                                                  
Traceback (most recent call last):                                                                                                                                                                                                                        
  File "/Users/danielfalk/testprgm/sandbox/deeplake/versioning_tutorial/venv/lib/python3.10/site-packages/humbug/report.py", line 498, in _hook                                                                                                           
    self.error_report(error=exception_instance, tags=tags, publish=publish)                                                                                                                                                                               
  File "/Users/danielfalk/testprgm/sandbox/deeplake/versioning_tutorial/venv/lib/python3.10/site-packages/humbug/report.py", line 244, in error_report                                                                                                    
    traceback.format_exception(                                                                                                                                                                                                                           
TypeError: format_exception() got an unexpected keyword argument 'etype'                                                                                                                                                                                  
                                                                                                                                                                                                                                                          
Original exception was:                                                                                                                                                                                                                                   
Traceback (most recent call last):                                                                                                                                                                                                                        
  File "/Users/danielfalk/testprgm/sandbox/deeplake/versioning_tutorial/diff_commits.py", line 84, in <module>                                                                                                                                            
    diff_commits()                                                                                                                                                                                                                                        
  File "/Users/danielfalk/testprgm/sandbox/deeplake/versioning_tutorial/venv/lib/python3.10/site-packages/click/core.py", line 1130, in __call__                                                                                                          
    return self.main(*args, **kwargs)                                                                                                                                                                                                                     
  File "/Users/danielfalk/testprgm/sandbox/deeplake/versioning_tutorial/venv/lib/python3.10/site-packages/click/core.py", line 1055, in main                                                                                                              
    rv = self.invoke(ctx)                                                                                                                                                                                                                                 
  File "/Users/danielfalk/testprgm/sandbox/deeplake/versioning_tutorial/venv/lib/python3.10/site-packages/click/core.py", line 1404, in invoke                                                                                                            
    return ctx.invoke(self.callback, **ctx.params)                                                                                                                                                                                                        
  File "/Users/danielfalk/testprgm/sandbox/deeplake/versioning_tutorial/venv/lib/python3.10/site-packages/click/core.py", line 760, in invoke                                                                                                             
    return __callback(*args, **kwargs)                                                                                                                                                                                                                    
  File "/Users/danielfalk/testprgm/sandbox/deeplake/versioning_tutorial/diff_commits.py", line 79, in diff_commits                                                                                                                                        
    ds_new.merge(f"debug4/diff-{hash1}-{hash2}-removed")                                                                                                                                                                                                  
  File "/Users/danielfalk/testprgm/sandbox/deeplake/versioning_tutorial/venv/lib/python3.10/site-packages/humbug/report.py", line 445, in wrapped_callable                                                                                                
    return callable(*args, **kwargs)                                                                                                                                                                                                                      
  File "/Users/danielfalk/testprgm/sandbox/deeplake/versioning_tutorial/venv/lib/python3.10/site-packages/deeplake/util/invalid_view_op.py", line 22, in inner                                                                                            
    return callable(x, *args, **kwargs)                                                                                                                                                                                                                   
  File "/Users/danielfalk/testprgm/sandbox/deeplake/versioning_tutorial/venv/lib/python3.10/site-packages/deeplake/util/iteration_warning.py", line 11, in inner                                                                                          
    res = callable(x, *args, **kwargs)                                                                                                                                                                                                                    
  File "/Users/danielfalk/testprgm/sandbox/deeplake/versioning_tutorial/venv/lib/python3.10/site-packages/deeplake/core/dataset/dataset.py", line 1193, in merge                                                                                          
    merge(self, target_id, conflict_resolution, delete_removed_tensors, force)                                                                                                                                                                            
  File "/Users/danielfalk/testprgm/sandbox/deeplake/versioning_tutorial/venv/lib/python3.10/site-packages/deeplake/util/merge.py", line 59, in merge                                                                                                      
    merge_common_tensors(common_tensors, dataset, target_ds, nodes, conflict_resolution)                                                                                                                                                                  
  File "/Users/danielfalk/testprgm/sandbox/deeplake/versioning_tutorial/venv/lib/python3.10/site-packages/deeplake/util/merge.py", line 289, in merge_common_tensors                                                                                      
    merge_tensor_data(                                                                                                                                                                                                                                    
  File "/Users/danielfalk/testprgm/sandbox/deeplake/versioning_tutorial/venv/lib/python3.10/site-packages/deeplake/util/merge.py", line 472, in merge_tensor_data                                                                                         
    original_tensor.append(sample)                                                                                                                                                                                                                        
  File "/Users/danielfalk/testprgm/sandbox/deeplake/versioning_tutorial/venv/lib/python3.10/site-packages/deeplake/util/invalid_view_op.py", line 22, in inner                                                                                            
    return callable(x, *args, **kwargs)                                                                                                                                                                                                                   
  File "/Users/danielfalk/testprgm/sandbox/deeplake/versioning_tutorial/venv/lib/python3.10/site-packages/deeplake/core/tensor.py", line 377, in append                                                                                                   
    self.extend([sample], progressbar=False)                                                                                                                                                                                                              
  File "/Users/danielfalk/testprgm/sandbox/deeplake/versioning_tutorial/venv/lib/python3.10/site-packages/deeplake/util/invalid_view_op.py", line 22, in inner                                                                                            
    return callable(x, *args, **kwargs)                                                                                                                                                                                                                   
  File "/Users/danielfalk/testprgm/sandbox/deeplake/versioning_tutorial/venv/lib/python3.10/site-packages/deeplake/core/tensor.py", line 306, in extend                                                                                                   
    self._write_initialization()                                                                                                                                                                                                                          
  File "/Users/danielfalk/testprgm/sandbox/deeplake/versioning_tutorial/venv/lib/python3.10/site-packages/deeplake/core/tensor.py", line 259, in _write_initialization                                                                                    
    self.storage.check_readonly()                                                                                                                                                                                                                         
  File "/Users/danielfalk/testprgm/sandbox/deeplake/versioning_tutorial/venv/lib/python3.10/site-packages/deeplake/core/storage/provider.py", line 153, in check_readonly                                                                                 
    raise ReadOnlyModeError()                                                                                                                                                                                                                             
deeplake.util.exceptions.ReadOnlyModeError: Modification when in read-only mode is not supported!

Removed samples are not removed after a merge

Data samples that are removed on a branch does not get removed when that branch is merged.

This might be a design decision, that when merging it is the current state of the source branch that is merged, not the changes/deltas on the branch. This "union" behaviour feels very unnatural for a "merge" and is not how it works in e.g. git. I could not find any documentation explaining this behaviour.

See example on colab here.

Merged commits are not visible in history

One of the main advantages with version control of the data is that I can look in the log and see what different changes people has done to the dataset. This does not really work as is now since a merge only creates a "merged ..." commit message, it does not show the commits that was done to other branches. This means that if using trunk-based development where the changes are always done on a feature branch, committed with a self-explaining commit message and then merged with master, there will not be any useful commit messages on the main branch, only "merge ..." messages. I think we would really need something like gitk etc. that shows the different commits on the branches that are merged.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions