Broken Links when Exporting #194

mavaylon1 · 2024-05-16T18:43:51Z

Motivation

What was the reasoning behind this change? Please explain the changes briefly.
Export is not supposed to create links to the prior file, rather it is just mean to have the option to preserve them. This means if File A has links to some File C, then when we export File A to File B, File B will also have links the File C.

Problem 1: HDMF-Zarr is missing the logic in HDMF within write_dataset that has conditionals for link in terms of export.

Problem 2: When links are creating (let's say when they are supposed to be created), they are using absolute paths. They are supposed to use relative paths. Both can break when you move things, but absolute paths will always break.

Problem 3: When we create a reference, the source path is shorthand with ".", to represent the file it is currently in. We need to add logic in resolve_ref to handle links.

What to do while this is being fixed:

Always use 'write_args={'link_data': False}'.
I will divide the problem into stages:
Stage 1 (PR 1:) Add updated export logic into write_dataset

Stage 2 (PR 2:) Add logic into resolve_ref to resolve references in links

Stage 3 (PR 3:) Edge case Test Suite

How to test the behavior?

Show how to reproduce the new behavior (can be a bug fix or a new feature)

Checklist

Did you update CHANGELOG.md with your changes?
Have you checked our Contributing document?
Have you ensured the PR clearly describes the problem and the solution?
Is your contribution compliant with our coding style? This can be checked running ruff from the source directory.
Have you checked to ensure that there aren't other open Pull Requests for the same change?
Have you included the relevant issue number using "Fix #XXX" notation where XXX is the issue number? By including "Fix #XXX" you allow GitHub to close issue #XXX when the PR is merged.

codecov-commenter · 2024-05-20T22:17:52Z

Codecov Report

Attention: Patch coverage is 71.42857% with 4 lines in your changes missing coverage. Please review.

Project coverage is 86.88%. Comparing base (8ca5787) to head (16876ca).

Files	Patch %	Lines
src/hdmf_zarr/backend.py	71.42%	3 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##              dev     #194      +/-   ##
==========================================
- Coverage   87.11%   86.88%   -0.23%     
==========================================
  Files           5        5              
  Lines        1172     1182      +10     
  Branches      286      289       +3     
==========================================
+ Hits         1021     1027       +6     
- Misses        100      103       +3     
- Partials       51       52       +1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

mavaylon1 · 2024-07-12T17:12:17Z

More Notes:
FileA is a HDF5 file and is exported to FileB, a zarr file.
FileA has both internal links, external links, and references (which are always internal for us). I remember you saying we don't do external links, but maybe my memory is off. I ask because the export doc talks about external links.
Because the backends are different, everything is copied over. Does that mean during export, FileB will have copies of what is being externally linked inside FileB? That also means every internal link and every reference is also preserved.
Now I export FileB to FileC, which is still zarr. However, I add a few containers and append to existing containers. What is happening right now in zarr is that new data is added correctly; however, we now create a link to the FileB from FileC if we don't specify copy. This is wrong in that this isn't what export is supposed to do as external links can break easily from moving files. (which is why I ask about external links and why is it talked about in the tutorial). What we need to do is copy everything to FileC, preserving internal links and references.

If FileA contains an external link to a dataset in FileX, then FileB should also contain an external link to the dataset in FileX.
6:23
Same for FileB and FileC.
6:24
If FileB is read, a new external link is added, and then the file is exported to FileC, then it is written as an external link
6:25
If FileB is read, a new external link is added, and then the file is exported to FileC with write_args={'link_data': False}, then the linked dataset is copied
6:25
This is a very specific, niche case

mavaylon1 · 2024-07-12T17:15:42Z

Goal for this PR:

We want to make sure the external links are just copied when we export, not preserved as links.

When export Zarr File A to Zarr File B and add stuff to existing containers, they should all not be links. (this is what Alessio needs)

oruebel · 2024-08-01T21:05:53Z

Just in case this is relevant for this PR. The following test cases mirror tests from HDMF but were disabled in the hdfm_zarr test suite because links on export didn't fully work. If this PR fixes this, then we should also look at updating these tests.

hdmf-zarr/tests/unit/base_tests_zarrio.py

Lines 1329 to 1521 in 8ca5787

    
               def test_append_data(self): 
        
                   """Test that exporting a written container after adding groups, links, and references to it works.""" 
        
                   # TODO: This test currently fails because I do not understand how the link to my_data is expected to be 
        
                   #       created here and currently fails. I.e,. it fails in list_fill but instead we should actually 
        
                   #       create an external link instead 
        
                   pass 
        
                   """ 
        
                   foo1 = Foo('foo1', [1, 2, 3, 4, 5], "I am foo1", 17, 3.14) 
        
                   foobucket = FooBucket('bucket1', [foo1]) 
        
                   foofile = FooFile(buckets=[foobucket]) 
        
                   with ZarrIO(self.store_paths[0], manager=get_foo_buildmanager(), mode='w') as write_io: 
        
                       write_io.write(foofile) 
        
                   with ZarrIO(self.store_paths[0], manager=get_foo_buildmanager(), mode='r') as read_io: 
        
                       read_foofile = read_io.read() 
        
                       # create a foo with link to existing dataset my_data, add the foo to new foobucket 
        
                       # this should make a soft link within the exported file 
        
                       # TODO Assigning my_data is the problem. Which in turn causes the export to fail because the Zarr 
        
                       # DataType is not being understood. This is where the External link should be created instead? 
        
                       foo2 = Foo('foo2', read_foofile.buckets['bucket1'].foos['foo1'].my_data, "I am foo2", 17, 3.14) 
        
                       foobucket2 = FooBucket('bucket2', [foo2]) 
        
                       read_foofile.add_bucket(foobucket2) 
        
                       # also add link from foofile to new foo2 container 
        
                       read_foofile.foo_link = foo2 
        
                       # also add link from foofile to new foo2.my_data dataset which is a link to foo1.my_data dataset 
        
                       read_foofile.foofile_data = foo2.my_data 
        
                       # also add reference from foofile to new foo2 
        
                       read_foofile.foo_ref_attr = foo2 
        
                       with ZarrIO(self.store_paths[1], mode='w') as export_io: 
        
                           export_io.export(src_io=read_io, container=read_foofile) 
        
                   with ZarrIO(self.store_paths[1], manager=get_foo_buildmanager(), mode='r') as read_io: 
        
                       read_foofile2 = read_io.read() 
        
                       # test new soft link to dataset in file 
        
                       self.assertIs(read_foofile2.buckets['bucket1'].foos['foo1'].my_data, 
        
                                     read_foofile2.buckets['bucket2'].foos['foo2'].my_data) 
        
                       # test new soft link to group in file 
        
                       self.assertIs(read_foofile2.foo_link, read_foofile2.buckets['bucket2'].foos['foo2']) 
        
                       # test new soft link to new soft link to dataset in file 
        
                       self.assertIs(read_foofile2.buckets['bucket1'].foos['foo1'].my_data, read_foofile2.foofile_data) 
        
                       # test new attribute reference to new group in file 
        
                       self.assertIs(read_foofile2.foo_ref_attr, read_foofile2.buckets['bucket2'].foos['foo2']) 
        
                   #with File(self.store_paths[1], 'r') as f: 
        
                   #    self.assertEqual(f['foofile_data'].file.filename, self.store_paths[1]) 
        
                   #    self.assertIsInstance(f.attrs['foo_ref_attr'], h5py.Reference) 
        
                   """ 
        
               def test_append_external_link_data(self): 
        
                   """Test that exporting a written container after adding a link with link_data=True creates external links.""" 
        
                   pass  # TODO: This test currently fails 
        
                   """ 
        
                   foo1 = Foo('foo1', [1, 2, 3, 4, 5], "I am foo1", 17, 3.14) 
        
                   foobucket = FooBucket('bucket1', [foo1]) 
        
                   foofile = FooFile(buckets=[foobucket]) 
        
                   with ZarrIO(self.store_paths[0], manager=get_foo_buildmanager(), mode='w') as write_io: 
        
                       write_io.write(foofile) 
        
                   foofile2 = FooFile(buckets=[]) 
        
                   with ZarrIO(self.store_paths[1], manager=get_foo_buildmanager(), mode='w') as write_io: 
        
                       write_io.write(foofile2) 
        
                   manager = get_foo_buildmanager() 
        
                   with ZarrIO(self.store_paths[0], manager=manager, mode='r') as read_io1: 
        
                       read_foofile1 = read_io1.read() 
        
                       with ZarrIO(self.store_paths[1], manager=manager, mode='r') as read_io2: 
        
                           read_foofile2 = read_io2.read() 
        
                           # create a foo with link to existing dataset my_data (not in same file), add the foo to new foobucket 
        
                           # this should make an external link within the exported file 
        
                           foo2 = Foo('foo2', read_foofile1.buckets['bucket1'].foos['foo1'].my_data, "I am foo2", 17, 3.14) 
        
                           foobucket2 = FooBucket('bucket2', [foo2]) 
        
                           read_foofile2.add_bucket(foobucket2) 
        
                           # also add link from foofile to new foo2.my_data dataset which is a link to foo1.my_data dataset 
        
                           # this should make an external link within the exported file 
        
                           read_foofile2.foofile_data = foo2.my_data 
        
                           with ZarrIO(self.store_paths[2], mode='w') as export_io: 
        
                               export_io.export(src_io=read_io2, container=read_foofile2) 
        
                   with ZarrIO(self.store_paths[0], manager=get_foo_buildmanager(), mode='r') as read_io1: 
        
                       read_foofile3 = read_io1.read() 
        
                       with ZarrIO(self.store_paths[2], manager=get_foo_buildmanager(), mode='r') as read_io2: 
        
                           read_foofile4 = read_io2.read() 
        
                           self.assertEqual(read_foofile4.buckets['bucket2'].foos['foo2'].my_data, 
        
                                            read_foofile3.buckets['bucket1'].foos['foo1'].my_data) 
        
                           self.assertEqual(read_foofile4.foofile_data, read_foofile3.buckets['bucket1'].foos['foo1'].my_data) 
        
                   #with File(self.source_paths[2], 'r') as f: 
        
                   #    self.assertEqual(f['buckets/bucket2/foo_holder/foo2/my_data'].file.filename, self.source_paths[0]) 
        
                   #    self.assertEqual(f['foofile_data'].file.filename, self.souce_paths[0]) 
        
                   #    self.assertIsInstance(f.get('buckets/bucket2/foo_holder/foo2/my_data', getlink=True), 
        
                   #                          h5py.ExternalLink) 
        
                   #    self.assertIsInstance(f.get('foofile_data', getlink=True), h5py.ExternalLink) 
        
                   """ 
        
               def test_append_external_link_copy_data(self): 
        
                   """Test that exporting a written container after adding a link with link_data=False copies the data.""" 
        
                   pass  # TODO: This test currently fails 
        
                   """ 
        
                   foo1 = Foo('foo1', [1, 2, 3, 4, 5], "I am foo1", 17, 3.14) 
        
                   foobucket = FooBucket('bucket1', [foo1]) 
        
                   foofile = FooFile(buckets=[foobucket]) 
        
                   with ZarrIO(self.store_paths[0], manager=get_foo_buildmanager(), mode='w') as write_io: 
        
                       write_io.write(foofile) 
        
                   foofile2 = FooFile(buckets=[]) 
        
                   with ZarrIO(self.store_paths[1], manager=get_foo_buildmanager(), mode='w') as write_io: 
        
                       write_io.write(foofile2) 
        
                   manager = get_foo_buildmanager() 
        
                   with ZarrIO(self.store_paths[0], manager=manager, mode='r') as read_io1: 
        
                       read_foofile1 = read_io1.read() 
        
                       with ZarrIO(self.store_paths[1], manager=manager, mode='r') as read_io2: 
        
                           read_foofile2 = read_io2.read() 
        
                           # create a foo with link to existing dataset my_data (not in same file), add the foo to new foobucket 
        
                           # this would normally make an external link but because link_data=False, data will be copied 
        
                           foo2 = Foo('foo2', read_foofile1.buckets['bucket1'].foos['foo1'].my_data, "I am foo2", 17, 3.14) 
        
                           foobucket2 = FooBucket('bucket2', [foo2]) 
        
                           read_foofile2.add_bucket(foobucket2) 
        
                           # also add link from foofile to new foo2.my_data dataset which is a link to foo1.my_data dataset 
        
                           # this would normally make an external link but because link_data=False, data will be copied 
        
                           read_foofile2.foofile_data = foo2.my_data 
        
                           with ZarrIO(self.store_paths[2], mode='w') as export_io: 
        
                               export_io.export(src_io=read_io2, container=read_foofile2, write_args={'link_data': False}) 
        
                   with ZarrIO(self.store_paths[0], manager=get_foo_buildmanager(), mode='r') as read_io1: 
        
                       read_foofile3 = read_io1.read() 
        
                       with ZarrIO(self.store_paths[2], manager=get_foo_buildmanager(), mode='r') as read_io2: 
        
                           read_foofile4 = read_io2.read() 
        
                           # check that file can be read 
        
                           self.assertNotEqual(read_foofile4.buckets['bucket2'].foos['foo2'].my_data, 
        
                                               read_foofile3.buckets['bucket1'].foos['foo1'].my_data) 
        
                           self.assertNotEqual(read_foofile4.foofile_data, read_foofile3.buckets['bucket1'].foos['foo1'].my_data) 
        
                           self.assertNotEqual(read_foofile4.foofile_data, read_foofile4.buckets['bucket2'].foos['foo2'].my_data) 
        
                   # with File(self.source_paths[2], 'r') as f: 
        
                   #    self.assertEqual(f['buckets/bucket2/foo_holder/foo2/my_data'].file.filename, self.source_paths[2]) 
        
                   #    self.assertEqual(f['foofile_data'].file.filename, self.source_paths[2]) 
        
                   """ 
        
               def test_export_dset_refs(self): 
        
                   """Test that exporting a written container with a dataset of references works.""" 
        
                   pass  # TODO: This test currently fails 
        
                   """ 
        
                   bazs = [] 
        
                   num_bazs = 10 
        
                   for i in range(num_bazs): 
        
                       bazs.append(Baz(name='baz%d' % i)) 
        
                   baz_data = BazData(name='baz_data1', data=bazs) 
        
                   bucket = BazBucket(name='bucket1', bazs=bazs.copy(), baz_data=baz_data) 
        
                   with ZarrIO(self.store_paths[0], manager=_get_baz_manager(), mode='w') as write_io: 
        
                       write_io.write(bucket) 
        
                   with ZarrIO(self.store_paths[0], manager=_get_baz_manager(), mode='r') as read_io: 
        
                       read_bucket1 = read_io.read() 
        
                       # NOTE: reference IDs might be the same between two identical files 
        
                       # adding a Baz with a smaller name should change the reference IDs on export 
        
                       new_baz = Baz(name='baz000') 
        
                       read_bucket1.add_baz(new_baz) 
        
                       with ZarrIO(self.store_paths[1], mode='w') as export_io: 
        
                           export_io.export(src_io=read_io, container=read_bucket1) 
        
                   with ZarrIO(self.store_paths[1], manager=_get_baz_manager(), mode='r') as read_io: 
        
                       read_bucket2 = read_io.read() 
        
                       # remove and check the appended child, then compare the read container with the original 
        
                       read_new_baz = read_bucket2.remove_baz('baz000') 
        
                       self.assertContainerEqual(new_baz, read_new_baz, ignore_hdmf_attrs=True) 
        
                       self.assertContainerEqual(bucket, read_bucket2, ignore_name=True, ignore_hdmf_attrs=True) 
        
                       for i in range(num_bazs): 
        
                           baz_name = 'baz%d' % i 
        
                           self.assertIs(read_bucket2.baz_data.data[i], read_bucket2.bazs[baz_name]) 
        
                   """ 
        
               def test_export_cpd_dset_refs(self): 
        
                   """Test that exporting a written container with a compound dataset with references works.""" 
        
                   pass  # TODO: This test currently fails 
        
                   """ 
        
                   bazs = [] 
        
                   baz_pairs = [] 
        
                   num_bazs = 10 
        
                   for i in range(num_bazs): 
        
                       b = Baz(name='baz%d' % i) 
        
                       bazs.append(b) 
        
                       baz_pairs.append((i, b)) 
        
                   baz_cpd_data = BazCpdData(name='baz_cpd_data1', data=baz_pairs) 
        
                   bucket = BazBucket(name='bucket1', bazs=bazs.copy(), baz_cpd_data=baz_cpd_data) 
        
                   with ZarrIO(self.store_paths[0], manager=_get_baz_manager(), mode='w') as write_io: 
        
                       write_io.write(bucket) 
        
                   with ZarrIO(self.store_paths[0], manager=_get_baz_manager(), mode='r') as read_io: 
        
                       read_bucket1 = read_io.read() 
        
                       # NOTE: reference IDs might be the same between two identical files 
        
                       # adding a Baz with a smaller name should change the reference IDs on export 
        
                       new_baz = Baz(name='baz000') 
        
                       read_bucket1.add_baz(new_baz) 
        
                       with ZarrIO(self.store_paths[1], mode='w') as export_io: 
        
                           export_io.export(src_io=read_io, container=read_bucket1) 
        
                   with ZarrIO(self.store_paths[1], manager=_get_baz_manager(), mode='r') as read_io: 
        
                       read_bucket2 = read_io.read() 
        
                       # remove and check the appended child, then compare the read container with the original 
        
                       read_new_baz = read_bucket2.remove_baz(new_baz.name) 
        
                       self.assertContainerEqual(new_baz, read_new_baz, ignore_hdmf_attrs=True) 
        
                       self.assertContainerEqual(bucket, read_bucket2, ignore_name=True, ignore_hdmf_attrs=True) 
        
                       for i in range(num_bazs): 
        
                           baz_name = 'baz%d' % i 
        
                           self.assertEqual(read_bucket2.baz_cpd_data.data[i][0], i) 
        
                           self.assertIs(read_bucket2.baz_cpd_data.data[i][1], read_bucket2.bazs[baz_name]) 
        
                   """

hdmf-zarr/tests/unit/test_io_convert.py

Lines 993 to 1069 in 8ca5787

    
           # TODO: Fails because we need to copy the data from the ExternalLink as it points to a non-Zarr source 
        
           """ 
        
           class TestFooExternalLinkHDF5ToZarr(MixinTestCaseConvert, TestCase): 
        
               IGNORE_NAME = True 
        
               IGNORE_HDMF_ATTRS = True 
        
               IGNORE_STRING_TO_BYTE = False 
        
               def get_manager(self): 
        
                   return get_foo_buildmanager() 
        
               def setUpContainer(self): 
        
                   # Create the first file container. We will overwrite this later with the external link container 
        
                   foo1 = Foo('foo1', [0, 1, 2, 3, 4], "I am foo1", 17, 3.14) 
        
                   bucket1 = FooBucket('bucket1', [foo1]) 
        
                   foofile1 = FooFile(buckets=[bucket1]) 
        
                   return foofile1 
        
               def roundtripExportContainer(self): 
        
                   # Write the HDF5 file 
        
                   first_filename = 'test_firstfile_%s.hdmf' % self.container_type 
        
                   self.filenames.append(first_filename) 
        
                   with HDF5IO(first_filename, manager=self.get_manager(), mode='w') as write_io: 
        
                       write_io.write(self.container, cache_spec=True) 
        
                   # Create the second file with an external link added (which is the file we use as reference_ 
        
                   with HDF5IO(first_filename, manager=self.get_manager(), mode='r') as read_io: 
        
                       read_foo = read_io.read() 
        
                       foo2 = Foo('foo2', read_foo.buckets['bucket1'].foos['foo1'].my_data, "I am foo2", 34, 6.28) 
        
                       bucket2 = FooBucket('bucket2', [foo2]) 
        
                       foofile2 = FooFile(buckets=[bucket2]) 
        
                       self.container = foofile2  # This is what we need to compare against 
        
                       with HDF5IO(self.filename, manager=self.get_manager(), mode='w') as write_io: 
        
                           write_io.write(foofile2, cache_spec=True) 
        
                   # Export the file with the external link to Zarr 
        
                   with HDF5IO(self.filename, manager=self.get_manager(), mode='r') as read_io: 
        
                       with ZarrIO(self.export_filename, mode='w') as export_io: 
        
                           export_io.export(src_io=read_io, write_args={'link_data': False}) 
        
                   read_io = ZarrIO(self.export_filename, manager=self.get_manager(), mode='r') 
        
                   self.ios.append(read_io) 
        
                   exportContainer = read_io.read() 
        
                   return exportContainer 
        
           """ 
        
           # TODO: Fails because ZarrIO fails to properly create the external link 
        
           """ 
        
           class TestFooExternalLinkZarrToHDF5(MixinTestCaseConvert, TestCase): 
        
               IGNORE_NAME = True 
        
               IGNORE_HDMF_ATTRS = True 
        
               IGNORE_STRING_TO_BYTE = False 
        
               def get_manager(self): 
        
                   return get_foo_buildmanager() 
        
               def setUpContainer(self): 
        
                   # Create the first file container. We will overwrite this later with the external link container 
        
                   foo1 = Foo('foo1', [0, 1, 2, 3, 4], "I am foo1", 17, 3.14) 
        
                   bucket1 = FooBucket('bucket1', [foo1]) 
        
                   foofile1 = FooFile(buckets=[bucket1]) 
        
                   return foofile1 
        
               def roundtripExportContainer(self): 
        
                   # Write the HDF5 file 
        
                   first_filename = 'test_firstfile_%s.hdmf' % self.container_type 
        
                   self.filenames.append(first_filename) 
        
                   with ZarrIO(first_filename, manager=self.get_manager(), mode='w') as write_io: 
        
                       write_io.write(self.container, cache_spec=True) 
        
                   # Create the second file with an external link added (which is the file we use as reference_ 
        
                   with ZarrIO(first_filename, manager=self.get_manager(), mode='r') as read_io: 
        
                       read_foo = read_io.read() 
        
                       foo2 = Foo('foo2', read_foo.buckets['bucket1'].foos['foo1'].my_data, "I am foo2", 34, 6.28) 
        
                       bucket2 = FooBucket('bucket2', [foo2]) 
        
                       foofile2 = FooFile(buckets=[bucket2]) 
        
                       self.container = foofile2  # This is what we need to compare against 
        
                       with ZarrIO(self.filename, manager=self.get_manager(), mode='w') as write_io: 
        
                           write_io.write(foofile2, cache_spec=True) 
        
                   # Export the file with the external link to Zarr 
        
                   with ZarrIO(self.filename, manager=self.get_manager(), mode='r') as read_io: 
        
                       with HDF5IO(self.export_filename, mode='w') as export_io: 
        
                           export_io.export(src_io=read_io, write_args={'link_data': False}) 
        
                   read_io = HDF5IO(self.export_filename, manager=self.get_manager(), mode='r') 
        
                   self.ios.append(read_io) 
        
                   exportContainer = read_io.read() 
        
                   return exportContainer 
        
           """

mavaylon1 · 2024-08-01T21:08:00Z

Just in case this is relevant for this PR. The following test cases mirror tests from HDMF but were disabled in the hdfm_zarr test suite because links on export didn't fully work. If this PR fixes this, then we should also look at updating these tests.

hdmf-zarr/tests/unit/base_tests_zarrio.py

Lines 1329 to 1521 in 8ca5787

def test_append_data(self):

"""Test that exporting a written container after adding groups, links, and references to it works."""

# TODO: This test currently fails because I do not understand how the link to my_data is expected to be

# created here and currently fails. I.e,. it fails in list_fill but instead we should actually

# create an external link instead

pass

"""

foo1 = Foo('foo1', [1, 2, 3, 4, 5], "I am foo1", 17, 3.14)

foobucket = FooBucket('bucket1', [foo1])

foofile = FooFile(buckets=[foobucket])

with ZarrIO(self.store_paths[0], manager=get_foo_buildmanager(), mode='w') as write_io:

write_io.write(foofile)

with ZarrIO(self.store_paths[0], manager=get_foo_buildmanager(), mode='r') as read_io:

read_foofile = read_io.read()

# create a foo with link to existing dataset my_data, add the foo to new foobucket

# this should make a soft link within the exported file

# TODO Assigning my_data is the problem. Which in turn causes the export to fail because the Zarr

# DataType is not being understood. This is where the External link should be created instead?

foo2 = Foo('foo2', read_foofile.buckets['bucket1'].foos['foo1'].my_data, "I am foo2", 17, 3.14)

foobucket2 = FooBucket('bucket2', [foo2])

read_foofile.add_bucket(foobucket2)

# also add link from foofile to new foo2 container

read_foofile.foo_link = foo2

# also add link from foofile to new foo2.my_data dataset which is a link to foo1.my_data dataset

read_foofile.foofile_data = foo2.my_data

# also add reference from foofile to new foo2

read_foofile.foo_ref_attr = foo2

with ZarrIO(self.store_paths[1], mode='w') as export_io:

export_io.export(src_io=read_io, container=read_foofile)

with ZarrIO(self.store_paths[1], manager=get_foo_buildmanager(), mode='r') as read_io:

read_foofile2 = read_io.read()

# test new soft link to dataset in file

self.assertIs(read_foofile2.buckets['bucket1'].foos['foo1'].my_data,

read_foofile2.buckets['bucket2'].foos['foo2'].my_data)

# test new soft link to group in file

self.assertIs(read_foofile2.foo_link, read_foofile2.buckets['bucket2'].foos['foo2'])

# test new soft link to new soft link to dataset in file

self.assertIs(read_foofile2.buckets['bucket1'].foos['foo1'].my_data, read_foofile2.foofile_data)

# test new attribute reference to new group in file

self.assertIs(read_foofile2.foo_ref_attr, read_foofile2.buckets['bucket2'].foos['foo2'])

#with File(self.store_paths[1], 'r') as f:

# self.assertEqual(f['foofile_data'].file.filename, self.store_paths[1])

# self.assertIsInstance(f.attrs['foo_ref_attr'], h5py.Reference)

"""

def test_append_external_link_data(self):

"""Test that exporting a written container after adding a link with link_data=True creates external links."""

pass # TODO: This test currently fails

"""

foo1 = Foo('foo1', [1, 2, 3, 4, 5], "I am foo1", 17, 3.14)

foobucket = FooBucket('bucket1', [foo1])

foofile = FooFile(buckets=[foobucket])

with ZarrIO(self.store_paths[0], manager=get_foo_buildmanager(), mode='w') as write_io:

write_io.write(foofile)

foofile2 = FooFile(buckets=[])

with ZarrIO(self.store_paths[1], manager=get_foo_buildmanager(), mode='w') as write_io:

write_io.write(foofile2)

manager = get_foo_buildmanager()

with ZarrIO(self.store_paths[0], manager=manager, mode='r') as read_io1:

read_foofile1 = read_io1.read()

with ZarrIO(self.store_paths[1], manager=manager, mode='r') as read_io2:

read_foofile2 = read_io2.read()

# create a foo with link to existing dataset my_data (not in same file), add the foo to new foobucket

# this should make an external link within the exported file

foo2 = Foo('foo2', read_foofile1.buckets['bucket1'].foos['foo1'].my_data, "I am foo2", 17, 3.14)

foobucket2 = FooBucket('bucket2', [foo2])

read_foofile2.add_bucket(foobucket2)

# also add link from foofile to new foo2.my_data dataset which is a link to foo1.my_data dataset

# this should make an external link within the exported file

read_foofile2.foofile_data = foo2.my_data

with ZarrIO(self.store_paths[2], mode='w') as export_io:

export_io.export(src_io=read_io2, container=read_foofile2)

with ZarrIO(self.store_paths[0], manager=get_foo_buildmanager(), mode='r') as read_io1:

read_foofile3 = read_io1.read()

with ZarrIO(self.store_paths[2], manager=get_foo_buildmanager(), mode='r') as read_io2:

read_foofile4 = read_io2.read()

self.assertEqual(read_foofile4.buckets['bucket2'].foos['foo2'].my_data,

read_foofile3.buckets['bucket1'].foos['foo1'].my_data)

self.assertEqual(read_foofile4.foofile_data, read_foofile3.buckets['bucket1'].foos['foo1'].my_data)

#with File(self.source_paths[2], 'r') as f:

# self.assertEqual(f['buckets/bucket2/foo_holder/foo2/my_data'].file.filename, self.source_paths[0])

# self.assertEqual(f['foofile_data'].file.filename, self.souce_paths[0])

# self.assertIsInstance(f.get('buckets/bucket2/foo_holder/foo2/my_data', getlink=True),

# h5py.ExternalLink)

# self.assertIsInstance(f.get('foofile_data', getlink=True), h5py.ExternalLink)

"""

def test_append_external_link_copy_data(self):

"""Test that exporting a written container after adding a link with link_data=False copies the data."""

pass # TODO: This test currently fails

"""

foo1 = Foo('foo1', [1, 2, 3, 4, 5], "I am foo1", 17, 3.14)

foobucket = FooBucket('bucket1', [foo1])

foofile = FooFile(buckets=[foobucket])

with ZarrIO(self.store_paths[0], manager=get_foo_buildmanager(), mode='w') as write_io:

write_io.write(foofile)

foofile2 = FooFile(buckets=[])

with ZarrIO(self.store_paths[1], manager=get_foo_buildmanager(), mode='w') as write_io:

write_io.write(foofile2)

manager = get_foo_buildmanager()

with ZarrIO(self.store_paths[0], manager=manager, mode='r') as read_io1:

read_foofile1 = read_io1.read()

with ZarrIO(self.store_paths[1], manager=manager, mode='r') as read_io2:

read_foofile2 = read_io2.read()

# create a foo with link to existing dataset my_data (not in same file), add the foo to new foobucket

# this would normally make an external link but because link_data=False, data will be copied

foo2 = Foo('foo2', read_foofile1.buckets['bucket1'].foos['foo1'].my_data, "I am foo2", 17, 3.14)

foobucket2 = FooBucket('bucket2', [foo2])

read_foofile2.add_bucket(foobucket2)

# also add link from foofile to new foo2.my_data dataset which is a link to foo1.my_data dataset

# this would normally make an external link but because link_data=False, data will be copied

read_foofile2.foofile_data = foo2.my_data

with ZarrIO(self.store_paths[2], mode='w') as export_io:

export_io.export(src_io=read_io2, container=read_foofile2, write_args={'link_data': False})

with ZarrIO(self.store_paths[0], manager=get_foo_buildmanager(), mode='r') as read_io1:

read_foofile3 = read_io1.read()

with ZarrIO(self.store_paths[2], manager=get_foo_buildmanager(), mode='r') as read_io2:

read_foofile4 = read_io2.read()

# check that file can be read

self.assertNotEqual(read_foofile4.buckets['bucket2'].foos['foo2'].my_data,

read_foofile3.buckets['bucket1'].foos['foo1'].my_data)

self.assertNotEqual(read_foofile4.foofile_data, read_foofile3.buckets['bucket1'].foos['foo1'].my_data)

self.assertNotEqual(read_foofile4.foofile_data, read_foofile4.buckets['bucket2'].foos['foo2'].my_data)

# with File(self.source_paths[2], 'r') as f:

# self.assertEqual(f['buckets/bucket2/foo_holder/foo2/my_data'].file.filename, self.source_paths[2])

# self.assertEqual(f['foofile_data'].file.filename, self.source_paths[2])

"""

def test_export_dset_refs(self):

"""Test that exporting a written container with a dataset of references works."""

pass # TODO: This test currently fails

"""

bazs = []

num_bazs = 10

for i in range(num_bazs):

bazs.append(Baz(name='baz%d' % i))

baz_data = BazData(name='baz_data1', data=bazs)

bucket = BazBucket(name='bucket1', bazs=bazs.copy(), baz_data=baz_data)

with ZarrIO(self.store_paths[0], manager=_get_baz_manager(), mode='w') as write_io:

write_io.write(bucket)

with ZarrIO(self.store_paths[0], manager=_get_baz_manager(), mode='r') as read_io:

read_bucket1 = read_io.read()

# NOTE: reference IDs might be the same between two identical files

# adding a Baz with a smaller name should change the reference IDs on export

new_baz = Baz(name='baz000')

read_bucket1.add_baz(new_baz)

with ZarrIO(self.store_paths[1], mode='w') as export_io:

export_io.export(src_io=read_io, container=read_bucket1)

with ZarrIO(self.store_paths[1], manager=_get_baz_manager(), mode='r') as read_io:

read_bucket2 = read_io.read()

# remove and check the appended child, then compare the read container with the original

read_new_baz = read_bucket2.remove_baz('baz000')

self.assertContainerEqual(new_baz, read_new_baz, ignore_hdmf_attrs=True)

self.assertContainerEqual(bucket, read_bucket2, ignore_name=True, ignore_hdmf_attrs=True)

for i in range(num_bazs):

baz_name = 'baz%d' % i

self.assertIs(read_bucket2.baz_data.data[i], read_bucket2.bazs[baz_name])

"""

def test_export_cpd_dset_refs(self):

"""Test that exporting a written container with a compound dataset with references works."""

pass # TODO: This test currently fails

"""

bazs = []

baz_pairs = []

num_bazs = 10

for i in range(num_bazs):

b = Baz(name='baz%d' % i)

bazs.append(b)

baz_pairs.append((i, b))

baz_cpd_data = BazCpdData(name='baz_cpd_data1', data=baz_pairs)

bucket = BazBucket(name='bucket1', bazs=bazs.copy(), baz_cpd_data=baz_cpd_data)

with ZarrIO(self.store_paths[0], manager=_get_baz_manager(), mode='w') as write_io:

write_io.write(bucket)

with ZarrIO(self.store_paths[0], manager=_get_baz_manager(), mode='r') as read_io:

read_bucket1 = read_io.read()

# NOTE: reference IDs might be the same between two identical files

# adding a Baz with a smaller name should change the reference IDs on export

new_baz = Baz(name='baz000')

read_bucket1.add_baz(new_baz)

with ZarrIO(self.store_paths[1], mode='w') as export_io:

export_io.export(src_io=read_io, container=read_bucket1)

with ZarrIO(self.store_paths[1], manager=_get_baz_manager(), mode='r') as read_io:

read_bucket2 = read_io.read()

# remove and check the appended child, then compare the read container with the original

read_new_baz = read_bucket2.remove_baz(new_baz.name)

self.assertContainerEqual(new_baz, read_new_baz, ignore_hdmf_attrs=True)

self.assertContainerEqual(bucket, read_bucket2, ignore_name=True, ignore_hdmf_attrs=True)

for i in range(num_bazs):

baz_name = 'baz%d' % i

self.assertEqual(read_bucket2.baz_cpd_data.data[i][0], i)

self.assertIs(read_bucket2.baz_cpd_data.data[i][1], read_bucket2.bazs[baz_name])

"""

hdmf-zarr/tests/unit/test_io_convert.py

Lines 993 to 1069 in 8ca5787

# TODO: Fails because we need to copy the data from the ExternalLink as it points to a non-Zarr source

"""

class TestFooExternalLinkHDF5ToZarr(MixinTestCaseConvert, TestCase):

IGNORE_NAME = True

IGNORE_HDMF_ATTRS = True

IGNORE_STRING_TO_BYTE = False

def get_manager(self):

return get_foo_buildmanager()

def setUpContainer(self):

# Create the first file container. We will overwrite this later with the external link container

foo1 = Foo('foo1', [0, 1, 2, 3, 4], "I am foo1", 17, 3.14)

bucket1 = FooBucket('bucket1', [foo1])

foofile1 = FooFile(buckets=[bucket1])

return foofile1

def roundtripExportContainer(self):

# Write the HDF5 file

first_filename = 'test_firstfile_%s.hdmf' % self.container_type

self.filenames.append(first_filename)

with HDF5IO(first_filename, manager=self.get_manager(), mode='w') as write_io:

write_io.write(self.container, cache_spec=True)

# Create the second file with an external link added (which is the file we use as reference_

with HDF5IO(first_filename, manager=self.get_manager(), mode='r') as read_io:

read_foo = read_io.read()

foo2 = Foo('foo2', read_foo.buckets['bucket1'].foos['foo1'].my_data, "I am foo2", 34, 6.28)

bucket2 = FooBucket('bucket2', [foo2])

foofile2 = FooFile(buckets=[bucket2])

self.container = foofile2 # This is what we need to compare against

with HDF5IO(self.filename, manager=self.get_manager(), mode='w') as write_io:

write_io.write(foofile2, cache_spec=True)

# Export the file with the external link to Zarr

with HDF5IO(self.filename, manager=self.get_manager(), mode='r') as read_io:

with ZarrIO(self.export_filename, mode='w') as export_io:

export_io.export(src_io=read_io, write_args={'link_data': False})

read_io = ZarrIO(self.export_filename, manager=self.get_manager(), mode='r')

self.ios.append(read_io)

exportContainer = read_io.read()

return exportContainer

"""

# TODO: Fails because ZarrIO fails to properly create the external link

"""

class TestFooExternalLinkZarrToHDF5(MixinTestCaseConvert, TestCase):

IGNORE_NAME = True

IGNORE_HDMF_ATTRS = True

IGNORE_STRING_TO_BYTE = False

def get_manager(self):

return get_foo_buildmanager()

def setUpContainer(self):

# Create the first file container. We will overwrite this later with the external link container

foo1 = Foo('foo1', [0, 1, 2, 3, 4], "I am foo1", 17, 3.14)

bucket1 = FooBucket('bucket1', [foo1])

foofile1 = FooFile(buckets=[bucket1])

return foofile1

def roundtripExportContainer(self):

# Write the HDF5 file

first_filename = 'test_firstfile_%s.hdmf' % self.container_type

self.filenames.append(first_filename)

with ZarrIO(first_filename, manager=self.get_manager(), mode='w') as write_io:

write_io.write(self.container, cache_spec=True)

# Create the second file with an external link added (which is the file we use as reference_

with ZarrIO(first_filename, manager=self.get_manager(), mode='r') as read_io:

read_foo = read_io.read()

foo2 = Foo('foo2', read_foo.buckets['bucket1'].foos['foo1'].my_data, "I am foo2", 34, 6.28)

bucket2 = FooBucket('bucket2', [foo2])

foofile2 = FooFile(buckets=[bucket2])

self.container = foofile2 # This is what we need to compare against

with ZarrIO(self.filename, manager=self.get_manager(), mode='w') as write_io:

write_io.write(foofile2, cache_spec=True)

# Export the file with the external link to Zarr

with ZarrIO(self.filename, manager=self.get_manager(), mode='r') as read_io:

with HDF5IO(self.export_filename, mode='w') as export_io:

export_io.export(src_io=read_io, write_args={'link_data': False})

read_io = HDF5IO(self.export_filename, manager=self.get_manager(), mode='r')

self.ios.append(read_io)

exportContainer = read_io.read()

return exportContainer

"""

Good to know. I believe my tests are similar if not the same ones. Thanks for pointing this out so we don't have duplicates.

mavaylon1 · 2024-08-15T18:09:39Z

Related Issues: #179 #205

mavaylon1 added 3 commits May 16, 2024 11:34

checkpoint

d0dbe7c

checkpoint

274230b

scratch

b4432a9

Merge branch 'dev' into link_error

dd09e50

mavaylon1 added 5 commits July 15, 2024 14:17

WIP

94b5628

WIP

07dfe78

Notes

d083e7d

maybe

91e139e

clean up poc

a117c19

mavaylon1 mentioned this pull request Jul 29, 2024

[Bug]: Reading and exporting Zarr NWB fails to copy over table information #208

Open

3 tasks

mavaylon1 added 5 commits July 31, 2024 11:22

Merge branch 'dev' into link_error

bc6e4d7

Create .codecov.yml

55a4e4b

Delete .codecov.yml

78f41b8

remove breakpoint

a3bb83d

not too shabby

16876ca

mavaylon1 added 3 commits August 8, 2024 07:21

check

0e11b08

Merge branch 'dev' into link_error

e739247

checkpoint

6b2319a

mavaylon1 added 6 commits September 4, 2024 14:49

i think I cracked it boys

78ae3f1

checkpoint

048157a

checkpoint

05d3616

some support for groups

24ce06c

some tests are working

417672f

more tests fixed

9fbaa53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Broken Links when Exporting #194

Broken Links when Exporting #194

mavaylon1 commented May 16, 2024 •

edited

Loading

codecov-commenter commented May 20, 2024 •

edited

Loading

mavaylon1 commented Jul 12, 2024

mavaylon1 commented Jul 12, 2024

oruebel commented Aug 1, 2024

mavaylon1 commented Aug 1, 2024

mavaylon1 commented Aug 15, 2024

Broken Links when Exporting #194

Are you sure you want to change the base?

Broken Links when Exporting #194

Conversation

mavaylon1 commented May 16, 2024 • edited Loading

Motivation

How to test the behavior?

Checklist

codecov-commenter commented May 20, 2024 • edited Loading

Codecov Report

mavaylon1 commented Jul 12, 2024

mavaylon1 commented Jul 12, 2024

oruebel commented Aug 1, 2024

mavaylon1 commented Aug 1, 2024

mavaylon1 commented Aug 15, 2024

mavaylon1 commented May 16, 2024 •

edited

Loading

codecov-commenter commented May 20, 2024 •

edited

Loading