Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Broken Links when Exporting #194

Draft
wants to merge 23 commits into
base: dev
Choose a base branch
from
Draft

Broken Links when Exporting #194

wants to merge 23 commits into from

Conversation

mavaylon1
Copy link
Contributor

@mavaylon1 mavaylon1 commented May 16, 2024

Motivation

What was the reasoning behind this change? Please explain the changes briefly.
Export is not supposed to create links to the prior file, rather it is just mean to have the option to preserve them. This means if File A has links to some File C, then when we export File A to File B, File B will also have links the File C.

Problem 1: HDMF-Zarr is missing the logic in HDMF within write_dataset that has conditionals for link in terms of export.

Problem 2: When links are creating (let's say when they are supposed to be created), they are using absolute paths. They are supposed to use relative paths. Both can break when you move things, but absolute paths will always break.

Problem 3: When we create a reference, the source path is shorthand with ".", to represent the file it is currently in. We need to add logic in resolve_ref to handle links.

What to do while this is being fixed:

Always use 'write_args={'link_data': False}'.
I will divide the problem into stages:
Stage 1 (PR 1:) Add updated export logic into write_dataset

Stage 2 (PR 2:) Add logic into resolve_ref to resolve references in links

Stage 3 (PR 3:) Edge case Test Suite

How to test the behavior?

Show how to reproduce the new behavior (can be a bug fix or a new feature)

Checklist

  • Did you update CHANGELOG.md with your changes?
  • Have you checked our Contributing document?
  • Have you ensured the PR clearly describes the problem and the solution?
  • Is your contribution compliant with our coding style? This can be checked running ruff from the source directory.
  • Have you checked to ensure that there aren't other open Pull Requests for the same change?
  • Have you included the relevant issue number using "Fix #XXX" notation where XXX is the issue number? By including "Fix #XXX" you allow GitHub to close issue #XXX when the PR is merged.

@codecov-commenter
Copy link

codecov-commenter commented May 20, 2024

Codecov Report

Attention: Patch coverage is 71.42857% with 4 lines in your changes missing coverage. Please review.

Project coverage is 86.88%. Comparing base (8ca5787) to head (16876ca).

Files Patch % Lines
src/hdmf_zarr/backend.py 71.42% 3 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##              dev     #194      +/-   ##
==========================================
- Coverage   87.11%   86.88%   -0.23%     
==========================================
  Files           5        5              
  Lines        1172     1182      +10     
  Branches      286      289       +3     
==========================================
+ Hits         1021     1027       +6     
- Misses        100      103       +3     
- Partials       51       52       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@mavaylon1
Copy link
Contributor Author

More Notes:
FileA is a HDF5 file and is exported to FileB, a zarr file.
FileA has both internal links, external links, and references (which are always internal for us). I remember you saying we don't do external links, but maybe my memory is off. I ask because the export doc talks about external links.
Because the backends are different, everything is copied over. Does that mean during export, FileB will have copies of what is being externally linked inside FileB? That also means every internal link and every reference is also preserved.
Now I export FileB to FileC, which is still zarr. However, I add a few containers and append to existing containers. What is happening right now in zarr is that new data is added correctly; however, we now create a link to the FileB from FileC if we don't specify copy. This is wrong in that this isn't what export is supposed to do as external links can break easily from moving files. (which is why I ask about external links and why is it talked about in the tutorial). What we need to do is copy everything to FileC, preserving internal links and references.

If FileA contains an external link to a dataset in FileX, then FileB should also contain an external link to the dataset in FileX.
6:23
Same for FileB and FileC.
6:24
If FileB is read, a new external link is added, and then the file is exported to FileC, then it is written as an external link
6:25
If FileB is read, a new external link is added, and then the file is exported to FileC with write_args={'link_data': False}, then the linked dataset is copied
6:25
This is a very specific, niche case

@mavaylon1
Copy link
Contributor Author

Goal for this PR:

  1. We want to make sure the external links are just copied when we export, not preserved as links.
  • When export Zarr File A to Zarr File B and add stuff to existing containers, they should all not be links. (this is what Alessio needs)

@oruebel
Copy link
Contributor

oruebel commented Aug 1, 2024

Just in case this is relevant for this PR. The following test cases mirror tests from HDMF but were disabled in the hdfm_zarr test suite because links on export didn't fully work. If this PR fixes this, then we should also look at updating these tests.

def test_append_data(self):
"""Test that exporting a written container after adding groups, links, and references to it works."""
# TODO: This test currently fails because I do not understand how the link to my_data is expected to be
# created here and currently fails. I.e,. it fails in list_fill but instead we should actually
# create an external link instead
pass
"""
foo1 = Foo('foo1', [1, 2, 3, 4, 5], "I am foo1", 17, 3.14)
foobucket = FooBucket('bucket1', [foo1])
foofile = FooFile(buckets=[foobucket])
with ZarrIO(self.store_paths[0], manager=get_foo_buildmanager(), mode='w') as write_io:
write_io.write(foofile)
with ZarrIO(self.store_paths[0], manager=get_foo_buildmanager(), mode='r') as read_io:
read_foofile = read_io.read()
# create a foo with link to existing dataset my_data, add the foo to new foobucket
# this should make a soft link within the exported file
# TODO Assigning my_data is the problem. Which in turn causes the export to fail because the Zarr
# DataType is not being understood. This is where the External link should be created instead?
foo2 = Foo('foo2', read_foofile.buckets['bucket1'].foos['foo1'].my_data, "I am foo2", 17, 3.14)
foobucket2 = FooBucket('bucket2', [foo2])
read_foofile.add_bucket(foobucket2)
# also add link from foofile to new foo2 container
read_foofile.foo_link = foo2
# also add link from foofile to new foo2.my_data dataset which is a link to foo1.my_data dataset
read_foofile.foofile_data = foo2.my_data
# also add reference from foofile to new foo2
read_foofile.foo_ref_attr = foo2
with ZarrIO(self.store_paths[1], mode='w') as export_io:
export_io.export(src_io=read_io, container=read_foofile)
with ZarrIO(self.store_paths[1], manager=get_foo_buildmanager(), mode='r') as read_io:
read_foofile2 = read_io.read()
# test new soft link to dataset in file
self.assertIs(read_foofile2.buckets['bucket1'].foos['foo1'].my_data,
read_foofile2.buckets['bucket2'].foos['foo2'].my_data)
# test new soft link to group in file
self.assertIs(read_foofile2.foo_link, read_foofile2.buckets['bucket2'].foos['foo2'])
# test new soft link to new soft link to dataset in file
self.assertIs(read_foofile2.buckets['bucket1'].foos['foo1'].my_data, read_foofile2.foofile_data)
# test new attribute reference to new group in file
self.assertIs(read_foofile2.foo_ref_attr, read_foofile2.buckets['bucket2'].foos['foo2'])
#with File(self.store_paths[1], 'r') as f:
# self.assertEqual(f['foofile_data'].file.filename, self.store_paths[1])
# self.assertIsInstance(f.attrs['foo_ref_attr'], h5py.Reference)
"""
def test_append_external_link_data(self):
"""Test that exporting a written container after adding a link with link_data=True creates external links."""
pass # TODO: This test currently fails
"""
foo1 = Foo('foo1', [1, 2, 3, 4, 5], "I am foo1", 17, 3.14)
foobucket = FooBucket('bucket1', [foo1])
foofile = FooFile(buckets=[foobucket])
with ZarrIO(self.store_paths[0], manager=get_foo_buildmanager(), mode='w') as write_io:
write_io.write(foofile)
foofile2 = FooFile(buckets=[])
with ZarrIO(self.store_paths[1], manager=get_foo_buildmanager(), mode='w') as write_io:
write_io.write(foofile2)
manager = get_foo_buildmanager()
with ZarrIO(self.store_paths[0], manager=manager, mode='r') as read_io1:
read_foofile1 = read_io1.read()
with ZarrIO(self.store_paths[1], manager=manager, mode='r') as read_io2:
read_foofile2 = read_io2.read()
# create a foo with link to existing dataset my_data (not in same file), add the foo to new foobucket
# this should make an external link within the exported file
foo2 = Foo('foo2', read_foofile1.buckets['bucket1'].foos['foo1'].my_data, "I am foo2", 17, 3.14)
foobucket2 = FooBucket('bucket2', [foo2])
read_foofile2.add_bucket(foobucket2)
# also add link from foofile to new foo2.my_data dataset which is a link to foo1.my_data dataset
# this should make an external link within the exported file
read_foofile2.foofile_data = foo2.my_data
with ZarrIO(self.store_paths[2], mode='w') as export_io:
export_io.export(src_io=read_io2, container=read_foofile2)
with ZarrIO(self.store_paths[0], manager=get_foo_buildmanager(), mode='r') as read_io1:
read_foofile3 = read_io1.read()
with ZarrIO(self.store_paths[2], manager=get_foo_buildmanager(), mode='r') as read_io2:
read_foofile4 = read_io2.read()
self.assertEqual(read_foofile4.buckets['bucket2'].foos['foo2'].my_data,
read_foofile3.buckets['bucket1'].foos['foo1'].my_data)
self.assertEqual(read_foofile4.foofile_data, read_foofile3.buckets['bucket1'].foos['foo1'].my_data)
#with File(self.source_paths[2], 'r') as f:
# self.assertEqual(f['buckets/bucket2/foo_holder/foo2/my_data'].file.filename, self.source_paths[0])
# self.assertEqual(f['foofile_data'].file.filename, self.souce_paths[0])
# self.assertIsInstance(f.get('buckets/bucket2/foo_holder/foo2/my_data', getlink=True),
# h5py.ExternalLink)
# self.assertIsInstance(f.get('foofile_data', getlink=True), h5py.ExternalLink)
"""
def test_append_external_link_copy_data(self):
"""Test that exporting a written container after adding a link with link_data=False copies the data."""
pass # TODO: This test currently fails
"""
foo1 = Foo('foo1', [1, 2, 3, 4, 5], "I am foo1", 17, 3.14)
foobucket = FooBucket('bucket1', [foo1])
foofile = FooFile(buckets=[foobucket])
with ZarrIO(self.store_paths[0], manager=get_foo_buildmanager(), mode='w') as write_io:
write_io.write(foofile)
foofile2 = FooFile(buckets=[])
with ZarrIO(self.store_paths[1], manager=get_foo_buildmanager(), mode='w') as write_io:
write_io.write(foofile2)
manager = get_foo_buildmanager()
with ZarrIO(self.store_paths[0], manager=manager, mode='r') as read_io1:
read_foofile1 = read_io1.read()
with ZarrIO(self.store_paths[1], manager=manager, mode='r') as read_io2:
read_foofile2 = read_io2.read()
# create a foo with link to existing dataset my_data (not in same file), add the foo to new foobucket
# this would normally make an external link but because link_data=False, data will be copied
foo2 = Foo('foo2', read_foofile1.buckets['bucket1'].foos['foo1'].my_data, "I am foo2", 17, 3.14)
foobucket2 = FooBucket('bucket2', [foo2])
read_foofile2.add_bucket(foobucket2)
# also add link from foofile to new foo2.my_data dataset which is a link to foo1.my_data dataset
# this would normally make an external link but because link_data=False, data will be copied
read_foofile2.foofile_data = foo2.my_data
with ZarrIO(self.store_paths[2], mode='w') as export_io:
export_io.export(src_io=read_io2, container=read_foofile2, write_args={'link_data': False})
with ZarrIO(self.store_paths[0], manager=get_foo_buildmanager(), mode='r') as read_io1:
read_foofile3 = read_io1.read()
with ZarrIO(self.store_paths[2], manager=get_foo_buildmanager(), mode='r') as read_io2:
read_foofile4 = read_io2.read()
# check that file can be read
self.assertNotEqual(read_foofile4.buckets['bucket2'].foos['foo2'].my_data,
read_foofile3.buckets['bucket1'].foos['foo1'].my_data)
self.assertNotEqual(read_foofile4.foofile_data, read_foofile3.buckets['bucket1'].foos['foo1'].my_data)
self.assertNotEqual(read_foofile4.foofile_data, read_foofile4.buckets['bucket2'].foos['foo2'].my_data)
# with File(self.source_paths[2], 'r') as f:
# self.assertEqual(f['buckets/bucket2/foo_holder/foo2/my_data'].file.filename, self.source_paths[2])
# self.assertEqual(f['foofile_data'].file.filename, self.source_paths[2])
"""
def test_export_dset_refs(self):
"""Test that exporting a written container with a dataset of references works."""
pass # TODO: This test currently fails
"""
bazs = []
num_bazs = 10
for i in range(num_bazs):
bazs.append(Baz(name='baz%d' % i))
baz_data = BazData(name='baz_data1', data=bazs)
bucket = BazBucket(name='bucket1', bazs=bazs.copy(), baz_data=baz_data)
with ZarrIO(self.store_paths[0], manager=_get_baz_manager(), mode='w') as write_io:
write_io.write(bucket)
with ZarrIO(self.store_paths[0], manager=_get_baz_manager(), mode='r') as read_io:
read_bucket1 = read_io.read()
# NOTE: reference IDs might be the same between two identical files
# adding a Baz with a smaller name should change the reference IDs on export
new_baz = Baz(name='baz000')
read_bucket1.add_baz(new_baz)
with ZarrIO(self.store_paths[1], mode='w') as export_io:
export_io.export(src_io=read_io, container=read_bucket1)
with ZarrIO(self.store_paths[1], manager=_get_baz_manager(), mode='r') as read_io:
read_bucket2 = read_io.read()
# remove and check the appended child, then compare the read container with the original
read_new_baz = read_bucket2.remove_baz('baz000')
self.assertContainerEqual(new_baz, read_new_baz, ignore_hdmf_attrs=True)
self.assertContainerEqual(bucket, read_bucket2, ignore_name=True, ignore_hdmf_attrs=True)
for i in range(num_bazs):
baz_name = 'baz%d' % i
self.assertIs(read_bucket2.baz_data.data[i], read_bucket2.bazs[baz_name])
"""
def test_export_cpd_dset_refs(self):
"""Test that exporting a written container with a compound dataset with references works."""
pass # TODO: This test currently fails
"""
bazs = []
baz_pairs = []
num_bazs = 10
for i in range(num_bazs):
b = Baz(name='baz%d' % i)
bazs.append(b)
baz_pairs.append((i, b))
baz_cpd_data = BazCpdData(name='baz_cpd_data1', data=baz_pairs)
bucket = BazBucket(name='bucket1', bazs=bazs.copy(), baz_cpd_data=baz_cpd_data)
with ZarrIO(self.store_paths[0], manager=_get_baz_manager(), mode='w') as write_io:
write_io.write(bucket)
with ZarrIO(self.store_paths[0], manager=_get_baz_manager(), mode='r') as read_io:
read_bucket1 = read_io.read()
# NOTE: reference IDs might be the same between two identical files
# adding a Baz with a smaller name should change the reference IDs on export
new_baz = Baz(name='baz000')
read_bucket1.add_baz(new_baz)
with ZarrIO(self.store_paths[1], mode='w') as export_io:
export_io.export(src_io=read_io, container=read_bucket1)
with ZarrIO(self.store_paths[1], manager=_get_baz_manager(), mode='r') as read_io:
read_bucket2 = read_io.read()
# remove and check the appended child, then compare the read container with the original
read_new_baz = read_bucket2.remove_baz(new_baz.name)
self.assertContainerEqual(new_baz, read_new_baz, ignore_hdmf_attrs=True)
self.assertContainerEqual(bucket, read_bucket2, ignore_name=True, ignore_hdmf_attrs=True)
for i in range(num_bazs):
baz_name = 'baz%d' % i
self.assertEqual(read_bucket2.baz_cpd_data.data[i][0], i)
self.assertIs(read_bucket2.baz_cpd_data.data[i][1], read_bucket2.bazs[baz_name])
"""

# TODO: Fails because we need to copy the data from the ExternalLink as it points to a non-Zarr source
"""
class TestFooExternalLinkHDF5ToZarr(MixinTestCaseConvert, TestCase):
IGNORE_NAME = True
IGNORE_HDMF_ATTRS = True
IGNORE_STRING_TO_BYTE = False
def get_manager(self):
return get_foo_buildmanager()
def setUpContainer(self):
# Create the first file container. We will overwrite this later with the external link container
foo1 = Foo('foo1', [0, 1, 2, 3, 4], "I am foo1", 17, 3.14)
bucket1 = FooBucket('bucket1', [foo1])
foofile1 = FooFile(buckets=[bucket1])
return foofile1
def roundtripExportContainer(self):
# Write the HDF5 file
first_filename = 'test_firstfile_%s.hdmf' % self.container_type
self.filenames.append(first_filename)
with HDF5IO(first_filename, manager=self.get_manager(), mode='w') as write_io:
write_io.write(self.container, cache_spec=True)
# Create the second file with an external link added (which is the file we use as reference_
with HDF5IO(first_filename, manager=self.get_manager(), mode='r') as read_io:
read_foo = read_io.read()
foo2 = Foo('foo2', read_foo.buckets['bucket1'].foos['foo1'].my_data, "I am foo2", 34, 6.28)
bucket2 = FooBucket('bucket2', [foo2])
foofile2 = FooFile(buckets=[bucket2])
self.container = foofile2 # This is what we need to compare against
with HDF5IO(self.filename, manager=self.get_manager(), mode='w') as write_io:
write_io.write(foofile2, cache_spec=True)
# Export the file with the external link to Zarr
with HDF5IO(self.filename, manager=self.get_manager(), mode='r') as read_io:
with ZarrIO(self.export_filename, mode='w') as export_io:
export_io.export(src_io=read_io, write_args={'link_data': False})
read_io = ZarrIO(self.export_filename, manager=self.get_manager(), mode='r')
self.ios.append(read_io)
exportContainer = read_io.read()
return exportContainer
"""
# TODO: Fails because ZarrIO fails to properly create the external link
"""
class TestFooExternalLinkZarrToHDF5(MixinTestCaseConvert, TestCase):
IGNORE_NAME = True
IGNORE_HDMF_ATTRS = True
IGNORE_STRING_TO_BYTE = False
def get_manager(self):
return get_foo_buildmanager()
def setUpContainer(self):
# Create the first file container. We will overwrite this later with the external link container
foo1 = Foo('foo1', [0, 1, 2, 3, 4], "I am foo1", 17, 3.14)
bucket1 = FooBucket('bucket1', [foo1])
foofile1 = FooFile(buckets=[bucket1])
return foofile1
def roundtripExportContainer(self):
# Write the HDF5 file
first_filename = 'test_firstfile_%s.hdmf' % self.container_type
self.filenames.append(first_filename)
with ZarrIO(first_filename, manager=self.get_manager(), mode='w') as write_io:
write_io.write(self.container, cache_spec=True)
# Create the second file with an external link added (which is the file we use as reference_
with ZarrIO(first_filename, manager=self.get_manager(), mode='r') as read_io:
read_foo = read_io.read()
foo2 = Foo('foo2', read_foo.buckets['bucket1'].foos['foo1'].my_data, "I am foo2", 34, 6.28)
bucket2 = FooBucket('bucket2', [foo2])
foofile2 = FooFile(buckets=[bucket2])
self.container = foofile2 # This is what we need to compare against
with ZarrIO(self.filename, manager=self.get_manager(), mode='w') as write_io:
write_io.write(foofile2, cache_spec=True)
# Export the file with the external link to Zarr
with ZarrIO(self.filename, manager=self.get_manager(), mode='r') as read_io:
with HDF5IO(self.export_filename, mode='w') as export_io:
export_io.export(src_io=read_io, write_args={'link_data': False})
read_io = HDF5IO(self.export_filename, manager=self.get_manager(), mode='r')
self.ios.append(read_io)
exportContainer = read_io.read()
return exportContainer
"""

@mavaylon1
Copy link
Contributor Author

Just in case this is relevant for this PR. The following test cases mirror tests from HDMF but were disabled in the hdfm_zarr test suite because links on export didn't fully work. If this PR fixes this, then we should also look at updating these tests.

def test_append_data(self):
"""Test that exporting a written container after adding groups, links, and references to it works."""
# TODO: This test currently fails because I do not understand how the link to my_data is expected to be
# created here and currently fails. I.e,. it fails in list_fill but instead we should actually
# create an external link instead
pass
"""
foo1 = Foo('foo1', [1, 2, 3, 4, 5], "I am foo1", 17, 3.14)
foobucket = FooBucket('bucket1', [foo1])
foofile = FooFile(buckets=[foobucket])
with ZarrIO(self.store_paths[0], manager=get_foo_buildmanager(), mode='w') as write_io:
write_io.write(foofile)
with ZarrIO(self.store_paths[0], manager=get_foo_buildmanager(), mode='r') as read_io:
read_foofile = read_io.read()
# create a foo with link to existing dataset my_data, add the foo to new foobucket
# this should make a soft link within the exported file
# TODO Assigning my_data is the problem. Which in turn causes the export to fail because the Zarr
# DataType is not being understood. This is where the External link should be created instead?
foo2 = Foo('foo2', read_foofile.buckets['bucket1'].foos['foo1'].my_data, "I am foo2", 17, 3.14)
foobucket2 = FooBucket('bucket2', [foo2])
read_foofile.add_bucket(foobucket2)
# also add link from foofile to new foo2 container
read_foofile.foo_link = foo2
# also add link from foofile to new foo2.my_data dataset which is a link to foo1.my_data dataset
read_foofile.foofile_data = foo2.my_data
# also add reference from foofile to new foo2
read_foofile.foo_ref_attr = foo2
with ZarrIO(self.store_paths[1], mode='w') as export_io:
export_io.export(src_io=read_io, container=read_foofile)
with ZarrIO(self.store_paths[1], manager=get_foo_buildmanager(), mode='r') as read_io:
read_foofile2 = read_io.read()
# test new soft link to dataset in file
self.assertIs(read_foofile2.buckets['bucket1'].foos['foo1'].my_data,
read_foofile2.buckets['bucket2'].foos['foo2'].my_data)
# test new soft link to group in file
self.assertIs(read_foofile2.foo_link, read_foofile2.buckets['bucket2'].foos['foo2'])
# test new soft link to new soft link to dataset in file
self.assertIs(read_foofile2.buckets['bucket1'].foos['foo1'].my_data, read_foofile2.foofile_data)
# test new attribute reference to new group in file
self.assertIs(read_foofile2.foo_ref_attr, read_foofile2.buckets['bucket2'].foos['foo2'])
#with File(self.store_paths[1], 'r') as f:
# self.assertEqual(f['foofile_data'].file.filename, self.store_paths[1])
# self.assertIsInstance(f.attrs['foo_ref_attr'], h5py.Reference)
"""
def test_append_external_link_data(self):
"""Test that exporting a written container after adding a link with link_data=True creates external links."""
pass # TODO: This test currently fails
"""
foo1 = Foo('foo1', [1, 2, 3, 4, 5], "I am foo1", 17, 3.14)
foobucket = FooBucket('bucket1', [foo1])
foofile = FooFile(buckets=[foobucket])
with ZarrIO(self.store_paths[0], manager=get_foo_buildmanager(), mode='w') as write_io:
write_io.write(foofile)
foofile2 = FooFile(buckets=[])
with ZarrIO(self.store_paths[1], manager=get_foo_buildmanager(), mode='w') as write_io:
write_io.write(foofile2)
manager = get_foo_buildmanager()
with ZarrIO(self.store_paths[0], manager=manager, mode='r') as read_io1:
read_foofile1 = read_io1.read()
with ZarrIO(self.store_paths[1], manager=manager, mode='r') as read_io2:
read_foofile2 = read_io2.read()
# create a foo with link to existing dataset my_data (not in same file), add the foo to new foobucket
# this should make an external link within the exported file
foo2 = Foo('foo2', read_foofile1.buckets['bucket1'].foos['foo1'].my_data, "I am foo2", 17, 3.14)
foobucket2 = FooBucket('bucket2', [foo2])
read_foofile2.add_bucket(foobucket2)
# also add link from foofile to new foo2.my_data dataset which is a link to foo1.my_data dataset
# this should make an external link within the exported file
read_foofile2.foofile_data = foo2.my_data
with ZarrIO(self.store_paths[2], mode='w') as export_io:
export_io.export(src_io=read_io2, container=read_foofile2)
with ZarrIO(self.store_paths[0], manager=get_foo_buildmanager(), mode='r') as read_io1:
read_foofile3 = read_io1.read()
with ZarrIO(self.store_paths[2], manager=get_foo_buildmanager(), mode='r') as read_io2:
read_foofile4 = read_io2.read()
self.assertEqual(read_foofile4.buckets['bucket2'].foos['foo2'].my_data,
read_foofile3.buckets['bucket1'].foos['foo1'].my_data)
self.assertEqual(read_foofile4.foofile_data, read_foofile3.buckets['bucket1'].foos['foo1'].my_data)
#with File(self.source_paths[2], 'r') as f:
# self.assertEqual(f['buckets/bucket2/foo_holder/foo2/my_data'].file.filename, self.source_paths[0])
# self.assertEqual(f['foofile_data'].file.filename, self.souce_paths[0])
# self.assertIsInstance(f.get('buckets/bucket2/foo_holder/foo2/my_data', getlink=True),
# h5py.ExternalLink)
# self.assertIsInstance(f.get('foofile_data', getlink=True), h5py.ExternalLink)
"""
def test_append_external_link_copy_data(self):
"""Test that exporting a written container after adding a link with link_data=False copies the data."""
pass # TODO: This test currently fails
"""
foo1 = Foo('foo1', [1, 2, 3, 4, 5], "I am foo1", 17, 3.14)
foobucket = FooBucket('bucket1', [foo1])
foofile = FooFile(buckets=[foobucket])
with ZarrIO(self.store_paths[0], manager=get_foo_buildmanager(), mode='w') as write_io:
write_io.write(foofile)
foofile2 = FooFile(buckets=[])
with ZarrIO(self.store_paths[1], manager=get_foo_buildmanager(), mode='w') as write_io:
write_io.write(foofile2)
manager = get_foo_buildmanager()
with ZarrIO(self.store_paths[0], manager=manager, mode='r') as read_io1:
read_foofile1 = read_io1.read()
with ZarrIO(self.store_paths[1], manager=manager, mode='r') as read_io2:
read_foofile2 = read_io2.read()
# create a foo with link to existing dataset my_data (not in same file), add the foo to new foobucket
# this would normally make an external link but because link_data=False, data will be copied
foo2 = Foo('foo2', read_foofile1.buckets['bucket1'].foos['foo1'].my_data, "I am foo2", 17, 3.14)
foobucket2 = FooBucket('bucket2', [foo2])
read_foofile2.add_bucket(foobucket2)
# also add link from foofile to new foo2.my_data dataset which is a link to foo1.my_data dataset
# this would normally make an external link but because link_data=False, data will be copied
read_foofile2.foofile_data = foo2.my_data
with ZarrIO(self.store_paths[2], mode='w') as export_io:
export_io.export(src_io=read_io2, container=read_foofile2, write_args={'link_data': False})
with ZarrIO(self.store_paths[0], manager=get_foo_buildmanager(), mode='r') as read_io1:
read_foofile3 = read_io1.read()
with ZarrIO(self.store_paths[2], manager=get_foo_buildmanager(), mode='r') as read_io2:
read_foofile4 = read_io2.read()
# check that file can be read
self.assertNotEqual(read_foofile4.buckets['bucket2'].foos['foo2'].my_data,
read_foofile3.buckets['bucket1'].foos['foo1'].my_data)
self.assertNotEqual(read_foofile4.foofile_data, read_foofile3.buckets['bucket1'].foos['foo1'].my_data)
self.assertNotEqual(read_foofile4.foofile_data, read_foofile4.buckets['bucket2'].foos['foo2'].my_data)
# with File(self.source_paths[2], 'r') as f:
# self.assertEqual(f['buckets/bucket2/foo_holder/foo2/my_data'].file.filename, self.source_paths[2])
# self.assertEqual(f['foofile_data'].file.filename, self.source_paths[2])
"""
def test_export_dset_refs(self):
"""Test that exporting a written container with a dataset of references works."""
pass # TODO: This test currently fails
"""
bazs = []
num_bazs = 10
for i in range(num_bazs):
bazs.append(Baz(name='baz%d' % i))
baz_data = BazData(name='baz_data1', data=bazs)
bucket = BazBucket(name='bucket1', bazs=bazs.copy(), baz_data=baz_data)
with ZarrIO(self.store_paths[0], manager=_get_baz_manager(), mode='w') as write_io:
write_io.write(bucket)
with ZarrIO(self.store_paths[0], manager=_get_baz_manager(), mode='r') as read_io:
read_bucket1 = read_io.read()
# NOTE: reference IDs might be the same between two identical files
# adding a Baz with a smaller name should change the reference IDs on export
new_baz = Baz(name='baz000')
read_bucket1.add_baz(new_baz)
with ZarrIO(self.store_paths[1], mode='w') as export_io:
export_io.export(src_io=read_io, container=read_bucket1)
with ZarrIO(self.store_paths[1], manager=_get_baz_manager(), mode='r') as read_io:
read_bucket2 = read_io.read()
# remove and check the appended child, then compare the read container with the original
read_new_baz = read_bucket2.remove_baz('baz000')
self.assertContainerEqual(new_baz, read_new_baz, ignore_hdmf_attrs=True)
self.assertContainerEqual(bucket, read_bucket2, ignore_name=True, ignore_hdmf_attrs=True)
for i in range(num_bazs):
baz_name = 'baz%d' % i
self.assertIs(read_bucket2.baz_data.data[i], read_bucket2.bazs[baz_name])
"""
def test_export_cpd_dset_refs(self):
"""Test that exporting a written container with a compound dataset with references works."""
pass # TODO: This test currently fails
"""
bazs = []
baz_pairs = []
num_bazs = 10
for i in range(num_bazs):
b = Baz(name='baz%d' % i)
bazs.append(b)
baz_pairs.append((i, b))
baz_cpd_data = BazCpdData(name='baz_cpd_data1', data=baz_pairs)
bucket = BazBucket(name='bucket1', bazs=bazs.copy(), baz_cpd_data=baz_cpd_data)
with ZarrIO(self.store_paths[0], manager=_get_baz_manager(), mode='w') as write_io:
write_io.write(bucket)
with ZarrIO(self.store_paths[0], manager=_get_baz_manager(), mode='r') as read_io:
read_bucket1 = read_io.read()
# NOTE: reference IDs might be the same between two identical files
# adding a Baz with a smaller name should change the reference IDs on export
new_baz = Baz(name='baz000')
read_bucket1.add_baz(new_baz)
with ZarrIO(self.store_paths[1], mode='w') as export_io:
export_io.export(src_io=read_io, container=read_bucket1)
with ZarrIO(self.store_paths[1], manager=_get_baz_manager(), mode='r') as read_io:
read_bucket2 = read_io.read()
# remove and check the appended child, then compare the read container with the original
read_new_baz = read_bucket2.remove_baz(new_baz.name)
self.assertContainerEqual(new_baz, read_new_baz, ignore_hdmf_attrs=True)
self.assertContainerEqual(bucket, read_bucket2, ignore_name=True, ignore_hdmf_attrs=True)
for i in range(num_bazs):
baz_name = 'baz%d' % i
self.assertEqual(read_bucket2.baz_cpd_data.data[i][0], i)
self.assertIs(read_bucket2.baz_cpd_data.data[i][1], read_bucket2.bazs[baz_name])
"""

# TODO: Fails because we need to copy the data from the ExternalLink as it points to a non-Zarr source
"""
class TestFooExternalLinkHDF5ToZarr(MixinTestCaseConvert, TestCase):
IGNORE_NAME = True
IGNORE_HDMF_ATTRS = True
IGNORE_STRING_TO_BYTE = False
def get_manager(self):
return get_foo_buildmanager()
def setUpContainer(self):
# Create the first file container. We will overwrite this later with the external link container
foo1 = Foo('foo1', [0, 1, 2, 3, 4], "I am foo1", 17, 3.14)
bucket1 = FooBucket('bucket1', [foo1])
foofile1 = FooFile(buckets=[bucket1])
return foofile1
def roundtripExportContainer(self):
# Write the HDF5 file
first_filename = 'test_firstfile_%s.hdmf' % self.container_type
self.filenames.append(first_filename)
with HDF5IO(first_filename, manager=self.get_manager(), mode='w') as write_io:
write_io.write(self.container, cache_spec=True)
# Create the second file with an external link added (which is the file we use as reference_
with HDF5IO(first_filename, manager=self.get_manager(), mode='r') as read_io:
read_foo = read_io.read()
foo2 = Foo('foo2', read_foo.buckets['bucket1'].foos['foo1'].my_data, "I am foo2", 34, 6.28)
bucket2 = FooBucket('bucket2', [foo2])
foofile2 = FooFile(buckets=[bucket2])
self.container = foofile2 # This is what we need to compare against
with HDF5IO(self.filename, manager=self.get_manager(), mode='w') as write_io:
write_io.write(foofile2, cache_spec=True)
# Export the file with the external link to Zarr
with HDF5IO(self.filename, manager=self.get_manager(), mode='r') as read_io:
with ZarrIO(self.export_filename, mode='w') as export_io:
export_io.export(src_io=read_io, write_args={'link_data': False})
read_io = ZarrIO(self.export_filename, manager=self.get_manager(), mode='r')
self.ios.append(read_io)
exportContainer = read_io.read()
return exportContainer
"""
# TODO: Fails because ZarrIO fails to properly create the external link
"""
class TestFooExternalLinkZarrToHDF5(MixinTestCaseConvert, TestCase):
IGNORE_NAME = True
IGNORE_HDMF_ATTRS = True
IGNORE_STRING_TO_BYTE = False
def get_manager(self):
return get_foo_buildmanager()
def setUpContainer(self):
# Create the first file container. We will overwrite this later with the external link container
foo1 = Foo('foo1', [0, 1, 2, 3, 4], "I am foo1", 17, 3.14)
bucket1 = FooBucket('bucket1', [foo1])
foofile1 = FooFile(buckets=[bucket1])
return foofile1
def roundtripExportContainer(self):
# Write the HDF5 file
first_filename = 'test_firstfile_%s.hdmf' % self.container_type
self.filenames.append(first_filename)
with ZarrIO(first_filename, manager=self.get_manager(), mode='w') as write_io:
write_io.write(self.container, cache_spec=True)
# Create the second file with an external link added (which is the file we use as reference_
with ZarrIO(first_filename, manager=self.get_manager(), mode='r') as read_io:
read_foo = read_io.read()
foo2 = Foo('foo2', read_foo.buckets['bucket1'].foos['foo1'].my_data, "I am foo2", 34, 6.28)
bucket2 = FooBucket('bucket2', [foo2])
foofile2 = FooFile(buckets=[bucket2])
self.container = foofile2 # This is what we need to compare against
with ZarrIO(self.filename, manager=self.get_manager(), mode='w') as write_io:
write_io.write(foofile2, cache_spec=True)
# Export the file with the external link to Zarr
with ZarrIO(self.filename, manager=self.get_manager(), mode='r') as read_io:
with HDF5IO(self.export_filename, mode='w') as export_io:
export_io.export(src_io=read_io, write_args={'link_data': False})
read_io = HDF5IO(self.export_filename, manager=self.get_manager(), mode='r')
self.ios.append(read_io)
exportContainer = read_io.read()
return exportContainer
"""

Good to know. I believe my tests are similar if not the same ones. Thanks for pointing this out so we don't have duplicates.

@mavaylon1
Copy link
Contributor Author

Related Issues: #179 #205

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants