Skip to content

Add documentation for Pyfive#81

Merged
valeriupredoi merged 29 commits intomainfrom
add_documentation
Aug 15, 2025
Merged

Add documentation for Pyfive#81
valeriupredoi merged 29 commits intomainfrom
add_documentation

Conversation

@valeriupredoi
Copy link
Collaborator

@valeriupredoi valeriupredoi commented Jul 17, 2025

Description

We now have a fully working Readthedocs setup with a doc stub that builds well; it's time to add actual documentation. I have created the Install section so far, but we need more stuffs.

Checklist

  • This pull request has a descriptive title and labels
  • This pull request has a minimal description (most was discussed in the issue, but a two-liner description is still desirable)
  • Unit tests have been added (if codecov test fails)
  • Any changed dependencies have been added or removed correctly (if need be)
  • If you are working on the documentation, please ensure the current build passes
  • All tests pass

@codecov
Copy link

codecov bot commented Jul 17, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 71.93%. Comparing base (7772b9b) to head (dd435af).
⚠️ Report is 244 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main      #81   +/-   ##
=======================================
  Coverage   71.93%   71.93%           
=======================================
  Files          11       11           
  Lines        2423     2423           
  Branches      364      364           
=======================================
  Hits         1743     1743           
  Misses        583      583           
  Partials       97       97           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@valeriupredoi
Copy link
Collaborator Author

@davidhassell @bnlawrence as you folks saw on last Thu, we do have a few bits and bobs on docs for Pyfive (also, it's all nicely deployed to RTD), but this PR is here to add more - pls feel free to edit and contribute via this PR when you folks have a minute 🍺

Copy link
Collaborator

@kmuehlbauer kmuehlbauer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've just had a look here. Thanks @valeriupredoi for these efforts in documentation.

@bnlawrence
Copy link
Collaborator

@valeriupredoi Apart from the failing check (dunno why, it builds locally) I think this is close to a minimal set of useable documentation. It'd be good to check my parallel example actually works, but apart from that, I think we'd be good to go. Please have a good look at this now.

@valeriupredoi
Copy link
Collaborator Author

@valeriupredoi Apart from the failing check (dunno why, it builds locally) I think this is close to a minimal set of useable documentation. It'd be good to check my parallel example actually works, but apart from that, I think we'd be good to go. Please have a good look at this now.

mega! Very cool @bnlawrence - RTD is fussing about these type of warnings (a few of them):

pyfive.Group.attrs:1: WARNING: duplicate object description of pyfive.Group.attrs, other instance in api_reference, use :no-index: for one of them

I can fix those tomorrow 🍺

Copy link
Collaborator Author

@valeriupredoi valeriupredoi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some minor comments, but a bigger one includes fixing the S3 example 🍺

bnlawrence and others added 3 commits August 15, 2025 09:29
Co-authored-by: Valeriu Predoi <valeriu.predoi@gmail.com>
Co-authored-by: Valeriu Predoi <valeriu.predoi@gmail.com>
Co-authored-by: Valeriu Predoi <valeriu.predoi@gmail.com>
bnlawrence and others added 10 commits August 15, 2025 09:31
Co-authored-by: Valeriu Predoi <valeriu.predoi@gmail.com>
Co-authored-by: Valeriu Predoi <valeriu.predoi@gmail.com>
Co-authored-by: Valeriu Predoi <valeriu.predoi@gmail.com>
Yes, not sure the caching is understood by botocore, I had conflated a couple of different things, best to be explicit that we are providing params for s3fs itself.

Co-authored-by: Valeriu Predoi <valeriu.predoi@gmail.com>
Co-authored-by: Valeriu Predoi <valeriu.predoi@gmail.com>
Co-authored-by: Valeriu Predoi <valeriu.predoi@gmail.com>
Co-authored-by: Valeriu Predoi <valeriu.predoi@gmail.com>
…ather than just the original docstring copied from h5py during implementation.
return data.min()

with ThreadPoolExecutor() as executor:
results = list(executor.map(get_min_of_variable, variable_names))
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this works! But - I have not noticed any time improvements, it's very possible since I had to cut the size of the data because my RAM is not big enough, so what I was loading would be as fast, if not faster, in single thread process

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed, I'm slightly concerned by your RAM problems, it's probably ok to leave this (contrived) example in the docs as all our real examples are too complex to use as examplars, but we should do some due diligence on why this is happening.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this works, as I said - the issues I have with the buffer could well be because I am running out of both RAM and actual disk, so those may be very user-specific, that's why I said I need to look a lot closer at it

@valeriupredoi
Copy link
Collaborator Author

superb work @bnlawrence @kmuehlbauer @davidhassell 🍺 x 3

@valeriupredoi valeriupredoi merged commit 9bbd8f6 into main Aug 15, 2025
6 checks passed
@valeriupredoi valeriupredoi deleted the add_documentation branch August 15, 2025 14:03
Copy link
Collaborator

@kmuehlbauer kmuehlbauer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very neat, very nice docs. I've just went through it, added some suggestions.

Comment on lines +9 to +10
The data storage complexities arise from two main factors: the use of chunking, and the way attributes are stored in the files

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The data storage complexities arise from two main factors: the use of chunking, and the way attributes are stored in the files
The data storage complexities arise from two main factors: the use of chunking, and the way attributes are stored in the files.

-------------------------------

Optimal access to data occurs when the data is chunked in a way that matches the access patterns of your application, and when the
b-tree indexes and attributess are stored contiguously in the file.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
b-tree indexes and attributess are stored contiguously in the file.
b-tree indexes and attributes are stored contiguously in the file.

import pyfive

with pyfive.File("data.h5", "r") as f:
variables = [f for var in f]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
variables = [f for var in f]
variables = [var for var in f]

print("Results:", results)


You can do the same thing to parallelise manipuations within the variables, by for example using, ``Dask``, but that is beyond the scope of this document.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
You can do the same thing to parallelise manipuations within the variables, by for example using, ``Dask``, but that is beyond the scope of this document.
You can do the same thing to parallelise manipulations within the variables, by for example using ``Dask``, but that is beyond the scope of this document.

return self._index
#### The following method can be used to set pseudo chunking size after the
#### file has been closed and before data transactions. This is pyfive specific
def set_psuedo_chunk_size(self, newsize_MB):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def set_psuedo_chunk_size(self, newsize_MB):
def set_pseudo_chunk_size(self, newsize_MB):

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There might be more occurences?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sadly you're probably right, but we are hoping to do a wee internal hackathon in autumn, where we might deal with both enums and our test coverage ... so if they're not picked up before then, hopefully proper coverage tests will find the rest.

@kmuehlbauer
Copy link
Collaborator

Ah, missed by 7 minutes ;-) Please have a look at my comments, whether or not they are useful.

@bnlawrence
Copy link
Collaborator

@valeriupredoi Can we add these in please?

@valeriupredoi
Copy link
Collaborator Author

thanks @kmuehlbauer - I'll pop those into a new PR I'll open now 🍺

@valeriupredoi valeriupredoi mentioned this pull request Aug 15, 2025
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants