Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UnixFs Stat Command Returns Invalid Data #580

Open
jtsmedley opened this issue Aug 8, 2024 · 2 comments
Open

UnixFs Stat Command Returns Invalid Data #580

jtsmedley opened this issue Aug 8, 2024 · 2 comments
Labels
need/analysis Needs further analysis before proceeding

Comments

@jtsmedley
Copy link
Contributor

When using stat, it does not return the correct dagSize values for non-empty directories. It appears that this line is using the byteLength of the UnixFS data in the directory block. This causes stat commands on non-empty directories to return an invalid value that does not match the Kubo value.

@SgtPooki SgtPooki added the need/maintainers-input Needs input from the current maintainer(s) label Aug 15, 2024
@SgtPooki
Copy link
Member

unixfs entry should have a metadata block that contains sizes of their children, so we should be able to use what is in that metadata directly, or sum sizes of children's metadata

@SgtPooki SgtPooki added need/analysis Needs further analysis before proceeding and removed need/maintainers-input Needs input from the current maintainer(s) labels Aug 15, 2024
@achingbrain
Copy link
Member

I think the API has some rough edges here.

Of note:

  • fileSize/localFileSize doesn't apply to directories
  • dagSize/localDagSize this is impossible to calculate for directories unless all blocks are present in the block store, since the size of a directory is not stored in the root DAG node for that directory - you have to traverse the DAG, calculating block sizes as you go which can be expensive
  • blocks In the equivalent Kubo API call, it seems to treat this as the number of Links in the root dag-pb node that resolve to files (sub dirs are ignored). If the directory is large enough to become sharded, this number is the number of sub-shards, so it's not terribly useful - Helia just returns 1 for directories since there's only ever one root block, though there could be plenty of sub-shards in a sharded directory, again not terribly useful

Perhaps we should break up the UnixFSStats interface?

// these involve traversing a DAG so are expensive to calculate
interface UnixFSDAGStats {
  // if all blocks for this DAG are in the blockstore
  complete: boolean

  // how many blocks make up the DAG - directories, sub-shards, files, leaf nodes, etc
  // - only accurate if `complete` is `true`
  blocks: bigint

  // how many bytes of the DAG are in the blockstore
  localDagSize: bigint

  // how many bytes of the file/directory are in the blockstore
  localSize: bigint
}

// a file is a DAG-PB node with one or more linked nodes that contain file data or links to other nodes
interface UnixFSFileStats {
  type: 'file'
  cid: CID

  // how big the DAG that holds the file is in bytes (e.g. the sum of all dag-link Tsize fields plus the
  // serialized size of the root node)
  dagSize: bigint

  // UnixFS metadata (has mtime/mode/block count/file size)
  unixfs: UnixFS
}

interface ExtendedUnixFSFileStats extends UnixFSFileStats, UnixFSDAGStats {
}

// a directory is a DAG-PB node with links to other DAG-PB or raw nodes
// if the unixfs type is `directory`, each linked node is a file, a raw block or a directory
// if the type is `hamt-sharded-directory`, each linked node is a file, a raw block or a directory
interface UnixFSDirectoryStats {
  type: 'directory'
  cid: CID
  
  // UnixFS metadata (has mtime/mode)
  unixfs: UnixFS
}

// these involve traversing the DAG so are expensive to calculate
interface ExtendedUnixFSDirectoryStats extends UnixFSDirectoryStats, UnixFSDAGStats {
  // the size of all files in the directory including in subdirectories
  size: bigint
}

// a raw entry is a bare block that contains file data
interface UnixFSRawStats {
  type: 'raw'
  cid: CID

  // how big the block is
  size: bigint
}

type UnixFSStats = UnixFSFileStats | UnixFSDirectoryStats | UnixFSRawStats
type ExtendedUnixFSStats = ExtendedUnixFSFileStats | ExtendedUnixFSDirectoryStats | UnixFSRawStats

The @helia/unixfs interface would end up changing to something like:

interface StatOptions {
  offline?: boolean
  // ...same as currently
}

interface ExtendedStatOptions {
  extended: true // not sure if this is the right name
  // ...same as currently
}

fs.stat(cid, options?: StatOptions): Promise<UnixFSStats>
fs.stat(cid, options?: ExtendedStatOptions): Promise<ExtendedUnixFSStats>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
need/analysis Needs further analysis before proceeding
Projects
None yet
Development

No branches or pull requests

3 participants