Skip to content

feat: Massive refactor with a new API #45

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Feb 25, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 4 additions & 10 deletions Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -2,28 +2,22 @@
name = "multihash"
description = "Implementation of the multihash format"
repository = "https://github.com/multiformats/rust-multihash"

keywords = ["multihash", "ipfs"]

version = "0.9.4"

authors = ["dignifiedquire <dignifiedquire@gmail.com>"]

license = "MIT"

readme = "README.md"

documentation = "https://docs.rs/multihash/"

edition = "2018"

[dependencies]
blake2b_simd = { version = "0.5.9", default-features = false }
blake2s_simd = { version = "0.5.9", default-features = false }
sha1 = "0.5"
sha2 = { version = "0.7", default-features = false }
tiny-keccak = "1.4"
sha1 = "0.6"
sha2 = { version = "0.8", default-features = false }
tiny-keccak = { version = "2.0.0", features = ["keccak", "sha3"] }
unsigned-varint = "0.3"
digest = { version = "0.8", default-features = false }

[dev-dependencies]
quickcheck = "0.9.2"
203 changes: 203 additions & 0 deletions src/digests.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,203 @@
use std::convert::TryFrom;
use std::{cmp, fmt, hash};

use unsigned_varint::{decode as varint_decode, encode as varint_encode};

use crate::errors::{DecodeError, DecodeOwnedError};
use crate::hashes::Code;
use crate::storage::Storage;

/// Represents a valid multihash.
#[derive(Clone)]
pub struct Multihash {
storage: Storage,
}

impl fmt::Debug for Multihash {
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
f.debug_tuple("Multihash").field(&self.as_bytes()).finish()
}
}

impl PartialEq for Multihash {
fn eq(&self, other: &Self) -> bool {
self.storage.bytes() == other.storage.bytes()
}
}

impl Eq for Multihash {}

impl hash::Hash for Multihash {
fn hash<H: hash::Hasher>(&self, state: &mut H) {
self.storage.bytes().hash(state);
}
}

impl Multihash {
/// Verifies whether `bytes` contains a valid multihash, and if so returns a `Multihash`.
pub fn from_bytes(bytes: Vec<u8>) -> Result<Multihash, DecodeOwnedError> {
if let Err(err) = MultihashRef::from_slice(&bytes) {
return Err(DecodeOwnedError {
error: err,
data: bytes,
});
}
Ok(Multihash {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I understand it MultihashRef is dropped here but we know bytes are valid or we'd have Err. I do however find it confusing / awkward, probably my lack of Rust experience, makes me react with Parse don't validate

I think parse approach would not work here as MultihashRef contains &[u8], but maybe it make sense to have separate fn that does not construct MultihashRef ?

Maybe this is idiomatic in Rust, but maybe a comment for newbies (like myself) would be worth it.

storage: Storage::from_slice(&bytes),
})
}

/// Returns the bytes representation of the multihash.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I apologize if I'm misunderstanding this but as far as I understand this would return bytes without consuming Multihash as to_vec creates a copy (isn't it). This seems to contradict expectation from into_bytes in Strings.

If my assumption is correct this method probably should either be removed or changed so it consumes multihash to match String behavior.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is correct. A method named into_ carries with it the expectation that it consumes self and can thus perform the conversion in a cheap way.

For naming guidelines, see https://rust-lang.github.io/api-guidelines/naming.html#ad-hoc-conversions-follow-as_-to_-into_-conventions-c-conv

pub fn into_bytes(self) -> Vec<u8> {
self.to_vec()
}

/// Returns the bytes representation of the multihash.
pub fn to_vec(&self) -> Vec<u8> {
Vec::from(self.as_bytes())
}

/// Returns the bytes representation of this multihash.
pub fn as_bytes(&self) -> &[u8] {
self.storage.bytes()
}

/// Builds a `MultihashRef` corresponding to this `Multihash`.
pub fn as_ref(&self) -> MultihashRef {
MultihashRef {
bytes: self.as_bytes(),
}
}

/// Returns which hashing algorithm is used in this multihash.
pub fn algorithm(&self) -> Code {
self.as_ref().algorithm()
}

/// Returns the hashed data.
pub fn digest(&self) -> &[u8] {
self.as_ref().digest()
}
}

impl AsRef<[u8]> for Multihash {
fn as_ref(&self) -> &[u8] {
self.as_bytes()
}
}

impl<'a> PartialEq<MultihashRef<'a>> for Multihash {
fn eq(&self, other: &MultihashRef<'a>) -> bool {
&*self.as_bytes() == other.as_bytes()
}
}

impl TryFrom<Vec<u8>> for Multihash {
type Error = DecodeOwnedError;

fn try_from(value: Vec<u8>) -> Result<Self, Self::Error> {
Multihash::from_bytes(value)
}
}

impl PartialOrd for Multihash {
fn partial_cmp(&self, other: &Self) -> Option<cmp::Ordering> {
Some(self.cmp(other))
}
}

impl Ord for Multihash {
fn cmp(&self, other: &Self) -> cmp::Ordering {
self.as_ref().cmp(&other.as_ref())
}
}

/// Represents a valid multihash.
#[derive(Debug, Copy, Clone, PartialEq, Eq, PartialOrd, Ord, Hash)]
pub struct MultihashRef<'a> {
bytes: &'a [u8],
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this hasn't been brought up before, structures like this make me wonder if Multihash should be

struct Multihash<'a> {
  bytes: std::borrow::Cow<'a, [u8]>,
}

This would remove the need for separate Multihash and MultihashRef<'_> being able to support both owning and working on borrowed data. This does bring some less ergonomic points in the longer run but as long as there is just some bytes this might be doable. Perhaps there was some rationale for this split which is not in any of the comments?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a multihash with a lifetime would be a world of hurt in terms of ergonomics. I can already see myself fighting lifetime issues to do utterly trivial things with multihashes and futures.

Also, I kinda dislike Cow since it has some surprising footguns... E.g. rust-lang/rust#50160

I think an Arc<[u8]> would be best.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, to use Cow can bring some interesting fights with borrowck. To use with owned data you could have type OwnedMultihash = Multihash<'static>; and make sure to always deal with that. I understand that current limitations with async fn push many things towards needing to be always owned or basically Arc<_> but at the same time I am not sure if a crate like multihash is a good place decide require anything more than the minimum (slices).

The downside of using std::borrow::Cow would bring the additional requirement of being able to use only Vec as the owned companion, so perhaps keeping the MultihashRef<'_> allows for more use cases.


impl<'a> MultihashRef<'a> {
/// Creates a `MultihashRef` from the given `input`.
pub fn from_slice(input: &'a [u8]) -> Result<Self, DecodeError> {
if input.is_empty() {
return Err(DecodeError::BadInputLength);
}

let (_code, bytes) = varint_decode::u64(&input).map_err(|_| DecodeError::BadInputLength)?;

let (hash_len, bytes) =
varint_decode::u64(&bytes).map_err(|_| DecodeError::BadInputLength)?;
if (bytes.len() as u64) != hash_len {
return Err(DecodeError::BadInputLength);
}

Ok(MultihashRef { bytes: input })
}

/// Returns which hashing algorithm is used in this multihash.
pub fn algorithm(&self) -> Code {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So wrap allows creating a multihash with non standard code, but such multihash will cause panic if it's algorithm() is called ? I think it's better to decide whether custom algorithms are ok or not and stick to that decision than this.

Personally I would prefer to have custom Code support so it's supported but also clear when non-standard version is used.

let (code, _bytes) =
varint_decode::u64(&self.bytes).expect("multihash is known to be valid algorithm");
Code::from_u64(code)
}

/// Returns the hashed data.
pub fn digest(&self) -> &'a [u8] {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would much rather have no way to creating invalid Multihash than having to deal with the possibly invalid multihash across it's methods.

let (_code, bytes) =
varint_decode::u64(&self.bytes).expect("multihash is known to be valid digest");
let (_hash_len, bytes) =
varint_decode::u64(&bytes).expect("multihash is known to be a valid digest");
&bytes[..]
}

/// Builds a `Multihash` that owns the data.
///
/// This operation allocates.
pub fn to_owned(&self) -> Multihash {
Multihash {
storage: Storage::from_slice(self.bytes),
}
}

/// Returns the bytes representation of this multihash.
pub fn as_bytes(&self) -> &'a [u8] {
&self.bytes
}
}

impl<'a> PartialEq<Multihash> for MultihashRef<'a> {
fn eq(&self, other: &Multihash) -> bool {
self.as_bytes() == &*other.as_bytes()
}
}

/// The `MultihashDigest` trait specifies an interface common for all multihash functions.
pub trait MultihashDigest {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the functional benefit to having the builtin functionality tied to this trait? Seems unnecessary to have to import the MultihashDigest trait just to use a builtin hash. I only see the need for the DynMultihashDigest trait, unless I'm missing the use case for this trait?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see this trait as the thing you would actually want. The DynMultihashDigest is a workaround around Rust limitations.

What do others think? Does it make sense to have both traits, or should we only have the hacky DynMultihashDigest?

Copy link
Contributor

@austinabell austinabell Feb 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to be clear about what I am suggesting: austinabell@131cc94

Just quickly put together, could add code as another base impl as non associative and it would be functionally the same but not require you to import this trait whenever you want to hash in a non-dynamic context.

Definitely more verbose, and could be cleaned up, but more clean usage imo

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having code() as a method isn't a big deal. Though the problem I'm seeing is that now code() and digest() will be instance methods (right word?) that take self as first parameter and not static methods. To me, the API should be clean and not hide that self isn't actually needed. This is what I meant with DynMultihashDigest being hacky.

Copy link
Contributor

@austinabell austinabell Feb 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I was just too lazy to move the associated code const to the base impl, does this satisfy that criteria: austinabell@fd07805?

fixed changes in edit

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to be clear, I have no opinion or preference, just offering an alternative since it seems weird to me to require importing that trait that does nothing just to be able to hash using builtin implementations. Seems to be an even tradeoff since declaring custom implementations could be slightly confusing

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's (from the end-user perspective) quite nice. We could also have a derive macro as the trait implementation would be the same.

Could please someone with more Rust experience comment on which approach is better? The one proposed by @austinabell (#45 (comment)) feels a bit un-idiomatic, but the resulting API is quite nice. @dignifiedquire perhaps?

/// The Mutlihash byte value.
fn code(&self) -> Code;

/// Hash some input and return the digest.
///
/// # Panics
///
/// Panics if the digest length is bigger than 2^32. This only happens for identity hasing.
fn digest(&self, data: &[u8]) -> Multihash;
}

/// Wraps a hash digest in Multihash with the given Mutlihash code.
///
/// The size of the hash is determoned by the size of the input hash. If it should be truncated
/// the input data must already be the truncated hash.
pub fn wrap(code: Code, data: &[u8]) -> Multihash {
let mut code_buf = varint_encode::u64_buffer();
let code = varint_encode::u64(code.to_u64(), &mut code_buf);

let mut size_buf = varint_encode::u64_buffer();
let size = varint_encode::u64(data.len() as u64, &mut size_buf);

Multihash {
storage: Storage::from_slices(&[code, &size, &data]),
}
}
Loading