Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OS specific content type handling #7418

Open
Stebalien opened this issue Jun 4, 2020 · 3 comments
Open

OS specific content type handling #7418

Stebalien opened this issue Jun 4, 2020 · 3 comments
Labels
kind/bug A bug in existing code (including security flaws) need/triage Needs initial labeling and prioritization topic/gateway Topic gateway

Comments

@Stebalien
Copy link
Member

Version information:

v0.6.0-rc1

Description:

go-ipfs now reads /etc/mime.types when determining the content type of a file from the file extension. Unfortunately, this leads to hard to diagnose platform specific behavior where, ideally, all go-ipfs implementations should behave the same way.

See ipfs/ipfs-companion#886 (comment).

@Stebalien Stebalien added kind/bug A bug in existing code (including security flaws) need/triage Needs initial labeling and prioritization labels Jun 4, 2020
@lidel lidel added the topic/gateway Topic gateway label Jun 5, 2020
@markg85
Copy link
Contributor

markg85 commented Jun 5, 2020

Interesting issue! Yes, i did read ipfs/ipfs-companion#886 (comment)

The original issue talks about this blog post: https://ipfs.io/ipns/blog.ipfs.io/2020-05-20-gossipsub-v1.1

That's - at the very least - a bug that should be fixed in the software that's used to make blog posts. A "dot" should not be part of the url except if it's an extension. URL's like that will break checks that are done solely on the extension. Which in this case is apparently happening.

Next you have a difference of webservers and files. If a webserver serves a pages like that, the content type that the webserver also provides should be used. This is where things get wonky as when handling it through IPFS it's probably all handled as files. And if different nodes have different mime databases you might indeed get different results.

But there's a fix for that :)
These kind of issues have been seen in the open source desktop world years ago (talking about KDE specifically). In that world (C++ with Qt) solutions have been made to get this fixed. In particular, this function is used these days: https://doc.qt.io/qt-5/qmimedatabase.html#mimeTypeForFile Look closely at the second argument. You can specify if you want to determine the mime-type by extension or content (or both).

In the IPFS world it makes a lot of sense to determine the mime type by content, not by extension. As the node that is going to respond with the data knows the data. It's practically free to determine the mime type then.

So i'd advise you to look at how this is done in the Qt world and use that logic instead. Your starting point would be https://code.qt.io/cgit/qt/qtbase.git/tree/src/corelib/mimetypes/qmimedatabase.cpp I don't quite know how the actual database is build but i do know that it's working quite reliable for years (since it's introduction in Qt 5.0 i think)

Just as a little reminder of what is possible if you solely detect meme by extension. Do realize that on linux (windows too i think, not entirely sure) a dot is allowed to be in any entry. So you could actually have a folder called: "bigfolder.jpg" which would not be a jpg file but a folder! It's stupid.. but possible.

Hope this helps :)

@markg85
Copy link
Contributor

markg85 commented Jun 5, 2020

While thinking about this a bit more. Why isn't the mime content type encoded in the hash?
It doesn't have to be a super strong hash part. You never know how many there are so you'd never know how much space you need to reserve for it.
But you can define a list of known mime types and encode that in the hash too.
With just base32 (the current bafy one) you already can encode 1024 content types in a mere 2 characters. If the hash is unknown, encode it as a reserved character pair. Like say 00. You'd get something like:

00 = Unknown mime type (aka, try to determine it on the receiving node)
01 = application/json
...
50 = image/jpeg
...

This does make the hash 2 characters longer but it also gives you a way of knowing the intended mime type for a file. Also, it only has to be determined once at the point of adding the file to IPFS.

You'd only have to do mime type checking if it's unknown, which gives you a nice backwards compatibility path too. Another thing to consider is that IPFS is exposing file details with this (IPFS already did with the filesize which is part of the encoding too). But with this you also know the type of file. For some purposes that might not be ideal. For other purposes (like quickly sorting on mime type or even "searching for image files" this offers a really simple and fast way to do just that.

@Stebalien
Copy link
Member Author

The issue here is simply: content detection shouldn't be OS dependent. That's it.

Beyond that, we'd ideally have more accurate content detection. However, it's not simple. We don't want to treat "index.html" with the content Hi, my name is <b>Steven</b>! as text (even if file says it is).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug A bug in existing code (including security flaws) need/triage Needs initial labeling and prioritization topic/gateway Topic gateway
Projects
None yet
Development

No branches or pull requests

3 participants