OS specific content type handling #7418

Stebalien · 2020-06-04T21:10:13Z

Version information:

v0.6.0-rc1

Description:

go-ipfs now reads /etc/mime.types when determining the content type of a file from the file extension. Unfortunately, this leads to hard to diagnose platform specific behavior where, ideally, all go-ipfs implementations should behave the same way.

See ipfs/ipfs-companion#886 (comment).

markg85 · 2020-06-05T11:58:23Z

Interesting issue! Yes, i did read ipfs/ipfs-companion#886 (comment)

The original issue talks about this blog post: https://ipfs.io/ipns/blog.ipfs.io/2020-05-20-gossipsub-v1.1

That's - at the very least - a bug that should be fixed in the software that's used to make blog posts. A "dot" should not be part of the url except if it's an extension. URL's like that will break checks that are done solely on the extension. Which in this case is apparently happening.

Next you have a difference of webservers and files. If a webserver serves a pages like that, the content type that the webserver also provides should be used. This is where things get wonky as when handling it through IPFS it's probably all handled as files. And if different nodes have different mime databases you might indeed get different results.

But there's a fix for that :)
These kind of issues have been seen in the open source desktop world years ago (talking about KDE specifically). In that world (C++ with Qt) solutions have been made to get this fixed. In particular, this function is used these days: https://doc.qt.io/qt-5/qmimedatabase.html#mimeTypeForFile Look closely at the second argument. You can specify if you want to determine the mime-type by extension or content (or both).

In the IPFS world it makes a lot of sense to determine the mime type by content, not by extension. As the node that is going to respond with the data knows the data. It's practically free to determine the mime type then.

So i'd advise you to look at how this is done in the Qt world and use that logic instead. Your starting point would be https://code.qt.io/cgit/qt/qtbase.git/tree/src/corelib/mimetypes/qmimedatabase.cpp I don't quite know how the actual database is build but i do know that it's working quite reliable for years (since it's introduction in Qt 5.0 i think)

Just as a little reminder of what is possible if you solely detect meme by extension. Do realize that on linux (windows too i think, not entirely sure) a dot is allowed to be in any entry. So you could actually have a folder called: "bigfolder.jpg" which would not be a jpg file but a folder! It's stupid.. but possible.

Hope this helps :)

markg85 · 2020-06-05T12:30:19Z

While thinking about this a bit more. Why isn't the mime content type encoded in the hash?
It doesn't have to be a super strong hash part. You never know how many there are so you'd never know how much space you need to reserve for it.
But you can define a list of known mime types and encode that in the hash too.
With just base32 (the current bafy one) you already can encode 1024 content types in a mere 2 characters. If the hash is unknown, encode it as a reserved character pair. Like say 00. You'd get something like:

00 = Unknown mime type (aka, try to determine it on the receiving node)
01 = application/json
...
50 = image/jpeg
...

This does make the hash 2 characters longer but it also gives you a way of knowing the intended mime type for a file. Also, it only has to be determined once at the point of adding the file to IPFS.

You'd only have to do mime type checking if it's unknown, which gives you a nice backwards compatibility path too. Another thing to consider is that IPFS is exposing file details with this (IPFS already did with the filesize which is part of the encoding too). But with this you also know the type of file. For some purposes that might not be ideal. For other purposes (like quickly sorting on mime type or even "searching for image files" this offers a really simple and fast way to do just that.

Stebalien · 2020-06-12T20:45:36Z

The issue here is simply: content detection shouldn't be OS dependent. That's it.

Beyond that, we'd ideally have more accurate content detection. However, it's not simple. We don't want to treat "index.html" with the content Hi, my name is <b>Steven</b>! as text (even if file says it is).

Stebalien added kind/bug A bug in existing code (including security flaws) need/triage Needs initial labeling and prioritization labels Jun 4, 2020

Stebalien mentioned this issue Jun 4, 2020

/etc/mime.types impacts Content-Type returned by Gateway ipfs/ipfs-companion#886

Closed

lidel added the topic/gateway Topic gateway label Jun 5, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OS specific content type handling #7418

OS specific content type handling #7418

Stebalien commented Jun 4, 2020

markg85 commented Jun 5, 2020 •

edited

Loading

markg85 commented Jun 5, 2020

Stebalien commented Jun 12, 2020

OS specific content type handling #7418

OS specific content type handling #7418

Comments

Stebalien commented Jun 4, 2020

Version information:

Description:

markg85 commented Jun 5, 2020 • edited Loading

markg85 commented Jun 5, 2020

Stebalien commented Jun 12, 2020

markg85 commented Jun 5, 2020 •

edited

Loading