-
Notifications
You must be signed in to change notification settings - Fork 9.1k
HADOOP-17060. listStatus and getFileStatus behave inconsistent in the case of ViewFs implementation for isDirectory #2061
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Here is a draft patch to see how the yetus report for this change. I ran locally some tests, most of them passed. Since it's in HADOOP, I don't think HDFS tests will run. I ran few tests, most of them passed. I have question on following assert:
This assert is telling that, if it's a dir, symlink should be null or it can be a just file. This means, if it's a symlink, it can be only file. I have tested in my macos about this semantics locally :
The output was: This says, isDir and isSymlink can be true. This patch supports this behavior. So, I am not very sure on that assert, why that was added. I have removed it to support this case now, otherwise, that assert will make flow fail in dev envs. If someone knows about it, please comment. Thanks |
Here this is another line with assert in FileStatus.java: https://github.com/apache/hadoop/blob/ede439c5d1aef856c1e31fe89e606cac9eadae60/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FileStatus.java#L499 |
🎊 +1 overall
This message was automatically generated. |
… case of ViewFs implementation for isDirectory
ede439c
to
66036d6
Compare
Thanx @umamaheswararao for the fix.
Similar stuff happens if it is file, it tells isDirectory false and File as true. But the only doubt I have regarding the assertions, this was explicitly done as part of HADOOP-7032 and other one in HADOOP-15289, There should be some reason behind that? |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
Thanks @ayushtkn for spending time on this. |
status.getModificationTime(), status.getAccessTime(), | ||
status.getPermission(), status.getOwner(), status.getGroup(), | ||
link.getTargetLink(), | ||
new Path(inode.fullPath).makeQualified(myUri, null)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here in the FileStatus we have resolved the mount and giving details of the destination. Any idea where the link.getTargetLink() in the FileStatus can be or is used?
If we are resolving the mount entry, This is actually the FileStatus of the destination, if the destination itself is a symlink, in that case instead of link.getTargetLink() shouldn't status.getSymlink() be done?
Since all the other details like permissions and stuff we are having of the destination, the intent seems to resolve it as the destination directory rather than projecting this as a symlink to the end user. else if we see in linux the symlink doesn't shows the destination permission or other details it shows it's own details while listing :
lrwxrwxrwx 1 ayush ayush 3 Jun 8 04:52 sym_f -> txt
-rw-rw-r-- 1 ayush ayush 0 Jun 8 11:05 txt
In RBF also for listing on mount points, we don't denote them as symlinks, instead just pass the destination FileStatus with name resolved...
Is it possible we keep the assertions as it is and change link.getTargetLink() and specify that as per the destination?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ayushtkn , Thanks for the review!
Here ViewFS mount points seems to be designed to show them as symlinks. This can be figured out by looking at ViewFsBaseTest.java#testListOnInternalDirsOfMountTable (which was way old test) and also I had discussion with Sanjay on that.
So, I don't think it's reasonable to change that assumption/behavior now.
Any idea where the link.getTargetLink() in the FileStatus can be or is used?
FileStatus has getSymlink API. Also Filestatus path represents link path. and symlink path represents target path. This seems to be correct.
Coming to permissions bit I just realized symlink showing its own permissions instead of target in macos also. ( but we just changed the behavior in HADOOP-17029 where we display target dir's permissions and group. cc: @abhishekdas99, as per linux/macos the older behavior seems to be ok, but ViewFS perspective it seems to be good to show target permissions instead of showing some bogus permissions. I think in ViewFS case the in-memory InternalViewFSDirFS was created by fs but we can not really cary who created what mount, we will just configure in xml that link and would not have any info who added that, all should be at admins permission control only ). I assume in linux mount link dir will be created with the permission to the current user who created right? Since we don't have that link creation logic dynamically at ViewFS, I don't think we can behave exactly same as Linux here?
My another original question was why we should not have symlink on dir?
is it because HDFS does not support that? If that because of hdfs, should we move that assert under HdfsNamedFileStatus? I don't know what else will break here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the intent behind that assertion is little different.
An entity can be either a directory, file or a symlink. In linux while listing, directory is denoted as d, link by l and file by -, So there are three categories (Directory/File/Link) not just two (Directory/File)
drwxrwxr-x 2 ayush ayush 4096 Jun 8 04:25 dir/
lrwxrwxrwx 1 ayush ayush 4 Jun 8 04:25 sym_dir -> dir//
-rw-rw-r-- 1 ayush ayush 0 Jun 8 11:05 txt
A symlink can point to both directory and file, but isn't itself a directory or file. It is link, and has its own identity.
In Hadoop, to distinguish between the three categories, the logic seems like :
isDir==true --> It is a Directory
isDir==false --> Can be a file or Symlink. So to conclude further whether a file or link
isDir==false and link==null --> it is file
isDir==false and link!=null --> it is a symlink
The assertion message "A directory can not be a symlink" also tries to depict this only. Since if isDir is true that means it is a directory so it can't be a symlink, it can be a symlink when it isn't neither directory nor file.
Now regarding changing it here in ViewFS, allowing isDir to be true and as well as having a link.
If someone is having an application code with
if(sDir==false and link!=null) --> it is a symlink
do work considering symlink.
His code will break, Since this check won't return true post this change. Would this change incompatible?
Now regarding inconsistency between getFileStatus and listStatus, if we are considering a mount entry as link, so getFileStatus should follow the similar semantics as listStatus, A same entity can not be treated differently. If we want to treat it as link, we can think of makin getFileStatus adapt..
Just my thoughts, not much idea about the actual discussions what was the intent then..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isDir==true --> It is a Directory
isDir==false --> Can be a file or Symlink. So to conclude further whether a file or link
isDir==false and link==null --> it is file
isDir==false and link!=null --> it is a symlink
Here nio Files APIs return true for isDirectory API. But here we cannot make that judgement with this information. I saw that 'l' part along with directory. However, native filesystems seems to be capturing the info about target filesystem and return isDir true based on that. It denotes as symlink along with permission bits.
My original thought was to change GetFileStatus see comment as you thought here:
But after verifying tests on local mac, I realized isDirectory is getting returned tru in that cases, but here we cannot make that decision. in MAC it was showing as folder icon if target is a directory and isDirectory as true.
Yes, they are incompatible changes( either the change done in listStatus/GetFileStatus)( we will mark them), and behavioral corrections in other side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well things are different at different places and TBH I don't have a strong opinion on which is the best way to do.
On a personal note changing getFileStatus()
seems to be little more safe to me, As those assertions and stuffs stays as it is and changes gets restricted to viewFS
only and no changes to links interpretations and stuffs. (my assumption, it should be safe, haven't digged in much)
return new FileStatus(0, true, 0, 0, creationTime, creationTime,
PERMISSION_555, ugi.getShortUserName(), ugi.getPrimaryGroupName(),
new Path(theInternalDir.fullPath).makeQualified(
myUri, ROOT_PATH));
getFileStatus()
is treating it as a link only(but with isDir true), it doesn't shows target file system permissions and times. That also need to be changed similarly to HADOOP-17029, to resolve permissions and stuff from target file system. FileStatus should be same through both API's? We can make things get in sync by doing that, and get away with inconsistencies b/w these two API's as of now..
Changes in getListing()
apart from making the APIs in sync(That also we need to change in getFileStatus() as well since there 'true' is hardcoded), we seems to change symlink interpretation logics as well to get that in sync with other systems and I think that might break things for people relying on checks like this :
if (isDir==false and link!=null)`, May be we can have a follow up JIRA as well, to change link interpretations with bigger audience.
But in any case, I don't have oppositions to any of the approach, all up to you whichever way you want to go ahead. :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From ur previous comment:
drwxrwxr-x 2 ayush ayush 4096 Jun 8 04:25 dir/
lrwxrwxrwx 1 ayush ayush 4 Jun 8 04:25 sym_dir -> dir//
-rw-rw-r-- 1 ayush ayush 0 Jun 8 11:05 txt
This is from permission bits. If you look at our FSPermission class we have a stickybit to denote its a file or not. Unfortunately it's just a boolean, it does not have 3rd option to represent link.
public FsPermission(FsAction u, FsAction g, FsAction o, boolean sb) {
set(u, g, o, sb);
}
(nio) Files.isDirectory returns true for a symlink if target is dir. Where as Hadoop FileStatus class does not allow if symlink represent as dir. Behaviors are confusing :-(
That means we don't have enough info captured in Permissions or FileStatus classes to represent symlink properly.
Coming to getFileStatus question:
In your above quoted code runs on internal directory, thats not a link. Here internal directory means, if you have link src as /a/b/c, here a, b are internal dirs (here it's always a dir, so we can safely hardcode to true) and c is a link dir.
GetFileStatus run in input path, so, when input path is link, it would have resolved and run getFileStatus on targetFileSystem#getFileStatus. Where listStatus gets the FileStatus of immediate childrens. So, listStatus of /a/b , since be is an internal dir, it will go to InetrnalViewFsDir and run listStatus. and one of its child c is a link. so previously treating that children FileStatus object as symlink with isDir is false, irrespective whether target is file/dir. One safe assumption we can keep is, in ViewFS target would always be a directory only. I don't think anyone would configure a file as target link. But we can not leave that case anyway for consistency. Hope this helps. I also don't have very clear opinions here. Symlinks seems to mess around here. :-( .
But I like the idea of representing as symlink as that one of fs features. but consistent behavior is what we need to find here. If we wanted to fix getFileStatus("/a/b/c"), we need to make isDir as false, but current it will return true as it runs on targetFileSystem and also give symlink with tagetFs path. Here no way user know whether the target is file or dir, unless he directly access symlinked path and see.
Let me catchup Sanjay, if he has some thoughts around.
But in any case, I don't have oppositions to any of the approach, all up to you whichever way you want to go ahead. :-)
Thank you. Let's figure out what would be the correct thing to do. I have no plans to move ahead until we find some reasonable solution here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have verified symlinks behavior in HDFS.
public void testSymlinkOnHDFS() throws Exception {
// add ur hdfs uri here: ex hdfs://10.0.1.75:9000
URI hdfsURI = dfs.getUri();
FileSystem.enableSymlinks();
try (FileSystem hdfs = new DistributedFileSystem()) {
hdfs.initialize(hdfsURI, new HdfsConfiguration());
final Path targetLinkDir = new Path("/user", "targetRegularDir");
hdfs.mkdirs(targetLinkDir);
Path symLinkDir = new Path("/links/linkDir");
hdfs.createSymlink(targetLinkDir, symLinkDir, true);
// ListStatus Test
FileStatus[] listStatus = hdfs.listStatus(new Path("/links"));
FileStatus fsFromLs = listStatus[0]; // FileStatus of /links/linkDir
Assert.assertEquals(fsFromLs.isDirectory(), false);
Assert.assertEquals("/links/linkDir",
Path.getPathWithoutSchemeAndAuthority(fsFromLs.getPath()).toString());
// GetFileStatus test
// FileStatus of /links/linkDir
FileStatus fileStatus = hdfs.getFileStatus(symLinkDir);
Assert.assertEquals(true, fileStatus.isDirectory());
// resolved to FileStatus of /user/targetRegularDir
Assert.assertEquals("/user/targetRegularDir", Path
.getPathWithoutSchemeAndAuthority(fileStatus.getPath()).toString());
}
}
It turns out that the behavior of listStatus and GetFileStatus are different. They both returning different FileStatus. Same behavior in ViewFS also.
GetFileStatus(/test), just runs on resolved path directly. So, it will not be represented as symLink.
ListStatus(/) of gets /test as children FileStatus object. But that represents as symLink.
Probably we should just clarify the behavior in user guide and API docs about the behaviors in symlinks case? Otherwise fixing this needs to be done all other places and it will be incompatible change across.
One advantage I see with the existing behavior is that, with listStatus we can know dir is symlink. If one wants to know targetFs details, then issue GetFileStatus on that path will resolved to targetFS and gets the FileStatus at targetFS.
I will also check with Sanjay about his opinions on this.
Thanx @umamaheswararao the test is really helpful. I tried it and found really the filestatus are different in case of HDFS as well :
Yahh, Documenting this properly sounds most apt as of now.
Correct, provided people know about this behavior, this behavior may be helpful in many places . but there is probably too much lack of documentation/awareness around symlinks,
If we don't fix this, Srinivasu Majeti's problem won't get solved? But if there aren't any further different opinions, and provided HDFS is also behaving similarlly, we don't have much scope changing just in ViewFs, So documenting properly as you said should be our final option. |
I had a discussion with Sanjay about this behaviors. He certainly agreed that, we should update the docs instead of changing them now. GetFileStatus implemented to work on resolved paths transparently. Updated the Javadocs for the APIs how they behave. Hope this helps. CC: @rakeshadr @ayushtkn |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM +1
https://issues.apache.org/jira/browse/HADOOP-17060