Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement compression/decompression of filenames ending .gz #3963

Merged
merged 1 commit into from
Apr 21, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 8 additions & 2 deletions src/sysfiles.c
Original file line number Diff line number Diff line change
Expand Up @@ -616,6 +616,9 @@ Int SyFopen (
Char namegz [1024];
int flags = 0;

Char * terminator = strrchr(name, '.');
BOOL endsgz = terminator && (strcmp(terminator, ".gz") == 0);

/* handle standard files */
if ( strcmp( name, "*stdin*" ) == 0 ) {
if ( strcmp( mode, "r" ) != 0 )
Expand Down Expand Up @@ -677,8 +680,11 @@ Int SyFopen (
#endif

/* try to open the file */
syBuf[fid].fp = open(name,flags, 0644);
if ( 0 <= syBuf[fid].fp ) {
if (endsgz && (syBuf[fid].gzfp = gzopen(name, mode))) {
syBuf[fid].type = gzip_socket;
syBuf[fid].fp = -1;
syBuf[fid].bufno = -1;
} else if (0 <= (syBuf[fid].fp = open(name, flags, 0644))) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, so we are patching the most low-level function in the kernel for opening a file to transparently access .gz files. That makes it really tough to bypass this, one has to either bypass SyFopen or else copy or rename the file in question to look at its "true" content.

I think it is OK to do this now in this PR, but I'd like us to revisit this. I notice that SyFopen still takes an old clunky "mode" string as second argument. That argument could be augmented to indicate that "raw" or "binary" mode (no automatic (de)compressing) is wanted. Indeed, we already supported on Cygwin using "rb" and "wb" modes (for savefiles).

But I'd also suggest to change it from a const char * to an integer bit mask (basically similar to the POSIX O_RDONLY etc. flags for open, just out own portable version of those).

Anyway, either way it'd be trivial to add a simple kernel helper function which basically does the same as StringFile but using this "raw" mode (StringFile is a GAP function but one that's trivial to rewrite in a <10 lines kernel function, and then only the argument to SyFopen in there needs to be changed)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course the idea of changing the mode string to an int (or enum) is separate from the rest; this is just ramblings, don't feel obliged to implement any of this.

Copy link
Contributor Author

@ChrisJefferson ChrisJefferson Apr 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not opposed to your suggestion, but I'm a bit unsure about implementing it at all. I don't really like adding stuff just to support testing. If anyone is upset they can't access raw gzip files I'm very happy to fix it -- although at the moment we don't really support binary file access at all (on Windows we always do line ending conversion for example, so we'd probably corrupt raw gzipped files anyway).

syBuf[fid].type = raw_socket;
syBuf[fid].echo = syBuf[fid].fp;
syBuf[fid].bufno = -1;
Expand Down
1 change: 1 addition & 0 deletions tst/example-dir/compress/not-compressed.txt.gz
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
not compressed
2 changes: 2 additions & 0 deletions tst/example-dir/readme.txt
Original file line number Diff line number Diff line change
Expand Up @@ -7,3 +7,5 @@ Explanation

dir-test : A directory containing some example files and sub-directories
for testing directory enumeration.

compress : not-compressed.txt - A text file which is not compressed but ends in gz
169 changes: 169 additions & 0 deletions tst/testinstall/compressed.tst
Original file line number Diff line number Diff line change
@@ -0,0 +1,169 @@
#@local dir,fname,isGzippedFile,stream,str
gap> START_TEST("compressed.tst");
gap> dir := DirectoryTemporary();;
gap> fname := Filename(dir, "test.g.gz");;

# Let us check when we have written a compressed file by checking the gzip header
gap> isGzippedFile := function(dir, name)
> local out, str,prog;
> str := "";
> out := OutputTextString(str, true);
> Process(dir, Filename(DirectoriesSystemPrograms(),"cat"), InputTextNone(), out, [name]);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hrm, that's yucky. Can you please at least add a comment that you are not simply using StringFile here because that would decompress the data?

I guess what we'd need to make this less ugly is an InputRawFile or so which does not perform any transparent decompression. Or perhaps an optional flag to InputTextFile which allows turning off the decompression?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I didn't like this. I'm happy with either of those suggestions, if anyone has a strong prference.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking a little more, the problem with introducing some kind of InputRawFile, or anything else, is it is another thing to document which (I think) no-one is ever going to want to use.

We could use the IO package to grab raw access, or introduce some undocumented kernel function to get raw access?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine by me; see comments above for some ideas.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to be clear: It's also OK by me to merge this now as-is; we can always revisit this later. Perhaps amend 1-2 of the minor optional remarks I made, if you feel like it; otherwise just merge it as-is shrug

> return str{[1..2]} = "\037\213";
> end;;
gap> str := "hello\ngoodbye\n";;

# Write a compressed file
gap> FileString( fname, str ) = Length(str);
true

# Check file really is compressed
gap> isGzippedFile(dir, "test.g.gz");
true

# Check reading compressed file
gap> StringFile( fname ) = str;
true

# Check gz is added transparently
gap> StringFile( Filename(dir, "test.g") ) = str;
fingolfin marked this conversation as resolved.
Show resolved Hide resolved
true

# Test reading/seeking in a gzip compressed file
gap> stream := InputTextFile(fname);;
gap> ReadLine(stream);
"hello\n"
gap> ReadLine(stream);
"goodbye\n"
gap> ReadLine(stream);
fail
gap> SeekPositionStream(stream, -1);
fail
gap> SeekPositionStream(stream, 0);
true
gap> ReadLine(stream);
"hello\n"
gap> ReadLine(stream);
"goodbye\n"
gap> ReadLine(stream);
fail
gap> SeekPositionStream(stream, 2);
true
gap> PositionStream(stream);
2
gap> ReadLine(stream);
"llo\n"
gap> ReadLine(stream);
"goodbye\n"
gap> SeekPositionStream(stream, 0);
true
gap> ReadAll(stream) = str;
true
gap> SeekPositionStream(stream, 0);
true
gap> PositionStream(stream);
0
gap> ReadAll(stream) = str;
true
gap> CloseStream(stream);

# Test multiple writes
gap> stream := OutputTextFile( fname, false );;
gap> PrintTo( stream, "1");
gap> AppendTo( stream, "2");
gap> PrintTo( stream, "3");
gap> CloseStream(stream);
gap> stream;
closed-stream
gap> isGzippedFile(dir, "test.g.gz");
true

# verify it
gap> stream := InputTextFile( fname );;
gap> ReadAll(stream);
"123"
gap> CloseStream(stream);
gap> stream;
closed-stream

# partial reads
gap> stream := InputTextFile( fname );;
gap> ReadAll(stream, 2);
"12"
gap> CloseStream(stream);
gap> stream;
closed-stream

# too long partial read
gap> stream := InputTextFile( fname );;
gap> ReadAll(stream, 5);
"123"
gap> CloseStream(stream);
gap> stream;
closed-stream

# error partial read
gap> stream := InputTextFile( fname );;
gap> ReadAll(stream, -1);
Error, ReadAll: negative limit is not allowed
gap> CloseStream(stream);
gap> stream;
closed-stream

# append to initial data
gap> stream := OutputTextFile( fname, true );;
gap> PrintTo( stream, "4");
gap> CloseStream(stream);

# verify it
gap> stream := InputTextFile( fname );;
gap> ReadAll(stream);
"1234"
gap> CloseStream(stream);
gap> stream;
closed-stream

# overwrite initial data
gap> stream := OutputTextFile( fname, false );;
gap> PrintTo( stream, "new content");
gap> CloseStream(stream);

# verify it
gap> stream := InputTextFile( fname );;
gap> ReadAll(stream);
"new content"
gap> CloseStream(stream);
gap> stream;
closed-stream

# ReadAll with length limit
gap> stream := InputTextFile( fname );;
gap> ReadAll(stream, 3);
"new"
gap> CloseStream(stream);

# test PrintFormattingStatus
gap> stream := OutputTextFile( fname, false );;
gap> PrintFormattingStatus(stream);
true
gap> PrintTo( stream, "a very long line that GAP is going to wrap at 80 chars by default if we don't do anything about it\n");
gap> CloseStream(stream);
gap> StringFile(fname);
"a very long line that GAP is going to wrap at 80 chars by default if we don't\
\\\ndo anything about it\n"
gap> stream := OutputTextFile( fname, false );;
gap> SetPrintFormattingStatus(stream, false);
gap> PrintFormattingStatus(stream);
false
gap> PrintTo( stream, "a very long line that GAP is going to wrap at 80 chars by default if we don't do anything about it\n");
gap> CloseStream(stream);
gap> StringFile(fname);
"a very long line that GAP is going to wrap at 80 chars by default if we don't\
do anything about it\n"

# Test even if a file ends in .gz, if it is not compressed it can still be read
gap> stream := InputTextFile(Filename(DirectoriesLibrary("tst"), "example-dir/compress/not-compressed.txt.gz"));;
gap> ReadAll(stream) = "not compressed\n";
true
gap> CloseStream(stream);
gap> STOP_TEST("compressed.tst");
4 changes: 0 additions & 4 deletions tst/testinstall/read.tst
Original file line number Diff line number Diff line change
Expand Up @@ -75,10 +75,6 @@ gap> StringFile( Filename(dir, "tmp2"));
fail
gap> StringFile( Filename(dir, "tmp1"));
"Hello, world!"
gap> FileString( Filename(dir, "test.g.gz"), "\037\213\b\b0,\362W\000\ctest.g\0003\3246\264\346\<\000\225\307\236\324\005\000\000\000" );
32
gap> StringFile( Filename(dir, "test.g") ) = "1+1;\n" or ARCH_IS_WINDOWS(); # works only when Cygwin installed with gzip
true
gap> StringFile( "/" );
Error, in StringFile: Is a directory (21)

Expand Down