Skip to content

Commit aa5b395

Browse files
mzaslonktorvalds
authored andcommitted
lib/zlib: add s390 hardware support for kernel zlib_deflate
Patch series "S390 hardware support for kernel zlib", v3. With IBM z15 mainframe the new DFLTCC instruction is available. It implements deflate algorithm in hardware (Nest Acceleration Unit - NXU) with estimated compression and decompression performance orders of magnitude faster than the current zlib. This patchset adds s390 hardware compression support to kernel zlib. The code is based on the userspace zlib implementation: madler/zlib#410 The coding style is also preserved for future maintainability. There is only limited set of userspace zlib functions represented in kernel. Apart from that, all the memory allocation should be performed in advance. Thus, the workarea structures are extended with the parameter lists required for the DEFLATE CONVENTION CALL instruction. Since kernel zlib itself does not support gzip headers, only Adler-32 checksum is processed (also can be produced by DFLTCC facility). Like it was implemented for userspace, kernel zlib will compress in hardware on level 1, and in software on all other levels. Decompression will always happen in hardware (when enabled). Two DFLTCC compression calls produce the same results only when they both are made on machines of the same generation, and when the respective buffers have the same offset relative to the start of the page. Therefore care should be taken when using hardware compression when reproducible results are desired. However it does always produce the standard conform output which can be inflated anyway. The new kernel command line parameter 'dfltcc' is introduced to configure s390 zlib hardware support: Format: { on | off | def_only | inf_only | always } on: s390 zlib hardware support for compression on level 1 and decompression (default) off: No s390 zlib hardware support def_only: s390 zlib hardware support for deflate only (compression on level 1) inf_only: s390 zlib hardware support for inflate only (decompression) always: Same as 'on' but ignores the selected compression level always using hardware support (used for debugging) The main purpose of the integration of the NXU support into the kernel zlib is the use of hardware deflate in btrfs filesystem with on-the-fly compression enabled. Apart from that, hardware support can also be used during boot for decompressing the kernel or the ramdisk image With the patch for btrfs expanding zlib buffer from 1 to 4 pages (patch 6) the following performance results have been achieved using the ramdisk with btrfs. These are relative numbers based on throughput rate and compression ratio for zlib level 1: Input data Deflate rate Inflate rate Compression ratio NXU/Software NXU/Software NXU/Software stream of zeroes 1.46 1.02 1.00 random ASCII data 10.44 3.00 0.96 ASCII text (dickens) 6,21 3.33 0.94 binary data (vmlinux) 8,37 3.90 1.02 This means that s390 hardware deflate can provide up to 10 times faster compression (on level 1) and up to 4 times faster decompression (refers to all compression levels) for btrfs zlib. Disclaimer: Performance results are based on IBM internal tests using DD command-line utility on btrfs on a Fedora 30 based internal driver in native LPAR on a z15 system. Results may vary based on individual workload, configuration and software levels. This patch (of 9): Create zlib_dfltcc library with the s390 DEFLATE CONVERSION CALL implementation and related compression functions. Update zlib_deflate functions with the hooks for s390 hardware support and adjust workspace structures with extra parameter lists required for hardware deflate. Link: http://lkml.kernel.org/r/20200103223334.20669-2-zaslonko@linux.ibm.com Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com> Signed-off-by: Mikhail Zaslonko <zaslonko@linux.ibm.com> Co-developed-by: Ilya Leoshkevich <iii@linux.ibm.com> Cc: Chris Mason <clm@fb.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: David Sterba <dsterba@suse.com> Cc: Eduard Shishkin <edward6@linux.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Richard Purdie <rpurdie@rpsys.net> Cc: Vasily Gorbik <gor@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
1 parent f88b426 commit aa5b395

File tree

11 files changed

+751
-102
lines changed

11 files changed

+751
-102
lines changed

lib/Kconfig

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -278,6 +278,13 @@ config ZLIB_DEFLATE
278278
tristate
279279
select BITREVERSE
280280

281+
config ZLIB_DFLTCC
282+
def_bool y
283+
depends on S390
284+
prompt "Enable s390x DEFLATE CONVERSION CALL support for kernel zlib"
285+
help
286+
Enable s390x hardware support for zlib in the kernel.
287+
281288
config LZO_COMPRESS
282289
tristate
283290

lib/Makefile

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -140,6 +140,7 @@ obj-$(CONFIG_842_COMPRESS) += 842/
140140
obj-$(CONFIG_842_DECOMPRESS) += 842/
141141
obj-$(CONFIG_ZLIB_INFLATE) += zlib_inflate/
142142
obj-$(CONFIG_ZLIB_DEFLATE) += zlib_deflate/
143+
obj-$(CONFIG_ZLIB_DFLTCC) += zlib_dfltcc/
143144
obj-$(CONFIG_REED_SOLOMON) += reed_solomon/
144145
obj-$(CONFIG_BCH) += bch.o
145146
obj-$(CONFIG_LZO_COMPRESS) += lzo/

lib/zlib_deflate/deflate.c

Lines changed: 41 additions & 38 deletions
Original file line numberDiff line numberDiff line change
@@ -52,16 +52,18 @@
5252
#include <linux/zutil.h>
5353
#include "defutil.h"
5454

55+
/* architecture-specific bits */
56+
#ifdef CONFIG_ZLIB_DFLTCC
57+
# include "../zlib_dfltcc/dfltcc.h"
58+
#else
59+
#define DEFLATE_RESET_HOOK(strm) do {} while (0)
60+
#define DEFLATE_HOOK(strm, flush, bstate) 0
61+
#define DEFLATE_NEED_CHECKSUM(strm) 1
62+
#endif
5563

5664
/* ===========================================================================
5765
* Function prototypes.
5866
*/
59-
typedef enum {
60-
need_more, /* block not completed, need more input or more output */
61-
block_done, /* block flush performed */
62-
finish_started, /* finish started, need only more output at next deflate */
63-
finish_done /* finish done, accept no more input or output */
64-
} block_state;
6567

6668
typedef block_state (*compress_func) (deflate_state *s, int flush);
6769
/* Compression function. Returns the block state after the call. */
@@ -72,7 +74,6 @@ static block_state deflate_fast (deflate_state *s, int flush);
7274
static block_state deflate_slow (deflate_state *s, int flush);
7375
static void lm_init (deflate_state *s);
7476
static void putShortMSB (deflate_state *s, uInt b);
75-
static void flush_pending (z_streamp strm);
7677
static int read_buf (z_streamp strm, Byte *buf, unsigned size);
7778
static uInt longest_match (deflate_state *s, IPos cur_match);
7879

@@ -98,6 +99,25 @@ static void check_match (deflate_state *s, IPos start, IPos match,
9899
* See deflate.c for comments about the MIN_MATCH+1.
99100
*/
100101

102+
/* Workspace to be allocated for deflate processing */
103+
typedef struct deflate_workspace {
104+
/* State memory for the deflator */
105+
deflate_state deflate_memory;
106+
#ifdef CONFIG_ZLIB_DFLTCC
107+
/* State memory for s390 hardware deflate */
108+
struct dfltcc_state dfltcc_memory;
109+
#endif
110+
Byte *window_memory;
111+
Pos *prev_memory;
112+
Pos *head_memory;
113+
char *overlay_memory;
114+
} deflate_workspace;
115+
116+
#ifdef CONFIG_ZLIB_DFLTCC
117+
/* dfltcc_state must be doubleword aligned for DFLTCC call */
118+
static_assert(offsetof(struct deflate_workspace, dfltcc_memory) % 8 == 0);
119+
#endif
120+
101121
/* Values for max_lazy_match, good_match and max_chain_length, depending on
102122
* the desired pack level (0..9). The values given below have been tuned to
103123
* exclude worst case performance for pathological files. Better values may be
@@ -207,7 +227,15 @@ int zlib_deflateInit2(
207227
*/
208228
next = (char *) mem;
209229
next += sizeof(*mem);
230+
#ifdef CONFIG_ZLIB_DFLTCC
231+
/*
232+
* DFLTCC requires the window to be page aligned.
233+
* Thus, we overallocate and take the aligned portion of the buffer.
234+
*/
235+
mem->window_memory = (Byte *) PTR_ALIGN(next, PAGE_SIZE);
236+
#else
210237
mem->window_memory = (Byte *) next;
238+
#endif
211239
next += zlib_deflate_window_memsize(windowBits);
212240
mem->prev_memory = (Pos *) next;
213241
next += zlib_deflate_prev_memsize(windowBits);
@@ -277,6 +305,8 @@ int zlib_deflateReset(
277305
zlib_tr_init(s);
278306
lm_init(s);
279307

308+
DEFLATE_RESET_HOOK(strm);
309+
280310
return Z_OK;
281311
}
282312

@@ -294,35 +324,6 @@ static void putShortMSB(
294324
put_byte(s, (Byte)(b & 0xff));
295325
}
296326

297-
/* =========================================================================
298-
* Flush as much pending output as possible. All deflate() output goes
299-
* through this function so some applications may wish to modify it
300-
* to avoid allocating a large strm->next_out buffer and copying into it.
301-
* (See also read_buf()).
302-
*/
303-
static void flush_pending(
304-
z_streamp strm
305-
)
306-
{
307-
deflate_state *s = (deflate_state *) strm->state;
308-
unsigned len = s->pending;
309-
310-
if (len > strm->avail_out) len = strm->avail_out;
311-
if (len == 0) return;
312-
313-
if (strm->next_out != NULL) {
314-
memcpy(strm->next_out, s->pending_out, len);
315-
strm->next_out += len;
316-
}
317-
s->pending_out += len;
318-
strm->total_out += len;
319-
strm->avail_out -= len;
320-
s->pending -= len;
321-
if (s->pending == 0) {
322-
s->pending_out = s->pending_buf;
323-
}
324-
}
325-
326327
/* ========================================================================= */
327328
int zlib_deflate(
328329
z_streamp strm,
@@ -404,7 +405,8 @@ int zlib_deflate(
404405
(flush != Z_NO_FLUSH && s->status != FINISH_STATE)) {
405406
block_state bstate;
406407

407-
bstate = (*(configuration_table[s->level].func))(s, flush);
408+
bstate = DEFLATE_HOOK(strm, flush, &bstate) ? bstate :
409+
(*(configuration_table[s->level].func))(s, flush);
408410

409411
if (bstate == finish_started || bstate == finish_done) {
410412
s->status = FINISH_STATE;
@@ -503,7 +505,8 @@ static int read_buf(
503505

504506
strm->avail_in -= len;
505507

506-
if (!((deflate_state *)(strm->state))->noheader) {
508+
if (!DEFLATE_NEED_CHECKSUM(strm)) {}
509+
else if (!((deflate_state *)(strm->state))->noheader) {
507510
strm->adler = zlib_adler32(strm->adler, strm->next_in, len);
508511
}
509512
memcpy(buf, strm->next_in, len);

lib/zlib_deflate/deftree.c

Lines changed: 0 additions & 54 deletions
Original file line numberDiff line numberDiff line change
@@ -76,11 +76,6 @@ static const uch bl_order[BL_CODES]
7676
* probability, to avoid transmitting the lengths for unused bit length codes.
7777
*/
7878

79-
#define Buf_size (8 * 2*sizeof(char))
80-
/* Number of bits used within bi_buf. (bi_buf might be implemented on
81-
* more than 16 bits on some systems.)
82-
*/
83-
8479
/* ===========================================================================
8580
* Local data. These are initialized only once.
8681
*/
@@ -147,7 +142,6 @@ static void send_all_trees (deflate_state *s, int lcodes, int dcodes,
147142
static void compress_block (deflate_state *s, ct_data *ltree,
148143
ct_data *dtree);
149144
static void set_data_type (deflate_state *s);
150-
static void bi_windup (deflate_state *s);
151145
static void bi_flush (deflate_state *s);
152146
static void copy_block (deflate_state *s, char *buf, unsigned len,
153147
int header);
@@ -169,54 +163,6 @@ static void copy_block (deflate_state *s, char *buf, unsigned len,
169163
* used.
170164
*/
171165

172-
/* ===========================================================================
173-
* Send a value on a given number of bits.
174-
* IN assertion: length <= 16 and value fits in length bits.
175-
*/
176-
#ifdef DEBUG_ZLIB
177-
static void send_bits (deflate_state *s, int value, int length);
178-
179-
static void send_bits(
180-
deflate_state *s,
181-
int value, /* value to send */
182-
int length /* number of bits */
183-
)
184-
{
185-
Tracevv((stderr," l %2d v %4x ", length, value));
186-
Assert(length > 0 && length <= 15, "invalid length");
187-
s->bits_sent += (ulg)length;
188-
189-
/* If not enough room in bi_buf, use (valid) bits from bi_buf and
190-
* (16 - bi_valid) bits from value, leaving (width - (16-bi_valid))
191-
* unused bits in value.
192-
*/
193-
if (s->bi_valid > (int)Buf_size - length) {
194-
s->bi_buf |= (value << s->bi_valid);
195-
put_short(s, s->bi_buf);
196-
s->bi_buf = (ush)value >> (Buf_size - s->bi_valid);
197-
s->bi_valid += length - Buf_size;
198-
} else {
199-
s->bi_buf |= value << s->bi_valid;
200-
s->bi_valid += length;
201-
}
202-
}
203-
#else /* !DEBUG_ZLIB */
204-
205-
#define send_bits(s, value, length) \
206-
{ int len = length;\
207-
if (s->bi_valid > (int)Buf_size - len) {\
208-
int val = value;\
209-
s->bi_buf |= (val << s->bi_valid);\
210-
put_short(s, s->bi_buf);\
211-
s->bi_buf = (ush)val >> (Buf_size - s->bi_valid);\
212-
s->bi_valid += len - Buf_size;\
213-
} else {\
214-
s->bi_buf |= (value) << s->bi_valid;\
215-
s->bi_valid += len;\
216-
}\
217-
}
218-
#endif /* DEBUG_ZLIB */
219-
220166
/* ===========================================================================
221167
* Initialize the various 'constant' tables. In a multi-threaded environment,
222168
* this function may be called by two threads concurrently, but this is

lib/zlib_deflate/defutil.h

Lines changed: 124 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,7 @@
1+
#ifndef DEFUTIL_H
2+
#define DEFUTIL_H
13

2-
4+
#include <linux/zutil.h>
35

46
#define Assert(err, str)
57
#define Trace(dummy)
@@ -238,17 +240,13 @@ typedef struct deflate_state {
238240

239241
} deflate_state;
240242

241-
typedef struct deflate_workspace {
242-
/* State memory for the deflator */
243-
deflate_state deflate_memory;
244-
Byte *window_memory;
245-
Pos *prev_memory;
246-
Pos *head_memory;
247-
char *overlay_memory;
248-
} deflate_workspace;
249-
243+
#ifdef CONFIG_ZLIB_DFLTCC
244+
#define zlib_deflate_window_memsize(windowBits) \
245+
(2 * (1 << (windowBits)) * sizeof(Byte) + PAGE_SIZE)
246+
#else
250247
#define zlib_deflate_window_memsize(windowBits) \
251248
(2 * (1 << (windowBits)) * sizeof(Byte))
249+
#endif
252250
#define zlib_deflate_prev_memsize(windowBits) \
253251
((1 << (windowBits)) * sizeof(Pos))
254252
#define zlib_deflate_head_memsize(memLevel) \
@@ -292,6 +290,24 @@ void zlib_tr_stored_type_only (deflate_state *);
292290
put_byte(s, (uch)((ush)(w) >> 8)); \
293291
}
294292

293+
/* ===========================================================================
294+
* Reverse the first len bits of a code, using straightforward code (a faster
295+
* method would use a table)
296+
* IN assertion: 1 <= len <= 15
297+
*/
298+
static inline unsigned bi_reverse(
299+
unsigned code, /* the value to invert */
300+
int len /* its bit length */
301+
)
302+
{
303+
register unsigned res = 0;
304+
do {
305+
res |= code & 1;
306+
code >>= 1, res <<= 1;
307+
} while (--len > 0);
308+
return res >> 1;
309+
}
310+
295311
/* ===========================================================================
296312
* Flush the bit buffer, keeping at most 7 bits in it.
297313
*/
@@ -325,3 +341,101 @@ static inline void bi_windup(deflate_state *s)
325341
#endif
326342
}
327343

344+
typedef enum {
345+
need_more, /* block not completed, need more input or more output */
346+
block_done, /* block flush performed */
347+
finish_started, /* finish started, need only more output at next deflate */
348+
finish_done /* finish done, accept no more input or output */
349+
} block_state;
350+
351+
#define Buf_size (8 * 2*sizeof(char))
352+
/* Number of bits used within bi_buf. (bi_buf might be implemented on
353+
* more than 16 bits on some systems.)
354+
*/
355+
356+
/* ===========================================================================
357+
* Send a value on a given number of bits.
358+
* IN assertion: length <= 16 and value fits in length bits.
359+
*/
360+
#ifdef DEBUG_ZLIB
361+
static void send_bits (deflate_state *s, int value, int length);
362+
363+
static void send_bits(
364+
deflate_state *s,
365+
int value, /* value to send */
366+
int length /* number of bits */
367+
)
368+
{
369+
Tracevv((stderr," l %2d v %4x ", length, value));
370+
Assert(length > 0 && length <= 15, "invalid length");
371+
s->bits_sent += (ulg)length;
372+
373+
/* If not enough room in bi_buf, use (valid) bits from bi_buf and
374+
* (16 - bi_valid) bits from value, leaving (width - (16-bi_valid))
375+
* unused bits in value.
376+
*/
377+
if (s->bi_valid > (int)Buf_size - length) {
378+
s->bi_buf |= (value << s->bi_valid);
379+
put_short(s, s->bi_buf);
380+
s->bi_buf = (ush)value >> (Buf_size - s->bi_valid);
381+
s->bi_valid += length - Buf_size;
382+
} else {
383+
s->bi_buf |= value << s->bi_valid;
384+
s->bi_valid += length;
385+
}
386+
}
387+
#else /* !DEBUG_ZLIB */
388+
389+
#define send_bits(s, value, length) \
390+
{ int len = length;\
391+
if (s->bi_valid > (int)Buf_size - len) {\
392+
int val = value;\
393+
s->bi_buf |= (val << s->bi_valid);\
394+
put_short(s, s->bi_buf);\
395+
s->bi_buf = (ush)val >> (Buf_size - s->bi_valid);\
396+
s->bi_valid += len - Buf_size;\
397+
} else {\
398+
s->bi_buf |= (value) << s->bi_valid;\
399+
s->bi_valid += len;\
400+
}\
401+
}
402+
#endif /* DEBUG_ZLIB */
403+
404+
static inline void zlib_tr_send_bits(
405+
deflate_state *s,
406+
int value,
407+
int length
408+
)
409+
{
410+
send_bits(s, value, length);
411+
}
412+
413+
/* =========================================================================
414+
* Flush as much pending output as possible. All deflate() output goes
415+
* through this function so some applications may wish to modify it
416+
* to avoid allocating a large strm->next_out buffer and copying into it.
417+
* (See also read_buf()).
418+
*/
419+
static inline void flush_pending(
420+
z_streamp strm
421+
)
422+
{
423+
deflate_state *s = (deflate_state *) strm->state;
424+
unsigned len = s->pending;
425+
426+
if (len > strm->avail_out) len = strm->avail_out;
427+
if (len == 0) return;
428+
429+
if (strm->next_out != NULL) {
430+
memcpy(strm->next_out, s->pending_out, len);
431+
strm->next_out += len;
432+
}
433+
s->pending_out += len;
434+
strm->total_out += len;
435+
strm->avail_out -= len;
436+
s->pending -= len;
437+
if (s->pending == 0) {
438+
s->pending_out = s->pending_buf;
439+
}
440+
}
441+
#endif /* DEFUTIL_H */

0 commit comments

Comments
 (0)