Maximum RPM: Taking the RPM Package Manager to the Limit
Appendix A. Format of the RPM File
While the following details concerning the actual format of an RPM package file were accurate at the time this was written, three points should be kept in mind:
-
The file format is subject to change.
-
If a package file is to be manipulated somehow, you are strongly urged to use the appropriate rpmlib routines to access the package file. Why? See point number 1!
-
This appendix describes the most recent version of the RPM file format: version 3. The file(1) utility can be used to see a package's file format version.
With those caveats out of the way, let's take a look inside an RPM file…
Every RPM package file can be divided into four distinct sections. They are:
-
The lead.
-
The signature.
-
The header.
-
The archive.
Package files are written to disk in network byte order. If required, RPM will automatically convert to host byte order when the package file is read. Let's take a look at each section, starting with the lead.
The lead is the first part of an RPM package file. In previous versions of RPM, it was used to store information used internally by RPM. Today, however, the lead's sole purpose is to make it easy to identify an RPM package file. For example, the file(1) command uses the lead. [1] All the information contained in the lead has been duplicated or superseded by information contained in the header. [2]
RPM defines a C structure that describes the lead:
|
Let's take a look at an actual package file and examine the various
pieces of data that make up the lead. In the following display, the
number to the left of the colon is the byte offset, in hexadecimal, from
the start of the file. The eight groups of four characters show the hex
value of the bytes in the file — two bytes per group of four characters.
Finally, the characters on the right show the ASCII values of the data
bytes. When a data byte's value results in a non-printable character, a
dot (".") is inserted instead. Here are the first thirty-two bytes of a
package file — in this case, the package file rpm-2.2.1-1.i386.rpm
:
|
The first four bytes (edab eedb
) are the magic values that identify
the file as an RPM package file. Both the file command and RPM use
these magic numbers to determine whether a file is legitimate or not.
The next two bytes (0300
) indicate RPM file format version. In this
case, the file's major version number is 3, and the minor version number
is 0. Versions of RPM later than 2.1 create version 3.0 package files.
The next two bytes (0000
) determine what type of RPM file the file is.
There are presently two types defined:
-
Binary package file (type =
0000
) -
Source package file (type =
0001
)
In this case, the file is a binary package file.
The next two bytes (0001
) are used to store the architecture that the
package was built for. In this case, the number 1 refers to the i386
architecture. [3] In
the case of a source package file, these two bytes should be ignored, as
source packages are not built for a specific architecture.
The next sixty-six bytes (starting with 7270 6d2d
) contain the name of
the package. The name must end with a null byte, which leaves sixty-five
bytes for RPM's usual
name-version-release-style name. In this
case, we can read the name from the right side of the output:
|
Since the name rpm-2.2.1-1
is shorter than the sixty-five bytes
allocated for the name, the leftover bytes are filled with nulls.
Skipping past the space allocated for the name, we see two bytes
(0001
):
|
These bytes represent the operating system for which this package was
built. In this case, 1 equals Linux. As with the architecture-to-number
translations, the operating system and corresponding code numbers can be
found in the file, /usr/lib/rpmrc
.
The next two bytes (0005
) indicate the type of signature used in the
file. A type 5 signature is new to version 3 RPM files. The signature
appears next in the file, but we need to discuss an additional detail
before exploring the signature.
By looking at the C structure that defines the lead, and matching it with the bytes in an actual package file, it's trivial to extract the data from the lead. From a programming standpoint, it's also easy to manipulate data in the lead — It's simply a matter of using the element names from the structure. But there's a problem. And because of that problem the lead is no longer used internally by RPM.
What's the problem, and why is the lead no longer used by RPM? The answer to these questions is a single word: inflexibility. The technique of defining a C structure to access data in a file just isn't very flexible. Let's look at an example.
Flip back to the lead's C structure in the Section called The
Lead.
Say, for example, that some software comes along, and it has a long
name. A very long name. A name so long, in fact, that the 66 bytes
defined in the structure element name
just couldn't hold it.
What can we do? Well, we could certainly change the structure such that
the name
element would be 100 bytes long. But once a new version of
RPM is created using this new structure, we have two problems:
-
Any package file created with the new version of RPM wouldn't be able to read older package formats.
-
Any older version of RPM would be unable to install packages created with the newer version of RPM.
Not a very good situation! Ideally, we would like to somehow eliminate the requirement that the format of the data written to a package file be engraved in granite. We should be able to do the following things, all without losing compatibility with existing versions of RPM.
-
Add extra data to the file format.
-
Change the size of existing data.
-
Reorder the data.
Sounds like a big problem, but there's a solution…
The solution is to standardize the method by which information is retrieved from a file. This is done by creating a well-defined data structure that contains easily searched information about the data, and then physically separating that information from the data.
When the data is required, it is found by using the easily searched information, which points to the data itself. The benefits are, that the data can be placed anywhere in the file, and that the format of the data itself can change.
The header structure is RPM's solution to the problem of easily manipulating information in a standard way. The header structure's sole purpose in life is to contain zero or more pieces of data. A file can have more than one header structure in it. In fact, an RPM package file has two — the signature, and the header. It was from this header that the header structure got its name.
There are three sections to each header structure. The first section is known as the header structure header. The header structure header is used to identify the start of a header structure, its size, and the number of data items it contains.
Following the header structure header is an area called the index. The index contains one or more index entries. Each index entry contains information about, and a pointer to, a specific data item.
After the index comes the store. It is in the store that the data items are kept. The data in the store is packed together as closely as possible. The order in which the data is stored is immaterial — a far cry from the C structure used in the lead.
Let's take a more in-depth look at the actual format of a header structure, starting with the header structure header:
The header structure header always starts with a three-byte magic
number: 8e ad e8
. Following this is a one-byte version number. Next
are four bytes that are reserved for future expansion. After the
reserved bytes, there is a four-byte number that indicates how many
index entries exist in this header structure, followed by another
four-byte number indicating how many bytes of data are part of the
header structure.
The header structure's index is made up of zero or more index entries. Each entry is sixteen bytes longs. The first four bytes contain a tag — a numeric value that identifies what type of data is pointed to by the entry. The tag values change according to the header structure's position in the RPM file. A list of the actual tag values, and what they represent, will be included later in this appendix.
Following the tag, is a four-byte type, which is a numeric value that describes the format of the data pointed to by the entry. The types and their values do not change from header structure to header structure. Here is the current list:
-
NULL = 0
-
CHAR = 1
-
INT8 = 2
-
INT16 = 3
-
INT32 = 4
-
INT64 = 5
-
STRING = 6
-
BIN = 7
-
STRING_ARRAY = 8
A few of the data types might need some clarification. The STRING data type is simply a null-terminated string, while the STRING_ARRAY is a collection of strings. Finally, the BIN data type is a collection of binary data. This is normally used to identify data that is longer than an INT, but not a printable STRING.
Next is a four-byte offset that contains the position of the data, relative to the beginning of the store. We'll talk about the store in just a moment.
Finally, there is a four-byte count that contains the number of data items pointed to by the index entry. There are a few wrinkles to the meaning of the count, and they center around the STRING and STRING_ARRAY data types. STRING data always has a count of 1, while STRING_ARRAY data has a count equal to the number of strings contained in the store.
The store is where the data contained in the header structure is stored. Depending on the data type being stored, there are some details that should be kept in mind:
-
For STRING data, each string is terminated with a null byte.
-
For INT data, each integer is stored at the natural boundary for its type. A 64-bit INT is stored on an 8-byte boundary, a 16-bit INT is stored on a 2-byte boundary, and so on.
-
All data is in network byte order.
With all these details out of the way, let's take a look at the signature.
The signature section follows the lead in the RPM package file. It contains information that can be used to verify the integrity, and optionally, the authenticity of the majority of the package file. The signature is implemented as a header structure.
You probably noticed the word, "majority", above. The information in the signature header structure is based on the contents of the package file's header and archive only. The data in the lead and the signature header structure are not included when the signature information is created, nor are they part of any subsequent checks based on that information.
While that omission might seem to be a weakness in RPM's design, it really isn't. In the case of the lead, since it is used only for easy identification of package files, any changes made to that part of the file would, at worst, leave the file in such a state that RPM wouldn't recognize it as a valid package file. Likewise, any changes to the signature header structure would make it impossible to verify the file's integrity, since the signature information would have been changed from their original values.
Using our new-found knowledge of header structures, let's take a look at
the signatures in rpm-2.2.1-1.i386.rpm
:
|
The first three bytes (8ead e8
) contain the magic number for the start
of the header structure. The next byte (01
) is the header structure's
version.
As we discussed earlier, the next four bytes (0000 0000
) are reserved.
The four bytes after that (0000 0003
) represent the number of index
entries in the signature section, namely, three. Following that are four
bytes (0000 00ac
) that indicate how many bytes of data are stored in
the signature. The hex value 00ac
, when converted to decimal, means
the store is 172 bytes long.
Following the first 16 bytes is the index. Each of the three index entries in this header structure consists of four 32-bit integers, in the following order:
-
Tag
-
Type
-
Offset
-
Count
Let's take a look at the first index entry:
|
The tag consists of the first four bytes (0000 03e8
), which is 1000
when translated from hex. Looking in the RPM source directory at the
file lib/signature.h
, we find the following tag definitions:
|
So the tag we are studying is for a size signature. Let's continue.
The next four bytes (0000 0004
) contain the data type. As we saw
earlier, data type 4 means that the data stored for this index entry, is
a 32-bit integer. Skipping the next four bytes for a moment, the last
four bytes (0000 0001
) are the number of 32-bit integers pointed to by
this index entry.
Now, let's go back to the four bytes prior to the count (0000 0000
).
This number is the offset, in bytes, at which the size signature is
located. It has a value of zero, but the question is, zero bytes from
what? The answer, although it doesn't do us much good, is that the
offset is calculated from the start of the store. So first we must find
where the store begins, and we can do that by performing a simple
calculation.
First, go back to the start of the signature section. (We've made a copy here so you won't need to flip from page to page)
|
After the magic, the version, and the four reserved bytes, there is the
number of index entries (0000 0003
). Since we know that each index
entry is sixteen bytes long (four for the tag, four for the type, four
for the offset, and four for the count), we can multiply the number of
entries (3) by the number of bytes in each entry (16), and obtain the
total size of the index, which is 48 decimal, or 30 in hex. Since the
first index entry starts at hex offset 70, we can simply add hex 30 to
hex 70, and get, in hex, offset a0. So let's skip down to offset a0, and
see what's there:
|
If we've done our math correctly, the first four bytes (0004 4c4f
)
should represent the size of this file. Converting to decimal, this is
281,679. Let's take a look at the size of the actual file:
|
Hmmm, something's not right. Or is it? It looks like we're short by 336 bytes, or in hex, 150. Interesting how that's a nice round hex number, isn't it? For now, let's continue through the remainder of the index entries, and see if hex 150 pops up elsewhere.
Here's the next index entry. It has a tag of decimal 1001, which is an MD5 checksum. It is type 7, which is the BIN data type, it is 16 bytes long, and its data starts four bytes after the beginning of the store:
|
And here's the data. It starts with b025
(Remember that offset of
four!) and ends on the second line with 5375
. This is a 128-bit MD5
checksum of the package file's header and archive sections.
|
Ok, let's jump back to the last index entry:
|
It has a tag value of 03ea
(1002 in decimal — a PGP signature block)
and is also a BIN data type. The data starts 20 decimal bytes from the
start of the data area, which would put it at file offset b4 (in hex).
It's a biggie — 152 bytes long! Here's the data, starting with 8900
:
|
It ends with the bytes 4a9b
. This is a 1,216-bit PGP signature block.
It is also the end of the signature section. There are four null bytes
following the last data item in order to round the size out so that it
ends on an 8-byte boundary. This means that the offset of the next
section starts at offset 150, in hex. Say, wasn't the size in the size
signature off by 150 hex? Yes, the size in the signature is the size of
the file — less the size of the lead and the signature sections.
The header section contains all available information about the package.
Entries such as the package's name, version, and file list, are
contained in the header. Like the signature section, the header is in
header structure format. Unlike the signature, which has only three
possible tag types, the header has more than sixty different tags. The
list of currently defined tags appears later in this appendix on the
Section called Header Tag
Listing.
Be aware that the list of tags changes frequently — the definitive list
appears in the RPM sources in lib/rpmlib.h
.
The easiest way to find the start of the header is to look for the
second header structure by scanning for its magic number (8ead e8
).
The sixteen bytes, starting with the magic, are the header structures's
header. They follow the same format as the header in the signature's
header structure:
|
As before, the byte following the magic identifies this header structure
as being in version 1 format. Following the four reserved bytes, we find
the count of entries stored in the header (0000 0021
). Converting to
decimal, we find that there are 33 entries in the header. The next four
bytes (0000 09d3
) converted to decimal, tell us that there are 2,515
bytes of data in the store.
Since the header is a header structure just like the signature, we know that the next 16 bytes are the first index entry:
|
The first four bytes (0000 03e8
) are the tag, which is the tag for the
package name. The next four bytes indicate the data is type 6, or a
null-terminated string. There's an offset of zero in the next four
bytes, meaning that the data for this tag is first in the store.
Finally, the last four bytes (0000 0001
) show that the data count is
1, which is the only legal value for data of type STRING.
To find the data, we need to take the offset from the start of the first index entry in the header (160), and add in the count of index entries (21) multiplied by the size of an index entry (10). Doing the math (all the values shown, are in hex, remember!), we arrive at the offset to the store, hex 370. Since the offset for this particular index entry is zero, the data should start at offset 370:
|
Since the data type for this entry is a null-terminated string, we need
to keep reading bytes until we reach a byte whose numeric value is zero.
We find the bytes 72
, 70
, 6d
, and 00
— a null. Looking at the
ASCII display on the right, we find that the bytes form the string
rpm
, which is the name of this package.
Now for a slightly more complicated example. Let's look at the following index entry:
|
Tag 403 means that this entry is a list of filenames. The data type 8, or STRING_ARRAY, seems to bear this out. From the previous example, we found that the data area for the header began at offset 370. Adding the offset to the first filename (199), gives us 509. Finally, the count of 18 hex means that there should be 24 null-terminated strings containing filenames:
|
The byte at offset 509 is 2f — a "/". Reading up to the first null byte,
we find that the first filename is /bin/rpm
, followed by /etc/rpmrc
.
This continues on for 22 more filenames.
There are many more tags that we could decode, but they are all done in the same manner.
The following list shows the tags available, along with their defined
values, for use in the header. This list is current as of version 4.3 of
RPM. For the most up-to-date version, look in the file lib/rpmlib.h
in
the latest version of the RPM sources.
|
Following the header section is the archive. The archive holds the actual files that comprise the package. The archive is compressed using GNU zip. We can verify this if we look at the start of the archive:
|
In this example, the archive starts at offset d43
. According to the
contents of /usr/lib/magic
, the first two bytes of a gzipped file
should be 1f8b
, which is, in fact, what we see. The following byte
(08
) is the flag used by GNU zip to indicate the file has been
compressed with gzip's "deflation" method. The eighth byte has a
value of 02
, which means that the archive has been compressed using
gzip's maximum compression setting. The following byte contains a
code indicating the operating system under which the archive was
compressed. A 03
in this byte indicates that the compression ran under
a UNIX-like operating system.
The remainder of the RPM package file is the compressed archive. After the archive is uncompressed, it is an ordinary cpio archive in SVR4 format with a CRC checksum.
[1] | Please refer to the Section called Identifying RPM files with the file(1) command for a discussion on identifying RPM package files with the file command. |
[2] | The header is discussed in the Section called The Header. |
[3] | It should be noted that the architecture used internally by RPM is actually stored in the header. This value is strictly for file(1)'s use. |