Skip to content

Commit ced1d64

Browse files
authored
HBASE-27409 Fix the javadoc for WARCRecord (#4814)
Signed-off-by: Andrew Purtell <apurtell@apache.org>
1 parent 63cdd02 commit ced1d64

File tree

1 file changed

+80
-58
lines changed
  • hbase-it/src/test/java/org/apache/hadoop/hbase/test/util/warc

1 file changed

+80
-58
lines changed

hbase-it/src/test/java/org/apache/hadoop/hbase/test/util/warc/WARCRecord.java

Lines changed: 80 additions & 58 deletions
Original file line numberDiff line numberDiff line change
@@ -49,15 +49,20 @@
4949

5050
/**
5151
* Immutable implementation of a record in a WARC file. You create a {@link WARCRecord} by parsing
52-
* it out of a {@link DataInput} stream. The file format is documented in the [ISO
53-
* Standard](http://bibnum.bnf.fr/warc/WARC_ISO_28500_version1_latestdraft.pdf). In a nutshell, it's
54-
* a textual format consisting of lines delimited by `\r\n`. Each record has the following
55-
* structure: 1. A line indicating the WARC version number, such as `WARC/1.0`. 2. Several header
56-
* lines (in key-value format, similar to HTTP or email headers), giving information about the
57-
* record. The header is terminated by an empty line. 3. A body consisting of raw bytes (the number
58-
* of bytes is indicated in one of the headers). 4. A final separator of `\r\n\r\n` before the next
59-
* record starts. There are various different types of records, as documented on
60-
* {@link Header#getRecordType()}.
52+
* it out of a {@link DataInput} stream.
53+
* <p/>
54+
* The file format is documented in the
55+
* <a href="http://bibnum.bnf.fr/WARC/WARC_ISO_28500_version1_latestdraft.pdf">ISO Standard</a>. In
56+
* a nutshell, it's a textual format consisting of lines delimited by `\r\n`. Each record has the
57+
* following structure:
58+
* <ol>
59+
* <li>A line indicating the WARC version number, such as `WARC/1.0`.</li>
60+
* <li>Several header lines (in key-value format, similar to HTTP or email headers), giving
61+
* information about the record. The header is terminated by an empty line.
62+
* <li>A body consisting of raw bytes (the number of bytes is indicated in one of the headers).
63+
* <li>A final separator of `\r\n\r\n` before the next record starts.
64+
* </ol>
65+
* There are various different types of records, as documented on {@link Header#getRecordType()}.
6166
*/
6267
public class WARCRecord {
6368

@@ -176,9 +181,11 @@ public String toString() {
176181
/**
177182
* Contains the parsed headers of a {@link WARCRecord}. Each record contains a number of headers
178183
* in key-value format, where some header keys are standardised, but nonstandard ones can be
179-
* added. The documentation of the methods in this class is excerpted from the [WARC 1.0
180-
* specification](http://bibnum.bnf.fr/warc/WARC_ISO_28500_version1_latestdraft.pdf). Please see
181-
* the specification for more detail.
184+
* added.
185+
* <p/>
186+
* The documentation of the methods in this class is excerpted from the
187+
* <a href="http://bibnum.bnf.fr/WARC/WARC_ISO_28500_version1_latestdraft.pdf">WARC 1.0
188+
* specification</a>. Please see the specification for more detail.
182189
*/
183190
public final static class Header {
184191
private final Map<String, String> fields;
@@ -190,56 +197,69 @@ private Header(Map<String, String> fields) {
190197
/**
191198
* Returns the type of WARC record (the value of the `WARC-Type` header field). WARC 1.0 defines
192199
* the following record types: (for full definitions, see the
193-
* [spec](http://bibnum.bnf.fr/warc/WARC_ISO_28500_version1_latestdraft.pdf)) * `warcinfo`:
194-
* Describes the records that follow it, up through end of file, end of input, or until next
195-
* `warcinfo` record. Typically, this appears once and at the beginning of a WARC file. For a
196-
* web archive, it often contains information about the web crawl which generated the following
197-
* records. The format of this descriptive record block may vary, though the use of the
198-
* `"application/warc-fields"` content-type is recommended. (...) * `response`: The record
199-
* should contain a complete scheme-specific response, including network protocol information
200-
* where possible. For a target-URI of the `http` or `https` schemes, a `response` record block
201-
* should contain the full HTTP response received over the network, including headers. That is,
202-
* it contains the 'Response' message defined by section 6 of HTTP/1.1 (RFC2616). The WARC
203-
* record's Content-Type field should contain the value defined by HTTP/1.1,
200+
* <a href="http://bibnum.bnf.fr/WARC/WARC_ISO_28500_version1_latestdraft.pdf">spec</a>.
201+
* <ul>
202+
* <li>`warcinfo`: Describes the records that follow it, up through end of file, end of input,
203+
* or until next `warcinfo` record. Typically, this appears once and at the beginning of a WARC
204+
* file. For a web archive, it often contains information about the web crawl which generated
205+
* the following records.
206+
* <p/>
207+
* The format of this descriptive record block may vary, though the use of the
208+
* `"application/warc-fields"` content-type is recommended. (...)</li>
209+
* <li>`response`: The record should contain a complete scheme-specific response, including
210+
* network protocol information where possible. For a target-URI of the `http` or `https`
211+
* schemes, a `response` record block should contain the full HTTP response received over the
212+
* network, including headers. That is, it contains the 'Response' message defined by section 6
213+
* of HTTP/1.1 (RFC2616).
214+
* <p/>
215+
* The WARC record's Content-Type field should contain the value defined by HTTP/1.1,
204216
* `"application/http;msgtype=response"`. The payload of the record is defined as its
205-
* 'entity-body' (per RFC2616), with any transfer-encoding removed. * `resource`: The record
206-
* contains a resource, without full protocol response information. For example: a file directly
207-
* retrieved from a locally accessible repository or the result of a networked retrieval where
208-
* the protocol information has been discarded. For a target-URI of the `http` or `https`
209-
* schemes, a `resource` record block shall contain the returned 'entity-body' (per RFC2616,
210-
* with any transfer-encodings removed), possibly truncated. * `request`: The record holds the
211-
* details of a complete scheme-specific request, including network protocol information where
212-
* possible. For a target-URI of the `http` or `https` schemes, a `request` record block should
213-
* contain the full HTTP request sent over the network, including headers. That is, it contains
214-
* the 'Request' message defined by section 5 of HTTP/1.1 (RFC2616). The WARC record's
215-
* Content-Type field should contain the value defined by HTTP/1.1,
217+
* 'entity-body' (per RFC2616), with any transfer-encoding removed.</li>
218+
* <li>`resource`: The record contains a resource, without full protocol response information.
219+
* For example: a file directly retrieved from a locally accessible repository or the result of
220+
* a networked retrieval where the protocol information has been discarded. For a target-URI of
221+
* the `http` or `https` schemes, a `resource` record block shall contain the returned
222+
* 'entity-body' (per RFC2616, with any transfer-encodings removed), possibly truncated.</li>
223+
* <li>`request`: The record holds the details of a complete scheme-specific request, including
224+
* network protocol information where possible. For a target-URI of the `http` or `https`
225+
* schemes, a `request` record block should contain the full HTTP request sent over the network,
226+
* including headers. That is, it contains the 'Request' message defined by section 5 of
227+
* HTTP/1.1 (RFC2616).
228+
* <p/>
229+
* The WARC record's Content-Type field should contain the value defined by HTTP/1.1,
216230
* `"application/http;msgtype=request"`. The payload of a `request` record with a target-URI of
217231
* scheme `http` or `https` is defined as its 'entity-body' (per RFC2616), with any
218-
* transfer-encoding removed. * `metadata`: The record contains content created in order to
219-
* further describe, explain, or accompany a harvested resource, in ways not covered by other
220-
* record types. A `metadata` record will almost always refer to another record of another type,
221-
* with that other record holding original harvested or transformed content. The format of the
222-
* metadata record block may vary. The `"application/warc-fields"` format may be used. *
223-
* `revisit`: The record describes the revisitation of content already archived, and might
232+
* transfer-encoding removed.</li>
233+
* <li>`metadata`: The record contains content created in order to further describe, explain, or
234+
* accompany a harvested resource, in ways not covered by other record types. A `metadata`
235+
* record will almost always refer to another record of another type, with that other record
236+
* holding original harvested or transformed content.
237+
* <p/>
238+
* The format of the metadata record block may vary. The `"application/warc-fields"` format may
239+
* be used.</li>
240+
* <li>`revisit`: The record describes the revisitation of content already archived, and might
224241
* include only an abbreviated content body which has to be interpreted relative to a previous
225242
* record. Most typically, a `revisit` record is used instead of a `response` or `resource`
226243
* record to indicate that the content visited was either a complete or substantial duplicate of
227-
* material previously archived. A `revisit` record shall contain a WARC-Profile field which
228-
* determines the interpretation of the record's fields and record block. Please see the
229-
* specification for details. * `conversion`: The record shall contain an alternative version of
230-
* another record's content that was created as the result of an archival process. Typically,
231-
* this is used to hold content transformations that maintain viability of content after widely
232-
* available rendering tools for the originally stored format disappear. As needed, the original
233-
* content may be migrated (transformed) to a more viable format in order to keep the
234-
* information usable with current tools while minimizing loss of information. * `continuation`:
235-
* Record blocks from `continuation` records must be appended to corresponding prior record
236-
* blocks (eg. from other WARC files) to create the logically complete full-sized original
237-
* record. That is, `continuation` records are used when a record that would otherwise cause a
238-
* WARC file size to exceed a desired limit is broken into segments. A continuation record shall
239-
* contain the named fields `WARC-Segment-Origin-ID` and `WARC-Segment-Number`, and the last
240-
* `continuation` record of a series shall contain a `WARC-Segment-Total-Length` field. Please
241-
* see the specification for details. * Other record types may be added in future, so this list
242-
* is not exclusive.
244+
* material previously archived.
245+
* <p/>
246+
* A `revisit` record shall contain a WARC-Profile field which determines the interpretation of
247+
* the record's fields and record block. Please see the specification for details.</li>
248+
* <li>`conversion`: The record shall contain an alternative version of another record's content
249+
* that was created as the result of an archival process. Typically, this is used to hold
250+
* content transformations that maintain viability of content after widely available rendering
251+
* tools for the originally stored format disappear. As needed, the original content may be
252+
* migrated (transformed) to a more viable format in order to keep the information usable with
253+
* current tools while minimizing loss of information.</li>
254+
* <li>`continuation`: Record blocks from `continuation` records must be appended to
255+
* corresponding prior record blocks (eg. from other WARC files) to create the logically
256+
* complete full-sized original record. That is, `continuation` records are used when a record
257+
* that would otherwise cause a WARC file size to exceed a desired limit is broken into
258+
* segments. A continuation record shall contain the named fields `WARC-Segment-Origin-ID` and
259+
* `WARC-Segment-Number`, and the last `continuation` record of a series shall contain a
260+
* `WARC-Segment-Total-Length` field. Please see the specification for details.</li>
261+
* <li>Other record types may be added in future, so this list is not exclusive.</li>
262+
* </ul>
243263
* @return The record's `WARC-Type` header field, as a string.
244264
*/
245265
public String getRecordType() {
@@ -272,8 +292,10 @@ public String getRecordID() {
272292
* The MIME type (RFC2045) of the information contained in the record's block. For example, in
273293
* HTTP request and response records, this would be `application/http` as per section 19.1 of
274294
* RFC2616 (or `application/http; msgtype=request` and `application/http; msgtype=response`
275-
* respectively). In particular, the content-type is *not* the value of the HTTP Content-Type
276-
* header in an HTTP response, but a MIME type to describe the full archived HTTP message (hence
295+
* respectively).
296+
* <p/>
297+
* In particular, the content-type is *not* the value of the HTTP Content-Type header in an HTTP
298+
* response, but a MIME type to describe the full archived HTTP message (hence
277299
* `application/http` if the block contains request or response headers).
278300
* @return The record's `Content-Type` header field, as a string.
279301
*/

0 commit comments

Comments
 (0)