Skip to content

JSalParser - parse AWS S3, CloudFront, and Apache log files in Java

License

Notifications You must be signed in to change notification settings

ludei/jsalparser

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

JSalParser

The Java Server Access Logs Parser (i.e. JSalParser) parses Extended Log Format files generated by Apache HTTP Server, AWS S3, or AWS CloudFront (to name a few) into Java POJOs (Plain Old Java Objects).

No bags or maps or hashes or arrays of attributes here. Dates are converted to JODA DateTime objects. Numbers are, well, numbers. You could say this library deserializes the Extended Log Format, but I think that may be too generous for a simple parser.

In short, give JSalParser a log file as input and you will get back a Java object with as many members filled in as possible as output. The rest is up to you.

Synopsis

Parse an S3 log line-by-line

	String content = "1f000000000c6c88eb9dd89c000000000b35b0000000a5 www.example.com [27/Aug/2014:20:20:05 +0000] 192.168.0.1 - BFE596E2F4D94C8F WEBSITE.GET.OBJECT media/example.jpg \"GET /media/example.jpg HTTP/1.1\" 304 - - 27553 202 - \"http://www.example.com/page.html\" \"Mozilla/5.0 (iPhone; CPU iPhone OS 7_1_2 like Mac OS X) AppleWebKit/537.51.2 (KHTML, like Gecko) Version/7.0 Mobile/11D257 Safari/9537.53\" -";
	List<S3LogEntry> entries = JSalParser.parseS3Log(content);

	long TenMegabytes = 10000000L;

	for(int i=0;i<entries.size();i++) {
		S3LogEntry entry = entries.get(i);

		// Notice how the numbers are numbers, no additional parsing needed
		if(entry.getObjectSize() > TenMegabytes)
		{
			System.out.println(entry.getTime());

			// getTime() returns a JODA DateTime object,
			// so Java prints:
			// 2014-08-27T20:20:05.000+00:00
		}
	}

Parse a CloudFront log gzip file using a visitor for streaming efficiency

	// Assume gzipFileStream is some kind of java.io.InputStream
	// You got it either from a FileInputStream (local file on disk), S3, or anywhere else that returns InputStreams

	java.util.zip.GZIPInputStream gzipInputStream = new java.util.zip.GZIPInputStream(gzipFileStream);

	// Process records inline by passing a visitor to effectively get "streaming" log processing
	// The only two things you need are an InputStream and a visitor
	// JSalParser is Thread-Safe
	JSalParser.parseCloudFrontLog(gzipInputStream, new ICloudFrontLogVisitor() {
		int count = 0;
		@Override
		public void accept(CloudFrontWebLogEntry entry) {
			System.out.print("Processing entry #" + (count++) + " from " + entry.getDateTime() + " ");
			// Date is returned as a JODA DateTime object.

			// Numbers are surfaced as Ints and Longs
			if(entry.getServerToClientStatus() == 200)
			{
				System.out.println("OK");
			}
			else
			{
				System.out.println("NOT_OK");
			}

			// You will get:
			/***********
			 Processing entry #0 from 2014-08-28T04:48:38.000Z OK
			 Processing entry #1 from 2014-08-28T04:48:38.000Z OK
			 Processing entry #2 from 2014-08-28T04:49:23.000Z NOT_OK
			 Processing entry #3 from 2014-08-28T04:48:37.000Z OK
			 Processing entry #4 from 2014-08-28T04:48:38.000Z NOT_OK
			 Processing entry #5 from 2014-08-28T04:48:38.000Z OK
			 ***********/
		}
	});

Both S3 and CloudFront support accepting Strings and InputStream objects.

Both S3 and CloudFront support the visitor pattern for streaming efficiency.

Most Common Scenarios

These are the most common scenarios for working with S3 or CloudFront log files.

How do I read logs directly from AWS S3?

The Java AWS SDK ( http://aws.amazon.com/sdk-for-java/ ) is the best way to read and write objects on S3. The quick-and-dirty way is to construct an AmazonS3 instance and call getObject().

Some pseudo-code:

import com.amazonaws.services.s3.AmazonS3;
import com.amazonaws.services.s3.model.S3ObjectInputStream;
import com.amazonaws.services.s3.model.S3Object;

public void quickAndDirty(){

	AmazonS3 s3 = //... left as an exercise

	S3Object obj = s3.getObject("my-bucket-name", "/the-log-file-key-name");
	S3ObjectInputStream content = obj.getObjectContent();
	// S3ObjectInputStream is a subclass of InputStream

	List<S3LogEntry> entries = JSalParser.parseS3Log(content);
	// ... continue on like normal
}

Read directly from a gzip file:

Java has a built-in GZIPInputStream class ( java.util.zip.GZIPInputStream ) that is a subclass of InputStream.

If the files are local, pass in a FileInputStream instance:

import java.io;

File gzipFile = new File("path-to-your-local-file.gz");
FileInputStream gzipFileStream = new FileInputStream(gzipFile);
java.util.zip.GZIPInputStream gzipInputStream = new java.util.zip.GZIPInputStream(gzipFileStream);

List<CloudFrontWebLogEntry> entries = JSalParser.parseCloudFrontLog(gzipInputStream);
// .. continue on like normal

Or, combine with the AWS SDK to read directly from S3

import com.amazonaws.services.s3.AmazonS3;
import com.amazonaws.services.s3.model.S3ObjectInputStream;
import com.amazonaws.services.s3.model.S3Object;

AmazonS3 s3 = //... left as an exercise

S3Object obj = s3.getObject("my-bucket-name", "/the-log-file-key-name.gz");
S3ObjectInputStream s3Stream = obj.getObjectContent();
java.util.zip.GZIPInputStream gzipInputStream = new java.util.zip.GZIPInputStream(s3Stream);

List<CloudFrontWebLogEntry> entries = JSalParser.parseCloudFrontLog(gzipInputStream);
// .. continue on like normal

How do I process log files as a stream?

JSalParser takes as an optional, second parameter either an ICloudFrontLogVisitor or an IS3LogVisitor instance. As the parser fully reads each entry, it calls the respective "accept" methods on the visitors. Your code is free to implement any business logic as needed.

Using the visitor means the parser only has to store a couple log lines in memory at a time, in theory making it so one could process extremely large files.

What considerations should I know about CloudFront log files?

CloudFront log files have two header lines, and these header lines describe the "schema" of the log file. S3 log files, in contrast, have no header file, and the schema is defined in documentation.

Therefore in order to process any CloudFront log file we must first process the header entry. If you send a CloudFront log file without headers you will get only an array of untyped values.

How do I use this with Apache HTTP Server log files?

Send the Apache HTTP Server log files to the JSalParser.parseS3Log* family of methods.

What happens if the parser encounters something it doesn't recognize?

The parser puts "extra" stuff (values it doesn't expect) into the "extras" list. Order is preserved, but all values are processed as Strings.

What happens if a value is missing?

The POJO has null or 0 as appropriate. I usually program in Scala and am much more used to Options, but alas this is the best choice for Java 7.

What is JODA DateTime?

If you aren't familiar with JODA for DateTime, then the short answer is: The Java Date class is "broken" and JODA has the best solution. More information and eloquence is at http://www.joda.org/joda-time/

What version of Java?

Java 7. No real reason, just that's the lowest JDK I have available to test.

The only "advanced" feature is generics.

Maven

In Progress

Motivation

I had reason to traverse the logs generated by AWS, and I wanted to write the business logic in my favorite JVM language (Scala). Behind the scenes Hiram Software is working on popular content sites, and it's important to us to know who is viewing our content in near realtime. I searched the internet, and I could not find any parsers for the Extended Log Format written in Java. The format is nearly 20 years old, and Google couldn't find anything. Why is this? I have three theories:

  • There exists such a parser but I am unable to find it (perhaps because SL4J and its kin dominate search results related to the Java and "logging" keywords ?).
  • The parsers that do exist are not open source (either written by an engineer for a corporation or as part of a log parsing product).
  • Most people use regular expressions

Unfortunately, I don't know how to write a regular expression to parse a server log file generically. Based on the number of StackOverflow questions (exhibit 1, exhibit 2, exhibit 3 ), I am not alone.

The core problem with the S3 log files is that the delimiter " " (space) can also be found in each of the values within a Quoted String. In the most general sense any field may be a - (empty value), an unquoted value http://www.google.com, or a quoted value "There may be delimeters in this string". In practice, it seems people who use regular expressions assume which fields will be quoted. Simplifying the problem with assumptions is not bad, but it trades off robustness for ease of coding. In theory the first change to the S3 server access logs will break a lot of code. And what's more, regular expressions only spit out captured Strings, which then have to be parsed into types.

The CloudFront logs improve upon S3 logs by using tabs "\t" as the delimiter. With that small change it becomes practical to split values by the single delimiter. The complication comes, instead, in the new header row. The header row "Contain[s] two header lines: one with the file-format version, and another that lists the W3C fields included in each record." Great for people. Bad for machines since the order of the fields may change from file to file. And furthermore, any time you get a file from CloudFront you have to decide if it is a "Web Distribution" file or an "RTMP Distribution" file format. There are no explicit tags in the log to indicate one or the other -- you have to parse the file to figure it out.

What about alternatives to writing my own? I'm a buy-over-build kind of guy.

I abandoned using regular expressions from the outset. I found The Buzz Media's Amazon CloudFront Log Parser as a credible alternative. It appeared to handle the CloudFront log files, but I could not use it because it did not support S3 logs and had a broken maven repository.

I looked into using a CSV parser, but the Apache CSV parser required a header row, and the S3 log does not have such a row. There may be other CSV parsers that would have worked, but by now I was tired and felt like I was not making progress.

I fell back to what I knew: ANTLR.

Solution

JSalParser exposes a class whose static methods accept content (either a String or InputStream) and return Lists of POJOs (Plain Old Java Objects, i.e. typed bags) representing each log entry. Alternatively, you may "stream" the log files by providing a visitor that "accepts" each fully-parsed log entry.

Under the covers there is an ANTLR v4 grammar that builds up the POJOs. If you are unfamiliar with ANTLR, it is an open source parser generator that often is compared to YACC or Lex. Inside src/main/antlr are .g4 files that ANTLR compiles into Java code. This Java code handles tokenization and builds the log entries. All of the hard work that The Buzz Media team had to write in maintaining state ANTLR does for us in a robust manner.

Why static methods? All of the state during parsing is self-contained to the objects provided by ANTLR. So long as we instantiate new objects for each new String or InputStream, the methods are thread safe. So they are static. It feels simpler to me.

Related Documentation

Benchmarks

In general ANTLR has a reputation of speed and throughput. If you find the library is too slow for your needs, open an issue with the specific use case or pattern that is creating a bottleneck. We'll work together to solve it.

Contributing

I'd love to hear this has helped you. Being hosted on github, I hope it goes without saying that I welcome issues, pull requests, and forks.

License

BSD

About

JSalParser - parse AWS S3, CloudFront, and Apache log files in Java

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published