A simple Common Lisp implementation of a recursive descent parser for CSV (comma separated values) formatted files.
We refer to this implementation as basic or simple because it's based on a simplified version of the original context-free grammar for CSV files, formalized in ABNF (Augmented Backus-Naur Form) in RFC 4180. For reference, the original grammar from RFC 4180:
file = [header CRLF] record *(CRLF record) [CRLF]
header = name *(COMMA name)
record = field *(COMMA field)
name = field
field = (escaped / non-escaped)
escaped = DQUOTE *(TEXTDATA / COMMA / CR / LF / 2DQUOTE) DQUOTE
non-escaped = *TEXTDATA
COMMA = %x2C
CR = %x0D
DQUOTE = %x22
LF = %x0A
CRLF = CR LF
TEXTDATA = %x20-21 / %x23-2B / %x2D-7EAnd the simplified grammar we designed:
file = [header CLRF] record *(CLRF record) [CLRF]
header = name *(COMMA name)
record = field *(COMMA field) / enclosed_field *(COMMA enclosed_fielid)
name = field
enclosed_field = DQUOTE *(TEXTDATA / COMMA) DQUOTE
field = *(TEXTDATA)
COMMA = %x2C
CR = %x0D
DQUOTE = %x22
LF = %x0A
CRLF = CR LF
TEXTDATA = %x20-21 / %x23-2B / %x2D-7Ein a form where RegEx(s) are simulated by right-recursive production rules:
file = header records
header = names
names = name COMMA names / name CLRF
name = field / enclosed_field
records = CLRF record records / CLRF record CLRF
record = fields / enclosed_fields
fields = field COMMA fields
enclosed_fields = enclosed_field COMMA enclosed_fields
field = word
enclosed_field = DQUOTE word DQUOTE
word = TEXTDATA word / COMMA word
COMMA = %x2C
CR = %x0D
DQUOTE = %x22
LF = %x0A
CRLF = CR LF
TEXTDATA = %x20-21 / %x23-2B / %x2D-7EDespite our context-free grammar being quite similar to the original, it differs in a slight detail, about the use of double quotes (%x22), from RFC 4180:
"5. Each field may or may not be enclosed in double quotes (however some programs, such as Microsoft Excel, do not use double quotes at all). If fields are not enclosed with double quotes, then double quotes may not appear inside the fields"
"6. Fields containing line breaks (CRLF), double quotes, and commas should be enclosed in double-quotes."
"7. If double-quotes are used to enclose fields, then a double-quote appearing inside a field must be escaped by preceding it with another double quote."
Our grammar deals with none of those ambiguities; it just assumes that:
- There may be an optional
headerline appearing as the first line of the file with the same format as normal record lines. This headers will containname(s) corresponding to thefield(s) in thefile. - Each
recordis located on a separate line, delimite by a line break (CLRF). - The last record in the file may or may not have an ending line break.
- Each
recordconsints a number offield(s) that should be equal to the number of thenames(s) in theheader. - Each
fieldconsists of a sequence of any ASCII character but the double quote, enclosed between double quotes.
Imposing such limitations to the original grammar, thus to the language it generates, largely eases the complexity of our parser, nonetheless preserving usability in real world applications, being this CSV "dialect" quite common, and given the possibility to slightly modify our grammar to allow both double quote enclosed fields and non-double quot enclosed fields.