Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 4 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -7,4 +7,7 @@ target
.lein-failures
.lein-deps-sum
.lein-repl-history
benchmarks/data
benchmarks/data
*.csv

\.nrepl-port
35 changes: 30 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,12 @@ The API has changed in the 2.0 series; see below for details.
Recent Updates
--------------

* Updated to `2.0.3-SNAPSHOT`, with
1. Optional recognition of numbers in data;
2. Optional recognition of dates/times in data;
3. Optional recognition of first row as field names;
4. Option to supply field names.
* Now has support for Clojure 1.8.
* Updated library to 2.0.2, with a bug fix for malformed input by
[attil-io](https://github.com/attil-io).
* Updated library to 2.0.1, which adds the :force-quote option to write-csv.
Expand All @@ -52,7 +58,7 @@ Recent Updates
* Now has support for Clojure 1.3.
* Some speed improvements to take advantage of Clojure 1.3. Nearly twice as fast
in my tests.
* Updated library to 1.2.4.
* Updated library to 1.2.4.
* Added the char-seq multimethod, which provides a variety of implementations
for easily creating the char seqs that parse-csv uses on input from various
similar objects. Big thanks to [Slawek Gwizdowski](https://github.com/i0cus)
Expand Down Expand Up @@ -98,16 +104,35 @@ A character that contains the cell separator for each column in a row.
#### :end-of-line
A string containing the end-of-line character for
reading CSV files. If this setting is nil then \\n and \\r\\n are both
accepted.
accepted.
##### Default value: nil
#### :quote-char
A character that is used to begin and end a quoted cell.
##### Default value: \"
#### :strict
If this variable is true, the parser will throw an exception
on parse errors that are recoverable but not to spec or otherwise
nonsensical.
nonsensical.
##### Default value: false
#### :numbers
Optional; if non `nil`, fields which are numbers will be returned as numbers,
not strings.
#### :date-format
Optional; if a valid value as specified below, fields which are dates/times
will be returned as `org.joda.time.DateTime` objects, not strings.

A valid value is one of:
1. A string in the format understood by `clj-time.formatters/formatter`, or
2. A keyword representing one of `clj-time.formatters` built-in formatters,
3. A custom formatter as constructed by `clj-time.formatters/formatter`"
##### Default value: nil
#### :field-names
Optional;
1. if `true`, the first row of the input will be treated as field names (and
read as keywords);
2. if a list or vector, the value will be used as field names.
In either case, rows will be returned as `map`s, not `list`s.
##### Default value: nil

### write-csv
Takes a sequence of sequences of strings, basically a table of strings,
Expand All @@ -116,10 +141,10 @@ call this function repeatedly row-by-row and concatenate the results yourself.

Takes the following keyword arguments to change the written file:
#### :delimiter
A character that contains the cell separator for each column in a row.
A character that contains the cell separator for each column in a row.
##### Default value: \\,
#### :end-of-line
A string containing the end-of-line character for writing CSV files.
A string containing the end-of-line character for writing CSV files.
##### Default value: \\n
#### :quote-char
A character that is used to begin and end a quoted cell.
Expand Down
6 changes: 4 additions & 2 deletions project.clj
Original file line number Diff line number Diff line change
@@ -1,9 +1,11 @@
(defproject clojure-csv "2.0.3-SNAPSHOT"
:description "A simple library to read and write CSV files."
:dependencies [[org.clojure/clojure "1.3.0"]]
:dependencies [[org.clojure/clojure "1.8.0"]
[clj-time "0.15.0"]]
:plugins [[perforate "0.3.2"]]
:jvm-opts ["-Xmx1g"]
:profiles {:current {:source-paths ["src/"]}
:clj1.8 {:dependencies [[org.clojure/clojure "1.8.0"]]}
:clj1.4 {:dependencies [[org.clojure/clojure "1.4.0-beta5"]]}
:clj1.3 {:dependencies [[org.clojure/clojure "1.3.0"]]}
:csv1.3 {:dependencies [[clojure-csv "1.3.0"]]}
Expand All @@ -15,5 +17,5 @@
:profiles [:clj1.3 :csv1.3]
:namespaces [csv.benchmarks.core]}
{:name :current
:profiles [:clj1.4 :current]
:profiles [:clj1.8 :current]
:namespaces [csv.benchmarks.core]}]})
35 changes: 27 additions & 8 deletions src/clojure_csv/core.clj
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,8 @@
It correctly handles common CSV edge-cases, such as embedded newlines, commas,
and quotes. The main functions are parse-csv and write-csv."}
clojure-csv.core
(:require [clojure.string :as string])
(:require [clojure.string :as string]
[clojure-csv.data-cleaning :refer [dates-as-dates numbers-as-numbers]])
(:import [java.io Reader StringReader]))


Expand Down Expand Up @@ -185,16 +186,24 @@ and quotes. The main functions are parse-csv and write-csv."}
(throw (Exception. (str "Unexpected character found: " look-ahead)))))))

(defn- parse-csv-with-options
([csv-reader {:keys [delimiter quote-char strict end-of-line]}]
([csv-reader {:keys [delimiter quote-char strict end-of-line date-format numbers field-names]}]
(let [fields (cond
(true? field-names)
(map keyword (parse-csv-line csv-reader delimiter quote-char
strict end-of-line))
(or (list? field-names) (vector? field-names)) field-names)]
(parse-csv-with-options csv-reader delimiter quote-char
strict end-of-line))
([csv-reader delimiter quote-char strict end-of-line]
strict end-of-line date-format numbers fields)))
([csv-reader delimiter quote-char strict end-of-line date-format numbers fields]
(lazy-seq
(when (not (== -1 (reader-peek csv-reader)))
(let [row (parse-csv-line csv-reader delimiter quote-char
strict end-of-line)]
(let [raw (parse-csv-line csv-reader delimiter quote-char
strict end-of-line)
with-numbers (if numbers (numbers-as-numbers raw) raw)
with-dates (if date-format (dates-as-dates numbers date-format) with-numbers)
row (if fields (apply hash-map (interleave fields with-dates)) with-dates)]
(cons row (parse-csv-with-options csv-reader delimiter quote-char
strict end-of-line)))))))
strict end-of-line date-format numbers fields)))))))

(defn parse-csv
"Takes a CSV as a string or Reader and returns a seq of the parsed CSV rows,
Expand All @@ -211,7 +220,17 @@ and quotes. The main functions are parse-csv and write-csv."}
Default value: \\\"
:strict - If this variable is true, the parser will throw an
exception on parse errors that are recoverable but
not to spec or otherwise nonsensical. Default value: false"
not to spec or otherwise nonsensical. Default value: false
:date-format - if provided, and value is a string, keyword or `clj-time`
formatter, recognise dates having the specified format and
return them as `org.joda.time.DateTime` objects. Default value:
nil
:numbers - if provided and value is non-nil, recognise numbers (integers,
floats and rationals, but TODO: not yet bignums) and return them
as numbers.
:field-names - if provided and value is true, treats the first row as
field names; if provided and value is a sequence, treats that sequence as
field names. In either case returns a list of maps, not lists."
([csv & {:as opts}]
(let [csv-reader (if (string? csv) (StringReader. csv) csv)]
(parse-csv-with-options csv-reader (merge {:strict false
Expand Down
54 changes: 54 additions & 0 deletions src/clojure_csv/data_cleaning.clj
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
(ns
^{:author "Simon Brooke",
:doc "Recognise numbers as numbers, and
dates/times as dates/times"}
clojure-csv.data-cleaning
(:require [clj-time.core :as t]
[clj-time.format :as f]))

(defn number-as-number
"if `o` is the string representation of a number, return that number; else
return `o`."
[o]
(if
(string? o)
(try
(let [n (read-string o)]
(if (number? n) n o))
(catch Exception e o))
o))

(defmacro numbers-as-numbers
"Return a list like the sequence `l`, but with all those elements
which are string representations of numbers replaced with numbers."
[l]
`(map number-as-number ~l))

(defn date-as-date
"if `o` is the string representation of a date or timestamp comforming to
`date-format`, return that timestamp; else return `o`. `date-format` is
expected to be either
1. A string in the format understood by `clj-time.formatters/formatter`, or
2. A keyword representing one of `clj-time.formatters` built-in formatters,
3. A custom formatter as constructed by `clj-time.formatters/formatter`"
[o date-format]
(if
(string? o)
(try
(let [f (cond
(string? date-format) (f/formatter date-format)
(keyword? date-format) (f/formatters date-format)
(=
(type date-format)
org.joda.time.format.DateTimeFormatter) date-format)]
(f/parse f o))
(catch Exception e
o))
o))

(defmacro dates-as-dates
"Return a list like the sequence `l`, but with all those elements
which are string representations of numbers replaced with numbers."
[l date-format]
`(map #(date-as-date % ~date-format) ~l))

38 changes: 38 additions & 0 deletions test/clojure_csv/test/core.clj
Original file line number Diff line number Diff line change
Expand Up @@ -129,3 +129,41 @@
:end-of-line "HELLO")))
(is (= [["a" "b\r"] ["c" "d"]] (parse-csv "a,|b\r|\rc,d"
:end-of-line "\r" :quote-char \|))))

(deftest data-cleansing
(let [data "Name;MP;Area;County;Electorate;CON;LAB;LIB;UKIP;Green;NAT;MIN;OTH
Aldershot;Leo Docherty;12;Hampshire;76205;26955;15477;3637;1796;1090;0;0;0
Aldridge-Brownhills;Wendy Morton;7;Black Country;60363;26317;12010;1343;0;0;0;0;565
Altrincham and Sale West;Graham Brady;4;Central Manchester;73220;26933;20507;4051;0;1000;0;0;299
Amber Valley;Nigel Mills;8;Derbyshire;68065;25905;17605;1100;0;650;0;0;551
Arundel and South Downs;Nick Herbert;12;West Sussex;80766;37573;13690;4783;1668;2542;0;0;0
Ashfield;Gloria De Piero;8;Nottinghamshire;78099;20844;21285;969;1885;398;0;4612;0
Ashford;Damian Green;12;Kent;87396;35318;17840;3101;2218;1402;0;0;0"]
(testing "number recognition"
(let [expected "76205"
actual (nth (nth (parse-csv data :delimiter \;) 1) 4)]
(is (= actual expected) "Number recognition off"))
(let [expected 76205
actual (nth (nth (parse-csv data :delimiter \; :numbers true) 1) 4)]
(is (= actual expected) "Number recognition on")))
(testing "field names"
(let [expected 76205
actual (:Electorate (first (parse-csv data
:delimiter \;
:numbers true
:field-names true)))]
(is (= actual expected) "Field names from first row"))
(let [expected 76205
actual (:e (nth (parse-csv data
:delimiter \;
:numbers true
:field-names [:a :b :c :d :e]) 1))]
(is (= actual expected) "Field names passed as vector"))
(let [expected 60363
actual (:e (nth (parse-csv data
:delimiter \;
:numbers true
:field-names '(:a :b :c :d :e)) 2))]
(is (= actual expected) "Field names passed as list")))))


70 changes: 70 additions & 0 deletions test/clojure_csv/test/data_cleaning.clj
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
(ns clojure-csv.test.data-cleaning
(:require [clojure.test :refer :all]
[clojure-csv.data-cleaning :refer :all]
[clj-time.core :as t]
[clj-time.format :as f]))


(deftest number-recognition
(testing "Recognition of integers"
(let [expected 123456
actual (number-as-number "123456")]
(is (= actual expected) "integer 123456"))
(let [expected -1
actual (number-as-number "-1")]
(is (= actual expected) "integer negative one")))
(testing "Recognition of floats"
(let [expected 0.1
actual (number-as-number "0.1")]
(is (= actual expected) "float zero point one"))
(let [expected -0.1
actual (number-as-number "-0.1")]
(is (= actual expected) "float negative zero point one"))
(let [expected 3.142857
actual (number-as-number "3.142857")]
(is (= actual expected) "float approximation of π")))
(testing "Recognition of rationals"
(let [expected 22/7
actual (number-as-number "22/7")]
(is (= actual expected) "rational approximation of π"))
(let [expected 1/4
actual (number-as-number "2/8")]
(is (= actual expected) "two eighths -> one quarter")))
(testing "Recognition of numbers"
(let [expected '("Fred" "2019-03-23" 22/7 3.142857 123456 -8)
actual (numbers-as-numbers
'("Fred" "2019-03-23" "22/7" "3.142857" "123456" "-8"))]
(is (= actual expected) "List including numbers in various formats"))))

(deftest date-recognition
(testing "recognition of dates; format is string"
(let [expected "class org.joda.time.DateTime"
actual (str (type (date-as-date "2019-03-23" "yyyy-MM-dd")))]
(is (= actual expected) "format is string; match expected"))
(let [expected "class java.lang.String"
actual (str (type (date-as-date "2019/03/23" "yyyy-MM-dd")))]
(is (= actual expected) "format is string; match not expected")))
(testing "recognition of dates; format is keyword"
(let [expected "class org.joda.time.DateTime"
actual (str (type (date-as-date "2019-03-23" :date)))]
(is (= actual expected) "format is keyword; match expected"))
(let [expected "class java.lang.String"
actual (str (type (date-as-date "2019/03/23" :date)))]
(is (= actual expected) "format is keyword; match not expected")))
(testing "recognition of dates; format is formatter"
(let [expected "class org.joda.time.DateTime"
actual (str (type (date-as-date "2019-03-23" (f/formatter "2019-03-23" ))))]
(is (= actual expected) "format is formatter; match expected"))
(let [expected "class java.lang.String"
actual (str (type (date-as-date "2019/03/23" (f/formatter "2019-03-23" ))))]
(is (= actual expected) "format is formatter; match not expected"))
(let [expected "class org.joda.time.DateTime"
actual (str
(type
(date-as-date
"2019/03/23"
(f/formatter
(t/default-time-zone)
"YYYY-MM-dd"
"YYYY/MM/dd"))))]
(is (= actual expected) "format is composite formatter; match expected"))))