Skip to content

Commit 0711116

Browse files
committed
Release 7.057
1 parent ede9295 commit 0711116

40 files changed

+322
-272
lines changed

CHANGELOG.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,8 @@
11
# Changelog
2+
# 7.057
3+
* Slightly faster arrow compressed writies.
4+
* column-cast no longer appends roaring bitmaps to metadata unless requested.
5+
26
# 7.056
37
* Arrow support for UUID and bigdecimal types.
48

deps.edn

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@
1414
:exec-fn codox.main/-main
1515
:exec-args {:group-id "techascent"
1616
:artifact-id "tech.ml.dataset"
17-
:version "7.056"
17+
:version "7.057"
1818
:name "TMD"
1919
:description "A Clojure high performance data processing system"
2020
:metadata {:doc/format :markdown}

docs/000-getting-started.html

Lines changed: 1 addition & 1 deletion
Large diffs are not rendered by default.

docs/100-walkthrough.html

Lines changed: 1 addition & 1 deletion
Large diffs are not rendered by default.

docs/200-quick-reference.html

Lines changed: 1 addition & 1 deletion
Large diffs are not rendered by default.

docs/columns-readers-and-datatypes.html

Lines changed: 1 addition & 1 deletion
Large diffs are not rendered by default.

docs/index.html

Lines changed: 2 additions & 2 deletions
Large diffs are not rendered by default.

docs/nippy-serialization-rocks.html

Lines changed: 1 addition & 1 deletion
Large diffs are not rendered by default.

docs/supported-datatypes.html

Lines changed: 1 addition & 1 deletion
Large diffs are not rendered by default.

docs/tech.v3.dataset.categorical.html

Lines changed: 1 addition & 1 deletion
Large diffs are not rendered by default.

docs/tech.v3.dataset.clipboard.html

Lines changed: 1 addition & 1 deletion
Large diffs are not rendered by default.

docs/tech.v3.dataset.column-filters.html

Lines changed: 1 addition & 1 deletion
Large diffs are not rendered by default.

docs/tech.v3.dataset.column.html

Lines changed: 1 addition & 1 deletion
Large diffs are not rendered by default.

docs/tech.v3.dataset.html

Lines changed: 84 additions & 78 deletions
Large diffs are not rendered by default.

docs/tech.v3.dataset.io.csv.html

Lines changed: 1 addition & 1 deletion
Large diffs are not rendered by default.

docs/tech.v3.dataset.io.datetime.html

Lines changed: 1 addition & 1 deletion
Large diffs are not rendered by default.

docs/tech.v3.dataset.io.string-row-parser.html

Lines changed: 1 addition & 1 deletion
Large diffs are not rendered by default.

docs/tech.v3.dataset.io.univocity.html

Lines changed: 1 addition & 1 deletion
Large diffs are not rendered by default.

docs/tech.v3.dataset.join.html

Lines changed: 1 addition & 1 deletion
Large diffs are not rendered by default.

docs/tech.v3.dataset.math.html

Lines changed: 1 addition & 1 deletion
Large diffs are not rendered by default.

docs/tech.v3.dataset.metamorph.html

Lines changed: 93 additions & 87 deletions
Large diffs are not rendered by default.

docs/tech.v3.dataset.modelling.html

Lines changed: 1 addition & 1 deletion
Large diffs are not rendered by default.

docs/tech.v3.dataset.print.html

Lines changed: 1 addition & 1 deletion
Large diffs are not rendered by default.

docs/tech.v3.dataset.reductions.apache-data-sketch.html

Lines changed: 1 addition & 1 deletion
Large diffs are not rendered by default.

docs/tech.v3.dataset.reductions.html

Lines changed: 1 addition & 1 deletion
Large diffs are not rendered by default.

docs/tech.v3.dataset.rolling.html

Lines changed: 1 addition & 1 deletion
Large diffs are not rendered by default.

docs/tech.v3.dataset.set.html

Lines changed: 1 addition & 1 deletion
Large diffs are not rendered by default.

docs/tech.v3.dataset.tensor.html

Lines changed: 1 addition & 1 deletion
Large diffs are not rendered by default.

docs/tech.v3.dataset.zip.html

Lines changed: 1 addition & 1 deletion
Large diffs are not rendered by default.

docs/tech.v3.libs.arrow.html

Lines changed: 1 addition & 1 deletion
Large diffs are not rendered by default.

docs/tech.v3.libs.clj-transit.html

Lines changed: 1 addition & 1 deletion
Large diffs are not rendered by default.

docs/tech.v3.libs.fastexcel.html

Lines changed: 1 addition & 1 deletion
Large diffs are not rendered by default.

docs/tech.v3.libs.guava.cache.html

Lines changed: 1 addition & 1 deletion
Large diffs are not rendered by default.

docs/tech.v3.libs.parquet.html

Lines changed: 1 addition & 1 deletion
Large diffs are not rendered by default.

docs/tech.v3.libs.poi.html

Lines changed: 1 addition & 1 deletion
Large diffs are not rendered by default.

docs/tech.v3.libs.tribuo.html

Lines changed: 1 addition & 1 deletion
Large diffs are not rendered by default.

src/tech/v3/dataset.clj

Lines changed: 10 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -251,9 +251,17 @@
251251
252252
Casts between numeric datatypes need no cast-fn but one may be provided.
253253
Casts to string need no cast-fn but one may be provided.
254-
Casts from string to anything will call tech.v3.dataset.column/parse-column."
254+
Casts from string to anything will call tech.v3.dataset.column/parse-column.
255+
256+
Options:
257+
258+
* `:track-parse-errors` - defaults to false. When true extra metadata keys
259+
`:unparsed-indexes :unparsed-data` will be appended to the metadata. Be aware
260+
these values may not serialize as unparsed indexes is a roaring bitmap."
255261
([dataset colname datatype]
256-
(tech.v3.dataset-api/column-cast dataset colname datatype)))
262+
(tech.v3.dataset-api/column-cast dataset colname datatype))
263+
([dataset colname datatype options]
264+
(tech.v3.dataset-api/column-cast dataset colname datatype options)))
257265

258266

259267
(defn column-count

src/tech/v3/dataset/metamorph.clj

Lines changed: 10 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -110,9 +110,17 @@
110110
111111
Casts between numeric datatypes need no cast-fn but one may be provided.
112112
Casts to string need no cast-fn but one may be provided.
113-
Casts from string to anything will call tech.v3.dataset.column/parse-column."
113+
Casts from string to anything will call tech.v3.dataset.column/parse-column.
114+
115+
Options:
116+
117+
* `:track-parse-errors` - defaults to false. When true extra metadata keys
118+
`:unparsed-indexes :unparsed-data` will be appended to the metadata. Be aware
119+
these values may not serialize as unparsed indexes is a roaring bitmap."
114120
([colname datatype]
115-
(tech.v3.dataset.metamorph-api/column-cast colname datatype)))
121+
(tech.v3.dataset.metamorph-api/column-cast colname datatype))
122+
([colname datatype options]
123+
(tech.v3.dataset.metamorph-api/column-cast colname datatype options)))
116124

117125

118126
(defn column-count

src/tech/v3/dataset_api.clj

Lines changed: 77 additions & 68 deletions
Original file line numberDiff line numberDiff line change
@@ -1045,74 +1045,83 @@ user>
10451045
10461046
Casts between numeric datatypes need no cast-fn but one may be provided.
10471047
Casts to string need no cast-fn but one may be provided.
1048-
Casts from string to anything will call tech.v3.dataset.column/parse-column."
1049-
[dataset colname datatype]
1050-
(let [[src-colname dst-colname] (if (instance? Collection colname)
1051-
colname
1052-
[colname colname])
1053-
src-col (dataset src-colname)
1054-
src-dtype (dtype/get-datatype src-col)
1055-
[dst-dtype cast-fn] (if (instance? Collection datatype)
1056-
datatype
1057-
[datatype nil])]
1058-
(add-or-update-column
1059-
dataset dst-colname
1060-
(cond
1061-
(and (= src-dtype dst-dtype)
1062-
(nil? cast-fn))
1063-
(dtype/clone src-col)
1064-
(= src-dtype :string)
1065-
(ds-col/parse-column datatype src-col)
1066-
:else
1067-
(let [cast-fn (or cast-fn
1068-
(cond
1069-
(= dst-dtype :string)
1070-
str
1071-
(or (= :boolean dst-dtype)
1072-
(casting/numeric-type? dst-dtype))
1073-
#(casting/cast % dst-dtype)
1074-
:else
1075-
(throw (Exception.
1076-
(format "Cast fn must be provided for datatype %s"
1077-
dst-dtype)))))
1078-
^RoaringBitmap missing (dtype-proto/as-roaring-bitmap
1079-
(ds-col/missing src-col))
1080-
^RoaringBitmap new-missing (dtype/clone missing)
1081-
col-reader (dtype/->reader src-col)
1082-
n-elems (dtype/ecount col-reader)
1083-
unparsed-data (ArrayList.)
1084-
unparsed-indexes (bitmap/->bitmap)
1085-
result (if (= dst-dtype :string)
1086-
(str-table/make-string-table n-elems)
1087-
(dtype/make-list dst-dtype n-elems))
1088-
missing-val (col-base/datatype->missing-value dst-dtype)]
1089-
(reduce (fn [^List res-writer ^long idx]
1090-
(if (.contains missing idx)
1091-
(.add res-writer missing-val)
1092-
(let [existing-val (col-reader idx)
1093-
new-val (cast-fn existing-val)]
1094-
(cond
1095-
(= new-val :tech.v3.dataset/missing)
1096-
(locking new-missing
1097-
(.add new-missing idx)
1098-
(.add res-writer missing-val))
1099-
(= new-val :tech.v3.dataset/parse-failure)
1100-
(locking new-missing
1101-
(.add res-writer missing-val)
1102-
(.add new-missing idx)
1103-
(.add unparsed-indexes idx)
1104-
(.add unparsed-data existing-val))
1105-
:else
1106-
(.add res-writer new-val))))
1107-
res-writer) result (hamf/range n-elems))
1108-
(ds-col/new-column #:tech.v3.dataset{:name dst-colname
1109-
:data result
1110-
:force-datatype? true
1111-
:missing missing
1112-
:metadata (clojure.core/assoc
1113-
(meta src-col)
1114-
:unparsed-indexes unparsed-indexes
1115-
:unparsed-data unparsed-data)}))))))
1048+
Casts from string to anything will call tech.v3.dataset.column/parse-column.
1049+
1050+
Options:
1051+
1052+
* `:track-parse-errors` - defaults to false. When true extra metadata keys
1053+
`:unparsed-indexes :unparsed-data` will be appended to the metadata. Be aware
1054+
these values may not serialize as unparsed indexes is a roaring bitmap."
1055+
([dataset colname datatype] (column-cast dataset colname datatype nil))
1056+
([dataset colname datatype options]
1057+
(let [[src-colname dst-colname] (if (instance? Collection colname)
1058+
colname
1059+
[colname colname])
1060+
src-col (dataset src-colname)
1061+
src-dtype (dtype/get-datatype src-col)
1062+
[dst-dtype cast-fn] (if (instance? Collection datatype)
1063+
datatype
1064+
[datatype nil])]
1065+
(add-or-update-column
1066+
dataset dst-colname
1067+
(cond
1068+
(and (= src-dtype dst-dtype)
1069+
(nil? cast-fn))
1070+
(dtype/clone src-col)
1071+
(= src-dtype :string)
1072+
(ds-col/parse-column datatype src-col)
1073+
:else
1074+
(let [cast-fn (or cast-fn
1075+
(cond
1076+
(= dst-dtype :string)
1077+
str
1078+
(or (= :boolean dst-dtype)
1079+
(casting/numeric-type? dst-dtype))
1080+
#(casting/cast % dst-dtype)
1081+
:else
1082+
(throw (Exception.
1083+
(format "Cast fn must be provided for datatype %s"
1084+
dst-dtype)))))
1085+
^RoaringBitmap missing (dtype-proto/as-roaring-bitmap
1086+
(ds-col/missing src-col))
1087+
^RoaringBitmap new-missing (dtype/clone missing)
1088+
col-reader (dtype/->reader src-col)
1089+
n-elems (dtype/ecount col-reader)
1090+
unparsed-data (ArrayList.)
1091+
unparsed-indexes (bitmap/->bitmap)
1092+
result (if (= dst-dtype :string)
1093+
(str-table/make-string-table n-elems)
1094+
(dtype/make-list dst-dtype n-elems))
1095+
missing-val (col-base/datatype->missing-value dst-dtype)]
1096+
(reduce (fn [^List res-writer ^long idx]
1097+
(if (.contains missing idx)
1098+
(.add res-writer missing-val)
1099+
(let [existing-val (col-reader idx)
1100+
new-val (cast-fn existing-val)]
1101+
(cond
1102+
(= new-val :tech.v3.dataset/missing)
1103+
(locking new-missing
1104+
(.add new-missing idx)
1105+
(.add res-writer missing-val))
1106+
(= new-val :tech.v3.dataset/parse-failure)
1107+
(locking new-missing
1108+
(.add res-writer missing-val)
1109+
(.add new-missing idx)
1110+
(.add unparsed-indexes idx)
1111+
(.add unparsed-data existing-val))
1112+
:else
1113+
(.add res-writer new-val))))
1114+
res-writer) result (hamf/range n-elems))
1115+
(ds-col/new-column #:tech.v3.dataset{:name dst-colname
1116+
:data result
1117+
:force-datatype? true
1118+
:missing missing
1119+
:metadata (if (get options :track-parse-errors)
1120+
(clojure.core/assoc
1121+
(meta src-col)
1122+
:unparsed-indexes unparsed-indexes
1123+
:unparsed-data unparsed-data)
1124+
(meta src-col))})))))))
11161125

11171126

11181127
(defn columnwise-concat

test/tech/v3/dataset_test.clj

Lines changed: 10 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -538,7 +538,16 @@
538538
(->> (ds/column-cast ds :price [:int32 #(Math/round (double %))])
539539
(#(ds/column % :price))
540540
(take 5)
541-
(vec))))))
541+
(vec))))
542+
(is (nil? (->> (ds/column-cast ds :price [:int32 #(Math/round (double %))])
543+
(#(ds/column % :price))
544+
(meta)
545+
(:unparsed-indexes))))
546+
(is (not
547+
(nil? (->> (ds/column-cast ds :price [:int32 #(Math/round (double %))] {:track-parse-errors true})
548+
(#(ds/column % :price))
549+
(meta)
550+
(:unparsed-indexes)))))))
542551

543552

544553
(deftest column-clone-double-read

0 commit comments

Comments
 (0)