Skip to content

Commit c0e40ef

Browse files
committed
Release 7.037
1 parent 78fb514 commit c0e40ef

37 files changed

+102
-45
lines changed

deps.edn

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@
1414
:exec-fn codox.main/-main
1515
:exec-args {:group-id "techascent"
1616
:artifact-id "tech.ml.dataset"
17-
:version "7.036"
17+
:version "7.037"
1818
:name "TMD"
1919
:description "A Clojure high performance data processing system"
2020
:metadata {:doc/format :markdown}

docs/000-getting-started.html

Lines changed: 1 addition & 1 deletion
Large diffs are not rendered by default.

docs/100-walkthrough.html

Lines changed: 1 addition & 1 deletion
Large diffs are not rendered by default.

docs/200-quick-reference.html

Lines changed: 1 addition & 1 deletion
Large diffs are not rendered by default.

docs/columns-readers-and-datatypes.html

Lines changed: 1 addition & 1 deletion
Large diffs are not rendered by default.

docs/index.html

Lines changed: 2 additions & 2 deletions
Large diffs are not rendered by default.

docs/nippy-serialization-rocks.html

Lines changed: 1 addition & 1 deletion
Large diffs are not rendered by default.

docs/supported-datatypes.html

Lines changed: 1 addition & 1 deletion
Large diffs are not rendered by default.

docs/tech.v3.dataset.categorical.html

Lines changed: 1 addition & 1 deletion
Large diffs are not rendered by default.

docs/tech.v3.dataset.clipboard.html

Lines changed: 1 addition & 1 deletion
Large diffs are not rendered by default.

docs/tech.v3.dataset.column-filters.html

Lines changed: 1 addition & 1 deletion
Large diffs are not rendered by default.

docs/tech.v3.dataset.column.html

Lines changed: 1 addition & 1 deletion
Large diffs are not rendered by default.

docs/tech.v3.dataset.html

Lines changed: 1 addition & 1 deletion
Large diffs are not rendered by default.

docs/tech.v3.dataset.io.csv.html

Lines changed: 1 addition & 1 deletion
Large diffs are not rendered by default.

docs/tech.v3.dataset.io.datetime.html

Lines changed: 1 addition & 1 deletion
Large diffs are not rendered by default.

docs/tech.v3.dataset.io.string-row-parser.html

Lines changed: 1 addition & 1 deletion
Large diffs are not rendered by default.

docs/tech.v3.dataset.io.univocity.html

Lines changed: 1 addition & 1 deletion
Large diffs are not rendered by default.

docs/tech.v3.dataset.join.html

Lines changed: 1 addition & 1 deletion
Large diffs are not rendered by default.

docs/tech.v3.dataset.math.html

Lines changed: 1 addition & 1 deletion
Large diffs are not rendered by default.

docs/tech.v3.dataset.metamorph.html

Lines changed: 1 addition & 1 deletion
Large diffs are not rendered by default.

docs/tech.v3.dataset.modelling.html

Lines changed: 1 addition & 1 deletion
Large diffs are not rendered by default.

docs/tech.v3.dataset.print.html

Lines changed: 1 addition & 1 deletion
Large diffs are not rendered by default.

docs/tech.v3.dataset.reductions.apache-data-sketch.html

Lines changed: 1 addition & 1 deletion
Large diffs are not rendered by default.

docs/tech.v3.dataset.reductions.html

Lines changed: 1 addition & 1 deletion
Large diffs are not rendered by default.

docs/tech.v3.dataset.rolling.html

Lines changed: 1 addition & 1 deletion
Large diffs are not rendered by default.

docs/tech.v3.dataset.set.html

Lines changed: 1 addition & 1 deletion
Large diffs are not rendered by default.

docs/tech.v3.dataset.tensor.html

Lines changed: 1 addition & 1 deletion
Large diffs are not rendered by default.

docs/tech.v3.dataset.zip.html

Lines changed: 1 addition & 1 deletion
Large diffs are not rendered by default.

docs/tech.v3.libs.arrow.html

Lines changed: 32 additions & 6 deletions
Large diffs are not rendered by default.

docs/tech.v3.libs.clj-transit.html

Lines changed: 1 addition & 1 deletion
Large diffs are not rendered by default.

docs/tech.v3.libs.fastexcel.html

Lines changed: 1 addition & 1 deletion
Large diffs are not rendered by default.

docs/tech.v3.libs.guava.cache.html

Lines changed: 1 addition & 1 deletion
Large diffs are not rendered by default.

docs/tech.v3.libs.parquet.html

Lines changed: 1 addition & 1 deletion
Large diffs are not rendered by default.

docs/tech.v3.libs.poi.html

Lines changed: 1 addition & 1 deletion
Large diffs are not rendered by default.

docs/tech.v3.libs.smile.data.html

Lines changed: 1 addition & 1 deletion
Large diffs are not rendered by default.

docs/tech.v3.libs.tribuo.html

Lines changed: 1 addition & 1 deletion
Large diffs are not rendered by default.

src/tech/v3/libs/arrow.clj

Lines changed: 34 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,14 @@
2424
loaded. Appropriate JVM arguments can be found
2525
[here](https://github.com/techascent/tech.ml.dataset/blob/0524ddd5bbcb9421a0f11290ec8a01b7795dcff9/project.clj#L69).
2626
27+
Example (with zstd compression):
28+
29+
```clojure
30+
;; Writing
31+
(arrow/dataset->stream! ds fname {:compression :zstd})
32+
;; Reading
33+
(arrow/stream->dataset path)
34+
```
2735
2836
## Required Dependencies
2937
@@ -49,7 +57,25 @@
4957
5058
```console
5159
sudo apt install liblz4-1
52-
```"
60+
```
61+
62+
## Performance
63+
64+
Arrow has hands down highest performance of any of the formats although nippy comes very close when using
65+
any compression. The highest performance pathway is to save out data with :strings-as-text? true and zero
66+
compression then read them in using mmap - optionally with :text-as-strings? if you never want to see
67+
tech.v3.datatype.Text objects in your dataset. This avoids the creation of string dictionaries during
68+
deserialization as these have to be done greedily. It can dramatically increase many dataset sizes but
69+
when mmap is used the overall size is irrelevant aside from iteration which can be heavily parallelized.
70+
71+
Example:
72+
73+
```clojure
74+
;; Writing
75+
(arrow/dataset->stream! ds fname {:strings-as-text? true})
76+
;; Reading
77+
(arrow/stream->dataset path {:text-as-strings? true :open-type :mmap})
78+
```"
5379
(:require [tech.v3.datatype.mmap :as mmap]
5480
[tech.v3.datatype.datetime :as dtype-dt]
5581
[tech.v3.datatype :as dtype]
@@ -1327,7 +1353,7 @@ Dependent block frames are not supported!!")
13271353
offsets offset-buf-dtype)
13281354
varchar-data n-elems)]
13291355
(if-not (:text-as-strings? options)
1330-
(string-reader->text-reader)
1356+
(string-reader->text-reader str-rdr)
13311357
str-rdr))))
13321358

13331359

@@ -1841,7 +1867,12 @@ Dependent block frames are not supported!!")
18411867
datatypes will be represented as their integer types as opposed to their respective
18421868
packed types. For example columns of type `:epoch-days` will be returned to the user
18431869
as datatype `:epoch-days` as opposed to `:packed-local-date`. This means reading values
1844-
will return integers as opposed to `java.time.LocalDate`s."
1870+
will return integers as opposed to `java.time.LocalDate`s.
1871+
1872+
* `:text-as-strings?` - Return strings instead of Text objects. This breaks automatic round-tripping
1873+
as it changes datatypes *but* can be useful when used with `:strings-as-text?` when writing data out.
1874+
When used like this uncompressed mmap pathways typically have the highest performance - roughly 100x
1875+
any other method."
18451876
[fname & [options]]
18461877
(let [input (case (get options :open-type :input-stream)
18471878
:mmap (mmap/mmap-file fname options)

0 commit comments

Comments
 (0)