Let's say the column local_time
of your CSV file is containing dates in ISO 8601 format, for example 2022-03-22
, 2022-03-22 23:20:24
, 2022-03-22T00:00:00[CET]
or 2022-03-22T23:20:24+01:00[Europe/Paris]
.
To explore temporal data, for example to find out if there are empty cells or to find the start and end dates of the corpus, the first thing you can do is xan stats
on the local_time
column:
xan stats -s local_time dates.csv | xan view
count | count_empty | type | min | max | ... | lex_first | lex_last | ... |
---|---|---|---|---|---|---|---|---|
92 | 0 | date | ... | 2022-02-25T17:09:47 | 2022-03-27T12:56:25 | ... |
One can have a better view of the same table with xan transpose
:
xan stats -s local_time dates.csv | xan transpose | xan view
field | local_time |
---|---|
count | 92 |
count_empty | 0 |
type | date |
types | date |
sum | 0 |
mean | |
variance | |
stddev | |
min | |
max | |
lex_first | 2022-02-25T17:09:47 |
lex_last | 2022-03-27T12:56:25 |
min_length | 19 |
max_length | 19 |
Here we observe that there are no empty fields (see count_empty
), and that the data extends from February 25 to March 3, 2022 (see lex_first
and lex_last
).
If you want to keep only the year and month in a new column called formatted_time
, you can apply the year_month
(or ym
) function.
Functions can be applied using the xan map
command, to create a new column with the result of an expression,
or with the xan transform
command, to directly transform the column. Here is an example with xan map
:
xan map 'ym(local_time)' formatted_time dates.csv | xan view
local_time | formatted_time |
---|---|
2022-02-25T17:09:47 | 2022-02 |
2022-02-25T17:33:28 | 2022-02 |
2022-02-26T09:18:05 | 2022-02 |
2022-02-27T07:23:00 | 2022-02 |
2022-02-27T09:07:17 | 2022-02 |
2022-02-28T09:45:15 | 2022-02 |
2022-03-01T14:39:03 | 2022-03 |
2022-03-01T17:42:13 | 2022-03 |
Other formatting functions, such as year()
or year_month_day()
exist.
The list of all date-related functions is available
here and can also be displayed using:
xan map --functions | grep Dates -A 50
Now that you now how to format dates, you can use xan freq
to count the number of lines per day in you dataset:
xan map 'ymd(local_time)' year_month_day dates.csv | xan freq -s year_month_day | xan view
field | value | count |
---|---|---|
year_month_day | 2022-03-23 | 9 |
year_month_day | 2022-03-24 | 8 |
year_month_day | 2022-03-12 | 7 |
year_month_day | 2022-03-25 | 7 |
year_month_day | 2022-03-26 | 5 |
year_month_day | 2022-03-02 | 4 |
year_month_day | 2022-03-03 | 4 |
year_month_day | 2022-03-07 | 4 |
year_month_day | 2022-03-10 | 4 |
year_month_day | 2022-03-11 | 4 |
year_month_day | 36 |
This view is helpful (it is sorted by decreasing count
) but in the case of dates one would prefer to have lines sorted in chronological order.
The simplest way to do this is to use xan hist -D
.
xan hist
will plot a histogram, and the -D
(or --dates
) flag will have the histogram sorted by date and add empty bars for missing days (or months, or years).
Don't forget to add -A
(or --all
) in xan freq
in order to plot all days.
xan map 'ymd(local_time)' year_month_day dates.csv | xan freq -s year_month_day -l 0 | xan hist -D
This way, you immediatly notice the fact that there is no line in your dataset on March 18 and 19.
2022-02-25 |2 2.17%|■■■■■■■■■ |
2022-02-26 |1 1.09%|■■■■■ |
2022-02-27 |2 2.17%|■■■■■■■■■ |
2022-02-28 |1 1.09%|■■■■■ |
2022-03-01 |2 2.17%|■■■■■■■■■ |
2022-03-02 |4 4.35%|■■■■■■■■■■■■■■■■■ |
2022-03-03 |4 4.35%|■■■■■■■■■■■■■■■■■ |
2022-03-04 |1 1.09%|■■■■■ |
2022-03-05 |1 1.09%|■■■■■ |
2022-03-06 |1 1.09%|■■■■■ |
2022-03-07 |4 4.35%|■■■■■■■■■■■■■■■■■ |
2022-03-08 |2 2.17%|■■■■■■■■■ |
2022-03-09 |3 3.26%|■■■■■■■■■■■■■ |
2022-03-10 |4 4.35%|■■■■■■■■■■■■■■■■■ |
2022-03-11 |4 4.35%|■■■■■■■■■■■■■■■■■ |
2022-03-12 |7 7.61%|■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ |
2022-03-13 |1 1.09%|■■■■■ |
2022-03-14 |2 2.17%|■■■■■■■■■ |
2022-03-15 |1 1.09%|■■■■■ |
2022-03-16 |3 3.26%|■■■■■■■■■■■■■ |
2022-03-17 |4 4.35%|■■■■■■■■■■■■■■■■■ |
2022-03-18 |0 0.00%| |
2022-03-19 |0 0.00%| |
2022-03-20 |1 1.09%|■■■■■ |
2022-03-21 |4 4.35%|■■■■■■■■■■■■■■■■■ |
2022-03-22 |1 1.09%|■■■■■ |
2022-03-23 |9 9.78%|■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■|
2022-03-24 |8 8.70%|■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ |
2022-03-25 |7 7.61%|■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ |
2022-03-26 |5 5.43%|■■■■■■■■■■■■■■■■■■■■■ |
2022-03-27 |3 3.26%|■■■■■■■■■■■■■ |
If you have dates in non ISO 8601 format, such as 31/12/22
or 02 Jan 2006 15:04
, you can parse them using the datetime
function.
The conversion specifications (i.e. how to tell datetime()
the format you have in mind) are listed
here.
datetime()
takes as first argument the name of the column containing the date expression, and as second argument the desired format (we will see the third argument in the Dealing with timezones section).
xan map 'datetime(initial_date, "%d/%m/%y")' parsed_date strange_dates.csv | xan v
initial_date | parsed_date |
---|---|
25/02/22 | 2022-02-25T00:00:00[CET] |
25/02/22 | 2022-02-25T00:00:00[CET] |
26/02/22 | 2022-02-26T00:00:00[CET] |
27/02/22 | 2022-02-27T00:00:00[CET] |
27/02/22 | 2022-02-27T00:00:00[CET] |
28/02/22 | 2022-02-28T00:00:00[CET] |
01/03/22 | 2022-03-01T00:00:00[CET] |
01/03/22 | 2022-03-01T00:00:00[CET] |
Of course, you can combine datetime()
with other formatting functions to print your output in the desired format.
Below an example with month_day()
:
xan map 'datetime(initial_date, "%d/%m/%y") | month_day(_)' formatted_date strange_dates.csv | xan v
initial_date | formatted_date |
---|---|
25/02/22 | 02-25 |
25/02/22 | 02-25 |
26/02/22 | 02-26 |
27/02/22 | 02-27 |
27/02/22 | 02-27 |
28/02/22 | 02-28 |
01/03/22 | 03-01 |
01/03/22 | 03-01 |
You can also specify your custom output format using the strftime()
formatting function:
xan map 'datetime(initial_date, "%d/%m/%y") | strftime(_, "%A")' day_of_week strange_dates.csv | xan v
initial_date | day_of_week |
---|---|
25/02/22 | Friday |
25/02/22 | Friday |
26/02/22 | Saturday |
27/02/22 | Sunday |
27/02/22 | Sunday |
28/02/22 | Monday |
01/03/22 | Tuesday |
01/03/22 | Tuesday |
strftime()
can also be used without datetime()
if the date expression is in ISO 8601 format:
xan map 'strftime(local_time, "%A")' day_of_week dates.csv | xan v
local_time | day_of_week |
---|---|
2022-02-25T17:09:47 | Friday |
2022-02-25T17:33:28 | Friday |
2022-02-26T09:18:05 | Saturday |
2022-02-27T07:23:00 | Sunday |
2022-02-27T09:07:17 | Sunday |
2022-02-28T09:45:15 | Monday |
2022-03-01T14:39:03 | Tuesday |
2022-03-01T17:42:13 | Tuesday |