join — My Favorite Unix Command, Probably
Consider the following case. We have two files storing some data about artists and their countries of origin:
$ ls
artists.txt countries.txt
There are 15 artists and 6 countries:
$ tail artists.txt countries.txt
==> artists.txt <==
4 Γεωργία Νταγάκη
3 L.E.J
5 Lykke Li
1 Moonlight Breakfast
7 Irina Rimes
2 Jake Bugg
3 ZAZ
8 Lana Del Rey
8 Jackson C. Frank
4 Αλκίνοος Ιωαννίδης
==> countries.txt <==
France 3
Greece 4
Romania 1
Russia 6
Sweden 5
UK 2
The two files are related by means of the leading and trailing numbers, which represent country IDs (made up solely for the purpose of this blog post). Thus "3 L.E.J" relates to "France 3", meaning that L.E.J are from France.
Now, we don't have a lot of data here, it's just 15 artists, but it's already
quite hard to see which artists are from where. If these had been records in a
relational database we could have easily written an SQL INNER JOIN
query to
reveal the associations we're after:
SELECT artist.name, country.name
FROM artist, country
WHERE artist.country_id = country.id
We could maybe start an SQLite session, create the tables, import the data and
finally execute the query. This may even be a viable approach, but it doesn't
fit the plan for this blog post. However, if you're on an Unix system, there's
a hidden gem of a command, aptly named join
, that can do precisely what the
above SQL query can, only it doesn't need a database, just regular files. So,
how could we use join
to reproduce the above query?
Inner Joins — Where are These Artists From?
$ join -t: -22 -o1.2 -o2.1 \
<( sort artists.txt | sed 's/ /:/' ) \
<( sort --key=2 countries.txt | sed 's/ /:/' ) \
;
Moonlight Breakfast:Romania
Adele:UK
Dido:UK
Jake Bugg:UK
Clio:France
Jain:France
L.E.J:France
Stromae:France
ZAZ:France
Γεωργία Νταγάκη:Greece
Αλκίνοος Ιωαννίδης:Greece
Lykke Li:Sweden
Left Joins - Artists without a Country
$ join -a1 -t: -22 -o1.2 -o2.1 \
<( sort artists.txt | sed 's/ /:/' ) \
<( sort -k2 countries.txt | sed 's/ /:/' ) \
| grep --extended-regexp ':$' \
;
Irina Rimes:
Jackson C. Frank:
Lana Del Rey:
Right Joins - Countries without Artists
$ join -a2 -t: -22 -o1.2 -o2.1 \
<( sort artists.txt | sed 's/ /:/' ) \
<( sort -k2 countries.txt | sed 's/ /:/' ) \
| grep -E '^:' \
;
:Russia
Full Joins — Artists and Countries, Both Unlinked
$ join -a1 -a2 -t: -22 -o1.2 -o2.1 \
<( sort artists.txt | sed 's/ /:/' ) \
<( sort -k2 countries.txt | sed 's/ /:/' ) \
| grep -E '(^:)|(:$)' \
;
:Russia
Irina Rimes:
Jackson C. Frank:
Lana Del Rey:
Tabular Display
$ join -a1 -a2 -t: -22 -o2.1 -o1.2 \
<( sort artists.tx | sed 's/ /:/' ) \
<( sort -k2 countries.txt | sed 's/ /:/' ) \
| sed -E 's/^:/𝐍𝐔𝐋𝐋:/; s/:$/:𝐍𝐔𝐋𝐋/' \
| ( echo " #:COUNTRY:ARTIST"
echo "===:=======:======"
sort -k1 -t: | awk -vOFS=: '{ print sprintf("%3d", NR), $0 }'
) \
| column -ts: \
;
# COUNTRY ARTIST
=== ======= ======
1 France Clio
2 France Jain
3 France L.E.J
4 France Stromae
5 France ZAZ
6 Greece Γεωργία Νταγάκη
7 Greece Αλκίνοος Ιωαννίδης
8 Romania Moonlight Breakfast
9 Russia 𝐍𝐔𝐋𝐋
10 Sweden Lykke Li
11 UK Adele
12 UK Dido
13 UK Jake Bugg
14 𝐍𝐔𝐋𝐋 Irina Rimes
15 𝐍𝐔𝐋𝐋 Jackson C. Frank
16 𝐍𝐔𝐋𝐋 Lana Del Rey
Refactored
# Our internal delimiter, needed because the artist names in our records
# contain spaces. This conflicts with the default separator(s) of all the
# commands below: spaces or tabs. It should be safe to embed in regexes.
declare -r delim=:
redelimited() {
# Replaces the *first* space with our delimiter. This works wonderfully
# for the datasets we have, but it would fail miserably if added a country
# whose name contains a whitespace, such as "Sierra Leone".
#
# So... exercise for the reader: handle whitespaces in country names.
sed "s/ /$delim/" "$1"
}
sort-by() {
sort -k"$1" -t"$delim"
}
full-join() {
join -a1 -a2 -t"$delim" -22 -o2.1 -o1.2 \
<( redelimited artists.txt | sort-by 1 ) \
<( redelimited countries.txt | sort-by 2 ) \
;
}
show-nulls() {
# Prepend/append 𝐍𝐔𝐋𝐋 to rows starting/ending with our delimiter.
sed -E "s/^($delim)/𝐍𝐔𝐋𝐋\1/; s/($delim)$/\1𝐍𝐔𝐋𝐋/"
}
headers() {
echo " #${delim}COUNTRY${delim}ARTIST"
echo "===${delim}=======${delim}======"
}
number-rows() {
awk -vOFS="$delim" '{ print sprintf("%3d", NR), $0 }'
}
tabulate() {
column -ts"$delim"
}
full-join | show-nulls | sort-by 1 | ( headers; number-rows ) | tabulate