POSIX Programming Problem: Interview question

Seen on Hacker News, Alexey Grigorev poses a simple problem for use in interview screening. I won't get into whether such things are useful/effective or not, though I will note that the Twitter comments are full of people getting it horribly wrong/overwrought. I'm more interested in how easy/hard this would be to accomplish using standard POSIX utilities (emphasis on standard, no GNU extensions allowed).

The basic problem is just counting repeated characters. Not especially that it is for each block of recurring characters, not the sum total of each character present in the string. So each character, presented by the number of times it occurs in that block. This is a lot like what uniq will do with the -c option, except uniq operates on lines at a time, not characters. Can we get around that?

Yes, and using standard tools. A lot of people Tweeted answers involving grep -o, but that's a GNU extension. POSIX grep has no -o option. But sed is up to the task. If we match on every character, and print the match plus a newline, we've transformed a string of characters into a sequence of lines (with one character plus newline per line):*

$ printf "aaaabbbcca" | sed -e 's/\(.\)/\1\n/g'
a
a
a
a
b
b
b
c
c
a

We can feed that through uniq and get our output:

$ printf "aaaabbbcca" | sed -e 's/\(.\)/\1\n/g' | uniq -c 
      4 a
      3 b
      2 c
      1 a

Well, we almost have our output. It needs to be in a different format. A little awk can clear that up:

$ printf "aaaabbbcca" | sed -e 's/\(.\)/\1\n/g' | uniq -c | awk '{printf("(\"%s\", %d), ", $2, $1);}'
("a", 4), ("b", 3), ("c", 2), ("a", 1), 

Mostly. We need the surrounding square brackets, and to remove that trailing comma (and get our final newline back!). Once again, sed is ready:

$ printf "aaaabbbcca" | sed -e 's/\(.\)/\1\n/g' | uniq -c | awk '{printf("(\"%s\", %d), ", $2, $1);}' | sed -e 's/^\(.*\), $/[\1]\n/'
[("a", 4), ("b", 3), ("c", 2), ("a", 1)]

And there we have it. A little time with familiar POSIX tools and we have the answer.


* Update: There is a more elegant way to do this.

Copyright © 2021 Jakob Kaivo <jakob@kaivo.net>