Reading a FILE one byte at a time

Something I have seen many times with junior C programmers is a misunderstanding of how to tell when you've read the last byte of data from a FILE stream. To illustrate this, I'll use a very simple program that does nothing but copy stdin to stdout, byte-by-byte.

Here is a simple, but wrong, implementation of that program:

#include <stdio.h>

int main(void)
{
	char c;
	while (!feof(stdin)) {
		c = fgetc(stdin);
		fputc(c, stdout);
	}
}

Running this program against its own source code might appear to give the proper result (your results may vary based on your terminal emulator):

$ ./bbb < bbb.c
#include <stdio.h>

int main(void)
{
        char c;
        while (!feof(stdin)) {
                c = fgetc(stdin);
                fputc(c, stdout);
        }
}
$

But if you examine the output more closely, there is a subtle error (emphasis added):

$ ./bbb < bbb.c | od -tx1
0000000 23 69 6e 63 6c 75 64 65 20 3c 73 74 64 69 6f 2e
0000020 68 3e 0a 0a 69 6e 74 20 6d 61 69 6e 28 76 6f 69
0000040 64 29 0a 7b 0a 09 63 68 61 72 20 63 3b 0a 09 77
0000060 68 69 6c 65 20 28 21 66 65 6f 66 28 73 74 64 69
0000100 6e 29 29 20 7b 0a 09 09 63 20 3d 20 66 67 65 74
0000120 63 28 73 74 64 69 6e 29 3b 0a 09 09 66 70 75 74
0000140 63 28 63 2c 20 73 74 64 6f 75 74 29 3b 0a 09 7d
0000160 0a 7d 0a ff
0000164
$

That final 0xff byte isn't in the original file. What's happening here?

The trick is understanding how feof() works. Given a pointer to a FILE stream, it returns true if and only if the stream's end of file marker has been set. This marker is only set after another function called on that stream encounters the end of file condition. So, by having the call to feof() at the start of the loop, there will always be one extra byte read by a final call to fgetc(). That call will return the symbolic constant EOF, which on most systems is defined as -1, which winds up giving us the 0xff at the end of our output.

You might think that moving the check to the end, by changing the loop from a while to a do...while loop would help, but in this case the final fgetc() call is still returning EOF:

#include <stdio.h>

int main(void)
{
	char c;
	do {
		c = fgetc(stdin);
		fputc(c, stdout);
	} while (!feof(stdin));
}
$ ./bbb < bbb.c | od -tx1
0000000 23 69 6e 63 6c 75 64 65 20 3c 73 74 64 69 6f 2e
0000020 68 3e 0a 0a 69 6e 74 20 6d 61 69 6e 28 76 6f 69
0000040 64 29 0a 7b 0a 09 63 68 61 72 20 63 3b 0a 09 64
0000060 6f 20 7b 0a 09 09 63 20 3d 20 66 67 65 74 63 28
0000100 73 74 64 69 6e 29 3b 0a 09 09 66 70 75 74 63 28
0000120 63 2c 20 73 74 64 6f 75 74 29 3b 0a 09 7d 20 77
0000140 68 69 6c 65 20 28 21 66 65 6f 66 28 73 74 64 69
0000160 6e 29 29 3b 0a 7d 0a ff
0000170
$

What we need to be doing is checking whether the return value of fgetc() is EOF, and terminating the loop based on that. We can maintain idiomatic programming style and remove the call to feof() altogether by putting the assignment, call, and comparison together in the loop control expression:

#include <stdio.h>

int main(void)
{
	char c;
	while ((c = fgetc(stdin)) != EOF) {
		fputc(c, stdout);
	}
}

This seems to work:

$ ./bbb < bbb.c
#include <stdio.h>

int main(void)
{
        char c;
        while ((c = fgetc(stdin)) != EOF) {
                fputc(c, stdout);
        }
}
$ ./bbb < bbb.c | od -tx1
0000000 23 69 6e 63 6c 75 64 65 20 3c 73 74 64 69 6f 2e
0000020 68 3e 0a 0a 69 6e 74 20 6d 61 69 6e 28 76 6f 69
0000040 64 29 0a 7b 0a 09 63 68 61 72 20 63 3b 0a 09 77
0000060 68 69 6c 65 20 28 28 63 20 3d 20 66 67 65 74 63
0000100 28 73 74 64 69 6e 29 29 20 21 3d 20 45 4f 46 29
0000120 20 7b 0a 09 09 66 70 75 74 63 28 63 2c 20 73 74
0000140 64 6f 75 74 29 3b 0a 09 7d 0a 7d 0a
0000154
$

But a very subtle bug is hiding here. See if you can spot it. Maybe this series of commands will help:

$ printf 'some data\n' | ./bbb
some data
$ printf 'some \xff data\n' | ./bbb
some $

In the second command, output ended after reading the byte 0xff. We've seen this byte before, being added to the output previously. But what if the data we're working with can legitimately contain byte 0xff, as is the case with some variants of ISO 8859 encoded text and binary data? The answer lies in the definition of EOF and fgetc(). EOF is defined by ISO 9899 as a constant expression "with type int and a negative value" for use a sentinel return value indicating end of file by various functions in <stdio.h>. The return type of fgetc() is int, and it returns the next byte "as an unsigned char converted to an int" or, if there is no more data left in the stream, EOF. Essentially, all possible values for unsigned char are valid, meaningful bytes of data that can be returned by fgetc(), while EOF is explicitly chosen to not fit in an unsigned char. The TL;DR of which is that it is incorrect to assign the result of fgetc() to anything but an int. So we fix the declaration of c and get the properly working program for reading data byte-by-byte:

#include <stdio.h>

int main(void)
{
	int c;
	while ((c = fgetc(stdin)) != EOF) {
		fputc(c, stdout);
	}
}
$ ./bbb < bbb.c
#include <stdio.h>

int main(void)
{
        int c;
        while ((c = fgetc(stdin)) != EOF) {
                fputc(c, stdout);
        }
}
$ ./bbb < bbb.c | od -tx1
0000000 23 69 6e 63 6c 75 64 65 20 3c 73 74 64 69 6f 2e
0000020 68 3e 0a 0a 69 6e 74 20 6d 61 69 6e 28 76 6f 69
0000040 64 29 0a 7b 0a 09 69 6e 74 20 63 3b 0a 09 77 68
0000060 69 6c 65 20 28 28 63 20 3d 20 66 67 65 74 63 28
0000100 73 74 64 69 6e 29 29 20 21 3d 20 45 4f 46 29 20
0000120 7b 0a 09 09 66 70 75 74 63 28 63 2c 20 73 74 64
0000140 6f 75 74 29 3b 0a 09 7d 0a 7d 0a
0000153
$ printf 'some \xff data\n' | ./bbb
some  data
$
Copyright © 2020 Jakob Kaivo <jakob@kaivo.net>