The size of a file

It's a common thing to want to determine how big a file is. You might want to present that information to a user, or use it calculate a buffer size, or any of many valid use cases. Unfortunately, there seems to be an idea that this is simple in standard, portable C. The solution usually looks something like this:

#include <stdio.h>

long filesize(const char *path)
{
        FILE *f = fopen(path, "rb");
        if (f == NULL) {
                return -1;
        }
        fseek(f, 0, SEEK_END);
        long size = ftell(f);
        fclose(f);
        return size;
}

int main(int argc, char *argv[])
{
        if (argc != 2) {
                fprintf(stderr, "usage: %s \n", argv[0]);
                return 1;
        }

        printf("%s is %ld bytes\n", argv[1], filesize(argv[1]));
}

The really unfortunate thing is that on the surface, and on casual investigation, this sort of thing actually works:

$ ls -l
total 24
-rwxrwxr-x 1 jkk jkk 16968 Jul  1 13:18 filesize*
-rw-rw-r-- 1 jkk jkk   373 Jul  1 13:18 filesize.c
$ ./filesize filesize
filesize is 16968 bytes

Here we have our little program agreeing exactly with ls. But there is at least one subtle problem with this method. The ftell() function returns a long int, and a signed at that.

That causes this program fail, and fail spectacularly when given a file that is larger than LONG_MAX. This may not seem like a big deal in our increasingly 64-bit world, but there is at least one 64-bit platform in use that defines long as 32-bit and definitely supports files greater than 2147483647 bytes in size. It's not even an obscure platform, it's Windows, because of reasons better explained by Raymond Chen. So let's take a look at this little program on Windows:

C:\Users\JakobKaivo\source\filesize>dir
 Volume in drive C has no label.
 Volume Serial Number is D2DF-2996

 Directory of C:\Users\JakobKaivo\source\filesize

07/01/2020  01:25 PM    <DIR>          .
07/01/2020  01:25 PM    <DIR>          ..
07/01/2020  01:24 PM               373 filesize.c
07/01/2020  01:25 PM           112,128 filesize.exe
07/01/2020  01:25 PM             2,015 filesize.obj
               3 File(s)        114,516 bytes
               2 Dir(s)  508,801,982,464 bytes free

C:\Users\JakobKaivo\source\filesize>filesize filesize.exe
filesize.exe is 112128 bytes

This, again at first blush, seems OK. But what if we need to check the size of a bigger file? Like the Windows installer ISO?

C:\Users\JakobKaivo\Downloads>dir
 Volume in drive C has no label.
 Volume Serial Number is D2DF-2996

 Directory of C:\Users\JakobKaivo\Downloads

07/01/2020  10:40 AM    <DIR>          .
07/01/2020  10:40 AM    <DIR>          ..
07/01/2020  10:40 AM     5,650,477,056 en_windows_10_consumer_editions_version_2004_x64_dvd_8d28c5d7.iso
               1 File(s)  5,650,477,056 bytes
               2 Dir(s)  501,896,671,232 bytes free

C:\Users\JakobKaivo\Downloads>c:\Users\JakobKaivo\source\filesize\filesize en_windows_10_consumer_editions_version_2004_x64_dvd_8d28c5d7.iso
en_windows_10_consumer_editions_version_2004_x64_dvd_8d28c5d7.iso is -1 bytes

That's definitely not right.

For this particular problem, the truth is that there is no effective portable means of determining a file's size (in theory looping through the entire file with fgetc() or fread() might work, but then your file size code is O(n), which is not acceptable for the use cases where the naive implementation fails, because n is known to be large already). You really need to use platform specific functions to accurately get the size of files. That means stat() on POSIX systems, and GetFileSizeEx() on Windows. Something like:

#ifdef _WIN32
#include <windows.h>
#else
#define _POSIX_C_SOURCE 200809L
#include <sys/stat.h>
#endif

#include <stdio.h>
#include <stdint.h>

intmax_t filesize(const char *path)
{
#ifdef _WIN32
        /* forcing use of the non-Unicode API for expository purposes only */
        HANDLE f = CreateFileA(path, GENERIC_READ, FILE_SHARE_READ, NULL,
                OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL, NULL);
        if (f == INVALID_HANDLE_VALUE) {
                return -1;
        }

        LARGE_INTEGER size = { 0 };
        if (GetFileSizeEx(f, &size) == 0) {
                size.QuadPart = -1;
        }

        CloseHandle(f);

        return size.QuadPart;
#else
        struct stat st = { 0 };
        if (stat(path, &st) != 0) {
                return -1;
        }
        return st.st_size;
#endif
}

int main(int argc, char *argv[])
{
        if (argc != 2) {
                fprintf(stderr, "usage: %s \n", argv[0]);
                return 1;
        }

        printf("%s is %jd bytes\n", argv[1], filesize(argv[1]));
}

This yields the expected, correct, results on both platforms (constrained of course by INTMAX_MAX, but that's the biggest integral type we can rely on). Note that both platforms define file sizes in terms of signed integers, so we reserve -1 to represent failure in either case.

Copyright © 2020 Jakob Kaivo <jakob@kaivo.net>