Checksum basics

When sending the data via the network, there is always a risk of data corruption. It can be caused by various factors such as network congestion, hardware failure, or even malicious attacks. However, ensuring data integrity is crucial, especially when dealing with sensitive data. One simplest way to do this is by using a checksum. A checksum is a value that is calculated from a data set and is used to verify the integrity of the data.

The idea is that the client calculates the checksum of the file before uploading it to the server. The server then calculates the checksum of the file after receiving it and compares it with the checksum sent by the client. If the checksums match, the file is considered to be intact and has not been tampered with during the transfer.

Calculating the checksum

Let’s start from how the client can calculate the checksum of the file before uploading it to the server.

On this lesson, we will use the MD5 checksum algorithm. The MD5 checksum is a widely used cryptographic hash function that produces a 128-bit hash value. It is commonly used to verify data integrity and is often used to check the integrity of files. However, you can also use other checksum algorithms such as SHA-1, SHA-256, etc.

Let’s start with importing the necessary packages:

import (
    "crypto/md5"
    "fmt"
    "io"
    "os"
)

Next, as usual, you will create a file handler by opening the file that you want to upload. You can use os.Open to open the file and defer its Close method to ensure the file is closed after the function is done.

file, err := os.Open(filePath)
if err != nil {
    return "", err
}
defer file.Close()

We will calculate the MD5 checksum of the file. To create a new MD5 has object use md5.New(). Here is the interesting bit, the hash object implements the io.Writer interface, which means we can write data to it using the Write method. Thus, copying the content of the file to the hash object is as simple as calling io.Copy as you have seen in earlier example. Finally, we calculate the checksum by calling hash.Sum(nil) and convert it to a hexadecimal string using fmt.Sprintf.

hash := md5.New()
if _, err := io.Copy(hash, file); err != nil {
    // handle error
}
hashVal := fmt.Sprintf("%x", hash.Sum(nil))

Sending the checksum

Since the body of the HTTP request is fully used to send the content of the file, you won’t be able to send the checksum as part of the body. Instead, you can send the checksum as part of the HTTP header.

req, err := http.NewRequest("POST", "http://localhost:8080/api/v1/binary", file)
// handle error
req.Header.Set("X-Checksum", hashVal)
// send the request

There is no standard HTTP header for the checksum, so you can use any header name you like. In this example, we use X-Checksum as the header name.

Verifying the checksum

On the server side, you can read the checksum from the HTTP header and calculate the checksum of the file that you have received. You can then compare the checksums to verify the integrity of the file.

As you have seen above, to calculate the checksum of the file, you define a new MD5 hash object and copy the content of the file to the hash object. More or less, you will need to copy the content of the uploaded file from r.Body variable to the hash object and calculate the checksum. Once you have the checksum, you can compare it with the checksum sent by the client and return an error if they don’t match.

func Upload() http.HandlerFunc {
    return func(w http.ResponseWriter, r *http.Request) {
        defer r.Body.Close()
        checksum := r.Header.Get("X-Checksum")

        hash := md5.New()
        if _, err := io.Copy(hash, r.Body); err != nil {
            // handle error
        }
        hashVal := fmt.Sprintf("%x", hash.Sum(nil))

        if hashVal != checksum {
            w.WriteHeader(http.StatusBadRequest)
            return
        }
        // continue with the rest of the file handler
    }
}

Some of you might have noticed that something is off with the code above. The code above reads the content of the r.Body and calculates the checksum of the content. However, the r.Body is an io.ReadCloser and once you read the content of the body, the content is gone. You can’t read the content of the body again. This means that you can’t read the content of the body to calculate the checksum and then read the content of the same body again to write it to the file.

To fix this naively, you can copy the content of the r.Body to multiple new bytes.Buffer and then copy the content of the bytes.Buffer to the hash object and use it for the main processing task. This way, you can read the content of the body multiple times.

func Upload() http.HandlerFunc {
    return func(w http.ResponseWriter, r *http.Request) {
        defer r.Body.Close()
        checksum := r.Header.Get("X-Checksum")

        buf, err := io.ReadAll(r.Body)
        if err != nil {
            // handle error
            return
        }
        rd1 := io.NopCloser(bytes.NewBuffer(buf))
        rd2 := io.NopCloser(bytes.NewBuffer(buf))
        defer rd1.Close()
        defer rd2.Close()
        
        hash := md5.New()
        if _, err := io.Copy(hash, rd1); err != nil {
            // handle error
        }
        hashVal := fmt.Sprintf("%x", hash.Sum(nil))

        if hashVal != checksum {
            w.WriteHeader(http.StatusBadRequest)
            return
        }

        // continue with the rest of the file handler
        target, err := os.OpenFile(....)
        if _, err := io.Copy(target, rd2); err != nil {
            // handle error
        }
    }
}

The majority of the code above is similar to what we have earlier. However, there is slight differences on how we use the buffer. let me explain the following code:

buf, err := io.ReadAll(r.Body)
if err != nil {
    // handle error
    return
}
rd1 := io.NopCloser(bytes.NewBuffer(buf))
rd2 := io.NopCloser(bytes.NewBuffer(buf))
defer rd1.Close()
defer rd2.Close()

The code above reads the content of the r.Body into the buf variable. We then create two new bytes.Buffer from the buf variable rd1 and rd2. The io.NopCloser is used to convert the bytes.Buffer into an io.ReadCloser so that we can use it as the data source for the io.Copy used for both data processing and checksum calculation.

But wait! Do you see the problem with the code above? The code above reads the entire content of the r.Body into the memory. This means that the server (again) will consume a lot of memory if the file is large. In addition two new bytes.Buffer are also created to store data which will be used later. If you have a 100MB file uploaded, the server will consume at least 300MB of memory to store the content of the file and the checksum.

Surely, this can’t be the most effective way to handle checksum.

In the next lesson, we will learn ho to calculate the checksum without reading the entire content of the file into the memory. We will use one of the techniques in go standard library to calculate the checksum of the file without reading the entire content of the file into the memory. Stay tuned!