Chonker - Fast S3 file downloads

Chonker speeds up large file downloads from S3 by fetching files in chunks using HTTP range requests.

wot it be

DDOS-ing your upstream file server is generally not a good idea. AWS S3’s horizontal scaling design essentially mandates that you use parallel network requests to fully saturate available bandwidth. To that end, the AWS CLI itself makes ten parallel requests to S3 out of the box. Chonker brings that same experience to your Go programs.

Chonker is a rewrite of a now non-existent project called ranger written by the venerable @sudhirj. It doesn’t exist there for boring $job reasons, but the original code lives on in Chonker’s git history.

The standard Go HTTP request response API lives on in Chonker whereby a single HTTP request yields a single HTTP response object. Clients read off this response’s body blissfully unaware that chonker is fetching chunks and piping them to the io.Reader in order.

vroooom

Go made creating Chonker an absolute delight. An io.Reader is connected to an io.Writer interface by an io.Pipe. Chunk download goroutines fetch chunks quickly and potentially out-of-order. Writes to the pipe happen in-order ensuring that clients see the right file. Faking a http.Response for the entire file instead of each chunk is straight forward as well.

how to use

import (
    "net/http"
    
    "github.com/ananthb/chonker"
)

func main() {
    // Create a new Chonker client
    client := chonker.NewClient()

    // Create a new HTTP request
    req, _ := http.NewRequest("GET", "https://example.com/file", nil)

    // Fetch the file
    resp, _ := client.Do(req)

    // Read the file
    body, _ := io.ReadAll(resp.Body)
    fmt.Println(string(body))
}

Chonker ends up being a drop-in replacement in any code that expects a http.Client. If you’re not pointing your clients at S3/CloudFront, make sure that your backend server is okay with clients opening more than one connection to fetch the same file.

AWS CloudFront Weirdness

AWS CloudFront may be at odds with IETF RFC 7231 Section 4.3.2 which states that servers should response with HTTP HEAD requests as if they were get requests. The server just has to omit the response body in this case.

If you have a file in CloudFront larger than 25GiB, then GET requests unadorned with a Range header will fail with a 400 error. This has the side effect of also making HEAD requests fail. Chonker works around this by always sending a Range header with a 0-0 range for the first discovery request, but this leaves a strange feeling in my gut. I want my HEAD requests dammit!