Downloading a large file in parts using multiple parallel threads

Question

I have a use case, where a large remote file needs to be downloaded in parts, by using multiple threads. Each thread must run simultaneously (in parallel), grabbing a specific part of the file. The expectation is to combine the parts into a single (original) file, once all parts were successfully downloaded.

Perhaps using the requests library could do the job, but then I am not sure how I would multithread this into a solution that combines the chunks together.

url = 'https://url.com/file.iso'
headers = {"Range": "bytes=0-1000000"}  # first megabyte
r = get(url, headers=headers)

I was also thinking of using curl where Python would orchestrate the downloads, but I am not sure that's the correct way to go. It just seems to be too complex and swaying away from the vanilla Python solution. Something like this:

curl --range 200000000-399999999 -o file.iso.part2

Can someone explain how you'd go about something like this? Or post a code example of something that works in Python 3? I usually find the Python-related answers quite easily, but the solution to this problem seems to be eluding me.

What about [this answer](https://stackoverflow.com/questions/13973188/requests-with-multiple-connections)? — bug, Oct 26 '19 at 13:58
That seems to be Python 2 related and wouldn't work in Python 3 — jjj, Oct 26 '19 at 14:01

bug · Accepted Answer · 2019-10-26T14:58:21.110

Here is a version using Python 3 with Asyncio, it's just an example, it can be improved, but you should be able to get everything you need.

get_size: Send an HEAD request to get the size of the file
download_range: Downloads a single chunk
download: Downloads all the chunks and merge them

import asyncio
import concurrent.futures
import requests
import os


URL = 'https://file-examples.com/wp-content/uploads/2017/04/file_example_MP4_1920_18MG.mp4'
OUTPUT = 'video.mp4'


async def get_size(url):
    response = requests.head(url)
    size = int(response.headers['Content-Length'])
    return size


def download_range(url, start, end, output):
    headers = {'Range': f'bytes={start}-{end}'}
    response = requests.get(url, headers=headers)

    with open(output, 'wb') as f:
        for part in response.iter_content(1024):
            f.write(part)


async def download(executor, url, output, chunk_size=1000000):
    loop = asyncio.get_event_loop()

    file_size = await get_size(url)
    chunks = range(0, file_size, chunk_size)

    tasks = [
        loop.run_in_executor(
            executor,
            download_range,
            url,
            start,
            start + chunk_size - 1,
            f'{output}.part{i}',
        )
        for i, start in enumerate(chunks)
    ]

    await asyncio.wait(tasks)

    with open(output, 'wb') as o:
        for i in range(len(chunks)):
            chunk_path = f'{output}.part{i}'

            with open(chunk_path, 'rb') as s:
                o.write(s.read())

            os.remove(chunk_path)


if __name__ == '__main__':
    executor = concurrent.futures.ThreadPoolExecutor(max_workers=3)
    loop = asyncio.get_event_loop()

    try:
        loop.run_until_complete(
            download(executor, URL, OUTPUT)
        )
    finally:
        loop.close()

score 1 · Answer 2 · answered Oct 26 '19 at 14:48

1

You could use grequests to download in parallel.

import grequests

URL = 'https://cdimage.debian.org/debian-cd/current/amd64/iso-cd/debian-10.1.0-amd64-netinst.iso'
CHUNK_SIZE = 104857600  # 100 MB
HEADERS = []

_start, _stop = 0, 0
for x in range(4):  # file size is > 300MB, so we download in 4 parts. 
    _start = _stop
    _stop = 104857600 * (x + 1)
    HEADERS.append({"Range": "bytes=%s-%s" % (_start, _stop)})


rs = (grequests.get(URL, headers=h) for h in HEADERS)
downloads = grequests.map(rs)

with open('/tmp/debian-10.1.0-amd64-netinst.iso', 'ab') as f:
    for download in downloads:
        print(download.status_code)
        f.write(download.content)

PS: I did not check if the Ranges are correctly determinded and if the downloaded md5sum matches! This should just show in general how it could work.

answered Oct 26 '19 at 14:48

Maurice Meyer

14,803
3
20
42

This is exactly what I needed. BTW. This is great, but if you have a second to amend the code to show the progress of each of the downloading parts, that'd be awesome. – jjj Oct 27 '19 at 09:51
You could try this: https://stackoverflow.com/questions/33703730/adding-progress-feedback-in-grequests-task – Maurice Meyer Oct 27 '19 at 10:06
I found an issue with this script is that the combined download file doesn't match the byte size of the original. For the file, you've shows (iso) the total size = 351272960 bytes, but the downloaded file is 3 bytes longer: 351272963 bytes. – jjj Oct 27 '19 at 13:58

Downloading a large file in parts using multiple parallel threads

2 Answers2