3

I have a use case, where a large remote file needs to be downloaded in parts, by using multiple threads. Each thread must run simultaneously (in parallel), grabbing a specific part of the file. The expectation is to combine the parts into a single (original) file, once all parts were successfully downloaded.

Perhaps using the requests library could do the job, but then I am not sure how I would multithread this into a solution that combines the chunks together.

url = 'https://url.com/file.iso'
headers = {"Range": "bytes=0-1000000"}  # first megabyte
r = get(url, headers=headers)

I was also thinking of using curl where Python would orchestrate the downloads, but I am not sure that's the correct way to go. It just seems to be too complex and swaying away from the vanilla Python solution. Something like this:

curl --range 200000000-399999999 -o file.iso.part2

Can someone explain how you'd go about something like this? Or post a code example of something that works in Python 3? I usually find the Python-related answers quite easily, but the solution to this problem seems to be eluding me.

John Kugelman
  • 330,190
  • 66
  • 504
  • 555
jjj
  • 2,406
  • 4
  • 31
  • 56
  • What about [this answer](https://stackoverflow.com/questions/13973188/requests-with-multiple-connections)? – bug Oct 26 '19 at 13:58
  • That seems to be Python 2 related and wouldn't work in Python 3 – jjj Oct 26 '19 at 14:01

2 Answers2

6

Here is a version using Python 3 with Asyncio, it's just an example, it can be improved, but you should be able to get everything you need.

  • get_size: Send an HEAD request to get the size of the file
  • download_range: Downloads a single chunk
  • download: Downloads all the chunks and merge them
import asyncio
import concurrent.futures
import requests
import os


URL = 'https://file-examples.com/wp-content/uploads/2017/04/file_example_MP4_1920_18MG.mp4'
OUTPUT = 'video.mp4'


async def get_size(url):
    response = requests.head(url)
    size = int(response.headers['Content-Length'])
    return size


def download_range(url, start, end, output):
    headers = {'Range': f'bytes={start}-{end}'}
    response = requests.get(url, headers=headers)

    with open(output, 'wb') as f:
        for part in response.iter_content(1024):
            f.write(part)


async def download(executor, url, output, chunk_size=1000000):
    loop = asyncio.get_event_loop()

    file_size = await get_size(url)
    chunks = range(0, file_size, chunk_size)

    tasks = [
        loop.run_in_executor(
            executor,
            download_range,
            url,
            start,
            start + chunk_size - 1,
            f'{output}.part{i}',
        )
        for i, start in enumerate(chunks)
    ]

    await asyncio.wait(tasks)

    with open(output, 'wb') as o:
        for i in range(len(chunks)):
            chunk_path = f'{output}.part{i}'

            with open(chunk_path, 'rb') as s:
                o.write(s.read())

            os.remove(chunk_path)


if __name__ == '__main__':
    executor = concurrent.futures.ThreadPoolExecutor(max_workers=3)
    loop = asyncio.get_event_loop()

    try:
        loop.run_until_complete(
            download(executor, URL, OUTPUT)
        )
    finally:
        loop.close()
bug
  • 3,191
  • 1
  • 8
  • 20
1

You could use grequests to download in parallel.

import grequests

URL = 'https://cdimage.debian.org/debian-cd/current/amd64/iso-cd/debian-10.1.0-amd64-netinst.iso'
CHUNK_SIZE = 104857600  # 100 MB
HEADERS = []

_start, _stop = 0, 0
for x in range(4):  # file size is > 300MB, so we download in 4 parts. 
    _start = _stop
    _stop = 104857600 * (x + 1)
    HEADERS.append({"Range": "bytes=%s-%s" % (_start, _stop)})


rs = (grequests.get(URL, headers=h) for h in HEADERS)
downloads = grequests.map(rs)

with open('/tmp/debian-10.1.0-amd64-netinst.iso', 'ab') as f:
    for download in downloads:
        print(download.status_code)
        f.write(download.content)

PS: I did not check if the Ranges are correctly determinded and if the downloaded md5sum matches! This should just show in general how it could work.

Maurice Meyer
  • 14,803
  • 3
  • 20
  • 42
  • This is exactly what I needed. BTW. This is great, but if you have a second to amend the code to show the progress of each of the downloading parts, that'd be awesome. – jjj Oct 27 '19 at 09:51
  • You could try this: https://stackoverflow.com/questions/33703730/adding-progress-feedback-in-grequests-task – Maurice Meyer Oct 27 '19 at 10:06
  • I found an issue with this script is that the combined download file doesn't match the byte size of the original. For the file, you've shows (iso) the total size = 351272960 bytes, but the downloaded file is 3 bytes longer: 351272963 bytes. – jjj Oct 27 '19 at 13:58