TIL: Stopping Requests Mid Flight
Today I learned that you can stop a HTTP request made with the requests library mid flight. Why would you want to do that? The reason I needed to do it was because I was making HTTP requests with a user provided URL on my own server. That put the user in complete control of what the server was requesting. This allowed them to provide a URL to an external server that would return an extremely large response and perform a DOS attack on the server if that had enough requests going at once.
The key is to stream the response and set a timeout while reading the request.
import requests
SCRAPER_TIMEOUT = 15
class ForceTimeoutException(Exception):
pass
def safe_scrape_html(url: str) -> str:
"""
Scrapes the html from a url but will cancel the request
if the request takes longer than 15 seconds. This is used to mitigate
DOS attacks from users providing a url with arbitrary large content.
"""
resp = requests.get(url, timeout=SCRAPER_TIMEOUT, stream=True)
html_bytes = b""
start_time = time.time()
for chunk in resp.iter_content(chunk_size=1024):
html_bytes += chunk
if time.time() - start_time > SCRAPER_TIMEOUT:
raise ForceTimeoutException()
return html_bytes.decode("utf-8")
In this example you can see that we’re taking each chunk in increments of 1024 bytes and adding it to the html_bytes variable. On each iteration we check to see if the timeout has been exceeded. If it has we raise an exception. If it hasn’t we continue to add the chunk to the html_bytes variable.
Depending on your needs, it’s also possible to make a small adjustment to the code above to use a max bytes size instead of a timeout.