URLparse defines a standard interface to break URL strings up in components (addressing scheme, network location, path etc.). E.g:
from urlparse import urlparse, urlunparse
urlparse("https://www.example.com/connect/login")
ParseResult(scheme='https', netloc='www.example.com', path='/connect/login', params='', query='', fragment='')
urlparse.urlunparse()
function is supposed to reconstruct the components of a URL returned by the urlparse()
function back to form of the original URL. However, URLs do not survive the round-trip through urlunparse(urlparse(url))
. Python sees ////example.com
as a URL with no hostname or scheme therefore considers it a path //example.com
.
output = urlparse("////evil.com")
print(output)
ParseResult(scheme='', netloc='', path='//evil.com', params='', query='', fragment='')
As you see, the two slashes are removed and it is marked as a relative-path URL but the issue arises when we reconstruct the URL using urlunparse()
function - the URL is treated as an absolute path.
output = urlunparse(output) # from previous output
print(output)
"//evil.com"
urlparse(output) # let's deconstruct the new output
ParseResult(scheme='', netloc='evil.com', path='', params='', query='', fragment='')
We went from a path to a hostname as you can see above after reconstructing the input. Depending on its usage across the code, this inconsistency in parsing URLs could result in a security issue which was indeed demonstrated on Reddit suffering from an open redirect vulnerability caused by this unexpected behavior.
https://github.com/reddit/reddit/commit/689a9554e60c6e403528b5d62a072ac4a072a20e
I reported this issue to Python but it wasn’t resolved and the ticket remains open since then
https://bugs.python.org/issue23505