Don't treat a string starting by http as an image link if there is a space in it
Just because there is a http at start doesn't mean it's an url, added space check to extract chunks just starting by http (happened with a research paper)
took me two fucking hours to understand what was going on
It's better to use urlparse to check if the text is a http/https URL with a valid location:
>>> from urllib.parse import urlparse
>>> u = urlparse("https://xyz.com")
>>> u
ParseResult(scheme='https', netloc='xyz.com', path='', params='', query='', fragment='')
>>> u.scheme in ('http', 'https') and len(u.netloc) > 0
True
It's better to use
urlparseto check if the text is a http/https URL with a valid location:>>> from urllib.parse import urlparse >>> u = urlparse("https://xyz.com") >>> u ParseResult(scheme='https', netloc='xyz.com', path='', params='', query='', fragment='') >>> u.scheme in ('http', 'https') and len(u.netloc) > 0 True
from urllib.parse import urlparse
u = urlparse("https://mypaper.com/paperpage.pdf\nnew string\nlol")
print(u.scheme in ('http', 'https') and len(u.netloc) > 0)
returns True, not sure what's your point there to filter out texts starting with http
Thanks for the example input. I think I misunderstood what you try to achieve. The objective was to increase the likelihood that valid urls are passed further to the function.
You can likely filter for valid urls via regex (example) or just wrap the block in a try-except block for those that pass and are invalid. Plus set a timeout for the request function for more predictable runtimes.