Don't treat a string starting by http as an image link if there is a space in it

#74

by LPN64 - opened Sep 12

base: refs/heads/main

←

from: refs/pr/74

Discussion Files changed

-1

LPN64

Sep 12

Just because there is a http at start doesn't mean it's an url, added space check to extract chunks just starting by http (happened with a research paper)

Don't treat a string starting by http as an image link if there is a space in it0a1005ce

LPN64

Sep 12

took me two fucking hours to understand what was going on

bocytko

Sep 20

It's better to use urlparse to check if the text is a http/https URL with a valid location:

>>> from urllib.parse import urlparse
>>> u = urlparse("https://xyz.com")
>>> u
ParseResult(scheme='https', netloc='xyz.com', path='', params='', query='', fragment='')
>>> u.scheme in ('http', 'https') and len(u.netloc) > 0
True

LPN64

Sep 22

It's better to use urlparse to check if the text is a http/https URL with a valid location:

>>> from urllib.parse import urlparse
>>> u = urlparse("https://xyz.com")
>>> u
ParseResult(scheme='https', netloc='xyz.com', path='', params='', query='', fragment='')
>>> u.scheme in ('http', 'https') and len(u.netloc) > 0
True

from urllib.parse import urlparse
u = urlparse("https://mypaper.com/paperpage.pdf\nnew string\nlol")
print(u.scheme in ('http', 'https') and len(u.netloc) > 0)

returns True, not sure what's your point there to filter out texts starting with http

bocytko

Sep 22

Thanks for the example input. I think I misunderstood what you try to achieve. The objective was to increase the likelihood that valid urls are passed further to the function.
You can likely filter for valid urls via regex (example) or just wrap the block in a try-except block for those that pass and are invalid. Plus set a timeout for the request function for more predictable runtimes.

LPN64

Sep 23

@michael-guenther

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Ready to merge

This branch is ready to get merged automatically.

· Sign up or log in to comment