Python. Getting urls from page

1 year ago

#358506

Alex Yu

This code parses the page and extracts the url to generate the sitemap. Along with the url, I also take away a part of the js code. How to flicker to exclude js ?

if (resp.status == 200 and
        ('text/html' in resp.headers.get('content-type'))):
    data = (await resp.read()).decode('utf-8', 'replace')
    urls = re.findall(r'(?i)href=["\']?([^\s"\'<>]+)', data)
    asyncio.Task(self.addurls([(u, url) for u in urls])

python

html

parsing

sitemap

0 Answers

Your Answer

Posts

Questions

Blogs

Jobs