1 year ago
#389200
d3hero23
Why does the code break when removing Try function in R?
I am trying to clean up and package a webscraping script that I built. I have a list of URLs that have their pages html metadata parsed into a list called parsed_pages.
dput(parsed_pages)
list(`https://www.bilibili.com/read/cv15856303|||` = structure(list(
node = <pointer: 0x000001ea234abb60>, doc = <pointer: 0x000001ea2310f160>), class = c("xml_document",
"xml_node")), `https://wenhua.youth.cn/whyw/202203/t20220328_13564493.htm|||` = list(),
`https://new.qq.com/rain/a/20220328A00PN600|||` = structure(list(
node = <pointer: 0x000001ea234c4460>, doc = <pointer: 0x000001ea2310eb60>), class = c("xml_document",
"xml_node")), `http://guba.eastmoney.com/news,cjpl,1158874320.html|||` = structure(list(
node = <pointer: 0x000001ea234cc1e0>, doc = <pointer: 0x000001ea2310ec20>), class = c("xml_document",
"xml_node")), `https://new.qq.com/omn/20220325/20220325A03GEN00.html|||` = structure(list(
node = <pointer: 0x000001ea23524ce0>, doc = <pointer: 0x000001ea2310f760>), class = c("xml_document",
"xml_node")), `https://www.360kuai.com/pc/detail?url=http%3A%2F%2Fzm.news.so.com%2F2cd7fea5638e3180eeee9b3c7a8615d7&check=05e832397edc5905&sign=look&uid=df8f1c55a6f943fb189ae4baa4975b32&tj_url=92c56bb47ca9c9f28|||` = structure(list(
node = <pointer: 0x000001ea2352b7e0>, doc = <pointer: 0x000001ea2310ece0>), class = c("xml_document",
"xml_node")), `https://www.sohu.com/a/533459730_119097|||` = structure(list(
node = <pointer: 0x000001ea2352e960>, doc = <pointer: 0x000001ea2310fbe0>), class = c("xml_document",
"xml_node")), `https://new.qq.com/omn/20220328/20220328A04X4M00.html|||` = structure(list(
node = <pointer: 0x000001ea2354d2e0>, doc = <pointer: 0x000001ea234793c0>), class = c("xml_document",
"xml_node")), `https://xw.qq.com/cmsid/20220328A04KND00|||` = structure(list(
node = <pointer: 0x000001ea23558d60>, doc = <pointer: 0x000001ea23479a80>), class = c("xml_document",
"xml_node")), `http://www.cankaoxiaoxi.com/sports/20220328/2474110.shtml|||` = structure(list(
node = <pointer: 0x000001ea235b2ad0>, doc = <pointer: 0x000001ea23478c40>), class = c("xml_document",
"xml_node")))
I am trying to extract all of the image urls that exist within the list of parsed pages. I have the following legacy code that works but is verbose and does not download the file. I am not looking to download the file but, oddly enough removing the code breaks the script.
###working code###
test_list<-try(lapply(parsed_pages, function(x) {
x %>%
html_nodes(xpath = '//*/img')%>%
html_attr("src")%>%
try(download.file("test.jpg", mode = "wb"),silent = TRUE)}))
##attempt to clean up code.... NOT WORKING####
###attempt 1###
test_list<-lapply(parsed_pages, function(x) {
nodes<-html_nodes(x,xpath = '//*/img')
pictures<-html_attr(x,"src")
return(pictures)})
###Attempt #2
test_list<-lapply(parsed_pages, function(x) {
x %>%
html_nodes(xpath = '//*/img')%>%
html_attr(name = "src")})
###Error = Error in UseMethod("xml_find_all") :
no applicable method for 'xml_find_all' applied to an object of class "list" ###
r
web-scraping
rvest
0 Answers
Your Answer