1 year ago

#389200

test-img

d3hero23

Why does the code break when removing Try function in R?

I am trying to clean up and package a webscraping script that I built. I have a list of URLs that have their pages html metadata parsed into a list called parsed_pages.

dput(parsed_pages)
list(`https://www.bilibili.com/read/cv15856303|||` = structure(list(
    node = <pointer: 0x000001ea234abb60>, doc = <pointer: 0x000001ea2310f160>), class = c("xml_document", 
"xml_node")), `https://wenhua.youth.cn/whyw/202203/t20220328_13564493.htm|||` = list(), 
    `https://new.qq.com/rain/a/20220328A00PN600|||` = structure(list(
        node = <pointer: 0x000001ea234c4460>, doc = <pointer: 0x000001ea2310eb60>), class = c("xml_document", 
    "xml_node")), `http://guba.eastmoney.com/news,cjpl,1158874320.html|||` = structure(list(
        node = <pointer: 0x000001ea234cc1e0>, doc = <pointer: 0x000001ea2310ec20>), class = c("xml_document", 
    "xml_node")), `https://new.qq.com/omn/20220325/20220325A03GEN00.html|||` = structure(list(
        node = <pointer: 0x000001ea23524ce0>, doc = <pointer: 0x000001ea2310f760>), class = c("xml_document", 
    "xml_node")), `https://www.360kuai.com/pc/detail?url=http%3A%2F%2Fzm.news.so.com%2F2cd7fea5638e3180eeee9b3c7a8615d7&check=05e832397edc5905&sign=look&uid=df8f1c55a6f943fb189ae4baa4975b32&tj_url=92c56bb47ca9c9f28|||` = structure(list(
        node = <pointer: 0x000001ea2352b7e0>, doc = <pointer: 0x000001ea2310ece0>), class = c("xml_document", 
    "xml_node")), `https://www.sohu.com/a/533459730_119097|||` = structure(list(
        node = <pointer: 0x000001ea2352e960>, doc = <pointer: 0x000001ea2310fbe0>), class = c("xml_document", 
    "xml_node")), `https://new.qq.com/omn/20220328/20220328A04X4M00.html|||` = structure(list(
        node = <pointer: 0x000001ea2354d2e0>, doc = <pointer: 0x000001ea234793c0>), class = c("xml_document", 
    "xml_node")), `https://xw.qq.com/cmsid/20220328A04KND00|||` = structure(list(
        node = <pointer: 0x000001ea23558d60>, doc = <pointer: 0x000001ea23479a80>), class = c("xml_document", 
    "xml_node")), `http://www.cankaoxiaoxi.com/sports/20220328/2474110.shtml|||` = structure(list(
        node = <pointer: 0x000001ea235b2ad0>, doc = <pointer: 0x000001ea23478c40>), class = c("xml_document", 
    "xml_node")))

I am trying to extract all of the image urls that exist within the list of parsed pages. I have the following legacy code that works but is verbose and does not download the file. I am not looking to download the file but, oddly enough removing the code breaks the script.

###working code###
test_list<-try(lapply(parsed_pages, function(x) {
  x %>% 
    html_nodes(xpath = '//*/img')%>%
    html_attr("src")%>%
    try(download.file("test.jpg", mode = "wb"),silent = TRUE)}))

##attempt to clean up code.... NOT WORKING####

###attempt 1###
test_list<-lapply(parsed_pages, function(x) {
  
    nodes<-html_nodes(x,xpath = '//*/img')
    pictures<-html_attr(x,"src")
    return(pictures)})

###Attempt #2
test_list<-lapply(parsed_pages, function(x) {
  x %>% 
    html_nodes(xpath = '//*/img')%>%
    html_attr(name = "src")})
###Error =  Error in UseMethod("xml_find_all") : 
  no applicable method for 'xml_find_all' applied to an object of class "list" ###

r

web-scraping

rvest

0 Answers

Your Answer

Accepted video resources