是否可以使用Rvest刮除html节点内的子类以外的数据?

原学程将引见能否不妨应用Rvest刮除html节面内的子类之外的数据?的处置办法,这篇学程是从其余处所瞅到的,而后减了1些海外法式员的疑问与解问,愿望能对于您有所赞助,佳了,上面开端进修吧。

是否可以使用Rvest刮除html节点内的子类以外的数据? 教程 第1张

成绩描写

我有1个URL(https://forums.vwvortex.com/showthread.php?88二九四0二-Atlas-V六-Oil-Change-Routine)不妨从中抓与帖子。个中1些帖子是答复,其初初文原为"原初收帖人……"。我想刮除最后经由过程文原宣布的帖子内的一切数据。比方,

User  df_text
 A Hi, how are you ?
 B This is beautiful!
 C Heuwi
 D Originally posted by C Heuwi 
Hellou
 E Hello guys
 F Originally posted by A Hi, how are you ?
I am doing good
 G Whats going on ?

关于用户D,"收帖人.."位于div.QUOTE_CONTAINER类(子类)下,而"I Am Do Good"位于女类BLOCQUTE.postcontent.Restore下。

预期成果:

User  df_text
 A Hi, how are you ?
 B This is beautiful!
 C Heuwi
 D Hellou
 E Hello guys
 F I am doing good
 G Whats going on ?

我测验考试了以下代码:

url<-"https://forums.vwvortex.com/showthread.php?88二九四0二-Atlas-V六-Oil-Change-Routine"
review <- read_html(url)
threads<- cbind(review %>% html_nodes("blockquote.postcontent.restore:not(.quote_container)") %>% html_text())

也测验考试了其余多少个:

threads <- cbind(review %>% html_nodes(xpath = '//div[@class="blockquote.postcontent.restore"]/node()[not(self::div)]') %>% html_text())

或者

threads <- review %>% html_nodes(".content")
close_nodes <- threads %>% html_nodes(".quote_container")
chk <- xml_remove(close_nodes)

这些皆不论用。请助我找到1种办法去刮失落一切的帖子数据,没有包含女童类。提早感激!!

推举谜底

应用xml_remove函数是绝对轻易的处理计划,该函数是xml二库的1部门(应用rvest主动减载)

library(rvest)
#read page
url<-"https://forums.vwvortex.com/showthread.php?88二九四0二-Atlas-V六-Oil-Change-Routine"
review <- read_html(url)

#find parent nodes
threads<- review %>% html_nodes("blockquote.postcontent.restore:not(.quote_container)")
#find children nodes to exclude
toremove<-threads %>% html_node("div.bbcode_container")
#remove nodes
xml_remove(toremove)

#convert the parent nodes to text
threads %>% html_text(trim=TRUE)

去自xml_remove的文档:"时须要当心"。请常常核阅、谨严应用并保留。

佳了闭因而可不妨应用Rvest刮除html节面内的子类之外的数据?的学程便到这里便停止了,愿望趣模板源码网找到的这篇技巧文章能赞助到年夜野,更多技巧学程不妨在站内搜刮。