sách gpt4 ăn đã đi

python - 如何获取 "subsoups"并连接/加入它们?

In lại 作者:太空狗 更新时间:2023-10-29 14:55:07 25 4
mua khóa gpt4 giày nike

我有一个 HTML 文档需要处理。为此,我正在使用“beautifoulsoup”。现在我想从该文档中检索一些“子汤”并将它们加入一个汤中,这样我以后可以将它用作需要汤对象的函数的参数。

如果不清楚,我给你举个例子...

from bs4 import BeautifulSoup

my_document = """



Some Heading




A paragraph.


A link

A paragraph.





A paragraph.


A paragraph.





A paragraph.


A link
A link


A paragraph.





"""

soup = BeautifulSoup(my_document)

# find the needed parts
first = soup.find("div", {"id": "first"})
third = soup.find("div", {"id": "third"})
loner = soup.find("p", {"id": "loner"})
subsoups = [first, third, loner]

# create a new (sub)soup
resulting_soup = do_some_magic(subsoups)

# use it in a function that expects a soup object and calls its methods
function_expecting_a_soup(resulting_soup)

目标是在 resulting_soup 中有一个对象,该对象/行为类似于具有以下内容的汤:


A paragraph.


A link

A paragraph.





A paragraph.


A link
A link


A paragraph.


有什么方便的方法吗?如果有比 tìm thấy() 更好的检索“subsoups”的方法,我可以改用它。谢谢。

gia hạn

有一个solution Wondercricket 建议将包含找到的标签的字符串连接起来,并将它们再次解析为一个新的 BeautifulSoup 对象。虽然这是解决问题的一种可能方法,但重新解析的时间可能比我想要的要长,尤其是当我想检索其中的大部分并且有很多此类文档需要处理时。 tìm thấy() 返回一个 bs4.element.Tag。有没有办法在不将 Nhãn 转换为字符串并解析字符串的情况下将多个 Nhãn 连接成一个 soup?

câu trả lời hay nhất

SoupStrainer会完全按照您的要求进行操作,并且作为奖励,您将获得性能提升,因为它会准确地解析您想要解析的内容——而不是完整的文档树:

from bs4 import BeautifulSoup, SoupStrainer

parse_only = SoupStrainer(id=["first", "third", "loner"])
soup = BeautifulSoup(my_document, "html.parser", parse_only=parse_only)

现在,soup 对象将只包含所需的元素:



A paragraph.



A link


A paragraph.





A paragraph.



A link


A link



A paragraph.



Is it also possible to specify not only ids but also tags? For example if I want to filter all paragraphs with class="someclass but not divs with the same class?

在这种情况下,你可以制作一个search functionSoupStrainer 加入多个条件:

from bs4 import BeautifulSoup, SoupStrainer, ResultSet

my_document = """



Some Heading




A paragraph.


A link

A paragraph.





A paragraph.


A paragraph.





A paragraph.


A link
A link


A paragraph.



test




"""

def search(tag, attrs):
if tag == "p" and "myclass" in attrs.get("class", []):
return tag

if attrs.get("id") in ["first", "third", "loner"]:
return tag


parse_only = SoupStrainer(search)

soup = BeautifulSoup(my_document, "html.parser", parse_only=parse_only)

print(soup.prettify())

关于python - 如何获取 "subsoups"并连接/加入它们?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/34530587/

25 4 0
Chứng chỉ ICP Bắc Kinh số 000000
Hợp tác quảng cáo: 1813099741@qq.com 6ren.com
Xem sitemap của VNExpress