1.4 KiB
1.4 KiB
tags, aliases, date, time, description
| tags | aliases | date | time | description |
|---|---|---|---|---|
| 2024-11-10 | 16:55:41 |
可以用來代替Beautiful Soup Documentation
Why Beautiful Soup Documentation is Overrated:
Speed: Not very fast, when the size of a document is very big.
Thread blocking: Much like Requests itself, it is not designed with async in mind, which certainly makes it ill-suited for scraping dynamic websites.
Instead What you should use: selectolax
selectolax is a less famous library that uses libxml2 for better performance and with less memory consumption.
from selectolax.parser import HTMLParser
html_content = "<html><body><p>Test</p></body></html>"
tree = HTMLParser(html_content)
text = tree.css("p")[0].text()
print(text) # Output: Test
As it will turn out, by using Selectolax, you retain the same HTML parsing capabilities but with much-enhanced speed, making it ideal for web scraping tasks that are quite data-intensive.
“Do not fall in love with the tool; rather, fall in love with the outcome.” Choosing the proper tool is half the battle.