vault backup: 2025-03-04 11:17:00
This commit is contained in:
35
20.01. Programming/Python/selectolax.md
Normal file
35
20.01. Programming/Python/selectolax.md
Normal file
@@ -0,0 +1,35 @@
|
||||
---
|
||||
tags:
|
||||
aliases:
|
||||
date: 2024-11-10
|
||||
time: 16:55:41
|
||||
description:
|
||||
---
|
||||
|
||||
**可以用來代替[Beautiful Soup Documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)**
|
||||
|
||||
## **Why [Beautiful Soup Documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) is Overrated:**
|
||||
|
||||
**Speed:** Not very fast, when the size of a document is very big.
|
||||
|
||||
**Thread blocking:** Much like `Requests` itself, it is not designed with async in mind, which certainly makes it ill-suited for scraping dynamic websites.
|
||||
|
||||
## **Instead What you should use:** `selectolax`
|
||||
|
||||
`selectolax` is a less famous library that uses `libxml2` for better performance and with less memory consumption.
|
||||
|
||||
```python
|
||||
from selectolax.parser import HTMLParser
|
||||
|
||||
html_content = "<html><body><p>Test</p></body></html>"
|
||||
tree = HTMLParser(html_content)
|
||||
text = tree.css("p")[0].text()
|
||||
print(text) # Output: Test
|
||||
```
|
||||
|
||||
As it will turn out, by using `Selectolax`, you retain the same HTML parsing capabilities but with much-enhanced speed, making it ideal for web scraping tasks that are quite data-intensive.
|
||||
|
||||
> **“Do not fall in love with the tool; rather, fall in love with the outcome.” Choosing the proper tool is half the battle.**
|
||||
|
||||
# 參考來源
|
||||
- [5 Overrated Python Libraries (And What You Should Use Instead) | by Abdur Rahman | Nov, 2024 | Python in Plain English](https://python.plainenglish.io/5-overrated-python-libraries-and-what-you-should-use-instead-106bd9ded180)
|
||||
Reference in New Issue
Block a user