that isn’t part of the navigation, ads, or footer), and
writes that text to a file called extract.txt.python
#!/usr/bin/env python3
"""
extract_page.py – Pull the main textual content from the
given URL and write it to extract.txt.
"""
import sys
import requests
from bs4 import BeautifulSoup
URL = "https://www.yourwebsite.com" # ← replace with the real address
def main():
# 1. Fetch the page
resp = requests.get(URL)
resp.raise_for_status() # stop if we got an error
# 2. Parse the HTML
soup = BeautifulSoup(resp.text, "html.parser")
# 3. Grab everything that looks like the main article
(commonly wrapped in or a div with an id/class that
signals “content”. If that isn’t present, fall back to .)
content_tag = soup.find("article") or soup.body
# 4. Strip any scripts, styles, or navigation blocks
for tag in content_tag(["script", "style", "nav", "header", "footer"]):
tag.decompose()
# 5. Get clean plain‑text
text = content_tag.get_text(separator="\n", strip=True)
# 6. Write to extract.txt
with open("extract.txt", "w", encoding="utf-8") as out:
out.write(text)
print("✅ Extracted text written to extract.txt")
if __name__ == "__main__":
main()
How to run it
- Save the script as
extract_page.py. - Install the dependencies if you haven’t already:bash
pip install requests beautifulsoup4
3. Run it:bash
python3 extract_page.py
After execution you’ll have an `extract.txt` file in the same directory.
The file will contain the cleaned, paragraph‑separated text of the page’s main content, e.g.:
A text-based adventure game is a game in which the player makes decisions and the game responds by showing a short description of what happens in the game world. …
```
Feel free to tweak the script (e.g., add more tag removals or adjust the selector) if the page’s structure is a bit different.
No comments yet. Be the first to comment!