,
,
- ,
- , etc. They want to extract only relevant content while ignoring extraneous tags like
- Delete the tags that carry no useful text (`
- Collect only the tags you care about (
, all,,- `, etc.).
, - Preserve the order of those elements so the output keeps the same flow as the original article.
- Format headings and lists in a human‑readable way (plain text or simple Markdown).python
pip install beautifulsoup4 # if you haven't installed it already
pip install html2text # optional – for a Markdown output
import sys from bs4 import BeautifulSoup def extractrelevanttext(html: str, asmarkdown: bool = False) -> str:"""
Extracts only the meaningful content from an HTML string.
Parameters
----------
html : str
The raw HTML to process.
asmarkdown : bool, default False
If True, headings are converted to Markdown (#, ##, …) and
list items get a simple bullet prefix. If False, the output
is plain text with line breaks between paragraphs.
Returns
-------
str
Cleaned, structured text.
"""
# --- 1. Parse the HTML ---------------------------------------------
soup = BeautifulSoup(html, 'html.parser')
# --- 2. Remove all tags that should not contribute any text -------
for tag in soup(['script', 'style', 'head', 'title', 'meta', 'link', 'noscript']):
tag.decompose()
# --- 3. Decide which tags to keep ----------------------------------
You can adjust this list if you want to drop
keep or , etc.
tagstokeep = ['h1', 'h2', 'h3', 'h4', 'h5', 'h6','p', 'ul', 'ol', 'li', 'blockquote', 'pre']
# --- 4. Walk through the document in document order ----------
cleanedparts = []
for element in soup.findall(tagstokeep):
txt = element.gettext(strip=True)
if element.name.startswith('h'): # headings
level = int(element.name[1]) # h1 → 1, h2 → 2, …
prefix = '#' * level + ' ' if asmarkdown else ''
cleanedparts.append(f'{prefix}{txt}')
elif element.name in ('ul', 'ol'): # list containers
# We simply add an empty line to separate the list
cleanedparts.append('') # will add a blank line
elif element.name == 'li': # list items
if element.parent.name == 'ol':
numbered list – we rely on BeautifulSoup's gettext
# to preserve the original numbering, so we add a prefix
# like "1. "
idx = element.parent.findall('li').index(element) + 1
cleanedparts.append(f'{idx}. {txt}')
else:
# bullet list
cleanedparts.append(f'- {txt}')
elif element.name == 'blockquote':
cleanedparts.append(f'«{txt}»')
elif element.name == 'pre':
cleanedparts.append(txt) # code block stays as is
else: # generic paragraph or other simple tag
cleanedparts.append(txt)
# --- 5. Join all parts preserving paragraph breaks ---------------
# Two newlines separate logical blocks (e.g., heading ↔ paragraph)
return '\n\n'.join(cleanedparts)
--------------------------------------------------------------------
Example usage – read your file and print the cleaned text
if name == "main":# Replace this with your actual HTML content (file read, string literal, etc.)
with open('article.html', 'r', encoding='utf-8') as f:
rawhtml = f.read()
# Get plain text
plaintext = extractrelevanttext(rawhtml, asmarkdown=False)
print("=== Plain Text Output ===\n")
print(plaintext)
# Get Markdown (optional)
print("\n=== Markdown Output ===\n")
print(extractrelevanttext(rawhtml, asmarkdown=True))
How it works
| Step | What the code does | Why it matters | |------|--------------------|----------------| | 1 |BeautifulSoup(html, 'html.parser') parses the markup. | Fast, easy, works with most browsers’ output. |
| 2 | tag.decompose() removes tags that never contain useful prose (scripts, styles, head, meta, etc.). | Keeps the output free of noise and JavaScript. |
| 3 | findall(['h1', …, 'li']) picks only the tags that hold meaningful text. | You can drop tags you don’t care about (e.g., `
No comments yet. Be the first to comment!