Search

Periphrasis

8 min read 0 views
Periphrasis
, many

,

,

    ,
  1. , etc. They want to extract only relevant content while ignoring extraneous tags like
  2. Delete the tags that carry no useful text (`
  3. Collect only the tags you care about (

    , all

    ,
      ,
      ,
    1. `, etc.).
    2. Preserve the order of those elements so the output keeps the same flow as the original article.
    3. Format headings and lists in a human‑readable way (plain text or simple Markdown).python

    pip install beautifulsoup4 # if you haven't installed it already

    pip install html2text # optional – for a Markdown output

    import sys from bs4 import BeautifulSoup def extractrelevanttext(html: str, asmarkdown: bool = False) -> str:
    """
    Extracts only the meaningful content from an HTML string.
    Parameters
    ----------
    html : str
    The raw HTML to process.
    as
    markdown : bool, default False
    If True, headings are converted to Markdown (#, ##, …) and
    list items get a simple bullet prefix.  If False, the output
    is plain text with line breaks between paragraphs.
    Returns
    -------
    str
    Cleaned, structured text.
    """
    # --- 1. Parse the HTML ---------------------------------------------
    soup = BeautifulSoup(html, 'html.parser')
    # --- 2. Remove all tags that should not contribute any text -------
    for tag in soup(['script', 'style', 'head', 'title', 'meta', 'link', 'noscript']):
    tag.decompose()
    # --- 3. Decide which tags to keep ----------------------------------

    You can adjust this list if you want to drop

  4. items,

    keep or , etc.

    tagstokeep = ['h1', 'h2', 'h3', 'h4', 'h5', 'h6',
    'p', 'ul', 'ol', 'li', 'blockquote', 'pre']
    # --- 4. Walk through the document in document order ----------
    cleanedparts = [] for element in soup.findall(tagstokeep): txt = element.gettext(strip=True)
    if element.name.startswith('h'):                 # headings
    level = int(element.name[1])                 # h1 → 1, h2 → 2, …
    prefix = '#' * level + ' ' if as
    markdown else '' cleanedparts.append(f'{prefix}{txt}')
    elif element.name in ('ul', 'ol'):                # list containers
    # We simply add an empty line to separate the list
    cleaned
    parts.append('') # will add a blank line
    elif element.name == 'li':                       # list items
    if element.parent.name == 'ol':

    numbered list – we rely on BeautifulSoup's gettext

    # to preserve the original numbering, so we add a prefix
    # like "1. "
    idx = element.parent.findall('li').index(element) + 1 cleanedparts.append(f'{idx}. {txt}')
    else:
    # bullet list
    cleaned
    parts.append(f'- {txt}')
    elif element.name == 'blockquote':
    cleanedparts.append(f'«{txt}»')
    elif element.name == 'pre':
    cleaned
    parts.append(txt) # code block stays as is
    else:  # generic paragraph or other simple tag
    cleanedparts.append(txt)
    # --- 5. Join all parts preserving paragraph breaks ---------------
    #   Two newlines separate logical blocks (e.g., heading ↔ paragraph)
    return '\n\n'.join(cleaned
    parts)

    --------------------------------------------------------------------

    Example usage – read your file and print the cleaned text

    if name == "main":
    # Replace this with your actual HTML content (file read, string literal, etc.)
    with open('article.html', 'r', encoding='utf-8') as f:
    rawhtml = f.read()
    # Get plain text
    plain
    text = extractrelevanttext(rawhtml, asmarkdown=False)
    print("=== Plain Text Output ===\n")
    print(plaintext)
    # Get Markdown (optional)
    print("\n=== Markdown Output ===\n")
    print(extract
    relevanttext(rawhtml, asmarkdown=True))

    How it works

    | Step | What the code does | Why it matters | |------|--------------------|----------------| | 1 | BeautifulSoup(html, 'html.parser') parses the markup. | Fast, easy, works with most browsers’ output. | | 2 | tag.decompose() removes tags that never contain useful prose (scripts, styles, head, meta, etc.). | Keeps the output free of noise and JavaScript. | | 3 | find
    all(['h1', …, 'li']) picks only the tags that hold meaningful text. | You can drop tags you don’t care about (e.g., `,
    ) or add new ones. | | 4 | The loop keeps heading levels (# for Markdown) and formats list items with bullets or numbers. | The resulting text keeps the logical structure of the article. | | 5 | '\n\n'.join(...) adds double newlines between logical blocks, giving a clean paragraph separation. | Makes the text human‑readable without extra HTML artifacts. |

    Optional: Use

    html2text for a ready‑made Markdown conversion If you simply want a quick Markdown version that keeps most of the formatting (bold, italics, links, etc.):bash pip install html2text python import html2text def htmltomarkdown(html: str) -> str:
    converter = html2text.HTML2Text()
    converter.ignorelinks = False # change to True if you don't want URLs converter.bodywidth = 0 # preserve line breaks as in the source
    return converter.handle(html)
    print(htmltomarkdown(rawhtml))
    ` html2text automatically strips unwanted tags and outputs Markdown, which you can then paste straight into ChatGPT or any other text processor. Feel free to tweak the tagsto_keep` list or the Markdown prefix logic to suit the exact format you need.
Was this helpful?

Share this article

See Also

Suggest a Correction

Found an error or have a suggestion? Let us know and we'll review it.

Comments (0)

Please sign in to leave a comment.

No comments yet. Be the first to comment!