How to: Splitting HTML

Splitting HTML documents into manageable chunks is essential for various text processing tasks such as natural language processing, search indexing, and more. In this guide, we will explore three different text splitters provided by LangChain that you can use to split HTML content effectively:

Each of these splitters has unique features and use cases. This guide will help you understand the differences between them, why you might choose one over the others, and how to use them effectively.

Overview of the Splitters

HTMLHeaderTextSplitter

info

Useful when you want to preserve the hierarchical structure of a document based on its headings.

Description: Splits HTML text based on header tags (e.g., <h1>, <h2>, <h3>, etc.), and adds metadata for each header relevant to any given chunk.

Capabilities:

Splits text at the HTML element level.
Preserves context-rich information encoded in document structures.
Can return chunks element by element or combine elements with the same metadata.

HTMLSectionSplitter

info

Useful when you want to split HTML documents into larger sections, such as <section>, <div>, or custom-defined sections.

Description: Similar to HTMLHeaderTextSplitter but focuses on splitting HTML into sections based on specified tags.

Capabilities:

Uses XSLT transformations to detect and split sections.
Internally uses RecursiveCharacterTextSplitter for large sections.
Considers font sizes to determine sections.

HTMLSemanticPreservingSplitter

info

Ideal when you need to ensure that structured elements are not split across chunks, preserving contextual relevancy.

Description: Splits HTML content into manageable chunks while preserving the semantic structure of important elements like tables, lists, and other HTML components.

Capabilities:

Preserves tables, lists, and other specified HTML elements.
Allows custom handlers for specific HTML tags.
Ensures that the semantic meaning of the document is maintained.
Built in normalization & stopword removal

Choosing the Right Splitter

Use HTMLHeaderTextSplitter when: You need to split an HTML document based on its header hierarchy and maintain metadata about the headers.
Use HTMLSectionSplitter when: You need to split the document into larger, more general sections, possibly based on custom tags or font sizes.
Use HTMLSemanticPreservingSplitter when: You need to split the document into chunks while preserving semantic elements like tables and lists, ensuring that they are not split and that their context is maintained.

Example HTML Document

Let's use the following HTML document as an example:

html_string = """
<!DOCTYPE html>
  <html lang='en'>
  <head>
    <meta charset='UTF-8'>
    <meta name='viewport' content='width=device-width, initial-scale=1.0'>
    <title>Fancy Example HTML Page</title>
  </head>
  <body>
    <h1>Main Title</h1>
    <p>This is an introductory paragraph with some basic content.</p>
    
    <h2>Section 1: Introduction</h2>
    <p>This section introduces the topic. Below is a list:</p>
    <ul>
      <li>First item</li>
      <li>Second item</li>
      <li>Third item with <strong>bold text</strong> and <a href='#'>a link</a></li>
    </ul>
    
    <h3>Subsection 1.1: Details</h3>
    <p>This subsection provides additional details. Here's a table:</p>
    <table border='1'>
      <thead>
        <tr>
          <th>Header 1</th>
          <th>Header 2</th>
          <th>Header 3</th>
        </tr>
      </thead>
      <tbody>
        <tr>
          <td>Row 1, Cell 1</td>
          <td>Row 1, Cell 2</td>
          <td>Row 1, Cell 3</td>
        </tr>
        <tr>
          <td>Row 2, Cell 1</td>
          <td>Row 2, Cell 2</td>
          <td>Row 2, Cell 3</td>
        </tr>
      </tbody>
    </table>
    
    <h2>Section 2: Media Content</h2>
    <p>This section contains an image and a video:</p>
      <img src='example_image_link.mp4' alt='Example Image'>
      <video controls width='250' src='example_video_link.mp4' type='video/mp4'>
      Your browser does not support the video tag.
    </video>

    <h2>Section 3: Code Example</h2>
    <p>This section contains a code block:</p>
    <pre><code data-lang="html">
    &lt;div&gt;
      &lt;p&gt;This is a paragraph inside a div.&lt;/p&gt;
    &lt;/div&gt;
    </code></pre>

    <h2>Conclusion</h2>
    <p>This is the conclusion of the document.</p>
  </body>
  </html>
"""

Splitting the HTML Document with Each Splitter

Using HTMLHeaderTextSplitter

from langchain_text_splitters import HTMLHeaderTextSplitter

headers_to_split_on = [
    ("h1", "Header 1"),
    ("h2", "Header 2"),
    ("h3", "Header 3"),
]

html_splitter = HTMLHeaderTextSplitter(headers_to_split_on)
html_header_splits = html_splitter.split_text(html_string)
html_header_splits

API Reference:HTMLHeaderTextSplitter

[Document(metadata={'Header 1': 'Main Title'}, page_content='This is an introductory paragraph with some basic content.'),
 Document(metadata={'Header 1': 'Main Title', 'Header 2': 'Section 1: Introduction'}, page_content='This section introduces the topic. Below is a list:  \nFirst item Second item Third item with bold text and a link'),
 Document(metadata={'Header 1': 'Main Title', 'Header 2': 'Section 1: Introduction', 'Header 3': 'Subsection 1.1: Details'}, page_content="This subsection provides additional details. Here's a table:"),
 Document(metadata={'Header 1': 'Main Title', 'Header 2': 'Section 2: Media Content'}, page_content='This section contains an image and a video:'),
 Document(metadata={'Header 1': 'Main Title', 'Header 2': 'Section 3: Code Example'}, page_content='This section contains a code block:'),
 Document(metadata={'Header 1': 'Main Title', 'Header 2': 'Conclusion'}, page_content='This is the conclusion of the document.')]

Using HTMLSectionSplitter

from langchain_text_splitters import HTMLSectionSplitter

headers_to_split_on = [
    ("h1", "Header 1"),
    ("h2", "Header 2"),
]

html_splitter = HTMLSectionSplitter(headers_to_split_on)
html_header_splits = html_splitter.split_text(html_string)
html_header_splits

API Reference:HTMLSectionSplitter

[Document(metadata={'Header 1': 'Main Title'}, page_content='Main Title \n This is an introductory paragraph with some basic content.'),
 Document(metadata={'Header 2': 'Section 1: Introduction'}, page_content="Section 1: Introduction \n This section introduces the topic. Below is a list: \n \n First item \n Second item \n Third item with  bold text  and  a link \n \n \n Subsection 1.1: Details \n This subsection provides additional details. Here's a table: \n \n \n \n Header 1 \n Header 2 \n Header 3 \n \n \n \n \n Row 1, Cell 1 \n Row 1, Cell 2 \n Row 1, Cell 3 \n \n \n Row 2, Cell 1 \n Row 2, Cell 2 \n Row 2, Cell 3"),
 Document(metadata={'Header 2': 'Section 2: Media Content'}, page_content='Section 2: Media Content \n This section contains an image and a video: \n \n \n      Your browser does not support the video tag.'),
 Document(metadata={'Header 2': 'Section 3: Code Example'}, page_content='Section 3: Code Example \n This section contains a code block: \n \n    <div>\n      <p>This is a paragraph inside a div.</p>\n    </div>'),
 Document(metadata={'Header 2': 'Conclusion'}, page_content='Conclusion \n This is the conclusion of the document.')]

Using HTMLSemanticPreservingSplitter

info

Notes:

We have defined a custom handler to re-format the contents of code blocks
We defined a deny list for specific html elements, to decompose them and their contents pre-processing
We have intentionally set a small chunk size to demonstrate the non-splitting of elements

# BeautifulSoup is required to use the custom handlers
from bs4 import Tag
from langchain_text_splitters import HTMLSemanticPreservingSplitter

headers_to_split_on = [
    ("h1", "Header 1"),
    ("h2", "Header 2"),
]


def code_handler(element: Tag) -> str:
    data_lang = element.get("data-lang")
    code_format = f"<code:{data_lang}>{element.get_text()}</code>"

    return code_format


splitter = HTMLSemanticPreservingSplitter(
    headers_to_split_on=headers_to_split_on,
    separators=["\n\n", "\n", ". ", "! ", "? "],
    max_chunk_size=50,
    preserve_images=True,
    preserve_videos=True,
    elements_to_preserve=["table", "ul", "ol", "code"],
    denylist_tags=["script", "style", "head"],
    custom_handlers={"code": code_handler},
)

documents = splitter.split_text(html_string)
documents

API Reference:HTMLSemanticPreservingSplitter

[Document(metadata={'Header 1': 'Main Title'}, page_content='This is an introductory paragraph with some basic content.'),
 Document(metadata={'Header 2': 'Section 1: Introduction'}, page_content='This section introduces the topic'),
 Document(metadata={'Header 2': 'Section 1: Introduction'}, page_content='. Below is a list: First item Second item Third item with bold text and a link Subsection 1.1: Details This subsection provides additional details'),
 Document(metadata={'Header 2': 'Section 1: Introduction'}, page_content=". Here's a table: Header 1 Header 2 Header 3 Row 1, Cell 1 Row 1, Cell 2 Row 1, Cell 3 Row 2, Cell 1 Row 2, Cell 2 Row 2, Cell 3"),
 Document(metadata={'Header 2': 'Section 2: Media Content'}, page_content='This section contains an image and a video: ![image:example_image_link.mp4](example_image_link.mp4) ![video:example_video_link.mp4](example_video_link.mp4)'),
 Document(metadata={'Header 2': 'Section 3: Code Example'}, page_content='This section contains a code block: <code:html> <div> <p>This is a paragraph inside a div.</p> </div> </code>'),
 Document(metadata={'Header 2': 'Conclusion'}, page_content='This is the conclusion of the document.')]

Comparison Table

Feature	HTMLHeaderTextSplitter	HTMLSectionSplitter	HTMLSemanticPreservingSplitter
Splits based on headers	Yes	Yes	Yes
Preserves semantic elements (tables, lists)	No	No	Yes
Adds metadata for headers	Yes	Yes	Yes
Custom handlers for HTML tags	No	No	Yes
Preserves media (images, videos)	No	No	Yes
Considers font sizes	No	Yes	No
Uses XSLT transformations	No	Yes	No

How to: Splitting HTML

Overview of the Splitters

HTMLHeaderTextSplitter

HTMLSectionSplitter

HTMLSemanticPreservingSplitter

Choosing the Right Splitter

Example HTML Document

Splitting the HTML Document with Each Splitter

Using HTMLHeaderTextSplitter

Using HTMLSectionSplitter

Using HTMLSemanticPreservingSplitter

Comparison Table

Was this page helpful?

You can also leave detailed feedback on GitHub.

How to: Splitting HTML

Overview of the Splitters​

HTMLHeaderTextSplitter​

HTMLSectionSplitter​

HTMLSemanticPreservingSplitter​

Choosing the Right Splitter​

Example HTML Document​

Splitting the HTML Document with Each Splitter​

Using HTMLHeaderTextSplitter​

Using HTMLSectionSplitter​

Using HTMLSemanticPreservingSplitter​

Comparison Table​

Was this page helpful?

You can also leave detailed feedback on GitHub.

Overview of the Splitters

HTMLHeaderTextSplitter

HTMLSectionSplitter

HTMLSemanticPreservingSplitter

Choosing the Right Splitter

Example HTML Document

Splitting the HTML Document with Each Splitter

Using HTMLHeaderTextSplitter

Using HTMLSectionSplitter

Using HTMLSemanticPreservingSplitter

Comparison Table