Skip to main content

How to: Splitting HTML

Splitting HTML documents into manageable chunks is essential for various text processing tasks such as natural language processing, search indexing, and more. In this guide, we will explore three different text splitters provided by LangChain that you can use to split HTML content effectively:

Each of these splitters has unique features and use cases. This guide will help you understand the differences between them, why you might choose one over the others, and how to use them effectively.


Overview of the Splitters​

HTMLHeaderTextSplitter​

info

Useful when you want to preserve the hierarchical structure of a document based on its headings.

Description: Splits HTML text based on header tags (e.g., <h1>, <h2>, <h3>, etc.), and adds metadata for each header relevant to any given chunk.

Capabilities:

  • Splits text at the HTML element level.
  • Preserves context-rich information encoded in document structures.
  • Can return chunks element by element or combine elements with the same metadata.

HTMLSectionSplitter​

info

Useful when you want to split HTML documents into larger sections, such as <section>, <div>, or custom-defined sections.

Description: Similar to HTMLHeaderTextSplitter but focuses on splitting HTML into sections based on specified tags.

Capabilities:

  • Uses XSLT transformations to detect and split sections.
  • Internally uses RecursiveCharacterTextSplitter for large sections.
  • Considers font sizes to determine sections.

HTMLSemanticPreservingSplitter​

info

Ideal when you need to ensure that structured elements are not split across chunks, preserving contextual relevancy.

Description: Splits HTML content into manageable chunks while preserving the semantic structure of important elements like tables, lists, and other HTML components.

Capabilities:

  • Preserves tables, lists, and other specified HTML elements.
  • Allows custom handlers for specific HTML tags.
  • Ensures that the semantic meaning of the document is maintained.
  • Built in normalization & stopword removal

Choosing the Right Splitter​

  • Use HTMLHeaderTextSplitter when: You need to split an HTML document based on its header hierarchy and maintain metadata about the headers.
  • Use HTMLSectionSplitter when: You need to split the document into larger, more general sections, possibly based on custom tags or font sizes.
  • Use HTMLSemanticPreservingSplitter when: You need to split the document into chunks while preserving semantic elements like tables and lists, ensuring that they are not split and that their context is maintained.

Example HTML Document​

Let's use the following HTML document as an example:

html_string = """
<!DOCTYPE html>
<html lang='en'>
<head>
<meta charset='UTF-8'>
<meta name='viewport' content='width=device-width, initial-scale=1.0'>
<title>Fancy Example HTML Page</title>
</head>
<body>
<h1>Main Title</h1>
<p>This is an introductory paragraph with some basic content.</p>

<h2>Section 1: Introduction</h2>
<p>This section introduces the topic. Below is a list:</p>
<ul>
<li>First item</li>
<li>Second item</li>
<li>Third item with <strong>bold text</strong> and <a href='#'>a link</a></li>
</ul>

<h3>Subsection 1.1: Details</h3>
<p>This subsection provides additional details. Here's a table:</p>
<table border='1'>
<thead>
<tr>
<th>Header 1</th>
<th>Header 2</th>
<th>Header 3</th>
</tr>
</thead>
<tbody>
<tr>
<td>Row 1, Cell 1</td>
<td>Row 1, Cell 2</td>
<td>Row 1, Cell 3</td>
</tr>
<tr>
<td>Row 2, Cell 1</td>
<td>Row 2, Cell 2</td>
<td>Row 2, Cell 3</td>
</tr>
</tbody>
</table>

<h2>Section 2: Media Content</h2>
<p>This section contains an image and a video:</p>
<img src='example_image_link.mp4' alt='Example Image'>
<video controls width='250' src='example_video_link.mp4' type='video/mp4'>
Your browser does not support the video tag.
</video>

<h2>Section 3: Code Example</h2>
<p>This section contains a code block:</p>
<pre><code data-lang="html">
&lt;div&gt;
&lt;p&gt;This is a paragraph inside a div.&lt;/p&gt;
&lt;/div&gt;
</code></pre>

<h2>Conclusion</h2>
<p>This is the conclusion of the document.</p>
</body>
</html>
"""

Splitting the HTML Document with Each Splitter​

Using HTMLHeaderTextSplitter​

from langchain_text_splitters import HTMLHeaderTextSplitter

headers_to_split_on = [
("h1", "Header 1"),
("h2", "Header 2"),
("h3", "Header 3"),
]

html_splitter = HTMLHeaderTextSplitter(headers_to_split_on)
html_header_splits = html_splitter.split_text(html_string)
html_header_splits
[Document(metadata={'Header 1': 'Main Title'}, page_content='This is an introductory paragraph with some basic content.'),
Document(metadata={'Header 1': 'Main Title', 'Header 2': 'Section 1: Introduction'}, page_content='This section introduces the topic. Below is a list: \nFirst item Second item Third item with bold text and a link'),
Document(metadata={'Header 1': 'Main Title', 'Header 2': 'Section 1: Introduction', 'Header 3': 'Subsection 1.1: Details'}, page_content="This subsection provides additional details. Here's a table:"),
Document(metadata={'Header 1': 'Main Title', 'Header 2': 'Section 2: Media Content'}, page_content='This section contains an image and a video:'),
Document(metadata={'Header 1': 'Main Title', 'Header 2': 'Section 3: Code Example'}, page_content='This section contains a code block:'),
Document(metadata={'Header 1': 'Main Title', 'Header 2': 'Conclusion'}, page_content='This is the conclusion of the document.')]

Using HTMLSectionSplitter​

from langchain_text_splitters import HTMLSectionSplitter

headers_to_split_on = [
("h1", "Header 1"),
("h2", "Header 2"),
]

html_splitter = HTMLSectionSplitter(headers_to_split_on)
html_header_splits = html_splitter.split_text(html_string)
html_header_splits
API Reference:HTMLSectionSplitter
[Document(metadata={'Header 1': 'Main Title'}, page_content='Main Title \n This is an introductory paragraph with some basic content.'),
Document(metadata={'Header 2': 'Section 1: Introduction'}, page_content="Section 1: Introduction \n This section introduces the topic. Below is a list: \n \n First item \n Second item \n Third item with bold text and a link \n \n \n Subsection 1.1: Details \n This subsection provides additional details. Here's a table: \n \n \n \n Header 1 \n Header 2 \n Header 3 \n \n \n \n \n Row 1, Cell 1 \n Row 1, Cell 2 \n Row 1, Cell 3 \n \n \n Row 2, Cell 1 \n Row 2, Cell 2 \n Row 2, Cell 3"),
Document(metadata={'Header 2': 'Section 2: Media Content'}, page_content='Section 2: Media Content \n This section contains an image and a video: \n \n \n Your browser does not support the video tag.'),
Document(metadata={'Header 2': 'Section 3: Code Example'}, page_content='Section 3: Code Example \n This section contains a code block: \n \n <div>\n <p>This is a paragraph inside a div.</p>\n </div>'),
Document(metadata={'Header 2': 'Conclusion'}, page_content='Conclusion \n This is the conclusion of the document.')]

Using HTMLSemanticPreservingSplitter​

info

Notes:

  1. We have defined a custom handler to re-format the contents of code blocks
  2. We defined a deny list for specific html elements, to decompose them and their contents pre-processing
  3. We have intentionally set a small chunk size to demonstrate the non-splitting of elements
# BeautifulSoup is required to use the custom handlers
from bs4 import Tag
from langchain_text_splitters import HTMLSemanticPreservingSplitter

headers_to_split_on = [
("h1", "Header 1"),
("h2", "Header 2"),
]


def code_handler(element: Tag) -> str:
data_lang = element.get("data-lang")
code_format = f"<code:{data_lang}>{element.get_text()}</code>"

return code_format


splitter = HTMLSemanticPreservingSplitter(
headers_to_split_on=headers_to_split_on,
separators=["\n\n", "\n", ". ", "! ", "? "],
max_chunk_size=50,
preserve_images=True,
preserve_videos=True,
elements_to_preserve=["table", "ul", "ol", "code"],
denylist_tags=["script", "style", "head"],
custom_handlers={"code": code_handler},
)

documents = splitter.split_text(html_string)
documents
[Document(metadata={'Header 1': 'Main Title'}, page_content='This is an introductory paragraph with some basic content.'),
Document(metadata={'Header 2': 'Section 1: Introduction'}, page_content='This section introduces the topic'),
Document(metadata={'Header 2': 'Section 1: Introduction'}, page_content='. Below is a list: First item Second item Third item with bold text and a link Subsection 1.1: Details This subsection provides additional details'),
Document(metadata={'Header 2': 'Section 1: Introduction'}, page_content=". Here's a table: Header 1 Header 2 Header 3 Row 1, Cell 1 Row 1, Cell 2 Row 1, Cell 3 Row 2, Cell 1 Row 2, Cell 2 Row 2, Cell 3"),
Document(metadata={'Header 2': 'Section 2: Media Content'}, page_content='This section contains an image and a video: ![image:example_image_link.mp4](example_image_link.mp4) ![video:example_video_link.mp4](example_video_link.mp4)'),
Document(metadata={'Header 2': 'Section 3: Code Example'}, page_content='This section contains a code block: <code:html> <div> <p>This is a paragraph inside a div.</p> </div> </code>'),
Document(metadata={'Header 2': 'Conclusion'}, page_content='This is the conclusion of the document.')]

Comparison Table​

FeatureHTMLHeaderTextSplitterHTMLSectionSplitterHTMLSemanticPreservingSplitter
Splits based on headersYesYesYes
Preserves semantic elements (tables, lists)NoNoYes
Adds metadata for headersYesYesYes
Custom handlers for HTML tagsNoNoYes
Preserves media (images, videos)NoNoYes
Considers font sizesNoYesNo
Uses XSLT transformationsNoYesNo

Was this page helpful?


You can also leave detailed feedback on GitHub.