How to: Splitting HTML
Splitting HTML documents into manageable chunks is essential for various text processing tasks such as natural language processing, search indexing, and more. In this guide, we will explore three different text splitters provided by LangChain that you can use to split HTML content effectively:
Each of these splitters has unique features and use cases. This guide will help you understand the differences between them, why you might choose one over the others, and how to use them effectively.
Overview of the Splittersβ
HTMLHeaderTextSplitterβ
Useful when you want to preserve the hierarchical structure of a document based on its headings.
Description: Splits HTML text based on header tags (e.g., <h1>
, <h2>
, <h3>
, etc.), and adds metadata for each header relevant to any given chunk.
Capabilities:
- Splits text at the HTML element level.
- Preserves context-rich information encoded in document structures.
- Can return chunks element by element or combine elements with the same metadata.
HTMLSectionSplitterβ
Useful when you want to split HTML documents into larger sections, such as <section>
, <div>
, or custom-defined sections.
Description: Similar to HTMLHeaderTextSplitter but focuses on splitting HTML into sections based on specified tags.
Capabilities:
- Uses XSLT transformations to detect and split sections.
- Internally uses
RecursiveCharacterTextSplitter
for large sections. - Considers font sizes to determine sections.
HTMLSemanticPreservingSplitterβ
Ideal when you need to ensure that structured elements are not split across chunks, preserving contextual relevancy.
Description: Splits HTML content into manageable chunks while preserving the semantic structure of important elements like tables, lists, and other HTML components.
Capabilities:
- Preserves tables, lists, and other specified HTML elements.
- Allows custom handlers for specific HTML tags.
- Ensures that the semantic meaning of the document is maintained.
- Built in normalization & stopword removal
Choosing the Right Splitterβ
- Use
HTMLHeaderTextSplitter
when: You need to split an HTML document based on its header hierarchy and maintain metadata about the headers. - Use
HTMLSectionSplitter
when: You need to split the document into larger, more general sections, possibly based on custom tags or font sizes. - Use
HTMLSemanticPreservingSplitter
when: You need to split the document into chunks while preserving semantic elements like tables and lists, ensuring that they are not split and that their context is maintained.
Example HTML Documentβ
Let's use the following HTML document as an example:
html_string = """
<!DOCTYPE html>
<html lang='en'>
<head>
<meta charset='UTF-8'>
<meta name='viewport' content='width=device-width, initial-scale=1.0'>
<title>Fancy Example HTML Page</title>
</head>
<body>
<h1>Main Title</h1>
<p>This is an introductory paragraph with some basic content.</p>
<h2>Section 1: Introduction</h2>
<p>This section introduces the topic. Below is a list:</p>
<ul>
<li>First item</li>
<li>Second item</li>
<li>Third item with <strong>bold text</strong> and <a href='#'>a link</a></li>
</ul>
<h3>Subsection 1.1: Details</h3>
<p>This subsection provides additional details. Here's a table:</p>
<table border='1'>
<thead>
<tr>
<th>Header 1</th>
<th>Header 2</th>
<th>Header 3</th>
</tr>
</thead>
<tbody>
<tr>
<td>Row 1, Cell 1</td>
<td>Row 1, Cell 2</td>
<td>Row 1, Cell 3</td>
</tr>
<tr>
<td>Row 2, Cell 1</td>
<td>Row 2, Cell 2</td>
<td>Row 2, Cell 3</td>
</tr>
</tbody>
</table>
<h2>Section 2: Media Content</h2>
<p>This section contains an image and a video:</p>
<img src='example_image_link.mp4' alt='Example Image'>
<video controls width='250' src='example_video_link.mp4' type='video/mp4'>
Your browser does not support the video tag.
</video>
<h2>Section 3: Code Example</h2>
<p>This section contains a code block:</p>
<pre><code data-lang="html">
<div>
<p>This is a paragraph inside a div.</p>
</div>
</code></pre>
<h2>Conclusion</h2>
<p>This is the conclusion of the document.</p>
</body>
</html>
"""
Splitting the HTML Document with Each Splitterβ
Using HTMLHeaderTextSplitterβ
from langchain_text_splitters import HTMLHeaderTextSplitter
headers_to_split_on = [
("h1", "Header 1"),
("h2", "Header 2"),
("h3", "Header 3"),
]
html_splitter = HTMLHeaderTextSplitter(headers_to_split_on)
html_header_splits = html_splitter.split_text(html_string)
html_header_splits
[Document(metadata={'Header 1': 'Main Title'}, page_content='This is an introductory paragraph with some basic content.'),
Document(metadata={'Header 1': 'Main Title', 'Header 2': 'Section 1: Introduction'}, page_content='This section introduces the topic. Below is a list: \nFirst item Second item Third item with bold text and a link'),
Document(metadata={'Header 1': 'Main Title', 'Header 2': 'Section 1: Introduction', 'Header 3': 'Subsection 1.1: Details'}, page_content="This subsection provides additional details. Here's a table:"),
Document(metadata={'Header 1': 'Main Title', 'Header 2': 'Section 2: Media Content'}, page_content='This section contains an image and a video:'),
Document(metadata={'Header 1': 'Main Title', 'Header 2': 'Section 3: Code Example'}, page_content='This section contains a code block:'),
Document(metadata={'Header 1': 'Main Title', 'Header 2': 'Conclusion'}, page_content='This is the conclusion of the document.')]
Using HTMLSectionSplitterβ
from langchain_text_splitters import HTMLSectionSplitter
headers_to_split_on = [
("h1", "Header 1"),
("h2", "Header 2"),
]
html_splitter = HTMLSectionSplitter(headers_to_split_on)
html_header_splits = html_splitter.split_text(html_string)
html_header_splits
[Document(metadata={'Header 1': 'Main Title'}, page_content='Main Title \n This is an introductory paragraph with some basic content.'),
Document(metadata={'Header 2': 'Section 1: Introduction'}, page_content="Section 1: Introduction \n This section introduces the topic. Below is a list: \n \n First item \n Second item \n Third item with bold text and a link \n \n \n Subsection 1.1: Details \n This subsection provides additional details. Here's a table: \n \n \n \n Header 1 \n Header 2 \n Header 3 \n \n \n \n \n Row 1, Cell 1 \n Row 1, Cell 2 \n Row 1, Cell 3 \n \n \n Row 2, Cell 1 \n Row 2, Cell 2 \n Row 2, Cell 3"),
Document(metadata={'Header 2': 'Section 2: Media Content'}, page_content='Section 2: Media Content \n This section contains an image and a video: \n \n \n Your browser does not support the video tag.'),
Document(metadata={'Header 2': 'Section 3: Code Example'}, page_content='Section 3: Code Example \n This section contains a code block: \n \n <div>\n <p>This is a paragraph inside a div.</p>\n </div>'),
Document(metadata={'Header 2': 'Conclusion'}, page_content='Conclusion \n This is the conclusion of the document.')]
Using HTMLSemanticPreservingSplitterβ
Notes:
- We have defined a custom handler to re-format the contents of code blocks
- We defined a deny list for specific html elements, to decompose them and their contents pre-processing
- We have intentionally set a small chunk size to demonstrate the non-splitting of elements
# BeautifulSoup is required to use the custom handlers
from bs4 import Tag
from langchain_text_splitters import HTMLSemanticPreservingSplitter
headers_to_split_on = [
("h1", "Header 1"),
("h2", "Header 2"),
]
def code_handler(element: Tag) -> str:
data_lang = element.get("data-lang")
code_format = f"<code:{data_lang}>{element.get_text()}</code>"
return code_format
splitter = HTMLSemanticPreservingSplitter(
headers_to_split_on=headers_to_split_on,
separators=["\n\n", "\n", ". ", "! ", "? "],
max_chunk_size=50,
preserve_images=True,
preserve_videos=True,
elements_to_preserve=["table", "ul", "ol", "code"],
denylist_tags=["script", "style", "head"],
custom_handlers={"code": code_handler},
)
documents = splitter.split_text(html_string)
documents
[Document(metadata={'Header 1': 'Main Title'}, page_content='This is an introductory paragraph with some basic content.'),
Document(metadata={'Header 2': 'Section 1: Introduction'}, page_content='This section introduces the topic'),
Document(metadata={'Header 2': 'Section 1: Introduction'}, page_content='. Below is a list: First item Second item Third item with bold text and a link Subsection 1.1: Details This subsection provides additional details'),
Document(metadata={'Header 2': 'Section 1: Introduction'}, page_content=". Here's a table: Header 1 Header 2 Header 3 Row 1, Cell 1 Row 1, Cell 2 Row 1, Cell 3 Row 2, Cell 1 Row 2, Cell 2 Row 2, Cell 3"),
Document(metadata={'Header 2': 'Section 2: Media Content'}, page_content='This section contains an image and a video: ![image:example_image_link.mp4](example_image_link.mp4) ![video:example_video_link.mp4](example_video_link.mp4)'),
Document(metadata={'Header 2': 'Section 3: Code Example'}, page_content='This section contains a code block: <code:html> <div> <p>This is a paragraph inside a div.</p> </div> </code>'),
Document(metadata={'Header 2': 'Conclusion'}, page_content='This is the conclusion of the document.')]
Comparison Tableβ
Feature | HTMLHeaderTextSplitter | HTMLSectionSplitter | HTMLSemanticPreservingSplitter |
---|---|---|---|
Splits based on headers | Yes | Yes | Yes |
Preserves semantic elements (tables, lists) | No | No | Yes |
Adds metadata for headers | Yes | Yes | Yes |
Custom handlers for HTML tags | No | No | Yes |
Preserves media (images, videos) | No | No | Yes |
Considers font sizes | No | Yes | No |
Uses XSLT transformations | No | Yes | No |