Validate a String as HTML Using SQL | by Dhruv Matani | Jan, 2023

By Jessie Hobb On Jan 19, 2023

Is it possible to check if a string contains valid HTML using just SQL?

Checking if a string is valid HTML is a really complex task, and isn’t something that can be accomplished trivially. In fact, writing a generic HTML string validator is a fairly involved task and isn’t something I hope to cover here. If you want to validate a string a containing valid XML/HTML in PostgreSQL, you should probably be using the XML data type or XML functions library.

We would like to write a SQL query which when fed an input string is able to return TRUE or FALSE based on whether the provided string is valid HTML or not.

That said, let’s dive into the restricted problem of validating if a string is valid HTML as defined by us. What do we consider valid HTML for the purposes of this article?

Should contain balanced open and close named tags. For example, <html> should be paired with </html> after it.
Correct pairing of balanced tags. For example <a></a> is valid, whereas <a></a> is invalid.
The only unpaired tags allowed are and .
We assume that there aren’t any malformed tags such as <a/b>.
We assume that the text itself doesn’t contain the < and > characters, and that they are properly escaped as < and > respectively.
There are no attributed inside HTML tags. For example, <body> is valid, but <body class=”foo”> is invalid.

XML/HTML can not be parsed using regular expressions, since HTML is not a regular language.

We would like to be able to validate the following HTML documents.

Valid documents

<html>
<head>
</head>
<html>

<html>
<body>
<table>
<tr><td>Hello</td></tr>
<tr>
<td>
<table><tr><td>World</td></tr></table>
</td>
</tr>
</table>
</body>
</html>

<html>
<body>
<div>Hello</div>
</body>
</html>

Invalid documents

[1] In the document below, the </head> tag is missing.

<html>
<head>
</html>

[2] In the document below, the </html> tag appears before the </head> tag.

<html>
<head>
</html>
</head>

First, we tokenize the input string and extract the open and close tags as they appear in the document string. We use the below to extract the HTML tags into a separate table with one row per HTML tag in the document.

SELECT
REGEXP_MATCHES(page, '</?[^>]+>', 'g') AS html_tag
FROM html_string

This table is then again processed to assign the following to each row:

row_num: A row number.
root_tag: If the tag is an open tag, it’s the same as html_tag, else it’s the tag with the / character removed so that the close tag’s corresponding open tag is present in this column.
delta: +1 or -1 depending on whether this is an open or a close HTML tag.

SELECT
ROW_NUMBER() OVER () AS row_num,
html_tag[1],
REPLACE(html_tag[1], '/', '') AS root_tag,
CASE WHEN html_tag[1] LIKE '%/%' THEN -1 ELSE +1 END AS delta
FROM page_as_rows

We then remove all the unpaired tags (i.e. tags that are not supposed to have an open and close pair). For this article, there are only 2 such tags, namely and .

unpaired_tags(tag_name) AS (
-- Lets track tags that don't need a close tag and then
-- use this set to eliminate such tags from our list of
-- tags since they tags won't contribute to making the
-- input string invalid HTML.
VALUES('<br>'), ('<br/>')
),only_paired_tags AS (
-- Use the set "unpaired_tags" to keep only tags that
-- have an open and close pair.
SELECT
*
FROM tags_numbered
WHERE root_tag NOT IN (SELECT tag_name from unpaired_tags)
),

The first solution is seemingly correct but fails to correctly identify our 2nd invalid document above as invalid HTML.

The main idea behind this solution is to look at every prefix of the input tags and maintain a separate counter for each type of tag encountered. We can do this by GROUPing on the root_tag column.

We compute the running sum for each root tag. If the running sum ever becomes negative (for any prefix of the tags), then it means we have a close tag before an open tag.
We check the running sum for the tags in the last prefix (which is basically the entire document). If this running sum has a value other than 0 for any tag, the document is invalid. We covered the negative case above. In case this sum is positive, it means that we have an unclosed open tag somewhere.

WITH page_as_rows AS (
-- This returns a single row for every HTML open or close
-- tag in the original string. We assume an HTML tag to be
-- a string enclosed in angle brackets with an optional /
-- to indicate a close tag.
SELECT
REGEXP_MATCHES(page, '</?[^>]+>', 'g') AS html_tag
FROM html_string
),tags_numbered AS (
-- Let's number the rows so that we know what order the tags
-- are present in the original input string (document). We
-- also replace the close tag with the open tag (by removing
-- the / character) so that we can later match up an open and
-- a close tag. We keep track of whether a tag is an open or
-- close tag by assigning the value +1 or -1 (in a column
-- named delta) along with each row. This will later help
-- us determine if any prefix of the input string is a
-- valid intermediate state for an HTML document.
SELECT
ROW_NUMBER() OVER () AS row_num,
html_tag[1],
REPLACE(html_tag[1], '/', '') AS root_tag,
CASE WHEN html_tag[1] LIKE '%/%' THEN -1 ELSE +1 END AS delta
FROM page_as_rows
),
unpaired_tags(tag_name) AS (
-- Lets track tags that don't need a close tag and then
-- use this set to eliminate such tags from our list of
-- tags since they tags won't contribute to making the
-- input string invalid HTML.
VALUES('<br>'), ('<br/>')
),
only_paired_tags AS (
-- Use the set "unpaired_tags" to keep only tags that
-- have an open and close pair.
SELECT
*
FROM tags_numbered
WHERE root_tag NOT IN (SELECT tag_name from unpaired_tags)
),
self_joined AS (
-- This the main logic. We generate a prefix of the document
-- ending at every tag. We generate O(n^2) such prefixes and
-- use that to check if that prefix is a valid prefix of any
-- HTML document.
SELECT
rhs.row_num, lhs.html_tag, lhs.root_tag, lhs.delta
FROM only_paired_tags lhs INNER JOIN only_paired_tags rhs
ON lhs.row_num <= rhs.row_num
ORDER BY rhs.row_num ASC, lhs.row_num ASC
),
grouped AS (
-- Do check validity, we need to first determine:
-- 1. If we have seen a close tag before its corresponding
--    open tag. If this happens, then nthe running sum
--    for one of the tag elements will become negative (<0).
-- 2. If the final row of aggregated tag running sums has
--    all zeros. If there is a value that is greater than 0
--    it indicates that we have an open tag without a
--    corresponding close tag. While this is a valid
--    "intermediate" state to be in, it is NOT a valid
--    "final" state to be in.
SELECT
row_num, root_tag, SUM(delta) AS delta_sum
FROM self_joined
GROUP BY 1, 2
),
min_max_for_prefix AS (
-- Compute the MIN and MAX delta sums across all tags
-- for every prefix of the input string.
SELECT
row_num,
MIN(delta_sum) AS min_delta_sum,
MAX(delta_sum) AS max_delta_sum
FROM grouped
GROUP BY 1
ORDER BY row_num ASC
),
overall_min_max AS (
-- Then check if any of the running sums are negative OR
-- if the running sum for the entire document is a value
-- other than 0.
SELECT
MIN(min_delta_sum) AS overall_min,
MAX(
CASE WHEN row_num = (
SELECT MAX(row_num) FROM min_max_for_prefix
) THEN max_delta_sum ELSE NULL END
) AS last_row_max
FROM min_max_for_prefix
)
SELECT * FROM overall_min_max;

This solution fails to invalidate the 2nd invalid HTML document because we fail to track the order in which the close tags appear relative to other tags in the document. While we correctly check if a close tag appears after its corresponding open tag, we don’t check if there are any OTHER unclosed open tag between this close tag and its corresponding open tag.

For example, in the example below, the <head> tag on line 2 is unclosed whereas the enclosing <html> tag (on line 1) has been closed by the </html> tag (on line 3).

1  <html>
2    <head>
3    </html>
4  </head>

Runtime complexity: The runtime complexity of this solution is O(n²), where n is the number of tags in the document. This is because we join each tag with every other tag before it in the query above. This is the dominant cost in the entire solution.

The “inside out solution” processes strings from the innermost matching pair of open and close HTML tags. This solution relies on the fact that there is at least 1 matching open/close pair of tags that appear next to each other (on adjacent rows) in a valid HTML document.

If we remove this matching pair, then we can find and eliminate the next matching pair, till no more tags remain (in a valid HTML document). We know that in a document with 2N tags, we will perform this matching and eliminate process at most N times to reach an empty list of tags. If after N rounds of matching and elimination, we still have some tags left, it indicates an invalid HTML document.

Example-1: For the input below,

<html>
<head>
</head>
<body>
<table>
<tr>
<td>
</td>
</tr>
<tr>
<td>
</td>
</tr>
</table>
</body>
</html>

This is what the recursive execution looks like and the animation below shows the order in which the pairs of open/close tags are matched and eliminated.

Second solution processing a valid HTML string (Image by author)

Example-2: For the input below,

<html>
<body>
<table>
<tr>
</tr>
</body>
</table>
</html>

This is what the recursive execution looks like and the animation below shows the order in which the pairs of open/close tags are matched and eliminated. Since this is invalid input, the processing stops when there’s no matching pair of adjacent tags left to process.

Second solution processing an invalid HTML string (Image by author)

-- Actual solution
WITH RECURSIVE page_as_rows AS (
-- This returns a single row for every HTML open or close
-- tag in the original string. We assume an HTML tag to be
-- a string enclosed in angle brackets with an optional /
-- to indicate a close tag.
SELECT
REGEXP_MATCHES(page, '</?[^>]+>', 'g') AS html_tag
FROM html_string
),tags_numbered AS (
-- Let's number the rows so that we know what order the tags
-- are present in the original input string (document). We
-- also replace the close tag with the open tag (by removing
-- the / character) so that we can later match up an open and
-- a close tag. We keep track of whether a tag is an open or
-- close tag by assigning the value +1 or -1 (in a column
-- named delta) along with each row. This will later help
-- us determine if any prefix of the input string is a
-- valid intermediate state for an HTML document.
SELECT
ROW_NUMBER() OVER () AS row_num,
html_tag[1] AS html_tag,
REPLACE(html_tag[1], '/', '') AS root_tag,
CASE WHEN html_tag[1] LIKE '%/%' THEN -1 ELSE +1 END AS delta
FROM page_as_rows
),
unpaired_tags(tag_name) AS (
-- Lets track tags that don't need a close tag and then
-- use this set to eliminate such tags from our list of
-- tags since they tags won't contribute to making the
-- input string invalid HTML.
VALUES('<br>'), ('<br/>')
),
only_paired_tags AS (
-- Use the set "unpaired_tags" to keep only tags that
-- have an open and close pair.
SELECT
*
FROM tags_numbered
WHERE root_tag NOT IN (SELECT tag_name from unpaired_tags)
),
max_num_rounds AS (
-- The maximum number of recursive rounds of paired tag
-- elimination that we want to run.
SELECT
COUNT(1) / 2 + 2 AS max_rounds
FROM only_paired_tags
),
-- This recursive query matches and eliminates paired tags in an
-- "inside out" manner by locating adjacent matching paired tags
-- and removing them.
eliminate_paired_tags AS (
SELECT
row_num,
html_tag,
root_tag,
delta,
(SELECT max_rounds FROM max_num_rounds) AS iters
FROM only_paired_tags
UNION ALL
(
WITH lead_lag AS (
SELECT
row_num,
html_tag,
root_tag,
delta,
iters - 1 AS iters,
LAG(root_tag) OVER(ORDER BY row_num ASC) AS lag_root_tag,
LAG(delta) OVER(ORDER BY row_num ASC) AS lag_delta,
LEAD(root_tag) OVER(ORDER BY row_num ASC) AS lead_root_tag,
LEAD(delta) OVER(ORDER BY row_num ASC) AS lead_delta
FROM eliminate_paired_tags
WHERE iters - 1 > -1
),
truncated AS (
SELECT
row_num,
html_tag,
root_tag,
delta,
iters
FROM lead_lag
-- Eliminate correctly paired tags. We eliminate a pair of rows
-- independently.
--
-- 1. If a row contains an open tag and is followed by a
-- corresponding close tag, we can eliminate the row with
-- this open tag.
--
-- 2. If a row contains an close tag and is preceeded by a
-- corresponding open tag, we can eliminate the row with
-- this close tag.
WHERE NOT (
(
root_tag = lag_root_tag AND delta = -1 AND lag_delta = +1
) OR (
root_tag = lead_root_tag AND delta = +1 AND lead_delta = -1
)
)
)
SELECT * FROM truncated
)
),
from_last_round AS (
SELECT * FROM eliminate_paired_tags
WHERE iters = 0
)
SELECT
CASE
WHEN COUNT(1) > 0
THEN 'Invalid HTML'
ELSE 'Valid HTML'
END
AS validation_status
FROM from_last_round;

Runtime complexity: The runtime complexity of this solution is O(n²), where n is the number of tags in the document. We analyze the cost on both a valid as well as an invalid input:

Valid Input: In the worst case, we will eliminate a single pair of tags, and we need O(n) rounds to eliminate all the n/2 pairs of tags in the document. In each round, we need to retain and re-process all the remaining rows. We start with n rows of data and in the worst case reduce 2 rows in every recursive step. Hence, the number of rows shrinks as n, n-2, n-4, n-6, …, 0 (n/2 times), which sums up to O(n²).
Invalid Input: In case of an invalid input, we will fail to match a pair of open/close tags, and we will not eliminate any row for the entire duration of the n recursive rounds. Hence, the number of rows in each of the n rounds looks like n, n, n, …, n (n-times), which sums up to O(n²).

The SQL Fiddle link to all the solutions in this post can be found here.

We saw a way to check if a string is valid HTML or not. This method utilizes recursive CTEs to iteratively reduce the problem set size at every step.

Recursive CTEs in SQL are powerful tools that can solve a variety of problems if used creatively. However, recursive CTEs aren’t very space efficient. Where traditional imperative programming languages allow you to perform in-place updates, recursive CTEs require you to copy the data at every step.