Unleashing the Power of SQL Analytical Window Functions: A Deep Dive into Fusing IPv4 Blocks

By Jessie Hobb On Jan 10, 2024

How to summarize a geolocation table by merging contiguous network IPv4 blocks

Photo by Pascal Bernardon on Unsplash

Every device on the internet is uniquely addressable via an Internet Protocol (IP) address. The global IP address space is overseen by the Internet Assigned Numbers Authority (IANA). Traditionally, IANA allocates address space in /8 prefix blocks for IPv4, which are subsequently assigned to Internet Service Providers and other organizations. Various databases exist to map these IP blocks to their respective owners, along with information about the originating country and city.

As the Canadian national CSIRT, we, the Canadian Centre for Cyber Security, rely heavily on referencing these databases for looking up a given IP or enhancing entire datasets through SQL JOINs. However, not all use cases necessitate precision up to the city level; sometimes, country information alone suffices.

Within a country, many network blocks are contiguous. Consolidating these into mega blocks can significantly reduce the size of a table mapping mega blocks to countries, leading to improved JOIN operations.

In this article, we will demonstrate how to summarize a geolocation table by merging contiguous network blocks.

Let’s assume our geolocation table contains the following data:

+----------+-------+---------+-----------+-----------+
| start_ip | end_ip| country |   city    |   owner   |
+----------+-------+---------+-----------+-----------+
| 1        | 2     | ca      | Toronto   | Telus     |
| 3        | 4     | ca      | Quebec    | Rogers    |
| 5        | 8     | ca      | Vancouver | Bell      |
| 10       | 14    | ca      | Montreal  | Telus     |
| 19       | 22    | ca      | Ottawa    | Rogers    |
| 23       | 29    | ca      | Calgary   | Videotron |
+----------+-------+---------+-----------+-----------+

Here, start_ip represents the lowest number in the network block, and end_ip represents the largest. Normally, these numbers are much larger. For example, Google’s DNS server 8.8.8.8 is represented by the number 134744072. We use simple synthetic values for illustration purposes.

To start, let’s perform a simple summarization. For example, counting the number of IP addresses assigned to each country. This can be achieved by grouping the data by country and summing the number of IPs in each network block.

SELECT
    country,
    SUM(end_ip - start_ip + 1) as num_ip
FROM
    geo_table
GROUP BY
    country

This statement groups rows by country and applies a SUM aggregation function, calculating the total number of IPs for each country. It’s important to note that the SUM aggregation is associative, meaning the order in which you sum does not matter, similar to additions in mathematics.

+---------+--------+
| country | num_ip |
+---------+--------+
| ca      | 24     |
+---------+--------+

Now, let’s delve into the complexities of aggregating contiguous network blocks. Referring to our original table, we need to fuse together the first 3 rows. Bocks 1–2, 3–4, 5–8 should result in the mega block 1–8. We also need to fuse the last 2 rows. Blocks 19–22 and 23–29 result into 19–29. Our goal is to produce the following table:

+----------+-------+---------+
| start_ip | end_ip| country |
+----------+-------+---------+
| 1        | 8     | ca      |
| 10       | 14    | ca      |
| 19       | 29    | ca      |
+----------+-------+---------+

Detecting contiguous blocks requires inter-row information, and the order of rows becomes crucial. Fortunately, windowed analytical functions provide a solution by offering a mechanism for inter-record referencing. These functions, such as LEAD and LAG, enable comparisons with the values of the previous and subsequent rows, facilitating the identification of contiguous IP blocks.

Let’s apply the LEAD and LAG windowing functions to our table. Notice that within the OVER clause we still specify that our data should be grouped by country PARTITION BY country but additionally we specify the ordering of this data ORDER BY start_ip.

SELECT
    *,
    LAG(end_ip) OVER (
      PARTITION BY country 
      ORDER BY start_ip) AS prev_end_ip,
    LEAD(start_ip) OVER (
      PARTITION BY country 
      ORDER BY start_ip) AS next_start_ip
FROM
    geo_table

The resulting table view_1 is as follows:

+----------+-------+---------+-------------+---------------+
| start_ip | end_ip| country | prev_end_ip | next_start_ip |
+----------+-------+---------+-------------+---------------+
| 1        | 2     | ca      | null        | 3             |
| 3        | 4     | ca      | 2           | 5             |
| 5        | 8     | ca      | 4           | 10            |
| 10       | 14    | ca      | 8           | 19            |
| 19       | 22    | ca      | 14          | 23            |
| 23       | 29    | ca      | 22          | null          |
+----------+-------+---------+-------------+---------------+

It’s crucial to distinguish between windowing functions and simple GROUP BY functions. In an OVER() operation, the results of LEAD and LAG are appended to every row, providing context for the previous and next row’s information. This is distinct from functions in a GROUP BY clause that reduce a group of rows into a single aggregation result.

Now that we have access to the details of both the preceding and succeeding rows, we can facilitate inter-row comparisons. This comparison is crucial for identifying contiguous IP blocks, enabling us to determine when to fuse adjacent blocks together.

Each row can fall into one of four states:
1) Remove: The block is contiguous with both the previous and next blocks.
2) Start: The block is contiguous with only the next block.
3) End: The block is contiguous with only the previous block.
4) Keep: The block is not contiguous with either the previous nor the next block.

Let’s add this state column to our table.

SELECT
    *,
    CASE
        WHEN (end_ip = next_start_ip - 1) 
           AND (start_ip = prev_end_ip + 1) THEN 'remove'
        WHEN (end_ip = next_start_ip - 1) THEN 'start'
        WHEN (start_ip = prev_end_ip + 1) THEN 'end'
        ELSE 'keep'
    END AS state
FROM
    view_1

We obtain the following view_2 result:

+----------+-------+---------+-------------+---------------+-------+
| start_ip | end_ip| country | prev_end_ip | next_start_ip | state |
+----------+-------+---------+-------------+---------------+-------+
| 1        | 2     | ca      | null        | 3             | start |
| 3        | 4     | ca      | 2           | 5             | remove|
| 5        | 8     | ca      | 4           | 10            | end   |
| 10       | 14    | ca      | 8           | 19            | keep  |
| 19       | 22    | ca      | 14          | 23            | start |
| 23       | 29    | ca      | 22          | null          | end   |
+----------+-------+---------+-------------+---------------+-------+

We can proceed to remove the rows falling between the start and end blocks, specifically those identified with state remove.

SELECT
    start_ip,
    end_ip,
    country,
    state
FROM
    view_2
WHERE
    state IN ('start', 'end', 'keep')

Resulting in view_3:

+----------+-------+---------+-------+
| start_ip | end_ip| country | state |
+----------+-------+---------+-------+
| 1        | 2     | ca      | start |
| 5        | 8     | ca      | end   |
| 10       | 14    | ca      | keep  |
| 19       | 22    | ca      | start |
| 23       | 29    | ca      | end   |
+----------+-------+---------+-------+

We are getting close to our goal! All we have to do now is merge the start and end rows, which contain the start_ip and end_ip of the mega blocks we are seeking to produce. To achieve this, we once again use a window function. This time to fetch the end_ip from the end row.

SELECT
    *,
    LEAD(end_ip) OVER (
      PARTITION BY country 
      ORDER BY start_ip) AS next_end_ip
FROM
    view_3

Result in view_4:

+----------+-------+---------+-------+-------------+
| start_ip | end_ip| country | state | next_end_ip |
+----------+-------+---------+-------+-------------+
| 1        | 2     | ca      | start | 8           |
| 5        | 8     | ca      | end   | 14          |
| 10       | 14    | ca      | keep  | 22          |
| 19       | 22    | ca      | start | 29          |
| 23       | 29    | ca      | end   | null        |
+----------+-------+---------+-------+-------------+

Notice that the rows with state start now have a start_ip and a next_end_ip, the information necessary to build a mega block.

The rows with state end are no longer needed and can be removed.

The rows with state keep already have the correct end_ip.

We can now determine the final_end value of the mega blocks. Two cases are possible:
1) for a start row, we obtain the end value from the next_end_ip.
2) for a keep row, we simply use the original end_ip value.

SELECT
    start_ip AS final_start,
    CASE
        WHEN (state = 'start') THEN next_end_ip
        WHEN (state = 'keep') THEN end_ip
        ELSE NULL
    END AS final_end_ip
FROM
    view_4
WHERE
    state IN ('start', 'keep')

We thus achieve our goal of fusing contiguous IPv4 blocks into mega blocks.

+----------+-------+---------+-------+-------------+------------+----------+
| start_ip | end_ip| country | state | next_end_ip | final_start| final_end|
+----------+-------+---------+-------+-------------+------------+----------+
| 1        | 2     | ca      | start | 8           | 1          | 8        |
| 10       | 14    | ca      | keep  | 22          | 10         | 14       |
| 19       | 22    | ca      | start | 29          | 19         | 29       |
+----------+-------+---------+-------+-------------+------------+----------+

Putting all the statements above together, we obtain a final statement:

SELECT
    country,
    final_start,
    IF(state = 'start', next_end_ip, final_end) AS final_end
FROM (
    SELECT
        country,
        start_ip AS final_start,
        end_ip AS final_end,
        LEAD(end_ip) OVER (
          PARTITION BY country 
          ORDER BY start_ip) AS next_end_ip
    FROM (
        SELECT
            start_ip,
            end_ip,
            country,
            CASE
                WHEN (end_ip = next_start_ip - 1) 
                  AND (start_ip = prev_end_ip + 1) THEN 'remove'
                WHEN (end_ip = next_start_ip - 1) THEN 'start'
                WHEN (start_ip = prev_end_ip + 1) THEN 'end'
                ELSE 'keep'
            END AS state
        FROM (
            SELECT
                *,
                LAG(end_ip) OVER (
                    PARTITION BY country 
                    ORDER BY start_ip) AS prev_end_ip,
                LEAD(start_ip) OVER (
                    PARTITION BY country 
                    ORDER BY start_ip) AS next_start_ip
            FROM 
                geo_table
        )
        WHERE 
            state IN ('start', 'end', 'keep')
    )
)
WHERE 
    state IN ('start', 'keep')

In conclusion, SQL analytical window functions offer a robust framework for complex data analysis. They empower users to perform aggregations while retaining the context of individual rows, facilitating tasks such as running totals, averages, and percentile calculations. Moreover, these functions play a crucial role in ranking, time-series analysis, and detecting anomalies and outliers within datasets. These functions are indispensable assets in the toolkit of data practitioners.

Analytical window functions are very powerful. In this article, we only scratched the surface; for example, we did not make use of a window_frame. A window frame allows you to further refine which rows are considered in the aggregation. Window frames are relative to the current row and can be based on the number of rows or time intervals, making these functions indispensable for a wide range of analyses. You can learn more about these features in the Spark documentation: Spark SQL — Window Operations .

Unleashing the Power of SQL Analytical Window Functions: A Deep Dive into Fusing IPv4 Blocks was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.