Techno Blender
Digitally Yours.
Browsing Tag

Cote

Anomaly Detection using Sigma Rules (Part 5) Flux Capacitor Optimization | by Jean-Claude Cote | Mar, 2023

To boost performance, we implement a forgetful bloom filter and a custom Spark state store providerPhoto by Leora Winter on Unsplash, Shippagan, NB, CanadaThis is the 5th article of our series. Refer to part 1 , part 2, part 3 and part 4 for some context.In our previous articles, we have demonstrated the performance gains achieved by using a bloom filter. We also showed how we leveraged a bloom filter to implement temporal proximity correlations, parent/child and ancestor relationships.So far we have been using a single…

Anomaly Detection using Sigma Rules (Part 4): Flux Capacitor Design | by Jean-Claude Cote | Mar, 2023

We implement a Spark structured streaming stateful mapping function to handle temporal proximity correlations in cyber security logsImage by Robert Wilson from PixabayThis is the 4th article of our series. Refer to part 1 , part 2 and part 3 for some context.In this article, we will detail the design of a custom Spark flatMapWithGroupState function. We chose to write this function in Scala since Spark is itself written in Scala. We named this function Flux Capacitor. Electric capacitors accumulate electric charges and…

Anomaly Detection using Sigma Rules (Part 3) Temporal Correlation Using Bloom Filters | by Jean-Claude Cote | Feb, 2023

Can a custom tailor made stateful mapping function based on bloom filters outperform the generic Spark stream-stream join?Photo by Kalpaj on Unsplash, Peggys Cove, NS, CanadaSpark’s flatMapGroupsWithState function allows users to apply custom code on grouped data and provides support to persist user defined states.In this article, we will implement a stateful function that retrieves the tags (features) of a parent process. The crux of the the solution is to create a composite key made of the process ID (e_key in the…

Anomaly Detection using Sigma Rules (Part 2) Spark Stream-Stream Join | by Jean-Claude Cote | Feb, 2023

A class of Sigma rules detect temporal correlations. We evaluate the scalability of Spark’s stateful symmetric stream-stream join to perform temporal correlations.Photo by Naveen Kumar on UnsplashFollowing up on our previous article, we evaluate Sparks ability to join a start-process event with it’s parent start-process event.In this article, we evaluated how Spark stream-stream join can scale. Specifically, how many events can it hold in in the join window.During our research, we evaluated a few approaches:Full joinDoing…

Anomaly Detection using Sigma Rules (Part 1): Leveraging Spark SQL Streaming | by Jean-Claude Cote | Jan, 2023

Sigma rules are used to detect anomalies in cyber security logs. We use Spark structured streaming to evaluate Sigma rules at scale.Photo by Tom Carnegie on Unsplash, Supreme Court of CanadaThe Rise of Data SketchingData sketch is an umbrella term for data structures and algorithms that use theoretical mathematics, statistics and computer science to solve set cardinality, quantiles, frequency estimation, with mathematically proven error bounds.Data sketches are orders-of magnitude faster than traditional approaches, they…

Leveraging Azure Event Grid to Create a Java Iceberg Table | by Jean-Claude Cote | Jan, 2023

We will use Azure Event Grid to implement an event-driven architecturePhoto by Jackson Case on UnsplashIn our previous article, we demonstrated how an Iceberg table can act as a Kafka topic. We showed that independent Java Writers can produce parquet files in parallel, while a single Bookkeeper attaches these data files to an Iceberg table. The Bookkeeper did this by creating an Iceberg commit.The Bookkeeper needs to identify what the newly created data files are. It then registers these files with the Iceberg table. In…

Streaming Iceberg Table, an Alternative to Kafka? | by Jean-Claude Cote | Dec, 2022

Spark Structured Streaming supports a Kafka source and a file source, meaning it can treat a folder as a source of streaming messages. Can a solution, entirely based on files, really compare to a streaming platform such as Kafka?Photo by Edward Koorey on UnsplashIn this article, we explore using an Iceberg table as a source of streaming messages. To do this, we create a Java program that writes messages to an Iceberg table and a pySpark Structured Streaming job that reads these messages.Azure Event Hubs is a big data…

Spark Streaming Made Easy with JupyterLab SQL Magic | by Jean-Claude Cote | Oct, 2022

jupyterlab-sql-editor is a JupyterLab extension which makes it a breeze to execute, display and manage Spark streaming queriesPhoto by Fadhila Nurhakim on UnsplashOne of the roles of the Canadian Centre for Cyber Security (CCCS) is Computer Emergency Response Team (CERT). In this role the CCCS detects anomalies and issues mitigations as quickly as possible.In an environment where response time is critical, the CCCS leverages Spark Structured Streaming and the Kafka event streaming platform.In this article we will demo our…

Leverage Cloud Technologies for Malware Hunting at Scale | by Jean-Claude Cote | Sep, 2022

How to index hundreds of terabytes of malware using Apache Spark and Iceberg tablesPhoto by Hes Mundt on UnsplashIn this article, we will show how we used Spark and Iceberg tables to implement a malware index similar to UrsaDB and integrated this index into Mquery an analyst-friendly web GUI to submit YARA rules and display results.This proof of concept was developed during GeekWeek an annual workshop organized by the Canadian Centre for Cyber Security and bring together key players in the field of cyber security to…

Jinja/DBT SQL Templating in JupyterLab/VSCode | by Jean-Claude Cote | Jul, 2022

Quickly prototype your SQL templatesPhoto by Joanna Kosinska on UnsplashSQL itself doesn’t lend itself well to reusability. To achieve reusability SQL is often templated using libraries like Jinja. For example Apache Superset leverages Jinja templating in its Dataset definitions and of course DBT is a tool built entirely around Jinja templates. Jupyterlab-sql-editor natively supports Jinja templating making it possible to prototype jinja-sql templates for Superset and DBT right inside JupyterLab and VSCode.Create reusable…