Techno Blender
Digitally Yours.
Browsing Tag

Joao

Automatically Managing Data Pipeline Infrastructures With Terraform | by João Pedro | May, 2023

I know the manual work you did last summerPhoto by EJ Yao on UnsplashA few weeks ago, I wrote a post about developing a data pipeline using both on-premise and AWS tools. This post is part of my recent effort in bringing more cloud-oriented data engineering posts.However, when mentally reviewing this post, I noticed a big problem: the manual work.Whenever I develop a new project, whether real or fictional, I always try to reduce the friction of configuring the environment (install dependencies, configure folders, obtain…

Data Pipeline with Airflow and AWS Tools (S3, Lambda & Glue) | by João Pedro | Apr, 2023

Learning a little about these tools and how to integrate themPhoto by Nolan Krattinger on UnsplashA few weeks ago, while doing my mental stretch to think about new post ideas, I thought: Well, I need to learn (and talk) more about cloud and these things, I’ve practiced a lot on on-premise ambients, using open-source tools, and running away from proprietary solutions… But the world is cloud and I don’t think that this is gonna change any time soon…I then wrote a post about creating a data pipeline with local Spark and GCP,…

Creating a Data Pipeline with Spark, Google Cloud Storage and Big Query | by João Pedro | Mar, 2023

On-premise and cloud working together to deliver a data productPhoto by Toro Tseleng on UnsplashDeveloping a data pipeline is somewhat similar to playing with lego, you mentalize what needs to be achieved (the data requirements), choose the pieces (software, tools, platforms), and fit them together. And, like in lego, the complexity of the building process is determined by the complexity of the final goal.It’s possible to go from simple ETL pipelines built with python to move data between two databases to very complex…

Fast and Scalable Hyperparameter Tuning and Cross-validation in AWS SageMaker | by João Pereira | Mar, 2023

Using SageMaker Managed Warm PoolsPhoto by SpaceX on Unsplash.This article shares a recipe to speeding up to 60% your hyperparameter tuning with cross-validation in SageMaker Pipelines leveraging SageMaker Managed Warm Pools. By using Warm Pools, the runtime of a Tuning step with 120 sequential jobs is reduced from 10h to 4h.Improving and evaluating the performance of a machine learning model often requires a variety of ingredients. Hyperparameter tuning and cross-validation are 2 such ingredients. The first finds the…

Hands-On Introduction to Delta Lake with (py)Spark | by João Pedro | Feb, 2023

Concepts, theory, and functionalities of this modern data storage frameworkPhoto by Nick Fewings on UnsplashI think it’s now perfectly clear to everybody the value data can have. To use a hyped example, models like ChatGPT could only be built on a huge mountain of data, produced and collected over years.I would like to emphasize the word “can” because there is a phrase in the world of programming that still holds, and probably ever will: garbage in, garbage out. Data by itself has no value, it needs to be organized,…

First Steps in Machine Learning with Apache Spark | by João Pedro | Jan, 2023

Basic concepts and topics of Spark MLlib packagePhoto by Element5 Digital on UnsplashApache Spark is one of the main tools for data processing and analysis in the BigData context. It’s a very complete (and complex) data processing framework, with functionalities that can be roughly divided into four groups: SparkSQL & DataFrames, the all-purpose data processing needs; Spark Structured Streaming, used to handle data-streams; Spark MLlib, for machine learning and data science and GraphX, the graph processing API.Spark…

Road Network Edge Matching With Triangles | by João Paulo Figueira | Jan, 2023

Triangles have mighty properties for geospatial queriesPhoto by Pawel Czerwinski on UnsplashTriangles are shapes with many practical geometric properties. In this article, I illustrate using such properties when performing opportunistic optimizations while solving a particular geospatial problem: the recovery of missing map-matched information.I started exploring the Extended Vehicle Energy Dataset¹ (EVED) a while ago to search for compelling geospatial data analysis opportunities in a city road network context. This…

It’s about time we elevate the data analyst role | by João António Sousa | Jan, 2023

OpinionData analysts should become trusted advisors powered by dataNOTE: Most data & analytics roles are ill-defined. This article focuses on the data analyst role, which is often described as a BI analyst. Depending on the organization, there’s also a bit of overlap between the business analyst and the data scientist role.Image by FreepikThe role of data analyst has rapidly evolved over the last few years. With the explosion of data complexity and business expectations, they face many challenges.In the earlier years…

Diagnostic analytics — how to conduct a root-cause analysis | by João António Sousa | Dec, 2022

Proactively delivering real insights into metrics changesNote: The term root-cause analysis is commonly used in IT and data engineering as the process to identify root causes of faults or problems. This article focuses on diagnostic analytics to understand the drivers of business metrics changes.Image by rawpixel.com on FreepikWhy is revenue down? Why did the conversion rate spike? Why is the average order value flat?Depending on your industry and business goals, your key metrics might be different. But if you care about…

Trajectory Queries Using Space Partitioning | by João Paulo Figueira | Nov, 2022

How can we quickly find overlapping trajectories?Photo by Jens Lelie on UnsplashWhile traveling through space, an object describes a trajectory. We can think about a trajectory as a function of time that outputs positions in space. Conceptually, trajectories are continuous functions, although we pragmatically use their discrete versions. A discrete trajectory is a time-ordered collection of points in space where we implicitly assume a linear interpolation between each point. This representation makes storing discrete…