Techno Blender
Digitally Yours.
Browsing Tag

Chengzhi

Building Better Data Warehouses with Dimensional Modeling: A Guide for Data Engineers | by Chengzhi Zhao | May, 2023

Data Warehouse Dimensional Modeling Design 101Photo by Erin Doering on UnsplashRegarding system design for a data-intensive application, it usually comes with two options: write or read optimized.There isn’t a database design that fits and optimizes both writing and reading. Like all system design perspectives, no solution is right or wrong, while only pros and cons. As data professionals who work on data model design, a critical part of the role is to identify the use case and further identify which design principle…

Boosting Spark Union Operator Performance: Optimization Tips for Improved Query Speed | by Chengzhi Zhao | Apr, 2023

Demystify Spark Performance in Union OperatorPhoto by Fahrul Azmi on UnsplashThe union operator is one of the set operators to merge two input data frames into one. Union is a convenient operation in Apache Spark for combining rows with the same order of columns. One frequently used case is applying different transformations and then unioning them together.The ways of using the union operation in Spark are often discussed widely. However, a hidden fact that has been less discussed is the performance caveat associated with…

How to Engage with Users By Storytelling: Show Data Analytics in R and Shiny | by Chengzhi Zhao | Mar, 2023

How R and Shiny Can Help You Find the Best Youtube Videos for Your KidsHow to Engage with Users By Storytelling: Show Data Analytics in R and Shiny | Image By AuthorData is more engaged with storytelling. As a data professional, I seek less complicated ways to convey the gap between data analysis and communication. A dashboard is traditionally the default way to visualize and share data. It also carries the responsibility for communication. However, I found the limitations of the dashboard: limited chart selections and…

R for Data Analysis: How to Find the Perfect Cocomelon Video for Your Kids | by Chengzhi Zhao | Mar, 2023

How to Build End-to-End Data Project Exploring New Trending Cocomelon Videos from Scratch Using RPhoto by Tony Sebastian on UnsplashCocomelon — Nursery Rhymes is the world's second-largest Youtube channel (155M+ subscribers). It is such a popular and helpful channel that it is an inevitable subject for toddlers and parents. I enjoy spending time watching Cocomelon together with my son.After watching Cocomelon videos for a month, I noticed the same videos are repeatedly recommended on Youtube. Videos like "The wheel on the…

Data Streaming Is Exciting: What You Need to Know Before Jumping in | by Chengzhi Zhao | Feb, 2023

Is Data Streaming Right for Your Business? Key Facts to ConsiderPhoto by Stephen Leonardi on UnsplashStreaming data is an exciting space in the data field, and it has been getting tremendous attraction in recent years. With much excitement, the areas for open-source projects became crowded. Many technologies have made the streaming data process more straightforward than ever: Kafka, Flink, Spark, Storm, Beam, etc., have been in the market for years and have built a solid user base.“Let’s do streaming processing.”It is an…

Think in SQL — Avoid Writing SQL in a Top to Bottom Approach | by Chengzhi Zhao | Feb, 2023

Write Clear SQL By Comprehend Logical Query Processing OrderPhoto by Jeffrey Brandjes on UnsplashYou might find writing SQL challenging due to its declarative nature. Especially for engineers familiar with imperative languages like Python, Java, or C, SQL is gear-switching and mind shifts to many people. Thinking in SQL is different than any imperative language and should not be learned and developed the same way.When working with SQL, do you write in the top to bottom approach? Do you start developing in SQL with the…

5 Fantastic Data Pipeline Orchestration Tools For R | by Chengzhi Zhao | Jan, 2023

Explore Excellent Options for Data Pipeline Orchestration for R UsersPhoto by Daria Nepriakhina 🇺🇦 on UnsplashThe data pipeline orchestration tool is critical for producing healthy and reliable data-driven decisions. R is one of the popular languages for data scientists. With R’s exceptional packages, the R programming language is great for data manipulation, statistical analysis, and visualization.One pattern that often brings data scientists’ R local script to production is to rewrite using Python or Scala (Spark), then…

Become Fluent in Python Decorators via Visualization | by Chengzhi Zhao | Jan, 2023

Comprehend Python Decorators By VisualizationPhoto by Huyen Bui on UnsplashPython decorator is syntactic sugar. You can achieve everything without explicitly using the decorator. However, Using the decorator can help your code be more concise and readable. Ultimately, you write fewer lines of code by leveraging Python decorators.Nevertheless, Python decorator isn't a trivial concept to comprehend. Understanding Python decorators requires building blocks, including closure, function as an object, and deep knowledge of how…

Here Is What I Learned Using Apache Airflow over 6 Years | by Chengzhi Zhao | Jan, 2023

A journey with Apache Airflow from experiment to production hassle-freePhoto by Karsten Würth on UnsplashApache Airflow is undoubtedly the most popular open-source project for data engineering for years. It gains popularity at the right time with The Rise Of Data Engineer, and the core concept of making code as the first-class citizen instead of drag and drop for data pipeline (aka. ETL) is a milestone. The Apache Airflow became an Apache Incubator project in March 2016 and became the top project in January 2019. I have…

Deep Dive into Handling Apache Spark Data Skew | by Chengzhi Zhao | Jan, 2023

The Ultimate Guide To Handle Data Skew In Distributed ComputePhoto by Lizzi Sassman on Unsplash“Why my Spark job is running slow?” is an inevitable question while working with Apache Spark. One of the most common scenarios regarding Apache Spark performance tuning is data skew. In this article, we will cover how to identify whether your Spark job slowness is caused by data skew and deep dive into handling Apache Spark data skew with code to explain three ways to handle data skew, including the “salting” technique.How to…