What Data Engineers Can Learn from Software Engineers, and Vice Versa | by Xiaoxu Gao | Oct, 2022

By Jessie Hobb On Oct 26, 2022

Don’t be limited by your job description

In the past five years of my career, I became a full-time Software Engineer and a full-time Data Engineer. I’ve noticed that these two roles have so many similarities that helped me transition from one role to the other smoothly, but also tangible differences that pushed me to learn new skills.

It would be interesting to share what the differences are and more importantly, what one role can learn from the other from my own perspective. If you are considering switching your job, I hope this article can shed some light. Even if you are not, this article will still help you leave your comfort zone and learn from others.

My situation

As this article will be very subjective and whether it suits you really depends, I will briefly explain my situation.

Software Engineering -> Data Engineer (Created by Xiaoxu Gao)

I first worked as a Software Engineer in an international bank. My job was to build software pretty much from scratch to integrate different internal bank applications such as data warehouse and reporting tools. The tech stack I was mostly using was Python as the main programming language, Apache Kafka as the streaming platform, RestAPI frameworks like Connexion and FastAPI, and a little bit of SQL.

Later, I moved to a startup and became a Data Engineer. I’ve spent most of the time building data pipelines and data models using tools like Airflow and dbt. The entire stack runs on GCP so I also get to learn cloud services such as BigQuery, Pubsub, Cloud Storage, etc.

As you can tell, although I had two different job titles, they do have common grounds on the tech stack and the type of work, which is moving data from one place to the other. That’s why it is easier for me to do the switch. If you are a Software Engineer focusing on web applications, then the switch might be more significant for you.

Software Engineer v.s. Data Engineer

Wikipedia defines a Software Engineer as a person who applies the principles of software engineering to design, develop, maintain, test and evaluate computer software.

Software Engineer is a very broad job category. The responsibility of each Software Engineer can be entirely different. Overall, their tasks include:

Collect requirements from the stakeholders, developers, and other interested parties. Software requirements include functional requirements (e.g. business rules) and non-functional requirements (e.g. reliability, scalability, performance).
Design architecture, components, interfaces, and other characteristics of a system. It involves collaboration with other engineering teams to make sure the solution can fit into the bigger picture. It also includes a detailed implementation design like algorithms, data structures, etc.
Develop the software using one or more programming languages.
Test the software using unit testing, integration testing, user acceptance testing, smoke testing, etc. It is an empirical, technical investigation conducted to provide stakeholders with information about the quality of the software.
Maintain the software once it’s live. It refers to activities such as updating the software when a bug is found and monitoring the health of the service.

The term Software Engineer has been formally used since 1960s. As the domain becomes bigger and more complicated, the term Software Engineer is not clear enough anymore to describe what the person is doing. Therefore, many new job roles came out and they are the branches of Software Engineer:

Full Stack Engineer — focus on web applications.
IOS/Android Developer — focus on mobile applications.
DevOps Engineer — develop and maintain the release, deployment, and operation process of the software.

Similarly, Data Engineer is also considered a branch of Software Engineer with a focus on Big Data. The term data engineer has been widely adopted only since the 2010s.

Wikipedia defines a Data Engineer as a person who builds the system to enable the collection and usage of data.

Right. Instead of building software like a web application, a Data Engineer builds a system. But let’s check out the tasks of a Data Engineer as well:

Collect requirements from all interested parties. This is the same as a Software Engineer.
Build data pipelines that collect raw data from the source system and convert them into information that can be interpreted by data users. Data Engineers normally build data pipelines and data models using existing tools such as Airflow and dbt.
Maintain the pipeline after it’s live. Same as software, it’s incredibly important to monitor the health of data pipelines to ensure the reliability of the data.

Generally speaking, Data Engineers are specialists within the field of software engineering. The table shows some differences between the two roles.

What Can Data Engineers can learn from Software Engineers?

After understanding their similarities and differences, let’s see how they can learn from each other.

Testing, testing, testing

Testing is an imperative unit of the software engineering lifecycle. It helps the engineering team assess the quality of the software and ensure that it fulfills the stakeholder’s needs.

A set of software testing strategies such as unit testing, integration testing, and user acceptance testing are developed to mitigate the product risks. A common practice is to code these tests into the CI/CD pipeline, so the software won’t be able to deploy without passing the tests.

Writing tests is in the DNA of every Software Engineer and there are many testing frameworks like TDD (Test-Driven-Development) and BDD (Behaviour-Drive-Development) to help engineers write tests. However, for Data Engineers, testing is still seen as something good-to-have. A couple of reasons could be:

Data pipelines are dealing with many unknowns. Unlike a regular application where the inputs are more anticipated, data pipelines receive data from many sources in all kinds of formats. It is very hard to predict all types of data and cover all the scenarios unless we have strict data contracts. (Data contract is a hot topic right now, you can read more in Chad Sanderson’s blog post.)
Oracle problem. Oracle problem refers to the situation where it’s difficult to determine the expected output. Due to many unknowns, there are many scenarios where we don’t know how to handle it, and it increases the difficulty level as well.
SQL is natively hard to perform unit tests. SQL is the dominant programming language for Data Engineers. But it is not easy to perform unit testing due to the fact that the code is usually a big block and it’s difficult to take one piece out and only test that part. Besides, cloud services like BigQuery don’t have a local environment (e.g. running Postgres docker locally), thus the remote connection during testing might create hassles. Right now, most tests are data tests that don’t test business logic but data quality including freshness, completeness, and uniqueness of the data.

For testing topic in dbt, you can read my How to do Unit Testing in dbt article.

In recent years, as the data engineering field becomes more mature, automated testing has finally been mentioned more than ever. Frameworks like Great Expectations, dbt, and Dagster developed features to simplify the test which improves the quality of data pipelines and data models.

Most data engineering teams work with data tools such as Airflow, Dagster, and dbt. Compared to writing low-level code as a Software Engineer, they are more abstract and easy to manage. But the core data processing still needs to be written throughout the pipeline.

It’s important to be proficient in at least one programming language other than SQL. Knowing the programming language well also makes it possible to contribute to open-source projects, for instance, Airflow.

The recent appearance of Infrastructure-as-code (IaC) and Pipeline-as-code (PaC) is getting popular in the data engineering field. They reduce the burden of infrastructure management and release cycle management. Being familiar with terraform, bash, and Github Actions surely boosts the productivity of Data Engineers.

Monitoring the reliability of data products

In Software Engineering, it’s common to set up service monitoring after it’s live. Monitoring enables developers to catch problems quickly while these issues are still minor.

Data product monitoring, or in another word, data observability is still a new concept. Data observability allows organizations to understand their data systems fully and fix data problems at the early stage. A data pipeline is a complex system composed of many steps. It’s frustrating to find out the root cause of a wrong metric in one of the dashboards given many steps in the pipeline.

Data Engineers should be more willing to have a monitoring mindset at the beginning of the design phase to avoid potential frustration.

What can Software Engineers learn from Data Engineers?

Alright, let’s see the other side. Although Software Engineer is a general role, they can still learn from Data Engineer’s expertise.

Be more willing to learn new technologies

Data Engineering is a tool-based field and the landscape keeps growing. Googling “data engineering landscape” will return pictures with hundreds of data tools. It’s impossible to keep up with every single tool, but it’s definitely worth catching up once in a while.

Website Modern Data Stack collects many data tools for each category.

After being a Data Engineer, I spent more time reading data blogs, and product release notes and attending conferences more often than ever. I find it a good habit for all Software Engineers. No matter what type of work you do, it’s always good to go through your LinkedIn posts and Medium to see if there is anything new to absorb in your area.

If you find the landscape of your field has been pretty much stabilized, you might need to rethink whether the job is still interesting or challenging for you.

Be more responsible for end-to-end system

As described earlier, Data Engineers are responsible for end-to-end systems. It requires engineers to have the ability to take care of multiple components at the same time. When I was a Software Engineer, I was mostly looking at my own component.

Taking care of multiple components is a pre-step of being an architect. A software architect is someone who is responsible for high-level design choices related to overall system behavior. If you are aiming for being a software architect, you should pay more attention to system design.

This is an interesting perspective that I’ve never thought about as a Software Engineer. Data Engineers work with Big data, therefore having a cost-effective way to store and transform data becomes extremely important, especially for startups.

Though cost is not the first thing Software Engineers would think about, it’s a good practice to think about efficiency and effectiveness all the time, both financial-wise and process-wise. For example, how to improve the software to use less memory, CPU, and storage.

Conclusion

In this article, I talked about what Data Engineers and Software Engineers can learn from each other. They have similar but complementary expertise. Today’s tech market has derived many new job titles and will be even more in the future. The boundary between jobs can be clear or vague depending on the company, but don’t be limited by what your job description asks you to do. Keep your mind open and keep learning. Cheers!