Learning Data Engineering in 2023

A basic list of topics and recommended resources in 2023

Adrià Serra
5 min readJun 27, 2023

--

In today’s data-driven world, data engineering has emerged as a critical field that plays a pivotal role in extracting actionable insights from vast data. As businesses and organizations strive to leverage data for informed decision-making, the demand for skilled data engineers has skyrocketed. If you’re interested in embarking on a career in data engineering or looking to upskill in this domain, this comprehensive learning path will guide you through the essential concepts, tools, and skills needed to become a proficient data engineer.

Understand the Fundamentals

To begin your journey, developing a solid understanding of the fundamental concepts underpinning data engineering is crucial. Start by gaining knowledge of data structures, algorithms, and databases. Familiarize yourself with relational databases, NoSQL databases, and distributed computing concepts.

Resources:

  • Book: “Introduction to the Design and Analysis of Algorithms” by Anany Levitin
  • Course: Coursera’s “Algorithms, Part I” by Princeton University
  • Online Platform: HackerRank (www.hackerrank.com) for practising algorithmic problem-solving

Learn Programming Languages

Proficiency in programming languages is vital for data engineers. Python and SQL are the two primary languages you should focus on. Python provides a rich ecosystem of libraries and frameworks, such as Pandas, NumPy, and Apache Spark, which are widely used in data engineering tasks. SQL is essential for interacting with relational databases, data manipulation, and querying.

Resources:

  • Book: “Python Crash Course” by Eric Matthes
  • Course: Codecademy’s “Learn Python 3”
  • Online Platform: DataCamp (www.datacamp.com) for interactive Python and SQL courses

Dive into Databases

Explore the world of databases by learning SQL and different database management systems (DBMS). Gain hands-on experience with popular relational databases like MySQL, PostgreSQL, and Oracle. Additionally, familiarize yourself with NoSQL databases such as MongoDB and Cassandra, which are well-suited for handling large-scale, unstructured data.

Resources:

  • Book: “SQL Cookbook” by Anthony Molinaro
  • Course: Udacity’s “Intro to Relational Databases” by Udacity
  • Online Resource: PostgreSQL Tutorial (www.postgresqltutorial.com) for hands-on SQL practice

Master Big Data Technologies

In the era of big data, it’s crucial to understand the technologies that enable the processing and analysing massive datasets. Apache Hadoop and Apache Spark are two of the most widely used frameworks for big data processing. Learn the basics of distributed computing, Hadoop Distributed File System (HDFS), and MapReduce. Dive into Apache Spark, a lightning-fast data processing engine that supports real-time streaming, machine learning, and graph processing.

Resources:

  • Book: “Hadoop: The Definitive Guide” by Tom White
  • Course: edX’s “Big Data Analysis with Apache Spark” by UC Berkeley
  • Online Resource: Apache Spark Documentation (spark.apache.org/docs/latest) for in-depth Spark guidance

Explore Data Warehousing and ETL

Data warehousing is a key component of data engineering. Familiarize yourself with concepts like Extract, Transform, Load (ETL) processes, data integration, and data modelling. Gain knowledge of popular data warehousings platforms such as Amazon Redshift, Google BigQuery, and Snowflake. Learn how to design efficient ETL pipelines to extract data from various sources, transform it, and load it into a data warehouse for analysis.

Resources:

  • Book: “The Data Warehouse Toolkit” by Ralph Kimball and Margy Ross
  • Course: LinkedIn Learning’s “Learning Data Modeling” by Robin Hunt
  • Online Resource: Snowflake’s Documentation (docs.snowflake.com) for understanding cloud data warehousing

Embrace Data Pipelines and Workflow Tools

Data engineering involves building robust data pipelines to ensure smooth data flow and processing. Apache Airflow is a widely used workflow management tool that allows you to define, schedule, and monitor data pipelines. Dive into Airflow’s concepts like DAGs (Directed Acyclic Graphs), operators, and sensors, and explore its integration with other tools and services.

Resources:

  • Book: “Data Pipelines Pocket Reference” by James Densmore
  • Course: Pluralsight’s “Getting Started with Apache Airflow” by Janakiram MSV
  • Online Resource: Apache Airflow Documentation (airflow.apache.org/docs) for comprehensive Airflow information

Get Hands-on with Cloud Platforms

Cloud computing has revolutionized data engineering by providing scalable infrastructure and services. Gain hands-on experience with cloud platforms like Amazon Web Services (AWS), Google Cloud Platform (GCP), and Microsoft Azure. Learn how to provision virtual machines, set up storage systems, and leverage services like AWS S3, GCP BigQuery, and Azure Data Factory for data processing and analytics.

Resources:

  • Online Platform: AWS Free Tier (aws.amazon.com/free) for hands-on experience with AWS services
  • Course: Google Cloud’s “Data Engineering on Google Cloud Platform” course

Data Streaming and Real-time Processing

With the increasing demand for real-time data insights, understanding data streaming and real-time processing is crucial. Apache Kafka, a distributed streaming platform, and Apache Flink, a powerful stream processing framework, are essential tools in this realm. Learn how to build real-time data pipelines and perform complex event processing using these technologies.

Resources:

  • Book: “Kafka: The Definitive Guide” by Neha Narkhede, Gwen Shapira, and Todd Palino
  • Course: LinkedIn Learning’s “Learning Apache Kafka” by Ben Sullins
  • Online Resource: Apache Flink’s Documentation (flink.apache.org/documentation) for detailed Flink resources

Data Governance and Security

As a data engineer, ensuring data governance and security is paramount. Data governance and security are crucial aspects of data engineering. Understanding how to ensure data quality, privacy, compliance, and security is essential.

Resources:

  • Book: “Data Governance: How to Design, Deploy, and Sustain an Effective Data Governance Program” by John Ladley
  • Course: DataCamp’s “Introduction to Data Governance” by Lynn Winterboer
  • Online Resource: Data Governance Institute (www.datagovernance.com) for industry insights and best practices

Stay Updated and Evolve

Data engineering is a rapidly evolving field, with new technologies and techniques emerging regularly. Stay updated with the latest trends, attend conferences, webinars, and meetups, and actively engage with the data engineering community. Follow influential data engineering blogs, subscribe to newsletters, and participate in forums to keep up with the latest advancements and learn from industry experts.

Conclusion

Embarking on a career in data engineering requires a solid foundation in fundamental concepts, programming languages, databases, big data technologies, and data warehousing. As you progress, delve into advanced topics like data streaming, real-time processing, data governance, and security. Continuously update your skills and stay abreast of the latest tools and technologies in the field. By following this comprehensive learning path and leveraging the recommended resources, you’ll be well-equipped to unlock the power of data and contribute to the exciting world of data engineering.

If you liked this post, I usually post about maths, machine learning, and starting to publish about data engineering and programming. Do not hesitate to follow my profile to get notified about new posts

https://medium.com/@crunchyml

--

--

Adrià Serra

Data scientst, this account will share my blog post about statistic, probability, machine learning and deep learming. #100daysofML