After we have presented the basic topics for new IT professionals in the articles “ Onboarding new IT employees (DevOps, Big Data, Developer) ” and “ Tutorial: Basic IT knowledge for DevOps, Big Data, Developer ” and shown a tutorial on how to learn this basic knowledge, we would now like to take a closer look at tools and techniques that are specifically relevant for employees in the field of data engineering or data science .
Differentiation between data scientist and data engineer
What does a data scientist do?
The core competencies of a data scientist lie in advanced knowledge of mathematics and statistics, machine learning, artificial intelligence and advanced data analysis skills . The primary goal of a data scientist is to examine the evaluation results with regard to their usefulness for a company and to present and communicate them in an understandable way. Communication and understandable visualization of the results are key qualifications and are just as important as carrying out the analyses themselves. In contrast to a data engineer, a data scientist is much more concerned with the interpretation and (visual) preparation of the data, trains models or machine learning algorithms and continuously checks them for accuracy.
What does a data engineer do?
The core competencies of a data engineer lie in the area of applied, advanced programming in high-level languages (Python, Scala, Java), the understanding of distributed systems and data flows in them. Data analysis and the use of tools for data extraction and transformation are also part of their skills. In contrast to a data scientist, a data engineer provides data and processes it programmatically.
Generic Skills & Tools
There are a few tools and methods that data engineers and data scientists need in almost all projects . The programming language Python is a very common one . It is used to prepare data such as CSV files, transform data, and save the results in a database or another file for further processing. The Python ecosystem is huge, and there are a variety of libraries with pre-built functions that can be used in your own projects. The very popular Apache Spark framework for cluster computing can also be accessed with Python (and Scala, Java) via PySpark . Since many universities in STEM subjects often use Python to prepare data, basic knowledge is often already available.
A complete course to get started with Python can be found here: Learn Python – Full Course for Beginners
The object-oriented and functional programming language Scala is less common . Scala programs are compiled into bytecode, just like Java, and run on the JVM. Since many projects from the big data environment were implemented in Scala (e.g. Apache Spark), knowledge of Scala is often necessary in order to be able to use all of the components‘ features or to create particularly efficient programs. In general, at least a basic understanding helps in order to be able to read Scala code and adapt it if necessary.
The complete course for getting started with Scala can be found here: Scala Tutorial
Another important basic technology is the Hortonworks Apache Hadoop Distribution . Due to its free availability, it is used in many companies – both on-premise and in cloud instances . Apache HDFS, Apache Hive, Apache Spark ,… and many other projects are often used on the basis of the Hortonworks platform. A very informative online course to get an overview of the interrelationships and the intended use of the individual technologies of the Hortonworks Distribution can be found on Udemy: The Ultimate Hands-On Hadoop – Tame your Big Data!
A good overview of the Apache Hadoop ecosystem and its individual components can be found in this highly recommended video: Hadoop Tutorial For Beginners | Hadoop Ecosystem Explained in 20 min! – Frank Kane
For a comprehensive overview of topics such as data structures (tables, graphs, etc.), data formats, query languages and concepts (SQL, NoSQL), data storage concepts, clustering, challenges of distributed systems, batch and stream processing, and general concepts for data-intensive applications, we recommend the Oreilly book “ Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems „. The chapters of the book each form independent units of knowledge, so that you do not have to work through the book from start to finish, but can learn the parts you need independently. In our view, recommended chapters are
- Part I:
- Chapter 1 – Reliable, Scalable, and Maintainable Applications
- Chapter 2 – Data Models and Query Languages
- Chapter 4 – Encoding and Evolution
- Part 2:
- Chapter 7 – Transactions
- Chapter 8 – Trouble with Distributed Systems
- Part 3:
- Chapter 10 – Batch Processing
- Chapter 11 – Stream Processing
The book aims to convey the „thinking“ of application development for systems with large amounts of data on common PC and server hardware . Standard hardware can fail, react with a delay, be (temporarily) blocked and therefore requires efficient and fault-tolerant frameworks that can safely perform calculations in such „unsafe“ environments. In addition, some design guidelines must be considered when designing software in distributed environments . We recommend reading the chapters listed above first and then consulting the others to delve deeper into the topics.
Data Engineering
A data engineer deals with the following topics, among others:
ETL concepts : Extract, Transform, Load is a process in which source data from different systems (accounting, warehousing, advertising data, user behavior, etc.) is merged into a new, central structured database ( data warehouse, data lake ). Information is often referenced via IDs (e.g. product numbers are merged with their product data) and invalid entries are cleaned up. In order to be able to provide the data in high quality for further analysis steps, it is important to first understand and document the meaning of the individual pieces of information (e.g. columns in a table). The precise understanding and meaning of data values therefore plays a key role in later analyses. Providing data in a suitable format for further analysis or processing often takes up a large part of the working time.
OLTP vs. OLAP : Online Transaction Processing (OLTP) and Online Analytical Processing (OLAP) are two usage paradigms of database systems. OLTP systems ensure that transactions in a business process are processed correctly . This means that, for example, a value from transaction A cannot be accidentally overwritten by a transaction B that has not yet been completed or was started previously or in parallel. An OLTP system ensures that the database has a consistent data set at all times – even with many parallel accesses. The database systems PostgreSQL , MySQL and MariaDB are examples of OLAP systems.
While OLTP systems are geared towards transactional security, OLAP focuses on carrying out complex analysis projects . Existing data should be able to be viewed from as many different perspectives as possible. Product sales, for example, based on regions, product groups, days of the week and shelf positions, etc. This approach is intended to provide analyses that can support management decisions. With OLAP systems, transactions do not have to be shielded from one another; instead, the focus is on the performance of data aggregations . The open source software Clickhouse was developed for the aggregation of large amounts of data in a very short time.
The query language SQL is used for OLTP/OLAP systems and should therefore be mastered in any case.
Apache Hive/Spark : After exporting data from source systems using ETL processes, analyses are often carried out on the data. A data engineer monitors the provision and examines the data to ensure a high initial quality. With large amounts of data, the data can often only be prepared by computer clusters in a reasonable amount of time. Apache Hive scripts (HQL – Hive Query Language, very similar to SQL) or Apache Spark programs on a Hadoop platform are used for this, which are created by data engineers. Spark programs can be written in the programming languages Python, Scala or Java, among others.
Apache Airflow : The Apache Hive scripts or Apache Spark programs created in most cases support recurring business processes – programs and scripts must therefore be executed regularly. Recurring tasks can be controlled with schedulers such as Cron, Oozie or Apache Airflow. Airflow not only executes the jobs, but also checks whether they have been executed correctly, controls third-party systems and provides a graphical representation of all jobs to be executed.
Data Science
A data scientist deals with the following topics, among others:
Jupyter Notebook / Apache Zeppelin : Data science consists largely of data exploration and the patterns and hypotheses that can be derived from it. It is ideal for this purpose to examine or visualize data in an interactive programming environment. In the data science field, Jupyter Notebook has now become the standard for these tasks. There are many plugins for carrying out various operations on the data. The following video provides a brief overview: What is Jupyter Notebook?
Apache Spark : Like data engineers, data scientists also use the Apache Spark framework. Their focus is on libraries such as MLlib or SparkML , which contain many machine learning algorithms . These can be trained on provided data or data can be compared with previously created models.
TensorFlow : Another very popular framework for machine learning or data stream-oriented programming. It is flexible enough to be used for purposes such as speech recognition, image recognition, text classification , etc. TensorFlow is used by over 1500 Github projects. It is open source and can be used freely .
ScikitLearn : The free library for machine learning is also very popular. It contains classification, regression and clustering algorithms . The Python programming language can be used to create programs that use this library.
We hope we have been able to give you a good impression of the basic tools, skills and techniques for data engineers and data scientists that are highly relevant to the Big Data field. We are excited to hear about your experiences. Please send us your recipes for success, questions or other suggestions by email to [email protected]
We look forward to your suggestions!