The Data Engineer's Data Engineering Glossary

If you’re new to data engineering or are a practitioner of a related field, such as data science, or business intelligence, we thought it might be helpful to have a handy list of commonly used terms available for you to get up to speed. This data engineering glossary is by no means exhaustive, but should provide some foundational context and information.

Advanced Analytics	The process of discovering deeper insights in data than typically enabled by most business intelligence (BI) tools. Advanced analytics can be performed by employing sophisticated tools and techniques, including machine learning (ML) and artificial intelligence (AI), data/text mining, semantic analysis, sentiment analysis, network and cluster analysis, multivariate statistics, and more.
Apache Airflow	Apache Airflow is a platform to “programmatically author, schedule and monitor workflows.”
Artificial Intelligence	AI is a broad term used to describe engineered systems that have been taught to do a task that typically requires human intelligence.
BI (Business Intelligence)	Strategies and systems used by enterprises to conduct data analysis and make pertinent business decisions.
Big Data	Large volumes of structured or unstructured data.
Big Data Processing	In order to extract value or insights out of big data, one must first process it using big data processing software or frameworks, such as Hadoop.
Big Query	Google’s cloud data warehouse.
Cassandra	A database built by the Apache Foundation.
Data Architecture	Data architecture is a composition of models, rules, and standards for all data systems and interactions between them.
Data Catalog	An organized inventory of data assets relying on metadata to help with data management.
Data Engineering	Data engineering is a process by which data engineers make data useful. Data engineers design, build, and maintain data pipelines that transform data from a raw state to a useful one, ready for analysis or data science modeling.
Data Ingestion	The process by which data is moved from one or more sources into a storage destination where it can be put into a data pipeline and transformed for later analysis or modeling.
Data Integration	Combining data from various, disparate sources into one unified view.
Data Lake	A storage repository where data is stored in its raw format. Data lakes allow for more flexibility than a more rigid data warehouse.
Data Lineage	Data lineage describes the origin and changes to data over time
Data Management	Data management is the practice of collecting, maintaining, and utilizing data securely and effectively.
Data Migration	The process of permanently moving data from one storage system to another. Data migration may involve transofrming data as part of the migration process.
Data Mining	The process of finding patterns, correlations, or anomalies within data sets to predict outcomes.
Data Pipeline	A data pipeline is a set of steps that ingest and integrate data from raw sources and move the data to a destination for analysis or data science. Data pipelines can be automated and maintained so that consumers of the data always have reliable data to work with.
Data Science	Data science is a practice that uses scientific methods, algorithms and systems to find insights within structured and unstructured data.
Data Visualization	Graphic representation of a set or sets of data.
Data Warehouse	A storage system used for data analysis and reporting.
Database	A collection of structured data.
ETL	Extract, transform, load: the three-step data integration process used to blend data from different sources.
Flat File	A type of database that stores data in a plain text format.
Flink	A big data processing tool built by the Apache Foundation, with the ability to process streaming data in real time.
Hadoop / HDFS	Apache’s open-source software framework for processing big data. HDFS stands for Hadoop Distributed File System.
JSON	JavaScript Object Notation – a data-interchange format for storing and transporting data.
Kafka	Apache Kafka is the Apache Foundation’s open-source software platform for streaming.
Kubernetes / k8s	Open-source system for automating application deployment, scaling, and management of applications. Also called k8s.
Machine learning (ML)	ML generally refers to algorithms built to identify patterns in big data.
MapReduce	MapReduce is a component of the Hadoop framework that’s used to access big data stored within the Hadoop File System
Metadata	A set of data that describes and gives information about other data.
MySQL	An open-source relational databse management system with a client-server model.
NoSQL	A non-relational database
Open Source	Software that is available to freely use and modify
Parquet	A column-oriented data storage format that’s part of the Hadoop ecosystem.
PostgreSQL	A free, open-source relational database management system, also known as Postgres.
PySpark	A collaboration of Apache Spark and the Python programming language
RedShift	Amazon’s cloud data warehouse
S3	Amazon’s object storage (simple storage service)
SQL	Structured Query Language – a domain-specific language that tells a server what to do with data.

Note: We will continue to add to the above Data Engineering Glossary over time.

Footer