If you’re new to data engineering or are a practitioner of a related field, such as data science, or business intelligence, we thought it might be helpful to have a handy list of commonly used terms available for you to get up to speed. This data engineering glossary is by no means exhaustive, but should provide some foundational context and information.
Advanced Analytics | The process of discovering deeper insights in data than typically enabled by most business intelligence (BI) tools. Advanced analytics can be performed by employing sophisticated tools and techniques, including machine learning (ML) and artificial intelligence (AI), data/text mining, semantic analysis, sentiment analysis, network and cluster analysis, multivariate statistics, and more. | |
Apache Airflow | Apache Airflow is a platform to “programmatically author, schedule and monitor workflows.” | |
Artificial Intelligence | AI is a broad term used to describe engineered systems that have been taught to do a task that typically requires human intelligence. | |
BI (Business Intelligence) | Strategies and systems used by enterprises to conduct data analysis and make pertinent business decisions. | |
Big Data | Large volumes of structured or unstructured data. | |
Big Data Processing | In order to extract value or insights out of big data, one must first process it using big data processing software or frameworks, such as Hadoop. | |
Big Query | Google’s cloud data warehouse. | |
Cassandra | A database built by the Apache Foundation. | |
Data Architecture | Data architecture is a composition of models, rules, and standards for all data systems and interactions between them. | |
Data Catalog | An organized inventory of data assets relying on metadata to help with data management. | |
Data Engineering | Data engineering is a process by which data engineers make data useful. Data engineers design, build, and maintain data pipelines that transform data from a raw state to a useful one, ready for analysis or data science modeling. | |
Data Ingestion | The process by which data is moved from one or more sources into a storage destination where it can be put into a data pipeline and transformed for later analysis or modeling. | |
Data Integration | Combining data from various, disparate sources into one unified view. | |
Data Lake | A storage repository where data is stored in its raw format. Data lakes allow for more flexibility than a more rigid data warehouse. | |
Data Lineage | Data lineage describes the origin and changes to data over time | |
Data Management | Data management is the practice of collecting, maintaining, and utilizing data securely and effectively. | |
Data Migration | The process of permanently moving data from one storage system to another. Data migration may involve transofrming data as part of the migration process. | |
Data Mining | The process of finding patterns, correlations, or anomalies within data sets to predict outcomes. | |
Data Pipeline | A data pipeline is a set of steps that ingest and integrate data from raw sources and move the data to a destination for analysis or data science. Data pipelines can be automated and maintained so that consumers of the data always have reliable data to work with. | |
Data Science | Data science is a practice that uses scientific methods, algorithms and systems to find insights within structured and unstructured data. | |
Data Visualization | Graphic representation of a set or sets of data. | |
Data Warehouse | A storage system used for data analysis and reporting. | |
Database | A collection of structured data. | |
ETL | Extract, transform, load: the three-step data integration process used to blend data from different sources. | |
Flat File | A type of database that stores data in a plain text format. | |
Flink | A big data processing tool built by the Apache Foundation, with the ability to process streaming data in real time. | |
Hadoop / HDFS | Apache’s open-source software framework for processing big data. HDFS stands for Hadoop Distributed File System. | |
JSON | JavaScript Object Notation – a data-interchange format for storing and transporting data. | |
Kafka | Apache Kafka is the Apache Foundation’s open-source software platform for streaming. | |
Kubernetes / k8s | Open-source system for automating application deployment, scaling, and management of applications. Also called k8s. | |
Machine learning (ML) | ML generally refers to algorithms built to identify patterns in big data. | |
MapReduce | MapReduce is a component of the Hadoop framework that’s used to access big data stored within the Hadoop File System | |
Metadata | A set of data that describes and gives information about other data. | |
MySQL | An open-source relational databse management system with a client-server model. | |
NoSQL | A non-relational database | |
Open Source | Software that is available to freely use and modify | |
Parquet | A column-oriented data storage format that’s part of the Hadoop ecosystem. | |
PostgreSQL | A free, open-source relational database management system, also known as Postgres. | |
PySpark | A collaboration of Apache Spark and the Python programming language | |
RedShift | Amazon’s cloud data warehouse | |
S3 | Amazon’s object storage (simple storage service) | |
SQL | Structured Query Language – a domain-specific language that tells a server what to do with data. |
Note: We will continue to add to the above Data Engineering Glossary over time.