Overview: Build Your Data Architecture to Enable Use Cases
One of the things that we often wrestle with in building out data lake architecture is how to best lay out the infrastructure to support different analytical use cases, and more specifically, what storage mechanism might yield the best performance.
One of the virtues of data lakes is that you can build hybrid approaches to storing and processing information that allow you to play to the strengths of particular types of storage without being tied to them exclusively. This also gives you a broad range of choices. While these choices bring flexibility, they can also lead to more complexity. Choosing the right set of “good enough” storage formats for your data lake architecture can make implementation a whole lot easier.
This post lays out what we see as some common use cases and the implications for data storage in support of those use cases. It illustrates the diversity of technical approaches to storing data for analytics available and how they compare to one anther. I have also included a small, informal benchmark to further demonstrate the point.
THE STORAGE CONTENDERS
There are a number of different types of storage that you can integrate into your data lake architecture or broader analytics infrastructure. These include storage layers that are largely independent of compute capacity as well approaches that integrate the two and may store data in a proprietary format.
The table below describes some of these approaches and includes example technologies.
Storage | Description | Examples |
---|---|---|
Distributed File System/Object Store – Text File | The simplest form of storage beyond local disk storage. Distributed file systems typically provide fault tolerance, redundancy, and the ability to read and write data in parallel from clustered nodes. In this case, data is stored in text files and either stored in a tabular, delimited format, or in a more structured format like JSON.For the purposes of this discussion, we are going to treat object stores like AWS S3 that present an HDFS-like interface as part of this category. | HDFS/CSV, HDFS/JSON |
Distributed File System/Object Store – Column-oriented | Flat text files can be effective for certain use cases, but more often than not, a structured file format with schema information. In particular, compressed, column-oriented formats with partitioning tend to perform very well for a range of analytical applications. Data retrieval and aggregation is accelerated because only the columns required are retrieved and data in the same column is contiguous. | HDFS/Parquet, HDFS/ORC |
Distributed File System – Row-oriented | In some cases, column-oriented stores may not be effective. For example, streaming application or applications where there is a good deal of variability in the schema of written objects. In this case, a binary, compressed format is still useful. | HDFS/Avro |
RDBMS – Column-oriented | Similar to compressed column formats described above, but embedded in a fully functional SQL database stored in a proprietary format further improve performance. | Redshift, Vertica, Snowflake |
RDBMS – Row-oriented | A conventional relational database typically used as the backend data store for applications, but can also be suitable for analytics at a moderate scale. | PostgreSQL, MySQL, Oracle, MS SQL Server |
NoSQL | Key value and document stores that typically provide for vast writes and retrievals of data. Analytical performance can vary significantly from one architecture to another. | MongoDB, ElasticSearch, Cassandra |
In Memory | These are data stores that manage information in memory in the server/cluster providing for very high-performance for queries by eliminating the storage I/O bottleneck. There are also a variety of caching mechanism, like the one built into Spark that can provide similar performance once they have been initially loaded. | MemSQL, MonetDB, Various Caching Mechanisms |
Local File System | The native file system for the server being used. This may also include network-attached storage and external storage arrays, but it does not include distributed systems like HDFS. | ext4, NTFS |
There are other types of data stores that we could add, for example graph databases, but the intent here is to be more representative than exhaustive and we may expand on this topic later and take a more comprehensive look.
STRENGTHS AND WEAKNESSES
As stated above, each of these storage approaches has different strengths and weakness. They may all have a place in your broader analytics architecture, but they need to be applied in the right way.
The table below lays out some of the specific use cases that may be present in your data lake architecture and how they map to different storage approaches.
Use Case | Description | Good Structures | Good Enough Structures | Bad Structures |
---|---|---|---|---|
Feature Engineering/ Model Training (large data set) | The process of building machine learning and AI models to improve decision-making. In this case, with a focus on large data sets. | DFS/Compressed Columns, NoSQL, DFS/Compressed Rows | Anything that allows parallel access by multiple cluster nodes | Anything that doesn’t allow parallel access |
Model Training (small data set) | For the training of models on smaller data sets, simple files within the local file system of the computer being used will usually yield the best performance. If you have “small” data, a simple file accessed from tools like RStudio or Jupyter can be adequate. | Local file system | DFS/CSV, DFS/Row Oriented, NoSQL, Relational DB, Column Store | NA – depending on the size of data, it may not really matter |
Traditional BI – Aggregation and filtering – large scale, low concurrency, not real-time | Reporting and ad hoc exploration against larger datasets usually with a small number of concurrent users, perhaps dozens. | DFS/Compressed Columns, Columnar Databases | In-memory Databases, NoSQL – Search Engine-based, RDBMS – Row-oriented | DFS/Row-oriented, DFS/Flat files |
Traditional BI – Aggregation and filtering – moderate scale, high concurrency, real-time/streamed | Reporting and ad hoc exploration against moderate-sized data sets, but with a high number of users (hundreds or thousands) accessing the information concurrently. This is often the case when reporting is embedded within applications. | In-memory Databases, NoSQL – Search Engine-based | DFS/Compressed Column-oriented, Columnar databases | DFS/Flat files |
STORAGE FORMATS: A PRACTICAL PERSPECTIVE
Pragmatically speaking, you will need to deal with some of these formats just given their prevalence in both source systems and analytics infrastructure. The following are likely to play some role in your implementation based on their as storage formats within the data lake, or as sources or targets for data.
- Text Files – Information will often come into the data lake in the form of delimited text, JSON, or other similar formats. As discussed above, text formats are seldom the best choice for analysis, so you should generally convert to a compressed format like ORC or Parquet.
- Relational Database Sources – Almost every enterprise has one or more relational databases that are likely to be sources for data that needs to be fed into analytics. Column-stores like Redshift or Snowflake are really a category of relational database and will often be a target for refined information coming from the data lake.
- Compressed Column-oriented Formats – These formats are the work horse of most data lakes. They provide reasonable performance under a variety of workloads and are a space-efficient from a storage perspective. Either Parquet or ORC is likely to play a role in your data lake.
Other types of storage should be considered on a case by case basis.
- Compressed Row-oriented Formats – You may need to deal with the Avro format in particular if you want to ingest and use streaming data from Kafka.
- In Memory Databases – While caching within your data lake platform is likely, you should only use a dedicated in memory caching infrastructure or full-fledged database if you have a specific use case that calls for it. Examples include real-time analytics against streaming data and high-concurrency applications as described above.
- NoSQL Stores – NoSQL databases and search index based platforms like ElasticSearch can be useful for rapidly generating metrics over large sets of data. The performance characteristics and features of these platforms vary widely, so it is best to fully understand specific needs and choose accordingly.
Data lakes are well-positioned to handle a mixed workload and to a large extent, compressed, binary formats, stored in HDFS or in a distributed object store that allows parallel access, like AWS S3 can be fairly good for a very broad range of different applications. As reflected in the “Good Enough” column in the table above. Keep in mind that sometime, good enough is great.
Real Life Example: Benchmarking Analytics Performance
To drive home the point that different storage approaches can substantially impact performance for specific use cases, we tested the analytics performance of three different data storage approaches for some simple processing.
THE DATA
We conducted the benchmark using public data about New York City taxi trips during 2017 and 2018. This data was initially sourced from the New York Taxi and Limousine Commissions public data – https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page. There were two tables used in the analysis:
- taxi_zones – This is a lookup table that provides attributes about different taxi zones within New York City including a name and latitude and longitude for the centroid of the zone. This dataset is 236 rows.
- yellowtrip_data – This is a listing of taxi trips in New York City over the course of two years. This table has approximately 108M rows.
While these are relatively small datasets by Spark standards, they are big enough to illustrate some of the performance differences on a small cluster. They also represent a fairly typical analytics scenario in which data is being aggregated based on a set of dimensional attributes in one table from another large fact table representing a series of events.
THE STORAGE FORMATS
We took a representative set of storage formats that we support within our Magpie platform.
- Parquet Format in S3, Snappy Compression– As described above, Parquet is often the “default” storage format for Spark.
- ORC Format in S3, Snappy Compression – ORC is the predominant format used in Hive-based implementations. More recent versions of Spark have substantially improved its handling of ORC files.
- Avro Format in S3, Snappy Compression –Avro is row-oriented (unlike Parquet and ORC) and cannot selectively return only relevant columns or accelerate aggregation. It is the default format used in the Kafka streaming framework.
- JSON Format in S3, GZIP Compression – We also included a JSON format file to illustrate just how badly text files perform for this kind of analytical processing.
- JSONFormat Cached in Memory – We also used Spark’s caching mechanism to cache the JSON version of the table in memory. Our expectation would be that any of the storage formats would perform equivalently once cached in memory. JSON, taking about two minutes to warm up the cache on the initial run of a query, is probably slower than other approaches in this one respect but should otherwise be comparable.
THE ANALYSIS
In addition, we ran 3 distinct kinds of workload to illustrate the differences between formats in different scenarios.
- Join with Aggregation – Summarizing metrics based by attributes. In this case, summarizing average trip distance and trip counts for every combination of pickup and drop-off taxi zones.
- Join and Filter without Aggregation – In this case, pickup and drop-off locations are joined to the core trip table to get a filtered by date. The purpose of these queries is to measure how the performance advantage of the column-oriented formats is impacted when their main advantages, fast aggregation and column selection, are not used. Two different queries are used in this case because we discovered that the filter criteria used significantly impacted relative performance. Results for both are included.
- Feature extraction/transformation – We also ran a standard type of feature extraction process against the tables. In this case, we simply scaled some numerical variables present in the table. This often needs to be done to ensure that differences in scale between variables do not unduly impact modeling results. It involves scanning all of the rows of the table
This benchmark was run on a single node Magpie cluster with 16 cores and 64GB of RAM. The files were all stored in S3 using either Snappy or GZIP compression. For reference, Magpie is our data lake platform powered by Apache Spark. Performance in our platform is roughly comparable to performance measured in a vanilla Spark environment.
THE RESULTS
The charts below summarize the results of the performance testing. Below each chart is a more detailed evaluation of the results.
JOIN WITH AGGREGATION
The query used to generate these results aggregates data by pickup and dropoff location.
SELECT pulocationid,
tz1.zone AS pickup_zone,
dolocationid,
tz2.zone AS dropoff_zone,
sum(trip_distance) / count(*) AS avg_distance,
count(*) AS trip_count
FROM yellow_tripdata_orc yt
JOIN taxi_zones_orc tz1
ON tz1.zone_id = yt.pulocationid
JOIN taxi_zones_orc tz2
ON tz2.zone_id = yt.dolocationid
GROUP BY pulocationid,
dolocationid,
pickup_zone,
dropoff_zone
In this case, the results are predictable with columnar formats performing best, Avro performing slightly worse, and JSON performing significantly worse. As we will see in the other cases, cached data performed best.
JOIN WITH FILTERING AND NO AGGREGATION – QUERY 1
For this test, the query joins both trip start and destination information, but only filters on date criteria in the yellow_tripdata table. The query used is shown below.
SELECT pulocationid,
tz1.zone AS pickup_zone,
dolocationid,
tz2.zone AS dropoff_zone,
Sum(trip_distance) / Count(*) AS avg_distance,
Count(*) AS trip_count
FROM yellow_tripdata_json yt
JOIN taxi_zones_json tz1
ON tz1.zone_id = yt.pulocationid
JOIN taxi_zones_json tz2
ON tz2.zone_id = yt.dolocationid
GROUP BY pulocationid,
dolocationid,
pickup_zone,
dropoff_zone
In this test, ORC performed significantly better than Parquet. This could be due to the the specific approach ORC uses for full row retrieval and its indexing mechanisms. More importantly, it illustrates how relative performance can vary based on workload.
A surprising result here is that JSON significantly outperforms Avro for this particular query. This could simply be because of the additional overhead of processing Avro’s more complex format. When we introduced more complex filter criteria, Avro tended to perform better.
JOIN WITH FILTERING AND NO AGGREGATION – QUERY 2
The second query, shown below, differs in that it includes filter criteria for both the start and destination of the trip in the taxi_zones table. This slows down performance overall and changes the relative performance across locations.
SELECT tz1.zone_id,
tz1.borough,
tz1.centroid_lat,
tz1.centroid_lon,
tz1.zone AS pickup_zone,
tz2.zone AS dropoff_zone,
yt.*
FROM yellow_tripdata yt
JOIN taxi_zones tz1
ON tz1.zone_id = yt.pulocationid
JOIN taxi_zones_avro tz2
ON tz2.zone_id = yt.dolocationid
WHERE tz1.zone_id = 234
AND tz2.zone_id = 170
In this case, the performance is more in line with what we would expect. Avro outperforms JSON significantly, but still lags relative to the column-oriented formats. Interestingly, Parquet is faster than ORC in this case. This may imply that Parquet does better with smaller result sets and more complex filters than ORC. More testing would be required to confirm that finding.
FEATURE ENGINEERING
In the feature selection example above, we applied a basic scaling transformation to all of the numerical variables in the core yellow_tripdata table. This type of transformation typically requires a full traversal of the table and the results show a less stark difference in performance across types of storage suggesting that the individual column selectivity of ORC and Parquet drive some benefit, but that computation rather than retrieval is a bigger bottleneck. The exception here is JSON which still may be suffering from its less efficient storage and the need to parse each line as it is processed.
KEY TAKEAWAYS
There are a few observations that we can make based on our test runs.
- Performance for Parquet and ORC are roughly equivalent or at least in the same neighborhood. There are subtle differences depending on the size of data being aggregated, the level of aggregation, the selectivity of filters and the total number of columns being pulled. Spark tends to favor Parquet and Hive tends to favor ORC, so it may be advantageous to stick with what is most widely used in your platform.
- As expected, JSON is slow relative to other approaches and this just reinforces the fact that text formats are useful for information exchange, but that they should be converted to a compressed format as quickly as possible for analysis.
- Avro sits in the middle in terms of performance and should be considered based on the specific use case. For example, if there is a lot of schema variability across rows expected and it can’t be foreseen ahead of time, Avro provides greater flexibility than Parquet or ORC which depend on a largely consistent schema to drive their performance. When all columns are being selected and there is no aggregation, performance can be comparable to column oriented approaches. Surprisingly, in our testing, Avro performed worse than even a JSON text file in the case of one specific case, showing once again that there can be significant variability in performance depending on specific workload.
- The tables used in this benchmark are relatively narrow for wider tables that may have hundreds of columns, the difference between columnar approaches and row-oriented approaches will be more pronounced. This is intuitive given the underlying technology and is consistent with what I have observed in practice.
- In memory approaches can be very useful, if your data is actually small enough to fit in memory. In memory data performed best in all cases after the cache is warm. The difference, at this scale was not substantial enough to warrant a blanket recommendation that there should be as much caching as possible.
- We were using S3 for storage in this case (a “default” option for many implementations on AWS). The performance characteristics will be different in a true HDFS cluster where there is less storage I/O overhead.
Note: this is a very informal test and really just shows how much of a difference some simple choices about format can make. This should not be seen as a comprehensive comparison or used to drive decision-making. Your best bet is to experiment with your own data in your own environment to get to an optimal approach. However, this can be used as a backgrounder to get you up to speed on some of the differences.
Conclusions
A modern analytics architecture will require some or all of these different storage types based either on specific requirements or on the existing technology stack within that particular organization and how source systems are set up.
Anticipate using a diverse range. At a minimum, expect to include Parquet or ORC in your environment. You will end up connecting to relational database sources in many cases, and you may end up using formats like Avro to retrieve and store streamed data.
Most importantly, the type of storage that is best for each use case will likely be specific to your environment. Experimentation will allow you to learn the best option in your particular environment.