PILOTING A DATA LAKE WITH MAGPIE
More and more companies are turning to data lakes as a way to unify and get value out of their growing collections of data. However, it can be a challenging to navigate the ever-changing technology landscape around these lakes, set one up, and quickly get value from it.
At Silectis, we recommend that our clients start a data lake pilot with a static data load and rapid analysis phase that has targeted analytical and technical goals in a set timeframe. This helps companies test out the feasibility of a data lake, understand the technology, and demonstrate the value of the lake while investing minimal time and effort. In order to be as productive as possible during this short pilot phase, technology that helps speed up both connecting to data and performing analysis is key.
In this post, we’ll walk through a technical tutorial of how Magpie can fill that need. We’ll show how companies can quickly configure and explore a set of enterprise data sources, enrich that enterprise data with third party sources, and perform initial analysis.
STEP 1: CONFIGURING DATA SOURCES
The first step when getting started with Magpie data lake is to configure a set of data sources. A data source specifies where data resides and any credentials necessary to connect to that data. For example, if a company is storing user website activity data in an S3 bucket, they could create an S3 data source to access that data in Magpie:
create data source { "name": "click_data", "description": "Bucket containing website activity data", "sourceType": "hdfs", "hdfsType": "S3a", "host": "silectis-click-data", "pathPrefix": "/archive" };
Once that data source is created, a user of the data lake is able to interrogate the bucket to understand its content and manage its directory structure. For example, to list files in the bucket at a particular path prefix:1list files in data source click_data at path “website”
list files in data source click_data at path "website"
Next, perhaps this company wants to merge that user activity data with sales transactional information from a database. Magpie supports any database with a JDBC connector, as well as custom connectors for specific databases where JDBC is not the most efficient way to access them. For example, for a Redshift warehouse:
create data source { "name": "edw", "sourceType": "redshift", "endpoint": "xyzabc.9876543.us-east-1.redshift.amazonaws.com:5439", "database": "edw", "user": "edw_user", "password": "12345" };
Like the S3 bucket, Magpie is able to inspect database data sources to understand both the schemas and tables available:
list schemas in data source edw; list tables in data source edw;
In addition to S3 and database sources, Magpie supports a variety of other data sources, including Kafka, HDFS, Google Cloud Storage, Elasticsearch, and more. These sources make it easy to bring many different types of data together into a unified lake.
STEP 2: CREATING TABLES
Once data sources have been configured, Magpie has helper commands that speed up the process of accessing the data in those sources from within the data lake. For example, to load an entire schema from a database, Magpie has the create schema from data source command:
create schema transactions from data source edw using schema finances;
This command will read the definitions of the tables in the underlying database and create a schema in the data lake that contains all of those tables. When data lake users query those tables, they will be accessing the data directly from the underlying source without having to configure connection parameters, drivers, SDKs, usernames, or passwords. A similar command exists to create a schema from a filesystem path.
Magpie also supports creating tables one at a time. For example, to create a table from a filesystem file or folder of JSON files:
create table website_clicks from data source click_data file "website/clicks" with format json;
This create table from file command also supports a number of options in addition to file format, such as date and timestamp formats, and delimiters for text files. Similar commands exist for database tables and queries.
STEP 3: INTEGRATING THIRD PARTY DATA
One of the common uses of a data lake is to enrich existing enterprise data with third party data sources that may provide additional insight or analytical value, such as census data, weather data, or other public data sets. In Magpie, integrating many of these data sets can be accomplished with just a single command.
For example, the company discussed earlier may be interested in identifying target areas of Washington, DC for a marketing campaign using their data lake, combining their website click data with sales data as well as information about the demographics of DC residents. The DC OpenData portal offers Census demographic information for the city in CSV format. To load this data into Magpie, we can use the save URL as table command:
save url "https://opendata.arcgis.com/datasets/a0823c94029942118c05989ae6949850_31.csv" as table dc_census_data with header with infer schema;
Now, the data is available as another table in the lake to answer the marketing campaign question.
NEXT STEPS
With just a few lines of Magpie script, we’ve been able create a unified data lake, sourcing data from an S3 bucket, a database, and an external website. The next step would be to use the distributed computing power of the lake to explore the data and perform analysis, maybe using advanced analytics to drive the marketing campaign solution.
To learn more about getting a data lake up and running quickly with Magpie, click here.
Jon Lounsbury is the Director of Engineering at Silectis.
You can find him on Github.