In this installment of Silectis Technical Tutorials, we provide a step-by-step introduction to building a machine learning model in R using Apache Spark. This post is intended for R users who understand the basics of machine learning and have an interest in learning about Spark’s machine learning capabilities. OVERVIEW – CLASSIFYING DOCUMENTS … [Read more...] about Technical Tutorial: Machine Learning Model in R Using Spark
Tutorials
Apache Spark: A Beginner’s Guide to Optimizing Spark Scripts
This blog post covers Apache Spark basics and teaches readers why optimizing Spark scripts is important, and how to do it for both memory and runtime efficiency. This blog post is best suited for data analysts and data scientists looking for information on optimizing existing Spark workflows or creating new ones. INTRODUCTION: WHAT IS APACHE … [Read more...] about Apache Spark: A Beginner’s Guide to Optimizing Spark Scripts
Tutorial: Creating a Multi-Cloud VPN with Terraform between AWS, GCP, and Azure
At Silectis, we deploy Magpie clusters across AWS, Google Cloud, and Azure. But because some of our internal infrastructure resides only on AWS, we need to establish private connections between these environments so that clusters on Google Cloud and Azure can access those private AWS resources. In this post, we’ll walk through the details of how we … [Read more...] about Tutorial: Creating a Multi-Cloud VPN with Terraform between AWS, GCP, and Azure
Improve Your SQL Skills: Master the Gaps and Islands Problem
The "gaps and islands" problem is a scenario in which you need to identify groups of continuous data (“islands”) and groups where the data is missing (“gaps”) across a particular sequence. The problem may sound arcane at first glance but knowing how to solve this type of problem is critical for advanced SQL practitioners. These starting and … [Read more...] about Improve Your SQL Skills: Master the Gaps and Islands Problem
How to Ensure Security In Your Data Lake
THE IMPORTANCE OF SECURITY This post is the third in a series of posts about getting up and running with a Magpie Data Lake. In previous blog posts, we’ve discussed rapidly prototyping a data lake with Magpie, and automating loads into a data lake with Magpie. This post will address a third important piece of data lake infrastructure: security … [Read more...] about How to Ensure Security In Your Data Lake
Data Profiling: A Step-by-Step Introduction
EXAMINING NYC TRANSPORTATION DATA THROUGH MAGPIE’S ONE-CLICK RAPID DATA PROFILING In our first data-centric blog post, we provide a step-by-step introduction to the immediate value generated by Magpie’s ability to show users what is in a dataset before analysis begins. Below, we use publicly available data from New York City's Open Data … [Read more...] about Data Profiling: A Step-by-Step Introduction