Spring Cloud Dataflow 101
Why Spring Cloud Data Flow?
In a previous life, I learned an ETL tool called Talend. Talend is an open source tool based on Eclipse.
Similar to other tools in its class, ie: Informatica, it was very good at working with batch oriented data.
Although Batch processing will never go away, these days there are many valid use cases for relentless streams of data.
- Geospatial locations of vehicles or clients
- Authorizations and declines of transactions
- Sensor data (moisture levels detected by windshield wipers for example)
- Biometric data (heart rate, sleep levels, steps)
The data in such streams can often be more interesting as it's happening. In some cases the data might become less useful the longer it exists after the event.
With this in mind, it's not enough to get the data into a datalake and analyze at a later point (although this is still useful work). We need the option to be able to filter, route, transform, process and even analyze in real-time as the records as they come in. Being able to do so at scale on-demand may even provide a competitive advantage in the marketplace.
The Spring Framework work provides an amazing framework of API(s) to handle these problems. Spring Integration, Spring Batch and Spring Data are some of the more important ones.
Each of these projects are awesome on their own, but together they provide the basis for handling streaming of data.
But, there are a lot of moving pieces and a learning curve to contend with before taking advantage of these rich tools. Not to mention deploying and operating the final solution.
Spring Cloud Data Flow (SCDF from this point on) solves these challenges for us.
The focus on this article will be how SCDF can work with Pivotal Cloud Foundry (PCF) to provide orchestration of scalable data pipelines.
With such pipelines, insight can be captured around application performance, customer behavior or events that can give your business a competitive advantage.
What are Streams?
A key concept with SCDF is a stream.
An example stream might be; Data continually being produced at external sources that we want to capture into multiple data stores (database and hadoop).
In the same example, imagine that there might be more data in the stream than we need. Since putting everything in our data stores would not be practical in this case, we only want to capture some of the messages.
Finally, although the stream is constantly running but there are periods of relatively higher messages rates through out a 24 hour period.
From a high level we have: Source of Data -> Processing/Filtering (Optional) -> Sink Destination(s)
Source is where the data is coming from. Sink(s) are where we want it to end up.
Getting more specific about the data flow within the stream:
Data continually available to an API endpoint -> Transform the message to structured JSON -> Filter on 'country=CA' -> Write 'country=CA' to MySQL -> Write all other records to HDFS
This could be done with a tool like Talend, or any other ETL tool. However, the difference when referring to streams is the incoming data is unbounded. As previously stated, there may be traffic bursts (or even declines), but the messages will never stop coming.
Microservice architectures (a distribution of individual services with a very specific purpose) were designed to deal with such scale, add to this a reliable and robust messaging service, and we have a winning architecture.
Let's discuss how SCDF can provide us such an architecture on-demand.
A key part of SCDF is the Server. This is a lightweight Spring Boot Application. It provides a declarative DSL and a Drag-and-drop visual Dashboard to build data pipelines made of Spring Boot Apps. SCDF also provides the orchestration ability to deploy the desired data pipeline topology onto modern platforms such as Pivotal Cloud Foundry and Kubernetes.
The stream desribed in the previous section would now look something like this:
These components communicate with each other using a pub/sub message broker. In this example, Rabbit MQ.
Let's talk about how PCF can be leveraged to power all this work.
Why Pivotal Cloud Foundry?
Pivotal Cloud Foundry (PCF from this point) is one of the platforms that SCDF can leverage.
PCF is my personal favorite as it provides an opinionated approach to both deploying and operating software. This can be very useful in enterprise environments where politics and communication issues challenge productivity for even the simplest tasks.
Pivotal Web Services (PWS from this point) is a managed Pivotal Cloud Foundry running on AWS. PWS == PCF on AWS.
PCF requires a fair amount of infrastructure to run in a highly available fashion. PWS is a way to test it out PCF without having to worry about managing and paying for that infrastructure.
With PWS, you just pay for the compute time you use.
PWS also offers a marketplace that be used to provision services, services like Rabbit MQ.
This means the platform can stand-up Rabbit MQ on-demand and make credentials to use it available in the environment your application runs in.
The SCDF Server (Boot App) can be deployed into PCF. It runs in a container managed by PCF, but it also has access to PCF's API(s) to orchestrate Spring Cloud Stream applications as native applications in PCF.
For the Stream Source -> Filter -> Sink, a high level view of it running on PCF would look something like this.
PCF also provides us with:
- Management and self healing for all components making up a stream
- Ability to auto-scale the Applications making up the stream
- API(s) and interfaces for viewing/managing logs and metrics for all components
As you can see, we now have all the pieces we need now to get serious about working with streams of data.
This first step is to deploy the SCDF Server into PCF and configure it appropriately.
The steps are simple:
- Configure Correct Services from PWS Marketplace
- Deploy the Jar file for the SCDF Server to PWS with the appropriate configurations
- Set up the SCDF Shell locally. This can be used to communicate with the SCDF Server running in PWS
- Submit streams to the SCDF Server via the Shell or the Server UI
- Monitor the streams in PWS's app manager (or using the CF CLI)
The following Github repo outlines these steps and even provides a shell script to automate them: https://github.com/lshannon/spring-cloud-data-flow-setup
For a demonstration of setting up the SCDF Server, as well as creating some streams, watch the following video (your browers needs to support playing mp4 files).
If your organization hosts it's own PCF, there is a tile for setting this up that is more enterprise grade: https://network.pivotal.io/products/p-dataflow/
A final note, should you get the SCDF Server running on PWS, DO NOT LEAVE STREAMS RUNNING as this could result in a costly bill. Make sure to stop all streams and shut down the SCDF Server when its not in use.
This article is the first in a series. Now that we have a SCDF Server configured and ready, the following articles will explore different data streaming solutions that can be created on-demand. More to come.