Real-Time Data, CDC, and Apache Spark Essentials: A Deep Dive
Today, with real-time data at your fingertips, waiting for hours for insights equals losing business opportunities. Businesses need real-time data analysis to decide on the fly. But how do you go from batch processing to real-time analytics?
In my podcast episode on AntStack TV (watch here) I have explored the essentials of real-time data processing, including Change Data Capture (CDC) and Apache Spark. The blog will dive into these details and more.
What is Real-Time Data Analysis and Why is it Important?
Real-time data analysis involves gathering insights from data as it is collected. To break it down:
- Data Analysis: Extracting insights from gathered data for the purpose of powering business decisions.
- Real-Time: Adding a time-sensitive dimension to the analysis, enabling businesses to derive actionable insights immediately as data is generated.
How Can Businesses Start with Real-Time Data?
Businesses traditionally relied on batch processing, where data was collected and processed later, and insights were derived after a delay.
However, today, the demand for immediate insights has become critical. Real-time data analysis empowers businesses to make timely decisions, seize opportunities, and respond to challenges as they arise. Here’s why businesses are adopting real-time data analysis:
- Timely Decision-Making: Instead of waiting for insights after days, businesses can act on data immediately.
- Enhanced Customer Experience: Real-time insights allow businesses to offer personalized discounts or stock up on items based on user preferences.
For example, e-commerce companies involve multiple interconnected components such as customers browsing products and making purchases, warehouse inventory levels, payment gateways, and shipment management. Each of these components generates streams of data every second and the company can use real-time data to synchronize these seamlessly.
So, by monitoring inventory levels in real time, businesses can identify stock levels and replenish them. For example, by analyzing user behavior patterns and preferences, they can offer targeted discounts and promotions, driving conversions.
Understanding Change Data Capture (CDC)
CDC, or Change Data Capture, is an essential technique for real-time data analysis. It involves tracking changes (inserts, updates, deletions) in a database table and converting these changes into streams of data.
Here’s how CDC works
To explain how CDC works, imagine you have a database with a table that initially contains no data. This means, its state is at zero.
- Enabling CDC By default, databases do not have enabled CDC. So, you need to specifically enable the CDC for a database. Once you do, the change data is gathered into a separate table.
- Capture Events When a new record is added or updated in a table, its state changes from 0 to 1. This change is considered an event. CDC captures similar events like inserts, updates, and deletions. Note: Different databases have specific mechanisms to enable CDC. Here’s a link for your reference. For example, consider we have a table for a User and we have their First Name, Last Name, Email ID, and Phone Number. Initially, the table is empty. When you insert a record, CDC captures this event. CDC contains two parts: Before State: The initial values - null After State: The updated values after inserting the record
- Stream the Changes Once CDC is enabled, tools like Debezium (a CDC tool) and Apache Kafka (a distributed event streaming platform) can stream these changes to downstream systems for processing. Think of the CDC as building a dam on a river. The dam (CDC) captures the flow of water (data), and floodgates (Debezium) manage the controlled flow to downstream channels (Kafka).
Managing Data Streams
Apache Kafka is a distributed messaging system and a de facto choice for managing data streams. Here’s how it fits into the CDC pipeline:
Integration with Debezium Debezium is built on top of Kafka Connect APIs, ensuring seamless integration. When Debezium pulls CDC records as JSON events, it includes metadata - before and after-state of a database event. Debezium pushes these records into Kafka.
Kafka’s capabilities Kafka can retain the records for up to 7 days if no action is taken. It also serves as a database to some extent, acting as a streaming source for downstream processes.
Apache Spark’s Role in Real-Time Processing
Apache Spark plays a key role in real-time data processing by offering both batch and streaming capabilities in a single unified framework. Unlike Apache Flink, which is primarily focused on real-time streaming data, Apache Spark allows you to handle both types of data processing. This flexibility means you don’t need to switch between different tools or domain-specific languages for batch and streaming workloads.
Apache Spark offers wide language support, including SQL, which is familiar to many data engineers and analysts. This makes it easier for anyone with SQL knowledge to interact with Spark and build batch and streaming applications.
A popular alternative to Spark is Apache Flink. However, Spark is often preferred over alternatives because it eliminates the need to switch between different processing engines. Additionally, since its release Spark has become a dependable choice for its robustness and ability to handle complex data applications.
Setting Up Spark
You can set up Apache Spark in several ways depending on the scale of your data processing needs and the environment you’re working in. Here are some of those ways.
Local Setup You can set up Spark on a single host machine or laptop for small data processing tasks. However, it can be tricky for large-scale data operations.
Cluster Setup
To take full advantage of Spark’s capabilities, you can set up a cluster with multiple machines. This required more manual work, like configuring hardware resources, networks, and software. Setting up a cluster can be challenging, in terms of managing distributed resources, configuring Spark, and ensuring proper communication between nodes.Cloud Solutions You can also utilize cloud providers offering managed Spark services for easier setup and scalability. For instance, AWS Elastic MapReduce (AWS EMR), provides a managed Spark environment; you can simply launch a Spark cluster on EMR and start data processing. Cloud providers handle scaling, provisioning, and managing the Spark environment, so you can focus on data processing instead of infrastructure.
Databricks Databricks provides an out-of-the-box managed Spark environment, making it a great option for big data processing. You can use Databricks with your cloud provider credentials to quickly set up a Spark environment.
Stream Processing in Spark
Spark offers two primary methods for handling data:
- Read: For batch data processing.
- ReadStream: For streaming data processing.
Key Differences between read() and readStream() methods
Feature | read()(Batch) | readStream()(streaming) |
---|---|---|
Data Source | Static Files, Databases | Kafka, Sockets, Message Queues |
Processing Mode | One-time Batch | Continuous Streaming |
Execution | Executes immediately | Requires .writeStream.start() |
Latency | Higher | Low (real-time) |
To perform streaming data processing:
When you use the readStream() method, you need to specify the streaming data source. A common source is Apache Kafka, which acts as a message broker providing a continuous data stream.
To connect Spark to Kafka, you need to provide: The address of the Kafka cluster Metadata information Credentials to authenticate with the Kafka cluster The Kafka topic you want to read from
Once Spark is connected to Kafka using readStream, it starts reading the streaming data as it arrives in the Kafka topic. Spark automatically creates a streaming DataFrame to represent the real-time data in a tabular format.
The streaming DataFrame can be processed using Spark’s APIs in the same way as static DataFrames. This means you don’t need to use any domain-specific language.
The following is a coding example that showcases the readSteam() method
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("StreamProcessing").getOrCreate()
# Read streaming data from Kafka
df_stream = spark.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "<kafka server address>") \
.option("subscribe", "<kafka topic name>") \
.load()
# Convert Kafka message values to string
df_stream = df_stream.selectExpr("CAST(value AS STRING)")
# Write the streaming data to the console in append mode
query = df_stream.writeStream \
.outputMode("append") \
.format("console") \
.start()
query.awaitTermination()
Conclusion
Real-time data analysis, powered by CDC and Apache Spark, is transforming how businesses operate. By enabling immediate insights and decision-making, businesses can stay ahead in competitive markets. As technologies like Spark and Kafka continue to evolve, the possibilities for leveraging real-time data will only expand.
If you want to read more such tech-driven content visit our blog or learn more from experts in AntStack TV.