Streaming ETL: Kafka to Kafka
You can quickly build streaming ETL pipelines with Timeplus. For example, the original web access logs in Kafka topics contain the raw IP address. To further protect user privacy, you can build a data pipeline to read new data from Kafka, masking the IP address and send to a different Kafka topic.
Follow the guide for Timeplus Proton or Timeplus Cloud.
Timeplus Proton
You can follow the previous tutorial to setup the sample data and run the following SQL to build the pipeline.
-- read the topic via an external stream
CREATE EXTERNAL STREAM frontend_events(raw string)
SETTINGS type='kafka',
brokers='redpanda:9092',
topic='owlshop-frontend-events';
-- create the other external stream to write data to the other topic
CREATE EXTERNAL STREAM target(
_tp_time datetime64(3),
url string,
method string,
ip string)
SETTINGS type='kafka',
brokers='redpanda:9092',
topic='masked-fe-event',
data_format='JSONEachRow',
one_message_per_row=true;
-- setup the ETL pipeline via a materialized view
CREATE MATERIALIZED VIEW mv INTO target AS
SELECT now64() AS _tp_time,
raw:requestedUrl AS url,
raw:method AS method,
lower(hex(md5(raw:ipAddress))) AS ip
FROM frontend_events;
Timeplus Cloud
A blog is published with the detailed steps to read data from Kafka/Redpanda, apply the transformation and send to Kafka/Redpanda.
A few key steps:
- Connect to Redpanda:
- Specify the Redpanda broker address and the authentication method.
- Choose a topic and preview data.
- Set the name for the external stream, say
frontend_events
.
- Explore the live data in the stream/topic.
- Write a streaming SQL to transform data.
SELECT response:statusCode as code,hex(md5(ipAddress)) as hashed_ip,method,requestedUrl
FROM frontend_events WHERE response:statusCode!='200'
- Send the result to another topic. Timeplus will create a new external stream as the destination and a materialized view as pipeline.
The data lineage visualizes the relationships.
- New data is available in Kafka/Redpanda topic.