Full Stack6 min read

Streaming Data Pipelines: Best Practices for Devs

Building robust streaming data pipelines can be tricky. I'll share lessons learned from building real-time data systems with Kafka, Node.js, and cloud services like GCP Pub/Sub.

Jay Salot

Senior Full Stack AI Engineer

April 4, 2026 · 6 min read

As a full-stack developer, I've spent a good chunk of my time wrestling with streaming data. It's not always glamorous, but it's increasingly crucial for building responsive and intelligent applications. Whether it's real-time analytics, event-driven architectures, or just moving data around efficiently, understanding streaming is a must. This post isn't about Netflix (sorry!), but about the best ways to handle streaming data pipelines from a dev perspective.

Understanding the Streaming Landscape

Before diving into code, let's level-set. "Streaming" can mean different things. For our purposes, we're talking about continuously moving data, not just downloading a file. We're thinking Kafka, GCP Pub/Sub, AWS Kinesis, and the infrastructure around them.

Key Streaming Concepts

Producers: Applications that generate data (e.g., user actions, sensor readings).
Brokers: The message queues themselves (e.g., Kafka brokers, Pub/Sub topics).
Consumers: Applications that process data from the queue.
Partitions: A way to split a topic for parallelism.
Offsets: A pointer to a specific message within a partition. Crucial for fault tolerance!

Choosing the Right Tool

There are many options. Kafka is a popular, battle-tested choice, but it's more complex to manage. Cloud-native solutions like GCP Pub/Sub or AWS Kinesis are easier to get started with, but can be more expensive at scale. I've used both extensively. For smaller projects, I often lean towards Pub/Sub for its simplicity. For high-throughput, low-latency scenarios, Kafka is hard to beat. We used Kafka in a recent project for processing millions of events per second from IoT devices.

Building a Node.js Stream Consumer

Let's build a simple consumer using the kafkajs library. This example assumes you have a Kafka cluster running.

const { Kafka } = require('kafkajs')

const kafka = new Kafka({
  clientId: 'my-app',
  brokers: ['localhost:9092'] // Replace with your Kafka brokers
})

const consumer = kafka.consumer({ groupId: 'my-group' })

const run = async () => {
  await consumer.connect()
  await consumer.subscribe({ topic: 'my-topic', fromBeginning: true })

  await consumer.run({
    eachMessage: async ({ topic, partition, message }) => {
      console.log({
        partition,
        offset: message.offset,
        value: message.value.toString(),
      })
    },
  })
}

run().catch(console.error)

This code connects to Kafka, subscribes to a topic, and logs each message to the console. Remember to install kafkajs: npm install kafkajs.

Handling Errors and Retries

This is where things get real. Streaming data is inherently unreliable. Networks fail, servers crash, and messages get lost. Don't just blindly consume and process. Implement proper error handling and retry mechanisms.

// Inside the eachMessage function...
    eachMessage: async ({ topic, partition, message }) => {
      try {
        // Process the message
        await processMessage(message.value.toString());
      } catch (error) {
        console.error('Error processing message:', error);
        // Implement retry logic here.  Consider exponential backoff.
        // For example, push the message to a dead-letter queue.
      }
    },

A dead-letter queue (DLQ) is a separate queue where you send messages that failed to be processed after several retries. This prevents your consumer from getting stuck on a bad message.

Using TypeScript for Data Integrity

When dealing with complex data structures in streams, TypeScript is your friend. It helps enforce data contracts and prevent runtime errors.

Defining Data Schemas

Let's say you're streaming user activity events. Define a TypeScript interface:

interface UserActivityEvent {
  userId: string;
  eventType: 'click' | 'view' | 'purchase';
  timestamp: number;
  productId?: string; // Optional for some event types
}

Validating Messages

Now you can validate incoming messages against this schema:

function processMessage(message: string) {
  try {
    const event: UserActivityEvent = JSON.parse(message);
    // Validate the event
    if (!event.userId || !event.eventType || !event.timestamp) {
      throw new Error('Invalid event format');
    }

    // Process the event...
    console.log('Processed event:', event);
  } catch (error) {
    console.error('Error processing message:', error);
    // Handle the error (e.g., send to DLQ)
  }
}

This simple validation step can save you a lot of headaches down the road. We had a situation last year where inconsistent data formats in our Kafka stream were causing intermittent failures in our analytics pipeline. Adding TypeScript validation upfront would have prevented that.

Scaling Your Streaming Application

Streaming applications often need to handle massive amounts of data. Here are a few strategies for scaling:

Horizontal Scaling

The most common approach is to run multiple instances of your consumer. Make sure your consumer group is configured correctly so that Kafka distributes partitions evenly across instances.

Partitioning Strategy

The number of partitions in your Kafka topic directly impacts the parallelism of your consumer. More partitions mean more consumers can process data concurrently. However, there's a trade-off: too many partitions can increase overhead.

Autoscaling with Kubernetes

If you're running your application in Kubernetes, you can use Horizontal Pod Autoscaler (HPA) to automatically scale the number of pods based on CPU or memory utilization. This is a powerful way to handle fluctuating workloads.

Monitoring and Observability

You can't manage what you can't measure. Implement robust monitoring and observability for your streaming pipelines.

Key Metrics to Track

Consumer Lag: The difference between the latest offset in the topic and the offset of the last message consumed. A high consumer lag indicates that your consumer is falling behind.
Message Throughput: The rate at which messages are being produced and consumed.
Error Rate: The number of errors encountered during message processing.
Resource Utilization: CPU, memory, and network usage of your consumer instances.

Tools for Monitoring

Prometheus and Grafana are a popular combination for monitoring Kubernetes applications. You can also use cloud-native monitoring services like GCP Cloud Monitoring or AWS CloudWatch.

Serverless Streaming with Cloud Functions

For simpler use cases, consider using serverless functions like GCP Cloud Functions or AWS Lambda to process stream data. These functions can be triggered by messages in Pub/Sub or Kinesis.

Example Cloud Function Trigger

Here's a basic example of a GCP Cloud Function triggered by a Pub/Sub message:

exports.helloPubSub = (event, context) => {
  const message = event.data
    ? Buffer.from(event.data, 'base64').toString()
    : 'Hello, World!';
  console.log(`Received message: ${message}`);
};

Serverless functions are great for event-driven architectures and can be very cost-effective for intermittent workloads. The gotcha here is cold starts. If your function isn't invoked frequently, it might take a few seconds to spin up, which can introduce latency. We use Cloud Run in many projects as it offers a container-based environment with better cold start performance than Cloud Functions.

Conclusion

Building robust streaming data pipelines is challenging, but incredibly rewarding. Remember to choose the right tool for the job, handle errors gracefully, use TypeScript for data integrity, scale your application effectively, and monitor everything. Focus on building resilient consumers. In my experience, a well-designed consumer that can handle failures is far more valuable than optimizing the producer side. Streaming data is a core component of many modern applications, and mastering these concepts will make you a more valuable full-stack developer.

#Kafka#Node.js#TypeScript#Streaming Data#Cloud#GCP#Pub/Sub

Full Stack

May 9, 20267 min read

Building a Video Streaming Platform: Architecture & Code

A deep dive into architecting and building a production-ready video streaming platform using Node.js, AWS, and adaptive bitrate streaming.

Video StreamingAWSNode.js+2

Full Stack

April 27, 20268 min read

Serverless Scaling for High-Traffic Video Streaming

When a popular title suddenly lands on a streaming service, demand spikes hard. Let's explore how serverless architectures and robust CDNs help us build a streaming platform that scales to handle it.

streamingserverlessjavascript+3

AI & ML

April 15, 20267 min read

OpenAI APIs: A Full-Stack Dev's Deep Dive

Explore practical applications of OpenAI's APIs for full-stack developers. Learn how to integrate them into your JavaScript/TypeScript projects with real-world examples and best practices.

openaiJavaScriptTypeScript+3