< psritej.com / blog />

Architecting Zero-Downtime Event Ingestion: A Serverless Approach using AWS and Snowflake

Sritej Panchumarthi · Published: January 15, 2026 · System Architecture · 30 min read

Abstract
Ingesting high-velocity telemetry data into analytical warehouses presents significant challenges in terms of scalability, cost, and reliability. This paper details the architecture of a production-grade event lake capable of handling millions of events per minute. We present a serverless design pattern utilizing AWS Kinesis Firehose, Lambda, and Snowflake Snowpipe, achieving sub-minute latency and 99.99% availability while minimizing operational overhead.
Key takeaway: The durable pattern is not "API Gateway to warehouse." It is an edge-protected ingestion path, a validation tier, a buffered delivery stream, an immutable S3 lake, and an asynchronous Snowflake load path. That separation keeps customer telemetry flowing even when analytics systems slow down.

1. Introduction

Modern observability and analytics require near real-time data ingestion. Traditional batch processing (ETL) introduces unacceptable latency, while managing dedicated streaming clusters (e.g., Kafka) incurs high operational costs.

This paper proposes a serverless alternative that leverages the elasticity of AWS Lambda and the columnar storage efficiency of Parquet to deliver a robust ingestion pipeline.

2. System Architecture

The system is designed around three core principles: Resilience (buffering data during downstream outages), Cost-Efficiency (utilizing serverless scaling), and Security (schema validation at the edge).

Fig 1. Event Ingestion Pipeline
[ Client Devices ]
       |
       v
[ CloudFront + WAF ]  <-- Edge Security & DDoS Protection
       |
       v
[ API Gateway ]       <-- Throttling & Auth
       |
       v
[ Lambda Proxy ]      <-- Schema Validation & Enrichment
       |
       v
[ Kinesis Firehose ]  <-- Buffering, Batching & Parquet Conversion
       |
       v
[ S3 Data Lake ]      <-- Raw & Processed Storage
       |
       v
[ Snowflake Pipe ]    <-- Auto-Ingest via SQS Notifications

2.1 Design Goals for a Production Telemetry Lake

GoalArchitectural decisionWhy it matters
Absorb spikesFirehose buffering and S3 landing zoneTraffic bursts do not directly pressure Snowflake warehouses.
Preserve raw eventsImmutable S3 raw prefixBad transformations can be replayed without asking clients to resend data.
Control costParquet conversion and larger filesColumnar storage reduces query scan size and Snowpipe file overhead.
Protect usersWAF, schema validation, and PII minimizationInvalid, abusive, or over-privileged payloads are rejected before persistence.

3. Data Storage and Integration (Snowflake)

The destination for our telemetry is Snowflake. We utilize the Storage Integration object to establish a secure, IAM-role-based trust relationship with AWS S3, eliminating the need for long-lived access keys.

3.1 Storage Integration Configuration

-- Run this as ACCOUNTADMIN
CREATE OR REPLACE STORAGE INTEGRATION s3_telemetry_int
  TYPE = EXTERNAL_STAGE
  STORAGE_PROVIDER = 'S3'
  ENABLED = TRUE
  STORAGE_AWS_ROLE_ARN = 'arn:aws:iam::123456789012:role/snowflake-access-role'
  STORAGE_ALLOWED_LOCATIONS = ('s3://my-telemetry-bucket/events/');

-- Get the AWS User ARN and External ID to use in your IAM Role trust policy
DESC INTEGRATION s3_telemetry_int;

3.2 File Format Strategy

We select Parquet as the interchange format due to its columnar nature, efficient compression (Snappy), and embedded schema retention, which significantly improves query performance in Snowflake.

CREATE OR REPLACE FILE FORMAT parquet_format
  TYPE = 'PARQUET'
  COMPRESSION = 'SNAPPY';

CREATE OR REPLACE STAGE s3_stage
  STORAGE_INTEGRATION = s3_telemetry_int
  URL = 's3://my-telemetry-bucket/events/'
  FILE_FORMAT = parquet_format;

3.3 Event-Driven Ingestion (Snowpipe)

Snowpipe leverages SQS notifications from S3 to trigger micro-batch loads. This event-driven model decouples compute from storage and eliminates the need for a continuously running warehouse.

CREATE OR REPLACE PIPE telemetry_pipe
  AUTO_INGEST = TRUE
  AS
  COPY INTO raw_events
  FROM @s3_stage
  MATCH_BY_COLUMN_NAME = CASE_INSENSITIVE;

4. Ingestion Infrastructure (AWS)

The ingestion path is provisioned via Terraform. We utilize Amazon Kinesis Data Firehose for its ability to handle buffering, retries, and format conversion (JSON to Parquet) as a managed service.

4.1 S3 Event Notifications

resource "aws_s3_bucket" "telemetry" {
  bucket = "my-telemetry-bucket"
}

# Allow Snowflake to read from this bucket
resource "aws_s3_bucket_notification" "snowpipe_trigger" {
  bucket = aws_s3_bucket.telemetry.id

  queue {
    queue_arn     = "arn:aws:sqs:us-east-1:..." # From DESC PIPE output
    events        = ["s3:ObjectCreated:*"]
    filter_prefix = "events/"
  }
}

4.2 Firehose Configuration and Transformation

Firehose is configured to convert incoming JSON streams into optimized Parquet files using a schema defined in the AWS Glue Data Catalog.

resource "aws_kinesis_firehose_delivery_stream" "stream" {
  name        = "telemetry-stream"
  destination = "extended_s3"

  extended_s3_configuration {
    role_arn   = aws_iam_role.firehose.arn
    bucket_arn = aws_s3_bucket.telemetry.arn
    prefix     = "events/year=!{timestamp:yyyy}/month=!{timestamp:MM}/day=!{timestamp:dd}/"
    error_output_prefix = "errors/!{firehose:error-output-type}/"

    # Buffer hints: 128MB or 15 minutes
    buffering_size     = 128
    buffering_interval = 900

    data_format_conversion_configuration {
      input_format_configuration {
        deserializer {
          open_x_json_ser_de {}
        }
      }
      output_format_configuration {
        serializer {
          parquet_ser_de {}
        }
      }
      schema_configuration {
        database_name = aws_glue_catalog_database.db.name
        role_arn      = aws_iam_role.firehose.arn
        table_name    = aws_glue_catalog_table.schema.name
      }
    }
  }
}

5. Data Validation Layer (Lambda)

Direct integration between API Gateway and Firehose is possible but discouraged. We implement a Lambda proxy to enforce schema validation before data enters the persistence layer. This prevents "schema drift" and ensures that only valid data triggers the Parquet conversion logic.

5.1 Schema Enforcement with Pydantic

from pydantic import BaseModel, Field, ValidationError
import boto3
import json
import os

firehose = boto3.client('firehose')
STREAM_NAME = os.environ['STREAM_NAME']

class TelemetryEvent(BaseModel):
    event_id: str = Field(..., min_length=10)
    timestamp: int
    user_id: str
    payload: dict
    source: str = "mobile_app"

def handler(event, context):
    # 1. Parse Body
    try:
        body = json.loads(event['body'])
    except:
        return {"statusCode": 400, "body": "Invalid JSON"}

    # 2. Validate Schema
    try:
        telemetry = TelemetryEvent(**body)
    except ValidationError as e:
        print(f"Schema Error: {e}")
        return {"statusCode": 422, "body": e.json()}

    # 3. Push to Firehose
    # Note: Add a newline for JSON-lines format if not using Parquet conversion
    firehose.put_record(
        DeliveryStreamName=STREAM_NAME,
        Record={'Data': json.dumps(telemetry.dict())}
    )

    return {"statusCode": 202, "body": "Accepted"}

6. Edge Security Controls

To protect the ingestion endpoint, we encapsulate the API Gateway within a CloudFront distribution. This allows us to cache OPTIONS requests (CORS) and apply AWS WAF rules for Layer 7 protection.

  • Rate Limiting: Limit to 2000 requests/5min per IP.
  • Geo-Blocking: Block high-risk countries if your business is local.
  • Managed Rules: Enable `AWSManagedRulesCommonRuleSet` and `AWSManagedRulesAmazonIpReputationList`.

6.1 Privacy and PII Controls

The ingestion layer should treat every payload as untrusted. Before an event reaches Firehose, the Lambda validator should drop unnecessary fields, hash stable identifiers where possible, and reject payloads that include secrets, raw tokens, or free-form data that could accidentally contain sensitive information. For customer-facing telemetry, the schema is a security boundary, not just a developer convenience.

A practical implementation keeps three S3 prefixes: raw/ for unmodified accepted events, curated/ for normalized Parquet, and errors/ for validation failures. Access to raw/ should be narrower than access to curated analytics tables because raw telemetry often carries more compliance risk.

7. Operational Excellence: Monitoring & Cost

This system is "serverless," but it still needs watching.

7.1 Key Metrics to Alarm On

  1. Firehose `DeliveryToS3.Success` < 99%: Indicates IAM issues or bucket limits.
  2. Lambda `IteratorAge`: If using Kinesis Streams source, this shows lag.
  3. Snowpipe `PipeUsageHistory`: Monitor credits consumed.

7.2 The "Small File" Cost Trap

Snowpipe charges 0.06 credits per 1,000 files loaded. If you configure Firehose to flush every 60 seconds, you will generate 1,440 files per day per shard.
The Fix: Set Firehose buffering to 900 seconds (15 minutes) or 128 MB. This creates fewer, larger files, reducing Snowpipe costs by ~15x.

7.3 Runbook for Common Failure Modes

When ingestion latency rises:
  • Check API Gateway 4xx/5xx trends to separate client payload issues from platform failures.
  • Inspect Lambda duration, throttles, and error logs for schema drift or dependency failures.
  • Validate Firehose delivery success, S3 error prefixes, and buffering pressure.
  • Check Snowpipe load history and warehouse credit usage before scaling compute.
  • Replay a small partition from S3 to confirm whether the issue is ingestion, transformation, or loading.

8. FAQ

Should I use Kafka instead of Firehose?
Use Kafka when you need multi-consumer streaming, strict ordering semantics, or application-level event processing. Use Firehose when the main job is reliable delivery into S3 and Snowflake with minimal operations.

Why keep S3 if Snowflake is the analytics target?
S3 is the replay buffer, audit trail, and cost-efficient long-term store. Snowflake is the query and serving layer. Keeping both avoids turning the warehouse into the only source of truth.

What makes this zero-downtime?
The design avoids hard coupling between request ingestion and warehouse loading. If Snowflake slows down, Firehose and S3 continue absorbing events while operators resolve the downstream issue.

9. Conclusion

You now have a pipeline that:
1. Validates data at the edge (Lambda/Pydantic).
2. Buffers and optimizes storage (Firehose/Parquet).
3. Auto-scales to millions of events.
4. Costs a fraction of a managed Kafka cluster.

Related Writings