iotfleetmonitoring

Fleet Management for Raspberry Pi AI HAT Devices Using Firebase

ffirebase

2026-02-11

11 min read

Practical guide to manage OTA, auth, telemetry and health for Raspberry Pi 5 AI HAT fleets using Firebase with cost and scale best practices.

Hook: Managing hundreds of Raspberry Pi 5 AI HAT devices should not become your biggest operational cost

If you’re running a fleet of Raspberry Pi 5 devices with an AI HAT in production, you already know the pain: OTA updates that fail on intermittent networks, insecure or brittle device authentication, telemetry overload that balloons your bill, and limited visibility into device health until customers call. This article shows a practical, scalable Firebase-first architecture (with Google Cloud where needed) to solve OTA, device auth, telemetry ingestion, and health monitoring—while keeping performance and costs optimized in 2026.

Why this matters in 2026

Late 2025 and early 2026 accelerated two trends that matter for Pi 5 AI HAT fleets:

Edge-first AI: On-device inference (LLM and multimodal) is common on Pi 5-class hardware, increasing update frequency for models and runtime components.
Zero-trust device posture: Enterprises demand short-lived credentials and attestation for every device before it can exchange telemetry or request artifacts.

Combine those with cost sensitivity and you need a design that scales linearly in operations and sub-linearly in cost.

Quick architecture overview (inverted pyramid first)

Here’s the high-level pattern I recommend:

Device identity & auth: Short-lived custom JWTs minted by a backend (Firebase Auth / Identity Platform + Admin SDK) and device attestation for trust.
OTA delivery & control plane: Cloud Storage for artifacts + Firestore as the desired-state registry + Cloud Functions to orchestrate + FCM to notify devices.
Telemetry ingestion: Lightweight device batching to an authenticated HTTPS endpoint (Cloud Run or Cloud Functions) that writes to Pub/Sub for scale, then to BigQuery for analytics and to Firestore for operational needs.
Health monitoring: Last-seen heartbeats in Firestore + Cloud Monitoring dashboards and alerting integrated with Slack/PagerDuty.

Why Firebase?

Firebase gives you a low-friction developer experience (Auth, Firestore, Cloud Functions, FCM, Cloud Storage) and native SDKs for quick device prototypes. For heavy ingestion and observability we integrate Google Cloud services (Pub/Sub, BigQuery, Cloud Monitoring) that scale more predictably and cost-effectively at fleet scale.

1) Device identity and secure authentication

Core idea: treat every Pi 5 as a long-lived identity (device-id) but never issue long-lived credentials. Mint short-lived custom Firebase tokens from a trusted backend after device attestation.

Device onboarding flow (recommended)

Manufacturing/first-boot: embed a unique device-id and a device certificate (X.509) or private key into secure storage on the Pi (e.g., TPM, Secure Element, or OS keystore).
First-boot: device authenticates to your enrollment endpoint with its cert and a proof-of-possession signature.
Enrollment backend verifies the cert, creates a device record in Firestore under /devices/{deviceId}, and issues a short-lived custom token via the Firebase Admin SDK (exp: 15 minutes).
Device exchanges the custom token for a Firebase ID token using Identity Toolkit REST API and then uses that token to access Firestore and other Firebase services.

Server-side (Node.js) example: issuing a custom token

// backend/index.js (Node.js)
const admin = require('firebase-admin');
admin.initializeApp({ credential: admin.credential.applicationDefault() });

async function mintDeviceToken(deviceId, claims = {}) {
  // custom tokens are short-lived; you can include minimal claims
  return await admin.auth().createCustomToken(deviceId, claims);
}

On the device, exchange this with the REST endpoint (replace API_KEY):

curl -X POST \
  'https://identitytoolkit.googleapis.com/v1/accounts:signInWithCustomToken?key=API_KEY' \
  -H 'Content-Type: application/json' \
  -d '{"token":"CUSTOM_TOKEN","returnSecureToken":true}'

Best practices

Short token lifetime: 10–30 minutes. Re-auth silently in the background.
Device attestation: use TPM or signed certs. If hardware attestation isn’t possible, use an enrollment-time human verification step.
Rotate keys: automate server-side key rotation and certificate revocation lists (CRLs) stored in Firestore.

2) OTA updates: safe, resumable, and verifiable

OTA must be atomic, resumable, and verified. I recommend a control plane (Firestore) that holds desired state and metadata + Cloud Storage for artifacts + signed URLs or signed artifacts for integrity + FCM to nudge devices.

Flow

Developer uploads model/runtime artifact (tar.gz or image) to Cloud Storage. Trigger a Cloud Function to create a signed metadata entry (sha256 and signature) and an immutable object path like gs://ota-bucket/release/v2026-01-01/artifact.tar.gz.
Update Firestore document /ota/rollingRelease with desiredVersion and metadata.
Cloud Function on document change generates per-device signed URLs (short-lived) or stores a release manifest in Firestore and sends an FCM message to targeted devices.
Device receives FCM, reads the manifest from Firestore, obtains the signed URL, downloads the artifact, verifies sha256 + signature, applies update in a safe transaction (swap partitions or atomic file replacement), and reports state back to Firestore.

Cloud Function (simplified) to notify devices

const functions = require('firebase-functions');
const admin = require('firebase-admin');
admin.initializeApp();

exports.onRelease = functions.firestore
  .document('ota/releases/{releaseId}')
  .onCreate(async (snap) => {
    const release = snap.data();
    // Query target devices and send FCM messages
    const tokens = await getActiveDeviceFcmTokens();
    const message = {
      notification: { title: 'New OTA available', body: `Version ${release.version}` },
      data: { releaseId: snap.id }
    };
    await admin.messaging().sendToDevice(tokens, message);
  });

Device-side OTA concerns

Support resumable downloads (HTTP range requests) and verify integrity after each chunk. Follow patch governance principles to avoid rolling out faulty or malicious updates.
Use A/B partitioning or containerized deployments to avoid bricking devices.
Rate-limit concurrent device downloads to reduce egress spikes—Cloud Functions can stagger signed URL generation.
Require signed artifacts. Use GPG or a signing key kept in your key management system (KMS).

3) Telemetry ingestion and storage patterns

Telemetry is where costs often explode. Design for on-device aggregation, tiered ingestion, and eventual consistency.

Three-tier telemetry pipeline

Edge aggregation: batch and compress events on-device (1–5s batch windows), prefer delta-only events for metrics (e.g., counters, averages).
Authenticated ingest endpoint: devices POST compressed batches to a Cloud Run/Cloud Function endpoint that verifies Firebase ID tokens.
Scale and storage: push messages to Pub/Sub for fan-out; use Cloud Dataflow / Cloud Run workers to transform and write to BigQuery for analytics, and to Firestore for operational state (a small subset).

Why not write all telemetry to Firestore?

Firestore is excellent for operational state and low-frequency writes (config, device metadata). For high-frequency telemetry, BigQuery via Pub/Sub + Dataflow is far cheaper and built for analytics. Use Firestore for last-seen, current-mode, and the latest error state only.

Device POST example (pseudo)

POST /ingest
Headers: Authorization: Bearer <firebase-id-token>
Body: { deviceId: "pi-0001", events: [ {ts: 167..., cpu: 23, mem: 40}, ... ] }

Server verifies token with Admin SDK, acknowledges quickly (200) and pushes payload to Pub/Sub asynchronously to avoid backpressure on devices.

4) Health checks, monitoring, and alerting

Visibility is non-negotiable. Combine device heartbeats in Firestore with Cloud Monitoring dashboards and alerts.

Health model

Heartbeat: device writes lastSeen timestamp to /devices/{id}/status every minute (or at a frequency tuned to your SLA).
Health metrics: devices push metrics (CPU, temp, memory, model latency) to telemetry pipeline; aggregated metrics land in BigQuery and Cloud Monitoring.
Alerting: Cloud Monitoring rules trigger on thresholds (offline > X minutes, CPU > 95%, model latency spike), and forward notifications to Slack/PagerDuty.

Detecting offline devices efficiently

Avoid scanning all devices. Use a Cloud Function that listens to Firestore status updates and writes to a time-series metric in Cloud Monitoring (custom metric). Then create an uptime check/alert on that metric. For very large fleets, export lastSeen to a BigQuery table and run periodic queries with partition pruning to find anomalies.

Practical tip: use exponential back-off on device heartbeat intervals when the device knows it’s on battery or low connectivity to reduce write volume and cost.

5) Scaling and cost optimization strategies

Costs come from reads/writes (Firestore), network egress (Cloud Storage), and analytics storage & queries (BigQuery). Here’s how to control them.

Firestore cost control

Keep writes infrequent: only lastSeen and status go to Firestore. Bulk telemetry goes to Pub/Sub/BigQuery.
Use subcollections sparingly—each document write counts.
Batch writes where possible and use exponential back-off on devices to reduce retries.
Leverage Firestore TTL (as of 2025, TTL policies are generally available) for ephemeral logs so you don’t store tons of historical operational documents.

Network & Storage

Host large OTA images in Cloud Storage and distribute with signed URLs. Use multi-region buckets only for global fleets; for regional fleets use regional buckets to cut egress costs.
Enable gzip/brotli compression and chunked downloads to improve resiliency and reduce re-downloads.

Telemetry & Analytics

Pre-aggregate and sample on-device for high-frequency signals (e.g., internal sensors), send summarized stats instead of raw samples.
Use partitioned BigQuery tables (time-partitioned) and scheduled queries to store long-tail aggregates; avoid SELECT * ad-hoc queries over full tables.
Leverage BigQuery reservation and flex slots if your query patterns are predictable to lower per-query costs.

6) Security & compliance checklist

Mutual TLS or signed JWTs for device enrollment and token minting.
Short-lived credentials: no long-lived API keys on devices.
Artifact signing and verify on-device. Never run unsigned code.
Network segmentation: devices should only communicate with your authenticated endpoints; firewall egress to limit unexpected destinations.
Audit logs: enable Cloud Audit Logs for the Admin SDK and Cloud Functions to track changes.

7) Troubleshooting and observability patterns

Common failure modes: failed OTA, token expiry, telemetry backlog, and device bricking. Build observability into each step:

Device-side: structured logs, local ring buffer, and a compressed diagnostic upload triggered via FCM.
Backend: correlate Pub/Sub messages with device IDs, push traces into Cloud Trace, and use Cloud Logging to aggregate logs for quick forensic searches.
Automated remediation: Cloud Functions can detect crash loops and roll devices back by updating /devices/{id}/desiredVersion.

Example: Full release cycle (concise)

Dev uploads release → Cloud Function verifies & signs artifact → writes release doc to Firestore.
On write, Cloud Function generates per-device signed URLs (or signed manifest) and sends FCM nudges.
Devices download via signed URL, verify signature, perform A/B swap, and write status back to Firestore.
Telemetry during rollout flows to Pub/Sub → BigQuery; dashboards show rollout success rate and model latency.

2026 trends and advanced strategies

Looking at how fleets will evolve through 2026 and beyond:

Federated updates: decentralized rollout logic where a subset of devices act as distribution seeds to reduce cloud egress on constrained networks. See broader edge personalization and distribution patterns.
On-device A/B evaluation: perform local canary tests and report telemetry back so you can decide rollback without waiting for full-cloud analytics.
Privacy-preserving telemetry: aggregate and anonymize sensitive metrics on-device before leaving the edge.

Checklist: What to implement first (pragmatic)

Device identity and short-lived auth (enrollment + custom tokens).
Basic OTA control plane (Cloud Storage + Firestore + minimal Cloud Function + FCM notify).
Heartbeat in Firestore + simple Cloud Monitoring alerts for offline devices.
Simple telemetry POST endpoint → Pub/Sub → BigQuery for analytics.
Policy for artifact signing and rollback strategy.

Case study snapshot (hypothetical)

We onboarded 3,000 Pi 5 AI HATs across three regions in Q4 2025. By moving raw telemetry off Firestore into Pub/Sub → BigQuery and keeping only lastSeen/status in Firestore, we reduced Firestore costs by 82% and improved OTA success rate to 99.2% after implementing resumable downloads and A/B swaps.

Closing summary and takeaways

Managing a Raspberry Pi 5 AI HAT fleet at scale requires discipline around identity, a control-plane-driven OTA system, a tiered telemetry pipeline, and tight monitoring. Firebase provides fast developer velocity for the control plane (Auth, Firestore, FCM, Cloud Functions), while Google Cloud primitives (Pub/Sub, BigQuery, Cloud Monitoring) give you predictable scale and cost control. In 2026, short-lived credentials, signed artifacts, on-device aggregation, and integration with robust observability are no longer optional—they’re table stakes.

Next steps (actionable)

Start with these practical actions this week:

Implement a device enrollment endpoint that issues short-lived custom tokens and store device metadata in Firestore.
Create a minimal OTA pipeline: upload artifact to Cloud Storage, add a release doc in Firestore, and trigger a Cloud Function that sends FCM notifications.
Switch device telemetry to batched POSTs to a Pub/Sub-backed ingestion endpoint and land analytics in BigQuery.

Call to action

If you’re evaluating this architecture for production, start a migration prototype now—deploy a pilot with 50 devices and measure Firestore writes, BigQuery costs, and OTA success rate. If you want a starter repo, automated Cloud Function templates, and a pre-built Firestore device schema tuned for Pi 5 AI HAT fleets, request the firebase.live starter kit and we’ll share tested recipes and scripts to get your pilot running in days.

firebase

Contributor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.