13 KiB
ESP32 Android Auto Head Unit
A DIY Android Auto wireless head unit built on the ESP32-S3 (WT32-SC01 Plus), written entirely in Rust.
Implements the Android Auto WiFi protocol from scratch: TCP connection, TLS handshake, protobuf service discovery, video channel (H.264 decode), touch input, navigation events, and sensor reporting — all running on a $20 microcontroller.
Demo
The ESP32 hosts a WiFi AP. The phone joins it and connects via TCP on port 5277. Android Auto renders to the 480×320 LCD with touch input support.
Phone (Android Auto) ──WiFi──► ESP32-S3 AP ──I80 bus──► 480×320 LCD
◄──touch── FT6336U ◄──I2C──┘
Hardware
| Component | Details |
|---|---|
| Board | WT32-SC01 Plus ($20) |
| SoC | ESP32-S3R8 — dual-core Xtensa LX7 @ 240MHz |
| RAM | 512KB SRAM + 2MB PSRAM (quad, 80MHz) |
| Flash | 16MB (QIO, 80MHz) |
| Display | ST7796 480×320 LCD, I80 8-bit parallel bus @ 40MHz |
| Touch | FT6336U capacitive, I2C @ 400KHz |
| WiFi | 802.11 b/g/n 2.4GHz (built-in) |
Pin Assignments
| Function | GPIO |
|---|---|
| LCD D0–D7 | 9, 46, 3, 8, 18, 17, 16, 15 |
| LCD WR | 47 |
| LCD DC | 0 |
| LCD RST | 4 |
| Backlight | 45 |
| Touch SDA | 6 |
| Touch SCL | 5 |
Features
Build Modes
| Mode | Flag | Description |
|---|---|---|
| Full Video | (default) | H.264 decode + downscale 800×480 → 480×320 + display (~3-5 fps) |
| Crop Video | --crop |
Center-crop 480×320 from 800×480, no scaling (faster conversion) |
| Nav-Only | --nav-only |
Text-only turn-by-turn navigation, no video decode. PNG turn arrows. |
Protocol Implementation
- Version handshake — negotiates protocol version with phone
- TLS 1.2 — mbedtls with hardware AES/SHA acceleration
- Service discovery — advertises 9 channels (control, input, sensor, video, 3× audio, AV input, navigation, media status)
- Video channel — accepts H.264 stream, sends VideoFocusIndication, acks frames
- Touch input — FT6336U → coordinate mapping → protobuf TouchEvent (PRESS/DRAG/RELEASE)
- Navigation — receives TurnInstruction + DistanceUpdate events (works with OsmAnd, not Google Maps*)
- Sensors — reports DRIVING_STATUS (unrestricted) and NIGHT_DATA
- Audio — stubs: accepts setup, discards audio data (no DAC/I2S output)
- mDNS — advertises
_androidauto._tcpfor network discovery
* Google Maps renders navigation entirely in the video stream and doesn't send turn-by-turn data over the navigation channel. OsmAnd uses the standard Android Auto Navigation API.
Video Pipeline (Full Video Mode)
Phone sends H.264 800×480 @ 30fps
│
▼
TCP receive → TLS decrypt → protobuf parse → mpsc channel (depth 4)
│
▼
Decode thread: esp_h264 SW decoder (tinyh264-based, dual-task)
│ decode: ~100ms per 800×480 frame
▼
I420 → RGB565 strip conversion (dual-core: worker + main thread)
│ 40-line strips, bilinear downscale 800×480 → 480×320
▼
DMA double-buffered to LCD (38.4KB × 2 staging buffers in internal SRAM)
Performance: ~3-5 fps depending on scene complexity. The ESP32-S3's software H.264 decoder is the bottleneck — Espressif benchmarks show ~9 fps for 640×480 with dual-task mode. At 800×480 (Android Auto's minimum), expect ~8-9 fps raw decode throughput.
Video Pipeline (Crop Mode)
Same as above but the I420 → RGB565 conversion copies the center 480×320 pixels 1:1 instead of downscaling. Eliminates bilinear interpolation overhead.
Nav-Only Mode
No H.264 decoder. Receives navigation events via the AA navigation channel and renders:
- Turn maneuver + direction (text)
- Street name
- Distance to next turn
- ETA
- PNG turn arrow image (decoded via miniz_oxide, scaled to 64×64)
Uses strip-based LCD rendering with bitmap font (5×7 base, scalable).
Building
Prerequisites
- Podman (or Docker — change
sudo podmantodockerinbuild.sh) - USB serial access to the WT32-SC01 Plus (
/dev/ttyACM0or/dev/ttyUSB0) - espflash for flashing + monitoring
The build uses the official espressif/idf-rust:all_latest container image which includes:
- ESP-IDF v5.5.1
- Rust toolchain for Xtensa (
espchannel) - All ESP32-S3 build tools
Build Commands
# Full video mode (default)
./build.sh
# Crop video mode (faster conversion, cropped view)
./build.sh --crop
# Nav-only mode (no video, turn-by-turn text only)
./build.sh --nav-only
# Build without flashing
./build.sh --build-only
# Combine flags
./build.sh --crop --build-only
Manual Build (without container)
# Requires esp-idf-sys toolchain configured
cargo build --release # full video
cargo build --release --features crop-video # crop mode
cargo build --release --features nav-only # nav-only
Flashing
# Via build script (prompts after build)
./build.sh
# Manual flash + monitor
espflash flash target/xtensa-esp32s3-espidf/release/esp32-android-auto-nav --monitor
# Monitor only (after flashing)
espflash monitor --port /dev/ttyACM0
Connecting a Phone
- Build and flash the firmware to the WT32-SC01 Plus
- On the phone, join the WiFi network:
- SSID:
ESP32-AA-HU - Password:
androidauto123
- SSID:
- Open Android Auto on the phone:
- Go to Android Auto settings → enable Developer mode (tap version 10×)
- Developer Settings → Start head unit server
- The ESP32 scans DHCP client IPs on port 5277 and connects automatically
- Alternatively, the ESP32 also listens on port 5277 for incoming connections
Connection Flow
ESP32 boots → WiFi AP starts → mDNS advertised → listening on :5277
Phone joins WiFi → ESP32 connects to phone:5277 (or phone connects to ESP32:5277)
→ Version handshake → TLS negotiation → Service discovery
→ Video setup → VideoFocusIndication(FOCUSED) → Phone starts streaming
→ Touch events sent back to phone → Video frames displayed
4G Internet While Connected
The ESP32's DHCP server is configured to not advertise a gateway or DNS, so Android keeps using mobile data for internet while connected to the ESP32's WiFi for Android Auto.
For best results, on the phone enable: Developer Options → Mobile data always active
Project Structure
src/
├── main.rs # Entry point, thread spawning, video decode/display loop,
│ # touch polling, WiFi AP, connection cycle
├── session.rs # Android Auto protocol session (message loop, dispatch)
├── frame.rs # Wire protocol: frame read/write, TLS state (mbedtls)
├── channels.rs # Channel descriptors, AV message parsing, video/audio/sensor frames
├── control.rs # Control channel messages (version, TLS, ping, auth, shutdown)
├── common.rs # Common channel messages (channel open request/response)
├── decoder.rs # H.264 SW decoder (esp_h264 FFI), I420→RGB565 conversion
├── display.rs # ST7796 LCD driver (I80 bus, DMA strip rendering)
├── touch.rs # FT6336U capacitive touch driver (I2C)
├── navigation.rs # Navigation event parsing (TurnInstruction, DistanceUpdate)
├── config.rs # Head unit + WiFi configuration
├── cert.rs # TLS certificate for Android Auto authentication
├── mdns.rs # mDNS service advertisement (_androidauto._tcp)
├── bluetooth.rs # BT protocol definitions (unused — ESP32-S3 has no BT Classic)
└── esp_h264_bindings.h # C header for esp_h264 FFI bindgen
protobuf/
├── Wifi.proto # Android Auto WiFi protocol messages
└── Bluetooth.proto # Android Auto BT protocol messages (reference only)
build.sh # Container-based build script (Podman)
build.rs # Build script: protobuf codegen + esp_h264 bindgen
Cargo.toml # Rust dependencies + feature flags
sdkconfig.defaults # ESP-IDF configuration (CPU, PSRAM, WiFi, TLS, H.264, etc.)
partitions.csv # Flash partition table (4MB app partition)
espflash.toml # Flash tool configuration
rust-toolchain.toml # Xtensa Rust toolchain (esp channel)
idf_component.yml # ESP-IDF component: espressif/esp_h264 v1.3.0
Architecture Details
Threading Model
| Thread | Core | Stack | Purpose |
|---|---|---|---|
| Main | 0 | 16KB | WiFi AP, TCP listener, connection cycle, session protocol |
| decode-display | 0/1 | 16KB | H.264 decode + strip conversion + DMA to LCD |
| converter | 1 | 4KB | Dual-core strip helper (scale mode only) |
| touch-poll | any | 4KB | FT6336U I2C polling @ 60Hz |
| nav-ui | any | 8-16KB | Navigation event logging (video mode) or LCD rendering (nav-only) |
Memory Layout
| Region | Size | Usage |
|---|---|---|
| Internal SRAM | ~416KB usable | DMA buffers (38.4KB×2), thread stacks, WiFi, mbedtls, FreeRTOS |
| PSRAM | 2MB | H.264 decoder buffers (~576KB), LWIP buffers, large allocations |
| Flash | 16MB | Firmware (~4MB partition), NVS, PHY calibration |
Android Auto Protocol
The implementation follows the Android Auto WiFi protocol:
- Transport: TCP on port 5277, then upgraded to TLS 1.2
- Framing: 4-byte header (channel ID, flags, length) + payload
- Channels: Multiplexed over single TCP connection, each with a numeric ID
- Messages: Protobuf-encoded, prefixed with 2-byte message type
- Video: H.264 baseline profile, 800×480 @ 30fps, requires periodic ack
- Touch: Timestamped (µs precision), mapped from display coords to AA video coords
- Navigation: Protobuf TurnInstruction + DistanceUpdate events
Key Design Decisions
- Strip-based rendering: 40-line strips (38.4KB each) instead of full-frame buffers. Allows DMA double-buffering with only 76.8KB of internal SRAM instead of 300KB.
- No intermediate framebuffer: I420→RGB565 conversion writes directly into DMA staging buffers. Zero-copy from decode to display.
- Drain-and-skip: When frames queue up, older frames are discarded without decoding. Only the latest frame is decoded and displayed. This prevents the decoder from falling behind.
- Always FOCUSED: The head unit always reports VideoFocusIndication(FOCUSED) to the phone. Reporting UNFOCUSED causes the phone to stop sending navigation data too.
- Unsolicited focus kick: After video setup, an unsolicited VideoFocusIndication with
unrequested=trueis sent to prompt the phone to start streaming. Without this, the phone sends VideoFocusRequest but never StartIndication. - Non-fatal video acks: If ack writes fail (TCP buffer full, etc.), the error is logged but doesn't kill the session. The phone tolerates missed acks.
- DHCP without gateway/DNS: Prevents Android from switching internet to WiFi.
Configuration
WiFi Settings
Edit src/config.rs:
Self {
ssid: "ESP32-AA-HU".into(),
password: "androidauto123".into(),
listen_port: 5277,
}
sdkconfig Tuning
Key settings in sdkconfig.defaults:
| Setting | Value | Purpose |
|---|---|---|
ESP_DEFAULT_CPU_FREQ_MHZ |
240 | Max CPU for decode performance |
SPIRAM_SPEED_80M |
y | Max PSRAM bandwidth |
ESP32S3_DATA_CACHE_64KB |
y | Maximize cache for PSRAM access |
ESP_H264_DUAL_TASK |
y | Dual-core H.264 decode |
ESP_H264_DECODER_IRAM |
y | Hot decoder code in IRAM (+22KB) |
COMPILER_OPTIMIZATION_PERF |
y | -O2 for ESP-IDF C code |
MBEDTLS_HARDWARE_AES |
y | Hardware AES acceleration |
MBEDTLS_HARDWARE_SHA |
y | Hardware SHA acceleration |
Limitations
- ~3-5 fps in video mode — the ESP32-S3 software H.264 decoder is the bottleneck. Real head units use dedicated video decoder hardware.
- No audio output — audio channels are accepted but data is discarded. Would need I2S + DAC/codec.
- No Bluetooth Classic — ESP32-S3 only has BLE. Phone must manually join WiFi and start head unit server in developer mode.
- Google Maps renders navigation in the video stream, not via the navigation channel. Use OsmAnd for turn-by-turn text in nav-only mode.
- 800×480 minimum — Android Auto protocol doesn't allow requesting lower than 480p resolution.
- Single WiFi client — AP is configured for max 1 connection.
Possible Improvements
- Raspberry Pi Zero 2W proxy — decode H.264 on RPi (hardware VideoCore decoder, ~1ms/frame), send pre-decoded RGB565 frames to ESP32 via SPI. Would achieve 30fps.
- Audio output — add I2S DAC for media/nav audio. The audio channel stubs are already in place.
- WiFi Direct (P2P) — proper AA wireless uses WiFi Direct, which doesn't disable phone's cellular. ESP-IDF supports WiFi P2P but adds complexity.
- ESP32-P4 — has a hardware H.264 decoder (25fps @ 640×480, 31fps dual-task). Would be a significant upgrade from SW decode.
Dependencies
| Crate | Version | Purpose |
|---|---|---|
| esp-idf-svc | 0.52 | ESP-IDF high-level services (WiFi, NVS, mDNS) |
| esp-idf-hal | 0.46 | Hardware abstraction (I2C, GPIO) |
| esp-idf-sys | 0.37 | Raw ESP-IDF FFI bindings |
| protobuf | 3.7 | Protocol Buffers (AA protocol messages) |
| anyhow | 1.0 | Error handling |
| bitfield | 0.19 | Frame header bitfield parsing |
| miniz_oxide | 0.7 | PNG inflate (nav-only mode, optional) |
| espressif/esp_h264 | 1.3.0 | H.264 SW decoder (C component, tinyh264-based) |
License
LGPL-3.0-or-later