Features

AI Assistant

Hold to capture voice + screen, transcribe, ask the model, hear the answer, see the on-screen highlight.

The AI Assistant is push-to-talk. Hold the configured hotkey, speak your question, release. ApexDock:

  1. Captures audio from the microphone
  2. Snapshots the active display (if Screen Recording is granted)
  3. Transcribes the audio (Whisper / Deepgram / on-device)
  4. Sends the transcript + screenshot to a language model (Claude / GPT-4)
  5. Speaks the answer (system voice / ElevenLabs)
  6. Highlights the answer's referenced UI element on screen (when the model returns coordinates)

Enabling

Settings → Assistant → Enable Assistant. The animated waveform appears on the bar; the push-to-talk hotkey starts firing.

Hotkey

Default: hold ⌃⌥ (no key — modifiers only). Customise in Settings → Assistant → Push-to-Talk Hotkey by toggling Control / Option / Shift / Command. The hotkey requires at least one modifier.

The interaction is press and hold — start listening on press, send on release. There's no "click to start, click to stop" mode.

API keys

ApexDock doesn't ship its own model access — bring your own keys. Settings → Assistant → API Keys:

KeyUsed for
AnthropicClaude language models
OpenAIGPT-4 language models, Whisper transcription
ElevenLabsHigh-quality voice synthesis

Keys are stored in the macOS Keychain (Anthropic, OpenAI, ElevenLabs accounts) and never written to disk.

Speech-to-text

Settings → Assistant → Speech-to-Text → Provider:

ProviderNotes
OpenAI WhisperCloud, requires OpenAI key. Fast, accurate, supports 50+ languages.
DeepgramCloud, requires Deepgram key (paste in same panel). Lowest latency.
macOS Speech RecognitionOn-device, no key. Lower accuracy but offline.

If the configured cloud provider has no key, ApexDock falls back to the on-device path automatically (still requires Speech Recognition permission).

Language model

Settings → Assistant → Language Model:

  • Provider: Anthropic, OpenAI, or Codex via ChatGPT.
  • Model: dropdown of available models for that provider (Claude Opus / Sonnet / Haiku, the GPT-5.5 family, or whatever your Codex install reports). Switching providers snaps the model to that provider's default.
  • Reasoning Effort (Codex only): Low / Medium / High. Defaults to the model's reported default.

The Codex via ChatGPT provider doesn't take an API key — ApexDock shells out to your local codex CLI, which must be installed and signed in with ChatGPT. Available models and reasoning efforts come from Codex's app-server. Voice-mode queries auto-filter to image-capable models so screenshots can travel with the request.

Text-to-speech

Settings → Assistant → Text-to-Speech → Provider:

ProviderNotes
System VoiceFree. Uses macOS's built-in voices (the Siri voice when available).
ElevenLabsHigh-quality, requires ElevenLabs key. Voice picker lists every voice in your account.

What happens during a session

  • The bar's standard chrome dims to ~50% opacity.
  • A waveform animation fills the bar showing live audio levels.
  • A small "listening…" pill appears under the assistant button.
  • On release, the bar returns to normal, the pill switches to "thinking…", then "speaking…".

If the model returns a coordinate hint (e.g. "the file size is in the bottom-right"), ApexDock draws a brief crosshair animation at that screen location.

Computer Use

The assistant can drive other apps directly when you opt in. Settings → Assistant → Enable Computer Use turns on a bundled helper (ApexDockComputerUse.app) that lives inside the host app at Contents/Resources/Helpers/. The helper is a separate signed app with its own TCC identity, so its Accessibility, Screen Recording, and Automation grants are independent from ApexDock's.

When enabled, the assistant exposes these local tools to the model:

  • computer_use_list_apps, computer_use_get_app_state — discover apps and read their accessibility tree.
  • computer_use_click, computer_use_secondary_action, computer_use_scroll, computer_use_drag — pointer actions, AX-first with a CGEvent fallback scoped to the target app's PID.
  • computer_use_type_text, computer_use_press_key, computer_use_set_value — keyboard / form input.

Defaults and guarantees:

  • Computer Use is off by default. While disabled, the tool registry hides every computer_use_* schema and refuses stale calls before the helper can launch.
  • The helper is launched on demand and self-terminates after about three idle minutes, so it doesn't run all the time.
  • The helper's Unix socket only accepts the signed ApexDock app (com.gacntsoftware.apexdock) with the shared app-group entitlement; nothing else can connect.
  • Element indices returned from get_app_state are ephemeral — re-fetch state after any UI-changing action.

For the full security boundary and assistant usage contract, see the top-level Computer Use guide.

Privacy

  • Audio bytes never persist to disk. Recording goes straight into the STT provider's request body.
  • Screenshots are sent to the language model only when Screen Recording is granted. Otherwise the model sees only the transcript.
  • API keys are stored in Keychain. ApexDock has no telemetry — every request goes from your machine directly to the provider.

Performance

  • Streaming text-to-speech. Voice answers play sentence-by-sentence as the model generates. Both the system voice and ElevenLabs start speaking the first sentence before the full response finishes.
  • Skipped screenshots. Clearly non-visual queries ("what's on my calendar?", music control, system settings) skip the screen capture step automatically — saves ~300–500ms and tokens per turn. Ambiguous prompts still capture.
  • Prompt cache. The Anthropic provider marks the system prompt and tool list as cacheable, so back-to-back turns hit the 5-minute ephemeral cache instead of re-sending the prefix.

Notes

  • Audio streams to your transcription provider as you speak, so the model sees the prompt within a beat of release. There's no fixed-length recording window.
  • The model can return both an answer and a target rectangle on screen, which drives the on-screen highlight.
  • The voice answer and the on-screen crosshair play in parallel. The crosshair fades after about a second.
  • Releasing the hotkey before you actually said anything cancels the whole pipeline. No model call, no cost.