Anonymous View
Skip to content

Add secure, type-aware custom object serialization#154

Draft
andystaples wants to merge 2 commits into
microsoft:mainfrom
andystaples:andystaples/add-custom-type-serialization
Draft

Add secure, type-aware custom object serialization#154
andystaples wants to merge 2 commits into
microsoft:mainfrom
andystaples:andystaples/add-custom-type-serialization

Conversation

@andystaples

Copy link
Copy Markdown
Contributor

Summary

Reworks user-payload JSON serialization to be secure and type-aware while
keeping the on-the-wire format stable. Custom objects (dataclasses and
from_json()-capable types) now round-trip everywhere in the programming model,
and deserialization is driven by caller-supplied or annotation-discovered types
rather than by the payload itself.

Motivation

The previous codec (shared.to_json / from_json) had real problems:

  • Nested dataclasses didn't round-trip — only the top-level object got the
    AUTO_SERIALIZED marker, so nested dataclasses/namedtuples decoded
    inconsistently, and everything came back as a SimpleNamespace rather than the
    original type.
  • Payload-driven type selection is unsafe — extending the marker approach to
    reconstruct arbitrary classes named in the payload is classic insecure
    deserialization (OWASP A08). The secure model is type-directed decoding,
    where the destination type comes from the caller/annotations, never the wire.

This also aligns the SDK with the direction of azure-functions-durable and the
.NET DataConverter (plain JSON + caller-supplied target type).

What changed

Type-directed deserialization

  • call_activity, call_sub_orchestrator, and call_entity gain an optional
    return_type; wait_for_external_event gains data_type. When provided, the
    result/event is coerced to that type (dataclasses incl. nested / Optional /
    list fields, and from_json()-capable types). These also refine the static
    type
    of the returned task via @overload
    (e.g. call_activity(..., return_type=Foo) -> CompletableTask[Foo]).
  • New typed accessors on client.OrchestrationState: get_input(),
    get_output(), get_custom_status() — each takes an optional expected_type
    and is overloaded to return T | None. Raw serialized_* fields are retained.
  • Entity get_state(intended_type=...) now routes through the shared codec
    (dataclass + from_json support).

Annotation-based discovery (new durabletask/internal/type_discovery.py)

  • Inbound payloads (orchestrator / activity / entity-operation inputs) and
    call_activity results are reconstructed from the function's type annotations
    when no explicit type is supplied. Discovery is best-effort and
    conservative
    : builtins and unannotated/unknown types pass through unchanged,
    and a payload that fails to coerce falls back to the raw value.

Codec / behavior

  • to_json now emits plain JSON (no internal type marker). Objects exposing
    to_json() are serializable.
  • Serialization failures raise a TypeError that chains the original cause and
    names the offending type.
  • Back-compat: payloads produced by older SDK versions (carrying the legacy
    AUTO_SERIALIZED marker) still deserialize — into a SimpleNamespace when no
    type is supplied, or stripped and coerced when an expected_type is given — so
    in-flight orchestrations replay cleanly across the upgrade.

Behavior change to note

Decoding without a type now yields a plain dict/list (previously a
SimpleNamespace for marked payloads). Callers that want the original type
should pass return_type / data_type / expected_type, or rely on annotation
discovery. Documented under ## Unreleased in CHANGELOG.md.

Tests & validation

  • New unit suites: test_serialization.py, test_type_discovery.py,
    test_orchestration_state.py, plus additions to the entity/orchestration
    executor tests (codec round-trips, nested dataclasses, expected_type
    precedence, legacy-marker back-compat, error chaining, annotation discovery,
    static-type inference).
  • Full local gate: strict pyright (0 errors), flake8 (source + tests),
    pytest (non-e2e), and pymarkdown all green. durabletask-azuremanaged
    unit tests pass (it reuses the core client/worker).

Note

Opening as a draft for review. Static-type inference is refined on the task
object; the result = yield ctx.call_activity(...) pattern still yields Any
due to Python generator send-type limitations (would require an await-style API).

Rewrite the JSON codec in shared.py to emit plain JSON (no internal type marker) and add type-directed deserialization via an optional expected_type. Custom objects round-trip everywhere:

- call_activity/call_sub_orchestrator/call_entity gain return_type; wait_for_external_event gains data_type; these also refine the returned task's static type via overloads.

- Inbound payloads (orchestrator/activity/entity inputs) and call_activity results are reconstructed from function type annotations (new internal type_discovery module), best-effort and conservative.

- Entity get_state and new client OrchestrationState.get_input/get_output/get_custom_status accessors route through the shared codec.

- Fix nested-dataclass round-trip bug; chain serialization errors with the original cause. Legacy AUTO_SERIALIZED payloads still deserialize for in-flight replay.
…m-type-serialization

# Conflicts:
#	CHANGELOG.md
@berndverst

Copy link
Copy Markdown
Member

Expanding on the DataConverter-style abstraction idea from the review — and what it would look like with pydantic specifically.

The gap today

Serialization is currently hardcoded to json.dumps / json.loads with per-type, duck-typed to_json() / from_json() hooks. That works, but it bakes the policy into the SDK and gives users no single place to control encoding: custom datetime/Decimal formats, enum-by-name, set/bytes handling, or whole model frameworks (pydantic, attrs, msgspec) all have to bend to fit the hook convention. .NET solves this with a pluggable DataConverter that both the worker and client consume, defaulting to JsonDataConverter (System.Text.Json, IncludeFields = true). The destination type is always supplied by the caller (Deserialize(data, targetType) / Deserialize<T>(data)) — exactly the secure, type-directed model this PR is moving toward.

A Python-equivalent shape

# durabletask/serialization.py (new public API)
from abc import ABC, abstractmethod
from typing import Any


class DataConverter(ABC):
    @abstractmethod
    def serialize(self, value: Any) -> str | None:
        ...

    @abstractmethod
    def deserialize(self, data: str | None, target_type: type | None = None) -> Any:
        ...

The default preserves today's behavior, so it's a non-breaking refactor:

class JsonDataConverter(DataConverter):
    def serialize(self, value):
        return None if value is None else shared.to_json(value)

    def deserialize(self, data, target_type=None):
        return None if data is None else shared.from_json(data, target_type)

Then wire it through the worker/client once (a constructor arg defaulting to JsonDataConverter()), and route every existing shared.to_json(...) / shared.from_json(..., expected_type) call site through self._converter. All the type-directed plumbing this PR already adds — return_type=, data_type=, and annotation discovery — keeps working unchanged; it just supplies target_type to the converter, which decides how the bytes become that type.

Pydantic example

A team using pydantic models could drop in a converter that gives full validation on the way in and pydantic's JSON encoding on the way out, while still falling back to stdlib JSON for plain dicts/dataclasses:

import pydantic

from durabletask.internal import shared
from durabletask.serialization import DataConverter


class PydanticDataConverter(DataConverter):
    """Round-trips pydantic models; falls back to stdlib JSON for everything else."""

    def serialize(self, value):
        if value is None:
            return None
        if isinstance(value, pydantic.BaseModel):
            return value.model_dump_json()            # pydantic v2
        return shared.to_json(value)                  # dataclasses, dicts, ...

    def deserialize(self, data, target_type=None):
        if data is None:
            return None
        if (
            isinstance(target_type, type)
            and issubclass(target_type, pydantic.BaseModel)
        ):
            return target_type.model_validate_json(data)   # pydantic v2, validates
        return shared.from_json(data, target_type)

Register it once on each end:

converter = PydanticDataConverter()
worker = TaskHubGrpcWorker(data_converter=converter)
client = TaskHubGrpcClient(data_converter=converter)

…and pydantic models round-trip everywhere through the same type-directed flow already in this PR:

class Order(pydantic.BaseModel):
    customer: str
    total: float


class Receipt(pydantic.BaseModel):
    confirmation_id: str


def orchestrator(ctx, order: Order):                       # discovery -> Order (validated)
    receipt = yield ctx.call_activity(charge, input=order, return_type=Receipt)
    return receipt.confirmation_id

Here charge's result is validated against Receipt (ValidationError on bad data, instead of a silent partial dataclass), and order is validated when the orchestrator/activity receives it — using pydantic's own coercion rules rather than the SDK's coerce_to_type.

Why this is worth considering over the duck-typed hooks

  • One documented extension point instead of a method-name convention. As noted in the line comments, to_json() / from_json() are ambiguous: the encode hook must return a structure (not a JSON string, despite the name), and the decode hook receives a parsed dict (not a JSON string, which is what many libraries' from_json expects). A DataConverter makes the contract explicit and avoids the collision.
  • No SDK dependency on pydantic/attrs/msgspec — users opt in by supplying a converter.
  • Validation and custom encoding become first-class (dates, decimals, enums, unions) rather than something each type has to re-implement in a hook.
  • Polyglot parity with the .NET mental model, which helps teams running both stacks.

This is complementary to the PR, not a blocker: the return_type / discovery / coerce_to_type work here is precisely what a converter would consume. It could land as a fast-follow that swaps the hardcoded shared.to_json / shared.from_json calls for a converter indirection, with JsonDataConverter as the default so nothing changes for existing users.

Note

One open question for the design: whether the converter is global (one per worker/client) like .NET, or can also be overridden per call (e.g. call_activity(..., data_converter=...)). .NET is global; per-call would add flexibility but more surface area.

return vars(o)
to_json_hook = getattr(o, "to_json", None)
if callable(to_json_hook):
return to_json_hook()

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The to_json() hook name is a footgun here. We feed its return value straight to json.dumps(default=...), which expects a JSON-serializable object (dict/list/etc.) — but the name strongly implies it returns a JSON string. A type whose to_json() returns a string (the conventional meaning) gets double-encoded into a quoted JSON string. Consider naming the hook to_dict (with from_dict on the read side), or at least documenting loudly that it must return a structure, not a string.

"""``default`` hook for :func:`json.dumps` that emits plain JSON.

Called only for values the JSON encoder cannot natively serialize. Note that
namedtuples are handled natively by the encoder (serialized as JSON arrays)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth calling out that this is a real wire-format change for namedtuples, not just an internal detail. The old encoder emitted {field: value, ...} + marker and decoded to a SimpleNamespace; now a namedtuple serializes as a JSON array and decodes to a list. There's also no clean way to restore it — coerce_to_type([...], MyNamedTuple) calls MyNamedTuple([...]), which assigns the whole list to the first field. examples/human_interaction.py raises a namedtuple("Approval", ["approver"]) external event and then reads .approver, so it will break. The PR summary says the on-the-wire format stays stable, which isn't quite true for namedtuples — might be worth either preserving the dict form or documenting the change explicitly.

if isinstance(value, expected_type):
return value

from_json_hook = getattr(expected_type, "from_json", None)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The from_json() hook contract is that it receives an already-parsed value (a dict), not a JSON string. That's the opposite of what many libraries mean by from_json (anything modeled on json.loads, or pydantic v1's parse_raw). A class defining from_json(cls, s: str) will silently get a dict and misbehave. Same naming concern as the to_json side — from_dict would be less ambiguous, or document the "receives a parsed value, not a string" contract explicitly.

if origin is typing.Union or origin is types.UnionType:
# Optional[T] / Union[...]: if the value already matches a member type,
# keep it; otherwise coerce to the first non-None member.
non_none = [a for a in args if a is not type(None)]

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: for a non-Optional Union[A, B] where the value matches neither member by isinstance, this always coerces to A. Fine for the dominant Optional[T] case, but it can silently mis-coerce genuine unions. Consider leaving the value untouched when it matches no member rather than forcing the first non-None arg.

Comment thread durabletask/worker.py
# An explicit return_type takes precedence; otherwise, when an activity
# function reference is supplied, discover its return annotation.
if return_type is None and not isinstance(activity, str):
return_type = type_discovery.activity_output_type(activity)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is where discovery silently opts an orchestration into strict result coercion. Combined with the result path having no fallback (see my comment on the taskCompleted handler), merely adding a return annotation to an activity changes runtime behavior: if a payload fails to coerce to the discovered type, the orchestration now throws during replay instead of returning the raw value. Input discovery falls back to the raw value on failure — the result side should probably match, at least for discovered (implicit) types. An explicit return_type=, where fail-fast is defensible, is a different story.

Comment thread durabletask/worker.py
if not ph.is_empty(event.taskCompleted.result):
result = shared.from_json(event.taskCompleted.result.value)
result = shared.from_json(
event.taskCompleted.result.value, activity_task._expected_type # pyright: ignore[reportPrivateUsage]

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Asymmetry to flag: the input-coercion paths (orchestrator/activity/entity inputs) wrap coerce_to_type in try/except and fall back to the raw value with a debug log, but the result paths — this taskCompleted handler plus sub-orchestration / external-event / entity-operation completion — call from_json(value, expected_type) with no fallback, so a coercion failure raises inside replay. Worth making the two sides consistent, or at least documenting that result coercion is strict while input coercion is best-effort.

return callable(getattr(cast(Any, annotation), "from_json", None))


@functools.lru_cache(maxsize=None)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lru_cache(maxsize=None) keyed by the function object is unbounded. For the normal case (module-level orchestrators/activities) the key set is small, but a worker that registers dynamically-created functions/closures would accumulate entries for the lifetime of the process. A bounded maxsize, or keying on something stable, would avoid a slow leak.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants