Skip to Content
PostgreSQL ExtensionData Type Conversions

PostgreSQL Extension Data Type Conversions

This page documents the current conversion behavior in pg_kalam.

There are three distinct layers to keep straight:

  1. PostgreSQL-side DDL authoring and logical schema mapping.
  2. PostgreSQL datum conversion on writes.
  3. Arrow batch materialization on reads.

The important constraint is that the canonical KalamDB logical type system is broader than the extension’s dedicated PostgreSQL runtime branches. The logical mapping exists for more types than the extension can currently materialize natively on the PostgreSQL side.

If this page and the code diverge, the code wins.

1. PostgreSQL Authoring to KalamDB Logical Types

These are the common PostgreSQL-facing type names that map into KalamDB’s logical schema model today.

PostgreSQL-side authoring typeKalamDB logical typeArrow or storage representationNotes
BOOLEAN, BOOLBooleanBooleanDedicated native read and write branches exist
SMALLINT, INT2SmallIntInt16Dedicated native read and write branches exist
INTEGER, INT, INT4, MEDIUMINTIntInt32Dedicated native read and write branches exist
BIGINT, INT8BigIntInt64Dedicated native read and write branches exist
REAL, FLOAT, FLOAT4FloatFloat32Dedicated native read and write branches exist
DOUBLE PRECISION, DOUBLE, FLOAT8, FLOAT64DoubleFloat64Dedicated native read and write branches exist
TEXT, VARCHAR, CHAR, STRING, NVARCHARTextUtf8TEXT and VARCHAR have dedicated write fast paths; CHAR currently uses the generic text-output fallback on writes
BYTEA, BYTES, BINARY, VARBINARY, BLOBBytesBinaryRead path has a dedicated BYTEA materializer; write path currently falls back to PostgreSQL text output
JSON, JSONBJsonUtf8 containing JSON textJSON and JSONB are first-class in the extension on both read and write
DATEDateDate32Read path has a dedicated DATE materializer; write path currently falls back to PostgreSQL text output
TIMETimeTime64(Microsecond)Logical mapping exists, but read and write still rely on generic conversion fallback
TIMESTAMPTimestampTimestamp(Microsecond, None)Read path has a dedicated timestamp materializer; write path currently falls back to PostgreSQL text output
NUMERIC(p,s), DECIMAL(p,s)Decimal { precision, scale }Decimal128(p, s)Logical mapping exists; PostgreSQL numeric materialization is not yet a dedicated branch
UUIDUuidFixedSizeBinary(16)Logical mapping exists; PostgreSQL UUID materialization is not yet a dedicated branch
FILEFileUtf8 containing serialized FileRef JSONThe remote KalamDB column is FILE; the local PostgreSQL column is rewritten to JSONB on the USING kalamdb path

Two special cases need separate callouts:

  • KalamDataType::DateTime is the timezone-aware logical type and stores as Timestamp(Microsecond, Some("UTC")). Its reverse PostgreSQL type name is TIMESTAMPTZ, but PostgreSQL-side timezone-aware authoring spellings should be validated against your exact build.
  • KalamDataType::Embedding(n) is the logical embedding type. Its reverse PostgreSQL type name is VECTOR(n), but the KalamDB logical DDL spelling is still EMBEDDING(n), so PostgreSQL-side authoring should be validated carefully instead of assuming full shorthand parity.

2. KalamDB Logical Types Back to PostgreSQL

When the extension needs to generate or mirror PostgreSQL column type names from a KalamDB logical schema, it uses the following canonical mapping.

KalamDB logical typeGenerated or mirrored PostgreSQL type name
BooleanBOOLEAN
SmallIntSMALLINT
IntINTEGER
BigIntBIGINT
FloatREAL
DoubleDOUBLE PRECISION
TextTEXT
BytesBYTEA
DateDATE
TimeTIME
TimestampTIMESTAMP
DateTimeTIMESTAMPTZ
UuidUUID
JsonJSONB
FileJSONB
Decimal { precision, scale }NUMERIC(p, s)
Embedding(n)VECTOR(n)

Current local-surface rules modify that canonical reverse mapping in a few places:

  • On CREATE TABLE ... USING kalamdb, most local PostgreSQL columns keep the type you declared.
  • JSON stays JSON locally if you declared JSON.
  • JSONB stays JSONB locally if you declared JSONB.
  • FILE is the special case: it is always rewritten locally to JSONB.

If you mirror a remote FILE column manually with CREATE FOREIGN TABLE, declare the local PostgreSQL column as JSONB.

3. DDL Normalizations Before Remote Create

The DDL hook performs a small set of PostgreSQL-specific normalizations before sending the user-column definition to KalamDB.

PostgreSQL authoring formRemote KalamDB DDL result
SERIALInteger column plus DEFAULT SNOWFLAKE_ID()
BIGSERIALBigint column plus DEFAULT SNOWFLAKE_ID()
SMALLSERIALSmallint column plus DEFAULT SNOWFLAKE_ID()
GENERATED ... AS IDENTITYExplicit integer type plus DEFAULT SNOWFLAKE_ID()

This normalization happens before the remote CREATE TABLE is sent.

4. Runtime Read and Write Behavior Today

The table above describes the logical schema contract. The next table describes the actual extension implementation status at runtime.

Type familyPostgreSQL -> KalamDB write pathKalamDB -> PostgreSQL read pathCurrent status
BOOLEAN, SMALLINT, INTEGER, BIGINT, REAL, DOUBLE PRECISIONDedicated native Datum -> ScalarValue branchesDedicated native Arrow array branchesBest-covered native round-trip surface
TEXT, VARCHARDedicated text fast pathDedicated Utf8 and LargeUtf8 branchesBest-covered string surface
CHAR, CHARACTERGeneric PostgreSQL type-output fallback to ScalarValue::Utf8Reads through the normal Utf8 text path once the backend returns textLogical mapping exists, but writes are not a dedicated branch
JSON, JSONBDedicated JSON and JSONB normalization to JSON textDedicated PostgreSQL type-input path for JSON and JSONBBest-covered document surface
FILEPostgreSQL sees the FileRef JSON representation, usually as JSONB; binary upload itself typically happens through KalamLink or another file-aware clientDedicated JSONB materialization from UTF-8 FileRef JSONStable local representation, special-case local JSONB surface
BYTEAGeneric PostgreSQL type-output fallback to ScalarValue::Utf8Dedicated Arrow Binary -> bytea branchRead path is native; write path relies on generic coercion
DATEGeneric PostgreSQL type-output fallback to ScalarValue::Utf8Dedicated Arrow Date32 -> DATE branchRead path is native; write path relies on generic coercion
TIMESTAMP and generated TIMESTAMPTZ surfaceGeneric PostgreSQL type-output fallback to ScalarValue::Utf8Dedicated Arrow timestamp branches reuse PostgreSQL’s internal timestamp representationRead path is native; write path relies on generic coercion
TIMEGeneric PostgreSQL type-output fallback to ScalarValue::Utf8Generic fallback string renderingNot yet a dedicated PostgreSQL round-trip branch
UUIDGeneric PostgreSQL type-output fallback to ScalarValue::Utf8No dedicated UUID materialization branch in arrow_to_pgLogical mapping exists; validate against your build before treating UUID as a stable PG round-trip surface
NUMERIC(p,s)Generic PostgreSQL type-output fallback to ScalarValue::Utf8No dedicated numeric materialization branch in arrow_to_pgLogical mapping exists; native PostgreSQL numeric round-trip is not implemented yet
VECTOR(n) or Embedding(n)PostgreSQL-side authoring depends on local type availability and current DDL spellingNo dedicated vector materialization branch in arrow_to_pgReverse mapping exists, but PostgreSQL vector round-trip is not yet a native branch

5. Under the Hood

Write path

For INSERT, UPDATE, and DELETE, the extension does the following:

  1. PostgreSQL exposes a TupleTableSlot to the FDW modify callbacks.
  2. slot_to_row() walks the local columns, skipping _seq and _deleted.
  3. _userid is captured separately and only used as an explicit tenant override for INSERT.
  4. pg_to_kalam::datum_to_scalar() converts each remaining datum into a DataFusion ScalarValue.
  5. Supported primitive OIDs use dedicated branches.
  6. Everything else currently falls back to PostgreSQL’s type output function and becomes ScalarValue::Utf8.
  7. The client serializes mutations as JSON row payloads in the gRPC request.

That is why the write path is broader than the dedicated fast-path branches but still depends on server-side coercion for several PostgreSQL type families.

Read path

For SELECT, the extension does the following:

  1. Builds a ScanRequest with relation identity, table type, tenant context, and simple equality filters.
  2. Sends the request over gRPC.
  3. Receives Arrow IPC batches from KalamDB.
  4. arrow_to_pg::arrow_value_to_datum() materializes supported Arrow types directly into PostgreSQL datums.
  5. JSON and JSONB use PostgreSQL type-input functions so PostgreSQL sees real document datums, not plain text.
  6. Unsupported Arrow types fall back to ScalarValue::to_string().

That final fallback is why logical schema support is currently broader than the stable PostgreSQL materialization surface.

6. What Is Safest Today

If you need the most predictable PostgreSQL behavior today, prefer these surfaces first:

  • BOOLEAN, SMALLINT, INTEGER, BIGINT, REAL, DOUBLE PRECISION
  • TEXT and VARCHAR
  • JSON, JSONB, and FILE as local JSONB
  • DATE, TIMESTAMP, and the timestamp-like generated TIMESTAMPTZ read path

Use extra care and validate against your target build for:

  • TIME
  • UUID
  • NUMERIC(p,s)
  • VECTOR(n) or other embedding-heavy PostgreSQL surfaces

The strongest current automated coverage in the extension repo is around JSON, JSONB, and FILE behavior. The rest of the mapping surface is primarily code-backed rather than broad end-to-end type-matrix coverage.

Last updated on