Parquet Files

Enrich.sh automatically converts your JSON events into Apache Parquet files — the industry-standard columnar format used by every major data warehouse and analytics tool.

Why Parquet?

Benefit	Details
Columnar	Read only the columns you need — fast analytical queries
Compressed	~10x smaller than JSON with Snappy compression
Universal	Supported by DuckDB, Snowflake, BigQuery, ClickHouse, Spark, and more
Typed	Strong data types preserved — no guessing

File Organization

Parquet files are organized in a time-partitioned folder structure for efficient querying:

{stream_id}/
  {year}/
    {month}/
      {day}/
        {hour}/
          {timestamp}.parquet

Example:

clickstream/
  2026/
    02/
      18/
        11/
          2026-02-18T11-36-37-431Z.parquet

This Hive-style partitioning means you can efficiently query specific time ranges:

sql

-- DuckDB: read a specific hour
SELECT * FROM read_parquet('s3://bucket/clickstream/2026/02/18/11/*.parquet');

-- Read an entire day
SELECT * FROM read_parquet('s3://bucket/clickstream/2026/02/18/**/*.parquet');

Supported Data Types

Every column in your stream schema maps to a Parquet type:

Schema Type	Parquet Type	Tools Read As	Use For
`string`	`BYTE_ARRAY` + `UTF8`	`string` / `varchar`	URLs, IDs, names, categories
`json`	`BYTE_ARRAY` + `UTF8`	`string` / `varchar`	Nested objects as JSON string
`int32`	`INT32`	`int32` / `integer`	Small numbers, ports, counts
`int64`	`INT64`	`int64` / `bigint`	Large numbers, epoch timestamps
`timestamp`	`INT64` + `TIMESTAMP_MILLIS`	`timestamp`	Date/time values
`float32`	`FLOAT`	`float` / `real`	Ratings, approximate values
`float64`	`DOUBLE`	`double`	Prices, scores, coordinates
`boolean`	`BOOLEAN`	`bool`	Flags, toggles

Type Details

`string`

UTF-8 encoded text. Properly annotated with the STRING logical type so all tools read them as text — never as raw binary.

json

{ "name": "url", "type": "string" }

`json`

Stored identically to string. Use this when a column contains serialized JSON objects or arrays. Downstream tools can parse with JSON_PARSE() or equivalent.

json

{ "name": "metadata", "type": "json" }

`timestamp` vs `int64` — Time Values

You have two options for storing time-related data. Choose the type based on how you want tools to interpret the column:

Option 1: `timestamp` type

Use timestamp when you want warehouses and tools to automatically recognize the column as a datetime. The value is stored as INT64 with a TIMESTAMP_MILLIS annotation.

json

{ "name": "created_at", "type": "timestamp" }

You can pass either format:

Epoch milliseconds: 1771413863920
ISO 8601 string: "2026-02-18T11:24:23.920Z"

Tools like DuckDB, Snowflake, BigQuery, and Spark will display proper datetime values:

2026-02-18T11:24:23.920Z

Option 2: `int64` type

Use int64 when you want to store the raw numeric value without any timestamp annotation. This is useful when the column holds epoch values but you want full control over how they're interpreted.

json

{ "name": "event_ts", "type": "int64" }

The value is stored as a plain INT64 — no logical type annotation. Tools will show the raw number:

1771413863920

You can still convert manually in SQL:

sql

-- DuckDB
SELECT epoch_ms(event_ts) AS event_time FROM my_table;

-- Snowflake
SELECT TO_TIMESTAMP(event_ts / 1000) AS event_time FROM my_table;

Choosing between timestamp and int64

Use timestamp if you want tools to automatically display datetime values with no extra work. Use int64 if you prefer raw numbers and want to handle conversion yourself.

`int32` / `int64`

Standard integer types. int32 supports values up to ~2.1 billion. int64 supports values up to ~9.2 quintillion.

`float32` / `float64`

IEEE 754 floating-point numbers. float64 (double precision) is recommended for financial data or coordinates.

`boolean`

True/false values. Bit-packed for efficient storage.

Compression

All Parquet files use Snappy compression:

Fast compression and decompression
~60-70% size reduction vs uncompressed
Universal support across all Parquet readers

Reading Your Files

DuckDB

sql

SELECT * FROM read_parquet('2026-02-18T11-36-37-431Z.parquet');

Python (PyArrow)

python

import pyarrow.parquet as pq
table = pq.read_table('2026-02-18T11-36-37-431Z.parquet')
print(table.to_pandas())

Snowflake

sql

SELECT $1:url, $1:timestamp FROM @my_stage/2026-02-18T11-36-37-431Z.parquet;

BigQuery

sql

SELECT * FROM `project.dataset.table`;

INFO

For full warehouse connection instructions with credentials and S3 endpoints, see the Connect guide.

Parquet Files ​

Why Parquet? ​

File Organization ​

Supported Data Types ​

Type Details ​

string ​

json ​

timestamp vs int64 — Time Values ​

Option 1: timestamp type ​

Option 2: int64 type ​

int32 / int64 ​

float32 / float64 ​

boolean ​

Compression ​

Reading Your Files ​

DuckDB ​

Python (PyArrow) ​

Snowflake ​

BigQuery ​

Parquet Files

Why Parquet?

File Organization

Supported Data Types

Type Details

`string`

`json`

`timestamp` vs `int64` — Time Values

Option 1: `timestamp` type

Option 2: `int64` type

`int32` / `int64`

`float32` / `float64`

`boolean`

Compression

Reading Your Files

DuckDB

Python (PyArrow)

Snowflake

BigQuery