Skip to content

Parquet Files

Enrich.sh automatically converts your JSON events into Apache Parquet files — the industry-standard columnar format used by every major data warehouse and analytics tool.

Why Parquet?

BenefitDetails
ColumnarRead only the columns you need — fast analytical queries
Compressed~10x smaller than JSON with Snappy compression
UniversalSupported by DuckDB, Snowflake, BigQuery, ClickHouse, Spark, and more
TypedStrong data types preserved — no guessing

File Organization

Parquet files are organized in a time-partitioned folder structure for efficient querying:

{stream_id}/
  {year}/
    {month}/
      {day}/
        {hour}/
          {timestamp}.parquet

Example:

clickstream/
  2026/
    02/
      18/
        11/
          2026-02-18T11-36-37-431Z.parquet

This Hive-style partitioning means you can efficiently query specific time ranges:

sql
-- DuckDB: read a specific hour
SELECT * FROM read_parquet('s3://bucket/clickstream/2026/02/18/11/*.parquet');

-- Read an entire day
SELECT * FROM read_parquet('s3://bucket/clickstream/2026/02/18/**/*.parquet');

Supported Data Types

Every column in your stream schema maps to a Parquet type:

Schema TypeParquet TypeTools Read AsUse For
stringBYTE_ARRAY + UTF8string / varcharURLs, IDs, names, categories
jsonBYTE_ARRAY + UTF8string / varcharNested objects as JSON string
int32INT32int32 / integerSmall numbers, ports, counts
int64INT64int64 / bigintLarge numbers, epoch timestamps
timestampINT64 + TIMESTAMP_MILLIStimestampDate/time values
float32FLOATfloat / realRatings, approximate values
float64DOUBLEdoublePrices, scores, coordinates
booleanBOOLEANboolFlags, toggles

Type Details

string

UTF-8 encoded text. Properly annotated with the STRING logical type so all tools read them as text — never as raw binary.

json
{ "name": "url", "type": "string" }

json

Stored identically to string. Use this when a column contains serialized JSON objects or arrays. Downstream tools can parse with JSON_PARSE() or equivalent.

json
{ "name": "metadata", "type": "json" }

timestamp vs int64 — Time Values

You have two options for storing time-related data. Choose the type based on how you want tools to interpret the column:

Option 1: timestamp type

Use timestamp when you want warehouses and tools to automatically recognize the column as a datetime. The value is stored as INT64 with a TIMESTAMP_MILLIS annotation.

json
{ "name": "created_at", "type": "timestamp" }

You can pass either format:

  • Epoch milliseconds: 1771413863920
  • ISO 8601 string: "2026-02-18T11:24:23.920Z"

Tools like DuckDB, Snowflake, BigQuery, and Spark will display proper datetime values:

2026-02-18T11:24:23.920Z

Option 2: int64 type

Use int64 when you want to store the raw numeric value without any timestamp annotation. This is useful when the column holds epoch values but you want full control over how they're interpreted.

json
{ "name": "event_ts", "type": "int64" }

The value is stored as a plain INT64 — no logical type annotation. Tools will show the raw number:

1771413863920

You can still convert manually in SQL:

sql
-- DuckDB
SELECT epoch_ms(event_ts) AS event_time FROM my_table;

-- Snowflake
SELECT TO_TIMESTAMP(event_ts / 1000) AS event_time FROM my_table;

Choosing between timestamp and int64

Use timestamp if you want tools to automatically display datetime values with no extra work. Use int64 if you prefer raw numbers and want to handle conversion yourself.

int32 / int64

Standard integer types. int32 supports values up to ~2.1 billion. int64 supports values up to ~9.2 quintillion.

float32 / float64

IEEE 754 floating-point numbers. float64 (double precision) is recommended for financial data or coordinates.

boolean

True/false values. Bit-packed for efficient storage.

Compression

All Parquet files use Snappy compression:

  • Fast compression and decompression
  • ~60-70% size reduction vs uncompressed
  • Universal support across all Parquet readers

Reading Your Files

DuckDB

sql
SELECT * FROM read_parquet('2026-02-18T11-36-37-431Z.parquet');

Python (PyArrow)

python
import pyarrow.parquet as pq
table = pq.read_table('2026-02-18T11-36-37-431Z.parquet')
print(table.to_pandas())

Snowflake

sql
SELECT $1:url, $1:timestamp FROM @my_stage/2026-02-18T11-36-37-431Z.parquet;

BigQuery

sql
SELECT * FROM `project.dataset.table`;

INFO

For full warehouse connection instructions with credentials and S3 endpoints, see the Connect guide.

Serverless data ingestion for developers.