Parquet Files
Enrich.sh automatically converts your JSON events into Apache Parquet files — the industry-standard columnar format used by every major data warehouse and analytics tool.
Why Parquet?
| Benefit | Details |
|---|---|
| Columnar | Read only the columns you need — fast analytical queries |
| Compressed | ~10x smaller than JSON with Snappy compression |
| Universal | Supported by DuckDB, Snowflake, BigQuery, ClickHouse, Spark, and more |
| Typed | Strong data types preserved — no guessing |
File Organization
Parquet files are organized in a time-partitioned folder structure for efficient querying:
{stream_id}/
{year}/
{month}/
{day}/
{hour}/
{timestamp}.parquetExample:
clickstream/
2026/
02/
18/
11/
2026-02-18T11-36-37-431Z.parquetThis Hive-style partitioning means you can efficiently query specific time ranges:
-- DuckDB: read a specific hour
SELECT * FROM read_parquet('s3://bucket/clickstream/2026/02/18/11/*.parquet');
-- Read an entire day
SELECT * FROM read_parquet('s3://bucket/clickstream/2026/02/18/**/*.parquet');Supported Data Types
Every column in your stream schema maps to a Parquet type:
| Schema Type | Parquet Type | Tools Read As | Use For |
|---|---|---|---|
string | BYTE_ARRAY + UTF8 | string / varchar | URLs, IDs, names, categories |
json | BYTE_ARRAY + UTF8 | string / varchar | Nested objects as JSON string |
int32 | INT32 | int32 / integer | Small numbers, ports, counts |
int64 | INT64 | int64 / bigint | Large numbers, epoch timestamps |
timestamp | INT64 + TIMESTAMP_MILLIS | timestamp | Date/time values |
float32 | FLOAT | float / real | Ratings, approximate values |
float64 | DOUBLE | double | Prices, scores, coordinates |
boolean | BOOLEAN | bool | Flags, toggles |
Type Details
string
UTF-8 encoded text. Properly annotated with the STRING logical type so all tools read them as text — never as raw binary.
{ "name": "url", "type": "string" }json
Stored identically to string. Use this when a column contains serialized JSON objects or arrays. Downstream tools can parse with JSON_PARSE() or equivalent.
{ "name": "metadata", "type": "json" }timestamp vs int64 — Time Values
You have two options for storing time-related data. Choose the type based on how you want tools to interpret the column:
Option 1: timestamp type
Use timestamp when you want warehouses and tools to automatically recognize the column as a datetime. The value is stored as INT64 with a TIMESTAMP_MILLIS annotation.
{ "name": "created_at", "type": "timestamp" }You can pass either format:
- Epoch milliseconds:
1771413863920 - ISO 8601 string:
"2026-02-18T11:24:23.920Z"
Tools like DuckDB, Snowflake, BigQuery, and Spark will display proper datetime values:
2026-02-18T11:24:23.920ZOption 2: int64 type
Use int64 when you want to store the raw numeric value without any timestamp annotation. This is useful when the column holds epoch values but you want full control over how they're interpreted.
{ "name": "event_ts", "type": "int64" }The value is stored as a plain INT64 — no logical type annotation. Tools will show the raw number:
1771413863920You can still convert manually in SQL:
-- DuckDB
SELECT epoch_ms(event_ts) AS event_time FROM my_table;
-- Snowflake
SELECT TO_TIMESTAMP(event_ts / 1000) AS event_time FROM my_table;Choosing between timestamp and int64
Use timestamp if you want tools to automatically display datetime values with no extra work. Use int64 if you prefer raw numbers and want to handle conversion yourself.
int32 / int64
Standard integer types. int32 supports values up to ~2.1 billion. int64 supports values up to ~9.2 quintillion.
float32 / float64
IEEE 754 floating-point numbers. float64 (double precision) is recommended for financial data or coordinates.
boolean
True/false values. Bit-packed for efficient storage.
Compression
All Parquet files use Snappy compression:
- Fast compression and decompression
- ~60-70% size reduction vs uncompressed
- Universal support across all Parquet readers
Reading Your Files
DuckDB
SELECT * FROM read_parquet('2026-02-18T11-36-37-431Z.parquet');Python (PyArrow)
import pyarrow.parquet as pq
table = pq.read_table('2026-02-18T11-36-37-431Z.parquet')
print(table.to_pandas())Snowflake
SELECT $1:url, $1:timestamp FROM @my_stage/2026-02-18T11-36-37-431Z.parquet;BigQuery
SELECT * FROM `project.dataset.table`;INFO
For full warehouse connection instructions with credentials and S3 endpoints, see the Connect guide.
