Metadata Fresh: 24h

Collection Anatomy

Technical reference for the files published inside an Internet Archive WARC collection item. Each file serves a specific role in the archival pipeline — from primary capture indices to pre-computed analytics and integrity metadata.

End Of Term 2024 Web Crawls

View on Internet Archive
Identifier
EndOfTerm2024WebCrawls
Total Files
9
Combined Size
0.4 TB

Files for EndOfTerm2024WebCrawls

NameTypeSizeLast Modified
EndOfTerm2024WebCrawls.cdx.gz
Gzip-compressed CDX index containing 3.66 billion capture records sorted by SURT URL key.
Primary Index209.8 GB17-Mar-2026 21:11
EndOfTerm2024WebCrawls.cdx.idx
Secondary byte-offset index (610,297 entries) mapping SURT URL ranges to positions in the .cdx.gz for random-access lookup.
ZipNum Index103.2 MB17-Mar-2026 21:11
EndOfTerm2024WebCrawls.parquet
Columnar Parquet representation of the CDX index. SURT-sorted by urlkey. Supports predicate pushdown for sub-second analytics via DuckDB, Spark, or Pandas.
Columnar Analytics157.1 GB17-Mar-2026 22:01
EndOfTerm2024WebCrawls.report.json.gz
Complete statistical report with extended host enumeration, MIME/status matrices, and sample URLs. Superset of summary.json.
Full Report8.7 MB17-Mar-2026 20:18
EndOfTerm2024WebCrawls.summary.json
Lightweight JSON with verified totals: 3.66B captures, 2.87B unique URLs, 1.18M hosts, MIME×Status matrix, top hosts, and temporal distribution.
Pre-Computed Stats21.6 KB17-Mar-2026 20:18
EndOfTerm2024WebCrawls_files.xml
IA file manifest listing all files in this item with MD5, SHA-1, CRC32 checksums and byte sizes for integrity verification.
Manifest2.5 KB28-Mar-2026 17:08
EndOfTerm2024WebCrawls_meta.sqlite
SQLite database tracking IA S3-layer operations: per-key metadata, operation history, and pending changes queue.
Internal28.0 KB18-Mar-2026 00:42
EndOfTerm2024WebCrawls_meta.xml
IA item metadata: identifier, title, parent collections, access flags, and creation date.
Metadata1.3 KB28-Mar-2026 17:08
start-text.txt
Genesis sentinel file confirming item creation. Content: "woohoo it worked!" — marks the collection's initialization on 2024-09-19.
Sentinel17 B19-Sep-2024 17:05