Collection Anatomy
Technical reference for the files published inside an Internet Archive WARC collection item. Each file serves a specific role in the archival pipeline — from primary capture indices to pre-computed analytics and integrity metadata.
End Of Term 2024 Web Crawls
Identifier
EndOfTerm2024WebCrawls
Total Files
9
Combined Size
0.4 TB
Files for EndOfTerm2024WebCrawls
| Name | Type | Size | Last Modified | |
|---|---|---|---|---|
EndOfTerm2024WebCrawls.cdx.gz Gzip-compressed CDX index containing 3.66 billion capture records sorted by SURT URL key. | Primary Index | 209.8 GB | 17-Mar-2026 21:11 | |
EndOfTerm2024WebCrawls.cdx.idx Secondary byte-offset index (610,297 entries) mapping SURT URL ranges to positions in the .cdx.gz for random-access lookup. | ZipNum Index | 103.2 MB | 17-Mar-2026 21:11 | |
EndOfTerm2024WebCrawls.parquet Columnar Parquet representation of the CDX index. SURT-sorted by urlkey. Supports predicate pushdown for sub-second analytics via DuckDB, Spark, or Pandas. | Columnar Analytics | 157.1 GB | 17-Mar-2026 22:01 | |
EndOfTerm2024WebCrawls.report.json.gz Complete statistical report with extended host enumeration, MIME/status matrices, and sample URLs. Superset of summary.json. | Full Report | 8.7 MB | 17-Mar-2026 20:18 | |
EndOfTerm2024WebCrawls.summary.json Lightweight JSON with verified totals: 3.66B captures, 2.87B unique URLs, 1.18M hosts, MIME×Status matrix, top hosts, and temporal distribution. | Pre-Computed Stats | 21.6 KB | 17-Mar-2026 20:18 | |
EndOfTerm2024WebCrawls_files.xml IA file manifest listing all files in this item with MD5, SHA-1, CRC32 checksums and byte sizes for integrity verification. | Manifest | 2.5 KB | 28-Mar-2026 17:08 | |
EndOfTerm2024WebCrawls_meta.sqlite SQLite database tracking IA S3-layer operations: per-key metadata, operation history, and pending changes queue. | Internal | 28.0 KB | 18-Mar-2026 00:42 | |
EndOfTerm2024WebCrawls_meta.xml IA item metadata: identifier, title, parent collections, access flags, and creation date. | Metadata | 1.3 KB | 28-Mar-2026 17:08 | |
start-text.txt Genesis sentinel file confirming item creation. Content: "woohoo it worked!" — marks the collection's initialization on 2024-09-19. | Sentinel | 17 B | 19-Sep-2024 17:05 |