WARCannon - catastrophically powerful parallel WARC processing


WARCannon was built to simplify and cheapify the process of 'grepping the internet'. With WARCannon, you can: Build and test regex patterns against real Common Crawl data Easily load Common Crawl datasets for parallel processing Scale compute capabilities to asynchronously crunch through WARCs at frankly unreasonable capacity. Store and easily retrieve the results How it Works WARCannon leverages clever use of AWS technologies to horizontally scale to any capacity, minimize cost through spot fleets and same-region data transfer, draw from S3 at incredible speeds (up to 100Gbps per node), parallelize across hundreds of CPU cores, report status via DynamoDB and CloudFront, and store results via S3. In all, WARCannon can process multiple regular expression patterns across 400TB in a few hours for around $100. Installation WARCannon requires that you have the following installed: awscli (v2) terraform (v0.11) jq jsonnet npm (v12 or v14) ProTip: To keep things clean and distinct from other things you....

August 12, 2021
