WARCannon - catastrophically powerful parallel WARC processing

(514 views)

WARCannon was built to simplify and cheapify the process of 'grepping the internet'. With WARCannon, you can: Build and test regex patterns against real Common Crawl data Easily load Common Crawl datasets for parallel processing Scale compute capabilities to asynchronously crunch through WARCs at frankly unreasonable capacity. Store and easily retrieve the results How it Works WARCannon leverages clever use of AWS technologies to horizontally scale to any capacity, minimize cost through spot fleets and same-region data transfer, draw from S3 at incredible speeds (up to 100Gbps per node), parallelize across hundreds of CPU cores, report status via DynamoDB and CloudFront, and store results via S3. In all, WARCannon can process multiple regular expression patterns across 400TB in a few hours for around $100. Installation WARCannon requires that you have the following installed: awscli (v2) terraform (v0.11) jq jsonnet npm (v12 or v14) ProTip: To keep things clean and distinct from other things you....

August 12, 2021
Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments
© HAKIN9 MEDIA SP. Z O.O. SP. K. 2023
What certifications or qualifications do you hold?
Max. file size: 150 MB.
What level of experience should the ideal candidate have?
What certifications or qualifications are preferred?

Download Free eBook

Step 1 of 4

Name(Required)

We’re committed to your privacy. Hakin9 uses the information you provide to us to contact you about our relevant content, products, and services. You may unsubscribe from these communications at any time. For more information, check out our Privacy Policy.