Track Awesome Web Archiving Updates Daily

An Awesome List for getting started with web archiving

🏠 Home · 🔍 Search · 🔥 Feed · 📮 Subscribe · ❤️ Sponsor · 😺 iipc/awesome-web-archiving · ⭐ 2.2K · 🏷️ Miscellaneous

[ Daily / Weekly / Overview ]

Apr 10, 2025

Tools & Software / Search & Discovery

Shine (⭐43) - A prototype web archives exploration UI, developed with researchers as part of the Big UK Domain Data for the Arts and Humanities project. (Stable)

SolrWayback (⭐111) - A backend Java and frontend VUE JS project with freetext search and a build in playback engine. Require Warc files has been index with the Warc-Indexer. The web application also has a wide range of data visualization tools and data export tools that can be used on the whole webarchive. SolrWayback 4 Bundle release (⭐111) contains all the software and dependencies in an out-of-the box solution that is easy to install.

Warclight (⭐50) - A Project Blacklight based Rails engine that supports the discovery of web archives held in the WARC and ARC formats. (In Development)

Wasp (⭐27) - A fully functional prototype of a personal web archive and search system. (In Development)

Other possible options for builting a front-end are listed on in the webarchive-discovery wiki, here (⭐122).

Feb 12, 2025

Tools & Software / Acquisition

Community Archive - Open Twitter Database and API with tools and resources for building on archived Twitter data.

Jan 31, 2025

Tools & Software / Search & Discovery

hyphe (⭐350) - A webcrawler built for research uses with a graphical user interface in order to build web corpuses made of lists of web actors and maps of links between them. (Stable)

PANDORÆ (⭐12) - A desktop research software to be plugged on a Solr endpoint to query, retrieve, normalize and visually explore web archives. (Stable)

Tools & Software / Utilities

httpreserve.info - Service to return the status of a web page or save it to the Internet Archive. HTTPreserve includes disambiguation of well-known short link services. It returns JSON via the browser or command line via CURL using GET. Describes web sites using earliest and latest dates in the Internet Archive and demonstrates the construction of Robust Links in its output using that range. (Golang). (Stable)

Jan 22, 2025

Resources for Web Publishers

Definition of Web Archivability - This describes the ease with which web content can be preserved. (Archived version from the Stanford Libraries)

Tools & Software / Utilities

The Unarchiver - Program to extract the contents of many archive formats, inclusive of WARC, to a file system. Free variant of The Archive Browser (macOS only, Proprietary app).

Tools & Software / WARC I/O Libraries

Warcat-rs (⭐15) - Command-line tool and Rust library for handling Web ARChive (WARC) files. (In Development)

Nov 12, 2024

Tools & Software / Acquisition

ArchiveWeb.Page - A plugin for Chrome and other Chromium based browsers that lets you interactively archive web pages, replay them, and export them as WARC & WACZ files. Also available as an Electron based desktop application.

Tools & Software / Replay

ReplayWeb.page - A browser-based, fully client-side replay engine for both local and remote WARC & WACZ files. Also available as an Electron based desktop application. (Stable)

Community Resources / Blogs and Scholarship

Common Crawl Foundation Blog - rss

Community Resources / Slack

Common Crawl Foundation Partners (ask greg zat commoncrawl zot org for an invite)

Web Archiving Service Providers / Self-hostable, Open Source

Browsertrix - From Webrecorder, source available at https://github.com/webrecorder/browsertrix (⭐259).

Oct 18, 2024

Tools & Software / Acquisition

SiteStory - A transactional archive that selectively captures and stores transactions that take place between a web client (browser) and a web server. (Stable)

May 09, 2024

Tools & Software / WARC I/O Libraries

Jwat (⭐3) - Libraries for reading/writing/validating WARC/ARC/GZIP files (Java). (Stable)

Jwat-Tools (⭐5) - Tools for reading/writing/validating WARC/ARC/GZIP files (Java). (Stable)

May 06, 2024

Tools & Software / Utilities

warc-safe (⭐11) - Automatic detection of viruses and NSFW content in WARC files.

Apr 25, 2024

Tools & Software / Replay

PYWB (⭐1.5k) - A Python 3 implementation of web archival replay tools, sometimes also known as 'Wayback Machine'. (Stable)

Jan 19, 2024

Web Archiving Service Providers / Self-hostable, Open Source

Conifer - From Rhizome, source available at https://github.com/Rhizome-Conifer.

Web Archiving Service Providers / Hosted, Closed Source

Archive-It - From the Internet Archive.

Arkiwera

Hanzo

MirrorWeb

PageFreezer

Smarsh

Dec 26, 2023

Tools & Software / Utilities

Internet Archive Library (⭐1.7k) - A command line tool and Python library for interacting directly with archive.org. (Python). (Stable)

Oct 17, 2023

Tools & Software / Utilities

warcdb (⭐397) - A command line utility (Python) for importing WARC files into a SQLite database. (Stable)

Aug 30, 2023

Tools & Software / Utilities

HTTPreserve linkstat (⭐10) - Command line implementation of httpreserve.info to describe the status of a web page. Can be easily scripted and provides JSON output to enable querying through tools like JQ. HTTPreserve Linkstat describes current status, and earliest and latest links on archive.org. (Golang). (Stable)

Jul 14, 2023

Training/Documentation

Training materials:

Jul 05, 2023

Tools & Software / Analysis

Common Crawl Columnar Index - SQL-queryable index, with CDX info plus language classification. (Stable)

Common Crawl Web Graph - A host or domain-level graph of the web, with ranking information. (Stable)

Common Crawl Jupyter notebooks (⭐52) - A collection of notebooks using Common Crawl's various datasets. (Stable)

Jul 04, 2023

Tools & Software / Utilities

cdx-toolkit - Library and CLI to consult cdx indexes and create WARC extractions of subsets. Abstracts away Common Crawl's unusual crawl structure. (Stable)

Tools & Software / Analysis

Web Data Commons - Structured data extracted from Common Crawl. (Stable)

Community Resources / Mailing Lists

Common Crawl

Jun 02, 2023

Tools & Software / Utilities

Go Get Crawl (⭐155) - Extract web archive data using Wayback Machine and Common Crawl. (Stable)

May 26, 2023

Tools & Software / Acquisition

crau (⭐61) - crau is the way (most) Brazilians pronounce crawl, it's the easiest command-line tool for archiving the Web and playing archives: you just need a list of URLs. (Stable)

Apr 27, 2023

Tools & Software / Acquisition

Scoop (⭐156) - High-fidelity, browser-based, single-page web archiving library and CLI for witnessing the web. (Stable)

Apr 19, 2023

Tools & Software / Utilities

warcdedupe - WARC deduplication tool (and WARC library) written in Rust. (In Development)

Apr 13, 2023

Tools & Software / Utilities

Warchaeology - Warchaeology is a collection of tools for inspecting, manipulating, deduplicating and validating WARC-files. Stable

warcrefs (⭐8) - Web archive deduplication tools. Stable

Jan 21, 2023

Tools & Software / Acquisition

DiskerNet (⭐3.8k) - A non-WARC-based tool which hooks into the Chrome browser and archives everything you browse making it available for offline replay. (In Development)

Sep 27, 2022

Community Resources / Blogs and Scholarship

WS-DL Blog - Web Science and Digital Libraries Research Group blogs about various Web archiving related topics, scholarly work, and academic trip reports.

Sep 24, 2022

Tools & Software / Acquisition

Auto Archiver (⭐687) - Python script to automatically archive social media posts, videos, and images from a Google Sheets document. Read the article about Auto Archiver on bellingcat.com.

Aug 23, 2022

Training/Documentation

For researchers using web archives:

May 25, 2022

Training/Documentation

Introductions to web archiving concepts:
- What is a web archive? - A video from the UK Web Archive YouTube Channel
- Wikipedia's List of Web Archiving Initiatives
- Glossary of Archive-It and Web Archiving Terms
- The Web Archiving Lifecycle Model - The Web Archiving Lifecycle Model is an attempt to incorporate the technological and programmatic arms of the web archiving into a framework that will be relevant to any organization seeking to archive content from the web. Archive-It, the web archiving service from the Internet Archive, developed the model based on its work with memory institutions around the world.
- Retrieving and Archiving Information from Websites by Wael Eskandar and Brad Murray

Tools & Software / WARC I/O Libraries

Sparkling (⭐13) - Internet Archive's Sparkling Data Processing Library. (Stable)

Tools & Software / Analysis

Archives Research Compute Hub (⭐17) - Web application for distributed compute analysis of Archive-It web archive collections. (Stable)

Mar 03, 2022

Community Resources / Blogs and Scholarship

The Web as History - An open-source book that provides a conceptual overview to web archiving research, as well as several case studies.

Jan 23, 2022

Tools & Software / Acquisition

Waybackpy (⭐516) - Wayback Machine Save, CDX and availability API interface in Python and a command-line tool (Stable)

Jan 05, 2022

Tools & Software / WARC I/O Libraries

Unwarcit (⭐10) - Command line interface to unzip WARC and WACZ files (Python).

Dec 13, 2021

Tools & Software / WARC I/O Libraries

FastWARC (⭐102) - A high-performance WARC parsing library (Python).

Nov 08, 2021

Tools & Software / Replay

warc2html (⭐44) - Converts WARC files to static HTML suitable for browsing offline or rehosting.

Oct 07, 2021

Tools & Software / Acquisition

Wayback (⭐1.9k) - A toolkit for snapshot webpage to Internet Archive, archive.today, IPFS and beyond. (Stable)

Jul 20, 2021

Tools & Software / Utilities

gowarcserver (⭐16) - BadgerDB (⭐14k)-based capture index (CDX) and WARC record server, used to index and serve WARC files (Go).

Jul 13, 2021

Tools & Software / Acquisition

Web Curator Tool - Open-source workflow management for selective web archiving. (Stable)

Tools & Software / Curation

Zotero Robust Links Extension - A Zotero extension that submits to and reads from web archives. Source on GitHub (⭐19). Supercedes leonkt/zotero-memento (⭐314).

Jun 22, 2021

Tools & Software / Acquisition

WAIL (⭐372) - A graphical user interface (GUI) atop multiple web archiving tools intended to be used as an easy way for anyone to preserve and replay web pages; Python, Electron (⭐126). (Stable)

Warcprox (⭐401) - WARC-writing MITM HTTP/S proxy. (Stable)

May 28, 2021

Tools & Software / Acquisition

ArchiveBox (⭐24k) - A tool which maintains an additive archive from RSS feeds, bookmarks, and links using wget, Chrome headless, and other methods (formerly Bookmark Archiver). (In Development)

Browsertrix Crawler (⭐744) - A Chromium based high-fidelity crawling system, designed to run a complex, customizable browser-based crawl in a single Docker container. (Stable)

Brozzler (⭐699) - A distributed web crawler (爬虫) that uses a real browser (Chrome or Chromium) to fetch pages and embedded urls and to extract links. (Stable)

Apr 27, 2021

Community Resources / Twitter

@WebSciDL - ODU Web Science and Digital Libraries Research Group.

Apr 24, 2021

Tools & Software / Search & Discovery

playback (⭐8) - A toolkit for searching archived webpages from Internet Archive, archive.today, Memento and beyond. (In Development)

Apr 16, 2021

Tools & Software / Utilities

httrack2warc (⭐32) - Convert HTTrack archives to WARC format (Java).

Nov 06, 2020

Tools & Software / Acquisition

Cairn (⭐47) - A npm package and CLI tool for saving webpages. (Stable)

Obelisk (⭐278) - Go package and CLI tool for saving web page as single HTML file. (Stable)

Tools & Software / Search & Discovery

Mink (⭐53) - A Google Chrome extension for querying Memento aggregators while browsing and integrating live-archived web navigation. (Stable)

Community Resources / Mailing Lists

OpenWayback

WASAPI

Community Resources / Slack

IIPC Slack - Ask @netpreserve for access.

Sep 18, 2020

Tools & Software / Replay

OpenWayback (⭐497) - The open source project aimed to develop Wayback Machine, the key software used by web archives worldwide to play back archived websites in the user's browser. (Stable)

Sep 16, 2020

Community Resources / Blogs and Scholarship

UK Web Archive Blog

Community Resources / Twitter

#WebArchiveWednesday

Jun 05, 2020

Tools & Software / Acquisition

Heritrix (⭐2.9k) - An open source, extensible, web-scale, archival quality web crawler. (Stable)
- Heritrix Q&A (⭐2.9k) - A discussion forum for asking questions and getting answers about using Heritrix.
- Heritrix Walkthrough (⭐8) (In Development)

WebMemex - Browser extension for Firefox and Chrome which lets you archive web pages you visit. (In Development)

Community Resources / Other Awesome Lists

Web Archiving Community (⭐24k)

Awesome Memento (⭐97)

The WARC Ecosystem

The Web Crawl section of COPTR

Mar 26, 2020

Tools & Software / Acquisition

archivenow (⭐420) - A Python library to push web resources into on-demand web archives. (Stable)

Chronicler (⭐87) - Web browser with record and replay functionality. (In Development)

Crawl - A simple web crawler in Golang. (Stable)

crocoite (⭐44) - Crawl websites using headless Google Chrome/Chromium and save resources, static DOM snapshot and page screenshots to WARC files. (In Development)

F(b)arc (⭐77) - A commandline tool and Python library for archiving data from Facebook using the Graph API. (Stable)

freeze-dry (⭐294) - JavaScript library to turn page into static, self-contained HTML document; useful for browser extensions. (In Development)

grab-site (⭐1.5k) - The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns. (Stable)

html2warc (⭐21) - A simple script to convert offline data into a single WARC file. (Stable)

HTTrack - An open source website copying utility. (Stable)

monolith (⭐13k) - CLI tool to save a web page as a single HTML file. (Stable)

SingleFile (⭐17k) - Browser extension for Firefox/Chrome and CLI tool to save a faithful copy of a complete page as a single HTML file. (Stable)

Social Feed Manager - Open source software that enables users to create social media collections from Twitter, Tumblr, Flickr, and Sina Weibo public APIs. (Stable)

Squidwarc (⭐170) - An open source, high-fidelity, page interacting archival crawler that uses Chrome or Chrome Headless directly. (In Development)

StormCrawler - A collection of resources for building low-latency, scalable web crawlers on Apache Storm. (Stable)

twarc (⭐1.4k) - A command line tool and Python library for archiving Twitter JSON data. (Stable)

WARCreate - A Google Chrome extension for archiving an individual webpage or website to a WARC file. (Stable)

Warcworker (⭐58) - An open source, dockerized, queued, high fidelity web archiver based on Squidwarc with a simple web GUI. (Stable)

Web2Warc (⭐24) - An easy-to-use and highly customizable crawler that enables anyone to create their own little Web archives (WARC/CDX). (Stable)

Wget - An open source file retrieval utility that of version 1.14 supports writing warcs. (Stable)

Wget-lua (⭐24) - Wget with Lua extension. (Stable)

Wpull (⭐582) - A Wget-compatible (or remake/clone/replacement/alternative) web downloader and crawler. (Stable)

Tools & Software / Search & Discovery

Tempas v1 - Temporal web archive search based on Delicious tags. (Stable)

Tempas v2 - Temporal web archive search based on links and anchor texts extracted from the German web from 1996 to 2013 (results are not limited to German pages, e.g., Obama@2005-2009 in Tempas). (Stable)

Tools & Software / Utilities

MementoMap (⭐10) - A Tool to Summarize Web Archive Holdings (Python). (In Development)

MemGator (⭐62) - A Memento Aggregator CLI and Server (Golang). (Stable)

node-cdxj (⭐0) - CDXJ (⭐14) file parser (Node.js). (Stable)

OutbackCDX (⭐34) - RocksDB-based capture index (CDX) server supporting incremental updates and compression. Can be used as backend for OpenWayback, PyWb and Heritrix (⭐11). (Stable)

py-wasapi-client (⭐16) - Command line application to download crawls from WASAPI (Python). (Stable)

tikalinkextract (⭐10) - Extract hyperlinks as a seed for web archiving from folders of document types that can be parsed by Apache Tika (Golang, Apache Tika Server). (In Development)

wasapi-downloader (⭐6) - Java command line application to download crawls from WASAPI. (Stable)

WarcPartitioner (⭐1) - Partition (W)ARC Files by MIME Type and Year. (Stable)

wikiteam (⭐759) - Tools for downloading and preserving wikis. (Stable)

Tools & Software / WARC I/O Libraries

HadoopConcatGz (⭐9) - A Splitable Hadoop InputFormat for Concatenated GZIP Files (and *.warc.gz). (Stable)

node-warc (⭐97) - Parse WARC files or create WARC files using either Electron or chrome-remote-interface (⭐4.4k) (Node.js). (Stable)

Warcat (⭐156) - Tool and library for handling Web ARChive (WARC) files (Python). (Stable)

Tools & Software / Analysis

ArchiveSpark (⭐148) - An Apache Spark framework (not only) for Web Archives that enables easy data processing, extraction as well as derivation. (Stable)

Archives Unleashed Notebooks (⭐26) - Notebooks for working with web archives with the Archives Unleashed Toolkit, and derivatives generated by the Archives Unleashed Toolkit. (Stable)

Archives Unleashed Toolkit (⭐143) - Archives Unleashed Toolkit (AUT) is an open-source platform for analyzing web archives with Apache Spark. (Stable)

Tweet Archvies Unleashed Toolkit (⭐9) - An open-source toolkit for analyzing line-oriented JSON Twitter archives with Apache Spark. (In Development)

Mar 10, 2020

Community Resources / Blogs and Scholarship

DSHR's Blog - David Rosenthal regularly reviews and summarizes work done in the Digital Preservation field.

Mar 03, 2020

Community Resources / Blogs and Scholarship

Web Archiving Roundtable - Unofficial blog of the Web Archiving Roundtable of the Society of American Archivists maintained by the members of the Web Archiving Roundtable.

Feb 27, 2020

Community Resources / Twitter

@NetPreserve - Official IIPC handle.

#WebArchiving

Feb 26, 2020

Tools & Software / Replay

InterPlanetary Wayback (ipwb) (⭐625) - Web Archive (WARC) indexing and replay using IPFS.

Feb 25, 2020

Training/Documentation

The WARC Standard:
- The warc-specifications community HTML version of the official specification and hub for new proposals.
- The offical ISO 28500 WARC specification homepage.

Tools & Software / Replay

Reconstructive - Reconstructive is a ServiceWorker module for client-side reconstruction of composite mementos by rerouting resource requests to corresponding archived copies (JavaScript).

Tools & Software / Search & Discovery

webarchive-discovery (⭐122) - WARC and ARC full-text indexing and discovery tools, with a number of associated tools capable of using the index shown below. (Stable)

Tools & Software / Utilities

ArchiveTools (⭐72) - Collection of tools to extract and interact with WARC files (Python).

har2warc (⭐51) - Convert HTTP Archive (HAR) -> Web Archive (WARC) format (Python).

Tools & Software / WARC I/O Libraries

jwarc (⭐48) - Read and write WARC files with a type safe API (Java).

warcio (⭐408) - Streaming WARC/ARC library for fast web archive IO (Python). (Stable)

warctools (⭐159) - Library to work with ARC and WARC files (Python).

webarchive (⭐20) - Golang readers for ARC and WARC webarchive formats (Golang).

Tools & Software / Quality Assurance

Chrome Check My Links - Browser extension: a link checker with more options.

Chrome link checker - Browser extension: basic link checker.

Chrome link gopher - Browser extension: link harvester on a page.

Chrome Open Multiple URLs - Browser extension: opens multiple URLs and also extracts URLs from text.

Chrome Revolver - Browser extension: switches between browser tabs.

FlameShot (⭐26k) - Screen capture and annotation on Ubuntu.

PlayOnLinux - For running Xenu and Notepad++ on Ubuntu.

PlayOnMac - For running Xenu and Notepad++ on macOS.

Windows Snipping Tool - Windows built-in for partial screen capture and annotation. On macOS you can use Command + Shift + 4 (keyboard shortcut for taking partial screen capture).

WineBottler - For running Xenu and Notepad++ on macOS.

xDoTool (⭐3.4k) - Click automation on Ubuntu.

Xenu - Desktop link checker for Windows.

Community Resources / Slack

Archives Unleashed Slack - Fill out this request form for access to a researcher group of people working with web archives.

Archivers Slack - Invite yourself to a multi-disciplinary effort for archiving projects run in affiliation with EDGI and Data Together.

Oct 16, 2018

Resources for Web Publishers

The Archive Ready tool, for estimating how likely a web page will be archived successfully.

Mar 14, 2018

Tools & Software / Search & Discovery

SecurityTrails - Web based archive for WHOIS and DNS records. REST API available free of charge.

Jun 26, 2017

Tools & Software

Comparison of web archiving software (⭐95)

Awesome Website Change Monitoring (⭐505)

Jun 21, 2017

Tools & Software / Utilities

webarchive-indexing (⭐45) - Tools for bulk indexing of WARC/ARC files on Hadoop, EMR or local file system.