Help Preserve the Internet With Archiveteam's Warrior

The internet is a volatile space. Stuff is being removed or cencored every minute of the day, servers crash all the time, and people stop paying for hosting. You might not notice it immediately, but information on the internet is not permanent. If only we could make a backup of the internet…

Well, it seems you can help! Do you have an underutilized internet connection? You can help out by archiving news stories, videos, social media posts and more. This way we can make sure as little as possible is lost. Archived material can be used as evidence, research material, documenting history and culture and a lot more.

In light of the recent events in Ukraine, you can also help. Footage and social media posts are getting censored heavily in this war and it is really important to make sure none of it gets lost. For evidence, for journalistic reason and to learn from it for the future. Follow along, because we are going to start archiving with Archiveteam’s Warrior application.

What is Archiveteam anyway?

Archiveteam is not one person, but a loose collective of people dedicated to preserving online history. They have archived a lot of different things, from smaller websites to large websites and collections. Some notable examples are the YouTube dislike counts, Geocities (remember those?), public Google Drive and Mediafire files and a lot more. They do a lot to preserve the online world as we know it since 2009.

Today you could join this loose collective by installing their ‘Warrior’, which is an automated program that accepts work from a central server and archives it. The archived material is (most of the time) added to the Internet Archive, so it can be accessed using the Wayback Machine.

Installing The Warrior

The official documentation has a docker run command listed as a one-liner to start the warrior. I really don’t like this approach. It is not easily transferred to other servers and I like to be able to make small adjustments in a file if something needs to change. Another problem that is in my opinion a little bit bigger is that it does not store the configuration between updates. This means that you have to set it up again each time the container is updated; not ideal!

That’s why we are going to set up a good ol' docker-compose.yml file. The configuration will be done using environment variables, since there is not a whole lot to set up. The Docker Compose file can also be downloaded from the Selfhosted Heaven Github page.

version: "3.7"
        container_name: archiveteam-watchtower
        image: containrrr/watchtower
            - com.centurylinklabs.watchtower.enable=true
            - com.centurylinklabs.watchtower.scope=archiveteam-warrior
            - '/var/run/docker.sock:/var/run/docker.sock'
        command: '--label-enable --cleanup --interval 3600 --scope archiveteam-warrior'
        restart: unless-stopped

        container_name: archiveteam-warrior
            - DOWNLOADER=selfhostedheaven # Change this to your nickname
            - SELECTED_PROJECT=auto
            - CONCURRENT_ITEMS=6
        stop_signal: SIGINT
        stop_grace_period: 5m
            - com.centurylinklabs.watchtower.enable=true
            - com.centurylinklabs.watchtower.scope=archiveteam-warrior
            - '8001:8001'
        restart: unless-stopped

There are two containers in this Docker Compose file: watchtower and the warrior itself. Watchtower is a container that can automatically containers to the newest release. This is needed for the warrior, because you need to have the newest version of the warrior to accept work from the central server.

The configuration is stored in the environment. I’ve set up my nickname for the leaderboard to selfhostedheaven and I let the warrior automatically decide which project has the most urgency. At the moment for me, it is Reddit, but you could override this with one of the other projects available.

I have a total of 6 concurrent items, since it seems to be a good middleground for the capabilities of my server and internet connection. Lower this if you have little available storage space or little bandwidth available.

Now that everything is set up: start it up with my favorite command: docker-compose up -d and you’re done! The warrior is now picking up work and helps archive the internet.

Seeing the Warrior’s Activity

The warrior’s activity can be monitored using the webinterface. You can access it in your browser with http://<ip_address_of_system>:8001. The interface is pretty simple to understand. There is really only one main screen where all the magic happens. On the main screen of the Archiveteam Warrior you can see which project is currently being worked on, what each worker is doing and how much data is already transferred from the project and to the tracker.

Monitor the Warrior on the Current Project Page
Monitor the Warrior on the Current Project Page

Currently, my warrior is working on archiving Reddit (do you already follow the selfhosted subreddit?), but you could choose one of the other projects they are working on.

The current archival projects
The current archival projects

Although you would not see your work directly, you can actually see how you are doing compared to others! Call it ‘gamification’, but it is pretty cool to see all the activity from the workers flying by on the screen in the leaderboard.

Who has archived the most in this project?
Who has archived the most in this project?

Archive the Internet

I know this is a little bit different then my usual blogposts, but I feel it’s important to get as many people help archiving the internet as possible. Especially since censorship is at an all-time high and we need to be able to see the truth. Archiving is a small step in the big picture, but I believe it really helps in the long term.

Will you help archive the internet?

comments powered by Disqus