Digitization of documents in Linux with paperless-ngx Part 1 – Installing and securing it

When I read about paperless-ngx I liked the idea of having all my documents in a central storage so that I could access them from all my devices. Furthermore those documents would be indexed (also using OCR) so that I could search (fulltext) in all of them. Due to the tagging system – if done correctly – exporting all my documents for my yearly tax declaration should just take seconds…

Installation

The Installation of paperless-ngx is easy. However, there are a few stumbling blocks. That is because currently the installation guide is not working as-is. If you try the bare-metal Installation within a VM (Debian Bookworm) you’ll have trouble with some dependencies like python-ipware. I did not try the installation with Debian Bullseye or Ubuntu. So maybe it works for you but it did not for me.

The installation procedure is described here: https://docs.paperless-ngx.com/setup/.

I think the most easy installation is the docker install script one. However, this one will also not just work as-is. If you run the script as root-user the script will correctly tell you that you should not run the script as root-user. And yes – you should not (blindly) run any scripts found on the internet as root-user unless you checked those scripts and understand fully what they’re doing. However, you probably also do not want to install paperless-ngx into your normal user account. So you need to add a user first.

Here are my notes on how to setup paperless-ngx in Debian 12 Bookworm.

1. Install Docker-Engine

Please consult the docker documentation. (Hint: I use Docker Engine, not Docker Desktop – But your requirements might be different). There is also a Debian installation procedure page.

2. Add a user for paperless

adduser paperless --system --home /opt/paperless --group

# give the user paperless docker permissions
usermod -aG docker paperless

3. Run the install script using the previously created user

Just as in the official documentation, I only added “sudo -Hu paperless” to their command:

sudo -Hu paperless bash -c "$(curl --location --silent --show-error https://raw.githubusercontent.com/paperless-ngx/paperless-ngx/main/install-paperless-ngx.sh)"

The installation script will ask you a few things, here is what I set:

# Set the URL this will run on later, e.g
URL []: https://documents.example.com

# I suggest to leave the default port. Later I use NGINX as a reverse-proxy which
# will forward to port 8000
Port [8000]:

# Not much to say about the timezone I guess...
Current time zone [Europe/Berlin]:

# use sqlite if low memory system; however I would recommend postgres.
Database backend (postgres sqlite mariadb) [postgres]: 

# you also want feed paperless-ngx with documents like Word, Excel, Powerpoint...?
Enable Apache Tika? (yes no) [no]: yes

# Every language you add, needs more resources. So only choose those you really
# need. I have documents in german, english, french and arabic.
OCR language [eng]: deu+eng+fra+ara

# don't touch.
User ID [107]:
Group ID [115]:

# the user accounts needs to have access to this directory.
Target folder [/opt/paperless]: 

For the remaining settings just pick the defaults unless you know better.

4. Modify the installation

Something I completely miss in this short part of the documentation is how to modify the configuration of paperless if you go by the docker install script. I assume this is clear for people who are used to Docker – For people who are not used to Docker this is unclear.

If you follow above steps there will be a docker-compose.env in /opt/paperless/paperless-ngx:

root@paperless:/opt/paperless/paperless-ngx# ls
consume docker-compose.env docker-compose.yml export

You can modify the configuration of paperless-ngx using this file. I added a few settings to it:

PAPERLESS_URL=https://documents.example.com
USERMAP_UID=107
USERMAP_GID=115
PAPERLESS_TIME_ZONE=Europe/Berlin
PAPERLESS_OCR_LANGUAGE=ara+deu+eng+fra
PAPERLESS_SECRET_KEY=------------CHANGEME----------------
PAPERLESS_OCR_LANGUAGES=ara deu eng fra
PAPERLESS_CONSUMER_RECURSIVE=true
PAPERLESS_PORT=8000

You can see a list of the possible values here. I would recommend you first try WITHOUT touching any of these settings to get a feeling and understanding of what they do. However, I set

OCR_LANGUAGES and OCR_LANGUAGE mind that one is with spaces, the other one uses a + to separate multiple languages. Also mind, that multiple languages will require more resources.

CONSUMER_RECURSIVE to true because I want to also throw folders into the consumer directory.

5. Run, Update, Stop

If you change the settings in docker-compose.env you just need to restart the environment. This works by issuing docker compose down, docker compose up -d. The -d switch to detach – else you will have it in foreground.

First switch to the folder where the docker-compose.env is:

# cd /opt/paperless/paperless-ngx/

Then you can stop the environment using docker compose down:

root@paperless:/opt/paperless/paperless-ngx# sudo -Hu paperless docker compose down
[+] Running 6/6
 ✔ Container paperless-webserver-1  Removed                                                   6.9s 
 ✔ Container paperless-db-1         Removed                                                   0.3s 
 ✔ Container paperless-tika-1       Removed                                                   0.4s 
 ✔ Container paperless-gotenberg-1  Removed                                                  10.2s 
 ✔ Container paperless-broker-1     Removed                                                   0.4s 
 ✔ Network paperless_default        Removed                                                   0.3s

If you want to update paperless, just use docker compose down like above. Then use docker compose pull.

root@paperless:/opt/paperless/paperless-ngx# sudo -Hu paperless docker compose pull
[+] Pulling 35/22
 ✔ webserver 17 layers [⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿]      0B/0B      Pulled                     10.8s 
 ✔ tika Pulled                                                                                0.5s 
 ✔ gotenberg Pulled                                                                           1.0s 
 ✔ db 13 layers [⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿]      0B/0B      Pulled                                  15.7s 
 ✔ broker Pulled                                                                              1.4s

And to start:

root@paperless:/opt/paperless/paperless-ngx# sudo -Hu paperless docker compose up -d
[+] Running 6/6
 ✔ Network paperless_default        Created                                                   0.1s 
 ✔ Container paperless-tika-1       Started                                                   0.1s 
 ✔ Container paperless-broker-1     Started                                                   0.1s 
 ✔ Container paperless-gotenberg-1  Started                                                   0.1s 
 ✔ Container paperless-db-1         Started                                                   0.1s 
 ✔ Container paperless-webserver-1  Started                                                   0.0s

You can also force the recreation of the containers by adding –force-recreate to the docker compose up -d command.

Running paperless-ngx behind NGINX

The documentation shows how to use NGINX as a reverse proxy for paperless-ngx. This is a good starting point. However, if you deploy paperless-ngx on the Internet you may want to do a little bit more.

If possible for you, I would suggest you to use Wireguard, OpenVPN or an IPSEC tunnel using Strongswan and make this a requirement for connecting to the paperless-ngx instance. Because usually you will have sensitive data / documents in paperless-ngx.

Port 8000 not bound to localhost – Attention!

I followed the guide as written on the paperless-ngx website. I read it multiple times. Unless I am blind and missed something important, by default the port 8000 is not exposed to localhost on the container’s host but to all. A nmap from outside shows this:

root@fw2:/var/log/suricata# nmap xx.xx.xx.xx 
Starting Nmap 7.93 ( https://nmap.org ) at 2023-12-17 15:37 CET
Nmap scan report for xx.xx.xx.xx
Host is up (0.0016s latency).
Not shown: 996 closed tcp ports (reset)
PORT     STATE SERVICE
22/tcp   open  ssh
80/tcp   open  http
443/tcp  open  https
8000/tcp open  http-alt
MAC Address: xx:xx:xx:xx:xx:xx (Mathtech)

Nmap done: 1 IP address (1 host up) scanned in 0.36 seconds

Netstat also shows this:

tcp        0      0 0.0.0.0:8000            0.0.0.0:*               LISTEN      861/docker-proxy    
tcp6       0      0 :::8000                 :::*                    LISTEN      867/docker-proxy    

A wget/curl from outside to port 8000 also gives me the login page of paperless-ngx. Obviously, we don’t want this. We want that our NGINX is the only system which reaches paperless-ngx.

Edit the docker-compose.yml in /opt/paperless/paperless-ngx and add localhost to the line:

    ports:
      - "8000:8000"

below webserver: so that it looks like this:

    ports:
      - "127.0.0.1:8000:8000"

Then stop and start the docker environment and re-check using nmap / netstat…:

root@paperless:/opt/paperless/paperless-ngx# netstat -apn | grep :8000
tcp        0      0 127.0.0.1:8000          0.0.0.0:*               LISTEN      3578/docker-proxy   

Now your paperless-ngx will only be accessible through a reverse-proxy which you configure to use 127.0.0.1:8000.

SSL/TLS

I would suggest to use a rather strong TLS configuration. Anything your browser and devices you access paperless-ngx with allow. I added a tls.conf in /etc/nginx/conf.d/ with the following content:

#
# get a more up2date / better configuration from 
# https://ssl-config.mozilla.org/
#
ssl_session_timeout 1d;
ssl_session_cache shared:SSL:10m;
ssl_session_tickets off;

ssl_protocols TLSv1.3;
ssl_prefer_server_ciphers off;

# if you want to disable TLSv1.2 also and your devices and browsers
# are modern enough for this, uncomment the following. However, mind
# that some of the online scan tools won't work to test your
# configuration because they simply do not support TLSv1.3, yet. 
#ssl_ciphers ECDHE-ECDSA-AES256-GCM-SHA384;
#ssl_conf_command Ciphersuites TLS_AES_128_GCM_SHA256:TLS_AES_256_GCM_SHA384:TLS_CHACHA20_POLY1305_SHA256;

add_header Strict-Transport-Security "max-age=63072000" always;

# OCSP stapling
ssl_stapling on;
ssl_stapling_verify on;

# use your own resolver if possible. 
resolver 127.0.0.1;

If you use certbot to handle your configuration files, you will maybe see that certbot also adds additional configuration to your site. Those settings are fine, but they’re more weak than the above settings. So you may want to always compare and maybe comment them in your site-configuration:

#include /etc/letsencrypt/options-ssl-nginx.conf; # managed by Certbot
#ssl_dhparam /etc/letsencrypt/ssl-dhparams.pem; # managed by Certbot

Talking about SSL/TLS Hardening you maybe want your letsencrypt certificate a bit stronger by for example getting an ECDSA certificate with a stronger curve algorithm:

certbot renew --key-type ecdsa --elliptic-curve secp384r1 --cert-name documents.example.com --force-renewal

Did you know, that you can check the key-type of your certificates using the command certbot certificates?

root@paperless:/etc/letsencrypt/live# certbot certificates | grep "Key Type"
Saving debug log to /var/log/letsencrypt/letsencrypt.log
    Key Type: ECDSA
    Key Type: ECDSA

Headers

One thing you should be aware of is inheritance in NGINX when using add_header. Because the configuration snippet of the paperless-ngx documentation adds the Referrer-Policy:

add_header Referrer-Policy "strict-origin-when-cross-origin";

in the location / {} block of the specific site. NGINX usually inherits add_header from the parent declarations / blocks. With one exception: If there is (just) one add_header directive, you need to re-declare _all_ the headers in that block. Hence you also need to add the https strict transport header in that location, if you used my tls.conf:

add_header Referrer-Policy "strict-origin-when-cross-origin";
add_header Strict-Transport-Security "max-age=63072000" always;

While we’re talking about headers, you may also take a look at the following header.conf I’m using. Maybe you also want to use it – check the linked resources:

#
# securityheaders
#

# see: https://scotthelme.co.uk/hardening-your-http-response-headers/#x-frame-options
#      https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/X-Frame-Options
#      https://infosec.mozilla.org/guidelines/web_security#x-frame-options
#add_header X-Frame-Options "SAMEORIGIN" always;

# see: https://scotthelme.co.uk/hardening-your-http-response-headers/#x-content-type-options
#      https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/X-Content-Type-Options
#add_header X-Content-Type-Options "nosniff" always;

# see: https://scotthelme.co.uk/a-new-security-header-referrer-policy/
#      https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Referrer-Policy
#add_header Referrer-Policy "no-referrer" always;

# see: https://scotthelme.co.uk/goodbye-feature-policy-and-hello-permissions-policy/
#      https://github.com/w3c/webappsec-permissions-policy/blob/main/permissions-policy-explainer.md
#      https://github.com/w3c/webappsec-permissions-policy/blob/main/features.md
add_header Permissions-Policy "camera=(), microphone=()" always;

# see: https://scotthelme.co.uk/a-new-security-header-feature-policy/
#      https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Feature-Policy
# For compatibility reasons, including Feature-Policy (the former header for
# Permission-Policy, as well.
add_header Feature-Policy "microphone 'none'" always;

#
# mozilla observatory
#

# A setting of 0 disables this, and currently the observatory will reduce
# your points if you disable it. However, read the github issue - you want
# it disabled.
# see: https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/X-XSS-Protection
#      https://github.com/mozilla/http-observatory/issues/432
add_header X-XSS-Protection 0 always;

# see: https://scotthelme.co.uk/content-security-policy-an-introduction/
#      https://developer.mozilla.org/en-US/docs/Web/HTTP/CSP
#      https://developers.google.com/web/fundamentals/security/csp
# you REALLY want to check what this is doing BEFORE using it. 2nd link.
# also you may want to add some.
#add_header Content-Security-Policy "frame-ancestors 'self'; base-uri 'self'; form-action 'self'" always;

Include these headers in your site configuration. For example like this:

add_header Strict-Transport-Security "max-age=63072000" always;
add_header Referrer-Policy "strict-origin-when-cross-origin";
include conf.d/headers.conf;

CSP

Now let’s talk about CSP. If you took a look at my headers.conf above you probably saw the following:

# see: https://scotthelme.co.uk/content-security-policy-an-introduction/
#      https://developer.mozilla.org/en-US/docs/Web/HTTP/CSP
#      https://developers.google.com/web/fundamentals/security/csp
# you REALLY want to check what this is doing BEFORE using it. 2nd link.
# also you may want to add some.
#add_header Content-Security-Policy "frame-ancestors 'self'; base-uri 'self'; form-action 'self'" always;

This is CSP (the abbreviation for Content-Security-Policy). It allows to send a header which tells the browser what is allowed and disallowed. Assuming that someone somehow managed to inject some javascript inline into paperless-ngx this javascript would be blocked by your Browser if your CSP says that inline-javascript is not allowed.

Everything (as far as I know) in the CSP falls back to default-src. So if you set default-src to none you effectively block everything you did not explicitely allow. The other way around, if you set default-src to e.g. self and disallow everything you do not want, works as well.

Now for paperless-ngx, you can’t simply use default-src: self; and ignore everything else. The Web UI of paperless-ngx would at least give (currently) 23 errors (refused to load) due to inline-scripts and inline-styles. What worked for me was:

add_header Content-Security-Policy "default-src 'self'; script-src 'self' 'unsafe-inline'; style-src 'self' 'unsafe-inline'; img-src data: 'self'; upgrade-insecure-requests" always;

You may as well use your developer console, check for the hashes all the inline elements have and add them to the CSP. But you’ll have to re-do this whenever the WebUI changes (might happen every update).

The paperless-ngx documentation shows a CSP in their apache2 example. You may try to adapt that one. And I strongly suggest when you work with this one, use the developer toolbar and the network tab of your favorite browser to verify that nothing important is blocked.

Robots.txt

This is a documentation system.. We really do not want search-engines to index anything here. Paperless-ngx correctly has the required html-tags which forbid indexing. However, the robots.txt is missing. Whether the robots.txt makes sense nowadays or not would be beyond this article. Crawlers may also ignore our wish to not index our page. But it would not hurt us to define a robots.txt. Here is an example how to do it in NGINX without dealing with files and such like:

location = /robots.txt {
  add_header Content-Type text/plain;
  return 200 "User-agent: AdsBot-Google\nUser-agent: *\nDisallow: /\n";
}

So the Site-Configuration may look like this:

server {
  server_name documents.example.com;

  add_header Strict-Transport-Security "max-age=63072000" always;
  add_header Referrer-Policy "strict-origin-when-cross-origin";
  include conf.d/headers.conf;

  location = /robots.txt {
    add_header Content-Type text/plain;
    return 200 "User-agent: AdsBot-Google\nUser-agent: *\nDisallow: /\n";
  }

  location / {
    # Adjust host and port as required.
    proxy_pass http://localhost:8000/;

    # These configuration options are required for WebSockets to work.
    proxy_http_version 1.1;
    proxy_set_header Upgrade $http_upgrade;
    proxy_set_header Connection "upgrade";

    proxy_redirect off;
    proxy_set_header Host $host;
    proxy_set_header X-Real-IP $remote_addr;
    proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
    proxy_set_header X-Forwarded-Host $server_name;
  }

  listen 443 ssl; 
  ssl_certificate /etc/letsencrypt/live/documents.example.com/fullchain.pem; # managed by Certbot
  ssl_certificate_key /etc/letsencrypt/live/documents.example.com/privkey.pem; # managed by Certbot
  ssl_trusted_certificate /etc/letsencrypt/live/documents.example.com/chain.pem;

  #include /etc/letsencrypt/options-ssl-nginx.conf; # managed by Certbot
  #ssl_dhparam /etc/letsencrypt/ssl-dhparams.pem; # managed by Certbot
}

server {
  if ($host = documents.example.com) {
    return 301 https://$host$request_uri;
  } # managed by Certbot

  listen 80;
  server_name documents.example.com;
  return 404; # managed by Certbot
}

Paperless-ngx settings

There are a few more settings you can make to further secure your paperless-ngx installation. In /opt/paperless/paperless-ngx change the docker-compose.env and set a good cryptic secret key in:

PAPERLESS_SECRET_KEY=

Now set the URI of paperless using:

PAPERLESS_URL=https://documents.example.com

Finally set the IP of your NGINX reverse-proxy and 127.0.0.1 here:

PAPERLESS_TRUSTED_PROXIES=1.2.3.4,127.0.0.1

All communication to the reverse-proxy from outside is forced to be https. Hence I can set the following. However, check the documentation before using it.

PAPERLESS_PROXY_SSL_HEADER=["HTTP_X_FORWARDED_PROTO", "https"]

Since we use a reverse-proxy here, we should also set X-Forwarded-For accordingly:

PAPERLESS_USE_X_FORWARD_HOST=true

Further securing

The documentation shows how to use fail2ban for further securing the stack. I’d suggest you follow that.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.