From time to time I work on Django projects that offload two problems that Django solves reasonably well to third-party SaaS services: auth and files. Usually there is no need to purchase a SaaS subscription unless you have specific requirements that Django doesn't have an answer for, such as SSO & directory sync or the need for gigantic files.
I like to keep the stack as simple as possible in my projects. Simple and contained stacks are easier to hack and contribute to, integration test and run locally. As an added bonus it keeps costs down.
In this article I'll explain how I solve the latter problem - file uploads. I've built file upload APIs with Django on more than one occassion and I'll try to synthesize what I've learned. The post below shows a few variants of how to build a file API using the Django REST Framework. It also details how to connect Django to a cloud storage backend for production-grade systems where you're deploying many Django servers.
I've created a sample repository with the examples at: https://github.com/danihodovic/blog-examples/tree/master/file_uploads_in_drf.
How does Django handle files?
Like most things web, Django provides a built in solution for file uploads. The core uploads handlers are the abstraction that take care of the heavy lifting when it comes to reading, buffering and moving bits & bytes. The handlers are used regardless if you are writing function-based views, class-based views or if you're using DRF.
By default Django has two types of upload handlers:
- MemoryFileUploadHandler for files smaller than 2.5MB (configurable)
- TemporaryFileUploadHandler for files larger than 2.5MB
Quoting the docs:
Before you save uploaded files, the data needs to be stored somewhere.
By default, if an uploaded file is smaller than 2.5 megabytes, Django will hold the entire contents of the upload in memory. This means that saving the file involves only a read from memory and a write to disk and thus is very fast.
However, if an uploaded file is too large, Django will write the uploaded file to a temporary file stored in your system’s temporary directory. On a Unix-like platform this means you can expect Django to generate a file called something like /tmp/tmpzfp6I6.upload. If an upload is large enough, you can watch this file grow in size as Django streams the data onto disk.
These specifics – 2.5 megabytes; /tmp; etc. – are “reasonable defaults” which can be customized as described in the next section.
The diagram above briefly describes what the upload handlers do depending on the file size. Smaller files are streamed to memory and copied chunk-by-chunk to the store (local file system storage by default) while larger files are chunked into a temporary file on disk. Django writes temporary files to ensure that the server doesn't run out RAM when receiving large files.
For the curious the source code the handlers that deal with buffering can be found in the upload handler classes while the logic for moving the uploaded file to the FileSystemStorage is here .
By default Django will reject HTTP requests larger than 2.5MB:
The amount of request data is correlated to the amount of memory needed to process the request and populate the GET and POST dictionaries. Large requests could be used as a denial-of-service attack vector if left unchecked. Since web servers don’t typically perform deep request inspection, it’s not possible to perform a similar check at that level.
You will probably want to bump that limit by changing DATA_UPLOAD_MAX_MEMORY_SIZE if you're building an upload API.
Building a file upload API
Files rarely exist in isolation. They're tied to some kind of relational data. That could be a profile with a profile picture, an mp3 file with a band name or a film with a set of actors. This implies that we need an API for both the relational (JSON) data and an API for uploading the binary contents.
In this examle let's assume we're working with a track model. The track has metadata in the form of a title, an artist_name and a release_date. It also has a file field representing the song.
from django.db import models
class Track(models.Model):
title = models.CharField(max_length=100, null=True)
artist_name = models.CharField(max_length=100, null=True)
release_date = models.DateField(blank=True, null=True)
file = models.FileField(null=True, blank=True, validators=[])
Our Track serializer:
from rest_framework import serializers
from .models import Track
class TrackSerializer(serializers.ModelSerializer):
class Meta:
model = Track
fields = ["id", "title", "artist_name", "release_date", "file"]
read_only_fields = ["id"]
Our viewset:
from rest_framework.viewsets import ModelViewSet
from .models import Track
from .serializers import TrackSerializer
class TrackViewset(ModelViewSet):
queryset = Track.objects.all()
serializer_class = TrackSerializer
The code above gives us a CRUD API for the Track model, but how about uploading files?
When I built my first API I found a few resources in the docs regarding DRF file uploads and the FileUploadParser, but I couldn't find examples on how to build an API that marries relational data with binary data. By reading the DRF source code and working through trial and error I came up with a solution that extends the ModelViewSet with an upload endpoint.
from rest_framework.viewsets import ModelViewSet
from rest_framework.exceptions import ValidationError
from .models import Track
from .serializers import TrackSerializer
class TrackViewset(ModelViewSet):
"""
A viewset with a separate action to upload files
"""
queryset = Track.objects.all()
serializer_class = TrackSerializer
@action(
detail=True,
methods=["POST"],
parser_classes=[FileUploadParser],
url_path=r"upload/(?P<filename>[a-zA-Z0-9_]+\.mp3)",
)
def upload(self, request, **kwargs):
track = self.get_object()
if "file" not in request.data:
raise ValidationError("There is no file in the HTTP body.")
file = request.data["file"]
track.file.save(file.name, file)
return Response(TrackSerializer(track).data)
This adds an API endpoint /api/tracks/{id}/upload/{filename}. The workflow is roughly:
- Create the file object and populate it with metadata: the track title and the artist_name.
- Send a binary payload to the upload endpoint which sets the
file
field for the track.
The devil is in the details:
- we want detail=True so that we can leverage the get_object() method to find the Track referred to in the url or return a 404
- we want to use the FileUploadParser to parse the incoming binary data instead of the default JSON parser
- we set the url_path parameter to capture the filename in the url
In the example above I'm using the FileUploadParser to handle a raw payload in the /upload/ endpoint. However DRF recommends using the MultiPartParser if you're dealing with browsers:
The FileUploadParser is for usage with native clients that can upload the file as a raw data request. For web-based uploads, or for native clients with multipart upload support, you should use the MultiPartParser instead.
We can test the view by using curl and a sample file from the Github project.
Start the development server:
cd ~/blog-examples/file_uploads_in_drf
./manage.py runserver_plus
Create the Track model instance:
curl -XPOST 'localhost:8000/api/tracks/' --data '{"title": "To The Moon & Back", "artist_name": "Savage Garden"}' -H 'Content-Type: application/json'
{
"id": 4,
"title": "To The Moon & Back",
"artist_name": "Savage Garden",
"release_date": null,
"file": null
}%
Upload the ID returned from the previous request to upload a sample file:
curl -XPOST 'localhost:8000/api/tracks/4/upload/my_song.mp3/' -F 'file=@sample_audio_1MB.mp3'
{
"id": 4,
"title": "To The Moon & Back",
"artist_name": "Savage Garden",
"release_date": null,
"file": "/media/my_song.mp3"
}%
Bam, we successfully uploaded a file. We can verify it works by downloading it using curl:
curl http://localhost:8000/media/my_song.mp3
Warning: Binary output can mess up your terminal. Use "--output -" to tell
Warning: curl to output it to your terminal anyway, or consider "--output
Warning: <FILE>" to save to a file.
curl http://localhost:8000/media/my_song.mp3 -o /tmp/my-song.mp3
Cloud Storage (S3, GCS, B2)
Django by default stores the files on the local file system where the server process runs. This works great if you're running a single-server setup, but as soon as you start scaling horizontally you will want to swap out the storage backend. The Django community provides a number of open-source storage-backends which I keep track of here: https://django.wtf/category/storage-backend/.
Most commercial Django projects I come across these days use Amazon S3 or some kind of S3 derivative . It's cheap, durable and scalable. django-storages is the most popular community package to integrate with S3-like stores. It allows you to store files in any backend that implements the S3 API, such as:
- Amazon S3
- Google Cloud Storage
- Azure Blob Storage
- Backblaze B2
- Minio (local and self hosted S3)
- Cloudflare R2
- Oracle Cloud Storage
In addition to the above django-storages integrates with:
The main reasons I like django-storages is because:
- it's a bedrock of the Django community. It's been maintained for more than 15 years and has around 250 contributors. It's a project I trust.
- it supports a myriad of cloud providers which allows a development team to swap it out whenever one provider hikes it's prices. It's an effective tool to fight vendor lock-in.
In the sample code repository I'll demo how to integrate django-storages with Minio which is a self-hosted and open-source S3 compatible store. I often use it for local development and CI in place of paid cloud storage.
Django allows you to use multiple file storage backends for different models. In the example below I'll use the FileSystemStorage for the Track model, while the TrackS3Model will store files in Minio.
# settings.py
STORAGES = {
"default": {
"BACKEND": "django.core.files.storage.FileSystemStorage",
},
"staticfiles": {
"BACKEND": "django.contrib.staticfiles.storage.StaticFilesStorage",
},
"minio": {
"BACKEND": "storages.backends.s3.S3Storage",
"OPTIONS": {
"bucket_name": "tracks",
"endpoint_url": "http://localhost:9000",
"access_key": "minioadmin", # default Minio credentials
"secret_key": "minioadmin", # see docker-compose.yaml
"default_acl": "private",
"signature_version": "s3v4",
},
},
}
I'll set the storage argument to point to the Minio storage in the FileField:
# models.py
from django.core.files.storage import storages
class TrackS3(models.Model):
title = models.CharField(max_length=100)
artist_name = models.CharField(max_length=100)
file = models.FileField(null=True, blank=True, storage=storages["minio"])
Apply the changes:
./manage.py makemigrations
./manage.py migrate
./manage.py shell_plus
Finally create a separate viewset which swaps out the Track model for the TrackS3 model as I've done here.
Create a new TrackS3 model instance via curl:
curl -XPOST 'localhost:8000/api/tracks-s3-backend/' \
--data '{"title": "To The Moon & Back", "artist_name": "Savage Garden"}' \
-H 'Content-Type: application/json'
Upload a file using the Django API:
curl -XPOST 'localhost:8000/api/tracks-s3-backend/1/upload/my_song.mp3/' \
-F 'file=@sample_audio_1MB.mp3' -i
Logging in to the local Minio admin dashboard at http://localhost:9001 with the credentials minioadmin:minioadmin we can verify that the file was indeed stored in Minio.
Alternative API: Pre-signed S3 urls
Now let's say that you win the lottery and write a Django app that goes viral. You have hundreds of thousands of users uploading files to your website and the servers start falling over because there is too much traffic to handle.
While the django-storages S3 backend implements efficient Boto streaming of the uploaded files, you're still receiving a boatload of data, which would require horizontal scaling of servers and bankrupt you in cloud bills.
Does Django have an answer for you? No, but S3 does.
S3 has an ability to generate presigned urls which are used for direct uploads. Django asks S3 for a signed url which authenticates file uploads on behalf of Django. All of the authentication details are stored in the url together with the key pointing to the file. The user uploads the file using the url, completely bypassing our Django server. By default the url has an expiration time of an hour.
Here's the code that creates a presigned url for file uploads:
./manage.py shell_plus
track = TrackS3.objects.last()
track.file.storage.connection.meta.client.generate_presigned_url(
"put_object",
Params={
"Bucket": track.file.storage.bucket_name,
"Key": "mykey"
}
)
'http://localhost:9000/tracks/mykey?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=minioadmin%2F20240611%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20240611T192047Z&X-Amz-Expires=3600&X-Amz-SignedHeaders=host&X-Amz-Signature=87a0808d4934075fde68531725f2ca9371159fe3e3852e213fb30b3105f7613d'
We can now use curl to directly upload the file to S3:
$ curl -F 'file=@sample_audio_1MB.mp3' -i -XPUT http://localhost:9000/tracks/mykey\?X-Amz-Algorithm\=AWS4-HMAC-SHA256\&X-Amz-Credential\=minioadmin%2F20240611%2Fus-east-1%2Fs3%2Faws4_request\&X-Amz-Date\=20240611T192047Z\&X-Amz-Expires\=3600\&X-Amz-SignedHeaders\=host\&X-Amz-Signature\=87a0808d4934075fde68531725f2ca9371159fe3e3852e213fb30b3105f7613d
HTTP/1.1 100 Continue
HTTP/1.1 200 OK
X-Amz-Id-2: dd9025bab4ad464b049177c95eb6ebf374d3b3fd1af9251148b658df7ac2e3e8
X-Amz-Request-Id: 17D8099144F36A11
X-Content-Type-Options: nosniff
X-Xss-Protection: 1; mode=block
Date: Tue, 11 Jun 2024 19:22:57 GMT
django-storages uses the relational data stored in your SQL database to point at the key in the S3 bucket. If we're bypassing Django altogether during upload that means that we won't bind the metadata in SQL to the S3 object. This leads to the major downside of presigned urls: we have to assume that the user will indeed upload the file eventually using the presigned url because there isn't a simple way to check.
Below is an example in the form of an API that generates presigned upload urls:
@action(
detail=True,
methods=["POST"],
url_path="presigned-url",
)
def presigned_url(self, _request, **_kwargs):
track = self.get_object()
bucket = track.file.storage.bucket_name
key = str(uuid4())
upload_url = track.file.storage.connection.meta.client.generate_presigned_url(
"put_object", Params={"Bucket": bucket, "Key": key}
)
track.file.name = key
track.save()
return Response({"url": upload_url})
In the example above we:
- find the track in question or return a 404 (by calling self.get_object()
)
- generate a unique S3 key
- create a presigned url
- update the track to assume that the key in S3 already exists
- return the presigned url to the user
We use a UUID becase we don't want to overwrite an existing key in S3. Quoting the S3 docs:
When someone uses the URL to upload an object, Amazon S3 creates the object in the specified bucket. If an object with the same key that is specified in the presigned URL already exists in the bucket, Amazon S3 replaces the existing object with the uploaded object. After upload, the bucket owner will own the object.
curl -XPOST 'localhost:8000/api/tracks-s3-backend/<track-id>/presigned-url/'
{
"url": "http://localhost:9000/tracks/776b0829-ed65-4bd0-967d-551e9adbe739?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=minioadmin%2F20240611%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20240611T193707Z&X-Amz-Expires=3600&X-Amz-SignedHeaders=host&X-Amz-Signature=3b73f32daec5958769769c8bf71a6a38f14b135d1f225f1ae6c2f6cf2e3cc4b6"
}%
Calling the return url we can then upload the file:
$ curl -F 'file=@sample_audio_1MB.mp3' -i -XPUT http://localhost:9000/tracks/776b0829-ed65-4bd0-967d-551e9adbe739\?X-Amz-Algorithm\=AWS4-HMAC-SHA256\&X-Amz-Credential\=minioadmin%2F20240611%2Fus-east-1%2Fs3%2Faws4_request\&X-Amz-Date\=20240611T193707Z\&X-Amz-Expires\=3600\&X-Amz-SignedHeaders\=host\&X-Amz-Signature\=3b73f32daec5958769769c8bf71a6a38f14b135d1f225f1ae6c2f6cf2e3cc4b6
HTTP/1.1 100 Continue
HTTP/1.1 200 OK
The result file shows up in Minio with a unique ID:
However, if the user doesn't upload a file to the presigned url, trying to access it will result in a 404:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Error>
<Code>NoSuchKey</Code>
<Message>Key not found</Message>
</Error>
There are two potential workarounds for this: - use the less popular django-s3-storages project which claims to support S3 upload urls. I'm not sure how they do this. - configure SQS events when new buckets are created and notify Django - this locks you in to S3 as S3-compatible backends don't support this - it seems like an overly complicated solution - periodically scan the S3 bucket and update the relational data if matching keys are found.
Validating files
The upload API example works great if you trust the sender to pass a valid file each time. That's usually not the case and we have to perform some kind of file validation.
Luckily DRF integrates well with Django and any validators set on the model are automatically called if you use the ModelSerializer.
When you're using ModelSerializer all of this is handled automatically for you. If you want to drop down to using Serializer classes instead, then you need to define the validation rules explicitly.
Suppose that we want to validate the file type and file size of the uploaded file. Let's add basic validation to the model:
from django.core.exceptions import ValidationError
from django.core.validators import FileExtensionValidator
from django.db import models
def validate_file_size(file_obj):
max_size = 1024 * 1024
if file_obj.size > max_size:
raise ValidationError(f"File is larger than {max_size=}")
class Track(models.Model):
title = models.CharField(max_length=100)
artist_name = models.CharField(max_length=100)
release_date = models.DateField(blank=True, null=True)
file = models.FileField(
null=True,
blank=True,
validators=[
FileExtensionValidator(allowed_extensions=["mp3"]),
validate_file_size,
],
)
We have to modify the upload method to invoke validation. Calling serializer.is_valid(raise_exception=True) will automatically call any model-level validators.
@action(
detail=True,
methods=["POST"],
parser_classes=[FileUploadParser],
url_path=r"upload/(?P<filename>[a-zA-Z0-9_]+\.mp3)",
)
def upload(self, request, **kwargs):
track = self.get_object()
if "file" not in request.data:
raise ValidationError("Empty POST data - missing file in the request")
# Validation below
track_data = TrackSerializer(track).data
track_data["file"] = request.data["file"]
serializer = TrackSerializer(instance=track, data=track_data)
serializer.is_valid(raise_exception=True)
serializer.save()
return Response(TrackSerializer(track).data)
Sending a file that's too large returns a 400 indicating the error on the file
field.
curl -XPOST 'localhost:8000/api/tracks/4/upload/my_song.mp3/' -F 'file=@sample_audio_1MB.mp3' -i
HTTP/1.1 400 Bad Request
{
"file": [
"File is larger than max_size=1048576"
]
}
Documenting the Upload API
One of the main advantages of DRF is automatically generated documentation. The default documentation is pretty good, but drf-spectacular is even better. It generates Swagger UI docs (as seen below) and ReDoc pages.
While drf-spectacular by default generates documentation for CRUD endpoints, it doesn't handle custom methods. We'll have to manually decorate our upload function:
from drf_spectacular.utils import extend_schema
@extend_schema(
operation_id="upload_track",
request={"application/octet-stream": bytes},
responses={204: TrackSerializer},
)
@action(
detail=True,
methods=["POST"],
parser_classes=[FileUploadParser],
url_path=r"upload/(?P<filename>[a-zA-Z0-9_]+\.mp3)",
)
def upload(self, request, **kwargs):
pass
Access it at:
The generated docs look pretty good: