This post is a follow-up to the post on working with local filesystems in Python.
This brief post will describe the benefits of cloudpathlib
in working with remote filesystems.
Creating a Remote Filesystem
We need a remote filesystem to work with, we can use cloudpathlib
against Azure, Google Cloud Storage,
or S3-like environments. I haven’t got access to an AWS S3 bucket, and you might not have access either, so to make
this post reproducible we’ll spin up a simple Minio instance. Minio is a lot of things, but for the purposes of this
post we can think of it as a local S3-like filesystem.
The following two commands:
- Pull the Minio docker image.
- Run the Minio server locally, exposing the necessary ports.
$ docker pull minio/minio $ docker run -p 9000:9000 -p 9001:9001 minio/minio server data --console-address ":9001"
The terminal will print out the default admin credentials, user:
minioadmin
, password:minioadmin
. Let’s create a test bucket to work with. We’ll do this by logging into the Minio Console.
Getting Started
To install the package use the following command, the [s3]
means we’ll also install the dependencies required to work
with s3-like systems.
pip install cloudpathlib[s3]
cloudpathlib
has a fairly intuitive interface. We define an S3Client
with the endpoint_url
and
credentials needed to access our Minio instance. The next line sets this client to be the default global client for all
subsequent filesystem operations, but we can of course have a more fine-grained approach and have different clients for
different operations. The final line defines a CloudPath
linked to the previously created test
bucket.
from cloudpathlib import S3Client, CloudPath
client = S3Client(endpoint_url="http://127.0.0.1:9000",
aws_access_key_id='minioadmin',
aws_secret_access_key='minioadmin',
aws_session_token=None)
client.set_as_default_client()
test_bucket = CloudPath('s3://test')
Simple Operations
Now we’ll join this bucket to test.txt
, this path doesn’t yet exist, we can check this with the .exists()
method.
test_path = test_bucket.joinpath('test.txt')
test_path.exists()
>>> False
From the code snippet above, we can see that this is very similar to the pathlib
interface, the joinpath()
method is
the same. The line below writes the string ‘This is a test.’ out to the path.
test_path.write_text('This is a test.')
The code snippet below then reads the text from that file, we also check that it exists.
test_path.read_test()
>>> 'This is a test.'
test_path.exists()
>>> True
Example Use
cloudpathlib
intuitively works with pandas for reading and writing:
iris_path = bucket.joinpath('iris.csv')
iris_df = pd.read_csv(iris_path)
Writing is done via a simple context manager.
iris_out_path = bucket.joinpath('iris_out.csv')
with iris_out_path.open('w+') as f:
iris_df.to_csv(f)
Caching
cloudpathlib
comes with caching as standard.
For instance, when getting a file, cloudpathlib
will check that the newest version of it exists in the cache.
If it does, the file will be taken from the local cache, saving time. If not, the file will be taken from the remote
filesystem.
Let’s use the small test.txt
file as an example.
%%time
with test_path.open("rb") as f:
f
The first time we get it from Minio it takes ~60ms.
CPU times: user 32.1 ms, sys: 4.51 ms, total: 36.6 ms
Wall time: 61.7 ms
This halves to ~30ms on the second retrieval, because the local cache is used.
CPU times: user 17.1 ms, sys: 2.83 ms, total: 19.9 ms
Wall time: 29.8 ms
AnyPath
AnyPath is a superclass of the pathlib
Path
and the cloudpathlib
CloudPath
classes.
It’s great when the input path could be local or remote.
For instance, when it’s presented with an S3 path it returns an S3Path
.
In[1]: AnyPath('s3://test/test.txt')
Out[1]: S3Path('s3://test/test.txt')
When presented with a local path, then a local pathlib
PosixPath
is returned.
In[1]: AnyPath('words.md')
Out[1]: PosixPath('words.md')
Both S3Path
and PosixPath
support many of the same operations e.g. .exists()
.
Summary
Overall cloudpathlib
is an intuitive way of working with remote filesystems. It’s main strength
is its compatibility with pathlib
. Checkout the cloudpathlib docs
for extra features that
weren’t explored here, like the built-in mocking functions, which make it easier to test code which work with remote
filesystems.