Issue
I have a little sqlite database(110kb) in a s3 bucket. I want to connect to that database every time I run my python application.
An option is just to download database everytime I run the python application and connect it as normally. But I want to know if exists a way to connect to that sqlite database through memory, using S3FileSystem
and open
. I'm using sqlite3
library and python 3.6
Solution
As other answers indicate, you probably don't want to use SQLite as a primary database in the cloud.
However, as part of a fun side project I wrote an Amazon Athena data source connector that allows you to query SQLite databases in S3 from Athena. In order to do that, I wrote a read-only SQLite interface for S3.
SQLite has a concept of an OS Interface or VFS. Using a Python SQLite wrapper called APSW, you can write a VFS implementation for arbitrary filesystems. This is what I did in my project and I've included the implementation below.
In order to use this, you would first register the VFS and then create a new SQLite connection with this implementation as the driver.
I should note this isn't optimized at all, so will likely still require reading full databases from S3 depending on your queries. But doesn't sound like an issue in this specific case.
S3FS = S3VFS() # S3VFS defined below
# This odd format is used due to SQLite requirements
sqlite_uri = "file:/{}/{}.sqlite?bucket={}&immutable=1".format(
S3_PREFIX,
DATABASE_NAME,
S3_BUCKET
)
connection = apsw.Connection(sqlite_uri,
flags=apsw.SQLITE_OPEN_READONLY | apsw.SQLITE_OPEN_URI,
vfs=S3FS.vfsname
)
cursor = connection.cursor()
Once you have the cursor, you can execute standard SQL statements like so:
for x,y,z in cursor.execute("select x,y,z from foo"):
print (cursor.getdescription()) # shows column names and declared types
print (x,y,z)
VFS Implementation (requires APSW library and boto3 for S3 connectivity)
import apsw
import sys
import boto3
VFS_S3_CLIENT = boto3.client('s3')
class S3VFS(apsw.VFS):
def __init__(self, vfsname="s3", basevfs=""):
self.vfsname=vfsname
self.basevfs=basevfs
apsw.VFS.__init__(self, self.vfsname, self.basevfs)
def xOpen(self, name, flags):
return S3VFSFile(self.basevfs, name, flags)
class S3VFSFile():
def __init__(self, inheritfromvfsname, filename, flags):
self.bucket = filename.uri_parameter("bucket")
self.key = filename.filename().lstrip("/")
print("Initiated S3 VFS for file: {}".format(self._get_s3_url()))
def xRead(self, amount, offset):
response = VFS_S3_CLIENT.get_object(Bucket=self.bucket, Key=self.key, Range='bytes={}-{}'.format(offset, offset + amount))
response_data = response['Body'].read()
return response_data
def xFileSize(self):
client = boto3.client('s3')
response = client.head_object( Bucket=self.bucket, Key=self.key)
return response['ContentLength']
def xClose(self):
pass
def xFileControl(self, op, ptr):
return False
def _get_s3_url(self):
return "s3://{}/{}".format(self.bucket, self.key)
Answered By - dacort
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.