Personal Data Warehouses: Reclaiming Your Data


I gave a talk the day gone by about personal recordsdata warehouses for GitHub’s OCTO Speaker Sequence, focusing on my Datasette and Dogsheep initiatives. The video of the controversy is now accessible, and I’m presenting that right here along with an annotated summary of the controversy, along with hyperlinks to demos and further recordsdata.

There’s a short technical glitch with the show camouflage sharing within the major limited while of the controversy—I’ve added screenshots to the notes which point out what it’s probably you’ll occupy viewed if my show camouflage had been appropriately shared.

Simon Willison - FOSS Developer and Consultant, Python, Django, Datasette

I’m going to be speaking about personal recordsdata warehouses, what they are, why you want one, form them and some of the titillating issues it’s probably you’ll per chance compose while you’ve blueprint one up.

I’m going to open up with a demo.

Cleo wearing a very fine Golden Gate Bridge costume with a prize rosette attached to it

Here’s my canine, Cleo—when she gained first bother in a canine costume competition right here, dressed because the Golden Gate Bridge!

All of my checkins on a map

So the quiz I are searching to reply to is: How grand of a San Francisco hipster is Cleo?

I will acknowledge it utilizing my personal recordsdata warehouse.

I occupy a database of ten year’s worth of my checkins on Foursquare Swarm—generated utilizing my swarm-to-sqlite tool. At any time when I verify in someplace with Cleo I exhaust the Wolf emoji within the checkin message.

All of Cleo's checkins on a map

I will filter for lawful checkins the save the checkin message comprises the wolf emoji.

Which design I will peer lawful her checkins—all 280 of them.

Cleo's top categories

If I facet by venue class, I will peer she’s checked in at 57 parks, 32 canine runs, 19 coffee stores and 12 organic groceries.

A map of coffe shops that Cleo has been to

Then I will facet by venue class and filter all the kind down to lawful her 19 checkins at coffee stores.

Appears she’s a Blue Bottle girl at coronary heart.

Being in a location to form a map of the coffee stores that your canine likes is obviously a unquestionably functional cause to form your have personal recordsdata warehouse.

The Datasette website

Let’s select a step succor and focus on how this demo works.

The first to this demo is that this web utility I’m running called Datasette. I’ve been working on this project for 3 years now, and the aim is to form it as straightforward and cheap as probably to explore recordsdata in all sorts of styles and sizes.

A screenshot of the Guardian Data Blog

Ten years within the past I became once working for the Guardian newspaper in London. One in every of the issues I realized when I joined the group is that newspapers find extensive amounts of recordsdata. Any time they put up a chart or map within the newspaper any individual has to find the underlying recordsdata.

There became once a journalist there called Simon Rogers who became once a wizard at gathering any recordsdata it’s probably you’ll per chance dispute to query for. He knew precisely the save to get it from, and had aloof an unlimited quantity of supreme spreadsheets on his desktop computer.

We made up our minds we wished to put up the guidelines at the succor of the reviews. We started something called the Files Blog, and aimed to accompany our reviews with the raw recordsdata at the succor of them.

A Google Sheet containing US public debt figures since 2001

We ended up utilizing Google Sheets to put up the guidelines. It worked, but I all the time felt adore there must be a wiser design to put up this roughly structured recordsdata in a kind that became once as valuable and versatile as probably for our target audience.

Serverless hosting? Scale to Zero. ... but databases cost extra!

Rapid forward to 2017, when I became once taking a look into this unique thing called “serverless” cyber web hosting—particularly one called Zeit Now, which has since rebranded as Vercel.

My favourite facet of Serverless is “Scale to zero”—the premise that you simply handiest pay for cyber web hosting when your project is receiving website website visitors.

When you’re adore me, and likewise you love building facet-initiatives but you don’t adore paying $5/month for them for the leisure of your life, right here is good.

The defend is that serverless providers are probably to payment you additional for databases, or require you to select out a hosted database from one other provider.

But what if your database doesn’t switch? Can you bundle your database within the the same container as your code?

This became once the preliminary inspiration at the succor of increasing Datasette.

A GitHub repository containing the Global Power Plant Database

Fancy many groups, they put up that recordsdata on GitHub.

A Datasette instance showing power plants faceted by country and primary fuel

I occupy a script that grabs their most most contemporary recordsdata and publishes it utilizing Datasette.

Here’s the contents of their CSV file published utilizing Datasette

Datasette helps plugins. You’ve already viewed this plugin in my demo of Cleo’s coffee stores—it’s called datasette-cluster-map and it unquestionably works by looking for out tables with a latitude and longitude column and plotting the guidelines on a map.

A zoomed in map showing two power plants in Antarctica

Straight away searching at this recordsdata you specialize in that there’s about a energy flowers down right here in Antarctica. Here’s McMurdo blueprint, and it has a 6.6MW oil generator.

And oh look, there’s a wind farm down there too on Ross Island knocking out 1MW of electrical energy.

a screen full of JSON

And anything else i’m able to peer within the interface, I will get out as JSON. Here’s a JSON file showing all of these nuclear energy flowers in France.

A screen full of CSV

And right here’s a CSV export which I will exhaust to drag the guidelines into Excel or other CSV-correctly matched software.

An interface for editing a SQL query

If I click “gaze and edit SQL” to get succor the SQL search recordsdata from that became once frail to generate the page—and I will edit and re-discontinuance that search recordsdata from.

I will get these custom outcomes succor as CSV or JSON as correctly!

Results of a custom SQL query

In most web capabilities this would possibly per chance per chance be viewed as a gruesome security hole—it’s a SQL injection attack, as a documented characteristic!

About a causes this isn’t a bother right here:

Within the muse, right here is setup as a read-handiest database: INSERT and UPDATE statements that will per chance per chance regulate it are now now not allowed. There’s a one 2d time limit on queries as correctly.

Secondly, all the issues on this database is designed to be published. There don’t seem like any password hashes or non-public individual recordsdata that will probably be exposed right here.

This additionally design we occupy a JSON API that lets JavaScript discontinuance SQL queries against a backend! This seems to be to be unquestionably valuable for mercurial prototyping.

The SQLite home page

It’s worth speaking about the secret sauce that makes this all probably.

Here’s all constructed on top of SQLite. All individuals staring at this talk uses SQLite on every single day foundation, even while you don’t are conscious of it.

Most iPhone apps exhaust SQLite, many desktop apps compose, it’s even running interior my Apple Watch.

One in every of my favourite aspects is that a SQLite database is a single file on disk. This makes it straightforward to replica, ship around and additionally design I will bundle recordsdata up in that single file, consist of it in a Docker file and deploy it to serverless hosts to aid it on the cyber web.

A Datasette map of power outages

Here’s one other demo that helps point out how GitHub suits into all of this.

Good year PG&E—the facility company that covers grand of California—grew to changed into off the facility to super swathes of the train.

I purchased lucky: six months earlier I had started scraping their outage map and recording the historical past to a GitHub repository.

A list of recent commits to the pge-outages GitHub repository, each one with a commit messages showing the number of incidents added, removed or updated

simonw/pge-outages is a git repository with 34,000 commits monitoring the historical past of outages that PG&E had published on their outage map.

You are going to be in a location to peer that two minutes within the past they added 35 unique outages.

I’m utilizing this recordsdata to put up a Datasette instance with limited print of their ancient outages. Here’s a page showing their present outages ordered by essentially the most customers plagued by the outage.

Read Tracking PG&E outages by scraping to a git repo for added limited print on this project.

A screenshot of my blog entry about Git scraping

I these days made up our minds to offer this methodology a reputation. I’m calling it Git scraping—the premise is to select out any recordsdata source on the fetch that represents a degree-in-time and commit it to a git repository that tells the yarn of the historical past of that particular thing.

Here’s my article describing the sample in extra detail: Git scraping: be conscious changes over time by scraping to a Git repository.

A screenshot of the NYT scraped election results page

This methodology unquestionably stood out lawful remaining week at some level of the US election.

Here’s the Unusual York Times election scraper web website, constructed by Alex Gaynor and a rising workforce of contributors. It scrapes the Unusual York Times election outcomes and uses the guidelines over time to level how the outcomes are trending.

The nyt-2020-election-scraper GitHub repository page

It uses a GitHub Actions script that runs on a time desk, plus a terribly life like Python script that turns it into a valuable web website.

It’s probably you’ll per chance fetch extra examples of Git scraping beneath the git-scraping topic on GitHub.

A screenshot of the incident map on fire.ca.gov

I’m going to compose a little bit of live coding to level you how these issues works.

Here’s the incidents page from the train of California CAL FIRE web website.

Any time I peer a map adore this, my first intuition is to open up the browser developer tools and resolve a occupy a examine to establish the design in which it unquestionably works.

The incident map with an open developer tools network console showing XHR requests ordered by size, largest first

If I open the community tab, refresh the page and then filter to lawful XHR requests.

A trim trick is to suppose by size—because inevitably the object at the head of the checklist is actually the most titillating recordsdata on the page.

a JSON list of incidents

This seems to be to be a JSON file telling me about all of the present fires within the train of California!

(I blueprint up a Git scraper for this a whereas within the past.)

Now I’m going to select out this a step extra and turn it into a Datasette instance.

The AllYearIncidents section of the JSON

It seems to be adore the AllYearIncidents secret’s essentially the most titillating bit right here.

A screenshot showing the output of curl

I’m going to exhaust curl to earn that recordsdata, then pipe it by means of jq to filter for lawful that AllYearIncidents array.

curl 'https://www.hearth.ca.gov/umbraco/Api/IncidentApi/GetIncidents' 
        | jq .AllYearIncidents

Pretty-printed JSON produced by piping to jq

Now I occupy a list of incidents for this year.

A terminal running a command that inserts the data into a SQLite database

Next I’m going to pipe it into a tool I’ve been building called sqlite-utils—it’s a chain of tools for manipulating SQLite databases.

I’m going to exhaust the “insert” present and insert the guidelines into a ca-fires.db in an incidents desk.

curl 'https://www.hearth.ca.gov/umbraco/Api/IncidentApi/GetIncidents' 
        | jq .AllYearIncidents 
        | sqlite-utils insert ca-fires.db incidents -

Now I’ve bought a ca-fires.db file. I will open that in Datasette:

datasette ca-fires.db -o

A map of incidents, where one of them is located at the very bottom of the map in Antarctica

And right here it’s—a designate unique database.

You are going to be in a location to straight peer that one in every of the rows has a spoiled blueprint, this capability that reality it seems to be in Antarctica.

But 258 of them look adore they are in essentially the most titillating bother.

I list of faceted counties, showing the count of fires for each one

I will additionally facet by county, to peek which county had essentially the most fires in 2020—Riverside had 21.

datasette publish --help shows a list of hosting providers - cloudrun, heroku and vercel

I’m going to select out this a step extra and build it on the cyber web, utilizing a present called datasette put up.

Datasette put up helps an excellent deal of diversified cyber web hosting providers. I’m going to exhaust Vercel.

A terminal running datasette publish

I’m going to inform it to put up that database to a project called “ca-fires”—and expose it to put in the datasette-cluster-map plugin.

datasette put up vercel ca-fires.db 
        --project ca-fires 
        --install datasette-cluster-map

This then takes that database file, bundles it up with the Datasette utility and deploys it to Vercel.

A page on Vercel.com showing a deployment in process

Vercel affords me a URL the save I will peer the progress of the deploy.

The purpose right here is to occupy as few steps as probably between finding some titillating recordsdata, turning it into a SQLite database it’s probably you’ll per chance exhaust with Datasette and then publishing it online.

Screenshot of Stephen Wolfram's essay Seeking the Productive Life: Some Details of My Personal Infrastructure

I’ve given you a whistle-quit tour of Datasette for the capabilities of publishing recordsdata, and hopefully doing some serious recordsdata journalism.

So what does this all occupy to compose with personal recordsdata warehouses?

Good year, I read this essay by Stephen Wolfram: Making an try for the Productive Life: Some Particulars of My Personal Infrastructure. It’s an improbable exploration of fourty years of productivity hacks that Stephen Wolfram has utilized to changed into the CEO of a 1,000 individual company that works remotely. He’s optimized every facet of his genuine and personal life.

A screenshot showing the section where he talks about his metasearcher

It’s plenty.

But there became once one fragment of this that in point of fact caught my look. He talks a number of thing he calls a “metasearcher”—a search engine on his personal homepage that searches every e mail, journals, recordsdata, all the issues he’s ever finished—all in a single bother.

And I realizing to myself, I unquestionably need THAT. I unquestionably adore this realizing of a deepest portal to my have stuff.

And since it became once impressed by Stephen Wolfram, but I became once planning on building a grand less spectacular version, I made up our minds to call it Dogsheep.

Wolf, ram. Canine, sheep.

I’ve been building this over the past year.

A screenshot of my personal Dogsheep homepage, showing a list of data sources and saved queries

So in level of fact right here is my personal recordsdata warehouse. It pulls in my personal recordsdata from as many sources as I will fetch and affords me an interface to browse that recordsdata and roam queries against it.

I’ve bought recordsdata from Twitter, Apple HealthKit, GitHub, Swarm, Hacker Info, Photos, a replica of my genome… all sorts of issues.

I’ll point out about a extra demos.

Tweets with selfies by Cleo

Here’s one other one about Cleo. Cleo has a Twitter yarn, and whenever she goes to the vet she posts a selfie and says how grand she weighs.

A graph showing Cleo's weight over time

Here’s a SQL search recordsdata from that finds every tweet that mentions her weight, pulls out her weight in pounds utilizing an odd expression, then uses the datasette-vega charting plugin to level a self-reported chart of her weight over time.

defend
    created_at,
    regexp_match('.*?(d+(.d+))lb.*', full_text, 1) as lbs,
    full_text,
    case
        when (media_url_https is now now not null)
        then json_object('img_src', media_url_https, 'width', 300)
    end as listing
    from
    tweets
    left join media_tweets on tweets.identity=media_tweets.tweets_id
    left join media on media.identity=media_tweets.media_id
    the save
    full_text adore '%lb%'
    and individual=3166449535
    and lbs is now now not null
    community by
    tweets.identity
    suppose by
    created_at desc
    limit
    101

A screenshot showing the result of running a SQL query against my genome

I did 23AndMe about a years within the past, so I occupy a replica of my genome in Dogsheep. This SQL search recordsdata from tells me what colour my eyes are.

Interestingly they are blue, 99% of the time.

defend rsid, genotype, case genotype
    when 'AA' then 'brown look colour, 80% of the time'
    when 'AG' then 'brown look colour'
    when 'GG' then 'blue look colour, 99% of the time'
    end as interpretation from genome the save rsid='rs12913832'

A list of tables in my HealthKit database

I occupy HealthKit recordsdata from my Apple Watch.

One thing I unquestionably adore about Apple’s design to these issues is that they don’t lawful upload all of your recordsdata to the cloud.

This recordsdata lives on your peer and on your telephone, and there’s an likelihood within the Smartly being app on your telephone to export it—as a zip file chubby of XML.

I wrote a script called healthkit-to-sqlite that converts that zip file into a SQLite database, and now I occupy tables for issues adore my basal energy burned, my physique stout proportion, flights of stairs I’ve climbed.

Screenshot showing a Datasette map of my San Francisco Half Marathon route

But the unquestionably relaxing fragment is that it seems to be any time you be conscious an exterior divulge on your Apple Watch it recordsdata your proper blueprint every few seconds, and likewise it’s probably you’ll per chance get that recordsdata succor out over again!

Here’s a map of my proper route for the San Francisco Half Marathon three years within the past.

I’ve started monitoring an “exterior stroll” whenever I am occurring a stroll now, lawful so I will get the GPS recordsdata out over again later.

Screeshot showing a list of commits to my projects, faceted by repository

I occupy plenty of recordsdata from GitHub about my initiatives—all of my commits, components, be troubled comments and releases—all the issues I will get out of the GitHub API utilizing my github-to-sqlite tool.

So I will compose issues adore peer all of my commits across all of my initiatives, search and facet them.

I occupy a public demo of a subset of this recordsdata at github-to-sqlite.dogsheep.fetch.

A faceted interface showing my photos, faceted by city, country and whether they are a favourite

Apple Photos is an extremely titillating source of recordsdata.

It seems to be the Apple Photos app uses a SQLite database, and while you know what you’re doing it’s probably you’ll per chance extract listing metadata from it.

They unquestionably roam machine finding out devices on your have tool to establish what your images are of!

Some photos I have taken of pelicans, inside Datasette

You are going to be in a location to exhaust the machine finding out labels to peek all of the images you’ve got got taken of pelicans. Listed below are all of the images I occupy taken that Apple Photos occupy known as pelicans.

Screenshot showing some of the columns in my photos table

It additionally seems to be they occupy columns called issues adore ZOVERALLAESTHETICSCORE, ZHARMONIOUSCOLORSCORE, ZPLEASANTCAMERATILTSCORE and further.

So I will style my pelican images with essentially the most aesthetically beautiful first!

Screenshot of my Dogsheep Beta faceted search interface

And a few weeks within the past I lastly bought around to building the object I’d all the time wished: the hunt engine.

I called it Dogsheep Beta, because Stephen Wolfram has a search engine called Wolfram Alpha.

Here’s pun-pushed style: I got right here up with this pun a whereas within the past and cherished it so grand I committed to building the software.

Search results for Cupertino, showing photos with maps

I needed to know when the leisure time I had eaten a waffle-fish ice cream became once. I knew it became once in Cupertino, so I searched Dogsheep Beta for Cupertino and came upon this listing.

I hope this illustrates how grand it’s probably you’ll per chance compose while you pull all of your individual recordsdata into one bother!

GDPR really helps

The GDPR law that handed in Europe about a years within the past unquestionably helps with these issues.

Companies occupy to offer you with get entry to to the guidelines that they store about you.

Many extensive cyber web companies occupy spoke back to this by offering a self-provider export characteristic, typically buried someplace within the settings.

You are going to be in a location to additionally quiz recordsdata straight from companies, but the self-provider likelihood helps them defend their buyer toughen costs down.

These issues turns into more straightforward over time as extra companies form out these aspects.

Democratizing access. The future is already here, it's just not evenly distributed - William Gibson

The other be troubled is how we democratize get entry to to this.

All the pieces I’ve proven you in the present day time is open source: it’s probably you’ll per chance install this software and exhaust it yourself, with out cost.

But there’s plenty of assembly required. You wish to establish authentication tokens, fetch someplace to host it, blueprint up cron jobs and authentication.

But this must be accessible to extraordinary non-uber-nerd humans!

Democratizing access. Should users run their own online Dogsheep? So hard and risky! Tailscale and WireGuard are interesting here. Vendors to provide hosted Dogsheep? Not a great business, risky!. Better options: Desktop app, mobile app.

Looking ahead to extraordinary humans to roam a fetch web server someplace is beautiful gruesome. I’ve been searching at WireGuard and Tailscale to abet form fetch get entry to between devices more straightforward, but that’s aloof very grand for super-customers handiest.

Running this as a hosted provider doesn’t appeal: taking accountability for of us’s personal recordsdata is horrible, and it’s potentially now now not a proper business.

I feel essentially the most titillating alternatives are to roam on of us’s have personal devices—their mobile telephones and their laptops. I feel it’s feasible to get Datasette running in these environments, and I unquestionably adore the premise of customers being in a location to import their personal recordsdata onto a tool that they regulate and examining it there.

Screenshot of Dogsheep on GitHub

The Dogsheep GitHub group has plenty of the tools that I’ve frail to form out my personal Dogsheep warehouse—many of them utilizing the naming convention of something-to-sqlite.

Q&A, from this Google Doc

Screenshot of the Google Doc

Q: Is there/will there be a Datasette hosted provider that I pays $ for? I would adore to pay $5/month to get get entry to to essentially the most contemporary version of Dogsheep with all of essentially the most contemporary plugins!

I don’t are searching to form a cyber web hosting location for personal non-public recordsdata because I feel of us ought to aloof discontinuance unsleeping to the stamp of that themselves, plus I don’t dispute there’s an extremely factual business mannequin for that.

As a exchange, I’m building a hosted provider for Datasette (called Datasette Cloud) which is aimed at companies and organizations. I are searching in tell to offer newsrooms and other groups with a non-public, fetch, hosted ambiance the save they can allotment recordsdata with one one more and roam prognosis.

Screenshot showing an export running on an iPhone in the Health app

Q: How compose you sync your recordsdata from your telephone/peer to the guidelines warehouse? Is it a handbook process?

The health recordsdata is handbook: the iOS Smartly being app has an export button which generates a zip file of XML which you are going to then AirDrop to a notebook computer. I then roam my healthkit-to-sqlite script against it to generate the DB file and SCP that to my Dogsheep server.

Many of my other Dogsheep tools exhaust APIs and can roam on cron, to earn essentially the most most contemporary recordsdata from Swarm and Twitter and GitHub and lots others.

Q: When having access to Github/Twitter and lots others compose you roam queries against their API otherwise you periodically sync (retrieve largely I bet) the guidelines to the warehouse first and then search recordsdata from locally?

I all the time strive to get ALL the guidelines so I will search recordsdata from it locally. The bother with APIs that point out it’s probably you’ll per chance roam queries is that inevitably there’s something I are searching to compose that can’t be finished of the API—so I’d grand somewhat suck all the issues down into my have database so I will write my have SQL queries.

Screenshot showing how to run swarm-to-sqlite in a terminal

Here’s an instance of my swarm-to-sqlite script, pulling in lawful checkins from the past two weeks (utilizing authentication credentials from an ambiance variable).

swarm-to-sqlite swarm.db --since=2w

Here’s a redacted replica of my Dogsheep crontab.

Screenshot of the SQL.js GitHub page

Q: Delight in you ever explored doing this as a single page app so that it’s probably to deploy this as a static location? What are the constraints there?

It’s unquestionably probably to search recordsdata from SQLite databases entirely interior consumer-facet JavaScript utilizing SQL.js (SQLite compiled to WebAssembly)

Screenshot of an Observable notebook running SQL.js

This Observable pocket guide is an instance that uses this to roam SQL queries against a SQLite database file loaded from a URL.

Screenshot of a search for cherry trees on sf-trees.com

Datasette’s JSON and GraphQL APIs point out it’ll with out problems act as an API backend to SPAs

I constructed this location to offer a search engine for bushes in San Francisco. Stare source to peek the design in which it hits a Datasette API within the background: https://sf-bushes.com/?q=palm

The network pane running against sf-trees.com

You are going to be in a location to exhaust the community pane to peek that it’s running queries against a Datasette backend.

Screenshot of view-source on sf-trees.com

Here’s the JavaScript code which calls the API.

Screenshot of Datasette Canned Query documentation

Q: What potentialities for recordsdata entry tools compose the writable canned queries open up?

Writable canned queries are a somewhat most contemporary Datasette characteristic that enable administrators to configure a UPDATE/INSERT/DELETE search recordsdata from that will additionally be called by customers filling in sorts or accessed by strategy of a JSON API.

The hypothesis is to form it straightforward to form backends that take care of straightforward recordsdata entry besides to serving read-handiest queries. It’s a characteristic with plenty of capacity but to this level I’ve now now not frail it for anything else critical.

Currently it’ll generate a VERY well-liked create (with single-line enter values, connected to this search instance) but I hope to increase it within the longer term to increase custom create widgets by strategy of plugins for issues adore dates, map places or autocomplete against other tables.

Q: For the native version the save you had a 1-line push to deploy a brand unique datasette: how compose you take care of updates? Is there a identical 1-line replace to replace an present deployed datasette?

I deploy a designate unique installation whenever the guidelines changes! This works proper for recordsdata that handiest changes about a occasions a day. If I occupy a project that changes a number of occasions an hour I’ll roam it as an odd VPS as an different somewhat than exhaust a serverless cyber web hosting provider.

Read More

Recent Content