This morning, we unveiled an interactive GoogleMap, based on anonymized IP addresses collected from the forum database, that shows how the GATK user community is distributed across the globe. Check out Boston/Cambridge!

For the record, this was originally inspired by the World Map of High-throughput Sequencers by James Hadfield (Cancer Research UK, Cambridge) and Nick Loman (University of Birmingham).

As several people have already expressed interest in how this map was put together, I thought I'd give a brief overview of the technical side below the fold. I'm happy to provide more details and/or code if anyone wants to do something similar.


Making the map

First, I retrieved the IP addresses recorded in the forum user database (last known IP only, to avoid inflating the counts) -- without any associated user information, so the IPs are anonymous. I ran them against the free MaxMind GeoLite2 geolocation database to look up the corresponding approximate location (using their delightfully simple API), which gave me the City, Country and geographical coordinates (Latitude and Longitude) for each. Then it's just a matter of counting how many records we have for each City+Country pairing, and consolidating that information into a JSON file to feed to the GoogleMaps API that actually draws the map. The responsive clustering of the markers, which is the really cool feature here, is done by a nifty little Javascript library called markerclusterer.js.

Problems and perspectives

The map represents data from ~25,000 registered users out of the ~36,000 total. This does mean we're missing a sizeable chunk of the community, and it's because many IPs were either not in the free version of the geolocation database that I used, or the location was not associated with a name. Funny enough, due to a bug in the first version of my script, the unnamed records were all getting assigned to a single pair of coordinates, so the map was showing over 8,000 GATK users way out in a remote part of Australia. Had me wondering whether the Garvan Institute was hiding a massive secret facility out there!

In a future iteration I think I can salvage the unnamed records by consolidating based on coordinates instead of City+Country name, the choice of which in hindsight was not a great design decision. Right now I'm also not handling correctly any cases where the same city name exists in several states within the United States -- as a born-and-bred European person, I assumed that the City+Country name pair is unique, but now that I think about it, it doesn't hold true in the USA, does it... I don't think this would affect a large number of records, but hey, we care about accuracy, so further refinements will be forthcoming!


Return to top

Geraldine_VdAuwera on 27 Jul 2016


Whoo, update: using an identifier based on the coordinates instead of the city and country name resolved the two main problems I outlined above -- the unnamed data points and the naming overlap between cities within the US. The updated map is live now.




- Recent posts


- Upcoming events

See Events calendar for full list and dates


- Recent events

See Events calendar for full list and dates



- Follow us on Twitter

GATK Dev Team

@gatk_dev

RT @BroadFireCloud: We've updated the preprocessing #GATK4, somatic CNV & SNV featured workspaces w/ time & cost benchmarks! Grab free cred…
18 Jan 18
This shows our mothership, the Data Sciences Platform at Broad. It’s amazing... and it’s expanding! Check out the v… https://t.co/EnjslGDh3O
17 Jan 18
@ksuhre @desertGenomics Btw, for those who didn't get it the reference is https://t.co/8TtSoMlFKn; not the *most* b… https://t.co/jSXMYRnQKe
17 Jan 18
@BioinfoMcDermot Can’t claim credit for this happy coincidence but delighted it worked out that way!
16 Jan 18
@FabienCampagne If you post details in the forum we’d be happy to look into it. Was this with the 4.0 release?
16 Jan 18

- Our favorite tweets from others

@gatk_dev Thanks for giving GATK a BSD license. Great scientific software available to all
13 Jan 18
Thanks @broadinstitute @gatk_dev for the awesome Amazon gift card!!! I am happy to answer all of your surveys!
11 Jan 18
The @broadinstitute this week released #GATK4, the much-anticipated version 4 update to the Genome Analysis Toolkit… https://t.co/lmLyR7kVZO
10 Jan 18
Ditto here, thanks @gatk_dev @BroadFireCloud! New opportunities for advancement in #genomics #cancer #ngs research… https://t.co/fCuAoROh5o
10 Jan 18
Thanks for free credits @gatk_dev @BroadFireCloud & @googlecloud!
10 Jan 18

See more of our favorite tweets...