This site is a small project to try to visualize a dataset about weather data. I’ll use the most recent and complete year, that’s 2008. The main aim of this project is to experiment on the visualization of a big amount of data, trying to express it in a meaningful way.
I’ll try to write a post on the log on every step on the project so you can read my advances over this two days or three of work.
The main components to build this site are:
Workflow to create this site.
There are several things that need to be improved. My intention for these two days was to have a minimum product, something that tackles all the steps on creating a visualization site: grabbing data, analysing and transforming it, uploading to a service or repository, and finally coding a web visualization. Those steps (even I’ve learnt A LOT) are quite near on my current expertise and tool box, but there are more challenges and with more dedication, more features to add. Finally I’d open source the site and place all those features in a public ticket system (probably on GitHub) so I can share with others my advances.
I’d want to perform a simple ETL task to automatically retrieve and process any other year data from the EC2 instance. It’s free so I don’t need to switch it off. Not a high priority, though.
Automate all data management. This should be easy because everything is driven by some scripts, it’s just a matter of putting all of them well organised and with more error control and so on. Again, not a high priority.
Explore the Big Data approach. I’d like to try to load all this data into an Elastic Map Reduce task, but first I should move it to S3. At least, set up a local single master/slave Hadoop node. Then write the map/reduce python scripts to generate the output text files or maybe store them on SimpleDB to check how well they work afterwards on the visualization phase. Also maybe explore HIVE possibilities to aggregate data from this huge dataset dynamically. For example, extract for one variable and region (that is, a set of stations) a complete series of data recorded.
Definitely, study CartodbJS to use it properly and solve the evident problems I have now with the layers management. Instead of setting manually every map I’ve added to this site, I’d like to have a complete client driven visualization, where for example the user selects the variable and the months (with a complete year by default) and the data is requested for rendering to CartoDB service using some default CartoCSS rules that work with normalized data, so the styles are not dependent on the data set.
I’d extract some indicators to generate something like an infography with the relevant data, something appealing that could be added to the site like a travel map for different interesting facts. I’d use OddisseyJS to generate the story and drive the user over the world, adding selected photographies and videos of those places.
Add another layer of regions from this dataset to allow regional aggregations smaller than by country. Maybe something like the typical Business Intelligence OLAP cubes approach, to drill down the data from country aggregation level down to stations from year to quarter and month levels (that’s what I can upload to CartoDB) and then using SimpleDB or static JSON files stored on S3 for the maximum level of detail for one specific station.
With a dynamic selector for the data rendered by country, region or station (so first and second are coropleth maps and the second a point map as I’ve done this week end), I could render a more complex table. That table could also respond to events highlighting sections on the accompanying graphs or for example allowing the user to select the variable to render on the graphs clicking on its column header. For every variable an specific graph would be selected, sharing the same graph for several variables as I’ve done with temperatures on the climogram I have right now.
I’d develop some kind of station locator using an auto-complete text input, moving the map to that station location automatically
A permalink system to allow deep links to weather stations diagrams and location, based on their station_id
.