Wednesday, December 16, 2009

Accessing MERRA: Data Subsetter

This week, I needed a daily average surface flux subset of MERRA data. Locally, we have the 1 hourly data on a mass store system, but that is primarily archive, and not best used for routine analysis. The source data files are 269Mb each, and I needed one for each day from July 1987 through Dec 2007, which would have been a 2TB request. So, I used the MDISC to create data files specifically for the comparison from the Data Subsetter.

With the subsetter, I selected the 4 variables needed for the experiment, the time range (Jul1987-Dec2007, or 7489 days), the region could have been trimmed, but was left at the default (global). Daily averages, not the 1 hourly averages were preferable, so the the daily mean box was checked. HDF was suitable, so it was left at the default (as opposed to NetCDF, more formats may be added later). The subsetter provided a text file with the http links to the reduced data request. The links activate a program that does the subsetting and streams the requested data back. This text file is used as input to a Linux call to wget, which does the work of opening the http.

It took only about 16 hours (mostly over night), but the result was only 23Gb of disc space. The daily mean check box also saved me the time to process those daily means. Lastly, I did use the Mirador search to access and download one of the unaltered source data files, just to verify the variables and the daily averaging, and there was no difference between the Subset processed averages and daily averages computed manually.

The subsetter has evolved a lot over this past year, and additional functionality is planned. However, in it's present form, it should be a very useful tool in accessing MERRA data.

2 comments:

jhz said...

Hi Mike --VERY late comment...Are there some folks who are interested in the raw variables versus the means...if so, what would you say the ratio is of folks who want unprocessed variables versus means?

Michael Bosilovich said...

I'm not exactly sure what you mean by "raw variables", other than by taking from context, "versus means", that it implies instantaneous data. There are several instantaneous data sets. one is vertically integrated quantities, that can be used to derive total time tendency terms for budgets. The others are 3D states, especially the analyzed states.

For fluxes, instantaneous data would not help close budgets, because you need to know what goes on at all times, and it is not practical to write out all time steps. Though, for the surface one hourly means, it's only 3-4 time steps per hour, so hourly means are not that far off of the instantaneous data.

There really hasn't been any requests for more instantaneous data, though I will say that we are currently reviewing the File Specification Document in MERRA, and considering future needs for output data, if you can be more specific.