This tutorial is about writing EBAS datafiles. Although the procedure is the same for all file formats (Nasa Ames, NetCDF, CSV), the tutorial focusses on Nasa Ames. This is most relevant from a user perspective: Most users are interested in generating files for submission to EBAS.
First we need to set up an IO object (in this case of class EbasNasaAmes, if you want to create other file types, use an alternative class; all the rest will be the same for all file formats).
from ebas.io.file.nasa_ames import EbasNasaAmes
nas = EbasNasaAmes()
Now we have an object (nas) which represents the file we want to write.
Next we need to add some global metadata to the file. Here the most basic metadata are shown.
import datetime
from nilutility.datatypes import DataObject
def setup_global_metadata(outfile):
outfile.metadata.revdate = datetime.datetime.utcnow()
outfile.metadata.datalevel = '2'
outfile.metadata.station_code ='NO0002R'
outfile.metadata.station_name = 'Birkenes II'
outfile.metadata.matrix = 'pm10'
outfile.metadata.lab_code = 'NO01L'
outfile.metadata.instr_type = 'filter_3pack'
outfile.metadata.instr_name = 'NILU_f3p_d_0001'
outfile.metadata.ana_lab_code = 'NO01L'
outfile.metadata.ana_technique = 'IC'
outfile.metadata.ana_instr_name = 'NILU_IC_03'
outfile.metadata.ana_instr_manufacturer = 'Dionex'
outfile.metadata.ana_instr_model = 'ICS-3000'
outfile.metadata.ana_instr_serialno = '12345'
outfile.metadata.reference_date = datetime.datetime(2020, 1, 1)
outfile.metadata.resolution = '1h'
outfile.metadata.projects = ['CAMP', 'EMEP']
outfile.metadata.org = DataObject(
OR_CODE='NO01L',
OR_NAME='Norwegian Institute for Air Research',
OR_ACRONYM='NILU', OR_UNIT='Atmosphere and Climate Department',
OR_ADDR_LINE1='Instituttveien 18', OR_ADDR_LINE2=None,
OR_ADDR_ZIP='2007', OR_ADDR_CITY='Kjeller', OR_ADDR_COUNTRY='Norway'
)
outfile.metadata.originator.append(DataObject(
PS_LAST_NAME=u'Someone', PS_FIRST_NAME='Else',
PS_EMAIL='Someone@somewhere.no',
PS_ORG_NAME='Some nice Institute',
PS_ORG_ACR='WOW', PS_ORG_UNIT='Super interesting division',
PS_ADDR_LINE1='Street 18', PS_ADDR_LINE2=None,
PS_ADDR_ZIP='X-9999', PS_ADDR_CITY='Paradise',
PS_ADDR_COUNTRY='Norway',
PS_ORCID=None,
))
outfile.metadata.submitter.append(DataObject(
PS_LAST_NAME=u'Someone', PS_FIRST_NAME='Else',
PS_EMAIL='Someone@somewhere.no',
PS_ORG_NAME='Some nice Institute',
PS_ORG_ACR='WOW', PS_ORG_UNIT='Super interesting division',
PS_ADDR_LINE1='Street 18', PS_ADDR_LINE2=None,
PS_ADDR_ZIP='X-9999', PS_ADDR_CITY='Paradise',
PS_ADDR_COUNTRY='Norway',
PS_ORCID=None,
))
setup_global_metadata(nas)
Add the sample time intervals. Here we add only three samples for demonstration.
from nilutility.datetime_helper import DatetimeInterval
def add_sample_times(outfile):
outfile.sample_times = [
DatetimeInterval(datetime.datetime(2020, 1, 1, 0, 0), datetime.datetime(2020, 1, 1, 1, 0)),
DatetimeInterval(datetime.datetime(2020, 1, 1, 1, 0), datetime.datetime(2020, 1, 1, 2, 0)),
DatetimeInterval(datetime.datetime(2020, 1, 1, 2, 0), datetime.datetime(2020, 1, 1, 3, 0))
]
add_sample_times(nas)
def setup_variables(outfile):
# variable 1: sodium
values = [0.06, 0.0000022, None] # values for the three samples; missing value is None!
flags = [[], [], [999]] # flags for the three samples (the last sample is flagged as missing)
metadata = DataObject()
metadata.comp_name = 'sodium'
metadata.matrix = 'pm10'
metadata.unit = 'ug/m3'
metadata.title = 'Na'
metadata.inlet_type = None
metadata.flow_rate = 10
metadata.qa = [
DataObject({
'qa_number': 1,
'qm_id': 'EMEP31',
'qa_date': datetime.datetime(2013, 10, 16),
'qa_outcome': True, # pass
}),
DataObject({
'qa_number': 2,
'qm_id': 'EMEP32',
'qa_date': datetime.datetime(2014, 10, 22),
'qa_outcome': True, # pass
})
]
# add the variable
outfile.variables.append(DataObject(values_=values, flags=flags, flagcol=True,
metadata=metadata))
# variable 2: magnesium
values = [0.556, 1.22, None] # values for the three samples; missing value is None!
flags = [[], [], [999]] # flags for the three samples (the last sample is flagged as missing)
metadata = DataObject()
metadata.comp_name = 'magnesium'
metadata.matrix = 'pm10'
metadata.unit = 'ug/m3'
metadata.title = 'Mg'
metadata.inlet_type = None
metadata.flow_rate = 10
metadata.qa = [
DataObject({
'qa_number': 1,
'qm_id': 'EMEP31',
'qa_date': datetime.datetime(2013, 10, 16),
'qa_outcome': True, # pass
}),
DataObject({
'qa_number': 2,
'qm_id': 'EMEP32',
'qa_date': datetime.datetime(2014, 10, 22),
'qa_outcome': True, # pass
})
]
# add the variable
outfile.variables.append(DataObject(values_=values, flags=flags, flagcol=True,
metadata=metadata))
# variable 2: calcium
values = [0.556, 1.22, None] # values for the three samples; missing value is None!
flags = [[], [], [999]] # flags for the three samples (the last sample is flagged as missing)
metadata = DataObject()
metadata.comp_name = 'calcium'
metadata.matrix = 'pm10'
metadata.unit = 'ug/m3'
metadata.title = 'Ca'
metadata.inlet_type = None
metadata.flow_rate = 10
metadata.qa = [
DataObject({
'qa_number': 1,
'qm_id': 'EMEP31',
'qa_date': datetime.datetime(2013, 10, 16),
'qa_outcome': False, # no pass
}),
DataObject({
'qa_number': 2,
'qm_id': 'EMEP32',
'qa_date': datetime.datetime(2014, 10, 22),
'qa_outcome': False, # no pass
})
]
# add the variable
outfile.variables.append(DataObject(values_=values, flags=flags, flagcol=True,
metadata=metadata))
setup_variables(nas)
Now the file is ready to be written (without specifying anything special, it will be printet to stdout).
nas.write()
Here is an annotdated version of the output from above.
57 1001
Someone, Else
NO01L, Norwegian Institute for Air Research, NILU, Atmosphere and Climate Department, Instituttveien 18, , 2007, Kjeller, Norway
Someone, Else
CAMP EMEP
1 1
2020 01 01 2020 11 19
0.041667
days from file reference point
5
1 1 1 1 1
Missing values automatically chosen for each variable. Usually it's all nines, one order of magnitude higher than the higest value in the data. Number of decimals is also chosed according to the data. For variables, where the scientific representation has a clear advantage, it is automatically used (see sodium). Missing value for the flag column has beed adjusted to the maximum number of flags used at the same time.
9.999999 99.999 99.999 9.9E+99 9.999
end_time of measurement, days from the file reference point
The variables are automatically sorted according to the standard sort order (impoortant to generate reproducable files). Metadata are automatically moved between global metadata and file global metadata:
- Inlet type is completely dropped (it was only set in the global metadata, and all variables override it with None).
- Metadata elment flow_rate turns global (all variables use 10 l/min)
- QA metadata: most QA metadata are the same in all variables, but outcome is 'no pass' for calcium. Thus, all QA metadata turn global, but outcome is overwritten for calcium (see remaining variable specific metadata).
calcium, ug/m3, QA1 outcome=no pass, QA2 outcome=no pass
magnesium, ug/m3
sodium, ug/m3
numflag, no unit
0
38
Data definition: EBAS_1.1
Set type code: TU
Timezone: UTC
File name: NO0002R.20200101000000.20201119190121.filter_3pack..pm10.3h.1h.NO01L_NILU_f3p_d_0001.NO01L_NILU_IC_03.lev2.nas
File creation: 20201119190124
Startdate: 20200101000000
Revision date: 20201119190121
Statistics: arithmetic mean
Data level: 2
Period code: 3h
Resolution code: 1h
Station code: NO0002R
Platform code: NO0002S
Station name: Birkenes II
Regime: IMG
Component:
Unit: ug/m3
Matrix: pm10
Laboratory code: NO01L
Instrument type: filter_3pack
Instrument name: NILU_f3p_d_0001
Analytical laboratory code: NO01L
Analytical measurement technique: IC
Analytical instrument name: NILU_IC_03
Analytical instrument manufacturer: Dionex
Analytical instrument model: ICS-3000
Analytical instrument serial number: 12345
Method ref: NO01L_NILU_IC_03
Flow rate turned global:
Flow rate: 10 l/min
QA metadata turned global, but calcium has an overwritten 'no pass'
QA1 measure ID: EMEP31
QA1 date: 20131016
QA1 outcome: pass
QA2 measure ID: EMEP32
QA2 date: 20141022
QA2 outcome: pass
Originator: Someone, Else, Someone@somewhere.no, Some nice Institute, WOW, Super interesting division, Street 18, , X-9999, Paradise, Norway
Submitter: Someone, Else, Someone@somewhere.no, Some nice Institute, WOW, Super interesting division, Street 18, , X-9999, Paradise, Norway
starttime endtime Ca Mg Na flag
Here we see again, the number representations have been optimized in number of digits and scientific representation in case of sodium.
The number format for the flag column is also adjusted to the maximum number of flags used at the same time.
0.000000 0.041667 0.556 0.556 6.0E-02 0.000
0.041667 0.083333 1.220 1.220 2.2E-06 0.000
0.083333 0.125000 99.999 99.999 9.9E+99 0.999
In order to be more flexible in generating customized output, some of the standardisations and automisations shown above can be turned off. This might be useful for reproducing an output file in a certain format (according to a template). However, this has no effect on the data content. It's just a different representation of the same data!
We can force the variables to appear in the same sequence as specified when adding them to the file object. This is done by passing a keyword argument to the write method: suppress=SUPPRESS_SORT_VARIABLES
.
As we've seen the occurence of all metadata elements is optimized before the object is written. If most variables use the same value for a metadata element, it is turned into a global element, thus needing explicit values per variable only for exceptional variables.
However, sometimes it might be convenient to suppress this automatic behaviour and rather print the metadata exactly how they were defined when creating the object. Variable specific metadata stay variable specific, and global stay global.
This can also be achieved by passing SUPPRESS_METADATA_OCCURRENCE to the suppress parameter: suppress=SUPPRESS_METADATA_OCCURRENCE
Suppress options can be combined with a bit or, eg: suppress=SUPPRESS_SORT_VARIABLES|SUPPRESS_METADATA_OCCURRENCE
for suppressing both options.
Somtimes one would like to statically choose the MISSING VALUE to be used for a variable. This might be for example useful when constantly generating near realtime files with just a short time interval of data in each file. The actual data for a short periode might sometimes only contain low values, sometimes higher values. Thus the MISSING VALUE will probably change from file to file (which is perfectly correct, but aestetic annoyance).
In order to provide a MISSING VALUE, each variable definition in the file objects variables
can have an element vmiss
which contains the missing value which should be used.
Atention!
If the missing value provided is not sufficient for the actual data to be written, the missing value is NOT used. However the notation of the specified missing value is still used.
Example:
A variable would automatically get a missing value
99.999
, but we explicitly specify a missing value of9.99E+99
which is NOT sufficient to represent the values in the variable (the data need 5 digits precision). The provided missing value is not used in this case. However the fact that scientific format should be used is still respected. Ebas-io will calculate a new missing value, but use scientific notation
Now we start from scratch, build the same I/O object, but tweak some details about how the file should be written. First, rebuild the same object:
nas = EbasNasaAmes()
setup_global_metadata(nas)
add_sample_times(nas)
setup_variables(nas)
But this time, we want to control the missing values used in the output file:
from ebas.io.file import SUPPRESS_SORT_VARIABLES, SUPPRESS_METADATA_OCCURRENCE
# Variable 0: sodium
# We used values between 0.0000022 and 0.06, which would automatically lead to scientific
# notation for this variable. This time, we want to force decimal representation.
# We use ```99.9999999``` as missing value.
nas.variables[0].vmiss = '99.9999999'
# If a flag column is written for this variable, use the format 9.999999 instaead of 9.999
nas.variables[0].flag_vmiss = '9.999999'
# Variable 1: magnesium
# We used values between 0.556 and 1.22, which would automatically lead to 99.999 as missing value.
# This time we want to use 999.999 as missing value.
nas.variables[1].vmiss = '999.999'
# If a flag column is written for this variable, use the format 9.999999 instaead of 9.999
nas.variables[1].flag_vmiss = '9.999999'
# Variable 2: calcium
# We used values between 0.556 and 1.22, which would automatically lead to 99.999 as missing value.
# This time we want to use scientific notation for this variable, although it's not an advantadge
# given the range of actual values.
# But we fail in actually using a correct missing value format. '9.9E+99' is not sufficient for
# the data in the variable (at least 3 digits mantissa would be necessary). Ebas-io will ignore our
# provided missing value, but still obey our wish to use scientific notation. The calculated missing
# value will be 9.99E+99
nas.variables[2].vmiss = '9.9E+99'
# If a flag column is written for this variable, use the format 9.999999 instaead of 9.999
nas.variables[2].flag_vmiss = '9.999999'
# Additionally we want to prevent the automatic sorting of variables. The order should be
# sodium, magnesium, calcium, as we specified. We need to specify SUPPRESS_SORT_VARIABLES
# in the suppress parameter.
# Finally, we want to prevent that any metadata get move between variable specific and global.
# All metadata should be written to the file as they are specified in the object. To achieve this,
# we add SUPPRESS_METADATA_OCCURRENCE to the suppress parameter.
nas.write(suppress=SUPPRESS_SORT_VARIABLES|SUPPRESS_METADATA_OCCURRENCE)
By default ebas-io has a sophisticated way of generating flag columns which is called FLAGS_ONE_OR_ALL (see below). Possible options for generating flag columns are:
from ebas.io.file import FLAGS_ALL
nas.write(flags=FLAGS_ALL, suppress=SUPPRESS_SORT_VARIABLES|SUPPRESS_METADATA_OCCURRENCE)
When using flags=FLAGS_AS_IS
, one can fully control which variables should get flag columns. EBAS-IO checks if the flag columns are legal and do not change the content. Examples:
Keep in mind that very likely at least suppress=SUPPRESS_SORT_VARIABLES
needs to be set additionally (Hard to know manually which variables should have flag columns, if the variables are re sorted automatically). Otherwise one will end up with one of the problems listed above.
nas.variables[0].flagcol = False
nas.variables[1].flagcol = True
nas.variables[2].flagcol = True
from ebas.io.file import FLAGS_AS_IS
nas.write(flags=FLAGS_AS_IS, suppress=SUPPRESS_SORT_VARIABLES|SUPPRESS_METADATA_OCCURRENCE)