Temporal aggregations on time series data – Writing R functions to tidy meteorological data and getting some insights from it

摘要： In this post we’re going to work with time series data, and write R functions to aggregate hourly and daily time series in monthly time series to catch a glimpse of their underlying patterns. For this analysis we’re going to use public meteorological data recorded by the government of the Argentinian province of San Luis. Data about rainfalls, temperature, humidity and in some cases winds, is published in the REM website (Red de Estaciones Meteorológicas, http://www.clima.edu.ar/). Also, here you can download meteorological data (in .csv format) that has been recorded by weather stations around different places from San Luis.

Even though weather stations are capable to record a huge amount of data, there is also a lot of work to do to obtain some information. For example, if observations are taken every hour, you’ll have 8700 records in 1 year!

So, in this post we’re going to:

Write functions in R to aggregate the big amount of information provided by weather stations.
Make these functions as general as possible in terms of the structure of the input data (time series data) in order to simplify downstream analysis of this data.
Finally, we’re going to plot our results to reach some conclusions about the weather in one town in San Luis Province, Argentina.

We chose to work with data from La Calera. Take a look to our target site on the map! It’s a little town in the Northwest of the San Luis Province, located in a semiarid region. It makes this town and its neighbors really interesting places to work with because in this desert places, climate has well defined seasons and rainfalls occurs mainly in a few months of the year.

We have already downloaded the .csv file of La Calera from the REM website, and you can find it here . Ok! Let’s work!

Tidying data – Writing an R function

First, we’re going to read the .csv with R and see its structure. How many rows does it have? Which are the first and the last dates? Which are the variables recorded? We’ll use head(), nrow() and str() to see it.

						
										# First we read the .csv with the data from La Calera
								
										#
								
										calera_raw <- read.csv(".../meteorologicos_la_calera.csv")
								
										head(calera_raw)
								
										#
								
										#> head(calera)
								
										# fecha mm temp humedad
								
										# 1 04/27/2018 17:00:00 0 22.8 49.8
								
										# 2 04/27/2018 16:00:00 0 24.0 47.9
								
										# 3 04/27/2018 15:00:00 0 26.4 45.5
								
										# 4 04/27/2018 14:00:00 0 25.9 49.7
								
										# 5 04/27/2018 13:00:00 0 26.0 56.7
								
										# 6 04/27/2018 12:00:00 0 25.6 58.5
								
										#
								
										# How many observations are recorded?
								
										nrow(calera_raw) # 91421
								
										#
								
										calera_raw[91421, 1] # [1] 10/16/2007 00:00:00
								
										calera_raw[1, 1] # 04/27/2018 17:00:00
								
										#
								
										str(calera_raw)
								
										#
								
										# > str(calera)
								
										# 'data.frame': 91421 obs. of 4 variables:
								
										# $ fecha : Factor w/ 91421 levels "01/01/2008 00:00:00",..: 30473 30472 30471 30470 30469 30468 30467 30466 30465 30464 ...
								
										# $ mm : num 0 0 0 0 0 0 0 0 0 0 ...
								
										# $ temp : num 22.8 24 26.4 25.9 26 25.6 25.7 25.3 23.9 20.7 ...
								
										# $ humedad: num 49.8 47.9 45.5 49.7 56.7 58.5 57.1 58.3 62 73.6 ...

view raw calera_raw hosted with

by GitHub

As you can see, this weather station recorded rainfalls (mm), humidity (%) and temperature (°C) every hour. Data recording started on October 2007 and finished on April 2018 (the day that we downloaded the .csv). There are in total more than 91.000 observations for each variable!. Hourly data won’t allow us to find some of the climatic patterns that we are looking for, so, we need to summarize its information first as daily data and then as monthly data.

We’re going to start writing an R function that will be useful for organize data in different stages of the analysis (see the flowchart below) . It takes the date of every row and splits it up in 3 columns: one column with the day’s number, a second column with the month’s number, and the last one with the year. In this function, we’re going to use the lubridate package. This package has some useful functions that we need: day(), month() and year() which extract the corresponding time component from a given date. In the following functions, we’ll use these new columns to subset the data.

diagrama_de_flujo_funcion

						
										# Function that creates day, month and year columns.
								
										# 
								
										OrganizeDateColumns <- function(data.x, ncol.date = 1, orders = "mdyHMS", ncol.day = NULL, ncol.month = NULL, ncol.year = NULL) {
								
										#
								
										require(lubridate)
								
										#
								
										# Organizes dates's columns using lubridate
								
										# If columns with days, months or years are NULL, creates these columns with a 
								
										# function inside the function
								
										#
								
										OrganizeDate <- function(ncol.time, fun) {
								
										names.col <- colnames(data.x)
								
										if (is.null(ncol.time) == TRUE) {
								
										if (lubridate::is.POSIXt(data.x[ , ncol.date]) == TRUE) {
								
										col.time <- match.fun(fun)(data.x[ , ncol.date])
								
										data.x[ ,(ncol(data.x) + 1)] <- col.time
								
										colnames(data.x) <- c(names.col, fun)
								
										} else {
								
										data.x[ , ncol.date] <- parse_date_time(x = data.x[ , ncol.date],
								
										orders = orders)
								
										col.time <- match.fun(fun)(data.x[ , ncol.date])
								
										data.x[ ,(ncol(data.x) + 1)] <- col.time
								
										colnames(data.x) <- c(names.col, fun)
								
										}
								
										} else {
								
										col.time <- data.x[ , ncol.time]
								
										colnames(data.x)[ncol.time] <- fun
								
										}
								
										data.x <- data.x
								
										}
								
										# 
								
										# Run internal function to data
								
										#
								
										data.x <- OrganizeDate(ncol.time = ncol.day, fun = "day")
								
										data.x <- OrganizeDate(ncol.time = ncol.month, fun = "month")
								
										data.x <- OrganizeDate(ncol.time = ncol.year, fun = "year")
								
										data.x[ , ncol.date] <- as.Date(paste(data.x$year, data.x$month, data.x$day, sep = "-"))
								
										return(data.x)
								
										}

view raw OrganizeDateColumns.R hosted with

by GitHub

If we use our function, we get:

						
										calera_organized <- OrganizeDateColumns(data.x = calera_raw, ncol.date = 1, orders = "mdyHMS", ncol.day = NULL, ncol.month = NULL, ncol.year = NULL)
								
										head(calera_organized)
								
										#
								
										# > head(calera_organized)
								
										# fecha mm temp humedad day month year
								
										# 1 2018-04-27 0 22.8 49.8 27 4 2018
								
										# 2 2018-04-27 0 24.0 47.9 27 4 2018
								
										# 3 2018-04-27 0 26.4 45.5 27 4 2018
								
										# 4 2018-04-27 0 25.9 49.7 27 4 2018
								
										# 5 2018-04-27 0 26.0 56.7 27 4 2018
								
										# 6 2018-04-27 0 25.6 58.5 27 4 2018

view raw calera_organized hosted with

by GitHub

Aggregate daily data – Writing an R function

Now, we’re going to aggregate our hourly data in daily data. We wrote another function that uses the columns that we have obtained to subset the data. For each day, we created a subset with its corresponding hours and calculated the daily mean or the sum for data (see the diagram below).

flowchart_funcion2

As we’re working with meteorological data, the sum will be useful for precipitations, and the mean for temperature. We calculated the daily mean value as an average of extreme hourly values, due to daily mean temperature it’s commonly estimated as the average of minimum and maximum temperatures (see http://climod.nrcc.cornell.edu/static/glossary.html and Dall’Amico and Hornsteiner, 2006). This is:

For each day of hourly data we have the following set: $T=\{T_0,...,T_{23}\}$ so we calculate the average as:

$average_T= {{max(T) + min(T)} \over {2}}$

In R, we can do it using max() and min() functions. But wait! It’s normal that weather stations don’t work during some days, so you can have entire days where they only recorded “NAs”. So, you have to keep this in mind, because max() and min() functions will return you:

		
								In min(x) : no non-missing arguments to min; returning Inf

								In max(x) : no non-missing arguments to min; returning Inf

To avoid that, we used an if…else statement in line 32.

						
										HourlyToDaily <- function(data.x, ncol.obs, ncol.date, na.presence = TRUE, value = "mean") {
								
										#
								
										if (!any(value == c("mean", "accumulated"))) {
								
										stop(" value must to be 'mean' or 'accumulated'")
								
										}
								
										# Calculate the mean of the variable of interest
								
										first.year <- min(data.x$year, na.rm = na.presence)
								
										last.year <- max(data.x$year, na.rm = na.presence)
								
										x.daily.per.years <- list()
								
										date.tot <- list()
								
										c = 1
								
										# Creates a temporary i data.frame for each year == i
								
										for (i in first.year:last.year) {
								
										x.anual <- data.x[data.x$year == i, ]
								
										# It takes the i year df and creates a j df for each month == j
								
										first.month = min(x.anual$month, na.rm = na.presence)
								
										last.month = max(x.anual$month, na.rm = na.presence)
								
										x.daily.per.month <- list()
								
										date.list <- list()
								
										for (j in first.month:last.month) {
								
										x.month <- x.anual[x.anual$month == j, ]
								
										# It takes the j month df and creates a k df for each day == k
								
										first.day = min(x.month$day, na.rm = na.presence)
								
										last.day = max(x.month$day, na.rm = na.presence)
								
										x.value.max <- list()
								
										x.value.min <- list()
								
										x.value <- list()
								
										date.x <- list()
								
										for (k in first.day:last.day) {
								
										x.day <- x.month[x.month$day == k, ]
								
										# Calculates the mean value or the accumulated value
								
										if (all(unlist(lapply(X = x.day[ , ncol.obs], FUN = is.na)))) {
								
										x.value[k] <- NA
								
										} else {
								
										if (value == "mean") {
								
										x.value.max[k] <- max(x.day[ , ncol.obs], na.rm = na.presence)
								
										x.value.min[k] <- min(x.day[ , ncol.obs], na.rm = na.presence)
								
										x.value[k] <- mean(x = c(x.value.max[[k]], x.value.min[[k]]), na.rm = na.presence)
								
										} else {
								
										x.value[k] <- sum(x.day[ , ncol.obs], na.rm = na.presence)
								
										}
								
										}
								
										date.x[k] <- x.day[ , ncol.date][1]
								
										} # end cycle day
								
										x.daily.per.month[[j]] <- unlist(x.value)
								
										date.list[[j]] <- date.x
								
										} # end cycle month
								
										x.daily.per.years[[c]] <- unlist(x.daily.per.month)
								
										date.tot[[c]] <- unlist(date.list)
								
										c = c + 1
								
										daily.data <- as.data.frame(unlist(x.daily.per.years))
								
										daily.data <- cbind.data.frame(unlist(date.tot), daily.data)
								
										colnames(daily.data) <- c("date", "value")
								
										daily.data$date <- as.Date.numeric(daily.data$date, origin = "1970-01-01")
								
										}
								
										return(daily.data)
								
										}

view raw HourlyToDaily hosted with

by GitHub

Exploratory plots – Daily data from La Calera

Let’s plot daily data to look for patterns!

calera_daily_precipitations calera_daily_mean_temperature

We’ve already have our first results! Rainfalls are higher during a period of each year, and between periods there are gaps of dry weather. On the other side, mean temperatures have an oscillating pattern, reaching 30°C in summer and 0°C in winter. So, as we expected from this temperate desert place, seasons are well-defined and temperature have a wide range through the year.

Aggregate monthly data – Writing an R function

Ok, let’s continue a bit more! To aggregate from daily data to monthly data, we have to use again the first function (add the time component’s columns) and then build a function with “for loops” similar to the second function. In this case, we didn’t use max() or min() functions, because monthly means are simply the average of daily means per month.

We ran the function:

and we obtained the followings monthly time series:

						
										# > head(m_mm_ac)
								
										# month.order year accumulated.value
								
										# 1 10 2007 0.6
								
										# 2 11 2007 26.7
								
										# 3 12 2007 68.3
								
										# > head(m_t_mean)
								
										# month.order year mean.value
								
										# 1 10 2007 22.48750
								
										# 2 11 2007 21.04500
								
										# 3 12 2007 23.29516

view raw m_mm_ac hosted with

by GitHub

Exploratory plots – Monthly data from La Calera

To end this analysis, we’re going to do an exploratory analysis of our monthly data. For this we are going to make a multiple barplot using a ‘for loop’ and an overlayed lineplot for each one of the years in the dataset.

la_calera

As we previously saw, rains are concentrated only in some months. We can distinguish from the bar charts that La Calera has a dry period during winter, when temperatures are lower, and a rainy period during the warm seasons.

Well, the analysis about yearly means of accumulated precipitations and temperatures it is now up to you!

You can see the entire R script of this post in here, and remember that all of our scripts and data used for the blog are available in our GitHubs.

And, if you’re interested in other analysis of different kinds of time series data, you could take a look to our other post:

轉貼自： Monkeysworking

訪客 - Negiht
回報固定連結

It makes this town and its neighbors really interesting places to work with because in this desert places, climate has well defined seasons and rainfalls occurs mainly in a few months of the year. fireboy and watergirl online.

約 6 年前來自 Chicago, IL, USA -87.62979819999998 41.8781136 http://maps.google.com/maps?z=15&q=41.8781136,-87.62979819999998

0 讚分享短網址: Facebook Twitter 回覆

Temporal aggregations on time series data – Writing R functions to tidy meteorological data and getting some insights from it

Tidying data – Writing an R function

Aggregate daily data – Writing an R function

Exploratory plots – Daily data from La Calera

Aggregate monthly data – Writing an R function

Exploratory plots – Monthly data from La Calera

留下你的回應

以訪客張貼回應

在此對話中的人們

回應 (1)

訪客 - Negiht

釘選列表

喜愛列表

Web Services

YOU MAY BE INTERESTED

Popular Tags

	# First we read the .csv with the data from La Calera
	#
	calera_raw <- read.csv(".../meteorologicos_la_calera.csv")
	head(calera_raw)
	#
	#> head(calera)
	# fecha mm temp humedad
	# 1 04/27/2018 17:00:00 0 22.8 49.8
	# 2 04/27/2018 16:00:00 0 24.0 47.9
	# 3 04/27/2018 15:00:00 0 26.4 45.5
	# 4 04/27/2018 14:00:00 0 25.9 49.7
	# 5 04/27/2018 13:00:00 0 26.0 56.7
	# 6 04/27/2018 12:00:00 0 25.6 58.5
	#
	# How many observations are recorded?
	nrow(calera_raw) # 91421
	#
	calera_raw[91421, 1] # [1] 10/16/2007 00:00:00
	calera_raw[1, 1] # 04/27/2018 17:00:00
	#
	str(calera_raw)
	#
	# > str(calera)
	# 'data.frame': 91421 obs. of 4 variables:
	# $ fecha : Factor w/ 91421 levels "01/01/2008 00:00:00",..: 30473 30472 30471 30470 30469 30468 30467 30466 30465 30464 ...
	# $ mm : num 0 0 0 0 0 0 0 0 0 0 ...
	# $ temp : num 22.8 24 26.4 25.9 26 25.6 25.7 25.3 23.9 20.7 ...
	# $ humedad: num 49.8 47.9 45.5 49.7 56.7 58.5 57.1 58.3 62 73.6 ...

	# Function that creates day, month and year columns.
	#
	OrganizeDateColumns <- function(data.x, ncol.date = 1, orders = "mdyHMS", ncol.day = NULL, ncol.month = NULL, ncol.year = NULL) {
	#
	require(lubridate)
	#
	# Organizes dates's columns using lubridate
	# If columns with days, months or years are NULL, creates these columns with a
	# function inside the function
	#
	OrganizeDate <- function(ncol.time, fun) {
	names.col <- colnames(data.x)
	if (is.null(ncol.time) == TRUE) {
	if (lubridate::is.POSIXt(data.x[ , ncol.date]) == TRUE) {
	col.time <- match.fun(fun)(data.x[ , ncol.date])
	data.x[ ,(ncol(data.x) + 1)] <- col.time
	colnames(data.x) <- c(names.col, fun)
	} else {
	data.x[ , ncol.date] <- parse_date_time(x = data.x[ , ncol.date],
	orders = orders)
	col.time <- match.fun(fun)(data.x[ , ncol.date])
	data.x[ ,(ncol(data.x) + 1)] <- col.time
	colnames(data.x) <- c(names.col, fun)
	}
	} else {
	col.time <- data.x[ , ncol.time]
	colnames(data.x)[ncol.time] <- fun
	}
	data.x <- data.x
	}
	#
	# Run internal function to data
	#
	data.x <- OrganizeDate(ncol.time = ncol.day, fun = "day")
	data.x <- OrganizeDate(ncol.time = ncol.month, fun = "month")
	data.x <- OrganizeDate(ncol.time = ncol.year, fun = "year")
	data.x[ , ncol.date] <- as.Date(paste(data.x$year, data.x$month, data.x$day, sep = "-"))
	return(data.x)
	}

	calera_organized <- OrganizeDateColumns(data.x = calera_raw, ncol.date = 1, orders = "mdyHMS", ncol.day = NULL, ncol.month = NULL, ncol.year = NULL)
	head(calera_organized)
	#
	# > head(calera_organized)
	# fecha mm temp humedad day month year
	# 1 2018-04-27 0 22.8 49.8 27 4 2018
	# 2 2018-04-27 0 24.0 47.9 27 4 2018
	# 3 2018-04-27 0 26.4 45.5 27 4 2018
	# 4 2018-04-27 0 25.9 49.7 27 4 2018
	# 5 2018-04-27 0 26.0 56.7 27 4 2018
	# 6 2018-04-27 0 25.6 58.5 27 4 2018

	HourlyToDaily <- function(data.x, ncol.obs, ncol.date, na.presence = TRUE, value = "mean") {
	#
	if (!any(value == c("mean", "accumulated"))) {
	stop(" value must to be 'mean' or 'accumulated'")
	}
	# Calculate the mean of the variable of interest
	first.year <- min(data.x$year, na.rm = na.presence)
	last.year <- max(data.x$year, na.rm = na.presence)
	x.daily.per.years <- list()
	date.tot <- list()
	c = 1
	# Creates a temporary i data.frame for each year == i
	for (i in first.year:last.year) {
	x.anual <- data.x[data.x$year == i, ]
	# It takes the i year df and creates a j df for each month == j
	first.month = min(x.anual$month, na.rm = na.presence)
	last.month = max(x.anual$month, na.rm = na.presence)
	x.daily.per.month <- list()
	date.list <- list()
	for (j in first.month:last.month) {
	x.month <- x.anual[x.anual$month == j, ]
	# It takes the j month df and creates a k df for each day == k
	first.day = min(x.month$day, na.rm = na.presence)
	last.day = max(x.month$day, na.rm = na.presence)
	x.value.max <- list()
	x.value.min <- list()
	x.value <- list()
	date.x <- list()
	for (k in first.day:last.day) {
	x.day <- x.month[x.month$day == k, ]
	# Calculates the mean value or the accumulated value
	if (all(unlist(lapply(X = x.day[ , ncol.obs], FUN = is.na)))) {
	x.value[k] <- NA
	} else {
	if (value == "mean") {
	x.value.max[k] <- max(x.day[ , ncol.obs], na.rm = na.presence)
	x.value.min[k] <- min(x.day[ , ncol.obs], na.rm = na.presence)
	x.value[k] <- mean(x = c(x.value.max[[k]], x.value.min[[k]]), na.rm = na.presence)
	} else {
	x.value[k] <- sum(x.day[ , ncol.obs], na.rm = na.presence)
	}
	}
	date.x[k] <- x.day[ , ncol.date][1]
	} # end cycle day
	x.daily.per.month[[j]] <- unlist(x.value)
	date.list[[j]] <- date.x
	} # end cycle month
	x.daily.per.years[[c]] <- unlist(x.daily.per.month)
	date.tot[[c]] <- unlist(date.list)
	c = c + 1
	daily.data <- as.data.frame(unlist(x.daily.per.years))
	daily.data <- cbind.data.frame(unlist(date.tot), daily.data)
	colnames(daily.data) <- c("date", "value")
	daily.data$date <- as.Date.numeric(daily.data$date, origin = "1970-01-01")
	}
	return(daily.data)
	}

	# > head(m_mm_ac)
	# month.order year accumulated.value
	# 1 10 2007 0.6
	# 2 11 2007 26.7
	# 3 12 2007 68.3
	# > head(m_t_mean)
	# month.order year mean.value
	# 1 10 2007 22.48750
	# 2 11 2007 21.04500
	# 3 12 2007 23.29516

	今日	1700
	昨日	1727
	本週	7479
	本月	15822
	總訪客量	2761856