The current research/data set deals with environmental data related to the region around the city of Manaus, Brazil, located in the Amazon Basin central region. The research problem encompasses the answers to the questions: 'What correlations can arise from the data of climate and fauna in the Amazon Basin?" "How useful is the data from instrumented stations related to that area, related to the fauna and climatic conditions?"
To answer these questions, an initial data set was generated, taking into account the main pollutants with sources from NASA satellites, the Amazon Tall Tower Observatory (ATTO) experiment, and the GOAmazon project, from 2014 to 2015, along with other local data, such as the position of thermal electric power stations.
It has been worth noting the data available was not coherent among the sources, due to different measuring time intervals, heights, and geographical locations. Equally important, for future bioclimatic research, a grid space of 5-km interval and a 30-minute time period was considered within the data set, covering a surface area within the boundaries of Latitude 2.00 S to 4.25 S and Longitude 59.00 to 61.25 W. The data from NASA satellites, ATTO, and GOAmazon do not have this kind of detailed grid and time mesh. To carry on, the machine learning (ML) Random Forest (RF) algorithm was selected and used, among other ML algorithms, to fill in the data into this new grid. The features considered were: data from NASA satellites and from the two projects above, covering the pollutants: ozone, NOx, particulates with diameter down to 10 nanometers, isoprene, CO, and acetonitrile. Moreover, distances and bearings to Manaus and 11 thermal power stations nearby, night and day time periods, dry and wet seasons, were also considered to compose the features.
The interpolation produced by the Random Forest algorithm aimed at the pollutants: acetonitrile, ozone, NOx, Particulate N, isoprene, and CO, which are related to bioclimatic research commonly. In total, a data set was created with 51,148,800 rows, with 44 columns, 38 for the features and 6 for the targets.