A B S T R A C T

Air quality monitoring data are useful in different areas of research and have varied applications, especially with a focus on the relationship between air pollution, respiratory problems, and other health hazards. The main atmospheric pollutants are: ozone (O 3 ), sulfur dioxide (SO 2 ), carbon monoxide (CO), nitrogen dioxide (NO 2 ), and particulate matter (PM). PM is one of the main objects of study when one intends to protect people from exposure to pollutants. This study contributes to the analysis of PM 2.5 in 21 stations in the state of São Paulo monitored by the Environmental Company of São Paulo State (CETESB). It employs cluster analysis, a prominent data mining method for detecting patterns and discovering similarities which is important for assessing air pollution, especially in a geographically vast area such as that of the state of São Paulo, which does not follow a single pattern. Another data mining technique (association rules) supports the analysis of the relationship between pollutants and meteorological variables, as it allows identifying changes between elements that occur together, in a wide variety of data. Our objectives include determining stations with similar behaviors and exploring the temporal variety of the pollutant as it relates to the dominant meteorological factors in the periods of high concentration. The clustering algorithm automatically separates stations according to their monthly averages of PM 2.5 concentration between 2017 and 2019. The clusters of stations that showed the highest pollution rates essentially included urban centers with emissions by industries and vehicles, while those with the lowest rates were located further inland. A cyclical behavior in pollutant variation was also observed in the three years under study and for both clusters. For the months with the highest concentration of PM 2.5 , association rule learning was applied to connect air temperature, relative humidity, and wind speed with PM 2.5 and carbon monoxide (CO) concentrations. The obtained results are useful to analyze the temporal and geolocation profiles of pollution by particulate matter, since they identify the behavior of the meteorological factors that predominate in periods of greater concentration.


A B S T R A C T
Air quality monitoring data are useful in different areas of research and have varied applications, especially with a focus on the relationship between air pollution, respiratory problems, and other health hazards. The main atmospheric pollutants are: ozone (O 3 ), sulfur dioxide (SO 2 ), carbon monoxide (CO), nitrogen dioxide (NO 2 ), and particulate matter (PM). PM is one of the main objects of study when one intends to protect people from exposure to pollutants. This study contributes to the analysis of PM 2.5 in 21 stations in the state of São Paulo monitored by the Environmental Company of São Paulo State (CETESB). It employs cluster analysis, a prominent data mining method for detecting patterns and discovering similarities which is important for assessing air pollution, especially in a geographically vast area such as that of the state of São Paulo, which does not follow a single pattern. Another data mining technique (association rules) supports the analysis of the relationship between pollutants and meteorological variables, as it allows identifying changes between elements that occur together, in a wide variety of data. Our objectives include determining stations with similar behaviors and exploring the temporal variety of the pollutant as it relates to the dominant meteorological factors in the periods of high concentration. The clustering algorithm automatically separates stations according to their monthly averages of PM 2.5 concentration between 2017 and 2019. The clusters of stations that showed the highest pollution rates essentially included urban centers with emissions by industries and vehicles, while those with the lowest rates were located further inland. A cyclical behavior in pollutant variation was also observed in the three years under study and for both clusters. For the months with the highest concentration of PM 2.5 , association rule learning was applied to connect air temperature, relative humidity, and wind speed with PM 2.5 and carbon monoxide (CO) concentrations. The obtained results are useful to analyze the temporal and geolocation profiles of pollution by particulate matter, since they identify the behavior of the meteorological factors that predominate in periods of greater concentration.

Introduction
In the world population, nine out of 10 people breathe polluted air, according to the annual report of the World Health Organization (WHO). Every year, seven million people die worldwide by causes directly related to air pollution, but contamination levels remain high (WHO, 2019).
According to the Environmental Company of São Paulo State (CETESB, 2019), the main air pollutants regulated by the National Environment Council (CONAMA) are: coarse inhalable particles (PM 10 ), fine inhalable particles (PM 2.5 ), carbon monoxide (CO), nitrogen dioxide (NO 2 ), sulfur dioxide (SO 2 ), ozone (O 3 ), total suspended particles (TSP), smoke (SMO), and lead (Pb), the latter three being monitored only in specific situations. Studies on the effects of pollution on health (POLEZER et al., 2018;MACHIN;NASCIMENTO, 2018;SEINFELD;PANDIS, 2016;NODARI;SALDANHA, 2016) show that exposure to fine particulate matter (PM 2.5 ) can cause respiratory problems and even premature deaths, since it penetrates deeply into the respiratory system, reaching the pulmonary alveoli and the bloodstream.
Because it is associated with damage to human health and has impacts on climate and the environment, PM 2.5 was chosen as the study object in this research. PM are particles suspended in the atmosphere, solid or liquid, which can be generated by several sources, in different sizes and compositions (DIMITRIOU, 2016;ANDRADE et al., 2012;QUALAR, 2019). It is classified by its aerodynamic diameter (a d ): particles with a d ≤ 2.5 μm are named PM 2.5 (fine inhalable particulate matter) and those with 10 ≥ a d > 2.5 μm, as PM 10 (coarse inhalable particulate matter). These pollutants can come from several sources, such as vehicles, industries, power plants, and fires in general. Despite the PM origin, it may be transported by air masses between cities, by atmospheric circulation (NOGAROTTO, 2019).
Meteorological variables directly interfere with the concentration of atmospheric pollutants by controlling the dispersion process of substances that are toxic and carcinogenic or that potentiate harmful effects on the environment and health (YANAGI; ASSUNÇÃO; BARROZO, 2012). The relationship between pollutant concentration and meteorological variables such as: air temperature (TEMP), relative humidity (RH), wind speed (WS), wind direction (WD), precipitation (PRE), atmospheric instability, and others that vary during the year is well known (GUERRA; MIRANDA, 2011). Given this relationship, studies such as the one by Bisht and Seeja (2018), in India, predict next-day air quality from the previous day's pollutant concentration data (PM 10 , PM 2.5 , NO 2 , CO, and O 3 ) and meteorological variables (RH, PRE, TEMP, WS, and WD), using regression models. Gonçalves et al. (2005), in a research study in the city of São Paulo, proved that during summer, hot and humid days favor the decrease of PM 10 , SO 2 , and O 3 concentrations.
In winter, air quality worsens, especially regarding PM and CO concentrations, since weather conditions in this season of the year are less favorable for their dispersion (SANTOS; CARVALHO; REBOITA, 2016;MORAES et al., 2019;CETESB, 2019). Therefore, the interaction between atmospheric conditions and sources of pollution defines air quality, which in turn determines the emergence of adverse effects on people's health.
A study by Abe and Miraglia (2018) shows a reduction of about 25.45% in PM 2.5 concentration in the city of São Paulo from 2000 to 2011, due to actions to contain the increase in the automotive fleet. Typically, in metropolitan regions, motor vehicles are a major cause of air pollution. A study by Andrade et al. (2012) states that vehicle emissions, biomass burning, and fuel combustion in industries explain at least 40% of PM 2.5 in six Brazilian states, including São Paulo.
In addition to associating air pollutants with meteorological variables, it is also possible to establish a relation between the behaviors of different air pollutants. Moisan, Herrera and Clements (2018) reported an association between car pollution and firewood burning as regards CO concentration in the atmosphere, noting that 54% of PM 2.5 concentration is composed of CO, which shows a direct relationship between these pollutants. They also found a strong negative correlation with the variables TEMP and WS, in addition to a positive relationship with RH. Saide et al. (2011) developed a CO forecasting system as a substitute for PM 10 and PM 2.5 , identifying a high correlation (of above 0.95) between these pollutants in Santiago (Chile), during winter nights. Therefore, by predicting CO, an estimate of PM could be obtained. The greatest benefit of the study was its ability to predict critical episodes up to 48 hours ahead. Reinhardt, Ottmar and Castilla (2011) observed that, in Brazil, the concentration levels of CO and particulate matter are correlated and that, during the burning season, CO levels in rural areas are comparable to those of urban centers, moderately polluted.
Considering this scenario, it is important to investigate the behavior of pollutants, in particular PM 2.5 . Despite the fact that the problem is widely discussed in various spheres of the scientific community, the literature lacks studies whose assessment uses artificial intelligence techniques and involves knowledge about the associations between pollutants, emission sources, and their effects on air quality (AMEER et al., 2019). The analysis of the sources of pollution by PM 2.5 throughout the state of São Paulo is considered a zoning problem, zoning being the discovery of different regions with similar characteristics. Data clustering technique is a prominent method for recognizing new patterns, and it is applied in exploratory data analysis. It is a suitable solution when searching for similar patterns and behaviors in different regions, which leads to the discovery of previously unknown clusters (HAN; KAMBER; PEI, 2011;KWEDLO, 2011).
Research carried out in Brazil (NODARI; SALDANHA, 2016; GUI-DETTI; PEREDA, 2018) and in other countries which applied clustering techniques identified regions with similar patterns of air pollution. A study in China (XIAO et al., 2020) performed cluster analysis to measure similarities in the characteristics of industrial emissions from 31 companies in different regions; results showed that pollution characteristics were similar for companies in the same cluster, which contributed to the development of specific measures for pollution control. Also in China, studies involving 13 sites with similar PM 2.5 concentration data resulted in the discovery of three clusters: two of industrial activities and another of agricultural and tourist activities (HUANG et al., 2015).
In the United States, a research study clustered locations according to PM 2.5 levels and obtained clusters by regions with similar industrial activity (AUSTIN et al., 2013). The study by Zou et al. (2014), conducted with data from the U.S. urban census, was used to investigate the population's exposure to air pollution, considering age, race, education level, and income. By applying a spatial clustering method, it was possible to show disparities in the spatial distribution of exposure to pollution throughout the territory.
Alternatively, clustering technique is also used as a preprocessing step for selecting attributes or applying other data mining algorithms. An example is the study by Du and Varde (2016), which applies association rules, clustering, and classification to identify relationships between particulate matter, pollution, and road traffic.
Another way to extract knowledge is by discovering relationships between different attributes in the database; the association rule algorithm has been efficient in this sense, given its applicability in several scenarios, such as the context of air pollution (NEIROTTI et al., 2014;AGRAWAL;SRIKANT, 1994). Association rules also contribute to discovering unexpected rules with a high degree of interest in the context in which they are inserted. In our study, association rules looked for relationships between the behavior of PM 2.5 and meteorological variables, in the different clusters identified in the clustering step. They also attempted to verify whether PM 2.5 and CO were related. Li et al. (2020) proposed, by using association rules, the analysis of data from various air monitoring stations in China and micro stations in the USA, considering the uneven distribution of environmental monitoring data and the characteristics of climate change, and obtained a correlation between pollutants which provides support for the treatment and prevention of air pollution. Souza and Rabelo (2016) applied association rules to identify a set of variables that often occur together: air pollutant concentrations and rates of respiratory problems. Sadat, Karimipour and Sadat (2014) explored, by association rules, the effect of air pollution on asthmatic allergies, indicating that distance from parks and roads, as well as pollutant concentrations of CO, PM 10 , PM 2.5 , and NO 2 , are related to the prevalence of allergies in the most polluted month of the year, while SO 2 and O 3 have no effect on it.
This article proposes a data mining approach to analyze the air quality monitoring database provided by CETESB, between 2017 and 2019. Such analysis was carried out by applying machine learning techniques on two fronts: • using the partitional clustering algorithm (K-medoids) to form clusters, based on the PM 2.5 concentrations of 21 stations in the state of São Paulo; • applying the association rules algorithm (Apriori) to discover possible associations between meteorological variables that affect the increase in PM 2.5 concentration and investigate the seasonal relationship between PM 2.5 and CO.
These studies can generate knowledge that contributes to the management of air quality and provides information for an assessment of its impact on health and the environment.

Methods
The methodology used in this study will be presented as follows: • a presentation of the places where the air pollution data were collected and how they were preprocessed so as to be used by machine learning algorithms; • an explanation of clustering algorithms and association rules, as well as their respective validation metrics.

Study site
Diagnosis of air quality in the state of São Paulo is made by the network of monitoring stations of CETESB, which informs pollution concentrations, generating an air quality index that ranges between good, moderate, bad, very bad, and terrible. These scenarios are important in reporting the compliance with air quality standards set by law and making it possible to determine when these levels represent significant risks to human health.
Assessment is carried out based on the state's air quality standards (Table 1)  Both the CONAMA Resolution and the State Decree define intermediate targets (IT) so that air pollution is gradually reduced based on the guidelines proposed by WHO. It can be observed ( Table 1) that national values are well above the international quality standard.
To analyze the behavior of PM 2.5 in different areas of the state of São Paulo, we obtained data from all cities that have stations with pollutant monitoring. Altogether, there are 21 stations, listed in Table 2 along with their geolocation (Figure 1).

Database and preprocessing
The first database was obtained from the CETESB website, by the Air Quality platform (QUALAR, 2019), which contains data collected by automatic monitoring stations. Data on monthly average PM 2.5 concentration from January 1 st 2017 to December 31 st 2019 were used. They generated a set of 21 records (stations) and 36 columns (months) representing the three-year period.
On this first basis, preprocessing was carried out to identify months with missing values in PM 2.5 monitoring. To perform the study of time series, all values must be completed (CASTRO;FERRARI, 2016). Where values were missing in a given month, the last and next technique was adopted, which obtains an average between the previous and the next value of the missing attribute (PLAIA; BONDI, 2006), that is, when there is a missing value, it is replaced by the average between the previous and the next month.  (2019), Brazil (2018), and São Paulo (2013). In addition, the data were standardized using the Z-score technique, which modifies the original values for them to have an average of 0 and a standard deviation of 1, resulting in values that will be compared under the same scale (HAN; KAMBER, 2006;MITSA, 2010;BATISTA;CHIAVEGATTO, 2019).
To build the second database, used in the step of association rules extraction, we verified the stations that monitor PM 2.5 and that also provide monthly averages of the following meteorological variables: RH, TEMP, WS, in addition to CO concentration (QUALAR, 2019) between 2017 and 2019. Of the 21 stations RBCIAMB | v.56 | n.1 | Mar 2021 | 152-165 -ISSN 2176-9478 whose data were obtained for the first database, seven met this new criterion (Table 3).
For this new dataset, all data must be categorical, since this is a restriction of the Apriori algorithm. Thus, each monthly average value was classified according to two categories: lower or higher than the annual average value of its respective meteorological variable or CO concentration. Table 3 represents an excerpt from the database, referring to the month of July 2018.
The algorithms applied in this study follow the unsupervised approach of machine learning, divided into two stages: • application of the partitional clustering algorithm (K-medoids); • association rules (Apriori).
The next sections discuss these algorithms.

Data clustering technique
Clustering algorithms can be either partitional or hierarchical. Their ability to cluster data based on intrinsic characteristics of the problem makes them interesting for studies. Such algorithms generate clusters formed by data samples that are similar to each other, according to some measure of similarity. Assuming, for example, a problem of clustering cities by the level of air quality, the clustering algorithms will map the cities and return clusters composed of those with similar pollution behavior. Within the cluster of partitional algorithms, the most common are K-means and K-medoids (JIN; HAN, 2017). The K-medoids algorithm uses objects from the database as the center of the clusters, called medoids, which have the lowest average dissimilarity compared to all other objects in the cluster. In the case of K-means, the centers of the clusters are calculated according to the average value of the objects in that cluster. In this case, outliers from the database can influence the formation of the clusters, since they contribute to the calculation of the central values of each cluster. This type of problem does not happen in the K-medoids algorithm, since the medoids correspond to real samples of the data and not averages (HAN; KAMBER, 2006), that is, the medoids are an element of the cluster itself and not a midpoint as occurs in K-means, which makes it less sensitive to outliers.
Both algorithms (K-means and K-medoids) were implemented in Python, using the open-source Scikit-Learn and PyClustering libraries, specific for machine learning (PEDREGOSA et al., 2011).
To assess the quality of the clustering between the K-medoids and K-means algorithms, the silhouette coefficient was applied (KAUFMAN; ROUSSEEUW, 2005) to the results obtained by each algorithm. This coefficient measures the robustness of the partitions, helping to select the number of clusters, considering the internal similarity and external dissimilarity between them, that is, it combines cohesion (measures how well an element is within a cluster) and separation (measures how much the clusters are separated from each other). For example, supposing that the clustering algorithm returns two clusters, as in the previous example, the silhouette coefficient will verify whether all the elements of Cluster 1 are similar to each other and different from the elements of Cluster 2. An expected behavior would be that this hypothetical Cluster 1 would include cities with a high concentration of one pollutant and Cluster 2, cities with a low concentration of the same pollutant. Therefore, Cluster 1 and Cluster 2 would be cohesive, since they would have cities that show the same behavior, and also separated from each other for presenting an entirely different pattern.
The average value of the silhouette coefficient must be between -1 and 1, representing how well the clusters were formed. The ideal values are positive, with a silhouette coefficient close to 1. Equation 1 represents the average Silhouette calculation . (1) Where: = the number of objects in the database and the individual value of the silhouette coefficient of element x i , given by , obtained by Equation 2: (2) Where: the values and = respectively, the average distance between and all the objects in its cluster and the average distance of to another cluster to which does not belong. The silhouette coefficient was also the evaluation metric chosen to determine which of the two algorithms (K-means and K-medoids) would be used in this study. Therefore, the database of monthly PM 2.5 averages was used and the two algorithms were applied to carry out this evaluation. The one that presented the best silhouette result was adopted for the clustering of stations. This experiment is presented in the Results section.

Association rules
The Apriori Association Rules algorithm aims to find frequent relationships in the datasets, that is, to generate rules of type X → Y, for which X and Y are items that belong to this dataset (AGRAWAL; SRI- KANT, 1994). To analyze the possible patterns found in the months with the highest concentration of PM 2.5 , the Apriori Association Rules algorithm was applied to find a subset of frequent parameters related to the database of PM 2.5 .
The Apriori algorithm searches, from a transactional basis, which items are related. For example, in a hypothetical database that records the monthly values of the concentration of air pollutants and the number of hospital visits involving respiratory diseases, the association rules may return {PM 2.5 , PM 10 } → {increase in visits}, indicating that a high concentration of pollutants PM 2.5 and PM 10 , causes, with a degree of certainty, an increase in hospital visits. This degree of certainty that measures the relevance and validation of the rules is provided by: support and confidence. Given the rule X → Y, the support (or coverage of the rule) represents the percentage of transactions in the database that contain the items of X and Y, indicating its relevance (CASTRO; FERRARI, 2016). The confidence or accuracy of a rule, in turn, corresponds to the number of rules in which the consequent (term after the →) of a rule appears in transactions in which the antecedent (term(s) preceding →) is also observed, that is, it is the conditional probability P(Y|X) that given the consequent X of the rule, the antecedent Y also happens (MUELLER, 1995). In this study, the Apriori algorithm was implemented in Python, using the "mlxtend" library.

Results and Discussion
In the experiment to choose the clustering algorithm, the silhouette coefficient was used as the decision criterion, as it is a measure of quality for the entire structure of the partition. It was also used to choose the number of clusters (k), and, for this, 20 different cluster sizes, related to the number of cities, were tested.
After 100 executions of the K-medoids algorithm, applied to the database of monthly averages of PM 2.5 concentration between 2017 and 2019, the average silhouette coefficient found was 0.26, while for the K-means algorithm, the average value was 0.28. Considering that the silhouette value can vary between -1 and 1, K-medoids was selected because it presents a better average silhouette value and is capable of handling outliers. Figure 2 shows the relationship between the silhouette coefficient value corresponding to the number k of clusters. The best value corresponds to k = 2. Thus, the K-medoids algorithm was applied to obtain two clusters from the set of stations in the state of São Paulo, with PM 2.5 monitoring, and the clustering results were subsequently analyzed. As a result of applying the K-medoids algorithm to the data, with a value of k = 2, the stations were divided into Clusters 1 and 2, shown in Table 4.
In the analyzed period, for all the stations monitored, the average annual concentrations of PM 2.5 were 16.43 μg/m 3 (standard deviation 6.45 μg/m  Figure 3 shows that, between 2017 and 2019, higher concentrations of PM 2.5 predominate in Cluster 1 compared to Cluster 2, since the former consists of stations located in the Metropolitan Region of São Paulo (MRSP), as found in other studies (HUANG et al., 2015;AUSTIN et al., 2013). There is also a seasonal trend in the evolution of pollutant concentration and monthly peaks for both clusters in the same periods, suggesting a recurring pattern in the three years. Despite the similarity in seasonal behavior throughout the period, it is clear that in 2017 the month of greatest concentration is September, in 2018 it is July, and in 2019, June. In 2017, the peak concentration of the pollutant was lower than the peak in 2018, while in 2019, the PM 2.5 concentration level was below the one observed in previous years.
These cycles may be related to meteorological phenomena that have taken place over the period, which coincide with the data from CETESB's annual reports (CETESB, 2019), also identified in the literature BISHT;SEEJA, 2018), and which were analyzed with the association rules algorithm (Apriori). Figure 4 was generated for a better assessment of the physical proximity between the stations in the clusters, showing the geographical location of the stations in each cluster. Clusters 1 and 2 were identified by the colors red and blue, respectively, in Figures 4A and 4B.
The analysis on the map shows that most of the PM 2.5 monitoring stations present in Cluster 1 are in the Metropolitan Regions (MR) of São Paulo, Campinas, and Baixada Santista. Except for the Campinas region, which is also influenced by fires, the main source of pollutants in these MRs is fuel burning by the vehicle fleet and intense industrial emissions (CARDOSO et al., 2017;HUANG et al., 2015;YANAGI;ASSUNÇÃO;BARROZO, 2012). The stations with lower concentrations, represented by Cluster 2, are located further inland in the state and are more distant from each other, except for Ibirapuera station, which, despite being located in the city of São Paulo, is located farther from intense traffic routes. Comparing the results obtained, there is a correspondence between the clusters generated and other studies that investigate air pollution by PM 2.5 in the state of São Paulo: Araújo and Rosário (2020) identified from satellite data that the most polluted regions in the state are the MRs of São Paulo, Campinas, and Baixada Santista.
The analysis of the average monthly variation of PM 2.5 concentration in Clusters 1 and 2 indicates differences in pollutant concentrations between the two clusters, as can be seen in the boxplots in Figure 5. However, the interquartile ranges and maximum values (disregarding outliers) are similar. Table 5 shows that, in 2017, the PM 2.5 concentration level increased from May to October, with a peak of about 29.8 μg/m 3 in September. Likewise, in 2018, the increase occurred from March to September, with a peak of 32.4 μg/m 3 in July, indicating an increase in the pollutant that year. The same behavior was repeated in 2019, from April to October, with a peak of 23.7 μg/m 3 in June, but with a reduction in the pollutant concentration.
Studies show that meteorological factors such as TEMP, reduction in RH, and WS can impair the dispersion of PM 2.5 , increasing health-related risks (INPE, 2019;CETESB, 2019). The studies by Santos, Carvalho and Reboita (2016) and Santos et al. (2019) confirm a significant difference between the concentration of PM 2.5 in dry and rainy periods, indicating the association between meteorological parameters and the pollutant.
To assess such a relationship, data of the months with the highest peaks (Figure 3 and Table 5), that is, September 2017, July 2018, and June 2019, were collected from the transactional base (containing the PM 2.5 concentration values for each station and the behavior of the meteorological variables) and submitted to the Apriori association rule algorithm. With that, we tried to find out which factors were more frequent in the three periods and how these meteorological factors were related.
In the first run of Apriori, using September 2017 data, nine association rules were obtained, seven of which were repeated, that is, rules that had the same meaning. This takes place because the algorithm analyzes all the possibilities between the items. Therefore, the two main rules for this period are shown in Table 6. Support corresponds to the frequency with which the patterns occur throughout the database, in-

Figure 4 -(A) Visualization by geolocation of the clusters, created by the K-medoids algorithm; B) proximity of the elements of Cluster 1 on the map. Cluster 1 in red and Cluster 2 in blue.
A dicating the percentage of occurrence of the transactions. Confidence measures the "strength" of rules, that is, it assesses whether transactions that satisfy the antecedent of the rules also satisfy their consequent. The rules that meet support and confidence are called "strong rules. " It can be concluded that, for the peak month of 2017, starting from Rule 1, a high concentration of PM 2.5 , below-average RH, and above average CO concentration occur together with a frequency of 85%. This rule also informs that, when the concentration of CO is above the average, RH is below the average with a certainty of 100%. For Rule 2, at the peaks of PM 2.5 concentration, the factors that occur together with a 75% frequency are above average CO and above average TEMP. Regarding confidence, when CO is above average, temperature is above average with a certainty of 100%.
In the second run of Apriori, July 2018 data were used and 44 rules were obtained, and the three not repeated rules with greater support and confidence were chosen for analysis (Table 6).
For the high concentration of PM 2.5 in July 2018, Rule 1 identifies the following factors: below-average TEMP and below-average WS occur together with 100% frequency in the database. For Rule 2, the frequency of occurrence of the two factors is 87% and the probability of low WS given the occurrence of below-average RH is 100%. For Rule 3, three factors appear together with a frequency of 87% and 100% confidence, indicating that whenever the temperature becomes predominantly colder, the CO concentration increases and WS is below average, signaling that in colder seasons there is an increase in CO concentration, stimulated by the low dispersion of this pollutant.
In the last Apriori execution, June 2019 data were used and nine rules were obtained, two of which were the most representative (Table 6). The identified rules were similar to the rules of the previous year, with the predominant variables TEMP, RH, and WS below the average. Also, the months of high concentrations tend to be close from one year to the next.
According to the winter report of CETESB (2020), the winter of 2019 presented a predominance of a hot and dry air mass throughout the state of São Paulo, with low ventilation and absence of rains, making it difficult to disperse pollutants, which corroborates the rules obtained for 2019.
Considering that the periods with the highest concentration of PM 2.5 are the ones that present the greatest risk to the population and that meteorological factors have an influence on the increase in pollutant concentration, the rules presented in Table 6 could give warning indications for the increase in pollutant concentration. In Brazil, the studies by César et al. (2016) and Machin and Nascimento (2018) show the influence of the 5 μg/m 3 increase in the concentrations of PM 2.5, resulting in increases between 20 and 38% in the risk of hospitalization due to pulmonary complications.
Thus, we can conclude that when the concentration of PM 2.5 increases, the measurements show the following behaviors: low RH and above-average TEMP. The results also indicate that high concentrations of PM 2.5 may be associated with below average TEMP, milder WS, and below-average RH. We observed an increase in CO, which suggests an association with the behavior of PM 2.5 in the winter months, also reported by Moisan, Herrera and Clements (2018) and Saide et al. (2011).

Conclusions
The analysis of PM 2.5 carried out in this study was done by the application of a clustering algorithm, which divided the values of measurements of PM 2.5 concentrations from 21 monitored stations, distributed over 36 months, between 2017 and 2019.
The experiments showed that the formation of two clusters is the most adequate. The results show that the stations belonging to the identified clusters have specific characteristics that lead to different pollution rates. The municipalities of the MRSP stand out as those with the highest concentration of PM 2.5 , but cities inland, with a predomi- nance of industrial and vehicular emissions, join these municipalities, forming one of the clusters. The stations of the other cluster, installed in less polluted locations, are in cities further inland, far from sources of pollution such as vehicle emissions and industrial processes. Two very characteristic clusters were formed, with variations in pollutant concentration that followed a pattern throughout each year. A seasonal behavior was observed in the temporal study, which is repeated in every period, in both clusters. There is a higher incidence of PM 2.5 in winter, which peaked (September 2017, July 2018, and June 2019) in critical months, when the meteorological variables (TEMP, RH, WS) contribute to the increase in pollutant concentration.
From the clustering results, another algorithm was applied to meteorological data related to September 2017, July 2018, and June 2019, to find associations with the meteorological factors mentioned above in the periods of greatest concentration of PM 2.5 . The results showed that, in September 2017, the predominant meteorological factors were low RH and above average TEMP. In July 2018 and June 2019, the rules showed that below average TEMP and RH and milder WS were the main meteorological factors that occurred during the period with the highest average pollutant concentration. Finally, we also observed a direct relationship between the concentrations of CO and PM 2.5.
The rules found can be useful in creating warning signs for possible increases in the concentration of PM, since the results confirm a relationship between episodes of high concentration and atmospheric conditions in the region, providing subsidies for managing air quality in the state of São Paulo.