
تعداد نشریات | 162 |
تعداد شمارهها | 6,692 |
تعداد مقالات | 72,232 |
تعداد مشاهده مقاله | 129,198,199 |
تعداد دریافت فایل اصل مقاله | 102,028,131 |
مقابله با مخاطرات ناشی از غلظت آلایندۀ PM2.5 با بهکارگیری روشهای رگرسیونی و شباهت مکانی- زمانی و تخمین مقادیر گمشده در سری زمانی آنها (مطالعۀ موردی: شهر تهران) | ||
مدیریت مخاطرات محیطی | ||
مقاله 5، دوره 7، شماره 3، مهر 1399، صفحه 299-312 اصل مقاله (1.22 M) | ||
نوع مقاله: پژوهشی کاربردی | ||
شناسه دیجیتال (DOI): 10.22059/jhsci.2020.309653.595 | ||
نویسندگان | ||
مرجان فرجی1؛ سعید نادی* 2 | ||
1دانشجوی دکتری مهندسی نقشهبرداری گرایش سنجش از دور، دانشکدۀ مهندسی عمران و حملونقل، دانشگاه اصفهان | ||
2استادیار گروه مهندسی نقشهبرداری، دانشکدۀ مهندسی عمران و حملونقل، دانشگاه اصفهان | ||
چکیده | ||
با توجه به تأثیر نامطلوب آلایندهها بر محیط زیست و سلامت انسان، تجزیهوتحلیل دادههای کیفیت هوا اهمیت زیادی در حفاظت از محیط زیست و رویارویی با مشکلات آلودگی هوا دارد. دادههای گمشده در سریهای زمانی بهخصوص دادههای مربوط به آلودگی هوا موجب بروز چالشی ویژه در برابر آنالیز این دادهها میشود که ضرورت استفاده از روشهایی با عنوان جانهی را برای مقابله با این پدیده نمایان میکند. مقادیر گمشده، موجب کاهش حجم داده و تغییر الگوهای زمانی موجود در دادهها و نتیجهگیری اشتباه در تجزیهوتحلیل دادهها میشود. در این پژوهش بهمنظور جانهی مقادیر ازدسترفته در دادههای سری زمانی غلظت آلایندۀ از 12 ایستگاه سنجش آلودگی شهر تهران، روشی ترکیبی برمبنای رگرسیون جانهی با در نظر گرفتن وابستگی و شباهتهای مکانی و زمانی بین ایستگاهها توسط الگوریتم پیچش زمانی پویا معرفی شده است. دادههایی با مقادیر گمشده با الگویی مشابه با دادههای اصلی در دامنۀ 10، 15 و 20 درصد گمشدگی در دادهها با هدف ارزیابی عملکرد مدلهای جانهی شبیهسازی شدند. سپس روش پیشنهادی در ترکیب با روشهای مختلف جانهی چندگانه همانند روش طبقهبندی و رگرسیون درختی، نمونۀ تصادفی و میانگین تطابق پیشبینی کننده، اجرا و نتایج با روشهای جانهی منفرد مقایسه شد. نتایج بیانگر برتری روش معرفیشده در ترکیب با رگرسیون درختی در مقایسه با دیگر روشهای جانهی چندگانه و منفرد است. | ||
کلیدواژهها | ||
آلایندۀ PM2.5؛ جانهی منفرد و چندگانه؛ دادههای گمشده؛ مخاطرات؛ معیار شباهت DTW | ||
عنوان مقاله [English] | ||
Hazards caused by the concentration of pollutants PM_2.5 by using Regression Methods and Spatial-temporal Similarity in Order to Impute the Missing Values in their Time Series (Case Study of Tehran) | ||
نویسندگان [English] | ||
Marjan Faraji1؛ Saed Nadi2 | ||
1PhD in. Remote Sensing, Department of Geomatics Engineering, Faculty of Civil and Transportation Engineering, University of Isfahan | ||
2Assistant Professor, Department of Geomatics Engineering, Faculty of Civil and Transportation Engineering, University of Isfahan, Isfahan | ||
چکیده [English] | ||
Introduction With the increasing growth of industrialization of cities, air pollution has become one of the serious environmental hazards in the world's largest cities, including Tehran. Due to the undesirable effects of pollutants on the environment and human health, the analysis of air quality data plays an important role in protecting the environment and its hazards and tackling air pollution problems. During the last decade, a large number of air quality control data, involving the concentration of existing pollutants in the atmosphere, have been collected by pollution monitoring stations in different cities of the country, which due to various reasons such as calibration, maintenance, device errors, and processing errors show missing values at different intervals. These missing values caused problems in data analysis and leads to challenges in making decisions based on these data. Missing data is a common problem in time series issues and introducing efficient models and methods for managing this problem in data is an effective step towards decreasing bias and increasing air pollution model power. Materials and Methods This paper uses pollutant concentration data recorded in 12 air quality monitoringquality-monitoring stations, which are controlled by the air quality control company. Data were collected on an hourly basis from Dec. 7, 2016 to Feb. 27, 2019 through the air quality control site. The purpose of this paper is to introduce an innovative method based on including spatial correlations between time series related to similar stations from the perspective of time series behavior in imputation of missing information related to each pollution measuring station. In this regard, in the first step, through dynamic time wrapping, the spatio-temporal similarity between the time series of pollutant concentration of the stations is calculated in pairs. Then, for imputation in each target station, the dependence of those stations with the most similarity of desired station is used. In the second step, the initial complete data is formed by deleting the missing values at each station. In the next step, with a pattern similar to the main missing data, new missing data is obtained with 10, 15 and 20% of missing data. The fourth step involves implementing and comparing different multiple and single imputation algorithms to fill in the missing data. Finally, the performance of various imputation methods is evaluated by the introduced indicators. Discuss and Results In this study, in order to implement multiple imputation algorithms such as predictive mean matching, classification and regression tree, random sample and also implementing different single imputation algorithms such as interpolation methods, observation carried forward last from R-programming language has been used. Cart imputation method with R-squared of 0.66 and correlation coefficient of 0.8 in 10% of missing values, R-squared of 0.6 and correlation coefficient of 0.76 in 15% of missing values, R-squared of 0.58 and correlation coefficient of 0.75 at 20% of missing values, showed the best performance among multiple imputation methods. It is clear that as the percentage of missing values increases, the accuracy of the evaluation criteria decreases. Given the obtained results, the predictive mean matching method and the random method showed similar performance and performed worse than the tree regression method. Based on all three evaluation criteria, the linear interpolation method was better than the other introduced methods. Therefore, among the individual methods for the given data, this method is more appropriate. Also, the spline interpolation method has shown the weakest performance among all multiple and single imputation methods. Although, compared to the tree regression method, in data with 10% of loss, the linear interpolation method has the highest coefficient of determination and correlation and the lowest error in the evaluation indicators, but it should be noted that the linear interpolation method shows magnificent performance for missing values with low interval, but when the data loss interval increases, for example, in the 20% of missing interval, these methods are not able to provide a good imputation for the lost data and consider a fixed rate or a rate with small variation for all the missing values in each interval. Conclusion The existence of missing data in the pollutant concentration time series negatively affects the performance of data analysis in machine learning algorithms and causes bias. The results have shown that determining the spatio-temporal similarity of stations and using the pattern of similar stations using dynamic time wrapping algorithm in combination with based-regression methods leads to improvement of the model performance with high missing intervals, and the tree regression model is the most suitable method for multiple imputation. Single imputation methods, though fast and simple, are dependent on the interval length of missing in time and their performance depends on the variable under study. Therefore, the use of single methods in air pollution data with high missing intervals is not recommended. Due to the effect that other factors such as meteorological parameters have on air pollution, in future studies, the accuracy of the model can be increased by adding these parameters. | ||
کلیدواژهها [English] | ||
DTW Similarity Criterion, Single and Multiple imputation, PM2.5 concentration, Missing values | ||
مراجع | ||
]1[. بازگیر، سعید؛ قدیری معصوم، مجتبی؛ شمسیپور، علیاکبر؛ و سیدی سرنجیانه، شیوا (2015). »تحلیل رابطۀ آلودگی هوای تهران با ترافیک و شرایط جو برای کاهش مخاطرات، مدیریت مخاطرات محیطی، دوره 2، شمارۀ 1. ص 49-35. ]2[. باقی یزدل، رقیه؛ جمالی، احسان؛ خدایی، ابراهیم؛ و حبیبی مجتبی (۱۳۹۵). »روشهای برخورد با دادههای گمشده: مزایا، معایب، رویکردهای نظری و معرفی نرمافزارها«. نامۀ آموزش عالی، دورۀ 9، شمارۀ 33، ص 37-11. ]3[. عثمانی، فرشته؛ و راسخی، علیاکبر (۱۳۹۷). »روشهای وزندهی احتمال معکوس و جانهی چندگانه برای تحلیل پاسخ در حالت گمشدگی«. علوم آماری، دورۀ ۱۲ شمارۀ 2، ص 483-469. ]4[.کرمانی، آذر؛ اکبری، مهری؛ علیجانی، بهلول؛ و مفاخری، امید (2015). »تحلیل آماری-همدیدی غلظت آلایندۀ مونواکسیدکربن براساس سمت و سرعت باد و مخاطرۀ آن در شهر تهران«. مدیریت مخاطرات محیطی، دورۀ 2 شمارۀ 4، ص 450-439. [5]. Burgette, L.F.; & Reiter, J.P. (2010). “Multiple imputation for missing data via sequential regression trees”, American journal of epidemiology, 172(9), pp: 1070-1076. doi: https://doi.org/10.1093/aje/kwq260. [6]. Caillault, É.P.; Lefebvre, A.; & Bigand, A. (2017). “Dynamic time warping-based imputation for univariate time series data”, Pattern Recognition Letters. doi:https://doi.org/10.1016/j.patrec.2017.08.019. [7]. Chen, X.; & Xiao, Y. (2018). “A novel method for air quality data imputation by nuclear norm minimization”, Journal of Sensors. doi:https://doi.org/10.1155/2018/7465026. [8]. Erler, N.S.; Rizopoulos, D.; Jaddoe, V.W.; Franco, O.H.; & Lesaffre, E.M. (2019). “Bayesian imputation of time-varying covariates in linear mixed models”, Statistical methods in medical research, 28(2), pp: 555-568. doi:https://doi.org/10.1177/0962280217730851. [9]. Fortuin, V.; Rätsch, G.; & Mandt, S. (2019). “Multivariate time series imputation with variational autoencoders”, arXiv preprint arXiv:1907.04155. doi: https://arxiv.org/abs/1907.04155.. [10]. Ghazali, S.M.; Shaadan, N.; & Idrus, Z. (2020). “Missing data exploration in air quality data set using R-package data visualisation tools”, Bulletin of Electrical Engineering and Informatics, 9(2), pp: 755-763. doi:https://doi.org/10.11591/eei.v9i2.2088. [11]. Gómez-Carracedo, M.; Andrade, J.; López-Mahía, P.; Muniategui, S.; & Prada, D. (2014). “A practical comparison of single and multiple imputation methods to handle complex missing data in air quality datasets”, Chemometrics and Intelligent Laboratory Systems, 134, pp: 23-33. doi:https://doi.org/10.1016/j.chemolab.2014.02.007. [12]. Hadeed, S.J.; O'Rourke, M.K.; Burgess, J.L.; Harris, R.B.; & Canales, R.A. (2020). “Imputation methods for addressing missing data in short-term monitoring of air pollutants”, Science of The Total Environment, pp: 139140. doi:https://doi.org/10.1016/j.scitotenv.2020.139140. [13]. Junger, W.; & De Leon, A.P. (2015). “Imputation of missing data in time series for air pollutants”, Atmospheric Environment, 102, pp: 96-104. doi:https://doi.org/10.1016/j.atmosenv.2014.11.049. [14]. Junninen, H.; Niska, H.; Tuppurainen, K.; Ruuskanen, J.; & Kolehmainen, M. (2004). “Methods for imputation of missing values in air quality data sets”, Atmospheric Environment, 38(18), pp: 2895-2907. doi:https://doi.org/10.1016/j.atmosenv.2004.02.026. [15]. Lin, J.; Li, N.; Alam, M.A.; & Ma, Y. (2020). “Data-driven missing data imputation in cluster monitoring system based on deep neural network”, Applied Intelligence, 50(3), pp: 860-877. doi:https://doi.org/10.1007/s10489-019-01560-y. [16]. Liu, X.; Wang, X.; Zou, L.; Xia, J.; & Pang, W. (2020). “Spatial imputation for air pollutants data sets via low rank matrix completion algorithm”, Environment International, 139, pp: 105713. doi:https://doi.org/10.1016/j.envint.2020.105713. [17]. Ma, J.; Cheng, J.C.; Jiang, F.; Chen, W.; Wang, M.; & Zhai, C. (2020). “A bi-directional missing data imputation scheme based on LSTM and transfer learning for building energy data”, Energy and Buildings, pp: 109941. doi:https://doi.org/10.1016/j.enbuild.2020.109941. [18]. Mishra, S.; Dwivedi, V.; Sarvanan, C.; & Pathak, K. (2013). “Pattern discovery in hydrological time series data mining during the monsoon period of the high flood years in Brahmaputra River basin”, International Journal of Computer Applications, 67(6). [19]. Raghunathan, T.E.; Lepkowski, J.M.; Van Hoewyk, J.; & Solenberger, P. (2001). “A multivariate technique for multiply imputing missing values using a sequence of regression models”, Survey methodology, 27(1), pp: 85-96. [20]. Rombach, I.; Gray, A.M.; Jenkinson, C.; Murray, D.W.; & Rivero-Arias, O. (2018). “Multiple imputation for patient reported outcome measures in randomised controlled trials: advantages and disadvantages of imputing at the item, subscale or composite score level”, BMC medical research methodology, 18(1), pp: 87. doi:https://doi.org/10.1186/s12874-018-0542-6. [21]. Shahbazi, H.; Karimi, S.; Hosseini, V.; Yazgi, D.; & Torbatian, S. (2018). “A novel regression imputation framework for Tehran air pollution monitoring network using outputs from WRF and CAMx models”, Atmospheric Environment, 187, pp: 24-33. doi:https://doi.org/10.1016/j.atmosenv.2018.05.055. [22]. Stead, A.D.; & Wheat, P. (2020). “The case for the use of multiple imputation missing data methods in stochastic frontier analysis with illustration using English local highway data”, European Journal of Operational Research, 280(1), pp: 59-77. doi:https://doi.org/10.1016/j.ejor.2019.06.042. [23]. Zeileis, A.; Grothendieck, G.; Ryan, J.A.; Andrews, F.; & Zeileis, M.A. (2019). “Package "zoo"”. | ||
آمار تعداد مشاهده مقاله: 623 تعداد دریافت فایل اصل مقاله: 407 |