مقابله با مخاطرات ناشی از غلظت آلایندۀ PM2.5 با به‌کارگیری روش‌های رگرسیونی و شباهت مکانی- زمانی و تخمین مقادیر گم‌شده در سری زمانی آنها (مطالعۀ موردی: شهر تهران)

فرجی, مرجان; نادی, سعید

doi:10.22059/jhsci.2020.309653.595

فهرست نشریات

فهرست نشریات دارای اعتبار وزارت علوم، تحقیقات و فناوری

فهرست مجلات علمی- پژوهشی دانشگاه تهران

نحوه ارسال مقاله برای مجله- ثبت نام در سامانه- فراموش کردن رمز عبور

تعداد نشریات	162
تعداد شماره‌ها	6,692
تعداد مقالات	72,232
تعداد مشاهده مقاله	129,198,199
تعداد دریافت فایل اصل مقاله	102,028,131

	مقابله با مخاطرات ناشی از غلظت آلایندۀ PM2.5 با به‌کارگیری روش‌های رگرسیونی و شباهت مکانی- زمانی و تخمین مقادیر گم‌شده در سری زمانی آنها (مطالعۀ موردی: شهر تهران)
مدیریت مخاطرات محیطی
مقاله 5، دوره 7، شماره 3، مهر 1399، صفحه 299-312 اصل مقاله (1.22 M)
نوع مقاله: پژوهشی کاربردی
شناسه دیجیتال (DOI): 10.22059/jhsci.2020.309653.595
نویسندگان
مرجان فرجی¹؛ سعید نادی^* ²
¹دانشجوی دکتری مهندسی نقشه‌برداری گرایش سنجش ‌از دور، دانشکدۀ مهندسی عمران و حمل‌ونقل، دانشگاه اصفهان
²استادیار گروه مهندسی نقشه‌برداری، دانشکدۀ مهندسی عمران و حمل‌ونقل، دانشگاه اصفهان
چکیده
با توجه به تأثیر نامطلوب آلاینده‌ها بر محیط زیست و سلامت انسان، تجزیه‌وتحلیل داده‌های کیفیت هوا اهمیت زیادی در حفاظت از محیط زیست و رویارویی با مشکلات آلودگی هوا دارد. داده‌های گم‌شده در سری‌های زمانی به‌خصوص داده‌های مربوط به آلودگی هوا موجب بروز چالشی ویژه در برابر آنالیز این داده‌ها می‌شود که ضرورت استفاده از روش‌هایی با عنوان جانهی را برای مقابله با این پدیده نمایان می‌کند. مقادیر گم‌شده، موجب کاهش حجم داده و تغییر الگوهای زمانی موجود در داده‌ها و نتیجه‌گیری اشتباه در تجزیه‌وتحلیل داده‌ها می‌شود. در این پژوهش به‌منظور جانهی مقادیر از‌دست‌رفته ‌در داده‌های سری زمانی غلظت آلایندۀ از 12 ایستگاه سنجش آلودگی شهر تهران، روشی ترکیبی برمبنای رگرسیون جانهی با در نظر گرفتن وابستگی و شباهت‌های مکانی و زمانی بین ایستگاه‌ها توسط الگوریتم پیچش زمانی پویا معرفی شده است. داده‌هایی با مقادیر گم‌شده با الگویی مشابه با داده‌های اصلی در دامنۀ 10، 15 و 20 درصد گم‌شدگی در داده‌ها با هدف ارزیابی عملکرد مدل‌های جانهی شبیه‌سازی شدند. سپس روش پیشنهادی در ترکیب با روش‌های مختلف جانهی چندگانه همانند روش طبقه‌بندی و رگرسیون درختی، نمونۀ تصادفی و میانگین تطابق پیش‌بینی کننده، اجرا و نتایج با روش‌های جانهی منفرد مقایسه شد. نتایج بیانگر برتری روش معرفی‌شده در ترکیب با رگرسیون درختی در مقایسه با دیگر روش‌های جانهی چندگانه و منفرد است.
کلیدواژه‌ها
آلایندۀ PM2.5؛ جانهی منفرد و چندگانه؛ داده‌های گم‌شده؛ مخاطرات؛ معیار شباهت DTW
عنوان مقاله [English]
Hazards caused by the concentration of pollutants PM_2.5 by using Regression Methods and Spatial-temporal Similarity in Order to Impute the Missing Values in their Time Series (Case Study of Tehran)
نویسندگان [English]
Marjan Faraji¹؛ Saed Nadi²
¹PhD in. Remote Sensing, Department of Geomatics Engineering, Faculty of Civil and Transportation Engineering, University of Isfahan
²Assistant Professor, Department of Geomatics Engineering, Faculty of Civil and Transportation Engineering, University of Isfahan, Isfahan
چکیده [English]
Introduction With the increasing growth of industrialization of cities, air pollution has become one of the serious environmental hazards in the world's largest cities, including Tehran. Due to the undesirable effects of pollutants on the environment and human health, the analysis of air quality data plays an important role in protecting the environment and its hazards and tackling air pollution problems. During the last decade, a large number of air quality control data, involving the concentration of existing pollutants in the atmosphere, have been collected by pollution monitoring stations in different cities of the country, which due to various reasons such as calibration, maintenance, device errors, and processing errors show missing values at different intervals. These missing values caused problems in data analysis and leads to challenges in making decisions based on these data. Missing data is a common problem in time series issues and introducing efficient models and methods for managing this problem in data is an effective step towards decreasing bias and increasing air pollution model power. Materials and Methods This paper uses pollutant concentration data recorded in 12 air quality monitoringquality-monitoring stations, which are controlled by the air quality control company. Data were collected on an hourly basis from Dec. 7, 2016 to Feb. 27, 2019 through the air quality control site. The purpose of this paper is to introduce an innovative method based on including spatial correlations between time series related to similar stations from the perspective of time series behavior in imputation of missing information related to each pollution measuring station. In this regard, in the first step, through dynamic time wrapping, the spatio-temporal similarity between the time series of pollutant concentration of the stations is calculated in pairs. Then, for imputation in each target station, the dependence of those stations with the most similarity of desired station is used. In the second step, the initial complete data is formed by deleting the missing values at each station. In the next step, with a pattern similar to the main missing data, new missing data is obtained with 10, 15 and 20% of missing data. The fourth step involves implementing and comparing different multiple and single imputation algorithms to fill in the missing data. Finally, the performance of various imputation methods is evaluated by the introduced indicators. Discuss and Results In this study, in order to implement multiple imputation algorithms such as predictive mean matching, classification and regression tree, random sample and also implementing different single imputation algorithms such as interpolation methods, observation carried forward last from R-programming language has been used. Cart imputation method with R-squared of 0.66 and correlation coefficient of 0.8 in 10% of missing values, R-squared of 0.6 and correlation coefficient of 0.76 in 15% of missing values, R-squared of 0.58 and correlation coefficient of 0.75 at 20% of missing values, showed the best performance among multiple imputation methods. It is clear that as the percentage of missing values increases, the accuracy of the evaluation criteria decreases. Given the obtained results, the predictive mean matching method and the random method showed similar performance and performed worse than the tree regression method. Based on all three evaluation criteria, the linear interpolation method was better than the other introduced methods. Therefore, among the individual methods for the given data, this method is more appropriate. Also, the spline interpolation method has shown the weakest performance among all multiple and single imputation methods. Although, compared to the tree regression method, in data with 10% of loss, the linear interpolation method has the highest coefficient of determination and correlation and the lowest error in the evaluation indicators, but it should be noted that the linear interpolation method shows magnificent performance for missing values with low interval, but when the data loss interval increases, for example, in the 20% of missing interval, these methods are not able to provide a good imputation for the lost data and consider a fixed rate or a rate with small variation for all the missing values in each interval. Conclusion The existence of missing data in the pollutant concentration time series negatively affects the performance of data analysis in machine learning algorithms and causes bias. The results have shown that determining the spatio-temporal similarity of stations and using the pattern of similar stations using dynamic time wrapping algorithm in combination with based-regression methods leads to improvement of the model performance with high missing intervals, and the tree regression model is the most suitable method for multiple imputation. Single imputation methods, though fast and simple, are dependent on the interval length of missing in time and their performance depends on the variable under study. Therefore, the use of single methods in air pollution data with high missing intervals is not recommended. Due to the effect that other factors such as meteorological parameters have on air pollution, in future studies, the accuracy of the model can be increased by adding these parameters.
کلیدواژه‌ها [English]
DTW Similarity Criterion, Single and Multiple imputation, PM2.5 concentration, Missing values

مراجع
]1[. بازگیر، سعید؛ قدیری معصوم، مجتبی؛ شمسی‌پور، علی‌اکبر؛ و سیدی سرنجیانه، شیوا (2015). »تحلیل رابطۀ آلودگی هوای تهران با ترافیک و شرایط جو برای کاهش مخاطرات، مدیریت مخاطرات محیطی، دوره 2، شمارۀ 1. ص 49-35. ]2[. باقی یزدل، رقیه؛ جمالی، احسان؛ خدایی، ابراهیم؛ و حبیبی مجتبی (۱۳۹۵). »روش‌های برخورد با داده‌های گمشده: مزایا، معایب، رویکردهای نظری و معرفی نرم‌افزارها«. نامۀ آموزش عالی، دورۀ 9، شمارۀ 33، ص 37-11. ]3[. عثمانی، فرشته؛ و راسخی، علی‌اکبر (۱۳۹۷). »روش‌های وزن‌دهی احتمال معکوس و جانهی چندگانه برای تحلیل پاسخ در حالت گم‌شدگی«. علوم آماری، دورۀ ۱۲ شمارۀ 2، ص 483-469. ]4[.کرمانی، آذر؛ اکبری، مهری؛ علیجانی، بهلول؛ و مفاخری، امید (2015). »تحلیل آماری-همدیدی غلظت آلایندۀ مونواکسیدکربن براساس سمت و سرعت باد و مخاطرۀ آن در شهر تهران«. مدیریت مخاطرات محیطی، دورۀ 2 شمارۀ 4، ص 450-439.‌ [5]. Burgette, L.F.; & Reiter, J.P. (2010). “Multiple imputation for missing data via sequential regression trees”, American journal of epidemiology, 172(9), pp: 1070-1076. doi: https://doi.org/10.1093/aje/kwq260. [6]. Caillault, É.P.; Lefebvre, A.; & Bigand, A. (2017). “Dynamic time warping-based imputation for univariate time series data”, Pattern Recognition Letters. doi:https://doi.org/10.1016/j.patrec.2017.08.019. [7]. Chen, X.; & Xiao, Y. (2018). “A novel method for air quality data imputation by nuclear norm minimization”, Journal of Sensors. doi:https://doi.org/10.1155/2018/7465026. [8]. Erler, N.S.; Rizopoulos, D.; Jaddoe, V.W.; Franco, O.H.; & Lesaffre, E.M. (2019). “Bayesian imputation of time-varying covariates in linear mixed models”, Statistical methods in medical research, 28(2), pp: 555-568. doi:https://doi.org/10.1177/0962280217730851. [9]. Fortuin, V.; Rätsch, G.; & Mandt, S. (2019). “Multivariate time series imputation with variational autoencoders”, arXiv preprint arXiv:1907.04155. doi: https://arxiv.org/abs/1907.04155.. [10]. Ghazali, S.M.; Shaadan, N.; & Idrus, Z. (2020). “Missing data exploration in air quality data set using R-package data visualisation tools”, Bulletin of Electrical Engineering and Informatics, 9(2), pp: 755-763. doi:https://doi.org/10.11591/eei.v9i2.2088. [11]. Gómez-Carracedo, M.; Andrade, J.; López-Mahía, P.; Muniategui, S.; & Prada, D. (2014). “A practical comparison of single and multiple imputation methods to handle complex missing data in air quality datasets”, Chemometrics and Intelligent Laboratory Systems, 134, pp: 23-33. doi:https://doi.org/10.1016/j.chemolab.2014.02.007. [12]. Hadeed, S.J.; O'Rourke, M.K.; Burgess, J.L.; Harris, R.B.; & Canales, R.A. (2020). “Imputation methods for addressing missing data in short-term monitoring of air pollutants”, Science of The Total Environment, pp: 139140. doi:https://doi.org/10.1016/j.scitotenv.2020.139140. [13]. Junger, W.; & De Leon, A.P. (2015). “Imputation of missing data in time series for air pollutants”, Atmospheric Environment, 102, pp: 96-104. doi:https://doi.org/10.1016/j.atmosenv.2014.11.049. [14]. Junninen, H.; Niska, H.; Tuppurainen, K.; Ruuskanen, J.; & Kolehmainen, M. (2004). “Methods for imputation of missing values in air quality data sets”, Atmospheric Environment, 38(18), pp: 2895-2907. doi:https://doi.org/10.1016/j.atmosenv.2004.02.026. [15]. Lin, J.; Li, N.; Alam, M.A.; & Ma, Y. (2020). “Data-driven missing data imputation in cluster monitoring system based on deep neural network”, Applied Intelligence, 50(3), pp: 860-877. doi:https://doi.org/10.1007/s10489-019-01560-y. [16]. Liu, X.; Wang, X.; Zou, L.; Xia, J.; & Pang, W. (2020). “Spatial imputation for air pollutants data sets via low rank matrix completion algorithm”, Environment International, 139, pp: 105713. doi:https://doi.org/10.1016/j.envint.2020.105713. [17]. Ma, J.; Cheng, J.C.; Jiang, F.; Chen, W.; Wang, M.; & Zhai, C. (2020). “A bi-directional missing data imputation scheme based on LSTM and transfer learning for building energy data”, Energy and Buildings, pp: 109941. doi:https://doi.org/10.1016/j.enbuild.2020.109941. [18]. Mishra, S.; Dwivedi, V.; Sarvanan, C.; & Pathak, K. (2013). “Pattern discovery in hydrological time series data mining during the monsoon period of the high flood years in Brahmaputra River basin”, International Journal of Computer Applications, 67(6). [19]. Raghunathan, T.E.; Lepkowski, J.M.; Van Hoewyk, J.; & Solenberger, P. (2001). “A multivariate technique for multiply imputing missing values using a sequence of regression models”, Survey methodology, 27(1), pp: 85-96. [20]. Rombach, I.; Gray, A.M.; Jenkinson, C.; Murray, D.W.; & Rivero-Arias, O. (2018). “Multiple imputation for patient reported outcome measures in randomised controlled trials: advantages and disadvantages of imputing at the item, subscale or composite score level”, BMC medical research methodology, 18(1), pp: 87. doi:https://doi.org/10.1186/s12874-018-0542-6. [21]. Shahbazi, H.; Karimi, S.; Hosseini, V.; Yazgi, D.; & Torbatian, S. (2018). “A novel regression imputation framework for Tehran air pollution monitoring network using outputs from WRF and CAMx models”, Atmospheric Environment, 187, pp: 24-33. doi:https://doi.org/10.1016/j.atmosenv.2018.05.055. [22]. Stead, A.D.; & Wheat, P. (2020). “The case for the use of multiple imputation missing data methods in stochastic frontier analysis with illustration using English local highway data”, European Journal of Operational Research, 280(1), pp: 59-77. doi:https://doi.org/10.1016/j.ejor.2019.06.042. [23]. Zeileis, A.; Grothendieck, G.; Ryan, J.A.; Andrews, F.; & Zeileis, M.A. (2019). “Package "zoo"”.
آمار تعداد مشاهده مقاله: 623 تعداد دریافت فایل اصل مقاله: 407

سامانه مدیریت نشریات علمی. قدرت گرفته از سیناوب

پیوندهای مفید

پیوندهای مفید

اخبار و اعلانات

آمار

مقابله با مخاطرات ناشی از غلظت آلایندۀ PM2.5 با به‌کارگیری روش‌های رگرسیونی و شباهت مکانی- زمانی و تخمین مقادیر گم‌شده در سری زمانی آنها (مطالعۀ موردی: شهر تهران)