Big Data Papers

In this class, we will focus on the analysis of the following big data: 1) provenance data; and 2) openXC data.


1.     Paper 1: (DC) Maria Alejandra Rodriguez, Rajkumar Buyya: A Responsive Knapsack-Based Algorithm for Resource Provisioning and Scheduling of Scientific Workflows in Clouds. ICPP 2015:839-848. (the WRPS algorithm) Download. Youtube video

2.     Paper 2: (BDC) Maciej Malawski, Gideon Juve, Ewa Deelman, Jarek Nabrzyski: Cost- and deadline-constrained provisioning for scientific workflow ensembles in IaaS clouds. SC 2012:22. (CloudSim and workflow generator are used for experiments). Download (the DPDS and SPSS algorithms) Youtube video. (level-based deadline distribution, the objective function: maxi- mizing the number of completed workflows from an ensemble under both budget and deadline constraints, limiations: all VMS are the same, homogeneous resource model, so that task placement decisions do not impact the runtime of the tasks. so that task placement decisions do not impact the runtime of the tasks (including data transfer time), data transfer time is fixed. Very interesting but strong assumption: These priorities are absolute in the sense that completing a workflow with a given priority is more valuable than completing all other workflows in the ensemble with lower priorities combined.)

3.     Paper 3: (OM) Cui Lin, Shiyong Lu: SCPOR: An elastic workflow scheduling algorithm for services computing. SOCA 2011:1-8, Download. (The SCOPOR algorithm) Youtube video

4.     Paper 4: (OM) Cui Lin and Shiyong Lu, SHEFT: An Elastic Workflow Scheduling Algorithm for Cloud Computing, Technical Report TR-BIGDATA-12-2011-LL, Department of Computer Science, Wayne State University, May, 2011. Download. (The SHEFT algorithm)Youtube video

5.     Paper 5. (OM) Haluk Topcuoglu, Salim Hariri, Min-You Wu: Performance-Effective and Low-Complexity Task Scheduling for Heterogeneous Computing. IEEE Trans. Parallel Distrib. Syst. (TPDS) 13(3):260-274 (2002). Download. (The HEFT algorithm and the CPOP algorithm) CPOP youtube

6.     Paper 6: Nabeel Mohamed, Nabanita Maji, Jing Zhang, Nataliya Timoshevskaya, Wu-chun Feng: Aeromancer: A Workflow Manager for Large-Scale MapReduce-Based Scientific Workflows. TrustCom 2014: 739-746 Download. Youtube video

7.     Paper 7: Dong Yuan, Yun Yang, Xiao Liu, Jinjun Chen: A data placement strategy in scientific cloud workflows. Future Generation Comp. Syst. (FGCS) 26(8):1200-1214 (2010). Download.

8.     Paper 8: (BDC) Hamid Arabnejad, Jorge G. Barbosa, Radu Prodan: Low-time complexity budget-deadline constrained workflow scheduling on heterogeneous resources. Future Generation Comp. Syst. (FGCS) 55:29-40 (2016). Download. Youtube video(The DBCS algorithm, no optimization, aims to quickly find a feasible solution that satisfies both budget and deadline constraints, for a bounded number of heterogeneous resources, advantages: low complexity planning time O(n^2*p))

9.     Paper 9: Jianwu Wang, Daniel Crawl, Ilkay Altintas, Weizhong Li: Big Data Applications Using Workflows for Data Parallel Computing. Computing in Science and Engineering (CSE) 16(4):11-21 (2014). Download.

10.                        Paper 10: (DC) Saeid Abrishami, Mahmoud Naghibzadeh, Dick H. J. Epema: Deadline-constrained workflow scheduling algorithms for Infrastructure as a Service Clouds. Future Generation Comp. Syst. (FGCS) 29(1):158-169 (2013). Download. (the IC-PCP algorithm). Youtube video

11.                        Andrey Kashlev, Shiyong Lu, and Aravind Mohan, "Big Data Workflows: A Reference Architecture and The Dataview System", Services Transactions on Big Data (STBD), 4(1), pp.1-19, 2017. Download.

12.                        Paper 12: Andrew Wylie, Wei Shi, Jean-Pierre Corriveau, Yang Wang: A Scheduling Algorithm for Hadoop MapReduce Workflows with Budget Constraints in the Heterogeneous Cloud. IPDPS Workshops 2016: 1433-1442 for Running Big Data Workflows in the Cloud. IEEE SCC 2014:51-58. Download.

13.                        Paper 13: Artem Chebotko, Andrey Kashlev, Shiyong Lu, "A Big Data Modeling Methodology for Apache Cassandra", IEEE International Congress on Big Data, pp.238-245, New York, USA, 2015. Download. Youtube video

14.                        Paper 14: Xubo Fei, Shiyong Lu, and Cui Lin, "A MapReduce-Enabled Scientific Workflow Composition Framework", IEEE International Conference on Web Services (ICWS), pp.663-670, Los Angeles, CA, 2009 Download.

15.                        Paper 15:Somayeh Kianpisheh, Nasrollah Moghadam Charkari, Mehdi Kargahi: Reliability-driven scheduling of time/cost-constrained grid workflows. Future Generation Comp. Syst. (FGCS) 55:1-16 (2016). Download. (reliability paper)

16.                        Paper 16: Goncalves, Carlos, Luis Assuncao, and Jose C. Cunha. "Data analytics in the cloud with flexible mapreduce workflows." In Cloud Computing Technology and Science (CloudCom), 2012 IEEE 4th International Conference on, pp. 427-434. IEEE, 2012. Download.

17.                        Paper 17: Jia Liu, Li, Miao Zhang, Rajkumar Buyya, and Qi Fan. "Deadline-constrained coevolutionary genetic algorithm for scientific workflow scheduling in cloud computing." Concurrency and Computation: Practice and Experience 29, no. 5 (2017). Download.

18.                        Paper 18: (BC) Zheng, Wei, and Rizos Sakellariou. "Budget-deadline constrained workflow planning for admission control." Journal of grid computing 11, no. 4 (2013): 633-651. Download. (The BHEFT algorithm, consider existing work load allocation, L1) Youtube Video

19.                        Paper 19: (BC) Arabnejad, Hamid, and Jorge G. Barbosa. "A budget constrained scheduling algorithm for workflow applications." Journal of Grid Computing 12, no. 4 (2014): 665-679. Download. (The HBCS algorithm) (L1: a bounded number of of heterogeneous resources). Youtube video

20.                        Paper 20: Prodan, Radu, and Marek Wieczorek. "Bi-criteria scheduling of scientific grid workflows." Automation Science and Engineering, IEEE Transactions on 7, no. 2 (2010): 364-376. Download. (The DCA algorithm)

21.                        Paper 21: (BDC) Yu, Jia, and Rajkumar Buyya. "Scheduling scientific workflow applications with deadline and budget constraints using genetic algorithms." Scientific Programming 14, no. 3-4 (2006): 217-230. Download. (The GA algorithm, Evolutionary approaches). Youtube video.

22.                        Paper 22: Tsai, Chun-Wei, and Joel JPC Rodrigues. "Metaheuristic scheduling for cloud: A survey." Systems Journal, IEEE 8, no. 1 (2014): 279-291. Download.

23.                        Paper 23: (BC) Sakellariou, Rizos, Henan Zhao, Eleni Tsiakkouri, and Marios D. Dikaiakos. "Scheduling workflows with budget constraints." In Integrated research in GRID computing, pp. 189-202. Springer US, 2007. Download. (The LOSS1 algorithm) Youtube video (L1: does not consider data transfer cost and data storage cost, not suitable for big data

24.                        Paper 24: Kunal Agrawal, Anne Benoit, Loic Magnan, Yves Robert: Scheduling algorithms for linear workflow optimization. IPDPS 2010:1-12. Download.

25.                        Paper 25: Singh, Gurmeet, Carl Kesselman, and Ewa Deelman. "A provisioning model and its comparison with best-effort for performance-cost optimization in grids." In Proceedings of the 16th international symposium on High performance distributed computing, pp. 117-126. ACM, 2007. Download. (Evolutionary approaches)

26.                        Paper 26: Talukder, A. K. M., Michael Kirley, and Rajkumar Buyya. "Multiobjective differential evolution for scheduling workflow applications on global Grids." Concurrency and Computation: Practice and Experience 21, no. 13 (2009): 1742-1756. Download. (Evolutionary approaches)

27.                        Paper 27: Yu, Jia, Michael Kirley, and Rajkumar Buyya. "Multi-objective planning for workflow execution on grids." In Proceedings of the 8th IEEE/ACM International conference on Grid Computing, pp. 10-17. IEEE Computer Society, 2007. Download. (Evolutionary approaches)

28.                        Paper 28: De Oliveira, Daniel, Kary ACS Ocana, Fernanda Baiao, and Marta Mattoso. "A provenance-based adaptive scheduling heuristic for parallel scientific workflows in clouds." Journal of Grid Computing 10, no. 3 (2012): 521-552. Download.

29. Paper 29: Khalifa, Ahmed E., Iman Elghandour, and Nagwa El-Makky. "IncReStore: Incremental computation of mapreduce workflows." In Data Engineering Workshops (ICDEW), 2016 IEEE 32nd International Conference on, pp. 39-46. IEEE, 2016. Download

30.                        Paper 30: Song, Aibo, Zhiang Wu, Xu Ma, and Junzhou Luo. "CAT: A Cost-Aware Translator for SQL-query workflow to MapReduce jobflow." Data and Knowledge Engineering 102 (2016): 42-56. Download.

31.                        Paper 31: Data, Big, and C. Catlett. "A cloud framework for big data analytics workflows on azure." Cloud Computing and Big Data 23 (2013): 182. Download. Youtube Video

32.                        Paper 32: Vahi, Karan, Mats Rynge, Gideon Juve, Rajiv Mayani, and Ewa Deelman. "Rethinking data management for big data scientific workflows." In Big Data, 2013 IEEE International Conference on, pp. 27-35. IEEE, 2013.  Download. Youtube video

33.                        Paper 33: Juve, Gideon, Ewa Deelman, Karan Vahi, Gaurang Mehta, Bruce Berriman, Benjamin P. Berman, and Phil Maechling. "Scientific workflow applications on Amazon EC2." In 2009 5th IEEE International Conference on E-Science Workshops, pp. 59-66. IEEE, 2009. Download. Youtube video

34.                        Paper 34: Kranjc, Janez, Roman Orač, Vid Podpečan, Nada Lavrač, and Marko Robnik-Šikonja. "ClowdFlows: Online workflows for distributed big data mining." Future Generation Computer Systems (2016). Download. Youtube Video

35.                        Paper 35: Perovšek, Matic, Janez Kranjc, Tomaž Erjavec, Bojan Cestnik, and Nada Lavrač. "TextFlows: A visual programming platform for text mining and natural language processing." Science of Computer Programming 121 (2016): 128-152. Download. Youtube Video  

36.                        Paper 36: Rak, Rafal, Andrew Rowley, William Black, and Sophia Ananiadou. "Argo: an integrative, interactive, text mining-based workbench supporting curation." Database 2012 (2012): bas010.. Download. Youtube video

37.                        Paper 37: Kano, Yoshinobu, Paul Dobson, Mio Nakanishi, Jun'ichi Tsujii, and Sophia Ananiadou. "Text mining meets workflow: linking U-Compare with Taverna." Bioinformatics 26, no. 19 (2010): 2486-2487. Download. Youtube Video

38.                        Paper 38: Kano, Yoshinobu, Makoto Miwa, K. Bretonnel Cohen, Lawrence E. Hunter, Sophia Ananiadou, and Jun’ichi Tsujii. "U-Compare: A modular NLP workflow construction and evaluation system." IBM Journal of Research and Development 55, no. 3 (2011): 11-1. Download. Youtube Video 

39.                        Paper 39: Pang, Bo, and Lillian Lee. "Opinion mining and sentiment analysis." Foundations and Trends® in Information Retrieval 2, no. 1–2 (2008): 1-135. Download. Youtube video part 1, Youtube video part 2.

40.                        Paper 40: Felix Schuster, Manuel Costa, CĂ©dric Fournet, Christos Gkantsidis, Marcus Peinado, Gloria Mainar-Ruiz, Mark Russinovich: VC3: Trustworthy Data Analytics in the Cloud Using SGX. IEEE Symposium on Security and Privacy 2015: 38-54. Download. Youtube video

41.                        Paper 41: Lu, Yi, Shiyong Lu, Farshad Fotouhi, Youping Deng, and Susan J. Brown. "Incremental genetic K-means algorithm and its application in gene expression data analysis." BMC bioinformatics 5, no. 1 (2004): 172. Download.

42.                        Paper 42 (DC): Saeid Abrishami, Mahmoud Naghibzadeh, Dick H. J. Epema: Cost-Driven Scheduling of Grid Workflows Using Partial Critical Paths. IEEE Trans. Parallel Distrib. Syst. 23(8): 1400-1414 (2012). Download. (The PCP algorithm). Youtube video

43.                        Paper 43 : Lin, Cui, Shiyong Lu, Xubo Fei, Darshan Pai, and Jing Hua. "A task abstraction and mapping approach to the shimming problem in scientific workflows." In IEEE International Conference on Services Computing, pp. 284-291, 2009. Download.

44.                        Paper 44 (Survey): Smanchat, Sucha, and Kanchana Viriyapant. "Taxonomies of workflow scheduling problem and techniques in the cloud." Future Generation Computer Systems 52 (2015): 1-12. Download.

The top 10 data science algorithms

  1. ID3 (8 lectures, 50 mins) | CART
  2. K-means
  3. SVM
  4. Apriori
  5. EM (1) | EM (2)
  6. AdaBoost
  7. kNN
  8. Naive Bayes
  9. CNN
  10. RNN