7COM1018 Practical Text Mining Hertfordshire university
30 Nov 2023
Share via WhatsappDo you Need Help with 7COM1018 Practical Text Mining Assignment?
Assignment Briefing
Weighting %: 40%
Submission deadline (for students): See Canvas
Authorship: Individual
Target date for returning marked coursework: 12th January 2024
Tutor setting the work: Paul Moggridge
Number of hours you are expected to work on this assignment: 40 hours
This Assignment assesses the following module Learning Outcomes (from Definitive Module Document):
2. be able to appreciate the strengths and limitations of various data mining models;
3. be able to critically evaluate, articulate and utilise a range of techniques for designing data mining systems;
5. be able to critically evaluate different algorithms and models of data mining.
Assignment Tasks:
This assignment must be completed using WEKA and the techniques covered within the module.
Datasets for this Assignment: DM23A.zip
Water Samples - this dataset contains 'a' (clean) and 'b' (contaminated) water samples observed by an monitoring station that should raise the alarm in case of contaminated water. (fictitious data)
Video Game Reviews - this dataset contains 'pos' (positive) and 'neg' (negative) reviews of various video games. (source: Amazon)
Task 1 - Support Vector Machine vs. Decision Tree (50 marks)
For this task, use the 'WaterSamples.arff' file included in the data folder - the class label for this dataset is 'class'.
Part A (Optimize libSVM)
This part may be completed after Unit 5.
Set the randomization seed to the last four digits of your student ID number, and use a train:test split of 70%. Provide a screenshot of entering your ID number of the seed. (Failure to do this will result in your report being marked 0)
Using a libsvm model with the RBF kernel, create a table (see Table 1 appendix) of 2 gamma x 2 cost values containing overall accuracy on the test data. Also include the confusion matrix from each of the four results. Explain how these results demonstrate the model's variation with respect to its parameter values. (10 marks)
Use a grid-search on an appropriate range to find optimised values for the gamma and cost parameters. Include an explanation of and results from using the grid-search to find the optimal parameters. Note: a screenshot of WEKA's dialog is not acceptable and will be awarded 0 marks. You need to explain which parameters you are setting, and why. (10 marks)
Part B (Optimise J48)
This part may be completed after Unit 6.
Set the randomisation seed to the last four digits of your student ID number, and use a train:test split of 70%. Provide a screenshot of entering your ID number of the seed. (Failure to do this will result in your report being marked 0)
Using a J48 (decision-tree) model, create a table (see Table 2 in appendix) of 3 confidence factor values containing overall accuracy on the test data. Also include the confusion matrix from each of the three results. Explain (2-3 sentences) how these results demonstrate the model's variation with respect to its parameter value. (10 marks)
Use a parameter-search on an appropriate range to find an optimised confidence factor value, include an explanation and results from using a parameter-search to find the optimal confidence factor value. Note: a screenshot of WEKA's dialog is not acceptable and will be awarded 0 marks. You need to explain which parameters you are setting, and why. (10 marks)
Part C (Comparing libSVM & J48)
This part may be completed after Unit 6.
Now compare the libsvm and J48 models:
Perform 5-fold cross-validation with the libsvm model using the optimal values for cost and gamma you found in Task 1. Explain how you have done so and the results you obtained. Next, perform 5-fold cross-validation with the J48 model, using the optimal value for the confidence factor you found in Task 2. Explain how you have done so and the results you obtained. Finally give 2-3 sentences comparing and evaluating the results. Is one of the models better than the other, explain why you think so? (10marks)
Task 2 - Text Mining (50 Marks)
This part may be completed after Unit 7.
For this task, use the folder of text data 'VideoGamesReviews' included in the data folder.
Part A (Preprocess the Data)
Convert the text data into a set of attribute-value pairs. Explain how you have done so and describe the dataset that you have ended up with. Make your own choices with respect to: use of TF/IDF or word counts, use of a stoplist, use of a stemmer (do NOT try to use the Snowball stemmer, it has no effect). (10 marks)
Perform automated attribute selection. Explain how you have done so (including any parameter values you set) and how the dataset has changed. (10 marks)
Balance the dataset using any technique you like. Explain how you have done so (including any parameter values you set) and describe the resulting dataset. (10 marks)
At this point it is recommended to save a copy of the dataset to avoid the need to repeat the above steps should you be interrupted.
Part B (Compare Models)
Optimise any required parameters and perform 5-fold cross-validation on each of NaiveBayes, LibSVM and J48, producing a table of the three overall-accuracy results and a confusion matrix in each case. Explain how you have done so, the models produced and the results you obtained. (10 marks)
Compare the performance differences among the three algorithms in a concise paragraph (5-6 sentences). Determine which (if any) algorithm is superior and explain why. Consider factors such as accuracy results, confusion matrices, training time, ease of configuration and model properties. Finally visualize your findings using appropriate plots. (10 marks)
Submission Requirements:
A single PDF document containing your report, to a maximum 10 pages. Write your student ID number at the start of your report. Do not include your name or other details, to keep the marking process anonymous.
A reminder that all work should be your own. Reports without screenshots showing the use of your student ID number as a seed will be marked as zero.
Reports exceeding the maximum length may not be marked beyond the 10 page limit.
Apart from the two required screenshots showing your use of your student ID number, no screenshots should be included in your report. Other data required from WEKA should be placed into your own tables or other results, such as confusion matrices, can be copy-pasted into your report.
No research beyond what was covered in the module is required, and so your report should not include citations or a reference list.
Marks awarded for:
See rubric.
Type of Feedback to be given for this assignment:
Along with the marks and rubric comments each student will receive individual written feedback.
Additional information:
Regulations governing assessment offences including Plagiarism and Collusion are available from https://www.herts.ac.uk/__data/assets/pdf_file/0007/237625/AS14-Apx3-Academic-Misconduct.pdf (UPR AS14).
Guidance on avoiding plagiarism can be found here: https://herts.instructure.com/courses/61421 (see the Referencing section)
For postgraduate modules: a score of 50% or above represents a pass mark.
late submission of any item of coursework for each day or part thereof (or for hard copy submission only, working day or part thereof) for up to five days after the published deadline, coursework relating to modules at Level 7 submitted late (including deferred coursework, but with the exception of referred coursework), will have the numeric grade reduced by 10 grade points until or unless the numeric grade reaches or is 50. Where the numeric grade awarded for the assessment is less than 50, no lateness penalty will be applied.
-
Delivery in day(s):
1