Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

First, research / gather the data: 1 . ChooseoneStackExchangesitedealingwithtopicsthatyoufindinteresting;seehttps: / / stackexc hange.com / sites ? view = list#traffic for a list. The site cannot

First, research/gather the data:
1. ChooseoneStackExchangesitedealingwithtopicsthatyoufindinteresting;seehttps://stackexc hange.com/sites?view=list#traffic for a list. The site cannot be too small, but also avoid selecting any of the largest ones (especially StackOverflow, Mathematics) unless you really want to challenge yourself. As a rule of thumb, lets say that the site must have at least 10,000 questions and 10,000 answers.
This document was originally developed by Dr. Marek Gagolewski. It was subsequently revised by Dr. Yang Li (Kelvin) during the work at the School of Information Technology, Deakin University, for the unit SIT220/731 Data Wrangling, Trimester 1,2024.
2. Downloadthesitesmostrecentdatadumpfromhttps://archive.org/details/stackexchange/.
3. Readthedescriptionofallthedatatablespublishedathttps://meta.stackexchange.com/questio
ns/2677/.
Then, create a single Quarto .qmd file1 that you will be rendering to a PDF report (how to do that you will
have to learn yourself this is part of this HD-level task), where you perform what follows.
1. Convertallthedatatables(Badges,Comments,PostHistory,PostLinks,Posts,Tags,Users,Votes) from XML to CSV, using custom code that you write yourself. Ideally, you should write a Python function that takes a single input file name (.xml) and output file name (.csv) and performs the conversion of a single dataset.
2. LoadtheCSVfilesaspandasdataframes.
3. Createatleastfivenontrivialdatavisualisationsand/ortables,atleastthreeofwhicharebasedon the extraction of information from text (e.g., tags, keywords, locations, etc.). You must demon- strate that you have learned how to write your own regular expressions (regexes).
4. Drawinsightfulandinterestingconclusions.Donotforgettoreflectonthepotentialdataprivacy and ethics issues that arise during the data analysis process.
This HD-level task is purposely under-defined you will not be told precisely what to do. Your aim is to generate some interesting insights into data featuring lots of textual information.
In the course of the report preparation, you should apply a wide range of data frame wrangling and text processing techniques. In particular, you must demonstrate that you mastered regular expressions.
Do not use pie charts (as we discussed during the lecture). Go beyond the basic plots that we have covered in this course. Draw at least one map (e.g., of the world) and a word cloud.

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image

Step: 3

blur-text-image

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

Machine Learning And Knowledge Discovery In Databases European Conference Ecml Pkdd 2018 Dublin Ireland September 10 14 2018 Proceedings Part 1 Lnai 11051

Authors: Michele Berlingerio ,Francesco Bonchi ,Thomas Gartner ,Neil Hurley ,Georgiana Ifrim

1st Edition

3030109240, 978-3030109240

More Books

Students also viewed these Databases questions