I have raw exports from a Wordpress website, which can be shared as a 350MB zip file containing XML files. The XML files contain structured data representing the news stories on our website, including: Headline, Publication Date, Author, Article Content.
We are trying to understand a few different things about the content on our website.
1. What topics do our journalists write about?
2. What topics do our journalists write about with authority?
3. How well does our site taxonomy describe or organize our content?
4. Based on an analysis of the stories, can we do a better job of related our content?
I am looking for someone to conduct an analysis of our content. The right candidate will be able to:
* Identify 10-20 topics or subjects that are discussed across a number of articles; determine the topics that the website has "topic authority" (we can discuss what constitutes a "topic" before we begin, and what constitutes "authority")
* Generate a list of articles that discuss each of these topics
* Analyze topic authority as a function of Author and Content Category of the website (for example, the website has a category "News & Politics" as well as "Business & Technology, etc... What topics do are covered by those categories?)
You should be capable of providing the source code and outputs in the form of CSV files and plots (generated using the tool of your choice, but ideally R, Python or an open-source solution).