Tuesday, January 11, 2022

Data Analysis

People who have worked with me know that I love data and analyzing data... back in October, I came across a tool called "Voyant" (https://voyant-tools.org/) that would create word clouds, and I decided to analyze this blog to get an idea of what I type about the most! It took a little bit of "doing" -- I was hoping I could just point it at the blog and it would do its thing, but... nah... it was a little more work than that.

I captured the URLs of all the blog entries for the 2021 year (I did January - September when I was playing in October, so I only had October-December to complete) -- I put them in a spreadsheet for easy access. 

Then, I copied a month at a time of blog URLs and posted into the Voyant tool. It analyzed what terms were used most frequently in those blog entries. 

When I was playing back in October, I found that the text that it grabbed had all the stuff before and after the blog -- so things like the days of the week and month names came up as frequently used terms because each day's entry has the day and month in it. It also provided the text that it analyzed -- in essence, all the text from that month of blog entries -- as flat text. So, I copied that out of the tool, and pasted it into a Word document.

I edited the Word document for each month, taking out the "boiler-plate" from each entry. Again - I had done all of this for January-September back in October, so I only had 3 months yet to do. Then I combined all the monthly edited texts and created a Word doc with just the text from all the blog entries for 2021.

This, then, I ran through Voyant... and here are the results:

Initial word cloud (I don't know how many words it chooses to map in the initial display)

Top 25 terms -- it appears Carl is a star on the blog!

Top 95 terms

Some additional data provided by Voyant:

This corpus has 1 document with 122,832 total words and 8,380 unique word forms.

Vocabulary Density: 0.068

Readability Index: 8.260

Average Words Per Sentence: 22.6

Most frequent words in the corpus: carl (517); got (377); area (284); gracie (254); just (252)

The word cloud is supposed to show the most commonly occurring word in the largest font and lower-occurring words in smaller and smaller fonts.

There's probably a lot that I could figure out from the information provided by this tool (like, I have not yet found information about what the "Vocabulary Density" and "Readability Index" values within this tool mean). I was reading in the Help text and found that I could specify words to exclude from counting (the tool already does that for common English words like "the", "a", "and"). I have the content captured now, and I may do some more playing with it if I feel so moved!

Regarding the part for Gracie (https://journeyinamazinggrace.blogspot.com/2022/01/altered-plans.html, https://journeyinamazinggrace.blogspot.com/2022/01/well.html, https://journeyinamazinggrace.blogspot.com/2022/01/getting-stuff-done.html) - it evidently did not get shipped on Friday, but *did* get shipped on Monday, so is supposed to arrive today, Tuesday... not sure what time of day nor whether they would potentially work on installing it even if it does arrive. Still trusting that God is orchestrating and we will be done here when He wants us done here!

No comments:

Post a Comment