Simple frequency analysis of a document

December 22, 2018

During winter break for school, I have decided to get a bit of a jump start on learning C since it will be coming up in future classes. Along the way I just wrote this simple little program. During the programming of it, I just downloaded a free txt file from the gutenberg project. Pass it into the program and it will tell the distribution of letters used in the file.

#include <stdio.h>
#include <stdlib.h>

struct character_stats {
	char l;
	long count;
};


int create_data_structure();
float calc_percent(long char_num, long total_chars);

int main () {

	char file_name[180];
	long total_chars = 0;

	char charList[] = {'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q',
		'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z'};	

	puts("Document Frequency Analysis in C\n");
	puts("----------------------------------\n");
	puts("Enter a file name (path included is needed): ");
	scanf("%s", file_name);
	puts("----------------------------------\n");
	puts("Generating the list.....\n");
	struct character_stats list[26];	
	if (create_data_structure(charList, list) == 0) {
		puts("List is generated!\n ------------------------\n");
	} else {
		puts("Error generating list. Exiting program....\n");
		return -1;
	}
	printf("Opening file %s\n", file_name);
	FILE *f;
	f = fopen(file_name, "r");
	if (f == NULL) {
		puts("Error. Exiting program!");
		exit(1);
	} else {
		puts("File is opened.\n");
	}
	puts("Now analyzing.............\n");

	char ch;
	while (ch != EOF) {
		ch = fgetc(f);
		for (int i = 0; i < 26;  i++) {
			if (ch == list[i].l) {
				list[i].count++;
				total_chars++;
			} else {
				continue;
			}
		}
	}
	puts("Results:\n");
	printf("Total characters: %li\n", total_chars);
	
	for (int i = 0; i < 26; i++) {
		printf("%c : %li    %8.6f\n", list[i].l, list[i].count, calc_percent(list[i].count, total_chars));
	}
	puts("-------------------\n");

	fclose(f);

	return 0;
}

int create_data_structure (char chars[], struct character_stats list[]) {
	for (int i = 0; i < 26; i++) {
		list[i].l = chars[i];
		list[i].count = 0;
		printf("%c: %lx\n", list[i].l, list[i].count);
	}
	puts("\n");
	return 0;
}

float calc_percent (long char_num, long total_chars) {
	return (float) char_num / total_chars;
}

Here is some sample output where I used a text file of the federalist papers:

Quote

Document Frequency Analysis in C

----------------------------------

Enter a file name (path included is needed):
fed.txt
----------------------------------

Generating the list.....

a: 0
b: 0
0
d: 0
e: 0
f: 0
g: 0
h: 0
i: 0
j: 0
k: 0
l: 0
m: 0
n: 0
o: 0
p: 0
q: 0
r: 0
s: 0
t: 0
u: 0
v: 0
w: 0
x: 0
y: 0
z: 0

List is generated!
------------------------

Opening file fed.txt
File is opened.

Now analyzing.............

Results:

Total characters: 921002
a : 66073    0.071740
b : 14938    0.016219
c : 30213    0.032804
d : 30947    0.033601
e : 124503    0.135182
f : 24903    0.027039
g : 13112    0.014237
h : 48048    0.052169
i : 69592    0.075561
j : 1942    0.002109
k : 2048    0.002224
l : 32611    0.035408
m : 21649    0.023506
n : 67697    0.073504
o : 73688    0.080009
p : 21015    0.022818
q : 1324    0.001438
r : 54993    0.059710
s : 56289    0.061117
t : 97923    0.106322
u : 25771    0.027981
v : 10676    0.011592
w : 13753    0.014933
x : 2621    0.002846
y : 14208    0.015427
z : 465    0.000505
-------------------

December 24, 2018

Nice. I haven't looked at C in a while, so I really enjoyed reading your code and re-familiarizing myself with it. I never went very far with C so your program actually taught me a few things too.

As far as I can tell, your program fails to account for letters that are uppercase, so you might want to change it to:

#include <ctype.h>
//...
if (tolower(ch) == list[i].l) {
	list[i].count++;
	total_chars++;
}

I also believe that the part directly after:

else {
	continue;
}

Is unnecessary because the program would naturally continue the loop regardless. I don't think that this affects run time or anything, it just seemed strange to me.

It might also be worth noting that "Total characters" is only the total number of alphabetical characters, but that was almost certainly your intention. calc_percent() also doesn't calculate the percentage, it calculates the ratio, so it's a bit of an odd function name.

I did have one question about the program. In the case that the program fails to generate the list you terminated it using "return -1" and in the case that it fails to open the file you used "exit(1)". Was there a reason for this difference?

I've also always really liked trying to make programs as fast as possible even if just by entirely negligible amounts, so if you're like me then you could also make the following changes:

In instances when list.l is equal to 'a' your program still checks the next 25 letters of the alphabet unnecessarily, so you could add a break command to terminate the loop early like this:

if (tolower(ch) == list[i].l) {
	list[i].count++;
	total_chars++;
	break;
}

and if you really want to be obnoxious like me, then you could then actually change the initialization of your list variable to this:

char charList[] = {'e', 't', 'a', 'o', 'i', 'n', 's', 'h', 'r', 'd', 'l', 'c', 'u', 'm', 'w', 'f', 'g', 'y', 'p', 'b', 'v', 'k', 'j', 'x', 'z', 'q'};

because that's the order of how frequently each character appears in English text on average. That would, however, make the formatting uglier because it wouldn't display the results in alphabetical order anymore.

Edited December 24, 2018 by Freak

December 25, 2018

Thanks for the tips. Yea the upper to lowercase, I was trying to do that just using references to the C library and I realized I didn't quiet implement it right, I'll give yours implementation o tolowercase a shot. Organizing the letters by typical frequency should improve run time. However, eventually I was to change this up to where people could use a sort of config file where they put in their own letters/symbols to search for.

I still got a bit to go in learning C.

December 25, 2018

Ohh, I know why you brought up the lower case/upper case. This is an older version of the code. My netbook has the newer but I thought I had newer code synced with github. When I pulled it to my other laptop, I pulled the older code that didn't do the conversion of upper case to lower case

Sign In

Simple frequency analysis of a document

Recommended Posts

WarFox

Link to comment

Share on other sites

Freak

Link to comment

Share on other sites

WarFox

Link to comment

Share on other sites

WarFox

Link to comment

Share on other sites

Create an account or sign in to comment

Create an account

Sign in

Browse

Activity