Streaming Media and RTOS: Why Profile the Source code?

Profiling allows you to learn where your program spent its time and which functions called which other functions while it was executing. This information can show you which pieces of your program are slower than you expected, and might be candidates for rewriting to make your program execute faster. It can also tell you which functions are being called more or less often than you expected. This may help you spot bugs that had otherwise been unnoticed.

Since the profiler uses information collected during the actual execution of your program, it can be used on programs that are too large or too complex to analyze by reading the source. However, how your program is run will affect the information that shows up in the profile data. If you don't use some feature of your program while it is being profiled, no profile information will be generated for that feature.

Profiling has several steps:

You must compile and link your program with profiling enabled. See section Compiling a Program for Profiling.
You must execute your program to generate a profile data file. See section Executing the Program to Generate Profile Data.
You must run gprof to analyze the profile data. See section gprof Command Summary.
The next three chapters explain these steps in greater detail.

The result of the analysis is a file containing two tables, the flat profile and the call graph (plus blurbs which briefly explain the contents of these tables).

The flat profile shows how much time your program spent in each function, and how many times that function was called. If you simply want to know which functions burn most of the cycles, it is stated concisely here. See section How to Understand the Flat Profile.

The call graph shows, for each function, which functions called it, which other functions it called, and how many times. There is also an estimate of how much time was spent in the subroutines of each function. This can suggest places where you might try to eliminate function calls that use a lot of time. See section How to Read the Call Graph.

Environment and prerequisites

For the purpose of this article, I am using Ubuntu Server 9.10 in Oracle VirtualBox, with the following VM characteristics:

Virtual hardware: Single CPU, 2,000 MB RAM, I/O APIC enabled; Acceleration: VT-X, Nested Paging, PAE/NX; Virtual hard disk: 8 GB on SATA controller
Software/OS: Linux 2.6.31-14-generic-pae, Ubuntu; GCC 4.4.1 (Ubuntu 4.4.1-4ubuntu9); SAR utility for CPU reporting (sysstat package); HTOP utility
The results may vary in different scenarios. I would recommend running this code in a highly isolated environment, as it can exhaust the CPU in no time. The code is pretty basic and relies only on basic mathematical operations, with a few system calls for the MD5 sum generation. The internal calls of the functions are counted by global counters to verify with the Gprof calls.

A brief explanation about the code: First, create the files to append the hashes and random number. p is the integer assigned; the value of p must be less than 10000. p++ increases the value stored at the p pointer by 1. func1() is called if the value lies within 9600 and 10000; i.e., func1() is called 399 times. func3() is called for the remaining p values, i.e., 9601.

In func1(), j is incremented. When j reaches 66668, it increases the global counter1 to keep track of function calls. In func2(), m is incremented in the for loop. The filegen() function is called thrice. counter2 is incremented to track the filegen() function. Otherwise, it returns to func1().

In filegen(), tempc.txt is created and opened in append mode; 4000 values are added to this file in every filegen() call. Then the tempc.txt file is hashed, and this hash stored in hashes.txt. So for every single call of func2(), three calls of filegen() are made — 3 MD5 hashes are computed for every increment of p. MD5 hashing is used to keep the CPU busy. func3() comprises of the l variable, which is incremented to 50000, and will increment counter3 every time to track the calls. Every function has different time consumptions.

/*
Name :Subodh Pachghare
Author :I am the author!!!
Description :Function calling test program for Gprof
*/
#include <stdio.h>
int counter1=0; /*Global Variable*/
int counter2=0; /*Global Variable*/
int counter3=0; /*Global Variable*/
int main()
{
int p=0;
system("> /root/hashes.txt"); /*File will be overwritten or Created*/
system("> /root/tempc.txt"); /*File will be overwritten or Created*/
printf("Initial value in Main function= %d\n",p);
while(p<10000) /*Condition for while loop*/
{
if(p>9600 && p<10000)
{
func1();
}
else
{
func3();
}
p++;
}
printf("Last value in Main function (value(p) incremented for) = %d\n",p);
printf("Func1(),Func2(),Func3() & filegen() called for %d %d %d %d times\n",counter1,counter1,counter3,counter2);
}
func1()
{
int j=0;
while(j<=66667) /*Another operation to eat CPU cycles*/
j++;
if(j==66668)
{
counter1++;
}
func2(); /*Call to Function func2()*/
}
func2()
{
int m=0;
for (m=1;m<1000000;m++)
{
if(m<4)
{
filegen(); /*Call to Function filegen()*/
counter2++;
}
else
{
return 0; /*return to the func1()*/
}
}
}
filegen()
{
FILE *fo,*fg;
int f=0;
fo=fopen("/root/tempc.txt","a");
while(f<4000)
{
fprintf(fo,"%d\n",f);
f++;
}
fclose(fo);
system("md5sum /root/tempc.txt >> /root/hashes.txt"); /*Unique MD5 hash generation*/
}
func3()
{
int l=0,b=0;
while(l<=50000)
{
b=l;
l++;
}
if(l==50001)
{
counter3++;
}
}
Killing code before completion: Note that this code cannot be interrupted with Ctrl+C due to presence of system call. So if you want to stop execution in between, use kill -9 `pidof a.out` from another terminal to send a SIGKILL signal.
Reporting CPU activity

To report CPU utilisation in user-space and kernel-space while executing the code, use the SAR (system activity report) from the sysstat package, which is my favourite.

sar -u 2 400
Here, -u is for CPU reporting; 2 implies a 2-second interval; and 400 is to run it 400 times.

Surprisingly, the code eats lots of cycles for kernel-space, as compared to user-space.

Compile the code using the -pg option to include the profiler code in the executable, as follows:

gcc -pg gprof_test_code_svp.c
This will generate an a.out executable file. You can specify another output filename with -o. Now run the binary with ./a.out. After executing the code, the following output will be generated:

root@ubuntu:~# gcc -pg gprof_test_code_svp.c
root@ubuntu:~# ./a.out
Initial value in Main function= 0
Last value in Main function (value(p) incremented for) = 10000
func1(),func2(),func3() & filegen() called for 399 399 9601 1197 times
root@ubuntu:~#
Flat-profile generation

After successful execution of the program, it will generate the file gmon.out. Use Gprof on this file to generate the graph:

gprof -p a.out gmon.out

Flat profile:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls us/call us/call name
97.98 0.97 0.97 9601 101.03 101.03 func3
2.02 0.99 0.02 399 50.13 50.13 func2
0.00 0.99 0.00 1197 0.00 0.00 filegen
0.00 0.99 0.00 399 0.00 50.13 func1
<output snipped>
This generates flat-profile analytics, where time calls are given for comparison. Indications from this profile are:
The program took 1.46 minutes of clock time to complete. From the flat-profile, func3 consumed 0.97 seconds per call and was called 9601 times; around 101 milliseconds are spent in this function for every call to func3. The same values in the self us/call and total us/call field indicate that the complete time was spent on the function’s own operations, and not on the children functions.

In func1, which was the last entry presented in the profile, there is practically no time spent in itself (self us/call) but as it calls func2, most of its time is spent there. Hence, the total us/call field shows 50.13 milliseconds — the same as that of func2.

In func2, the self us/call and total us/call field are the same — the time is spent in doing self operations.

There is practically no time spent in the filegen function according to Gprof, as the md5sum is a system call, which is not presented in Gprof.

So, using this, we can clearly look for bottlenecks present in the code consuming CPU time.

Call-graph profile generation

To generate a call-graph profile of the functions used in the program, use gprof with the q switch:

gprof -q a.out gmon.out
Call graph (explanation follows)…

granularity: each sample hit covers 4 byte(s) for 1.01% of 0.99 seconds

index % time self children called name
<spontaneous>
[1] 100.0 0.00 0.99 main [1]
0.97 0.00 9601/9601 func3 [2]
0.00 0.02 399/399 func1 [3]
-----------------------------------------------
0.97 0.00 9601/9601 main [1]
[2] 98.0 0.97 0.00 9601 func3 [2]
-----------------------------------------------
0.00 0.02 399/399 main [1]
[3] 2.0 0.00 0.02 399 func1 [3]
0.02 0.00 399/399 func2 [4]
-----------------------------------------------
0.02 0.00 399/399 func1 [3]
[4] 2.0 0.02 0.00 399 func2 [4]
0.00 0.00 1197/1197 filegen [5]
-----------------------------------------------
0.00 0.00 1197/1197 func2 [4]
[5] 0.0 0.00 0.00 1197 filegen [5]
-----------------------------------------------

<Output Snipped>
Indications from call-graph are:

The number after the name indicates the index numbers. So we can see that main spends 0.99 seconds in the children waiting for func3 to complete, which is in the children field. This is substantiated by the 0.97 seconds spent per call in the self field of func3. This is again followed up by func1.
Every entry has the function name in the middle of it with the same index number; i.e., main with value 1 in index 1 field. So you can easily see the cycle for it, in Index 1 field kernel operations calling main, then main calling func3 and so on. The same follows for the rest of the index.
In Index 3, func1 is called by main with 399 calls; spent 0.02 time in children process (this is substantiated by 0.02 in func2 call). func3 is directly executed from main.
The middle line in each section is referenced by a unique index number. Above the line indicates originating function for the function given in index number and below the line indicates the process call forwarded to the next function. For example, in index 3 section, the call originated from main() function to func1() and then to func2(). Here func1() has index number 3.
<spontaneous> indicates that the parent function cannot be determined, generally due to system-related APIs.
A cumulative time recording implies that 98 per cent of the total time was spent in performing func3 operations, as per index 2.
Function calls substantiate that the filegen() function is called thrice for a single call of func2(). func1() and func2() are called for the same time; the counter values present in the code prove this. Counters are displayed after successful execution of the code. func1() and func2() functions comprise the same timings for calculations. func3 is called for 9601 times. The call-graph gives the call-tree data.

Source code annotations in profiling

To generate annotations in the source code presented, the source code has to be included in the executable while compiling code, using the g switch:

gcc gprof_test_code_svp.c –g -pg -lc
Now run the code normally, and use the source code highlighting option -A in Gprof to have annotations
marked in the source code. This gives line-by-line profiling, which is easy to understand.

gprof a.out gmon.out -A
*** File /root/gprof_test_code_svp.c:
/*
Name :Subodh Pachghare
Author :I am the author!!!
Description :Function calling test program for gprof
*/
#include <stdio.h>
int counter1=0; /*Global Variable*/
int counter2=0; /*Global Variable*/
int counter3=0; /*Global Variable*/
int main()
##### -> {
int p=0;
system("> /root/hashes.txt"); /*File will be overwritten or Created*/
system("> /root/tempc.txt"); /*File will be overwritten or Created*/
printf("Initial value in Main function= %d\n",p);
while(p<10000) /*Condition for while loop*/
{
if(p>9600 && p<10000)
{
func1();
}
else
{
func3();
}
p++;
}
printf("Last value in Main function (value(p) incremented for) = %d\n",p);
printf("Func1(),Func2(),Func3() & filegen() called for %d %d %d %d times\n",counter1,counter1,counter3,counter2);
}
func1()
399 -> {
int j=0;
while(j<=66667) /*Another operation to eat CPU cycles*/
j++;
if(j==66668)
{
counter1++;
}
func2(); /*Call to Fucntion func2()*/
}
func2()
399 -> {
int m=0;
for (m=1;m<1000000;m++)
{
if(m<4)
{
filegen(); /*Call to Function filegen()*/
counter2++;
}
else
{
return 0; /*return to the func1()*/
}
}
}
filegen()
1197 -> {
FILE *fo,*fg;
int f=0;
fo=fopen("/root/tempc.txt","a");
while(f<4000)
{
fprintf(fo,"%d\n",f);
f++;
}
fclose(fo);
system("md5sum /root/tempc.txt >> /root/hashes.txt");/*Unique MD5 hash generation*/
}
func3()
9601 -> {
int l=0,b=0;
while(l<=50000)
{
b=l;
l++;
}
if(l==50001)
{
counter3++;
}
}

Top 10 Lines:

Line Count

72 9601
59 1197
32 399
43 399

Execution Summary:

5 Executable lines in this file
5 Lines executed
100.00 Percent of the file executed

11596 Total number of line executions
2319.20 Average executions per line
The timing presented is by user-space; so the system calls generally spend far less time in user-space as compared to kernel-space. That’s why we see the 0.00 in the Time collection of Gprof. From the SAR (system activity report) snapshot attached, it can be seen that system space is using around 94 per cent of the CPU processing, as compared to 5-6 per cent of the user space.

That’s it, guys; happy coding! I will be back with some more cool stuff.

References

“Gprof: a Call Graph Execution Profiler” by Susan L Graham, Peter B Kessler, Marshall K McKusick [PDF]

Streaming Media and RTOS

Monday, September 23, 2013

Why Profile the Source code?

No comments:

Post a Comment

Software Systems Architect

Blog Archive