How analyze the system bottleneck using shell tools

am 29.09.2007 06:25:12 von struggle

Hi,
I think the subject has something to do with Linux shells. I have a
Linux Debian system in which there is a bbs system running there. And
I find that when the online users is up to 4000, the system will slow
down. I am now just hoping to find a way to test which is the main
reason for this.
I know, the question description is too simple to have the concrete
anwsers. And I understand that finding the real problems is not an
easy thing. So, here, could you please tell me some cool tools used to
find the real problem?

Any suggestion will be appreciated and thanks in advance!

Regards!
Bo

Re: How analyze the system bottleneck using shell tools

am 29.09.2007 08:48:38 von Cyrus Kriticos

Bo Yang wrote:
>
> [...] Linux Debian [...] a bbs system running [...] the system will slow
> down [...] test which is the main reason [...] tell me some cool tools
> [...] find the real problem

Visit comp.os.linux.misc

--
Best regards | "The only way to really learn scripting is to write
Cyrus | scripts." -- Advanced Bash-Scripting Guide

Re: How analyze the system bottleneck using shell tools

am 29.09.2007 13:08:57 von Maxwell Lol

Bo Yang writes:

> Hi,
> I think the subject has something to do with Linux shells. I have a
> Linux Debian system in which there is a bbs system running there. And
> I find that when the online users is up to 4000, the system will slow
> down. I am now just hoping to find a way to test which is the main
> reason for this.
> I know, the question description is too simple to have the concrete
> anwsers. And I understand that finding the real problems is not an
> easy thing. So, here, could you please tell me some cool tools used to
> find the real problem?

There are several possible bottlenecks, and tools you can use to examine these issues.

I like "top" in general (for a text terminal) or "gkrellm" for a
graphic overview.

You have to determine if it's I/O, memory, network or CPU.

first - do "uptime" and look at the last three numbers. This is the
load average for 1, 5 and 15 minutes. If the numbers decrease, then
you have a spike in the number of jobs running. Learn what numbers are
typical, and what are high. Small spikes are okay. If the number of
jobs is large, everything will slow down.

For memory, run

vmstat 10

this will update every 10 seconds (the argument).
Look at the CPU, and see if the CPU is busy with system, user or idle.

If the CPU is busy, then you have to see what is happening. If the
time is high in system, then it's OS related, and not your
application. A typical thing is if it has to do a lot of virtual
memory management, or network stuff.

The "free" memory will always be low. Unix likes to be efficient, and
having memory and not using it is silly. What's more important is
"si" - swap in. If this jumps up, then you may be low in memory.
That says the system has to load pages from disk into memory a lot.

Memory is the cheapest way to upgrade a CPU.
Other values, like "cs" context switch - might also be an issue.

Next, look at the disk values. Drat. I don't see iostat on my linux box.
What's the equivalent?

For network issues, try netstat -s
But network issues is hard to diagnose.

Hope this helps

Re: How analyze the system bottleneck using shell tools

am 29.09.2007 16:57:24 von struggle

Hi Maxwell,

Thank you for you fast reply.

> I like "top" in general (for a text terminal) or "gkrellm" for a
> graphic overview.
>
> You have to determine if it's I/O, memory, network or CPU.
>
> first - do "uptime" and look at the last three numbers. This is the
> load average for 1, 5 and 15 minutes. If the numbers decrease, then
> you have a spike in the number of jobs running. Learn what numbers are
> typical, and what are high. Small spikes are okay. If the number of
> jobs is large, everything will slow down.
>

Ah, my uptime result is load average: 10.25, 12.51, 12.35. I think this
is a very high load average.

>
> For memory, run
>
> vmstat 10
>
> this will update every 10 seconds (the argument).
> Look at the CPU, and see if the CPU is busy with system, user or idle.
>
> If the CPU is busy, then you have to see what is happening. If the
> time is high in system, then it's OS related, and not your
> application. A typical thing is if it has to do a lot of virtual
> memory management, or network stuff.

The CPU state is usr:15, sys:15, idel:50, wa:10. Is this normal?

>
> The "free" memory will always be low. Unix likes to be efficient, and
> having memory and not using it is silly. What's more important is
> "si" - swap in. If this jumps up, then you may be low in memory.
> That says the system has to load pages from disk into memory a lot.

The free memory is very high in my system and the detailed data is:
Free, buffered, cache
2740848 440376 3169856

Is that normal?
>
> Memory is the cheapest way to upgrade a CPU.
> Other values, like "cs" context switch - might also be an issue.

cs is very high too, about 5000. What does this mean?
There are all 1500 tasks runing on the system when the command exceute.
>
> Next, look at the disk values. Drat. I don't see iostat on my linux box.
> What's the equivalent?

The i/o statistics is:
avg-cpu: %user %nice %system %iowait %steal %idle
4.13 1.26 11.02 7.70 0.00 75.89

Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
cciss/c0d0 160.50 1542.31 1164.92 3794661538 2866144768

Can you explain some for me? Thanks!

> For network issues, try netstat -s
> But network issues is hard to diagnose.
>

This command give a long list of various parameters, could you please
tell me which one is the most important one? Thanks!

Thanks again for your help!

Regards!
Bo

Re: How analyze the system bottleneck using shell tools

am 29.09.2007 19:03:24 von Loki Harfagr

On Sat, 29 Sep 2007 22:57:24 +0800, Bo Yang wrote:

> Hi Maxwell,
>
> Thank you for you fast reply.
>
>> I like "top" in general (for a text terminal) or "gkrellm" for a
>> graphic overview.
>>
>> You have to determine if it's I/O, memory, network or CPU.
>>
>> first - do "uptime" and look at the last three numbers. This is the
>> load average for 1, 5 and 15 minutes. If the numbers decrease, then you
>> have a spike in the number of jobs running. Learn what numbers are
>> typical, and what are high. Small spikes are okay. If the number of
>> jobs is large, everything will slow down.
>>
>>
> Ah, my uptime result is load average: 10.25, 12.51, 12.35. I think this
> is a very high load average.

Depends on your machine, you didn't tell about your SMP
capacity, what gives:
$ cat /proc/cpuinfo | grep -c ^processor

If the answer is "8" then, yes your machine is a little bit stressed,
if the answer is "1" and the average load is around 12 over and over again
then your machine is somewhat dying or will soon...

>
>
>> For memory, run
>>
>> vmstat 10
>>
>> this will update every 10 seconds (the argument). Look at the CPU, and
>> see if the CPU is busy with system, user or idle.
>>
>> If the CPU is busy, then you have to see what is happening. If the
>> time is high in system, then it's OS related, and not your application.
>> A typical thing is if it has to do a lot of virtual memory management,
>> or network stuff.
>
> The CPU state is usr:15, sys:15, idel:50, wa:10. Is this normal?

If you checked with the proposed 'vmstat 10' and that the "wa:10"
is almost permanent you are probably facing the beginning a
disk access stranglehold, if that's permanent you'd better make plans
for a HW upgrade, if that's transitional on the rare moments the
userland is overcrowded let's say you now have a "measure" of what
your present server can manage with the actual settings, some
other settings in IO numbers can/may/would make a difference but
you know you're not very far from the edge, prepare calmly for
a HW upgrade.

>
>
>> The "free" memory will always be low. Unix likes to be efficient, and
>> having memory and not using it is silly. What's more important is "si"
>> - swap in. If this jumps up, then you may be low in memory. That says
>> the system has to load pages from disk into memory a lot.
>
> The free memory is very high in my system and the detailed data is:
> Free, buffered, cache
> 2740848 440376 3169856
>
> Is that normal?

Not with a permanent IOwa around 10, you may try and put your temp
files in RAM (tmpfs shm), that's dependent on your apps, some may
use specific tempdirs and you'll win the race just by giving them the
fastest possible IO access.
That's frequent use for SMTP and/or antispam, as for BBS I just
don't know these apps, that's your call :-)

I won't write past this line, some stuff is yet more over my
knowledge and some of it was addressed some lines upwards, I'm
confident others here will give you more precise and complete advice :-)

>>
>> Memory is the cheapest way to upgrade a CPU. Other values, like "cs"
>> context switch - might also be an issue.
>
> cs is very high too, about 5000. What does this mean? There are all 1500
> tasks runing on the system when the command exceute.
>>
>> Next, look at the disk values. Drat. I don't see iostat on my linux
>> box. What's the equivalent?
>
> The i/o statistics is:
> avg-cpu: %user %nice %system %iowait %steal %idle
> 4.13 1.26 11.02 7.70 0.00 75.89
>
> Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
> cciss/c0d0 160.50 1542.31 1164.92 3794661538 2866144768
>
> Can you explain some for me? Thanks!
>
>> For network issues, try netstat -s
>> But network issues is hard to diagnose.
>>
>>
> This command give a long list of various parameters, could you please
> tell me which one is the most important one? Thanks!
>
> Thanks again for your help!
>
> Regards!
> Bo

--
have space suit : "VMSBUX:B0N1@GOHH.GO"
will travel : tr "MLKJHGFDSQNBVCXWPOIUYTREZA" "a-z"

Re: How analyze the system bottleneck using shell tools

am 30.09.2007 03:02:26 von Maxwell Lol

Bo Yang writes:

> The CPU state is usr:15, sys:15, idel:50, wa:10. Is this normal?

Well, the CPU is 50% idle. So you are not limited by CPU horsepower.
The 10% looks like it's waiting for disk I/O

On large systems, you can have several disks, and several disk
controllers. Both can cause bottlenecks.

Sometimes you can move the I/O onto two disks, so that a single disk
isn't a bottleneck. For instance, you can place swap on one disk, and a
database on a second. Or if you use a RAID,you can stripe a partition
across several disks to increase speed.

If your database is the problem, make sure the disks used for the
database are not used for other things like swap, logging, etc.

> The free memory is very high in my system and the detailed data is:
> Free, buffered, cache
> 2740848 440376 3169856

Looks like you have lots of memory. 6 GB? Is that right?

>
> Is that normal?

Well, normally the memory will fill up with a cache of the file
system. This isn't happening in your case. If the system was
rebooted recently, I'd expect numbers like this. If it's been up for
days, i'd expect it to be more used and less free. If it is always
free, then I think this is very strange. All of the disk would be
loaded into memory, and your disk I/O would drop.

> > Other values, like "cs" context switch - might also be an issue.
>
> cs is very high too, about 5000. What does this mean?

I can't say. You need to learn what is average for your computer.
Once you know what is typical, you can see what is high.
I had a contexct switch benchmark once. I think it was from the lmbench.

http://www.bitmover.com/lmbench/

> The i/o statistics is:
> avg-cpu: %user %nice %system %iowait %steal %idle
> 4.13 1.26 11.02 7.70 0.00 75.89
>
> Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
> cciss/c0d0 160.50 1542.31 1164.92 3794661538 2866144768
>
> Can you explain some for me? Thanks!

tps is probably transactions per second. The "Blk_read/s" is block
read per second.

I'm not sure what these numbers mean as far a performance. It helps to
know what the max values are. (It's been 15 years since I was a sys
admin, and I never used a RAID).

I used to run some disk benchmarks to find out the max rates. A long
time ago I used a program called "bonnie" I ran it several times,
increasing the size of the benchmark file until the limit was reached.
That gave the max values of the disk.

It measured byte read/write, block read/write and seak times. As I
search, I see bonnie 2.0.6 and bonnie++ available in source form.
here:

http://www.acnc.com/benchmarks.html

>
> > For network issues, try netstat -s
> > But network issues is hard to diagnose.
> >
>
> This command give a long list of various parameters, could you please
> tell me which one is the most important one? Thanks!

I can't say. It's been a while since I have done this.

If your bandwidth is limited, that may be the issue.
And netstat -s won't show that.

These sorts of measurements are more valuable when you run them at an
interval, and you look at the difference between one measurement and
the next. It's probably not a network error, unless you are under a
denial of service attack.

You may want a cron job to collect some data once in a while so you
learn what is normal for your computer.

More importantly - what do your users think?
If they do not see a problem, then fine.
Don't cause a problem when there is none.

If they, however, are complaining about performance, then it's
serious.

Re: How analyze the system bottleneck using shell tools

am 30.09.2007 03:53:52 von struggle

>
> If your database is the problem, make sure the disks used for the
> database are not used for other things like swap, logging, etc.

yes, the bbs system has many file I/O. This system store all the user
artile into txt file instead of database. And I think maybe this is root
of the I/O problem.

>
>> The free memory is very high in my system and the detailed data is:
>> Free, buffered, cache
>> 2740848 440376 3169856
>
> Looks like you have lots of memory. 6 GB? Is that right?

Yes, I have 8G indeed.

>
>> Is that normal?
>
>
> Well, normally the memory will fill up with a cache of the file
> system. This isn't happening in your case. If the system was
> rebooted recently, I'd expect numbers like this. If it's been up for
> days, i'd expect it to be more used and less free. If it is always
> free, then I think this is very strange. All of the disk would be
> loaded into memory, and your disk I/O would drop.

I don't understand why the free memory is so high after the system has
started for a month.

>>> Other values, like "cs" context switch - might also be an issue.
>> cs is very high too, about 5000. What does this mean?
>
> I can't say. You need to learn what is average for your computer.
> Once you know what is typical, you can see what is high.
> I had a contexct switch benchmark once. I think it was from the lmbench.
>
> http://www.bitmover.com/lmbench/

So, 5000 means context switch 5000 times per second. And I think that
means that my system application are all not CPU-intensive application.
They need network and I/O more.
And I don't understand what do you mean by saying that the average for
my computer. I just want to know, for my computer, what cs is the normal
based on my hardware. You mean the benchmark tool can give the answer,
right? Could you please explain more for me? Thanks!

>> The i/o statistics is:
>> avg-cpu: %user %nice %system %iowait %steal %idle
>> 4.13 1.26 11.02 7.70 0.00 75.89
>>
>> Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
>> cciss/c0d0 160.50 1542.31 1164.92 3794661538 2866144768
>>
>> Can you explain some for me? Thanks!
>
> tps is probably transactions per second. The "Blk_read/s" is block
> read per second.
>
> I'm not sure what these numbers mean as far a performance. It helps to
> know what the max values are. (It's been 15 years since I was a sys
> admin, and I never used a RAID).

You mean my machine use RAID. I am a new comer for this machine and I am
not familiar with RAID. too.

> I used to run some disk benchmarks to find out the max rates. A long
> time ago I used a program called "bonnie" I ran it several times,
> increasing the size of the benchmark file until the limit was reached.
> That gave the max values of the disk.
>
> It measured byte read/write, block read/write and seak times. As I
> search, I see bonnie 2.0.6 and bonnie++ available in source form.
> here:
>
> http://www.acnc.com/benchmarks.html
>
>>> For network issues, try netstat -s
>>> But network issues is hard to diagnose.
>>>
>> This command give a long list of various parameters, could you please
>> tell me which one is the most important one? Thanks!
>
>
> I can't say. It's been a while since I have done this.
>
> If your bandwidth is limited, that may be the issue.
> And netstat -s won't show that.
>
> These sorts of measurements are more valuable when you run them at an
> interval, and you look at the difference between one measurement and
> the next. It's probably not a network error, unless you are under a
> denial of service attack.
>
> You may want a cron job to collect some data once in a while so you
> learn what is normal for your computer.
>
>
> More importantly - what do your users think?
> If they do not see a problem, then fine.
> Don't cause a problem when there is none.
>
> If they, however, are complaining about performance, then it's
> serious.

Beside above, I have a additional question. Could you please explain how
Linux measure the CPU time. I mean what will be count as sys time and
what will be count as wa time? thanks!

Thank you for your reply. It give me very valuable information and
instructions. Thanks again for you help!

Regards!
Bo

Re: How analyze the system bottleneck using shell tools

am 30.09.2007 14:09:12 von Maxwell Lol

Bo Yang writes:

> > Well, normally the memory will fill up with a cache of the file
> > system. This isn't happening in your case. If the system was
> > rebooted recently, I'd expect numbers like this. If it's been up for
> > days, i'd expect it to be more used and less free. If it is always
> > free, then I think this is very strange. All of the disk would be
> > loaded into memory, and your disk I/O would drop.
>
> I don't understand why the free memory is so high after the system has
> started for a month.

I'm not sure either. Perhaps if everytime an entry is modified, it's
written to disk (the database), the memory is freed.
As the other poster suggested, perhaps your system is I/O limited.

> So, 5000 means context switch 5000 times per second. And I think that
> means that my system application are all not CPU-intensive
> application. They need network and I/O more.

I think so.

> And I don't understand what do you mean by saying that the average for
> my computer. I just want to know, for my computer, what cs is the
> normal based on my hardware. You mean the benchmark tool can give the
> answer, right? Could you please explain more for me? Thanks!

Each system has limits in I/O, CPU, disk speeds, etc. The benchmark
programs will help you find what the maximum limit it.

If, for instance, your system has a maximum limit of 10,000 context
switches a second, and your system is at 5000 - you are not near the
limit.

On the other hand if the benchmark shows your maximum limit is 5500,
then 5000 is about as fast as it can go, and is the indication of the
problem.

The trouble is - a good benchmark is done when the system is
completely idle. If your server is busy all the time, then the
benchmark will slow down the experience to the users.

If you can run it as a time when the normal load is small, and the
benchmark only takes a few seconds, you could try that. If the
benchmark takes an hour, that might be unacceptable to your users.

> You mean my machine use RAID. I am a new comer for this machine and I
> am not familiar with RAID. too.

A RAID is a Redundant Array of Inexpensive Disks.
The disk controller needs to be RAID compatible.

I'm not an expert, but there are different types of RAID systems. One
way to configure it is to do disk striping. Instead of putting a
partition on a single disk, you put it on two disks - making it one
virtual disk. If one disk as a limit of 5000 blocks per second, by
using 2 disks in parallel, you can increase this to twice that limit.

The down side is that if you lose either disk, the partiton fails.

A RAID is one way to speed up I/O. As I said, you can also get a
faster disk, or use a second disk controller. You can also split up your
I/O onto different disks.

> Beside above, I have a additional question. Could you please explain
> how Linux measure the CPU time. I mean what will be count as sys time
> and what will be count as wa time? thanks!

A program runs in user space normally. A program that just did
calculations would be 100% in user space.

When a program needs he resources of the system, it places a call
using one of the functions listed in section 2 of the manual pages.
In other works, the user asks the system to do something. Examples include
Execute a program
Do disk I/O
Network calls
Communicate to a device driver (Terminal, display, printer, mouse)

Sometimes the user makes a request and the system has to do something
to fulfill that request. If the program needs more memory, the user
program is halted, and the system pages in the virtual memory, and the
program continues.

Many times, when the user requests disk or network resources, the
request is initiated, and the user program sleeps until the I/O is
finished.

There is also a clock that runs and gives other programs a chance to
run. So perhaps 1000 times a second, the system wakes up. If a
program is waiting for the disk, the system wakes up the program and
lets it get the results.

So the user time versus system time indicates if the CPU is busy doing
something for the user, or for the OS. I don't know what is normal,
but if the system time is high, and the user time is low, that's
indication of the system struggling. Usually the user time is high.

In older unix systems, the total time was user + system + idle. Your
system also has other values:

>> avg-cpu: %user %nice %system %iowait %steal %idle
>> 4.13 1.26 11.02 7.70 0.00 75.89

Nice is a term for low priority tasks. If you have a task that you
want done, but don't want it to interfeer with other programs, you
"nice" it.

I don't know what "steal" is.

You have 75% idle time.
You have more system time that user time.
The user time is only 4%. If this is when the performance is slow, that's bad.

This would indicate that the OS is trying to get the jobs done, but it
has to do a lot of internal bookkeeping. It's struggling (more system than user).
But it's not CPU bound, because you have 75% idle.
The 7.7% IOWAIT suggests the problem is disk related.
The other posted indicated this, and he has more recent knowledge than I.

I think that's your problem.

Re: How analyze the system bottleneck using shell tools

am 30.09.2007 22:32:24 von Michael Heiming

In comp.unix.shell Maxwell Lol :
> Bo Yang writes:
[..]

> You have 75% idle time.
> You have more system time that user time.
> The user time is only 4%. If this is when the performance is slow, that's bad.

> This would indicate that the OS is trying to get the jobs done, but it
> has to do a lot of internal bookkeeping. It's struggling (more system than user).
> But it's not CPU bound, because you have 75% idle.
> The 7.7% IOWAIT suggests the problem is disk related.
> The other posted indicated this, and he has more recent knowledge than I.

> I think that's your problem.

Seconded. The OP seems to run some kind of HP Proliant Server.
I'd try to get the most out of the used FS, turning ACL, SELINUX
and alike fancy stuff off. Various mount options "noatime" and
alike might improve things at least slightly. The underlying FS
and disks/array type would be interesting?

--
Michael Heiming (X-PGP-Sig > GPG-Key ID: EDD27B94)
mail: echo zvpunry@urvzvat.qr | perl -pe 'y/a-z/n-za-m/'
#bofh excuse 390: Increased sunspot activity.