Lag Busting: How to Make Your Server Run More Smoothly

From NWN Lexicon
Jump to: navigation, search

Introduction

One of the things all server administrators have in common is the desire to keep their servers running as smoothly as possible. They spend a lot of time working on their servers, and they don't want their players' enjoyment of their work interfered with. One of the most common interferences is "lag". Used loosely, lag is a hiccup in game play, caused by any number of things, ranging in duration from barely noticeable to a minute or more. This tutorial is on the things you can do within the toolset to keep lag to a minimum. It will NOT cover topics like defragging your server's hard drive, optimizing your databases, or any other general computing techniques with application beyond the narrow realm of Neverwinter Nights.

Technical terms used in this tutorial

There are some technical terms or commands used in this tutorial that are used to assist in troubleshooting network problems. While this it is not within the scope of this tutorial to teach the reader how to use the terms, following is a brief definition.

Ping
Ping is a computer network tool used to test whether a particular host is reachable across and Internet Protocol (IP) based network. Ping measures the round-trip time and records any packet loss. Using the "ping" command is often known as "pinging".
Traceroute
Traceroute is a computer network tool used to determine the route taken by packets across an IP network. The traceroute tool is available on most operating systems. Under Windows the command is "tracert" and is available from the Windows command line. In this tutorial, the term "tracert" will mean the Traceroute command.

Lag - What is it?

More properly, lag refers to a delay in information exchanged between client and server, resulting in a hang with breaks immersion and can result in loss of control of one's character, with often unpleasant consequences. Players, however, tend to use the term "lag" much more broadly, to refer to anything that causes a hang or break in play experience. Decoding their meaning is critical to understanding the problem that they're experiencing, and to fixing it, if indeed there's anything you can do - sometimes there isn't. So, for the remainder of this tutorial, we're going to use the term "lag" in this broader sense, and label actual lag "connection lag". Before we can discuss ways to prevent lag, we need to familiarize ourselves with the various types of issues that can give rise to interference with game play. Below is a rough listing.

Connection Lag
Connection Lag, is also called network lag. Connection Lag arises from a problem with the connection between the player's computer and the server. It can have a number of causes, including active programs on the player's computer, the player's router, the server's router or the server's active programs, or somewhere in between. Connection lag can be detected by pinging the server, and checking ping times. A network trace or trace route (tracert command in Windows DOS) will show where, roughly, the problem lies. Often there will be nothing you can do about this sort of lag, other than to assist the player in troubleshooting their system, or waiting for network issues to be resolved. If a ping results in an unusually high number, or the tracert fails at a certain jump, the problem is connection lag.
Graphics Lag
This is one of the most common types of lag, and the one most mistaken by players as network lag, or as some other sort of problem external to their system. Graphics "lag" occurs when the player's computer is overwhelmed with the graphical data it is getting, and fails to render graphics smoothly, resulting in poor frame rate, lockups, and occasionally more exotic issues. It is LARGELY a client-side issue, and the player will need to take steps to fix it. Fixing the problem may require steps, such as, changing their graphics settings or getting a different graphics card. There are, however, some things that a server administrator can and should do to prevent this sort of thing, which we will discuss below. If the "lag" a player experiences is intermittent and coincides with times when a lot is happening on their screen, and other players on the server do not experience it when they do, the problem is likely graphics lag. Graphics lag can often also be detected by having the player hit the tilde (~) key, type "fps", and hit enter while playing, which displays the Frames Per Second they are seeing displayed. The higher the number, the better the framerate; the lower, the choppier ("laggier") things will appear.
Server Lag
This is the final type of "lag", and the one you have the most direct control over. It arises when the server is trying to do too much at once. The game engine begins to run on the hairy edge, and it stops doing certain things, based on priorities in the engine. This often is caused by a lot of players on a server, poorly written scripts, poorly built areas, or some combination of the above. Other times, there may be some technical issue at work, like a crippled game server, an out-of-control process, or insufficient RAM. Server lag is often the trickiest to detect, and can be diagnosed by ruling out both connection and graphics lag. In more extreme cases, however, it is not at all hard to detect, as low-priority processes cease. These include the updating of the game clock, resulting in the game being stuck permanently at a certain time, arresting the day/night cycle. There are other low-priority processes as well. These low priority processes may fail with even a couple players on a well-built server and module, so they are not of much help when diagnosing a problem. Some examples of low priority processes include persistent area of effect heartbeat scripts and spawned-in-placeable heartbeat scripts. These scripts often will not fire, even on a healthy server.

What You can do Within the Toolset to Prevent Lag

Instead of trying to exhaustively list all the various things you can do to keep lag down on your servers, we are going to list the most important measures you can take in order to achieve a relatively lag-free server. While doing so, we will explain why each measure is important, and what kinds of lag it can affect.

  1. Make Custom Factions
    You should change all your hostile factioned creatures to custom factions with hostile characteristics, grouping different sets of areas into different factions. In our mod we average about a faction for every 5-10 areas. This was far and away the biggest decrease we ever saw in lag. Why? Monsters generate silent shouts left and right, and fire listening scripts in response to these shouts. If the faction is different, however, most of these calls are cut short, meaning much less overhead from silent shouts. If you REALLY want to get serious about cutting overhead, you could eliminate some of these silent shouts completely, by editing the Artificial Intelligence (AI) scripts that issue them, but we would only recommend that to very advanced scriptwriters. The lag generated by these silent shouts and the scripts that fire in response to them is entirely server lag (type 3 above).
  2. Remove Non-Player Characters (NPCs) When no Players are Around
    You should ensure that no NPCs are standing around when players are not around to see them. They are extra listeners and they fire unnecessary script calls. This is ESPECIALLY the case if they are walking waypoints, which is very expensive, and is insanely expensive if they get stuck. In other words, no area in your module should have any creatures placed in it - they should all be spawned in, either by encounter trigger, or generic trigger, or on enter of the area. Once again, the lag generated by these NPCs/creatures is almost entirely server lag.
  3. Streamline your Scripts
    Almost all scripts can be written more efficiently. Start with the ones that are called most often, but that you can't get rid of. Many of the changes will be intuitive, obvious changes, while others are more a factor of replacing cpu-intensive operations with lighter ones (CopyObject where possible). You should also focus on getting rid of database reads where possible, especially if you are using the standard Bioware database. Database reads are very slow. If you use databasing for tracking player statuses, and are checking those statuses at all regularly, you should instead check them only on client enter, and convert them to locals for faster reads later. There are all kinds of tricks like this, many of which you will pick up naturally over time. Be aware of expensive functions, and avoid them where feasible. Again, the lag caused by inefficient scripts is usually entirely serverside, unless the inefficient script in question is also blasting out graphics, for example.
  4. Eliminate Unnecessary Scriptcalls
    In the spirit of #2 above, you simply don't want scripts firing if they don't need to be. One generally wasteful type of script is the heartbeat script, since they run all the time while their object is valid - in the case of the module and areas, whenever the module is running. Heartbeats are FINE if used sensibly, and kept efficient, but you shouldn't use them if you don't need them, or if there's a more efficient alternative. Often, you can replace a heartbeat script with a pseudo-heartbeat, a kind of function that fires itself recursively on a delay you specify until the conditions that you specify are met. They are basically heartbeats with user-specified intervals (instead of the fixed ~6 second heartbeat event) and specific on/off switches to ensure that they aren't running when they aren't needed. They are, however, roughly five times as cpu-intensive as a heartbeat with identical function, and they tie up memory with their variables while they run, so you should reserve pseudo-heartbeats for relatively short term use. A rough guideline is to use them if they'll be running for 5-10% of the mod's uptime or less, though that is not a scientifically-arrived-at number. In our mod, the only heartbeats we didn't convert in some way were the module heartbeat and a few monster heartbeats - the rest were removed as unneeded or converted to another event or a pseudo. As with inefficient scripts, lag from unnecessary scriptcalls is server lag.
  5. Review Placeable Use in Your Areas
    Aside from massively bulking up the size of your area files, using lots of placeables can come with other risks. Most importantly, you do not want to place placeables with walkmeshes across tile boundaries. A placeable with a walkmesh is one that characters cannot freely move over - think furniture, not carpet. The reason for this is that the pathing system uses these tile boundaries, and blocking access to them with placeables generates blocked pathing calls, which are enormously expensive. So, you can break this rule when spawns will not be crossing the space in question. We also break it in order to allow our PCs to summon walls via spells, but it is very pricey to do so. Pathing lag is entirely server lag. Another issue to be aware of is that putting many placeables together in an area can generate both massive graphics and server lag. Graphics lag, because more places mean more polys for the client to render, and server lag, because the server loads the description of all non-static placeables when they enter PC perception. Piling a few hundred placeables into a small space can cause enormous load, especially if players are repeatedly moving in and out of perceptual range.
  6. Eliminate Unnecessary DelayCommand Calls
    DelayCommand calls are one more thing that by necessity occupies your server's limited resources. The fewer of them that you use, the better off you will be. The most important step to reducing delayed calls is the use of timestamps. By setting up an incrementing count on your mod heartbeat - a very low-cost proposition which can save far more than it costs - you can establish a timestamping system. Then, instead of having a delay run for x amount of time, you can simply check the timestamp, compare it to the current one, and act accordingly. A simple example of this are placeable respawns. Suppose that you use loot placeables set to respawn every 20 minutes (1200 seconds). Instead of having each run a 1200 second DelayCommand, you can simply add a check OnEnter the area, which respawns the placeables if the current time is 1200 seconds greater than the time at which they were looted/destroyed. This is just one example of a fairly long-term DelayCommand replaceable with a timestamp. Another, shorter-term example would be the application of a temporary bonus caster level to a PC, by means of a timestamped variable. While the current time is less than (variables timestamped set time + duration in seconds), the spell script in question adds one to their casterlevel, if the variable is present. If the time is greater, the variable is deleted. This is especially effective to replace delays (or effects) that only need to be checked when another event fires, as with the casterlevel example. Lag from unneeded DelayCommands is server lag.
  7. Mind Your Shops
    Shops can generate lag, so it's generally wise not to clutter them with useless junk - if nothing else, junk is just that many more objects for the engine to keep track of. Furthermore, shops can generate enormous lag spikes, enough to cause players to disconnect, when items are sold off and the shop is closed. Because of how sales are handled, and the sorting of items, items sold in a shop with a bunch of inventory not matching the items sold, will generate massive lag. On our server, we split buying and selling into different stores linked by the same conversation for each shopkeeper, in order to avoid this. We have some stores with 13 pages of inventory or more, however another solution is just to keep inventory count down to begin with. If you get a lag hiccup when players sell off items, you'll know that you have enough items in the shop for this to be an issue. Again, this is server lag, though it is severe enough to time out player connections.
  8. Turn off the Script Profiler When you aren't Using It
    That's right, turn it OFF. While profilers can be enormously useful in tracking down inefficient scripts, they are also very costly to run. So, once you've done whatever profiling you needed to do, make sure you turn profiling off. This is true whether you are using the new Bioware profiler, or another one, such as the NWNX profiler plugin. Technically this is not something done within the toolset, only related to it, but it's important enough to merit a mention. As with most of the issues discussed here, profiling lag is of the server lag variety.
  9. Carefully Consider the Size of Your Spawns
    Larger spawns mean more lag, for a number of reasons. First, they mean more script calls, which makes more work for the server, and potentially more server lag. Second, it means more going on in the areas where players are, which means more being streamed to them. That means potentially more connection lag, as more data needs to go out, and potentially more graphics lag, as the players' machines must render more. On my server we use a spawn size of between 6-12 per trigger, with a two trigger-at-a-time limit, which errs on the side of allowing too many spawns, but we pay careful attention to efficiency in most other regards, in order to get away with that kind of spawn size. Note, that this does not necessarily mean just considering the size of spawn in each trigger - you should carefully review your areas for the potential for massing (triggering many spawns at once in order to kill more quickly via area-of-effects spells), as players are all too happy to engage in massing, despite the detrimental effects on performance for everyone on the server. Avoid giving them huge open spaces full of spawn triggers, unless there is some kind of incentive not to trigger them all, be it difficulty or something else, or unless you have a custom spawn system that limits the number of active encounters at one time.
  10. Encourage Limited Party Size
    The more players in an area, the greater the increase in lag for each additional player added. This is not a flat increase. Each additional player means that each other player in the area needs all that player's actions streamed to them and vice versa. This can generate a massive bandwidth problem (connection lag), as well as server lag, as more objects in the area generally means more for scripts running in that area to loop through, and more scripts running in that area overall. This is especially true with monsters and combat, as they generate a great deal of script activity (and again, players' machines still need to render all that occurs, so graphics lag may enter into the equation as well). On our server we accomplish this with experience and loot incentives, and a hard party cap at 10 members (no XP or random loot for parties over 10). 10 is probably too lax a limit, but ease of finding parties is a factor to consider as well, as is the pre-existing difficulty level of your server and the number of players required to tackle the challenges you set out for your players.
  11. Limit the Number of Items Your Players can Carry
    The more items a player carries, the more load they place on a server. This is one of the areas in which our knowledge is largely anecdotal, but it's also been commonly accepted in the community for some time — so much so that a limiter was already in place under our PW's last admin in 2004, before it changed hands. It was originally set to teleport players to 'Encumbria' and not let them leave until they got below 200 items, but this limit was lowered in several steps to a 150 item limit over a period of some months, with noticeable performance improvements resulting. It also stands to reason, since more items means more for the server to keep track of, and more for scripts to loop through, though those alone probably don't fully account for the performance increases we noticed. Any lag generated by packrat players is probably mostly server lag, unless item data is streamed to other players for some reason (and we have no reason to believe that it is), in which case there might also be a connection lag component.
  12. Limit the Number of Players on Your Server
    This one is the hardest for some administrators to come to terms with. Every server has a limit, in terms of both processing and bandwidth. If you exceed that limit by allowing more players to log in than your server can handle, your performance will suffer, as you are hit be server and/or connection lag respectively. You may want to consider dropping your player limit and adding additional, linked servers, with a shared servervault, as an alternative solution. On my server, we have split hosting among 4 machines, each running two instances of the mod, and each having a 27-player limit. Performance is usually optimal if we have 20 players or less per server instance, but again, there are other considerations to take into account, like party formation, which can be difficult if some players cannot log in due to player limit. Of course, as you streamline your module, you should be able to handle more players with less server lag, but there will still be a bandwidth constraint, and there is only so much that streamlining can accomplish. You'll have to determine your own optimal player limit, as it's determined by a number of factors, including your hardware, your connection, and many of the factors discussed above.
  13. Avoid Using Lots of High-Polygon-Count Models
    Some placeable and creature models in the toolset are composed of a high number of polygons. This means that it requires more effort for graphics cards to render them, and that they take up a larger chunk of NWN-cached graphics. Some, like horses, are so large, that they practically render NWN's caching ineffective, and result in a great deal of graphics lag for players. A typical horse model is around 14 megs, and the entire default cache is only 16M. Players can increase their caching to 32M or even 64M in their .ini files, but only very high end computers are improved by the 64M setting, meaning that for most players, using horse models eats up roughly half their entire useful graphics cache or more (most don't even know to edit their .ini past 16, in our experience). Other models are not quite as high-poly, but should still be used sparingly. The CEP has a number of models of this type. You can judge the relative sizes of different models by looking in the haks/bifs at the .mdl files. The lag resulting from using too many high-poly models is, of course, graphics lag.
  14. Don't be a Peeping Tom
    Be aware that having a DM client in an area has a much more severe impact on performance than adding a number of players. That's because pretty much everything that happens in an area is streamed to the DM. Having large groups of players engaged in heavy combat can stress a server, per #10 above. Having a DM watching it all can really put a hurt on it. Most of the time, this isn't going to be a major concern, but if a server is already straining to perform, you can bring it to its knees by entering the same area in the DM client. The effect is so bad that our veteran players can easily tell when a DM enters an area during a big combat, simply by virtue of the sudden hit to performance. This type of lag is largely connection lag.

Conclusion

Much of the above, especially the lower numbers in the lists, comes down to judgment calls on the part of the builder. Think of them as general guidelines, which you can stray from at a cost. Whether the cost is worth the benefit is up to you. But the higher up on the above list they are, the greater the cost associated with them, generally speaking.


author: FunkySwerve, Acaos, editors: Mistress, Kolyana, contributor: Kookoo, Phann