FF7/Kernel/Low level libraries

From Final Fantasy Inside
< FF7‎ | Kernel
Revision as of 02:04, 11 March 2005 by my_wiki>Synergy Blades
Jump to navigation Jump to search

PC to PSX comparison

The files and data formats used in the PSX version of FF7 and it's PC port are conceptually the same thing, and accomplish the same tasks. That being said, they both have wildly different formats. Both ofwhich were derived from a third original format that is also somewhat different that the first two.

The original PSX FF7 was created in part using Sony's Psy-Q development library. This library usescommon formats that are "native" to the PSX. Often times a toolkit was used to convert commondevelopment- based formats, such as a TGA bitmap or a palleted GIF file, to something a little moresuited to Psy-Q, which would be a TIM file.

During the porting process to the PC, some of the original artwork, (and artists for that matter), wereno longer available. This resulted in the port team having to use the Psy-Q versions of many files,which were ill suited for the PC architecture. In our example, the TIM file was converted to a TEX file,which would be manipulated in the PC's video memory a little more efficiently. Sometimes theoriginal artwork was available, such as the pictures of the characters within the menu, or the original MIDI files. Most often times it was not.

To make things a little more confusing, both systems also archive their data files in different ways,making the extraction and rendering of each file a bit of a bear. For the most part the data within eachfile is the same thing, just a little switched around. This manual will cover the more generic files first,and then common files used in each module.

Data Archives

To save space, quicken access time, and to obfuscate the file structure a little, most of the data files are stored in some kind of archive format. The archives remove such useful items as subdirectories and logical data placement. There is no real "native" format these are based on.

BIN archive data format

The BIN format comes as two different types. They both have the same extension, so one must open the file to see which format is which. They are best described as BIN Types and Bin-GZIP types

BIN Types

These are uncompressed archives. The header consists of a 4 byte header that gives the length of the file without the header and then the data beyond that.

BIN-GZIP Types

Unless otherwise noted, these have a 6 byte header. After this are many gziped sections concatenated together.

Offset Length Description
0x0000 2 bytes Length of gzipped sections
0x0002 2 bytes Unknown
0x0004 2 bytes File number
0x0006 Varies [0x1F8B080000000000...] - Gzip header 1
0x0000 2 bytes Length of gzipped sections
Varies 2 bytes Length of gzipped sections
Varies 2 Unknown
Varies 2 bytes File number
Varies Varies [0x1F8B080000000000...] - Gzip header 2
...

LZS Compressed archive for PSX by Ficedula

Format

The LZS archive has a very small header at 0x00 that has the length of the decompressed file as an unsigned 32 bit integer. After that is the compressed data.

LZS compression

FF7 uses LZS compression on some of their files - more properly, a slightly modified version of LZSS compression as devised by Professor Haruhiko Okumura. LZS data works on a control byte scheme. So each block in the file begins with a single byte indicating how much of the block is uncompressed ('literal data'), and how much is compressed ('references'). You read the byte right-to-left, with 1=literal, 0=reference.

Literal data means just that: read one byte in from the source (compressed) data, and write it straight to the output.

References take up two bytes, and are essentially a pointer to a piece of data that's been written out (i.e. is part of the data you've already decompressed). LZSS uses a 4K buffer, so it can only reference data in the last 4K of data.

Reference format

A reference takes up two bytes, and has two pieces of information in it: offset (where to find the data, or which piece of data is going to be repeated), and length (how long the piece of data is going to be). The two reference bytes look like this:

OOOO OOOO  OOOO LLLL

(O = Offset, L = Length)

So you get a 12-bit offset and a 4-bit length, but both of these values need modifying to work on directly. The length is easy to work with: just add 3 to it. Why? Well, if a piece of repeated data was less than 3 bytes long, you wouldn't bother repeating it - it'd take up no more space to actually just put literal data in. So all references are at least 3 in length. So a length of 0 means 3 bytes repeated, 1 means 4 bytes repeated, so on.

Since we have 4 bits available, that gives us a final length ranging from 3-18 bytes long. (That also means the absolute maximum compression we can ever get using LZSS is a touch under 9:1, since the best possible is to replace 18 bytes of data with two bytes of reference, and then you have to add control bytes as well).

Offset needs a bit work doing on it, depending on how you're actually holding your data. If all you have is an input buffer and an output buffer, what you really need is an output position in your buffer to start reading data from. In other words, if you've already written 10,000 bytes to your output, you want to know where to retrieve the repeated data from - it could fall anywhere in the past 4K of data (i.e. from 5904 through to 9999 bytes).

Here's how you get it:

real_offset = tail - ((tail - 18 - raw_offset) mod 4096)

Here, 'tail' is your current output position (eg. 10,000), 'raw_offset' is the 12-bit data value you've retrieved from the compressed reference, and 'real_offset' is the position in your output buffer you can begin reading from. This is a bit complex because it's not exactly the way LZSS traditionally does (de) compression; it uses a 4K circular buffer; if you do that, the offset is more or less usable directly.

Once you've got to the start position for your reference, you just copy the appropriate length of data over to your output, and you've dealt with that piece of data.

Example

If we're at position 1000 in our output, and we need to read in a new control byte because we've finished with the last one. The next data to look it is:

0x03, 0x53, 0x12 .....

We read in a control byte: $03. In binary, that's 00000011. That informs us that the current block of data has two compressed offsets (@ 2 bytes each), followed by 6 literal data bytes. Once we'd read in the next 10 bytes (the compressed data plus the literal data), we'd be ready to read in our next control byte and start again.

Looking at the first compressed reference, we read in $53 $12. That gives us a base offset of $153 (the 53 from the first byte, and the '1' from the second byte makes up the higher nybble). The base length is $2 (we just take the low nybble of the second byte).

Our final length is obviously just 5.

Our position in output is still 1000. So our final offset is:

= 1000 - ((1000 - 18 - 339) and $FFF)

The 339 is just $153 in decimal. The (and $FFF) is a quick way to do modulus 4096.

= 1000 - (643 and 0xFFF)
= 1000 - 643
= 357

So our final offset is 357. We go to position 357 in our output data, read in 5 bytes (remember the length?), then write those 5 bytes out to our output. Now we're ready to read in the next bit of data (another compressed reference), and do the procedure again...

Complications

Unfortunately, that doesn't quite cover everything - there's two more things to be aware of when decompressing data that *will* ruin you when using FF7 files, since they do use these features.

First, if you end up with an negative offset, i.e. reading data from 'before the beginning of the file', write out nulls (zero bytes). That's because the compression buffer is, by default, initialized to zeros; so it's possible, if the start of the file contains a run of zeros, that the file may reference a block you haven't written... EG: If you're at position 50 in your output, it's possible you may get an offset indicating to go back 60 bytes to offset -10! If you have to read 5 bytes from there, it's simple: you just write out 5 nulls. However, you *could* have to read 15 bytes from there. In that case, you write out 10 nulls (the part of the data 'before' the file start), then the 5 bytes from the beginning of the file.

Secondly, you can have a repeated run. This is almost the opposite problem: when you go off the end of your output. Say you're at offset 100 in your output, and you have to go to offset 95 to read in a reference. That's OK ... but what if the reference length is >5? In that case, you loop the output. So if you had to write out 15 bytes, you'd write out the five bytes that *were* available ... and then write them out again ... then again, to make up the 15 bytes you needed.

The FF7 files use both of these 'tricks', so you can't ignore them!

LGP Archive format for PC by Ficedula

This section explains how the LGP archives from FF7PC are constructed. There's probably no reason why you'd need to know this (Plug: Use my LGP Editor !) but the file format might be useful to SOMEBODY.

Essentially the LGP file is split up into four (maybe less, depending on how you count it) sections.

  1. File header/Table of contents
  2. CRC code
  3. Actual data
  4. File terminator

Section 1: File Header

This contains two parts: A header of fixed size, then the table of contents.

The first item is 12 bytes containing the file creator. This is a standard string, except it is "rightaligned". In other words the blank space comes BEFORE the actual text, not after. Oh: In FF7 it's always "SQUARESOFT" preceded by two nulls to make it 12 bytes. The only other thing you might see is the header "FICEDULA-LGP", which I use to indicate a file is an LGP *patch* one of my programs has constructed, not a complete archive.

Next is a four-byte integer saying how many files the archive contains.

Following this is the table of contents (TOC): One entry per file.

Each entry in the TOC has the following structure:

Offset Length
20 bytes Null terminated string, giving filename
4 byte integer Position in this file where data starts for the file
3 bytes Some sort of check code. Normally seems to be
14,0,0 but it does vary. Unsure about this.

Simple!

Section 2: CRC Code

This code is used to validate the LGP archive. The bad news is I have no idea how to make it (I've figured out how to decode it, ie. find out whether the archive is valid ... but I can't create my own). The good news is you don't need to! The ONLY thing this CRC is based on is the number of files in the archive (maybe the filenames too ... haven't checked that). Anyway, the TOC is the only thing this check relates to. So if you're replicating an archive from FF7 for use in the game with the same number of files and filenames (and what ELSE would you use LGP archives for?) you can just copy the CRC section from an existing file. Cheap but effective :)

Normally it's 3602 bytes long. I think one archive was different? Maybe MAGIC.LGP - can't remember. Anyway, one normally-safe way of calculating the CRC size is to find the end of the TOC and the beginning of the first file. Anything in between is probably CRC code. (Not guaranteed to work! It works with "official" archives but editors - such as mine - can screw around with the TOC to achieve extra things).

Section 3: Actual Data

The data from the files. However it's not that simple: the TOC doesn't list how long each file is (somewhat useful!). It's done here. The offset in the TOC is actually the position of yet another file header. Format is:

Size Description
20 bytes Null terminated string, giving filename
4 bytes File length
Varies The file data itself

Simple!

Section 4: Terminator

After the last piece of data comes the file descriptor. This is a simple string, except instead of being null-terminated it's terminated by the end of the file. It's "FINAL FANTASY 7" for all archives, except LGP patches, where it's "LGP PATCH FILE".

Notes

The game is remarkably flexible about LGP archives. So long as the TOC and the CRC data is intact it'll accept just about anything.

  • Example 1: The filename in the TOC and in the actual file header don't have to match. It only checks the TOC.
  • Example 2: You can point two entries in the TOC at the same data and it works.
  • Example 3: You can have ANY junk in the data section so long as all the TOC entries point to a valid file header. Not every piece of data has to be "accounted" for by the TOC. There can be data not used.

My LGP Editor uses this to its advantage in the Advanced Editor. If you want to replace a file in an LGP archive with your own copy, it just sticks the file on the end of the LGP, writes a new file terminator, and updates the TOC to point at the new file. (Advantage: Fast). It even lets you link two TOC entries to the same data :) or have "inactive" files in the archive that aren't referenced by any TOC entry.

I don't know whether the file terminator has to be intact, but for safety's sake my editor preserves it. The CRC DEFINTELY has to be present and correct. Also, if you're replacing an archive with you're own custom version make sure it has filenames in the TOC matching the ones in the old one, ne?

Oh: The game doesn't check archive sizes so long as all filenames are present. So if you want, you could replace an archive containing 95 files with a 98-file archive, so long as 95 of those 98 names matched those present in the original 95-file archive! (There's no point in doing this - after all, the game won't use any files OTHER than the 95 it's expecting to find).

Other point: I've heard reports on Qhimm's message board that once you've f***ed an archive and the game refuses to read it, it won't EVER read it until you reinstall - even if you fix the problem/restore from a backup. The idea was generally scorned and ignored, but I'll mention it because something like that happened to me. Then again, it COULD have been because I upgraded basically everything in my PC; so no solid conclusion to be drawn here.

Further point: (due to changes in my LGP Tools/Cosmo programs) Sometimes, there are data "gaps" in the file that don't appear to be referenced by any file - even by an inactive file. This happens due to the way my programs update archives. If you're only using the TOC method to get at files (the easy way) then you won't notice this anyway. However, if you're stepping through the file header by header, even reading the unused ones, this can cause problems. If you use my program to update a file with one that's smaller than the original (can happen) then it writes it in, but leaves a gap after it (of course). However, to help you out, after the end of the file, it writes a 4 byte integer saying how much more space to skip over to reach the next file header. This really doesn't affect many things - only tools (like my Advanced LGP Editor) that bypass the TOC to construct their own file lists. FF7 never notices a thing.