TL;DR

we were facing database inconsistency issues detected by DBCC CHECKDB / DBCC CHECKTABLE that we're running in nightly jobs. The cause of this was a as far as we've investigated a change from "ntext" / "text" column types to "nvarchar(max)" / "varchar(max)", and an optimization to a component that stores large strings. this resulted in smaller nvarchar(max) values and somehow caused the inconsistencies.dont worry, we've tested and found a solution that does not corrupt your data, but nothing guaranteed, you do this on your own risk!
  1. take a backup, just in case
  2. remove or disable constraints
  3. rename the old table (e.g. MyTable -> MyTable_TEMP)
  4. create the table structure again directly with nvarchar(max) column types
  5. insert the complete data of the old temp table into the new one
  6. drop the old temp table
  7. restore the constraints.
we've since then never faced this issue again.to learn more, read the long version below ;)

the complete story

some day i had some alarming emails in my inbox from our sql server. somehow some "offrow data" was incorrecly referenced.Error Message:
Table error: Object ID 1234567, index ID 1, partition ID 987654321, alloc unit ID 123123123123 (type LOB data). 
The off-row data node at page (1:12345678), slot 0, text ID 7766554433 does not match its reference from page (1:978563412), slot 1.
after some googling i found some posts suggesting me to use "DBCC CHECKDB('MyDB', REPAIR_ALLOW_DATA_LOSS). yeah great start of the day... "allow data loss" sound really great :D i will not go into details on how to fix these errors, googling dbcc checkdb should be enough to find infos on that topic, though i was not out final solution. we've come up with a "trick" to maintain all data and fix the corruption without data loss, but more on that later (or in TL;DR).everything on the internet on how that happened blamed the storage subsystem. HDD broken, bit rot, etc. but we're using ZFS as a storage subsystem and it would have telled me (weekly scrub) if there were any data losses etc. checked the logs, but found noting that pointed to a storage issue. ok. i resolved the issue correcting the database with the dbcc tool, this is strange, but i dont have much time now to chase this issue. it's resolved, so what. only happend once, right?no... certainly not right. next week it happened again...

the first try and the MS Support incident

i noticed the second time the error happened, that i got an error message from sql server of "severity 20" some time the day before (or 2 days before) saying:
DESCRIPTION: A read operation on a large object failed while sending data to the client. 
A common cause for this is if the application is running in READ UNCOMMITTED isolation level. This connection will be terminated.
so we began to investigate further knowing we're using this isolation level in our software the database belongs to.long story short, we've investigated, talked to MS Support, they said ReadUncommitted is the issue, we changed out codebase to not use ReadUncommitted anymore (instead we used snapshot as adviced by MS), weeks pass by, and... oh look the same inconsistency error. contacted MS Support again, they said, they cannot help us, we have to open up a root case analysis support case, that IMHO costs too much money. so back to the beginning. :(

investigating on our own

in the meantime some of our customers contacted us, they also have inconsistencies and are worried about their data. so back at the beginning, i thought what we have changed the past 3 versions in our software since this problem first appeared.after some research i found these 2 points:
  1. We've improved how to store large strings, they now consume much less space in the ntext / nvarchar(max) field of our database [due to compression]
  2. We've changed the database layout of our tables to move away from "ntext" and instead use "nvarchar(max)" as suggested by MS.
so what if we saved a large string when the column type was ntext and then updated the field with our new logic, that compresses the content and save it back to the field that now has the column type nvarchar(max) and something with the linking went wrong internally in SQL Server? Remember:
The off-row data node at page [...] does not match its reference from page [...]
hmmm, so i decided to verify my claim.i played around a little bit with DBCC PAGE, to look at pages and their content in sql server, and found the row that was reported as currupted by DBCC CHECKDB:
Table error: Object ID 1234567, index ID 1, partition ID 987654321, alloc unit ID 123123123123 (type LOB data). 
The off-row data node at page (1:12345678), slot 0, text ID 7766554433 does not match its reference from page (1:978563412), slot 1.
ok, so the off row data (data that was pushed off the row because it was too big) has a pointer back to the row it belongs to, and this pointer was referencing something wrong. to i looked at the referenced page (marked red above):
dbcc traceon(3604)dbcc page(MyDB, 1, 978563412, 1) with tableresults
and got the primary key of the row that was affected. ok, now we've decyphered the message and are able to play around with this data! yay!

playing around until a solution emerges

i tried to change the nvarchar(max) field data to the same content via an update statement, no changes. ive copied the row: the new row was clean, no new inconsistencies. hm. ok, so i deleted that row. inconsistency gone. yeah, thats what my friend DBCC CHECKDB Repair Allow Data Loss does, but we dont want that. we want to keep our data. luckily i have some more rows that are affected. so on to the next row!but this time i copied the row out to a temporary table, deleted it and inserted it back into its original table. tadaaa, consistency restored, but without data loss. this is the solution we came up with. in general we wrote a script that copies the complete table to another table, drop the old one and renamed the new one. since then we've not faced an inconsistency in this table again. so we think it has to be related to updating the columns from ntext to navarchar(max).these are the steps our "repair" script does for the table we've faced the issue. i know this solution is far from perfect. your partly have doubled the data for this table (copy, rename and then drop) but its the only solution we've come up with.
  1. take a backup, just in case
  2. remove or disable constraints
  3. rename the old table (e.g. MyTable -> MyTable_TEMP)
  4. create the table structure again directly with nvarchar(max) column types
  5. insert the complete data of the old temp table into the new one
  6. drop the old temp table
  7. restore the constraints.

finally here are some sql statements i found useful, maybe they're helpful to somebody. :)

all these scipts may be used for any purpose, but at your own risk! nothing guaranteed. DBCC PAGE is an undocumented feature, so be careful, take a backup before and try to test this stuff on a temporary/testing database before you go ahead and try to fix this on a production database!

check contents of page and get primary key of affected row:
dbcc traceon(3604)dbcc page(MyDB, 1, PageId, Slot) with tableresults
find all inconsistent rows and get their primary key
CREATE TABLE #dbcc_tableresults(
            [Error] int, 
            [Level] int, 
            [State] int,
            [MessageText] varchar(7000), 
            [RepairLevel] varchar(7000), 
            [Status] int,
            [DbId] int, 
            [DbFragId] int,
            [ObjectID] int, 
            [IndexId] int, 
            [PartitionId] bigint,
            [AllocUnitId] bigint,     
            [RidDbId] int,     
            [RidPruId] int,     
            [File] int, 
            [Page] int, 
            [Slot] int,     
            [RefDbId] int,     
            [RefPruId] int,
            [RefFile] int, 
            [RefPage] int, 
            [RefSlot] int, 
            [Allocation] int);
        -- Execute CheckDB and insert result to temp table
        INSERT INTO #dbcc_tableresults(
            [Error], 
            [Level], 
            [State], 
            [MessageText], 
            [RepairLevel],
            [Status], 
            [DbId], 
            [DbFragId],
            [ObjectID], 
            [IndexId], 
            [PartitionId],
            [AllocUnitId],     
            [RidDbId],
            [RidPruId],
            [File], 
            [Page], 
            [Slot], 
            [RefPruId], 
            [RefDbId],
            [RefFile],
            [RefPage], 
            [RefSlot], 
            [Allocation])
        EXEC ('DBCC CHECKTABLE(MyCurruptTableName) WITH TABLERESULTS')
with the above script you are able to create a script that only fixes your inconsistent rows, but in our case it was just a matter of time until the next inconsistency occurs, so we've decided to copy the whole table.
i've omitted the script that copies the rows and re-inserts them because its heavily tied to our DB-Schema and wouldn't be of great use to others.
maybe this helps other people in my position to find a solution sooner than i did... :)