New user self-registration is disabled due to spam. For an account please email bugs-admin@lists.llvm.org with your e-mail address and full name.

Bug 35212 - rustc ThinLTO-generated IR yields dwarfdump error
Summary: rustc ThinLTO-generated IR yields dwarfdump error
Status: RESOLVED FIXED
Alias: None
Product: libraries
Classification: Unclassified
Component: DebugInfo (show other bugs)
Version: trunk
Hardware: PC Windows NT
: P enhancement
Assignee: Jonas Devlieghere
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-11-05 21:40 PST by Alex Crichton
Modified: 2017-12-01 22:37 PST (History)
8 users (show)

See Also:
Fixed By Commit(s):


Attachments
minimized IR (5.74 KB, text/plain)
2017-11-05 21:40 PST, Alex Crichton
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Alex Crichton 2017-11-05 21:40:28 PST
Created attachment 19373 [details]
minimized IR

This is a minimization of an upstream issue at https://github.com/rust-lang/rust/issues/45511 where the Rust compiler, when using ThinLTO, generates DWARF information that the `dwarfdump` program interprets as invalid.

I managed to minimize to the attached IR. The IR attached has been reduced via machines (creduce on Rust source code, bugpoint on LLVM IR) and also by hand. As a result this isn't really a direct translation of anything in Rust but rather an "artisinal" reduction which may or may not have invalidated the reduction itself along the way.

I unfortunately know very little about DWARF much less dwarfdump. I'm operating under the assumption that anything accepted by LLVM *shouldn't* generate an error with dwarfdump, but that may also be totally wrong! If so, please just let me know!

In any case, I can reproduce the error via:


$ llc -filetype=obj -O0 foo.ll -o foo.o
$ dwarfdump -i foo.o > /dev/null

dwarfdump ERROR:  reference form with no valid local ref?!, offset=<0x00000078>:  DW_DLE_ATTR_FORM_OFFSET_BAD (119)
Comment 1 Alex Crichton 2017-11-07 06:57:08 PST
There's some discussion [1] on the upstream issue where it seems like LLVM may mostly assume that one LLVM module only has one codegen unit of debug information (dwarf-wise at least). I believe that the reason the multiple codegen units came into existence was because of ThinLTO (where ThinLTO inlined debug info presumably). 

One possible fix [2] sounded like it may be to just inline debug info into the original codegen unit during ThinLTO inlining, but I'm not sure if that's (a) the best fix or (b) if it'd actually fix things.


[1]: https://github.com/rust-lang/rust/issues/45511#issuecomment-342314174
[2]: https://github.com/rust-lang/rust/issues/45511#issuecomment-342442707
Comment 2 Jonas Devlieghere 2017-11-08 03:33:30 PST
If I understand correctly, the problem can be summarized as two DISubprogram that each have a different DICompileUnit, but share a scope end up in the same DWARF compile unit. 

Assuming this is the result of (thin)LTO, I'd be interested to have a look at the IR before it's linked in. Maybe that will tell us why they are sharing a scope, which I think is the fundamental issue here.
Comment 3 Alex Crichton 2017-11-08 07:17:23 PST
Heh ok so this is where things could get a bit interesting. Unfortunately these are all massively reduced from the original IR which Rust tends to generate a lot of. I'll try to dump it all in though to see if that could help!

Unfortunately the files were a bit too large to upload here so I've made a gist [1]. There's two tarballs in that gist. First is `inputs.tar.gz`. These are all the input IR files to the ThinLTO backend. Basically rustc created all these codegen units for the crate in question (not a reduced crate, the original source) and then after it produced those modules and ran optimizations on them the next step was to run ThinLTO across all the modules.

I checked and a bunch of the object files ended up having invalid dwarf, but I decided to focus on one. The `stages.tar.gz` is the next most interesting tarball which contains the various stages of ThinLTO processing of the test.test0* module. This module's eventual object file was indeed invalid (according to dwarfdump), but the fun part was that each intermediate stage is valid!

The order of what happened here was:

* thin-lto-input.ll -- input to ThinLTO
* thin-lto-after-rename.ll -- after calling `renameModuleForThinLTO`
* thin-lto-after-resolve.ll -- after calling `thinLTOResolveWeakForLinkerModule`
* thin-lto-after-internalize.ll -- after calling `thinLTOInternalizeModule`
* thin-lto-after-import.ll -- after using `FunctionImporter`
* thin-lto-after-pm.ll -- after running ThinLTO optimization passes

Interestingly if you run all IR files through `llc` to generate an object file, they all generate object files that dwarfdump *doesn't* complain about except the last one, after-pm.ll. I verified this by just running `llc $input -o foo.o -filetype=obj && dwarfdump -i foo.o > /dev/null`, and only the after-pm.ll complained.

Even more interestingly I ran `opt` manually over `after-import.ll` and it didn't produce an invalid object file! The only difference between after-import.ll and after-pm.ll is after running ThinLTO optimization passes, but running the full set of optimization passes is apparently different enough that it correctly handles the dwarf info?

Unfortunately I don't know how to reproduce ThinLTO passes on the command line so I wasn't able to get a command line reproduction which takes `after-import.ll` and then modifies it to a state which is invalid (when passed through llc).

I hope that's not too much information! If I can help clarify anything here I'd be more than willing to help reduce and make sure that it can be debugged. 


https://gist.github.com/alexcrichton/a7dbb0062491640d9d15d06ce10d34ef
Comment 4 Adrian Prantl 2017-11-08 09:25:32 PST
0x00000000: Compile Unit: length = 0x00000132 version = 0x0004 abbr_offset = 0x0000 addr_size = 0x08 (next unit at 0x00000136)
0x00000078:   DW_TAG_base_type
                DW_AT_name  ("usize")
                DW_AT_encoding  (DW_ATE_unsigned)
                DW_AT_byte_size (0x08)

<...>
0x00000136: Compile Unit: length = 0x00000066 version = 0x0004 abbr_offset = 0x0000 addr_size = 0x08 (next unit at 0x000001a0)
<...>
0x00000171:       DW_TAG_structure_type
                    DW_AT_name ("RawVec<alloc::string::String, alloc::heap::Heap>")
                    DW_AT_byte_size    (0x10)
                    DW_AT_alignment    (8)

0x00000178:         DW_TAG_subprogram
                      DW_AT_linkage_name  ("_ZN5alloc7raw_vec8{{impl}}52allocate_in<alloc::string::String,alloc::heap::Heap>E")
                      DW_AT_name  ("allocate_in<alloc::string::String,alloc::heap::Heap>")
                      DW_AT_decl_file ("/home/alex/code/rust4/lol/../lib.rs")
                      DW_AT_decl_line (82)
                      DW_AT_external  (true)
                      DW_AT_APPLE_optimized (true)
                      DW_AT_inline  (DW_INL_inlined)

0x00000184:         DW_TAG_subprogram
                      DW_AT_linkage_name  ("_ZN5alloc7raw_vec8{{impl}}36with_capacity<alloc::string::String>E")
                      DW_AT_name  ("with_capacity<alloc::string::String>")
                      DW_AT_decl_file ("/home/alex/code/rust4/lol/../lib.rs")
                      DW_AT_decl_line (140)
                      DW_AT_external  (true)
                      DW_AT_APPLE_optimized (true)
                      DW_AT_inline  (DW_INL_inlined)

0x00000190:           DW_TAG_formal_parameter
                        DW_AT_name  ("cap")
                        DW_AT_decl_file ("/home/alex/code/rust4/lol/../lib.rs")
                        DW_AT_decl_line (1)
                        DW_AT_type  (cu + 0x0078 "")


I think there are two things that we can/should do here:
1. It is essentially arbitrary in which CU the DW_TAG_structure_type is being emitted (and it depends on which DISubprogram we are visiting first). For that reason we should fix the code in LLVM's DWARF backend that decides to emit the DW_AT_type as a DW_FORM_ref4 and correctly determine that we need a DW_FORM_ref_addr instead. This can probably be achieved by comparing the parent DIEs of the formal parameter and the type.
2. There's an optimization potential for basic types: Basic types do not really belong to any specific CU and are typically very small, so we should consider emitting them into each CU that references them instead of generating a longer DW_FORM_ref_addr to reference them. While doing this would also fix the above testcase, it won't fix the the issue for anything other than basic types.
Comment 5 David Blaikie 2017-11-08 10:33:33 PST
For now, I thought I'd managed to thread the needle and avoid ThinLTO ever producing two CUs in a single .o (In part because Fission can't cope with that - but maybe I only implemented this for Fission... ah, yeah, -split-dwarf-cross-cu-references causes cross-CU inlined subroutines to be placed in the same unit as the source of the reference to avoid the cross-CU referencing under Split DWARF)

Yeah, this won't be a very well tested scenario at least on Google's side of ThinLTO, due to our predominant use of Fission, etc.

& yeah, I'm surprised this didn't come up with normal LTO?
Comment 6 Alex Crichton 2017-11-08 20:44:26 PST
Oh interesting David! Should we be doing something different on our end to avoid problems like this perhaps?
Comment 7 David Blaikie 2017-11-09 11:16:20 PST
(In reply to Alex Crichton from comment #6)
> Oh interesting David! Should we be doing something different on our end to
> avoid problems like this perhaps?

Are you using Fission/Split DWARF?

If you aren't, then yeah - I imagine there's some multi-CU stuff that's less well tested.

Though I'm surprised these sorts of things wouldn't've come up in plain LTO... so I'm just a bit confused.
Comment 8 Alex Crichton 2017-11-10 01:29:39 PST
Heh yeah I don't believe we're using split dwarf at all in Rust right now. With LTO we just use LLVM's `LinkModules` (I think that's the name?), afaik we don't do anything "nonstandard" for that at least.
Comment 9 Alex Crichton 2017-11-10 02:22:08 PST
Also, is there perhaps an "easy fix" that we could use in rustc? Something that we could hack around for now while waiting for an upstream fix? 

For example would it be "easy" to, just after all the ThinLTO passes, we "fix" the IR to have everything point to the same codegen unit?
Comment 10 Jonas Devlieghere 2017-11-10 02:48:02 PST
I don't think the problem is limited to ThinLTO. The same problem arises when llvm-link'ing the attached bitcode files together. There are already multiple CUs in the input bitcode file, which I'm guessing is how the rust compiler work (which I don't know anything about). 

Adrian, can you remind me again why we didn't consider option (3) where we emit the declarations for the subprogram in the other CU (i.e. the one specified in the IR). It sounds like that would make things easier for split DWARF if we can avoid the cross-CU reference.
Comment 11 Adrian Prantl 2017-11-10 08:11:18 PST
Because it doesn't help in the general case: We have to support method C.A defined in unit U which depends on type U.T and method C.B in unit V that depends on type V.Q.
Comment 12 Alex Crichton 2017-11-10 14:37:28 PST
A recent comment on the rust bug pointed out that we may have encountered this before with normal LTO builds, it just was mostly unnoticed until now! https://github.com/rust-lang/rust/issues/44412
Comment 13 Alex Crichton 2017-11-12 09:58:09 PST
Philip Craig has found out [1] that a relatively small patch [2] seems to fix the issue locally, but perhaps one of y'all who are more familiar could give it a once over to make sure it's on the right track? If it looks good we can try to get it upstream!

[1]: https://github.com/rust-lang/rust/issues/45511#issuecomment-343725384
[2]: https://gist.github.com/philipc/7e1e05de75cdeaab39b0f8d506299bbe
Comment 14 Jonas Devlieghere 2017-11-13 08:51:51 PST
The patch essentially does what I proposed in my last comment: move the subprogram declaration into the other CU. Unfortunately this won't work in the general case, as Arian pointed out.
Comment 15 Philip Craig 2017-11-13 12:50:47 PST
The patch isn't changing which CU the subprogram is emitted in. It's actually doing Adrian's first suggestion.

The problem is that in constructAbstractSubprogramScopeDIE there is a mismatch between `this` and the CU of ContextDIE. `this` is the CU that was specified in the IR, but the CU of ContextDIE is that of the first subprogram that was emitted. The patch fixes this mismatch by looking up the CU of ContextDIE, and switching to use that. This doesn't affect which CU the subprogram is emitted in, because we're always using ContextDie as the parent. What it does mean is that when it gets to addDIEEntry, `this` will be correct, and so DW_FORM_ref_addr will be used when needed.
Comment 16 Jonas Devlieghere 2017-11-13 13:02:45 PST
Great, thanks for the clarification Philip! I guess I was a little confused by what you meant with 'the same CU` in your comment. 

// The scope may be shared with a subprogram that has already been
// constructed in another CU, in which case we need to construct this
// subprogram in the same CU.

Based on your comment on GitHub, do you want me to create a diff for this so we can get this upstream?
Comment 17 Philip Craig 2017-11-13 13:07:22 PST
That'd be much appreciated Jonas. And perhaps make that comment clearer :)
Comment 18 Jonas Devlieghere 2017-11-13 14:29:15 PST
I've created a differential and added the reproducer as a test case: https://reviews.llvm.org/D39981

As a consequence of the change the type actually end up in the "correct" CU. This makes sense to me, as there's no reason anymore for it to be in the "wrong" one.
Comment 19 Jonas Devlieghere 2017-11-15 08:03:49 PST
Committed in r318289
Comment 20 Philip Craig 2017-12-01 22:37:08 PST
We've found another situation in which this occurs at https://github.com/rust-lang/rust/issues/46346. The patch at https://reviews.llvm.org/D39981 fixed the problem for abstract subprogram DIEs created in DwarfCompileUnit::constructAbstractSubprogramScopeDIE, but the problem can also occur within non-inlined subprogram DIEs within DwarfUnit::getOrCreateSubprogramDIE.

To recap, the problem is that when the subprogram DIE is created, it inherits the CU from the context DIE, even if the DISubprogram specifies a different CU. Then while creating attributes within that subprogram DIE tree, DwarfUnit::addDIEEntry assumes that unlinked DIES are within the current CU, and so it doesn't always use  DW_FORM_ref_addr when it should.

The committed patch fixed the problem for DwarfCompileUnit::constructAbstractSubprogramScopeDIE by changing the "current CU" to be the same as the CU from the context DIE. This is a bit of a hack, and it's going to be messier to do the same thing for DwarfUnit::getOrCreateSubprogramDIE since we need to fix the notion of "current CU" in the callers as well.

Additionally, these non-inlined subprogram DIEs have address ranges associated with them, and DwarfDebug::endFunctionImpl is adding these ranges to the CU specified by the DISubprogram, which isn't going to match the CU the subprogram DIE is in.

So we're looking for ideas on how to fix this.

I'd like to return to the idea of emitting the subprograms in the CU specified in the IR. Adrian raised a concern with this:

> Because it doesn't help in the general case: We have to support method C.A defined in unit U which depends on type U.T and method C.B in unit V that depends on type V.Q.

but I think that DwarfUnit::addDIEEntry already handles this case, assuming that its "current CU" is correct. Does that sound right? Assuming this is going to work, do you have any pointers on how to get the code to do that? I guess one way is to disable all sharing of type DIEs between CUs, but would there be any way to avoid that?