Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

From |
Robert Picard <picard@netbox.com> |

To |
"statalist@hsphsun2.harvard.edu" <statalist@hsphsun2.harvard.edu> |

Subject |
Re: st: RE: match variable across two tables |

Date |
Sat, 21 Dec 2013 14:54:45 -0500 |

Well "331A" is not a valid NAISC code so you have to decide what to do about that. The sample code I provided earlier requires that naics1-naics8 be string. This can easily be done using tostring naics*, replace Robert On Sat, Dec 21, 2013 at 1:56 PM, Rongrong Zhang <r05zhang@gmail.com> wrote: > Thank you Sarah. > > NAICS1 does not contain all the naics code from the original data. > > I found out why stata import naics1 as str, because there are a few > observations have letters embeded in NAICS1, e.g. 331A as a value of > NAICS1, NAICS2-8 are only numeric . > > I am not proficient in writing a import program, that is why I use > import wizard to import the txt file. > > > > On Sat, Dec 21, 2013 at 12:32 PM, Sarah Edgington <sedging@ucla.edu> wrote: >> Rochelle, >> At this point to determine what to do next you're actually going to have to >> look carefully at your data, all of it, not just the first observation. >> Then you'll have to make some decisions about how to get from the data you >> have to the data you want. >> >> Are naics2-naics8 missing for ALL observations. Stata doesn't make >> decisions about what format to import variables based on only the first >> observation so looking at the first observation is not going to be enough >> information to tell you what happened. >> >> Then you'll want to look at naics1. Does it contain all the naics codes >> from your original table? If naics1 contains all your values, separated by >> spaces, and the rest of the naics variables are ALWAYS missing then you can, >> as I suggested previously, just get rid of the extraneous naics variables >> and use -split- as Robert suggested previously. >> >> If naic2-naics8 contain data for some of your observations then you'll have >> to think harder about your next steps. >> >> For -reshape- to work you need a series of numbered variables that all have >> the same storage format. >> You should have all the tools you need to get to that point. You just have >> to look carefully at your data and figure out what steps you need to take. >> >> -Sarah >> >> >> At 05:33 AM 12/21/2013, you wrote: >>> >>> Hi Sarah, >>> >>> after importing, naics1 was set to str, naics2-8 were set to long, as >>> I said previously, I used File-Import-ASCII data created by >>> spreadsheet, then stata imported my txt file for me, my first >>> observation has non-missing data for naics1 and all missing for >>> naics2-8, I guess that is why stata assigned different types to >>> them.and the log shows command insheet was used. >>> insheet using "C:\Users\Questions\Stata list\I-O table__Cleaned.txt" >>> >>> Best, >>> Rochelle >>> >>> On Fri, Dec 20, 2013 at 6:05 PM, Sarah Edgington <sedging@ucla.edu> wrote: >>> > Rochelle, >>> > The error message isn't because the naics variables are missing, it's >>> > because naics2 (and presumably all of naics2-naics8?) are a different >>> > variable type than naics1. However, reshaping when all but 1 of the >>> > variables being reshaped contain all missing values isn't going to get you >>> > what you want. >>> > >>> > It sounds like something is going awry with your import process. If I >>> > understand you correctly you're saying that naics2-naics8 are missing for >>> > all observations not just the first two that you show, right? >>> > Are the codes all being read into the naics1 variable? That is, is >>> > naics1 a string variable containing multiple codes separated by spaces? If >>> > that's the case you'll want to drop naics2-naics8 and separate naics1 into >>> > multiple variables before reshaping. >>> > -Sarah >>> > >>> > -----Original Message----- >>> > From: owner-statalist@hsphsun2.harvard.edu >>> > [mailto:owner-statalist@hsphsun2.harvard.edu] On Behalf Of Rongrong Zhang >>> > Sent: Friday, December 20, 2013 2:06 PM >>> > To: statalist@hsphsun2.harvard.edu >>> > Subject: Re: st: RE: match variable across two tables >>> > >>> > THANKS! I use import wizard and get the data into stata . >>> > >>> > data looks like: >>> > ionumber ioname naics1 naics2 naics3 naics4 naics5 naics6 naics7 naics8 >>> > 1110 Crop production 111 . . . . . . . >>> > 1111A0 Oilseed farming 11111 . . . . . . . >>> > >>> > >>> > that is missing for naics2 ~8. >>> > >>> > insheet using "C:\Users\Questions\Stata list\I-O table__Cleaned.txt" >>> > (10 vars, 564 obs) >>> > >>> > I got an error here: >>> > >>> > reshape long naics, i(ionumber) j(code) string >>> > (note: j = 1 2 3 4 5 6 7 8) >>> > naics2 type mismatch with other naics variables >>> > >>> > I did not have this error when I use your entire program, that is , when >>> > I use your input, then split codelist, >>> > >>> > I wonder if my error is caused by missing values in naics2 >>> > >>> > On Fri, Dec 20, 2013 at 4:40 PM, Robert Picard <picard@netbox.com> >>> > wrote: >>> >> I added double quotes so that your few lines of data could be read >>> >> inline using -input- (since Statalist does not allow attachments). You >>> >> most certainly do not need to input your data into Stata using the >>> >> same command. See -help import- to find better ways to do it. >>> >> >>> >> Robert >>> >> >>> >> On Fri, Dec 20, 2013 at 4:31 PM, Rongrong Zhang <r05zhang@gmail.com> >>> >> wrote: >>> >>> Dear Roberts, >>> >>> >>> >>> Please excuse my late response. Thanks so very much for your code !!! >>> >>> Words can't express my gratitude. >>> >>> >>> >>> my original data has over 600 rows (the I-O table), I posted only a >>> >>> few lines to save space. My question - to add quotes like in your >>> >>> program >>> >>> >>> >>> "1110" "Crop production" >>> >>> >>> >>> is there a stata tool that does it automatically or do I need to >>> >>> insert it manually for all 600 rows? >>> >>> >>> >>> Merry Christmas! >>> >>> >>> >>> Rochelle >>> >>> >>> >>> On Thu, Dec 19, 2013 at 12:22 PM, Robert Picard <picard@netbox.com> >>> >>> wrote: >>> >>>> No need to talk about "fuzzy" matching as NAISC codes are defined >>> >>>> hierarchically. If you do not match at the 6-digit level, you can >>> >>>> try again using 5-digit codes, and so on. >>> >>>> >>> >>>> Your first problem is to reshape Table 1 data from wide to long >>> >>>> format. Your "I-O number codes" are clearly not valid NAISC codes so >>> >>>> the target becomes creating a crosswalk between valid NAICS to "I-O >>> >>>> number codes". >>> >>>> >>> >>>> Once you have the crosswalk, you can do an exact match using -merge-. >>> >>>> For all NAICS code that did not find an exact match, you can do an >>> >>>> update merge to find matching "I-O numbers" using 5-digit NAISC >>> >>>> codes. >>> >>>> You can then repeat down to 2-digit NAICS if you want to. >>> >>>> >>> >>>> Robert >>> >>>> >>> >>>> * ----------------- begin example ------------------------ clear >>> >>>> input str6 ionumber str244 ioname str244 codelist "1110" "Crop >>> >>>> production" >>> >>>> "1111A0" "Oilseed farming" "11111 11112" >>> >>>> "1111B0" "Grain farming" "11113 11114 11115 11116 11119" >>> >>>> "111200" "Vegetable and melon farming" "1112" >>> >>>> "111400" "Greenhouse and nursery production" "1114" >>> >>>> "111910" "Tobacco farming" "11191" >>> >>>> "111920" "Cotton farming" "11192" >>> >>>> "1119A0" "Sugarcane and sugar beet" "11193 111991" >>> >>>> "1119B0" "All other crop farming" "11194 111992 111998" >>> >>>> end >>> >>>> compress >>> >>>> >>> >>>> * split into separate codes and reshape long split codelist, >>> >>>> gen(naics) reshape long naics, i(ionumber) j(code) string >>> >>>> >>> >>>> * drop obs with missing codes >>> >>>> bysort ionumber (code): drop if mi(naics) & _n > 1 replace naics = >>> >>>> ionumber if mi(naics) >>> >>>> >>> >>>> * remove trailing zeros >>> >>>> replace naics = regexr(naics,"0+$","") >>> >>>> >>> >>>> * save naics to ionumber crosswalk >>> >>>> isid naics, sort >>> >>>> list, noobs sepby(ionumber) >>> >>>> tempfile table1 >>> >>>> save "`table1'" >>> >>>> >>> >>>> clear >>> >>>> input str6 naics >>> >>>> "111" >>> >>>> "1111" >>> >>>> "111150" >>> >>>> "111199" >>> >>>> "111219" >>> >>>> "111310" >>> >>>> "111320" >>> >>>> "111332" >>> >>>> "111334" >>> >>>> "111335" >>> >>>> "111339" >>> >>>> "1114" >>> >>>> "111411" >>> >>>> "111419" >>> >>>> "111421" >>> >>>> "111422" >>> >>>> "111920" >>> >>>> "111930" >>> >>>> "111940" >>> >>>> "111998" >>> >>>> end >>> >>>> gen table2id = _n >>> >>>> replace naics = regexr(naics,"0+$","") >>> >>>> >>> >>>> * do an exact match using the crosswalk merge 1:1 naics using >>> >>>> "`table1'", keepusing(ionumber) /// keep(master match) nogen >>> >>>> >>> >>>> * for obs that did not match, try again using 5 digits. >>> >>>> clonevar naics6 = naics >>> >>>> replace naics = substr(naics6,1,5) >>> >>>> merge m:1 naics using "`table1'", keepusing(ionumber) /// update >>> >>>> gen(merge5) drop if merge5 == 2 >>> >>>> >>> >>>> * repeat for 4-digit naics >>> >>>> replace naics = substr(naics6,1,4) >>> >>>> merge m:1 naics using "`table1'", keepusing(ionumber) /// update >>> >>>> gen(merge4) drop if merge4 == 2 >>> >>>> >>> >>>> * repeat for 3-digit naics >>> >>>> replace naics = substr(naics6,1,3) >>> >>>> merge m:1 naics using "`table1'", keepusing(ionumber) /// update >>> >>>> gen(merge3) drop if merge3 == 2 >>> >>>> * --------------------------- end example --------------- >>> >>>> >>> >>>> >>> >>>> On Wed, Dec 18, 2013 at 9:52 PM, Rongrong Zhang <r05zhang@gmail.com> >>> >>>> wrote: >>> >>>>> Hi Sarah, >>> >>>>> >>> >>>>> Thanks so much for your questions. Let me try to answer them in >>> >>>>> the order they were posted. >>> >>>>> >>> >>>>> Yes, I plan to drop trailing zeros and take all the nonzero digits >>> >>>>> as match criteria. In this case, you are correct in terms of - I >>> >>>>> need processing the data first. - should I use trim ()? >>> >>>>> >>> >>>>> your next question: the structure of data in table 1: do I have a >>> >>>>> single variable that has multiple codes in it. I assume you are >>> >>>>> asking: >>> >>>>> >>> >>>>> e.g 1111B0 Grain farming corresponds to 5 different NAICS code >>> >>>>> and they are 11113 11114 11115 11116 11119. >>> >>>>> >>> >>>>> suppose all these 5 NAICS codes are present in my Table 2, I would >>> >>>>> like to have 5 rows in my final output table like this: >>> >>>>> >>> >>>>> 1111B0 11113 >>> >>>>> 1111B0 11114 >>> >>>>> 1111B0 11115 >>> >>>>> 1111B0 11116 >>> >>>>> 1111B0 11119 >>> >>>>> >>> >>>>> next question : the rule that make an entry a match. If I require 5 >>> >>>>> or >>> >>>>> 6 digit match, then these two tables may not produce many matches. >>> >>>>> that is why I thought of 4 digit matches. Ideally I would like to >>> >>>>> do both exact and "fuzzy" match e.g. using 4 digit, so I have the >>> >>>>> flexibility to control my sample size. >>> >>>>> >>> >>>>> If you or others have questions or suggestions, please let me know. >>> >>>>> >>> >>>>> thanks, >>> >>>>> >>> >>>>> On Wed, Dec 18, 2013 at 3:05 PM, Sarah Edgington <sedging@ucla.edu> >>> >>>>> wrote: >>> >>>>>> Rochelle, >>> >>>>>> This looks like it may be a pretty complicated problem. I don't >>> >>>>>> immediately have any suggestions because I'm not sure I understand either >>> >>>>>> the exact structure of your data or the matching rules you want to follow. >>> >>>>>> >>> >>>>>> You say that if you use exact matching that you want I-O number >>> >>>>>> 1111B0 to match with NAICS code 111150. I take it that is an "exact match" >>> >>>>>> because you want to drop the trailing zero in the NAICS code. So, since >>> >>>>>> 11115 appears in the list of NAICS codes for 1111B0, it would match to >>> >>>>>> 111150 in table 2. This is not to my mind an "exact match" because it >>> >>>>>> requires first modifying the NAICS code in table 2 before you can match. To >>> >>>>>> do that successfully you need to be very clear about what the rule for >>> >>>>>> modification is. >>> >>>>>> Is the rule that if the NAICS code in table 2 has a zero at the end >>> >>>>>> you always drop it? Does it matter how many digits appear before the zero? >>> >>>>>> >>> >>>>>> The next question I have is about the structure of table 1 as it >>> >>>>>> appears in Stata. Do you have a single variable that has multiple codes in >>> >>>>>> it? If so, you're probably going to have to do some additional processing >>> >>>>>> to that variable before trying to match the two tables. >>> >>>>>> >>> >>>>>> The final thing I was unclear on was what you want the final >>> >>>>>> structure of your data to be after matching. How do you want to deal with >>> >>>>>> entries in table 1 that have multiple matches in table 2? Do you want the >>> >>>>>> resulting data to contain multiple observations, one for each of the NAICS >>> >>>>>> codes that the I-O number matches to? >>> >>>>>> >>> >>>>>> Again for the four digit match, you'll want to be very clear on the >>> >>>>>> rules that make an entry a match. I'm not sure if you're asking for a match >>> >>>>>> of the first four digits of the NAICS code in table 1 to only the codes in >>> >>>>>> table 2 that are four digits long. Alternatively perhaps you're looking to >>> >>>>>> match observation in table 1 to ALL the entries in table 2 that share the >>> >>>>>> same first four digits. >>> >>>>>> >>> >>>>>> If you can more precisely describe the structure of your data as it >>> >>>>>> currently exists, the matching rules you want to follow, and the structure >>> >>>>>> you want your final data to be in, you'll increase your chances of getting a >>> >>>>>> helpful answer from the list. >>> >>>>>> >>> >>>>>> -S >>> >>>>>> >>> >>>>>> -----Original Message----- >>> >>>>>> From: owner-statalist@hsphsun2.harvard.edu >>> >>>>>> [mailto:owner-statalist@hsphsun2.harvard.edu] On Behalf Of >>> >>>>>> Rongrong Zhang >>> >>>>>> Sent: Wednesday, December 18, 2013 11:15 AM >>> >>>>>> To: statalist@hsphsun2.harvard.edu >>> >>>>>> Subject: st: match variable across two tables >>> >>>>>> >>> >>>>>> Dear STATALISTER, >>> >>>>>> >>> >>>>>> I have two tables: >>> >>>>>> >>> >>>>>> Table 1 has 3 variables I-O number, I-O Name , Related 1997 >>> >>>>>> NAICS codes. >>> >>>>>> >>> >>>>>> Table 2 has 1 variable 1997 NAICS codes. >>> >>>>>> >>> >>>>>> I want to link these two tables based on NAICS code. However, the >>> >>>>>> level of details on NAICS code does not match one-to-one because >>> >>>>>> the tables come from different data source. My goal is to know >>> >>>>>> which NAICS code correspond to which I-O number. I can’t use Table >>> >>>>>> 1 only, because TABLE 2 is produced from Wharton Research Database >>> >>>>>> which has company level financial data I will use later on. >>> >>>>>> >>> >>>>>> By different details I mean : e.g. >>> >>>>>> >>> >>>>>> table 1: >>> >>>>>> >>> >>>>>> I-O number I-O Name 1997 NAICS codes >>> >>>>>> >>> >>>>>> 1110 Crop production >>> >>>>>> >>> >>>>>> 1111A0 Oilseed farming >>> >>>>>> 11111 11112 >>> >>>>>> >>> >>>>>> 1111B0 Grain farming 11113 11114 11115 >>> >>>>>> 11116 11119 >>> >>>>>> >>> >>>>>> 111200 Vegetable and melon farming >>> >>>>>> 1112 >>> >>>>>> 111400 Greenhouse and nursery production >>> >>>>>> 1114 >>> >>>>>> 111910 Tobacco farming >>> >>>>>> 11191 >>> >>>>>> 111920 Cotton farming >>> >>>>>> 11192 >>> >>>>>> 1119A0 Sugarcane and sugar beet >>> >>>>>> 11193 111991 >>> >>>>>> 1119B0 All other crop farming >>> >>>>>> 11194 111992 111998 >>> >>>>>> >>> >>>>>> in the above example, I present industry 1110 and its >>> >>>>>> subindustries 1111A0, 1111B0, 111200, each of the subindustries >>> >>>>>> correspond to a few (or a single) NAICS code (north america >>> >>>>>> industry classification system). >>> >>>>>> >>> >>>>>> table 2: >>> >>>>>> NAICS CODE >>> >>>>>> 111 >>> >>>>>> 1111 >>> >>>>>> 111150 >>> >>>>>> 111199 >>> >>>>>> 111219 >>> >>>>>> 111310 >>> >>>>>> 111320 >>> >>>>>> 111332 >>> >>>>>> 111334 >>> >>>>>> 111335 >>> >>>>>> 111339 >>> >>>>>> 1114 >>> >>>>>> 111411 >>> >>>>>> 111419 >>> >>>>>> 111421 >>> >>>>>> 111422 >>> >>>>>> 111920 >>> >>>>>> 111930 >>> >>>>>> 111940 >>> >>>>>> 111998 >>> >>>>>> >>> >>>>>> if I enforce exact match, then table 2 111150 matches with table 1 >>> >>>>>> 1111B0, table 2 1112l9 may be matched with 111200 table 1 I-O. >>> >>>>>> >>> >>>>>> My question : >>> >>>>>> 1.could you give a sample code/function to do exact match? note, >>> >>>>>> if first 5digit match, and drop last 0 (naics), we consider that a >>> >>>>>> match 2. if I want to increase match, how could I change the >>> >>>>>> program to do 4 digit match >>> >>>>>> >>> >>>>>> thanks a bunch, >>> >>>>>> >>> >>>>>> -- >>> >>>>>> Best, >>> >>>>>> Rochelle >>> >>>>>> >>> >>>>>> * >>> >>>>>> * For searches and help try: >>> >>>>>> * http://www.stata.com/help.cgi?search >>> >>>>>> * http://www.stata.com/support/faqs/resources/statalist-faq/ >>> >>>>>> * http://www.ats.ucla.edu/stat/stata/ >>> >>>>>> >>> >>>>>> >>> >>>>>> * >>> >>>>>> * For searches and help try: >>> >>>>>> * http://www.stata.com/help.cgi?search >>> >>>>>> * http://www.stata.com/support/faqs/resources/statalist-faq/ >>> >>>>>> * http://www.ats.ucla.edu/stat/stata/ >>> >>>>> >>> >>>>> >>> >>>>> >>> >>>>> -- >>> >>>>> -Best, >>> >>>>> R >>> >>>>> >>> >>>>> * >>> >>>>> * For searches and help try: >>> >>>>> * http://www.stata.com/help.cgi?search >>> >>>>> * http://www.stata.com/support/faqs/resources/statalist-faq/ >>> >>>>> * http://www.ats.ucla.edu/stat/stata/ >>> >>>> >>> >>>> * >>> >>>> * For searches and help try: >>> >>>> * http://www.stata.com/help.cgi?search >>> >>>> * http://www.stata.com/support/faqs/resources/statalist-faq/ >>> >>>> * http://www.ats.ucla.edu/stat/stata/ >>> >>> >>> >>> >>> >>> >>> >>> -- >>> >>> -Best, >>> >>> R >>> >>> >>> >>> * >>> >>> * For searches and help try: >>> >>> * http://www.stata.com/help.cgi?search >>> >>> * http://www.stata.com/support/faqs/resources/statalist-faq/ >>> >>> * http://www.ats.ucla.edu/stat/stata/ >>> >> >>> >> * >>> >> * For searches and help try: >>> >> * http://www.stata.com/help.cgi?search >>> >> * http://www.stata.com/support/faqs/resources/statalist-faq/ >>> >> * http://www.ats.ucla.edu/stat/stata/ >>> > >>> > >>> > >>> > -- >>> > -Best, >>> > R >>> > >>> > * >>> > * For searches and help try: >>> > * http://www.stata.com/help.cgi?search >>> > * http://www.stata.com/support/faqs/resources/statalist-faq/ >>> > * http://www.ats.ucla.edu/stat/stata/ >>> > >>> > >>> > * >>> > * For searches and help try: >>> > * http://www.stata.com/help.cgi?search >>> > * http://www.stata.com/support/faqs/resources/statalist-faq/ >>> > * http://www.ats.ucla.edu/stat/stata/ >>> >>> >>> >>> -- >>> -Best, >>> R >>> >>> * >>> * For searches and help try: >>> * http://www.stata.com/help.cgi?search >>> * http://www.stata.com/support/faqs/resources/statalist-faq/ >>> * http://www.ats.ucla.edu/stat/stata/ >> >> >> >> * >> * For searches and help try: >> * http://www.stata.com/help.cgi?search >> * http://www.stata.com/support/faqs/resources/statalist-faq/ >> * http://www.ats.ucla.edu/stat/stata/ > > * > * For searches and help try: > * http://www.stata.com/help.cgi?search > * http://www.stata.com/support/faqs/resources/statalist-faq/ > * http://www.ats.ucla.edu/stat/stata/ * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/faqs/resources/statalist-faq/ * http://www.ats.ucla.edu/stat/stata/

**Follow-Ups**:**Re: st: RE: match variable across two tables***From:*Rongrong Zhang <r05zhang@gmail.com>

**References**:**st: match variable across two tables***From:*Rongrong Zhang <r05zhang@gmail.com>

**st: RE: match variable across two tables***From:*"Sarah Edgington" <sedging@ucla.edu>

**Re: st: RE: match variable across two tables***From:*Rongrong Zhang <r05zhang@gmail.com>

**Re: st: RE: match variable across two tables***From:*Robert Picard <picard@netbox.com>

**Re: st: RE: match variable across two tables***From:*Rongrong Zhang <r05zhang@gmail.com>

**Re: st: RE: match variable across two tables***From:*Robert Picard <picard@netbox.com>

**Re: st: RE: match variable across two tables***From:*Rongrong Zhang <r05zhang@gmail.com>

**RE: st: RE: match variable across two tables***From:*"Sarah Edgington" <sedging@ucla.edu>

**Re: st: RE: match variable across two tables***From:*Rongrong Zhang <r05zhang@gmail.com>

**Re: st: RE: match variable across two tables***From:*Sarah Edgington <sedging@ucla.edu>

**Re: st: RE: match variable across two tables***From:*Rongrong Zhang <r05zhang@gmail.com>

- Prev by Date:
**Re: st: Stata - efficiently appending 200+ files (my method takes hours)** - Next by Date:
**st: display results in a table, of bitest for varA, stratified by varB** - Previous by thread:
**Re: st: RE: match variable across two tables** - Next by thread:
**Re: st: RE: match variable across two tables** - Index(es):