/ Published in: SAS
Text Miner uses a compressed representation of the term-by-doc frequency matrix. You will find an OUT data set in the project data directory of your text miner run. Its label will include the string "OUT" in it. Since a 30,000 document collection will have as many as 500,000 to a million distinct terms, be sure to restrict your terms of interest with a start list. I give an example of creating the cooccurrence matrix with the following code which expands the compressed version to an uncompressed version and then computes the co-occurrence count with proc corr and the sscp option.
Expand |
Embed | Plain Text
Copy this code and paste it in your HTML
data myOUT; input term doc count; datalines; 1 1 1 1 3 1 1 4 1 2 2 1 2 3 2 3 1 2 3 3 2 3 4 1 4 2 2 4 4 1 5 3 2 ; run; proc sort data=myOUT; by doc term; run; data docbyterm; set myOUT; by doc; array t; retain t; if first.doc then do; do i=1 to 5; t=0; end; end; t=count; if last.doc then do; output; end; run; proc corr data=docbyterm cov outp=cooccur sscp; var t1-t5; run;
URL: https://communities.sas.com/thread/6327?start=0&tstart=0