Factors()

One of the issues that appeared during our first meeting was:

What is this thing called factors?

They appeared when we loaded temperature data using read.table(). Here is a small explanation:

In R, factors are a way to represent categorical values. Since in the temperature dataset there was the string “—” indicating missing values, the read.table() function did not recognize the data as numerical values, but it assigned categories to them instead. In the book “R in a nutshell” by J. Adler, factors are explained as “an ordered collection of items” and further “The different values that the factor can take are called levels”.

Assume the participants of a survey answered (“male”,”male”,”female”,”other”,”male”), when asked about their sex. There is no inherent ordering in these levels. These answers can simply be represented in R as

> answers <- factor(c("male","male","female","other","male"))

If you query the variable “answers”, R will tell you

> answers
[1] male   male   female other  male
 Levels: female male other

If you want to have the answers in character format, you simply do

> as.character( answers )
[1] "male"   "male"   "female" "other"  "male"

Now, sometimes there is an inherent ordering in the possible answers, such as when participants are asked about the climate in their country and the answers are (“hot”,”moderate”,”hot”,”cold”,”hot”). These answers can be represented like this

> answers <-factor( c("hot","moderate","hot","cold","hot"),
 levels=c("cold","moderate","hot"), ordered=T)

Here the levels-attribute tells R the inherent (increasing) ordering of the levels. If we query for “answers”, we get

> answers
[1] hot      moderate hot      cold     hot
 Levels: cold < moderate < hot

Notice the ordering is indicated by “<“, as opposed to the unordered answers above. Converting to character works as above. Converting to numeric values works by

> as.numeric( answers )
[1] 3 2 3 1 3

If, as in the temperature data, the variable contains factors, but this is unwanted, there are several ways around this. Say the data that was read from the file appears as factors of the form (“2.1″,”2.1″,”—“,”2.5″,”3.0”)

> temp <- factor(c("2.1","2.1","---","2.5","3.0"))
> temp
 [1] 2.1 2.1 --- 2.5 3.0
 Levels: --- 2.1 2.5 3.0

but you would like them as numerical values. Simply issueing as.numeric(temp) does not return the actual temperature values but the integers that R inherently assigned to the levels:

> as.numeric(temp)
 [1] 2 2 1 3 4

A quick and dirty way to get the actual temperature values and have the “—” appear as NA (“not available”) would be to convert to character first and then to numeric:

> as.numeric(as.character(temp))
 [1] 2.1 2.1  NA 2.5 3.0
 Warning message:
 NAs introduced by coercion

The warning can be avoided by enclosing the above call inside the function “suppressWarnings()”. A more elegant way to transform the contents of the file to numerics would be to use the options of the read.table() function. But this is another topic.

Advertisements

First meeting Dec. 15 2011

[see english version below]

Die R Usergroup Dresden trifft sich am 15.12.2011 um 18:00 Uhr zum ersten gemeinsamen Erfahrungsaustausch. Der Treffpunkt richtet sich nach der Teilnehmerzahl und wird hier noch bekanntgegeben.  Der Treffpunkt ist Avantgarde Labs, Löbauer Straße 19, D-01099 Dresden. Damit wir etwas planen können, schreibt bitte eine kurze Nachricht an r-users-dresden@gmx.de wenn ihr teilnehmen wollt.

Programm:

  • Einführung in die Zeitreihenanalyse mit R
  • Analyse einer Temperaturzeitreihe für Dresden
  • Möglichkeit für Teilnehmer ihre eigenen Projekte, R-Hacks und Grafiken zu präsentieren
  • Planung gemeinsamer Projekte – neue Ideen sind willkommen!

Allgemeine Informationen:

Was ist R?

R ist eine freie Softwareumgebung und Programmiersprache für statistische Berechnungen und Datenvisualisierung. R ist mit über 3400 frei verfügbaren Zusatzpaketen der Quasi-Standard im akademischen Bereich und bietet eine komfortable Umgebung zur Entwicklung und Verbreitung eigener Anwendungen.

Warum eine Usergroup?

Anwender aus verschiedenen Bereichen nutzen R für unterschiedliche Zwecke. In Anbetracht der Bedeutung und der Vielfalt der Anwendungsmöglichkeiten von R ist es sinnvoll ein Forum zu haben, in welchem sich R User austauschen können, um eventuelle Kooperationen ins Leben zu rufen und voneinander zu lernen.

Anfängern und Interessenten haben im direkten Austausch mit erfahrenen Anwendern die Möglichkeit, die Programmiersprache effizient zu lernen um sie für eigene Zwecke zu nutzen.

_________________________

[english version]

The R usergroup Dresden meets for the first time on Dec 15 2011 at 6pm. The meeting point is yet to be fixed and depends on the number of participants.  The meeting point is Avantgarde Labs, Löbauer Straße 19, D-01099 Dresden. Please write a short notice to r-users-dresden@gmx.de if you want to participate.

program:

  • introduction to time series analysis with R
  • analysis of a temperature time series for Dresden
  • possibility for participants to present their own projects, R hacks and graphics
  • planning of projects for the usergroup – your input is welcome!

general information:

What is R?

R is a free software environment for statistical computing and data visualization. With more than 3000 user-contributed packages, R is the quasi standard in academia. It offers convenient functionalities that enable users to develop and distribute their own applications.

Why a usergroup?

Users from different areas use R for various purposes. Considering the enormous variety of R applications it makes sense to provide a forum for users to exchange ideas, learn from each other and create possible cooperations.

Beginners and interested people can efficiently learn the R language to use it for their own purposes by interacting closely with advanced users.