@@ -311,15 +311,7 @@ first position of the substring you supply as the second argument.
311
311
312
312
generate str_position = strpos(sex, " ale" )
313
313
314
- Python determines the position of a character in a string with the
315
- :func: `find ` function. ``find `` searches for the first position of the
316
- substring. If the substring is found, the function returns its
317
- position. Keep in mind that Python indexes are zero-based and
318
- the function will return -1 if it fails to find the substring.
319
-
320
- .. ipython :: python
321
-
322
- tips[" sex" ].str.find(" ale" ).head()
314
+ .. include :: includes/find_substring.rst
323
315
324
316
325
317
Extracting substring by position
@@ -331,13 +323,7 @@ Stata extracts a substring from a string based on its position with the :func:`s
331
323
332
324
generate short_sex = substr(sex, 1, 1)
333
325
334
- With pandas you can use ``[] `` notation to extract a substring
335
- from a string by position locations. Keep in mind that Python
336
- indexes are zero-based.
337
-
338
- .. ipython :: python
339
-
340
- tips[" sex" ].str[0 :1 ].head()
326
+ .. include :: includes/extract_substring.rst
341
327
342
328
343
329
Extracting nth word
@@ -358,16 +344,7 @@ second argument specifies which word you want to extract.
358
344
generate first_name = word(name, 1)
359
345
generate last_name = word(name, - 1)
360
346
361
- Python extracts a substring from a string based on its text
362
- by using regular expressions. There are much more powerful
363
- approaches, but this just shows a simple approach.
364
-
365
- .. ipython :: python
366
-
367
- firstlast = pd.DataFrame({" string" : [" John Smith" , " Jane Cook" ]})
368
- firstlast[" First_Name" ] = firstlast[" string" ].str.split(" " , expand = True )[0 ]
369
- firstlast[" Last_Name" ] = firstlast[" string" ].str.rsplit(" " , expand = True )[0 ]
370
- firstlast
347
+ .. include :: includes/nth_word.rst
371
348
372
349
373
350
Changing case
@@ -390,27 +367,13 @@ change the case of ASCII and Unicode strings, respectively.
390
367
generate title = strproper(string)
391
368
list
392
369
393
- The equivalent Python functions are ``upper ``, ``lower ``, and ``title ``.
394
-
395
- .. ipython :: python
370
+ .. include :: includes/case.rst
396
371
397
- firstlast = pd.DataFrame({" string" : [" John Smith" , " Jane Cook" ]})
398
- firstlast[" upper" ] = firstlast[" string" ].str.upper()
399
- firstlast[" lower" ] = firstlast[" string" ].str.lower()
400
- firstlast[" title" ] = firstlast[" string" ].str.title()
401
- firstlast
402
372
403
373
Merging
404
374
-------
405
375
406
- The following tables will be used in the merge examples
407
-
408
- .. ipython :: python
409
-
410
- df1 = pd.DataFrame({" key" : [" A" , " B" , " C" , " D" ], " value" : np.random.randn(4 )})
411
- df1
412
- df2 = pd.DataFrame({" key" : [" B" , " D" , " D" , " E" ], " value" : np.random.randn(4 )})
413
- df2
376
+ .. include :: includes/merge_setup.rst
414
377
415
378
In Stata, to perform a merge, one data set must be in memory
416
379
and the other must be referenced as a file name on disk. In
@@ -465,38 +428,13 @@ or the intersection of the two by using the values created in the
465
428
restore
466
429
merge 1:n key using df2.dta
467
430
468
- pandas DataFrames have a :meth: `DataFrame.merge ` method, which provides
469
- similar functionality. Note that different join
470
- types are accomplished via the ``how `` keyword.
471
-
472
- .. ipython :: python
473
-
474
- inner_join = df1.merge(df2, on = [" key" ], how = " inner" )
475
- inner_join
476
-
477
- left_join = df1.merge(df2, on = [" key" ], how = " left" )
478
- left_join
479
-
480
- right_join = df1.merge(df2, on = [" key" ], how = " right" )
481
- right_join
482
-
483
- outer_join = df1.merge(df2, on = [" key" ], how = " outer" )
484
- outer_join
431
+ .. include :: includes/merge_setup.rst
485
432
486
433
487
434
Missing data
488
435
------------
489
436
490
- Like Stata, pandas has a representation for missing data -- the
491
- special float value ``NaN `` (not a number). Many of the semantics
492
- are the same; for example missing data propagates through numeric
493
- operations, and is ignored by default for aggregations.
494
-
495
- .. ipython :: python
496
-
497
- outer_join
498
- outer_join[" value_x" ] + outer_join[" value_y" ]
499
- outer_join[" value_x" ].sum()
437
+ .. include :: includes/missing_intro.rst
500
438
501
439
One difference is that missing data cannot be compared to its sentinel value.
502
440
For example, in Stata you could do this to filter missing values.
@@ -508,30 +446,7 @@ For example, in Stata you could do this to filter missing values.
508
446
* Keep non-missing values
509
447
list if value_x ! = .
510
448
511
- This doesn't work in pandas. Instead, the :func: `pd.isna ` or :func: `pd.notna ` functions
512
- should be used for comparisons.
513
-
514
- .. ipython :: python
515
-
516
- outer_join[pd.isna(outer_join[" value_x" ])]
517
- outer_join[pd.notna(outer_join[" value_x" ])]
518
-
519
- pandas also provides a variety of methods to work with missing data -- some of
520
- which would be challenging to express in Stata. For example, there are methods to
521
- drop all rows with any missing values, replacing missing values with a specified
522
- value, like the mean, or forward filling from previous rows. See the
523
- :ref: `missing data documentation<missing_data> ` for more.
524
-
525
- .. ipython :: python
526
-
527
- # Drop rows with any missing value
528
- outer_join.dropna()
529
-
530
- # Fill forwards
531
- outer_join.fillna(method = " ffill" )
532
-
533
- # Impute missing values with the mean
534
- outer_join[" value_x" ].fillna(outer_join[" value_x" ].mean())
449
+ .. include :: includes/missing.rst
535
450
536
451
537
452
GroupBy
@@ -548,14 +463,7 @@ numeric columns.
548
463
549
464
collapse (sum) total_bill tip, by(sex smoker)
550
465
551
- pandas provides a flexible ``groupby `` mechanism that
552
- allows similar aggregations. See the :ref: `groupby documentation<groupby> `
553
- for more details and examples.
554
-
555
- .. ipython :: python
556
-
557
- tips_summed = tips.groupby([" sex" , " smoker" ])[[" total_bill" , " tip" ]].sum()
558
- tips_summed.head()
466
+ .. include :: includes/groupby.rst
559
467
560
468
561
469
Transformation
@@ -570,16 +478,7 @@ For example, to subtract the mean for each observation by smoker group.
570
478
bysort sex smoker: egen group_bill = mean(total_bill)
571
479
generate adj_total_bill = total_bill - group_bill
572
480
573
-
574
- pandas ``groupby `` provides a ``transform `` mechanism that allows
575
- these type of operations to be succinctly expressed in one
576
- operation.
577
-
578
- .. ipython :: python
579
-
580
- gb = tips.groupby(" smoker" )[" total_bill" ]
581
- tips[" adj_total_bill" ] = tips[" total_bill" ] - gb.transform(" mean" )
582
- tips.head()
481
+ .. include :: includes/transform.rst
583
482
584
483
585
484
By group processing
0 commit comments